-
Journal of Econometrics Aug 2023Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public...
Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public health, information technology, and the U.S. decennial census. Nevertheless, to guarantee differential privacy, existing methods may unavoidably alter the conclusion of original data analysis, as privatization often changes the sample distribution. This phenomenon is known as the trade-off between privacy protection and statistical accuracy. In this work, we mitigate this trade-off by developing a distribution-invariant privatization (DIP) method to reconcile both high statistical accuracy and strict differential privacy. As a result, any downstream statistical or machine learning task yields essentially the same conclusion as if one used the original data. Numerically, under the same strictness of privacy protection, DIP achieves superior statistical accuracy in a wide range of simulation studies and real-world benchmarks.
PubMed: 37701878
DOI: 10.1016/j.jeconom.2022.05.004 -
Biology Open Dec 2023Complex allometry describes a smooth, curvilinear relationship between logarithmic transformations of a biological variable and a corresponding measure for body size...
Complex allometry describes a smooth, curvilinear relationship between logarithmic transformations of a biological variable and a corresponding measure for body size when the observations are displayed on a bivariate graph with linear scaling. The curvature in such a display is commonly captured by fitting a quadratic equation to the distribution; and the quadratic term is typically interpreted, in turn, to mean that the mathematically equivalent equation for describing the arithmetic distribution is a two-parameter power equation with an exponent that changes with body size. A power equation with an exponent that is itself a function of body size is virtually uninterpretable, yet numerous attempts have been made in recent years to incorporate such an exponent into theoretical models for the evolution of form and function in both plants and animals. However, the curvature that is described by a quadratic equation fitted to logarithms usually means that an explicit, non-zero intercept is required in the power equation describing the untransformed distribution - not that the exponent in the power equation varies with body size. Misperceptions that commonly accompany reports of complex allometry can be avoided by using nonlinear regression to examine untransformed data.
Topics: Animals; Body Size; Models, Statistical; Models, Biological
PubMed: 38126464
DOI: 10.1242/bio.060148 -
Journal of Korean Medical Science Jan 2024Determining if the frequency distribution of a given data set follows a normal distribution or not is among the first steps of data analysis. Visual examination of the... (Review)
Review
Determining if the frequency distribution of a given data set follows a normal distribution or not is among the first steps of data analysis. Visual examination of the data, commonly by Q-Q plot, although is acceptable by many scientists, is considered subjective and not acceptable by other researchers. One-sample Kolmogorov-Smirnov test with Lilliefors correction (for a sample size ≥ 50) and Shapiro-Wilk test (for a sample size < 50) are common statistical tests for checking the normality of a data set quantitatively. As parametric tests, which assume that the data distribution is normal (Gaussian, bell-shaped), are more robust compared to their non-parametric counterparts, we commonly use transformations (e.g., log-transformation, Box-Cox transformation, etc.) to make the frequency distribution of non-normally distributed data close to a normal distribution. Herein, I wish to reflect on presenting how to practically work with these statistical methods through examining of real data sets.
Topics: Humans; Data Analysis; Physicians; Research Personnel; Statistics, Nonparametric
PubMed: 38258367
DOI: 10.3346/jkms.2024.39.e35 -
Nanomaterials (Basel, Switzerland) Dec 2023Graphene is a two-dimensional carbon allotrope which exhibits exceptional properties, making it highly suitable for a wide range of applications. Practical graphene...
Graphene is a two-dimensional carbon allotrope which exhibits exceptional properties, making it highly suitable for a wide range of applications. Practical graphene fabrication often yields a polycrystalline structure with many inherent defects, which significantly influence its performance. In this study, we utilize a Monte Carlo approach based on the optimized Wooten, Winer and Weaire (WWW) algorithm to simulate the crystalline domain coarsening process of polycrystalline graphene. Our sample configurations show excellent agreement with experimental data. We conduct statistical analyses of the bond and angle distribution, temporal evolution of the defect distribution, and spatial correlation of the lattice orientation that follows a stretched exponential distribution. Furthermore, we thoroughly investigate the diffusion behavior of defects and find that the changes in domain size follow a power-law distribution. We briefly discuss the possible connections of these results to (and differences from) domain growth processes in other statistical models, such as the Ising dynamics. We also examine the impact of buckling of polycrystalline graphene on the crystallization rate under substrate effects. Our findings may offer valuable guidance and insights for both theoretical investigations and experimental advancements.
PubMed: 38133024
DOI: 10.3390/nano13243127 -
Mathematical Biosciences and... Jun 2023This study aims to develop appropriate models for income distribution in Iran using the econophysics approach for the 2006-2018 period. For this purpose, the three...
This study aims to develop appropriate models for income distribution in Iran using the econophysics approach for the 2006-2018 period. For this purpose, the three improved distributions of the Pareto, Lognormal, and Gibbs-Boltzmann distributions are analyzed with the data extracted from the target household income expansion plan of the statistical centers in Iran. The research results indicate that the income distribution in Iran does not follow the Pareto and Lognormal distributions in most of the study years but follows the generalized Gibbs-Boltzmann distribution function in all study years. According to the results, the generalized Gibbs-Boltzmann distribution also properly fits the actual data distribution and could clearly explain the income distribution in Iran. The generalized Gibbs-Boltzmann distribution also fits the actual income data better than both Pareto and Lognormal distributions.
PubMed: 37501483
DOI: 10.3934/mbe.2023587 -
Bioinformatics (Oxford, England) Feb 2024Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect...
MOTIVATION
Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution.
RESULTS
We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them.
AVAILABILITY AND IMPLEMENTATION
The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.
Topics: Bayes Theorem; Sequence Alignment; Software; Algorithms; INDEL Mutation; Evolution, Molecular
PubMed: 38269647
DOI: 10.1093/bioinformatics/btae043 -
BMC Medical Research Methodology Feb 2024When studying the association between treatment and a clinical outcome, a parametric multivariable model of the conditional outcome expectation is often used to adjust...
BACKGROUND
When studying the association between treatment and a clinical outcome, a parametric multivariable model of the conditional outcome expectation is often used to adjust for covariates. The treatment coefficient of the outcome model targets a conditional treatment effect. Model-based standardization is typically applied to average the model predictions over the target covariate distribution, and generate a covariate-adjusted estimate of the marginal treatment effect.
METHODS
The standard approach to model-based standardization involves maximum-likelihood estimation and use of the non-parametric bootstrap. We introduce a novel, general-purpose, model-based standardization method based on multiple imputation that is easily applicable when the outcome model is a generalized linear model. We term our proposed approach multiple imputation marginalization (MIM). MIM consists of two main stages: the generation of synthetic datasets and their analysis. MIM accommodates a Bayesian statistical framework, which naturally allows for the principled propagation of uncertainty, integrates the analysis into a probabilistic framework, and allows for the incorporation of prior evidence.
RESULTS
We conduct a simulation study to benchmark the finite-sample performance of MIM in conjunction with a parametric outcome model. The simulations provide proof-of-principle in scenarios with binary outcomes, continuous-valued covariates, a logistic outcome model and the marginal log odds ratio as the target effect measure. When parametric modeling assumptions hold, MIM yields unbiased estimation in the target covariate distribution, valid coverage rates, and similar precision and efficiency than the standard approach to model-based standardization.
CONCLUSION
We demonstrate that multiple imputation can be used to marginalize over a target covariate distribution, providing appropriate inference with a correctly specified parametric outcome model and offering statistical performance comparable to that of the standard approach to model-based standardization.
Topics: Humans; Bayes Theorem; Linear Models; Computer Simulation; Logistic Models; Reference Standards; Models, Statistical
PubMed: 38341552
DOI: 10.1186/s12874-024-02157-x -
JAMA Otolaryngology-- Head & Neck... Jul 2023Allostatic load, the cumulative strain that results from the chronic stress response, is associated with poor health outcomes. Increased cognitive load and impaired...
IMPORTANCE
Allostatic load, the cumulative strain that results from the chronic stress response, is associated with poor health outcomes. Increased cognitive load and impaired communication associated with hearing loss could potentially be associated with higher allostatic load, but few studies to date have quantified this association.
OBJECTIVE
To investigate if audiometric hearing loss is associated with allostatic load and evaluate if the association varies by demographic factors.
DESIGN, SETTING, PARTICIPANTS
This cross-sectional survey used nationally representative data from the National Health and Nutrition Examination Survey. Audiometric testing was conducted from 2003 to 2004 (ages 20-69 years) and 2009 to 2010 (70 years or older). The study was restricted to participants aged 50 years or older, and the analysis was stratified based on cycle. The data were analyzed between October 2021 and October 2022.
EXPOSURE
A 4-frequency (0.5-4.0 kHz) pure tone average was calculated in the better-hearing ear and modeled continuously and categorically (<25 dB hearing level [dB HL], no hearing loss; 26-40 dB HL, mild hearing loss; ≥41 dB HL, moderate or greater hearing loss).
MAIN OUTCOME AND MEASURES
Allostatic load score (ALS) was defined using laboratory measurements of 8 biomarkers (systolic/diastolic blood pressure, body mass index [calculated as weight in kilograms divided by height in meters squared], and total serum and high-density lipoprotein cholesterol, glycohemoglobin, albumin, and C-reactive protein levels). Each biomarker was assigned a point if it was in the highest risk quartile based on statistical distribution and then summed to yield the ALS (range, 0-8). Linear regression models adjusted for demographic and clinical covariates. Sensitivity analysis included using clinical cut points for ALS and subgroup stratification.
RESULTS
In 1412 participants (mean [SD] age, 59.7 [5.9] years; 293 women [51.9%]; 130 [23.0%] Hispanic, 89 [15.8%] non-Hispanic Black, and 318 [55.3%] non-Hispanic White individuals), a modest association was suggested between hearing loss and ALS (ages 50-69 years: β = 0.19 [95% CI, 0.02-0.36] per 10 dB HL; 70 years or older: β = 0.10 [95% CI, 0.02-0.18] per 10 dB HL) among non-hearing aid users. Results were not clearly reflected in the sensitivity analysis with clinical cut points for ALS or modeling hearing loss categorically. Sex-based stratifications identified a stronger association among male individuals (men 70 years or older: β = 0.22 [95% CI, 0.12-0.32] per 10 dB HL; women: β = 0.08 [95% CI, -0.04 to 0.20] per 10 dB HL).
CONCLUSION AND RELEVANCE
The study findings did not clearly support an association between hearing loss and ALS. While hearing loss has been shown to be associated with increased risk for numerous health comorbidities, its association with the chronic stress response and allostasis may be less than that of other health conditions.
Topics: Aged; Female; Humans; Male; Middle Aged; Allostasis; Audiometry, Pure-Tone; Cross-Sectional Studies; Deafness; Hearing Loss; Nutrition Surveys
PubMed: 37200015
DOI: 10.1001/jamaoto.2023.0948 -
PloS One 2023Testing whether data are from a normal distribution is a traditional problem and is of great concern for data analyses. The normality is the premise of many statistical...
Testing whether data are from a normal distribution is a traditional problem and is of great concern for data analyses. The normality is the premise of many statistical methods, such as t-test, Hotelling T2 test and ANOVA. There are numerous tests in the literature and the commonly used ones are Anderson-Darling test, Shapiro-Wilk test and Jarque-Bera test. Each test has its own advantageous points since they are developed for specific patterns and there is no method that consistently performs optimally in all situations. Since the data distribution of practical problems can be complex and diverse, we propose a Cauchy Combination Omnibus Test (CCOT) that is robust and valid in most data cases. We also give some theoretical results to analyze the good properties of CCOT. Two obvious advantages of CCOT are that not only does CCOT have a display expression for calculating statistical significance, but extensive simulation results show its robustness regardless of the shape of distribution the data comes from. Applications to South African Heart Disease and Neonatal Hearing Impairment data further illustrate its practicability.
Topics: Computer Simulation; Normal Distribution; Sample Size; Data Analysis
PubMed: 37535617
DOI: 10.1371/journal.pone.0289498 -
JCO Clinical Cancer Informatics Sep 2023Waterfall plots have gained popularity as a visualization tool to present antitumor activity of treatments in oncology, especially for phase I and II trials. The typical...
Waterfall plots have gained popularity as a visualization tool to present antitumor activity of treatments in oncology, especially for phase I and II trials. The typical waterfall plot in oncology is a bar plot with each bar representing the best percent tumor size reduction from baseline for a patient sorted in descending order along the -axis. As new therapies are routinely developed in combination with standard of care or other investigational treatments, waterfall plot comparison between combination therapy and monotherapy may facilitate development decisions in addition to overall response rate or duration of response. However, waterfall plots are often assessed heuristically in practice with lack of statistical rigor. In this work, we examine the correspondence between the waterfall plot and the empirical cumulative distribution function. We demonstrate how to derive key summary statistics directly from the waterfall plot. Using real examples from published waterfall plots, we show how comparisons of waterfall plots can elucidate clinically meaningful information, such as treatment effect patterns in progression-free survival and overall survival.
Topics: Humans; Medical Oncology; Data Visualization
PubMed: 37906725
DOI: 10.1200/CCI.23.00132