statistical distribution - OpenMD.com Journal Search

Fundamentals of Research Data and Variables: The Devil Is in the Details.

Anesthesia and Analgesia Oct 2017

Designing, conducting, analyzing, reporting, and interpreting the findings of a research study require an understanding of the types and characteristics of data and... (Review)

Summary PubMed

Review

Authors: Thomas R Vetter

Designing, conducting, analyzing, reporting, and interpreting the findings of a research study require an understanding of the types and characteristics of data and variables. Descriptive statistics are typically used simply to calculate, describe, and summarize the collected research data in a logical, meaningful, and efficient way. Inferential statistics allow researchers to make a valid estimate of the association between an intervention and the treatment effect in a specific population, based upon their randomly collected, representative sample data. Categorical data can be either dichotomous or polytomous. Dichotomous data have only 2 categories, and thus are considered binary. Polytomous data have more than 2 categories. Unlike dichotomous and polytomous data, ordinal data are rank ordered, typically based on a numerical scale that is comprised of a small set of discrete classes or integers. Continuous data are measured on a continuum and can have any numeric value over this continuous range. Continuous data can be meaningfully divided into smaller and smaller or finer and finer increments, depending upon the precision of the measurement instrument. Interval data are a form of continuous data in which equal intervals represent equal differences in the property being measured. Ratio data are another form of continuous data, which have the same properties as interval data, plus a true definition of an absolute zero point, and the ratios of the values on the measurement scale make sense. The normal (Gaussian) distribution ("bell-shaped curve") is of the most common statistical distributions. Many applied inferential statistical tests are predicated on the assumption that the analyzed data follow a normal distribution. The histogram and the Q-Q plot are 2 graphical methods to assess if a set of data have a normal distribution (display "normality"). The Shapiro-Wilk test and the Kolmogorov-Smirnov test are 2 well-known and historically widely applied quantitative methods to assess for data normality. Parametric statistical tests make certain assumptions about the characteristics and/or parameters of the underlying population distribution upon which the test is based, whereas nonparametric tests make fewer or less rigorous assumptions. If the normality test concludes that the study data deviate significantly from a Gaussian distribution, rather than applying a less robust nonparametric test, the problem can potentially be remedied by judiciously and openly: (1) performing a data transformation of all the data values; or (2) eliminating any obvious data outlier(s).

Topics: Biomedical Research; Data Interpretation, Statistical; Humans; Normal Distribution; Sample Size

PubMed: 28787341
DOI: 10.1213/ANE.0000000000002370

Feature distribution learning by passive exposure.

Cognition Oct 2022

Humans can rapidly estimate the statistical properties of groups of stimuli, including their average and variability. But recent studies of so-called Feature...

Summary PubMed

Authors: David Pascucci, Gizay Ceylan, Árni Kristjánsson...

Humans can rapidly estimate the statistical properties of groups of stimuli, including their average and variability. But recent studies of so-called Feature Distribution Learning (FDL) have shown that observers can quickly learn even more complex aspects of feature distributions. In FDL, observers learn the full shape of a distribution of features in a set of distractor stimuli and use this information to improve visual search: response times (RT) are slowed if the target feature lies inside the previous distractor distribution, and the RT patterns closely reflect the distribution shape. FDL requires only a few trials and is markedly sensitive to different distribution types. It is unknown, however, whether our perceptual system encodes feature distributions automatically and by passive exposure, or whether this learning requires active engagement with the stimuli. In two experiments, we sought to answer this question. During an initial exposure stage, participants passively viewed a display of 36 lines that included one orientation singleton or no singletons. In the following search display, they had to find an oddly oriented target. The orientations of the lines were determined either by a Gaussian or a uniform distribution. We found evidence for FDL only when the passive trials contained an orientation singleton. Under these conditions, RT's decreased as a function of the orientation distance between the target and the mean of the exposed distractor distribution. These results suggest that passive exposure to a distribution of visual features can affect subsequent search performance, but only if a singleton appears during exposure to the distribution.

Topics: Attention; Humans; Learning; Reaction Time; Statistical Distributions; Visual Perception

PubMed: 35785655
DOI: 10.1016/j.cognition.2022.105211

Statistical power for cluster analysis.

BMC Bioinformatics May 2022

Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing...

Summary PubMed Full Text PDF

Authors: Edwin S Dalmaijer, Camilla L Nord, Duncan E Astle...

BACKGROUND

Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).

RESULTS

We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).

CONCLUSIONS

Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

Topics: Algorithms; Cluster Analysis; Humans; Normal Distribution; Sample Size; Software

PubMed: 35641905
DOI: 10.1186/s12859-022-04675-1

Analytical probabilistic modeling of dose-volume histograms.

Medical Physics Oct 2020

Radiotherapy, especially with charged particles, is sensitive to executional and preparational uncertainties that propagate to uncertainty in dose and plan quality...

Summary PubMed

Authors: Niklas Wahl, Philipp Hennig, Hans-Peter Wieser...

PURPOSE

Radiotherapy, especially with charged particles, is sensitive to executional and preparational uncertainties that propagate to uncertainty in dose and plan quality indicators, for example, dose-volume histograms (DVHs). Current approaches to quantify and mitigate such uncertainties rely on explicitly computed error scenarios and are thus subject to statistical uncertainty and limitations regarding the underlying uncertainty model. Here we present an alternative, analytical method to approximate moments, in particular expectation value and (co)variance, of the probability distribution of DVH-points, and evaluate its accuracy on patient data.

METHODS

We use Analytical Probabilistic Modeling (APM) to derive moments of the probability distribution over individual DVH-points based on the probability distribution over dose. By using the computed moments to parameterize distinct probability distributions over DVH-points (here normal or beta distributions), not only the moments but also percentiles, that is, α - DVHs, are computed. The model is subsequently evaluated on three patient cases (intracranial, paraspinal, prostate) in 30- and single-fraction scenarios by assuming the dose to follow a multivariate normal distribution, whose moments are computed in closed-form with APM. The results are compared to a benchmark based on discrete random sampling.

RESULTS

The evaluation of the new probabilistic model on the three patient cases against a sampling benchmark proves its correctness under perfect assumptions as well as good agreement in realistic conditions. More precisely, ca. 90% of all computed expected DVH-points and their standard deviations agree within 1% volume with their empirical counterpart from sampling computations, for both fractionated and single fraction treatments. When computing α - DVH, the assumption of a beta distribution achieved better agreement with empirical percentiles than the assumption of a normal distribution: While in both cases probabilities locally showed large deviations (up to ±0.2), the respective - DVHs for α={0.05,0.5,0.95} only showed small deviations in respective volume (up to ±5% volume for a normal distribution, and up to 2% for a beta distribution). A previously published model from literature, which was included for comparison, exhibited substantially larger deviations.

CONCLUSIONS

With APM we could derive a mathematically exact description of moments of probability distributions over DVH-points given a probability distribution over dose. The model generalizes previous attempts and performs well for both choices of probability distributions, that is, normal or beta distributions, over DVH-points.

Topics: Humans; Male; Models, Statistical; Normal Distribution; Probability; Radiotherapy Dosage; Radiotherapy Planning, Computer-Assisted

PubMed: 32740930
DOI: 10.1002/mp.14414

A Guide to Robust Statistical Methods in Neuroscience.

Current Protocols in Neuroscience Jan 2018

There is a vast array of new and improved methods for comparing groups and studying associations that offer the potential for substantially increasing power, providing... (Review)

Summary PubMed

Review

Authors: Rand R Wilcox, Guillaume A Rousselet

There is a vast array of new and improved methods for comparing groups and studying associations that offer the potential for substantially increasing power, providing improved control over the probability of a Type I error, and yielding a deeper and more nuanced understanding of data. These new techniques effectively deal with four insights into when and why conventional methods can be unsatisfactory. But for the non-statistician, the vast array of new and improved techniques for comparing groups and studying associations can seem daunting, simply because there are so many new methods that are now available. This unit briefly reviews when and why conventional methods can have relatively low power and yield misleading results. The main goal is to suggest some general guidelines regarding when, how, and why certain modern techniques might be used. © 2018 by John Wiley & Sons, Inc.

Topics: Animals; Data Interpretation, Statistical; Humans; Neurosciences; Statistical Distributions

PubMed: 29357109
DOI: 10.1002/cpns.41

t-Test and ANOVA for data with ceiling and/or floor effects.

Behavior Research Methods Feb 2021

Ceiling and floor effects are often observed in social and behavioral science. The current study examines ceiling/floor effects in the context of the t-test and ANOVA,... (Review)

Summary PubMed

Review

Authors: Qimin Liu, Lijuan Wang

Ceiling and floor effects are often observed in social and behavioral science. The current study examines ceiling/floor effects in the context of the t-test and ANOVA, two frequently used statistical methods in experimental studies. Our literature review indicated that most researchers treated ceiling or floor data as if these data were true values, and that some researchers used statistical methods such as discarding ceiling or floor data in conducting the t-test and ANOVA. The current study evaluates the performance of these conventional methods for t-test and ANOVA with ceiling or floor data. Our evaluation also includes censored regression with regard to its capacity for handling ceiling/floor data. Furthermore, we propose an easy-to-use method that handles ceiling or floor data in t-tests and ANOVA by using properties of truncated normal distributions. Simulation studies were conducted to compare the performance of the methods in handling ceiling or floor data for t-test and ANOVA. Overall, the proposed method showed greater accuracy in effect size estimation and better-controlled Type I error rates over other evaluated methods. We developed an easy-to-use software package and web applications to help researchers implement the proposed method. Recommendations and future directions are discussed.

Topics: Analysis of Variance; Humans; Normal Distribution; Research Design

PubMed: 32671580
DOI: 10.3758/s13428-020-01407-2

Difficulties in summing log-normal distributions for abundance and potential solutions.

PloS One 2023

The log-normal distribution, often used to model animal abundance and its uncertainty, is central to ecological modeling and conservation but its statistical properties... (Review)

Summary PubMed Full Text PDF

Review

Authors: Emma J Talis, Christian Che-Castaldo, Heather J Lynch...

The log-normal distribution, often used to model animal abundance and its uncertainty, is central to ecological modeling and conservation but its statistical properties are less intuitive than those of the normal distribution. The right skew of the log-normal distribution can be considerable for highly uncertain estimates and the median is often chosen as a point estimate. However, the use of the median can become complicated when summing across populations since the median of the sum of log-normal distributions is not the sum of the constituent medians. Such estimates become sensitive to the spatial or taxonomic scale over which abundance is being summarized and the naive estimate (the median of the distribution representing the sum across populations) can become grossly inflated. Here we review the statistical issues involved and some alternative formulations that might be considered by ecologists interested in modeling abundance. Using a recent estimate of global avian abundance as a case study (Callaghan et al. 2021), we investigate the properties of several alternative methods of summing across species' abundance, including the sorted summing used in the original study (Callaghan et al. 2021) and the use of shifted log-normal distributions, truncated normal distributions, and rectified normal distributions. The appropriate method of summing across distributions was intimately tied to the use of the mean or median as the measure of central tendency used as the point estimate. Use of the shifted log-normal distribution, however, generated scale-consistent estimates for global abundance across a spectrum of contexts. Our paper highlights how seemingly inconsequential decisions regarding the estimation of abundance yield radically different estimates of global abundance and its uncertainty, with conservation consequences that are underappreciated and require careful consideration.

Topics: Animals; Normal Distribution; Statistical Distributions; Birds

PubMed: 36634090
DOI: 10.1371/journal.pone.0280351

Classifying variables.

Anaesthesia Jan 2017

Summary PubMed

Authors: S W Choi, D M H Lam

Topics: Data Interpretation, Statistical; Humans; Hydrogen-Ion Concentration; Infant, Newborn; Reference Values; Statistical Distributions; Umbilical Arteries; Umbilical Veins

PubMed: 27858980
DOI: 10.1111/anae.13753

General two-parameter distribution: Statistical properties, estimation, and application on COVID-19.

PloS One 2023

In this paper, we introduced a novel general two-parameter statistical distribution which can be presented as a mix of both exponential and gamma distributions. Some...

Summary PubMed Full Text PDF

Authors: Ahmed M Gemeay, Zeghdoudi Halim, M M Abd El-Raouf...

In this paper, we introduced a novel general two-parameter statistical distribution which can be presented as a mix of both exponential and gamma distributions. Some statistical properties of the general model were derived mathematically. Many estimation methods studied the estimation of the proposed model parameters. A new statistical model was presented as a particular case of the general two-parameter model, which is used to study the performance of the different estimation methods with the randomly generated data sets. Finally, the COVID-19 data set was used to show the superiority of the particular case for fitting real-world data sets over other compared well-known models.

Topics: Humans; COVID-19; Models, Statistical; Statistical Distributions

PubMed: 36753497
DOI: 10.1371/journal.pone.0281474

Voltage distributions in extracellular brain recordings.

Journal of Neurophysiology Apr 2021

Extracellular recordings of brain voltage signals have many uses, including the identification of spikes and the characterization of brain states via analysis of local...

Summary PubMed

Authors: Nicholas V Swindale, Peter Rowat, Matthew Krause...

Extracellular recordings of brain voltage signals have many uses, including the identification of spikes and the characterization of brain states via analysis of local field potential (LFP) or EEG recordings. Though the factors underlying the generation of these signals are time varying and complex, their analysis may be facilitated by an understanding of their statistical properties. To this end, we analyzed the voltage distributions of high-pass extracellular recordings from a variety of structures, including cortex, thalamus, and hippocampus, in monkeys, cats, and rodents. We additionally investigated LFP signals in these recordings as well as human EEG signals obtained during different sleep stages. In all cases, the distributions were accurately described by a Gaussian within ±1.5 standard deviations from zero. Outside these limits, voltages tended to be distributed exponentially, that is, they fell off linearly on log-linear frequency plots, with variable heights and slopes. A possible explanation for this is that sporadically and independently occurring events with individual Gaussian size distributions can sum to produce approximately exponential distributions. For the high-pass recordings, a second explanation results from a model of the noisy behavior of ion channels that produce action potentials via Hodgkin-Huxley kinetics. The distributions produced by this model, relative to the averaged potential, were also Gaussian with approximately exponential flanks. The model also predicted time-varying noise distributions during action potentials, which were observed in the extracellular spike signals. These findings suggest a principled method for detecting spikes in high-pass recordings and transient events in LFP and EEG signals. We show that the voltage distributions in brain recordings, including high-pass extracellular recordings, the LFP, and human EEG, are accurately described by a Gaussian within ±1.5 standard deviations from zero, with heavy, exponential tails outside these limits. This offers a principled way of setting event detection thresholds in high-pass recordings. It also offers a means for identifying event-like, transient signals in LFP and EEG recordings which may correlate with other neural phenomena.

Topics: Adult; Animals; Cats; Cerebral Cortex; Electroencephalography; Electrophysiological Phenomena; Humans; Macaca; Mice; Models, Statistical; Normal Distribution; Rats

PubMed: 33689506
DOI: 10.1152/jn.00633.2020