-
Cognition Oct 2022Humans can rapidly estimate the statistical properties of groups of stimuli, including their average and variability. But recent studies of so-called Feature...
Humans can rapidly estimate the statistical properties of groups of stimuli, including their average and variability. But recent studies of so-called Feature Distribution Learning (FDL) have shown that observers can quickly learn even more complex aspects of feature distributions. In FDL, observers learn the full shape of a distribution of features in a set of distractor stimuli and use this information to improve visual search: response times (RT) are slowed if the target feature lies inside the previous distractor distribution, and the RT patterns closely reflect the distribution shape. FDL requires only a few trials and is markedly sensitive to different distribution types. It is unknown, however, whether our perceptual system encodes feature distributions automatically and by passive exposure, or whether this learning requires active engagement with the stimuli. In two experiments, we sought to answer this question. During an initial exposure stage, participants passively viewed a display of 36 lines that included one orientation singleton or no singletons. In the following search display, they had to find an oddly oriented target. The orientations of the lines were determined either by a Gaussian or a uniform distribution. We found evidence for FDL only when the passive trials contained an orientation singleton. Under these conditions, RT's decreased as a function of the orientation distance between the target and the mean of the exposed distractor distribution. These results suggest that passive exposure to a distribution of visual features can affect subsequent search performance, but only if a singleton appears during exposure to the distribution.
Topics: Attention; Humans; Learning; Reaction Time; Statistical Distributions; Visual Perception
PubMed: 35785655
DOI: 10.1016/j.cognition.2022.105211 -
BMC Bioinformatics May 2022Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing...
BACKGROUND
Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).
RESULTS
We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).
CONCLUSIONS
Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.
Topics: Algorithms; Cluster Analysis; Humans; Normal Distribution; Sample Size; Software
PubMed: 35641905
DOI: 10.1186/s12859-022-04675-1 -
General two-parameter distribution: Statistical properties, estimation, and application on COVID-19.PloS One 2023In this paper, we introduced a novel general two-parameter statistical distribution which can be presented as a mix of both exponential and gamma distributions. Some...
In this paper, we introduced a novel general two-parameter statistical distribution which can be presented as a mix of both exponential and gamma distributions. Some statistical properties of the general model were derived mathematically. Many estimation methods studied the estimation of the proposed model parameters. A new statistical model was presented as a particular case of the general two-parameter model, which is used to study the performance of the different estimation methods with the randomly generated data sets. Finally, the COVID-19 data set was used to show the superiority of the particular case for fitting real-world data sets over other compared well-known models.
Topics: Humans; COVID-19; Models, Statistical; Statistical Distributions
PubMed: 36753497
DOI: 10.1371/journal.pone.0281474 -
PloS One 2018Fame and celebrity play an ever-increasing role in our culture. However, despite the cultural and economic importance of fame and its gradations, there exists no...
Fame and celebrity play an ever-increasing role in our culture. However, despite the cultural and economic importance of fame and its gradations, there exists no consensus method for quantifying the fame of an individual, or of comparing that of two individuals. We argue that, even if fame is difficult to measure with precision, one may develop useful metrics for fame that correlate well with intuition and that remain reasonably stable over time. Using datasets of recently deceased individuals who were highly renowned, we have evaluated several internet-based methods for quantifying fame. We find that some widely-used internet-derived metrics, such as search engine results, correlate poorly with human subject judgments of fame. However other metrics exist that agree well with human judgments and appear to offer workable, easily accessible measures of fame. Using such a metric we perform a preliminary investigation of the statistical distribution of fame, which has some of the power law character seen in other natural and social phenomena such as landslides and market crashes. In order to demonstrate how such findings can generate quantitative insight into celebrity culture, we assess some folk ideas regarding the frequency distribution and apparent clustering of celebrity deaths.
Topics: Famous Persons; Female; Humans; Internet; Judgment; Male; Probability; Statistical Distributions; Surveys and Questionnaires
PubMed: 29979792
DOI: 10.1371/journal.pone.0200196 -
Computational Intelligence and... 2022In this study, a new one-parameter count distribution is proposed by combining Poisson and XLindley distributions. Some of its statistical and reliability properties...
In this study, a new one-parameter count distribution is proposed by combining Poisson and XLindley distributions. Some of its statistical and reliability properties including order statistics, hazard rate function, reversed hazard rate function, mode, factorial moments, probability generating function, moment generating function, index of dispersion, Shannon entropy, Mills ratio, mean residual life function, and associated measures are investigated. All these properties can be expressed in explicit forms. It is found that the new probability mass function can be utilized to model positively skewed data with leptokurtic shape. Moreover, the new discrete distribution is considered a proper tool to model equi- and over-dispersed phenomena with increasing hazard rate function. The distribution parameter is estimated by different six estimation approaches, and the behavior of these methods is explored using the Monte Carlo simulation. Finally, two applications to real life are presented herein to illustrate the flexibility of the new model.
Topics: Computer Simulation; Likelihood Functions; Models, Statistical; Monte Carlo Method; Poisson Distribution; Reproducibility of Results; Statistical Distributions
PubMed: 35463286
DOI: 10.1155/2022/6503670 -
Scientific Reports Feb 2023One clear aspect of behaviour in the COVID-19 pandemic has been people's focus on, and response to, reported or observed infection numbers in their community. We...
One clear aspect of behaviour in the COVID-19 pandemic has been people's focus on, and response to, reported or observed infection numbers in their community. We describe a simple model of infectious disease spread in a pandemic situation where people's behaviour is influenced by the current risk of infection and where this behavioural response acts homeostatically to return infection risk to a certain preferred level. This homeostatic response is active until approximate herd immunity is reached: in this domain the model predicts that the reproduction rate R will be centred around a median of 1, that proportional change in infection numbers will follow the standard Cauchy distribution with location and scale parameters 0 and 1, and that high infection numbers will follow a power-law frequency distribution with exponent 2. To test these predictions we used worldwide COVID-19 data from 1st February 2020 to 30th June 2022 to calculate [Formula: see text] confidence interval estimates across countries for these R, location, scale and exponent parameters. The resulting median R estimate was [Formula: see text] (predicted value 1) the proportional change location estimate was [Formula: see text] (predicted value 0), the proportional change scale estimate was [Formula: see text] (predicted value 1), and the frequency distribution exponent estimate was [Formula: see text] (predicted value 2); in each case the observed estimate agreed with model predictions.
Topics: Humans; COVID-19; Pandemics; Reproduction; Statistical Distributions
PubMed: 36765110
DOI: 10.1038/s41598-023-28752-4 -
Statistics in Medicine Oct 2015Zero-inflated Poisson (ZIP) and negative binomial (ZINB) models are widely used to model zero-inflated count responses. These models extend the Poisson and negative...
Zero-inflated Poisson (ZIP) and negative binomial (ZINB) models are widely used to model zero-inflated count responses. These models extend the Poisson and negative binomial (NB) to address excessive zeros in the count response. By adding a degenerate distribution centered at 0 and interpreting it as describing a non-risk group in the population, the ZIP (ZINB) models a two-component population mixture. As in applications of Poisson and NB, the key difference between ZIP and ZINB is the allowance for overdispersion by the ZINB in its NB component in modeling the count response for the at-risk group. Overdispersion arising in practice too often does not follow the NB, and applications of ZINB to such data yield invalid inference. If sources of overdispersion are known, other parametric models may be used to directly model the overdispersion. Such models too are subject to assumed distributions. Further, this approach may not be applicable if information about the sources of overdispersion is unavailable. In this paper, we propose a distribution-free alternative and compare its performance with these popular parametric models as well as a moment-based approach proposed by Yu et al. [Statistics in Medicine 2013; 32: 2390-2405]. Like the generalized estimating equations, the proposed approach requires no elaborate distribution assumptions. Compared with the approach of Yu et al., it is more robust to overdispersed zero-inflated responses. We illustrate our approach with both simulated and real study data.
Topics: Binomial Distribution; Biometry; Computer Simulation; HIV Infections; Humans; Likelihood Functions; Male; Models, Statistical; Poisson Distribution; Randomized Controlled Trials as Topic
PubMed: 26078035
DOI: 10.1002/sim.6560 -
PloS One 2023The log-normal distribution, often used to model animal abundance and its uncertainty, is central to ecological modeling and conservation but its statistical properties... (Review)
Review
The log-normal distribution, often used to model animal abundance and its uncertainty, is central to ecological modeling and conservation but its statistical properties are less intuitive than those of the normal distribution. The right skew of the log-normal distribution can be considerable for highly uncertain estimates and the median is often chosen as a point estimate. However, the use of the median can become complicated when summing across populations since the median of the sum of log-normal distributions is not the sum of the constituent medians. Such estimates become sensitive to the spatial or taxonomic scale over which abundance is being summarized and the naive estimate (the median of the distribution representing the sum across populations) can become grossly inflated. Here we review the statistical issues involved and some alternative formulations that might be considered by ecologists interested in modeling abundance. Using a recent estimate of global avian abundance as a case study (Callaghan et al. 2021), we investigate the properties of several alternative methods of summing across species' abundance, including the sorted summing used in the original study (Callaghan et al. 2021) and the use of shifted log-normal distributions, truncated normal distributions, and rectified normal distributions. The appropriate method of summing across distributions was intimately tied to the use of the mean or median as the measure of central tendency used as the point estimate. Use of the shifted log-normal distribution, however, generated scale-consistent estimates for global abundance across a spectrum of contexts. Our paper highlights how seemingly inconsequential decisions regarding the estimation of abundance yield radically different estimates of global abundance and its uncertainty, with conservation consequences that are underappreciated and require careful consideration.
Topics: Animals; Normal Distribution; Statistical Distributions; Birds
PubMed: 36634090
DOI: 10.1371/journal.pone.0280351 -
Anaesthesia Jan 2017
Topics: Data Interpretation, Statistical; Humans; Hydrogen-Ion Concentration; Infant, Newborn; Reference Values; Statistical Distributions; Umbilical Arteries; Umbilical Veins
PubMed: 27858980
DOI: 10.1111/anae.13753 -
Genetic Epidemiology Feb 2022Count data with excessive zeros are increasingly ubiquitous in genetic association studies, such as neuritic plaques in brain pathology for Alzheimer's disease. Here, we...
Count data with excessive zeros are increasingly ubiquitous in genetic association studies, such as neuritic plaques in brain pathology for Alzheimer's disease. Here, we developed gene-based association tests to model such data by a mixture of two distributions, one for the structural zeros contributed by the Binomial distribution, and the other for the counts from the Poisson distribution. We derived the score statistics of the corresponding parameter of the rare variants in the zero-inflated Poisson regression model, and then constructed burden (ZIP-b) and kernel (ZIP-k) tests for the association tests. We evaluated omnibus tests that combined both ZIP-b and ZIP-k tests. Through simulated sequence data, we illustrated the potential power gain of our proposed method over a two-stage method that analyzes binary and non-zero continuous data separately for both burden and kernel tests. The ZIP burden test outperformed the kernel test as expected in all scenarios except for the scenario of variants with a mixture of directions in the genetic effects. We further demonstrated its applications to analyses of the neuritic plaque data in the ROSMAP cohort. We expect our proposed test to be useful in practice as more powerful than or complementary to the two-stage method.
Topics: Binomial Distribution; Humans; Models, Genetic; Models, Statistical; Phenotype; Poisson Distribution
PubMed: 34779034
DOI: 10.1002/gepi.22438