statistical cluster - OpenMD.com Journal Search

Statistical power for cluster analysis.

BMC Bioinformatics May 2022

Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing...

Summary PubMed Full Text PDF

Authors: Edwin S Dalmaijer, Camilla L Nord, Duncan E Astle...

BACKGROUND

Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).

RESULTS

We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).

CONCLUSIONS

Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

Topics: Algorithms; Cluster Analysis; Humans; Normal Distribution; Sample Size; Software

PubMed: 35641905
DOI: 10.1186/s12859-022-04675-1

Discrete coalescent trees.

Journal of Mathematical Biology Nov 2021

In many phylogenetic applications, such as cancer and virus evolution, time trees, evolutionary histories where speciation events are timed, are inferred. Of particular...

Summary PubMed Full Text PDF

Authors: Lena Collienne, Kieran Elmes, Mareike Fischer...

In many phylogenetic applications, such as cancer and virus evolution, time trees, evolutionary histories where speciation events are timed, are inferred. Of particular interest are clock-like trees, where all leaves are sampled at the same time and have equal distance to the root. One popular approach to model clock-like trees is coalescent theory, which is used in various tree inference software packages. Methodologically, phylogenetic inference methods require a tree space over which the inference is performed, and the geometry of this space plays an important role in statistical and computational aspects of tree inference algorithms. It has recently been shown that coalescent tree spaces possess a unique geometry, different from that of classical phylogenetic tree spaces. Here we introduce and study a space of discrete coalescent trees. They assume that time is discrete, which is natural in many computational applications. This tree space is a generalisation of the previously studied ranked nearest neighbour interchange space, and is built upon tree-rearrangement operations. We generalise existing results about ranked trees, including an algorithm for computing distances in polynomial time, and in particular provide new results for both the space of discrete coalescent trees and the space of ranked trees. We establish several geometrical properties of these spaces and show how these properties impact various algorithms used in phylogenetic analyses. Our tree space is a discretisation of a previously introduced time tree space, called t-space, and hence our results can be used to approximate solutions to various open problems in t-space.

Topics: Algorithms; Cluster Analysis; Phylogeny

PubMed: 34739608
DOI: 10.1007/s00285-021-01685-0

Statistical analysis and handling of missing data in cluster randomized trials: a systematic review.

Trials Feb 2016

Cluster randomized trials (CRTs) randomize participants in groups, rather than as individuals and are key tools used to assess interventions in health research where... (Review)

Summary PubMed Full Text PDF

Review

Authors: Mallorie H Fiero, Shuang Huang, Eyal Oren...

BACKGROUND

Cluster randomized trials (CRTs) randomize participants in groups, rather than as individuals and are key tools used to assess interventions in health research where treatment contamination is likely or if individual randomization is not feasible. Two potential major pitfalls exist regarding CRTs, namely handling missing data and not accounting for clustering in the primary analysis. The aim of this review was to evaluate approaches for handling missing data and statistical analysis with respect to the primary outcome in CRTs.

METHODS

We systematically searched for CRTs published between August 2013 and July 2014 using PubMed, Web of Science, and PsycINFO. For each trial, two independent reviewers assessed the extent of the missing data and method(s) used for handling missing data in the primary and sensitivity analyses. We evaluated the primary analysis and determined whether it was at the cluster or individual level.

RESULTS

Of the 86 included CRTs, 80 (93%) trials reported some missing outcome data. Of those reporting missing data, the median percent of individuals with a missing outcome was 19% (range 0.5 to 90%). The most common way to handle missing data in the primary analysis was complete case analysis (44, 55%), whereas 18 (22%) used mixed models, six (8%) used single imputation, four (5%) used unweighted generalized estimating equations, and two (2%) used multiple imputation. Fourteen (16%) trials reported a sensitivity analysis for missing data, but most assumed the same missing data mechanism as in the primary analysis. Overall, 67 (78%) trials accounted for clustering in the primary analysis.

CONCLUSIONS

High rates of missing outcome data are present in the majority of CRTs, yet handling missing data in practice remains suboptimal. Researchers and applied statisticians should carry out appropriate missing data methods, which are valid under plausible assumptions in order to increase statistical power in trials and reduce the possibility of bias. Sensitivity analysis should be performed, with weakened assumptions regarding the missing data mechanism to explore the robustness of results reported in the primary analysis.

Topics: Cluster Analysis; Data Interpretation, Statistical; Randomized Controlled Trials as Topic; Sample Size

PubMed: 26862034
DOI: 10.1186/s13063-016-1201-z

Statistical Methods for the Analysis of Food Composition Databases: A Review.

Nutrients May 2022

Evidence-based knowledge of the relationship between foods and nutrients is needed to inform dietary-based guidelines and policy. Proper and tailored statistical methods... (Review)

Summary PubMed Full Text PDF

Review

Authors: Yusentha Balakrishna, Samuel Manda, Henry Mwambi...

Evidence-based knowledge of the relationship between foods and nutrients is needed to inform dietary-based guidelines and policy. Proper and tailored statistical methods to analyse food composition databases (FCDBs) could assist in this regard. This review aims to collate the existing literature that used any statistical method to analyse FCDBs, to identify key trends and research gaps. The search strategy yielded 4238 references from electronic databases of which 24 fulfilled our inclusion criteria. Information on the objectives, statistical methods, and results was extracted. Statistical methods were mostly applied to group similar food items (37.5%). Other aims and objectives included determining associations between the nutrient content and known food characteristics (25.0%), determining nutrient co-occurrence (20.8%), evaluating nutrient changes over time (16.7%), and addressing the accuracy and completeness of databases (16.7%). Standard statistical tests (33.3%) were the most utilised followed by clustering (29.1%), other methods (16.7%), regression methods (12.5%), and dimension reduction techniques (8.3%). Nutrient data has unique characteristics such as correlated components, natural groupings, and a compositional nature. Statistical methods used for analysis need to account for this data structure. Our summary of the literature provides a reference for researchers looking to expand into this area.

Topics: Cluster Analysis; Databases, Factual; Food; Food Analysis; Nutrients; Nutrition Policy

PubMed: 35683993
DOI: 10.3390/nu14112193

Statistical challenges in longitudinal microbiome data analysis.

Briefings in Bioinformatics Jul 2022

The microbiome is a complex and dynamic community of microorganisms that co-exist interdependently within an ecosystem, and interact with its host or environment.... (Review)

Summary PubMed Full Text PDF

Review

Authors: Saritha Kodikara, Susan Ellul, Kim-Anh Lê Cao...

The microbiome is a complex and dynamic community of microorganisms that co-exist interdependently within an ecosystem, and interact with its host or environment. Longitudinal studies can capture temporal variation within the microbiome to gain mechanistic insights into microbial systems; however, current statistical methods are limited due to the complex and inherent features of the data. We have identified three analytical objectives in longitudinal microbial studies: (1) differential abundance over time and between sample groups, demographic factors or clinical variables of interest; (2) clustering of microorganisms evolving concomitantly across time and (3) network modelling to identify temporal relationships between microorganisms. This review explores the strengths and limitations of current methods to fulfill these objectives, compares different methods in simulation and case studies for objectives (1) and (2), and highlights opportunities for further methodological developments. R tutorials are provided to reproduce the analyses conducted in this review.

Topics: Cluster Analysis; Data Analysis; Longitudinal Studies; Microbiota; RNA, Ribosomal, 16S

PubMed: 35830875
DOI: 10.1093/bib/bbac273

Comparison of imputation and imputation-free methods for statistical analysis of mass spectrometry data with missing data.

Briefings in Bioinformatics Jan 2022

Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values...

Summary PubMed Full Text PDF

Authors: Sandra Taylor, Matthew Ponzini, Machelle Wilson...

Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values and apply statistical methods that require complete data and (ii) use statistical methods that specifically account for missing values without imputation (imputation-free methods). This study reviews the effect of sample size and percentage of missing values on statistical inference for multiple methods under these two strategies. With increasing missingness, the ability of imputation and imputation-free methods to identify differentially and non-differentially regulated compounds in a two-group comparison study declined. Random forest and k-nearest neighbor imputation combined with a Wilcoxon test performed well in statistical testing for up to 50% missingness with little bias in estimating the effect size. Quantile regression imputation accompanied with a Wilcoxon test also had good statistical testing outcomes but substantially distorted the difference in means between groups. None of the imputation-free methods performed consistently better for statistical testing than imputation methods.

Topics: Bias; Cluster Analysis; Mass Spectrometry; Research Design

PubMed: 34472591
DOI: 10.1093/bib/bbab353

Challenges of cluster randomized trials.

Journal of Comparative Effectiveness... May 2014

Cluster randomized trials are trials that randomize clusters of people, rather than individuals. They are becoming increasingly common. A number of innovations have been... (Review)

Summary PubMed Full Text

Review

Authors: Michael J Campbell

Cluster randomized trials are trials that randomize clusters of people, rather than individuals. They are becoming increasingly common. A number of innovations have been developed recently, particularly in the calculation of the required size of a cluster trial, the handling of missing data, designs to minimize recruitment bias, the ethics of cluster randomized trials and the stepped wedge design. This article will highlight and illustrate these developments. It will also discuss issues with regards to the reporting of cluster randomized trials.

Topics: Algorithms; Bias; Cluster Analysis; Data Interpretation, Statistical; Humans; Patient Selection; Randomized Controlled Trials as Topic; Research Design

PubMed: 24969154
DOI: 10.2217/cer.14.21

Cancer cluster investigations: review of the past and proposals for the future.

International Journal of Environmental... Jan 2014

Residential clusters of non-communicable diseases are a source of enduring public concern, and at times, controversy. Many clusters reported to public health agencies by... (Review)

Summary PubMed Full Text PDF

Review

Authors: Michael Goodman, Judy S LaKind, Jerald A Fagliano...

Residential clusters of non-communicable diseases are a source of enduring public concern, and at times, controversy. Many clusters reported to public health agencies by concerned citizens are accompanied by expectations that investigations will uncover a cause of disease. While goals, methods and conclusions of cluster studies are debated in the scientific literature and popular press, investigations of reported residential clusters rarely provide definitive answers about disease etiology. Further, it is inherently difficult to study a cluster for diseases with complex etiology and long latency (e.g., most cancers). Regardless, cluster investigations remain an important function of local, state and federal public health agencies. Challenges limiting the ability of cluster investigations to uncover causes for disease include the need to consider long latency, low statistical power of most analyses, uncertain definitions of cluster boundaries and population of interest, and in- and out-migration. A multi-disciplinary Workshop was held to discuss innovative and/or under-explored approaches to investigate cancer clusters. Several potentially fruitful paths forward are described, including modern methods of reconstructing residential history, improved approaches to analyzing spatial data, improved utilization of electronic data sources, advances using biomarkers of carcinogenesis, novel concepts for grouping cases, investigations of infectious etiology of cancer, and "omics" approaches.

Topics: Cluster Analysis; Forecasting; Humans; Neoplasms

PubMed: 24477211
DOI: 10.3390/ijerph110201479

Constrained randomization and statistical inference for multi-arm parallel cluster randomized controlled trials.

Statistics in Medicine May 2022

A practical limitation of cluster randomized controlled trials (cRCTs) is that the number of available clusters may be small, resulting in an increased risk of baseline...

Summary PubMed Full Text PDF

Authors: Yunji Zhou, Elizabeth L Turner, Ryan A Simmons...

A practical limitation of cluster randomized controlled trials (cRCTs) is that the number of available clusters may be small, resulting in an increased risk of baseline imbalance under simple randomization. Constrained randomization overcomes this issue by restricting the allocation to a subset of randomization schemes where sufficient overall covariate balance across comparison arms is achieved. However, for multi-arm cRCTs, several design and analysis issues pertaining to constrained randomization have not been fully investigated. Motivated by an ongoing multi-arm cRCT, we elaborate the method of constrained randomization and provide a comprehensive evaluation of the statistical properties of model-based and randomization-based tests under both simple and constrained randomization designs in multi-arm cRCTs, with varying combinations of design and analysis-based covariate adjustment strategies. In particular, as randomization-based tests have not been extensively studied in multi-arm cRCTs, we additionally develop most-powerful randomization tests under the linear mixed model framework for our comparisons. Our results indicate that under constrained randomization, both model-based and randomization-based analyses could gain power while preserving nominal type I error rate, given proper analysis-based adjustment for the baseline covariates. Randomization-based analyses, however, are more robust against violations of distributional assumptions. The choice of balance metrics and candidate set sizes and their implications on the testing of the pairwise and global hypotheses are also discussed. Finally, we caution against the design and analysis of multi-arm cRCTs with an extremely small number of clusters, due to insufficient degrees of freedom and the tendency to obtain an overly restricted randomization space.

Topics: Cluster Analysis; Humans; Random Allocation; Randomized Controlled Trials as Topic; Research Design

PubMed: 35146788
DOI: 10.1002/sim.9333

Cluster-randomized trials of cancer screening interventions: Has use of appropriate statistical methods increased over time?

Contemporary Clinical Trials Dec 2022

In a cluster randomized trial, groups of individuals (e.g., clinics, schools) are randomized to conditions. The design and analysis of cluster randomized trials can... (Review)

Summary PubMed Full Text

Review

Authors: Catherine M Crespi, Kevin Ziehl

BACKGROUND

In a cluster randomized trial, groups of individuals (e.g., clinics, schools) are randomized to conditions. The design and analysis of cluster randomized trials can require more care than individually randomized trials. Past reviews have noted deficiencies in the use of appropriate statistical methods for such trials.

METHODS

We reviewed cluster randomized trials of cancer screening interventions published 1995-2019 to determine whether appropriate statistical methods had been used for sample size calculation and outcome analysis and whether they reported intraclass correlation coefficient (ICC) values. This work expanded a previous review of articles published 1995-2010.

RESULTS

Our search identified 88 articles published 1995-2020 that reported outcomes of cluster randomized trials of breast, cervix, and colorectal cancer screening interventions. There was increased reporting of the trials' sample size calculations over time, with the percentage increasing from 31% in 1995-2004 to 77% in 2014-2019. However, the percentage of calculations failing to account for cluster randomization did not change over time and was 17% of studies in 2014-2019. There was a nonsignificant trend towards increased use of outcome analysis methods that accounted for the cluster randomized design. However, in lower impact journals, use of appropriate analysis methods was only 80% in 2014-2019. Only 33% of studies reported ICC values in 2014-2019.

CONCLUSION

For cluster randomized trials with cancer screening outcomes, there have been improvements in the reporting of sample size calculations but methodological and reporting deficiencies persist. Efforts to disseminate, adopt and report the use of appropriate statistical methodologies are still needed.

Topics: Female; Humans; Early Detection of Cancer; Randomized Controlled Trials as Topic; Cluster Analysis; Research Design; Neoplasms

PubMed: 36343881
DOI: 10.1016/j.cct.2022.106974