-
BMC Bioinformatics May 2022Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing...
BACKGROUND
Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).
RESULTS
We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).
CONCLUSIONS
Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.
Topics: Algorithms; Cluster Analysis; Humans; Normal Distribution; Sample Size; Software
PubMed: 35641905
DOI: 10.1186/s12859-022-04675-1 -
Journal of Mathematical Biology Nov 2021In many phylogenetic applications, such as cancer and virus evolution, time trees, evolutionary histories where speciation events are timed, are inferred. Of particular...
In many phylogenetic applications, such as cancer and virus evolution, time trees, evolutionary histories where speciation events are timed, are inferred. Of particular interest are clock-like trees, where all leaves are sampled at the same time and have equal distance to the root. One popular approach to model clock-like trees is coalescent theory, which is used in various tree inference software packages. Methodologically, phylogenetic inference methods require a tree space over which the inference is performed, and the geometry of this space plays an important role in statistical and computational aspects of tree inference algorithms. It has recently been shown that coalescent tree spaces possess a unique geometry, different from that of classical phylogenetic tree spaces. Here we introduce and study a space of discrete coalescent trees. They assume that time is discrete, which is natural in many computational applications. This tree space is a generalisation of the previously studied ranked nearest neighbour interchange space, and is built upon tree-rearrangement operations. We generalise existing results about ranked trees, including an algorithm for computing distances in polynomial time, and in particular provide new results for both the space of discrete coalescent trees and the space of ranked trees. We establish several geometrical properties of these spaces and show how these properties impact various algorithms used in phylogenetic analyses. Our tree space is a discretisation of a previously introduced time tree space, called t-space, and hence our results can be used to approximate solutions to various open problems in t-space.
Topics: Algorithms; Cluster Analysis; Phylogeny
PubMed: 34739608
DOI: 10.1007/s00285-021-01685-0 -
Statistical analysis and handling of missing data in cluster randomized trials: a systematic review.Trials Feb 2016Cluster randomized trials (CRTs) randomize participants in groups, rather than as individuals and are key tools used to assess interventions in health research where... (Review)
Review
BACKGROUND
Cluster randomized trials (CRTs) randomize participants in groups, rather than as individuals and are key tools used to assess interventions in health research where treatment contamination is likely or if individual randomization is not feasible. Two potential major pitfalls exist regarding CRTs, namely handling missing data and not accounting for clustering in the primary analysis. The aim of this review was to evaluate approaches for handling missing data and statistical analysis with respect to the primary outcome in CRTs.
METHODS
We systematically searched for CRTs published between August 2013 and July 2014 using PubMed, Web of Science, and PsycINFO. For each trial, two independent reviewers assessed the extent of the missing data and method(s) used for handling missing data in the primary and sensitivity analyses. We evaluated the primary analysis and determined whether it was at the cluster or individual level.
RESULTS
Of the 86 included CRTs, 80 (93%) trials reported some missing outcome data. Of those reporting missing data, the median percent of individuals with a missing outcome was 19% (range 0.5 to 90%). The most common way to handle missing data in the primary analysis was complete case analysis (44, 55%), whereas 18 (22%) used mixed models, six (8%) used single imputation, four (5%) used unweighted generalized estimating equations, and two (2%) used multiple imputation. Fourteen (16%) trials reported a sensitivity analysis for missing data, but most assumed the same missing data mechanism as in the primary analysis. Overall, 67 (78%) trials accounted for clustering in the primary analysis.
CONCLUSIONS
High rates of missing outcome data are present in the majority of CRTs, yet handling missing data in practice remains suboptimal. Researchers and applied statisticians should carry out appropriate missing data methods, which are valid under plausible assumptions in order to increase statistical power in trials and reduce the possibility of bias. Sensitivity analysis should be performed, with weakened assumptions regarding the missing data mechanism to explore the robustness of results reported in the primary analysis.
Topics: Cluster Analysis; Data Interpretation, Statistical; Randomized Controlled Trials as Topic; Sample Size
PubMed: 26862034
DOI: 10.1186/s13063-016-1201-z -
Nutrients May 2022Evidence-based knowledge of the relationship between foods and nutrients is needed to inform dietary-based guidelines and policy. Proper and tailored statistical methods... (Review)
Review
Evidence-based knowledge of the relationship between foods and nutrients is needed to inform dietary-based guidelines and policy. Proper and tailored statistical methods to analyse food composition databases (FCDBs) could assist in this regard. This review aims to collate the existing literature that used any statistical method to analyse FCDBs, to identify key trends and research gaps. The search strategy yielded 4238 references from electronic databases of which 24 fulfilled our inclusion criteria. Information on the objectives, statistical methods, and results was extracted. Statistical methods were mostly applied to group similar food items (37.5%). Other aims and objectives included determining associations between the nutrient content and known food characteristics (25.0%), determining nutrient co-occurrence (20.8%), evaluating nutrient changes over time (16.7%), and addressing the accuracy and completeness of databases (16.7%). Standard statistical tests (33.3%) were the most utilised followed by clustering (29.1%), other methods (16.7%), regression methods (12.5%), and dimension reduction techniques (8.3%). Nutrient data has unique characteristics such as correlated components, natural groupings, and a compositional nature. Statistical methods used for analysis need to account for this data structure. Our summary of the literature provides a reference for researchers looking to expand into this area.
Topics: Cluster Analysis; Databases, Factual; Food; Food Analysis; Nutrients; Nutrition Policy
PubMed: 35683993
DOI: 10.3390/nu14112193 -
Briefings in Bioinformatics Jul 2022The microbiome is a complex and dynamic community of microorganisms that co-exist interdependently within an ecosystem, and interact with its host or environment.... (Review)
Review
The microbiome is a complex and dynamic community of microorganisms that co-exist interdependently within an ecosystem, and interact with its host or environment. Longitudinal studies can capture temporal variation within the microbiome to gain mechanistic insights into microbial systems; however, current statistical methods are limited due to the complex and inherent features of the data. We have identified three analytical objectives in longitudinal microbial studies: (1) differential abundance over time and between sample groups, demographic factors or clinical variables of interest; (2) clustering of microorganisms evolving concomitantly across time and (3) network modelling to identify temporal relationships between microorganisms. This review explores the strengths and limitations of current methods to fulfill these objectives, compares different methods in simulation and case studies for objectives (1) and (2), and highlights opportunities for further methodological developments. R tutorials are provided to reproduce the analyses conducted in this review.
Topics: Cluster Analysis; Data Analysis; Longitudinal Studies; Microbiota; RNA, Ribosomal, 16S
PubMed: 35830875
DOI: 10.1093/bib/bbac273 -
Briefings in Bioinformatics Jan 2022Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values...
Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values and apply statistical methods that require complete data and (ii) use statistical methods that specifically account for missing values without imputation (imputation-free methods). This study reviews the effect of sample size and percentage of missing values on statistical inference for multiple methods under these two strategies. With increasing missingness, the ability of imputation and imputation-free methods to identify differentially and non-differentially regulated compounds in a two-group comparison study declined. Random forest and k-nearest neighbor imputation combined with a Wilcoxon test performed well in statistical testing for up to 50% missingness with little bias in estimating the effect size. Quantile regression imputation accompanied with a Wilcoxon test also had good statistical testing outcomes but substantially distorted the difference in means between groups. None of the imputation-free methods performed consistently better for statistical testing than imputation methods.
Topics: Bias; Cluster Analysis; Mass Spectrometry; Research Design
PubMed: 34472591
DOI: 10.1093/bib/bbab353 -
Journal of Comparative Effectiveness... May 2014Cluster randomized trials are trials that randomize clusters of people, rather than individuals. They are becoming increasingly common. A number of innovations have been... (Review)
Review
Cluster randomized trials are trials that randomize clusters of people, rather than individuals. They are becoming increasingly common. A number of innovations have been developed recently, particularly in the calculation of the required size of a cluster trial, the handling of missing data, designs to minimize recruitment bias, the ethics of cluster randomized trials and the stepped wedge design. This article will highlight and illustrate these developments. It will also discuss issues with regards to the reporting of cluster randomized trials.
Topics: Algorithms; Bias; Cluster Analysis; Data Interpretation, Statistical; Humans; Patient Selection; Randomized Controlled Trials as Topic; Research Design
PubMed: 24969154
DOI: 10.2217/cer.14.21 -
International Journal of Environmental... Jan 2014Residential clusters of non-communicable diseases are a source of enduring public concern, and at times, controversy. Many clusters reported to public health agencies by... (Review)
Review
Residential clusters of non-communicable diseases are a source of enduring public concern, and at times, controversy. Many clusters reported to public health agencies by concerned citizens are accompanied by expectations that investigations will uncover a cause of disease. While goals, methods and conclusions of cluster studies are debated in the scientific literature and popular press, investigations of reported residential clusters rarely provide definitive answers about disease etiology. Further, it is inherently difficult to study a cluster for diseases with complex etiology and long latency (e.g., most cancers). Regardless, cluster investigations remain an important function of local, state and federal public health agencies. Challenges limiting the ability of cluster investigations to uncover causes for disease include the need to consider long latency, low statistical power of most analyses, uncertain definitions of cluster boundaries and population of interest, and in- and out-migration. A multi-disciplinary Workshop was held to discuss innovative and/or under-explored approaches to investigate cancer clusters. Several potentially fruitful paths forward are described, including modern methods of reconstructing residential history, improved approaches to analyzing spatial data, improved utilization of electronic data sources, advances using biomarkers of carcinogenesis, novel concepts for grouping cases, investigations of infectious etiology of cancer, and "omics" approaches.
Topics: Cluster Analysis; Forecasting; Humans; Neoplasms
PubMed: 24477211
DOI: 10.3390/ijerph110201479 -
Statistics in Medicine May 2022A practical limitation of cluster randomized controlled trials (cRCTs) is that the number of available clusters may be small, resulting in an increased risk of baseline...
A practical limitation of cluster randomized controlled trials (cRCTs) is that the number of available clusters may be small, resulting in an increased risk of baseline imbalance under simple randomization. Constrained randomization overcomes this issue by restricting the allocation to a subset of randomization schemes where sufficient overall covariate balance across comparison arms is achieved. However, for multi-arm cRCTs, several design and analysis issues pertaining to constrained randomization have not been fully investigated. Motivated by an ongoing multi-arm cRCT, we elaborate the method of constrained randomization and provide a comprehensive evaluation of the statistical properties of model-based and randomization-based tests under both simple and constrained randomization designs in multi-arm cRCTs, with varying combinations of design and analysis-based covariate adjustment strategies. In particular, as randomization-based tests have not been extensively studied in multi-arm cRCTs, we additionally develop most-powerful randomization tests under the linear mixed model framework for our comparisons. Our results indicate that under constrained randomization, both model-based and randomization-based analyses could gain power while preserving nominal type I error rate, given proper analysis-based adjustment for the baseline covariates. Randomization-based analyses, however, are more robust against violations of distributional assumptions. The choice of balance metrics and candidate set sizes and their implications on the testing of the pairwise and global hypotheses are also discussed. Finally, we caution against the design and analysis of multi-arm cRCTs with an extremely small number of clusters, due to insufficient degrees of freedom and the tendency to obtain an overly restricted randomization space.
Topics: Cluster Analysis; Humans; Random Allocation; Randomized Controlled Trials as Topic; Research Design
PubMed: 35146788
DOI: 10.1002/sim.9333 -
Contemporary Clinical Trials Dec 2022In a cluster randomized trial, groups of individuals (e.g., clinics, schools) are randomized to conditions. The design and analysis of cluster randomized trials can... (Review)
Review
BACKGROUND
In a cluster randomized trial, groups of individuals (e.g., clinics, schools) are randomized to conditions. The design and analysis of cluster randomized trials can require more care than individually randomized trials. Past reviews have noted deficiencies in the use of appropriate statistical methods for such trials.
METHODS
We reviewed cluster randomized trials of cancer screening interventions published 1995-2019 to determine whether appropriate statistical methods had been used for sample size calculation and outcome analysis and whether they reported intraclass correlation coefficient (ICC) values. This work expanded a previous review of articles published 1995-2010.
RESULTS
Our search identified 88 articles published 1995-2020 that reported outcomes of cluster randomized trials of breast, cervix, and colorectal cancer screening interventions. There was increased reporting of the trials' sample size calculations over time, with the percentage increasing from 31% in 1995-2004 to 77% in 2014-2019. However, the percentage of calculations failing to account for cluster randomization did not change over time and was 17% of studies in 2014-2019. There was a nonsignificant trend towards increased use of outcome analysis methods that accounted for the cluster randomized design. However, in lower impact journals, use of appropriate analysis methods was only 80% in 2014-2019. Only 33% of studies reported ICC values in 2014-2019.
CONCLUSION
For cluster randomized trials with cancer screening outcomes, there have been improvements in the reporting of sample size calculations but methodological and reporting deficiencies persist. Efforts to disseminate, adopt and report the use of appropriate statistical methodologies are still needed.
Topics: Female; Humans; Early Detection of Cancer; Randomized Controlled Trials as Topic; Cluster Analysis; Research Design; Neoplasms
PubMed: 36343881
DOI: 10.1016/j.cct.2022.106974