-
Molecular Biology and Evolution Nov 2023In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and complicate species...
In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and complicate species tree inference. The amount of data handled today in classical phylogenomic analyses precludes manual error detection and removal. However, a simple and efficient way to automate the identification of outliers from a collection of gene trees is still missing. Here, we present PhylteR, a method that allows rapid and accurate detection of outlier sequences in phylogenomic datasets, i.e. species from individual gene trees that do not follow the general trend. PhylteR relies on DISTATIS, an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once. In PhylteR, these distance matrices extracted from individual gene phylogenies represent evolutionary distances between species according to each gene. On simulated datasets, we show that PhylteR identifies outliers with more sensitivity and precision than a comparable existing method. We also show that PhylteR is not sensitive to ILS-induced incongruences, which is a desirable feature. On a biological dataset of 14,463 genes for 53 species previously assembled for Carnivora phylogenomics, we show (i) that PhylteR identifies as outliers sequences that can be considered as such by other means, and (ii) that the removal of these sequences improves the concordance between the gene trees and the species tree. Thanks to the generation of numerous graphical outputs, PhylteR also allows for the rapid and easy visual characterization of the dataset at hand, thus aiding in the precise identification of errors. PhylteR is distributed as an R package on CRAN and as containerized versions (docker and singularity).
Topics: Phylogeny; Biological Evolution
PubMed: 37879113
DOI: 10.1093/molbev/msad234 -
Environmental Science and Pollution... Jun 2023The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables...
The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables and the result that will be applied to analysis and simulation models that feed variables such as flow and level of a study area. One of the most common problems is the acquisition and transmission of data from weather stations due to atypical values and lost data; this generates difficulties in the simulation process. Consequently, it is necessary to propose a numerical strategy to solve this problem. The data source for this study is a real database where these problems are presented with different variables of weather. This study is based on comparing three methods of time series analysis to evaluate a multivariable process offline. For the development of the study, we applied a method based on the discrete Fourier transform (DFT), and we contrasted it with methods such as the average and linear regression without uncertainty parameters to complete missing data. The proposed methodology entails statistical values, outlier detection, and the application of the DFT. The application of DFT allows the time series completion, based on its ability to manage various gap sizes and replace missing values. In sum, DFT led to low error percentages for all the time series (1% average). This percentage reflects what would have likely been the shape or pattern of the time series behavior in the absence of misleading outliers and missing data.
Topics: Time Factors; Colombia; Weather; Linear Models; Computer Simulation
PubMed: 37165270
DOI: 10.1007/s11356-023-27176-x -
Stat (International Statistical... Dec 2022Identifying influential and outlying data is important as it would guide the effective collection of future data and the proper use of existing information. We develop a...
Identifying influential and outlying data is important as it would guide the effective collection of future data and the proper use of existing information. We develop a unified approach for outlier detection and influence analysis. Our proposed method is grounded in the intuitive value of information concepts and has a distinct advantage in interpretability and flexibility when compared to existing methods: it decomposes the data influence into the leverage effect (expected to be influential) and the outlying effect (surprisingly more influential than being expected); and it applies to all decision problems such as estimation, prediction, and hypothesis testing. We study the theoretical properties of three value of information quantities, establish the relationship between the proposed measures and classic measures in the linear regression setting, and provide real data analysis examples of how to apply the new value of information approach in the cases of linear regression, generalized linear mixed model, and hypothesis testing.
PubMed: 37908311
DOI: 10.1002/sta4.442 -
Breast Cancer Research : BCR May 2021DNA methylation alterations have similar patterns in normal aging tissue and in cancer. In this study, we investigated breast tissue-specific age-related DNA methylation...
BACKGROUND
DNA methylation alterations have similar patterns in normal aging tissue and in cancer. In this study, we investigated breast tissue-specific age-related DNA methylation alterations and used those methylation sites to identify individuals with outlier phenotypes. Outlier phenotype is identified by unsupervised anomaly detection algorithms and is defined by individuals who have normal tissue age-dependent DNA methylation levels that vary dramatically from the population mean.
METHODS
We generated whole-genome DNA methylation profiles (GSE160233) on purified epithelial cells and used publicly available Infinium HumanMethylation 450K array datasets (TCGA, GSE88883, GSE69914, GSE101961, and GSE74214) for discovery and validation.
RESULTS
We found that hypermethylation in normal breast tissue is the best predictor of hypermethylation in cancer. Using unsupervised anomaly detection approaches, we found that about 10% of the individuals (39/427) were outliers for DNA methylation from 6 DNA methylation datasets. We also found that there were significantly more outlier samples in normal-adjacent to cancer (24/139, 17.3%) than in normal samples (15/228, 5.2%). Additionally, we found significant differences between the predicted ages based on DNA methylation and the chronological ages among outliers and not-outliers. Additionally, we found that accelerated outliers (older predicted age) were more frequent in normal-adjacent to cancer (14/17, 82%) compared to normal samples from individuals without cancer (3/17, 18%). Furthermore, in matched samples, we found that the epigenome of the outliers in the pre-malignant tissue was as severely altered as in cancer.
CONCLUSIONS
A subset of patients with breast cancer has severely altered epigenomes which are characterized by accelerated aging in their normal-appearing tissue. In the future, these DNA methylation sites should be studied further such as in cell-free DNA to determine their potential use as biomarkers for early detection of malignant transformation and preventive intervention in breast cancer.
Topics: Aging; Breast; Breast Neoplasms; CpG Islands; DNA Methylation; Epigenome; Female; Humans; Phenotype
PubMed: 34022936
DOI: 10.1186/s13058-021-01434-7 -
Computer Methods and Programs in... Nov 2019Electronic Health Record (EHR) data often include observation records that are unlikely to represent the "truth" about a patient at a given clinical encounter. Due to...
BACKGROUND AND OBJECTIVE
Electronic Health Record (EHR) data often include observation records that are unlikely to represent the "truth" about a patient at a given clinical encounter. Due to their high throughput, examples of such implausible observations are frequent in records of laboratory test results and vital signs. Outlier detection methods can offer low-cost solutions to flagging implausible EHR observations. This article evaluates the utility of a semi-supervised encoding approach (super-encoding) for constructing non-linear exemplar data distributions from EHR observation data and detecting non-conforming observations as outliers.
METHODS
Two hypotheses are tested using experimental design and non-parametric hypothesis testing procedures: (1) adding demographic features (e.g., age, gender, race/ethnicity) can increase precision in outlier detection, (2) sampling small subsets of the large EHR data can increase outlier detection by reducing noise-to-signal ratio. The experiments involved applying 492 encoder configurations (involving different input features, architectures, sampling ratios, and error margins) to a set of 30 datasets EHR observations including laboratory tests and vital sign records extracted from the Research Patient Data Registry (RPDR) from Partners HealthCare.
RESULTS
Results are obtained from (30 × 492) 14,760 encoders. The semi-supervised encoding approach (super-encoding) outperformed conventional autoencoders in outlier detection. Adding age of the patient at the observation (encounter) to the baseline encoder that only included observation value as the input feature slightly improved outlier detection. Top-nine performing encoders are introduced. The best outlier detection performance was from a semi-supervised encoder, with observation value as the single feature and a single hidden layer, built on one percent of the data and one percent reconstruction error. At least one encoder configurations had a Youden's J index higher than 0.9999 for all 30 observation types.
CONCLUSION
Given the multiplicity of distributions for a single observation in EHR data (i.e., same observation represented with different names or units), as well as non-linearity of human observations, encoding offers huge promises for outlier detection in large-scale data repositories. https://github.com/hestiri/superencoder.
Topics: Algorithms; Data Collection; Data Interpretation, Statistical; Electronic Health Records; Female; Humans; Male; Medical Informatics; Models, Statistical; Neural Networks, Computer; ROC Curve; Registries; Reproducibility of Results
PubMed: 30658851
DOI: 10.1016/j.cmpb.2019.01.002 -
Journal of Clinical Bioinformatics Dec 2012Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family...
BACKGROUND
Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset.
RESULTS
We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours.
CONCLUSIONS
We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers.
PubMed: 23216803
DOI: 10.1186/2043-9113-2-22 -
Human Brain Mapping Apr 2022Outliers in neuroimaging represent spurious data or the data of unusual phenotypes that deserve special attention such as clinical follow-up. Outliers have usually been...
Outliers in neuroimaging represent spurious data or the data of unusual phenotypes that deserve special attention such as clinical follow-up. Outliers have usually been detected in a supervised or semi-supervised manner for labeled neuroimaging cohorts. There has been much less work using unsupervised outlier detection on large unlabeled cohorts like the UK Biobank brain imaging dataset. Given its large sample size, rare imaging phenotypes within this unique cohort are of interest, as they are often clinically relevant and could be informative for discovering new processes. Here, we developed a two-level outlier detection and screening methodology to characterize individual outliers from the multimodal MRI dataset of more than 15,000 UK Biobank subjects. In primary screening, using brain ventricles, white matter, cortical thickness, and functional connectivity-based imaging phenotypes, every subject was parameterized with an outlier score per imaging phenotype. Outlier scores of these imaging phenotypes had good-to-excellent test-retest reliability, with the exception of resting-state functional connectivity (RSFC). Due to the low reliability of RSFC outlier scores, RSFC outliers were excluded from further individual-level outlier screening. In secondary screening, the extreme outliers (1,026 subjects) were examined individually, and those arising from data collection/processing errors were eliminated. A representative subgroup of 120 subjects from the remaining non-artifactual outliers were radiologically reviewed, and radiological findings were identified in 97.5% of them. This study establishes an unsupervised framework for investigating rare individual imaging phenotypes within a large neuroimaging cohort.
Topics: Brain; Humans; Magnetic Resonance Imaging; Neuroimaging; Phenotype; Reproducibility of Results
PubMed: 34957633
DOI: 10.1002/hbm.25756 -
Statistics and Its Interface 2016Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the...
Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding. Under Stein's unbiased risk estimation framework, we propose a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation. The leverage scores give an exact decomposition of the so-called model degrees of freedom to the observation level, which lead to exact decomposition of many commonly-used information criteria; the resulting quantities are thus named information scores of the observations. The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection. Simulation studies confirm that the proposed diagnostic tools work well. A pattern recognition example with hand-writing digital images and a time series analysis example with monthly U.S. macroeconomic data further demonstrate the efficacy of the proposed approaches.
PubMed: 28003860
DOI: 10.4310/SII.2016.v9.n4.a7 -
PloS One 2014The removal of outliers to acquire a significant result is a questionable research practice that appears to be commonly used in psychology. In this study, we...
BACKGROUND
The removal of outliers to acquire a significant result is a questionable research practice that appears to be commonly used in psychology. In this study, we investigated whether the removal of outliers in psychology papers is related to weaker evidence (against the null hypothesis of no effect), a higher prevalence of reporting errors, and smaller sample sizes in these papers compared to papers in the same journals that did not report the exclusion of outliers from the analyses.
METHODS AND FINDINGS
We retrieved a total of 2667 statistical results of null hypothesis significance tests from 153 articles in main psychology journals, and compared results from articles in which outliers were removed (N = 92) with results from articles that reported no exclusion of outliers (N = 61). We preregistered our hypotheses and methods and analyzed the data at the level of articles. Results show no significant difference between the two types of articles in median p value, sample sizes, or prevalence of all reporting errors, large reporting errors, and reporting errors that concerned the statistical significance. However, we did find a discrepancy between the reported degrees of freedom of t tests and the reported sample size in 41% of articles that did not report removal of any data values. This suggests common failure to report data exclusions (or missingness) in psychological articles.
CONCLUSIONS
We failed to find that the removal of outliers from the analysis in psychological articles was related to weaker evidence (against the null hypothesis of no effect), sample size, or the prevalence of errors. However, our control sample might be contaminated due to nondisclosure of excluded values in articles that did not report exclusion of outliers. Results therefore highlight the importance of more transparent reporting of statistical analyses.
Topics: Humans; Psychology; Research
PubMed: 25072606
DOI: 10.1371/journal.pone.0103360 -
Heliyon Jul 2021Outlier scanpaths identification is a crucial preliminary step in designing visual software, digital media analysis, radiology training and clustering participants in...
Outlier scanpaths identification is a crucial preliminary step in designing visual software, digital media analysis, radiology training and clustering participants in eye-tracking experiments. However, the task is challenging due to the visual irregularity of the scanpath shapes and the difficulty in dimensionality reduction due to geometric complexity. Conventional approaches have used heat maps to exclude scanpaths that lack a similarity pattern. However, the typically-used packages, such as ScanMatch and MultiMatch often generate discordant results when outlier identification is done empirically. This paper introduces a novel outlier evaluation approach by integrating the fractal dimension (FD), capturing the geometrical complexity of patterns, as an additional parameter with the heat map. This additional parameter is used to evaluate the degree of influence of a scanpath within a dataset. More specifically, the 2D Cartesian coordinates of a scanpath are fitted to a space filling 1D fractal curve to characterise its temporal FD. The FDs of the scanpaths are then compared to match their geometric complexity to one another. The findings indicate that the FD can be a beneficial additional parameter when evaluating the candidacy of poorly matching scanpaths as outliers and performs better at identifying unusual scanpaths than using other methods, including scanpath matching, Jaccard, or bounding box methods alone.
PubMed: 34368482
DOI: 10.1016/j.heliyon.2021.e07616