-
BMC Bioinformatics May 2018Learning accurate models from 'omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and...
BACKGROUND
Learning accurate models from 'omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Furthermore, the presence of outliers, either experimental errors or interesting abnormal clinical cases, may severely hamper a correct classification of patients and the identification of reliable biomarkers for a particular disease. We propose to address this problem through an ensemble classification setting based on distinct feature selection and modeling strategies, including logistic regression with elastic net regularization, Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) and Sparse Generalized PLS (SGPLS), coupled with an evaluation of the individuals' outlierness based on the Cook's distance. The consensus is achieved with the Rank Product statistics corrected for multiple testing, which gives a final list of sorted observations by their outlierness level.
RESULTS
We applied this strategy for the classification of Triple-Negative Breast Cancer (TNBC) RNA-Seq and clinical data from the Cancer Genome Atlas (TCGA). The detected 24 outliers were identified as putative mislabeled samples, corresponding to individuals with discrepant clinical labels for the HER2 receptor, but also individuals with abnormal expression values of ER, PR and HER2, contradictory with the corresponding clinical labels, which may invalidate the initial TNBC label. Moreover, the model consensus approach leads to the selection of a set of genes that may be linked to the disease. These results are robust to a resampling approach, either by selecting a subset of patients or a subset of genes, with a significant overlap of the outlier patients identified.
CONCLUSIONS
The proposed ensemble outlier detection approach constitutes a robust procedure to identify abnormal cases and consensus covariates, which may improve biomarker selection for precision medicine applications. The method can also be easily extended to other regression models and datasets.
Topics: Female; Humans; Sample Size; Triple Negative Breast Neoplasms; Whole Genome Sequencing
PubMed: 29728051
DOI: 10.1186/s12859-018-2149-7 -
Artificial Intelligence in Medicine Dec 2022Many genetic syndromes are associated with distinctive facial features. Several computer-assisted methods have been proposed that make use of facial features for...
Many genetic syndromes are associated with distinctive facial features. Several computer-assisted methods have been proposed that make use of facial features for syndrome diagnosis. Training supervised classifiers, the most common approach for this purpose, requires large, comprehensive, and difficult to collect databases of syndromic facial images. In this work, we use unsupervised, normalizing flow-based manifold and density estimation models trained entirely on unaffected subjects to detect syndromic 3D faces as statistical outliers. Furthermore, we demonstrate a general, user-friendly, gradient-based interpretability mechanism that enables clinicians and patients to understand model inferences. 3D facial surface scans of 2471 unaffected subjects and 1629 syndromic subjects representing 262 different genetic syndromes were used to train and evaluate the models. The flow-based models outperformed unsupervised comparison methods, with the best model achieving an ROC-AUC of 86.3% on a challenging, age and sex diverse data set. In addition to highlighting the viability of outlier-based syndrome screening tools, our methods generalize and extend previously proposed outlier scores for 3D face-based syndrome detection, resulting in improved performance for unsupervised syndrome detection.
Topics: Humans; Syndrome; Databases, Factual
PubMed: 36462895
DOI: 10.1016/j.artmed.2022.102425 -
The Science of the Total Environment May 2022CO and CH outliers may have a noticeable impact on the trend of both gases. Nine years of measurements since 2010 recorded at a rural site in northern Spain were used to...
CO and CH outliers may have a noticeable impact on the trend of both gases. Nine years of measurements since 2010 recorded at a rural site in northern Spain were used to investigate these outliers. Their influence on the trend was presented and two limits were established. No more than 23.5% of outliers should be excluded from the measurement series in order to obtain representative trends, which were 2.349 ± 0.012 ppm year for CO and 0.00879 ± 0.00004 ppm year for CH. Two types of outliers were distinguished. Those above the trend line and the rest below the trend line. Outliers were described by skewed distributions where the Weibull distribution figures prominently in most cases. A qualitative procedure was presented to exclude the worst fits, although five statistics were considered to select the best fit. In this case, the modified Nash-Sutcliffe efficiency is prominent. Finally, three symmetrical distributions were added to fit the observations when outliers are excluded, with the Gaussian and beta distributions providing the best fits. As a result, certain skewed functions, such as the lognormal distribution, whose use is frequent for air pollutants, could be questioned in certain applications.
Topics: Air Pollutants; Carbon Dioxide; Humans; Methane; Rural Population; Spain
PubMed: 35041963
DOI: 10.1016/j.scitotenv.2022.153129 -
PloS One 2022We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is...
We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes, illustrated by outlier detection. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
Topics: Base Sequence; Databases, Factual; Genome
PubMed: 35921272
DOI: 10.1371/journal.pone.0271970 -
JAMA Surgery Aug 2013The circumferential resection margin is the primary determinant of local recurrence and a major factor in survival in rectal cancer. Neither chemotherapy nor...
IMPORTANCE
The circumferential resection margin is the primary determinant of local recurrence and a major factor in survival in rectal cancer. Neither chemotherapy nor chemoradiation compensates for a margin positive for cancer.
OBJECTIVE
To identify treatment-related factors associated with hospital margin-positive resection and to develop a tool that could be used by individual hospitals to assess their outcomes based on their unique mix of patient and tumor characteristics.
DESIGN
Retrospective review of the National Cancer Data Base, 1998-2007.
SETTINGS
Community and academic/research hospitals.
PARTICIPANTS
Individuals with histologically confirmed localized rectal/rectosigmoid adenocarcinoma.
EXPOSURE
All individuals underwent radical resection for rectal cancer with or without neoadjuvant therapy.
MAIN OUTCOMES AND MEASURES
Rate of margin positivity determined and adjusted for patient- and tumor-related factors to calculate expected margin positivity per hospital. An observed to expected ratio was calculated based on patient- and tumor-related factors to identify treatment-associated variation.
RESULTS
The overall margin-positive resection rate was 5.2%. Patients with margins positive for cancer were more likely to be older, male, and African American; not have private insurance; and have their cancer diagnosed later in the study period. Associated tumor-related factors include rectal location, higher American Joint Committee on Cancer stage, signet/mucinous histology, and poor/undifferentiated grade. Among hospitals that were significantly low outliers, 47% were comprehensive community hospitals, and 43.9% were academic/research hospitals; of those that were significantly high outliers, 52.3% were comprehensive community hospitals, and 17.8% were academic/research hospitals. High-volume centers made up 80% of significantly low outlier hospitals and 17% of significantly high outlier hospitals. The rates of chemotherapy and radiation were similar, but low outlier hospitals gave more neoadjuvant radiation (26.3% vs 17%).
CONCLUSIONS AND RELEVANCE
After adjustment for patient- and tumor-related factors, we identified both low and high outlier hospitals for margin positivity at resection, as well as potentially modifiable risk factors. The nomogram created in this model allows for the evaluation of observed and expected event rates for individual hospitals, providing a hospital self-assessment tool for identifying targets for improvement.
Topics: Adenocarcinoma; Adolescent; Adult; Aged; Female; Hospitalization; Humans; Logistic Models; Male; Middle Aged; Neoplasm, Residual; Nomograms; Rectal Neoplasms; Retrospective Studies; Risk Assessment; Socioeconomic Factors; Treatment Outcome; Young Adult
PubMed: 23803722
DOI: 10.1001/jamasurg.2013.2136 -
Entropy (Basel, Switzerland) Sep 2020The article presents both methods of clustering and outlier detection in complex data, such as rule-based knowledge bases. What distinguishes this work from others is,...
The article presents both methods of clustering and outlier detection in complex data, such as rule-based knowledge bases. What distinguishes this work from others is, first, the application of clustering algorithms to rules in domain knowledge bases, and secondly, the use of outlier detection algorithms to detect unusual rules in knowledge bases. The aim of the paper is the analysis of using four algorithms for outlier detection in rule-based knowledge bases: Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), -MEANS, and SMALLCLUSTERS. The subject of outlier mining is very important nowadays. Outliers in rules mean unusual rules, which are rare in comparing to others and should be explored by the domain expert as soon as possible. In the research, the authors use the outlier detection methods to find a given number of outliers in rules (1%, 5%, 10%), while in small groups, the number of outliers covers no more than 5% of the rule cluster. Subsequently, the authors analyze which of seven various quality indices, which they use for all rules and after removing selected outliers, improve the quality of rule clusters. In the experimental stage, the authors use six different knowledge bases. The best results (the most often the clusters quality was improved) are achieved for two outlier detection algorithms LOF and COF.
PubMed: 33286864
DOI: 10.3390/e22101096 -
Philosophical Transactions of the Royal... Jan 2022Vocal production learning (VPL) is the experience-driven ability to produce novel vocal signals through imitation or modification of existing vocalizations. A parallel...
Vocal production learning (VPL) is the experience-driven ability to produce novel vocal signals through imitation or modification of existing vocalizations. A parallel strand of research investigates acoustic allometry, namely how information about body size is conveyed by acoustic signals. Recently, we proposed that deviation from acoustic allometry principles as a result of sexual selection may have been an intermediate step towards the evolution of vocal learning abilities in mammals. Adopting a more hypothesis-neutral stance, here we perform phylogenetic regressions and other analyses further testing a potential link between VPL and being an allometric outlier. We find that multiple species belonging to VPL clades deviate from allometric scaling but in the opposite direction to that expected from size exaggeration mechanisms. In other words, our correlational approach finds an association between VPL and being an allometric outlier. However, the direction of this association, contra our original hypothesis, may indicate that VPL did not necessarily emerge via sexual selection for size exaggeration: VPL clades show higher vocalization frequencies than expected. In addition, our approach allows us to identify species with potential for VPL abilities: we hypothesize that those outliers from acoustic allometry lying above the regression line may be VPL species. Our results may help better understand the cross-species diversity, variability and aetiology of VPL, which among other things is a key underpinning of speech in our species. This article is part of the theme issue 'Voice modulation: from origin and mechanism to social impact (Part II)'.
Topics: Acoustics; Animals; Mammals; Phylogeny; Speech; Vocalization, Animal
PubMed: 34775824
DOI: 10.1098/rstb.2020.0394 -
Sensors (Basel, Switzerland) Apr 2020With the advent of unmanned aerial vehicles (UAVs), a major area of interest in the research field of UAVs has been vision-aided inertial navigation systems (V-INS). In...
With the advent of unmanned aerial vehicles (UAVs), a major area of interest in the research field of UAVs has been vision-aided inertial navigation systems (V-INS). In the front-end of V-INS, image processing extracts information about the surrounding environment and determines features or points of interest. With the extracted vision data and inertial measurement unit (IMU) dead reckoning, the most widely used algorithm for estimating vehicle and feature states in the back-end of V-INS is an extended Kalman filter (EKF). An important assumption of the EKF is Gaussian white noise. In fact, measurement outliers that arise in various realistic conditions are often non-Gaussian. A lack of compensation for unknown noise parameters often leads to a serious impact on the reliability and robustness of these navigation systems. To compensate for uncertainties of the outliers, we require modified versions of the estimator or the incorporation of other techniques into the filter. The main purpose of this paper is to develop accurate and robust V-INS for UAVs, in particular, those for situations pertaining to such unknown outliers. Feature correspondence in image processing front-end rejects vision outliers, and then a statistic test in filtering back-end detects the remaining outliers of the vision data. For frequent outliers occurrence, variational approximation for Bayesian inference derives a way to compute the optimal noise precision matrices of the measurement outliers. The overall process of outlier removal and adaptation is referred to here as "outlier-adaptive filtering". Even though almost all approaches of V-INS remove outliers by some method, few researchers have treated outlier adaptation in V-INS in much detail. Here, results from flight datasets validate the improved accuracy of V-INS employing the proposed outlier-adaptive filtering framework.
PubMed: 32260451
DOI: 10.3390/s20072036 -
BMC Medical Informatics and Decision... Feb 2024Unsupervised clustering and outlier detection are important in medical research to understand the distributional composition of a collective of patients. A number of...
BACKGROUND
Unsupervised clustering and outlier detection are important in medical research to understand the distributional composition of a collective of patients. A number of clustering methods exist, also for high-dimensional data after dimension reduction. Clustering and outlier detection may, however, become less robust or contradictory if multiple high-dimensional data sets per patient exist. Such a scenario is given when the focus is on 3-D data of multiple organs per patient, and a high-dimensional feature matrix per organ is extracted.
METHODS
We use principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and multiple co-inertia analysis (MCIA) combined with bagplots to study the distribution of multi-organ 3-D data taken by computed tomography scans. After point-set registration of multiple organs from two public data sets, multiple hundred shape features are extracted per organ. While PCA and t-SNE can only be applied to each organ individually, MCIA can project the data of all organs into the same low-dimensional space.
RESULTS
MCIA is the only approach, here, with which data of all organs can be projected into the same low-dimensional space. We studied how frequently (i.e., by how many organs) a patient was classified to belong to the inner or outer 50% of the population, or as an outlier. Outliers could only be detected with MCIA and PCA. MCIA and t-SNE were more robust in judging the distributional location of a patient in contrast to PCA.
CONCLUSIONS
MCIA is more appropriate and robust in judging the distributional location of a patient in the case of multiple high-dimensional data sets per patient. It is still recommendable to apply PCA or t-SNE in parallel to MCIA to study the location of individual organs.
Topics: Humans; Cluster Analysis; Tomography, X-Ray Computed; Principal Component Analysis; Algorithms
PubMed: 38355504
DOI: 10.1186/s12911-024-02457-8 -
Proceedings of Machine Learning Research Jul 2021Continuous-time event sequences represent discrete events occurring in continuous time. Such sequences arise frequently in real-life. Usually we expect the sequences to...
Continuous-time event sequences represent discrete events occurring in continuous time. Such sequences arise frequently in real-life. Usually we expect the sequences to follow some regular pattern over time. However, sometimes these patterns may be interrupted by unexpected absence or occurrences of events. Identification of these unexpected cases can be very important as they may point to abnormal situations that need human attention. In this work, we study and develop methods for detecting outliers in continuous-time event sequences, including unexpected absence and unexpected occurrences of events. Since the patterns that event sequences tend to follow may change in different contexts, we develop outlier detection methods based on point processes that can take context information into account. Our methods are based on Bayesian decision theory and hypothesis testing with theoretical guarantees. To test the performance of the methods, we conduct experiments on both synthetic data and real-world clinical data and show the effectiveness of the proposed methods.
PubMed: 34712956
DOI: No ID Found