-
Genes Feb 2023Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence,...
Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier's performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.
Topics: Transcriptome; Gene Expression Profiling; Probability; Research Design
PubMed: 36833313
DOI: 10.3390/genes14020387 -
Scientific Reports Sep 2023Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high...
Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high dimensional data due to the "curse of dimensionality". Subspace outlier detection methods have great potential to overcome the problem. However, the challenge becomes how to determine which subspaces to be used for outlier detection among a huge number of all subspaces. In this paper, firstly, we propose an intuitive definition of outliers in subspaces. We study the desirable properties of subspaces for outlier detection and investigate the metrics for those properties. Then, a novel subspace outlier detection algorithm with a statistical foundation is proposed. Our method selectively leverages a limited set of the most interesting subspaces for outlier detection. Through experimental validation, we demonstrate that identifying outliers within this reduced set of highly interesting subspaces yields significantly higher accuracy compared to analyzing the entire feature space. We show by experiments that the proposed method outperforms competing subspace outlier detection approaches on real world data sets.
PubMed: 37714878
DOI: 10.1038/s41598-023-42261-4 -
Data Mining and Knowledge Discovery 2023It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th...
UNLABELLED
It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th annual Belgian-Dutch on machine learning, pp 56-64, 2009; Janssens et al. in: Proceedings of the 2009 ICMLA international conference on machine learning and applications, IEEE Computer Society, pp 147-153, 2009. 10.1109/ICMLA.2009.16). In this paper, we focus on the comparison of one-class classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. In contrast to previous comparison studies, where the models (algorithms, parameters) are selected by using examples from both classes (outlier and inlier), here we also study and compare different approaches for model selection in the absence of examples from the outlier class, which is more realistic for practical applications since labeled outliers are rarely available. Our results showed that, overall, SVDD and GMM are top-performers, regardless of whether the ground truth is used for parameter selection or not. However, in specific application scenarios, other methods exhibited better performance. Combining one-class classifiers into ensembles showed better performance than individual methods in terms of accuracy, as long as the ensemble members are properly selected.
SUPPLEMENTARY INFORMATION
The online version contains supplementary material available at 10.1007/s10618-023-00931-x.
PubMed: 37424877
DOI: 10.1007/s10618-023-00931-x -
NeuroImage Dec 2023Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain...
Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain regions. The method, however, requires relatively complex data processing frameworks and analysis pipelines. Many of these approaches are vulnerable to signal dropout artefacts that can originate from subjects moving their head during the scan. To combat these artefacts and eliminate such outliers, researchers have proposed two approaches: to replace outliers or to downweight outliers during modelling and analysis. With the rising interest in dMRI for clinical research, these types of corrections are increasingly important. Therefore, we set out to investigate the differences between outlier replacement and weighting approaches to help the dMRI community to select the best tool for their data processing pipelines. We evaluated dMRI motion correction registration and single tensor model fit pipelines using Gaussian Process and Spherical Harmonic based replacement approaches and outlier downweighting using highly realistic whole-brain simulations. As a proof of concept, we applied these approaches to dMRI infant data sets that contained varying numbers of dropout artefacts. Based on our results, we concluded that the Gaussian Process based outlier replacement provided similar tensor fit results to Gaussian Process based outlier detection and downweighting. Therefore, if only the least-squares estimate of the single tensor model is of interest, our recommendation is to use outlier replacement. However, outlier downweighting can potentially provide a more accurate estimate of the model precision which could be relevant for applications such as probabilistic tractoraphy.
Topics: Humans; Algorithms; Diffusion Magnetic Resonance Imaging; Brain; Artifacts; Least-Squares Analysis
PubMed: 37820862
DOI: 10.1016/j.neuroimage.2023.120397 -
Molecules (Basel, Switzerland) Jun 2021In this paper, we report comprehensive experimental and chemoinformatics analyses of the solubility of small organic molecules ("fragments") in dimethyl sulfoxide (DMSO)...
In this paper, we report comprehensive experimental and chemoinformatics analyses of the solubility of small organic molecules ("fragments") in dimethyl sulfoxide (DMSO) in the context of their ability to be tested in screening experiments. Here, DMSO solubility of 939 fragments has been measured experimentally using an NMR technique. A Support Vector Classification model was built on the obtained data using the ISIDA fragment descriptors. The analysis revealed 34 outliers: experimental issues were retrospectively identified for 28 of them. The updated model performs well in 5-fold cross-validation (balanced accuracy = 0.78). The datasets are available on the Zenodo platform (DOI:10.5281/zenodo.4767511) and the model is available on the website of the Laboratory of Chemoinformatics.
PubMed: 34203441
DOI: 10.3390/molecules26133950 -
JCO Clinical Cancer Informatics Oct 2022Artificial intelligence (AI) models for medical image diagnosis are often trained and validated on curated data. However, in a clinical setting, images that are outliers...
PURPOSE
Artificial intelligence (AI) models for medical image diagnosis are often trained and validated on curated data. However, in a clinical setting, images that are outliers with respect to the training data, such as those representing rare disease conditions or acquired using a slightly different setup, can lead to wrong decisions. It is not practical to expect clinicians to be trained to discount results for such outlier images. Toward clinical deployment, we have designed a method to train cautious AI that can automatically flag outlier cases.
MATERIALS AND METHODS
Our method-ClassClust-forms tight clusters of training images using supervised contrastive learning, which helps it identify outliers during testing. We compared ClassClust's ability to detect outliers with three competing methods on four publicly available data sets covering pathology, dermatoscopy, and radiology. We held out certain diseases, artifacts, and types of images from training data and examined the ability of various models to detect these as outliers during testing. We compared the decision accuracy of the models on held-out nonoutlier images also. We visualized the regions of the images that the models used for their decisions.
RESULTS
Area under receiver operating characteristic curve for outlier detection was consistently higher using ClassClust compared with the previous methods. Average accuracy on held-out nonoutlier images was also higher, and the visualizations of image regions were more informative using ClassClust.
CONCLUSION
The ability to flag outlier test cases need not be at odds with the ability to accurately classify nonoutliers in AI models. Although the latter capability has received research and regulatory attention, AI models for clinical deployment should possess the former as well.
Topics: Artificial Intelligence; Data Collection; Humans; ROC Curve; Trust
PubMed: 36228179
DOI: 10.1200/CCI.22.00067 -
Molecular Oncology Jun 2024Multiple strategies are continuously being explored to expand the drug target repertoire in solid tumors. We devised a novel computational workflow for...
Multiple strategies are continuously being explored to expand the drug target repertoire in solid tumors. We devised a novel computational workflow for transcriptome-wide gene expression outlier analysis that allows the systematic identification of both overexpression and underexpression events in cancer cells. Here, it was applied to expression values obtained through RNA sequencing in 226 colorectal cancer (CRC) cell lines that were also characterized by whole-exome sequencing and microarray-based DNA methylation profiling. We found cell models displaying an abnormally high or low expression level for 3533 and 965 genes, respectively. Gene expression abnormalities that have been previously associated with clinically relevant features of CRC cell lines were confirmed. Moreover, by integrating multi-omics data, we identified both genetic and epigenetic alternations underlying outlier expression values. Importantly, our atlas of CRC gene expression outliers can guide the discovery of novel drug targets and biomarkers. As a proof of concept, we found that CRC cell lines lacking expression of the MTAP gene are sensitive to treatment with a PRMT5-MTA inhibitor (MRTX1719). Finally, other tumor types may also benefit from this approach.
Topics: Humans; Colorectal Neoplasms; Gene Expression Regulation, Neoplastic; Cell Line, Tumor; Transcriptome; Gene Expression Profiling; DNA Methylation
PubMed: 38468448
DOI: 10.1002/1878-0261.13622 -
Journal of Healthcare Informatics... Sep 2018Emerging wearable and environmental sensor technologies provide health professionals with unprecedented capacity to continuously collect human behavioral data for health...
Emerging wearable and environmental sensor technologies provide health professionals with unprecedented capacity to continuously collect human behavioral data for health monitoring and management. This enables new solutions to mitigate globally emerging health problems such as obesity. With such outburst of dynamic sensor data, it is critical that appropriate mathematical models and computational methods are developed to translate the collected data into accurate characterization of the underlying health dynamics, enabling more reliable personalized monitoring, prediction, and intervention of health status changes. In addition to addressing common analytic challenges in analyzing sensor behavioral data, such as missing values and outliers, we focus on modeling heterogeneous dynamics to better capture health status changes under different conditions, which may lead to more effective state-dependent intervention strategies. We implement switching-state dynamic system models with different complexity levels on real-world daily behavioral data. Evaluation experiments of these models are conducted to demonstrate the importance of modeling the dynamic heterogeneity, as well as simultaneously conducting missing value imputation and outlier detection in achieving interpretable health dynamic models with better prediction of health status changes.
PubMed: 35415411
DOI: 10.1007/s41666-018-0017-x -
Jornal Vascular Brasileiro May 2019During analysis of scientific research data, it is customary to encounter anomalous values or missing data. Anomalous values can be the result of errors of recording,... (Review)
Review
During analysis of scientific research data, it is customary to encounter anomalous values or missing data. Anomalous values can be the result of errors of recording, typing, measurement by instruments, or may be true outliers. This review discusses concepts, examples and methods for identifying and dealing with such contingencies. In the case of missing data, techniques for imputation of the values are discussed in, order to avoid exclusion of the research subject, if it is not possible to retrieve information from registration forms or to re-address the participant.
PubMed: 31320882
DOI: 10.1590/1677-5449.190004