-
Methods in Molecular Biology (Clifton,... 2016Mass spectrometry data are often generated from various biological or chemical experiments. However, due to technical reasons, outlying observations are often obtained,...
Mass spectrometry data are often generated from various biological or chemical experiments. However, due to technical reasons, outlying observations are often obtained, some of which may be extreme. Identifying the causes of outlying observations is important in the analysis of replicated MS data because elaborate pre-processing is essential in order to obtain successful analyses with reliable results, and because manual outlier detection is a time-consuming pre-processing step. It is natural to measure the variability of observations using standard deviation or interquartile range calculations, and in this work, these criteria for identifying outliers are presented. However, the low replicability and the heterogeneity of variability are often obstacles to outlier detection. Therefore, quantile regression methods for identifying outliers with low replication are also presented. The procedures are illustrated with artificial and real examples, while a software program is introduced to demonstrate how to apply these procedures in the R environment system.
Topics: Mass Spectrometry; Regression Analysis; Reproducibility of Results
PubMed: 26519171
DOI: 10.1007/978-1-4939-3106-4_5 -
Genes Feb 2023Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence,...
Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier's performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.
Topics: Transcriptome; Gene Expression Profiling; Probability; Research Design
PubMed: 36833313
DOI: 10.3390/genes14020387 -
IEEE Transactions on Pattern Analysis... Mar 2017Although widely used, Multilinear PCA (MPCA), one of the leading multilinear analysis methods, still suffers from four major drawbacks. First, it is very sensitive to...
Although widely used, Multilinear PCA (MPCA), one of the leading multilinear analysis methods, still suffers from four major drawbacks. First, it is very sensitive to outliers and noise. Second, it is unable to cope with missing values. Third, it is computationally expensive since MPCA deals with large multi-dimensional datasets. Finally, it is unable to maintain the local geometrical structures due to the averaging process. This paper proposes a novel approach named Compressed Submanifold Multifactor Analysis (CSMA) to solve the four problems mentioned above. Our approach can deal with the problem of missing values and outliers via SVD-L1. The Random Projection method is used to obtain the fast low-rank approximation of a given multifactor dataset. In addition, it is able to preserve the geometry of the original data. Our CSMA method can be used efficiently for multiple purposes, e.g. noise and outlier removal, estimation of missing values, biometric applications. We show that CSMA method can achieve good results and is very efficient in the inpainting problem as compared to [1], [2]. Our method also achieves higher face recognition rates compared to LRTC, SPMA, MPCA and some other methods, i.e. PCA, LDA and LPP, on three challenging face databases, i.e. CMU-MPIE, CMU-PIE and Extended YALE-B.
PubMed: 27101597
DOI: 10.1109/TPAMI.2016.2554107 -
NeuroImage Dec 2023Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain...
Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain regions. The method, however, requires relatively complex data processing frameworks and analysis pipelines. Many of these approaches are vulnerable to signal dropout artefacts that can originate from subjects moving their head during the scan. To combat these artefacts and eliminate such outliers, researchers have proposed two approaches: to replace outliers or to downweight outliers during modelling and analysis. With the rising interest in dMRI for clinical research, these types of corrections are increasingly important. Therefore, we set out to investigate the differences between outlier replacement and weighting approaches to help the dMRI community to select the best tool for their data processing pipelines. We evaluated dMRI motion correction registration and single tensor model fit pipelines using Gaussian Process and Spherical Harmonic based replacement approaches and outlier downweighting using highly realistic whole-brain simulations. As a proof of concept, we applied these approaches to dMRI infant data sets that contained varying numbers of dropout artefacts. Based on our results, we concluded that the Gaussian Process based outlier replacement provided similar tensor fit results to Gaussian Process based outlier detection and downweighting. Therefore, if only the least-squares estimate of the single tensor model is of interest, our recommendation is to use outlier replacement. However, outlier downweighting can potentially provide a more accurate estimate of the model precision which could be relevant for applications such as probabilistic tractoraphy.
Topics: Humans; Algorithms; Diffusion Magnetic Resonance Imaging; Brain; Artifacts; Least-Squares Analysis
PubMed: 37820862
DOI: 10.1016/j.neuroimage.2023.120397 -
The Science of the Total Environment Sep 2023We demonstrate the benefits of using Riemannian geometry in the analysis of multi-site, multi-pollutant atmospheric monitoring data. Our approach uses covariance...
We demonstrate the benefits of using Riemannian geometry in the analysis of multi-site, multi-pollutant atmospheric monitoring data. Our approach uses covariance matrices to encode spatio-temporal variability and correlations of multiple pollutants at different sites and times. A key property of covariance matrices is that they lie on a Riemannian manifold and one can exploit this property to facilitate dimensionality reduction, outlier detection, and spatial interpolation. Specifically, the transformation of data using Reimannian geometry provides a better data surface for interpolation and assessment of outliers compared to traditional data analysis tools that assume Euclidean geometry. We demonstrate the utility of using Riemannian geometry by analyzing a full year of atmospheric monitoring data collected from 34 monitoring stations in Beijing, China.
Topics: Algorithms; Environmental Pollutants; Data Analysis; Beijing; China
PubMed: 37230339
DOI: 10.1016/j.scitotenv.2023.164064 -
Jornal Vascular Brasileiro May 2019During analysis of scientific research data, it is customary to encounter anomalous values or missing data. Anomalous values can be the result of errors of recording,... (Review)
Review
During analysis of scientific research data, it is customary to encounter anomalous values or missing data. Anomalous values can be the result of errors of recording, typing, measurement by instruments, or may be true outliers. This review discusses concepts, examples and methods for identifying and dealing with such contingencies. In the case of missing data, techniques for imputation of the values are discussed in, order to avoid exclusion of the research subject, if it is not possible to retrieve information from registration forms or to re-address the participant.
PubMed: 31320882
DOI: 10.1590/1677-5449.190004 -
Journal of Affective Disorders Feb 2022Symptom manifestations in affective disorders can be subtle. Small imprecisions in measurement can lead to incorrect estimation of change. Previously, expert-derived...
Symptom manifestations in affective disorders can be subtle. Small imprecisions in measurement can lead to incorrect estimation of change. Previously, expert-derived scoring inconsistency flags were developed for MADRS. Currently, we derive empirically based outlier-pattern flags, to further detect imprecisions in ratings. NEWMEDS data repository of almost 25,000 MADRS administrations from 11 registration trials of antidepressants was used to identify outlier response patterns reflecting potentially careless responses. Coverage of these flags was compared to previously published expert derived flags. Both sets of flags were also further tested in Monte Carlo simulated data as a proxy to applying flags under conditions of known inconsistency. The outlier flags derived provide cutting points to identify: (1) under and overuse of values (e.g., Scoring "1″ on 6 or more items), (2) disproportionate use of even or odd response choices (e.g., 8 or more odd values), (3) longest consecutive use of value (e.g., more than 5 items in a row scored with same value), (4) high variability within administration (standard deviation greater than 1.8), (5) outlier responses on multiple items (i.e., multivariate outliers), and (6) outlier scoring (e.g., scoring 4,5 or 6 on item 1). Outlier response flags were raised in 26% of the MADRS administration and in 97% of the Monte Carlo data. Of administrations with no expert flag, 21.7% had an outlier flag and of administrations with at least one expert flag, 27.7% also had an outlier flag. Outlier-pattern flags appear to be a useful adjunct to expert derived flags in the quest to improve measurement in clinical trials.
Topics: Antidepressive Agents; Depression; Humans; Mood Disorders; Psychiatric Status Rating Scales; Reproducibility of Results
PubMed: 34952105
DOI: 10.1016/j.jad.2021.12.076 -
IEEE Transactions on Image Processing :... 2021Outlier handling has attracted considerable attention recently but remains challenging for image deblurring. Existing approaches mainly depend on iterative outlier...
Outlier handling has attracted considerable attention recently but remains challenging for image deblurring. Existing approaches mainly depend on iterative outlier detection steps to explicitly or implicitly reduce the influence of outliers on image deblurring. However, these outlier detection steps usually involve heuristic operations and iterative optimization processes, which are complex and time-consuming. In contrast, we propose to learn a deep convolutional neural network to directly estimate the confidence map, which can identify reliable inliers and outliers from the blurred image and thus facilitates the following deblurring process. We analyze that the proposed algorithm incorporated with the learned confidence map is effective in handling outliers and does not require ad-hoc outlier detection steps which are critical to existing outlier handling methods. Compared to existing approaches, the proposed algorithm is more efficient and can be applied to both non-blind and blind image deblurring. Extensive experimental results demonstrate that the proposed algorithm performs favorably against state-of-the-art methods in terms of accuracy and efficiency.
PubMed: 33417555
DOI: 10.1109/TIP.2020.3048679 -
Proceedings of SPIE--the International... 2020Abdominal multi-organ segmentation of computed tomography (CT) images has been the subject of extensive research interest. It presents a substantial challenge in medical...
Abdominal multi-organ segmentation of computed tomography (CT) images has been the subject of extensive research interest. It presents a substantial challenge in medical image processing, as the shape and distribution of abdominal organs can vary greatly among the population and within an individual over time. While continuous integration of novel datasets into the training set provides potential for better segmentation performance, collection of data at scale is not only costly, but also impractical in some contexts. Moreover, it remains unclear what marginal value additional data have to offer. Herein, we propose a single-pass active learning method through human quality assurance (QA). We built on a pre-trained 3D U-Net model for abdominal multi-organ segmentation and augmented the dataset either with outlier data (e.g., exemplars for which the baseline algorithm failed) or inliers (e.g., exemplars for which the baseline algorithm worked). The new models were trained using the augmented datasets with 5-fold cross-validation (for outlier data) and withheld outlier samples (for inlier data). Manual labeling of outliers increased Dice scores with outliers by 0.130, compared to an increase of 0.067 with inliers (p<0.001, two-tailed paired t-test). By adding 5 to 37 inliers or outliers to training, we find that the marginal value of adding outliers is higher than that of adding inliers. In summary, improvement on single-organ performance was obtained without diminishing multi-organ performance or significantly increasing training time. Hence, identification and correction of baseline failure cases present an effective and efficient method of selecting training data to improve algorithm performance.
PubMed: 33907347
DOI: 10.1117/12.2549365