-
Plants (Basel, Switzerland) Apr 2023We epigenotyped 211 individuals from 17 populations using methylation-sensitive amplification polymorphism (MSAP) and investigated the associations of methylated...
We epigenotyped 211 individuals from 17 populations using methylation-sensitive amplification polymorphism (MSAP) and investigated the associations of methylated (mMSAP) and unmethylated (uMSAP) loci with 16 environmental variables. Data regarding genetic variation based on amplified fragment length polymorphism (AFLP) were obtained from an earlier study. We found a significant positive correlation between genetic and epigenetic variation. Significantly higher mean mMSAP and uMSAP (unbiased expected heterozygosity: 0.223 and 0.131, respectively, < 0.001) per locus than that estimated based on AFLP ( = 0.104) were found. Genome scans detected 10 mMSAP and 9 uMSAP outliers associated with various environmental variables. A significant linear fit for 11 and 12 environmental variables with outlier mMSAP and uMSAP ordination, respectively, generated using full model redundancy analysis (RDA) was found. When conditioned on geography, partial RDA revealed that five and six environmental variables, respectively, were the most important variables influencing outlier mMSAP and uMSAP variation. We found higher genetic (average = 0.298) than epigenetic (mMSAP and uMSAP average = 0.044 and 0.106, respectively) differentiation and higher genetic isolation-by-distance (IBD) than epigenetic IBD. Strong epigenetic isolation-by-environment (IBE) was found, particularly based on the outlier data, controlling either for geography (mMSAP and uMSAP = 0.128 and 0.132, respectively, = 0.001) or for genetic structure (mMSAP and uMSAP = 0.105 and 0.136, respectively, = 0.001). Our results suggest that epigenetic variants can be substrates for natural selection linked to environmental variables and complement genetic changes in the adaptive evolution of populations.
PubMed: 37050184
DOI: 10.3390/plants12071558 -
JAMIA Open Dec 2022To demonstrate the utility of , an anthropometric data cleaning method designed for electronic health records (EHR).
OBJECTIVE
To demonstrate the utility of , an anthropometric data cleaning method designed for electronic health records (EHR).
MATERIALS AND METHODS
We used all available pediatric and adult height and weight data from an ongoing observational study that includes EHR data from 15 healthcare systems and applied to identify outliers and errors and compared its performance in pediatric data with 2 other pediatric data cleaning methods: (1) conditional percentile () and (2) PaEdiatric ANthropometric measurement Outlier Flagging pipeline ().
RESULTS
687 226 children (<20 years) and 3 267 293 adults contributed 71 246 369 weight and 51 525 487 height measurements. flagged 18% of pediatric and 12% of adult measurements for exclusion, mostly as carried-forward measures for pediatric data and duplicates for adult and pediatric data. After removing the flagged measurements, 0.5% and 0.6% of the pediatric heights and weights and 0.3% and 1.4% of the adult heights and weights, respectively, were biologically implausible according to the CDC and other established cut points. Compared with other pediatric cleaning methods, flagged the most measurements for exclusion; however, it did not flag some more extreme measurements. The prevalence of severe pediatric obesity was 9.0%, 9.2%, and 8.0% after cleaning by , , and , respectively.
CONCLUSION
is useful for cleaning pediatric and adult height and weight data. It is the only method with the ability to clean adult data and identify carried-forward and duplicates, which are prevalent in EHR. Findings of this study can be used to improve the algorithm.
PubMed: 36339053
DOI: 10.1093/jamiaopen/ooac089 -
Knowledge-based Systems Feb 2022The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process,...
The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process, leading to incorrect conclusions about the data. For example, anomaly detection using deep generative models is typically only possible when similar anomalies (or outliers) are not present in the training data. Here we focus on variational autoencoders (VAEs). While the VAE is a popular framework for anomaly detection tasks, we observe that the VAE is unable to detect outliers when the training data contains anomalies that have the same distribution as those in test data. In this paper we focus on robustness to outliers in training data in VAE settings using concepts from robust statistics. We propose a variational lower bound that leads to a robust VAE model that has the same computational complexity as the standard VAE and contains a single automatically-adjusted tuning parameter to control the degree of robustness. We present mathematical formulations for robust variational autoencoders (RVAEs) for Bernoulli, Gaussian and categorical variables. The RVAE model is based on beta-divergence rather than the standard Kullback-Leibler (KL) divergence. We demonstrate the performance of our proposed -divergence-based autoencoder for a variety of image and categorical datasets showing improved robustness to outliers both qualitatively and quantitatively. We also illustrate the use of our robust VAE for detection of lesions in brain images, formulated as an anomaly detection task. Finally, we suggest a method to tune the hyperparameter of RVAE which makes our model completely unsupervised.
PubMed: 36714396
DOI: 10.1016/j.knosys.2021.107886 -
Clinical Breast Cancer Jun 2022Our breast screening unit was identified as high outlier for B3 lesions with a low positive predictive value (PPV) compared to the England average. This prompted a...
Rates and Outcomes of Breast Lesions of Uncertain Malignant Potential (B3) benchmarked against the National Breast Screening Pathology Audit; Improving Performance in a High Volume Screening Unit.
INTRODUCTION
Our breast screening unit was identified as high outlier for B3 lesions with a low positive predictive value (PPV) compared to the England average. This prompted a detailed internal audit and review of B3 lesions and their outcomes to identify causes and address any variation in practice.
PATIENTS AND METHODS
The B3 rate was calculated in 4168 breast core biopsies from 2019, using the subsequent excision to determine the PPV. Atypical intraductal epithelial proliferation (AIDEP) cases were subject to microscopic review to reassess the presence of atypia against published criteria. The B3 rate was re-audited in 2021, and the results compared.
RESULTS
Screening cases had a high B3 rate of 12.4% (30% above the national average), and a PPV of 7.7% (9.7% with atypia). AIDEP was identified as a possible cause of this outlier status. On review and by consensus, AIDEP was confirmed in only 66% of cases reported as such, 17% were downgraded, and 16% did not reach consensus, the latter highlighting the difficulty and subjectivity in diagnosis of these lesions. Repeat audit of B3 rates after this extended review revealed a reduction from 12.4% to 9.11%, which is more in line with national standards.
CONCLUSION
Benchmarking against national reporting standards is critical for service improvement. Through a supportive environment, team working, rigorous internal review and adherence to guidelines, interobserver variation and outlier status in breast pathology screening outliers can both be addressed. This study can serve as a model to other outlier units to identify and tackle underlying causes.
Topics: Benchmarking; Biopsy, Large-Core Needle; Breast; Breast Neoplasms; Female; Humans; Mammography
PubMed: 35260351
DOI: 10.1016/j.clbc.2022.02.004 -
BMC Health Services Research Jan 2023Institutions or clinicians (units) are often compared according to a performance indicator such as in-hospital mortality. Several approaches have been proposed for the...
BACKGROUND
Institutions or clinicians (units) are often compared according to a performance indicator such as in-hospital mortality. Several approaches have been proposed for the detection of outlying units, whose performance deviates from the overall performance.
METHODS
We provide an overview of three approaches commonly used to monitor institutional performances for outlier detection. These are the common-mean model, the 'Normal-Poisson' random effects model and the 'Logistic' random effects model. For the latter we also propose a visualisation technique. The common-mean model assumes that the underlying true performance of all units is equal and that any observed variation between units is due to chance. Even after applying case-mix adjustment, this assumption is often violated due to overdispersion and a post-hoc correction may need to be applied. The random effects models relax this assumption and explicitly allow the true performance to differ between units, thus offering a more flexible approach. We discuss the strengths and weaknesses of each approach and illustrate their application using audit data from England and Wales on Adult Cardiac Surgery (ACS) and Percutaneous Coronary Intervention (PCI).
RESULTS
In general, the overdispersion-corrected common-mean model and the random effects approaches produced similar p-values for the detection of outliers. For the ACS dataset (41 hospitals) three outliers were identified in total but only one was identified by all methods above. For the PCI dataset (88 hospitals), seven outliers were identified in total but only two were identified by all methods. The common-mean model uncorrected for overdispersion produced several more outliers. The reason for observing similar p-values for all three approaches could be attributed to the fact that the between-hospital variance was relatively small in both datasets, resulting only in a mild violation of the common-mean assumption; in this situation, the overdispersion correction worked well.
CONCLUSION
If the common-mean assumption is likely to hold, all three methods are appropriate to use for outlier detection and their results should be similar. Random effect methods may be the preferred approach when the common-mean assumption is likely to be violated.
Topics: Humans; Percutaneous Coronary Intervention; Hospitals; Risk Adjustment; Logistic Models; England
PubMed: 36627627
DOI: 10.1186/s12913-022-08995-z -
Statistical Methods in Medical Research May 2022The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of...
The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of genes and more than samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.
Topics: Humans; Triple Negative Breast Neoplasms
PubMed: 35072570
DOI: 10.1177/09622802211072456 -
BMC Medical Informatics and Decision... Oct 2022Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the...
BACKGROUND
Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling.
METHODS
This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017-2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost).
RESULTS
Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938).
CONCLUSION
This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC.
Topics: Humans; Electronic Health Records; Reactive Oxygen Species; Machine Learning; Support Vector Machine; Cerebral Hemorrhage
PubMed: 36284327
DOI: 10.1186/s12911-022-02018-x -
BMC Bioinformatics Jun 2020High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a...
BACKGROUND
High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis.
RESULTS
We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes.
CONCLUSIONS
rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
Topics: Animals; Cerebellum; Female; Male; Mice, Knockout; Principal Component Analysis; Proto-Oncogene Proteins; RNA-Seq; Reverse Transcriptase Polymerase Chain Reaction
PubMed: 32600248
DOI: 10.1186/s12859-020-03608-0 -
Sensors (Basel, Switzerland) Oct 2022Image novelty detection is a repeating task in computer vision and describes the detection of anomalous images based on a training dataset consisting solely of normal...
Image novelty detection is a repeating task in computer vision and describes the detection of anomalous images based on a training dataset consisting solely of normal reference data. It has been found that, in particular, neural networks are well-suited for the task. Our approach first transforms the training and test images into ensembles of patches, which enables the assessment of mean-shifts between normal data and outliers. As mean-shifts are only detectable when the outlier ensemble and inlier distribution are spatially separate from each other, a rich feature space, such as a pre-trained neural network, needs to be chosen to represent the extracted patches. For mean-shift estimation, the Hotelling T2 test is used. The size of the patches turned out to be a crucial hyperparameter that needs additional domain knowledge about the spatial size of the expected anomalies (local vs. global). This also affects model selection and the chosen feature space, as commonly used Convolutional Neural Networks or Vision Image Transformers have very different receptive field sizes. To showcase the state-of-the-art capabilities of our approach, we compare results with classical and deep learning methods on the popular dataset CIFAR-10, and demonstrate its real-world applicability in a large-scale industrial inspection scenario using the MVTec dataset. Because of the inexpensive design, our method can be implemented by a single additional 2D-convolution and pooling layer and allows particularly fast prediction times while being very data-efficient.
Topics: Image Processing, Computer-Assisted; Neural Networks, Computer
PubMed: 36236774
DOI: 10.3390/s22197674 -
Frontiers in Psychiatry 2021Deficient decision-making (DM) in attention deficit/hyperactivity disorder (ADHD) is marked by altered reward sensitivity, higher risk taking, and aberrant...
Deficient decision-making (DM) in attention deficit/hyperactivity disorder (ADHD) is marked by altered reward sensitivity, higher risk taking, and aberrant reinforcement learning. Previous meta-analysis aggregate findings for the ADHD combined presentation (ADHD-C) mostly, while the ADHD predominantly inattentive presentation (ADHD-I) and the predominantly hyperactive/impulsive presentation (ADHD-H) were not disentangled. The objectives of the current meta-analysis were to aggregate findings from DM for each presentation separately. A comprehensive literature search of the PubMed (Medline) and Web of Science Database took place using the keywords "ADHD," "attention-deficit/hyperactivity disorder," "decision-making," "risk-taking," "reinforcement learning," and "risky." Random-effects models based on correlational effect-sizes were conducted. Heterogeneity analysis and sensitivity/outlier analysis were performed, and publication biases were assessed with funnel-plots and the egger intercept. Of 1,240 candidate articles, seven fulfilled criteria for analysis of ADHD-C ( = 193), seven for ADHD-I ( = 256), and eight for ADHD-H ( = 231). Moderate effect-size were found for ADHD-C ( = 0.34; = 0.0001; 95% CI = [0.19, 0.49]). Small effect-sizes were found for ADHD-I ( = 0.09; = 0.0001; 95% CI = [0.008, 0.25]) and for ADHD-H ( = 0.1; = 0.0001; 95% CI = [-0.012, 0.32]). Heterogeneity was moderate for ADHD-H. Sensitivity analyses show robustness of the analysis, and no outliers were detected. No publication bias was evident. This is the first study that uses a meta-analytic approach to investigate the relationship between the different presentations of ADHD separately. These findings provide first evidence of lesser pronounced impairment in DM for ADHD-I and ADHD-I compared to ADHD-C. While the exact factors remain elusive, the current study can be considered as a starting point to reveal the relationship of ADHD presentations and DM more detailed.
PubMed: 33679462
DOI: 10.3389/fpsyt.2021.519840