-
BMC Medical Informatics and Decision... Oct 2022Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the...
BACKGROUND
Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling.
METHODS
This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017-2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost).
RESULTS
Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938).
CONCLUSION
This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC.
Topics: Humans; Electronic Health Records; Reactive Oxygen Species; Machine Learning; Support Vector Machine; Cerebral Hemorrhage
PubMed: 36284327
DOI: 10.1186/s12911-022-02018-x -
IEEE Transactions on Visualization and... Feb 2021Given a scatterplot with tens of thousands of points or even more, a natural question is which sampling method should be used to create a small but "good" scatterplot...
Given a scatterplot with tens of thousands of points or even more, a natural question is which sampling method should be used to create a small but "good" scatterplot for a better abstraction. We present the results of a user study that investigates the influence of different sampling strategies on multi-class scatterplots. The main goal of this study is to understand the capability of sampling methods in preserving the density, outliers, and overall shape of a scatterplot. To this end, we comprehensively review the literature and select seven typical sampling strategies as well as eight representative datasets. We then design four experiments to understand the performance of different strategies in maintaining: 1) region density; 2) class density; 3) outliers; and 4) overall shape in the sampling results. The results show that: 1) random sampling is preferred for preserving region density; 2) blue noise sampling and random sampling have comparable performance with the three multi-class sampling strategies in preserving class density; 3) outlier biased density based sampling, recursive subdivision based sampling, and blue noise sampling perform the best in keeping outliers; and 4) blue noise sampling outperforms the others in maintaining the overall shape of a scatterplot.
PubMed: 33074820
DOI: 10.1109/TVCG.2020.3030432 -
... IEEE International Conference on... Dec 2022Outlier detection is a fundamental data analytics technique often used for many security applications. Numerous outlier detection techniques exist, and in most cases are...
Outlier detection is a fundamental data analytics technique often used for many security applications. Numerous outlier detection techniques exist, and in most cases are used to directly identify outliers without any interaction. Typically the underlying data used is often high dimensional and complex. Even though outliers may be identified, since humans can easily grasp low dimensional spaces, it is difficult for a security expert to understand/visualize why a particular event or record has been identified as an outlier. In this paper we study the extent to which outlier detection techniques work in smaller dimensions and how well dimensional reduction techniques still enable accurate detection of outliers. This can help us to understand the extent to which data can be visualized while still retaining the intrinsic outlyingness of the outliers.
PubMed: 38094985
DOI: 10.1109/tps-isa56441.2022.00028 -
Turkish Journal of Orthodontics Dec 2022To determine whether multiple siblings resemble one another in their craniofacial characteristics as measured on cephalometric radiographs.
OBJECTIVE
To determine whether multiple siblings resemble one another in their craniofacial characteristics as measured on cephalometric radiographs.
METHODS
This study was conducted retrospectively using the Forsyth Moorrees twin sample. A total of 32 families were included, each with ≥4 postpubertal siblings, totaling 142 subjects. Only 1 monozygotic twin was included per family. Headfilms were digitized, skeletal landmarks were located, and 6 parameters that indicated sagittal jaw relationships and vertical status were measured. Diverse statistical approaches were used. Dixon's Q-test detected outliers in a family for a given parameter. Manhattan Distance quantified similarity among siblings per parameter. Scatter plots visually displayed subject's measure relative to the mean and standard deviation of each parameter to assess the clinical relevance of the differences.
RESULTS
A total of 11 families (34.4%) had no outliers on any parameter, 13 families (40.6%) had outliers on 1 parameter, and 8 families (25%) had outliers on ≥2 parameters. We identified 29 individuals with at least 1 outlying measure (20.4%). Among these, only 2 individuals (1.4%) were significantly different from their siblings for more than 1 measurement. Although the majority of the families did not demonstrate any statistical outlier, the ranges of the measurements were clinically relevant as they might suggest different treatment. For example, the mean range of SNB (Sella-Nasion-B point) angles was 7.23°, and the mean range of MPA was 9.42°.
CONCLUSION
Although families are generally not dissimilar in their craniofacial characteristics, measurements from siblings cannot be used to predict the measurements of another sibling in a clinically meaningful way.
PubMed: 36594544
DOI: 10.5152/TurkJOrthod.2022.21237 -
IEEE Transactions on Cybernetics Aug 2022Outlier detection is one of the most important research directions in data mining. However, most of the current research focuses on outlier detection for categorical or...
Outlier detection is one of the most important research directions in data mining. However, most of the current research focuses on outlier detection for categorical or numerical attribute data. There are few studies on the outlier detection of mixed attribute data. In this article, we introduce fuzzy rough sets (FRSs) to deal with the problem of outlier detection in mixed attribute data. Since the outlier detection model of the classical rough set is only applicable to the categorical attribute data, we use FRS to generalize the outlier detection model and construct a generalized outlier detection model based on fuzzy rough granules. First, the granule outlier degree (GOD) is defined to characterize the outlier degree of fuzzy rough granules by employing the fuzzy approximation accuracy. Then, the outlier factor based on fuzzy rough granules is constructed by integrating the GOD and the corresponding weights to characterize the outlier degree of objects. Furthermore, the corresponding fuzzy rough granules-based outlier detection (FRGOD) algorithm is designed. The effectiveness of the FRGOD algorithm is evaluated through experiments on 16 real-world datasets. The experimental results show that the algorithm is more flexible for detecting outliers and is suitable for numerical, categorical, and mixed attribute data.
Topics: Algorithms; Data Mining; Fuzzy Logic
PubMed: 33750721
DOI: 10.1109/TCYB.2021.3058780 -
Psychophysiology Jun 2023In response time (RT) research, RTs which largely deviate from the RT distribution are considered "outliers". Outliers are typically excluded from RT analysis building...
In response time (RT) research, RTs which largely deviate from the RT distribution are considered "outliers". Outliers are typically excluded from RT analysis building upon the implicit assumption that cognitive processing is distorted in outlier trials. The present study aims to test this assumption by comparing cognitive processing indexed by event-related potentials (ERP) of trials with outliers and valid trials in two different tasks. To this end, we compared stimulus- and response-locked ERPs for outliers identified by nine different methods with valid trials, using cluster-based permutation tests. Consistently across outlier exclusion methods and tasks, the late positive complex (P3) associated with response-related processes was reduced in outliers. Analyses of response-locked ERPs related this P3 attenuation to a slower and temporally more extended increase of the P3, possibly indexing reduced evidence accumulation speed in outliers. P3 peak amplitude in response-locked ERPs was similar between outliers and valid trials, suggesting that the absolute amount of evidence required for a response remained comparable. Furthermore, in addition to these more general ERP correlates of outliers, the contingent negative variation (CNV) ERP component was reduced in outliers as a function of preparatory demands of the task. Hence, electrophysiological correlates, and thus cognitive processing, are altered in outliers compared to valid trials. In order to avoid distortion of observed ERP differences between conditions, the RT outlier distribution should be considered for the analysis of ERPs in combined ERP and RT studies.
Topics: Humans; Reaction Time; Electroencephalography; Evoked Potentials; Contingent Negative Variation; Mental Processes
PubMed: 37042066
DOI: 10.1111/psyp.14305 -
JCO Clinical Cancer Informatics Oct 2022Artificial intelligence (AI) models for medical image diagnosis are often trained and validated on curated data. However, in a clinical setting, images that are outliers...
PURPOSE
Artificial intelligence (AI) models for medical image diagnosis are often trained and validated on curated data. However, in a clinical setting, images that are outliers with respect to the training data, such as those representing rare disease conditions or acquired using a slightly different setup, can lead to wrong decisions. It is not practical to expect clinicians to be trained to discount results for such outlier images. Toward clinical deployment, we have designed a method to train cautious AI that can automatically flag outlier cases.
MATERIALS AND METHODS
Our method-ClassClust-forms tight clusters of training images using supervised contrastive learning, which helps it identify outliers during testing. We compared ClassClust's ability to detect outliers with three competing methods on four publicly available data sets covering pathology, dermatoscopy, and radiology. We held out certain diseases, artifacts, and types of images from training data and examined the ability of various models to detect these as outliers during testing. We compared the decision accuracy of the models on held-out nonoutlier images also. We visualized the regions of the images that the models used for their decisions.
RESULTS
Area under receiver operating characteristic curve for outlier detection was consistently higher using ClassClust compared with the previous methods. Average accuracy on held-out nonoutlier images was also higher, and the visualizations of image regions were more informative using ClassClust.
CONCLUSION
The ability to flag outlier test cases need not be at odds with the ability to accurately classify nonoutliers in AI models. Although the latter capability has received research and regulatory attention, AI models for clinical deployment should possess the former as well.
Topics: Artificial Intelligence; Data Collection; Humans; ROC Curve; Trust
PubMed: 36228179
DOI: 10.1200/CCI.22.00067 -
Finding the Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions.The American Naturalist Oct 2016Uncovering the genetic and evolutionary basis of local adaptation is a major focus of evolutionary biology. The recent development of cost-effective methods for... (Review)
Review
Uncovering the genetic and evolutionary basis of local adaptation is a major focus of evolutionary biology. The recent development of cost-effective methods for obtaining high-quality genome-scale data makes it possible to identify some of the loci responsible for adaptive differences among populations. Two basic approaches for identifying putatively locally adaptive loci have been developed and are broadly used: one that identifies loci with unusually high genetic differentiation among populations (differentiation outlier methods) and one that searches for correlations between local population allele frequencies and local environments (genetic-environment association methods). Here, we review the promises and challenges of these genome scan methods, including correcting for the confounding influence of a species' demographic history, biases caused by missing aspects of the genome, matching scales of environmental data with population structure, and other statistical considerations. In each case, we make suggestions for best practices for maximizing the accuracy and efficiency of genome scans to detect the underlying genetic basis of local adaptation. With attention to their current limitations, genome scan methods can be an important tool in finding the genetic basis of adaptive evolutionary change.
Topics: Adaptation, Physiological; Animals; Gene Frequency; Genetics, Population; Genome; Genomics; Selection, Genetic
PubMed: 27622873
DOI: 10.1086/688018 -
NeuroImage Feb 2017Even after thorough preprocessing and a careful time series analysis of functional magnetic resonance imaging (fMRI) data, artifact and other issues can lead to... (Review)
Review
Even after thorough preprocessing and a careful time series analysis of functional magnetic resonance imaging (fMRI) data, artifact and other issues can lead to violations of the assumption that the variance is constant across subjects in the group level model. This is especially concerning when modeling a continuous covariate at the group level, as the slope is easily biased by outliers. Various models have been proposed to deal with outliers including models that use the first level variance or that use the group level residual magnitude to differentially weight subjects. The most typically used robust regression, implementing a robust estimator of the regression slope, has been previously studied in the context of fMRI studies and was found to perform well in some scenarios, but a loss of Type I error control can occur for some outlier settings. A second type of robust regression using a heteroscedastic autocorrelation consistent (HAC) estimator, which produces robust slope and variance estimates has been shown to perform well, with better Type I error control, but with large sample sizes (500-1000 subjects). The Type I error control with smaller sample sizes has not been studied in this model and has not been compared to other modeling approaches that handle outliers such as FSL's Flame 1 and FSL's outlier de-weighting. Focusing on group level inference with a continuous covariate over a range of sample sizes and degree of heteroscedasticity, which can be driven either by the within- or between-subject variability, both styles of robust regression are compared to ordinary least squares (OLS), FSL's Flame 1, Flame 1 with outlier de-weighting algorithm and Kendall's Tau. Additionally, subject omission using the Cook's Distance measure with OLS and nonparametric inference with the OLS statistic are studied. Pros and cons of these models as well as general strategies for detecting outliers in data and taking precaution to avoid inflated Type I error rates are discussed.
Topics: Adolescent; Adult; Data Interpretation, Statistical; Decision Making; Female; Functional Neuroimaging; Humans; Magnetic Resonance Imaging; Male; Models, Statistical; Psychomotor Performance; Young Adult
PubMed: 28030782
DOI: 10.1016/j.neuroimage.2016.12.058 -
IEEE Transactions on Image Processing :... Jan 2018Identifying different types of data outliers with abnormal behaviors in multi-view data setting is challenging due to the complicated data distributions across different...
Identifying different types of data outliers with abnormal behaviors in multi-view data setting is challenging due to the complicated data distributions across different views. Conventional approaches achieve this by learning a new latent feature representation with the pairwise constraint on different view data. In this paper, we argue that the existing methods are expensive in generalizing their models from two-view data to three-view (or more) data, in terms of the number of introduced variables and detection performance. To address this, we propose a novel multi-view outlier detection method with consensus regularization on the latent representations. Specifically, we explicitly characterize each kind of outliers by the intrinsic cluster assignment labels and sample-specific errors. Moreover, we make a thorough discussion about the proposed consensus-regularization and the pairwise-regularization. Correspondingly, an optimization solution based on augmented Lagrangian multiplier method is proposed and derived in details. In the experiments, we evaluate our method on five well-known machine learning data sets with different outlier settings. Further, to show its effectiveness in real-world computer vision scenario, we tailor our proposed model to saliency detection and face reconstruction applications. The extensive results of both standard multi-view outlier detection task and the extended computer vision tasks demonstrate the effectiveness of our proposed method.
PubMed: 28945594
DOI: 10.1109/TIP.2017.2754942