-
Big Data Oct 2023Anomaly detection is crucial in a variety of domains, such as fraud detection, disease diagnosis, and equipment defect detection. With the development of deep learning,...
Anomaly detection is crucial in a variety of domains, such as fraud detection, disease diagnosis, and equipment defect detection. With the development of deep learning, anomaly detection with Bayesian neural networks (BNNs) becomes a novel research topic in recent years. This article aims to propose a widely applicable method of outlier detection (a category of anomaly detection) using BNNs based on uncertainty measurement. There are three kinds of uncertainties generated in the prediction of BNNs: epistemic uncertainty, aleatoric uncertainty, and (model) misspecification uncertainty. Although the approaches in previous studies are adopted to measure epistemic and aleatoric uncertainty, a new method of utilizing loss functions to quantify misspecification uncertainty is proposed in this article. Then, these three uncertainty sources are merged together by specific combination models to construct total prediction uncertainty. In this study, the key idea is that the observations with high total prediction uncertainty should correspond to outliers in the data. The method of this research is applied to the experiments on Modified National Institute of Standards and Technology (MNIST) dataset and Taxi dataset, respectively. From the results, if the network is appropriately constructed and well-trained and model parameters are carefully tuned, most anomalous images in MNIST dataset and all the abnormal traffic periods in Taxi dataset can be nicely detected. In addition, the performance of this method is compared with the BNN anomaly detection methods proposed before and the classical Local Outlier Factor and Density-Based Spatial Clustering of Applications with Noise methods. This study links the classification of uncertainties in essence with anomaly detection and takes the lead to consider combining different uncertainty sources to reform detection outcomes instead of using only single uncertainty each time.
Topics: Bayes Theorem; Fraud; Neural Networks, Computer; Spatial Analysis; Deep Learning
PubMed: 36706252
DOI: 10.1089/big.2021.0343 -
Journal of Healthcare Quality Research 2022The objective is to describe and analyze how outlier admission influences hospital stay and the appearance of complications in patients with a femoral neck fracture...
OBJECTIVES
The objective is to describe and analyze how outlier admission influences hospital stay and the appearance of complications in patients with a femoral neck fracture treated with arthroplasty.
MATERIAL AND METHOD
A historical cohort study was carried out in which the group of patients with a displaced fracture of the femoral neck who had an outlier admission was defined as an exposed cohort, that is, they were admitted to a hospitalization area not belonging to the Orthopedic Surgery and Traumatology department, unlike the unexposed cohort, that included patients admitted to a hospitalization area assigned to the Orthopedic Surgery and Traumatology department.
RESULTS
Outlier admission was a risk factor for requiring a postoperative transfusion (RR 1.52, 95% CI 1.05-2.21; P=.035), to have a postoperative stay longer than 5 days (RR 1.35, 95% CI 1.04-1.74; P=.038) and to suffer general postoperative complications (RR 1.35, 95% CI 1.02-1.78; P=.048).
CONCLUSIONS
Outlier admission is a threat to the quality and safety of health care. In patients over 80 years of age, medical outliers is a risk factor for postoperative transfusion and systemic postoperative complications.
Topics: Humans; Aged, 80 and over; Femoral Neck Fractures; Cohort Studies; Length of Stay; Postoperative Complications; Risk Factors
PubMed: 35654723
DOI: 10.1016/j.jhqr.2022.02.012 -
IEEE Transactions on Image Processing :... 2022When neural networks are employed for high-stakes decision-making, it is desirable that they provide explanations for their prediction in order for us to understand the...
When neural networks are employed for high-stakes decision-making, it is desirable that they provide explanations for their prediction in order for us to understand the features that have contributed to the decision. At the same time, it is important to flag potential outliers for in-depth verification by domain experts. In this work we propose to unify two differing aspects of explainability with outlier detection. We argue for a broader adoption of prototype-based student networks capable of providing an example-based explanation for their prediction and at the same time identify regions of similarity between the predicted sample and the examples. The examples are real prototypical cases sampled from the training set via a novel iterative prototype replacement algorithm. Furthermore, we propose to use the prototype similarity scores for identifying outliers. We compare performance in terms of the classification, explanation quality and outlier detection of our proposed network with baselines. We show that our prototype-based networks extending beyond similarity kernels deliver meaningful explanations and promising outlier detection results without compromising classification accuracy.
PubMed: 34793299
DOI: 10.1109/TIP.2021.3127847 -
Knowledge-based Systems Feb 2022The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process,...
The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process, leading to incorrect conclusions about the data. For example, anomaly detection using deep generative models is typically only possible when similar anomalies (or outliers) are not present in the training data. Here we focus on variational autoencoders (VAEs). While the VAE is a popular framework for anomaly detection tasks, we observe that the VAE is unable to detect outliers when the training data contains anomalies that have the same distribution as those in test data. In this paper we focus on robustness to outliers in training data in VAE settings using concepts from robust statistics. We propose a variational lower bound that leads to a robust VAE model that has the same computational complexity as the standard VAE and contains a single automatically-adjusted tuning parameter to control the degree of robustness. We present mathematical formulations for robust variational autoencoders (RVAEs) for Bernoulli, Gaussian and categorical variables. The RVAE model is based on beta-divergence rather than the standard Kullback-Leibler (KL) divergence. We demonstrate the performance of our proposed -divergence-based autoencoder for a variety of image and categorical datasets showing improved robustness to outliers both qualitatively and quantitatively. We also illustrate the use of our robust VAE for detection of lesions in brain images, formulated as an anomaly detection task. Finally, we suggest a method to tune the hyperparameter of RVAE which makes our model completely unsupervised.
PubMed: 36714396
DOI: 10.1016/j.knosys.2021.107886 -
The Canadian Journal of Nursing... Sep 2021The presence of statistical outliers is a shared concern in research. If ignored or improperly handled, outliers have the potential to distort parameter estimates and... (Review)
Review
The presence of statistical outliers is a shared concern in research. If ignored or improperly handled, outliers have the potential to distort parameter estimates and possibly compromise the validity of research findings. The purpose of this paper is to provide a conceptual and practical overview of multivariate outliers with a focus on common techniques used to identify and manage multivariate outliers. Specifically, this paper discusses the use of Mahalanobis distance and residual statistics as common multivariate outlier identification techniques. It also discusses the use of leverage and Cook's distance as two common techniques to determine the influence that multivariate outliers may have on statistical models. Finally, this paper discusses techniques that are commonly used to handle influential multivariate outlier cases.
Topics: Humans; Models, Statistical; Research Personnel
PubMed: 32522115
DOI: 10.1177/0844562120932054 -
Clinical Breast Cancer Jun 2022Our breast screening unit was identified as high outlier for B3 lesions with a low positive predictive value (PPV) compared to the England average. This prompted a...
Rates and Outcomes of Breast Lesions of Uncertain Malignant Potential (B3) benchmarked against the National Breast Screening Pathology Audit; Improving Performance in a High Volume Screening Unit.
INTRODUCTION
Our breast screening unit was identified as high outlier for B3 lesions with a low positive predictive value (PPV) compared to the England average. This prompted a detailed internal audit and review of B3 lesions and their outcomes to identify causes and address any variation in practice.
PATIENTS AND METHODS
The B3 rate was calculated in 4168 breast core biopsies from 2019, using the subsequent excision to determine the PPV. Atypical intraductal epithelial proliferation (AIDEP) cases were subject to microscopic review to reassess the presence of atypia against published criteria. The B3 rate was re-audited in 2021, and the results compared.
RESULTS
Screening cases had a high B3 rate of 12.4% (30% above the national average), and a PPV of 7.7% (9.7% with atypia). AIDEP was identified as a possible cause of this outlier status. On review and by consensus, AIDEP was confirmed in only 66% of cases reported as such, 17% were downgraded, and 16% did not reach consensus, the latter highlighting the difficulty and subjectivity in diagnosis of these lesions. Repeat audit of B3 rates after this extended review revealed a reduction from 12.4% to 9.11%, which is more in line with national standards.
CONCLUSION
Benchmarking against national reporting standards is critical for service improvement. Through a supportive environment, team working, rigorous internal review and adherence to guidelines, interobserver variation and outlier status in breast pathology screening outliers can both be addressed. This study can serve as a model to other outlier units to identify and tackle underlying causes.
Topics: Benchmarking; Biopsy, Large-Core Needle; Breast; Breast Neoplasms; Female; Humans; Mammography
PubMed: 35260351
DOI: 10.1016/j.clbc.2022.02.004 -
BMC Health Services Research Jan 2023Institutions or clinicians (units) are often compared according to a performance indicator such as in-hospital mortality. Several approaches have been proposed for the...
BACKGROUND
Institutions or clinicians (units) are often compared according to a performance indicator such as in-hospital mortality. Several approaches have been proposed for the detection of outlying units, whose performance deviates from the overall performance.
METHODS
We provide an overview of three approaches commonly used to monitor institutional performances for outlier detection. These are the common-mean model, the 'Normal-Poisson' random effects model and the 'Logistic' random effects model. For the latter we also propose a visualisation technique. The common-mean model assumes that the underlying true performance of all units is equal and that any observed variation between units is due to chance. Even after applying case-mix adjustment, this assumption is often violated due to overdispersion and a post-hoc correction may need to be applied. The random effects models relax this assumption and explicitly allow the true performance to differ between units, thus offering a more flexible approach. We discuss the strengths and weaknesses of each approach and illustrate their application using audit data from England and Wales on Adult Cardiac Surgery (ACS) and Percutaneous Coronary Intervention (PCI).
RESULTS
In general, the overdispersion-corrected common-mean model and the random effects approaches produced similar p-values for the detection of outliers. For the ACS dataset (41 hospitals) three outliers were identified in total but only one was identified by all methods above. For the PCI dataset (88 hospitals), seven outliers were identified in total but only two were identified by all methods. The common-mean model uncorrected for overdispersion produced several more outliers. The reason for observing similar p-values for all three approaches could be attributed to the fact that the between-hospital variance was relatively small in both datasets, resulting only in a mild violation of the common-mean assumption; in this situation, the overdispersion correction worked well.
CONCLUSION
If the common-mean assumption is likely to hold, all three methods are appropriate to use for outlier detection and their results should be similar. Random effect methods may be the preferred approach when the common-mean assumption is likely to be violated.
Topics: Humans; Percutaneous Coronary Intervention; Hospitals; Risk Adjustment; Logistic Models; England
PubMed: 36627627
DOI: 10.1186/s12913-022-08995-z -
Statistical Methods in Medical Research May 2022The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of...
The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of genes and more than samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.
Topics: Humans; Triple Negative Breast Neoplasms
PubMed: 35072570
DOI: 10.1177/09622802211072456 -
Scientific Reports Sep 2023Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high...
Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high dimensional data due to the "curse of dimensionality". Subspace outlier detection methods have great potential to overcome the problem. However, the challenge becomes how to determine which subspaces to be used for outlier detection among a huge number of all subspaces. In this paper, firstly, we propose an intuitive definition of outliers in subspaces. We study the desirable properties of subspaces for outlier detection and investigate the metrics for those properties. Then, a novel subspace outlier detection algorithm with a statistical foundation is proposed. Our method selectively leverages a limited set of the most interesting subspaces for outlier detection. Through experimental validation, we demonstrate that identifying outliers within this reduced set of highly interesting subspaces yields significantly higher accuracy compared to analyzing the entire feature space. We show by experiments that the proposed method outperforms competing subspace outlier detection approaches on real world data sets.
PubMed: 37714878
DOI: 10.1038/s41598-023-42261-4 -
BMC Medical Informatics and Decision... Oct 2022Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the...
BACKGROUND
Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling.
METHODS
This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017-2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost).
RESULTS
Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938).
CONCLUSION
This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC.
Topics: Humans; Electronic Health Records; Reactive Oxygen Species; Machine Learning; Support Vector Machine; Cerebral Hemorrhage
PubMed: 36284327
DOI: 10.1186/s12911-022-02018-x