-
The Canadian Journal of Nursing... Sep 2021The presence of statistical outliers is a shared concern in research. If ignored or improperly handled, outliers have the potential to distort parameter estimates and... (Review)
Review
The presence of statistical outliers is a shared concern in research. If ignored or improperly handled, outliers have the potential to distort parameter estimates and possibly compromise the validity of research findings. The purpose of this paper is to provide a conceptual and practical overview of multivariate outliers with a focus on common techniques used to identify and manage multivariate outliers. Specifically, this paper discusses the use of Mahalanobis distance and residual statistics as common multivariate outlier identification techniques. It also discusses the use of leverage and Cook's distance as two common techniques to determine the influence that multivariate outliers may have on statistical models. Finally, this paper discusses techniques that are commonly used to handle influential multivariate outlier cases.
Topics: Humans; Models, Statistical; Research Personnel
PubMed: 32522115
DOI: 10.1177/0844562120932054 -
Plants (Basel, Switzerland) Mar 2023Globally, food and medicinal plants have been documented, but their use patterns are poorly understood. Useful plants are non-random subsets of flora, prioritizing... (Review)
Review
Globally, food and medicinal plants have been documented, but their use patterns are poorly understood. Useful plants are non-random subsets of flora, prioritizing certain taxa. This study evaluates orders and families prioritized for medicine and food in Kenya, using three statistical models: Regression, Binomial, and Bayesian approaches. An extensive literature search was conducted to gather information on indigenous flora, medicinal and food plants. Regression residuals, obtained using LlNEST linear regression function, were used to quantify if taxa had unexpectedly high number of useful species relative to the overall proportion in the flora. Bayesian analysis, performed using BETA.INV function, was used to obtain superior and inferior 95% probability credible intervals for the whole flora and for all taxa. To test for the significance of individual taxa departure from the expected number, binomial analysis using BINOMDIST function was performed to obtain -values for all taxa. The three models identified 14 positive outlier medicinal orders, all with significant values ( < 0.05). Fabales had the highest (66.16) regression residuals, while Sapindales had the highest (1.1605) R-value. Thirty-eight positive outlier medicinal families were identified; 34 were significant outliers ( < 0.05). Rutaceae (1.6808) had the highest R-value, while Fabaceae had the highest regression residuals (63.2). Sixteen positive outlier food orders were recovered; 13 were significant outliers ( < 0.05). Gentianales (45.27) had the highest regression residuals, while Sapindales (2.3654) had the highest R-value. Forty-two positive outlier food families were recovered by the three models; 30 were significant outliers ( < 0.05). Anacardiaceae (5.163) had the highest R-value, while Fabaceae had the highest (28.72) regression residuals. This study presents important medicinal and food taxa in Kenya, and adds useful data for global comparisons.
PubMed: 36904005
DOI: 10.3390/plants12051145 -
BMC Health Services Research Jan 2023Institutions or clinicians (units) are often compared according to a performance indicator such as in-hospital mortality. Several approaches have been proposed for the...
BACKGROUND
Institutions or clinicians (units) are often compared according to a performance indicator such as in-hospital mortality. Several approaches have been proposed for the detection of outlying units, whose performance deviates from the overall performance.
METHODS
We provide an overview of three approaches commonly used to monitor institutional performances for outlier detection. These are the common-mean model, the 'Normal-Poisson' random effects model and the 'Logistic' random effects model. For the latter we also propose a visualisation technique. The common-mean model assumes that the underlying true performance of all units is equal and that any observed variation between units is due to chance. Even after applying case-mix adjustment, this assumption is often violated due to overdispersion and a post-hoc correction may need to be applied. The random effects models relax this assumption and explicitly allow the true performance to differ between units, thus offering a more flexible approach. We discuss the strengths and weaknesses of each approach and illustrate their application using audit data from England and Wales on Adult Cardiac Surgery (ACS) and Percutaneous Coronary Intervention (PCI).
RESULTS
In general, the overdispersion-corrected common-mean model and the random effects approaches produced similar p-values for the detection of outliers. For the ACS dataset (41 hospitals) three outliers were identified in total but only one was identified by all methods above. For the PCI dataset (88 hospitals), seven outliers were identified in total but only two were identified by all methods. The common-mean model uncorrected for overdispersion produced several more outliers. The reason for observing similar p-values for all three approaches could be attributed to the fact that the between-hospital variance was relatively small in both datasets, resulting only in a mild violation of the common-mean assumption; in this situation, the overdispersion correction worked well.
CONCLUSION
If the common-mean assumption is likely to hold, all three methods are appropriate to use for outlier detection and their results should be similar. Random effect methods may be the preferred approach when the common-mean assumption is likely to be violated.
Topics: Humans; Percutaneous Coronary Intervention; Hospitals; Risk Adjustment; Logistic Models; England
PubMed: 36627627
DOI: 10.1186/s12913-022-08995-z -
Statistical Methods in Medical Research May 2022The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of...
The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of genes and more than samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.
Topics: Humans; Triple Negative Breast Neoplasms
PubMed: 35072570
DOI: 10.1177/09622802211072456 -
Scientific Reports Sep 2023Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high...
Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high dimensional data due to the "curse of dimensionality". Subspace outlier detection methods have great potential to overcome the problem. However, the challenge becomes how to determine which subspaces to be used for outlier detection among a huge number of all subspaces. In this paper, firstly, we propose an intuitive definition of outliers in subspaces. We study the desirable properties of subspaces for outlier detection and investigate the metrics for those properties. Then, a novel subspace outlier detection algorithm with a statistical foundation is proposed. Our method selectively leverages a limited set of the most interesting subspaces for outlier detection. Through experimental validation, we demonstrate that identifying outliers within this reduced set of highly interesting subspaces yields significantly higher accuracy compared to analyzing the entire feature space. We show by experiments that the proposed method outperforms competing subspace outlier detection approaches on real world data sets.
PubMed: 37714878
DOI: 10.1038/s41598-023-42261-4 -
IEEE Transactions on Image Processing :... Feb 2016A critical step in cryogenic electron microscopy (cryo-EM) image analysis is to calculate the average of all images aligned to a projection direction. This average,...
A critical step in cryogenic electron microscopy (cryo-EM) image analysis is to calculate the average of all images aligned to a projection direction. This average, called the class mean, improves the signal-to-noise ratio in single-particle reconstruction. The averaging step is often compromised because of the outlier images of ice, contaminants, and particle fragments. Outlier detection and rejection in the majority of current cryo-EM methods are done using cross-correlation with a manually determined threshold. Empirical assessment shows that the performance of these methods is very sensitive to the threshold. This paper proposes an alternative: a w-estimator of the average image, which is robust to outliers and which does not use a threshold. Various properties of the estimator, such as consistency and influence function are investigated. An extension of the estimator to images with different contrast transfer functions is also provided. Experiments with simulated and real cryo-EM images show that the proposed estimator performs quite well in the presence of outliers.
Topics: Computer Simulation; Cryoelectron Microscopy; Image Processing, Computer-Assisted; Imaging, Three-Dimensional; Proteins; Signal-To-Noise Ratio
PubMed: 26841397
DOI: 10.1109/TIP.2015.2512384 -
Journal of Healthcare Quality Research 2022The objective is to describe and analyze how outlier admission influences hospital stay and the appearance of complications in patients with a femoral neck fracture...
OBJECTIVES
The objective is to describe and analyze how outlier admission influences hospital stay and the appearance of complications in patients with a femoral neck fracture treated with arthroplasty.
MATERIAL AND METHOD
A historical cohort study was carried out in which the group of patients with a displaced fracture of the femoral neck who had an outlier admission was defined as an exposed cohort, that is, they were admitted to a hospitalization area not belonging to the Orthopedic Surgery and Traumatology department, unlike the unexposed cohort, that included patients admitted to a hospitalization area assigned to the Orthopedic Surgery and Traumatology department.
RESULTS
Outlier admission was a risk factor for requiring a postoperative transfusion (RR 1.52, 95% CI 1.05-2.21; P=.035), to have a postoperative stay longer than 5 days (RR 1.35, 95% CI 1.04-1.74; P=.038) and to suffer general postoperative complications (RR 1.35, 95% CI 1.02-1.78; P=.048).
CONCLUSIONS
Outlier admission is a threat to the quality and safety of health care. In patients over 80 years of age, medical outliers is a risk factor for postoperative transfusion and systemic postoperative complications.
Topics: Humans; Aged, 80 and over; Femoral Neck Fractures; Cohort Studies; Length of Stay; Postoperative Complications; Risk Factors
PubMed: 35654723
DOI: 10.1016/j.jhqr.2022.02.012 -
BMC Bioinformatics Jun 2020High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a...
BACKGROUND
High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis.
RESULTS
We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes.
CONCLUSIONS
rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
Topics: Animals; Cerebellum; Female; Male; Mice, Knockout; Principal Component Analysis; Proto-Oncogene Proteins; RNA-Seq; Reverse Transcriptase Polymerase Chain Reaction
PubMed: 32600248
DOI: 10.1186/s12859-020-03608-0 -
NeuroImage. Clinical 2021Focal cortical dysplasias (FCDs) are a common cause of apparently non-lesional drug-resistant focal epilepsy. Visual detection of subtle FCDs on MRI is clinically...
OBJECTIVE
Focal cortical dysplasias (FCDs) are a common cause of apparently non-lesional drug-resistant focal epilepsy. Visual detection of subtle FCDs on MRI is clinically important and often challenging. In this study, we implement a set of 3D local image filters adapted from computer vision applications to characterize the appearance of normal cortex surrounding the gray-white junction. We create a normative model to serve as the basis for a novel multivariate constrained outlier approach to automated FCD detection.
METHODS
Standardized MPRAGE, T and FLAIR MR images were obtained in 15 patients with radiologically or histologically diagnosed FCDs and 30 healthy volunteers. Multiscale 3D local image filters were computed for each MR contrast then sampled onto the gray-white junction surface. Using an iterative Gaussianization procedure, we created a normative model of cortical variability in healthy volunteers, allowing for identification of outlier regions and estimates of similarity in normal cortex and FCD lesions. We used a constrained outlier approach following local normalization to automatically detect FCD lesions based on projection onto the mean FCD feature vector.
RESULTS
FCDs as well as some normal cortical regions such as primary sensorimotor and paralimbic regions appear as outliers. Regions such as the paralimbic regions and the anterior insula have similar features to FCDs. Our constrained outlier approach allows for automated FCD detection with 80% sensitivity and 70% specificity.
SIGNIFICANCE
A normative model using multiscale local image filters can be used to describe the normal cortical variability. Although FCDs appear similar to some cortical regions such as the anterior insula and paralimbic cortices, they can be identified using a constrained outlier detection approach. Our method for detecting outliers and estimating similarity is generic and could be extended to identification of other types of lesions or atypical cortical areas.
Topics: Epilepsy; Humans; Imaging, Three-Dimensional; Magnetic Resonance Imaging; Malformations of Cortical Development; Malformations of Cortical Development, Group I
PubMed: 33556791
DOI: 10.1016/j.nicl.2021.102565 -
Medical Physics Nov 2017The purpose of this study was to apply statistical metrics to identify outliers and to investigate the impact of outliers on knowledge-based planning in radiation...
PURPOSE
The purpose of this study was to apply statistical metrics to identify outliers and to investigate the impact of outliers on knowledge-based planning in radiation therapy of pelvic cases. We also aimed to develop a systematic workflow for identifying and analyzing geometric and dosimetric outliers.
METHODS
Four groups (G1-G4) of pelvic plans were sampled in this study. These include the following three groups of clinical IMRT cases: G1 (37 prostate cases), G2 (37 prostate plus lymph node cases) and G3 (37 prostate bed cases). Cases in G4 were planned in accordance with dynamic-arc radiation therapy procedure and include 10 prostate cases in addition to those from G1. The workflow was separated into two parts: 1. identifying geometric outliers, assessing outlier impact, and outlier cleaning; 2. identifying dosimetric outliers, assessing outlier impact, and outlier cleaning. G2 and G3 were used to analyze the effects of geometric outliers (first experiment outlined below) while G1 and G4 were used to analyze the effects of dosimetric outliers (second experiment outlined below). A baseline model was trained by regarding all G2 cases as inliers. G3 cases were then individually added to the baseline model as geometric outliers. The impact on the model was assessed by comparing leverages of inliers (G2) and outliers (G3). A receiver-operating-characteristic (ROC) analysis was performed to determine the optimal threshold. The experiment was repeated by training the baseline model with all G3 cases as inliers and perturbing the model with G2 cases as outliers. A separate baseline model was trained with 32 G1 cases. Each G4 case (dosimetric outlier) was subsequently added to perturb the model. Predictions of dose-volume histograms (DVHs) were made using these perturbed models for the remaining 5 G1 cases. A Weighted Sum of Absolute Residuals (WSAR) was used to evaluate the impact of the dosimetric outliers.
RESULTS
The leverage of inliers and outliers was significantly different. The Area-Under-Curve (AUC) for differentiating G2 (outliers) from G3 (inliers) was 0.98 (threshold: 0.27) for the bladder and 0.81 (threshold: 0.11) for the rectum. For differentiating G3 (outlier) from G2 (inlier), the AUC (threshold) was 0.86 (0.11) for the bladder and 0.71 (0.11) for the rectum. Significant increase in WSAR was observed in the model with 3 dosimetric outliers for the bladder (P < 0.005 with Bonferroni correction), and in the model with only 1 dosimetric outlier for the rectum (P < 0.005).
CONCLUSIONS
We established a systematic workflow for identifying and analyzing geometric and dosimetric outliers, and investigated statistical metrics for outlier detection. Results validated the necessity for outlier detection and clean-up to enhance model quality in clinical practice.
Topics: Algorithms; Humans; Male; Organs at Risk; Pelvis; Prostatic Neoplasms; Radiometry; Radiotherapy Dosage; Radiotherapy Planning, Computer-Assisted
PubMed: 28869649
DOI: 10.1002/mp.12556