-
Scientific Reports Feb 2023Outlier detection is an important topic in machine learning and has been used in a wide range of applications. Outliers are objects that are few in number and deviate...
Outlier detection is an important topic in machine learning and has been used in a wide range of applications. Outliers are objects that are few in number and deviate from the majority of objects. As a result of these two properties, we show that outliers are susceptible to a mechanism called fluctuation. This article proposes a method called fluctuation-based outlier detection (FBOD) that achieves a low linear time complexity and detects outliers purely based on the concept of fluctuation without employing any distance, density or isolation measure. Fundamentally different from all existing methods. FBOD first converts the Euclidean structure datasets into graphs by using random links, then propagates the feature value according to the connection of the graph. Finally, by comparing the difference between the fluctuation of an object and its neighbors, FBOD determines the object with a larger difference as an outlier. The results of experiments comparing FBOD with eight state-of-the-art algorithms on eight real-worlds tabular datasets and three video datasets show that FBOD outperforms its competitors in the majority of cases and that FBOD has only 5% of the execution time of the fastest algorithm. The experiment codes are available at: https://github.com/FluctuationOD/Fluctuation-based-Outlier-Detection .
PubMed: 36765095
DOI: 10.1038/s41598-023-29549-1 -
The VLDB Journal : Very Large Data... 2022While many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is...
While many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is difficult for users to promptly take appropriate actions concerning the detected outliers. To lessen this difficulty, when outliers are identified, they should be presented together with their explanations. There are survey papers on outlier detection, but none exists for outlier explanations. To fill this gap, in this paper, we present a survey on outlier explanations in which meaningful knowledge is mined from anomalous data to explain them. We define different types of outlier explanations and discuss the challenges in generating each type. We review the existing outlier explanation techniques and discuss how they address the challenges. We also discuss the applications of outlier explanations and review the existing methods used to evaluate outlier explanations. Furthermore, we discuss possible future research directions.
PubMed: 35095253
DOI: 10.1007/s00778-021-00721-1 -
Genomics Jan 2021The ΔΔct method estimates fold change in gene expression data from RT-PCR assay. The ΔΔct estimate aggregates replicates using mean and standard deviation (sd) and...
The ΔΔct method estimates fold change in gene expression data from RT-PCR assay. The ΔΔct estimate aggregates replicates using mean and standard deviation (sd) and is not robust to outliers which are in practice often removed before the non-outlying replicates are aggregated. The alternative of using robust statistics such as median and median absolute deviation (MAD) to aggregate the replicates is not done in practice perhaps because the distribution of a robust ΔΔct estimate based on median and MAD is not straightforward to deduce. We introduce a robust ΔΔct estimate and deduce an approximate distribution for it. Simulations show that when data has outliers, the robust ΔΔct estimate compared to the non-robust ΔΔct estimate leads to significantly reduced confidence interval length and a coverage close to the nominal coverage. The analysis of an RT-PCR data from a Novartis clinical trial demonstrates benefit of a robust ΔΔct estimate.
Topics: Algorithms; Biomarkers, Tumor; Clinical Trials as Topic; Gene Expression Profiling; Humans; Real-Time Polymerase Chain Reaction; Reference Standards
PubMed: 33309766
DOI: 10.1016/j.ygeno.2020.12.009 -
Briefings in Bioinformatics Mar 2019For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and... (Review)
Review
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Topics: Epistasis, Genetic; Gene-Environment Interaction; Humans; Models, Genetic
PubMed: 29897421
DOI: 10.1093/bib/bby033 -
Biometrika Sep 2017In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates...
In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates parameter estimation and model interpretation. However, commonly used reduced-rank methods are sensitive to data corruption, as the low-rank dependence structure between response variables and predictors is easily distorted by outliers. We propose a robust reduced-rank regression approach for joint modelling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parameterization, which generalizes and unifies some popular robust multivariate methods. An efficient thresholding-based iterative procedure is developed for optimization. We show that the algorithm is guaranteed to converge and that the coordinatewise minimum point produced is statistically accurate under regularity conditions. Our theoretical investigations focus on non-asymptotic robust analysis, demonstrating that joint rank reduction and outlier detection leads to improved prediction accuracy. In particular, we show that redescending [Formula: see text]-functions can essentially attain the minimax optimal error rate, and in some less challenging problems convex regularization guarantees the same low error rate. The performance of the proposed method is examined through simulation studies and real-data examples.
PubMed: 29430036
DOI: 10.1093/biomet/asx032 -
Entropy (Basel, Switzerland) Jun 2022In this article, we evaluate the efficiency and performance of two clustering algorithms: AHC (Agglomerative Hierarchical Clustering) and K-Means. We are aware that...
In this article, we evaluate the efficiency and performance of two clustering algorithms: AHC (Agglomerative Hierarchical Clustering) and K-Means. We are aware that there are various linkage options and distance measures that influence the clustering results. We assess the quality of clustering using the Davies-Bouldin and Dunn cluster validity indexes. The main contribution of this research is to verify whether the quality of clusters without outliers is higher than those with outliers in the data. To do this, we compare and analyze outlier detection algorithms depending on the applied clustering algorithm. In our research, we use and compare the LOF (Local Outlier Factor) and COF (Connectivity-based Outlier Factor) algorithms for detecting outliers before and after removing 1%, 5%, and 10% of outliers. Next, we analyze how the quality of clustering has improved. In the experiments, three real data sets were used with a different number of instances.
PubMed: 35885141
DOI: 10.3390/e24070917 -
The Canadian Journal of Nursing... Mar 2019The presence of statistical outliers is a shared concern in research. If ignored or improperly handled, outliers have the potential to distort the estimate of the... (Review)
Review
The presence of statistical outliers is a shared concern in research. If ignored or improperly handled, outliers have the potential to distort the estimate of the parameter of interest and thus compromise the generalizability of research findings. A variety of statistical techniques are available to assist researchers with the identification and management of outlier cases. The purpose of this paper is to provide a conceptual overview of univariate outliers with special focus on common techniques used to detect and manage univariate outliers. Specifically, this paper discusses the use of histograms, boxplots, interquartile range, and z-score analysis as common univariate outlier identification techniques. The paper also discusses the outlier management techniques of deletion, substitution, and transformation.
Topics: Analysis of Variance; Data Interpretation, Statistical; Humans; Nursing Research
PubMed: 29969044
DOI: 10.1177/0844562118786647 -
Journal of the Royal Statistical... Apr 2022We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called...
We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set () as a subset of class labels, possibly empty. It tries to optimize the out-of-sample performance, aiming to include the correct class and to detect outliers as often as possible. BCOPS returns no prediction (corresponding to () equal to the empty set) if it infers to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the out-of-sample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.
PubMed: 35910400
DOI: 10.1111/rssb.12443 -
BMC Medical Research Methodology Oct 2023Growth studies rely on longitudinal measurements, typically represented as trajectories. However, anthropometry is prone to errors that can generate outliers. While...
BACKGROUND
Growth studies rely on longitudinal measurements, typically represented as trajectories. However, anthropometry is prone to errors that can generate outliers. While various methods are available for detecting outlier measurements, a gold standard has yet to be identified, and there is no established method for outlying trajectories. Thus, outlier types and their effects on growth pattern detection still need to be investigated. This work aimed to assess the performance of six methods at detecting different types of outliers, propose two novel methods for outlier trajectory detection and evaluate how outliers affect growth pattern detection.
METHODS
We included 393 healthy infants from The Applied Research Group for Kids (TARGet Kids!) cohort and 1651 children with severe malnutrition from the co-trimoxazole prophylaxis clinical trial. We injected outliers of three types and six intensities and applied four outlier detection methods for measurements (model-based and World Health Organization cut-offs-based) and two for trajectories. We also assessed growth pattern detection before and after outlier injection using time series clustering and latent class mixed models. Error type, intensity, and population affected method performance.
RESULTS
Model-based outlier detection methods performed best for measurements with precision between 5.72-99.89%, especially for low and moderate error intensities. The clustering-based outlier trajectory method had high precision of 14.93-99.12%. Combining methods improved the detection rate to 21.82% in outlier measurements. Finally, when comparing growth groups with and without outliers, the outliers were shown to alter group membership by 57.9 -79.04%.
CONCLUSIONS
World Health Organization cut-off-based techniques were shown to perform well in few very particular cases (extreme errors of high intensity), while model-based techniques performed well, especially for moderate errors of low intensity. Clustering-based outlier trajectory detection performed exceptionally well across all types and intensities of errors, indicating a potential strategic change in how outliers in growth data are viewed. Finally, the importance of detecting outliers was shown, given its impact on children growth studies, as demonstrated by comparing results of growth group detection.
Topics: Child; Humans; Cluster Analysis; Research Design; Infant; Child Development
PubMed: 37833647
DOI: 10.1186/s12874-023-02045-w -
Journal of Applied Statistics 2022Functional box plots satisfy two needs; visualization of functional data, and the calculation of important box plot statistics. Data visualization illuminates key...
Functional box plots satisfy two needs; visualization of functional data, and the calculation of important box plot statistics. Data visualization illuminates key characteristics of functional sets missed by statistical tests and summary statistics. The calculation of box plot statistics for functional sets permits a novel comparison more suited to functional data. The functional box plot uses a depth method to visualize and rank smooth functional curves in terms of a mean, box, whiskers, and outliers. The functional box plot improves upon other classic functional data analysis tools such as functional principal components and discriminant analysis for outlier detection. This research adds wavelet analysis as a generating mechanism along with depth for functional box plots to visualize functional data and calculate relevant statistics. The wavelet analysis of variance box plot tool gives competitive error rates in Gaussian test cases with magnitude outliers, and outperforms the functional box plot, for Gaussian test cases with shape outliers. Further, we show wavelet analysis is well suited at approximating irregular and noisy functional data and show the enhanced capability of WANOVA box plots to classify shape outliers which follow a different pattern than other functional data for both simulated and real data instances.
PubMed: 36246858
DOI: 10.1080/02664763.2021.1951685