-
Genomics Jan 2021The ΔΔct method estimates fold change in gene expression data from RT-PCR assay. The ΔΔct estimate aggregates replicates using mean and standard deviation (sd) and...
The ΔΔct method estimates fold change in gene expression data from RT-PCR assay. The ΔΔct estimate aggregates replicates using mean and standard deviation (sd) and is not robust to outliers which are in practice often removed before the non-outlying replicates are aggregated. The alternative of using robust statistics such as median and median absolute deviation (MAD) to aggregate the replicates is not done in practice perhaps because the distribution of a robust ΔΔct estimate based on median and MAD is not straightforward to deduce. We introduce a robust ΔΔct estimate and deduce an approximate distribution for it. Simulations show that when data has outliers, the robust ΔΔct estimate compared to the non-robust ΔΔct estimate leads to significantly reduced confidence interval length and a coverage close to the nominal coverage. The analysis of an RT-PCR data from a Novartis clinical trial demonstrates benefit of a robust ΔΔct estimate.
Topics: Algorithms; Biomarkers, Tumor; Clinical Trials as Topic; Gene Expression Profiling; Humans; Real-Time Polymerase Chain Reaction; Reference Standards
PubMed: 33309766
DOI: 10.1016/j.ygeno.2020.12.009 -
Briefings in Bioinformatics Mar 2019For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and... (Review)
Review
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Topics: Epistasis, Genetic; Gene-Environment Interaction; Humans; Models, Genetic
PubMed: 29897421
DOI: 10.1093/bib/bby033 -
The VLDB Journal : Very Large Data... 2022While many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is...
While many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is difficult for users to promptly take appropriate actions concerning the detected outliers. To lessen this difficulty, when outliers are identified, they should be presented together with their explanations. There are survey papers on outlier detection, but none exists for outlier explanations. To fill this gap, in this paper, we present a survey on outlier explanations in which meaningful knowledge is mined from anomalous data to explain them. We define different types of outlier explanations and discuss the challenges in generating each type. We review the existing outlier explanation techniques and discuss how they address the challenges. We also discuss the applications of outlier explanations and review the existing methods used to evaluate outlier explanations. Furthermore, we discuss possible future research directions.
PubMed: 35095253
DOI: 10.1007/s00778-021-00721-1 -
Scientific Reports Feb 2023Outlier detection is an important topic in machine learning and has been used in a wide range of applications. Outliers are objects that are few in number and deviate...
Outlier detection is an important topic in machine learning and has been used in a wide range of applications. Outliers are objects that are few in number and deviate from the majority of objects. As a result of these two properties, we show that outliers are susceptible to a mechanism called fluctuation. This article proposes a method called fluctuation-based outlier detection (FBOD) that achieves a low linear time complexity and detects outliers purely based on the concept of fluctuation without employing any distance, density or isolation measure. Fundamentally different from all existing methods. FBOD first converts the Euclidean structure datasets into graphs by using random links, then propagates the feature value according to the connection of the graph. Finally, by comparing the difference between the fluctuation of an object and its neighbors, FBOD determines the object with a larger difference as an outlier. The results of experiments comparing FBOD with eight state-of-the-art algorithms on eight real-worlds tabular datasets and three video datasets show that FBOD outperforms its competitors in the majority of cases and that FBOD has only 5% of the execution time of the fastest algorithm. The experiment codes are available at: https://github.com/FluctuationOD/Fluctuation-based-Outlier-Detection .
PubMed: 36765095
DOI: 10.1038/s41598-023-29549-1 -
Biometrika Sep 2017In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates...
In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates parameter estimation and model interpretation. However, commonly used reduced-rank methods are sensitive to data corruption, as the low-rank dependence structure between response variables and predictors is easily distorted by outliers. We propose a robust reduced-rank regression approach for joint modelling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parameterization, which generalizes and unifies some popular robust multivariate methods. An efficient thresholding-based iterative procedure is developed for optimization. We show that the algorithm is guaranteed to converge and that the coordinatewise minimum point produced is statistically accurate under regularity conditions. Our theoretical investigations focus on non-asymptotic robust analysis, demonstrating that joint rank reduction and outlier detection leads to improved prediction accuracy. In particular, we show that redescending [Formula: see text]-functions can essentially attain the minimax optimal error rate, and in some less challenging problems convex regularization guarantees the same low error rate. The performance of the proposed method is examined through simulation studies and real-data examples.
PubMed: 29430036
DOI: 10.1093/biomet/asx032 -
Entropy (Basel, Switzerland) Jun 2022In this article, we evaluate the efficiency and performance of two clustering algorithms: AHC (Agglomerative Hierarchical Clustering) and K-Means. We are aware that...
In this article, we evaluate the efficiency and performance of two clustering algorithms: AHC (Agglomerative Hierarchical Clustering) and K-Means. We are aware that there are various linkage options and distance measures that influence the clustering results. We assess the quality of clustering using the Davies-Bouldin and Dunn cluster validity indexes. The main contribution of this research is to verify whether the quality of clusters without outliers is higher than those with outliers in the data. To do this, we compare and analyze outlier detection algorithms depending on the applied clustering algorithm. In our research, we use and compare the LOF (Local Outlier Factor) and COF (Connectivity-based Outlier Factor) algorithms for detecting outliers before and after removing 1%, 5%, and 10% of outliers. Next, we analyze how the quality of clustering has improved. In the experiments, three real data sets were used with a different number of instances.
PubMed: 35885141
DOI: 10.3390/e24070917 -
Journal of the Royal Statistical... Apr 2022We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called...
We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set () as a subset of class labels, possibly empty. It tries to optimize the out-of-sample performance, aiming to include the correct class and to detect outliers as often as possible. BCOPS returns no prediction (corresponding to () equal to the empty set) if it infers to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the out-of-sample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.
PubMed: 35910400
DOI: 10.1111/rssb.12443 -
Biometrics Dec 2022Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial...
Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.
Topics: Child; Humans; Pediatric Obesity; Algorithms; Sample Size; Probability
PubMed: 34437713
DOI: 10.1111/biom.13553 -
Nutricion Hospitalaria Feb 2015When performing nutritional epidemiology studies, missing values and outliers inevitably appear. Missing values appear, for example, because of the difficulty in... (Review)
Review
When performing nutritional epidemiology studies, missing values and outliers inevitably appear. Missing values appear, for example, because of the difficulty in collecting data in dietary surveys, leading to a lack of data on the amounts of foods consumed or a poor description of these foods. Inadequate treatment during the data processing stage can create biases and loss of accuracy and, consequently, misinterpretation of the results. The objective of this article is to provide some recommendations about the treatment of missing and outlier data, and orientation regarding existing software for the determination of sample sizes and for performing statistical analysis. Some recommendations about data collection are provided as an important previous step in any nutritional research. We discuss methods used for dealing with missing values, especially the case deletion method, simple imputation and multiple imputation, with indications and examples. Identification, impact on statistical analysis and options available for adequate treatment of outlier values are explained, including some illustrative examples. Finally, the current software that totally or partially addresses the questions treated is mentioned, especially the free software available.
Topics: Data Interpretation, Statistical; Epidemiologic Methods; Epidemiology; Humans; Nutrition Surveys
PubMed: 25719786
DOI: 10.3305/nh.2015.31.sup3.8766 -
BMC Medical Research Methodology Oct 2023Growth studies rely on longitudinal measurements, typically represented as trajectories. However, anthropometry is prone to errors that can generate outliers. While...
BACKGROUND
Growth studies rely on longitudinal measurements, typically represented as trajectories. However, anthropometry is prone to errors that can generate outliers. While various methods are available for detecting outlier measurements, a gold standard has yet to be identified, and there is no established method for outlying trajectories. Thus, outlier types and their effects on growth pattern detection still need to be investigated. This work aimed to assess the performance of six methods at detecting different types of outliers, propose two novel methods for outlier trajectory detection and evaluate how outliers affect growth pattern detection.
METHODS
We included 393 healthy infants from The Applied Research Group for Kids (TARGet Kids!) cohort and 1651 children with severe malnutrition from the co-trimoxazole prophylaxis clinical trial. We injected outliers of three types and six intensities and applied four outlier detection methods for measurements (model-based and World Health Organization cut-offs-based) and two for trajectories. We also assessed growth pattern detection before and after outlier injection using time series clustering and latent class mixed models. Error type, intensity, and population affected method performance.
RESULTS
Model-based outlier detection methods performed best for measurements with precision between 5.72-99.89%, especially for low and moderate error intensities. The clustering-based outlier trajectory method had high precision of 14.93-99.12%. Combining methods improved the detection rate to 21.82% in outlier measurements. Finally, when comparing growth groups with and without outliers, the outliers were shown to alter group membership by 57.9 -79.04%.
CONCLUSIONS
World Health Organization cut-off-based techniques were shown to perform well in few very particular cases (extreme errors of high intensity), while model-based techniques performed well, especially for moderate errors of low intensity. Clustering-based outlier trajectory detection performed exceptionally well across all types and intensities of errors, indicating a potential strategic change in how outliers in growth data are viewed. Finally, the importance of detecting outliers was shown, given its impact on children growth studies, as demonstrated by comparing results of growth group detection.
Topics: Child; Humans; Cluster Analysis; Research Design; Infant; Child Development
PubMed: 37833647
DOI: 10.1186/s12874-023-02045-w