-
Genomics Jan 2021The ΔΔct method estimates fold change in gene expression data from RT-PCR assay. The ΔΔct estimate aggregates replicates using mean and standard deviation (sd) and...
The ΔΔct method estimates fold change in gene expression data from RT-PCR assay. The ΔΔct estimate aggregates replicates using mean and standard deviation (sd) and is not robust to outliers which are in practice often removed before the non-outlying replicates are aggregated. The alternative of using robust statistics such as median and median absolute deviation (MAD) to aggregate the replicates is not done in practice perhaps because the distribution of a robust ΔΔct estimate based on median and MAD is not straightforward to deduce. We introduce a robust ΔΔct estimate and deduce an approximate distribution for it. Simulations show that when data has outliers, the robust ΔΔct estimate compared to the non-robust ΔΔct estimate leads to significantly reduced confidence interval length and a coverage close to the nominal coverage. The analysis of an RT-PCR data from a Novartis clinical trial demonstrates benefit of a robust ΔΔct estimate.
Topics: Algorithms; Biomarkers, Tumor; Clinical Trials as Topic; Gene Expression Profiling; Humans; Real-Time Polymerase Chain Reaction; Reference Standards
PubMed: 33309766
DOI: 10.1016/j.ygeno.2020.12.009 -
Briefings in Bioinformatics Mar 2019For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and... (Review)
Review
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Topics: Epistasis, Genetic; Gene-Environment Interaction; Humans; Models, Genetic
PubMed: 29897421
DOI: 10.1093/bib/bby033 -
The VLDB Journal : Very Large Data... 2022While many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is...
While many techniques for outlier detection have been proposed in the literature, the interpretation of detected outliers is often left to users. As a result, it is difficult for users to promptly take appropriate actions concerning the detected outliers. To lessen this difficulty, when outliers are identified, they should be presented together with their explanations. There are survey papers on outlier detection, but none exists for outlier explanations. To fill this gap, in this paper, we present a survey on outlier explanations in which meaningful knowledge is mined from anomalous data to explain them. We define different types of outlier explanations and discuss the challenges in generating each type. We review the existing outlier explanation techniques and discuss how they address the challenges. We also discuss the applications of outlier explanations and review the existing methods used to evaluate outlier explanations. Furthermore, we discuss possible future research directions.
PubMed: 35095253
DOI: 10.1007/s00778-021-00721-1 -
Scientific Reports Feb 2023Outlier detection is an important topic in machine learning and has been used in a wide range of applications. Outliers are objects that are few in number and deviate...
Outlier detection is an important topic in machine learning and has been used in a wide range of applications. Outliers are objects that are few in number and deviate from the majority of objects. As a result of these two properties, we show that outliers are susceptible to a mechanism called fluctuation. This article proposes a method called fluctuation-based outlier detection (FBOD) that achieves a low linear time complexity and detects outliers purely based on the concept of fluctuation without employing any distance, density or isolation measure. Fundamentally different from all existing methods. FBOD first converts the Euclidean structure datasets into graphs by using random links, then propagates the feature value according to the connection of the graph. Finally, by comparing the difference between the fluctuation of an object and its neighbors, FBOD determines the object with a larger difference as an outlier. The results of experiments comparing FBOD with eight state-of-the-art algorithms on eight real-worlds tabular datasets and three video datasets show that FBOD outperforms its competitors in the majority of cases and that FBOD has only 5% of the execution time of the fastest algorithm. The experiment codes are available at: https://github.com/FluctuationOD/Fluctuation-based-Outlier-Detection .
PubMed: 36765095
DOI: 10.1038/s41598-023-29549-1 -
Biometrika Sep 2017In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates...
In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates parameter estimation and model interpretation. However, commonly used reduced-rank methods are sensitive to data corruption, as the low-rank dependence structure between response variables and predictors is easily distorted by outliers. We propose a robust reduced-rank regression approach for joint modelling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parameterization, which generalizes and unifies some popular robust multivariate methods. An efficient thresholding-based iterative procedure is developed for optimization. We show that the algorithm is guaranteed to converge and that the coordinatewise minimum point produced is statistically accurate under regularity conditions. Our theoretical investigations focus on non-asymptotic robust analysis, demonstrating that joint rank reduction and outlier detection leads to improved prediction accuracy. In particular, we show that redescending [Formula: see text]-functions can essentially attain the minimax optimal error rate, and in some less challenging problems convex regularization guarantees the same low error rate. The performance of the proposed method is examined through simulation studies and real-data examples.
PubMed: 29430036
DOI: 10.1093/biomet/asx032 -
Entropy (Basel, Switzerland) Jun 2022In this article, we evaluate the efficiency and performance of two clustering algorithms: AHC (Agglomerative Hierarchical Clustering) and K-Means. We are aware that...
In this article, we evaluate the efficiency and performance of two clustering algorithms: AHC (Agglomerative Hierarchical Clustering) and K-Means. We are aware that there are various linkage options and distance measures that influence the clustering results. We assess the quality of clustering using the Davies-Bouldin and Dunn cluster validity indexes. The main contribution of this research is to verify whether the quality of clusters without outliers is higher than those with outliers in the data. To do this, we compare and analyze outlier detection algorithms depending on the applied clustering algorithm. In our research, we use and compare the LOF (Local Outlier Factor) and COF (Connectivity-based Outlier Factor) algorithms for detecting outliers before and after removing 1%, 5%, and 10% of outliers. Next, we analyze how the quality of clustering has improved. In the experiments, three real data sets were used with a different number of instances.
PubMed: 35885141
DOI: 10.3390/e24070917 -
Iranian Journal of Parasitology 2016The aim of the study was assessment of defaults and conducted meta-analysis of the efficacy of single-dose oral albendazole against infection. (Review)
Review
BACKGROUND
The aim of the study was assessment of defaults and conducted meta-analysis of the efficacy of single-dose oral albendazole against infection.
METHODS
We searched PubMed, ISI Web of Science, Science Direct, the Cochrane Central Register of Controlled Trials, and WHO library databases between 1983 and 2014. Data from 13 clinical trial articles were used. Each article was included the effect of single oral dose (400 mg) albendazole and placebo in treating two groups of patients with infection. For both groups in each article, sample size, the number of those with infection, and the number of those recovered following the intake of albendazole were identified and recorded. The relative risk and variance were computed. Funnel plot, Beggs and Eggers tests were used for assessment of publication bias. The random effect variance shift outlier model and likelihood ratio test were applied for detecting outliers. In order to detect influence, DFFITS values, Cook's distances and COVRATIO were used. Data were analyzed using STATA and R software.
RESULTS
The article number 13 and 9 were outlier and influence, respectively. Outlier is diagnosed by variance shift of target study in inferential method and by RR value in graphical method. Funnel plot and Beggs test did not show the publication bias (=0.272). However, the Eggers test confirmed it (=0.034). Meta-analysis after removal of article 13 showed that relative risk was 1.99 (CI 95% 1.71 - 2.31).
CONCLUSION
The estimated RR and our meta-analyses show that treatment of with single oral doses of albendazole is unsatisfactory. New anthelminthics are urgently needed.
PubMed: 28127355
DOI: No ID Found -
Journal of the Royal Statistical... Apr 2022We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called...
We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set () as a subset of class labels, possibly empty. It tries to optimize the out-of-sample performance, aiming to include the correct class and to detect outliers as often as possible. BCOPS returns no prediction (corresponding to () equal to the empty set) if it infers to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the out-of-sample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.
PubMed: 35910400
DOI: 10.1111/rssb.12443 -
Advances in Experimental Medicine and... 2016The statistical analysis of robust biomarker candidates is a complex process, and is involved in several key steps in the overall biomarker development pipeline (see... (Review)
Review
The statistical analysis of robust biomarker candidates is a complex process, and is involved in several key steps in the overall biomarker development pipeline (see Fig. 22.1, Chap. 19 ). Initially, data visualization (Sect. 22.1, below) is important to determine outliers and to get a feel for the nature of the data and whether there appear to be any differences among the groups being examined. From there, the data must be pre-processed (Sect. 22.2) so that outliers are handled, missing values are dealt with, and normality is assessed. Once the processed data has been cleaned and is ready for downstream analysis, hypothesis tests (Sect. 22.3) are performed, and proteins that are differentially expressed are identified. Since the number of differentially expressed proteins is usually larger than warrants further investigation (50+ proteins versus just a handful that will be considered for a biomarker panel), some sort of feature reduction (Sect. 22.4) should be performed to narrow the list of candidate biomarkers down to a more reasonable number. Once the list of proteins has been reduced to those that are likely most useful for downstream classification purposes, unsupervised or supervised learning is performed (Sects. 22.5 and 22.6, respectively).
Topics: Algorithms; Biomarkers; Computational Biology; Data Interpretation, Statistical; Data Mining; Databases, Protein; High-Throughput Screening Assays; Humans; Mass Spectrometry; Models, Statistical; Proteins; Proteome; Proteomics; Software
PubMed: 27975231
DOI: 10.1007/978-3-319-41448-5_22 -
Frontiers in Digital Health 2023This paper compares three finite element-based methods used in a physics-based non-rigid registration approach and reports on the progress made over the last 15 years.... (Review)
Review
This paper compares three finite element-based methods used in a physics-based non-rigid registration approach and reports on the progress made over the last 15 years. Large brain shifts caused by brain tumor removal affect registration accuracy by creating point and element outliers. A combination of approximation- and geometry-based point and element outlier rejection improves the rigid registration error by 2.5 mm and meets the real-time constraints (4 min). In addition, the paper raises several questions and presents two open problems for the robust estimation and improvement of registration error in the presence of outliers due to sparse, noisy, and incomplete data. It concludes with preliminary results on leveraging Quantum Computing, a promising new technology for computationally intensive problems like Feature Detection and Block Matching in addition to finite element solver; all three account for 75% of computing time in deformable registration.
PubMed: 38144260
DOI: 10.3389/fdgth.2023.1283726