-
PeerJ. Computer Science 2022Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one...
Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for data cleansing in data science. In this study, we propose two novel outlier detection approaches using the typicality degrees which are the partitioning result of unsupervised possibilistic clustering algorithms. The proposed approaches are based on finding the atypical data points below a predefined threshold value, a possibilistic level for evaluating a point as an outlier. The experiments on the synthetic and real data sets showed that the proposed approaches can be successfully used to detect outliers without considering the structure and distribution of the features in multidimensional data sets.
PubMed: 36262121
DOI: 10.7717/peerj-cs.1060 -
Knowledge-based Systems Feb 2022The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process,...
The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process, leading to incorrect conclusions about the data. For example, anomaly detection using deep generative models is typically only possible when similar anomalies (or outliers) are not present in the training data. Here we focus on variational autoencoders (VAEs). While the VAE is a popular framework for anomaly detection tasks, we observe that the VAE is unable to detect outliers when the training data contains anomalies that have the same distribution as those in test data. In this paper we focus on robustness to outliers in training data in VAE settings using concepts from robust statistics. We propose a variational lower bound that leads to a robust VAE model that has the same computational complexity as the standard VAE and contains a single automatically-adjusted tuning parameter to control the degree of robustness. We present mathematical formulations for robust variational autoencoders (RVAEs) for Bernoulli, Gaussian and categorical variables. The RVAE model is based on beta-divergence rather than the standard Kullback-Leibler (KL) divergence. We demonstrate the performance of our proposed -divergence-based autoencoder for a variety of image and categorical datasets showing improved robustness to outliers both qualitatively and quantitatively. We also illustrate the use of our robust VAE for detection of lesions in brain images, formulated as an anomaly detection task. Finally, we suggest a method to tune the hyperparameter of RVAE which makes our model completely unsupervised.
PubMed: 36714396
DOI: 10.1016/j.knosys.2021.107886 -
Plants (Basel, Switzerland) Mar 2023Globally, food and medicinal plants have been documented, but their use patterns are poorly understood. Useful plants are non-random subsets of flora, prioritizing... (Review)
Review
Globally, food and medicinal plants have been documented, but their use patterns are poorly understood. Useful plants are non-random subsets of flora, prioritizing certain taxa. This study evaluates orders and families prioritized for medicine and food in Kenya, using three statistical models: Regression, Binomial, and Bayesian approaches. An extensive literature search was conducted to gather information on indigenous flora, medicinal and food plants. Regression residuals, obtained using LlNEST linear regression function, were used to quantify if taxa had unexpectedly high number of useful species relative to the overall proportion in the flora. Bayesian analysis, performed using BETA.INV function, was used to obtain superior and inferior 95% probability credible intervals for the whole flora and for all taxa. To test for the significance of individual taxa departure from the expected number, binomial analysis using BINOMDIST function was performed to obtain -values for all taxa. The three models identified 14 positive outlier medicinal orders, all with significant values ( < 0.05). Fabales had the highest (66.16) regression residuals, while Sapindales had the highest (1.1605) R-value. Thirty-eight positive outlier medicinal families were identified; 34 were significant outliers ( < 0.05). Rutaceae (1.6808) had the highest R-value, while Fabaceae had the highest regression residuals (63.2). Sixteen positive outlier food orders were recovered; 13 were significant outliers ( < 0.05). Gentianales (45.27) had the highest regression residuals, while Sapindales (2.3654) had the highest R-value. Forty-two positive outlier food families were recovered by the three models; 30 were significant outliers ( < 0.05). Anacardiaceae (5.163) had the highest R-value, while Fabaceae had the highest (28.72) regression residuals. This study presents important medicinal and food taxa in Kenya, and adds useful data for global comparisons.
PubMed: 36904005
DOI: 10.3390/plants12051145 -
IEEE Transactions on Image Processing :... Feb 2016A critical step in cryogenic electron microscopy (cryo-EM) image analysis is to calculate the average of all images aligned to a projection direction. This average,...
A critical step in cryogenic electron microscopy (cryo-EM) image analysis is to calculate the average of all images aligned to a projection direction. This average, called the class mean, improves the signal-to-noise ratio in single-particle reconstruction. The averaging step is often compromised because of the outlier images of ice, contaminants, and particle fragments. Outlier detection and rejection in the majority of current cryo-EM methods are done using cross-correlation with a manually determined threshold. Empirical assessment shows that the performance of these methods is very sensitive to the threshold. This paper proposes an alternative: a w-estimator of the average image, which is robust to outliers and which does not use a threshold. Various properties of the estimator, such as consistency and influence function are investigated. An extension of the estimator to images with different contrast transfer functions is also provided. Experiments with simulated and real cryo-EM images show that the proposed estimator performs quite well in the presence of outliers.
Topics: Computer Simulation; Cryoelectron Microscopy; Image Processing, Computer-Assisted; Imaging, Three-Dimensional; Proteins; Signal-To-Noise Ratio
PubMed: 26841397
DOI: 10.1109/TIP.2015.2512384 -
Algorithms For Molecular Biology : AMB 2018An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a...
BACKGROUND
An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods, which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. Similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. In contrast, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive.
RESULTS
We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log-Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets.
CONCLUSION
Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone.
PubMed: 29588650
DOI: 10.1186/s13015-018-0126-3 -
BMC Bioinformatics Jun 2020High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a...
BACKGROUND
High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis.
RESULTS
We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes.
CONCLUSIONS
rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
Topics: Animals; Cerebellum; Female; Male; Mice, Knockout; Principal Component Analysis; Proto-Oncogene Proteins; RNA-Seq; Reverse Transcriptase Polymerase Chain Reaction
PubMed: 32600248
DOI: 10.1186/s12859-020-03608-0 -
NeuroImage. Clinical 2021Focal cortical dysplasias (FCDs) are a common cause of apparently non-lesional drug-resistant focal epilepsy. Visual detection of subtle FCDs on MRI is clinically...
OBJECTIVE
Focal cortical dysplasias (FCDs) are a common cause of apparently non-lesional drug-resistant focal epilepsy. Visual detection of subtle FCDs on MRI is clinically important and often challenging. In this study, we implement a set of 3D local image filters adapted from computer vision applications to characterize the appearance of normal cortex surrounding the gray-white junction. We create a normative model to serve as the basis for a novel multivariate constrained outlier approach to automated FCD detection.
METHODS
Standardized MPRAGE, T and FLAIR MR images were obtained in 15 patients with radiologically or histologically diagnosed FCDs and 30 healthy volunteers. Multiscale 3D local image filters were computed for each MR contrast then sampled onto the gray-white junction surface. Using an iterative Gaussianization procedure, we created a normative model of cortical variability in healthy volunteers, allowing for identification of outlier regions and estimates of similarity in normal cortex and FCD lesions. We used a constrained outlier approach following local normalization to automatically detect FCD lesions based on projection onto the mean FCD feature vector.
RESULTS
FCDs as well as some normal cortical regions such as primary sensorimotor and paralimbic regions appear as outliers. Regions such as the paralimbic regions and the anterior insula have similar features to FCDs. Our constrained outlier approach allows for automated FCD detection with 80% sensitivity and 70% specificity.
SIGNIFICANCE
A normative model using multiscale local image filters can be used to describe the normal cortical variability. Although FCDs appear similar to some cortical regions such as the anterior insula and paralimbic cortices, they can be identified using a constrained outlier detection approach. Our method for detecting outliers and estimating similarity is generic and could be extended to identification of other types of lesions or atypical cortical areas.
Topics: Epilepsy; Humans; Imaging, Three-Dimensional; Magnetic Resonance Imaging; Malformations of Cortical Development; Malformations of Cortical Development, Group I
PubMed: 33556791
DOI: 10.1016/j.nicl.2021.102565 -
Sensors (Basel, Switzerland) May 2020Geometric model fitting is a fundamental issue in computer vision, and the fitting accuracy is affected by outliers. In order to eliminate the impact of the outliers,...
Geometric model fitting is a fundamental issue in computer vision, and the fitting accuracy is affected by outliers. In order to eliminate the impact of the outliers, the inlier threshold or scale estimator is usually adopted. However, a single inlier threshold cannot satisfy multiple models in the data, and scale estimators with a certain noise distribution model work poorly in geometric model fitting. It can be observed that the residuals of outliers are big for all true models in the data, which makes the consensus of the outliers. Based on this observation, we propose a preference analysis method based on residual histograms to study the outlier consensus for outlier detection in this paper. We have found that the outlier consensus makes the outliers gather away from the inliers on the designed residual histogram preference space, which is quite convenient to separate outliers from inliers through linkage clustering. After the outliers are detected and removed, a linkage clustering with permutation preference is introduced to segment the inliers. In addition, in order to make the linkage clustering process stable and robust, an alternative sampling and clustering framework is proposed in both the outlier detection and inlier segmentation processes. The experimental results also show that the outlier detection scheme based on residual histogram preference can detect most of the outliers in the data sets, and the fitting results are better than most of the state-of-the-art methods in geometric multi-model fitting.
PubMed: 32471177
DOI: 10.3390/s20113037 -
Clinical Breast Cancer Jun 2022Our breast screening unit was identified as high outlier for B3 lesions with a low positive predictive value (PPV) compared to the England average. This prompted a...
Rates and Outcomes of Breast Lesions of Uncertain Malignant Potential (B3) benchmarked against the National Breast Screening Pathology Audit; Improving Performance in a High Volume Screening Unit.
INTRODUCTION
Our breast screening unit was identified as high outlier for B3 lesions with a low positive predictive value (PPV) compared to the England average. This prompted a detailed internal audit and review of B3 lesions and their outcomes to identify causes and address any variation in practice.
PATIENTS AND METHODS
The B3 rate was calculated in 4168 breast core biopsies from 2019, using the subsequent excision to determine the PPV. Atypical intraductal epithelial proliferation (AIDEP) cases were subject to microscopic review to reassess the presence of atypia against published criteria. The B3 rate was re-audited in 2021, and the results compared.
RESULTS
Screening cases had a high B3 rate of 12.4% (30% above the national average), and a PPV of 7.7% (9.7% with atypia). AIDEP was identified as a possible cause of this outlier status. On review and by consensus, AIDEP was confirmed in only 66% of cases reported as such, 17% were downgraded, and 16% did not reach consensus, the latter highlighting the difficulty and subjectivity in diagnosis of these lesions. Repeat audit of B3 rates after this extended review revealed a reduction from 12.4% to 9.11%, which is more in line with national standards.
CONCLUSION
Benchmarking against national reporting standards is critical for service improvement. Through a supportive environment, team working, rigorous internal review and adherence to guidelines, interobserver variation and outlier status in breast pathology screening outliers can both be addressed. This study can serve as a model to other outlier units to identify and tackle underlying causes.
Topics: Benchmarking; Biopsy, Large-Core Needle; Breast; Breast Neoplasms; Female; Humans; Mammography
PubMed: 35260351
DOI: 10.1016/j.clbc.2022.02.004 -
JAMA Facial Plastic Surgery Jan 2018Despite the large number of studies focused on defining frontal or lateral facial attractiveness, no reports have examined whether a significant association between...
IMPORTANCE
Despite the large number of studies focused on defining frontal or lateral facial attractiveness, no reports have examined whether a significant association between frontal and lateral facial attractiveness exists.
OBJECTIVES
To examine the association between frontal and lateral facial attractiveness and to identify anatomical features that may influence discordance between frontal and lateral facial beauty.
DESIGN, SETTING, AND PARTICIPANTS
Paired frontal and lateral facial synthetic images of 240 white women (age range, 18-25 years) were evaluated from September 30, 2004, to September 29, 2008, using an internet-based focus group (nā=ā600) on an attractiveness Likert scale of 1 to 10, with 1 being least attractive and 10 being most attractive. Data analysis was performed from December 6, 2016, to March 30, 2017. The association between frontal and lateral attractiveness scores was determined using linear regression. Outliers were defined as data outside the 95% individual prediction interval. To identify features that contribute to score discordance between frontal and lateral attractiveness scores, each of these image pairs were scrutinized by an evaluator panel for facial features that were present in the frontal or lateral projections and absent in the other respective facial projections.
MAIN OUTCOMES AND MEASURES
Attractiveness scores obtained from internet-based focus groups.
RESULTS
For the 240 white women studied (mean [SD] age, 21.4 [2.2] years), attractiveness scores ranged from 3.4 to 9.5 for frontal images and 3.3 to 9.4 for lateral images. The mean (SD) frontal attractiveness score was 6.9 (1.4), whereas the mean (SD) lateral attractiveness score was 6.4 (1.3). Simple linear regression of frontal and lateral attractiveness scores resulted in a coefficient of determination of r2ā=ā0.749. Eight outlier pairs were identified and analyzed by panel evaluation. Panel evaluation revealed no clinically applicable association between frontal and lateral images among outliers; however, contributory facial features were suggested. Thin upper lip, convex nose, and blunt cervicomental angle were suggested by evaluators as facial characteristics that contributed to outlier frontal or lateral attractiveness scores.
CONCLUSIONS AND RELEVANCE
This study identified a strong linear association between frontal and lateral facial attractiveness. Furthermore, specific facial landmarks responsible for the discordance between frontal and lateral facial attractiveness scores were suggested. Additional studies are necessary to determine whether correction of these landmarks may increase facial harmony and attractiveness.
LEVEL OF EVIDENCE
NA.
Topics: Adolescent; Adult; Anatomic Landmarks; Beauty; Face; Female; Focus Groups; Humans; Photography; Posture; Social Perception; White People; Young Adult
PubMed: 28772308
DOI: 10.1001/jamafacial.2017.0710