-
Data Mining and Knowledge Discovery 2023It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th...
UNLABELLED
It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th annual Belgian-Dutch on machine learning, pp 56-64, 2009; Janssens et al. in: Proceedings of the 2009 ICMLA international conference on machine learning and applications, IEEE Computer Society, pp 147-153, 2009. 10.1109/ICMLA.2009.16). In this paper, we focus on the comparison of one-class classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. In contrast to previous comparison studies, where the models (algorithms, parameters) are selected by using examples from both classes (outlier and inlier), here we also study and compare different approaches for model selection in the absence of examples from the outlier class, which is more realistic for practical applications since labeled outliers are rarely available. Our results showed that, overall, SVDD and GMM are top-performers, regardless of whether the ground truth is used for parameter selection or not. However, in specific application scenarios, other methods exhibited better performance. Combining one-class classifiers into ensembles showed better performance than individual methods in terms of accuracy, as long as the ensemble members are properly selected.
SUPPLEMENTARY INFORMATION
The online version contains supplementary material available at 10.1007/s10618-023-00931-x.
PubMed: 37424877
DOI: 10.1007/s10618-023-00931-x -
Journal of Medical Internet Research May 2021Perioperative quantitative monitoring of neuromuscular function in patients receiving neuromuscular blockers has become internationally recognized as an absolute and...
BACKGROUND
Perioperative quantitative monitoring of neuromuscular function in patients receiving neuromuscular blockers has become internationally recognized as an absolute and core necessity in modern anesthesia care. Because of their kinetic nature, artifactual recordings of acceleromyography-based neuromuscular monitoring devices are not unusual. These generate a great deal of cynicism among anesthesiologists, constituting an obstacle toward their widespread adoption. Through outlier analysis techniques, monitoring devices can learn to detect and flag signal abnormalities. Outlier analysis (or anomaly detection) refers to the problem of finding patterns in data that do not conform to expected behavior.
OBJECTIVE
This study was motivated by the development of a smartphone app intended for neuromuscular monitoring based on combined accelerometric and angular hand movement data. During the paired comparison stage of this app against existing acceleromyography monitoring devices, it was noted that the results from both devices did not always concur. This study aims to engineer a set of features that enable the detection of outliers in the form of erroneous train-of-four (TOF) measurements from an acceleromyographic-based device. These features are tested for their potential in the detection of erroneous TOF measurements by developing an outlier detection algorithm.
METHODS
A data set encompassing 533 high-sensitivity TOF measurements from 35 patients was created based on a multicentric open label trial of a purpose-built accelero- and gyroscopic-based neuromuscular monitoring app. A basic set of features was extracted based on raw data while a second set of features was purpose engineered based on TOF pattern characteristics. Two cost-sensitive logistic regression (CSLR) models were deployed to evaluate the performance of these features. The final output of the developed models was a binary classification, indicating if a TOF measurement was an outlier or not.
RESULTS
A total of 7 basic features were extracted based on raw data, while another 8 features were engineered based on TOF pattern characteristics. The model training and testing were based on separate data sets: one with 319 measurements (18 outliers) and a second with 214 measurements (12 outliers). The F1 score (95% CI) was 0.86 (0.48-0.97) for the CSLR model with engineered features, significantly larger than the CSLR model with the basic features (0.29 [0.17-0.53]; P<.001).
CONCLUSIONS
The set of engineered features and their corresponding incorporation in an outlier detection algorithm have the potential to increase overall neuromuscular monitoring data consistency. Integrating outlier flagging algorithms within neuromuscular monitors could potentially reduce overall acceleromyography-based reliability issues.
TRIAL REGISTRATION
ClinicalTrials.gov NCT03605225; https://clinicaltrials.gov/ct2/show/NCT03605225.
Topics: Accelerometry; Humans; Machine Learning; Neuromuscular Blockade; Neuromuscular Monitoring; Reproducibility of Results
PubMed: 34152273
DOI: 10.2196/25913 -
Entropy (Basel, Switzerland) Dec 2021People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking...
People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking sites. These sites serve as a great source to gather data for data analytics, sentiment analysis, natural language processing, etc. Conventionally, the true sentiment of a customer review matches its corresponding star rating. There are exceptions when the star rating of a review is opposite to its true nature. These are labeled as the outliers in a dataset in this work. The state-of-the-art methods for anomaly detection involve manual searching, predefined rules, or traditional machine learning techniques to detect such instances. This paper conducts a sentiment analysis and outlier detection case study for Amazon customer reviews, and it proposes a statistics-based outlier detection and correction method (SODCM), which helps identify such reviews and rectify their star ratings to enhance the performance of a sentiment analysis algorithm without any data loss. This paper focuses on performing SODCM in datasets containing customer reviews of various products, which are (a) scraped from Amazon.com and (b) publicly available. The paper also studies the dataset and concludes the effect of SODCM on the performance of a sentiment analysis algorithm. The results exhibit that SODCM achieves higher accuracy and recall percentage than other state-of-the-art anomaly detection algorithms.
PubMed: 34945950
DOI: 10.3390/e23121645 -
Evolution; International Journal of... Apr 2023An evolutionary debate contrasts the importance of genetic convergence versus genetic redundancy. In genetic convergence, the same adaptive trait evolves because of...
An evolutionary debate contrasts the importance of genetic convergence versus genetic redundancy. In genetic convergence, the same adaptive trait evolves because of similar genetic changes. In genetic redundancy, the adaptive trait evolves using different genetic combinations, and populations might not share the same genetic changes. Here we address this debate by examining single nucleotide polymorphisms (SNPs) associated with the rapid evolution of character displacement in Anolis carolinensis populations inhabiting replicate islands with and without a competitor species (1Spp and 2Spp islands, respectively). We identify 215-outliers SNPs that have improbably large FST values, low nucleotide variation, greater linkage than expected and that are enriched for genes underlying animal movement. The pattern of SNP divergence between 1Spp and 2Spp populations supports both genetic convergence and genetic redundancy for character displacement. In support of genetic convergence: all 215-outliers SNPs are shared among at least three of the five 2Spp island populations, and 23% of outlier SNPS are shared among all five 2Spp island populations. In contrast, in support of genetic redundancy: many outlier SNPs only have meaningful allele frequency differences between 1Spp and 2Spp islands on a few 2Spp islands. That is, on at least one of the 2Spp islands, 77% of outlier SNPs have allele frequencies more similar to those on 1Spp islands than to those on 2Spp islands. Focusing on genetic convergence is scientifically rigorous because it relies on replication. Yet, this focus distracts from the possibility that there are multiple, redundant genetic solutions that enhance the rate and stability of adaptive change.
Topics: Animals; Gene Frequency; Genomics; Phenotype; Polymorphism, Single Nucleotide; Selection, Genetic
PubMed: 36857409
DOI: 10.1093/evolut/qpad031 -
International Journal of Biometeorology Nov 2020Citizen science involves public participation in research, usually through volunteer observation and reporting. Data collected by citizen scientists are a valuable...
Citizen science involves public participation in research, usually through volunteer observation and reporting. Data collected by citizen scientists are a valuable resource in many fields of research that require long-term observations at large geographic scales. However, such data may be perceived as less accurate than those collected by trained professionals. Here, we analyze the quality of data from a plant phenology network, which tracks biological response to climate change. We apply five algorithms designed to detect outlier observations or inconsistent observers. These methods rely on different quantitative approaches, including residuals of linear models, correlations among observers, deviations from multivariate clusters, and percentile-based outlier removal. We evaluated these methods by comparing the resulting cleaned datasets in terms of time series means, spatial data coverage, and spatial autocorrelations after outlier removal. Spatial autocorrelations were used to determine the efficacy of outlier removal, as they are expected to increase if outliers and inconsistent observations are successfully removed. All data cleaning methods resulted in better Moran's I autocorrelation statistics, with percentile-based outlier removal and the clustering method showing the greatest improvement. Methods based on residual analysis of linear models had the strongest impact on the final bloom time mean estimates, but were among the weakest based on autocorrelation analysis. Removing entire sets of observations from potentially unreliable observers proved least effective. In conclusion, percentile-based outlier removal emerges as a simple and effective method to improve reliability of citizen science phenology observations.
Topics: Citizen Science; Climate Change; Community Participation; Humans; Reproducibility of Results; Volunteers
PubMed: 32671668
DOI: 10.1007/s00484-020-01968-z -
Journal of Applied Statistics 2020Outlier detection can be seen as a pre-processing step for locating data points in a data sample, which do not conform to the majority of observations. Various...
Outlier detection can be seen as a pre-processing step for locating data points in a data sample, which do not conform to the majority of observations. Various techniques and methods for outlier detection can be found in the literature dealing with different types of data. However, many data sets are inflated by true zeros and, in addition, some components/variables might be of compositional nature. Important examples of such data sets are the Structural Earnings Survey, the Structural Business Statistics, the European Statistics on Income and Living Conditions, tax data or - as in this contribution - household expenditure data which are used, for example, to estimate the Purchase Power Parity of a country. In this work, robust univariate and multivariate outlier detection methods are compared by a complex simulation study that considers various challenges included in data sets, namely structural (true) zeros, missing values, and compositional variables. These circumstances make it difficult or impossible to flag true outliers and influential observations by well-known outlier detection methods. Our aim is to assess the performance of outlier detection methods in terms of their effectiveness to identify outliers when applied to challenging data sets such as the household expenditures data surveyed all over the world. Moreover, different methods are evaluated through a close-to-reality simulation study. Differences in performance of univariate and multivariate robust techniques for outlier detection and their shortcomings are reported. We found that robust multivariate methods outperform robust univariate methods. The best performing methods in finding the outliers and in providing a low false discovery rate were found to be the generalized S estimators (GSE), the BACON-EEM algorithm and a compositional method (CoDa-Cov). In addition, these methods performed also best when the outliers are imputed based on the corresponding outlier detection method and indicators are estimated from the data sets.
PubMed: 35707025
DOI: 10.1080/02664763.2019.1671961 -
IEEE ... International Conference on... Jul 2022When it comes to observing and measuring human gait data for further analysis, determining whether the observed behavior is within the normal range of variability, or...
When it comes to observing and measuring human gait data for further analysis, determining whether the observed behavior is within the normal range of variability, or should be considered abnormal, is very challenging. Moreover, usually gait data are multivariate including motion capture, electromyography, force measurements, etc., each source having its own unique causes of irregularities and anomalies. This paper introduces a unique algorithm for outlier detection in periodic gait data using multiple sources and multiple procedures to improve the overall accuracy. The proposed algorithm's performance is evaluated using realistic synthetic gait data to gauge its accuracy to a truly objective known solution. It is shown that the proposed method is able to detect 91.2% of the true outliers in an extensive synthetic dataset, while only producing false positives at a rate of 0.1%, outperforming other procedures usually utilized in gait data outlier detection. The proposed method is a systematic way of removing outliers from gait data, with direct applications to human biomechanics, rehabilitation and robotics, and can be applied to other scientific fields dealing with periodic data.
Topics: Algorithms; Biomechanical Phenomena; Electromyography; Gait; Humans
PubMed: 36176090
DOI: 10.1109/ICORR55369.2022.9896411 -
Entropy (Basel, Switzerland) May 2023Outliers are often present in data and many algorithms exist to find these outliers. Often we can verify these outliers to determine whether they are data errors or not....
Outliers are often present in data and many algorithms exist to find these outliers. Often we can verify these outliers to determine whether they are data errors or not. Unfortunately, checking such points is time-consuming and the underlying issues leading to the data error can change over time. An outlier detection approach should therefore be able to optimally use the knowledge gained from the verification of the ground truth and adjust accordingly. With advances in machine learning, this can be achieved by applying reinforcement learning on a statistical outlier detection approach. The approach uses an ensemble of proven outlier detection methods in combination with a reinforcement learning approach to tune the coefficients of the ensemble with every additional bit of data. The performance and the applicability of the reinforcement learning outlier detection approach are illustrated using granular data reported by Dutch insurers and pension funds under the Solvency II and FTK frameworks. The application shows that outliers can be identified by the ensemble learner. Moreover, applying the reinforcement learner on top of the ensemble model can further improve the results by optimising the coefficients of the ensemble learner.
PubMed: 37372186
DOI: 10.3390/e25060842 -
PeerJ. Computer Science 2022Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one...
Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for data cleansing in data science. In this study, we propose two novel outlier detection approaches using the typicality degrees which are the partitioning result of unsupervised possibilistic clustering algorithms. The proposed approaches are based on finding the atypical data points below a predefined threshold value, a possibilistic level for evaluating a point as an outlier. The experiments on the synthetic and real data sets showed that the proposed approaches can be successfully used to detect outliers without considering the structure and distribution of the features in multidimensional data sets.
PubMed: 36262121
DOI: 10.7717/peerj-cs.1060 -
Frontiers in Bioinformatics 2023Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in...
Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in the embedding. We introduce a robust MDS method, called (Detection and Correction of Orthogonal outliers using MDS), based on the geometry and statistics of simplices formed by data points, that allows to detect orthogonal outliers and subsequently reduce dimensionality. We validate our methods using synthetic datasets, and further show how it can be applied to a variety of large real biological datasets, including cancer image cell data, human microbiome project data and single cell RNA sequencing data, to address the task of data cleaning and visualization.
PubMed: 37637212
DOI: 10.3389/fbinf.2023.1211819