-
Journal of Chemometrics Jan 2020Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called...
Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called cellwise outlier diagnostics using robust pairwise log ratios (cell-rPLR) for the identification of outliers in single cell of a data matrix is proposed. The algorithm is designed for metabolomic data, where due to the size effect, the measured values are not directly comparable. Pairwise log ratios between the variable values form the elemental information for the algorithm, and the aggregation of appropriate outlyingness values results in outlyingness information. A further feature of cell-rPLR is that it is useful for biomarker identification, particularly in the presence of cellwise outliers. Real data examples and simulation studies underline the good performance of this algorithm in comparison with alternative methods.
PubMed: 32189829
DOI: 10.1002/cem.3182 -
PLOS Digital Health May 2024Clinical discoveries largely depend on dedicated clinicians and scientists to identify and pursue unique and unusual clinical encounters with patients and communicate...
Clinical discoveries largely depend on dedicated clinicians and scientists to identify and pursue unique and unusual clinical encounters with patients and communicate these through case reports and case series. This process has remained essentially unchanged throughout the history of modern medicine. However, these traditional methods are inefficient, especially considering the modern-day availability of health-related data and the sophistication of computer processing. Outlier analysis has been used in various fields to uncover unique observations, including fraud detection in finance and quality control in manufacturing. We propose that clinical discovery can be formulated as an outlier problem within an augmented intelligence framework to be implemented on any health-related data. Such an augmented intelligence approach would accelerate the identification and pursuit of clinical discoveries, advancing our medical knowledge and uncovering new therapies and management approaches. We define clinical discoveries as contextual outliers measured through an information-based approach and with a novelty-based root cause. Our augmented intelligence framework has five steps: define a patient population with a desired clinical outcome, build a predictive model, identify outliers through appropriate measures, investigate outliers through domain content experts, and generate scientific hypotheses. Recognizing that the field of obstetrics can particularly benefit from this approach, as it is traditionally neglected in commercial research, we conducted a systematic review to explore how outlier analysis is implemented in obstetric research. We identified two obstetrics-related studies that assessed outliers at an aggregate level for purposes outside of clinical discovery. Our findings indicate that using outlier analysis in clinical research in obstetrics and clinical research, in general, requires further development.
PubMed: 38776276
DOI: 10.1371/journal.pdig.0000515 -
Analytical Methods : Advancing Methods... Feb 2024Surface-enhanced Raman spectroscopy (SERS) has shown promising potential in cancer screening. In practical applications, Raman spectra are often affected by deviations...
Surface-enhanced Raman spectroscopy (SERS) has shown promising potential in cancer screening. In practical applications, Raman spectra are often affected by deviations from the spectrometer, changes in measurement environments, and anomalies in spectrum characteristic peak intensities due to improper sample storage. Previous research has overlooked the presence of outliers in categorical data, leading to significant impacts on model learning outcomes. In this study, we propose a novel method, called Principal Component Analysis and Density Based Spatial Clustering of Applications with Noise (PCA-DBSCAN) to effectively remove outliers. This method employs dimensionality reduction and spectral data clustering to identify and remove outliers. The PCA-DBSCAN method introduces adjustable parameters (Eps and MinPts) to control the clustering effect. The effectiveness of the proposed PCA-DBSCAN method is verified through modeling on outlier-removed datasets. Further refinement of the machine learning model and PCA-DBSCAN parameters resulted in the best cancer screening model, achieving 97.41% macro-average recall and 97.74% macro-average 1-score. This paper introduces a new outlier removal method that significantly improves the performance of the SERS cancer screening model. Moreover, the proposed method serves as inspiration for outlier detection in other fields, such as biomedical research, environmental monitoring, manufacturing, quality control, and hazard prediction.
Topics: Spectrum Analysis, Raman; Cluster Analysis; Principal Component Analysis; Biomedical Research
PubMed: 38231020
DOI: 10.1039/d3ay02037a -
BMC Medical Research Methodology Oct 2023Growth studies rely on longitudinal measurements, typically represented as trajectories. However, anthropometry is prone to errors that can generate outliers. While...
BACKGROUND
Growth studies rely on longitudinal measurements, typically represented as trajectories. However, anthropometry is prone to errors that can generate outliers. While various methods are available for detecting outlier measurements, a gold standard has yet to be identified, and there is no established method for outlying trajectories. Thus, outlier types and their effects on growth pattern detection still need to be investigated. This work aimed to assess the performance of six methods at detecting different types of outliers, propose two novel methods for outlier trajectory detection and evaluate how outliers affect growth pattern detection.
METHODS
We included 393 healthy infants from The Applied Research Group for Kids (TARGet Kids!) cohort and 1651 children with severe malnutrition from the co-trimoxazole prophylaxis clinical trial. We injected outliers of three types and six intensities and applied four outlier detection methods for measurements (model-based and World Health Organization cut-offs-based) and two for trajectories. We also assessed growth pattern detection before and after outlier injection using time series clustering and latent class mixed models. Error type, intensity, and population affected method performance.
RESULTS
Model-based outlier detection methods performed best for measurements with precision between 5.72-99.89%, especially for low and moderate error intensities. The clustering-based outlier trajectory method had high precision of 14.93-99.12%. Combining methods improved the detection rate to 21.82% in outlier measurements. Finally, when comparing growth groups with and without outliers, the outliers were shown to alter group membership by 57.9 -79.04%.
CONCLUSIONS
World Health Organization cut-off-based techniques were shown to perform well in few very particular cases (extreme errors of high intensity), while model-based techniques performed well, especially for moderate errors of low intensity. Clustering-based outlier trajectory detection performed exceptionally well across all types and intensities of errors, indicating a potential strategic change in how outliers in growth data are viewed. Finally, the importance of detecting outliers was shown, given its impact on children growth studies, as demonstrated by comparing results of growth group detection.
Topics: Child; Humans; Cluster Analysis; Research Design; Infant; Child Development
PubMed: 37833647
DOI: 10.1186/s12874-023-02045-w -
BMJ Open Jul 2023Benchmarking is common in clinical registries to support the improvement of health outcomes by identifying underperforming clinician or health service providers. Despite...
OBJECTIVES
Benchmarking is common in clinical registries to support the improvement of health outcomes by identifying underperforming clinician or health service providers. Despite the rise in clinical registries and interest in publicly reporting benchmarking results, appropriate methods for benchmarking and outlier detection within clinical registries are not well established, and the current application of methods is inconsistent. The aim of this review was to determine the current statistical methods of outlier detection that have been evaluated in the context of clinical registry benchmarking.
DESIGN
A systematic search for studies evaluating the performance of methods to detect outliers when benchmarking in clinical registries was conducted in five databases: EMBASE, ProQuest, Scopus, Web of Science and Google Scholar. A modified healthcare modelling evaluation tool was used to assess quality; data extracted from each study were summarised and presented in a narrative synthesis.
RESULTS
Nineteen studies evaluating a variety of statistical methods in 20 clinical registries were included. The majority of studies conducted application studies comparing outliers without statistical performance assessment (79%), while only few studies used simulations to conduct more rigorous evaluations (21%). A common comparison was between random effects and fixed effects regression, which provided mixed results. Registry population coverage, provider case volume minimum and missing data handling were all poorly reported.
CONCLUSIONS
The optimal methods for detecting outliers when benchmarking clinical registry data remains unclear, and the use of different models may provide vastly different results. Further research is needed to address the unresolved methodological considerations and evaluate methods across a range of registry conditions.
PROSPERO REGISTRATION NUMBER
CRD42022296520.
Topics: Humans; Benchmarking; Registries
PubMed: 37451708
DOI: 10.1136/bmjopen-2022-069130 -
IEEE Transactions on Image Processing :... Jan 2018Identifying different types of data outliers with abnormal behaviors in multi-view data setting is challenging due to the complicated data distributions across different...
Identifying different types of data outliers with abnormal behaviors in multi-view data setting is challenging due to the complicated data distributions across different views. Conventional approaches achieve this by learning a new latent feature representation with the pairwise constraint on different view data. In this paper, we argue that the existing methods are expensive in generalizing their models from two-view data to three-view (or more) data, in terms of the number of introduced variables and detection performance. To address this, we propose a novel multi-view outlier detection method with consensus regularization on the latent representations. Specifically, we explicitly characterize each kind of outliers by the intrinsic cluster assignment labels and sample-specific errors. Moreover, we make a thorough discussion about the proposed consensus-regularization and the pairwise-regularization. Correspondingly, an optimization solution based on augmented Lagrangian multiplier method is proposed and derived in details. In the experiments, we evaluate our method on five well-known machine learning data sets with different outlier settings. Further, to show its effectiveness in real-world computer vision scenario, we tailor our proposed model to saliency detection and face reconstruction applications. The extensive results of both standard multi-view outlier detection task and the extended computer vision tasks demonstrate the effectiveness of our proposed method.
PubMed: 28945594
DOI: 10.1109/TIP.2017.2754942 -
Journal of Applied Statistics 2021Neuroscience is a combination of different scientific disciplines which investigate the nervous system for understanding of the biological basis. Recently, applications...
Neuroscience is a combination of different scientific disciplines which investigate the nervous system for understanding of the biological basis. Recently, applications to the diagnosis of neurodegenerative diseases like Parkinson's disease have become very promising by considering different statistical regression models. However, well-known statistical regression models may give misleading results for the diagnosis of the neurodegenerative diseases when experimental data contain outlier observations that lie an abnormal distance from the other observation. The main achievements of this study consist of a novel mathematics-supported approach beside statistical regression models to identify and treat the without direct elimination for a great and emerging challenge in humankind, such as neurodegenerative diseases. By this approach, a new method named as CMMSOM is proposed with the contributions of the powerful convex and continuous optimization techniques referred to as conic quadratic programing. This method, based on the mean-shift outlier regression model, is developed by combining robustness of M-estimation and stability of Tikhonov regularization. We apply our method and other parametric models on Parkinson telemonitoring dataset which is a real-world dataset in Neuroscience. Then, we compare these methods by using well-known method-free performance measures. The results indicate that the CMMSOM method performs better than current parametric models.
PubMed: 35707096
DOI: 10.1080/02664763.2020.1864815 -
Cureus Mar 2023Objectives Clinical discoveries are heralded by observing unique and unusual clinical cases. The effort of identifying such cases rests on the shoulders of busy...
Objectives Clinical discoveries are heralded by observing unique and unusual clinical cases. The effort of identifying such cases rests on the shoulders of busy clinicians. We assess the feasibility and applicability of an augmented intelligence framework to accelerate the rate of clinical discovery in preeclampsia and hypertensive disorders of pregnancy-an area that has seen little change in its clinical management. Methods We conducted a retrospective exploratory outlier analysis of participants enrolled in the folic acid clinical trial (FACT, N=2,301) and the Ottawa and Kingston birth cohort (OaK, N=8,085). We applied two outlier analysis methods: extreme misclassification contextual outlier and isolation forest point outlier. The extreme misclassification contextual outlier is based on a random forest predictive model for the outcome of preeclampsia in FACT and hypertensive disorder of pregnancy in OaK. We defined outliers in the extreme misclassification approach as mislabelled observations with a confidence level of more than 90%. Within the isolation forest approach, we defined outliers as observations with an average path length z score less or equal to -3, or more or equal to 3. Content experts reviewed the identified outliers and determined if they represented a potential novelty that could conceivably lead to a clinical discovery. Results In the FACT study, we identified 19 outliers using the isolation forest algorithm and 13 outliers using the random forest extreme misclassification approach. We determined that three (15.8%) and 10 (76.9%) were potential novelties, respectively. Out of 8,085 participants in the OaK study, we identified 172 outliers using the isolation forest algorithm and 98 outliers using the random forest extreme misclassification approach; four (2.3%) and 32 (32.7%), respectively, were potential novelties. Overall, the outlier analysis part of the augmented intelligence framework identified a total of 302 outliers. These were subsequently reviewed by content experts, representing the human part of the augmented intelligence framework. The clinical review determined that 49 of the 302 outliers represented potential novelties. Conclusions Augmented intelligence using extreme misclassification outlier analysis is a feasible and applicable approach for accelerating the rate of clinical discoveries. The use of an extreme misclassification contextual outlier analysis approach has resulted in a higher proportion of potential novelties than using the more traditional point outlier isolation forest approach. This finding was consistent in both the clinical trial and real-world cohort study data. Using augmented intelligence through outlier analysis has the potential to speed up the process of identifying potential clinical discoveries. This approach can be replicated across clinical disciplines and could exist within electronic medical records systems to automatically identify outliers within clinical notes to clinical experts.
PubMed: 37009347
DOI: 10.7759/cureus.36909 -
Archives of Pathology & Laboratory... Jul 1999To determine the causes of excessive test turnaround time (TAT) and to identify methods of improvement by studying reasons for those tests reported in excess of 70...
OBJECTIVES
To determine the causes of excessive test turnaround time (TAT) and to identify methods of improvement by studying reasons for those tests reported in excess of 70 minutes from the time the test was ordered (ie, outliers).
DESIGN
Self-directed data-gathering of stat outlier TAT events from intensive care units and emergency departments, with descriptive parameters associated with each event and additional descriptive parameters associated with the participant.
PARTICIPANTS
Laboratories enrolled in the 1996 College of American Pathologists Q-Probes program.
MAIN OUTCOME MEASURES
Components associated with outlier TAT events and outlier TAT rates.
RESULTS
Four hundred ninety-six hospital laboratories returned data on 218 551 stat tests, of which 10.6% had TATs in excess of 70 minutes. Ten percent of stat emergency department tests and 14.7% of stat intensive care unit tests were outliers. Major areas in which delays occurred were test ordering, 29.9%; within-laboratory (analytic) phase, 28.2%; collection of the specimen, 27.4%; postanalytic phase, 1.9%; and undetermined, 12.5%. The type of test performed was a significant factor and was independent of location: Chemistry-Multiple Test appeared most frequently ( approximately 40%), followed closely by Hematology-Complete Blood Count (approximately 20%) and Chemistry-Single Test ( approximately 18%). Factors of outlier TAT components for intensive care unit specimens were identified using statistical modeling and included hour of day, type of health care personnel collecting specimen, performing the test in a stat laboratory, and reason for delay. Outlier rates were not associated with any identified factors. The practice parameters of laboratories with outlier rates in the lowest 10th percentile significantly differed from those with rates in the top 10th percentile in test request computerization, report methods, and ordering methods.
CONCLUSIONS
We observed that outlier analysis yields new information, such as type of test and reason for delay, concerning test delays when compared with TAT determination alone. Laboratories experiencing stat test TAT problems should use this tool as an adjunct to routine TAT monitoring for identifying unique causes of delay.
Topics: Clinical Laboratory Techniques; Humans; Time Factors
PubMed: 10388917
DOI: 10.5858/1999-123-0607-UOETMT -
Journal of Pediatric Gastroenterology... Aug 2022To create a new methodology that has a single simple rule to identify height outliers in the electronic health records (EHR) of children.
OBJECTIVE
To create a new methodology that has a single simple rule to identify height outliers in the electronic health records (EHR) of children.
METHODS
We constructed 2 independent cohorts of children 2 to 8 years old to train and validate a model predicting heights from age, gender, race and weight with monotonic Bayesian additive regression trees. The training cohort consisted of 1376 children where outliers were unknown. The testing cohort consisted of 318 patients that were manually reviewed retrospectively to identify height outliers.
RESULTS
The amount of variation explained in height values by our model, R2 , was 82.2% and 75.3% in the training and testing cohorts, respectively. The discriminatory ability to assess height outliers in the testing cohort as assessed by the area under the receiver operating characteristic curve was excellent, 0.841. Based on a relatively aggressive cutoff of 0.075, the outlier sensitivity is 0.713, the specificity 0.793; the positive predictive value 0.615 and the negative predictive value is 0.856.
CONCLUSIONS
We have developed a new reliable, largely automated, outlier detection method which is applicable to the identification of height outliers in the pediatric EHR. This methodology can be applied to assess the veracity of height measurements ensuring reliable indices of body proportionality such as body mass index.
Topics: Bayes Theorem; Child; Child, Preschool; Electronic Health Records; Humans; Machine Learning; ROC Curve; Retrospective Studies
PubMed: 35641892
DOI: 10.1097/MPG.0000000000003492