-
PloS One 2023This study proposes a robust outlier detection method based on the circular median for non-parametric linear-circular regression in case the response variable includes...
This study proposes a robust outlier detection method based on the circular median for non-parametric linear-circular regression in case the response variable includes outlier(s) and the residuals are Wrapped-Cauchy distributed. Nadaraya-Watson and local linear regression methods were employed to obtain non-parametric regression fits. The proposed method's performance was investigated by using a real dataset and a comprehensive simulation study with different sample sizes, contamination, and heterogeneity degrees. The method performs quite well in medium and higher contamination degrees, and its performance increases as the sample size and the homogeneity of data increase. In addition, when the response variable of linear-circular regression contains outliers, the Local Linear Estimation method fits the data set better than the Nadaraya Watson method.
Topics: Humans; Linear Models; Computer Simulation; Drug Contamination; Sample Size; Seizures
PubMed: 37307265
DOI: 10.1371/journal.pone.0286448 -
BioData Mining Sep 2023There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some...
There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.
PubMed: 37667378
DOI: 10.1186/s13040-023-00342-0 -
Journal of Nutrition Education and... Jan 2021The goal of this study was to explore the impact of 5 decision rules for removing outliers from adolescent food frequency questionnaire (FFQ) data.
OBJECTIVE
The goal of this study was to explore the impact of 5 decision rules for removing outliers from adolescent food frequency questionnaire (FFQ) data.
DESIGN
This secondary analysis used baseline and 3-month data from a weight loss intervention clinical trial.
PARTICIPANTS
African American adolescents (n = 181) were recruited from outpatient clinics and community health fairs.
VARIABLES MEASURED
Data collected included self-reported FFQ and mediators of weight (food addiction, depressive symptoms, and relative reinforcing value of food), caregiver-reported executive functioning, and objectively measured weight status (percentage overweight).
ANALYSIS
Descriptive statistics examined patterns in study variables at baseline and follow-up. Correlational analyses explored the relationships between FFQ data and key study variables at baseline and follow-up.
RESULTS
Compared with not removing outliers, using decision rules reduced the number of cases and restricted the range of data. The magnitude of baseline FFQ-mediator relationships was attenuated under all decision rules but varied (increasing, decreasing, and reversing direction) at follow-up. Decision rule use increased the magnitude of change in FFQ estimated energy intake and significantly strengthened its relationship with weight change under 2 fixed range decision rules.
CONCLUSIONS AND IMPLICATIONS
Results suggest careful evaluation of outliers and testing and reporting the effects of different outlier decision rules through sensitivity analyses.
Topics: Adolescent; Diet; Diet Records; Diet Surveys; Energy Intake; Female; Humans; Male; Motivation; Reproducibility of Results; Surveys and Questionnaires
PubMed: 33012663
DOI: 10.1016/j.jneb.2020.08.002 -
BMC Public Health Jan 2023One of the seminal events since 2019 has been the outbreak of the SARS-CoV-2 pandemic. Countries have adopted various policies to deal with it, but they also differ in...
BACKGROUND
One of the seminal events since 2019 has been the outbreak of the SARS-CoV-2 pandemic. Countries have adopted various policies to deal with it, but they also differ in their socio-geographical characteristics and public health care facilities. Our study aimed to investigate differences between epidemiological parameters across countries.
METHOD
The analysed data represents SARS-CoV-2 repository provided by the Johns Hopkins University. Separately for each country, we estimated recovery and mortality rates using the SIRD model applied to the first 30, 60, 150, and 300 days of the pandemic. Moreover, a mixture of normal distributions was fitted to the number of confirmed cases and deaths during the first 300 days. The estimates of peaks' means and variances were used to identify countries with outlying parameters.
RESULTS
For 300 days Belgium, Cyprus, France, the Netherlands, Serbia, and the UK were classified as outliers by all three outlier detection methods. Yemen was classified as an outlier for each of the four considered timeframes, due to high mortality rates. During the first 300 days of the pandemic, the majority of countries underwent three peaks in the number of confirmed cases, except Australia and Kazakhstan with two peaks.
CONCLUSIONS
Considering recovery and mortality rates we observed heterogeneity between countries. Liechtenstein was the "positive" outlier with low mortality rates and high recovery rates, at the opposite, Yemen represented a "negative" outlier with high mortality for all four considered periods and low recovery for 30 and 60 days.
Topics: Humans; SARS-CoV-2; COVID-19; Pandemics; Disease Outbreaks; France
PubMed: 36681790
DOI: 10.1186/s12889-023-15092-1 -
Frontiers in Bioinformatics 2023Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in...
Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in the embedding. We introduce a robust MDS method, called (Detection and Correction of Orthogonal outliers using MDS), based on the geometry and statistics of simplices formed by data points, that allows to detect orthogonal outliers and subsequently reduce dimensionality. We validate our methods using synthetic datasets, and further show how it can be applied to a variety of large real biological datasets, including cancer image cell data, human microbiome project data and single cell RNA sequencing data, to address the task of data cleaning and visualization.
PubMed: 37637212
DOI: 10.3389/fbinf.2023.1211819 -
Gaceta Sanitaria 2019To analyze the relationship between the type of hospital admission (outlier and non-outlier admissions) and the appearance of clinical complications and the average stay.
OBJECTIVE
To analyze the relationship between the type of hospital admission (outlier and non-outlier admissions) and the appearance of clinical complications and the average stay.
METHODS
From a retrospective epidemiological study of a cohort of patients admitted to the Hospital Complejo Asistencial Universitario de Salamanca (Salamanca, Spain) over a six-month period, outlier and non-outlier patients were identified. This project had access to the admissions department database, the hospital's CMBD (in Spanish, Conjunto Mínimo Básico de Datos) for hospitalisation, the AP-DRG (All Patient-Diagnosis Related Groups) and ALCOR (a clinical-statistics analytics tool). It then proceeded to break down the results by DRG, looking at the five most common DRGs in that period.
RESULTS
8.4% of the total 11,842 admissions were medical outliers. In the overall study, the average stay was longer for outlier patients (8. 11 days) than for other patients (7.15 days). The mortality rate was, likewise, higher for outlier patients, although there was a reduced incidence of complications (7.6% for outlier patients as opposed to 8.4% for others). The analysis by DRG corroborated these results in three of the five cases investigated, showing longer average stays but fewer clinical complications in the case of outlier patients.
CONCLUSIONS
On admission to hospital, a significant proportion of patients were allocated beds on inappropriate wards (outlier patients). It was more common to find medical patients placed on surgical wards than vice versa. The average stay of outlier patients was longer than that of patients admitted to the correct ward. The study found no significant difference between the two groupś in terms of clinical complication rates.
Topics: Cohort Studies; Diagnosis-Related Groups; Epidemiologic Studies; Humans; Length of Stay; Patient Admission; Retrospective Studies
PubMed: 28943019
DOI: 10.1016/j.gaceta.2017.07.012 -
BMC Health Services Research Sep 2021As healthcare systems strive for efficiency, hospital "length of stay outliers" have the potential to significantly impact a hospital's overall utilization. There is a...
BACKGROUND
As healthcare systems strive for efficiency, hospital "length of stay outliers" have the potential to significantly impact a hospital's overall utilization. There is a tendency to exclude such "outlier" stays in local quality improvement and data reporting due to their assumed rare occurrence and disproportionate ability to skew mean and other summary data. This study sought to assess the influence of length of stay (LOS) outliers on inpatient length of stay and hospital capacity over a 5-year period at a large urban academic medical center.
METHODS
From January 2014 through December 2019, 169,645 consecutive inpatient cases were analyzed and assigned an expected LOS based on national academic center benchmarks. Cases in the top 1% of national sample LOS by diagnosis were flagged as length of stay outliers.
RESULTS
From 2014 to 2019, mean outlier LOS increased (40.98 to 45.11 days), as did inpatient LOS with outliers excluded (5.63 to 6.19 days). Outlier cases increased both in number (from 297 to 412) and as a percent of total discharges (0.98 to 1.56%), and outlier patient days increased from 6.7 to 9.8% of total inpatient plus observation days over the study period.
CONCLUSIONS
Outlier cases utilize a disproportionate and increasing share of hospital resources and available beds. The current tendency to exclude such outlier stays in data reporting due to assumed rare occurrence may need to be revisited. Outlier stays require distinct and targeted interventions to appropriately reduce length of stay to both improve patient care and maintain hospital capacity.
Topics: Hospitals, Urban; Humans; Length of Stay; Quality Improvement; Retrospective Studies
PubMed: 34503494
DOI: 10.1186/s12913-021-06972-6 -
Environmental Science and Pollution... Jun 2023The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables...
The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables and the result that will be applied to analysis and simulation models that feed variables such as flow and level of a study area. One of the most common problems is the acquisition and transmission of data from weather stations due to atypical values and lost data; this generates difficulties in the simulation process. Consequently, it is necessary to propose a numerical strategy to solve this problem. The data source for this study is a real database where these problems are presented with different variables of weather. This study is based on comparing three methods of time series analysis to evaluate a multivariable process offline. For the development of the study, we applied a method based on the discrete Fourier transform (DFT), and we contrasted it with methods such as the average and linear regression without uncertainty parameters to complete missing data. The proposed methodology entails statistical values, outlier detection, and the application of the DFT. The application of DFT allows the time series completion, based on its ability to manage various gap sizes and replace missing values. In sum, DFT led to low error percentages for all the time series (1% average). This percentage reflects what would have likely been the shape or pattern of the time series behavior in the absence of misleading outliers and missing data.
Topics: Time Factors; Colombia; Weather; Linear Models; Computer Simulation
PubMed: 37165270
DOI: 10.1007/s11356-023-27176-x -
PeerJ. Computer Science 2022Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one...
Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for data cleansing in data science. In this study, we propose two novel outlier detection approaches using the typicality degrees which are the partitioning result of unsupervised possibilistic clustering algorithms. The proposed approaches are based on finding the atypical data points below a predefined threshold value, a possibilistic level for evaluating a point as an outlier. The experiments on the synthetic and real data sets showed that the proposed approaches can be successfully used to detect outliers without considering the structure and distribution of the features in multidimensional data sets.
PubMed: 36262121
DOI: 10.7717/peerj-cs.1060 -
Knowledge-based Systems Feb 2022The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process,...
The presence of outliers can severely degrade learned representations and performance of deep learning methods and hence disproportionately affect the training process, leading to incorrect conclusions about the data. For example, anomaly detection using deep generative models is typically only possible when similar anomalies (or outliers) are not present in the training data. Here we focus on variational autoencoders (VAEs). While the VAE is a popular framework for anomaly detection tasks, we observe that the VAE is unable to detect outliers when the training data contains anomalies that have the same distribution as those in test data. In this paper we focus on robustness to outliers in training data in VAE settings using concepts from robust statistics. We propose a variational lower bound that leads to a robust VAE model that has the same computational complexity as the standard VAE and contains a single automatically-adjusted tuning parameter to control the degree of robustness. We present mathematical formulations for robust variational autoencoders (RVAEs) for Bernoulli, Gaussian and categorical variables. The RVAE model is based on beta-divergence rather than the standard Kullback-Leibler (KL) divergence. We demonstrate the performance of our proposed -divergence-based autoencoder for a variety of image and categorical datasets showing improved robustness to outliers both qualitatively and quantitatively. We also illustrate the use of our robust VAE for detection of lesions in brain images, formulated as an anomaly detection task. Finally, we suggest a method to tune the hyperparameter of RVAE which makes our model completely unsupervised.
PubMed: 36714396
DOI: 10.1016/j.knosys.2021.107886