outlier - OpenMD.com Journal Search

Statistics-Based Outlier Detection and Correction Method for Amazon Customer Reviews.

Entropy (Basel, Switzerland) Dec 2021

People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking...

Summary PubMed Full Text PDF

Authors: Ishani Chatterjee, Mengchu Zhou, Abdullah Abusorrah...

People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking sites. These sites serve as a great source to gather data for data analytics, sentiment analysis, natural language processing, etc. Conventionally, the true sentiment of a customer review matches its corresponding star rating. There are exceptions when the star rating of a review is opposite to its true nature. These are labeled as the outliers in a dataset in this work. The state-of-the-art methods for anomaly detection involve manual searching, predefined rules, or traditional machine learning techniques to detect such instances. This paper conducts a sentiment analysis and outlier detection case study for Amazon customer reviews, and it proposes a statistics-based outlier detection and correction method (SODCM), which helps identify such reviews and rectify their star ratings to enhance the performance of a sentiment analysis algorithm without any data loss. This paper focuses on performing SODCM in datasets containing customer reviews of various products, which are (a) scraped from Amazon.com and (b) publicly available. The paper also studies the dataset and concludes the effect of SODCM on the performance of a sentiment analysis algorithm. The results exhibit that SODCM achieves higher accuracy and recall percentage than other state-of-the-art anomaly detection algorithms.

PubMed: 34945950
DOI: 10.3390/e23121645

Outlier Guided Optimization of Abdominal Segmentation.

Proceedings of SPIE--the International... 2020

Abdominal multi-organ segmentation of computed tomography (CT) images has been the subject of extensive research interest. It presents a substantial challenge in medical...

Summary PubMed Full Text PDF

Authors: Yuchen Xu, Olivia Tang, Yucheng Tang...

Abdominal multi-organ segmentation of computed tomography (CT) images has been the subject of extensive research interest. It presents a substantial challenge in medical image processing, as the shape and distribution of abdominal organs can vary greatly among the population and within an individual over time. While continuous integration of novel datasets into the training set provides potential for better segmentation performance, collection of data at scale is not only costly, but also impractical in some contexts. Moreover, it remains unclear what marginal value additional data have to offer. Herein, we propose a single-pass active learning method through human quality assurance (QA). We built on a pre-trained 3D U-Net model for abdominal multi-organ segmentation and augmented the dataset either with outlier data (e.g., exemplars for which the baseline algorithm failed) or inliers (e.g., exemplars for which the baseline algorithm worked). The new models were trained using the augmented datasets with 5-fold cross-validation (for outlier data) and withheld outlier samples (for inlier data). Manual labeling of outliers increased Dice scores with outliers by 0.130, compared to an increase of 0.067 with inliers (p<0.001, two-tailed paired t-test). By adding 5 to 37 inliers or outliers to training, we find that the marginal value of adding outliers is higher than that of adding inliers. In summary, improvement on single-organ performance was obtained without diminishing multi-organ performance or significantly increasing training time. Hence, identification and correction of baseline failure cases present an effective and efficient method of selecting training data to improve algorithm performance.

PubMed: 33907347
DOI: 10.1117/12.2549365

A method for detecting outliers in linear-circular non-parametric regression.

PloS One 2023

This study proposes a robust outlier detection method based on the circular median for non-parametric linear-circular regression in case the response variable includes...

Summary PubMed Full Text PDF

Authors: Sümeyra Sert, Filiz Kardiyen

This study proposes a robust outlier detection method based on the circular median for non-parametric linear-circular regression in case the response variable includes outlier(s) and the residuals are Wrapped-Cauchy distributed. Nadaraya-Watson and local linear regression methods were employed to obtain non-parametric regression fits. The proposed method's performance was investigated by using a real dataset and a comprehensive simulation study with different sample sizes, contamination, and heterogeneity degrees. The method performs quite well in medium and higher contamination degrees, and its performance increases as the sample size and the homogeneity of data increase. In addition, when the response variable of linear-circular regression contains outliers, the Local Linear Estimation method fits the data set better than the Nadaraya Watson method.

Topics: Humans; Linear Models; Computer Simulation; Drug Contamination; Sample Size; Seizures

PubMed: 37307265
DOI: 10.1371/journal.pone.0286448

STAR_outliers: a python package that separates univariate outliers from non-normal distributions.

BioData Mining Sep 2023

There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some...

Summary PubMed Full Text PDF

Authors: John T Gregg, Jason H Moore

There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.

PubMed: 37667378
DOI: 10.1186/s13040-023-00342-0

Managing Outliers in Adolescent Food Frequency Questionnaire Data.

Journal of Nutrition Education and... Jan 2021

The goal of this study was to explore the impact of 5 decision rules for removing outliers from adolescent food frequency questionnaire (FFQ) data.

Summary PubMed Full Text PDF

Authors: Morgan S Lee, April Idalski Carcone, Linda Ko...

OBJECTIVE

The goal of this study was to explore the impact of 5 decision rules for removing outliers from adolescent food frequency questionnaire (FFQ) data.

DESIGN

This secondary analysis used baseline and 3-month data from a weight loss intervention clinical trial.

PARTICIPANTS

African American adolescents (n = 181) were recruited from outpatient clinics and community health fairs.

VARIABLES MEASURED

Data collected included self-reported FFQ and mediators of weight (food addiction, depressive symptoms, and relative reinforcing value of food), caregiver-reported executive functioning, and objectively measured weight status (percentage overweight).

ANALYSIS

Descriptive statistics examined patterns in study variables at baseline and follow-up. Correlational analyses explored the relationships between FFQ data and key study variables at baseline and follow-up.

RESULTS

Compared with not removing outliers, using decision rules reduced the number of cases and restricted the range of data. The magnitude of baseline FFQ-mediator relationships was attenuated under all decision rules but varied (increasing, decreasing, and reversing direction) at follow-up. Decision rule use increased the magnitude of change in FFQ estimated energy intake and significantly strengthened its relationship with weight change under 2 fixed range decision rules.

CONCLUSIONS AND IMPLICATIONS

Results suggest careful evaluation of outliers and testing and reporting the effects of different outlier decision rules through sensitivity analyses.

Topics: Adolescent; Diet; Diet Records; Diet Surveys; Energy Intake; Female; Humans; Male; Motivation; Reproducibility of Results; Surveys and Questionnaires

PubMed: 33012663
DOI: 10.1016/j.jneb.2020.08.002

Bioinformatic modelling of SARS-CoV-2 pandemic with a focus on country-specific dynamics.

BMC Public Health Jan 2023

One of the seminal events since 2019 has been the outbreak of the SARS-CoV-2 pandemic. Countries have adopted various policies to deal with it, but they also differ in...

Summary PubMed Full Text PDF

Authors: Jakub Liu, Tomasz Suchocki, Joanna Szyda...

BACKGROUND

One of the seminal events since 2019 has been the outbreak of the SARS-CoV-2 pandemic. Countries have adopted various policies to deal with it, but they also differ in their socio-geographical characteristics and public health care facilities. Our study aimed to investigate differences between epidemiological parameters across countries.

METHOD

The analysed data represents SARS-CoV-2 repository provided by the Johns Hopkins University. Separately for each country, we estimated recovery and mortality rates using the SIRD model applied to the first 30, 60, 150, and 300 days of the pandemic. Moreover, a mixture of normal distributions was fitted to the number of confirmed cases and deaths during the first 300 days. The estimates of peaks' means and variances were used to identify countries with outlying parameters.

RESULTS

For 300 days Belgium, Cyprus, France, the Netherlands, Serbia, and the UK were classified as outliers by all three outlier detection methods. Yemen was classified as an outlier for each of the four considered timeframes, due to high mortality rates. During the first 300 days of the pandemic, the majority of countries underwent three peaks in the number of confirmed cases, except Australia and Kazakhstan with two peaks.

CONCLUSIONS

Considering recovery and mortality rates we observed heterogeneity between countries. Liechtenstein was the "positive" outlier with low mortality rates and high recovery rates, at the opposite, Yemen represented a "negative" outlier with high mortality for all four considered periods and low recovery for 30 and 60 days.

Topics: Humans; SARS-CoV-2; COVID-19; Pandemics; Disease Outbreaks; France

PubMed: 36681790
DOI: 10.1186/s12889-023-15092-1

Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets.

Frontiers in Bioinformatics 2023

Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in...

Summary PubMed Full Text PDF

Authors: Wanxin Li, Jules Mirone, Ashok Prasad...

Conventional dimensionality reduction methods like Multidimensional Scaling (MDS) are sensitive to the presence of orthogonal outliers, leading to significant defects in the embedding. We introduce a robust MDS method, called (Detection and Correction of Orthogonal outliers using MDS), based on the geometry and statistics of simplices formed by data points, that allows to detect orthogonal outliers and subsequently reduce dimensionality. We validate our methods using synthetic datasets, and further show how it can be applied to a variety of large real biological datasets, including cancer image cell data, human microbiome project data and single cell RNA sequencing data, to address the task of data cleaning and visualization.

PubMed: 37637212
DOI: 10.3389/fbinf.2023.1211819

[Outlier patient admissions and their relationship with the emergence of clinical complications and prolonged hospital stays].

Gaceta Sanitaria 2019

To analyze the relationship between the type of hospital admission (outlier and non-outlier admissions) and the appearance of clinical complications and the average stay.

Summary PubMed Full Text

Authors: Enrique Cabrera Torres, María Aránzazu García Iglesias, María Teresa Santos Jiménez...

OBJECTIVE

To analyze the relationship between the type of hospital admission (outlier and non-outlier admissions) and the appearance of clinical complications and the average stay.

METHODS

From a retrospective epidemiological study of a cohort of patients admitted to the Hospital Complejo Asistencial Universitario de Salamanca (Salamanca, Spain) over a six-month period, outlier and non-outlier patients were identified. This project had access to the admissions department database, the hospital's CMBD (in Spanish, Conjunto Mínimo Básico de Datos) for hospitalisation, the AP-DRG (All Patient-Diagnosis Related Groups) and ALCOR (a clinical-statistics analytics tool). It then proceeded to break down the results by DRG, looking at the five most common DRGs in that period.

RESULTS

8.4% of the total 11,842 admissions were medical outliers. In the overall study, the average stay was longer for outlier patients (8. 11 days) than for other patients (7.15 days). The mortality rate was, likewise, higher for outlier patients, although there was a reduced incidence of complications (7.6% for outlier patients as opposed to 8.4% for others). The analysis by DRG corroborated these results in three of the five cases investigated, showing longer average stays but fewer clinical complications in the case of outlier patients.

CONCLUSIONS

On admission to hospital, a significant proportion of patients were allocated beds on inappropriate wards (outlier patients). It was more common to find medical patients placed on surgical wards than vice versa. The average stay of outlier patients was longer than that of patients admitted to the correct ward. The study found no significant difference between the two groupś in terms of clinical complication rates.

Topics: Cohort Studies; Diagnosis-Related Groups; Epidemiologic Studies; Humans; Length of Stay; Patient Admission; Retrospective Studies

PubMed: 28943019
DOI: 10.1016/j.gaceta.2017.07.012

The increasing impact of length of stay "outliers" on length of stay at an urban academic hospital.

BMC Health Services Research Sep 2021

As healthcare systems strive for efficiency, hospital "length of stay outliers" have the potential to significantly impact a hospital's overall utilization. There is a...

Summary PubMed Full Text PDF

Authors: Andrew H Hughes, David Horrocks, Curtis Leung...

BACKGROUND

As healthcare systems strive for efficiency, hospital "length of stay outliers" have the potential to significantly impact a hospital's overall utilization. There is a tendency to exclude such "outlier" stays in local quality improvement and data reporting due to their assumed rare occurrence and disproportionate ability to skew mean and other summary data. This study sought to assess the influence of length of stay (LOS) outliers on inpatient length of stay and hospital capacity over a 5-year period at a large urban academic medical center.

METHODS

From January 2014 through December 2019, 169,645 consecutive inpatient cases were analyzed and assigned an expected LOS based on national academic center benchmarks. Cases in the top 1% of national sample LOS by diagnosis were flagged as length of stay outliers.

RESULTS

From 2014 to 2019, mean outlier LOS increased (40.98 to 45.11 days), as did inpatient LOS with outliers excluded (5.63 to 6.19 days). Outlier cases increased both in number (from 297 to 412) and as a percent of total discharges (0.98 to 1.56%), and outlier patient days increased from 6.7 to 9.8% of total inpatient plus observation days over the study period.

CONCLUSIONS

Outlier cases utilize a disproportionate and increasing share of hospital resources and available beds. The current tendency to exclude such outlier stays in data reporting due to assumed rare occurrence may need to be revisited. Outlier stays require distinct and targeted interventions to appropriately reduce length of stay to both improve patient care and maintain hospital capacity.

Topics: Hospitals, Urban; Humans; Length of Stay; Quality Improvement; Retrospective Studies

PubMed: 34503494
DOI: 10.1186/s12913-021-06972-6

Time series outlier removal and imputing methods based on Colombian weather stations data.

Environmental Science and Pollution... Jun 2023

The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables...

Summary PubMed Full Text PDF

Authors: Jaime Parra-Plazas, Paulo Gaona-Garcia, Leonardo Plazas-Nossa...

The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables and the result that will be applied to analysis and simulation models that feed variables such as flow and level of a study area. One of the most common problems is the acquisition and transmission of data from weather stations due to atypical values and lost data; this generates difficulties in the simulation process. Consequently, it is necessary to propose a numerical strategy to solve this problem. The data source for this study is a real database where these problems are presented with different variables of weather. This study is based on comparing three methods of time series analysis to evaluate a multivariable process offline. For the development of the study, we applied a method based on the discrete Fourier transform (DFT), and we contrasted it with methods such as the average and linear regression without uncertainty parameters to complete missing data. The proposed methodology entails statistical values, outlier detection, and the application of the DFT. The application of DFT allows the time series completion, based on its ability to manage various gap sizes and replace missing values. In sum, DFT led to low error percentages for all the time series (1% average). This percentage reflects what would have likely been the shape or pattern of the time series behavior in the absence of misleading outliers and missing data.

Topics: Time Factors; Colombia; Weather; Linear Models; Computer Simulation

PubMed: 37165270
DOI: 10.1007/s11356-023-27176-x