-
Journal of Medical Internet Research Jul 2023Reference intervals (RIs) play an important role in clinical decision-making. However, due to the time, labor, and financial costs involved in establishing RIs using... (Observational Study)
Observational Study
BACKGROUND
Reference intervals (RIs) play an important role in clinical decision-making. However, due to the time, labor, and financial costs involved in establishing RIs using direct means, the use of indirect methods, based on big data previously obtained from clinical laboratories, is getting increasing attention. Different indirect techniques combined with different data transformation methods and outlier removal might cause differences in the calculation of RIs. However, there are few systematic evaluations of this.
OBJECTIVE
This study used data derived from direct methods as reference standards and evaluated the accuracy of combinations of different data transformation, outlier removal, and indirect techniques in establishing complete blood count (CBC) RIs for large-scale data.
METHODS
The CBC data of populations aged ≥18 years undergoing physical examination from January 2010 to December 2011 were retrieved from the First Affiliated Hospital of China Medical University in northern China. After exclusion of repeated individuals, we performed parametric, nonparametric, Hoffmann, Bhattacharya, and truncation points and Kolmogorov-Smirnov distance (kosmic) indirect methods, combined with log or BoxCox transformation, and Reed-Dixon, Tukey, and iterative mean (3SD) outlier removal methods in order to derive the RIs of 8 CBC parameters and compared the results with those directly and previously established. Furthermore, bias ratios (BRs) were calculated to assess which combination of indirect technique, data transformation pattern, and outlier removal method is preferrable.
RESULTS
Raw data showed that the degrees of skewness of the white blood cell (WBC) count, platelet (PLT) count, mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), and mean corpuscular volume (MCV) were much more obvious than those of other CBC parameters. After log or BoxCox transformation combined with Tukey or iterative mean (3SD) processing, the distribution types of these data were close to Gaussian distribution. Tukey-based outlier removal yielded the maximum number of outliers. The lower-limit bias of WBC (male), PLT (male), hemoglobin (HGB; male), MCH (male/female), and MCV (female) was greater than that of the corresponding upper limit for more than half of 30 indirect methods. Computational indirect choices of CBC parameters for males and females were inconsistent. The RIs of MCHC established by the direct method for females were narrow. For this, the kosmic method was markedly superior, which contrasted with the RI calculation of CBC parameters with high |BR| qualification rates for males. Among the top 10 methodologies for the WBC count, PLT count, HGB, MCV, and MCHC with a high-BR qualification rate among males, the Bhattacharya, Hoffmann, and parametric methods were superior to the other 2 indirect methods.
CONCLUSIONS
Compared to results derived by the direct method, outlier removal methods and indirect techniques markedly influence the final RIs, whereas data transformation has negligible effects, except for obviously skewed data. Specifically, the outlier removal efficiency of Tukey and iterative mean (3SD) methods is almost equivalent. Furthermore, the choice of indirect techniques depends more on the characteristics of the studied analyte itself. This study provides scientific evidence for clinical laboratories to use their previous data sets to establish RIs.
Topics: Adolescent; Adult; Female; Humans; Male; Big Data; Blood Cell Count; China; Leukocyte Count; Reference Values; Clinical Decision-Making
PubMed: 37459170
DOI: 10.2196/45651 -
Communications in Statistics:... 2023A two-stage joint survival model is used to analyse time to event outcomes that could be associated with biomakers that are repeatedly collected over time. A Two-stage...
A two-stage joint survival model is used to analyse time to event outcomes that could be associated with biomakers that are repeatedly collected over time. A Two-stage joint survival model has limited model checking tools and is usually assessed using standard diagnostic tools for survival models. The diagnostic tools can be improved and implemented. Time-varying covariates in a two-stage joint survival model might contain outlying observations or subjects. In this study we used the variance shift outlier model (VSOM) to detect and down-weight outliers in the first stage of the two-stage joint survival model. This entails fitting a VSOM at the observation level and a VSOM at the subject level, and then fitting a combined VSOM for the identified outliers. The fitted values were then extracted from the combined VSOM which were then used as time-varying covariate in the extended Cox model. We illustrate this methodology on a dataset from a multi-centre randomised clinical trial. A multi-centre trial showed that a combined VSOM fits the data better than an extended Cox model. We noted that implementing a combined VSOM, when desired, has a better fit based on the fact that outliers are down-weighted.
PubMed: 37981985
DOI: 10.1080/03610918.2021.1995751 -
PloS One 2019Levels exceeding the standard reference interval (RI) for total thyroxine (TT4) concentrations are diagnostic for hyperthyroidism, however some hyperthyroid cats have...
BACKGROUND
Levels exceeding the standard reference interval (RI) for total thyroxine (TT4) concentrations are diagnostic for hyperthyroidism, however some hyperthyroid cats have TT4 values within the RI. Determining outlier TT4 concentrations should aid practitioners in identification of hyperthyroidism. The objective of this study was to determine the expected distribution of TT4 concentration using a large population of cats (531,765) of unknown health status to identify unexpected TT4 concentrations (outlier), and determine whether this concentration changes with age.
METHODOLOGY/PRINCIPLE FINDINGS
This study is a population-based, retrospective study evaluating an electronic database of laboratory results to identify unique TT4 measurement between January 2014 and July 2015. An expected distribution of TT4 concentrations was determined using a large population of cats (531,765) of unknown health status, and this in turn was used to identify unexpected TT4 concentrations (outlier) and determine whether this concentration changes with age. All cats between the age of 1 and 9 years (n = 141,294) had the same expected distribution of TT4 concentration (0.5-3.5ug/dL), and cats with a TT4 value >3.5ug/dL were determined to be unexpected outliers. There was a steep and progressive rise in both the total number and percentage of statistical outliers in the feline population as a function of age. The greatest acceleration in the percentage of outliers occurred between the age of 7 and 14 years, which was up to 4.6 times the rate seen between the age of 3 and 7 years.
CONCLUSIONS
TT4 concentrations >3.5ug/dL represent outliers from the expected distribution of TT4 concentration. Furthermore, age has a strong influence on the proportion of cats. These findings suggest that patients with TT4 concentrations >3.5ug/dL should be more closely evaluated for hyperthyroidism, particularly between the ages of 7 and 14 years. This finding may aid clinicians in earlier identification of hyperthyroidism in at-risk patients.
Topics: Animals; Biomarkers; Cat Diseases; Cats; Female; Hyperthyroidism; Male; Retrospective Studies; Thyroxine; Time Factors; United States
PubMed: 30840691
DOI: 10.1371/journal.pone.0213259 -
Analytical Chemistry Feb 2020Previously, we have introduced an approach for calculation of the full object distance in the frame of Principal Component Analysis that can be applied to data...
Previously, we have introduced an approach for calculation of the full object distance in the frame of Principal Component Analysis that can be applied to data exploration and classification. Now, a similar approach has been developed for regression problems in which a total distance can be calculated for every sample in projection modeling. Based on the total distance, a threshold for outlier detection has been developed by means of a data-driven estimation of the degrees of freedom and scaling parameters for the partial distances in the projection models. A joint threshold is used as a basis for a sequential outlier detection procedure. The iterative nature of the procedure helps to overcome masking effect in outliers, and a backward step eliminates swamping effects. Two real examples are used for illustration. The first dataset represents capsules filled with specially prepared mixtures of an active pharmaceutical ingredient and a number of excipients. This dataset is used to illustrate the behavior of possible outliers in the regression model and their corresponding locations in the X- and XY-distance plots. The second dataset consists of spectra of 135 whole wheat samples used for the prediction of protein, gluten, and moisture content. This dataset is used for a demonstration of the step-by-step application of the sequential procedure for outlier detection.
PubMed: 31880430
DOI: 10.1021/acs.analchem.9b04611 -
Bioinformatics (Oxford, England) Aug 2022It has become routine in neuroscience studies to measure brain networks for different individuals using neuroimaging. These networks are typically expressed as adjacency...
MOTIVATION
It has become routine in neuroscience studies to measure brain networks for different individuals using neuroimaging. These networks are typically expressed as adjacency matrices, with each cell containing a summary of connectivity between a pair of brain regions. There is an emerging statistical literature describing methods for the analysis of such multi-network data in which nodes are common across networks but the edges vary. However, there has been essentially no consideration of the important problem of outlier detection. In particular, for certain subjects, the neuroimaging data are so poor quality that the network cannot be reliably reconstructed. For such subjects, the resulting adjacency matrix may be mostly zero or exhibit a bizarre pattern not consistent with a functioning brain. These outlying networks may serve as influential points, contaminating subsequent statistical analyses. We propose a simple Outlier DetectIon for Networks (ODIN) method relying on an influence measure under a hierarchical generalized linear model for the adjacency matrices. An efficient computational algorithm is described, and ODIN is illustrated through simulations and an application to data from the UK Biobank.
RESULTS
ODIN was successful in identifying moderate to extreme outliers. Removing such outliers can significantly change inferences in downstream applications.
AVAILABILITY AND IMPLEMENTATION
ODIN has been implemented in both Python and R and these implementations along with other code are publicly available at github.com/pritamdey/ODIN-python and github.com/pritamdey/ODIN-r, respectively.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Humans; Algorithms; Neuroimaging; Brain; Software
PubMed: 35762974
DOI: 10.1093/bioinformatics/btac431 -
Entropy (Basel, Switzerland) Mar 2023Rate distortion theory was developed for optimizing lossy compression of data, but it also has applications in statistics. In this paper, we illustrate how rate...
Rate distortion theory was developed for optimizing lossy compression of data, but it also has applications in statistics. In this paper, we illustrate how rate distortion theory can be used to analyze various datasets. The analysis involves testing, identification of outliers, choice of compression rate, calculation of optimal reconstruction points, and assigning "descriptive confidence regions" to the reconstruction points. We study four models or datasets of increasing complexity: clustering, Gaussian models, linear regression, and a dataset describing orientations of early Islamic mosques. These examples illustrate how rate distortion analysis may serve as a common framework for handling different statistical problems.
PubMed: 36981344
DOI: 10.3390/e25030456 -
IEEE Transactions on Neural Networks... Apr 2021Few-shot learning (FSL) focuses on distilling transferrable knowledge from existing experience to cope with novel concepts for which the labeled data are scarce. A...
Few-shot learning (FSL) focuses on distilling transferrable knowledge from existing experience to cope with novel concepts for which the labeled data are scarce. A typical assumption in FSL is that the training examples of novel classes are all clean with no outlier interference. In many realistic applications where examples are provided by users, however, data are potentially noisy or unreadable. In this context, we introduce a novel research topic, robust FSL (RFSL), where we aim to address two types of outliers within user-provided data: the representation outlier (RO) and the label outlier (LO). Moreover, we introduce a metric for estimating robustness and use it to investigate the performance of several advanced methods to FSL when faced with user-provided outliers. In addition, we propose robust attentive profile networks (RapNets) to achieve outlier suppression. The results of a comprehensive evaluation of benchmark data sets demonstrate the shortcomings of current FSL methods and the superiority of the proposed RapNets when dealing with RFSL problems, establishing a benchmark for follow-up studies.
PubMed: 32310797
DOI: 10.1109/TNNLS.2020.2984710 -
Scientific Reports Jan 2022Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In...
Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called "unicorn" or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord detection algorithms even in recognizing traditional outliers and it also detected unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully retrieved unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.
PubMed: 34996940
DOI: 10.1038/s41598-021-03526-y -
Environmental Science and Pollution... Dec 2023Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an...
Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional 4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented solution to outlier detection in environmental monitoring data.
Topics: Entropy; Algorithms; Cluster Analysis
PubMed: 37306879
DOI: 10.1007/s11356-023-26780-1 -
Knee Surgery, Sports Traumatology,... Feb 2023The primary aim was to evaluate the accuracy of navigation in opening wedge high tibial osteotomy (HTO). The secondary aim was to examine mid-term outcomes after HTO.
PURPOSE
The primary aim was to evaluate the accuracy of navigation in opening wedge high tibial osteotomy (HTO). The secondary aim was to examine mid-term outcomes after HTO.
METHODS
Inclusion criteria were patients with medial compartment knee osteoarthritis who underwent computer-assisted HTOs. Mechanical axis (MA), percentage MA (%MA), and change in posterior tibial slope (ΔPTS) were displayed on the navigation screen. Radiographic examinations included hip-knee-ankle (HKA) angle, medial proximal tibial angle (MPTA), joint line convergence angle (JLCA), and PTS. Preoperative and 5 weeks postoperative standing radiographs of the whole lower extremity and knee were used. Clinical evaluations were performed using American Knee Society knee score and function score both preoperatively and at last follow-up. Radiographic evaluations were performed by orthopedic surgeons. Intraoperative navigation after osteotomy and postoperative standing radiograph were compared. MA (HKA), %MA, and ΔPTS were compared. Outliers were defined as > 3° in MA, > 10% in %MA, and > 10° in ΔPTS. Outlier and non-outlier groups were compared. The rate of conversion to arthroplasty was examined.
RESULTS
This study involved 38 patients (44 knees) and last follow-up was at a mean of 5 years (range, 1-9 years). Mean American Knee Society knee score and function score improved significantly from 59 to 69 preoperatively to 95 and 85 at last follow-up, respectively. Absolute values of mean errors for MA, %MA, and ΔPTS were 2.1°, 9.3%, 1.2°, respectively. Outlier rates were 18% in MA, 39% in %MA, and 5% in ΔPTS. No significant factors were found in MA and ΔPTS. In %MA, preoperative JLCA was significantly higher in the outlier group compared to the non-outlier group. No knees underwent conversion to total knee arthroplasty. No differences in outcomes were found between outlier and non-outlier groups.
CONCLUSION
Although rates of outlier values in computer-assisted opening wedge HTO were high, mid-term outcomes were excellent.
LEVEL OF EVIDENCE
IV.
Topics: Humans; Osteoarthritis, Knee; Knee Joint; Tibia; Osteotomy; Arthroplasty, Replacement, Knee; Computers; Retrospective Studies
PubMed: 34738158
DOI: 10.1007/s00167-021-06788-1