-
Journal of Neurotrauma Jan 2024Blood biomarkers have been studied to improve the clinical assessment and prognostication of patients with moderate-severe traumatic brain injury (mo/sTBI). To assess...
Blood biomarkers have been studied to improve the clinical assessment and prognostication of patients with moderate-severe traumatic brain injury (mo/sTBI). To assess their clinical usability, one needs to know of potential factors that might cause outlier values and affect clinical decision making. In a prospective study, we recruited patients with mo/sTBI ( = 85) and measured the blood levels of eight protein brain pathophysiology biomarkers, including glial fibrillary acidic protein (GFAP), S100 calcium-binding protein B (S100B), neurofilament light (Nf-L), heart-type fatty acid-binding protein (H-FABP), interleukin-10 (IL-10), total tau (T-tau), amyloid β40 (Aβ40) and amyloid β42 (Aβ42), within 24 h of admission. Similar analyses were conducted for controls ( = 40) with an acute orthopedic injury without any head trauma. The patients with TBI were divided into subgroups of normal versus abnormal ( = 9/76) head computed tomography (CT) and favorable (Glasgow Outcome Scale Extended [GOSE] 5-8) versus unfavorable (GOSE <5) ( = 38/42, 5 missing) outcome. Outliers were sought individually from all subgroups from and the whole TBI patient population. Biomarker levels outside Q1 - 1.5 interquartile range (IQR) or Q3 + 1.5 IQR were considered as outliers. The medical records of each outlier patient were reviewed in a team meeting to determine possible reasons for outlier values. A total of 29 patients (34%) combined from all subgroups and 12 patients (30%) among the controls showed outlier values for one or more of the eight biomarkers. Nine patients with TBI and five control patients had outlier values in more than one biomarker (up to 4). All outlier values were > Q3 + 1.5 IQR. A logical explanation was found for almost all cases, except the amyloid proteins. Explanations for outlier values included extremely severe injury, especially for GFAP and S100B. In the case of H-FABP and IL-10, the explanation was extracranial injuries (thoracic injuries for H-FABP and multi-trauma for IL-10), in some cases these also were associated with abnormally high S100B. Timing of sampling and demographic factors such as age and pre-existing neurological conditions (especially for T-tau), explained some of the abnormally high values especially for Nf-L. Similar explanations also emerged in controls, where the outlier values were caused especially by pre-existing neurological diseases. To utilize blood-based biomarkers in clinical assessment of mo/sTBI, very severe or fatal TBIs, various extracranial injuries, timing of sampling, and demographic factors such as age and pre-existing systemic or neurological conditions must be taken into consideration. Very high levels seem to be often associated with poor prognosis and mortality (GFAP and S100B).
Topics: Humans; Fatty Acid Binding Protein 3; Interleukin-10; Prospective Studies; Brain Injuries, Traumatic; Biomarkers; S100 Calcium Binding Protein beta Subunit; Glial Fibrillary Acidic Protein
PubMed: 37725575
DOI: 10.1089/neu.2023.0120 -
Biometrics Dec 2022Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial...
Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.
Topics: Child; Humans; Pediatric Obesity; Algorithms; Sample Size; Probability
PubMed: 34437713
DOI: 10.1111/biom.13553 -
Entropy (Basel, Switzerland) Apr 2022Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by...
Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by randomly dividing features of the data set in the Isolation Forest algorithm in outlier detection, an algorithm CIIF (Cluster-based Improved Isolation Forest) that combines clustering and Isolation Forest is proposed. CIIF first uses the -means method to cluster the data set, selects a specific cluster to construct a selection matrix based on the results of the clustering, and implements the selection mechanism of the algorithm through the selection matrix; then builds multiple isolation trees. Finally, the outliers are calculated according to the average search length of each sample in different isolation trees, and the Top-n objects with the highest outlier scores are regarded as outliers. Through comparative experiments with six algorithms in eleven real data sets, the results show that the CIIF algorithm has better performance. Compared to the Isolation Forest algorithm, the average AUC (Area under the Curve of ROC) value of our proposed CIIF algorithm is improved by 7%.
PubMed: 35626495
DOI: 10.3390/e24050611 -
Entropy (Basel, Switzerland) Nov 2021With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers....
With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have outliers and other aberrant observations. We provide a comparative analysis of several probabilistic artificial intelligence and machine learning techniques for supervised learning case studies. Broadly, Winsorization is a versatile technique for accounting for outliers in data. However, different probabilistic machine learning techniques have different levels of efficiency when used on outlier-prone data, with or without Winsorization. We notice that Gaussian processes are extremely vulnerable to outliers, while deep learning techniques in general are more robust.
PubMed: 34828244
DOI: 10.3390/e23111546 -
IEEE Transactions on Pattern Analysis... May 2023An efficient 3D point cloud learning architecture, named EfficientLO-Net, for LiDAR odometry is first proposed in this article. In this architecture, the...
An efficient 3D point cloud learning architecture, named EfficientLO-Net, for LiDAR odometry is first proposed in this article. In this architecture, the projection-aware representation of the 3D point cloud is proposed to organize the raw 3D point cloud into an ordered data form to achieve efficiency. The Pyramid, Warping, and Cost volume (PWC) structure for the LiDAR odometry task is built to estimate and refine the pose in a coarse-to-fine approach. A projection-aware attentive cost volume is built to directly associate two discrete point clouds and obtain embedding motion patterns. Then, a trainable embedding mask is proposed to weigh the local motion patterns to regress the overall pose and filter outlier points. The trainable pose warp-refinement module is iteratively used with embedding mask optimized hierarchically to make the pose estimation more robust for outliers. The entire architecture is holistically optimized end-to-end to achieve adaptive learning of cost volume and mask, and all operations involving point cloud sampling and grouping are accelerated by projection-aware 3D feature learning methods. The superior performance and effectiveness of our LiDAR odometry architecture are demonstrated on KITTI, M2DGR, and Argoverse datasets. Our method outperforms all recent learning-based methods and even the geometry-based approach, LOAM with mapping optimization, on most sequences of KITTI odometry dataset. We open sourced our codes at: https://github.com/IRMVLab/EfficientLO-Net.
PubMed: 36107901
DOI: 10.1109/TPAMI.2022.3207015 -
Pharmaceutical Statistics May 2020Potency bioassays are used to measure biological activity. Consequently, potency is considered a critical quality attribute in manufacturing. Relative potency is...
Potency bioassays are used to measure biological activity. Consequently, potency is considered a critical quality attribute in manufacturing. Relative potency is measured by comparing the concentration-response curves of a manufactured test batch with that of a reference standard. If the curve shapes are deemed similar, the test batch is said to exhibit constant relative potency with the reference standard, a critical requirement for calibrating the potency of the final drug product. Outliers in bioassay potency data may result in the false acceptance/rejection of a bad/good sample and, if accepted, may yield a biased relative potency estimate. To avoid these issues, the USP<1032> recommends the screening of bioassay data for outliers prior to performing a relative potency analysis. In a recently published work, the effects of one or more outliers, outlier size, and outlier type on similarity testing and estimation of relative potency were thoroughly examined, confirming the USP<1032> outlier guidance. As a follow-up, several outlier detection methods, including those proposed by the USP<1010>, are evaluated and compared in this work through computer simulation. Two novel outlier detection methods are also proposed. The effects of outlier removal on similarity testing and estimation of relative potency were evaluated, resulting in recommendations for best practice.
Topics: Biological Assay; Data Interpretation, Statistical; Dose-Response Relationship, Drug; Models, Statistical; Reference Standards; Research Design
PubMed: 31762118
DOI: 10.1002/pst.1984 -
Cancer Research and Treatment Jan 2021To find biomarkers for disease, there have been constant attempts to investigate the genes that differ from those in the disease groups. However, the values that lie...
PURPOSE
To find biomarkers for disease, there have been constant attempts to investigate the genes that differ from those in the disease groups. However, the values that lie outside the overall pattern of a distribution, the outliers, are frequently excluded in traditional analytical methods as they are considered to be 'some sort of problem.' Such outliers may have a biologic role in the disease group. Thus, this study explored new biomarker using outlier analysis, and verified the suitability of therapeutic potential of two genes (TM4SF4 and LRRK2).
MATERIALS AND METHODS
Modified Tukey's fences outlier analysis was carried out to identify new biomarkers using the public gene expression datasets. And we verified the presence of the selected biomarkers in other clinical samples via customized gene expression panels and tissue microarrays. Moreover, a siRNA-based knockdown test was performed to evaluate the impact of the biomarkers on oncogenic phenotypes.
RESULTS
TM4SF4 in lung cancer and LRRK2 in breast cancer were chosen as candidates among the genes derived from the analysis. TM4SF4 and LRRK2 were overexpressed in the small number of samples with lung cancer (4.20%) and breast cancer (2.42%), respectively. Knockdown of TM4SF4 and LRRK2 suppressed the growth of lung and breast cancer cell lines. The LRRK2 overexpressing cell lines were more sensitive to LRRK2-IN-1 than the LRRK2 under-expressing cell lines.
CONCLUSION
Our modified outlier-based analysis method has proved to rescue biomarkers previously missed or unnoticed by traditional analysis showing TM4SF4 and LRRK2 are novel target candidates for lung and breast cancer, respectively.
Topics: Breast Neoplasms; Female; Humans; Leucine-Rich Repeat Serine-Threonine Protein Kinase-2; Lung Neoplasms; Membrane Glycoproteins; Molecular Targeted Therapy
PubMed: 32972043
DOI: 10.4143/crt.2020.434 -
The Science of the Total Environment Apr 2023An interlaboratory comparison is typically conducted among the laboratories for the purpose of providing quality assurance and control. To solve the interlaboratory...
An interlaboratory comparison is typically conducted among the laboratories for the purpose of providing quality assurance and control. To solve the interlaboratory agreement problem, a distinct type of metrological challenge, a new uncertainty-based Bayesian strategy was developed and tested among environmental laboratories. A holistic algorithm with the key phases of sampling, outlier analysis, recognition, and simulation-based structure identification was developed and is being addressed in place of conventional indices and plots. Computer simulations showed that the proposed hybrid approach has no discernible sensitivity to outliers and that the agreement structure is transparent and robust. Some meta-data is also generated by the analysis based on relative uncertainty. To measure the performance and capability of Bayesian consensus building algorithm, the uncertainty intervals were established and comparative evaluations have been carried out using the conventional techniques. As a result, the suggested algorithm can explore both the laboratory performances (harmony) and the conformity between two independent samples. The algorithmic procedure features a generalizable framework that may be adapted in other fields to obtain a consensus among the laboratories.
PubMed: 36736391
DOI: 10.1016/j.scitotenv.2023.161977 -
Journal of Medical Internet Research Jul 2023Reference intervals (RIs) play an important role in clinical decision-making. However, due to the time, labor, and financial costs involved in establishing RIs using... (Observational Study)
Observational Study
BACKGROUND
Reference intervals (RIs) play an important role in clinical decision-making. However, due to the time, labor, and financial costs involved in establishing RIs using direct means, the use of indirect methods, based on big data previously obtained from clinical laboratories, is getting increasing attention. Different indirect techniques combined with different data transformation methods and outlier removal might cause differences in the calculation of RIs. However, there are few systematic evaluations of this.
OBJECTIVE
This study used data derived from direct methods as reference standards and evaluated the accuracy of combinations of different data transformation, outlier removal, and indirect techniques in establishing complete blood count (CBC) RIs for large-scale data.
METHODS
The CBC data of populations aged ≥18 years undergoing physical examination from January 2010 to December 2011 were retrieved from the First Affiliated Hospital of China Medical University in northern China. After exclusion of repeated individuals, we performed parametric, nonparametric, Hoffmann, Bhattacharya, and truncation points and Kolmogorov-Smirnov distance (kosmic) indirect methods, combined with log or BoxCox transformation, and Reed-Dixon, Tukey, and iterative mean (3SD) outlier removal methods in order to derive the RIs of 8 CBC parameters and compared the results with those directly and previously established. Furthermore, bias ratios (BRs) were calculated to assess which combination of indirect technique, data transformation pattern, and outlier removal method is preferrable.
RESULTS
Raw data showed that the degrees of skewness of the white blood cell (WBC) count, platelet (PLT) count, mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), and mean corpuscular volume (MCV) were much more obvious than those of other CBC parameters. After log or BoxCox transformation combined with Tukey or iterative mean (3SD) processing, the distribution types of these data were close to Gaussian distribution. Tukey-based outlier removal yielded the maximum number of outliers. The lower-limit bias of WBC (male), PLT (male), hemoglobin (HGB; male), MCH (male/female), and MCV (female) was greater than that of the corresponding upper limit for more than half of 30 indirect methods. Computational indirect choices of CBC parameters for males and females were inconsistent. The RIs of MCHC established by the direct method for females were narrow. For this, the kosmic method was markedly superior, which contrasted with the RI calculation of CBC parameters with high |BR| qualification rates for males. Among the top 10 methodologies for the WBC count, PLT count, HGB, MCV, and MCHC with a high-BR qualification rate among males, the Bhattacharya, Hoffmann, and parametric methods were superior to the other 2 indirect methods.
CONCLUSIONS
Compared to results derived by the direct method, outlier removal methods and indirect techniques markedly influence the final RIs, whereas data transformation has negligible effects, except for obviously skewed data. Specifically, the outlier removal efficiency of Tukey and iterative mean (3SD) methods is almost equivalent. Furthermore, the choice of indirect techniques depends more on the characteristics of the studied analyte itself. This study provides scientific evidence for clinical laboratories to use their previous data sets to establish RIs.
Topics: Adolescent; Adult; Female; Humans; Male; Big Data; Blood Cell Count; China; Leukocyte Count; Reference Values; Clinical Decision-Making
PubMed: 37459170
DOI: 10.2196/45651 -
Communications in Statistics:... 2023A two-stage joint survival model is used to analyse time to event outcomes that could be associated with biomakers that are repeatedly collected over time. A Two-stage...
A two-stage joint survival model is used to analyse time to event outcomes that could be associated with biomakers that are repeatedly collected over time. A Two-stage joint survival model has limited model checking tools and is usually assessed using standard diagnostic tools for survival models. The diagnostic tools can be improved and implemented. Time-varying covariates in a two-stage joint survival model might contain outlying observations or subjects. In this study we used the variance shift outlier model (VSOM) to detect and down-weight outliers in the first stage of the two-stage joint survival model. This entails fitting a VSOM at the observation level and a VSOM at the subject level, and then fitting a combined VSOM for the identified outliers. The fitted values were then extracted from the combined VSOM which were then used as time-varying covariate in the extended Cox model. We illustrate this methodology on a dataset from a multi-centre randomised clinical trial. A multi-centre trial showed that a combined VSOM fits the data better than an extended Cox model. We noted that implementing a combined VSOM, when desired, has a better fit based on the fact that outliers are down-weighted.
PubMed: 37981985
DOI: 10.1080/03610918.2021.1995751