-
Sensors (Basel, Switzerland) Aug 2022Analysing human physiological data allows access to the health state and the state of mind of the subject individual. Whenever a person is sick, having a panic attack,...
Analysing human physiological data allows access to the health state and the state of mind of the subject individual. Whenever a person is sick, having a panic attack, happy or scared, physiological signals will be different. In terms of physiological signals, we focus, in this manuscript, on monitoring breathing patterns. The scope can be extended to also address heart rate and other variables. We describe an analysis of breathing rate patterns during activities including resting, walking, running and watching a movie. We model normal breathing behaviours by statistically analysing signals, processed to represent quantities of interest. We consider moving maximum/minimum, the amplitude and the Fourier transform of the respiration signal, working with different window sizes. We then learn a statistical model for the basal behaviour, per individual, and detect outliers. When outliers are detected, a system that incorporates our approach would send a visible signal through a smart garment or through other means. We describe alert generation performance in two datasets-one literature dataset and one collected as a field study for this work. In particular, when learning personal rest distributions for the breathing signals of 14 subjects, we see alerts generated more often when the same individual is running than when they are tested in rest conditions.
Topics: Humans; Models, Statistical; Respiration; Respiratory Rate; Rest
PubMed: 36016067
DOI: 10.3390/s22166306 -
Journal of Clinical Medicine Feb 2023The aim was to study the genetic correlation and causal relationship between spondyloarthritis (SpA) and blood metabolites based on the large-scale genome-wide...
The aim was to study the genetic correlation and causal relationship between spondyloarthritis (SpA) and blood metabolites based on the large-scale genome-wide association study (GWAS) summary data. The GWAS summary data (3966 SpA and 448,298 control cases) of SpA were from the UK Biobank, and the GWAS summary data (486 blood metabolites) of human blood metabolites were from a published study. First, the genetic correlation between SpA and blood metabolites was analyzed by linkage disequilibrium score (LDSC) regression. Next, we used Mendelian randomization (MR) analysis to perform access causal relationship between SpA and blood metabolites. Random effects inverse variance weighted (IVW) was the main analysis method, and the MR Egger, weighted median, simple mode, and weighted mode were supplementary methods. The MR analysis results were dominated by the random effects IVW. The Cochran's Q statistic (MR-IVW) and Rucker's Q statistic (MR Egger) were used to check heterogeneity. MR Egger and MR pleiotropy residual sum and outlier (MR-PRESSO) were used to check horizontal pleiotropy. The MR-PRESSO was also used to check outliers. The "leave-one-out" analysis was used to assess whether the MR analysis results were affected by a single SNP and thus test the robustness of the MR results. Finally, we identified seven blood metabolites that are genetically related to SpA: X-10395 (correlation coefficient = -0.546, = 0.025), pantothenate (correlation coefficient = -0.565, = 0.038), caprylate (correlation coefficient = -0.333, = 0.037), pelargonate (correlation coefficient = -0.339, = 0.047), X-11317 (correlation coefficient = -0.350, = 0.038), X-12510 (correlation coefficient = -0.399, = 0.034), and X-13859 (Correlation coefficient = -0.458, = 0.015). Among them, X-10395 had a positive genetic causal relationship with SpA ( = 0.014, OR = 1.011). The blood metabolites that have genetic correlation and causal relationship with SpA found in this study provide a new idea for the study of the pathogenesis of SpA and the determination of diagnostic indicators.
PubMed: 36769847
DOI: 10.3390/jcm12031201 -
Bioinformatics (Oxford, England) Aug 2023Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all...
MOTIVATION
Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all high-dimensional molecular data, it is prone to incorrect values that can stem from various sources for example the technical limitations of the measurement devices, errors in the sample preparation, or contamination. Most anomaly detection algorithms identify complete samples as outliers or anomalies. However, in most cases, not all measurements of those samples are erroneous but only a few one-dimensional features within the samples are incorrect. These one-dimensional data errors are continuous measurements that are either located outside or inside the normal ranges of their features but in both cases show atypical values given all other continuous and categorical features in the sample. Additionally, categorical anomalies can occur for example when the genotype or diagnosis was submitted wrongly.
RESULTS
We introduce ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high-dimensional data. Hereby, we focus on the detection of single (one-dimensional) data errors in the categorical and continuous features of a sample. For that the joint distribution of continuous and categorical features is learned by mixed graphical models, anomalies are detected by the difference between measured and model-based estimations and are corrected using imputation. We evaluated ADMIRE in simulation and by screening for anomalies in one of our own metabolic datasets. In simulation experiments, ADMIRE outperformed the state-of-the-art methods of Local Outlier Factor, stray, and Isolation Forest.
AVAILABILITY AND IMPLEMENTATION
All data and code is available at https://github.com/spang-lab/adadmire. ADMIRE is implemented in a Python package called adadmire which can be found at https://pypi.org/project/adadmire.
Topics: Humans; Algorithms; Computer Simulation; Genotype; Software
PubMed: 37584673
DOI: 10.1093/bioinformatics/btad501 -
Frontiers in Psychiatry 2023Observational studies have reported the association between fatigue and coronary artery disease (CAD), but the causal association between fatigue and CAD is unclear.
BACKGROUND
Observational studies have reported the association between fatigue and coronary artery disease (CAD), but the causal association between fatigue and CAD is unclear.
METHOD
We conducted a bidirectional Mendelian randomization (MR) study using publicly available genome-wide association studies (GWAS) data. The inverse-variance weighted (IVW) method was used as the primary analysis. We performed three complementary methods, including weighted median, MR-Egger regression, and MR pleiotropy residual sum and outlier (MR-PRESSO) to evaluate the sensitivity and horizontal pleiotropy of the results.
RESULT
Self-reported fatigue had a causal effect on coronary artery atherosclerosis (CAA) (OR 1.047, 95%CI 1.033-1.062), myocardial infarction (MI) (OR 1.027 95%CI 1.014-1.039) and coronary heart disease (CHD) (OR 1.037, 95%CI 1.021-1.053). We did not find a significant reverse causality between self-reported fatigue and CAD. Given the heterogeneity revealed by MR-Egger regression, we employed the IVW random effect model. For the examination of fatigue on CHD and the reverse analysis of CAA, and MI on fatigue, the MR-PRESSO test found horizontal pleiotropy. No significant outliers were found.
CONCLUSION
The MR analysis reveals a causal relationship between self-reported fatigue and CAD. The results should be interpreted with caution due to horizontal pleiotropy.
PubMed: 37799396
DOI: 10.3389/fpsyt.2023.1166689 -
Biometrics Dec 2022Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial...
Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.
Topics: Child; Humans; Pediatric Obesity; Algorithms; Sample Size; Probability
PubMed: 34437713
DOI: 10.1111/biom.13553 -
Entropy (Basel, Switzerland) Apr 2022Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by...
Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by randomly dividing features of the data set in the Isolation Forest algorithm in outlier detection, an algorithm CIIF (Cluster-based Improved Isolation Forest) that combines clustering and Isolation Forest is proposed. CIIF first uses the -means method to cluster the data set, selects a specific cluster to construct a selection matrix based on the results of the clustering, and implements the selection mechanism of the algorithm through the selection matrix; then builds multiple isolation trees. Finally, the outliers are calculated according to the average search length of each sample in different isolation trees, and the Top-n objects with the highest outlier scores are regarded as outliers. Through comparative experiments with six algorithms in eleven real data sets, the results show that the CIIF algorithm has better performance. Compared to the Isolation Forest algorithm, the average AUC (Area under the Curve of ROC) value of our proposed CIIF algorithm is improved by 7%.
PubMed: 35626495
DOI: 10.3390/e24050611 -
The British Journal of Surgery May 2024Clinical auditing is a powerful tool to evaluate and improve healthcare. Deviations from the expected quality of care are identified by benchmarking the results of...
BACKGROUND
Clinical auditing is a powerful tool to evaluate and improve healthcare. Deviations from the expected quality of care are identified by benchmarking the results of individual hospitals using national averages. This study aimed to evaluate the use of quality indicators for benchmarking hepato-pancreato-biliary (HPB) surgery and when outlier hospitals could be identified.
METHODS
A population-based study used data from two nationwide Dutch HPB audits (DHBA and DPCA) from 2014 to 2021. Sample size calculations determined the threshold (in percentage points) to identify centres as statistical outliers, based on current volume requirements (annual minimum of 20 resections) on a two-year period (2020-2021), covering mortality rate, failure to rescue (FTR), major morbidity rate and textbook/ideal outcome (TO) for minor liver resection (LR), major LR, pancreaticoduodenectomy (PD) and distal pancreatectomy (DP).
RESULTS
In total, 10 963 and 7365 patients who underwent liver and pancreatic resection respectively were included. Benchmark and corresponding range of mortality rates were 0.6% (0 -3.2%) and 3.3% (0-16.7%) for minor and major LR, and 2.7% (0-7.0%) and 0.6% (0-4.2%) for PD and DP respectively. FTR rates were 5.4% (0-33.3%), 14.2% (0-100%), 7.5% (1.6%-28.5%) and 3.1% (0-14.9%). For major morbidity rate, corresponding rates were 9.8% (0-20.5%), 28.1% (0-47.1%), 36% (15.8%-58.3%) and 22.3% (5.2%-46.1%). For TO, corresponding rates were 73.6% (61.3%-94.4%), 54.1% (35.3-100), 46.8% (25.3%-59.4%) and 63.3% (30.7%-84.6%). Mortality rate thresholds indicating a significant outlier were 8.6% and 15.4% for minor and major LR and 14.2% and 8.6% for PD and DP. For FTR, these thresholds were 17.9%, 31.6%, 22.9% and 15.0%. For major morbidity rate, these thresholds were 26.1%, 49.7%, 57.9% and 52.9% respectively. For TO, lower thresholds were 52.5%, 32.5%, 25.8% and 41.4% respectively. Higher hospital volumes decrease thresholds to detect outliers.
CONCLUSION
Current event rates and minimum volume requirements per hospital are too low to detect any meaningful between hospital differences in mortality rate and FTR. Major morbidity rate and TO are better candidates to use for benchmarking.
Topics: Humans; Benchmarking; Quality Indicators, Health Care; Netherlands; Pancreatectomy; Male; Pancreaticoduodenectomy; Hepatectomy; Female; Middle Aged; Aged; Hospital Mortality
PubMed: 38747683
DOI: 10.1093/bjs/znae119 -
Entropy (Basel, Switzerland) Nov 2021With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers....
With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have outliers and other aberrant observations. We provide a comparative analysis of several probabilistic artificial intelligence and machine learning techniques for supervised learning case studies. Broadly, Winsorization is a versatile technique for accounting for outliers in data. However, different probabilistic machine learning techniques have different levels of efficiency when used on outlier-prone data, with or without Winsorization. We notice that Gaussian processes are extremely vulnerable to outliers, while deep learning techniques in general are more robust.
PubMed: 34828244
DOI: 10.3390/e23111546 -
Scientific Reports Jan 2022Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In...
Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called "unicorn" or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord detection algorithms even in recognizing traditional outliers and it also detected unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully retrieved unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.
PubMed: 34996940
DOI: 10.1038/s41598-021-03526-y -
PloS One 2022With the explosive growth of data, how to efficiently cluster large-scale unlabeled data has become an important issue that needs to be solved urgently. Especially in...
With the explosive growth of data, how to efficiently cluster large-scale unlabeled data has become an important issue that needs to be solved urgently. Especially in the face of large-scale real-world data, which contains a large number of complex distributions of noises and outliers, the research on robust large-scale real-world data clustering algorithms has become one of the hottest topics. In response to this issue, a robust large-scale clustering algorithm based on correntropy (RLSCC) is proposed in this paper, specifically, k-means is firstly applied to generated pseudo-labels which reduce input data scale of subsequent spectral clustering, then anchor graphs instead of full sample graphs are introduced into spectral clustering to obtain final clustering results based on pseudo-labels which further improve the efficiency. Therefore, RLSCC inherits the advantages of the effectiveness of k-means and spectral clustering while greatly reducing the computational complexity. Furthermore, correntropy is developed to suppress the influence of noises and outlier the real-world data on the robustness of clustering. Finally, extensive experiments were carried out on real-world datasets and noise datasets and the results show that compared with other state-of-the-art algorithms, RLSCC can improve efficiency and robustness greatly while maintaining comparable or even higher clustering effectiveness.
Topics: Cluster Analysis; Algorithms
PubMed: 36331916
DOI: 10.1371/journal.pone.0277012