-
Genes Feb 2023Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence,...
Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier's performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.
Topics: Transcriptome; Gene Expression Profiling; Probability; Research Design
PubMed: 36833313
DOI: 10.3390/genes14020387 -
Frontiers in Endocrinology 2023The occurrence and development of oesophageal neoplasia (ON) is closely related to hormone changes. The aim of this study was to investigate the causal relationships...
OBJECTIVE
The occurrence and development of oesophageal neoplasia (ON) is closely related to hormone changes. The aim of this study was to investigate the causal relationships between age at menarche (AAMA) or age at menopause (AAMO) and benign oesophageal neoplasia (BON) or malignant oesophageal neoplasia (MON) from a genetic perspective.
METHODS
Genome-wide association study (GWAS) summary data of exposures (AAMA and AAMO) and outcomes (BON and MON) were obtained from the IEU OpenGWAS database. We performed a two-sample Mendelian randomization (MR) study between them. The inverse variance weighted (IVW) was used as the main analysis method, while the MR Egger, weighted median, simple mode, and weighted mode were supplementary methods. The maximum likelihood, penalized weighted median, and IVW (fixed effects) were validation methods. We used Cochran's Q statistic and Rucker's Q statistic to detect heterogeneity. The intercept test of the MR Egger and global test of MR pleiotropy residual sum and outlier (MR-PRESSO) were used to detect horizontal pleiotropy, and the distortion test of the MR-PRESSO analysis was used to detect outliers. The leave-one-out analysis was used to detect whether the MR analysis was affected by single nucleotide polymorphisms (SNPs). In addition, the MR robust adjusted profile score (MR-RAPS) method was used to assess the robustness of MR analysis.
RESULTS
The random-effects IVW results showed that AAMA had a negative genetic causal relationship with BON (odds ratio [OR] = 0.285 [95% confidence interval [CI]: 0.130-0.623], = 0.002). The weighted median, maximum likelihood, penalized weighted median, and IVW (fixed effects) were consistent with random-effects IVW ( < 0.05). The MR Egger, simple mode and weighted mode results showed that AAMA had no genetic causal relationship with BON ( > 0.05). However, there were no causal genetic relationships between AAMA and MON (OR = 1.132 [95%CI: 0.621-2.063], = 0.685), AAMO and BON (OR = 0.989 [95%CI: 0.755-1.296], = 0.935), or AAMO and MON (OR = 1.129 [95%CI: 0.938-1.359], = 0.200). The MR Egger, weighted median, simple mode, weighted mode, maximum likelihood, penalized weighted median, and IVW (fixed effects) were consistent with a random-effects IVW ( > 0.05). MR analysis results showed no heterogeneity, the horizontal pleiotropy and outliers ( > 0.05). They were not driven by a single SNP, and were normally distributed ( > 0.05).
CONCLUSION
Only AAMA has a negative genetic causal relationship with BON, and no genetic causal relationships exist between AAMA and MON, AAMO and BON, or AAMO and MON. However, it cannot be ruled out that they are related at other levels besides genetics.
Topics: Female; Humans; Adolescent; Genome-Wide Association Study; Menarche; Mendelian Randomization Analysis; Esophageal Neoplasms; Adolescent Development
PubMed: 37025412
DOI: 10.3389/fendo.2023.1113765 -
BMC Public Health Jan 2023One of the seminal events since 2019 has been the outbreak of the SARS-CoV-2 pandemic. Countries have adopted various policies to deal with it, but they also differ in...
BACKGROUND
One of the seminal events since 2019 has been the outbreak of the SARS-CoV-2 pandemic. Countries have adopted various policies to deal with it, but they also differ in their socio-geographical characteristics and public health care facilities. Our study aimed to investigate differences between epidemiological parameters across countries.
METHOD
The analysed data represents SARS-CoV-2 repository provided by the Johns Hopkins University. Separately for each country, we estimated recovery and mortality rates using the SIRD model applied to the first 30, 60, 150, and 300 days of the pandemic. Moreover, a mixture of normal distributions was fitted to the number of confirmed cases and deaths during the first 300 days. The estimates of peaks' means and variances were used to identify countries with outlying parameters.
RESULTS
For 300 days Belgium, Cyprus, France, the Netherlands, Serbia, and the UK were classified as outliers by all three outlier detection methods. Yemen was classified as an outlier for each of the four considered timeframes, due to high mortality rates. During the first 300 days of the pandemic, the majority of countries underwent three peaks in the number of confirmed cases, except Australia and Kazakhstan with two peaks.
CONCLUSIONS
Considering recovery and mortality rates we observed heterogeneity between countries. Liechtenstein was the "positive" outlier with low mortality rates and high recovery rates, at the opposite, Yemen represented a "negative" outlier with high mortality for all four considered periods and low recovery for 30 and 60 days.
Topics: Humans; SARS-CoV-2; COVID-19; Pandemics; Disease Outbreaks; France
PubMed: 36681790
DOI: 10.1186/s12889-023-15092-1 -
Scientific Reports Sep 2023Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high...
Subspace outlier detection has emerged as a practical approach for outlier detection. Classical full space outlier detection methods become ineffective in high dimensional data due to the "curse of dimensionality". Subspace outlier detection methods have great potential to overcome the problem. However, the challenge becomes how to determine which subspaces to be used for outlier detection among a huge number of all subspaces. In this paper, firstly, we propose an intuitive definition of outliers in subspaces. We study the desirable properties of subspaces for outlier detection and investigate the metrics for those properties. Then, a novel subspace outlier detection algorithm with a statistical foundation is proposed. Our method selectively leverages a limited set of the most interesting subspaces for outlier detection. Through experimental validation, we demonstrate that identifying outliers within this reduced set of highly interesting subspaces yields significantly higher accuracy compared to analyzing the entire feature space. We show by experiments that the proposed method outperforms competing subspace outlier detection approaches on real world data sets.
PubMed: 37714878
DOI: 10.1038/s41598-023-42261-4 -
NeuroImage Dec 2023Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain...
Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain regions. The method, however, requires relatively complex data processing frameworks and analysis pipelines. Many of these approaches are vulnerable to signal dropout artefacts that can originate from subjects moving their head during the scan. To combat these artefacts and eliminate such outliers, researchers have proposed two approaches: to replace outliers or to downweight outliers during modelling and analysis. With the rising interest in dMRI for clinical research, these types of corrections are increasingly important. Therefore, we set out to investigate the differences between outlier replacement and weighting approaches to help the dMRI community to select the best tool for their data processing pipelines. We evaluated dMRI motion correction registration and single tensor model fit pipelines using Gaussian Process and Spherical Harmonic based replacement approaches and outlier downweighting using highly realistic whole-brain simulations. As a proof of concept, we applied these approaches to dMRI infant data sets that contained varying numbers of dropout artefacts. Based on our results, we concluded that the Gaussian Process based outlier replacement provided similar tensor fit results to Gaussian Process based outlier detection and downweighting. Therefore, if only the least-squares estimate of the single tensor model is of interest, our recommendation is to use outlier replacement. However, outlier downweighting can potentially provide a more accurate estimate of the model precision which could be relevant for applications such as probabilistic tractoraphy.
Topics: Humans; Algorithms; Diffusion Magnetic Resonance Imaging; Brain; Artifacts; Least-Squares Analysis
PubMed: 37820862
DOI: 10.1016/j.neuroimage.2023.120397 -
Molecules (Basel, Switzerland) Jun 2021In this paper, we report comprehensive experimental and chemoinformatics analyses of the solubility of small organic molecules ("fragments") in dimethyl sulfoxide (DMSO)...
In this paper, we report comprehensive experimental and chemoinformatics analyses of the solubility of small organic molecules ("fragments") in dimethyl sulfoxide (DMSO) in the context of their ability to be tested in screening experiments. Here, DMSO solubility of 939 fragments has been measured experimentally using an NMR technique. A Support Vector Classification model was built on the obtained data using the ISIDA fragment descriptors. The analysis revealed 34 outliers: experimental issues were retrospectively identified for 28 of them. The updated model performs well in 5-fold cross-validation (balanced accuracy = 0.78). The datasets are available on the Zenodo platform (DOI:10.5281/zenodo.4767511) and the model is available on the website of the Laboratory of Chemoinformatics.
PubMed: 34203441
DOI: 10.3390/molecules26133950 -
Medicine Feb 2024Traditional observational and in vivo studies have suggested an etiological link between gastroesophageal reflux disease (GERD) and the development of extraesophageal...
Traditional observational and in vivo studies have suggested an etiological link between gastroesophageal reflux disease (GERD) and the development of extraesophageal diseases (EEDs), such as noncardiac chest pain. However, evidence demonstrating potential causal relationships is lacking. This study evaluated the potential causal relationship between GERD and EEDs, including throat and chest pain, asthma, bronchitis, chronic rhinitis, nasopharyngitis and pharyngitis, gingivitis and periodontal disease, cough, using multiple Mendelian randomization (MR) methods, and sensitivity analysis was performed. The Mendelian randomization Pleiotropy RESidual Sum and Outlier and PhenoScanner tools were used to further check for heterogeneous results and remove outliers. MR with inverse-variance weighted (IVW) showed a significant causal relationship between GERD and EEDs after Bonferroni correction. IVW results indicated that GERD increased the risk of chronic rhinitis, nasopharyngitis and pharyngitis (odds ratio [OR] = 1.482, 95% confidence interval [CI] = 1.267-1.734, P < .001], gingivitis and periodontal disease (OR = 1.166, 95% CI = 1.046-1.190, P = .001), throat and chest pain (OR = 1.585, 95% CI = 1.455-1.726, P < .001), asthma (OR = 1.539, 95% CI = 1.379-1.717, P < .001), and bronchitis (OR = 1.249, 95% CI = 1.168-1.335, P < .001). Sensitivity analysis did not detect pleiotropy. Leave-one-out analysis shows that MR results were not affected by individual single nucleotide polymorphisms. The funnel plot considers the genetic instrumental variables to be almost symmetrically distributed. This MR supports a causal relationship among GERD and EEDs. Precise moderation based on causality and active promotion of collaboration among multidisciplinary physicians ensure high-quality diagnostic and treatment recommendations and maximize patient benefit.
Topics: Humans; Nasopharyngitis; Mendelian Randomization Analysis; Gastroesophageal Reflux; Pharyngitis; Asthma; Bronchitis; Chest Pain; Gingivitis; Periodontal Diseases; Rhinitis; Genome-Wide Association Study
PubMed: 38363933
DOI: 10.1097/MD.0000000000037054 -
Molecular Oncology Jun 2024Multiple strategies are continuously being explored to expand the drug target repertoire in solid tumors. We devised a novel computational workflow for...
Multiple strategies are continuously being explored to expand the drug target repertoire in solid tumors. We devised a novel computational workflow for transcriptome-wide gene expression outlier analysis that allows the systematic identification of both overexpression and underexpression events in cancer cells. Here, it was applied to expression values obtained through RNA sequencing in 226 colorectal cancer (CRC) cell lines that were also characterized by whole-exome sequencing and microarray-based DNA methylation profiling. We found cell models displaying an abnormally high or low expression level for 3533 and 965 genes, respectively. Gene expression abnormalities that have been previously associated with clinically relevant features of CRC cell lines were confirmed. Moreover, by integrating multi-omics data, we identified both genetic and epigenetic alternations underlying outlier expression values. Importantly, our atlas of CRC gene expression outliers can guide the discovery of novel drug targets and biomarkers. As a proof of concept, we found that CRC cell lines lacking expression of the MTAP gene are sensitive to treatment with a PRMT5-MTA inhibitor (MRTX1719). Finally, other tumor types may also benefit from this approach.
Topics: Humans; Colorectal Neoplasms; Gene Expression Regulation, Neoplastic; Cell Line, Tumor; Transcriptome; Gene Expression Profiling; DNA Methylation
PubMed: 38468448
DOI: 10.1002/1878-0261.13622 -
Clinical Epigenetics Mar 2020DNA methylation outlier burden has been suggested as a potential marker of biological age. An outlier is typically defined as DNA methylation levels at any one CpG site...
BACKGROUND
DNA methylation outlier burden has been suggested as a potential marker of biological age. An outlier is typically defined as DNA methylation levels at any one CpG site that are three times beyond the inter-quartile range from the 25th or 75th percentiles compared to the rest of the population. DNA methylation outlier burden (the number of such outlier sites per individual) increases exponentially with age. However, these findings have been observed in small samples.
RESULTS
Here, we showed an association between age and log-transformed DNA methylation outlier burden in a large cross-sectional cohort, the Generation Scotland Family Health Study (N = 7010, β = 0.0091, p < 2 × 10), and in two longitudinal cohort studies, the Lothian Birth Cohorts of 1921 (N = 430, β = 0.033, p = 7.9 × 10) and 1936 (N = 898, β = 0.0079, p = 0.074). Significant confounders of both cross-sectional and longitudinal associations between outlier burden and age included white blood cell proportions, body mass index (BMI), smoking, and batch effects. In Generation Scotland, the increase in epigenetic outlier burden with age was not purely an artefact of an increase in DNA methylation level variability with age (epigenetic drift). Log-transformed DNA methylation outlier burden in Generation Scotland was not related to self-reported, or family history of, age-related diseases, and it was not heritable (SNP-based heritability of 4.4%, p = 0.18). Finally, DNA methylation outlier burden was not significantly related to survival in either of the Lothian Birth Cohorts individually or in the meta-analysis after correction for multiple testing (HR = 1.12; 95% CI = [1.02; 1.21]; p = 0.021).
CONCLUSIONS
These findings suggest that, while it does not associate with ageing-related health outcomes, DNA methylation outlier burden does track chronological ageing and may also relate to survival. DNA methylation outlier burden may thus be useful as a marker of biological ageing.
Topics: Adult; Age Factors; Aging; Confounding Factors, Epidemiologic; CpG Islands; Cross-Sectional Studies; DNA Methylation; Epigenesis, Genetic; Female; Humans; Longitudinal Studies; Male; Middle Aged; Risk Factors; Scotland
PubMed: 32216821
DOI: 10.1186/s13148-020-00838-0 -
Proceedings of SPIE--the International... 2020Abdominal multi-organ segmentation of computed tomography (CT) images has been the subject of extensive research interest. It presents a substantial challenge in medical...
Abdominal multi-organ segmentation of computed tomography (CT) images has been the subject of extensive research interest. It presents a substantial challenge in medical image processing, as the shape and distribution of abdominal organs can vary greatly among the population and within an individual over time. While continuous integration of novel datasets into the training set provides potential for better segmentation performance, collection of data at scale is not only costly, but also impractical in some contexts. Moreover, it remains unclear what marginal value additional data have to offer. Herein, we propose a single-pass active learning method through human quality assurance (QA). We built on a pre-trained 3D U-Net model for abdominal multi-organ segmentation and augmented the dataset either with outlier data (e.g., exemplars for which the baseline algorithm failed) or inliers (e.g., exemplars for which the baseline algorithm worked). The new models were trained using the augmented datasets with 5-fold cross-validation (for outlier data) and withheld outlier samples (for inlier data). Manual labeling of outliers increased Dice scores with outliers by 0.130, compared to an increase of 0.067 with inliers (p<0.001, two-tailed paired t-test). By adding 5 to 37 inliers or outliers to training, we find that the marginal value of adding outliers is higher than that of adding inliers. In summary, improvement on single-organ performance was obtained without diminishing multi-organ performance or significantly increasing training time. Hence, identification and correction of baseline failure cases present an effective and efficient method of selecting training data to improve algorithm performance.
PubMed: 33907347
DOI: 10.1117/12.2549365