-
Journal of Visualized Experiments : JoVE Sep 2021RNA sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics as it can reveal the relationship between the genetic alteration and complex...
RNA sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics as it can reveal the relationship between the genetic alteration and complex biological processes and has great value in diagnostics, prognostics, and therapeutics of tumors. Differential analysis of RNA-seq data is crucial to identify aberrant transcriptions, and limma, EdgeR and DESeq2 are efficient tools for differential analysis. However, RNA-seq differential analysis requires certain skills with R language and the ability to choose an appropriate method, which is lacking in the curriculum of medical education. Herein, we provide the detailed protocol to identify differentially expressed genes (DEGs) between cholangiocarcinoma (CHOL) and normal tissues through limma, DESeq2 and EdgeR, respectively, and the results are shown in volcano plots and Venn diagrams. The three protocols of limma, DESeq2 and EdgeR are similar but have different steps among the processes of the analysis. For example, a linear model is used for statistics in limma, while the negative binomial distribution is used in edgeR and DESeq2. Additionally, the normalized RNA-seq count data is necessary for EdgeR and limma but is not necessary for DESeq2. Here, we provide a detailed protocol for three differential analysis methods: limma, EdgeR and DESeq2. The results of the three methods are partly overlapping. All three methods have their own advantages, and the choice of method only depends on the data.
Topics: Gene Expression Profiling; RNA; Sequence Analysis, RNA; Software; Transcriptome
PubMed: 34605806
DOI: 10.3791/62528 -
Briefings in Bioinformatics Sep 2022The rapid development of spatial transcriptomics allows the measurement of RNA abundance at a high spatial resolution, making it possible to simultaneously profile gene...
The rapid development of spatial transcriptomics allows the measurement of RNA abundance at a high spatial resolution, making it possible to simultaneously profile gene expression, spatial locations of cells or spots, and the corresponding hematoxylin and eosin-stained histology images. It turns promising to predict gene expression from histology images that are relatively easy and cheap to obtain. For this purpose, several methods are devised, but they have not fully captured the internal relations of the 2D vision features or spatial dependency between spots. Here, we developed Hist2ST, a deep learning-based model to predict RNA-seq expression from histology images. Around each sequenced spot, the corresponding histology image is cropped into an image patch and fed into a convolutional module to extract 2D vision features. Meanwhile, the spatial relations with the whole image and neighbored patches are captured through Transformer and graph neural network modules, respectively. These learned features are then used to predict the gene expression by following the zero-inflated negative binomial distribution. To alleviate the impact by the small spatial transcriptomics data, a self-distillation mechanism is employed for efficient learning of the model. By comprehensive tests on cancer and normal datasets, Hist2ST was shown to outperform existing methods in terms of both gene expression prediction and spatial region identification. Further pathway analyses indicated that our model could reserve biological information. Thus, Hist2ST enables generating spatial transcriptomics data from histology images for elucidating molecular signatures of tissues.
Topics: Eosine Yellowish-(YS); Hematoxylin; Image Processing, Computer-Assisted; Neural Networks, Computer; RNA; Transcriptome
PubMed: 35849101
DOI: 10.1093/bib/bbac297 -
Chirality Jan 2020NaClO is achiral in solution. If crystallization is performed under a static set-up, it is recognized that the stochastic nucleation probability results in a racemic...
NaClO is achiral in solution. If crystallization is performed under a static set-up, it is recognized that the stochastic nucleation probability results in a racemic mixture of the conglomerate. In this paper, we report a reexamination of the crystallization of NaClO from static solution in petri dishes that was conducted over a number of years and is based on the count and analysis of several thousand d- vs. l-NaClO crystals. Remarkably, instead of an expected nearly 50/50 coin-tossing situation for the d/l crystal frequency, in most of our experiments a statistically significant bias in favor of d- over l-NaClO crystals was found. The experiments also showed that the NaClO system was relatively insensitive regarding the intentional addition of a variety of optically active agents. Only in some cases, the persisting d-bias observed in the unseeded experiments slightly increased upon the presence of such additives. Nevertheless, experiments in plastic petri dishes or in presence of fungal spores were able to reverse this bias. A literature survey shows that mainly d-directed non-stochastic behavior in the NaClO system has been previously observed in other laboratory settings and by the application of different crystallization techniques. So far, the kind of chiral influence that could be at the origin of the observed bias remains unknown. After the examination of several possible chiral influences of physical, chemical and biological origin, we carefully consider the presence of bio-contaminants as most likely for the cause of this effect.
PubMed: 31696979
DOI: 10.1002/chir.23154 -
BMC Cancer Jan 2022The incidence of early-onset colorectal cancer (EOCRC) is increasing at an alarming rate and further studies are needed to identify risk factors and to develop... (Meta-Analysis)
Meta-Analysis Review
BACKGROUND
The incidence of early-onset colorectal cancer (EOCRC) is increasing at an alarming rate and further studies are needed to identify risk factors and to develop prevention strategies.
METHODS
Risk factors significantly associated with EOCRC were identified using meta-analysis. An individual risk appraisal model was constructed using the Rothman-Keller model. Next, a group of random data sets was generated using the binomial distribution function method, to determine nodes of risk assessment levels and to identify low, medium, and high risk populations.
RESULTS
A total of 32,843 EOCRC patients were identified in this study, and nine significant risk factors were identified using meta-analysis, including male sex, Caucasian ethnicity, sedentary lifestyle, inflammatory bowel disease, and high intake of red meat and processed meat. After simulating the risk assessment data of 10,000 subjects, scores of 0 to 0.0018, 0.0018 to 0.0036, and 0.0036 or more were respectively considered as low-, moderate-, and high-risk populations for the EOCRC population based on risk trends from the Rothman-Keller model.
CONCLUSION
This model can be used for screening of young adults to predict high risk of EOCRC and will contribute to the primary prevention strategies and the reduction of risk of developing EOCRC.
Topics: Adult; Clinical Decision Rules; Colorectal Neoplasms; Early Detection of Cancer; Female; Humans; Incidence; Male; Middle Aged; Risk Assessment; Risk Factors; Young Adult
PubMed: 35093005
DOI: 10.1186/s12885-022-09238-4 -
Yi Chuan = Hereditas Dec 2020Genetic drift is one of the four important factors affecting population genetic balance. Because its form of action is not as apparent as mutation, selection, and...
Genetic drift is one of the four important factors affecting population genetic balance. Because its form of action is not as apparent as mutation, selection, and migration, which are intuitive and easy to understand, there are potential difficulties in understanding and mastering genetic drift. A particularly prominent problem is that the current introduction of genetic drift contents in textbooks is systematically insufficient. They are either even too rough, or completely neglecting the mathematical foundation such as the binomial theorem, resulting in long-term inadequate learning of genetic drift. In this paper, we summarize the five basic attributes of genetic drift, namely inherent, universal, random, non-directional, and regular features. Based on the concept that the genetic basis of genetic drift is the free combination of male and female gametes, we pointed out that the attribute of random sampling error is the inherent essential feature of genetic drift. Then step by step, from an extremely small population consisting of only one individual (N = 1), we deduced that the effect of genetic drift decreased while population size increased. Through introducing the mathematical model of the binomial theorem, the characteristics of the binomial distribution, and the results of computer simulations, the effect of genetic drift is visually and intuitively displayed to help the teaching the concept of genetic drift.
Topics: Gene Frequency; Genetic Drift; Genetics; Genetics, Population; Models, Genetic; Selection, Genetic
PubMed: 33509785
DOI: 10.16288/j.yczz.20-310 -
Physica A Feb 2021At the end of 2019, the current novel coronavirus emerged as a severe acute respiratory disease that has now become a worldwide pandemic. Future generations will look...
At the end of 2019, the current novel coronavirus emerged as a severe acute respiratory disease that has now become a worldwide pandemic. Future generations will look back on this difficult period and see how our society as a whole united and rose to this challenge. Many reports have suggested that this new virus is becoming comparable to the Spanish flu pandemic of 1918. We provide a statistical study on the modelling and analysis of the daily incidence of COVID-19 in eighteen countries around the world. In particular, we investigate whether it is possible to fit count regression models to the number of daily new cases of COVID-19 in various countries and make short term predictions of these numbers. The results suggest that the biggest advantage of these methods is that they are simplistic and straightforward allowing us to obtain preliminary results and an overall picture of the trends in the daily confirmed cases of COVID-19 around the world. The best fitting count regression model for modelling the number of new daily COVID-19 cases of all countries analysed was shown to be a negative binomial distribution with log link function. Whilst the results cannot solely be used to determine and influence policy decisions, they provide an alternative to more specialised epidemiological models and can help to support or contradict results obtained from other analysis.
PubMed: 33162665
DOI: 10.1016/j.physa.2020.125460 -
Journal of Statistical Theory and... 2022Two families of bivariate discrete Poisson-Lindley distributions are introduced. The first is derived by mixing the common parameter in a bivariate Poisson distribution...
Two families of bivariate discrete Poisson-Lindley distributions are introduced. The first is derived by mixing the common parameter in a bivariate Poisson distribution by different models of univariate continuous Lindley distributions. The second is obtained by generalizing a bivariate binomial distribution with respect to its exponent when it follows any of five different univariate discrete Poisson-Lindley distributions with one or two parameters. The use of probability-generating functions is mainly employed to derive some general properties for both families and specific characteristics for each one of their members. We obtain expressions for probabilities, moments, conditional distributions, regression functions, as well as characterizations for certain bivariate models and their marginals. An attractive property of all bivariate individual models is that they contain only two or three parameters, and one of them is readily estimated by simple ratios of their sample means. This feature, and since all marginal distributions are over-dispersed, strongly suggests their potential use to describe bivariate dependent count data in many different areas.
PubMed: 35493334
DOI: 10.1007/s42519-022-00261-z -
PloS One 2022RNA-seq is a high-throughput sequencing technology widely used for gene transcript discovery and quantification under different biological or biomedical conditions. A...
RNA-seq is a high-throughput sequencing technology widely used for gene transcript discovery and quantification under different biological or biomedical conditions. A fundamental research question in most RNA-seq experiments is the identification of differentially expressed genes among experimental conditions or sample groups. Numerous statistical methods for RNA-seq differential analysis have been proposed since the emergence of the RNA-seq assay. To evaluate popular differential analysis methods used in the open source R and Bioconductor packages, we conducted multiple simulation studies to compare the performance of eight RNA-seq differential analysis methods used in RNA-seq data analysis (edgeR, DESeq, DESeq2, baySeq, EBSeq, NOISeq, SAMSeq, Voom). The comparisons were across different scenarios with either equal or unequal library sizes, different distribution assumptions and sample sizes. We measured performance using false discovery rate (FDR) control, power, and stability. No significant differences were observed for FDR control, power, or stability across methods, whether with equal or unequal library sizes. For RNA-seq count data with negative binomial distribution, when sample size is 3 in each group, EBSeq performed better than the other methods as indicated by FDR control, power, and stability. When sample sizes increase to 6 or 12 in each group, DESeq2 performed slightly better than other methods. All methods have improved performance when sample size increases to 12 in each group except DESeq. For RNA-seq count data with log-normal distribution, both DESeq and DESeq2 methods performed better than other methods in terms of FDR control, power, and stability across all sample sizes. Real RNA-seq experimental data were also used to compare the total number of discoveries and stability of discoveries for each method. For RNA-seq data analysis, the EBSeq method is recommended for studies with sample size as small as 3 in each group, and the DESeq2 method is recommended for sample size of 6 or higher in each group when the data follow the negative binomial distribution. Both DESeq and DESeq2 methods are recommended when the data follow the log-normal distribution.
Topics: Binomial Distribution; High-Throughput Nucleotide Sequencing; RNA-Seq; Sample Size; Sequence Analysis, RNA
PubMed: 36112652
DOI: 10.1371/journal.pone.0264246 -
Scientific Reports Jul 2023Among diseases, cancer exhibits the fastest global spread, presenting a substantial challenge for patients, their families, and the communities they belong to. This...
Among diseases, cancer exhibits the fastest global spread, presenting a substantial challenge for patients, their families, and the communities they belong to. This paper is devoted to modeling such a disease as a special case. A newly proposed distribution called the binomial-discrete Erlang-truncated exponential (BDETE) is introduced. The BDETE is a mixture of binomial distribution with the number of trials (parameter [Formula: see text]) taken after a discrete Erlang-truncated exponential distribution. A comprehensive mathematical treatment of the proposed distribution and expressions of its density, cumulative distribution function, survival function, failure rate function, Quantile function, moment generating function, Shannon entropy, order statistics, and stress-strength reliability, are provided. The distribution's parameters are estimated using the maximum likelihood method. Two real-world lifetime count data sets from the cancer disease, both of which are right-skewed and over-dispersed, are fitted using the proposed BDETE distribution to evaluate its efficacy and viability. We expect the findings to become standard works in probability theory and its related fields.
Topics: Humans; Reproducibility of Results; Statistical Distributions; Entropy; Neoplasms
PubMed: 37507433
DOI: 10.1038/s41598-023-38709-2 -
Infectious Disease Modelling Dec 2023Accurately estimating the effective reproduction number is crucial for characterizing the transmissibility of infectious diseases to optimize interventions and responses...
Accurately estimating the effective reproduction number is crucial for characterizing the transmissibility of infectious diseases to optimize interventions and responses during epidemic outbreaks. In this study, we improve the estimation of the effective reproduction number through two main approaches. First, we derive a discrete model to represent a time series of case counts and propose an estimation method based on this framework. We also conduct numerical experiments to demonstrate the effectiveness of the proposed discretization scheme. By doing so, we enhance the accuracy of approximating the underlying epidemic process compared to previous methods, even when the counting period is similar to the mean generation time of an infectious disease. Second, we employ a negative binomial distribution to model the variability of count data to accommodate overdispersion. Specifically, given that observed incidence counts follow a negative binomial distribution, the posterior distribution of secondary infections is obtained as a Dirichlet multinomial distribution. With this formulation, we establish posterior uncertainty bounds for the effective reproduction number. Finally, we demonstrate the effectiveness of the proposed method using incidence data from the COVID-19 pandemic.
PubMed: 37701756
DOI: 10.1016/j.idm.2023.08.006