-
Expert Opinion on Drug Discovery Feb 2024Modern drug discovery incorporates various tools and data, heralding the beginning of the data-driven drug design (DD) era. The distributions of chemical and physical...
INTRODUCTION
Modern drug discovery incorporates various tools and data, heralding the beginning of the data-driven drug design (DD) era. The distributions of chemical and physical data used for Artificial Intelligence (AI)/Machine Learning (ML) and to drive DD have thus become highly important to be understood and used effectively.
AREAS COVERED
The authors perform a comprehensive exploration of the statistical distributions driving the data-intensive era of drug discovery, including Benford's Law in AI/ML-based DD.
EXPERT OPINION
As the relevance of data-driven discovery escalates, we anticipate meticulous scrutiny of datasets utilizing principles like Benford's Law to enhance data integrity and guide efficient resource allocation and experimental planning. In this data-driven era of the pharmaceutical and medical industries, addressing critical aspects such as bias mitigation, algorithm effectiveness, data stewardship, effects, and fraud prevention are essential. Harnessing Benford's Law and other distributions and statistical tests in DD provides a potent strategy to detect data anomalies, fill data gaps, and enhance dataset quality. Benford's Law is a fast method for data integrity and quality of datasets, the backbone of AI/ML and other modeling approaches, proving very useful in the design process.
Topics: Humans; Artificial Intelligence; Drug Design; Drug Discovery; Research Design; Machine Learning
PubMed: 37921672
DOI: 10.1080/17460441.2023.2277342 -
Behavior Modification Nov 2023There has been growing interest in using statistical methods to analyze data and estimate effect size indices from studies that use single-case designs (SCDs), as a...
There has been growing interest in using statistical methods to analyze data and estimate effect size indices from studies that use single-case designs (SCDs), as a complement to traditional visual inspection methods. The validity of a statistical method rests on whether its assumptions are plausible representations of the process by which the data were collected, yet there is evidence that some assumptions-particularly regarding normality of error distributions-may be inappropriate for single-case data. To develop more appropriate modeling assumptions and statistical methods, researchers must attend to the features of real SCD data. In this study, we examine several features of SCDs with behavioral outcome measures in order to inform development of statistical methods. Drawing on a corpus of over 300 studies, including approximately 1,800 cases, from seven systematic reviews that cover a range of interventions and outcome constructs, we report the distribution of study designs, distribution of outcome measurement procedures, and features of baseline outcome data distributions for the most common types of measurements used in single-case research. We discuss implications for the development of more realistic assumptions regarding outcome distributions in SCD studies, as well as the design of Monte Carlo simulation studies evaluating the performance of statistical analysis techniques for SCD data.
Topics: Humans; Computer Simulation; Research Design; Outcome Assessment, Health Care
PubMed: 31375029
DOI: 10.1177/0145445519864264 -
Hippocampus Dec 2023We present practical solutions to applying Gaussian-process (GP) methods to calculate spatial statistics for grid cells in large environments. GPs are a data efficient...
We present practical solutions to applying Gaussian-process (GP) methods to calculate spatial statistics for grid cells in large environments. GPs are a data efficient approach to inferring neural tuning as a function of time, space, and other variables. We discuss how to design appropriate kernels for grid cells, and show that a variational Bayesian approach to log-Gaussian Poisson models can be calculated quickly. This class of models has closed-form expressions for the evidence lower-bound, and can be estimated rapidly for certain parameterizations of the posterior covariance. We provide an implementation that operates in a low-rank spatial frequency subspace for further acceleration, and demonstrate these methods on experimental data.
Topics: Bayes Theorem; Grid Cells; Normal Distribution
PubMed: 37749821
DOI: 10.1002/hipo.23577 -
ACS Applied Materials & Interfaces Nov 2023Tetracyanonickelate (TCN)-based metal-organic frameworks (MOFs) show great potential in electrochemical applications such as supercapacitors due to their layered...
Tetracyanonickelate (TCN)-based metal-organic frameworks (MOFs) show great potential in electrochemical applications such as supercapacitors due to their layered morphology and tunable structure. This study reports on improved electrochemical performance of exfoliated manganese tetracyanonickelate (Mn-TCN) nanosheets produced by the heat-assisted liquid-phase exfoliation (LPE) technique. The structural change was confirmed by the Raman frequency shift of the C≡N band from 2177 to 2182 cm and increased band gap from 3.15 to 4.33 eV in the exfoliated phase. Statistical distribution obtained from atomic force microscopy (AFM) shows that 50% of the nanosheets are single-to-four-layered and have an average lateral size of ∼240 nm and thickness of ∼1.2-4.8 nm. High-resolution transmission electron microscopy (HRTEM) and selected area electron diffraction (SAED) patterns suggest that the material maintains its crystallinity after exfoliation. It exhibits an almost 6-fold improvement in specific capacitance (from 13.0 to 72.5 F g) measured at a scan rate of 5 mV s in 1 M KOH solution. Galvanostatic charge-discharge (GCD) measurement shows a capacity enhancement from ∼18 F g in the bulk phase to ∼45 F g in the exfoliated phase at a current density of 1 A g. Bulk crystals exhibit an increasing trend of capacitance retention by ∼125% over 1000 charge-discharge cycles attributed to electrochemical exfoliation. Electrochemical impedance spectroscopy (EIS) demonstrates a 5-fold reduction in the total equivalent series resistance (ESR) from 4864 Ω (bulk) to 1089 Ω (exfoliated). The enhanced storage capacity in the exfoliated phase results from the combined effect of the electrochemical double-layer charge storage mechanism at the nanosheet-electrolyte interface and the Faradic process characteristic of the pseudocapacitive charge storage behavior.
PubMed: 37943692
DOI: 10.1021/acsami.3c14059 -
Biometrics Dec 2023Contact-tracing is one of the most effective tools in infectious disease outbreak control. A capture-recapture approach based upon ratio regression is suggested to...
Contact-tracing is one of the most effective tools in infectious disease outbreak control. A capture-recapture approach based upon ratio regression is suggested to estimate the completeness of case detection. Ratio regression has been recently developed as flexible tool for count data modeling and has proved to be successful in the capture-recapture setting. The methodology is applied here to Covid-19 contact tracing data from Thailand. A simple weighted straight line approach is used which includes the Poisson and geometric distribution as special cases. For the case study data of contact tracing for Thailand, a completeness of 83% could be found with a 95% confidence interval of 74%-93%.
Topics: Humans; COVID-19; Contact Tracing; Disease Outbreaks; Statistical Distributions
PubMed: 36795803
DOI: 10.1111/biom.13842 -
IEEE Transactions on Visualization and... Dec 2023Idealized probability distributions, such as normal or other curves, lie at the root of confirmatory statistical tests. But how well do people understand these idealized...
Idealized probability distributions, such as normal or other curves, lie at the root of confirmatory statistical tests. But how well do people understand these idealized curves? In practical terms, does the human visual system allow us to match sample data distributions with hypothesized population distributions from which those samples might have been drawn? And how do different visualization techniques impact this capability? This article shares the results of a crowdsourced experiment that tested the ability of respondents to fit normal curves to four different data distribution visualizations: bar histograms, dotplot histograms, strip plots, and boxplots. We find that the crowd can estimate the center (mean) of a distribution with some success and little bias. We also find that people generally overestimate the standard deviation-which we dub the "umbrella effect" because people tend to want to cover the whole distribution using the curve, as if sheltering it from the heavens above-and that strip plots yield the best accuracy.
PubMed: 36173772
DOI: 10.1109/TVCG.2022.3210763 -
Scientific Reports Aug 2023Many species used in behavioral studies are small vertebrates with high metabolic rates and potentially enhanced temporal resolution of perception. Nevertheless, the...
Many species used in behavioral studies are small vertebrates with high metabolic rates and potentially enhanced temporal resolution of perception. Nevertheless, the selection of an appropriate scales to evaluate behavioral dynamics has received little attention. Herein, we studied the temporal organization of behaviors at fine-grain (i.e. sampling interval ≤1s) to gain insight into dynamics and to rethink how behavioral events are defined. We statistically explored high-resolution Japanese quail (Coturnix japonica) datasets encompassing 17 defined behaviors. We show that for the majority of these behaviors, events last predominately <300ms and can be shorter than 70ms. Insufficient sampling resolution, even in the order of 1s, of behaviors that involve spatial displacement (e.g. walking) yields distorted probability distributions of event durations and overestimation of event durations. Contrarily, behaviors without spatial displacement (e.g. vigilance) maintain non-Gaussian, power-law-type distributions indicative of long-term memory, independently of the sampling resolution evaluated. Since data probability distributions reflect underlying biological processes, our results highlight the importance of quantification of behavioral dynamics based on the temporal scale pertinent to the species, and data distribution. We propose a hierarchical model that links diverse types of behavioral definitions and distributions, and paves the way towards a statistical framework for defining behaviors.
Topics: Animals; Coturnix; Research; Edible Grain; Memory, Long-Term; Probability
PubMed: 37587164
DOI: 10.1038/s41598-023-39295-z -
Biometrics Mar 2024Limitations of using the traditional Cox's hazard ratio for summarizing the magnitude of the treatment effect on time-to-event outcomes have been widely discussed, and...
Limitations of using the traditional Cox's hazard ratio for summarizing the magnitude of the treatment effect on time-to-event outcomes have been widely discussed, and alternative measures that do not have such limitations are gaining attention. One of the alternative methods recently proposed, in a simple 2-sample comparison setting, uses the average hazard with survival weight (AH), which can be interpreted as the general censoring-free person-time incidence rate on a given time window. In this paper, we propose a new regression analysis approach for the AH with a truncation time τ. We investigate 3 versions of AH regression analysis, assuming (1) independent censoring, (2) group-specific censoring, and (3) covariate-dependent censoring. The proposed AH regression methods are closely related to robust Poisson regression. While the new approach needs to require a truncation time τ explicitly, it can be more robust than Poisson regression in the presence of censoring. With the AH regression approach, one can summarize the between-group treatment difference in both absolute difference and relative terms, adjusting for covariates that are associated with the outcome. This property will increase the likelihood that the treatment effect magnitude is correctly interpreted. The AH regression approach can be a useful alternative to the traditional Cox's hazard ratio approach for estimating and reporting the magnitude of the treatment effect on time-to-event outcomes.
Topics: Humans; Regression Analysis; Proportional Hazards Models; Survival Analysis; Computer Simulation; Poisson Distribution; Biometry; Models, Statistical
PubMed: 38771658
DOI: 10.1093/biomtc/ujae037 -
PeerJ 2023In this research, we propose probabilistic approaches to identify pairwise patterns of species co-occurrence by using presence-absence maps only. In particular, the...
BACKGROUND
In this research, we propose probabilistic approaches to identify pairwise patterns of species co-occurrence by using presence-absence maps only. In particular, the two-by-two contingency table constructed from a presence-absence map of two species would be sufficient to compute the test statistics and perform the statistical tests proposed in this article. Some previous studies have investigated species co-occurrence through incidence data of different survey sites. We focus on using presence-absence maps for a specific study plot instead. The proposed methods are assessed by a thorough simulation study.
METHODS
A Chi-squared test is used to determine whether the distributions of two species are independent. If the null hypothesis of independence is rejected, the Chi-squared method can not distinguish positive or negative association between two species. We propose six different approaches based on either the binomial or Poisson distribution to obtain p-values for testing the positive (or negative) association between two species. When we test to investigate a positive (or negative) association, if the -value is below the predetermined level of significance, then we have enough evidence to support that the two species are positively (or negatively) associated.
RESULTS
A simulation study is conducted to demonstrate the type-I errors and the testing powers of our approaches. The probabilistic approach proposed by Veech (2013) is served as a benchmark for comparison. The results show that the type-I error of the Chi-squared test is close to the significance level when the presence rate is between 40% and 80%. For extremely low or high presence rate data, one of our approaches outperforms Veech (2013)'s in terms of the testing power and type-I error rate. The proposed methods are applied to a tree data of Barro Colorado Island in Panama and a tree data of Lansing Woods in USA. Both positive and negative associations are found among some species in these two real data.
Topics: Benchmarking; Colorado; Computer Simulation; Interior Design and Furnishings; Panama
PubMed: 37719117
DOI: 10.7717/peerj.15907 -
Biometrics Jan 2024Multiple testing has been a prominent topic in statistical research. Despite extensive work in this area, controlling false discoveries remains a challenging task,...
Multiple testing has been a prominent topic in statistical research. Despite extensive work in this area, controlling false discoveries remains a challenging task, especially when the test statistics exhibit dependence. Various methods have been proposed to estimate the false discovery proportion (FDP) under arbitrary dependencies among the test statistics. One key approach is to transform arbitrary dependence into weak dependence and subsequently establish the strong consistency of FDP and false discovery rate under weak dependence. As a result, FDPs converge to the same asymptotic limit within the framework of weak dependence. However, we have observed that the asymptotic variance of FDP can be significantly influenced by the dependence structure of the test statistics, even when they exhibit only weak dependence. Quantifying this variability is of great practical importance, as it serves as an indicator of the quality of FDP estimation from the data. To the best of our knowledge, there is limited research on this aspect in the literature. In this paper, we aim to fill in this gap by quantifying the variation of FDP, assuming that the test statistics exhibit weak dependence and follow normal distributions. We begin by deriving the asymptotic expansion of the FDP and subsequently investigate how the asymptotic variance of the FDP is influenced by different dependence structures. Based on the insights gained from this study, we recommend that in multiple testing procedures utilizing FDP, reporting both the mean and variance estimates of FDP can provide a more comprehensive assessment of the study's outcomes.
Topics: Uncertainty; Normal Distribution
PubMed: 38497826
DOI: 10.1093/biomtc/ujae015