-
Journal of the American Medical... Jun 2024Linking information on Japanese pharmaceutical products to global knowledge bases (KBs) would enhance international collaborative research and yield valuable insights....
OBJECTIVES
Linking information on Japanese pharmaceutical products to global knowledge bases (KBs) would enhance international collaborative research and yield valuable insights. However, public access to mappings of Japanese pharmaceutical products that use international controlled vocabularies remains limited. This study mapped YJ codes to RxNorm ingredient classes, providing new insights by comparing Japanese and international drug-drug interaction (DDI) information using a case study methodology.
MATERIALS AND METHODS
Tables linking YJ codes to RxNorm concepts were created using the application programming interfaces of the Kyoto Encyclopedia of Genes and Genomes and the National Library of Medicine. A comparative analysis of Japanese and international DDI information was thus performed by linking to an international DDI KB.
RESULTS
There was limited agreement between the Japanese and international DDI severity classifications. Cross-tabulation of Japanese and international DDIs by severity showed that 213 combinations classified as serious DDIs by an international KB were missing from the Japanese DDI information.
DISCUSSION
It is desirable that efforts be undertaken to standardize international criteria for DDIs to ensure consistency in the classification of their severity.
CONCLUSION
The classification of DDI severity remains highly variable. It is imperative to augment the repository of critical DDI information, which would revalidate the utility of fostering collaborations with global KBs.
Topics: Japan; Drug Interactions; Knowledge Bases; RxNorm; Humans; Vocabulary, Controlled; East Asian People
PubMed: 38758661
DOI: 10.1093/jamia/ocae094 -
Open Mind : Discoveries in Cognitive... 2024How does lexical decision behavior vary in students with the same grade level (all students were in their first year of middle-school), but different levels of reading...
PURPOSE
How does lexical decision behavior vary in students with the same grade level (all students were in their first year of middle-school), but different levels of reading fluency? Here, we tested a prediction of the dual-route model: as fluency increases, variations in the results may reflect a decreasing reliance on decoding and an increasing reliance on the lexical route.
METHOD
1,501 French 6 graders passed a one-minute speeded reading-aloud task evaluating fluency, and a ten-minute computerized lexical decision task evaluating the impact of lexicality, length, word frequency and pseudoword type.
RESULTS
As predicted, the word length effect varied dramatically with reading fluency, with the least fluent students showing a length effect even for frequent words. The frequency effect also varied, but solely in proportion to overall reading speed, suggesting that frequency affects the decision stage similarly in all readers, while length disproportionately impacts poor readers. Response times and errors were also affected by pseudoword type (e.g., letter substitutions or transpositions), but these effects showed minimal variation with fluency. Overall, lexical decision variables were excellent predictors of reading fluency (r = 0.62).
CONCLUSION
Our results highlight the variability in middle-school reading ability and describe how a simple lexical decision task can be used to assess students' mental lexicon (vocabulary) and the automatization of reading skills.
PubMed: 38746855
DOI: 10.1162/opmi_a_00140 -
BioRxiv : the Preprint Server For... May 2024The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect...
The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM's representations are used as features. SWING was first applied to predicting peptide:MHC (pMHC) interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally. To further evaluate SWING's generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interaction-specific disruptions. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.
PubMed: 38746274
DOI: 10.1101/2024.05.01.592062 -
Heliyon May 2024Dementia is marked by a steady decline or worsening in cognitive abilities, affecting memory, logic, and social competencies. While many studies suggest a potential link...
INTRODUCTION
Dementia is marked by a steady decline or worsening in cognitive abilities, affecting memory, logic, and social competencies. While many studies suggest a potential link between the amount of sleep and dementia risk, the outcomes are not yet consistent. This research delved into the relationship between sleep length and bedtime on cognitive abilities using an extensive dataset from the China Family Panel Studies (CFPS) from 2014 to 2020.
METHODS
Data from 175,702 observations were collected, including cognitive function test data from 22,848 participants. Various cognitive tests were used to assess cognitive function. Restricted cubic spline (RCS) models were used for data analysis.
RESULTS
The optimal sleep duration for cognitive function was found to be 6-7 h, and the optimal bedtime was generally between 22:00-23:00. Longitudinal analysis revealed that sleep duration four years prior had a significant impact on current cognitive function. After accounting for various factors, those who slept for 7-8 h and over 8 h displayed lower cognitive function scores. Conversely, individuals sleeping less than 6 h had higher scores on the Vocabulary Test. Bedtime before 22:00 was associated with lower scores on the Vocabulary Test and Mathematical Test. Subgroup analyses based on age, gender, and urban residence showed variations in optimal sleep duration for different populations. Propensity Score Matching (PSM) analysis supported the findings.
CONCLUSIONS
Maintaining a sleep duration of 6-7 h and a regular bedtime between 22:00-23:00 is important for optimizing cognitive performance.
PubMed: 38737242
DOI: 10.1016/j.heliyon.2024.e30009 -
Journal of Proteome Research Jun 2024Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly...
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an 1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for 1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of 1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
Topics: Deep Learning; Algorithms; Natural Language Processing; Enzymes; Molecular Sequence Annotation; Humans; Data Mining
PubMed: 38733346
DOI: 10.1021/acs.jproteome.3c00367 -
Journal of the American Medical... Jun 2024Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their... (Comparative Study)
Comparative Study
OBJECTIVE
Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.
MATERIALS AND METHODS
This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.
RESULTS
Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes.
DISCUSSION AND CONCLUSION
Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
Topics: Natural Language Processing; Electronic Health Records; Humans; Algorithms; Cohort Studies; Female; Male; Disease; England
PubMed: 38719204
DOI: 10.1093/jamia/ocae091 -
Scientific Reports May 2024Every human has a body. Yet, languages differ in how they divide the body into parts to name them. While universal naming strategies exist, there is also variation in...
Every human has a body. Yet, languages differ in how they divide the body into parts to name them. While universal naming strategies exist, there is also variation in the vocabularies of body parts across languages. In this study, we investigate the similarities and differences in naming two separate body parts with one word, i.e., colexifications. We use a computational approach to create networks of body part vocabularies across languages. The analyses focus on body part networks in large language families, on perceptual features that lead to colexifications of body parts, and on a comparison of network structures in different semantic domains. Our results show that adjacent body parts are colexified frequently. However, preferences for perceptual features such as shape and function lead to variations in body part vocabularies. In addition, body part colexification networks are less varied across language families than networks in the semantic domains of emotion and colour. The study presents the first large-scale comparison of body part vocabularies in 1,028 language varieties and provides important insights into the variability of a universal human domain.
Topics: Humans; Language; Semantics; Vocabulary; Human Body; Culture
PubMed: 38714717
DOI: 10.1038/s41598-024-61140-0 -
Frontiers in Plant Science 2024Carrot ( L.) is a high value, nutritious, and colorful crop, but delivering carrots from seed to table can be a struggle for carrot growers. Weed competitive ability is...
Carrot ( L.) is a high value, nutritious, and colorful crop, but delivering carrots from seed to table can be a struggle for carrot growers. Weed competitive ability is a critical trait for crop success that carrot and its apiaceous relatives often lack owing to their characteristic slow shoot growth and erratic seedling emergence, even among genetically uniform lines. This study is the first field-based, multi-year experiment to evaluate shoot-growth trait variation over a 100-day growing season in a carrot diversity panel (N=695) that includes genetically diverse carrot accessions from the United States Department of Agriculture National Plant Germplasm System. We report phenotypic variability for shoot-growth characteristics, the first broad-sense heritability estimates for seedling emergence (0.68 < H < 0.80) and early-season canopy coverage ( 0.61 < H < 0.65), and consistent broad-sense heritability for late-season canopy height (0.76 < H < 0.82), indicating quantitative inheritance and potential for improvement through plant breeding. Strong correlation between emergence and canopy coverage (0.62 < r < 0.72) suggests that improvement of seedling emergence has great potential to increase yield and weed competitive ability. Accessions with high emergence and vigorous canopy growth are of immediate use to breeders targeting stand establishment, weed-tolerance, or weed-suppressant carrots, which is of particular advantage to the organic carrot production sector, reducing the costs and labor associated with herbicide application and weeding. We developed a standardized vocabulary and protocol to describe shoot-growth and facilitate collaboration and communication across carrot research groups. Our study facilitates identification and utilization of carrot genetic resources, conservation of agrobiodiversity, and development of breeding stocks for weed-competitive ability, with the long-term goal of delivering improved carrot cultivars to breeders, growers, and consumers. Accession selection can be further optimized for efficient breeding by combining shoot growth data with phenological data in this study's companion paper to identify ideotypes based on global market needs.
PubMed: 38708395
DOI: 10.3389/fpls.2024.1342512 -
Data in Brief Jun 2024This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in...
This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.
PubMed: 38708296
DOI: 10.1016/j.dib.2024.110413 -
Frontiers in Psychology 2024This study examines the dimensionality of and relationships between two subscales from the British Ability Scales - Third Edition, measuring verbal (expressive...
This study examines the dimensionality of and relationships between two subscales from the British Ability Scales - Third Edition, measuring verbal (expressive vocabulary) and non-verbal (reasoning) cognitive skills for toddlers (age three) and preschoolers (age five), in a Norwegian context across genders. Descriptive statistics revealed item selection criteria that included specific items within each subscale. Subsequently, Confirmatory Factor Analysis established the subscales' dimensionality (Naming Vocabulary and Picture Similarities; = 1094) and confirmed measurement invariance across genders. Further, the relationships between the verbal and non-verbal factors were investigated using correlation analysis and Structural Equation Modeling. The findings revealed that the verbal factor at age three strongly predicted the verbal factor at age five and significantly influenced the non-verbal factor at age five. The non-verbal factor at age three exhibited a moderate predictive relationship with the non-verbal factor at age five, and did not significantly predict the verbal factor at age five. In terms of gender differences, girls showed higher scores on the verbal factor at age three, and a stronger correlation between the non-verbal factor at age three and the verbal factor at age five. In summary, this research provides valuable insights into cognitive skill measurement and development in a Norwegian context and highlights possible variations across gender. The study's findings, limitations, and implications are discussed.
PubMed: 38708013
DOI: 10.3389/fpsyg.2024.1330334