-
Journal of the American Medical... Jun 2024Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their... (Comparative Study)
Comparative Study
OBJECTIVE
Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.
MATERIALS AND METHODS
This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.
RESULTS
Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes.
DISCUSSION AND CONCLUSION
Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
Topics: Natural Language Processing; Electronic Health Records; Humans; Algorithms; Cohort Studies; Female; Male; Disease; England
PubMed: 38719204
DOI: 10.1093/jamia/ocae091 -
Scientific Reports May 2024Every human has a body. Yet, languages differ in how they divide the body into parts to name them. While universal naming strategies exist, there is also variation in...
Every human has a body. Yet, languages differ in how they divide the body into parts to name them. While universal naming strategies exist, there is also variation in the vocabularies of body parts across languages. In this study, we investigate the similarities and differences in naming two separate body parts with one word, i.e., colexifications. We use a computational approach to create networks of body part vocabularies across languages. The analyses focus on body part networks in large language families, on perceptual features that lead to colexifications of body parts, and on a comparison of network structures in different semantic domains. Our results show that adjacent body parts are colexified frequently. However, preferences for perceptual features such as shape and function lead to variations in body part vocabularies. In addition, body part colexification networks are less varied across language families than networks in the semantic domains of emotion and colour. The study presents the first large-scale comparison of body part vocabularies in 1,028 language varieties and provides important insights into the variability of a universal human domain.
Topics: Humans; Language; Semantics; Vocabulary; Human Body; Culture
PubMed: 38714717
DOI: 10.1038/s41598-024-61140-0 -
Frontiers in Plant Science 2024Carrot ( L.) is a high value, nutritious, and colorful crop, but delivering carrots from seed to table can be a struggle for carrot growers. Weed competitive ability is...
Carrot ( L.) is a high value, nutritious, and colorful crop, but delivering carrots from seed to table can be a struggle for carrot growers. Weed competitive ability is a critical trait for crop success that carrot and its apiaceous relatives often lack owing to their characteristic slow shoot growth and erratic seedling emergence, even among genetically uniform lines. This study is the first field-based, multi-year experiment to evaluate shoot-growth trait variation over a 100-day growing season in a carrot diversity panel (N=695) that includes genetically diverse carrot accessions from the United States Department of Agriculture National Plant Germplasm System. We report phenotypic variability for shoot-growth characteristics, the first broad-sense heritability estimates for seedling emergence (0.68 < H < 0.80) and early-season canopy coverage ( 0.61 < H < 0.65), and consistent broad-sense heritability for late-season canopy height (0.76 < H < 0.82), indicating quantitative inheritance and potential for improvement through plant breeding. Strong correlation between emergence and canopy coverage (0.62 < r < 0.72) suggests that improvement of seedling emergence has great potential to increase yield and weed competitive ability. Accessions with high emergence and vigorous canopy growth are of immediate use to breeders targeting stand establishment, weed-tolerance, or weed-suppressant carrots, which is of particular advantage to the organic carrot production sector, reducing the costs and labor associated with herbicide application and weeding. We developed a standardized vocabulary and protocol to describe shoot-growth and facilitate collaboration and communication across carrot research groups. Our study facilitates identification and utilization of carrot genetic resources, conservation of agrobiodiversity, and development of breeding stocks for weed-competitive ability, with the long-term goal of delivering improved carrot cultivars to breeders, growers, and consumers. Accession selection can be further optimized for efficient breeding by combining shoot growth data with phenological data in this study's companion paper to identify ideotypes based on global market needs.
PubMed: 38708395
DOI: 10.3389/fpls.2024.1342512 -
Data in Brief Jun 2024This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in...
This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.
PubMed: 38708296
DOI: 10.1016/j.dib.2024.110413 -
Frontiers in Psychology 2024This study examines the dimensionality of and relationships between two subscales from the British Ability Scales - Third Edition, measuring verbal (expressive...
This study examines the dimensionality of and relationships between two subscales from the British Ability Scales - Third Edition, measuring verbal (expressive vocabulary) and non-verbal (reasoning) cognitive skills for toddlers (age three) and preschoolers (age five), in a Norwegian context across genders. Descriptive statistics revealed item selection criteria that included specific items within each subscale. Subsequently, Confirmatory Factor Analysis established the subscales' dimensionality (Naming Vocabulary and Picture Similarities; = 1094) and confirmed measurement invariance across genders. Further, the relationships between the verbal and non-verbal factors were investigated using correlation analysis and Structural Equation Modeling. The findings revealed that the verbal factor at age three strongly predicted the verbal factor at age five and significantly influenced the non-verbal factor at age five. The non-verbal factor at age three exhibited a moderate predictive relationship with the non-verbal factor at age five, and did not significantly predict the verbal factor at age five. In terms of gender differences, girls showed higher scores on the verbal factor at age three, and a stronger correlation between the non-verbal factor at age three and the verbal factor at age five. In summary, this research provides valuable insights into cognitive skill measurement and development in a Norwegian context and highlights possible variations across gender. The study's findings, limitations, and implications are discussed.
PubMed: 38708013
DOI: 10.3389/fpsyg.2024.1330334 -
Cognitive Neurodynamics Apr 2024Because cognitive competences emerge in evolution and development from the sensory-motor domain, we seek a neural process account for higher cognition in which all...
Because cognitive competences emerge in evolution and development from the sensory-motor domain, we seek a neural process account for higher cognition in which all representations are necessarily grounded in perception and action. The challenge is to understand how hallmarks of higher cognition, productivity, systematicity, and compositionality, may emerge from such a bottom-up approach. To address this challenge, we present key ideas from Dynamic Field Theory which postulates that neural populations are organized by recurrent connectivity to create stable localist representations. Dynamic instabilities enable the autonomous generation of sequences of mental states. The capacity to apply neural circuitry across broad sets of inputs that emulates the function call postulated in symbolic computation emerges through coordinate transforms implemented in neural gain fields. We show how binding localist neural representations through a shared index dimension enables conceptual structure, in which the interdependence among components of a representation is flexibly expressed. We demonstrate these principles in a neural dynamic architecture that represents and perceptually grounds nested relational and action phrases. Sequences of neural processing steps are generated autonomously to attentionally select the referenced objects and events in a manner that is sensitive to their interdependencies. This solves the problem of 2 and the massive binding problem in expressions such as "the small tree that is to the left of the lake which is to the left of the large tree". We extend earlier work by incorporating new types of grammatical constructions and a larger vocabulary. We discuss the DFT framework relative to other neural process accounts of higher cognition and assess the scope and challenges of such neural theories.
PubMed: 38699609
DOI: 10.1007/s11571-023-10007-7 -
Heliyon May 2024Non-native English-speaking law students and international legal practitioners who speak English as an additional language face significant challenges while pursuing...
Non-native English-speaking law students and international legal practitioners who speak English as an additional language face significant challenges while pursuing legal studies at English-only institutions, participating in professional training or catering to the legal needs of an increasingly diverse clientele. One of the most difficult challenges is sustaining adequate lexical knowledge to initiate and maintain communication regarding legal subject matter. This study aims to address this issue by presenting two short lists of lexical bundles and keywords (KWs) of the . Through a combination of corpus analysis and linguistics methodology, these lists are designed to provide a pedagogically useful and subject-focused source for learning academic vocabulary. Bundles are functionally classified into referentials, discourse organisers and stance markers, and their structural forms are filtered into distinct nominal, prepositional and verbal categories. KWs are POS-tagged to allow for direct instructional intervention. This research discusses the pedagogical implications of the research for teaching English for legal purposes.
PubMed: 38699014
DOI: 10.1016/j.heliyon.2024.e29944 -
Autism & Developmental Language... 2024Caregiver-delivered programs are a recommended best practice to support young autistic children. While research has extensively explored children's outcomes...
BACKGROUND AND AIMS
Caregiver-delivered programs are a recommended best practice to support young autistic children. While research has extensively explored children's outcomes quantitatively, minimal qualitative research has been conducted to understand caregivers' perspectives of program outcomes for themselves and their children. Hearing directly from caregivers is an important step in ensuring these programs are meeting the needs of those who use them. This study explored caregivers' perceived outcomes following one virtual caregiver-delivered program, The Hanen Centre's (MTW) program
METHODS
This study was a secondary analysis of data from individual interviews conducted with 21 caregivers who had recently participated in a virtual MTW program. A hybrid codebook thematic analysis approach was taken to analyze the interview data. Program outcomes were coded and analyzed within the International Classification Functioning, Disability, and Health (ICF) framework. Additionally, caregivers completed an online survey and rated Likert Scale items about perceived program outcomes, which were analyzed descriptively.
RESULTS
Five themes were identified: (1) caregivers learned new strategies to facilitate their child's development, (2) caregivers developed a new mindset, (3) children gained functional communication skills, (4) caregiver-child relationships improved, and (5) caregivers gained a social and professional support network. These themes fell within four of five ICF framework components (activities, participation, personal factors, and environmental factors). No themes were identified under Body Structures and Functions. Survey results indicated most caregivers reported learning new communication strategies (= 20, 95%), and identifying new teaching opportunities with their child (= 21, 100%).
CONCLUSIONS
Some reported outcomes, related to Activities and Participation, were consistent with previous reports in the literature on the MTW program. In line with previous research, caregivers learned strategies to support their child's communication development. Contrary to previous quantitative studies, caregivers in this study rarely commented on gains in vocabulary and instead focused on gains in skills that positively impacted their child's ability to engage in meaningful social interaction. Novel outcomes were identified within the Participation, Personal Factors, and Environmental Factors components of the ICF framework.
IMPLICATIONS
Caregivers in this study identified important outcomes for themselves and their child that have not been the focus of prior research, suggesting it is important to integrate their perspectives in the development and evaluation of caregiver-delivered programs. Clinicians should include goals that address outcomes identified as important by caregivers, including those that address children's Participation, and those that target caregivers' Personal and Environmental Factors. Developers of caregiver-delivered programs could integrate identified goals to ensure they are meeting families' needs.
PubMed: 38694817
DOI: 10.1177/23969415241244767 -
Journal of Biomedical Semantics May 2024Biomedical terminologies play a vital role in managing biomedical data. Missing IS-A relations in a biomedical terminology could be detrimental to its downstream usages....
Biomedical terminologies play a vital role in managing biomedical data. Missing IS-A relations in a biomedical terminology could be detrimental to its downstream usages. In this paper, we investigate an approach combining logical definitions and lexical features to discover missing IS-A relations in two biomedical terminologies: SNOMED CT and the National Cancer Institute (NCI) thesaurus. The method is applied to unrelated concept-pairs within non-lattice subgraphs: graph fragments within a terminology likely to contain various inconsistencies. Our approach first compares whether the logical definition of a concept is more general than that of the other concept. Then, we check whether the lexical features of the concept are contained in those of the other concept. If both constraints are satisfied, we suggest a potentially missing IS-A relation between the two concepts. The method identified 982 potential missing IS-A relations for SNOMED CT and 100 for NCI thesaurus. In order to assess the efficacy of our approach, a random sample of results belonging to the "Clinical Findings" and "Procedure" subhierarchies of SNOMED CT and results belonging to the "Drug, Food, Chemical or Biomedical Material" subhierarchy of the NCI thesaurus were evaluated by domain experts. The evaluation results revealed that 118 out of 150 suggestions are valid for SNOMED CT and 17 out of 20 are valid for NCI thesaurus.
Topics: Systematized Nomenclature of Medicine; Terminology as Topic; Vocabulary, Controlled; Logic
PubMed: 38693592
DOI: 10.1186/s13326-024-00309-y -
Frontiers in Genetics 2024DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression...
DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression without changing the DNA sequence. The main difficulty in identifying DNA methylation sites lies in the subtle and complex nature of methylation patterns, which may vary across different tissues, developmental stages, and environmental conditions. Traditional methods for methylation site identification, such as bisulfite sequencing, are typically labor-intensive, costly, and require large amounts of DNA, hindering high-throughput analysis. Moreover, these methods may not always provide the resolution needed to detect methylation at specific sites, especially in genomic regions that are rich in repetitive sequences or have low levels of methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. This study introduces the iDNA-OpenPrompt model, leveraging the novel OpenPrompt learning framework. The model combines a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to construct the prompt-learning framework for DNA methylation sequences. Moreover, a DNA vocabulary library, BERT tokenizer, and specific label words are also introduced into the model to enable accurate identification of DNA methylation sites. An extensive analysis is conducted to evaluate the predictive, reliability, and consistency capabilities of the iDNA-OpenPrompt model. The experimental outcomes, covering 17 benchmark datasets that include various species and three DNA methylation modifications (4mC, 5hmC, 6mA), consistently indicate that our model surpasses outstanding performance and robustness approaches.
PubMed: 38689652
DOI: 10.3389/fgene.2024.1377285