-
Scientific Data Jun 2024In low- and middle-income countries, the substantial costs associated with traditional data collection pose an obstacle to facilitating decision-making in the field of...
In low- and middle-income countries, the substantial costs associated with traditional data collection pose an obstacle to facilitating decision-making in the field of public health. Satellite imagery offers a potential solution, but the image extraction and analysis can be costly and requires specialized expertise. We introduce SatelliteBench, a scalable framework for satellite image extraction and vector embeddings generation. We also propose a novel multimodal fusion pipeline that utilizes a series of satellite imagery and metadata. The framework was evaluated generating a dataset with a collection of 12,636 images and embeddings accompanied by comprehensive metadata, from 81 municipalities in Colombia between 2016 and 2018. The dataset was then evaluated in 3 tasks: including dengue case prediction, poverty assessment, and access to education. The performance showcases the versatility and practicality of SatelliteBench, offering a reproducible, accessible and open tool to enhance decision-making in public health.
Topics: Satellite Imagery; Colombia; Public Health; Humans; Dengue; Metadata
PubMed: 38879585
DOI: 10.1038/s41597-024-03366-1 -
Journal of the American Medical... Jun 2024Investigate the use of advanced natural language processing models to streamline the time-consuming process of writing and revising scholarly manuscripts.
OBJECTIVE
Investigate the use of advanced natural language processing models to streamline the time-consuming process of writing and revising scholarly manuscripts.
MATERIALS AND METHODS
For this purpose, we integrate large language models into the Manubot publishing ecosystem to suggest revisions for scholarly texts. Our AI-based revision workflow employs a prompt generator that incorporates manuscript metadata into templates, generating section-specific instructions for the language model. The model then generates revised versions of each paragraph for human authors to review. We evaluated this methodology through 5 case studies of existing manuscripts, including the revision of this manuscript.
RESULTS
Our results indicate that these models, despite some limitations, can grasp complex academic concepts and enhance text quality. All changes to the manuscript are tracked using a version control system, ensuring transparency in distinguishing between human- and machine-generated text.
CONCLUSIONS
Given the significant time researchers invest in crafting prose, incorporating large language models into the scholarly writing process can significantly improve the type of knowledge work performed by academics. Our approach also enables scholars to concentrate on critical aspects of their work, such as the novelty of their ideas, while automating tedious tasks like adhering to specific writing styles. Although the use of AI-assisted tools in scientific authoring is controversial, our approach, which focuses on revising human-written text and provides change-tracking transparency, can mitigate concerns regarding AI's role in scientific writing.
PubMed: 38879443
DOI: 10.1093/jamia/ocae139 -
JMIR AI Mar 2024Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As...
BACKGROUND
Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act.
OBJECTIVE
We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task).
METHODS
Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers.
RESULTS
We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 10 comparisons to 1.41 at 6 × 10 comparisons, with a near 1:1 ratio at the midpoint of 3 × 10 comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively.
CONCLUSIONS
Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings.
PubMed: 38875581
DOI: 10.2196/52054 -
PloS One 2024Data curators play an important role in assessing data quality and take actions that may ultimately lead to better, more valuable data products. This study explores the...
Data curators play an important role in assessing data quality and take actions that may ultimately lead to better, more valuable data products. This study explores the curation practices of data curators working within US-based data repositories. We performed a survey in January 2021 to benchmark the levels of curation performed by repositories and assess the perceived value and impact of curation on the data sharing process. Our analysis included 95 responses from 59 unique data repositories. Respondents primarily were professionals working within repositories and examined curation performed within a repository setting. A majority 72.6% of respondents reported that "data-level" curation was performed by their repository and around half reported their repository took steps to ensure interoperability and reproducibility of their repository's datasets. Curation actions most frequently reported include checking for duplicate files, reviewing documentation, reviewing metadata, minting persistent identifiers, and checking for corrupt/broken files. The most "value-add" curation action across generalist, institutional, and disciplinary repository respondents was related to reviewing and enhancing documentation. Respondents reported high perceived impact of curation by their repositories on specific data sharing outcomes including usability, findability, understandability, and accessibility of deposited datasets; respondents associated with disciplinary repositories tended to perceive higher impact on most outcomes. Most survey participants strongly agreed that data curation by the repository adds value to the data sharing process and that it outweighs the effort and cost. We found some differences between institutional and disciplinary repositories, both in the reported frequency of specific curation actions as well as the perceived impact of data curation. Interestingly, we also found variation in the perceptions of those working within the same repository regarding the level and frequency of curation actions performed, which exemplifies the complexity of a repository curation work. Our results suggest data curation may be better understood in terms of specific curation actions and outcomes than broadly defined curation levels and that more research is needed to understand the resource implications of performing these activities. We share these results to provide a more nuanced view of curation, and how curation impacts the broader data lifecycle and data sharing behaviors.
Topics: Humans; Data Curation; Surveys and Questionnaires; United States; Information Dissemination; Data Accuracy; Databases, Factual; Reproducibility of Results
PubMed: 38875230
DOI: 10.1371/journal.pone.0301171 -
Journal of Cosmetic Dermatology Jun 2024Oral finasteride and topical minoxidil formulations are the only FDA-approved drug therapies for androgenetic alopecia (AGA). Research into dutasteride, topical...
BACKGROUND
Oral finasteride and topical minoxidil formulations are the only FDA-approved drug therapies for androgenetic alopecia (AGA). Research into dutasteride, topical finasteride, and nontopical minoxidil (low-dose oral and sublingual) formulations in the treatment of AGA has spiked within recent years. Early findings show that these alternative drug therapies may have similar to improved efficacy and safety profiles relative to the conventional treatment options.
AIMS
Conducting a bibliometric analysis, compare trends in publications on these alternative drug therapies, identify key contributors, evaluate major findings from top-cited articles, and elucidate gaps in evidence.
METHODS
A search was conducted on the Web of Science database for publications on the use of alternative drug therapies in the treatment of AGA. A total of 95 publications, published between January 2003-March 2024, and their citation metadata were included in the analysis.
RESULTS
Dutasteride showed the greatest (n = 37) and longest (20+ years) history of publications, as well as the highest cumulative citations (n = 914); however, nontopical minoxidil showed a burst in research activity within the last 5 years (n = 33 publications since 2019). A relatively low number of randomized control trials (n = 3) for nontopical minoxidil suggests a need for higher-quality evidence.
CONCLUSIONS
Our analysis reveals major trends, contributors, and gaps in evidence for alternative drug therapies for AGA, which can help inform researchers on their future projects in this growing field of study. There is enthusiasm for exploring off-label formulations: nontopical forms of minoxidil (oral and sublingual), topical finasteride, and mesotherapy.
PubMed: 38873787
DOI: 10.1111/jocd.16427 -
Scientific Data Jun 2024Facilitating data sharing in scientific research, especially in the domain of animal studies, holds immense value, particularly in mitigating distress and enhancing the...
Facilitating data sharing in scientific research, especially in the domain of animal studies, holds immense value, particularly in mitigating distress and enhancing the efficiency of data collection. This study unveils a meticulously curated collection of neural activity data extracted from six electrophysiological datasets recorded from three parietal areas (V6A, PEc, PE) of two Macaca fascicularis during an instructed-delay foveated reaching task. This valuable resource is now accessible to the public, featuring spike timestamps, behavioural event timings and supplementary metadata, all presented alongside a comprehensive description of the encompassing structure. To enhance accessibility, data are stored as HDF5 files, a convenient format due to its flexible structure and the capability to attach diverse information to each hierarchical sub-level. To guarantee ready-to-use datasets, we also provide some MATLAB and Python code examples, enabling users to quickly familiarize themselves with the data structure.
Topics: Animals; Parietal Lobe; Macaca fascicularis
PubMed: 38871737
DOI: 10.1038/s41597-024-03479-7 -
Data in Brief Jun 2024This paper presents the data (images, observations, metadata) of three different deployments of camera traps in the Amsterdam Water Supply Dunes, a Natura 2000 nature...
This paper presents the data (images, observations, metadata) of three different deployments of camera traps in the Amsterdam Water Supply Dunes, a Natura 2000 nature reserve in the coastal dunes of the Netherlands. The pilots were aimed at determining how different types of camera deployment (e.g. regular vs. wide lens, various heights, inside/outside exclosures) might influence species detections, and how to deploy autonomous wildlife monitoring networks. Two pilots were conducted in herbivore exclosures and mainly detected European rabbits () and red fox (). The third pilot was conducted outside exclosures, with the European fallow deer () being most prevalent. Across all three pilots, a total of 47,597 images were annotated using the Agouti platform. All annotations were verified and quality-checked by a human expert. A total of 2,779 observations of 20 different species (including humans) were observed using 11 wildlife cameras during 2021-2023. The raw image files (excluding humans), image metadata, deployment metadata and observations from each pilot are shared using the Camtrap DP open standard and the extended data publishing capabilities of GBIF to increase the findability, accessibility, interoperability, and reusability of this data. The data are freely available and can be used for developing artificial intelligence (AI) algorithms that automatically detect and identify species from wildlife camera images.
PubMed: 38868386
DOI: 10.1016/j.dib.2024.110544 -
IMeta Nov 2023The framework of the MicroEXPERT platform. Our Platform was composed of five modules. Data management module: Users upload raw data and metadata to the system using a...
The framework of the MicroEXPERT platform. Our Platform was composed of five modules. Data management module: Users upload raw data and metadata to the system using a guided workflow. Data processing module: Uploaded data is processed to generate taxonomical distribution and functional composition results. Metagenome-wide association studies module (MWAS): Various methods, including biomarker analysis, PCA, co-occurrence networks, and sample classification, are employed using metadata. Data search module: Users can query nucleotide sequences to retrieve information in the MicroEXPERT database. Data visualization module: Visualization tools are used to illustrate the metagenome analysis results.
PubMed: 38868224
DOI: 10.1002/imt2.131 -
Rapid Communications in Mass... Jun 2024Glutamate carboxypeptidase II (GCPII) catalyzes the hydrolysis of N-acetylaspartylglutamate (NAAG) to yield glutamate (Glu) and N-acetylaspartate (NAA). Inhibition of...
RATIONALE
Glutamate carboxypeptidase II (GCPII) catalyzes the hydrolysis of N-acetylaspartylglutamate (NAAG) to yield glutamate (Glu) and N-acetylaspartate (NAA). Inhibition of GCPII has been shown to remediate the neurotoxicity of excess Glu in a variety of cell and animal disease models. A robust high-throughput liquid chromatography-tandem mass spectrometry (LC/MS/MS) method was needed to quantify GCPII enzymatic activity in a biochemical high-throughput screening assay.
METHODS
A dual-stream LC/MS/MS method was developed. Two parallel eluent streams ran identical HILIC gradient methods on BEH-Amide (2 × 30 mm) columns. Each LC channel was run independently, and the cycle time was 2 min per channel. Overall throughput was 1 min per sample for the dual-channel integrated system. Multiply injected acquisition files were split during data review, and batch metadata were automatically paired with raw data during the review process.
RESULTS
Two LC sorbents, BEH-Amide and Penta-HILIC, were tested to separate the NAAG cleavage product Glu from isobaric interference and ion suppressants in the bioassay matrix. Early elution of NAAG and NAA on BEH-Amide allowed interfering species to be diverted to waste. The limit of quantification was 0.1 pmol for Glu. The Z-factor of this assay averaged 0.85. Over 36 000 compounds were screened using this method.
CONCLUSIONS
A fast gradient dual-stream LC/MS/MS method for Glu quantification in GCPII biochemical screening assay samples was developed and validated. HILIC separation chemistry offers robust performance and unique selectivity for targeted positive mode quantification of Glu, NAA, and NAAG.
PubMed: 38867136
DOI: 10.1002/rcm.9772 -
Scientific Data Jun 2024In the social and behavioral sciences, surveys are frequently used to collect data. During the COVID-19 pandemic, surveys provided political actors and public health...
In the social and behavioral sciences, surveys are frequently used to collect data. During the COVID-19 pandemic, surveys provided political actors and public health professionals with timely insights on the attitudes and behaviors of the general population. These insights were key in guiding actions to fight the pandemic. However, the data quality of these surveys remains unclear because systematic knowledge about how the survey data were collected during the COVID-19 pandemic is lacking. This is unfortunate, since decades of survey research have shown that survey design impacts data. Our Survey Data Collection and the COVID-19 Pandemic (SDCCP) project deals with this research gap. We collected rich metadata on survey design for 717 social and behavioral science surveys carried out in Germany during the first two years of the COVID-19 pandemic. In this data descriptor, we present a unique resource for a systematic assessment of the survey data collection practices and quality of surveys conducted in Germany during the COVID-19 pandemic.
Topics: COVID-19; Humans; Germany; Behavioral Sciences; Social Sciences; Surveys and Questionnaires; Pandemics; Data Collection
PubMed: 38866799
DOI: 10.1038/s41597-024-03475-x