-
Bioinformatics (Oxford, England) Nov 2022The volume of public nucleotide sequence data has blossomed over the past two decades and is ripe for re- and meta-analyses to enable novel discoveries. However,... (Meta-Analysis)
Meta-Analysis
MOTIVATION
The volume of public nucleotide sequence data has blossomed over the past two decades and is ripe for re- and meta-analyses to enable novel discoveries. However, reproducible re-use and management of sequence datasets and associated metadata remain critical challenges. We created the open source Python package q2-fondue to enable user-friendly acquisition, re-use and management of public sequence (meta)data while adhering to open data principles.
RESULTS
q2-fondue allows fully provenance-tracked programmatic access to and management of data from the NCBI Sequence Read Archive (SRA). Unlike other packages allowing download of sequence data from the SRA, q2-fondue enables full data provenance tracking from data download to final visualization, integrates with the QIIME 2 ecosystem, prevents data loss upon space exhaustion and allows download of (meta)data given a publication library. To highlight its manifold capabilities, we present executable demonstrations using publicly available amplicon, whole genome and metagenome datasets.
AVAILABILITY AND IMPLEMENTATION
q2-fondue is available as an open-source BSD-3-licensed Python package at https://github.com/bokulich-lab/q2-fondue. Usage tutorials are available in the same repository. All Jupyter notebooks used in this article are available under https://github.com/bokulich-lab/q2-fondue-examples.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Software; Base Sequence; Ecosystem; Metadata; Metagenome
PubMed: 36130056
DOI: 10.1093/bioinformatics/btac639 -
Molecular Ecology Sep 2022Although it is becoming widely appreciated that microbes can enhance plant tolerance to environmental stress, the nature of microbial mediation of exposure responses is...
Although it is becoming widely appreciated that microbes can enhance plant tolerance to environmental stress, the nature of microbial mediation of exposure responses is not well understood. We addressed this deficit by examining whether microbial mediation of plant responses to elevated salinity is contingent on the environment and factors intrinsic to the host. We evaluated the influence of contrasting environmental conditions relative to host genotype, provenance and evolution by conducting a common-garden experiment utilizing ancestral and descendant cohorts of Schoenoplectus americanus genotypes recovered from two 100+ year coastal marsh seed banks. We compared S. americanus productivity and trait variation as well as associated endophytic microbial communities according to plant genotype, provenance, and age cohort under high and low salinity stress with and without native soil inoculation. The magnitude and direction of microbial mediation of S. americanus responses to elevated salinity varied according to individual genotype, provenance, as well as temporal shifts in genotypic variation and G × E (gene by environment) interactions. Relationships differed between plant traits and the structure of endosphere communities. Our findings indicate that plant-microbe associations and microbial mediation of plant stress are not only context-dependent but also dynamic. Our results additionally suggest that evolution can shape the fate of marsh ecosystems by altering how microbes confer plant tolerance to pressures linked to global change.
Topics: Genotype; Humans; Microbiota; Salinity; Salt Stress; Wetlands
PubMed: 35792676
DOI: 10.1111/mec.16603 -
Frontiers in Big Data 2022Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use... (Review)
Review
Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.
PubMed: 36072823
DOI: 10.3389/fdata.2022.945720 -
GigaScience Nov 2019The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven...
BACKGROUND
The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.
RESULTS
Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.
CONCLUSIONS
The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.
Topics: Genomics; Humans; Models, Theoretical; Software; Workflow
PubMed: 31675414
DOI: 10.1093/gigascience/giz095 -
PloS One 2022In 2016, a Stollhof-type copper hoard was found during an excavation in Magyaregres, Hungary. It was placed in a cooking pot, and deposited upside down within the...
In 2016, a Stollhof-type copper hoard was found during an excavation in Magyaregres, Hungary. It was placed in a cooking pot, and deposited upside down within the boundaries of an Early Copper Age settlement. Similar hoards dating to the end of the 5th millennium BCE are well-known from Central Europe, however, this hoard represents the only one so far with thoroughly documented finding circumstances. The hoard contained 681 pieces of copper, 264 pieces of stone and a single Spondylus bead, along with 19 pieces of small tubular spiral copper coils, three spiral copper bracelets, and two large, spectacle spiral copper pendants. Until now, information on the provenance of raw materials and how such copper artefacts were manufactured has not been available. The artefacts were studied under optical microscopes to reveal the manufacturing process. Trace elemental composition (HR-ICP-MS) and lead isotope ratios (MC-ICP-MS) were measured to explore the provenance of raw materials. The ornaments were rolled or folded and coiled from thin sheets of copper using fahlore copper probably originating from the Northwestern Carpathians. A complex archaeological approach was employed to reveal the provenance, distribution and the social roles the ornaments could have played in the life of a Copper Age community. Evidence for local metallurgy was lacking in contemporaneous Transdanubian sites, therefore it is likely that the items of the hoard were manufactured closer to the raw material source, prior to being transported to Transdanubia as finished products. The method of deposition implies that such items were associated with special social contexts, represented exceptional values, and the context of deposition was also highly prescribed. The Magyaregres hoard serves as the first firm piece of evidence for the existence of a typologically independent Central European metallurgical circle which exploited the raw material sources located within its distribution.
Topics: Hungary; Technology; Archaeology; Artifacts; Metallurgy
PubMed: 36417420
DOI: 10.1371/journal.pone.0278116 -
Neuroinformatics Jul 2022Sharing various neuroimaging digital resources have received widespread attention in FAIR (Findable, Accessible, Interoperable and Reusable) neuroscience. In order to...
Sharing various neuroimaging digital resources have received widespread attention in FAIR (Findable, Accessible, Interoperable and Reusable) neuroscience. In order to support a comprehensive understanding of brain cognition, neuroimaging provenance should be constructed to characterize both research processes and results, and integrates various digital resources for quick replication and open cooperation. This brings new challenges to neuroimaging text mining, including fragmented information, lack of labelled corpora, and vague topics. This paper proposes a text mining pipeline for enabling the FAIR neuroimaging study. In order to avoid fragmented information, the Brain Informatics provenance model is redesigned based on NIDM (Neuroimaging Data Model) and FAIR facets. It can systematically capture the provenance requests from the FAIR neuroimaging study and then transform them into a group of text mining tasks. A neuroimaging text mining pipeline combining deep adversarial learning with interaction based topic modeling, called neuroimaging interaction topic model (Neuroimaging-ITM), is proposed to automatically extract neuroimaging provenance and identify research topics in the few-shot scenario. Finally, a group of experiments is completed by using real data from the journal PloS One. The experimental results show that Neuroimaging-ITM can systematically and accurately extract provenance information and obtain high-quality research topics from the full text of neuroimaging articles. Most of the mean F1 values of provenance extraction exceed 0.9. The topic coherence and KL (Kullback-Leibler) divergence reach 9.95 and 0.96 respectively. The results are obviously better than baseline methods.
Topics: Data Mining; Neuroimaging; Neurosciences
PubMed: 35235184
DOI: 10.1007/s12021-022-09571-w -
IEEE Computer Graphics and Applications 2019Visual analytics tools integrate provenance recording to externalize analytic processes or user insights. Provenance can be captured on varying levels of detail, and in...
Visual analytics tools integrate provenance recording to externalize analytic processes or user insights. Provenance can be captured on varying levels of detail, and in turn activities can be characterized from different granularities. However, current approaches do not support inferring activities that can only be characterized across multiple levels of provenance. We propose a task abstraction framework that consists of a three stage approach, composed of 1) initializing a provenance task hierarchy, 2) parsing the provenance hierarchy by using an abstraction mapping mechanism, and 3) leveraging the task hierarchy in an analytical tool. Furthermore, we identify implications to accommodate iterative refinement, context, variability, and uncertainty during all stages of the framework. We describe a use case which exemplifies our abstraction framework, demonstrating how context can influence the provenance hierarchy to support analysis. The article concludes with an agenda, raising and discussing challenges that need to be considered for successfully implementing such a framework.
PubMed: 31603814
DOI: 10.1109/MCG.2019.2945720 -
F1000Research 2021Knowledge graph (KG) publishes machine-readable representation of knowledge on the Web. Structured data in the knowledge graph is published using Resource Description...
Knowledge graph (KG) publishes machine-readable representation of knowledge on the Web. Structured data in the knowledge graph is published using Resource Description Framework (RDF) where knowledge is represented as a triple (subject, predicate, object). Due to the presence of erroneous, outdated or conflicting data in the knowledge graph, the quality of facts cannot be guaranteed. Trustworthiness of facts in knowledge graph can be enhanced by the addition of metadata like the source of information, location and time of the fact occurrence. Since RDF does not support metadata for providing provenance and contextualization, an alternate method, RDF reification is employed by most of the knowledge graphs. RDF reification increases the magnitude of data as several statements are required to represent a single fact. Another limitation for applications that uses provenance data like in the medical domain and in cyber security is that not all facts in these knowledge graphs are annotated with provenance data. In this paper, we have provided an overview of prominent reification approaches together with the analysis of popular, general knowledge graphs Wikidata and YAGO4 with regard to the representation of provenance and context data. Wikidata employs qualifiers to include metadata to facts, while YAGO4 collects metadata from Wikidata qualifiers. However, facts in Wikidata and YAGO4 can be fetched without using reification to cater for applications that do not require metadata. To the best of our knowledge, this is the first paper that investigates the method and the extent of metadata covered by two prominent KGs, Wikidata and YAGO4.
Topics: Empirical Research; Metadata; Pattern Recognition, Automated; Research Design
PubMed: 34900233
DOI: 10.12688/f1000research.72843.2 -
FEMS Microbiology Ecology Nov 2022Plant-soil interactions can be important drivers of biological invasions. In particular, the symbiotic relationship between legumes and nitrogen-fixing soil bacteria...
Plant-soil interactions can be important drivers of biological invasions. In particular, the symbiotic relationship between legumes and nitrogen-fixing soil bacteria (i.e. rhizobia) may be influential in invasion success. Legumes, including Australian acacias, have been introduced into novel ranges around the world. Our goal was to examine the acacia-rhizobia symbiosis to determine whether cointroduction of non-native mutualists plays a role in invasiveness of introduced legumes. To determine whether acacias were introduced abroad concurrently with native symbionts, we selected four species introduced to California (two invasive and two noninvasive in the region) and identified rhizobial strains associating with each species in their native and novel ranges. We amplified three genes to examine phylogenetic placement (16S rRNA) and provenance (nifD and nodC) of rhizobia associating with acacias in California and Australia. We found that all Acacia species, regardless of invasive status, are associating with rhizobia of Australian origin in their introduced ranges, indicating that concurrent acacia-rhizobia introductions have occurred for all species tested. Our results suggest that cointroduction of rhizobial symbionts may be involved in the establishment of non-native acacias in their introduced ranges, but do not contribute to the differential invasiveness of Acacia species introduced abroad.
Topics: Rhizobium; Acacia; Phylogeny; RNA, Ribosomal, 16S; Australia; Fabaceae; Nitrogen-Fixing Bacteria; California; Soil
PubMed: 36396354
DOI: 10.1093/femsec/fiac138 -
Environmental Pollution (Barking, Essex... Aug 2020Rare earth elements (REEs) are widely used in optoelectronic industries, and they can be emitted into the environment and may induce biological effects. In this study,...
Rare earth elements (REEs) are widely used in optoelectronic industries, and they can be emitted into the environment and may induce biological effects. In this study, we investigated the provenance and bioaccessibility of REEs in atmospheric particles (APs) collected from areas impacted by the optoelectronic industry. The geoaccumulation index (I) values showed that Y, Eu, and Tb were much more enriched in the APs from the optoelectronic recycling sites than in those from the optoelectronic producing sites and were not enriched in the APs from the optoelectronic administrative sites and background sites. The characteristic parameters and the distribution patterns of REEs demonstrated that the AP samples from the recycling sites and producing sites showed remarkably positive Eu and Tb anomalies. According to the positive matrix factorization (PMF) model, the optoelectronic industry was quantitatively determined to contribute 82.8% of Y, 86.5% of Eu, and 83.4% of Tb. Furthermore, an in vitro physiologically based extraction test (PBET) was performed to assess the bioaccessibility of REEs in the APs. The results showed that the bioaccessibility of all the REEs in the APs was below 50.0% in the human gastrointestinal tract, with higher values in the gastric phases than in the intestinal phases. In particular, extremely low gastric bioaccessibilities of Tb and Ce and relatively high gastric bioaccessibilities of Y and Eu were observed in the APs from the recycling sites and producing sites, which may due to the chemical composition of the compounds containing REEs that are used in the optoelectronic industry. In conclusion, our results provide additional information about the contribution and influence of the optoelectronic industry on the provenance and bioaccessibility of REEs in APs.
Topics: Humans; Metals, Rare Earth
PubMed: 32244157
DOI: 10.1016/j.envpol.2020.114349