Synthesising Arena- and Hantavirus data from rodents to understand current known host distributions and viral pathogens.
Rodent
Ecology
Zoonosis
Arenaviridae
Hantaviridae
Open Data
Authors
Affiliations
David Simons
The Royal Veterinary College
Steph Seifert
Washington State University
Published
January 27, 2023
Abstract
Current host-pathogen association datasets provide synthesised information on hosts and their pathogens but do not contain temporal or geographic information. These resources often provide linking information to publications reporting the association but information including accession numbers of archived sequences, number of individuals tested and measures of prevalence among sampled populations are not immediately retrievable. Using these resources for inference beyond host-pathogen associations is therefore limited. Here, we aim to produce a database of host-pathogen associations for two viral families of small mammals which contain several known zoonoses, namely Arenaviridae and Hantaviridae. This database can be used to explore the distribution of small mammal hosts of known and suspected pathogens and the extent to which they have been sampled. Further, linkage to sequence data of known zoonoses will support analysis of the risk of viral reassortment between viral species within geographically co-located host species.
ONGOING WORK
Project aims
Review the literature to produce a synthesised dataset on Arenaviridae and Hantaviridae among small mammals
Implement visualisation tools to explore sampling extent for hosts and pathogens to further understanding of host-pathogen associations
Investigate the relative contributions of ecological, geographic, and genomic factors leading to cross-species transmission and reassortment in rodent-associated arenaviruses and hantaviruses
Method
Search strategy
An initial search was run on NCBI Pubmed 2023-01-06.
rodent* - 165,513
arenavir* - 2,367
hantavir*.mp. - 4,022
2 OR 3 - 6,308
1 AND 4 - 1,842
Citations were downloaded as a text file and imported into R for processing. Deduplication by Pubmed ID resulted in 1821 distinct citations.
An initial search of the returned citations was conducted to identify nine studies to trial data extraction.
Data extraction
Included studies
Information about the included study is extracted in a descriptive sheet.
Table 1: Study information extraction sheet
Column name
Description
study_id
A unique identifier for the included study
pubmed_id
The pubmed ID of the included study (if available)
DOI
The digital object identifier of the study (if available), if a DOI is not available a weblink to a persistent page for the reference will be included
first_author_surname
The first authors surname
title
The title of the manuscript, report or book section
journal
The name of the journal, report or book
year
The year of publication
study_design
A free text entry succinctly describing the study design for the rodents or pathogens
sampling_effort
A free text entry to capture the effort of sampling, ideally in trap nights for rodent studies
data_access
Whether study data is available in complete form or whether summarise data only are available
linked_manuscripts
The DOI or weblink to other studies including the same dataset either in its entirety or a subset, this will be used to attempt to de-duplicate data
Rodent sampling
Rodent data is extracted in a rodent sheet. This will include information on the timing of data collection, the location of data collected, the small mammal species detected either as detection/non-detection or number detected. For studies not reporting the detection of an individual species at a location that has been detected at other locations the species will be entered as not-detected at that location.
Table 2: Rodent information extraction sheet
Column name
Description
rodent_record_id
A unique identifier for the rodent species, at a specific location or timepoint reported by the study dependent on the level of aggregation reported in the study
study_id
A unique identifier to link a study to the descriptive sheet entry for that study
date
The period in which data collection of rodent samples occurred, this will be extracted at the highest temporal resolution provided
genus
The genus of the small-mammal as reported in the study
species
The species of the small-mammal as reported in the study
location
The location of sampling effort, depending on how data are presented in the study this will match to the coordinates given for trapping effort. i.e. if trapping is aggregated at village level, village names will be used
country
Country where trapping occurred, for multinational studies where numbers are not disaggregated by country, all countries will be included
habitat_type
High level habitat type will be recorded here at the scale for which trapping is recorded
coordinate_resolution
The description of coordinate levels provided in the study, i.e. aggregated at study site or study region
latitude
Latitude will be converted from coordinates presented to EPSG:4326
longitude
Longitude will be converted from coordinates presented to EPSG:4326
number
The number of detected individuals, for capture-mark-recapture studies the number of distinct individuals will be entered. For studies not explicitly reporting non-detection, values of 0 for a species or genus will be entered if it is detected elsewhere in the study
Pathogen sampling
Pathogen assays are extracted in the pathogen sheet. This includes information on the host the sample originated from, the pathogen family and species being assayed for and the method of the assay. For studies conducting multiple assays on the same samples for different pathogens additional records will be added for each assay. Similarly, if antibody and direct detection for the same pathogen on the same samples is performed additional records will be added.
Table 3: Pathogen information extraction sheet
Column name
Description
pathogen_record_id
A unique identifier for the group of samples from the same rodent species, at a specific location or timepoint, tested for the same pathogen using the same method
study_id
A unique identifier to link a study to the descriptive sheet entry for that study
date
The period in which data collection of rodent samples occurred, this will be extracted at the highest temporal resolution provided
host_genus
The genus of the small-mammal from which the sample originated as reported in the study
host_species
The species of the small-mammal from which the sample originated as reported in the study
location
The location of sampling effort, depending on how data are presented in the study this will match to the coordinates given for trapping effort. i.e. if trapping is aggregated at village level, village names will be used
country
Country where trapping occurred, for multinational studies where numbers are not disaggregated by country, all countries will be included
habitat_type
High level habitat type will be recorded here at the scale for which trapping is recorded
coordinate_resolution
The description of coordinate levels provided in the study, i.e. aggregated at study site or study region
latitude
Latitude will be converted from coordinates presented to EPSG:4326
longitude
Longitude will be converted from coordinates presented to EPSG:4326
pathogen_family
The family of virus being assayed for (i.e. Arenaviridae or Hantaviridae)
pathogen_species
The species of virus being assayed for if a specific test is being used. For assays unable to differentiate between multiple viral species Multiple will be entered
detection_method
Whether the assay is attempting to detect antibody, direct detection of virus (i.e. pcr), or other
number_tested
The number of distinct samples tested
number_negative
The number of reported negative samples
number_positive
The number of reported positive samples
number_inconclusive
The number of samples with inconclusive results
Pathogen sequences
If studies include linkage to complete or partial sequences of viruses archived in NCBI they will be lined through the pathogen_sequences sheet.
Table 4: Pathogen sequences extraction sheet
Column name
Description
sequence_record_id
A unique identifier for the sequence record
study_id
A unique identifier to link a study to the descriptive sheet entry for that study
host_genus
The genus of the small-mammal from which the sample originated as reported in the study
host_species
The species of the small-mammal from which the sample originated as reported in the study
pathogen_species
The species of the pathogen
accession_number
The accession number for each record archived by the study
Zoonosis status
Finally, an additional sheet known_zoonoses will be produced containing all of the viral species sampled and whether they are known to cause disease in humans.
Table 5: Zoonosis status extraction sheet
Column name
Description
pathogen_id
A unique identifier for the pathogen species
pathogen_family
The family of pathogen
pathogen_species
The species of pathogen
known_zoonosis
A logical statement of pathogenicity among humans
disease_name
The disease name caused by the viral species, if multiple diseases they are entered with a comma separator
icd_10
The ICD-10 associated name of the disease (if known)
disease_reference
A DOI of a publication that supports the known_zoonosis statement
Data processing
Raw data will be downloaded from Google Sheets using the googledrive API in R, with date stamped files stored locally. Data will be imported into R for processing, cleaning and formatting to produce a dataset suitable for further analysis.
Outputs
Data visualisation
An RShiny web-based application will be produced to visualise the database and support future analysis. The source code of the app is available from the GitHub repository and it is currently hosted through shinyapps.io here
Rodent sampling
An example of a map produce from rodent sampling data included in the initial 9 studies is shown below.
An interactive map displaying the location of detected small species in included studies. Selecting points will expand the number of data points at those coordinates. Point colour indicates small mammal genus with size of the point varying by the number of individuals detected. Hovering over a point or selecting it will show the species name, the number detected and the time period surveyed at the location. Data is currently shown for 9 studies.
Pathogen detection
A similar map can be produced to map pathogen detection with separate layers for acute infection (i.e. PCR positive samples) and evidence of prior infection (i.e. antibody positive samples)
An interactive map showing the locations of pathogen sampling. Selecting clustered points will expand the data. Each individual circle represents a single host species pathogen combination at that sampling location. The number of samples tested is associated with circle size. Selecting the point will display information about the host-pathogen association, including number of samples tested and number of positive samples. A white border of the circle designates host-pathogen associations with at least one positive result. The colour of the circle indicates the pathogen species (where known). Two layers are available, for viral detection and antibody detection.
An interactive plot displaying observed host pathogen associations. Facets are produced for antibody based assays and viral detection assays. The rodent species are listed in alphabetical order on the y-axis, detected pathogens are displayed in alphabetical order on the x-axis. Purple tiles represent host-pathogen associations that were not observed for that detection type, yellow tiles represent observed host-pathogen associations. The strength of the colour (the alpha) is scaled to the number of assays performed for that host-pathogen pair. A black line surrounding the tile indicates the pathogen as a known zoonosis. Hovering over the relavent tile will highlight the number of assays and the number of positive samples.
Citation
BibTeX citation:
@online{simons2023,
author = {Simons, David and Seifert, Steph},
title = {Synthesising {Arena-} and {Hantavirus} Data from Rodents to
Understand Current Known Host Distributions and Viral Pathogens.},
date = {2023-01-27},
url = {https://www.dsimons.org/others/arena_hanta.html},
langid = {en},
abstract = {Current host-pathogen association datasets provide
synthesised information on hosts and their pathogens but do not
contain temporal or geographic information. These resources often
provide linking information to publications reporting the
association but information including accession numbers of archived
sequences, number of individuals tested and measures of prevalence
among sampled populations are not immediately retrievable. Using
these resources for inference beyond host-pathogen associations is
therefore limited. Here, we aim to produce a database of
host-pathogen associations for two viral families of small mammals
which contain several known zoonoses, namely Arenaviridae and
Hantaviridae. This database can be used to explore the distribution
of small mammal hosts of known and suspected pathogens and the
extent to which they have been sampled. Further, linkage to sequence
data of known zoonoses will support analysis of the risk of viral
reassortment between viral species within geographically co-located
host species.}
}
For attribution, please cite this work as:
Simons, David, and Steph Seifert. 2023. “Synthesising Arena- and
Hantavirus Data from Rodents to Understand Current Known Host
Distributions and Viral Pathogens.” January 27, 2023. https://www.dsimons.org/others/arena_hanta.html.