Introduction
The ohvbd package allows you to retrieve data from many different databases directly.
Currently these databases include the VecTraits and VecDyn projects from VectorByte.
Let’s imagine you wanted to figure out where a particular vector
species - Aedes aegypti, for example - was found. Here’s how
you’d use ohvbd
to do that:
df <- search_vt("Aedes aegypti") %>%
fetch_vt() %>%
extract_vt(
cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
returnunique = TRUE
)
df
Great, lovely. But now… what does that all mean? And how can you build up your own data request? Read on to find out…
Some notes before we begin
If you are using Linux, sometimes httr2
fails to
communicate with the VectorByte servers. As such on these systems it is
best to set ohvbd to use slightly less stringent identity checks in its
calls to the server.
You do this using the command set_ohvbd_compat()
however
this is only required once per R session/script.
For users who are familiar with tidyverse pipes
(%>%
), this approach is generally usable in ohvbd as
well.
For users who are not familiar, pipes take the output of one command and feed it forward to the next command as the first argument:
## [1] 2
## [1] 2
For the rest of this vignette we will be using a piped-style approach.
Another note before we begin: the functions used to interface with
VecTraits and VecDyn are basically the same, except for a
vt
or vd
in the function name. So
search_vt()
is the VecTraits equivalent of
search_vd()
. We will use VecTraits functions in this
vignette, but the process is the same for VecDyn.
There are also generic versions of the fetch_x()
and
extract_x()
functions (called fetch()
and
extract()
), which can be used in the place of the
db-specific functions.
Finding IDs
Datasets in VecTraits and VecDyn are organised by id. As such it is
often good to be able to see which IDs are currently available. We can
do this using the find_x_ids()
family of functions. These
return a vector of valid IDs on the given database:
library(ohvbd)
find_vt_ids()[1:20]
## Database: vt
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The functions search_x()
and
search_x_smart()
also return a list of IDs that match the
query provided in the arguments.
In this case let’s search VecTraits for Aedes aegypti, the “Yellow Fever mosquito”:
aedes_ids <- search_vt("Aedes aegypti")
aedes_ids
## Database: vt
## [1] 97 98 124 125 126 142 143 144 145 146 147 148 149 169 170 214 285 286 287
## [20] 346 354 355 356 357 358 359 473 474 475 476 552 553 554 555 556 557 558 564
## [39] 565 571 572 573 574 575 576 577 578 579 580
This again returns a vector of IDs, this time corresponding to any dataset with Aedes aegypti in one of the searchable columns.
You can also find IDs interactively online by using the VecTraits Explorer should you so choose.
Getting data
Now you have a vector of datasets, we need to actually retrieve the data of these datasets through the API.
To do this we can use the fetch_x()
functions. In this
case let’s get the first 5 Aedes aegypti datasets:
aedes_responses <- fetch_vt(aedes_ids[1:5])
aedes_responses[[1]]
## <httr2_response>
## GET https://vectorbyte.crc.nd.edu/portal/api/vectraits-dataset/97/?format=json
## Status: 200 OK
## Content-Type: application/json
## Body: In memory (26362 bytes)
The fetch_vt()
function returns a list of the data in
the form of the original httr2
responses. These are useful
if you want to know specifics about how the server sent data back, but
for most usecases it is more useful to extract the data into a
dataframe.
Extracting data
Now we have a list of responses, we can extract the relevant data
from them using the extract_x()
functions.
aedes_data <- aedes_responses %>% extract_vt()
cat("Data dimensions: ", ncol(aedes_data), " cols x ", nrow(aedes_data), " rows")
## Data dimensions: 157 cols x 21 rows
This dataset is a bit too large to print here, and often you may only want a few columns of data rather than the whole dataset.
Fortunately the cols
argument allows us to filter this
data out easily. We can also use the argument returnunique
to instruct ohvbd to return only unique rows.
So let’s get just the unique locations and trait name combinations from our data using the same command as before:
aedes_data_filtered <- aedes_responses %>%
extract_vt(
cols = c("Location", "OriginalTraitName"),
returnunique = TRUE
)
aedes_data_filtered
## Database: vt
## Location OriginalTraitName
## 1 University of Malaya body size
## 2 University of Malaya blood meal size
## 3 Guadeloupe French West Indies longevity
## 4 Guadeloupe French West Indies blood feeding
## 5 Guadeloupe French West Indies transmission potential
Chunked downloading
In some cases you may end up downloading a lot of data. This can cause problems if you do not have enough memory on your computer to store this data in memory all at once, even if you just want one or two columns, like before,
One potential method for mitigating this is to use the
fetch_extract_x_chunked()
functions. These only download a
few records at a time and extract them before moving on to the next
chunk of ids. We can perform the filtered extraction from above like
so:
fetch_extract_vt_chunked(
aedes_ids[1:5],
chunksize = 2,
cols = c("Location", "OriginalTraitName"),
returnunique = TRUE
)
## Database: vt
## Location OriginalTraitName
## 1 University of Malaya body size
## 2 University of Malaya blood meal size
## 3 Guadeloupe French West Indies longevity
## 4 Guadeloupe French West Indies blood feeding
## 5 Guadeloupe French West Indies transmission potential
Putting it all together
In day-to-day use, you will mostly find yourself using all these functions together to create small pipelines.
A typical pipeline would likely only contain a few lines of code:
Smart searching
One subtlety of the VectorByte search interfaces is to do with name collisions.
Let’s imagine that we are looking for traits of whitefly species (Bemisia). We can just construct a query to investigate this as follows:
df <- search_vt("Bemisia")[1:6] %>%
fetch_vt() %>%
extract_vt(cols = c("DatasetID", "Interactor1Genus", "Interactor1Species", "Interactor2Genus", "Interactor2Species"))
Now we would expect this to be traits of Bemisia spp, however when we look at the Interactor1Genus column we see something a touch odd:
unique(df$Interactor1Genus)
## [1] "Bemisia" "Axinoscymnus"
Axinoscymnus is a ladybird, but why is it appearing? Let’s look at only rows containing Axinoscymnus:
## Database: vt
## DatasetID Interactor1Genus Interactor1Species Interactor2Genus
## 1 158 Axinoscymnus cardilobus Bemisia
## 2 158 Axinoscymnus cardilobus Bemisia
## 3 158 Axinoscymnus cardilobus Bemisia
## 4 158 Axinoscymnus cardilobus Bemisia
## 5 158 Axinoscymnus cardilobus Bemisia
## 6 158 Axinoscymnus cardilobus Bemisia
## Interactor2Species
## 1 tabaci
## 2 tabaci
## 3 tabaci
## 4 tabaci
## 5 tabaci
## 6 tabaci
In this scenario, Bemisia is present in the dataset, but it is as the “target” rather than the animal that the trait is referring to.
As such we might want to be more specific about precisely which data
to retrieve. Enter the search_x_smart()
family of
functions.
These allow you to construct a more specific search.
So let’s construct the same search as we were wanting to do before, but with the smart search.
df <- search_vt_smart("Interactor1Genus", "contains", "Bemisia")[1:6] %>%
fetch_vt() %>%
extract_vt(cols = c("DatasetID", "Interactor1Genus", "Interactor1Species", "Interactor2Genus", "Interactor2Species"))
unique(df$Interactor1Genus)
## [1] "Bemisia"
Here we have made sure to only search the
Interactor1Genus
column. As such we have only gotten back
Bemisia traits!
This same sort of collision is particularly common in the “Citation” column, where papers may mention multiple trait names.
The search_x_smart()
functions have many different
operators and columns that they can work upon. For full details, run
?search_vt_smart
or ?search_vd_smart
in your
console.
In general it is always worthwhile inspecting the data you retrieve from VectorByte to make sure that your query returned the data that you thought it did.
Futher steps
From here you now have all the data you might need for further analysis, so now it’s down to you!
One final note to end on: it can be advisable to save any output data
in a csv or parquet format so that you do not need to re-download it
every time you run your script. This is as easy as running
write.csv()
on your dataframe, then reading it in later
with read.csv()
Built in 20.0483952s