Retrieving data from VectorByte • ohvbd

Introduction

The ohvbd package allows you to retrieve data from many different databases directly.

Currently these databases include the VecTraits and VecDyn projects from VectorByte.

Let’s imagine you wanted to figure out where a particular vector species - Aedes aegypti, for example - was found. Here’s how you’d use ohvbd to do that:

df <- search_vt("Aedes aegypti") %>%
  fetch_vt() %>%
  extract_vt(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
df

Great, lovely. But now… what does that all mean? And how can you build up your own data request? Read on to find out…

Some notes before we begin

If you are using Linux, sometimes httr2 fails to communicate with the VectorByte servers. As such on these systems it is best to set ohvbd to use slightly less stringent identity checks in its calls to the server.

You do this using the command set_ohvbd_compat() however this is only required once per R session/script.

For users who are familiar with tidyverse pipes (%>%), this approach is generally usable in ohvbd as well.

For users who are not familiar, pipes take the output of one command and feed it forward to the next command as the first argument:

# Find mean of a vector normally
x <- c(1, 2, 3)
mean(x)

## [1] 2

# Find mean of a vector using pipes
c(1, 2, 3) %>% mean()

## [1] 2

For the rest of this vignette we will be using a piped-style approach.

Another note before we begin: the functions used to interface with VecTraits and VecDyn are basically the same, except for a vt or vd in the function name. So search_vt() is the VecTraits equivalent of search_vd(). We will use VecTraits functions in this vignette, but the process is the same for VecDyn.

There are also generic versions of the fetch_x() and extract_x() functions (called fetch() and extract()), which can be used in the place of the db-specific functions.

Finding IDs

Datasets in VecTraits and VecDyn are organised by id. As such it is often good to be able to see which IDs are currently available. We can do this using the find_x_ids() family of functions. These return a vector of valid IDs on the given database:

library(ohvbd)
find_vt_ids()[1:20]

## Database: vt
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

The functions search_x() and search_x_smart() also return a list of IDs that match the query provided in the arguments.

In this case let’s search VecTraits for Aedes aegypti, the “Yellow Fever mosquito”:

aedes_ids <- search_vt("Aedes aegypti")
aedes_ids

## Database: vt
##  [1]  97  98 124 125 126 142 143 144 145 146 147 148 149 169 170 214 285 286 287
## [20] 346 354 355 356 357 358 359 473 474 475 476 552 553 554 555 556 557 558 564
## [39] 565 571 572 573 574 575 576 577 578 579 580

This again returns a vector of IDs, this time corresponding to any dataset with Aedes aegypti in one of the searchable columns.

You can also find IDs interactively online by using the VecTraits Explorer should you so choose.

Getting data

Now you have a vector of datasets, we need to actually retrieve the data of these datasets through the API.

To do this we can use the fetch_x() functions. In this case let’s get the first 5 Aedes aegypti datasets:

aedes_responses <- fetch_vt(aedes_ids[1:5])
aedes_responses[[1]]

## <httr2_response>
## GET https://vectorbyte.crc.nd.edu/portal/api/vectraits-dataset/97/?format=json
## Status: 200 OK
## Content-Type: application/json
## Body: In memory (26362 bytes)

The fetch_vt() function returns a list of the data in the form of the original httr2 responses. These are useful if you want to know specifics about how the server sent data back, but for most usecases it is more useful to extract the data into a dataframe.

Extracting data

Now we have a list of responses, we can extract the relevant data from them using the extract_x() functions.

aedes_data <- aedes_responses %>% extract_vt()
cat("Data dimensions: ", ncol(aedes_data), " cols x ", nrow(aedes_data), " rows")

## Data dimensions:  157  cols x  21  rows

This dataset is a bit too large to print here, and often you may only want a few columns of data rather than the whole dataset.

Fortunately the cols argument allows us to filter this data out easily. We can also use the argument returnunique to instruct ohvbd to return only unique rows.

So let’s get just the unique locations and trait name combinations from our data using the same command as before:

aedes_data_filtered <- aedes_responses %>%
  extract_vt(
    cols = c("Location", "OriginalTraitName"),
    returnunique = TRUE
  )
aedes_data_filtered

## Database: vt
##                        Location      OriginalTraitName
## 1          University of Malaya              body size
## 2          University of Malaya        blood meal size
## 3 Guadeloupe French West Indies              longevity
## 4 Guadeloupe French West Indies          blood feeding
## 5 Guadeloupe French West Indies transmission potential

Chunked downloading

In some cases you may end up downloading a lot of data. This can cause problems if you do not have enough memory on your computer to store this data in memory all at once, even if you just want one or two columns, like before,

One potential method for mitigating this is to use the fetch_extract_x_chunked() functions. These only download a few records at a time and extract them before moving on to the next chunk of ids. We can perform the filtered extraction from above like so:

fetch_extract_vt_chunked(
  aedes_ids[1:5],
  chunksize = 2,
  cols = c("Location", "OriginalTraitName"),
  returnunique = TRUE
)

## Database: vt
##                        Location      OriginalTraitName
## 1          University of Malaya              body size
## 2          University of Malaya        blood meal size
## 3 Guadeloupe French West Indies              longevity
## 4 Guadeloupe French West Indies          blood feeding
## 5 Guadeloupe French West Indies transmission potential

Putting it all together

In day-to-day use, you will mostly find yourself using all these functions together to create small pipelines.

A typical pipeline would likely only contain a few lines of code:

df <- search_vt("Aedes aegypti") %>%
  fetch_vt() %>%
  extract_vt(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
head(df)

Smart searching

One subtlety of the VectorByte search interfaces is to do with name collisions.

Let’s imagine that we are looking for traits of whitefly species (Bemisia). We can just construct a query to investigate this as follows:

df <- search_vt("Bemisia")[1:6] %>%
  fetch_vt() %>%
  extract_vt(cols = c("DatasetID", "Interactor1Genus", "Interactor1Species", "Interactor2Genus", "Interactor2Species"))

Now we would expect this to be traits of Bemisia spp, however when we look at the Interactor1Genus column we see something a touch odd:

unique(df$Interactor1Genus)

## [1] "Bemisia"      "Axinoscymnus"

Axinoscymnus is a ladybird, but why is it appearing? Let’s look at only rows containing Axinoscymnus:

df %>% dplyr::filter(Interactor1Genus == "Axinoscymnus") %>% head()

## Database: vt
##   DatasetID Interactor1Genus Interactor1Species Interactor2Genus
## 1       158     Axinoscymnus         cardilobus          Bemisia
## 2       158     Axinoscymnus         cardilobus          Bemisia
## 3       158     Axinoscymnus         cardilobus          Bemisia
## 4       158     Axinoscymnus         cardilobus          Bemisia
## 5       158     Axinoscymnus         cardilobus          Bemisia
## 6       158     Axinoscymnus         cardilobus          Bemisia
##   Interactor2Species
## 1             tabaci
## 2             tabaci
## 3             tabaci
## 4             tabaci
## 5             tabaci
## 6             tabaci

In this scenario, Bemisia is present in the dataset, but it is as the “target” rather than the animal that the trait is referring to.

As such we might want to be more specific about precisely which data to retrieve. Enter the search_x_smart() family of functions.

These allow you to construct a more specific search.

So let’s construct the same search as we were wanting to do before, but with the smart search.

df <- search_vt_smart("Interactor1Genus", "contains", "Bemisia")[1:6] %>%
  fetch_vt() %>%
  extract_vt(cols = c("DatasetID", "Interactor1Genus", "Interactor1Species", "Interactor2Genus", "Interactor2Species"))
unique(df$Interactor1Genus)

## [1] "Bemisia"

Here we have made sure to only search the Interactor1Genus column. As such we have only gotten back Bemisia traits!

This same sort of collision is particularly common in the “Citation” column, where papers may mention multiple trait names.

The search_x_smart() functions have many different operators and columns that they can work upon. For full details, run ?search_vt_smart or ?search_vd_smart in your console.

In general it is always worthwhile inspecting the data you retrieve from VectorByte to make sure that your query returned the data that you thought it did.

Futher steps

From here you now have all the data you might need for further analysis, so now it’s down to you!

One final note to end on: it can be advisable to save any output data in a csv or parquet format so that you do not need to re-download it every time you run your script. This is as easy as running write.csv() on your dataframe, then reading it in later with read.csv()

Built in 20.4294441s