Skip to contents

Introduction

The ohvbd package allows you to retrieve data from many different databases directly.

Currently these databases include the VecTraits and VecDyn projects from VectorByte, and GBIF.

Let’s imagine you wanted to figure out where there is trait data for a particular vector species - Aedes aegypti, for example. Here’s how you’d use ohvbd to do that:

df <- search_hub("Aedes aegypti", "vt") |>
  fetch() |>
  glean(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
df

Great, lovely. But now… what does that all mean? And how can you build up your own data request? Read on to find out…

A note before we begin

For users who are familiar with base R pipes (|>), this approach is generally usable in ohvbd as well.

For users who are not familiar, pipes take the output of one command and feed it forward to the next command as the first argument:

# Find mean of a vector normally
x <- c(1, 2, 3)
mean(x)
## [1] 2
# Find mean of a vector using pipes
c(1, 2, 3) |> mean()
## [1] 2

For the rest of this vignette we will be using a piped-style approach.

Finding IDs

Datasets in VecTraits and VecDyn, and GBIF are organised by id.

You can search for ids related to a particular query using the vbdhub.org (aka “the hub”) search functionality via search_hub().

In this case let’s search the hub for Aedes aegypti, the “Yellow Fever mosquito”:

library(ohvbd)
aedes_results <- search_hub("Aedes aegypti")
summary(aedes_results)
## Rows: 143, Query: Aedes aegypti
## 
## Split by database:
## gbif   px   vd   vt 
##   20   42   10   71

You can see here there are 20 GBIF datasets, 10 VecDyn datasets, and 71 VecTraits datasets.

However right now we only have the ids of the data, not the data themselves. In order to get that, we must fetch the data from a given database.

Filtering dbs

Before fetching data, we must extract only the ids relevant to our database from our search.

GBIF, VecTraits, and VecDyn do not have unified ids between datasets, so if you attempted to get VT ids from another database you would (at best) get garbage.

Filtering database results from searches can be performed using the filter_db() command:

aedes_vt <- filter_db(aedes_results, "vt")
aedes_vt
## <ohvbd.ids>
## Database: vt
##  [1]  474  475  148  578  144  142  126  556  169  580  577  285  287  357  863
## [16]  865  149  473  476  565  573  576  555  841  842  356  146  864  170  214
## [31]  579  147  143  355  359  354  564  574  575  124  125  346  553  554  853
## [46]  825  826  286  901  145  906  358  892  893  854  855  911  828  860  910
## [61]  572  557  558  571 1506 1510 1511 1512 1507 1508 1509

If you only searched the hub for one database, by default search_hub() will automatically perform the filter_db() operation for you!

search_hub("Aedes aegypti", db = "vt")
## <ohvbd.ids>
## Database: vt
##  [1]  474  475  148  578  144  142  126  556  169  580  577  285  287  357  863
## [16]  865  149  473  476  565  573  576  555  841  842  356  146  864  170  214
## [31]  579  147  143  355  359  354  564  574  575  124  125  346  553  554  853
## [46]  825  826  286  901  145  906  358  892  893  854  855  911  828  860  910
## [61]  572  557  558  571 1506 1510 1511 1512 1507 1508 1509

Getting data

Now you have a vector of datasets from vectraits, we need to actually retrieve the data of these datasets through the API.

To do this we can use the fetch() function. In this case let’s get the first 5 Aedes aegypti datasets:

aedes_vt <- aedes_vt |> head(5)
aedes_responses <- aedes_vt |> fetch()
aedes_responses[[1]]
## <httr2_response>
## GET https://vectorbyte.crc.nd.edu/portal/api/vectraits-dataset/474/?format=json
## Status: 200 OK
## Content-Type: application/json
## Body: In memory (52292 bytes)

The fetch() function returns a list of the data in the form of the original httr2 responses. These are useful if you want to know specifics about how the server sent data back, but for most usecases it is more useful to extract the data into a dataframe.

Specific fetch functions

fetch() retrieves data from the appropriate database for your data. Under the hood it farms out this work to the fetch_x() series of functions.

You can use these yourself for extra piece of mind, though we do not really recommend it.

So the above code could also be written as:

aedes_responses <- aedes_vt |> fetch_vt()

Extracting data

Now we have a list of responses, we can extract the relevant data from them using the glean() function.

aedes_data <- aedes_responses |> glean()
cat("Data dimensions: ", ncol(aedes_data), " cols x ", nrow(aedes_data), " rows")
## Data dimensions:  157  cols x  725  rows

This dataset is a bit too large to print here, and often you may only want a few columns of data rather than the whole dataset.

Fortunately the cols argument allows us to filter this data out easily. We can also use the argument returnunique to instruct ohvbd to return only unique rows.

So let’s get just the unique locations and trait name combinations from our data using the same command as before:

aedes_data_filtered <- aedes_responses |>
  glean(
    cols = c("Location", "OriginalTraitName"),
    returnunique = TRUE
  )
aedes_data_filtered
## <ohvbd.data.frame>
## Database: vt
##   DatasetID                 Location OriginalTraitName
## 1       474           Marilia Brazil    fecundity rate
## 2       475           Marilia Brazil         longevity
## 3       148 Unversity of Georgia USA  glycogen content
## 4       578       Fort Myers Florida  development time
## 5       144 Unversity of Georgia USA     lipid content

Specific glean functions

Like fetch(), glean() has database-specific variants:

aedes_data <- aedes_responses |> glean_vt()

Putting it all together

In day-to-day use, you will mostly find yourself using all these functions together to create small pipelines.

A typical pipeline would likely only contain a few lines of code:

df <- search_hub("Aedes aegypti") |>
  filter_db("vt") |>
  head(5) |>
  fetch() |>
  glean(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )
head(df)
## <ohvbd.data.frame>
## Database: vt
##   DatasetID Interactor1Genus Interactor1Species  Latitude Longitude
## 1       474            Aedes            aegypti -22.21389 -49.94583
## 2       475            Aedes            aegypti -22.21389 -49.94583
## 3       148            Aedes            aegypti        NA        NA
## 4       578            Aedes            aegypti  26.61667 -81.83333
## 5       144            Aedes            aegypti        NA        NA

A similar pipeline taking advantage of the autofiltering in search_hub() might look like this:

df <- search_hub("Aedes aegypti", db = "vt") |>
  head(5) |>
  fetch() |>
  glean(
    cols = c("Interactor1Genus", "Interactor1Species", "Latitude", "Longitude"),
    returnunique = TRUE
  )

Smart searching of VectorByte databases

One subtlety of VectorByte (in particular) is to do with field collisions.

Let’s imagine that we are looking for traits of whitefly species (Bemisia). We can just construct a query to investigate this as follows:

df <- search_hub("Bemisia", "vt") |>
  head(6) |>
  fetch() |>
  glean(
    cols = c(
      "DatasetID",
      "Interactor1Genus",
      "Interactor1Species",
      "Interactor2Genus",
      "Interactor2Species"
      )
    )

Now we would expect this to be traits of Bemisia spp, however when we look at the Interactor1Genus column we see something a touch odd:

unique(df$Interactor1Genus)
## [1] "Axinoscymnus" "Bemisia"

Axinoscymnus is a ladybird, but why is it appearing? Let’s look at only rows containing Axinoscymnus:

df |> dplyr::filter(Interactor1Genus == "Axinoscymnus") |> head()
## <ohvbd.data.frame>
## Database: vt
##   DatasetID Interactor1Genus Interactor1Species Interactor2Genus
## 1       160     Axinoscymnus         cardilobus          Bemisia
## 2       160     Axinoscymnus         cardilobus          Bemisia
## 3       160     Axinoscymnus         cardilobus          Bemisia
## 4       160     Axinoscymnus         cardilobus          Bemisia
## 5       160     Axinoscymnus         cardilobus          Bemisia
## 6       160     Axinoscymnus         cardilobus          Bemisia
##   Interactor2Species
## 1             tabaci
## 2             tabaci
## 3             tabaci
## 4             tabaci
## 5             tabaci
## 6             tabaci

In this scenario, Bemisia is present in the dataset, but it is as the “target” rather than the animal that the trait is referring to.

As such we might want to be more specific about precisely which data to retrieve. Enter the search_x_smart() family of functions.

These allow you to construct a more specific search.

So let’s construct the same search as we were wanting to do before, but with the smart search.

df <- search_vt_smart("Interactor1Genus", "contains", "Bemisia") |>
  head(6) |>
  fetch() |>
  glean(
    cols = c(
      "DatasetID",
      "Interactor1Genus",
      "Interactor1Species",
      "Interactor2Genus",
      "Interactor2Species"
      )
    )
unique(df$Interactor1Genus)
## [1] "Bemisia"

Here we have made sure to only search the Interactor1Genus column. As such we have only gotten back Bemisia traits!

This same sort of collision is particularly common in the “Citation” column, where papers may mention multiple trait names.

The search_x_smart() functions have many different operators and columns that they can work upon. For full details, run ?search_vt_smart or ?search_vd_smart in your console.

In general it is always worthwhile inspecting the data you retrieve to make sure that your query returned the data that you thought it did.

Futher steps

From here you now have all the data you might need for further analysis, so now it’s down to you!

One final note to end on: it can be advisable to save any output data in a csv or parquet format so that you do not need to re-download it every time you run your script. This is as easy as running write.csv() on your dataframe, then reading it in later with read.csv()

Built in 5.0632613s