Sourcing climatic data from AREAdata

Introduction

Often when presented with a set of population dynamic data, we may wish to investigate the abiotic conditions such as temperature or humidity that these dynamics occurred under.

Ideally this sort of data would have been collected at the same time as the dynamics data. However with data collected for one project and then repurposed for another, this is often not possible.

In such scenarios it is possible to link the location of the data with other measurements of abiotic variables in the area.

These can be retrieved from raster files provided by projects such as the ERA5 data store from the Copernicus project, however often this is a large amount of data for a small number of datapoints.

AREAdata

An alternative method for gathering this data is through the AREAdata project, which provides historical and forecast data aggregated at different spatial scales.

Let’s take a look at the format of data retrieved from AREAdata (we will deconstruct this command properly in a bit):

library(ohvbd)

ad_df <- fetch_ad(metric = "temp", gid = 0, use_cache = TRUE)
print(ad_df)
#> Areadata matrix for temp at gid level 0 .
#> Cached: FALSE 
#> Dates: 2020-01-01 -> 2023-12-31 (1461)
#> Locations: 256

So that is quite a lot of data! Way too much to print here, given that it’s 374016 entries, however if you want to print all the data rather than just a summary, you can add the argument full = TRUE to the print command:

print(ad_df, full = TRUE)

AREAdata outputs are formatted as matrices. Here the rows correspond to locations (identified by GADM codes):

head(rownames(ad_df))
#> [1] "Aruba"       "Afghanistan" "Angola"      "Anguilla"    "Åland"      
#> [6] "Albania"

Meanwhile the column names correspond to the day that the data refers to:

head(colnames(ad_df))
#> [1] "2020-01-01" "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05"
#> [6] "2020-01-06"

So the value at ad_df["Angola", "2020-01-01"] corresponds to the value of the metric (in this case temperature) in Angola on 2020-01-01:

ad_df["Angola", "2020-01-01"]
#> [1] 29.33611

Deconstructing `fetch_ad()`

That call above to fetch_ad() used some slightly odd parameters, so let’s break that down a bit.

The arguments we provided were

metric = “temp”
gid = 0
use_cache = TRUE
cache_location = cachepath

`metric`

The metric argument tells fetch_ad() what type of data it should retrieve from AREAdata.

There are many valid options for this including temperature and humidity. (See AREAdata or run ?fetch_ad() in the R console for a full list).

`gid`

The gid argument tells fetch_ad() the spatial scale to download (with higher numbers corresponding to finer scales, at the cost of larger download sizes).

`use_cache` and `cache_location`

As the AREAdata files are large and monolithic, it is often advantageous to download data at an given spatial scale only once.

We can do this by caching the files and retrieving them as needed. Whilst you could do this manually, ohvbd already has the functionality to perform this behind the scenes.

If we set use_cache = TRUE then instead of downloading from the repository immediately, ohvbd will first try to load any cached version that may be present.

Typically, the cache location is a folder in your user directory, obtained from tools::R_user_dir(). However, if you want to overwrite it (such as if you want to share the cache between different scripts) you can provide an alternative path to save data to.

Example

So given the above, let’s construct another function call to download relative humidity data at the province level (gid 1), caching in the same folder.

(We will not execute this, just because gid level 1 data is significantly larger than gid level 0)

fetch_ad(metric = "relhumid", gid = 1, use_cache = TRUE)

Extracting specific data

Extracting data from this matrix can be fiddly, but luckily ohvbd also provides a function to aid in that: extract_ad().

So let’s again try to extract the same data as previously, the temperature in Angola on 2020-01-01:

ad_df %>% extract_ad(targetdate = "2020-01-01", places = "Angola")
#> [1] 29.33611
#> attr(,"db")
#> [1] "ad"

Note: ignore the attr part of the output, this data just allows ohvbd to ensure consistency between its functions.

Wonderful! But what if we want a range of dates, or many places? What if the dates fall outside the dates that we have data for, or the places are not actually present in the data?

This is where extract_ad() really shines.

Dates and date ranges

Let’s first look at the flexibility of the date arguments to extract_ad().

If we (for example) want not just one day, we can specify either a vector of dates, or a less exact date as targetdate

ad_df %>% extract_ad(
  targetdate = c("2020-01-01", "2020-01-02", "2020-01-05"),
  places = "Angola"
)
#> 2020-01-01 2020-01-02 2020-01-05 
#>   29.33611   28.77514   28.13407 
#> attr(,"db")
#> [1] "ad"

ad_df %>% extract_ad(targetdate = "2020-01", places = "Angola")
#> 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 
#>   29.33611   28.77514   28.88312   27.43250   28.13407   28.64046   28.09983 
#> 2020-01-08 2020-01-09 2020-01-10 2020-01-11 2020-01-12 2020-01-13 2020-01-14 
#>   28.40937   28.71222   27.21757   26.85815   28.92744   29.53750   28.88849 
#> 2020-01-15 2020-01-16 2020-01-17 2020-01-18 2020-01-19 2020-01-20 2020-01-21 
#>   28.81174   29.81402   29.89741   29.59390   29.87222   30.28832   30.18386 
#> 2020-01-22 2020-01-23 2020-01-24 2020-01-25 2020-01-26 2020-01-27 2020-01-28 
#>   29.98904   29.97556   29.63855   29.53917   29.00056   28.81855   29.47186 
#> 2020-01-29 2020-01-30 2020-01-31 
#>   28.44818   28.64745   28.34594 
#> attr(,"db")
#> [1] "ad"

# 366 days of data return as 2020 is a leap year
length(ad_df %>% extract_ad(targetdate = "2020", places = "Angola"))
#> [1] 366

If we would instead like to specify a date range, we can add the argument enddate.

Note: the end date is defined exclusively, meaning that any date before it (and on or after the target date) is considered to be “in”.

ad_df %>% extract_ad(
  targetdate = "2020-01-01",
  enddate = "2020-01-04",
  places = "Angola"
)
#> 2020-01-01 2020-01-02 2020-01-03 
#>   29.33611   28.77514   28.88312 
#> attr(,"db")
#> [1] "ad"

Places

Multiple places to extract data for can also be provided as a vector of places:

ad_df %>% extract_ad(
  targetdate = "2020-01-01",
  places = c("Angola", "Latvia", "United Kingdom")
)
#>         Angola         Latvia United_Kingdom 
#>      29.336107       1.765062       5.113141 
#> attr(,"db")
#> [1] "ad"

This can be combined with the date filters above to create a more restrictive extraction:

ad_df %>% extract_ad(
  targetdate = "2020-01-01",
  enddate = "2020-01-04",
  places = c("Angola", "Latvia", "United Kingdom")
)
#> Areadata matrix for temp at gid level 0 .
#> Cached: FALSE 
#> Dates: 2020-01-01 -> 2020-01-03 (3)
#> Locations: 3

Resiliance

The extract_ad() function is also relatively resilient to the use of dates or places that are not in AREAdata:

ad_df %>% extract_ad(
  targetdate = "2023-12-30",
  enddate = "2024-01-02",
  places = c("Atlantis", "Latvia", "United Kingdom")
)
#> Areadata matrix for temp at gid level 0 .
#> Cached: FALSE 
#> Dates: 2023-12-30 -> 2023-12-31 (2)
#> Locations: 2

To be completed

Built in 2.2854955s