Introduction
Often when presented with a set of population dynamic data, we may wish to investigate the abiotic conditions such as temperature or humidity that these dynamics occurred under.
Ideally this sort of data would have been collected at the same time as the dynamics data. However with data collected for one project and then repurposed for another, this is often not possible.
In such scenarios it is possible to link the location of the data with other measurements of abiotic variables in the area.
These can be retrieved from raster files provided by projects such as the ERA5 data store from the Copernicus project, however often this is a large amount of data for a small number of datapoints.
AREAdata
An alternative method for gathering this data is through the AREAdata project, which provides historical and forecast data aggregated at different spatial scales.
Let’s take a look at the format of data retrieved from AREAdata (we will deconstruct this command properly in a bit):
library(ohvbd)
ad_df <- fetch_ad(metric = "temp", gid = 0, use_cache = TRUE)
print(ad_df)
#> Areadata matrix for temp at gid level 0 .
#> Cached: FALSE
#> Dates: 2020-01-01 -> 2023-12-31 (1461)
#> Locations: 256
So that is quite a lot of data! Way too much to print here, given
that it’s 374016 entries, however if you want to print all the data
rather than just a summary, you can add the argument
full = TRUE
to the print command:
print(ad_df, full = TRUE)
AREAdata outputs are formatted as matrices. Here the rows correspond to locations (identified by GADM codes):
Meanwhile the column names correspond to the day that the data refers to:
head(colnames(ad_df))
#> [1] "2020-01-01" "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05"
#> [6] "2020-01-06"
So the value at ad_df["Angola", "2020-01-01"]
corresponds to the value of the metric (in this case temperature) in
Angola on 2020-01-01:
ad_df["Angola", "2020-01-01"]
#> [1] 29.33611
Deconstructing fetch_ad()
That call above to fetch_ad()
used some slightly odd
parameters, so let’s break that down a bit.
The arguments we provided were
- metric = “temp”
- gid = 0
- use_cache = TRUE
- cache_location = cachepath
metric
The metric argument tells fetch_ad()
what type of data
it should retrieve from AREAdata.
There are many valid options for this including temperature and
humidity. (See AREAdata
or run ?fetch_ad()
in the R console for a full list).
gid
The gid argument tells fetch_ad()
the spatial scale to
download (with higher numbers corresponding to finer scales, at the cost
of larger download sizes).
use_cache
and cache_location
As the AREAdata files are large and monolithic, it is often advantageous to download data at an given spatial scale only once.
We can do this by caching the files and retrieving them as needed.
Whilst you could do this manually, ohvbd
already has the
functionality to perform this behind the scenes.
If we set use_cache = TRUE
then instead of downloading
from the repository immediately, ohvbd
will first try to
load any cached version that may be present.
Typically, the cache location is a folder in your user directory,
obtained from tools::R_user_dir()
. However, if you want to
overwrite it (such as if you want to share the cache between different
scripts) you can provide an alternative path to save data to.
Example
So given the above, let’s construct another function call to download relative humidity data at the province level (gid 1), caching in the same folder.
(We will not execute this, just because gid level 1 data is significantly larger than gid level 0)
fetch_ad(metric = "relhumid", gid = 1, use_cache = TRUE)
Extracting specific data
Extracting data from this matrix can be fiddly, but luckily
ohvbd
also provides a function to aid in that:
extract_ad()
.
So let’s again try to extract the same data as previously, the temperature in Angola on 2020-01-01:
ad_df %>% extract_ad(targetdate = "2020-01-01", places = "Angola")
#> [1] 29.33611
#> attr(,"db")
#> [1] "ad"
Note: ignore the attr
part of the output, this data
just allows ohvbd
to ensure consistency between its
functions.
Wonderful! But what if we want a range of dates, or many places? What if the dates fall outside the dates that we have data for, or the places are not actually present in the data?
This is where extract_ad()
really shines.
Dates and date ranges
Let’s first look at the flexibility of the date arguments to
extract_ad()
.
If we (for example) want not just one day, we can specify either a
vector of dates, or a less exact date as targetdate
ad_df %>% extract_ad(
targetdate = c("2020-01-01", "2020-01-02", "2020-01-05"),
places = "Angola"
)
#> 2020-01-01 2020-01-02 2020-01-05
#> 29.33611 28.77514 28.13407
#> attr(,"db")
#> [1] "ad"
ad_df %>% extract_ad(targetdate = "2020-01", places = "Angola")
#> 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07
#> 29.33611 28.77514 28.88312 27.43250 28.13407 28.64046 28.09983
#> 2020-01-08 2020-01-09 2020-01-10 2020-01-11 2020-01-12 2020-01-13 2020-01-14
#> 28.40937 28.71222 27.21757 26.85815 28.92744 29.53750 28.88849
#> 2020-01-15 2020-01-16 2020-01-17 2020-01-18 2020-01-19 2020-01-20 2020-01-21
#> 28.81174 29.81402 29.89741 29.59390 29.87222 30.28832 30.18386
#> 2020-01-22 2020-01-23 2020-01-24 2020-01-25 2020-01-26 2020-01-27 2020-01-28
#> 29.98904 29.97556 29.63855 29.53917 29.00056 28.81855 29.47186
#> 2020-01-29 2020-01-30 2020-01-31
#> 28.44818 28.64745 28.34594
#> attr(,"db")
#> [1] "ad"
# 366 days of data return as 2020 is a leap year
length(ad_df %>% extract_ad(targetdate = "2020", places = "Angola"))
#> [1] 366
If we would instead like to specify a date range, we can add the
argument enddate
.
Note: the end date is defined exclusively, meaning that any date before it (and on or after the target date) is considered to be “in”.
ad_df %>% extract_ad(
targetdate = "2020-01-01",
enddate = "2020-01-04",
places = "Angola"
)
#> 2020-01-01 2020-01-02 2020-01-03
#> 29.33611 28.77514 28.88312
#> attr(,"db")
#> [1] "ad"
Places
Multiple places to extract data for can also be provided as a vector of places:
ad_df %>% extract_ad(
targetdate = "2020-01-01",
places = c("Angola", "Latvia", "United Kingdom")
)
#> Angola Latvia United_Kingdom
#> 29.336107 1.765062 5.113141
#> attr(,"db")
#> [1] "ad"
This can be combined with the date filters above to create a more restrictive extraction:
ad_df %>% extract_ad(
targetdate = "2020-01-01",
enddate = "2020-01-04",
places = c("Angola", "Latvia", "United Kingdom")
)
#> Areadata matrix for temp at gid level 0 .
#> Cached: FALSE
#> Dates: 2020-01-01 -> 2020-01-03 (3)
#> Locations: 3
Resiliance
The extract_ad()
function is also relatively resilient
to the use of dates or places that are not in AREAdata:
ad_df %>% extract_ad(
targetdate = "2023-12-30",
enddate = "2024-01-02",
places = c("Atlantis", "Latvia", "United Kingdom")
)
#> Areadata matrix for temp at gid level 0 .
#> Cached: FALSE
#> Dates: 2023-12-30 -> 2023-12-31 (2)
#> Locations: 2