Creating reference data
reference_data.Rmd
This tutorial shows how to use the hiaR
package to
create reference data that can be used in the DYNAMO-HIA model. We use
the Global Burden of Disease (GBD) data to create reference data for the
mortality of COVID-19 in the Netherlands. The GBD data is available from
the IHME
website. The data can be downloaded from the following link
after registering on the website and entering a user agreement. For
simplicity, we include simulated data with identical format in the
hiaR
package that we will use in this tutorial. Please
install the dplyr
and tidyr
packages to run
this tutorial:
install.packages(c("dplyr", "tidyr"))
#> Installing packages into 'D:/a/_temp/Library'
#> (as 'lib' is unspecified)
#> package 'dplyr' successfully unpacked and MD5 sums checked
#> Warning: cannot remove prior installation of package 'dplyr'
#> Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
#> D:\a\_temp\Library\00LOCK\dplyr\libs\x64\dplyr.dll to
#> D:\a\_temp\Library\dplyr\libs\x64\dplyr.dll: Permission denied
#> Warning: restored 'dplyr'
#> package 'tidyr' successfully unpacked and MD5 sums checked
#>
#> The downloaded binary packages are in
#> C:\Users\runneradmin\AppData\Local\Temp\RtmpcrEwrn\downloaded_packages
We start by loading the hiaR
package and the example
data covid19nl
. When using the real GBD data, the data can
be loaded from CSV with read.csv
or a similar function. For
documentation on the variables, see the IHME
Data and Tools Guide.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(hiaR)
disease <- "COVID-19"
data(covid19nl)
# Take a glimpse at the data
head(covid19nl)
#> Measure Metric Cause Location Age Sex Year Value
#> 1 Deaths Number COVID-19 Netherlands < 5 years Males 2021 0.801720560
#> 2 Deaths Number COVID-19 Netherlands < 5 years Females 2021 2.790728576
#> 3 Deaths Percent COVID-19 Netherlands < 5 years Males 2021 0.001583409
#> 4 Deaths Percent COVID-19 Netherlands < 5 years Females 2021 0.005511728
#> 5 Deaths Rate COVID-19 Netherlands < 5 years Males 2021 0.004581260
#> 6 Deaths Rate COVID-19 Netherlands < 5 years Females 2021 0.015947020
#> Upper Lower
#> 1 0.881892616 0.721548504
#> 2 3.069801433 2.511655718
#> 3 0.001741750 0.001425068
#> 4 0.006062901 0.004960555
#> 5 0.005039386 0.004123134
#> 6 0.017541722 0.014352318
The DYNAMO-HIA model requires reference data to be in a specific format. We therefore need to reshape and preprocess the raw GBD data. We start by creating data for age groups 0 to 95 years that are required by DYNAMO-HIA (under default settings). The GBD data contains 5-year age groups, so we need to interpolate these to 1-year age groups by imputing the 5-year rows for each 1-year group. We first look at the age groups present in the GBD data.
unique_age_groups <- unique(covid19nl$Age)
print(unique_age_groups)
#> [1] "< 5 years" "5-9 years" "10-14 years" "15-19 years" "20-24 years"
#> [6] "25-29 years" "30-34 years" "35-39 years" "40-44 years" "45-49 years"
#> [11] "50-54 years" "55-59 years" "60-64 years" "65-69 years" "70-74 years"
#> [16] "75-79 years" "80-84 years" "85-89 years" "90-94 years" "95+ years"
We see 19 5-year age groups and one “95+ years” age group that we
will treat as the 95-year age group. We extract the range of each group
from the character string and create additional rows for each 1-year age
group. We also recode character values in the Sex
column as
numeric values.
preprocessed_df <- covid19nl |>
mutate(Age = ifelse(Age == "< 5 years",
"0-4 years",
ifelse(Age == "95+ years",
"95 years",
Age)),
sex_num = ifelse(Sex == "Males", 0, 1),
age_num = gregexpr("\\d+", Age) |>
regmatches(Age, m = _) |>
sapply(function(x) list(c(x[1]:x[length(x)])))) |>
unnest(age_num)
Now, we can create a data frame that contains the prevalence of
COVID-19 for the Dutch population. We wrap the data frame in a list with
the name of the target population. NB: We could also include lists with
data frames for multiple populations. The name of the prevalence column
must be percent
.
prevalence_df <- preprocessed_df |>
filter(Measure == "Prevalence" & Metric == "Percent") |>
select(age = age_num, sex = sex_num, percent = Value)
prevalence_list <- list(NL = list(data = prevalence_df))
Similarly, we create a data frame and list for the incidence. Note
that the column name in the data frame must be value
instead of percent
.
incidence_df <- preprocessed_df |>
filter(Measure == "Incidence" & Metric == "Percent") |>
select(age = age_num, sex = sex_num, value = Value)
incidence_list <- list(NL = list(data = incidence_df))
We also create a data frame and for the disease excess mortality.
Since people can be cured from COVID-19, we use the
curedfraction
column to store the faction of the population
that is cured from the disease. We estimate the cured fraction from the
deaths but we could also obtain this data from other GBD data sources.
The column acutelyfatal
is set to 0 for all rows because
DYNAMO-HIA only allow either acutely fatal or cured fractions to be
non-zero. We add two additional values to the list:
unit_type
which indicates that we are using rates, and
parameter_type
which indicates that we supply the cured
fraction.
mortality_df <- preprocessed_df |>
mutate(acutelyfatal = 0,
curedfraction = Value * 100000) |>
filter(Measure == "Deaths" & Metric == "Rate") |>
select(age = age_num, sex = sex_num, unit = Value, acutelyfatal, curedfraction)
mortality_list <- list(NL = list(data = mortality_df,
unit_type = "Rate",
parameter_type = "Cured Fraction"))
Finally, we create a data frame and list for disease-adjusted life years (DALYs).
daly_df <- preprocessed_df |>
filter(Measure == "DALYs (Disability-Adjusted Life Years)" & Metric == "Percent") |>
select(age = age_num, sex = sex_num, percent = Value)
daly_list <- list(NL = list(data = daly_df))
With the prevalence, incidence, mortality, and DALY lists, we can now
write the XML configuration files for the DYNAMO-HIA model that contain
the reference data. We use the write_disease_dir
function
inside a temporary directory:
withr::with_tempdir({
fs::dir_create("Reference_Data")
withr::with_dir("Reference_Data", {
fs::dir_create("Diseases")
withr::with_dir("Diseases", {
write_disease_dir(disease_name = disease,
prevalences_list = prevalence_list,
incidences_list = incidence_list,
excess_mortalities_list = mortality_list,
disability_list = daly_list,
risk_factor_list = list(),
diseases_list = list())
})
})
})
#> [1] TRUE
The function writes the XML files in the
Reference_Data/Diseases/
subdirectory where the DYNAMO-HIA
Java application can find them. It also validates the XML schemata of
the files so that they correspond to the expected format. Note that we
did not include relative risks from risk factors or other diseases (they
go in the risk_factor_list
and diseases_list
arguments, respectively).