Developing R Packages:
How and Why

Malte Lüken

Netherlands eScience Center

11-11-2024

Links

Link to presentation:

Link to GitLab package template:

Why R packaging

Imagine taking over a project with 10000 lines of dense code in a single file with no documentation or examples on how to run it.

Reusability:
- R users know how to use a package
- R developers know how to develop a package

Efficiency:
- Following an established structure saves time
- Code and documentation only live in one place (DRY principle)

Reproducibility:
- Ensures that code runs outside of your environment (“But it works on my machine!!!”)

\(\rightarrow\) Software sustainability

\(\rightarrow\) Trade-off between costs and benefits

Example: Turning a script into a package

Estimating income from age and sex with linear regression:

my_script.R

N = 100

age <- sample(18:99, N, replace = TRUE)
sex <- sample(0:1, N, replace = TRUE)
income <- 2 + 0.1 * age + 0.2 * sex + rnorm(N)

df <- data.frame(age, sex, income)

model <- lm(income ~ age + sex, data = df)

summary(model)

The R packaging workflow

Setup

Two ways to easily create an R package:

RStudio: File \(\rightarrow\) New Project \(\rightarrow\) New Directory \(\rightarrow\) R Package (with name testR)
usethis package: usethis::create_package("testR")

Best practice: The usethis package contains useful functions to automate package development

Package structure

Creates new folder with minimal R package skeleton:

DESCRIPTION: Metadata (e.g, package name, version, author, dependencies)
NAMESPACE: Which functions to export and which other packages to import
R/: R functions (with hello.R example file)
.Rbuildignore: Files to ignore when building the package (e.g., old R scripts)

Build package by clicking on Build \(\rightarrow\) Install or devtools::install()

Code

Creating functions

Advantages of functions:

Rerun code with different inputs
Lead to modular code (separation of concerns)
Easier to read and test code

Two ways to create a new function:

Manual: Create a new create_model.R file in the R/ folder
usethis: usethis::use_r("create_model"). This will automatically create R/create_model.R

Best practice: Give functions clear and consistent names (e.g., create_model instead of model or create_mod)

Example:

R/create_model.R

# Ambiguous argument and variable names
create_model <- function(df, dep, preds) {
  f <- formula(
    paste(dep, "~", paste(preds, collapse = " + "))
  )
  
  m <- lm(f, data = df)
  
  return(m)
}

R/create_model.R

# Clear argument and variable names
create_model <- function(df, dependent, predictors) {
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  model <- lm(model_formula, data = df)
  
  return(model)
}

Best practice: Use clear and consistent argument names (e.g., dependent instead of dep; df is a common abbreviation)

To try out the function, run devtools::load_all() (or Ctrl+Shift+L) and then create_model(df, "income", c("age", "sex"))

Build-time vs. load-time

Build-time: Code in R/ is executed when the binary package is built (e.g., devtools::install() or by CRAN) and results are saved
Load-time: Saved results are loaded when the package is attached (e.g., library(testR))

Example:

x <- Sys.time() # Is executed at build-time

# Loads and returns x at load-time
get_current_time <- function() {
  return(x)
}

Important when defining aliases:

# Uses version of stats::lm that is available at build-time
lm_alias <- stats::lm

Or filepaths:

# Uses filepath at build-time
model_dir <- file.path("data", "models")

Best practice: Don’t use library, require, or source in a package

Best practice: Don’t use functions that change the global state in a package, e.g., setwd, options, par, instead use the withr package

Example:

# Modifies the global state
read_data <- function(base_dir) {
  old_wd <- getwd()
  setwd(base_dir)
  df <- read.csv(file.path("data", "data.csv"))
  setwd(old_wd)
  
  return(df)
}

# Uses the withr package
read_data <- function(base_dir) {
  withr::with_dir(base_dir, {
    df <- read.csv(file.path("data", "data.csv"))
  })
  
  return(df)
}

Writing robust code

Code should be robust to avoid silent failures
Workflow for writing robust R functions:
1. What are the assumptions of the function? (e.g., df is a data frame)
2. Check if assumptions are met (asserting)
3. Define what happens when assumptions are not met

Example:

R/create_model.R

create_model <- function(df,
                         # Hints at which type is expected
                         dependent = character(),
                         predictors = character()) {
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  model <- lm(model_formula, data = df)
  
  return(model)
}

R/create_model.R

create_model <- function(df,
                         dependent = character(),
                         predictors = character()) {
  # Checks if arguments have expected type
  stopifnot(
    is.data.frame(df),
    is.character(dependent),
    is.character(predictors)
  )
  
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  model <- lm(model_formula, data = df)
  
  return(model)
}

R/create_model.R

create_model <- function(df,
                         dependent = character(),
                         predictors = character()) {
  stopifnot(
    is.data.frame(df),
    is.character(dependent),
    is.character(predictors)
  )
  
  # Checks if another assumption is met and handles exception
  if (nrow(df) == 0) {
    stop("Data frame not valid")
  }
  
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  model <- lm(model_formula, data = df)
  
  return(model)
}

R/create_model.R

create_model <- function(df,
                         dependent = character(),
                         predictors = character()) {
  stopifnot(
    is.data.frame(df),
    is.character(dependent),
    is.character(predictors)
  )
  
  # Returns an informative error message
  if (nrow(df) == 0) {
    stop("Data frame contains zero rows")
  }
  
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  model <- lm(model_formula, data = df)
  
  return(model)
}

Informative error messages

Clearly describe the problem
Suggest a solution
Are honest about what they know and don’t know

Example:

R/create_model.R

create_model <- function(df,
                         dependent = character(),
                         predictors = character()) {
  stopifnot(
    is.data.frame(df),
    is.character(dependent),
    is.character(predictors)
  )
  
  if (nrow(df) == 0) {
    stop("Data frame contains zero rows")
  }
  
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  tryCatch(
    {
      model <- lm(model_formula, data = df)
    },
    error = function(error) {
      # Does not know whether dependent variable is numeric
      stop("Dependent variable must be numeric")
    }
  )
  
  return(model)
}

R/create_model.R

create_model <- function(df,
                         dependent = character(),
                         predictors = character()) {
  stopifnot(
    is.data.frame(df),
    is.character(dependent),
    is.character(predictors)
  )
  
  if (nrow(df) == 0) {
    stop("Data frame contains zero rows")
  }
  
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  
  tryCatch(
    {
      model <- lm(model_formula, data = df)
    },
    error = function(error) {
      # Returns what it knows
      stop(paste("Model could not be created:", error$message))
    }
  )
  
  return(model)
}

Software development principles

Do not repeat yourself (DRY): Avoid duplicating code \(\rightarrow\) Rule of three (if you use the same code three times, it should be a function)
- Abstractions can avoid duplication

R/create_model_formula.R

create_model_formula <- function(predictors = character(),
                                 dependent = character()) {
  model_formula <- formula(
    paste(dependent, "~", paste(predictors, collapse = " + "))
  )
  return(model_formula)
}

Keep it simple, stupid (KISS): Avoid unnecessary complexity (also YAGNI: You ain’t gonna need it)
- Abstractions can create complexity and overhead
Separation of concerns: Functions should have a single responsibility

Best practice: Isolate side-effects (e.g. writing files, plotting) from core functions

Organizing code

Two (bad) extremes:

All functions in one file \(\rightarrow\) Hard to find functions
One function per file \(\rightarrow\) Too many files

Best practice: Large functions with lots of documentation should have their own files. Small functions can be grouped together in one file

Function definitions can be found with Code \(\rightarrow\) Go to File/Function (Ctrl+.) or by moving the cursor into the function name and pressing F2

The styler package is useful for applying a consistent code style (e.g. tidyverse style)
The formatR package is useful for applying a consistent line breaks and indentation
The lintr package is useful for static code analysis (e.g. checking for style, syntax, and semantic issues)

Testing

Ad-hoc testing

Most R users test their code implicitly.

Typical development workflow:

Write code
Run code in console or R script and see if it produces the expected results (e.g., via print statements)
Adjust code if necessary and repeat

Common problems with ad-hoc testing:

Time-consuming
Error-prone
Not systematic (edge cases)
Not reproducible

Automated testing

Advantages of automated testing with testthat:

Fewer (undetected) bugs
Forces better code structure
Easy to apply changes to code
More trustworthy code

\(\rightarrow\) Tests as documentation and starting point for new developers

What to test

External interface instead of internal interface
Each behavior has only one test
Fragile over robust code
Fixed bugs

Setup automated testing with usethis::use_testthat() \(\rightarrow\) creates folder tests/testthat/ for test files

Add a new test file with usethis::use_test("create_model") which creates tests/testthat/test-create_model.R with a dummy passing test

Testing principles

Any tests are better than no tests
Tests should not only be written but also be run
Tests should be run proportionally to the time it takes to run them
Test cases should be realistic
Tests should be fully self-sufficient and self-contained

tests/testthat/test-create_model.R

N = 100

age <- sample(18:99, N, replace = TRUE)
sex <- sample(0:1, N, replace = TRUE)
income <- 2 + 0.1 * age + 0.2 * sex + rnorm(N)

df <- data.frame(age, sex, income)

test_that("create_model works", {
  mod <- create_model(df, "income", c("age", "sex"))
  
  expect_is(mod, "lm")
})

Test passed 🎊

tests/testthat/test-create_model.R

test_that("create_model works", {
  N = 100

  age <- sample(18:99, N, replace = TRUE)
  sex <- sample(0:1, N, replace = TRUE)
  income <- 2 + 0.1 * age + 0.2 * sex + rnorm(N)
  
  df <- data.frame(age, sex, income)
  
  mod <- create_model(df, "income", c("age", "sex"))
  
  expect_is(mod, "lm")
})

Test passed 🌈

Best practice: Helper functions and the withr package can be used to create self-sufficient and self-contained tests

Testing layers

Unit tests: Test individual functions
Integration tests: Test how functions work together
System tests: Test the entire system

\(\rightarrow\) Only proceed to next layer if previous layer succeeds

Regression tests: Check whether the output of a function is still the same (e.g., tables, plots)
\(\rightarrow\) Does not check if output is correct

Documentation

Function documentation

R packages can be easily documented with roxygen2:

Add documentation directly as code doc strings instead to a separate file
Automatically converts doc strings to markdown \(\rightarrow\) text-based and easy to version control
Automatically updates exported functions and imported packages in NAMESPACE
Easy to update documentation
Document functions, datasets, and package itself

Setup documentation with usethis::use_roxygen_md()

Add documentation to R/create_model.R by clicking into the function definition and Code \(\rightarrow\) Insert Roxygen Skeleton

To update documentation, run devtools::document() (or Ctrl+Shift+D)

Function documentation example

Example:

#' Create a linear regression model
#' 
#' Creates a linear regression model from a data frame and dependent and independent variables.
#'
#' @param df A data frame containing the variables included in the model.
#' @param dep A single character string with the name of the dependent variable.
#' @param preds A character vector with the names of the independent variables.
#'
#' @return A linear regression model of class `"lm"`.
#' 
#' @details The function uses the \link{lm} function to estimate a linear regression model.
#' 
#' @export
#'
#' @examples
#' N = 100
#' 
#' age <- sample(18:99, N, replace = TRUE)
#' sex <- sample(0:1, N, replace = TRUE)
#' income <- 2 + 0.1 * age + 0.2 * sex + rnorm(N)
#' 
#' df <- data.frame(age, sex, income)
#' 
#' mod <- create_model(df, "income", c("age", "sex"))
#' 
create_model <- function(df, dep, preds) {
  f <- formula(paste(dep, "~", paste(preds, collapse = " + ")))
  
  m <- lm(f, data = df)
  
  return(m)
}

Vignettes

Complex examples, background information (e.g., theories, model equations, simulation studies), and tutorials should not live in the function documentation but in vignettes.

Create a new vignette with usethis::use_vignette("create_model")

This creates a new vignettes/ folder with a create_model.Rmd file.

Add content to vignettes/create_model.Rmd

README

Documentation for developers/users who see the package on GitHub/GitLab/CRAN

Answers three questions about a package:

Why should I use it?
How do I use it?
How do I install it?

Create a new R markdown README file with usethis::use_readme_md() and add content to README.Rmd

Website

Combine README, vignettes, and function documentation in a website with pkgdown

Setup website with usethis::use_pkgdown()

pkgdown automatically collects all function documentation, vignettes, and README files and creates a website in the docs/ folder

Update website with pkgdown::build_site() or usethis::build_site()

Workflow summary

Edit files in R/ and vignettes/
Update documentation with devtools::document()
Load package with devtools::load_all()
Run tests with devtools::test() or devtools::test_active_file()

If tests pass:

Check package with devtools::check()

Version control and continuous integration

Version control: Save a snapshot of your package at a certain point in time
Continuous integration: Connect your version control system to a server (e.g., GitHub/GitLab) that automatically runs tests and builds documentation

Setup version control with usethis::use_git() and connect to GitHub/GitLab with usethis::use_github() or usethis::use_gitlab_ci()

Add automated testing on GitHub with usethis::use_github_action("testthat")

GitLab example

Running usethis::use_gitlab_ci() creates a .gitlab-ci.yml file in the root directory of the package:

.gitlab-ci.yml

image: rocker/tidyverse

stages:
  - build
  - test
  - deploy

building:
  stage: build
  script:
    - R -e "remotes::install_deps(dependencies = TRUE)"
    - R -e 'devtools::check()'

# To have the coverage percentage appear as a gitlab badge follow these
# instructions:
# https://docs.gitlab.com/ee/user/project/pipelines/settings.html#test-coverage-parsing
# The coverage parsing string is
# Coverage: \d+\.\d+

testing:
    stage: test
    allow_failure: true
    when: on_success
    only:
        - master
    script:
        - Rscript -e 'install.packages("DT")'
        - Rscript -e 'covr::gitlab(quiet = FALSE)'
    artifacts:
        paths:
            - public

# To produce a code coverage report as a GitLab page see
# https://about.gitlab.com/2016/11/03/publish-code-coverage-report-with-gitlab-pages/

pages:
    stage: deploy
    dependencies:
        - testing
    script:
        - ls
    artifacts:
        paths:
            - public
        expire_in: 30 days
    only:
        - master

Further references

Course material on R packaging:
Rodriguez-Sanchez, P., Vreede, B., & de Boer, L. (n.d.). R packaging. Carpentries Incubator. https://carpentries-incubator.github.io/lesson-R-packaging/
Reproducible software development:
The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (Version 1.0.2). Zenodo. https://doi.org/10.5281/ZENODO.3233853
R packaging guide:
Wickham, H. (2021). Mastering Shiny: Build interactive apps, reports, and dashboards powered by R (1st edition). O’Reilly Media. https://mastering-shiny.org/
R Shiny guide:
Wickham, H., & Bryan, J. (2023). R packages: Organize, test, document, and share your code (2nd edition). O’Reilly Media. https://r-pkgs.org/

Developing R Packages:How and Why

Links

Why R packaging

Example: Turning a script into a package

The R packaging workflow

Setup

Package structure

Code

Creating functions

Build-time vs. load-time

Writing robust code

Informative error messages

Software development principles

Organizing code

Testing

Ad-hoc testing

Automated testing

What to test

Testing principles

Testing layers

Documentation

Function documentation

Function documentation example

Vignettes

README

Website

Workflow summary

Version control and continuous integration

GitLab example

Further references

Questions and discussion

Developing R Packages:
How and Why