BernR: semi-advanced R programming tips

Maciej (Matthew) Dobrzyński, IZB UniBern

2021-04-27

About

This R notebook accompanies BernR MeetUp: Semi-advanced R programming tricks.

Topics

Working with projects

RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.

RStudio projects are associated with R working directories. You can create an RStudio project:

In a brand new directory
In an existing directory where you already have R code and data
By cloning a version control (Git or Subversion) repository

To create a new project use the Create Project command (available on the Projects menu and on the global toolbar).

Code formatting

When working in RStudio with regular R scripts, use # to add comments to your code. Additionally, any comment line which includes at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section. For example, all of the following lines create code sections.

Note that the line can start with any number of pound signs (#) so long as it ends with four or more -, =, or # characters.

To navigate between code sections you can use the Jump To menu available at the bottom of the editor. You can expand the folded region by either clicking on the arrow in the gutter or on the icon that overlays the folded code.

RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.

To indent or reformat the code use:

Menu > Code > Reindent Lines (⌘I)
Menu > Code > Reformat Code (⇧⌘A)

Syntax convention

It’s a good practice to stick to one naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:

calcStatsOfDF - function to calculate stats
nIteration - prefix n to indicate an integer variable
fPatientAge - prefix f to indicate a float variable
sPatientName - prefix s to indicate a string variable
vFileNames - v for vector
lColumnNames - l for a list

R notebooks

This document is written as an R Notebook. It allows to create publish-ready documents with text, graphics, and interactive plots. Can be saved as an html, pdf, or a Word document.

An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. The text is formatted using R Markdown.

Quick data.table primer

Data.table package is a faster, more efficient framework for data manipulation compared to R’s standard data.frame. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.

Let’s define a data table:

dtBabies = data.table(name= c("Jackson Smith", "Emma Williams", "Liam Brown", "Ava Wilson"), 
                    gender = c("M", "F", "M", "F"), 
                    year2011= c(74.69, NA, 88.24, 81.77), 
                    year2012=c(84.99, NA, NA, 96.45), 
                    year2013=c(91.73, 75.74, 101.83, NA),
                    year2014=c(95.32, 82.49, 108.23, NA),
                    year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies

##             name gender year2011 year2012 year2013 year2014 year2015
## 1: Jackson Smith      M    74.69    84.99    91.73    95.32   107.12
## 2: Emma Williams      F       NA       NA    75.74    82.49    93.73
## 3:    Liam Brown      M    88.24       NA   101.83   108.23   119.01
## 4:    Ava Wilson      F    81.77    96.45       NA       NA   105.65

General form

The general form of data.table syntax is as follows:

DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

The way to read it (out loud) is:

Take DT, subset/reorder rows using i, then calculate j, grouped by by.

Let’s select specific records from our data table:

dtBabies[gender == 'M']

##             name gender year2011 year2012 year2013 year2014 year2015
## 1: Jackson Smith      M    74.69    84.99    91.73    95.32   107.12
## 2:    Liam Brown      M    88.24       NA   101.83   108.23   119.01

Let’s select specific columns from our data table:

dtBabies[, .(name, gender, year2015)]

##             name gender year2015
## 1: Jackson Smith      M   107.12
## 2: Emma Williams      F    93.73
## 3:    Liam Brown      M   119.01
## 4:    Ava Wilson      F   105.65

Calculate the mean of a column:

dtBabies[, .(meanWeight = mean(year2015))]

##    meanWeight
## 1:   106.3775

Calculate the mean of a column by gender:

dtBabies[, .(meanWeight = mean(year2015)), by = gender]

##    gender meanWeight
## 1:      M    113.065
## 2:      F     99.690

Reference columns by name

In the above example, you have propbably noticed that column names are given explicitly. Hardcoding them this way in the script is potentially dangerous, for example when for some reason column names change. A much handier way would be to store the column names somewhere at the beginning of the script, where it’s easy to change them, and then use variables with those names in the code.

lCol = list()
lCol$meas = 'weight'
lCol$time = 'year'
lCol$group = c('name', 'gender')
lCol$timeLast = 'year2015'

lCol

## $meas
## [1] "weight"
## 
## $time
## [1] "year"
## 
## $group
## [1] "name"   "gender"
## 
## $timeLast
## [1] "year2015"

Now, we can perform the same summary with data.table but we’ll provide column names stored as strings in elements of the list lCol. The j part of the data.table requires us to use a function get to interpret the string as the column name.

dtBabies[, .(meanWeight = mean(get(lCol$timeLast))), by = c(lCol$group[2])]

##    gender meanWeight
## 1:      M    113.065
## 2:      F     99.690

Long vs wide format

Wide to long

The test data frame is in the wide format. Here, we convert it to long format using function melt. The key is to provide the names of identification (parameter id.vars) and measure variables (parameter measure.vars). If none are provided, melt will try to guess them automatically, which sometimes may result in a wrong conversion.

Both variables can be provided as explicit strings with column names, or as column numbers.

The original data frame contained missing values. The function melt has an option na.rm=T to omit them in the long-format table.

dtBabiesLong = data.table::melt(dtBabies, 
     id.vars = c('name', 'gender'), 
     measure.vars = 3:7,
     variable.name = 'year',
     value.name = 'weight',
     na.rm = T)

dtBabiesLong

##              name gender     year weight
##  1: Jackson Smith      M year2011  74.69
##  2:    Liam Brown      M year2011  88.24
##  3:    Ava Wilson      F year2011  81.77
##  4: Jackson Smith      M year2012  84.99
##  5:    Ava Wilson      F year2012  96.45
##  6: Jackson Smith      M year2013  91.73
##  7: Emma Williams      F year2013  75.74
##  8:    Liam Brown      M year2013 101.83
##  9: Jackson Smith      M year2014  95.32
## 10: Emma Williams      F year2014  82.49
## 11:    Liam Brown      M year2014 108.23
## 12: Jackson Smith      M year2015 107.12
## 13: Emma Williams      F year2015  93.73
## 14:    Liam Brown      M year2015 119.01
## 15:    Ava Wilson      F year2015 105.65

Long to wide

The function dcast from reshape2 package converts from wide to long format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.

Note that because some combinations of name + gender + year do not exist, the dcast function will introduce NAs.

dtBabiesWide = data.table::dcast(dtBabiesLong, 
                name + gender ~ year, 
                value.var = 'weight')

dtBabiesWide

##             name gender year2011 year2012 year2013 year2014 year2015
## 1:    Ava Wilson      F    81.77    96.45       NA       NA   105.65
## 2: Emma Williams      F       NA       NA    75.74    82.49    93.73
## 3: Jackson Smith      M    74.69    84.99    91.73    95.32   107.12
## 4:    Liam Brown      M    88.24       NA   101.83   108.23   119.01

Reference columns by name

In the above example, you’ve noticed that the formula interface of dcast requires providing column names explicitly. As was the case with data.table, it is a good programming practice to avoid hard-coding column names at multiple points in the code.

We will use the lCol list with column names, we will build a string with the formula, and we will provide it to dcast function as the formula argument.

We build a string from column names stored in lCol list using paste0 function:

sFormula = paste0(lCol$group[1], '+', lCol$group[2], '~', lCol$time)
sFormula

## [1] "name+gender~year"

Finally, we use the string as a formula using as.formula function:

dcast(dtBabiesLong, 
                as.formula(sFormula), 
                value.var = lCol$meas)

##             name gender year2011 year2012 year2013 year2014 year2015
## 1:    Ava Wilson      F    81.77    96.45       NA       NA   105.65
## 2: Emma Williams      F       NA       NA    75.74    82.49    93.73
## 3: Jackson Smith      M    74.69    84.99    91.73    95.32   107.12
## 4:    Liam Brown      M    88.24       NA   101.83   108.23   119.01

Plot with ggplot2

ggPlot is a powerfull plotting package that requires data in the long format. Let’s plot weight over time.

Attempt 1

ggplot2::ggplot(dtBabiesLong, aes(x = year, y = weight)) +
  geom_line()

Ups, doesn’t look good… The reason being that plotting function doesn’t know how to link the points. The logical way to link them is by name column,

Attempt 2

Here we add group option in the ggplot aesthetics (aes) to avoid the mistake from the above.

Also, to avoid hard-coding column names we use aes_string instead of aes.

The data will be plotted as lines, with additional dots to indicate data points, and with the summary mean of the group.

In order to produce facets per gender, we use function facet_wrap. It uses formula interface, same as in the case of dcast. Again, we need to build the formula from a string.

sFormula2 = paste0('~', lCol$group[2])
sFormula2

## [1] "~gender"

p1 = ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) +
  geom_line() +
  geom_point() +
  stat_summary(fun.y = mean, aes(group=1), geom = "line", colour = 'red') +
  facet_wrap(as.formula(sFormula2)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: `fun.y` is deprecated. Use `fun` instead.

p1

The group=1 overrides the by-name grouping so that we get a mean across all names for each gender, rather than the mean of each individual name within each gender (which would the same as the individual observations).

Interactive

Making an interactive plot from a ggplot object is extremely easy. Just use ggplotly function from the amazing plotly package. The interactive plot will remain in the html document knitted from the R notebook.

plotly::ggplotly(p1)

Wrapping into functions

Let’s write a function to calculate statistics of a data frame. In the simplest case the function will calculate the mean of a single column of a data frame.

We will expand the funciton to csalculate the mean by a group, and to calculate robust statistics, i.e. median instead of the mean.

The function will need the following input parameters:

the name of the data frame to use for calculations
the name of the variable to summarise
the name of the column with grouping (optional)
a True or False parameter for robust stats (optional)

calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {

  if (inRobust) {
    outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
  } else {
    outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
  }
  
  return(outDt)
}

Since column names will be provided to our function as string parameters, we cannot hard-code them inside of the function. Therefore, we use function get to use the string stored in the variable inMeasVar as the column name.

Calculate the mean of the weight column:

calcStats(dtBabiesLong, 'weight')

##    meanMeas
## 1: 93.79933

Calculate the mean of the weight column by name and gender. Use robust stats:

calcStats(dtBabiesLong, 'weight', inGroupName = c('name', 'gender'), inRobust = T)

##             name gender medianMeas
## 1: Jackson Smith      M      91.73
## 2:    Liam Brown      M     105.03
## 3:    Ava Wilson      F      96.45
## 4: Emma Williams      F      82.49

Documentation

Once inside the function, click Menu > Code > Insert Roxygen Skeleton (Shift-Option-Command R). A pecial type of comment will be added above the function. You can add your text next to parameters.

#' Calculates stats of a data frame
#'
#' @param inDt Input data table in the long format
#' @param inMeasVar Name of the measurement column
#' @param inGroupName Name of the grouping column (default NULL)
#' @param inRobust If true, the function calculates median instead of the mean (default False)
#'
#' @return Data table with summary stats
#' @export
#' @import data.table
#'
#' @examples
#' # example usasge 

calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {
  
  if (inRobust) {
    outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
  } else {
    outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
  }
  
  return(outDt)
}