Intro

This notebook accompanies MIC training: Modern data analysis in R/RStudio. The corresponding presentation is here.

To convert this notebook into HTML (to knit it), type make in the command line of this folder or click Knit icon in the top bar of the RStudio editor.

data.table

Define a data.table:

library(data.table)
dtBabies = data.table(name = c("Jackson", "Emma", "Liam", "Ava"), gender = c("M", 
    "F", "M", "F"), year2011 = c(74.69, NA, 88.24, 81.77), year2012 = c(84.99, NA, 
    NA, 96.45), year2013 = c(91.73, 75.74, 101.83, NA), year2014 = c(95.32, 82.49, 
    108.23, NA), year2015 = c(107.12, 93.73, 119.01, 105.65))
dtBabies

      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Emma      F       NA       NA    75.74    82.49    93.73
3:    Liam      M    88.24       NA   101.83   108.23   119.01
4:     Ava      F    81.77    96.45       NA       NA   105.65

Select specific records/rows:

dtBabies[gender == "M"]

      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Liam      M    88.24       NA   101.83   108.23   119.01

Select specific variables/columns:

dtBabies[, .(name, gender, year2015)]

      name gender year2015
1: Jackson      M   107.12
2:    Emma      F    93.73
3:    Liam      M   119.01
4:     Ava      F   105.65

Calculate the mean of a single column:

dtBabies[, .(meanWeight = mean(year2015))]

   meanWeight
1:   106.3775

Calculate the mean of a single column by gender:

dtBabies[, .(meanWeight = mean(year2015)), by = gender]

   gender meanWeight
1:      M    113.065
2:      F     99.690

data.table - reference by name

Define a list of string variables with column names:

lCol = list(meas = "weight", time = "year", group = c("name", "gender"), timeLast = "year2015")
lCol

$meas
[1] "weight"

$time
[1] "year"

$group
[1] "name"   "gender"

$timeLast
[1] "year2015"

Select records with gender column equal to M.

The name of the gender column is stored in one of the list elements.

lCol$group

[1] "name"   "gender"

We need the second entry of that list element.

[1] "gender"

To let data.table know that we want to provide column names via pre-defined string variables, we need to use function get in the i position.

Hence, to select rows:

dtBabies[get(lCol$group[[2]]) == "M"]

      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Liam      M    88.24       NA   101.83   108.23   119.01

To select specific columns we can provide them this way:

myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, myColumns, with = F]

      name year2015
1: Jackson   107.12
2:    Emma    93.73
3:    Liam   119.01
4:     Ava   105.65

Or simpler (note the double dot):

dtBabies[, ..myColumns]

      name year2015
1: Jackson   107.12
2:    Emma    93.73
3:    Liam   119.01
4:     Ava   105.65

To aggregate by calculating the mean:

dtBabies[, .(meanWeight = mean(get(lCol$timeLast))), by = c(lCol$group[2])]

   gender meanWeight
1:      M    113.065
2:      F     99.690

Wide to long format

Our data table is in the wide format. To convert it to long format, use function melt from the data.table package.

Provide the names of identification (id.vars) and measure variables (measure.vars). If none are provided, melt guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.

The original data frame contains missing values; na.rm=T omits them in the long-format table.

dtBabiesLong = data.table::melt(dtBabies, id.vars = c("name", "gender"), measure.vars = 3:7, 
    variable.name = "year", value.name = "weight", na.rm = T)
head(dtBabiesLong, n = 5L)

      name gender     year weight
1: Jackson      M year2011  74.69
2:    Liam      M year2011  88.24
3:     Ava      F year2011  81.77
4: Jackson      M year2012  84.99
5:     Ava      F year2012  96.45

Long to wide

The function dcast converts from long to wide format. The function has a so-called formula interface that specifies a combination of variables that uniquely identify a row.

Note that because some combinations of name + gender + year do not exist, the dcast function will introduce NAs.

dtBabiesWide = data.table::dcast(dtBabiesLong, name + gender ~ year, value.var = "weight")

dtBabiesWide

      name gender year2011 year2012 year2013 year2014 year2015
1:     Ava      F    81.77    96.45       NA       NA   105.65
2:    Emma      F       NA       NA    75.74    82.49    93.73
3: Jackson      M    74.69    84.99    91.73    95.32   107.12
4:    Liam      M    88.24       NA   101.83   108.23   119.01

One way to use the formula interface with string variables is to create the formula string explicitly by concatenating individual string pieces:

as.formula(paste0(lCol$group[[1]], "+", lCol$group[[2]], "~", lCol$time))

name + gender ~ year

dtBabiesWide2 = data.table::dcast(dtBabiesLong, as.formula(paste0(lCol$group[[1]], 
    "+", lCol$group[[2]], "~", lCol$time)), value.var = "weight")

all.equal(dtBabiesWide, dtBabiesWide2)

[1] TRUE

data.table - IO

Fast reading of the files; use nThread option to take advantage of multiple threads and read files even faster!

Compressed files can be read directly without decompressing them first!

myDT = data.table::fread(file = "../../practicals/data/m1_allGF_wExpDescr.csv.gz", 
    nThread = 2)

Write as a CSV file and then compress it:

# Write a subset
myFilePath = "testOutput.csv"
data.table::fwrite(x = myDT[Stim_Treat == "EGF" & Metadata_T < 10], file = myFilePath, 
    row.names = F)

R.utils::gzip(myFilePath, overwrite = T)

Plotting with ggplot2

Let’s plot weight of individual babies over time.

Oops, the plotting function doesn’t know how to link the points. The logical way to link them is by the name column. We need to tell ggplot2 how to group the data.

Also, to avoid hard-coding column names we use aes_string instead of aes.

The data will be plotted as lines, with additional dots to indicate data points.

ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) + 
    geom_line() + geom_point()

In order to produce facets per gender, we use function facet_wrap.

ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) + 
    geom_line() + geom_point() + facet_wrap(lCol$group[2])

## Summaries

ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) + 
    geom_line() + geom_point() + facet_wrap(lCol$group[2]) + stat_summary(fun = mean, 
    aes(group = 1), geom = "line", colour = "red")

Trends

ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas)) + geom_point() + 
    facet_wrap(lCol$group[2]) + geom_smooth(method = "lm", formula = y ~ x, colour = "red")

Oops, regression analysis requires numerical variables! Here, the time in the year column is a string and is treated as a categorical variable.

Extract numerical value from the year column and assign it to a yearNum column. The result of the string substituting function gsub is also a string, hence we need to convert the result to a number using as.numeric function.

dtBabiesLong[, `:=`(yearNum, as.numeric(gsub("year", "", get(lCol$time))))]

ggplot2::ggplot(dtBabiesLong, aes_string(x = "yearNum", y = lCol$meas)) + geom_point() + 
    facet_wrap(lCol$group[2]) + geom_smooth(method = "lm", formula = y ~ x, colour = "red")

Shaded regions correspond to 95% CI.

Customisations

Lot’s of different themes and customisations, e.g. theme_bw(), theme_minimal(), etc.

Also check the ggthemes package for additional themes and colour scales.

p1 = ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) + 
    geom_line() + geom_point() + stat_summary(fun = mean, aes(group = 1), geom = "line", 
    colour = "red") + facet_wrap(lCol$group[2]) + theme_classic() + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1))

Note, that the plot was assigned to a variable p1 and there is no graphics output. This is useful if we want to do something additional with the plot. For example, we can add additional ggplot layers or save the plot to a file.

To display the plot simply invoke the variable:

p1

Interactive plots

Now when we have the plot stored in p1 variable, we can convert it to an interactive plot using plotly package.

library(plotly)
plotly::ggplotly(p1)

Functions

Let’s test the calcStats() function on our dataset.

source("../calcStats.R")

calcStats(inDt = dtBabiesLong, inMeasVar = "weight", inGroupName = "gender", inRobust = T)

   gender medianMeas
1:      M      95.32
2:      F      88.11

Unit testing

library(testthat)

# calculated result
resCalc = data.table(meas = 1:9)
resTrue = data.table(meanMeas = 5)

# Test should pass; NO message will be produced
testthat::expect_equal(calcStats(inDt = resCalc, inMeasVar = "meas"), resTrue)

# Test should not pass; error message will appear
resTrue = data.table(meanMeas = 4)

testthat::expect_equal(calcStats(inDt = resCalc, inMeasVar = "meas"), resTrue)

Unit testing with test folder

Create a test folder and create individual files with tests. The name of these files should start with test_ prefix, e.g. test_calcStats.R.

library(testthat)

testthat::test_dir("../tests")

✔ |  OK F W S | Context

⠏ |   0       | test_calcStats
⠹ |   3       | test_calcStats
✔ |   8       | test_calcStats [0.2 s]

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 0.3 s

OK:       8
Failed:   0
Warnings: 0
Skipped:  0

Way to go!

Profiling

Source code with examples for profvis.

Source code with examples for microbenchmark.

Debugging

Source code with examples.

Vectorization

Suppose we need to add 10M numbers to each other. This would be a classical way of doing it using a for loop:

nNum = 1e+07
vA = rnorm(nNum)
vB = rnorm(nNum)
vCloop = rep(0, nNum)

system.time(for (ii in 1:nNum) {
    vCloop[ii] = vA[ii] + vB[ii]
})

   user  system elapsed 
  1.453   0.074   4.161

Instead, R provides a convenient operator + that also works with vectors!

system.time({
    vCvec = vA + vB
})

   user  system elapsed 
  0.033   0.000   0.034

The result of two operations is exactly the same, but the latter is way faster!

all.equal(vCloop, vCvec)

[1] TRUE

For loops

For loops can be fast if you follow some rules:

Don’t use a loop when a vectorized alternative exists (see the previous slide).
Don’t increase the size of your objects, e.g. using cbind, rbind, during the loop! Instead, pre-allocate memory by predefining an object to hold the result.

Example of an inefficient loop: at every iteration of the loop, myRes increases in size because a new element is added to this vector.

nNum = 10000
vA = rnorm(nNum)
vB = rnorm(nNum)

myRes = c()
system.time(for (ii in 1:nNum) {
    myRes = c(myRes, vA[ii] + vB[ii])
})

   user  system elapsed 
  0.295   0.122   0.420

Example of an efficient loop: the result vector vC is pre-allocated and has the same size throughout the iteration of the loop. The only thing that changes is the content of that vector!

   user  system elapsed 
  0.004   0.000   0.004

[1] TRUE

The result is the same, but pre-allocating the vC vector speeds up the computation by two orders of magnitude!

The difference between the computation time increases with the size of the vector because without pre-allocating the memory, R is copying the existing vector before increasing its size with the c operation.

Apply functions

R notebook with examples.

Parallel computations using foreach

R notebook with example calculations.

Command-line parameters