Intro
This notebook accompanies MIC training: Modern data analysis in R/RStudio. The corresponding presentation is here.
To convert this notebook into HTML (to knit it), type make
in the command line of this folder or click Knit icon in the top bar of the RStudio editor.
data.table
Define a data.table
:
library(data.table)
dtBabies = data.table(name = c("Jackson", "Emma", "Liam", "Ava"), gender = c("M",
"F", "M", "F"), year2011 = c(74.69, NA, 88.24, 81.77), year2012 = c(84.99, NA,
NA, 96.45), year2013 = c(91.73, 75.74, 101.83, NA), year2014 = c(95.32, 82.49,
108.23, NA), year2015 = c(107.12, 93.73, 119.01, 105.65))
dtBabies
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Emma F NA NA 75.74 82.49 93.73
3: Liam M 88.24 NA 101.83 108.23 119.01
4: Ava F 81.77 96.45 NA NA 105.65
Select specific records/rows:
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific variables/columns:
name gender year2015
1: Jackson M 107.12
2: Emma F 93.73
3: Liam M 119.01
4: Ava F 105.65
Calculate the mean of a single column:
meanWeight
1: 106.3775
Calculate the mean of a single column by gender:
gender meanWeight
1: M 113.065
2: F 99.690
data.table - reference by name
Define a list of string variables with column names:
lCol = list(meas = "weight", time = "year", group = c("name", "gender"), timeLast = "year2015")
lCol
$meas
[1] "weight"
$time
[1] "year"
$group
[1] "name" "gender"
$timeLast
[1] "year2015"
Select records with gender
column equal to M
.
The name of the gender column is stored in one of the list elements.
[1] "name" "gender"
We need the second entry of that list element.
[1] "gender"
To let data.table
know that we want to provide column names via pre-defined string variables, we need to use function get
in the i
position.
Hence, to select rows:
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
To select specific columns we can provide them this way:
name year2015
1: Jackson 107.12
2: Emma 93.73
3: Liam 119.01
4: Ava 105.65
Or simpler (note the double dot):
name year2015
1: Jackson 107.12
2: Emma 93.73
3: Liam 119.01
4: Ava 105.65
To aggregate by calculating the mean:
gender meanWeight
1: M 113.065
2: F 99.690
Wide to long format
Our data table is in the wide format. To convert it to long format, use function melt
from the data.table
package.
Provide the names of identification (id.vars
) and measure variables (measure.vars
). If none are provided, melt
guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.
The original data frame contains missing values; na.rm=T
omits them in the long-format table.
dtBabiesLong = data.table::melt(dtBabies, id.vars = c("name", "gender"), measure.vars = 3:7,
variable.name = "year", value.name = "weight", na.rm = T)
head(dtBabiesLong, n = 5L)
name gender year weight
1: Jackson M year2011 74.69
2: Liam M year2011 88.24
3: Ava F year2011 81.77
4: Jackson M year2012 84.99
5: Ava F year2012 96.45
Long to wide
The function dcast
converts from long to wide format. The function has a so-called formula interface that specifies a combination of variables that uniquely identify a row.
Note that because some combinations of name + gender + year
do not exist, the dcast
function will introduce NAs
.
dtBabiesWide = data.table::dcast(dtBabiesLong, name + gender ~ year, value.var = "weight")
dtBabiesWide
name gender year2011 year2012 year2013 year2014 year2015
1: Ava F 81.77 96.45 NA NA 105.65
2: Emma F NA NA 75.74 82.49 93.73
3: Jackson M 74.69 84.99 91.73 95.32 107.12
4: Liam M 88.24 NA 101.83 108.23 119.01
One way to use the formula interface with string variables is to create the formula string explicitly by concatenating individual string pieces:
name + gender ~ year
dtBabiesWide2 = data.table::dcast(dtBabiesLong, as.formula(paste0(lCol$group[[1]],
"+", lCol$group[[2]], "~", lCol$time)), value.var = "weight")
all.equal(dtBabiesWide, dtBabiesWide2)
[1] TRUE
data.table - IO
Fast reading of the files; use nThread
option to take advantage of multiple threads and read files even faster!
Compressed files can be read directly without decompressing them first!
Write as a CSV file and then compress it:
Plotting with ggplot2
Let’s plot weight of individual babies over time.
Oops, the plotting function doesn’t know how to link the points. The logical way to link them is by the name
column. We need to tell ggplot2
how to group the data.
Also, to avoid hard-coding column names we use aes_string
instead of aes
.
The data will be plotted as lines, with additional dots to indicate data points.
ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) +
geom_line() + geom_point()
In order to produce facets per gender
, we use function facet_wrap
.
ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) +
geom_line() + geom_point() + facet_wrap(lCol$group[2])
## Summaries
ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) +
geom_line() + geom_point() + facet_wrap(lCol$group[2]) + stat_summary(fun = mean,
aes(group = 1), geom = "line", colour = "red")
Trends
ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas)) + geom_point() +
facet_wrap(lCol$group[2]) + geom_smooth(method = "lm", formula = y ~ x, colour = "red")
Oops, regression analysis requires numerical variables! Here, the time in the year
column is a string and is treated as a categorical variable.
Extract numerical value from the year
column and assign it to a yearNum
column. The result of the string substituting function gsub
is also a string, hence we need to convert the result to a number using as.numeric
function.
ggplot2::ggplot(dtBabiesLong, aes_string(x = "yearNum", y = lCol$meas)) + geom_point() +
facet_wrap(lCol$group[2]) + geom_smooth(method = "lm", formula = y ~ x, colour = "red")
Shaded regions correspond to 95% CI.
Customisations
Lot’s of different themes and customisations, e.g. theme_bw()
, theme_minimal()
, etc.
Also check the ggthemes
package for additional themes and colour scales.
p1 = ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) +
geom_line() + geom_point() + stat_summary(fun = mean, aes(group = 1), geom = "line",
colour = "red") + facet_wrap(lCol$group[2]) + theme_classic() + theme(axis.text.x = element_text(angle = 45,
hjust = 1))
Note, that the plot was assigned to a variable p1
and there is no graphics output. This is useful if we want to do something additional with the plot. For example, we can add additional ggplot
layers or save the plot to a file.
To display the plot simply invoke the variable:
Functions
Let’s test the calcStats()
function on our dataset.
source("../calcStats.R")
calcStats(inDt = dtBabiesLong, inMeasVar = "weight", inGroupName = "gender", inRobust = T)
gender medianMeas
1: M 95.32
2: F 88.11
Unit testing
Unit testing with test folder
Create a test
folder and create individual files with tests. The name of these files should start with test_
prefix, e.g. test_calcStats.R
.
✔ | OK F W S | Context
⠏ | 0 | test_calcStats
⠹ | 3 | test_calcStats
✔ | 8 | test_calcStats [0.2 s]
══ Results ═════════════════════════════════════════════════════════════════════
Duration: 0.3 s
OK: 8
Failed: 0
Warnings: 0
Skipped: 0
Way to go!
Debugging
Source code with examples.
Vectorization
Suppose we need to add 10M numbers to each other. This would be a classical way of doing it using a for loop:
nNum = 1e+07
vA = rnorm(nNum)
vB = rnorm(nNum)
vCloop = rep(0, nNum)
system.time(for (ii in 1:nNum) {
vCloop[ii] = vA[ii] + vB[ii]
})
user system elapsed
1.453 0.074 4.161
Instead, R provides a convenient operator +
that also works with vectors!
user system elapsed
0.033 0.000 0.034
The result of two operations is exactly the same, but the latter is way faster!
[1] TRUE
For loops
For loops can be fast if you follow some rules:
- Don’t use a loop when a vectorized alternative exists (see the previous slide).
- Don’t increase the size of your objects, e.g. using
cbind
,rbind
, during the loop! Instead, pre-allocate memory by predefining an object to hold the result.
Example of an inefficient loop: at every iteration of the loop, myRes
increases in size because a new element is added to this vector.
nNum = 10000
vA = rnorm(nNum)
vB = rnorm(nNum)
myRes = c()
system.time(for (ii in 1:nNum) {
myRes = c(myRes, vA[ii] + vB[ii])
})
user system elapsed
0.295 0.122 0.420
Example of an efficient loop: the result vector vC
is pre-allocated and has the same size throughout the iteration of the loop. The only thing that changes is the content of that vector!
user system elapsed
0.004 0.000 0.004
[1] TRUE
The result is the same, but pre-allocating the vC
vector speeds up the computation by two orders of magnitude!
The difference between the computation time increases with the size of the vector because without pre-allocating the memory, R is copying the existing vector before increasing its size with the c
operation.
Apply functions
R notebook with examples.
Parallel computations using foreach
R notebook with example calculations.
Command-line parameters
Source code with examples.