Maciej Dobrzyński (Institute of Cell Biology, University of Bern)
November 3, 2020
The first part will demonstrate:
data.table
& ggplot2
,R notebook with the code.
During the second part we will process time-series data from a time-lapse microscopy experiment. We will:
Intermediate datasets throughout the workshop:
PDF with an introduction to datasets.
Notebook with the practical session.
data.table
Extension of base R's data.frame
structure.
Fast data manipulation with a concise SQL-like syntax.
Check out the vignette for an introduction and Advanced tips and tricks with data.table for expanding your knowledge.
Packages are R's greatest strength but may create confusion.
CRAN = The Comprehensive R Archive Network, is a package repository that currently features >15k packages.
Aside from an obligatory reference manual, many packages include vignettes, i.e. digestible intros into working with a package.
To access functions provided by R packages, a package needs to be loaded:
library(data.table)
Then, functions such as dcast
, melt
, etc. are directly available right in the R interpreter.
However, there can be more packages that provide functions with the same name! For example:
library(plyr)
library(Hmisc)
Both provide a function summarise
. Upon loading the second package, R throws a warning:
> require(Hmisc)
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:plyr’:
is.discrete, summarize
Therefore, it is a good practice to call functions including the package reference:
plyr::summarise
Hmisc::summarise
An opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Brief, visual sumamries of packages' functionality.
A variable refers to a storage location in computer's memory, e.g.
myVariable = 5.3
The symbolic name myVariable
refers to a memory location that stores the number 5.3
.
A variable can vary!
myVariable = myVariable + 2.8
We changed the value referred by the mnemonic myVariable
. Now it stores 8.1
.
Data stored under variables can have different types. There are 5 of them in R. Use function typeof()
to check.
Illustration from R tutorial on TechVidvan.
A data structure is a way of storing and organising data. For example, in order to store 10 integers, we could define 10 variables, which isn't very efficient. Instead, we can store these numbers in a vector.
Illustration from R tutorial on TechVidvan.
Control structures change the flow of the code. The changes are based on conditions, e.g. if variable a
is greater than a certain value, do this, otherwise, do that.
Illustration from R tutorial on TechVidvan.
Source code with the template.
## Load libraries ----
library(data.table)
library(ggplot2)
## Global variables ----
# Lists with parameters for easy recall
lParRW = list(
fileIn = "experimentalResults.csv",
fileOut = "processedData",
filePlotOut = "boxPlot_activity.pdf"
)
lCol = list(
time = "Time_h",
meas = "sensor_ch0",
group = "Exp_cond"
)
## Custom functions ----
# Define custom functions or
# load from an external file
source("myFunctionLIbrary.R")
locCalcStats = function(...) {
...
}
## Read data ----
dt = fread(lParRW$fileIn)
## Clean data ----
# Remove unnecessary columns
dt[,
c("uselessColumn1",
"uselessColumn2") := NULL]
## Process data ----
...
## Save output data ----
fwrite(x = dt,
file = lParRW$fileOut)
## Save plots ----
p1 = ggplot2(dt,
aes(x = ...,
y = ...)) +
geom_line(aes(color = group))
ggsave(filename = lParRW$filePlotOut,
plot = p1)
Use # to add comments to your code.
Any comment line with at least four trailing characters, i.e. -
, =
, or #
creates a section in the code.
Comment your code by explaining the WHY instead of the WHAT!
Navigate code sections, using Jump To menu at the bottom of the editor.
The outline of code sections is in the upper right corner of the editor window.
RStudio supports folding for code regions (small downward triangles next to line numbers on the left in the editor window).
To indent or reformat the code use:
Stick to a single naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:
calcStatsOfDF
- function to calculate statistics; use verbs!nIteration
- prefix n
to indicate an integer variablefPatientAge
- f
to indicate a float variablesPatientName
- s
to indicate a string variablevFileNames
- v
for vectorlColumnNames
- l
for a listIf a variable is available/valid only within a function, prefix the name with loc
, e.g.:
locNiter
locVfileNames
Divide your work into multiple contexts with RStudio projects.
Each project has its own working directory, workspace, history, and source documents.
You can create an RStudio project:
File > New Project… to create a new R project
An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.
The text is formatted using R Markdown.
Use R Notebooks to create publish-ready documents with text, graphics, and interactive plots.
Can be exported to html, pdf, or a Word document.
Can also include interactive elements!
Data.table
package is a faster, more efficient framework for data manipulation compared to R's default data.frame
. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.
R notebook with the code.
Let's define a data table:
library(data.table)
dtBabies = data.table(name= c("Jackson", "Emma", "Liam", "Ava"),
gender = c("M", "F", "M", "F"),
year2011= c(74.69, NA, 88.24, 81.77),
year2012=c(84.99, NA, NA, 96.45),
year2013=c(91.73, 75.74, 101.83, NA),
year2014=c(95.32, 82.49, 108.23, NA),
year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Emma F NA NA 75.74 82.49 93.73
3: Liam M 88.24 NA 101.83 108.23 119.01
4: Ava F 81.77 96.45 NA NA 105.65
The general form of data.table syntax is as follows:
DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
The way to read it (out loud) is: take DT, subset/reorder rows using i
, then calculate j
, grouped by by
.
Select specific records:
dtBabies[gender == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
dtBabies[, .(name, gender, year2015)]
name gender year2015
1: Jackson M 107.12
2: Emma F 93.73
3: Liam M 119.01
4: Ava F 105.65
Calculate the mean of a single column:
dtBabies[,
.(meanWeight = mean(year2015))]
meanWeight
1: 106.3775
Calculate the mean of a single column by gender:
dtBabies[,
.(meanWeight = mean(year2015)),
by = gender]
gender meanWeight
1: M 113.065
2: F 99.690
Hardcoding column names in the script is potentially dangerous, for example when for some reason column names change.
Better solution: store column names at the beginning of the script, where it's easy to change them. Then use variables with those names in the code.
lCol = list(meas = 'weight',
time = 'year',
group = c('name', 'gender'),
timeLast = 'year2015')
lCol
$meas
[1] "weight"
$time
[1] "year"
$group
[1] "name" "gender"
$timeLast
[1] "year2015"
Select specific records:
dtBabies[get(lCol$group[[2]]) == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, ..myColumns]
name year2015
1: Jackson 107.12
2: Emma 93.73
3: Liam 119.01
4: Ava 105.65
The same summary but with column names stored as strings in elements of the list lCol
.
The j
part of the data.table
requires us to use a function get
to interpret the string as the column name.
dtBabies[,
.(meanWeight = mean(get(lCol$timeLast))),
by = c(lCol$group[2])]
gender meanWeight
1: M 113.065
2: F 99.690
Our data table is in the wide format. To convert it to long format, use function melt
from the data.table
package.
Provide the names of identification (id.vars
) and measure variables (measure.vars
). If none are provided, melt
guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.
The original data frame contains missing values; na.rm=T
omits them in the long-format table.
dtBabiesLong = data.table::melt(dtBabies,
id.vars = c('name', 'gender'), measure.vars = 3:7,
variable.name = 'year', value.name = 'weight',
na.rm = T)
head(dtBabiesLong, n = 5L)
name gender year weight
1: Jackson M year2011 74.69
2: Liam M year2011 88.24
3: Ava F year2011 81.77
4: Jackson M year2012 84.99
5: Ava F year2012 96.45
The function dcast
converts from long to wide format. The function has a so-called formula interface that specifies a combination of variables that uniquely identify a row.
Note that because some combinations of name + gender + year
do not exist, the dcast
function will introduce NAs
.
dtBabiesWide = data.table::dcast(dtBabiesLong,
name + gender ~ year,
value.var = 'weight')
dtBabiesWide
name gender year2011 year2012 year2013 year2014 year2015
1: Ava F 81.77 96.45 NA NA 105.65
2: Emma F NA NA 75.74 82.49 93.73
3: Jackson M 74.69 84.99 91.73 95.32 107.12
4: Liam M 88.24 NA 101.83 108.23 119.01
One way to use the formula interface with string variables is to create the formula string explicitly by concatenating individual string pieces:
as.formula(
paste0(
lCol$group[[1]], "+",
lCol$group[[2]], "~",
lCol$time))
name + gender ~ year
data.table::fread
Fast reading of the files; use nThread
option to take advantage of multiple threads and read files even faster!
Compressed files can be read directly without decompressing them first!
library(data.table)
library(R.utils) # read gz files directly
mydt = data.table::fread(file = "inFile.csv.gz",
nThread = 4)
data.table::fwrite
for fast writing; also compressed files (gz and bz2).
library(data.table)
library(R.utils) # gzip files
myFilePath = "outFile.csv"
data.table::fwrite(x = mydt,
file = myFilePath,
row.names = F)
# compresses; overwrites any pre-existing files
Rutils::gzip(myFilePath, overwrite = T)
The ggPlot package is a powerfull plotting package that requires data in the long format.
Let's plot weight of individual babies over time.
myCols = c(lCol$time,
lCol$meas)
head(dtBabiesLong[, ..myCols], 5)
year weight
1: year2011 74.69
2: year2011 88.24
3: year2011 81.77
4: year2012 84.99
5: year2012 96.45
library(ggplot2)
ggplot2::ggplot(dtBabiesLong,
aes(x = year,
y = weight)) +
geom_line()
Oops, the plotting function doesn't know how to link the points. The logical way to link them is by the name
column.
Here we add group
option in the ggplot
aesthetics (aes
) to avoid the mistake from the above.
Also, to avoid hard-coding column names we use aes_string
instead of aes
.
The data will be plotted as lines, with additional dots to indicate data points.
myCols = c(lCol$time,
lCol$meas,
lCol$group[1])
head(dtBabiesLong[, ..myCols], 5)
year weight name
1: year2011 74.69 Jackson
2: year2011 88.24 Liam
3: year2011 81.77 Ava
4: year2012 84.99 Jackson
5: year2012 96.45 Ava
ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point()
In order to produce facets per gender
, we use function facet_wrap
.
It uses formula interface, same as in the case of dcast
. Again, we need to build the formula from a string.
sFormula = paste0('~', lCol$group[2])
sFormula
[1] "~gender"
ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point() +
facet_wrap(lCol$group[2])
p1 = ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point() +
facet_wrap(sFormula) +
stat_summary(fun = mean,
aes(group=1),
geom = "line",
colour = 'red')
Note, that the plot was assigned to a variable p1
and there is no graphics output. This is useful if we want to do something additional with the plot. For example, we can add additional ggplot
layers or save the plot to a file.
To display the plot simply invoke the variable:
p1
The group=1
overrides the by-name grouping so that we get a mean across all names for each gender, rather than the mean of each individual name within each gender (which would be the same as individual observations).
Let's add a linear regression line:
ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas)) +
geom_point() +
facet_wrap(sFormula) +
geom_smooth(method = "lm",
colour = "red")
Oops, regression analysis requires numerical variables! Here, the time in the year
column is a string and is treated as a categorical variable.
Extract numerical value from the year
column and assign it to a yearNum
column. The result of the string substituting function gsub
is also a string, hence we need to convert the result to a number using as.numeric
function.
dtBabiesLong[,
yearNum := as.numeric(gsub("year", "", get(lCol$time)))]
ggplot2::ggplot(dtBabiesLong,
aes_string(x = "yearNum",
y = lCol$meas)) +
geom_point() +
facet_wrap(sFormula) +
geom_smooth(method = "lm",
colour = "red")
Shaded regions correspond to 95% CI.
Lot's of different themes and customisations, e.g. theme_bw()
, theme_minimal()
, etc.
Also check the ggthemes
package for additional themes and colour scales.
p1 = ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point() +
stat_summary(fun.y = mean,
aes(group=1),
geom = "line",
colour = 'red') +
facet_wrap(sFormula) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1))
Note, that the plot was assigned to a variable p1
and there is no graphics output. This is useful if we want to do something additional with the plot. For example, we can add additional ggplot
layers or save the plot to a file.
To display the plot simply invoke the variable:
p1
Making an interactive plot from a ggplot
object is extremely easy. Just use ggplotly
function from the amazing plotly
package. The interactive plot will remain in the html document knitted from the R notebook.
library(plotly)
plotly::ggplotly(p1)
Let's write a function to calculate statistics of a data frame. In the simplest case the function will calculate the mean of a single column of a data frame.
We will expand the function to calculate the mean by a group, and to calculate robust statistics, i.e. median instead of the mean.
The function will need the following input parameters:
calcStats = function(inDt,
inMeasVar) {
require(data.table)
outDt = inDt[,
.(meanMeas = mean(get(inMeasVar)))]
return(outDt)
}
Since column names will be provided to our function as string parameters, we cannot hard-code them inside of the function. Therefore, we use function get
to use the string stored in variable inMeasVar
as the column name.
Let's apply the function to our dtBabiesLong
table:
calcStats(inDt = dtBabiesLong,
inMeasVar = "weight")
meanMeas
1: 93.79933
Extension 1: allow for calculating the mean by group via an optional parameter:
calcStats = function(inDt,
inMeasVar,
inGroupName = NULL) {
require(data.table)
outDt = inDt[,
.(meanMeas = mean(get(inMeasVar))),
by = inGroupName]
return(outDt)
}
calcStats(inDt = dtBabiesLong,
inMeasVar = "weight",
inGroupName = "gender")
gender meanMeas
1: M 96.79556
2: F 89.30500
If the parameter inGroupName
is not provided, the mean will be calculated for the entire table without grouping.
Extension 2: add a parameter inRobust
to select between calculating the mean and the median.
calcStats = function(inDt,
inMeasVar,
inGroupName = NULL,
inRobust = F) {
require(data.table)
if (inRobust) {
outDt = inDt[,
.(medianMeas = median(get(inMeasVar))),
by = inGroupName]
} else {
outDt = inDt[,
.(meanMeas = mean(get(inMeasVar))),
by = inGroupName]
}
return(outDt)
}
calcStats(inDt = dtBabiesLong,
inMeasVar = "weight",
inGroupName = "gender",
inRobust = T)
gender medianMeas
1: M 95.32
2: F 88.11
Possible extensions could involve error handling, e.g.:
inMeasVar
and inGroupName
exist in the data table.Once inside the function, click Menu > Code > Insert Roxygen Skeleton (Shift-Option-Command R). A special type of comment will be added above the function. You can add your text next to parameters.
#' Calculates stats of a data frame
#'
#' @param inDt Input data table in the long format
#' @param inMeasVar Name of the measurement column
#' @param inGroupName Name of the grouping column (default NULL)
#' @param inRobust If true, the function calculates median instead of the mean (default False)
#'
#' @return Data table with summary stats
#' @export
#' @import data.table
#'
#' @examples
#' # example usage
calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {
require(data.table)
if (inRobust) {
outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
} else {
outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
}
return(outDt)
}
calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {
require(data.table)
if (is.null(inDt))
stop("input data.table is NULL")
if(!is.data.table(inDt))
stop("input is not a data.table")
if(nrow(inDt) == 0)
stop("input data.table has 0 rows")
if(!(inMeasVar %in% names(inDt)))
stop("column name does not exist in the data.table")
if(!is.logical(inRobust))
stop("argument inRobust must be logical, TRUE or FALSE")
if (inRobust) {
outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
} else {
outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
}
return(outDt)
}
Source code with this function.
library(testthat)
# calculated result
resCalc = data.table(meas = 1:9)
resTrue = data.table(meanMeas = 5.0)
# Test should pass; NO message will be produced
testthat::expect_equal(calcStats(inDt = resCalc,
inMeasVar = "meas"),
resTrue)
# Test should not pass; error message will appear
resTrue = data.table(meanMeas = 4.0)
testthat::expect_equal(calcStats(inDt = resCalc,
inMeasVar = "meas"),
resTrue)
Error: calcStats(inDt = resCalc, inMeasVar = "meas") not equal to `resTrue`.
Column 'meanMeas': Mean relative difference: 0.2
Create a test
folder and create individual files with tests. The name of these files should start with test_
prefix, e.g. test_calcStats.R
.
library(testthat)
# Source the file with the function to test; adjust the path if required.
source("../calcStats.R")
testthat::test_that("tests", {
expect_error(calcStats(NULL, "x"))
expect_equal(...)
})
Run tests on the entire tests
folder with testthat::test_dir("examples/tests")
or on individual test files with testthat::test_file("examples/tests/test_calcStats.R")
.
The output should like similar to this:
> testthat::test_file("examples/tests/test_calcStats.R")
✓ | OK F W S | Context
✓ | 8 | test_calcStats
══ Results ══
OK: 8
Failed: 0
Warnings: 0
Skipped: 0
The profiler is a tool for helping you to understand how R spends its time. It provides a interactive graphical interface for visualizing data from:
Rprof
, R’s built-in tool for collecting profiling data and, profvis
, a tool for visualizing profiles from R.To make slow code faster, we need accurate information about what is making our code slow.
library(profvis)
# Define data size
nRows = 4e5
nCols = 150
# Create data
# Random Gaussian numbers with mean = 5 , sd = 1
myData = matrix(rnorm(nRows * nCols,
mean = 5),
ncol = nCols)
# Start profiling
profvis({
# Get column means: 2 methods
# 1. Apply the mean function to every column
locColMeans = apply(myData, 2, mean)
# 2. use colMeans function on the entire matrix
locColMeans = colMeans(myData)
# Loop over each column.
# subtract mean from each column.
for (ii in seq_along(locColMeans)) {
myData[, ii] <-
myData[, ii] - locColMeans[ii]
}
})
Use system.time()
function to measure the elapsed time.
nRows = 4e4
nCols = 100
myData = matrix(rnorm(nRows * nCols,
mean = 5),
ncol = nCols)
locColMeans = colMeans(myData)
system.time(
for (ii in seq_along(locColMeans)) {
myData[, ii] <-
myData[, ii] - locColMeans[ii]
}
)
user system elapsed
0.068 0.034 0.146
Use microbenchmark
package to accurately measure and compare the execution time of R expressions.
# Benchmark 3 approaches to center columns:
# 1. using base R, loop over columns and subtract precomputed means,
# 2. using data.table with lapply,
# 3. using data.table, loop over columns and use set() function
mbm = microbenchmark(
base_r = for (ii in seq_along(myColMeans)) {
data1[, ii] <-
data1[, ii] - myColMeans[ii]},
dt_lapply = data2[,
(myColNames) := lapply(.SD,
function(x) x - mean(x)),
.SDcols = myColNames],
dt_for_set = for (ii in myColNames)
set(data3,
j = ii,
value = data3[[ii]] - mean(data3[[ii]])),
times = 100L)
# Display benchmark table
mbm
Unit: milliseconds
expr min lq mean median uq max neval cld
base_r 672.9243 716.4792 878.6226 851.141 921.6990 1529.4567 100 b
dt_lapply 347.4257 367.2748 469.3408 388.986 513.9202 944.1644 100 a
dt_for_set 336.9143 367.0250 445.6910 376.871 499.6137 943.2988 100 a
# Plot results of microbenchmark
ggplot(mbm) +
geom_violin(aes(x = expr,
y = time)) +
coord_flip()
Debugging is designed to help you find bugs by figuring out where the code is not behaving in the way that you expect. To do this, you need to:
Place the browser()
function anywhere in the code to halt the code and start an environment browser.
# Input vector with numbers
vX = c(7,2,8,1,3,17,6,34,37)
# Initialise counter
nCount = 0
# Loop over the vector
for (val in vX) {
if(val %% 2 == 0) {
nCount = nCount+1
cat(val, "is an even number\n")
}
browser()
}
Set breakpoint to stop on a line of code.
In RStudio, you set a breakpoint by clicking to the left of the line number in the editor. A red dot will appear.
Important: the code needs to be sourced first. Click a source
button in the upper right corner of the editor.
A properly set breakpoint
A deferred breakpoint
Requires sourcing the code first.
R is a high-level, interpreted computer language. This means that R takes care of a lot of basic computer tasks for you.
Most of R’s functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time. This makes writing code more concise, easy to read, and less error prone.
Suppose we need to add 10M numbers to each other. This would be a classical way of doing it using a for loop:
nNum = 1e7
vA = rnorm(nNum)
vB = rnorm(nNum)
vCloop = rep(0., nNum)
system.time(
for(ii in 1:nNum) {
vCloop[ii] = vA[ii] + vB[ii]
}
)
user system elapsed
1.509 0.060 4.338
Instead, R provides a convenient operator +
that also works with vectors!
system.time({vCvec = vA + vB})
user system elapsed
0.037 0.001 0.053
The result of two operations is exactly the same, but the latter is way faster!
all.equal(vCloop, vCvec)
[1] TRUE
R notebook with this example.
From R-bloggers:
A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.
However, for loops can be fast if you follow some rules:
cbind
, rbind
, during the loop! Instead, pre-allocate memory by predefining an object to hold the result. Example of an inefficient loop: at every iteration of the loop, myRes
increases in size because a new element is added to this vector.
nNum = 1e4
vA = rnorm(nNum)
vB = rnorm(nNum)
myRes = c()
system.time(
for(ii in 1:nNum) {
myRes = c(myRes, vA[ii] + vB[ii])
}
)
user system elapsed
0.253 0.216 0.487
# Pre-allocated vector where the result will be written to
vC = rep(0., nNum)
system.time(
for(ii in seq_along(vA)) {
vC[ii] = vA[ii] + vB[ii]
}
)
user system elapsed
0.007 0.000 0.007
all.equal(myRes, vC)
[1] TRUE
The result is the same, but pre-allocating the vC
vector speeds up the computation by two orders of magnitude!
The difference between the computation time increases with the size of the vector because without pre-allocating the memory, R is copying the existing vector before increasing its size with the c
operation.
As an alternative to for loops, try using functions from the apply
family, e.g. lapply
, sapply
, etc. The advantage is that it forces you to encapsulate your code into functions, which makes the code reusable and modular.
R notebook with examples.
For example, we have several CSV files in examples/ex_apply_data
folder that we would like to read and combine into a single data.table
:
library(data.table)
# List all CSV files in the folder; store them in a character vector
vFiles = list.files(path = "examples/ex_apply_data/.",
pattern = "csv",
full.names = T)
# Pre-allocate a list which will hold data tables from individual files
lMyExp = list()
# Loop over file names and read them
for (ii in seq_along(vFiles)) {
lMyExp[[ii]] = fread(vFiles[[ii]])
}
# Combine the list into a data.table
dtMyExp1 = rbindlist(lMyExp)
A simpler way is to use the lapply
function that loops over all elements of the vFiles
vector and applies a function data.table::fread
. As previously, the result is stored in a list.
lMyExp = lapply(vFiles, fread)
dtMyExp2 = rbindlist(lMyExp)
all.equal(dtMyExp1, dtMyExp2)
[1] TRUE
The main reason for using the foreach
package is that it supports parallel execution; it can execute repeated operations on multiple processors/cores on your computer, or on multiple nodes of a cluster.
R notebook with example calculations.
Template code where the result of each iteration is stored in a myResult
list:
library(foreach)
library(doParallel)
numCores = 4
doParallel::registerDoParallel(numCores)
system.time({
myResult = foreach::foreach(ii=1:10,
.packages='...') %dopar%
myFunction(ii)
...
})
doParallel::stopImplicitCluster()
Here, the results of each iteration are combined into a vector:
system.time({
myResult = foreach::foreach(ii=1:10,
.combine = 'c',
.packages='...') %dopar%
myFunction(ii)
...
})
Parse command-line arguments using the optparse
package (code taken from R-bloggers)
Here, we create two command-line parameters -r
and -d
. Both have their respective options in the long format, --rootdir
and --debug
.
The default value for -r
is NULL
which, together with a subsequent if-stop
statement, makes it a required parameter. The -d
has a default value set to FALSE
.
#!/usr/bin/env Rscript
library(optparse)
option_list = list(
optparse::make_option(c("-r", "--rootdir"), type="character", default=NULL,
help="full path to root directory", metavar="character"),
optparse::make_option(c("-d", "--debug"), action="store_true", default = FALSE, dest = 'debug',
help="Print extra output")
);
opt_parser = optparse::OptionParser(option_list=option_list)
opt = optparse::parse_args(opt_parser)
The list opt
contains all the arguments sorted by order of appearance in option_list and which can be called by their names as declared in the object: opt$rootdir
, opt$debug
, etc.
Manage NULL
arguments as follows:
if (is.null(opt$rootdir)){
optparse::print_help(opt_parser)
stop("Please supply a full path to root directory with data to analyse.\n", call.=FALSE)
}
l.par$dir.root = opt$rootdir
cat(sprintf('Working in:\n%s\n\n', l.par$dir.root))
if (opt$debug)
cat('\nEntering debug mode. More output...\n\n')
Thanks to the first line (#!/usr/bin/env Rscript
), a so-called shebang, you can run the script by simply executing ./ex_optparse.R
in the command line. Otherwise, you can type Rscript ex_optparse.R
.
This is the output, if no parameters given:
Loading required package: optparse
Usage: ./ex_optparse.R [options]
Options:
-r CHARACTER, --rootdir=CHARACTER
full path to root directory
-d, --debug
Print extra output
-h, --help
Show this help message and exit
Error: Please supply a full path to root directory with data to analyse.
Execution halted