MIC training: Modern data analysis in R/RStudio

Maciej Dobrzyński (Institute of Cell Biology, University of Bern)
November 3, 2020







Roadmap for this workshop

The first part will demonstrate:

  1. Resources with R courses/tutorials,
  2. Basic programming concepts,
  3. Working with RStudio,
  4. Brief intro to data.table & ggplot2,
  5. Functions, testing, profiling, debugging,
  6. Vectorization,
  7. Parallel computations,
  8. Command-line parameters

R notebook with the code.

During the second part we will process time-series data from a time-lapse microscopy experiment. We will:

  • load data, merge different data sources,
  • clean missing data and outliers,
  • plot different data cuts,
  • perform hierarchical clustering,
  • validate clusters.

Intermediate datasets throughout the workshop:

PDF with an introduction to datasets.

Notebook with the practical session.

Relevant R packages

data.table

Extension of base R's data.frame structure.

Fast data manipulation with a concise SQL-like syntax.

Check out the vignette for an introduction and Advanced tips and tricks with data.table for expanding your knowledge.

ggplot2

Quickly create publication-ready plots.

Check out project website for more details.

Online resources - CRAN

Packages are R's greatest strength but may create confusion.

CRAN = The Comprehensive R Archive Network, is a package repository that currently features >15k packages.

https://cran.r-project.org

Aside from an obligatory reference manual, many packages include vignettes, i.e. digestible intros into working with a package.

A note about packages

To access functions provided by R packages, a package needs to be loaded:

library(data.table)

Then, functions such as dcast, melt, etc. are directly available right in the R interpreter.

However, there can be more packages that provide functions with the same name! For example:

library(plyr)
library(Hmisc)

Both provide a function summarise. Upon loading the second package, R throws a warning:

> require(Hmisc)
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:plyr’:

    is.discrete, summarize

Therefore, it is a good practice to call functions including the package reference:

plyr::summarise
Hmisc::summarise

Online resources - Tidyverse

An opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

https://www.tidyverse.org

Online resources - Cheatsheets

Brief, visual sumamries of packages' functionality.

https://rstudio.com/resources/cheatsheets/

Courses/tutorials

Programming concepts

Levels of abstraction

Variables

A variable refers to a storage location in computer's memory, e.g.

myVariable = 5.3

The symbolic name myVariable refers to a memory location that stores the number 5.3.

A variable can vary!

myVariable = myVariable + 2.8

We changed the value referred by the mnemonic myVariable. Now it stores 8.1.

Data types

Data stored under variables can have different types. There are 5 of them in R. Use function typeof() to check.

Illustration from R tutorial on TechVidvan.

Data structures

A data structure is a way of storing and organising data. For example, in order to store 10 integers, we could define 10 variables, which isn't very efficient. Instead, we can store these numbers in a vector.

Illustration from R tutorial on TechVidvan.

Control structures

Control structures change the flow of the code. The changes are based on conditions, e.g. if variable a is greater than a certain value, do this, otherwise, do that.

Illustration from R tutorial on TechVidvan.

Code structure

Source code with the template.

## Load libraries ----
library(data.table)
library(ggplot2)

## Global variables ----
# Lists with parameters for easy recall
lParRW = list(
  fileIn = "experimentalResults.csv",
  fileOut = "processedData",
  filePlotOut = "boxPlot_activity.pdf"
)

lCol = list(
  time = "Time_h",
  meas = "sensor_ch0",
  group = "Exp_cond"
)

## Custom functions ----
# Define custom functions or 
# load from an external file
source("myFunctionLIbrary.R")

locCalcStats = function(...) {
  ...
}
## Read data ----
dt = fread(lParRW$fileIn)

## Clean data ----
# Remove unnecessary columns
dt[,
   c("uselessColumn1",
     "uselessColumn2") := NULL]

## Process data ----

...

## Save output data ----
fwrite(x = dt, 
       file = lParRW$fileOut)

## Save plots ----
p1 = ggplot2(dt,
             aes(x = ...,
                 y = ...)) +
  geom_line(aes(color = group))

ggsave(filename = lParRW$filePlotOut, 
       plot = p1)

Code formatting