QUANTBIO2019: Processing and clustering of time series from single-cell time-lapse microscopy

Maciej Dobrzynski (Intitute of Cell Biology, University of Bern)
September 15, 2019

Some context about our lab

  • Time-lapse microscopy experiments: taking movies of many cells
  • Cells have fluorescent biosensors that measure activities of proteins
  • We look at the dynamics of protein activity
  • Movies are processed to identify cells, measure the fluorescent signal, track cells over time
  • The reulting data: thousands of time series
  • Batch image analysis on a computing cluster generates lot's of CSV files
  • Postprocessing: merging files from different experimental conditions, cleaning data from outliers, pulling experimental description from other files

Roadmap for this workshop

The aim of this workshop is to process raw time-series data from time-lapse microscopy experiment. We will:

  • load data, merg different data sources,
  • clean missing data and outliers,
  • plot different data cuts,
  • perform hierarchical clustering,
  • validate clusters.

Intermediate data sets throughout the workshop:

RStudio Projects

Divide your work into multiple contexts with RStudio projects.

Each project has its own working directory, workspace, history, and source documents.

You can create an RStudio project:

  • In a brand new directory
  • In an existing directory where you already have R code and data
  • By cloning a version control (Git or Subversion) repository

File > New Project… to create a new R project

Code formatting

Use # to add comments to your code.

Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section.

To navigate between code sections, use the Jump To menu available at the bottom of the editor.

The outline of your code sections is available in the upper right corner of the editor window.

RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.

To indent or reformat the code use:

  • Menu > Code > Reindent Lines (⌘I)
  • Menu > Code > Reformat Code (⇧⌘A)

Syntax convention

Stick to a single naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:

  • calcStatsOfDF - function to calculate stats
  • nIteration - prefix n to indicate an integer variable
  • fPatientAge - f to indicate a float variable
  • sPatientName - s to indicate a string variable
  • vFileNames - v for vector
  • lColumnNames - l for a list

Avoid for loops

From R-bloggers:

A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.

As an aletrnative, use apply function family, e.g. lapply.

R notebooks

Use R Notebooks to create publish-ready documents with text, graphics, and interactive plots. Can be saved as an html, pdf, or a Word document.

An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. The text is formatted using R Markdown.

data.table

Data.table package is a faster, more efficient framework for data manipulation compared to R's default data.frame. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.

Let's define a data table:

require(data.table)
dtBabies = data.table(name= c("Jackson", "Emma", "Liam", "Ava"), 
                    gender = c("M", "F", "M", "F"), 
                    year2011= c(74.69, NA, 88.24, 81.77), 
                    year2012=c(84.99, NA, NA, 96.45), 
                    year2013=c(91.73, 75.74, 101.83, NA),
                    year2014=c(95.32, 82.49, 108.23, NA),
                    year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Emma      F       NA       NA    75.74    82.49    93.73
3:    Liam      M    88.24       NA   101.83   108.23   119.01
4:     Ava      F    81.77    96.45       NA       NA   105.65

data.table - general form

The general form of data.table syntax is as follows:

DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

The way to read it (out loud) is: take DT, subset/reorder rows using i, then calculate j, grouped by by.

data.table - selection

Select specific records:

dtBabies[gender == 'M']
      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Liam      M    88.24       NA   101.83   108.23   119.01

Select specific columns:

dtBabies[, .(name, gender, year2015)]
      name gender year2015
1: Jackson      M   107.12
2:    Emma      F    93.73
3:    Liam      M   119.01
4:     Ava      F   105.65

data.table - aggregation

Calculate the mean of a column:

dtBabies[, .(meanWeight = mean(year2015))]
   meanWeight
1:   106.3775

Calculate the mean of a column by gender:

dtBabies[, .(meanWeight = mean(year2015)), by = gender]
   gender meanWeight
1:      M    113.065
2:      F     99.690

data.table - reference by name

In the above example, column names are given explicitly. Hardcoding them this way in the script is potentially dangerous, for example when for some reason column names change.

It cleaner to store the column names somewhere at the beginning of the script, where it's easy to change them, and then use variables with those names in the code.

lCol = list(meas = 'weight',
            time = 'year',
            group = c('name', 'gender'),
            timeLast = 'year2015')
lCol
$meas
[1] "weight"

$time
[1] "year"

$group
[1] "name"   "gender"

$timeLast
[1] "year2015"

data.table - selection

Select specific records:

dtBabies[get(lCol$group[[2]]) == 'M']
      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Liam      M    88.24       NA   101.83   108.23   119.01

Select specific columns:

myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, ..myColumns]
      name year2015
1: Jackson   107.12
2:    Emma    93.73
3:    Liam   119.01
4:     Ava   105.65

data.table - aggregation (2)

The same summary but with column names stored as strings in elements of the list lCol. The j part of the data.table requires us to use a function get to interpret the string as the column name.

dtBabies[, 
         .(meanWeight = mean(get(lCol$timeLast))), 
         by = c(lCol$group[2])]
   gender meanWeight
1:      M    113.065
2:      F     99.690

Wide to long format

Our data table is in the wide format. To convert it to long format, use the function melt.

Provide the names of identification (id.vars) and measure variables (measure.vars). If none are provided, melt guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.

The original data frame contains missing values; na.rm=T omits them in the long-format table.

dtBabiesLong = data.table::melt(dtBabies, 
                                id.vars = c('name', 'gender'), measure.vars = 3:7,
                                variable.name = 'year', value.name = 'weight',
                                na.rm = T)
head(dtBabiesLong, n = 5L)
      name gender     year weight
1: Jackson      M year2011  74.69
2:    Liam      M year2011  88.24
3:     Ava      F year2011  81.77
4: Jackson      M year2012  84.99
5:     Ava      F year2012  96.45

Long to wide

The function dcast converts from long to wide format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.

Note that because some combinations of name + gender + year do not exist, the dcast function will introduce NAs.

dtBabiesWide = data.table::dcast(dtBabiesLong, 
                                 name + gender ~ year, 
                                 value.var = 'weight')

dtBabiesWide
      name gender year2011 year2012 year2013 year2014 year2015
1:     Ava      F    81.77    96.45       NA       NA   105.65
2:    Emma      F       NA       NA    75.74    82.49    93.73
3: Jackson      M    74.69    84.99    91.73    95.32   107.12
4:    Liam      M    88.24       NA   101.83   108.23   119.01

Note on formula interface

There are two ways to use the formula interface with string variables:

Create formula string explicitly

as.formula(
   paste0(
      lCol$group[[1]], "+", lCol$group[[2]], 
      "~", lCol$time))
name + gender ~ year

Use reformulate function. The limitation being a single response variable only!

reformulate(response = c(lCol$group[[1]], lCol$group[[2]]),
            termlabels = lCol$time)
name ~ year

data.table - IO

fread

Fast reading of the files; use nThread option to take advantage of multiple threads and read files even faster!

Documentation.

fwrite

for fast writing; also compressed files (gz and bz2).

Documentation.

Our dataset

Comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.

PC-12 cells stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations.

The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase, an end point of the MAPK signalling pathway.

EKAR biosensor

MAPK scheme

Our dataset (continued)

Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.

Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution.

A single experiment consists of a sustained stimulation with a single growth factor at 4 concentrations imaged at 4 fields of view. Thus, we have 16 fields of view per experiment.

Data description

The folder data contains:

  • processed data at various milestones of our analysis,
  • 3 sets of files that correspond to 3 growth factor treatments at 4 different concentrations.

Data files

Data description (continued)

A single set from a single experiment includes:

tCoursesSelected_xGFsust_expXXXX.csv.gz a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:

  • Metadata_Series an integer, 0-15, that corresponds to the field of view.
  • Metadata_T an integer, 0-100, corresponds to the time frame number in a movie from a single field of view.
  • TrackObjects_Label an integer that describes the track number from object tracking algorithm. The number is unique only within a single field of view.
  • Intensity_MeanIntensity_Ratio a float with the measurement of FRET ratio; this is our proxy of ERK activity.

experimentDescription_EGFsust.xlsx an Excel file with experiment description. Columns:

  • Position an integer, 0-15, with the number of the field of view.
  • Channel an integer, 1-4, with the number of the microfluidic channel.
  • Stim_Conc a string with an indication of the GF concentration.

Exercise 1: read data

Aim: read data from a single experiment. Merge single-cell measurements of the EKAR biosensor fluorescence intensity with experimental description.

Hints: Use following functions from data.table and readxl packages.

  • data.table::fread
  • readxl::read_xlsx
  • data.table::merge

Install packages with install.packages("name_of_the_package").

Exercise 1: read data - solution part 1

Read single-cell data:

require(R.utils)
require(data.table)
dtEGF = fread("./data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz")

head(dtEGF)
   Metadata_Series Metadata_T TrackObjects_Label
1:               0          0                  3
2:               0          0                  4
3:               0          0                  8
4:               0          0                 12
5:               0          0                 13
6:               0          0                 15
   Intensity_MeanIntensity_Ratio
1:                      1229.246
2:                      1279.513
3:                      1236.920
4:                      1170.460
5:                      1150.294
6:                      1128.049

Exercise 1: read data - solution part 2

Read experimental description:

require(readxl)
dtEGFexp = as.data.table(read_xlsx("./data/original/experimentDescription_EGFsust.xlsx"))

head(dtEGFexp)
   Position Channel Stim_Duration Stim_Conc Stim_Treat
1:        0       1          sust 250 ng/ml        EGF
2:        1       1          sust 250 ng/ml        EGF
3:        2       1          sust 250 ng/ml        EGF
4:        3       1          sust 250 ng/ml        EGF
5:        4       2          sust  25 ng/ml        EGF
6:        5       2          sust  25 ng/ml        EGF
   Acquisition_frequency_min Equilibration_min   Device_type
1:                         2                43 STD 4-channel
2:                         2                43 STD 4-channel
3:                         2                43 STD 4-channel
4:                         2                43 STD 4-channel
5:                         2                43 STD 4-channel
6:                         2                43 STD 4-channel

Exercise 1: read data - solution part 3

Merge single-cell data with experimental description. The common column is the ID of the field of view, which is named Metadata_Series in single-cell data and Position in experimental description.

# columns to keep from experimental description
vsExpDescrCols = c("Position", "Channel", "Stim_Conc", "Stim_Treat")

# merge data
dtEGF = merge(dtEGF, dtEGFexp[, ..vsExpDescrCols], 
              by.x = c("Metadata_Series"), by.y = "Position")

head(dtEGF, n = 3L)
   Metadata_Series Metadata_T TrackObjects_Label
1:               0          0                  3
2:               0          0                  4
3:               0          0                  8
   Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
1:                      1229.246       1 250 ng/ml        EGF
2:                      1279.513       1 250 ng/ml        EGF
3:                      1236.920       1 250 ng/ml        EGF

Complete data

Since we have 3 datasets that correspond to 3 experiments with different growth factors, we can repeat the process as above, and concatenate 3 data tables at the end.

For large experiments with many output files, it is best to make a list/vector with all filenames to read using list.files. Then, loop fread over that list, e.g.:

# create a list of files to read by searching a folder
vsInputFiles = list.files(path = "./data/original", 
                          pattern = "*.csv.gz", 
                          full.names = T)

# apply a file-reading function to the list of file names
lData = lapply(vsInputFiles, fread)

# combine data into a single object
# the option idcol adds a column with the name of the list element
dtAll = rbindlist(lData, use.names = T, idcol = "Stim_Treat")

Caveat: not all files have the same columns!

Milestone 1

Fetch the data for all growth factors and experiment description from a Milestone 1 file:

dtAll = fread("./data/m1_allGF_wExpDescr.csv.gz")

head(dtAll)
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        EGF               0          0                  3
2:        EGF               0          0                  4
3:        EGF               0          0                  8
4:        EGF               0          0                 12
5:        EGF               0          0                 13
6:        EGF               0          0                 15
   Intensity_MeanIntensity_Ratio Channel Stim_Conc
1:                      1229.246       1 250 ng/ml
2:                      1279.513       1 250 ng/ml
3:                      1236.920       1 250 ng/ml
4:                      1170.460       1 250 ng/ml
5:                      1150.294       1 250 ng/ml
6:                      1128.049       1 250 ng/ml

Access column names by strings

To make our lives easier, we will define a list with variables that contain strings with column mames:

# create a list with strings for column names

lCol = list(
  fov = "Metadata_Series",
  frame = "Metadata_T",
  trackid = "TrackObjects_Label",
  meas = "Intensity_MeanIntensity_Ratio",
  measNorm = "Intensity_MeanIntensity_Ratio_Norm",
  ch = "Channel",
  treatGF = "Stim_Treat",
  treatConc = "Stim_Conc",
  trackiduni = "TrackObjects_Label_Uni", # we will add this column later
  clid = "Cluster_Label"                 # we will add this column later
)

Access column names by strings (continued)

Now, we can select rows like this:

dtAll[get(lCol$meas) > 2000]
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        EGF               0         22                 22
2:        FGF               5         65                 18
   Intensity_MeanIntensity_Ratio Channel Stim_Conc
1:                      4060.864       1 250 ng/ml
2:                      2085.110       2  25 ng/ml

These are measurements of EKAR fluorescence above a certain threshold.

In fact, these measurements are outliers.

Exercise 2: unique time series ID's

Before we proceed, let's make our lives easier and add a column with a unique track identifier. The current TrackObjects_Label column contains track ID's that are unique only within a single field of view. To make them unique, we shall combine the Stim_Treat, Metadata_Series, and TrackObjects_Label columns into a single string.

Aim: create a column with unique time series ID's. It will become quite handy later on…

Hint: use paste, paste0, or sprintf functions to concatenate strings from different columns.

A unique track ID will allow us to easily pick individual time series, plot them, etc.

Exercise 2: unique time series ID's - solution

dtAll[, (lCol$trackiduni) := sprintf("%s_%02d_%04d", 
                                     get(lCol$treatGF), 
                                     get(lCol$fov), 
                                     get(lCol$trackid))]
head(dtAll)
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        EGF               0          0                  3
2:        EGF               0          0                  4
3:        EGF               0          0                  8
4:        EGF               0          0                 12
5:        EGF               0          0                 13
6:        EGF               0          0                 15
   Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1:                      1229.246       1 250 ng/ml            EGF_00_0003
2:                      1279.513       1 250 ng/ml            EGF_00_0004
3:                      1236.920       1 250 ng/ml            EGF_00_0008
4:                      1170.460       1 250 ng/ml            EGF_00_0012
5:                      1150.294       1 250 ng/ml            EGF_00_0013
6:                      1128.049       1 250 ng/ml            EGF_00_0015

Data exploration

summary(dtAll)
  Stim_Treat        Metadata_Series    Metadata_T     TrackObjects_Label
 Length:120086      Min.   : 0.000   Min.   :  0.00   Min.   : 1.00     
 Class :character   1st Qu.: 3.000   1st Qu.: 25.00   1st Qu.:11.00     
 Mode  :character   Median : 7.000   Median : 50.00   Median :22.00     
                    Mean   : 7.477   Mean   : 49.99   Mean   :23.42     
                    3rd Qu.:12.000   3rd Qu.: 75.00   3rd Qu.:34.00     
                    Max.   :15.000   Max.   :100.00   Max.   :80.00     

 Intensity_MeanIntensity_Ratio    Channel       Stim_Conc        
 Min.   : 677.9                Min.   :1.000   Length:120086     
 1st Qu.:1196.4                1st Qu.:1.000   Class :character  
 Median :1246.3                Median :2.000   Mode  :character  
 Mean   :1257.3                Mean   :2.485                     
 3rd Qu.:1311.9                3rd Qu.:4.000                     
 Max.   :4060.9                Max.   :4.000                     
 NA's   :18                                                      
 TrackObjects_Label_Uni
 Length:120086         
 Class :character      
 Mode  :character      




We have missing data! (NA's in the measurement column).

We also may have point and dynamic outliers.

Missing data

head(dtAll[is.na(get(lCol$meas))])
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        FGF               8         11                  1
2:        FGF               8         11                  2
3:        FGF               8         11                  5
4:        FGF               8         11                  6
5:        FGF               8         11                  7
6:        FGF               8         11                  8
   Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1:                            NA       3 2.5 ng/ml            FGF_08_0001
2:                            NA       3 2.5 ng/ml            FGF_08_0002
3:                            NA       3 2.5 ng/ml            FGF_08_0005
4:                            NA       3 2.5 ng/ml            FGF_08_0006
5:                            NA       3 2.5 ng/ml            FGF_08_0007
6:                            NA       3 2.5 ng/ml            FGF_08_0008

Missing data (continued)

All NA's come from a single time point in the same FOV. Different tracks are affected. One way to deal with NA's is to impute them with interpolated values. A package imputeTS has a handy function na_interpolation. But

Do we have enough data around NA's to interpolate? Let's print one such time series:

dtAll[get(lCol$trackiduni) == "FGF_08_0001"][[lCol$meas]]
  [1] 1154.882 1153.657 1150.764 1147.915 1140.173 1142.111 1140.136
  [8] 1140.837 1144.007 1135.256 1132.608       NA 1133.615 1134.720
 [15] 1145.947 1130.349 1135.619 1138.593 1139.691 1137.407 1138.597
 [22] 1130.640 1166.818 1184.005 1162.116 1173.394 1196.259 1202.515
 [29] 1235.769 1236.971 1272.345 1266.500 1251.723 1228.215 1226.551
 [36] 1218.169 1214.489 1229.251 1236.079 1231.706 1226.196 1227.705
 [43] 1212.287 1195.712 1198.995 1202.976 1222.753 1215.568 1202.693
 [50] 1217.286 1217.344 1218.442 1221.851 1226.046 1217.408 1192.199
 [57] 1185.705 1181.744 1191.051 1200.445 1198.971 1203.541 1215.357
 [64] 1215.048 1213.177 1226.709 1222.884 1236.786 1216.284 1220.544
 [71] 1207.355 1219.985 1210.953 1221.475 1233.104 1236.179 1239.964
 [78] 1238.919 1235.951 1219.092 1225.653 1225.001 1227.338 1217.194
 [85] 1225.857 1242.707 1241.310 1241.725 1221.622 1219.596 1230.118
 [92] 1236.985 1227.344 1233.488 1248.268 1242.010 1233.652 1228.994
 [99] 1231.844 1243.105 1255.455

Missing data (continued 2)

Let's also check the length of all other time series with NA's.

# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]]
vsTracksWithNAs
 [1] "FGF_08_0001" "FGF_08_0002" "FGF_08_0005" "FGF_08_0006" "FGF_08_0007"
 [6] "FGF_08_0008" "FGF_08_0010" "FGF_08_0011" "FGF_08_0014" "FGF_08_0017"
[11] "FGF_08_0019" "FGF_08_0020" "FGF_08_0024" "FGF_08_0021" "FGF_08_0023"
[16] "FGF_08_0022" "FGF_08_0025" "FGF_08_0027"

Missing data (continued 3)

# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)

# calculate the number of time points of tracks with NA's
head(
  dtAll[vsTracksWithNAs, .N, by = c(lCol$trackiduni)], n = 12L)
    TrackObjects_Label_Uni   N
 1:            FGF_08_0001 101
 2:            FGF_08_0002 101
 3:            FGF_08_0005 101
 4:            FGF_08_0006 101
 5:            FGF_08_0007 101
 6:            FGF_08_0008 101
 7:            FGF_08_0010 101
 8:            FGF_08_0011 100
 9:            FGF_08_0014 101
10:            FGF_08_0017 101
11:            FGF_08_0019 101
12:            FGF_08_0020 101

Exercise 3: identify missing data

Looks like we have another problem… The track FGF_08_0011 has one time point less than others. The entire row with the measurement is missing!

Aim: calculate the number of time points per time series. Check the min and max.

Hint: The .N calculates the number of rows.

Exercise 3: identify missing data - solution

summary(dtAll[, .N, by = c(lCol$trackiduni)][["N"]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   96.0   101.0   101.0   100.9   101.0   101.0 

Handling missing data

This dataset has two types of missing data:

  • NA's in the measurement for some data points.
  • Missing rows.

Depending on the analysis, missing data might or might not pose a problem.

In our case:

  • We will stretch the data to introduce NA's for time points that miss data entirely. A neat solution is based on this.
  • Interpolate NA's using na_interpolation function from imputeTS package.

Milestone 2

Fetch the dataset with interpolated NA's from Milestone 2:

dtAll = fread("./data/m2_allGF_wExpDescr_noNAs.csv.gz")

Plot single-cell data

require(ggplot2)
p0 = ggplot(dtAll, 
            aes_string(x = lCol$frame, y = lCol$meas, 
                       group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) # parameter alpha adds transparency
p0

plot of chunk unnamed-chunk-29

Exercise 4A: remove point outliers

We have outliers… Actually, two types of outliers. Some time series have single time points that are way above the average. There are also several time series that are behaving entirely differently to the rest, so called dynamic outliers.

Aim: remove single time points above the threshold 2000, and impute them with interpolated data.

Hint: replace outlier measurements with NA's and interpolate them with imputeTS::na_interpolation.

Exercise 4A: remove point outliers - solution

# replace time points with measurements above the threshold with NA's
dtAll[get(lCol$meas) > 2000, 
      (lCol$meas) := NA]

# interpolate NA's
require(imputeTS)
dtAll[, 
      (lCol$meas) := na_interpolation(get(lCol$meas)), 
      by = c(lCol$trackiduni)]

Exercise 4B: remove dynamics outliers

Aim: remove entire trajectories if the measurement is below 1000.

Hint: create a vector with unique track ID's for time series that have measurements below the threshold. Subset dtAll using that vector.

Exercise 4B: remove dynamics outliers - solution

# vector with unique track id's of dynamic outliers
vsTrackOut = unique(dtAll[get(lCol$meas) < 1000][[lCol$trackiduni]])

# leave only tracks that are not in the outlier vector
dtAll = dtAll[!(get(lCol$trackiduni) %in% vsTrackOut)]

# clean
rm(vsTrackOut)

Handling outliers

There's no single recipe for handling outliers; it all depends on the analysis.

A handy interactive tool written by Mauro Gwerder (our Bachelor student) can help with that process. The R/Shiny app can be downloaded from here and executed from RStudio.

Milestone 3

Fetch the dataset without outliers from Milestone 3:

dtAll = fread("./data/m3_allGF_wExpDescr_noNAs_noOut.csv.gz")

Plot per condition

# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame, 
                              y = lCol$meas, 
                              group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) + # parameter alpha adds transparency 
  facet_grid(reformulate(lCol$treatGF, lCol$treatConc))

p0

plot of chunk unnamed-chunk-33

Exercise 5: normalisation

The baseline level before stimulation is very heterogeneous (due to biological and technical noise). At this stage, we do not care about that variability, we only want to cluster the responses. Thus, we might get better results by normalising every time series to its baseline (i.e. first 20 time points).

Aim: add a column to dtAll with a normalised measurement where every time series is divided by the mean of its first 20 time points. Plot normalised data.

Hint: a column with the mean of first 20 elements of a group can be added this way:

dt[, newCol := mean(.SD[1:20, meas]), by = uniqueID]

.SD corresponds to a subset of data as defined by by section. It's a temporary data table and can be used as such.

Exercise 5: normalisation - solution

# add a column with the mean of the beasline for every time series
dtAll[, 
      baseline := mean(.SD[1:20, get(lCol$meas)]), # add a new column with the mean of first 20 rows of the group
      by = c(lCol$trackiduni)] # group by unique trajectory

# add a column with normalized measurement
dtAll[,
      (lCol$measNorm) := get(lCol$meas) / baseline]

# remove baseline column
dtAll[, baseline := NULL]

Milestone 4

Fetch the dataset with normalised measurement column from Milestone 4:

dtAll = fread("./data/m4_allGF_wExpDescr_noNAs_noOut_norm0-20.csv.gz")

Plot normalised data

Plot per condition using normalized data

# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame, 
                              y = lCol$measNorm, 
                              group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) + # parameter alpha adds transparency
  facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
p0

plot of chunk unnamed-chunk-37

Add mean to the plot

p1 = p0 +
  stat_summary(
      aes_string(y = lCol$measNorm, group = 1),
      fun.y = mean, geom = "line", group = 1,
      colour = 'red', linetype = 'solid', size = 1)

p1

plot of chunk unnamed-chunk-38

Beautifying plots

Add themes, e.g. + theme_bw(), or themes available in packages such as ggthemes.

Use + labels() to add the title, subtitle, the caption.

Use + xlab() or + ylab() to control labels of x and y axes.

require(ggthemes)
p1 +
  theme_minimal() +
  labs(title = "ERK activity in response to sustained GF treatments",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("Time (min)") +
  ylab("ERK activity")

plot of chunk unnamed-chunk-39

Exercise 6: interactive plots

Aim: make an interactive plot

Hint: use plotly::ggplotly function

Exercise 6: interactive plots - solution

require(plotly)
ggplotly(p1)

Exercise 7: time point snapshots

Aim: plot ERK activity at selected time points: baseline, peak, relaxation period. Visualise as box-, dot-, violin plots, or their combination.

Hint: Use %in syntax to select rows, e.g.

dt[frame %in% c(10, 20, 30)]

Use ggplot2 functions:

geom_boxplot()
geom_violin()
geom_dotplot(binaxis = "y", 
             stackdir = "center", 
             position = "dodge",
             binwidth = .01, 
             binpositions = "all)

Exercise 7: time point snapshots - solution

ggplot(dtAll[get(lCol$frame) %in% c(10, 25, 50)], 
       aes(x = as.factor(Metadata_T), 
           y = Intensity_MeanIntensity_Ratio_Norm)) +
  geom_violin(fill = NA) +
  geom_boxplot(fill = NA,
               outlier.shape = NA) +
  facet_grid(reformulate(lCol$treatGF, lCol$treatConc))

plot of chunk unnamed-chunk-44

Exercise 8: long to wide format

In order to cluster data and to plot the results of clustering as a heatmap we will use a handy function heatmap.2 from gplots package.

To do so, we need to convert dtAll from long to a wide format as a matrix. Such a matrix contains individual time series as rows. Columns correspond to time points.

Aim: convert dtAll to a matrix in a wide format

Hint: use dcast function.

Exercise 8: long to wide format - solution

# long to wide format
# every row corresponds to a single time series; column are time points
dtAllWide = dcast(dtAll, 
                  TrackObjects_Label_Uni ~ Metadata_T, 
                  value.var = lCol$measNorm)

# obtain row names for later
vsRowNames = dtAllWide[[lCol$trackiduni]]

# convert to a matrix; omit the first column
mAllWide = as.matrix(dtAllWide[, -1])

# assign row names to the matrix (useful for later plotting heatmaps from clustering)
rownames(mAllWide) = vsRowNames

# clean
rm(vsRowNames, dtAllWide)

Hierarchical clustering

require(gplots)
heatmap.2(mAllWide)

plot of chunk unnamed-chunk-46

Prettify the heatmap

We only want to cluster according to rows only. Clustering according to columns doesn't make sense for us because these are time points whose order we need to maintain in the plot.

Tune:

  • the dendrogram for columns can be removed with the option dendrogram = "row" and clustering according to columns is removed with Colv = F,
  • display density of measured values on the colour key with density.info = "density",
  • apply a custom palette, col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)).
  • add labels with xlab, ylab, and key.lab.

Prettify the heatmap (continued)

require(RColorBrewer)

heatmap.2(mAllWide, 
          dendrogram = "row", Colv = F,
          trace = "none", density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")

plot of chunk unnamed-chunk-47

Piping

Before we proceed, let's introduce a very handy concept in R called piping. Pipes are an operator, %>%, and allow to express a sequence of operations without creating intermediate variables. The pipe operator comes with the magrittr package, which is loaded automatically by tidyverse packages.

Colour dendrogram branches:

require(proxy)
require(dendextend)

# create a coloured dendrogram using a given distance and a linkage method
myRowDend  <- mAllWide %>% 
  proxy::dist(., "euclidean") %>% # calculate distance
  stats::hclust(., "complete") %>% # create a dendrogram
  stats::as.dendrogram(.) %>%
  dendextend::set("branches_k_color", k = 4) %>% # color k main branches
  dendextend::set("branches_lwd", 2) %>% # set line width of the dendrogram
  dendextend::ladderize(.) # reorganize the dendrogram

Heatmap with coloured dendrogram

heatmap.2(mAllWide, 
          dendrogram = "row", Colv = F,
          Rowv = myRowDend, # use our coloured dendrogram to order rows
          trace = "none", density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")

plot of chunk unnamed-chunk-49

Interactive heatmap

require(d3heatmap)
d3heatmap(mAllWide,
          dendrogram = "row", Colv = F,
          Rowv = myRowDend,
          trace = "none", density.info = "density", 
          show_grid = F,
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")

Shape clustering

Above approaches treat individual time points as independent measurements. Instead, shape clustering such as Dynamic Time Warping (DTW) is suited for calculating distances between time series.

Some resources:

  • Time series clustering with Dynamic Time Warping (link).
  • Comparing Time-Series Clustering Algorithms in R Using the dtwclust Package (link).
  • Calculating a distance matrix by dtw (link).

Shape clustering (continued)

We will use dtwclust and proxy packages to calculate the distance matrix

# From: https://stackoverflow.com/a/50776685/1898713
require(dtwclust)

myRowDend  <- mAllWide %>% 
  proxy::dist(., method = "dtw_basic", pattern = symmetric1, norm = "L1", window.size = 10L) %>% 
  stats::as.dist(.) %>% # conversion required because dtw_basic returns a cross-distance matrix; it is symmetric, thus we take the lower triangle
  stats::hclust(., "complete") %>% 
  stats::as.dendrogram(.) %>%
  dendextend::set("branches_k_color", k = 5) %>% 
  dendextend::set("branches_lwd", 2) %>%
  dendextend::ladderize(.)

Shape clustering (continued 2)

heatmap.2(mAllWide, 
          dendrogram = "row", Colv = F,
          Rowv = myRowDend, # use our coloured dendrogram to order rows
          trace = "none", density.info = "density",
          col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
          xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")

plot of chunk unnamed-chunk-53

Extract cluster information

After performing hierarchical clustering and after cutting the dendrogram at a desired level, we extract assignments of cluster numbers to individual time series.

Step 1: hierarchical clustering

clTree  <- mAllWide %>% 
  stats::dist(., method = "euclidean") %>% 
  stats::hclust(., "complete") 

str(clTree)
List of 7
 $ merge      : int [1:1186, 1:2] -75 -20 -93 -92 -87 -499 -98 -102 -52 -516 ...
 $ height     : num [1:1186] 0.0858 0.0867 0.0917 0.0973 0.1016 ...
 $ order      : int [1:1187] 968 117 392 677 187 729 444 993 991 459 ...
 $ labels     : chr [1:1187] "EGF_00_0003" "EGF_00_0004" "EGF_00_0008" "EGF_00_0012" ...
 $ method     : chr "complete"
 $ call       : language stats::hclust(d = ., method = "complete")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

Extract cluster information (continued)

Step 2: cut the dendrogram

clAssign = dendextend::cutree(clTree, k = 6)
head(clAssign)
EGF_00_0003 EGF_00_0004 EGF_00_0008 EGF_00_0012 EGF_00_0013 EGF_00_0015 
          1           1           1           2           1           1 

Extract cluster information (continued 2)

Convert named vector to a data table.

dtClAssign = as.data.table(clAssign, keep.rownames = T)
setnames(dtClAssign, c(lCol$trackiduni, lCol$clid))
head(dtClAssign)
   TrackObjects_Label_Uni Cluster_Label
1:            EGF_00_0003             1
2:            EGF_00_0004             1
3:            EGF_00_0008             1
4:            EGF_00_0012             2
5:            EGF_00_0013             1
6:            EGF_00_0015             1

Extract cluster information (continued 3)

Step 3: Merge original time series with cluster assignments for individual time series.

dtAllCl = merge(dtAll, dtClAssign, by = lCol$trackiduni)

# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]

head(dtAllCl, n = 3L)
   TrackObjects_Label_Uni Stim_Treat Metadata_T
1:            EGF_00_0003        EGF          0
2:            EGF_00_0003        EGF          1
3:            EGF_00_0003        EGF          2
   Intensity_MeanIntensity_Ratio Stim_Conc
1:                      1229.246 250 ng/ml
2:                      1219.209 250 ng/ml
3:                      1222.809 250 ng/ml
   Intensity_MeanIntensity_Ratio_Norm Cluster_Label
1:                           1.012986             1
2:                           1.004714             1
3:                           1.007681             1

Milestone 5

If your are completely lost, here's a file with the dataset resulting from previous steps:

dtAllCl = fread("./data/m5_allGF_wExpDescr_noNAs_noOut_norm0-20_cl.csv.gz")

# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]

Exercise 9: plot time series per cluster

Aim: plot time series in each cluster, include the population mean

Hint: use facetting per cluster

Exercise 9: plot time series per cluster - solution

ggplot(dtAllCl, aes_string(x = lCol$frame, y = lCol$measNorm, 
                           group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) +
  stat_summary(
      aes_string(y = lCol$measNorm, group = 1),
      fun.y = mean,
      geom = "line", group = 1,
      colour = 'red', linetype = 'solid', size = 1) +
  facet_wrap(lCol$clid) +
  theme_minimal() +
  labs(title = "Clusters of ERK activity dynamic responses",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("Time (min)") +
  ylab("ERK activity (normalised)")

plot of chunk unnamed-chunk-59

Exercise 10: contribution of clusters per condition

Aim: calculate and plot the composition of experimental conditions with respect to clusters.

Hint: perform data aggregation to calculate the number of time series per group, per cluster. The shortcut to calculate the number of rows in data.table is .N, e.g.

dt[, .(nTimeSer = .N), by = group]

Exercise 10: contribution of clusters per condition - solution

Aggregate and assign the result to a new data table dtAllClN:

dtAllClN = dtAllCl[, 
                   .(ntc = .N), 
                   by = c(lCol$treatGF, lCol$treatConc, lCol$clid)]

dtAllClN[1:5]
   Stim_Treat Stim_Conc Cluster_Label  ntc
1:        EGF 250 ng/ml             1 7474
2:        EGF 250 ng/ml             2 2828
3:        EGF 250 ng/ml             3  303
4:        EGF 250 ng/ml             4  101
5:        EGF 250 ng/ml             5  101

Exercise 10: contribution of clusters per condition - solution (continued)

# for percentages on y-axis in ggplot
require(scales)

# The core plot: concentrations on the x-axis, the number of time series on y-axis
p5 = ggplot(dtAllClN, aes_string(x = lCol$treatConc, y = "ntc"))

# Facetting per growth factor
p5 = p5 +
  facet_wrap(lCol$treatGF)

# Stacked bar plot with bars coloured by cluster number
# Bars are stretched to "fill" the y-axis using the option position = position_fill()
p5 = p5 +
  geom_bar(aes_string(fill = lCol$clid), 
           stat = "identity", 
           position = position_fill())

# Convert y-axis labels to percentages (0-100%) instead of (0-1)
p5 = p5 +
  scale_y_continuous(labels = percent)

Exercise 10: contribution of clusters per condition - solution (continued 2)

# Use a nice colour palette for colour fill
p5 = p5 +
  scale_fill_manual("GF:", 
                    values = ggthemes::tableau_color_pal("Color Blind")(6))

# Prettify the plot; add labels, etc
p5 = p5 +
  theme_minimal() +
  labs(title = "Participation of clusters in experimental conditions",
       caption = paste0("Created on ", Sys.Date())) +
  xlab("") + 
  ylab("Percentage") +
  theme(axis.text.x =  element_text(angle = 45, hjust = 1))

Exercise 10: contribution of clusters per condition - solution (continued 3)

p5

plot of chunk unnamed-chunk-64

Clustering validation

Based on Clustering Validation Statistics.

Clustering is an unsupervised machine learning method for partitioning dataset into a set of groups or clusters. A big issue is that clustering methods will return clusters even if the data does not contain any clusters. Therefore, it’s necessary to:

  • assess clustering tendency before the analysis,,
  • validate the quality of the result after clustering.

A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.

Clustering validation (continued)

Generally, clustering validation statistics can be categorized into 4 classes (Theodoridis and Koutroubas, 2008; G. Brock et al., 2008, Charrad et al., 2014):

Relative clustering validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g. varying the number of clusters k). It’s generally used for determining the optimal number of clusters.

External clustering validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific dataset.

Internal clustering validation, which use the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.

Clustering stability validation, which is a special version of internal validation. It evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time. Clustering stability measures will be described in a future chapter.

Optimal number of clusters - Silhouette method

require(factoextra)
# Silhouette method
factoextra::fviz_nbclust(mAllWide, hcut, method = "silhouette") +
  labs(subtitle = "Silhouette method")

plot of chunk unnamed-chunk-65

Optimal number of clusters - Gap statistics method

# Gap statistic
# nboot = 10 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
factoextra::fviz_nbclust(mAllWide, hcut, method = "gap_stat",
                         nstart = 25, nboot = 10)+
  labs(subtitle = "Gap statistic method")

plot of chunk unnamed-chunk-66

30 indices for choosing the best number of clusters

We’ll use the package NbClust which will compute, with a single function call, 30 indices for deciding the right number of clusters in the dataset:

require(NbClust)
nb <- NbClust(mAllWide, 
              distance = "euclidean", 
              min.nc = 2, max.nc = 10, 
              method = "complete", 
              index ="all")

plot of chunk unnamed-chunk-67

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 

plot of chunk unnamed-chunk-67

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 

******************************************************************* 
* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 9 proposed 3 as the best number of clusters 
* 3 proposed 4 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 1 proposed 9 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            

* According to the majority rule, the best number of clusters is  3 


******************************************************************* 

Visualise NbClust

# Visualize the result
factoextra::fviz_nbclust(nb) + theme_minimal()

Among all indices:

  • 2 proposed 0 as the best number of clusters
  • 5 proposed 2 as the best number of clusters
  • 9 proposed 3 as the best number of clusters
  • 3 proposed 4 as the best number of clusters
  • 1 proposed 7 as the best number of clusters
  • 1 proposed 9 as the best number of clusters
  • 2 proposed 10 as the best number of clusters
  • 3 proposed NA's as the best number of clusters

Conclusion

  • According to the majority rule, the best number of clusters is 3 . “

plot of chunk unnamed-chunk-68

Visualise clusters

hc.res <- factoextra::eclust(mAllWide, 
                             "hclust", k = 3, graph = FALSE,
                             hc_method = "complete", 
                             hc_metric = "euclidean")

# Visualize clusters
factoextra::fviz_cluster(hc.res, geom = "point", frame.type = "norm")

plot of chunk unnamed-chunk-69

Visualise silhouette

factoextra::fviz_silhouette(hc.res)
  cluster size ave.sil.width
1       1  893          0.21
2       2  274          0.48
3       3   20          0.48

plot of chunk unnamed-chunk-70

Interactive clustering

A free R/Shiny app available here.