Maciej Dobrzynski (Intitute of Cell Biology, University of Bern)
September 15, 2019
The aim of this workshop is to process raw time-series data from time-lapse microscopy experiment. We will:
Intermediate data sets throughout the workshop:
Divide your work into multiple contexts with RStudio projects.
Each project has its own working directory, workspace, history, and source documents.
You can create an RStudio project:
File > New Project… to create a new R project
Use # to add comments to your code.
Any comment line with at least four trailing dashes (-
), equal signs (=
), or pound signs (#
) automatically creates a code section.
To navigate between code sections, use the Jump To menu available at the bottom of the editor.
The outline of your code sections is available in the upper right corner of the editor window.
RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.
To indent or reformat the code use:
Stick to a single naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:
calcStatsOfDF
- function to calculate statsnIteration
- prefix n
to indicate an integer variablefPatientAge
- f
to indicate a float variablesPatientName
- s
to indicate a string variablevFileNames
- v
for vectorlColumnNames
- l
for a listFrom R-bloggers:
A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.
As an aletrnative, use apply
function family, e.g. lapply
.
Use R Notebooks to create publish-ready documents with text, graphics, and interactive plots. Can be saved as an html, pdf, or a Word document.
An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. The text is formatted using R Markdown.
Data.table
package is a faster, more efficient framework for data manipulation compared to R's default data.frame
. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.
Let's define a data table:
require(data.table)
dtBabies = data.table(name= c("Jackson", "Emma", "Liam", "Ava"),
gender = c("M", "F", "M", "F"),
year2011= c(74.69, NA, 88.24, 81.77),
year2012=c(84.99, NA, NA, 96.45),
year2013=c(91.73, 75.74, 101.83, NA),
year2014=c(95.32, 82.49, 108.23, NA),
year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Emma F NA NA 75.74 82.49 93.73
3: Liam M 88.24 NA 101.83 108.23 119.01
4: Ava F 81.77 96.45 NA NA 105.65
The general form of data.table syntax is as follows:
DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
The way to read it (out loud) is: take DT, subset/reorder rows using i, then calculate j, grouped by by.
Select specific records:
dtBabies[gender == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
dtBabies[, .(name, gender, year2015)]
name gender year2015
1: Jackson M 107.12
2: Emma F 93.73
3: Liam M 119.01
4: Ava F 105.65
Calculate the mean of a column:
dtBabies[, .(meanWeight = mean(year2015))]
meanWeight
1: 106.3775
Calculate the mean of a column by gender:
dtBabies[, .(meanWeight = mean(year2015)), by = gender]
gender meanWeight
1: M 113.065
2: F 99.690
In the above example, column names are given explicitly. Hardcoding them this way in the script is potentially dangerous, for example when for some reason column names change.
It cleaner to store the column names somewhere at the beginning of the script, where it's easy to change them, and then use variables with those names in the code.
lCol = list(meas = 'weight',
time = 'year',
group = c('name', 'gender'),
timeLast = 'year2015')
lCol
$meas
[1] "weight"
$time
[1] "year"
$group
[1] "name" "gender"
$timeLast
[1] "year2015"
Select specific records:
dtBabies[get(lCol$group[[2]]) == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, ..myColumns]
name year2015
1: Jackson 107.12
2: Emma 93.73
3: Liam 119.01
4: Ava 105.65
The same summary but with column names stored as strings in elements of the list lCol
. The j
part of the data.table
requires us to use a function get
to interpret the string as the column name.
dtBabies[,
.(meanWeight = mean(get(lCol$timeLast))),
by = c(lCol$group[2])]
gender meanWeight
1: M 113.065
2: F 99.690
Our data table is in the wide format. To convert it to long format, use the function melt
.
Provide the names of identification (id.vars
) and measure variables (measure.vars
). If none are provided, melt
guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.
The original data frame contains missing values; na.rm=T
omits them in the long-format table.
dtBabiesLong = data.table::melt(dtBabies,
id.vars = c('name', 'gender'), measure.vars = 3:7,
variable.name = 'year', value.name = 'weight',
na.rm = T)
head(dtBabiesLong, n = 5L)
name gender year weight
1: Jackson M year2011 74.69
2: Liam M year2011 88.24
3: Ava F year2011 81.77
4: Jackson M year2012 84.99
5: Ava F year2012 96.45
The function dcast
converts from long to wide format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.
Note that because some combinations of name + gender + year
do not exist, the dcast
function will introduce NAs
.
dtBabiesWide = data.table::dcast(dtBabiesLong,
name + gender ~ year,
value.var = 'weight')
dtBabiesWide
name gender year2011 year2012 year2013 year2014 year2015
1: Ava F 81.77 96.45 NA NA 105.65
2: Emma F NA NA 75.74 82.49 93.73
3: Jackson M 74.69 84.99 91.73 95.32 107.12
4: Liam M 88.24 NA 101.83 108.23 119.01
There are two ways to use the formula interface with string variables:
Create formula string explicitly
as.formula(
paste0(
lCol$group[[1]], "+", lCol$group[[2]],
"~", lCol$time))
name + gender ~ year
Use reformulate
function. The limitation being a single response variable only!
reformulate(response = c(lCol$group[[1]], lCol$group[[2]]),
termlabels = lCol$time)
name ~ year
fread
Fast reading of the files; use nThread
option to take advantage of multiple threads and read files even faster!
Comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.
PC-12 cells stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations.
The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase, an end point of the MAPK signalling pathway.
Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.
Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution.
A single experiment consists of a sustained stimulation with a single growth factor at 4 concentrations imaged at 4 fields of view. Thus, we have 16 fields of view per experiment.
The folder data
contains:
A single set from a single experiment includes:
tCoursesSelected_xGFsust_expXXXX.csv.gz
a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:
Metadata_Series
an integer, 0-15, that corresponds to the field of view.Metadata_T
an integer, 0-100, corresponds to the time frame number in a movie from a single field of view.TrackObjects_Label
an integer that describes the track number from object tracking algorithm. The number is unique only within a single field of view.Intensity_MeanIntensity_Ratio
a float with the measurement of FRET ratio; this is our proxy of ERK activity.experimentDescription_EGFsust.xlsx
an Excel file with experiment description. Columns:
Position
an integer, 0-15, with the number of the field of view.Channel
an integer, 1-4, with the number of the microfluidic channel.Stim_Conc
a string with an indication of the GF concentration.Aim: read data from a single experiment. Merge single-cell measurements of the EKAR biosensor fluorescence intensity with experimental description.
Hints:
Use following functions from data.table
and readxl
packages.
data.table::fread
readxl::read_xlsx
data.table::merge
Install packages with install.packages("name_of_the_package")
.
Read single-cell data:
require(R.utils)
require(data.table)
dtEGF = fread("./data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz")
head(dtEGF)
Metadata_Series Metadata_T TrackObjects_Label
1: 0 0 3
2: 0 0 4
3: 0 0 8
4: 0 0 12
5: 0 0 13
6: 0 0 15
Intensity_MeanIntensity_Ratio
1: 1229.246
2: 1279.513
3: 1236.920
4: 1170.460
5: 1150.294
6: 1128.049
Read experimental description:
require(readxl)
dtEGFexp = as.data.table(read_xlsx("./data/original/experimentDescription_EGFsust.xlsx"))
head(dtEGFexp)
Position Channel Stim_Duration Stim_Conc Stim_Treat
1: 0 1 sust 250 ng/ml EGF
2: 1 1 sust 250 ng/ml EGF
3: 2 1 sust 250 ng/ml EGF
4: 3 1 sust 250 ng/ml EGF
5: 4 2 sust 25 ng/ml EGF
6: 5 2 sust 25 ng/ml EGF
Acquisition_frequency_min Equilibration_min Device_type
1: 2 43 STD 4-channel
2: 2 43 STD 4-channel
3: 2 43 STD 4-channel
4: 2 43 STD 4-channel
5: 2 43 STD 4-channel
6: 2 43 STD 4-channel
Merge single-cell data with experimental description. The common column is the ID of the field of view, which is named Metadata_Series
in single-cell data and Position
in experimental description.
# columns to keep from experimental description
vsExpDescrCols = c("Position", "Channel", "Stim_Conc", "Stim_Treat")
# merge data
dtEGF = merge(dtEGF, dtEGFexp[, ..vsExpDescrCols],
by.x = c("Metadata_Series"), by.y = "Position")
head(dtEGF, n = 3L)
Metadata_Series Metadata_T TrackObjects_Label
1: 0 0 3
2: 0 0 4
3: 0 0 8
Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
1: 1229.246 1 250 ng/ml EGF
2: 1279.513 1 250 ng/ml EGF
3: 1236.920 1 250 ng/ml EGF
Since we have 3 datasets that correspond to 3 experiments with different growth factors, we can repeat the process as above, and concatenate 3 data tables at the end.
For large experiments with many output files, it is best to make a list/vector with all filenames to read using list.files
. Then, loop fread
over that list, e.g.:
# create a list of files to read by searching a folder
vsInputFiles = list.files(path = "./data/original",
pattern = "*.csv.gz",
full.names = T)
# apply a file-reading function to the list of file names
lData = lapply(vsInputFiles, fread)
# combine data into a single object
# the option idcol adds a column with the name of the list element
dtAll = rbindlist(lData, use.names = T, idcol = "Stim_Treat")
Caveat: not all files have the same columns!
Fetch the data for all growth factors and experiment description from a Milestone 1 file:
dtAll = fread("./data/m1_allGF_wExpDescr.csv.gz")
head(dtAll)
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc
1: 1229.246 1 250 ng/ml
2: 1279.513 1 250 ng/ml
3: 1236.920 1 250 ng/ml
4: 1170.460 1 250 ng/ml
5: 1150.294 1 250 ng/ml
6: 1128.049 1 250 ng/ml
To make our lives easier, we will define a list with variables that contain strings with column mames:
# create a list with strings for column names
lCol = list(
fov = "Metadata_Series",
frame = "Metadata_T",
trackid = "TrackObjects_Label",
meas = "Intensity_MeanIntensity_Ratio",
measNorm = "Intensity_MeanIntensity_Ratio_Norm",
ch = "Channel",
treatGF = "Stim_Treat",
treatConc = "Stim_Conc",
trackiduni = "TrackObjects_Label_Uni", # we will add this column later
clid = "Cluster_Label" # we will add this column later
)
Now, we can select rows like this:
dtAll[get(lCol$meas) > 2000]
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 22 22
2: FGF 5 65 18
Intensity_MeanIntensity_Ratio Channel Stim_Conc
1: 4060.864 1 250 ng/ml
2: 2085.110 2 25 ng/ml
These are measurements of EKAR fluorescence above a certain threshold.
In fact, these measurements are outliers.
Before we proceed, let's make our lives easier and add a column with a unique track identifier. The current TrackObjects_Label
column contains track ID's that are unique only within a single field of view. To make them unique, we shall combine the Stim_Treat
, Metadata_Series
, and TrackObjects_Label
columns into a single string.
Aim: create a column with unique time series ID's. It will become quite handy later on…
Hint: use paste
, paste0
, or sprintf
functions to concatenate strings from different columns.
A unique track ID will allow us to easily pick individual time series, plot them, etc.
dtAll[, (lCol$trackiduni) := sprintf("%s_%02d_%04d",
get(lCol$treatGF),
get(lCol$fov),
get(lCol$trackid))]
head(dtAll)
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: 1229.246 1 250 ng/ml EGF_00_0003
2: 1279.513 1 250 ng/ml EGF_00_0004
3: 1236.920 1 250 ng/ml EGF_00_0008
4: 1170.460 1 250 ng/ml EGF_00_0012
5: 1150.294 1 250 ng/ml EGF_00_0013
6: 1128.049 1 250 ng/ml EGF_00_0015
summary(dtAll)
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
Length:120086 Min. : 0.000 Min. : 0.00 Min. : 1.00
Class :character 1st Qu.: 3.000 1st Qu.: 25.00 1st Qu.:11.00
Mode :character Median : 7.000 Median : 50.00 Median :22.00
Mean : 7.477 Mean : 49.99 Mean :23.42
3rd Qu.:12.000 3rd Qu.: 75.00 3rd Qu.:34.00
Max. :15.000 Max. :100.00 Max. :80.00
Intensity_MeanIntensity_Ratio Channel Stim_Conc
Min. : 677.9 Min. :1.000 Length:120086
1st Qu.:1196.4 1st Qu.:1.000 Class :character
Median :1246.3 Median :2.000 Mode :character
Mean :1257.3 Mean :2.485
3rd Qu.:1311.9 3rd Qu.:4.000
Max. :4060.9 Max. :4.000
NA's :18
TrackObjects_Label_Uni
Length:120086
Class :character
Mode :character
We have missing data! (NA's
in the measurement column).
We also may have point and dynamic outliers.
head(dtAll[is.na(get(lCol$meas))])
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: FGF 8 11 1
2: FGF 8 11 2
3: FGF 8 11 5
4: FGF 8 11 6
5: FGF 8 11 7
6: FGF 8 11 8
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: NA 3 2.5 ng/ml FGF_08_0001
2: NA 3 2.5 ng/ml FGF_08_0002
3: NA 3 2.5 ng/ml FGF_08_0005
4: NA 3 2.5 ng/ml FGF_08_0006
5: NA 3 2.5 ng/ml FGF_08_0007
6: NA 3 2.5 ng/ml FGF_08_0008
All NA's come from a single time point in the same FOV. Different tracks are affected. One way to deal with NA's is to impute them with interpolated values. A package imputeTS
has a handy function na_interpolation
. But
Do we have enough data around NA's to interpolate? Let's print one such time series:
dtAll[get(lCol$trackiduni) == "FGF_08_0001"][[lCol$meas]]
[1] 1154.882 1153.657 1150.764 1147.915 1140.173 1142.111 1140.136
[8] 1140.837 1144.007 1135.256 1132.608 NA 1133.615 1134.720
[15] 1145.947 1130.349 1135.619 1138.593 1139.691 1137.407 1138.597
[22] 1130.640 1166.818 1184.005 1162.116 1173.394 1196.259 1202.515
[29] 1235.769 1236.971 1272.345 1266.500 1251.723 1228.215 1226.551
[36] 1218.169 1214.489 1229.251 1236.079 1231.706 1226.196 1227.705
[43] 1212.287 1195.712 1198.995 1202.976 1222.753 1215.568 1202.693
[50] 1217.286 1217.344 1218.442 1221.851 1226.046 1217.408 1192.199
[57] 1185.705 1181.744 1191.051 1200.445 1198.971 1203.541 1215.357
[64] 1215.048 1213.177 1226.709 1222.884 1236.786 1216.284 1220.544
[71] 1207.355 1219.985 1210.953 1221.475 1233.104 1236.179 1239.964
[78] 1238.919 1235.951 1219.092 1225.653 1225.001 1227.338 1217.194
[85] 1225.857 1242.707 1241.310 1241.725 1221.622 1219.596 1230.118
[92] 1236.985 1227.344 1233.488 1248.268 1242.010 1233.652 1228.994
[99] 1231.844 1243.105 1255.455
Let's also check the length of all other time series with NA's.
# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]]
vsTracksWithNAs
[1] "FGF_08_0001" "FGF_08_0002" "FGF_08_0005" "FGF_08_0006" "FGF_08_0007"
[6] "FGF_08_0008" "FGF_08_0010" "FGF_08_0011" "FGF_08_0014" "FGF_08_0017"
[11] "FGF_08_0019" "FGF_08_0020" "FGF_08_0024" "FGF_08_0021" "FGF_08_0023"
[16] "FGF_08_0022" "FGF_08_0025" "FGF_08_0027"
# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)
# calculate the number of time points of tracks with NA's
head(
dtAll[vsTracksWithNAs, .N, by = c(lCol$trackiduni)], n = 12L)
TrackObjects_Label_Uni N
1: FGF_08_0001 101
2: FGF_08_0002 101
3: FGF_08_0005 101
4: FGF_08_0006 101
5: FGF_08_0007 101
6: FGF_08_0008 101
7: FGF_08_0010 101
8: FGF_08_0011 100
9: FGF_08_0014 101
10: FGF_08_0017 101
11: FGF_08_0019 101
12: FGF_08_0020 101
Looks like we have another problem… The track FGF_08_0011
has one time point less than others. The entire row with the measurement is missing!
Aim: calculate the number of time points per time series. Check the min and max.
Hint: The .N
calculates the number of rows.
summary(dtAll[, .N, by = c(lCol$trackiduni)][["N"]])
Min. 1st Qu. Median Mean 3rd Qu. Max.
96.0 101.0 101.0 100.9 101.0 101.0
This dataset has two types of missing data:
NA's
in the measurement for some data points.Depending on the analysis, missing data might or might not pose a problem.
In our case:
NA's
for time points that miss data entirely. A neat solution is based on this.NA's
using na_interpolation
function from imputeTS
package.Fetch the dataset with interpolated NA's
from Milestone 2:
dtAll = fread("./data/m2_allGF_wExpDescr_noNAs.csv.gz")
require(ggplot2)
p0 = ggplot(dtAll,
aes_string(x = lCol$frame, y = lCol$meas,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) # parameter alpha adds transparency
p0
We have outliers… Actually, two types of outliers. Some time series have single time points that are way above the average. There are also several time series that are behaving entirely differently to the rest, so called dynamic outliers.
Aim: remove single time points above the threshold 2000, and impute them with interpolated data.
Hint: replace outlier measurements with NA's
and interpolate them with imputeTS::na_interpolation
.
# replace time points with measurements above the threshold with NA's
dtAll[get(lCol$meas) > 2000,
(lCol$meas) := NA]
# interpolate NA's
require(imputeTS)
dtAll[,
(lCol$meas) := na_interpolation(get(lCol$meas)),
by = c(lCol$trackiduni)]
Aim: remove entire trajectories if the measurement is below 1000.
Hint: create a vector with unique track ID's for time series that have measurements below the threshold. Subset dtAll
using that vector.
# vector with unique track id's of dynamic outliers
vsTrackOut = unique(dtAll[get(lCol$meas) < 1000][[lCol$trackiduni]])
# leave only tracks that are not in the outlier vector
dtAll = dtAll[!(get(lCol$trackiduni) %in% vsTrackOut)]
# clean
rm(vsTrackOut)
There's no single recipe for handling outliers; it all depends on the analysis.
A handy interactive tool written by Mauro Gwerder (our Bachelor student) can help with that process. The R/Shiny app can be downloaded from here and executed from RStudio.
Fetch the dataset without outliers from Milestone 3:
dtAll = fread("./data/m3_allGF_wExpDescr_noNAs_noOut.csv.gz")
# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame,
y = lCol$meas,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) + # parameter alpha adds transparency
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
p0
The baseline level before stimulation is very heterogeneous (due to biological and technical noise). At this stage, we do not care about that variability, we only want to cluster the responses. Thus, we might get better results by normalising every time series to its baseline (i.e. first 20 time points).
Aim: add a column to dtAll
with a normalised measurement where every time series is divided by the mean of its first 20 time points. Plot normalised data.
Hint: a column with the mean of first 20 elements of a group can be added this way:
dt[, newCol := mean(.SD[1:20, meas]), by = uniqueID]
.SD
corresponds to a subset of data as defined by by
section. It's a temporary data table and can be used as such.
# add a column with the mean of the beasline for every time series
dtAll[,
baseline := mean(.SD[1:20, get(lCol$meas)]), # add a new column with the mean of first 20 rows of the group
by = c(lCol$trackiduni)] # group by unique trajectory
# add a column with normalized measurement
dtAll[,
(lCol$measNorm) := get(lCol$meas) / baseline]
# remove baseline column
dtAll[, baseline := NULL]
Fetch the dataset with normalised measurement column from Milestone 4:
dtAll = fread("./data/m4_allGF_wExpDescr_noNAs_noOut_norm0-20.csv.gz")
Plot per condition using normalized data
# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame,
y = lCol$measNorm,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) + # parameter alpha adds transparency
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
p0
p1 = p0 +
stat_summary(
aes_string(y = lCol$measNorm, group = 1),
fun.y = mean, geom = "line", group = 1,
colour = 'red', linetype = 'solid', size = 1)
p1
Add themes, e.g. + theme_bw()
, or themes available in packages such as ggthemes
.
Use + labels()
to add the title, subtitle, the caption.
Use + xlab()
or + ylab()
to control labels of x and y axes.
require(ggthemes)
p1 +
theme_minimal() +
labs(title = "ERK activity in response to sustained GF treatments",
caption = paste0("Created on ", Sys.Date())) +
xlab("Time (min)") +
ylab("ERK activity")
Aim: make an interactive plot
Hint: use plotly::ggplotly
function
require(plotly)
ggplotly(p1)
Aim: plot ERK activity at selected time points: baseline, peak, relaxation period. Visualise as box-, dot-, violin plots, or their combination.
Hint: Use %in
syntax to select rows, e.g.
dt[frame %in% c(10, 20, 30)]
Use ggplot2
functions:
geom_boxplot()
geom_violin()
geom_dotplot(binaxis = "y",
stackdir = "center",
position = "dodge",
binwidth = .01,
binpositions = "all)
ggplot(dtAll[get(lCol$frame) %in% c(10, 25, 50)],
aes(x = as.factor(Metadata_T),
y = Intensity_MeanIntensity_Ratio_Norm)) +
geom_violin(fill = NA) +
geom_boxplot(fill = NA,
outlier.shape = NA) +
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
In order to cluster data and to plot the results of clustering as a heatmap we will use a handy function heatmap.2
from gplots
package.
To do so, we need to convert dtAll
from long to a wide format as a matrix. Such a matrix contains individual time series as rows. Columns correspond to time points.
Aim: convert dtAll
to a matrix in a wide format
Hint: use dcast
function.
# long to wide format
# every row corresponds to a single time series; column are time points
dtAllWide = dcast(dtAll,
TrackObjects_Label_Uni ~ Metadata_T,
value.var = lCol$measNorm)
# obtain row names for later
vsRowNames = dtAllWide[[lCol$trackiduni]]
# convert to a matrix; omit the first column
mAllWide = as.matrix(dtAllWide[, -1])
# assign row names to the matrix (useful for later plotting heatmaps from clustering)
rownames(mAllWide) = vsRowNames
# clean
rm(vsRowNames, dtAllWide)
require(gplots)
heatmap.2(mAllWide)
We only want to cluster according to rows only. Clustering according to columns doesn't make sense for us because these are time points whose order we need to maintain in the plot.
Tune:
dendrogram = "row"
and clustering according to columns is removed with Colv = F
,density.info = "density"
,col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99))
.xlab
, ylab
, and key.lab
.require(RColorBrewer)
heatmap.2(mAllWide,
dendrogram = "row", Colv = F,
trace = "none", density.info = "density",
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
Before we proceed, let's introduce a very handy concept in R called piping. Pipes are an operator, %>%
, and allow to express a sequence of operations without creating intermediate variables. The pipe operator comes with the magrittr package, which is loaded automatically by tidyverse packages.
Colour dendrogram branches:
require(proxy)
require(dendextend)
# create a coloured dendrogram using a given distance and a linkage method
myRowDend <- mAllWide %>%
proxy::dist(., "euclidean") %>% # calculate distance
stats::hclust(., "complete") %>% # create a dendrogram
stats::as.dendrogram(.) %>%
dendextend::set("branches_k_color", k = 4) %>% # color k main branches
dendextend::set("branches_lwd", 2) %>% # set line width of the dendrogram
dendextend::ladderize(.) # reorganize the dendrogram
heatmap.2(mAllWide,
dendrogram = "row", Colv = F,
Rowv = myRowDend, # use our coloured dendrogram to order rows
trace = "none", density.info = "density",
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
require(d3heatmap)
d3heatmap(mAllWide,
dendrogram = "row", Colv = F,
Rowv = myRowDend,
trace = "none", density.info = "density",
show_grid = F,
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
Above approaches treat individual time points as independent measurements. Instead, shape clustering such as Dynamic Time Warping (DTW) is suited for calculating distances between time series.
Some resources:
We will use dtwclust
and proxy
packages to calculate the distance matrix
# From: https://stackoverflow.com/a/50776685/1898713
require(dtwclust)
myRowDend <- mAllWide %>%
proxy::dist(., method = "dtw_basic", pattern = symmetric1, norm = "L1", window.size = 10L) %>%
stats::as.dist(.) %>% # conversion required because dtw_basic returns a cross-distance matrix; it is symmetric, thus we take the lower triangle
stats::hclust(., "complete") %>%
stats::as.dendrogram(.) %>%
dendextend::set("branches_k_color", k = 5) %>%
dendextend::set("branches_lwd", 2) %>%
dendextend::ladderize(.)
heatmap.2(mAllWide,
dendrogram = "row", Colv = F,
Rowv = myRowDend, # use our coloured dendrogram to order rows
trace = "none", density.info = "density",
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
After performing hierarchical clustering and after cutting the dendrogram at a desired level, we extract assignments of cluster numbers to individual time series.
Step 1: hierarchical clustering
clTree <- mAllWide %>%
stats::dist(., method = "euclidean") %>%
stats::hclust(., "complete")
str(clTree)
List of 7
$ merge : int [1:1186, 1:2] -75 -20 -93 -92 -87 -499 -98 -102 -52 -516 ...
$ height : num [1:1186] 0.0858 0.0867 0.0917 0.0973 0.1016 ...
$ order : int [1:1187] 968 117 392 677 187 729 444 993 991 459 ...
$ labels : chr [1:1187] "EGF_00_0003" "EGF_00_0004" "EGF_00_0008" "EGF_00_0012" ...
$ method : chr "complete"
$ call : language stats::hclust(d = ., method = "complete")
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
Step 2: cut the dendrogram
clAssign = dendextend::cutree(clTree, k = 6)
head(clAssign)
EGF_00_0003 EGF_00_0004 EGF_00_0008 EGF_00_0012 EGF_00_0013 EGF_00_0015
1 1 1 2 1 1
Convert named vector to a data table.
dtClAssign = as.data.table(clAssign, keep.rownames = T)
setnames(dtClAssign, c(lCol$trackiduni, lCol$clid))
head(dtClAssign)
TrackObjects_Label_Uni Cluster_Label
1: EGF_00_0003 1
2: EGF_00_0004 1
3: EGF_00_0008 1
4: EGF_00_0012 2
5: EGF_00_0013 1
6: EGF_00_0015 1
Step 3: Merge original time series with cluster assignments for individual time series.
dtAllCl = merge(dtAll, dtClAssign, by = lCol$trackiduni)
# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]
head(dtAllCl, n = 3L)
TrackObjects_Label_Uni Stim_Treat Metadata_T
1: EGF_00_0003 EGF 0
2: EGF_00_0003 EGF 1
3: EGF_00_0003 EGF 2
Intensity_MeanIntensity_Ratio Stim_Conc
1: 1229.246 250 ng/ml
2: 1219.209 250 ng/ml
3: 1222.809 250 ng/ml
Intensity_MeanIntensity_Ratio_Norm Cluster_Label
1: 1.012986 1
2: 1.004714 1
3: 1.007681 1
If your are completely lost, here's a file with the dataset resulting from previous steps:
dtAllCl = fread("./data/m5_allGF_wExpDescr_noNAs_noOut_norm0-20_cl.csv.gz")
# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]
Aim: plot time series in each cluster, include the population mean
Hint: use facetting per cluster
ggplot(dtAllCl, aes_string(x = lCol$frame, y = lCol$measNorm,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) +
stat_summary(
aes_string(y = lCol$measNorm, group = 1),
fun.y = mean,
geom = "line", group = 1,
colour = 'red', linetype = 'solid', size = 1) +
facet_wrap(lCol$clid) +
theme_minimal() +
labs(title = "Clusters of ERK activity dynamic responses",
caption = paste0("Created on ", Sys.Date())) +
xlab("Time (min)") +
ylab("ERK activity (normalised)")
Aim: calculate and plot the composition of experimental conditions with respect to clusters.
Hint: perform data aggregation to calculate the number of time series per group, per cluster. The shortcut to calculate the number of rows in data.table
is .N
, e.g.
dt[, .(nTimeSer = .N), by = group]
Aggregate and assign the result to a new data table dtAllClN
:
dtAllClN = dtAllCl[,
.(ntc = .N),
by = c(lCol$treatGF, lCol$treatConc, lCol$clid)]
dtAllClN[1:5]
Stim_Treat Stim_Conc Cluster_Label ntc
1: EGF 250 ng/ml 1 7474
2: EGF 250 ng/ml 2 2828
3: EGF 250 ng/ml 3 303
4: EGF 250 ng/ml 4 101
5: EGF 250 ng/ml 5 101
# for percentages on y-axis in ggplot
require(scales)
# The core plot: concentrations on the x-axis, the number of time series on y-axis
p5 = ggplot(dtAllClN, aes_string(x = lCol$treatConc, y = "ntc"))
# Facetting per growth factor
p5 = p5 +
facet_wrap(lCol$treatGF)
# Stacked bar plot with bars coloured by cluster number
# Bars are stretched to "fill" the y-axis using the option position = position_fill()
p5 = p5 +
geom_bar(aes_string(fill = lCol$clid),
stat = "identity",
position = position_fill())
# Convert y-axis labels to percentages (0-100%) instead of (0-1)
p5 = p5 +
scale_y_continuous(labels = percent)
# Use a nice colour palette for colour fill
p5 = p5 +
scale_fill_manual("GF:",
values = ggthemes::tableau_color_pal("Color Blind")(6))
# Prettify the plot; add labels, etc
p5 = p5 +
theme_minimal() +
labs(title = "Participation of clusters in experimental conditions",
caption = paste0("Created on ", Sys.Date())) +
xlab("") +
ylab("Percentage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5
Based on Clustering Validation Statistics.
Clustering is an unsupervised machine learning method for partitioning dataset into a set of groups or clusters. A big issue is that clustering methods will return clusters even if the data does not contain any clusters. Therefore, it’s necessary to:
A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.
Generally, clustering validation statistics can be categorized into 4 classes (Theodoridis and Koutroubas, 2008; G. Brock et al., 2008, Charrad et al., 2014):
Relative clustering validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g. varying the number of clusters k). It’s generally used for determining the optimal number of clusters.
External clustering validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific dataset.
Internal clustering validation, which use the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.
Clustering stability validation, which is a special version of internal validation. It evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time. Clustering stability measures will be described in a future chapter.
require(factoextra)
# Silhouette method
factoextra::fviz_nbclust(mAllWide, hcut, method = "silhouette") +
labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 10 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
factoextra::fviz_nbclust(mAllWide, hcut, method = "gap_stat",
nstart = 25, nboot = 10)+
labs(subtitle = "Gap statistic method")
We’ll use the package NbClust
which will compute, with a single function call, 30 indices for deciding the right number of clusters in the dataset:
require(NbClust)
nb <- NbClust(mAllWide,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "complete",
index ="all")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 5 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 3 proposed 4 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
# Visualize the result
factoextra::fviz_nbclust(nb) + theme_minimal()
hc.res <- factoextra::eclust(mAllWide,
"hclust", k = 3, graph = FALSE,
hc_method = "complete",
hc_metric = "euclidean")
# Visualize clusters
factoextra::fviz_cluster(hc.res, geom = "point", frame.type = "norm")
factoextra::fviz_silhouette(hc.res)
cluster size ave.sil.width
1 1 893 0.21
2 2 274 0.48
3 3 20 0.48
A free R/Shiny app available here.