Description
This R notebook accompanies the MIC workshop: Modern data analysis in R/RStudio. It contains placeholders (...)
to fill the code in exercises presented during the workshop.
You can knit this document into an HTML file. Once you fill in the code, change the parameter eval=F
to eval=T
in every code snippet. This way, the output of every code section will appear in the notebook.
R primer
The R for data Science book and the accompanying website is an excellent resource to get acquainted with R and its applications to data science.
Here is an introduction to data.table
package that we use extensively in this analysis. Some advanced tips and tricks with data.table are covered here.
Data description
Data come from a microfluidic experiment where PC-12 cells were stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations. It comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.
The microfluidic device has four separate channels, thus 4 different treatments can be performed in a single experiment.
The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase.
ERK is an end point of the MAPK signalling pathway.
Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.
Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution. Each channel of a microfluidic device was imaged at 4 different fields of view. Thus, we have 16 fields of view per experiment.
The folder data
contains 3 sets of files that correspond to 3 growth factor treatments at 4 different concentrations. A single set from a single experiment includes:
tCoursesSelected_xGFsust_expXXXX.csv.gz
a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:
Metadata_Series
an integer, 0-15, that corresponds to the field of view.Metadata_T
an integer, 0-100, corresponds to the time frame number in a movie from a single field of view.TrackObjects_Label
an integer that describes the track number from object tracking algorithm. The number is unique only within a single field of view.Intensity_MeanIntensity_Ratio
a float with the measurement of FRET ratio; this is our proxy of ERK activity.
experimentDescription_EGFsust.xlsx
an Excel file with experiment description. Columns:
Position
an integer, 0-15, with the number of the field of view.Channel
an integer, 1-4, with the number of the microfluidic channel.Stim_Conc
a string with an indication of the GF concentration.
Exercise 1: read data
Read single-cell data
Read data from a single experiment from this file ../data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz
. Use data.table::fread
function.
Read experimental description
The experimental description is located in this Excel file ../data/original/experimentDescription_EGFsust.xlsx
.
Use the readxl::read_xlsx
function.
Merge data with experimental description
Add columns Stim_Treat
and Stim_Conc
with experiment description to single-cell data using merge
function. Identify common columns between the two tables that will be used for merging.
Read file list
The table dtEGFwithExp
created above contains single-cell data with treatment information (i.e. growth factor and its concentration) for a single growth factor only! Now, let’s read all files in the ../data
folder corresponding to three growth factor treatments.
- Create a vector with raw file names in a folder
- Apply an
fread
function to read individual files
- Name list elements according to file names they were read from.
These names will be converted to an ID column when binding the list with rbindlist
.
- Bind the list.
Note that tables read from individual files have different column numbers. Therefore, when binding, we specify to option use.names = T
to match column between different list elements, and fill = T
to fill non-existing columns with NA`.
In addition, we use the option idcol = "fileName"
to create a column that contains names of list elements that we assigned in the previous step.
dtAll = rbindlist(lAll, use.names = T, fill = T, idcol = "fileName")
# Remove unnecessary columns; they exist only in 2 out of 3 of the files
dtAll[, `:=`(c("Location_Center_X", "Location_Center_Y"), NULL)]
- Extract growth factor name from the
fileName
column.
gsub
is a handy function to extract strings using regular expressions.
The dtAll
table is still missing information about the growth factor concentration. To add that, we need to read ../data/original/experimentDescription_*
files.
Milestone 1
Data from all experiments merged into a single file:
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc
1: 1229.246 1 250 ng/ml
2: 1279.513 1 250 ng/ml
3: 1236.920 1 250 ng/ml
4: 1170.460 1 250 ng/ml
5: 1150.294 1 250 ng/ml
6: 1128.049 1 250 ng/ml
Before proceeding further, execute the cell below. It contains definitions of column names that we will use throughout this workshop.
# create a list with strings for column names
lCol = list(
fov = "Metadata_Series",
frame = "Metadata_T",
trackid = "TrackObjects_Label",
meas = "Intensity_MeanIntensity_Ratio",
measNorm = "Intensity_MeanIntensity_Ratio_Norm",
ch = "Channel",
treatGF = "Stim_Treat",
treatConc = "Stim_Conc",
trackiduni = "TrackObjects_Label_Uni", # we will add this column later
clid = "Cluster_Label" # we will add this column later
)
Exercise 2: unique time series ID’s
The track ID column is unique only within a single field of view, which is given in the Metadata_Series
column.
Create a column with unique time series ID’s that includes the treatment, concentration, and track ID. It will become quite handy later on during subsetting and clustering. Use paste
, paste0
, or sprintf
functions to concatenate strings from different columns.
Data exploration
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
Length:120086 Min. : 0.000 Min. : 0.00 Min. : 1.00
Class :character 1st Qu.: 3.000 1st Qu.: 25.00 1st Qu.:11.00
Mode :character Median : 7.000 Median : 50.00 Median :22.00
Mean : 7.477 Mean : 49.99 Mean :23.42
3rd Qu.:12.000 3rd Qu.: 75.00 3rd Qu.:34.00
Max. :15.000 Max. :100.00 Max. :80.00
Intensity_MeanIntensity_Ratio Channel Stim_Conc
Min. : 677.9 Min. :1.000 Length:120086
1st Qu.:1196.4 1st Qu.:1.000 Class :character
Median :1246.3 Median :2.000 Mode :character
Mean :1257.3 Mean :2.485
3rd Qu.:1311.9 3rd Qu.:4.000
Max. :4060.9 Max. :4.000
NA's :18
Missing data
To obtain rows with missing data in the Intensity_MeanIntensity_Ratio
column, use is.na
function. It returns TRUE
, if the argument is NA
.
Check measurements of a single time series. Note that the two operations on the data table have been chained, i.e. dtAll[][]
. First, we selected all rows of dtAll
which have the TrackObjects_Label_Uni
column equal to FGF_08_0001
, then we select Intensity_MeanIntensity_Ratio
column from that result. The output is a vector.
Check the length of all other time series with NA’s.
First, let’s create a vector with TrackObjects_Label_Uni
that contain NA's
in the Intensity_MeanIntensity_Ratio
column. Again, we’re chaining data table operations.
# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = unique(dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]])
vsTracksWithNAs
The unique
function takes only unique elements of its argument. This is to ensure that even if a single track contains multiple NA's
we take its TrackObjects_Label_Uni
only once.
Second, we subset the dtAll
to obtain rows where TrackObjects_Label_Uni
belongs to elements of a vsTracksWithNAs
vector.
One option is to use %in%
operator:
Another, faster option is to set the key in dtAll
to the TrackObjects_Label_Uni
column:
# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)
head(dtAll[vsTracksWithNAs])
Finally we can calculate the number of time points in tracks with NA's
.
Exercise 3: identify missing data
Calculate the number of time points per time series. Check the min and max. The .N
calculates the number of rows:
Here, we chained two operations on a data.table
. First, we calculated the number of rows per time series. This results in a new data.table
. We then selected a single column named N
from that table by using [[...]]
syntax.
Use summary
function to calculate the 5-point statistics.
Milestone 2
Dataset with interpolated NA's
from Milestone 2:
require(R.utils)
require(data.table)
dtAll = fread("../data/m2_allGF_wExpDescr_noNAs.csv.gz")
head(dtAll)
Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio Stim_Conc
1: EGF 0 1183.396 0.25 ng/ml
2: EGF 1 1187.431 0.25 ng/ml
3: EGF 2 1172.491 0.25 ng/ml
4: EGF 3 1176.407 0.25 ng/ml
5: EGF 4 1172.421 0.25 ng/ml
6: EGF 5 1167.917 0.25 ng/ml
TrackObjects_Label_Uni
1: EGF_12_0002
2: EGF_12_0002
3: EGF_12_0002
4: EGF_12_0002
5: EGF_12_0002
6: EGF_12_0002
Plot single-cell data
Exercise 4A: remove point outliers
Remove single time points above the threshold 2000, and impute them with interpolated data. Replace outlier measurements with NA's
and interpolate them with imputeTS::na_interpolation
.
# replace time points with measurements above the threshold with NA's
dtAll[..., ...]
# interpolate NA's
require(imputeTS)
dtAll[, `:=`((lCol$meas), ...), by = ...]
Note, that interpolation has to be done independently by track. Doing that on a an entire column without grouping could result in interpolating the last NA
of a track with a previous time point of that track and the first time point of the next track.
Exercise 4B: remove dynamics outliers
Remove entire trajectories if at least a single time point is below 1000. Create a vector with unique track ID’s for time series that have measurements below the threshold. Subset dtAll
using that vector and %in%
operator, or subsetting a keyed dtAll
as above.
Exercise 4B: remove dynamics outliers
Exercise 4C: plot cleaned dataset
Interactive removal of outliers
There’s no single recipe for handling outliers; it all depends on the analysis.
A handy interactive tool written by Mauro Gwerder (a Bachelor student in Olivier Pertz lab) can help with that process. The R/Shiny app can be downloaded from here and executed from RStudio.
Milestone 3
Dataset without outliers from Milestone 3:
Plot per condition
Exercise 5: normalisation
Add a column to dtAll
with a normalised measurement where every time series is divided by the mean of its first 20 time points. Plot normalised data.
A column with the mean of first 20 elements of a group can be added this way:
.SD
corresponds to a subset of data as defined by by
section. It’s a temporary data table and can be used as such.
# add a column with the mean of the beasline for every time series
dtAll[,
baseline := ..., # add a new column with the mean of first 20 rows of the group
by = ...] # group by unique trajectory
# add a column with normalized measurement
dtAll[,
(lCol$measNorm) := ...]
# remove baseline column
dtAll[, baseline := NULL]
head(dtAll)
Milestone 4
Dataset with a column with normalised measurement.
Plot normalised data
Plot per condition using normalized data:
Add mean to the plot
Note, that we are adding a layer with the summary to the existing ggplot!
Beautifying plots
Add themes, e.g. + theme_bw()
, or themes available in packages such as ggthemes
.
Use + labels()
to add the title, subtitle, the caption.
Use + xlab()
or + ylab()
to control labels of x and y axes.
Exercise 6: interactive plots
Make an interactive plot. Use plotly::ggplotly
function
Exercise 7: time point snapshots
Plot ERK activity at selected time points: baseline, peak, relaxation period. Visualise as box-, dot-, violin plots, or their combination.
Use %in%
syntax to select rows from desired time points, e.g.:
Use ggplot2
functions such as:
geom_boxplot()
geom_violin()
geom_dotplot(binaxis = "y",
stackdir = "center",
position = "dodge",
binwidth = .01,
binpositions = "all)
For simplicity’s sake, we will use aes
instead of aes_string
and get
to use column name strings. The argument for the x-axis has to be a categorical variable, which can be converted from the continuous variable by using the as.factor()
function.
Exercise 8: long to wide format
Convert dtAll
to a matrix in a wide format using dcast
function from the data.table
package.
In the long format, every row corresponds to the measurement at a single time point.
In the wide format, every row corresponds to a whole single time series; column are time points.
# Conversion from long to wide format
dtAllWide = data.table::dcast(dtAll,
... ~ ..., # place for the formula; LHS - row variable, RHS - column variable
value.var = '...') # place for the column name that will be cast into columns
# obtain row names for later
vsRowNames = dtAllWide[[lCol$trackiduni]]
# convert to a matrix; omit the first column
mAllWide = as.matrix(dtAllWide[, -1])
# assign row names to the matrix (useful for later plotting heat-maps from clustering)
rownames(mAllWide) = vsRowNames
# clean
rm(vsRowNames, dtAllWide)
# check the result
head(mAllWide)
Hierarchical clustering
Clustering is a data analysis technique that groups data points based on their distance. In the context of time series, a data point is an entire time series with all time points.
There many methods to calculate the distance, e.g. Euclidean or Manhattan distances where where time points are treated independently, or shape-based distances such as Dynamic Time Warping.
A very convenient way to cluster AND to visualise clustering of time series is a heatmap with a dendrogram. Use a base-R heatmap
function to perform hierarchical clustering and to display the result as a heatmap.
Prettify the heatmap
Things we’d like to change:
- cluster only rows,
- change the colour palette.
require(RColorBrewer)
heatmap(mAllWide, Colv = NA, scale = "none", col = rev(colorRampPalette(brewer.pal(11,
"RdYlBu"))(99)), xlab = "Time (min)", ylab = "Cells")
Play with parameters of dist
and hclust
functions.
The former calculates the distance matrix between individual time series. Admissible parameters are: euclidean
, maximum
, manhattan
, canberra
, binary
or minkowski
.
The latter performs hierarchical clustering based on the distance matrix and builds a dendrogram. Admissible parameters for dendrogram linkage in hclust
are: ward.D
, ward.D2
, single
, complete
, average
, mcquitty
, median
or centroid
.
Heatmaps with pheatmap
Another package to display a nicer looking heatmaps is the pheatmap
package package.
require(pheatmap)
require(RColorBrewer)
pheatmap::pheatmap(mAllWide, cluster_cols = F, cutree_rows = 6, treeheight_row = 100,
color = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)), clustering_distance_rows = "euclidean",
clustering_method = "complete", fontsize_row = 2, fontsize_col = 5, angle_col = 45)
Interactive heatmap
Create interactive heatmaps using heatmaply
package.
Extract cluster information
After performing hierarchical clustering and after cutting the dendrogram at a desired level, we extract assignments of cluster numbers to individual time series.
Step 1: hierarchical clustering
Again, you can play with different distance and linkage methods in dist
and hclust
functions.
Step 2: cut the dendrogram
Cut the dendrogram at a desired level. Here, we cut it at 6. Play with that value…
Convert named vector to a data table.
Step 3: Merge
Merge the original time series with cluster assignments for individual time series. This way, we’ll have a cluster number assigned to every time series in th elong format.
Milestone 5
If your are completely lost, here’s a file with the dataset resulting from previous steps:
Exercise 9: plot time series per cluster
Plot single-cell time series in each cluster, include the population mean. Use faceting per cluster.
ggplot(dtAllCl,
aes_string(x = lCol$frame,
y = lCol$measNorm,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) +
stat_summary(
aes_string(y = lCol$measNorm, group = 1),
fun.y = ..., # place for a function to calculate the summary
geom = "line", group = 1,
colour = 'red', linetype = 'solid', size = 1) +
facet_grid(...) + # place for a faceting variable
theme_minimal() +
labs(title = "Clusters of ERK activity dynamic responses",
caption = paste0("Created on ", Sys.Date())) +
xlab("Time (min)") +
ylab("ERK activity (normalised)")
Exercise 10: contribution of clusters per condition
Calculate and plot the composition of experimental conditions with respect to clusters. Perform data aggregation to calculate the number of time series per group, per cluster.
Hint, the number of time series in dtAllCl
is the number of unique TrackObjects_Label_Uni
per group.
Aggregate and assign the result to a new data.table
dtAllClN
:
Make a bar-plot:
# for percentages on y-axis in ggplot
require(scales)
# The core plot: concentrations on the x-axis, the number of time series on y-axis
p5 = ggplot(dtAllClN, aes_string(x = ..., # places for x and y variables
y = ...))
# Facetting per growth factor
p5 = p5 +
facet_wrap(...)
# Stacked bar plot with bars coloured by cluster number
# Bars are stretched to "fill" the y-axis using the option position = position_fill()
p5 = p5 +
geom_bar(aes_string(...), # place for variable to colour the bars
stat = "identity",
position = position_fill())
# Convert y-axis labels to percentages (0-100%) instead of (0-1)
p5 = p5 +
scale_y_continuous(labels = percent)
# Use a nice colour palette for colour fill
p5 = p5 +
scale_fill_manual("GF:",
values = ggthemes::tableau_color_pal("Color Blind")(6))
# Prettify the plot; add labels, etc.
p5 = p5 +
theme_minimal() +
labs(title = "Participation of clusters in experimental conditions",
caption = paste0("Created on ", Sys.Date())) +
xlab("") +
ylab("Percentage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5
Clustering validation
Based on Clustering Validation Statistics.
To work on this section, you need to load the data from Milestone 4 and to convert it to wide format.
Optimal number of clusters - Silhouette method
Silhouette is a method of interpretation and validation of consistency within clusters of data.
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
The factoextra::fviz_nbclust
calculates the average silhouette of N Monte Carlo samples for a range of cluster numbers.
Optimal number of clusters - Gap statistics method
# Gap statistic nboot = 10 to keep the function speedy. recommended value:
# nboot= 500 for your analysis. Use verbose = FALSE to hide computing
# progression.
set.seed(123)
factoextra::fviz_nbclust(mAllWide, FUNcluster = hcut, method = "gap_stat", nstart = 25,
nboot = 10) + labs(subtitle = "Gap statistic method")
30 indices
We’ll use the package NbClust
to compute, with a single function call, 30 indices for deciding the right number of clusters in the dataset:
require(NbClust)
nb <- NbClust(mAllWide, distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete",
index = "all")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 5 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 3 proposed 4 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
Visualise results from NbClust
:
Among all indices:
===================
* 2 proposed 0 as the best number of clusters
* 5 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 3 proposed 4 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 2 proposed 10 as the best number of clusters
* 3 proposed NA's as the best number of clusters
Conclusion
=========================
* According to the majority rule, the best number of clusters is 3 .