Maciej Dobrzynski (Intitute of Cell Biology, University of Bern)
September 15, 2019
The aim of this workshop is to process raw time-series data from time-lapse microscopy experiment. We will:
Intermediate data sets throughout the workshop:
Divide your work into multiple contexts with RStudio projects.
Each project has its own working directory, workspace, history, and source documents.
You can create an RStudio project:
File > New Project… to create a new R project
Use # to add comments to your code.
Any comment line with at least four trailing dashes (-
), equal signs (=
), or pound signs (#
) automatically creates a code section.
To navigate between code sections, use the Jump To menu available at the bottom of the editor.
The outline of your code sections is available in the upper right corner of the editor window.
RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.
To indent or reformat the code use:
Stick to a single naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:
calcStatsOfDF
- function to calculate statsnIteration
- prefix n
to indicate an integer variablefPatientAge
- f
to indicate a float variablesPatientName
- s
to indicate a string variablevFileNames
- v
for vectorlColumnNames
- l
for a listFrom R-bloggers:
A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.
As an aletrnative, use apply
function family, e.g. lapply
.
Use R Notebooks to create publish-ready documents with text, graphics, and interactive plots. Can be saved as an html, pdf, or a Word document.
An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. The text is formatted using R Markdown.
Data.table
package is a faster, more efficient framework for data manipulation compared to R's default data.frame
. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.
Let's define a data table:
require(data.table)
dtBabies = data.table(name= c("Jackson", "Emma", "Liam", "Ava"),
gender = c("M", "F", "M", "F"),
year2011= c(74.69, NA, 88.24, 81.77),
year2012=c(84.99, NA, NA, 96.45),
year2013=c(91.73, 75.74, 101.83, NA),
year2014=c(95.32, 82.49, 108.23, NA),
year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Emma F NA NA 75.74 82.49 93.73
3: Liam M 88.24 NA 101.83 108.23 119.01
4: Ava F 81.77 96.45 NA NA 105.65
The general form of data.table syntax is as follows:
DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
The way to read it (out loud) is: take DT, subset/reorder rows using i, then calculate j, grouped by by.
Select specific records:
dtBabies[gender == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
dtBabies[, .(name, gender, year2015)]
name gender year2015
1: Jackson M 107.12
2: Emma F 93.73
3: Liam M 119.01
4: Ava F 105.65
Calculate the mean of a column:
dtBabies[, .(meanWeight = mean(year2015))]
meanWeight
1: 106.3775
Calculate the mean of a column by gender:
dtBabies[, .(meanWeight = mean(year2015)), by = gender]
gender meanWeight
1: M 113.065
2: F 99.690
In the above example, column names are given explicitly. Hardcoding them this way in the script is potentially dangerous, for example when for some reason column names change.
It cleaner to store the column names somewhere at the beginning of the script, where it's easy to change them, and then use variables with those names in the code.
lCol = list(meas = 'weight',
time = 'year',
group = c('name', 'gender'),
timeLast = 'year2015')
lCol
$meas
[1] "weight"
$time
[1] "year"
$group
[1] "name" "gender"
$timeLast
[1] "year2015"
Select specific records:
dtBabies[get(lCol$group[[2]]) == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, ..myColumns]
name year2015
1: Jackson 107.12
2: Emma 93.73
3: Liam 119.01
4: Ava 105.65
The same summary but with column names stored as strings in elements of the list lCol
. The j
part of the data.table
requires us to use a function get
to interpret the string as the column name.
dtBabies[,
.(meanWeight = mean(get(lCol$timeLast))),
by = c(lCol$group[2])]
gender meanWeight
1: M 113.065
2: F 99.690
Our data table is in the wide format. To convert it to long format, use the function melt
.
Provide the names of identification (id.vars
) and measure variables (measure.vars
). If none are provided, melt
guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.
The original data frame contains missing values; na.rm=T
omits them in the long-format table.
dtBabiesLong = data.table::melt(dtBabies,
id.vars = c('name', 'gender'), measure.vars = 3:7,
variable.name = 'year', value.name = 'weight',
na.rm = T)
head(dtBabiesLong, n = 5L)
name gender year weight
1: Jackson M year2011 74.69
2: Liam M year2011 88.24
3: Ava F year2011 81.77
4: Jackson M year2012 84.99
5: Ava F year2012 96.45
The function dcast
converts from long to wide format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.
Note that because some combinations of name + gender + year
do not exist, the dcast
function will introduce NAs
.
dtBabiesWide = data.table::dcast(dtBabiesLong,
name + gender ~ year,
value.var = 'weight')
dtBabiesWide
name gender year2011 year2012 year2013 year2014 year2015
1: Ava F 81.77 96.45 NA NA 105.65
2: Emma F NA NA 75.74 82.49 93.73
3: Jackson M 74.69 84.99 91.73 95.32 107.12
4: Liam M 88.24 NA 101.83 108.23 119.01
There are two ways to use the formula interface with string variables:
Create formula string explicitly
as.formula(
paste0(
lCol$group[[1]], "+", lCol$group[[2]],
"~", lCol$time))
name + gender ~ year
Use reformulate
function. The limitation being a single response variable only!
reformulate(response = c(lCol$group[[1]], lCol$group[[2]]),
termlabels = lCol$time)
name ~ year
fread
Fast reading of the files; use nThread
option to take advantage of multiple threads and read files even faster!
Comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.
PC-12 cells stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations.
The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase, an end point of the MAPK signalling pathway.
Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.
Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution.
A single experiment consists of a sustained stimulation with a single growth factor at 4 concentrations imaged at 4 fields of view. Thus, we have 16 fields of view per experiment.
The folder data
contains:
A single set from a single experiment includes:
tCoursesSelected_xGFsust_expXXXX.csv.gz
a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:
Metadata_Series
an integer, 0-15, that corresponds to the field of view.Metadata_T
an integer, 0-100, corresponds to the time frame number in a movie from a single field of view.TrackObjects_Label
an integer that describes the track number from object tracking algorithm. The number is unique only within a single field of view.Intensity_MeanIntensity_Ratio
a float with the measurement of FRET ratio; this is our proxy of ERK activity.experimentDescription_EGFsust.xlsx
an Excel file with experiment description. Columns:
Position
an integer, 0-15, with the number of the field of view.Channel
an integer, 1-4, with the number of the microfluidic channel.Stim_Conc
a string with an indication of the GF concentration.Aim: read data from a single experiment. Merge single-cell measurements of the EKAR biosensor fluorescence intensity with experimental description.
Hints:
Use following functions from data.table
and readxl
packages.
data.table::fread
readxl::read_xlsx
data.table::merge
Install packages with install.packages("name_of_the_package")
.
Read single-cell data:
require(R.utils)
require(data.table)
dtEGF = fread("./data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz")
head(dtEGF)
Metadata_Series Metadata_T TrackObjects_Label
1: 0 0 3
2: 0 0 4
3: 0 0 8
4: 0 0 12
5: 0 0 13
6: 0 0 15
Intensity_MeanIntensity_Ratio
1: 1229.246
2: 1279.513
3: 1236.920
4: 1170.460
5: 1150.294
6: 1128.049
Read experimental description:
require(readxl)
dtEGFexp = as.data.table(read_xlsx("./data/original/experimentDescription_EGFsust.xlsx"))
head(dtEGFexp)
Position Channel Stim_Duration Stim_Conc Stim_Treat
1: 0 1 sust 250 ng/ml EGF
2: 1 1 sust 250 ng/ml EGF
3: 2 1 sust 250 ng/ml EGF
4: 3 1 sust 250 ng/ml EGF
5: 4 2 sust 25 ng/ml EGF
6: 5 2 sust 25 ng/ml EGF
Acquisition_frequency_min Equilibration_min Device_type
1: 2 43 STD 4-channel
2: 2 43 STD 4-channel
3: 2 43 STD 4-channel
4: 2 43 STD 4-channel
5: 2 43 STD 4-channel
6: 2 43 STD 4-channel
Merge single-cell data with experimental description. The common column is the ID of the field of view, which is named Metadata_Series
in single-cell data and Position
in experimental description.
# columns to keep from experimental description
vsExpDescrCols = c("Position", "Channel", "Stim_Conc", "Stim_Treat")
# merge data
dtEGF = merge(dtEGF, dtEGFexp[, ..vsExpDescrCols],
by.x = c("Metadata_Series"), by.y = "Position")
head(dtEGF, n = 3L)
Metadata_Series Metadata_T TrackObjects_Label
1: 0 0 3
2: 0 0 4
3: 0 0 8
Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
1: 1229.246 1 250 ng/ml EGF
2: 1279.513 1 250 ng/ml EGF
3: 1236.920 1 250 ng/ml EGF
Since we have 3 datasets that correspond to 3 experiments with different growth factors, we can repeat the process as above, and concatenate 3 data tables at the end.
For large experiments with many output files, it is best to make a list/vector with all filenames to read using list.files
. Then, loop fread
over that list, e.g.:
# create a list of files to read by searching a folder
vsInputFiles = list.files(path = "./data/original",
pattern = "*.csv.gz",
full.names = T)
# apply a file-reading function to the list of file names
lData = lapply(vsInputFiles, fread)
# combine data into a single object
# the option idcol adds a column with the name of the list element
dtAll = rbindlist(lData, use.names = T, idcol = "Stim_Treat")
Caveat: not all files have the same columns!
Fetch the data for all growth factors and experiment description from a Milestone 1 file:
dtAll = fread("./data/m1_allGF_wExpDescr.csv.gz")
head(dtAll)
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc
1: 1229.246 1 250 ng/ml
2: 1279.513 1 250 ng/ml
3: 1236.920 1 250 ng/ml
4: 1170.460 1 250 ng/ml
5: 1150.294 1 250 ng/ml
6: 1128.049 1 250 ng/ml
To make our lives easier, we will define a list with variables that contain strings with column mames:
# create a list with strings for column names
lCol = list(
fov = "Metadata_Series",
frame = "Metadata_T",
trackid = "TrackObjects_Label",
meas = "Intensity_MeanIntensity_Ratio",
measNorm = "Intensity_MeanIntensity_Ratio_Norm",
ch = "Channel",
treatGF = "Stim_Treat",
treatConc = "Stim_Conc",
trackiduni = "TrackObjects_Label_Uni", # we will add this column later
clid = "Cluster_Label" # we will add this column later
)
Now, we can select rows like this:
dtAll[get(lCol$meas) > 2000]
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 22 22
2: FGF 5 65 18
Intensity_MeanIntensity_Ratio Channel Stim_Conc
1: 4060.864 1 250 ng/ml
2: 2085.110 2 25 ng/ml
These are measurements of EKAR fluorescence above a certain threshold.
In fact, these measurements are outliers.
Before we proceed, let's make our lives easier and add a column with a unique track identifier. The current TrackObjects_Label
column contains track ID's that are unique only within a single field of view. To make them unique, we shall combine the Stim_Treat
, Metadata_Series
, and TrackObjects_Label
columns into a single string.
Aim: create a column with unique time series ID's. It will become quite handy later on…
Hint: use paste
, paste0
, or sprintf
functions to concatenate strings from different columns.
A unique track ID will allow us to easily pick individual time series, plot them, etc.
dtAll[, (lCol$trackiduni) := sprintf("%s_%02d_%04d",
get(lCol$treatGF),
get(lCol$fov),
get(lCol$trackid))]
head(dtAll)
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: EGF 0 0 3
2: EGF 0 0 4
3: EGF 0 0 8
4: EGF 0 0 12
5: EGF 0 0 13
6: EGF 0 0 15
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: 1229.246 1 250 ng/ml EGF_00_0003
2: 1279.513 1 250 ng/ml EGF_00_0004
3: 1236.920 1 250 ng/ml EGF_00_0008
4: 1170.460 1 250 ng/ml EGF_00_0012
5: 1150.294 1 250 ng/ml EGF_00_0013
6: 1128.049 1 250 ng/ml EGF_00_0015
summary(dtAll)
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
Length:120086 Min. : 0.000 Min. : 0.00 Min. : 1.00
Class :character 1st Qu.: 3.000 1st Qu.: 25.00 1st Qu.:11.00
Mode :character Median : 7.000 Median : 50.00 Median :22.00
Mean : 7.477 Mean : 49.99 Mean :23.42
3rd Qu.:12.000 3rd Qu.: 75.00 3rd Qu.:34.00
Max. :15.000 Max. :100.00 Max. :80.00
Intensity_MeanIntensity_Ratio Channel Stim_Conc
Min. : 677.9 Min. :1.000 Length:120086
1st Qu.:1196.4 1st Qu.:1.000 Class :character
Median :1246.3 Median :2.000 Mode :character
Mean :1257.3 Mean :2.485
3rd Qu.:1311.9 3rd Qu.:4.000
Max. :4060.9 Max. :4.000
NA's :18
TrackObjects_Label_Uni
Length:120086
Class :character
Mode :character
We have missing data! (NA's
in the measurement column).
We also may have point and dynamic outliers.
head(dtAll[is.na(get(lCol$meas))])
Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1: FGF 8 11 1
2: FGF 8 11 2
3: FGF 8 11 5
4: FGF 8 11 6
5: FGF 8 11 7
6: FGF 8 11 8
Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1: NA 3 2.5 ng/ml FGF_08_0001
2: NA 3 2.5 ng/ml FGF_08_0002
3: NA 3 2.5 ng/ml FGF_08_0005
4: NA 3 2.5 ng/ml FGF_08_0006
5: NA 3 2.5 ng/ml FGF_08_0007
6: NA 3 2.5 ng/ml FGF_08_0008
All NA's come from a single time point in the same FOV. Different tracks are affected. One way to deal with NA's is to impute them with interpolated values. A package imputeTS
has a handy function na_interpolation
. But
Do we have enough data around NA's to interpolate? Let's print one such time series:
dtAll[get(lCol$trackiduni) == "FGF_08_0001"][[lCol$meas]]
[1] 1154.882 1153.657 1150.764 1147.915 1140.173 1142.111 1140.136
[8] 1140.837 1144.007 1135.256 1132.608 NA 1133.615 1134.720
[15] 1145.947 1130.349 1135.619 1138.593 1139.691 1137.407 1138.597
[22] 1130.640 1166.818 1184.005 1162.116 1173.394 1196.259 1202.515
[29] 1235.769 1236.971 1272.345 1266.500 1251.723 1228.215 1226.551
[36] 1218.169 1214.489 1229.251 1236.079 1231.706 1226.196 1227.705
[43] 1212.287 1195.712 1198.995 1202.976 1222.753 1215.568 1202.693
[50] 1217.286 1217.344 1218.442 1221.851 1226.046 1217.408 1192.199
[57] 1185.705 1181.744 1191.051 1200.445 1198.971 1203.541 1215.357
[64] 1215.048 1213.177 1226.709 1222.884 1236.786 1216.284 1220.544
[71] 1207.355 1219.985 1210.953 1221.475 1233.104 1236.179 1239.964
[78] 1238.919 1235.951 1219.092 1225.653 1225.001 1227.338 1217.194
[85] 1225.857 1242.707 1241.310 1241.725 1221.622 1219.596 1230.118
[92] 1236.985 1227.344 1233.488 1248.268 1242.010 1233.652 1228.994
[99] 1231.844 1243.105 1255.455
Let's also check the length of all other time series with NA's.
# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]]
vsTracksWithNAs
[1] "FGF_08_0001" "FGF_08_0002" "FGF_08_0005" "FGF_08_0006" "FGF_08_0007"
[6] "FGF_08_0008" "FGF_08_0010" "FGF_08_0011" "FGF_08_0014" "FGF_08_0017"
[11] "FGF_08_0019" "FGF_08_0020" "FGF_08_0024" "FGF_08_0021" "FGF_08_0023"
[16] "FGF_08_0022" "FGF_08_0025" "FGF_08_0027"
# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)
# calculate the number of time points of tracks with NA's
head(
dtAll[vsTracksWithNAs, .N, by = c(lCol$trackiduni)], n = 12L)
TrackObjects_Label_Uni N
1: FGF_08_0001 101
2: FGF_08_0002 101
3: FGF_08_0005 101
4: FGF_08_0006 101
5: FGF_08_0007 101
6: FGF_08_0008 101
7: FGF_08_0010 101
8: FGF_08_0011 100
9: FGF_08_0014 101
10: FGF_08_0017 101
11: FGF_08_0019 101
12: FGF_08_0020 101
Looks like we have another problem… The track FGF_08_0011
has one time point less than others. The entire row with the measurement is missing!
Aim: calculate the number of time points per time series. Check the min and max.
Hint: The .N
calculates the number of rows.
summary(dtAll[, .N, by = c(lCol$trackiduni)][["N"]])
Min. 1st Qu. Median Mean 3rd Qu. Max.
96.0 101.0 101.0 100.9 101.0 101.0
This dataset has two types of missing data:
NA's
in the measurement for some data points.Depending on the analysis, missing data might or might not pose a problem.
In our case:
NA's
for time points that miss data entirely. A neat solution is based on this.NA's
using na_interpolation
function from imputeTS
package.Fetch the dataset with interpolated NA's
from Milestone 2:
dtAll = fread("./data/m2_allGF_wExpDescr_noNAs.csv.gz")
require(ggplot2)
p0 = ggplot(dtAll,
aes_string(x = lCol$frame, y = lCol$meas,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) # parameter alpha adds transparency
p0
We have outliers… Actually, two types of outliers. Some time series have single time points that are way above the average. There are also several time series that are behaving entirely differently to the rest, so called dynamic outliers.
Aim: remove single time points above the threshold 2000, and impute them with interpolated data.
Hint: replace outlier measurements with NA's
and interpolate them with imputeTS::na_interpolation
.
# replace time points with measurements above the threshold with NA's
dtAll[get(lCol$meas) > 2000,
(lCol$meas) := NA]
# interpolate NA's
require(imputeTS)
dtAll[,
(lCol$meas) := na_interpolation(get(lCol$meas)),
by = c(lCol$trackiduni)]
Aim: remove entire trajectories if the measurement is below 1000.
Hint: create a vector with unique track ID's for time series that have measurements below the threshold. Subset dtAll
using that vector.
# vector with unique track id's of dynamic outliers
vsTrackOut = unique(dtAll[get(lCol$meas) < 1000][[lCol$trackiduni]])
# leave only tracks that are not in the outlier vector
dtAll = dtAll[!(get(lCol$trackiduni) %in% vsTrackOut)]
# clean
rm(vsTrackOut)
There's no single recipe for handling outliers; it all depends on the analysis.
A handy interactive tool written by Mauro Gwerder (our Bachelor student) can help with that process. The R/Shiny app can be downloaded from here and executed from RStudio.