QUANTBIO2019: Processing and clustering of time series from single-cell time-lapse microscopy

Maciej Dobrzynski (Intitute of Cell Biology, University of Bern)
September 15, 2019

Some context about our lab

  • Time-lapse microscopy experiments: taking movies of many cells
  • Cells have fluorescent biosensors that measure activities of proteins
  • We look at the dynamics of protein activity
  • Movies are processed to identify cells, measure the fluorescent signal, track cells over time
  • The reulting data: thousands of time series
  • Batch image analysis on a computing cluster generates lot's of CSV files
  • Postprocessing: merging files from different experimental conditions, cleaning data from outliers, pulling experimental description from other files

Roadmap for this workshop

The aim of this workshop is to process raw time-series data from time-lapse microscopy experiment. We will:

  • load data, merg different data sources,
  • clean missing data and outliers,
  • plot different data cuts,
  • perform hierarchical clustering,
  • validate clusters.

Intermediate data sets throughout the workshop:

RStudio Projects

Divide your work into multiple contexts with RStudio projects.

Each project has its own working directory, workspace, history, and source documents.

You can create an RStudio project:

  • In a brand new directory
  • In an existing directory where you already have R code and data
  • By cloning a version control (Git or Subversion) repository

File > New Project… to create a new R project

Code formatting

Use # to add comments to your code.

Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section.

To navigate between code sections, use the Jump To menu available at the bottom of the editor.

The outline of your code sections is available in the upper right corner of the editor window.

RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.

To indent or reformat the code use:

  • Menu > Code > Reindent Lines (⌘I)
  • Menu > Code > Reformat Code (⇧⌘A)

Syntax convention

Stick to a single naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:

  • calcStatsOfDF - function to calculate stats
  • nIteration - prefix n to indicate an integer variable
  • fPatientAge - f to indicate a float variable
  • sPatientName - s to indicate a string variable
  • vFileNames - v for vector
  • lColumnNames - l for a list

Avoid for loops

From R-bloggers:

A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.

As an aletrnative, use apply function family, e.g. lapply.

R notebooks

Use R Notebooks to create publish-ready documents with text, graphics, and interactive plots. Can be saved as an html, pdf, or a Word document.

An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. The text is formatted using R Markdown.

data.table

Data.table package is a faster, more efficient framework for data manipulation compared to R's default data.frame. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.

Let's define a data table:

require(data.table)
dtBabies = data.table(name= c("Jackson", "Emma", "Liam", "Ava"), 
                    gender = c("M", "F", "M", "F"), 
                    year2011= c(74.69, NA, 88.24, 81.77), 
                    year2012=c(84.99, NA, NA, 96.45), 
                    year2013=c(91.73, 75.74, 101.83, NA),
                    year2014=c(95.32, 82.49, 108.23, NA),
                    year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Emma      F       NA       NA    75.74    82.49    93.73
3:    Liam      M    88.24       NA   101.83   108.23   119.01
4:     Ava      F    81.77    96.45       NA       NA   105.65

data.table - general form

The general form of data.table syntax is as follows:

DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

The way to read it (out loud) is: take DT, subset/reorder rows using i, then calculate j, grouped by by.

data.table - selection

Select specific records:

dtBabies[gender == 'M']
      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Liam      M    88.24       NA   101.83   108.23   119.01

Select specific columns:

dtBabies[, .(name, gender, year2015)]
      name gender year2015
1: Jackson      M   107.12
2:    Emma      F    93.73
3:    Liam      M   119.01
4:     Ava      F   105.65

data.table - aggregation

Calculate the mean of a column:

dtBabies[, .(meanWeight = mean(year2015))]
   meanWeight
1:   106.3775

Calculate the mean of a column by gender:

dtBabies[, .(meanWeight = mean(year2015)), by = gender]
   gender meanWeight
1:      M    113.065
2:      F     99.690

data.table - reference by name

In the above example, column names are given explicitly. Hardcoding them this way in the script is potentially dangerous, for example when for some reason column names change.

It cleaner to store the column names somewhere at the beginning of the script, where it's easy to change them, and then use variables with those names in the code.

lCol = list(meas = 'weight',
            time = 'year',
            group = c('name', 'gender'),
            timeLast = 'year2015')
lCol
$meas
[1] "weight"

$time
[1] "year"

$group
[1] "name"   "gender"

$timeLast
[1] "year2015"

data.table - selection

Select specific records:

dtBabies[get(lCol$group[[2]]) == 'M']
      name gender year2011 year2012 year2013 year2014 year2015
1: Jackson      M    74.69    84.99    91.73    95.32   107.12
2:    Liam      M    88.24       NA   101.83   108.23   119.01

Select specific columns:

myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, ..myColumns]
      name year2015
1: Jackson   107.12
2:    Emma    93.73
3:    Liam   119.01
4:     Ava   105.65

data.table - aggregation (2)

The same summary but with column names stored as strings in elements of the list lCol. The j part of the data.table requires us to use a function get to interpret the string as the column name.

dtBabies[, 
         .(meanWeight = mean(get(lCol$timeLast))), 
         by = c(lCol$group[2])]
   gender meanWeight
1:      M    113.065
2:      F     99.690

Wide to long format

Our data table is in the wide format. To convert it to long format, use the function melt.

Provide the names of identification (id.vars) and measure variables (measure.vars). If none are provided, melt guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.

The original data frame contains missing values; na.rm=T omits them in the long-format table.

dtBabiesLong = data.table::melt(dtBabies, 
                                id.vars = c('name', 'gender'), measure.vars = 3:7,
                                variable.name = 'year', value.name = 'weight',
                                na.rm = T)
head(dtBabiesLong, n = 5L)
      name gender     year weight
1: Jackson      M year2011  74.69
2:    Liam      M year2011  88.24
3:     Ava      F year2011  81.77
4: Jackson      M year2012  84.99
5:     Ava      F year2012  96.45

Long to wide

The function dcast converts from long to wide format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.

Note that because some combinations of name + gender + year do not exist, the dcast function will introduce NAs.

dtBabiesWide = data.table::dcast(dtBabiesLong, 
                                 name + gender ~ year, 
                                 value.var = 'weight')

dtBabiesWide
      name gender year2011 year2012 year2013 year2014 year2015
1:     Ava      F    81.77    96.45       NA       NA   105.65
2:    Emma      F       NA       NA    75.74    82.49    93.73
3: Jackson      M    74.69    84.99    91.73    95.32   107.12
4:    Liam      M    88.24       NA   101.83   108.23   119.01

Note on formula interface

There are two ways to use the formula interface with string variables:

Create formula string explicitly

as.formula(
   paste0(
      lCol$group[[1]], "+", lCol$group[[2]], 
      "~", lCol$time))
name + gender ~ year

Use reformulate function. The limitation being a single response variable only!

reformulate(response = c(lCol$group[[1]], lCol$group[[2]]),
            termlabels = lCol$time)
name ~ year

data.table - IO

fread

Fast reading of the files; use nThread option to take advantage of multiple threads and read files even faster!

Documentation.

fwrite

for fast writing; also compressed files (gz and bz2).

Documentation.

Our dataset

Comes from our recent publication Temporal perturbation of ERK dynamics reveals network architecture of FGF2-MAPK signaling (2019) available here.

PC-12 cells stimulated with 3 different growth factors (EGF, NGF, FGF2) at different concentrations.

The PC-12 cells express a FRET biosensor to indicate activity of the ERK kinase, an end point of the MAPK signalling pathway.

EKAR biosensor

MAPK scheme

Our dataset (continued)

Three different growth factors tested activate different parts of the MAPK network and result in different signalling dynamics, i.e. the behaviour of ERK activity over time.

Cells stimulated with growth factors were then imaged with a single-cell resolution over 3-4 hours at a 2-minute time resolution.

A single experiment consists of a sustained stimulation with a single growth factor at 4 concentrations imaged at 4 fields of view. Thus, we have 16 fields of view per experiment.

Data description

The folder data contains:

  • processed data at various milestones of our analysis,
  • 3 sets of files that correspond to 3 growth factor treatments at 4 different concentrations.

Data files

Data description (continued)

A single set from a single experiment includes:

tCoursesSelected_xGFsust_expXXXX.csv.gz a compressed file with single-cell data. Each row corresponds to the measurement of ERK activity in a single cell, at a single time point. Columns:

  • Metadata_Series an integer, 0-15, that corresponds to the field of view.
  • Metadata_T an integer, 0-100, corresponds to the time frame number in a movie from a single field of view.
  • TrackObjects_Label an integer that describes the track number from object tracking algorithm. The number is unique only within a single field of view.
  • Intensity_MeanIntensity_Ratio a float with the measurement of FRET ratio; this is our proxy of ERK activity.

experimentDescription_EGFsust.xlsx an Excel file with experiment description. Columns:

  • Position an integer, 0-15, with the number of the field of view.
  • Channel an integer, 1-4, with the number of the microfluidic channel.
  • Stim_Conc a string with an indication of the GF concentration.

Exercise 1: read data

Aim: read data from a single experiment. Merge single-cell measurements of the EKAR biosensor fluorescence intensity with experimental description.

Hints: Use following functions from data.table and readxl packages.

  • data.table::fread
  • readxl::read_xlsx
  • data.table::merge

Install packages with install.packages("name_of_the_package").

Exercise 1: read data - solution part 1

Read single-cell data:

require(R.utils)
require(data.table)
dtEGF = fread("./data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz")

head(dtEGF)
   Metadata_Series Metadata_T TrackObjects_Label
1:               0          0                  3
2:               0          0                  4
3:               0          0                  8
4:               0          0                 12
5:               0          0                 13
6:               0          0                 15
   Intensity_MeanIntensity_Ratio
1:                      1229.246
2:                      1279.513
3:                      1236.920
4:                      1170.460
5:                      1150.294
6:                      1128.049

Exercise 1: read data - solution part 2

Read experimental description:

require(readxl)
dtEGFexp = as.data.table(read_xlsx("./data/original/experimentDescription_EGFsust.xlsx"))

head(dtEGFexp)
   Position Channel Stim_Duration Stim_Conc Stim_Treat
1:        0       1          sust 250 ng/ml        EGF
2:        1       1          sust 250 ng/ml        EGF
3:        2       1          sust 250 ng/ml        EGF
4:        3       1          sust 250 ng/ml        EGF
5:        4       2          sust  25 ng/ml        EGF
6:        5       2          sust  25 ng/ml        EGF
   Acquisition_frequency_min Equilibration_min   Device_type
1:                         2                43 STD 4-channel
2:                         2                43 STD 4-channel
3:                         2                43 STD 4-channel
4:                         2                43 STD 4-channel
5:                         2                43 STD 4-channel
6:                         2                43 STD 4-channel

Exercise 1: read data - solution part 3

Merge single-cell data with experimental description. The common column is the ID of the field of view, which is named Metadata_Series in single-cell data and Position in experimental description.

# columns to keep from experimental description
vsExpDescrCols = c("Position", "Channel", "Stim_Conc", "Stim_Treat")

# merge data
dtEGF = merge(dtEGF, dtEGFexp[, ..vsExpDescrCols], 
              by.x = c("Metadata_Series"), by.y = "Position")

head(dtEGF, n = 3L)
   Metadata_Series Metadata_T TrackObjects_Label
1:               0          0                  3
2:               0          0                  4
3:               0          0                  8
   Intensity_MeanIntensity_Ratio Channel Stim_Conc Stim_Treat
1:                      1229.246       1 250 ng/ml        EGF
2:                      1279.513       1 250 ng/ml        EGF
3:                      1236.920       1 250 ng/ml        EGF

Complete data

Since we have 3 datasets that correspond to 3 experiments with different growth factors, we can repeat the process as above, and concatenate 3 data tables at the end.

For large experiments with many output files, it is best to make a list/vector with all filenames to read using list.files. Then, loop fread over that list, e.g.:

# create a list of files to read by searching a folder
vsInputFiles = list.files(path = "./data/original", 
                          pattern = "*.csv.gz", 
                          full.names = T)

# apply a file-reading function to the list of file names
lData = lapply(vsInputFiles, fread)

# combine data into a single object
# the option idcol adds a column with the name of the list element
dtAll = rbindlist(lData, use.names = T, idcol = "Stim_Treat")

Caveat: not all files have the same columns!

Milestone 1

Fetch the data for all growth factors and experiment description from a Milestone 1 file:

dtAll = fread("./data/m1_allGF_wExpDescr.csv.gz")

head(dtAll)
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        EGF               0          0                  3
2:        EGF               0          0                  4
3:        EGF               0          0                  8
4:        EGF               0          0                 12
5:        EGF               0          0                 13
6:        EGF               0          0                 15
   Intensity_MeanIntensity_Ratio Channel Stim_Conc
1:                      1229.246       1 250 ng/ml
2:                      1279.513       1 250 ng/ml
3:                      1236.920       1 250 ng/ml
4:                      1170.460       1 250 ng/ml
5:                      1150.294       1 250 ng/ml
6:                      1128.049       1 250 ng/ml

Access column names by strings

To make our lives easier, we will define a list with variables that contain strings with column mames:

# create a list with strings for column names

lCol = list(
  fov = "Metadata_Series",
  frame = "Metadata_T",
  trackid = "TrackObjects_Label",
  meas = "Intensity_MeanIntensity_Ratio",
  measNorm = "Intensity_MeanIntensity_Ratio_Norm",
  ch = "Channel",
  treatGF = "Stim_Treat",
  treatConc = "Stim_Conc",
  trackiduni = "TrackObjects_Label_Uni", # we will add this column later
  clid = "Cluster_Label"                 # we will add this column later
)

Access column names by strings (continued)

Now, we can select rows like this:

dtAll[get(lCol$meas) > 2000]
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        EGF               0         22                 22
2:        FGF               5         65                 18
   Intensity_MeanIntensity_Ratio Channel Stim_Conc
1:                      4060.864       1 250 ng/ml
2:                      2085.110       2  25 ng/ml

These are measurements of EKAR fluorescence above a certain threshold.

In fact, these measurements are outliers.

Exercise 2: unique time series ID's

Before we proceed, let's make our lives easier and add a column with a unique track identifier. The current TrackObjects_Label column contains track ID's that are unique only within a single field of view. To make them unique, we shall combine the Stim_Treat, Metadata_Series, and TrackObjects_Label columns into a single string.

Aim: create a column with unique time series ID's. It will become quite handy later on…

Hint: use paste, paste0, or sprintf functions to concatenate strings from different columns.

A unique track ID will allow us to easily pick individual time series, plot them, etc.

Exercise 2: unique time series ID's - solution

dtAll[, (lCol$trackiduni) := sprintf("%s_%02d_%04d", 
                                     get(lCol$treatGF), 
                                     get(lCol$fov), 
                                     get(lCol$trackid))]
head(dtAll)
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        EGF               0          0                  3
2:        EGF               0          0                  4
3:        EGF               0          0                  8
4:        EGF               0          0                 12
5:        EGF               0          0                 13
6:        EGF               0          0                 15
   Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1:                      1229.246       1 250 ng/ml            EGF_00_0003
2:                      1279.513       1 250 ng/ml            EGF_00_0004
3:                      1236.920       1 250 ng/ml            EGF_00_0008
4:                      1170.460       1 250 ng/ml            EGF_00_0012
5:                      1150.294       1 250 ng/ml            EGF_00_0013
6:                      1128.049       1 250 ng/ml            EGF_00_0015

Data exploration

summary(dtAll)
  Stim_Treat        Metadata_Series    Metadata_T     TrackObjects_Label
 Length:120086      Min.   : 0.000   Min.   :  0.00   Min.   : 1.00     
 Class :character   1st Qu.: 3.000   1st Qu.: 25.00   1st Qu.:11.00     
 Mode  :character   Median : 7.000   Median : 50.00   Median :22.00     
                    Mean   : 7.477   Mean   : 49.99   Mean   :23.42     
                    3rd Qu.:12.000   3rd Qu.: 75.00   3rd Qu.:34.00     
                    Max.   :15.000   Max.   :100.00   Max.   :80.00     

 Intensity_MeanIntensity_Ratio    Channel       Stim_Conc        
 Min.   : 677.9                Min.   :1.000   Length:120086     
 1st Qu.:1196.4                1st Qu.:1.000   Class :character  
 Median :1246.3                Median :2.000   Mode  :character  
 Mean   :1257.3                Mean   :2.485                     
 3rd Qu.:1311.9                3rd Qu.:4.000                     
 Max.   :4060.9                Max.   :4.000                     
 NA's   :18                                                      
 TrackObjects_Label_Uni
 Length:120086         
 Class :character      
 Mode  :character      




We have missing data! (NA's in the measurement column).

We also may have point and dynamic outliers.

Missing data

head(dtAll[is.na(get(lCol$meas))])
   Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
1:        FGF               8         11                  1
2:        FGF               8         11                  2
3:        FGF               8         11                  5
4:        FGF               8         11                  6
5:        FGF               8         11                  7
6:        FGF               8         11                  8
   Intensity_MeanIntensity_Ratio Channel Stim_Conc TrackObjects_Label_Uni
1:                            NA       3 2.5 ng/ml            FGF_08_0001
2:                            NA       3 2.5 ng/ml            FGF_08_0002
3:                            NA       3 2.5 ng/ml            FGF_08_0005
4:                            NA       3 2.5 ng/ml            FGF_08_0006
5:                            NA       3 2.5 ng/ml            FGF_08_0007
6:                            NA       3 2.5 ng/ml            FGF_08_0008

Missing data (continued)

All NA's come from a single time point in the same FOV. Different tracks are affected. One way to deal with NA's is to impute them with interpolated values. A package imputeTS has a handy function na_interpolation. But

Do we have enough data around NA's to interpolate? Let's print one such time series:

dtAll[get(lCol$trackiduni) == "FGF_08_0001"][[lCol$meas]]
  [1] 1154.882 1153.657 1150.764 1147.915 1140.173 1142.111 1140.136
  [8] 1140.837 1144.007 1135.256 1132.608       NA 1133.615 1134.720
 [15] 1145.947 1130.349 1135.619 1138.593 1139.691 1137.407 1138.597
 [22] 1130.640 1166.818 1184.005 1162.116 1173.394 1196.259 1202.515
 [29] 1235.769 1236.971 1272.345 1266.500 1251.723 1228.215 1226.551
 [36] 1218.169 1214.489 1229.251 1236.079 1231.706 1226.196 1227.705
 [43] 1212.287 1195.712 1198.995 1202.976 1222.753 1215.568 1202.693
 [50] 1217.286 1217.344 1218.442 1221.851 1226.046 1217.408 1192.199
 [57] 1185.705 1181.744 1191.051 1200.445 1198.971 1203.541 1215.357
 [64] 1215.048 1213.177 1226.709 1222.884 1236.786 1216.284 1220.544
 [71] 1207.355 1219.985 1210.953 1221.475 1233.104 1236.179 1239.964
 [78] 1238.919 1235.951 1219.092 1225.653 1225.001 1227.338 1217.194
 [85] 1225.857 1242.707 1241.310 1241.725 1221.622 1219.596 1230.118
 [92] 1236.985 1227.344 1233.488 1248.268 1242.010 1233.652 1228.994
 [99] 1231.844 1243.105 1255.455

Missing data (continued 2)

Let's also check the length of all other time series with NA's.

# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]]
vsTracksWithNAs
 [1] "FGF_08_0001" "FGF_08_0002" "FGF_08_0005" "FGF_08_0006" "FGF_08_0007"
 [6] "FGF_08_0008" "FGF_08_0010" "FGF_08_0011" "FGF_08_0014" "FGF_08_0017"
[11] "FGF_08_0019" "FGF_08_0020" "FGF_08_0024" "FGF_08_0021" "FGF_08_0023"
[16] "FGF_08_0022" "FGF_08_0025" "FGF_08_0027"

Missing data (continued 3)

# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)

# calculate the number of time points of tracks with NA's
head(
  dtAll[vsTracksWithNAs, .N, by = c(lCol$trackiduni)], n = 12L)
    TrackObjects_Label_Uni   N
 1:            FGF_08_0001 101
 2:            FGF_08_0002 101
 3:            FGF_08_0005 101
 4:            FGF_08_0006 101
 5:            FGF_08_0007 101
 6:            FGF_08_0008 101
 7:            FGF_08_0010 101
 8:            FGF_08_0011 100
 9:            FGF_08_0014 101
10:            FGF_08_0017 101
11:            FGF_08_0019 101
12:            FGF_08_0020 101

Exercise 3: identify missing data

Looks like we have another problem… The track FGF_08_0011 has one time point less than others. The entire row with the measurement is missing!

Aim: calculate the number of time points per time series. Check the min and max.

Hint: The .N calculates the number of rows.

Exercise 3: identify missing data - solution

summary(dtAll[, .N, by = c(lCol$trackiduni)][["N"]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   96.0   101.0   101.0   100.9   101.0   101.0 

Handling missing data

This dataset has two types of missing data:

  • NA's in the measurement for some data points.
  • Missing rows.

Depending on the analysis, missing data might or might not pose a problem.

In our case:

  • We will stretch the data to introduce NA's for time points that miss data entirely. A neat solution is based on this.
  • Interpolate NA's using na_interpolation function from imputeTS package.

Milestone 2

Fetch the dataset with interpolated NA's from Milestone 2:

dtAll = fread("./data/m2_allGF_wExpDescr_noNAs.csv.gz")

Plot single-cell data

require(ggplot2)
p0 = ggplot(dtAll, 
            aes_string(x = lCol$frame, y = lCol$meas, 
                       group = lCol$trackiduni)) +
  geom_path(alpha = 0.1) # parameter alpha adds transparency
p0

plot of chunk unnamed-chunk-29

Exercise 4A: remove point outliers

We have outliers… Actually, two types of outliers. Some time series have single time points that are way above the average. There are also several time series that are behaving entirely differently to the rest, so called dynamic outliers.

Aim: remove single time points above the threshold 2000, and impute them with interpolated data.

Hint: replace outlier measurements with NA's and interpolate them with imputeTS::na_interpolation.

Exercise 4A: remove point outliers - solution

# replace time points with measurements above the threshold with NA's
dtAll[get(lCol$meas) > 2000, 
      (lCol$meas) := NA]

# interpolate NA's
require(imputeTS)
dtAll[, 
      (lCol$meas) := na_interpolation(get(lCol$meas)), 
      by = c(lCol$trackiduni)]

Exercise 4B: remove dynamics outliers

Aim: remove entire trajectories if the measurement is below 1000.

Hint: create a vector with unique track ID's for time series that have measurements below the threshold. Subset dtAll using that vector.

Exercise 4B: remove dynamics outliers - solution

# vector with unique track id's of dynamic outliers
vsTrackOut = unique(dtAll[get(lCol$meas) < 1000][[lCol$trackiduni]])

# leave only tracks that are not in the outlier vector
dtAll = dtAll[!(get(lCol$trackiduni) %in% vsTrackOut)]

# clean
rm(vsTrackOut)

Handling outliers

There's no single recipe for handling outliers; it all depends on the analysis.

A handy interactive tool written by Mauro Gwerder (our Bachelor student) can help with that process. The R/Shiny app can be downloaded from here and executed from RStudio.