1. Loading data

Download a set of qPCR experiments results here. The file is a zip archive which you should uncompress into a subfolder of your project.

For example, if you downloaded the file into your project folder you can run the following command:

unzip("pcr.zip", exdir = "data/pcr") # will create a data/pcr subfolder and extract the files

These qPCR results were obtained from 2 different samples and were replicated 5 times. They are stored in 10 different files. A filename looks like “mlc1_1.csv” where the first number is the sample id and the second the replicate id.

Read in the qPCR results

Identify the file format

  1. Using a text editor or RStudio, try to identify how the flat file has been encoded.
  2. Try to import the file mlc1_1.csv using the read_delim() function.

Solution

read_delim(file.path("data", "pcr", "mlc1_1.csv"), delim = " ")
## # A tibble: 96 × 5
##     well   gene    Ct  type  flag
##    <chr>  <chr> <dbl> <chr> <chr>
## 1     A1  gene1 33.60  data    OK
## 2     A2  gene2 22.42  data    OK
## 3     A3  gene3 31.84  data    OK
## 4     A4  gene4 30.25  data    OK
## 5     A5  gene5 33.14  data    OK
## 6     A6  gene6 33.80  data    OK
## 7     A7  gene7 31.05  data    OK
## 8     A8  gene8 16.29  data    OK
## 9     A9  gene9 32.59  data    OK
## 10   A10 gene10 30.13  data    OK
## # ... with 86 more rows

Import multiple files

  1. Create a vector named pcr_files containing the path to all 10 data files using the list.files() function and adjust the full.names argument accordingly (if required you might get some help using ?list.files).

Solution

pcr_files <- list.files(file.path("data", "pcr"), full.names = TRUE)
  1. Now use map() from the purrr package to import all files.
    • What is the type of the output (you might want to use glimpse())
    • Are you able to identify from which file each element has been imported ?

Solution

options(tibble.print_min = 3)
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  map(read_delim, delim = " ", col_types = cols()) %>% 
  str(max.level = 1)
## List of 10
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    96 obs. of  5 variables:
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
# map() creates a list
# We are not able to tell which list element comes from which file
  1. map() will name each output element according to the names found in the input vector. Use set_names() to keep this information.

Solution

list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names() %>%
  map(read_delim, delim = " ") %>% 
  names()
##  [1] "data/pcr/mlc1_1.csv" "data/pcr/mlc1_2.csv" "data/pcr/mlc1_3.csv"
##  [4] "data/pcr/mlc1_4.csv" "data/pcr/mlc1_5.csv" "data/pcr/mlc2_1.csv"
##  [7] "data/pcr/mlc2_2.csv" "data/pcr/mlc2_3.csv" "data/pcr/mlc2_4.csv"
## [10] "data/pcr/mlc2_5.csv"
  • (Supplementary question) remove the path and extension from the filename using basename() and tools::file_path_sans_ext().

Solution

list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map(read_delim, delim = " ") %>% 
  names()
##  [1] "mlc1_1" "mlc1_2" "mlc1_3" "mlc1_4" "mlc1_5" "mlc2_1" "mlc2_2"
##  [8] "mlc2_3" "mlc2_4" "mlc2_5"
  1. Getting a single tibble out of all files would be much handier. Instead of further transforming this output we will use another member of the map() family of functions that will immediately create the desired output. Replace your call to map() in your previous code by the alternative member of the family to get a data frame directly.

Solution

list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map_df(read_delim, delim = " ")
## # A tibble: 960 × 5
##    well  gene    Ct  type  flag
##   <chr> <chr> <dbl> <chr> <chr>
## 1    A1 gene1 33.60  data    OK
## 2    A2 gene2 22.42  data    OK
## 3    A3 gene3 31.84  data    OK
## # ... with 957 more rows
  1. Are you still able to identify the different samples and replicates? You probably need to adjust the appropriate argument in your mapping function (have a look at the help page).

Solution

# We are missing the sample and replicate variables (i.e. the associated filename)
# We need to adjust the .id argument to get the name of each individual
# tibble in a new column.
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map_df(read_delim, delim = " ", .id = "filename")
## # A tibble: 960 × 6
##   filename  well  gene    Ct  type  flag
##      <chr> <chr> <chr> <dbl> <chr> <chr>
## 1   mlc1_1    A1 gene1 33.60  data    OK
## 2   mlc1_1    A2 gene2 22.42  data    OK
## 3   mlc1_1    A3 gene3 31.84  data    OK
## # ... with 957 more rows

Rearrange the data and save multiple files

Now that we were able to read in our data into a single data frame we would like to group together the measures for each individual gene and store them as separate .csv files.

  1. first nest the data to let the measures associated to each gene appear in their own tibble.
  2. create a folder to store the files

Tip

# First we create a new folder inside the data folder to store the output files
# You can create such a folder relative to your project or Rmarkdown folder using the following command:
dir.create(file.path("data", "by_gene"), showWarnings = FALSE) 
# You can set `showWarnings = FALSE` to avoid a warning each time you knit
# or execute the chunk telling that the folder already exists 

Tip

In this tutorial we stored our input files in the data/pcr subfolder relative to the projects path. To build up a platform independent path (windows uses \ while linux and MAC OS are using /) we can use the function file.path() which will choose the appropriate path separator.

  1. create a new column containing the path to the target name
    • first create a filename using paste0
    • then combine the folder path with the filename using file.path().
  2. now we should use a function able to write .csv files and map it to the appropriate vector(s). If read_csv() is able to read .csv files, which would be the readr function able to write such a file?
    • Is writing a file to a disk a side-effect? What would be the appropriate purrr function?

Solution

  • Writing a file to a disk is considered to be a side-effect. We are indead not interested in changing our data and should use walk instead of map.
  • We will use write_csv() with two arguments: 1) the data and 2) the file path.
  • Here we will try two different approaches (and a third one which is actually only working with the developement version of purrr available on github)
    • First method will use walk2() with two input lists (data and filename). We will have to put it inside a mutate() function within our workflow. This might not be the nicest place for this function as we don’t want to mutate anything when calling walk2(). walk2() returns the first argument (data) leaving it unchanged and we will assign this output to the data column (this will overwrite it with itself leaving it unchanged).
    • Second method will use walk() but outside a mutate. We will use the transposed tibble as the input list (will map a function on each row instead of column) and declare an anonymous function using write_csv().
    • Third method not working with the actual purrr found on CRAN will use the new function iwalk(). iwalk() is a shortcut to walk2(.x, names(.x), .f) to use the index.

Solution

# Method 1: using walk2 but inside a mutate call
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map_df(read_delim, delim = " ", .id = "filename") %>%
  group_by(gene) %>%
  nest() %>%
  mutate(file_out = paste0(gene, ".csv"),
         file_out_path = file.path("data", "by_gene", file_out),
         data = walk2(data, file_out_path, write_csv))

Solution

# Method 2: using walk on the transposed tibble with an anonymous function
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map_df(read_delim, delim = " ", .id = "filename") %>%
  group_by(gene) %>%
  nest() %>%
  mutate(file_out = paste0(gene, ".csv"),
         file_out_path = file.path("data", "by_gene", file_out)) %>%
  transpose() %>%
  walk(~write_csv(.$data, .$file_out_path))

Solution

# Method 3: using `iwalk()` (new purrr function)
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map_df(read_delim, delim = " ", .id = "filename", col_types = cols()) %>%
  group_by(gene) %>%
  nest() %>%
  mutate(file_out = paste0(gene, ".csv"),
         file_out_path = file.path("data", "by_gene", file_out)) %>%
  select(file_out_path, data) %>%
  deframe() %>% # generate a named list from a 2 column tibble
  iwalk(write_csv)

2. Loading data and handling untidy data

Download the excel file called sizes.xls. The file contains measures (width, height and depth) of 5 different samples before and after a treatment. We would like to calculate the volume of the object before and after the treatment.

Load the content of the excel file

First load the measures into a data frame using the powerful tools provided by the purrr package.

  • What function would you like to repeat?
  • How do you generate your starting vector?

Solution

library(readxl)

xls_file <- "data/sizes.xls"

excel_sheets(xls_file) %>%
  set_names() %>% 
  map_df(read_excel, path = xls_file, .id = "treatment_onset", range = "B3:C9")
## # A tibble: 12 × 3
##    treatment_onset sample    size
##              <chr>  <chr>   <chr>
## 1           before      A  10x3x2
## 2           before      B   5x2x1
## 3           before      C   9x7x6
## 4           before      D   8x4x1
## 5           before      E   3x3x4
## 6           before      F  9x14x2
## 7            after      A  12x4x2
## 8            after      B   6x5x2
## 9            after      C   8x7x7
## 10           after      D  16x8x2
## 11           after      E   4x5x5
## 12           after      F 10x19x7

Calculate the volume

It turns out that the size (\(width \times height \times depth\)) was not entered in a tidy form. Extract the different values in order to calculate the volume.

Warning

We already saw the tidyr::separate() function which might help you to get what you want. Here we would like to stick to a purrr approach. Think about how many elements you would like to provide and how many elements you would like to get.

Tip

  • You might want to use base::strsplit() or stringr::str_split()

Solution

# The tidyr / dplyr way but it's not a purrr approach
excel_sheets(xls_file) %>%
  set_names() %>% 
  map_df(read_excel, path = xls_file, .id = "treatment_onset", range = "B3:C9") %>% 
  separate(size, c("width", "height", "depth"), sep = "x") %>%
  mutate_at(c("width", "height", "depth"), parse_integer) %>%
  mutate(volume = width * height * depth)
## # A tibble: 12 × 6
##    treatment_onset sample width height depth volume
##              <chr>  <chr> <int>  <int> <int>  <int>
## 1           before      A    10      3     2     60
## 2           before      B     5      2     1     10
## 3           before      C     9      7     6    378
## 4           before      D     8      4     1     32
## 5           before      E     3      3     4     36
## 6           before      F     9     14     2    252
## 7            after      A    12      4     2     96
## 8            after      B     6      5     2     60
## 9            after      C     8      7     7    392
## 10           after      D    16      8     2    256
## 11           after      E     4      5     5    100
## 12           after      F    10     19     7   1330

Solution

excel_sheets(xls_file) %>%
  set_names() %>% 
  map_df(read_excel, path = xls_file, .id = "treatment_onset", range = "B3:C9") %>% 
  mutate(tidy_size = stringr::str_split(size, "x") %>% map(parse_integer),
         volume = map_dbl(tidy_size, reduce, `*`))
## # A tibble: 12 × 5
##    treatment_onset sample    size tidy_size volume
##              <chr>  <chr>   <chr>    <list>  <dbl>
## 1           before      A  10x3x2 <int [3]>     60
## 2           before      B   5x2x1 <int [3]>     10
## 3           before      C   9x7x6 <int [3]>    378
## 4           before      D   8x4x1 <int [3]>     32
## 5           before      E   3x3x4 <int [3]>     36
## 6           before      F  9x14x2 <int [3]>    252
## 7            after      A  12x4x2 <int [3]>     96
## 8            after      B   6x5x2 <int [3]>     60
## 9            after      C   8x7x7 <int [3]>    392
## 10           after      D  16x8x2 <int [3]>    256
## 11           after      E   4x5x5 <int [3]>    100
## 12           after      F 10x19x7 <int [3]>   1330