purrr
practicalDownload a set of qPCR experiments results here. The file is a zip archive which you should uncompress into a subfolder of your project.
For example, if you downloaded the file into your project folder you can run the following command:
unzip("pcr.zip", exdir = "data/pcr") # will create a data/pcr subfolder and extract the files
These qPCR results were obtained from 2 different samples and were replicated 5 times. They are stored in 10 different files. A filename looks like “mlc1_1.csv” where the first number is the sample id and the second the replicate id.
mlc1_1.csv
using the read_delim()
function.read_delim(file.path("data", "pcr", "mlc1_1.csv"), delim = " ")
## # A tibble: 96 × 5
## well gene Ct type flag
## <chr> <chr> <dbl> <chr> <chr>
## 1 A1 gene1 33.60 data OK
## 2 A2 gene2 22.42 data OK
## 3 A3 gene3 31.84 data OK
## 4 A4 gene4 30.25 data OK
## 5 A5 gene5 33.14 data OK
## 6 A6 gene6 33.80 data OK
## 7 A7 gene7 31.05 data OK
## 8 A8 gene8 16.29 data OK
## 9 A9 gene9 32.59 data OK
## 10 A10 gene10 30.13 data OK
## # ... with 86 more rows
pcr_files
containing the path to all 10 data files using the list.files()
function and adjust the full.names
argument accordingly (if required you might get some help using ?list.files
).pcr_files <- list.files(file.path("data", "pcr"), full.names = TRUE)
map()
from the purrr
package to import all files.
glimpse()
)options(tibble.print_min = 3)
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
map(read_delim, delim = " ", col_types = cols()) %>%
str(max.level = 1)
## List of 10
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 96 obs. of 5 variables:
## ..- attr(*, "spec")=List of 2
## .. ..- attr(*, "class")= chr "col_spec"
# map() creates a list
# We are not able to tell which list element comes from which file
map()
will name each output element according to the names found in the input vector. Use set_names()
to keep this information.list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names() %>%
map(read_delim, delim = " ") %>%
names()
## [1] "data/pcr/mlc1_1.csv" "data/pcr/mlc1_2.csv" "data/pcr/mlc1_3.csv"
## [4] "data/pcr/mlc1_4.csv" "data/pcr/mlc1_5.csv" "data/pcr/mlc2_1.csv"
## [7] "data/pcr/mlc2_2.csv" "data/pcr/mlc2_3.csv" "data/pcr/mlc2_4.csv"
## [10] "data/pcr/mlc2_5.csv"
basename()
and tools::file_path_sans_ext()
.list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
map(read_delim, delim = " ") %>%
names()
## [1] "mlc1_1" "mlc1_2" "mlc1_3" "mlc1_4" "mlc1_5" "mlc2_1" "mlc2_2"
## [8] "mlc2_3" "mlc2_4" "mlc2_5"
tibble
out of all files would be much handier. Instead of further transforming this output we will use another member of the map()
family of functions that will immediately create the desired output. Replace your call to map()
in your previous code by the alternative member of the family to get a data frame directly.list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
map_df(read_delim, delim = " ")
## # A tibble: 960 × 5
## well gene Ct type flag
## <chr> <chr> <dbl> <chr> <chr>
## 1 A1 gene1 33.60 data OK
## 2 A2 gene2 22.42 data OK
## 3 A3 gene3 31.84 data OK
## # ... with 957 more rows
# We are missing the sample and replicate variables (i.e. the associated filename)
# We need to adjust the .id argument to get the name of each individual
# tibble in a new column.
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
map_df(read_delim, delim = " ", .id = "filename")
## # A tibble: 960 × 6
## filename well gene Ct type flag
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 mlc1_1 A1 gene1 33.60 data OK
## 2 mlc1_1 A2 gene2 22.42 data OK
## 3 mlc1_1 A3 gene3 31.84 data OK
## # ... with 957 more rows
Now that we were able to read in our data into a single data frame we would like to group together the measures for each individual gene and store them as separate .csv
files.
# First we create a new folder inside the data folder to store the output files
# You can create such a folder relative to your project or Rmarkdown folder using the following command:
dir.create(file.path("data", "by_gene"), showWarnings = FALSE)
# You can set `showWarnings = FALSE` to avoid a warning each time you knit
# or execute the chunk telling that the folder already exists
In this tutorial we stored our input files in the data/pcr
subfolder relative to the projects path. To build up a platform independent path (windows uses \
while linux and MAC OS are using /
) we can use the function file.path()
which will choose the appropriate path separator.
paste0
file.path()
..csv
files and map it to the appropriate vector(s). If read_csv()
is able to read .csv
files, which would be the readr
function able to write such a file?
purrr
function?walk
instead of map
.write_csv()
with two arguments: 1) the data and 2) the file path.purrr
available on github)
walk2()
with two input lists (data and filename). We will have to put it inside a mutate()
function within our workflow. This might not be the nicest place for this function as we don’t want to mutate anything when calling walk2()
. walk2()
returns the first argument (data) leaving it unchanged and we will assign this output to the data column (this will overwrite it with itself leaving it unchanged).walk()
but outside a mutate. We will use the transposed tibble as the input list (will map a function on each row instead of column) and declare an anonymous function using write_csv()
.iwalk()
. iwalk()
is a shortcut to walk2(.x, names(.x), .f)
to use the index.# Method 1: using walk2 but inside a mutate call
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
map_df(read_delim, delim = " ", .id = "filename") %>%
group_by(gene) %>%
nest() %>%
mutate(file_out = paste0(gene, ".csv"),
file_out_path = file.path("data", "by_gene", file_out),
data = walk2(data, file_out_path, write_csv))
# Method 2: using walk on the transposed tibble with an anonymous function
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
map_df(read_delim, delim = " ", .id = "filename") %>%
group_by(gene) %>%
nest() %>%
mutate(file_out = paste0(gene, ".csv"),
file_out_path = file.path("data", "by_gene", file_out)) %>%
transpose() %>%
walk(~write_csv(.$data, .$file_out_path))
# Method 3: using `iwalk()` (new purrr function)
list.files(file.path("data", "pcr"), full.names = TRUE) %>%
set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
map_df(read_delim, delim = " ", .id = "filename", col_types = cols()) %>%
group_by(gene) %>%
nest() %>%
mutate(file_out = paste0(gene, ".csv"),
file_out_path = file.path("data", "by_gene", file_out)) %>%
select(file_out_path, data) %>%
deframe() %>% # generate a named list from a 2 column tibble
iwalk(write_csv)
Download the excel file called sizes.xls
. The file contains measures (width, height and depth) of 5 different samples before and after a treatment. We would like to calculate the volume of the object before and after the treatment.
First load the measures into a data frame using the powerful tools provided by the purrr
package.
library(readxl)
xls_file <- "data/sizes.xls"
excel_sheets(xls_file) %>%
set_names() %>%
map_df(read_excel, path = xls_file, .id = "treatment_onset", range = "B3:C9")
## # A tibble: 12 × 3
## treatment_onset sample size
## <chr> <chr> <chr>
## 1 before A 10x3x2
## 2 before B 5x2x1
## 3 before C 9x7x6
## 4 before D 8x4x1
## 5 before E 3x3x4
## 6 before F 9x14x2
## 7 after A 12x4x2
## 8 after B 6x5x2
## 9 after C 8x7x7
## 10 after D 16x8x2
## 11 after E 4x5x5
## 12 after F 10x19x7
It turns out that the size (\(width \times height \times depth\)) was not entered in a tidy form. Extract the different values in order to calculate the volume.
We already saw the tidyr::separate()
function which might help you to get what you want. Here we would like to stick to a purrr
approach. Think about how many elements you would like to provide and how many elements you would like to get.
base::strsplit()
or stringr::str_split()
# The tidyr / dplyr way but it's not a purrr approach
excel_sheets(xls_file) %>%
set_names() %>%
map_df(read_excel, path = xls_file, .id = "treatment_onset", range = "B3:C9") %>%
separate(size, c("width", "height", "depth"), sep = "x") %>%
mutate_at(c("width", "height", "depth"), parse_integer) %>%
mutate(volume = width * height * depth)
## # A tibble: 12 × 6
## treatment_onset sample width height depth volume
## <chr> <chr> <int> <int> <int> <int>
## 1 before A 10 3 2 60
## 2 before B 5 2 1 10
## 3 before C 9 7 6 378
## 4 before D 8 4 1 32
## 5 before E 3 3 4 36
## 6 before F 9 14 2 252
## 7 after A 12 4 2 96
## 8 after B 6 5 2 60
## 9 after C 8 7 7 392
## 10 after D 16 8 2 256
## 11 after E 4 5 5 100
## 12 after F 10 19 7 1330
excel_sheets(xls_file) %>%
set_names() %>%
map_df(read_excel, path = xls_file, .id = "treatment_onset", range = "B3:C9") %>%
mutate(tidy_size = stringr::str_split(size, "x") %>% map(parse_integer),
volume = map_dbl(tidy_size, reduce, `*`))
## # A tibble: 12 × 5
## treatment_onset sample size tidy_size volume
## <chr> <chr> <chr> <list> <dbl>
## 1 before A 10x3x2 <int [3]> 60
## 2 before B 5x2x1 <int [3]> 10
## 3 before C 9x7x6 <int [3]> 378
## 4 before D 8x4x1 <int [3]> 32
## 5 before E 3x3x4 <int [3]> 36
## 6 before F 9x14x2 <int [3]> 252
## 7 after A 12x4x2 <int [3]> 96
## 8 after B 6x5x2 <int [3]> 60
## 9 after C 8x7x7 <int [3]> 392
## 10 after D 16x8x2 <int [3]> 256
## 11 after E 4x5x5 <int [3]> 100
## 12 after F 10x19x7 <int [3]> 1330