This practical work is adapted from the exhaustive example published by David Robinson on his blog.
In 2008, Brauer et al. used microarrays to test the effect of starvation on the growth rate of yeast. For example, they tried limiting the yeast’s supply of glucose (sugar to metabolize into energy), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).
Create a new project in a meaningful folder name (such as R_workshop/day2-advanced
) on your computer using the project manager utility in the upper-right part of the rstudio window.
data
Download Brauer2008_DataSet1.tds
inside the data
folder
Load the [Brauer2008_DataSet1.tds
] file into a dataframe. This is the exact data that was published with the paper (though for some reason the link on the journal’s page is broken). It thus serves as a good example of tidying a biological dataset “found in the wild”.
original_data <- read_tsv("data/Brauer2008_DataSet1.tds")
Have a look at the dataset. Is the data “tidy”?
cat(as.character(original_data$NAME[1:3]), sep = "\n")
## SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129
## || biological process unknown || molecular function unknown || YNL095C || 1086222
## QRI7 || proteolysis and peptidolysis || metalloendopeptidase activity || YDL104C || 1085955
1082129
. I don’t know what this number means, and it’s not annotated in the paper. Oh, well.tidyr
library to split these values and generate a column for each variable.cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|")
||
”, check closer the new values: You will see that they might start and/or end with whitespaces which might be inconvinient during the subsequent use.
trimws()
. Let’s test how the function works:trimws()
) to all columns. In other words, we would like to modify the content of each column with the output of the function trimws()
. How can you achieve this? Save the result in a data frame called cleaned_data
.# Creating test string with whitespaces:
s <- " Removing whitespaces at both ends "
s
## [1] " Removing whitespaces at both ends "
trimws(s)
## [1] "Removing whitespaces at both ends"
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_each(funs(trimws), name:systematic_name)
number
, GID
, YORF
and GWEIGHT
.cleaned_data %>%
select(-number, -GID, -YORF, -GWEIGHT) -> cleaned_data
Look at the column names.
Do you think that our dataset is now “tidy”?
No, our dataframe is still not tidy. We can see that the column names from G0.05 to U0.3 represent a variable.
sample
associated to values in expression
column. Save as cleaned_data_melt
cleaned_data %>%
gather(sample, expression, G0.05:U0.3) -> cleaned_data_melt
Now look at the content of the sample
column. We are again facing the problem that two variables are stored in a single column.
levels(cleaned_data_melt$sample)
## NULL
Use the same function as before to split the sample
column into two variables nutrient
and rate
(use the appropriate delimitation in sep
and consider using the convert
argument).
cleaned_data_melt %>%
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) -> cleaned_data_melt
Tidying the data is a crucial step allowing easy handling and representing.
Extract the data corresponding to the gene called “LEU1” and draw a line for each nutrient showing the expression in function of the growth rate.
cleaned_data_melt %>%
filter(name == "LEU1") %>%
ggplot(aes(rate, expression, colour = nutrient)) +
geom_line() +
theme_bw()
For this, we don’t need to filter by single gene names as the raw data provides us some information on the biological process for each gene.
Extract all the genes in the “leucine biosynthesis” process and plot the expression in function of the growth rate for each nutrient.
cleaned_data_melt %>%
filter(BP == "leucine biosynthesis") %>%
ggplot(aes(rate, expression, color = nutrient)) +
geom_line() +
facet_wrap(~ name)
Let’s play with the graph a little more. These trends look vaguely linear.
Add a linear regression with the appropriate ggplot2
function and carrefully adjust the method
argument.
cleaned_data_melt %>%
filter(BP == "leucine biosynthesis") %>%
ggplot(aes(rate, expression, colour = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ name)
Once the dataset is tidy, it is very easy to switch to another biological process. Instead of the “leucine biosynthesis”, plot the data corresponding to “sulfur metabolism”.
cleaned_data_melt %>%
filter(BP == "sulfur metabolism") %>%
ggplot(aes(x = rate, y = expression, colour = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ name + systematic_name, scales = "free_y") # add 2 headers to facets with '+'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
We are following the gapminder
example written by Hadley.
Before applying a linear regression to all genes, we must further clean the data. A linear model will be build with at least 6 rate points. Filter out the genes that have less than 6 data points. Moreover, some systematic_name and expression data are missing (NA
). Filter them out too.
Then, nest the data and save the result as cleaned_nest
.
cleaned_data_melt %>%
filter(!is.na(expression),!is.na(systematic_name)) %>%
group_by(nutrient, systematic_name) %>%
mutate(n = n()) %>%
filter(n > 5) %>%
nest() -> cleaned_nest
Mutate the cleaned_nest
data frame and, for all genes / nutrient, perform a linear regression of expression
explained by rate
.
A bug in dplyr
(which is already fixed in the dev version), prevents you from using the form map(data, ~ lm(y ~ x, data = .x)
instead of using map(data, function(x) lm(y ~ x, data = x)
The computation takes ~ 40 sec on my macbook pro. Save as cleaned_lm
.
library("purrr")
##
## Attaching package: 'purrr'
## The following object is masked from 'package:dplyr':
##
## order_by
cleaned_nest %>%
mutate(model = map(data, function(x) lm(expression ~ rate, data = x))) -> cleaned_lm
the nrow of cleaned_nest is 32543 models
Similarly as the life expectancy, use the 3 main functions of broom
to - glance
the models - extract the \(r^2\) - tidy
models - augment
to extract residuals
This computation takes ~ 3 min. For testing, you should work with a subset by running sample_frac()
before the mutate
. Here, we used randomly only 5% of the data.
cleaned_lm %>%
sample_frac(0.05) %>%
mutate(...)
library("broom")
cleaned_lm %>%
mutate(glance = map(model, glance),
rsq = glance %>% map_dbl("r.squared"),
tidy = map(model, tidy),
augment = map(model, augment)) -> cleaned_lm
## Warning in stats::summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
theme_set(theme_bw(14))
cleaned_lm %>%
ggplot(aes(x = rsq))+
geom_histogram()+
facet_wrap(~ nutrient)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most genes do not respond in a linear fashion to starvation. Especially for S and U.
cleaned_lm %>%
filter(rsq > 0.9) %>%
count(nutrient)
## Source: local data frame [6 x 2]
##
## nutrient n
## (chr) (int)
## 1 G 756
## 2 L 815
## 3 N 603
## 4 P 966
## 5 S 295
## 6 U 272
cleaned_lm %>%
filter(rsq > 0.9) %>%
unnest(tidy) %>%
ggplot(aes(x = estimate, fill = term))+
geom_histogram(alpha = 0.6)+
facet_wrap(~ nutrient)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, we would like to find out how many of the ‘linear-trend’ genes are overlapping to the nutrient starvations.
Venn Diagrams are an option but as we have 5 sets intersections would be difficult to see. The alternative is UpSet for which an R
implementation exists.
The required object is a data.frame
with: - row.names as systematic_name
- columns as nutrient
, so 5 - values as 0/1 for absence/presence
Starting from cleaned_lm
, filter out the models with \(r^2 < 0.9\) and empty the systematic names (i.e = “”).
systematic_name
and nutrient
columnsset
containing 1
in all rows.convert from long to wide format (nutrient
filled up by set
). All absent genes will then be NAs.
cleaned_lm %>%
filter(rsq > 0.9, systematic_name != "") %>%
select(systematic_name, nutrient) %>%
mutate(set = 1) %>%
spread(nutrient, set) -> mat
the systematic
from the column to rownames (tibble::column_to_rownames()
) Save as mat_upset
NA
and a presence as 1
.NA
by 0mat_upset
to data.frame
as upset
does not handle tbl_df
.upset(mat_upset)
library("UpSetR")
mat %>%
tibble::column_to_rownames("systematic_name") -> mat_upset
## Warning: Setting row names on a tibble is deprecated.
mat_upset[is.na(mat_upset)] <- 0
class(mat_upset) <- "data.frame"
upset(mat_upset)