This practical work is adapted from the exhaustive example published by David Robinson on his blog.

In 2008, Brauer et al. used microarrays to test the effect of starvation on the growth rate of yeast. For example, they tried limiting the yeast’s supply of glucose (sugar to metabolize into energy), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

Project - set-up

Load the [Brauer2008_DataSet1.tds] file into a dataframe. This is the exact data that was published with the paper (though for some reason the link on the journal’s page is broken). It thus serves as a good example of tidying a biological dataset “found in the wild”.

original_data <- read_tsv("data/Brauer2008_DataSet1.tds")

1 Tidying the data

Have a look at the dataset. Is the data “tidy”?

1.1 Many variables are stored in one column

cat(as.character(original_data$NAME[1:3]), sep = "\n")
## SFB2       || ER to Golgi transport || molecular function unknown || YNL049C || 1082129
## || biological process unknown || molecular function unknown || YNL095C || 1086222
## QRI7       || proteolysis and peptidolysis || metalloendopeptidase activity || YDL104C || 1085955
  • Gene name e.g. SFB2. Note that not all genes have a name.
  • Biological process e.g. “proteolysis and peptidolysis”
  • Molecular function e.g. “metalloendopeptidase activity”
  • Systematic ID e.g. YNL049C. Unlike a gene name, every gene in this dataset has a systematic ID.
  • Another ID number e.g. 1082129. I don’t know what this number means, and it’s not annotated in the paper. Oh, well.
  1. Use the appropriate function provided in the tidyr library to split these values and generate a column for each variable.

  2. Once you separated the variables delimited by two “||”, check closer the new values: You will see that they might start and/or end with whitespaces which might be inconvinient during the subsequent use.
    • To remove these whitespaces, R base provides a function called trimws(). Let’s test how the function works:
    • Dplyr allows us to apply a function (in our case trimws()) to all columns. In other words, we would like to modify the content of each column with the output of the function trimws(). How can you achieve this? Save the result in a data frame called cleaned_data.
# Creating test string with whitespaces:
s <- "  Removing whitespaces at both ends "
s
## [1] "  Removing whitespaces at both ends "
trimws(s)
## [1] "Removing whitespaces at both ends"
  1. We are not going to use every column of the dataframe. Remove the unnecessary columns: number, GID, YORF and GWEIGHT.

Look at the column names.
Do you think that our dataset is now “tidy”?

1.2 Column headers are values, not variable names

  • Keep care to build a dataframe with each column representing a variable: At this point we are storing the sample name as a different column sample associated to values in expression column. Save as cleaned_data_melt

Now look at the content of the sample column. We are again facing the problem that two variables are stored in a single column.

Use the same function as before to split the sample column into two variables nutrient and rate (use the appropriate delimitation in sep and consider using the convert argument).

2 Representing the data

Tidying the data is a crucial step allowing easy handling and representing.

2.1 Plot the expression data of the LEU1 gene

Extract the data corresponding to the gene called “LEU1” and draw a line for each nutrient showing the expression in function of the growth rate.

2.2 Plot the expression data of a biological process

For this, we don’t need to filter by single gene names as the raw data provides us some information on the biological process for each gene.
Extract all the genes in the “leucine biosynthesis” process and plot the expression in function of the growth rate for each nutrient.

2.3 Perform a linear regression in top of the plots

Let’s play with the graph a little more. These trends look vaguely linear.
Add a linear regression with the appropriate ggplot2 function and carrefully adjust the method argument.

2.4 Switch to another biological process

Once the dataset is tidy, it is very easy to switch to another biological process. Instead of the “leucine biosynthesis”, plot the data corresponding to “sulfur metabolism”.

3. Linear models

We are following the gapminder example written by Hadley.

3.1 Nest data for systematic_name and nutrient

Before applying a linear regression to all genes, we must further clean the data. A linear model will be build with at least 6 rate points. Filter out the genes that have less than 6 data points. Moreover, some systematic_name and expression data are missing (NA). Filter them out too.

Warning

nest() cannot use column names with spaces, even if between backstics, such as Systematic name. Use systematic_name instead.

Then, nest the data and save the result as cleaned_nest.

3.2 Perform all linear models

Mutate the cleaned_nest data frame and, for all genes / nutrient, perform a linear regression of expression explained by rate.

Warning

A bug in dplyr (which is already fixed in the dev version), prevents you from using the form map(data, ~ lm(y ~ x, data = .x) instead of using map(data, function(x) lm(y ~ x, data = x)

The computation takes ~ 40 sec on my macbook pro. Save as cleaned_lm.

  • How many models did you perform?

3.3 Tidy the linear models

Similarly as the life expectancy, use the 3 main functions of broom to - glance the models - extract the \(r^2\) - tidy models - augment to extract residuals

This computation takes ~ 3 min. For testing, you should work with a subset by running sample_frac() before the mutate. Here, we used randomly only 5% of the data.

cleaned_lm %>%
  sample_frac(0.05) %>%
  mutate(...)
library("broom")
cleaned_lm %>%
  mutate(glance  = map(model, glance),
         rsq     = glance %>% map_dbl("r.squared"),
         tidy    = map(model, tidy),
         augment = map(model, augment)) -> cleaned_lm
## Warning in stats::summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

3.4 Explore models

  • plot the histogram of \(r^2\) for each nutrient. What can you say?
  • Count how many models have a \(r^2 > 0.9\) per nutrient

  • For genes with \(r^2 > 0.9\) per nutrient, plot the distribution of the intercept/slope estimates

3.4 Explore models

Now, we would like to find out how many of the ‘linear-trend’ genes are overlapping to the nutrient starvations.

Venn Diagrams are an option but as we have 5 sets intersections would be difficult to see. The alternative is UpSet for which an R implementation exists.

The required object is a data.frame with: - row.names as systematic_name - columns as nutrient, so 5 - values as 0/1 for absence/presence

  • Starting from cleaned_lm, filter out the models with \(r^2 < 0.9\) and empty the systematic names (i.e = “”).

  • select only the systematic_name and nutrient columns
  • add a column set containing 1 in all rows.
  • convert from long to wide format (nutrient filled up by set). All absent genes will then be NAs.

  • Move the systematic from the column to rownames (tibble::column_to_rownames()) Save as mat_upset
  • The pseudo matrix is almost done: we get an absence as NA and a presence as 1.
  • Replace all NA by 0
  • Set the class of mat_upset to data.frame as upset does not handle tbl_df.
  • plot the upset using upset(mat_upset)

unilur Rmarkdown template - E. Koncina