This practical work is adapted from the exhaustive example published by David Robinson on his blog.
In 2008, Brauer et al. used microarrays to test the effect of starvation on the growth rate of yeast. For example, they tried limiting the yeast’s supply of glucose (sugar to metabolize into energy), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).
data
Brauer2008_DataSet1.tds
inside the data
folderLoad the [Brauer2008_DataSet1.tds
] file as a tibble
named original_data
. This is the exact data that was published with the paper (though for some reason the link on the journal’s page is broken). It thus serves as a good example of tidying a biological dataset “found in the wild”.
Have a look at the dataset. Is the data “tidy”?
1082129
. We don’t know what this number means, and it’s not annotated in the paper. Oh, well.tidyr
library to split these values and generate a column for each variable.Special characters such as pipes (|
) must be despecialized as they have a specific meaning. In R, you have to use 2 backslashes like \\|
for one pipe |
||
”, check closer the new values: You will see that they might start and/or end with whitespaces which might be inconvinient during the subsequent use.
trimws()
. Let’s test how the function works:dplyr
allows us to apply a function (in our case trimws()
) to all columns. In other words, we would like to modify the content of each column with the output of the function trimws()
. How can you achieve this? Save the result in a tibble
called cleaned_data
.# Creating test string with whitespaces:
s <- " Removing whitespaces at both ends "
s
## [1] " Removing whitespaces at both ends "
trimws(s)
## [1] "Removing whitespaces at both ends"
number
, GID
, YORF
and GWEIGHT
.Look at the column names.
Do you think that our dataset is now “tidy”?
G0.05
…) as a different column sample
associated to values in expression
column. Save as cleaned_data_melt
Now look at the content of the sample
column. We are again facing the problem that two variables are stored in a single column. The nutrient
in the first character, then the growth rate
.
Use the same function as before to split the sample
column into two variables nutrient
and rate
(use the appropriate delimitation in sep
.
Consider using the convert
argument. It allows to convert strings to number when relevant like here.
Right now, the nutrients are designed by a single letter. It would be nice to have the full word instead. One could use a full mixture of if
and else
such as if_else(nutrient == "G", "Glucose", if_else(nutrient == "L", "Leucine", etc ...))
But, that would be cumbersome.
using the following correspondences and dplyr::recode
, recode all nutrient names.
G = "Glucose", L = "Leucine", P = "Phosphate",
S = "Sulfate", N = "Ammonia", U = "Uracil"
Two variables must be present for the further analysis:
expression
systematic_name
delete observations that are missing any of the two mandatory variables. How many rows did you remove?
Tidying the data is a crucial step allowing easy handling and representing.
Extract the data corresponding to the gene called LEU1 and draw a line for each nutrient showing the expression in function of the growth rate.
For this, we don’t need to filter by single gene names as the raw data provides us some information on the biological process for each gene.
Extract all the genes in the leucine biosynthesis process and plot the expression in function of the growth rate for each nutrient.
Let’s play with the graph a little more. These trends look vaguely linear.
Add a linear regression with the appropriate ggplot2
function and carrefully adjust the method
argument.
Once the dataset is tidy, it is very easy to switch to another biological process. Instead of the “leucine biosynthesis”, plot the data corresponding to sulfur metabolism.
you can combine the facet headers using +
in facet_wrap()
. Adding the systematic name allows to get a name when the gene name is missing.
We can see that most genes follow a linear trend under a starvation stress. Instead of manipulating all data points, we would like to estimate just the 2 parameters of a linear model. What are they? And what do they represent for this experiment?
Perform a linear regression of expression
explained by rate
for all group of name, systematic_name and nutrient. By sure to convert the linear models to a tibble
using broom
as saved as cleaned_lm
The computation takes ~ 60 sec on my macbook pro. For test purposes, you may want to subsample per group a small number of genes
How many models did you perform?
bonus question: you have observed a warning essentially perfect fit: summary may be unreliable
, why? How can we clean up the dataset to avoid it?
For the gene LEU1, let’s look again at the linear models per nutrient and add as a dashed black line for mean of all intercepts.
We can see that when leucine is the limiting factor, the yeast expresses massively this gene LEU1 to compensate. Remember that all experiments are performed at constant growth rates in this chemostat.
Now, compute the difference between the intercept and the mean(intercept) per systematic name, so for all nutrient per gene. And select the top 20 highest intercept values to their means. Save as top_20_intercept
cleaned_data_melt
to retreived the original data pointsplot the data points and linear trends for those top 20 genes. Instead of the systematic name in the header, use the molecular function (MF
)
what can conclude?
We can now have a look at the slope estimates.
first, filter the term for slope estimate rate
and remove p-value that are missing. Save as rate_slopes
display the p-values histograms per nutrient
looking at those histograms, how do you estimate the ratio of null and alternative hypotheses?
how can retrieved the genes that belong most probably to the _alternative hypothesis?
display the histograms of both p and q values per nutrient