Project - set-up

using the project manager utility, top-right of the rstudio window.

project menu

project menu

Files tab

Files tab

create menu

create menu

Now, you have the 4 panels of the rstudio layout.

Reading data

Download this simple tab-separated file http://lsru.github.io/r_workshop/data/women.tsv

and save it inside the folder R_workshop/day1-beginner/data.

Remember, your current active rstudio project should be day1-beginner

load it: All paths are relative to the root which is the projects folder

library("readr")
df <- read_tsv("data/women.tsv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   height = col_integer(),
##   weight = col_integer()
## )
df
## # A tibble: 15 x 2
##    height weight
##     <int>  <int>
## 1      58    115
## 2      59    117
## 3      60    120
## 4      61    123
## 5      62    126
## 6      63    129
## 7      64    132
## 8      65    135
## 9      66    139
## 10     67    142
## 11     68    146
## 12     69    150
## 13     70    154
## 14     71    159
## 15     72    164

Thanks to readr the object df is already a tibble diff rstudio blog: tibble

Manipulate a data frame

We keep this section short, as we will focus on dplyr to perform tasks on data frames

Access to one column, display only the first elements

head(df$height)
## [1] 58 59 60 61 62 63

Using a similar syntax, apply:

To compute her BMI (remember height are inches and weight US pounds) the formula is:

\[BMI = \frac{weight}{height^2} * 703\]

For the first individual (^2 for square):

(115 / 58^2) * 703
## [1] 24.0324

plotting

First load dplyr. This enables the use of the %>% pipe operator

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Using df dataset:

tidying and plotting

df has 2 columns, both contain values.

Remember that gather takes by default all columns.

plot densities

adding a column to a data frame

Let’s add bmi as a third column to df.

df$bmi <- bmi
head(df)
## # A tibble: 6 x 3
##   height weight      bmi
##    <int>  <int>    <dbl>
## 1     58    115 24.03240
## 2     59    117 23.62856
## 3     60    120 23.43333
## 4     61    123 23.23811
## 5     62    126 23.04318
## 6     63    129 22.84883

plot densities

  • Gather (from tidyr) the 3 columns and plot all densities using different colours and set them translucent You will need to make a new df_melt data frame first.

The 3 distributions have very different ranges.

  • Plot the same data but faceting it by measure (Use the appropriate free scale).

When faceting, the 3 distributions are drawn in distinct plots: mapping the colours to measure is useless.

  • redo the plot using a lightblue colour for all. Be careful to NOT set the colour inside aes().

Supplementary exercices

reading more complex file

Microarray data from the GEO dataset GSE35982.

  • download this compressed file: GSE35982.tsv.gz in your data folder.
  • read it using read_tsv() and store it into a data frame named gse. The file will be uncompressed seamlessly.

  • Is the file tidy?

  • Gather the samples. Look at the gather help page to select columns based on characters.

  • plot the distributions as boxplots

  • Any obvious issues? Check the file and find out what happened.

Hint

the locale setting in readr allows to specify the decimal mark used for float numbers

  • Correct the mistake by reading again the file with the adjusted relevant option and store the data into a a new object.
  • Replace the wrong column in gse by the correct one found in the data frame you just created.

  • tidy the samples again.

  • plot the distributions as boxplots

  • do the data appear normalised?

unilur Rmarkdown template - E. Koncina