3 May 2017
By the end you should be able to:
ggplot2 when you run into problems.XX ADD GROUP aes for geom_line
ggplot1Released in 2005 until 2008 by Hadley Wickham.
If the pipe ( %>% in 2014) had been invented before,
ggplot2would have never existed Hadley Wickham
# devtools::install_github("hadley/ggplot1")
p <- ggplot(mtcars, list(x = mpg, y = wt))
# need temp p object to avoid too many ()'s
scbrewer(ggpoint(p, list(colour = gear)))# devtools::install_github("hadley/ggplot1")
library(ggplot1)
mtcars %>%
ggplot(list(x = mpg, y = wt)) %>%
ggpoint(list(colour = gear)) %>%
scbrewer()library(ggplot2)
mtcars %>%
ggplot(aes(x = mpg, y = wt)) +
geom_point(aes(colour = as.factor(gear))) +
scale_color_brewer("gear", type = "qual")Introduced a break in the workflow from %>% to +
ggplot2 ggplot2 stands for grammar of graphics plot version 2
source: thinkR
| x | y | shape |
|---|---|---|
| 25 | 11 | circle |
| 0 | 0 | circle |
| 75 | 53 | square |
| 200 | 300 | square |

What is we want to split circles and squares?
Now, dot shapes and facets give the same information. Shapes could be freed for another meaningful variable

tribble(
~x, ~y, ~shape,
25L, 11L, "circle",
0L, 0L, "circle",
75L, 53L, "square",
200L, 300L, "square"
) %>%
ggplot(aes(x = x, y = y, shape = shape)) +
geom_point(size = 4) +
facet_wrap(~ shape) +
coord_cartesian() +
theme_classic(base_size = 18)
Data visualisation is not meant just to be seen but to be read, like written text Alberto Cairo
Using the following dataset from the Euro Club Index
library(tidyverse)
allSeasons <- read_rds("data/allseasons.rds")
oneSeason <- allSeasons %>% filter(year == 2016)
allSeasons
# A tibble: 1,556 x 12
club score year country n rank allRank atb atw eb
<chr> <int> <int> <chr> <int> <int> <int> <int> <int> <int>
1 ManUnited 1876 2001 ENG 20 1 6 2031 1831 1927
2 Liverpool 1876 2001 ENG 20 2 5 2020 1826 1918
3 Leeds 1843 2001 ENG 20 3 11 1979 1793 1880
4 Arsenal 1820 2001 ENG 20 4 14 1946 1776 1868
5 Chelsea 1794 2001 ENG 20 5 18 1860 1738 1865
6 Ipswich 1744 2001 ENG 20 6 31 1830 1710 1840
7 Sunderland 1732 2001 ENG 20 7 34 1802 1688 1813
8 AstonVilla 1725 2001 ENG 20 8 42 1775 1667 1796
9 Newcastle 1704 2001 ENG 20 9 46 1765 1663 1792
10 Middlesbrough 1699 2001 ENG 20 10 49 1737 1661 1791
# ... with 1,546 more rows, and 2 more variables: ew <int>, tenth <int>source John Burn-Murdoch working at the Financial Times
points
points on a line
ribbon
shaded range
faceted plots
source John Burn-Murdoch working at the Financial Times
oneSeason %>% ggplot(aes(x = year, y = score, colour = country)) + geom_point(size = 3) + scale_x_discrete() + theme_bw(base_size = 18)

size = 3 increases the size of all dots. Not in aes()scale_x_discrete is to force the 1 value on the x axis to be discretetheme_bw() is a pre-defined black/white theme, where all fonts are set to size = 18we can't see much. Improve the x mapping
oneSeason %>%
ggplot(aes(x = rank, y = score,
colour = country)) +
geom_point(size = 3) +
theme_bw(18)
scale_x_discrete is useless now, we have a continuous variable.base_size = in theme_bw() as it is the first argument.Now obvious that Spain does well, even for low ranking clubs
oneSeason %>%
ggplot(aes(rank, score,
colour = country)) +
geom_line() +
geom_point(size = 3) +
theme_bw(18)
aes() define in ggplot() are passed on all subsequent geomx and y could be omitted, better to specify them though.Hard to see differences, ENG seems more coherent
oneSeason %>%
group_by(country) %>%
summarise(min = min(score),
max = max(score),
range = max - min) %>%
mutate(country = forcats::fct_reorder(country, range)) %>%
ggplot(aes(x = "2016", y = range, fill = country)) +
geom_col(position = "dodge") +
theme_classic(18)
force the discretization using 2016 as character
use dodging to get all bars on the same x index
reorder levels based on a numeric variable using fct_reorder
oneSeason %>%
select(score, rank, country) %>%
filter(country %in% c("ENG", "ESP")) %>%
spread(country, score) %>%
rowwise() %>%
mutate(gap = ESP - ENG,
min = min(ESP, ENG),
max = max(ESP, ENG)) %>%
ggplot(aes(x = rank, fill = gap > 0)) +
geom_rect(aes(xmin = rank - 0.5,
xmax = rank + 0.5,
ymin = min, ymax = max), alpha = 0.8) +
theme_classic(18) +
scale_fill_manual(name = "gap", labels = c("ENG", "ESP"),
values = c("royalblue", "red3")) +
labs(title = "quality gap",
subtitle = "between England and Spain",
caption = "by John Burn-Murdoch")
rowwise() mandatory to get the right min and maxare performing better at every rank except #11
oneSeason %>%
filter(country == "ENG") %>%
ggplot(aes(x = rank, y = score)) +
geom_ribbon(aes(ymin = atw, ymax = atb),
fill = "royalblue", alpha = 0.5) +
geom_line(size = 1.5, colour = "royalblue") +
geom_point(size = 3, colour = "royalblue") +
theme_bw(18) +
scale_fill_manual(name = "gap", labels = c("ENG", "ESP"),
values = c("royalblue", "red3")) +
labs(title = "Comparison of the nth best \nteam to its predecessors",
subtitle = "in England in 2016",
caption = "by John Burn-Murdoch")
allSeasons %>%
ggplot(aes(rank, score,
colour = country)) +
geom_line() +
#geom_point(size = 1.5) +
theme_classic(18) +
facet_wrap(~ year)
allSeasons %>%
select(score, year, rank, country) %>%
filter(country %in% c("ENG", "ESP")) %>%
spread(country, score) %>%
rowwise() %>%
mutate(gap = ESP - ENG,
min = min(ESP, ENG),
max = max(ESP, ENG)) %>%
ggplot(aes(x = rank, fill = gap > 0)) +
geom_rect(aes(xmin = rank - 0.5,
xmax = rank + 0.5,
ymin = min, ymax = max), alpha = 0.8) +
theme_classic(18) +
scale_fill_manual(name = "gap", labels = c("ENG", "ESP"),
values = c("royalblue", "red3")) +
labs(title = "quality gap",
subtitle = "between England and Spain",
caption = "by John Burn-Murdoch") +
facet_wrap(~ year)With tidy data, add only the facet layer to get all panels

rstudio cheatsheet
iris <- as_tibble(iris) iris
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rowsas a tibble to avoid printing all 150 rows
I set for this course the following to avoid the grey background and print bigger text
ggplot2::theme_set(ggplot2::theme_bw(18))
iris %>% ggplot() + geom_point(aes(x = Petal.Width, y = Petal.Length))

geom_point()
geom_line()
geom_bar()
geom_boxplot()
geom_histogram()
geom_density()
aesthetics map the columns of a data.frame/tibble to the variable each ggplot2 geom is expecting.
For example geom_point() requires at least the x and y coordinates for each point.
ggplot(iris) + geom_point(aes(x = Petal.Width, y = Petal.Length))
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
Additional arguments such as the colour, the transparency (alpha) or the size.
ggplot(iris) +
geom_point(aes(x = Petal.Width,
y = Petal.Length),
colour = "blue", alpha = 0.6,
size = 3)

see that paramaters define outside the aesthetics aes() are applied to all data
colour, alpha or size can also be mapped to a column in the data frame.
For example: We can attribute a different color to each species:
ggplot(iris) +
geom_point(aes(x = Petal.Width,
y = Petal.Length,
colour = Species),
alpha = 0.6, size = 3)

Note that the colour argument now is inside aes() and must refer to a column in the dataframe.
ggplot(iris) +
geom_point(aes(x = Petal.Width, y = Petal.Length, shape = Species, colour = Species),
alpha = 0.6, size = 3)

It is easy to adjust axis labels and the title
ggplot(iris) +
geom_point(aes(x = Petal.Width,
y = Petal.Length,
colour = Species),
alpha = 0.6, size = 3) +
labs(x = "Width",
y = "Length",
colour = "flower",
title = "Iris",
subtitle = "petal measures",
caption = "Fisher, R. A. (1936)")
ggplot(iris) +
geom_histogram(aes(x = Petal.Length,
fill = Species),
alpha = 0.6) `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The density is the count divided by the total number of occurences.
ggplot(iris) +
geom_density(aes(x = Petal.Length,
fill = Species),
alpha = 0.6)
ggplot(iris) + geom_histogram(aes(x = Petal.Length, y = ..density..), fill = "darkgrey", binwidth = 0.1) + geom_density(aes(x = Petal.Length, fill = Species, colour = Species), alpha = 0.4) + theme_classic()

..variable..) are intermediate values calculated by ggplot2 using stat functionsgeom uses a stat function to transform the data:
geom_histogram() uses stat_bin() to divide the data into bins and count the number of observations in each bin.stat_bin() computes for example: ..count.., ..density.., ..ncount.. and ..ndensity.. (see ?stat_bin())..density..stat_identity is used for scatter plots or geom_col(), no transformationstat_countstat_binstat_density_2dstat_bin_2dstat_ellipsegeom_bar() counts the number of occurences for each values of a categorical variable.geom_bar() uses stat_count() to compute these values (creating a new count column)ggplot(iris) + geom_bar(aes(x = Species))

# or: geom_bar(aes(x = Species, y = ..count..))
stat_count()stat = "identity" will force geom_bar() to use stat_identity() instead (leaving the original data unchanged)Petal.Length and not ..Petal.Length.. as it is not "new" and is already present in the original data frameggplot(iris) + geom_bar(aes(x = Species, y = Petal.Length), stat = "identity")

since version 2.1, thanks to Bob Rudis, geom_col does require a y variable
geom_colggplot(iris) + geom_col(aes(x = Species, y = Petal.Length))
mtcars %>%
ggplot() +
geom_bar(aes(x = factor(cyl),
fill = factor(gear)))

mtcars %>%
mutate(cyl = factor(cyl),
gear = factor(gear)) %>%
complete(cyl, gear) %>%
ggplot() +
geom_bar(aes(x = cyl,
fill = gear),
position = "dodge")
the combination gear 4 / cyl 8 is missing. Using tidyr::complete() to avoid bars with different widths.
mtcars %>%
mutate(cyl = factor(cyl),
gear = factor(gear)) %>%
complete(cyl, gear) %>%
ggplot() +
geom_bar(aes(x = cyl,
fill = gear),
position = "fill")
We can easily switch to polar coordinates:
mtcars %>%
mutate(cyl = factor(cyl),
gear = factor(gear)) %>%
complete(cyl, gear) %>%
ggplot() +
geom_bar(aes(x = cyl,
fill = gear),
position = "fill") +
coord_polar()
ggplot(mtcars) +
geom_boxplot(aes(x = factor(cyl),
y = mpg))
ggplot(mtcars) +
geom_boxplot(aes(x = factor(cyl),
y = mpg,
fill = factor(am)))
scale_fill_manual() and scale_color_manual()ggplot(mtcars) +
geom_boxplot(aes(x = factor(cyl),
y = mpg,
fill = factor(am),
color = factor(am))) +
scale_fill_manual(values = c("red", "lightblue")) +
scale_color_manual(values = c("purple", "blue"))

library(RColorBrewer) display.brewer.all()

ggplot(mtcars) +
geom_boxplot(aes(x = factor(cyl),
y = mpg,
fill = factor(am),
colour = factor(am))) +
scale_fill_brewer(palette = "Pastel2") +
scale_colour_brewer(palette = "Set1")

mtcars %>%
ggplot(aes(x = wt,
y = mpg,
colour = hp)) +
geom_point(size = 3)
mtcars %>%
ggplot(aes(x = wt,
y = mpg,
colour = hp)) +
geom_point(size = 3) +
viridis::scale_colour_viridis()
viridis is color blind friendly and nice in b&w

Actually, one can use a plain character inside aes(), will be used to build the legend. Useful for few layers when lazy enough to create the variable in the dataframe.
set.seed(123)
dens <- tibble(x = c(rnorm(500),
rnorm(200, 3, 3)))
ggplot(dens) +
geom_line(aes(x), stat = "density") +
geom_vline(aes(xintercept = mean(x),
colour = "mean"),
size = 1.1) +
geom_vline(aes(xintercept = median(x),
colour = "median"),
size = 1.1) -> p
p
dens_mode <- tibble(mode = density(dens$x)$x[which.max(density(dens$x)$y)])
p + geom_vline(data = dens_mode,
aes(xintercept = mode, colour = "mode"), size = 1.1) +
theme(legend.position = "top") +
scale_colour_hue(name = NULL) # could be: labs(colour = NULL)

the easiest way to create facet is to provide facet_wrap() with a column name
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl)

ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl, ncol = 2)

ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl, scales = "free_x")

ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl, scales = "free")

the rows on the left and columns on the right separated by a tilde ~ (i.e by)
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_grid(am ~ cyl)

A dot (.) specifies that no faceting should be performed. Mimic facet_wrap()
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_grid(. ~ cyl)

Add the column names with labeller
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg)) +
facet_grid(am ~ cyl,
labeller = label_both)
fig.height, fig.widthfig.asp…
ggplot object, 2nd argumentggsave("aes_trick.png", p,
width = 60, height = 30, units = "mm")
ggsave("aes_trick.pdf", p,
width = 50, height = 50, units = "mm")
ggplot2 introduced the possibility for the community to contribute and create extensions.
They are referenced on a dedicated site

never trust summary statistics alone; always visualize your data Alberto Cairo
source: Justin Matejka, George Fitzmaurice Same Stats, Different Graphs…
A compilation of some of my gifs created with #rstats #ggplot2 #gganimate #tweenr https://t.co/nCppSOZv4W
— Marcus Volz (@mgvolz) 4 avril 2017
geom_tile() heatmapgeom_bind2() 2D binninggeom_abline() slopestat_ellipse()stat_summary() easy mean 95CI etc.geom_smooth() linear/splines/non linearggforce::facet_grid_paginate() facetsgridExtra::marrangeGrob() plotsposition_jitter() random shiftcoord_cartesian() for zooming incoord_flip() exchanges x & yscale_x_log10() and yscale_x_sqrt() and yaes_string() for plotting inside function