3 May 2017
By the end you should be able to:
ggplot2
when you run into problems.XX ADD GROUP aes for geom_line
ggplot1
Released in 2005 until 2008 by Hadley Wickham.
If the pipe ( %>% in 2014) had been invented before,
ggplot2
would have never existed Hadley Wickham
# devtools::install_github("hadley/ggplot1") p <- ggplot(mtcars, list(x = mpg, y = wt)) # need temp p object to avoid too many ()'s scbrewer(ggpoint(p, list(colour = gear)))
# devtools::install_github("hadley/ggplot1") library(ggplot1) mtcars %>% ggplot(list(x = mpg, y = wt)) %>% ggpoint(list(colour = gear)) %>% scbrewer()
library(ggplot2) mtcars %>% ggplot(aes(x = mpg, y = wt)) + geom_point(aes(colour = as.factor(gear))) + scale_color_brewer("gear", type = "qual")
Introduced a break in the workflow from %>%
to +
ggplot2
ggplot2
stands for grammar of graphics plot version 2
source: thinkR
x | y | shape |
---|---|---|
25 | 11 | circle |
0 | 0 | circle |
75 | 53 | square |
200 | 300 | square |
What is we want to split circles and squares?
Now, dot shapes and facets give the same information. Shapes could be freed for another meaningful variable
tribble( ~x, ~y, ~shape, 25L, 11L, "circle", 0L, 0L, "circle", 75L, 53L, "square", 200L, 300L, "square" ) %>% ggplot(aes(x = x, y = y, shape = shape)) + geom_point(size = 4) + facet_wrap(~ shape) + coord_cartesian() + theme_classic(base_size = 18)
Data visualisation is not meant just to be seen but to be read, like written text Alberto Cairo
Using the following dataset from the Euro Club Index
library(tidyverse) allSeasons <- read_rds("data/allseasons.rds") oneSeason <- allSeasons %>% filter(year == 2016) allSeasons
# A tibble: 1,556 x 12 club score year country n rank allRank atb atw eb <chr> <int> <int> <chr> <int> <int> <int> <int> <int> <int> 1 ManUnited 1876 2001 ENG 20 1 6 2031 1831 1927 2 Liverpool 1876 2001 ENG 20 2 5 2020 1826 1918 3 Leeds 1843 2001 ENG 20 3 11 1979 1793 1880 4 Arsenal 1820 2001 ENG 20 4 14 1946 1776 1868 5 Chelsea 1794 2001 ENG 20 5 18 1860 1738 1865 6 Ipswich 1744 2001 ENG 20 6 31 1830 1710 1840 7 Sunderland 1732 2001 ENG 20 7 34 1802 1688 1813 8 AstonVilla 1725 2001 ENG 20 8 42 1775 1667 1796 9 Newcastle 1704 2001 ENG 20 9 46 1765 1663 1792 10 Middlesbrough 1699 2001 ENG 20 10 49 1737 1661 1791 # ... with 1,546 more rows, and 2 more variables: ew <int>, tenth <int>
source John Burn-Murdoch working at the Financial Times
points
points on a line
ribbon
shaded range
faceted plots
source John Burn-Murdoch working at the Financial Times
oneSeason %>% ggplot(aes(x = year, y = score, colour = country)) + geom_point(size = 3) + scale_x_discrete() + theme_bw(base_size = 18)
size = 3
increases the size of all dots. Not in aes()
scale_x_discrete
is to force the 1 value on the x axis to be discretetheme_bw()
is a pre-defined black/white theme, where all fonts are set to size = 18we can't see much. Improve the x mapping
oneSeason %>% ggplot(aes(x = rank, y = score, colour = country)) + geom_point(size = 3) + theme_bw(18)
scale_x_discrete
is useless now, we have a continuous variable.base_size =
in theme_bw()
as it is the first argument.Now obvious that Spain does well, even for low ranking clubs
oneSeason %>% ggplot(aes(rank, score, colour = country)) + geom_line() + geom_point(size = 3) + theme_bw(18)
aes()
define in ggplot()
are passed on all subsequent geom
x
and y
could be omitted, better to specify them though.Hard to see differences, ENG seems more coherent
oneSeason %>% group_by(country) %>% summarise(min = min(score), max = max(score), range = max - min) %>% mutate(country = forcats::fct_reorder(country, range)) %>% ggplot(aes(x = "2016", y = range, fill = country)) + geom_col(position = "dodge") + theme_classic(18)
force the discretization using 2016 as character
use dodging to get all bars on the same x index
reorder levels based on a numeric variable using fct_reorder
oneSeason %>% select(score, rank, country) %>% filter(country %in% c("ENG", "ESP")) %>% spread(country, score) %>% rowwise() %>% mutate(gap = ESP - ENG, min = min(ESP, ENG), max = max(ESP, ENG)) %>% ggplot(aes(x = rank, fill = gap > 0)) + geom_rect(aes(xmin = rank - 0.5, xmax = rank + 0.5, ymin = min, ymax = max), alpha = 0.8) + theme_classic(18) + scale_fill_manual(name = "gap", labels = c("ENG", "ESP"), values = c("royalblue", "red3")) + labs(title = "quality gap", subtitle = "between England and Spain", caption = "by John Burn-Murdoch")
rowwise()
mandatory to get the right min and maxare performing better at every rank except #11
oneSeason %>% filter(country == "ENG") %>% ggplot(aes(x = rank, y = score)) + geom_ribbon(aes(ymin = atw, ymax = atb), fill = "royalblue", alpha = 0.5) + geom_line(size = 1.5, colour = "royalblue") + geom_point(size = 3, colour = "royalblue") + theme_bw(18) + scale_fill_manual(name = "gap", labels = c("ENG", "ESP"), values = c("royalblue", "red3")) + labs(title = "Comparison of the nth best \nteam to its predecessors", subtitle = "in England in 2016", caption = "by John Burn-Murdoch")
allSeasons %>% ggplot(aes(rank, score, colour = country)) + geom_line() + #geom_point(size = 1.5) + theme_classic(18) + facet_wrap(~ year)
allSeasons %>% select(score, year, rank, country) %>% filter(country %in% c("ENG", "ESP")) %>% spread(country, score) %>% rowwise() %>% mutate(gap = ESP - ENG, min = min(ESP, ENG), max = max(ESP, ENG)) %>% ggplot(aes(x = rank, fill = gap > 0)) + geom_rect(aes(xmin = rank - 0.5, xmax = rank + 0.5, ymin = min, ymax = max), alpha = 0.8) + theme_classic(18) + scale_fill_manual(name = "gap", labels = c("ENG", "ESP"), values = c("royalblue", "red3")) + labs(title = "quality gap", subtitle = "between England and Spain", caption = "by John Burn-Murdoch") + facet_wrap(~ year)
With tidy data, add only the facet
layer to get all panels
rstudio cheatsheet
iris <- as_tibble(iris) iris
# A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fctr> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows
as a tibble
to avoid printing all 150 rows
I set for this course the following to avoid the grey background and print bigger text
ggplot2::theme_set(ggplot2::theme_bw(18))
iris %>% ggplot() + geom_point(aes(x = Petal.Width, y = Petal.Length))
geom_point()
geom_line()
geom_bar()
geom_boxplot()
geom_histogram()
geom_density()
aesthetics map the columns of a data.frame
/tibble
to the variable each ggplot2 geom
is expecting.
For example geom_point()
requires at least the x and y coordinates for each point.
ggplot(iris) + geom_point(aes(x = Petal.Width, y = Petal.Length))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Additional arguments such as the colour
, the transparency (alpha
) or the size
.
ggplot(iris) + geom_point(aes(x = Petal.Width, y = Petal.Length), colour = "blue", alpha = 0.6, size = 3)
see that paramaters define outside the aesthetics aes()
are applied to all data
colour
, alpha
or size
can also be mapped to a column in the data frame.
For example: We can attribute a different color to each species:
ggplot(iris) + geom_point(aes(x = Petal.Width, y = Petal.Length, colour = Species), alpha = 0.6, size = 3)
Note that the colour
argument now is inside aes()
and must refer to a column in the dataframe.
ggplot(iris) + geom_point(aes(x = Petal.Width, y = Petal.Length, shape = Species, colour = Species), alpha = 0.6, size = 3)
It is easy to adjust axis labels and the title
ggplot(iris) + geom_point(aes(x = Petal.Width, y = Petal.Length, colour = Species), alpha = 0.6, size = 3) + labs(x = "Width", y = "Length", colour = "flower", title = "Iris", subtitle = "petal measures", caption = "Fisher, R. A. (1936)")
ggplot(iris) + geom_histogram(aes(x = Petal.Length, fill = Species), alpha = 0.6)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The density is the count divided by the total number of occurences.
ggplot(iris) + geom_density(aes(x = Petal.Length, fill = Species), alpha = 0.6)
ggplot(iris) + geom_histogram(aes(x = Petal.Length, y = ..density..), fill = "darkgrey", binwidth = 0.1) + geom_density(aes(x = Petal.Length, fill = Species, colour = Species), alpha = 0.4) + theme_classic()
..variable..
) are intermediate values calculated by ggplot2
using stat functionsgeom
uses a stat
function to transform the data:
geom_histogram()
uses stat_bin()
to divide the data into bins and count the number of observations in each bin.stat_bin()
computes for example: ..count..
, ..density..
, ..ncount..
and ..ndensity..
(see ?stat_bin()
)..density..
stat_identity
is used for scatter plots or geom_col()
, no transformationstat_count
stat_bin
stat_density_2d
stat_bin_2d
stat_ellipse
geom_bar()
counts the number of occurences for each values of a categorical variable.geom_bar()
uses stat_count()
to compute these values (creating a new count
column)ggplot(iris) + geom_bar(aes(x = Species))
# or: geom_bar(aes(x = Species, y = ..count..))
stat_count()
stat = "identity"
will force geom_bar()
to use stat_identity()
instead (leaving the original data unchanged)Petal.Length
and not ..Petal.Length..
as it is not "new" and is already present in the original data frameggplot(iris) + geom_bar(aes(x = Species, y = Petal.Length), stat = "identity")
since version 2.1, thanks to Bob Rudis, geom_col
does require a y variable
geom_col
ggplot(iris) + geom_col(aes(x = Species, y = Petal.Length))
mtcars %>% ggplot() + geom_bar(aes(x = factor(cyl), fill = factor(gear)))
mtcars %>% mutate(cyl = factor(cyl), gear = factor(gear)) %>% complete(cyl, gear) %>% ggplot() + geom_bar(aes(x = cyl, fill = gear), position = "dodge")
the combination gear
4 / cyl
8 is missing. Using tidyr::complete()
to avoid bars with different widths.
mtcars %>% mutate(cyl = factor(cyl), gear = factor(gear)) %>% complete(cyl, gear) %>% ggplot() + geom_bar(aes(x = cyl, fill = gear), position = "fill")
We can easily switch to polar coordinates:
mtcars %>% mutate(cyl = factor(cyl), gear = factor(gear)) %>% complete(cyl, gear) %>% ggplot() + geom_bar(aes(x = cyl, fill = gear), position = "fill") + coord_polar()
ggplot(mtcars) + geom_boxplot(aes(x = factor(cyl), y = mpg))
ggplot(mtcars) + geom_boxplot(aes(x = factor(cyl), y = mpg, fill = factor(am)))
scale_fill_manual()
and scale_color_manual()
ggplot(mtcars) + geom_boxplot(aes(x = factor(cyl), y = mpg, fill = factor(am), color = factor(am))) + scale_fill_manual(values = c("red", "lightblue")) + scale_color_manual(values = c("purple", "blue"))
library(RColorBrewer) display.brewer.all()
ggplot(mtcars) + geom_boxplot(aes(x = factor(cyl), y = mpg, fill = factor(am), colour = factor(am))) + scale_fill_brewer(palette = "Pastel2") + scale_colour_brewer(palette = "Set1")
mtcars %>% ggplot(aes(x = wt, y = mpg, colour = hp)) + geom_point(size = 3)
mtcars %>% ggplot(aes(x = wt, y = mpg, colour = hp)) + geom_point(size = 3) + viridis::scale_colour_viridis()
viridis is color blind friendly and nice in b&w
Actually, one can use a plain character inside aes()
, will be used to build the legend. Useful for few layers when lazy enough to create the variable in the dataframe.
set.seed(123) dens <- tibble(x = c(rnorm(500), rnorm(200, 3, 3))) ggplot(dens) + geom_line(aes(x), stat = "density") + geom_vline(aes(xintercept = mean(x), colour = "mean"), size = 1.1) + geom_vline(aes(xintercept = median(x), colour = "median"), size = 1.1) -> p p
dens_mode <- tibble(mode = density(dens$x)$x[which.max(density(dens$x)$y)]) p + geom_vline(data = dens_mode, aes(xintercept = mode, colour = "mode"), size = 1.1) + theme(legend.position = "top") + scale_colour_hue(name = NULL) # could be: labs(colour = NULL)
the easiest way to create facet is to provide facet_wrap()
with a column name
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl)
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl, ncol = 2)
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl, scales = "free_x")
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_wrap(~ cyl, scales = "free")
the rows on the left and columns on the right separated by a tilde ~
(i.e by)
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_grid(am ~ cyl)
A dot (.
) specifies that no faceting should be performed. Mimic facet_wrap()
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_grid(. ~ cyl)
Add the column names with labeller
ggplot(mtcars) + geom_point(aes(x = wt, y = mpg)) + facet_grid(am ~ cyl, labeller = label_both)
fig.height
, fig.width
fig.asp
…ggplot
object, 2nd argumentggsave("aes_trick.png", p, width = 60, height = 30, units = "mm") ggsave("aes_trick.pdf", p, width = 50, height = 50, units = "mm")
ggplot2
introduced the possibility for the community to contribute and create extensions.
They are referenced on a dedicated site
never trust summary statistics alone; always visualize your data Alberto Cairo
source: Justin Matejka, George Fitzmaurice Same Stats, Different Graphs…
A compilation of some of my gifs created with #rstats #ggplot2 #gganimate #tweenr https://t.co/nCppSOZv4W
— Marcus Volz (@mgvolz) 4 avril 2017
geom_tile()
heatmapgeom_bind2()
2D binninggeom_abline()
slopestat_ellipse()
stat_summary()
easy mean 95CI etc.geom_smooth()
linear/splines/non linearggforce::facet_grid_paginate()
facetsgridExtra::marrangeGrob()
plotsposition_jitter()
random shiftcoord_cartesian()
for zooming incoord_flip()
exchanges x & yscale_x_log10()
and yscale_x_sqrt()
and yaes_string()
for plotting inside function