Visualisations with ggplot

July 2018

1. Introduction:

You have already made some plots in Sessions 1 and 2

Histograms

1. Introduction

You have already made some plots in Sessions 1 and 2

Histograms

1. Introduction

You have already made some plots in Sessions 1 and 2

Boxplots

1. Introduction

You have already made some plots in Sessions 1 and 2

Scatterplots

1. Introduction

In this session you extend these plotting routines using ggplot2. The ggplot2 package is part of the tidyverse - a collection of packages that support data science - Lex and I are big fans!

2. The `ggplot2` package

a dedicated visualization package
based on the Grammar of Graphics (Wilkinson, 2005) (hence the gg in the name of the package).

Key Points

conceptualizes graphics (and plots) in terms of their components.
each element of the graphic is handled separately in a series of layers
provides control over each part of the plot

2. The `ggplot2` package

ggplot in action

par(mar = c(4,8,3,3))
# plot 1
boxplot(census$Owned~census$social.class, horizontal = T, outline = F, 
        lwd = 0.5, las = 2, col = c("#D7191C","#FFFFBF", "#2B83BA"),
  xlab=expression(paste("Property Ownership")))

2. The `ggplot2` package

ggplot in action
note the use of the +

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot()

2. The `ggplot2` package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA"))

2. The `ggplot2` package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip()

2. The `ggplot2` package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("")

2. The `ggplot2` package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("") +  theme_bw()

2. The `ggplot2` package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot(width = 0.5) +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("") +  theme_bw()

2. The `ggplot2` package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot(width = 0.5, position=position_dodge(1)) +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("") +  theme_bw()

3. Exploratory data analysis (EDA)

Using ggplot2 tools, this practical will show you how to use visualisation (and some data transformations) to explore your data in a systematic way, a task that known as exploratory data analysis, or EDA for short. EDA is an iterative cycle that involves

Generating questions about your data.
Searching for answers by visualising, transforming, and modelling your data.
Using what you learn to refine your questions and/or generate new questions.

3. EDA

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

Your goal during EDA is to develop an understanding of your data.

3. EDA

EDA is fundamentally a creative process.
The key to asking quality questions is to generate a large quantity of questions.
It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.

3. EDA

There is no rule about which questions you should ask to guide your research.

However, two types of questions will always be useful for making discoveries within your data.

You can loosely word these questions as:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

3. EDA

Look for Variation the tendency of the values of a variable to change from measurement to measurement

Visualising distributions for both categorical or continuous variables.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

3. EDA

Look for Variation the tendency of the values of a variable to change from measurement to measurement

Visualising distributions for both categorical or continuous variables.

ggplot(data = diamonds) +  
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

3. EDA

Look for Typical Values and Outliers

smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

3. EDA

Look for Variation and Covariation

variation describes the behavior within a variable
covariation describes the behavior between variables: their tendency to vary together in a related way

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

3. EDA

boxplot to display the distribution of a continuous variable broken down by a categorical variable
can be good to order categories if you have many of them

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, 
    hwy, FUN = median), y = hwy))

3. EDA

geom_count to display the distribution of two categorical variables
size of each circle in the plot displays how many observations for each combination

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

3. EDA

scatterplot to display the distribution of two continuous variables
add transparency

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

3. EDA

scatterplot to display the distribution of two continuous variables
transparency difficult very large datasets

ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))

3. EDA

boxplot to display the distribution of two continuous variables
bin one continuous variable so it acts like a categorical variable

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

4. Piping syntax `%>%`

the pipe operator is %>%
pipes the output from one function directly to the input of another function

df %>%
  summarise(avg_MedIbc = mean(MedInc))

## # A tibble: 1 x 1
##   avg_MedIbc
##        <dbl>
## 1       37.1

4. Piping syntax `%>%`

can use dplyr verbs
(dplyr is also part of the tidyverse)
eg the group_by function:

tb %>%
  group_by(IncClass) %>%
  summarise(avg_MedIbc = mean(MedInc))

## # A tibble: 3 x 2
##   IncClass avg_MedIbc
##   <chr>         <dbl>
## 1 Average       34533
## 2 Poor          27410
## 3 Rich          52047

4. Piping syntax `%>%`

verbs in dplyr

verbs	Description
select()	select columns
filter()	filter rows
arrange()	re-order or arrange rows
mutate()	create new columns
summarise()	summarise values
group_by()	allows for group operations in the split-apply-combine concept

4. Piping syntax `%>%`

example

tb %>% ungroup %>% 
  mutate(black_high = ifelse(PctBlack > median(PctBlack),
    "High % Black Area","Low % Black Area")) %>% 
  ggplot(aes(y=PctBach,x=IncClass, fill = IncClass)) + 
  facet_wrap(~black_high) + 
  coord_flip() + geom_boxplot() + 
  labs(y='Percentage with Bachelors',x='Income class') +
  scale_fill_manual(name = "Income Class", 
                    values = c("orange", "palegoldenrod","firebrick3"))

4. Piping syntax `%>%`

example

4. Piping syntax `%>%`

only a small introduction to piping
(don’t worry!)
BUT it allows you to dynamically create variables and summaries
these can be passed to ggplot
allows for fast and informative EDA: critical aspect of data science

1. Introduction:

1. Introduction

1. Introduction

1. Introduction

1. Introduction

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

2. The ggplot2 package

3. Exploratory data analysis (EDA)

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

3. EDA

4. Piping syntax %>%

4. Piping syntax %>%

4. Piping syntax %>%

4. Piping syntax %>%

4. Piping syntax %>%

4. Piping syntax %>%

The talk covered lots of ground…. any questions?

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

2. The `ggplot2` package

4. Piping syntax `%>%`

4. Piping syntax `%>%`

4. Piping syntax `%>%`

4. Piping syntax `%>%`

4. Piping syntax `%>%`

4. Piping syntax `%>%`