July 2018

1. Introduction:

You have already made some plots in Sessions 1 and 2

  • Histograms

1. Introduction

You have already made some plots in Sessions 1 and 2

  • Histograms

1. Introduction

You have already made some plots in Sessions 1 and 2

  • Boxplots

1. Introduction

You have already made some plots in Sessions 1 and 2

  • Scatterplots

1. Introduction

In this session you extend these plotting routines using ggplot2. The ggplot2 package is part of the tidyverse - a collection of packages that support data science - Lex and I are big fans!

2. The ggplot2 package

  • a dedicated visualization package
  • based on the Grammar of Graphics (Wilkinson, 2005) (hence the gg in the name of the package).

Key Points

  1. conceptualizes graphics (and plots) in terms of their components.
  2. each element of the graphic is handled separately in a series of layers
  3. provides control over each part of the plot

2. The ggplot2 package

ggplot in action

par(mar = c(4,8,3,3))
# plot 1
boxplot(census$Owned~census$social.class, horizontal = T, outline = F, 
        lwd = 0.5, las = 2, col = c("#D7191C","#FFFFBF", "#2B83BA"),
  xlab=expression(paste("Property Ownership")))

2. The ggplot2 package

  • ggplot in action
  • note the use of the +
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot()

2. The ggplot2 package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA"))

2. The ggplot2 package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() 

2. The ggplot2 package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("")

2. The ggplot2 package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot() +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("") +  theme_bw() 

2. The ggplot2 package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot(width = 0.5) +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("") +  theme_bw() 

2. The ggplot2 package

ggplot in action

ggplot(data = census, aes(x="", y=Owned, fill = social.class)) +
  geom_boxplot(width = 0.5, position=position_dodge(1)) +
  scale_fill_manual("Social Class", 
    values = c("#D7191C","#FFFFBF", "#2B83BA")) +
  coord_flip() + ylab("Property Ownership") + xlab("") +  theme_bw() 

3. Exploratory data analysis (EDA)

Using ggplot2 tools, this practical will show you how to use visualisation (and some data transformations) to explore your data in a systematic way, a task that known as exploratory data analysis, or EDA for short. EDA is an iterative cycle that involves

  1. Generating questions about your data.

  2. Searching for answers by visualising, transforming, and modelling your data.

  3. Using what you learn to refine your questions and/or generate new questions.

3. EDA

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

Your goal during EDA is to develop an understanding of your data.

3. EDA

  • EDA is fundamentally a creative process.
  • The key to asking quality questions is to generate a large quantity of questions.
  • It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.

3. EDA

There is no rule about which questions you should ask to guide your research.

However, two types of questions will always be useful for making discoveries within your data.

You can loosely word these questions as:

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

3. EDA

Look for Variation the tendency of the values of a variable to change from measurement to measurement

Visualising distributions for both categorical or continuous variables.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

3. EDA

Look for Variation the tendency of the values of a variable to change from measurement to measurement

Visualising distributions for both categorical or continuous variables.

ggplot(data = diamonds) +  
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

3. EDA

Look for Typical Values and Outliers

smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

3. EDA

Look for Variation and Covariation

  • variation describes the behavior within a variable
  • covariation describes the behavior between variables: their tendency to vary together in a related way
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

3. EDA

  • boxplot to display the distribution of a continuous variable broken down by a categorical variable
  • can be good to order categories if you have many of them
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, 
    hwy, FUN = median), y = hwy))

3. EDA

  • geom_count to display the distribution of two categorical variables
  • size of each circle in the plot displays how many observations for each combination
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

3. EDA

  • scatterplot to display the distribution of two continuous variables
  • add transparency
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

3. EDA

  • scatterplot to display the distribution of two continuous variables
  • transparency difficult very large datasets
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price)) 

3. EDA

  • boxplot to display the distribution of two continuous variables
  • bin one continuous variable so it acts like a categorical variable
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

4. Piping syntax %>%

  • the pipe operator is %>%
  • pipes the output from one function directly to the input of another function
df %>%
  summarise(avg_MedIbc = mean(MedInc))
## # A tibble: 1 x 1
##   avg_MedIbc
##        <dbl>
## 1       37.1

4. Piping syntax %>%

  • can use dplyr verbs
  • (dplyr is also part of the tidyverse)
  • eg the group_by function:
tb %>%
  group_by(IncClass) %>%
  summarise(avg_MedIbc = mean(MedInc))
## # A tibble: 3 x 2
##   IncClass avg_MedIbc
##   <chr>         <dbl>
## 1 Average       34533
## 2 Poor          27410
## 3 Rich          52047

4. Piping syntax %>%

verbs in dplyr

verbs Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations in the split-apply-combine concept

4. Piping syntax %>%

example

tb %>% ungroup %>% 
  mutate(black_high = ifelse(PctBlack > median(PctBlack),
    "High % Black Area","Low % Black Area")) %>% 
  ggplot(aes(y=PctBach,x=IncClass, fill = IncClass)) + 
  facet_wrap(~black_high) + 
  coord_flip() + geom_boxplot() + 
  labs(y='Percentage with Bachelors',x='Income class') +
  scale_fill_manual(name = "Income Class", 
                    values = c("orange", "palegoldenrod","firebrick3"))

4. Piping syntax %>%

example

4. Piping syntax %>%

  • only a small introduction to piping
  • (don’t worry!)
  • BUT it allows you to dynamically create variables and summaries
  • these can be passed to ggplot
  • allows for fast and informative EDA: critical aspect of data science

The talk covered lots of ground…. any questions?