In this chapter, you will learn how to graphically summarize numerical data.

library(readr)
cars <- read_csv('https://assets.datacamp.com/production/course_1796/datasets/cars04.csv')
## Parsed with column specification:
## cols(
##   name = col_character(),
##   sports_car = col_logical(),
##   suv = col_logical(),
##   wagon = col_logical(),
##   minivan = col_logical(),
##   pickup = col_logical(),
##   all_wheel = col_logical(),
##   rear_wheel = col_logical(),
##   msrp = col_double(),
##   dealer_cost = col_double(),
##   eng_size = col_double(),
##   ncyl = col_double(),
##   horsepwr = col_double(),
##   city_mpg = col_double(),
##   hwy_mpg = col_double(),
##   weight = col_double(),
##   wheel_base = col_double(),
##   length = col_double(),
##   width = col_double()
## )
#cars <- cars %>% 
#  mutate(msrp = as.integer(msrp))
cars[,c(9:10,12:19)] <- sapply(cars[,c(9:10,12:19)],as.integer)

Video: Exploring numerical data

View slides.

str(cars)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   sports_car = col_logical(),
##   ..   suv = col_logical(),
##   ..   wagon = col_logical(),
##   ..   minivan = col_logical(),
##   ..   pickup = col_logical(),
##   ..   all_wheel = col_logical(),
##   ..   rear_wheel = col_logical(),
##   ..   msrp = col_double(),
##   ..   dealer_cost = col_double(),
##   ..   eng_size = col_double(),
##   ..   ncyl = col_double(),
##   ..   horsepwr = col_double(),
##   ..   city_mpg = col_double(),
##   ..   hwy_mpg = col_double(),
##   ..   weight = col_double(),
##   ..   wheel_base = col_double(),
##   ..   length = col_double(),
##   ..   width = col_double()
##   .. )
# The most direct way to represent numerical data is a dotplot
ggplot(cars, aes(x = weight)) +
  geom_dotplot(dotsize = 0.4)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).

# There is zero data loss in a dotplot - you could recreate the data set perfectly from the display
ggplot(cars, aes(x = weight)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

# Because of binning, it's no longer possible to recreate the original data set
# A density plot avoids the unnatural step-wise nature of a histogram
ggplot(cars, aes(x = weight)) +
  geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).

# Use only when you have a large number of cases
# A more abstracted sense of the distribution
ggplot(cars, aes(x = 1, y = weight)) +
  geom_boxplot() +
  coord_flip()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

ggplot(cars, aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_wrap(~pickup)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

Faceted histogram

In this chapter, you’ll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. You will investigate the distribution of mileage across a categorial variable, but before you get there, you’ll want to familiarize yourself with the dataset.

# Load package
library(ggplot2)

# Learn data structure
str(cars)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   sports_car = col_logical(),
##   ..   suv = col_logical(),
##   ..   wagon = col_logical(),
##   ..   minivan = col_logical(),
##   ..   pickup = col_logical(),
##   ..   all_wheel = col_logical(),
##   ..   rear_wheel = col_logical(),
##   ..   msrp = col_double(),
##   ..   dealer_cost = col_double(),
##   ..   eng_size = col_double(),
##   ..   ncyl = col_double(),
##   ..   horsepwr = col_double(),
##   ..   city_mpg = col_double(),
##   ..   hwy_mpg = col_double(),
##   ..   weight = col_double(),
##   ..   wheel_base = col_double(),
##   ..   length = col_double(),
##   ..   width = col_double()
##   .. )
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

In this exercise, you faceted by the suv variable, but it’s important to note that you can facet a plot by any categorical variable using facet_wrap().

Boxplots and density plots

The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you’ll try your hand at two alternatives: the box plot and the density plot.

# Create box plots of city mpg by ncyl
ggplot(cars, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# Check how many possible levels of ncyl there are
unique(cars$ncyl)
## [1]  4  6  3  8  5 12 10 -1
# Which levels are the most common?
table(cars$ncyl)
## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).

Compare distribution via plots

Which of the following interpretations of the plot is not valid?

It’s “The variability in mileage of 8 cylinder cars is similar to the variability in mileage of 4 cylinder cars.” The variability in mileage of 8 cylinder cars seem much smaller than that of 4 cylinder cars.

Video: Distribution of one variable

View slides.

…Specifically a numerical variable.

ggplot(cars, aes(x = hwy_mpg)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

# Adding a facet_wrap layer (faceting on a categorical value)
ggplot(cars, aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_wrap(~pickup)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

# Adding a filter (filtering on a numerical variable)
cars2 <- cars %>% 
  filter(eng_size < 2.0)

ggplot(cars2, aes(x = hwy_mpg)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Alternatively
cars %>% 
  filter(eng_size < 2.0) %>% 
  ggplot(aes(x = hwy_mpg)) + # Note that you don't need to specify the data frame, as we're piping in cars
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Alternatively
cars %>% 
  filter(eng_size < 2.0) %>% 
  ggplot(aes(x = hwy_mpg)) + # Note that you don't need to specify the data frame, as we're piping in cars
  geom_histogram(binwidth = 5)

Marginal and conditional histograms

Now, turn your attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

You’ll be making two plots using the “data pipeline” paradigm, where you start with the raw data and end with the plot.

# Create hist of horsepwr
cars %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ylim(c(0, 50)) +
  ggtitle("Distribution of horsepower for all cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ylim(c(0, 50)) +
  ggtitle("Distribution of horsepower for affordable cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Question: Marginal and conditional histograms interpretation

Observe the two histograms in the plotting window and decide which of the following is a valid interpretation.

It’s “The highest horsepower car in the less expensive range has just under 250 horsepower.”

Three binwidths

Before you take these plots for granted, it’s a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It’s good practice to consider several binwidths in order to detect different types of structure in your data.

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("binwidth = 30")

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("binwidth = 60")

Be sure to toggle back and forth in the plots pane to compare the histograms.

Question: Three binwidths interpretation

What feature is present in Plot A that’s not found in B or C?

It’s “There is a tendency for cars to have horsepower right at 200 or 300 horsepower.” Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower.

Video: Box plots

View slides.

Box plots for outliers

In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

Be sure to toggle back and forth in the plots pane to compare the box plots.

Plot selection

Consider two other columns in the cars dataset: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = city_mpg)) +
  geom_density()
## Warning: Removed 14 rows containing non-finite values (stat_density).

cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# Create plot of width
cars %>%
  ggplot(aes(x = width)) +
  geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).

cars %>%
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()
## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

Because the city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.

Video: Visualization in higher dimensions

View slides.

ggplot(cars, aes(x = msrp)) +
  geom_density() +
  facet_grid(pickup ~ rear_wheel)

# Add labels to aid understanding
ggplot(cars, aes(x = msrp)) +
  geom_density() +
  facet_grid(pickup ~ rear_wheel, labeller = label_both)

# There's very few rear wheel pickups and front wheel pickups
table(cars$rear_wheel, cars$pickup) # table(rows, columns)
##        
##         FALSE TRUE
##   FALSE   306   12
##   TRUE     98   12

3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv, labeller = label_both) +
  ggtitle("Mileage by suv and ncyl")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).

Question: Interpret 3 var plot

Which of the following interpretations of the plot is valid?

It’s “Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.”