In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the course, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.
In this chapter, you will learn how to create graphical and numerical summaries of two categorical variables.
# Install gapminder R package. Once it's installed, you won't have to run this code again
#install.packages("dplyr")
#install.packages("ggplot2")
#Install.packages("openintro")
# Load dplyr package
library(dplyr) #for use of dplyr functions such as glimpse(), mutate(), and filter()
library(ggplot2) #for use of ggplot2 functions such ggplot()
library(gapminder) #for use of gapminder data (ch3.2)
library(openintro) #for use of email data (ch4)
# Import data
comics <- read.csv("~/resources/rstudio/comics.csv") #first_appear var isn't exactly same
str(comics)
## 'data.frame': 23272 obs. of 11 variables:
## $ name : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
## $ id : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
## $ align : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
## $ eye : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
## $ hair : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ gsm : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
## $ alive : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ appearances : int 4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
## $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
## $ publisher : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...
# Convert data to tbl_df
comics <- tbl_df(comics)
str(comics)
## Classes 'tbl_df', 'tbl' and 'data.frame': 23272 obs. of 11 variables:
## $ name : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
## $ id : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
## $ align : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
## $ eye : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
## $ hair : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ gsm : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
## $ alive : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ appearances : int 4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
## $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
## $ publisher : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...
# Print the first rows of the data
comics
## # A tibble: 23,272 x 11
## name id align eye
## <fctr> <fctr> <fctr> <fctr>
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes
## 2 Captain America (Steven Rogers) Public Good Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes
## 7 Reed Richards (Earth-616) Public Good Brown Eyes
## 8 Hulk (Robert Bruce Banner) Public Good Brown Eyes
## 9 Scott Summers (Earth-616) Public Neutral Brown Eyes
## 10 Jonathan Storm (Earth-616) Public Good Blue Eyes
## # ... with 23,262 more rows, and 7 more variables: hair <fctr>,
## # gender <fctr>, gsm <fctr>, alive <fctr>, appearances <int>,
## # first_appear <fctr>, publisher <fctr>
# Check levels of align
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male" "Other"
# Create a 2-way contingency table
table(comics$align, comics$gender)
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
To simplify the analysis, drop levels that have very low counts.
# Create a 2-way contingency table
tab <- table(comics$align, comics$gender)
# Print tab
tab
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
# Remove align level
comics <- comics %>%
filter(align != "Reformed Criminals") %>%
droplevels()
Construct two side-by-side barcharts of the comics data, showing how there are two options for presenting the same data.
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge") #position = "dodge", to have a side-by-side (i.e. not stacked) barchart.
# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
Interpretation
Among characters with Neutral alignment, males are the most common. In general, there is an association between gender and alignment. There are more male characters than female characters in this dataset.
Approximately what proportion of all female characters are good? 51%. Look at how align was distributed within each gender
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab) # Joint proportions
##
## Female Male Other
## Bad 0.082210 0.395160 0.001672
## Good 0.130135 0.251333 0.000888
## Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2) # Conditional on columns
##
## Female Male Other
## Bad 0.321 0.534 0.485
## Good 0.508 0.339 0.258
## Neutral 0.171 0.127 0.258
Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two barcharts in this exercise: one of counts and one of proportions.
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill") #position = "fill", to have a stacked barchart
Improve the interpretability of the plot by implementing some sensible ordering. Superheroes that are “Neutral” show an alignment between “Good” and “Bad”, so it makes sense to put that bar in the middle.
# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
Make the distribution of alignment based on gender by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts. Improve the piechart representation of these data by constructing a barchart.
# Import data
pies <- data.frame(read.csv("~/resources/rstudio/pie.csv"))
pies_tb <- table(pies$flavor)
pies_tb
##
## apple blueberry boston creme cherry key lime
## 17 14 15 13 16
## pumpkin strawberry
## 12 11
# Create a pie chart
pie(pies_tb)
# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)
# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))
In this chapter, you will learn how to graphically summarize numerical data
The cars dataset records characteristics on all of the new models of cars for sale in the US in a certain year. Investigate the distribution of mileage across a categorial variable.
# Import data
cars <- read.csv("~/resources/rstudio/cars.csv", stringsAsFactors = FALSE)
# Learn data structure
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_grid(~ suv)
Explore the relationship between mpg and the engine size (as measured by number of cylinders).
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
Interpretation
The highest mileage cars have 4 cylinders. The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car. Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars. The variability in mileage of 8 cylinder cars seem much smaller than that of 4 cylinder cars.
Get a sense of the marginal distribution of horsepower and then compare it to the distribution of horsepower conditional on the cars being affordable. How much horsepower do you sacrifice by restricting your budget?
# Create hist of horsepwr
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
ggtitle("Distribution of horsepower")
# Create hist of horsepwr for affordable cars
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("Distribution of horsepower for cars less than $25,000")
Interpretation
The highest horsepower car in the affordable range has just under 250 horsepower.
Alter the binwidths to see how things change.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("The distribution of horsepower with a binwidth of 3")
# Create hist of horsepwr with binwidth of 30
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("The distribution of horsepower with a binwidth of 30")
# Create hist of horsepwr with binwidth of 60
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("The distribution of horsepower with a binwidth of 60")
Interpretation
Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower. Plot A has the most bins, which means that it shows the number of cars with more specific horsepower values.
Apply a box plot to the msrp column (manufacturers suggested retail price) to detect if there are unusually expensive or cheap cars, outliers.
# Construct box plot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
cars_no_out <- cars %>%
filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
Plot to display the important features of the distributions of city_mpg and width. Which is most appropriate? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
# Create plot of city_mpg
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
cars %>%
ggplot(aes(city_mpg)) +
geom_density()
# Create plot of width
cars %>%
ggplot(aes(x = 1, y = width)) +
geom_boxplot()
cars %>%
ggplot(aes(width)) +
geom_density()
Interpretation
Because the city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.
Faceting is a valuable technique for looking at several conditional distributions at the same time.
# Facet hists using hwy mileage and ncyl
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv, labeller = label_both) +
ggtitle("Distribution of Highway MPG by ncyl and SUV")
Interpretation
Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases. There are fewer SUVs than non-SUVs across all cylinder types. There is more variability in 4-cylinder non-SUVs than in any other type of car.
Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data.
Throughout this chapter, you will use data from gapminder, which tracks demographic data in countries of the world over time. For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize()
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fctr> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. Use the group_by() and summarize() functions, a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fctr> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
Let’s extend the powerful group_by() and summarize() syntax to measures of spread.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 x 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fctr> <dbl> <dbl> <int>
## 1 Africa 9.631 11.610 52
## 2 Americas 4.441 4.632 25
## 3 Asia 7.964 10.152 33
## 4 Europe 2.980 4.782 30
## 5 Oceania 0.729 0.516 2
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
Using the shapes of the density plots, calculate the most appropriate measures of center and spread for the following:
The distribution of life expectancy in the countries of the Americas. Note you’ll need to apply a filter here. The distribution of country populations across the entire gap2007 dataset.
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 x 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
# Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 x 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008
Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log_pop)) +
geom_density()
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(!is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
Is there an association between spam and the length of an email? use the email dataset to settle that question. Begin by bringing up the help file and learning about all the variables with ?email.
As you explore the association between spam and the length of an email, use this opportunity to try out linking a dplyr chain with the layers in a ggplot2 object.
# Load packages
library(openintro)
library(ggplot2)
library(dplyr)
# Compute summary statistics
email %>%
group_by(spam) %>%
summarize(median(num_char),
IQR(num_char))
## # A tibble: 2 x 3
## spam `median(num_char)` `IQR(num_char)`
## <dbl> <dbl> <dbl>
## 1 0 6.83 13.58
## 2 1 1.05 2.82
# Create plot
email %>%
mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.
# Compute center and spread for exclaim_mess by spam
email %>%
group_by(spam) %>%
summarize(median(exclaim_mess),
IQR(exclaim_mess))
## # A tibble: 2 x 3
## spam `median(exclaim_mess)` `IQR(exclaim_mess)`
## <dbl> <dbl> <dbl>
## 1 0 1 5
## 2 1 0 1
# Create plot for spam and exclaim_mess
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
# Alternative plot: side-by-side box plots
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
ggplot(aes(x = 1, y = log_exclaim_mess)) +
geom_boxplot() +
facet_wrap(~ spam)
# Alternative plot: Overlaid density plots
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
ggplot(aes(x = log_exclaim_mess, fill = spam)) +
geom_density(alpha = 0.3)
If it was difficult to work with the heavy skew of exclaim_mess, the number of images attached to each email (image) poses even more of a challenge. Run the following code at the console to get a sense of its distribution:
table(email$image)
##
## 0 1 2 3 4 5 9 20
## 3811 76 17 11 2 2 1 1
Recall that this tabulates the number of cases in each category (so there were 3811 emails with 0 images, for example). Given the very low counts at the higher number of images, let’s collapse image into a categorical variable that indicates whether or not the email had at least one image. In this exercise, you’ll create this new variable and explore its association with spam.
# Create plot of proportion of spam by image
email %>%
mutate(has_image = image > 0) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0
When you have a specific question about a dataset, you can find your way to an answer by carefully constructing the appropriate chain of R code. For example, consider the following question:
“Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?” This can be answered with the following chain:
email %>% filter(spam == “not-spam”) %>% group_by(to_multiple) %>% summarize(median(num_char))
# Question 1
email %>%
filter(dollar > 0) %>%
group_by(spam) %>%
summarize(median(dollar))
## # A tibble: 2 x 2
## spam `median(dollar)`
## <dbl> <dbl>
## 1 0 4
## 2 1 2
# Question 2
email %>%
filter(dollar > 10) %>%
ggplot(aes(x = spam)) +
geom_bar()
practice constructing a faceted barchart.
# Reorder levels
email$number <- factor(email$number, levels = c("none", "small", "big"))
# Construct plot of number
ggplot(email, aes(x = number)) +
geom_bar() +
facet_wrap(~ spam)
Interpretation: Given that an email contains no number, it is more likely to be spam.
# Load data
data(countyComplete) # It comes from the openintro package
# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 1000, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)