R Markdown

Chapter 4 Tutorial: Exploratory Data Analysis

<March 22>

Step 1: Contingency Tables and Bar Plots (Section 4.1)

This section focuses on summarizing single variables and the basics of two-variable tables. A contingency table summarizes data for two categorical variables, showing the number of times each combination occurred.

library(openintro) library(tidyverse) library(dplyr) library(tidyr)

Load data

data(“loans_full_schema”) ?loans_full_schema

Prepare the data as described in the source

loans <- loans_full_schema %>% mutate(application_type = as.character(application_type)) # [2]

Table 4.1: Contingency table for application type and homeownership

loans %>% count(application_type, homeownership) %>% pivot_wider(names_from = homeownership, values_from = n) # [2], [3]

Figure 4.1a: Bar plot of counts

ggplot(loans, aes(x = homeownership)) + geom_bar(fill = “#1a4e5c”) + labs(title = “Counts of Homeownership”, x = “Homeownership”, y = “Count”) # [3]

Figure 4.1b: Bar plot of proportions

ggplot(loans, aes(x = homeownership, y = after_stat(count / sum(count)))) + geom_bar(fill = “#1a4e5c”) + labs(title = “Proportions of Homeownership”, x = “Homeownership”, y = “Proportion”) # [3], [4]

Visualizing Two Categorical Variables (Section 4.2)

You can display the distributions of two categorical variables concurrently to visualize their relationship.

Figure 4.2a: Stacked bar plot

Useful when one variable is explanatory and the other is response [5]

ggplot(loans, aes(x = homeownership, fill = application_type)) + geom_bar() + labs(title = “Stacked Bar Plot”) # [5]

Figure 4.2b: Standardized (filled) bar plot

Helpful for understanding fractions when group sizes are imbalanced [4], [6]

ggplot(loans, aes(x = homeownership, fill = application_type)) + geom_bar(position = “fill”) + labs(title = “Standardized Bar Plot”, y = “Proportion”) # [5]

Figure 4.2c: Dodged bar plot

Most clearly displays that individual applications are more common in every category [4]

ggplot(loans, aes(x = homeownership, fill = application_type)) + geom_bar(position = “dodge”) + labs(title = “Dodged Bar Plot”) # [5]

Figure 4.3 & 4.4: Mosaic Plots ************** SKIP this one

Uses box areas to represent the number of cases in each category [7]

Note: Mosaic plots are often created using the ‘vcd’ package or ‘ggmosaic’

install.packages(“vcd”) library(vcd) mosaic(application_type ~ homeownership, data = loans) # [8]

Step 3: Row and Column Proportions (Section 4.3)

Row and column proportions are conditional proportions that help show if two variables are associated,.

Table 4.3: Row proportions (Conditional on Application Type)

loans %>% count(application_type, homeownership) %>% group_by(application_type) %>% mutate(proportion = n / sum(n)) # [11]

Table 4.4: Column proportions (Conditional on Homeownership)

loans %>% count(application_type, homeownership) %>% group_by(homeownership) %>% mutate(proportion = n / sum(n)) # [10]

Step 4: Pie and Waffle Charts (Section 4.4 & 4.5)

Pie charts provide a high-level overview but can be difficult to decipher for details compared to bar plots. Waffle charts are an alternative for communicating proportions of data.

Figure 4.5: Pie Chart vs Bar Plot

(Standard ggplot2 pie chart using coord_polar)

ggplot(loans, aes(x = ““, fill = homeownership)) + geom_bar(width = 1) + coord_polar(”y”) + theme_void() # [12]

Figure 4.7: Waffle Chart ********** Skip

Requires the ‘waffle’ package

install.packages(“waffle”) library(waffle) waffle_data <- table(loans$homeownership) / 100 # Representing proportions waffle(waffle_data) # [13], [14]

Comparing Numerical Data Across Groups (Section 4.6)

This section transitions to the county dataset to compare numerical outcomes (like income) across categorical groups.

Preparation for Section 4.6 [15]

Using the ‘county’ dataset from the ‘usdata’ package [16]

library(usdata) county_data <- county %>% filter(!is.na(pop_change)) %>% mutate(change_type = ifelse(pop_change > 0, “gain”, “no gain”)) # [15]

Figure 4.8a: Comparative Histograms

ggplot(county_data, aes(x = median_hh_income, fill = change_type)) + geom_histogram(alpha = 0.5, position = “identity”) # [17], [18]

Figure 4.8b: Side-by-side box plots

Traditional tool for comparing centers and spreads across groups [17]

ggplot(county_data, aes(x = median_hh_income, y = change_type)) + geom_boxplot() # [18]

Figure 4.9: Ridge plot

Combines density plots drawn on the same scale [19]

install.packages(“ggridges”) library(ggridges) ggplot(county_data, aes(x = median_hh_income, y = change_type)) + geom_density_ridges() # [19]

Figure 4.10: Faceting

Splitting the display across windows based on groups [20]

ggplot(county_data, aes(x = median_hh_income)) + geom_histogram() + facet_grid(change_type ~ metro) # [20], [21]