Today we’re going to work with the “Wage” dataset to investigate causal relationships. In particular, we will try and see if we can predict Wage based on Education, Race, Health, Age, and other variables.
The library that we’ll be using is called “ISLR2”, and it is from the book “Introduction to Statistical Learning”.
# Load relevant libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ISLR2)
Now, ket’s check what the dataset looks like and perform some preliminary analysis
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
# Check for missing values (NA)
sum(is.na(Wage)) # there are no missing data
## [1] 0
# Create a summary table
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
Now that we have some idea about the dataframe, let’s do a deep dive and create some visualizations.
# Check distribution of wage
Wage %>% ggplot(aes(x = wage)) + geom_density(color = "blue", fill = "lightblue")
# Check qqplot of wage
Wage %>% ggplot(aes(sample = wage)) + stat_qq() + stat_qq_line()
# Check distribution of logWage
Wage %>% ggplot(aes(x = logwage)) + geom_density(color = "blue", fill = "lightblue")
# Create a qqplot of logWage
Wage %>% ggplot(aes(sample = logwage)) + stat_qq() + stat_qq_line()
# looks approximately normal
# Create distributions based on education
Wage %>% ggplot() + geom_density(aes(x = logwage, fill = education), alpha = 0.4)
# This is not really informative; let's try creating boxplots
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot()
# much more informative
# Let's create boxplots based on logwage, education, and race
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot() + facet_wrap(~race)
# this is actually quite informative; can you express any opinions based on your findings?
Solution:
# Scatter plot of age against logwage
ggplot(data = Wage, aes(x = age, y = logwage)) + geom_point()
Solution:
ggplot(data = Wage, aes(x = age, y = logwage)) + geom_point() + geom_line() + geom_smooth(method = 'gam')
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
Solution:
glimpse(Wage)
## Rows: 3,000
## Columns: 11
## $ year <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
## $ age <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
## $ maritl <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
## $ race <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
## $ education <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
## $ region <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
## $ jobclass <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
## $ health <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
## $ logwage <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,…
## $ wage <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.115…
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot() + facet_wrap(~health)
Solution:
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot() + facet_wrap(~maritl)
Solution:
Wage %>% ggplot(aes(y = logwage, fill = health)) + geom_boxplot() + facet_wrap(~maritl)