Description

Today we’re going to work with the “Wage” dataset to investigate causal relationships. In particular, we will try and see if we can predict Wage based on Education, Race, Health, Age, and other variables.

The library that we’ll be using is called “ISLR2”, and it is from the book “Introduction to Statistical Learning”.

# Load relevant libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ISLR2)

Initial Data Exploration

Now, ket’s check what the dataset looks like and perform some preliminary analysis

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...
# Check for missing values (NA)
sum(is.na(Wage)) # there are no missing data
## [1] 0
# Create a summary table
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 

Further Analysis

Now that we have some idea about the dataframe, let’s do a deep dive and create some visualizations.

# Check distribution of wage
Wage %>% ggplot(aes(x = wage)) + geom_density(color = "blue", fill = "lightblue")

# Check qqplot of wage
Wage %>% ggplot(aes(sample = wage)) + stat_qq() + stat_qq_line()

# Check distribution of logWage
Wage %>% ggplot(aes(x = logwage)) + geom_density(color = "blue", fill = "lightblue")

# Create a qqplot of logWage
Wage %>% ggplot(aes(sample = logwage)) + stat_qq() + stat_qq_line()

# looks approximately normal
# Create distributions based on education
Wage %>% ggplot() + geom_density(aes(x = logwage, fill = education), alpha = 0.4)

# This is not really informative; let's try creating boxplots
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot()

# much more informative
# Let's create boxplots based on logwage, education, and race
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot() + facet_wrap(~race)

# this is actually quite informative; can you express any opinions based on your findings?

Excercises

1. Create a scatter plot of age against logwage

Solution:

# Scatter plot of age against logwage
ggplot(data = Wage, aes(x = age, y = logwage)) +  geom_point()

2. Fit a best fit line of the plot for age against logwage

Solution:

ggplot(data = Wage, aes(x = age, y = logwage)) +  geom_point() + geom_line() + geom_smooth(method = 'gam')
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

3. Create boxplots for logwage, health, education

Solution:

glimpse(Wage)
## Rows: 3,000
## Columns: 11
## $ year       <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
## $ age        <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
## $ maritl     <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
## $ race       <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
## $ education  <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
## $ region     <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
## $ jobclass   <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
## $ health     <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
## $ logwage    <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,…
## $ wage       <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.115…
Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot() + facet_wrap(~health)

4. Create boxplots for logwage, education, marital status

Solution:

Wage %>% ggplot(aes(y = logwage, fill = education)) + geom_boxplot() + facet_wrap(~maritl)

5. Create boxplots for logwage, health, marital status

Solution:

Wage %>% ggplot(aes(y = logwage, fill = health)) + geom_boxplot() + facet_wrap(~maritl)