Description

The Lab 1 based on “Wage” dataset to investigate causal relationships. In particular, we will try and see if we can predict Wage based on Education, Race, Health, Age, and other variables.

Load Package

  • The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.

  • The ISLR2 package has taken from the book “Introduction to Statistical Learning, Second Edition”. This package contains datasets used in the book “Introduction to Statistical Learning, with Applications in R (second edition)” by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani.

library(tidyverse)
library(ISLR2) # Wage data set belongs to ISLR2

Loading Wage dataset

  • Description

Wage data for a group of 3000 male workers in the Mid-Atlantic region. Data was manually assembled by Steve Miller, of Inquidia Consulting (formerly Open BI). From the March 2011 Supplement to Current Population Survey data.

## load Wage data set 
data("Wage")

Some Exploration

Checking data Structure

str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

This data set contains 3000 rows and 11 columns. Out of eleven variables, two variables year and age are integer, whereas logwage and wage are numeric. Reaming seven variables are factor/categorical variables.

Check for missing values (NA)

sum(is.na(Wage))
## [1] 0

The wage data set contain with non-missing observations.

Displaying summary statistics

summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 

The results from summary statistics shows some basic discription of statistics like minimum, maximum, mean, median, 1st quartile and 3rd quartile.

Further Analysis

Now that we have some idea about the dataframe, let’s do a deep dive and create some visualizations.

Check the distribution of wage

Wage %>%
  ggplot(aes(x=wage))+
  geom_density(color="red", fill="orange")

By eyeballing the distribution of wage it seems to be right-skewed/positively skewed.

Q-Q plot of wage

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.

Wage %>%
  ggplot(aes(sample=wage))+
  stat_qq(color="blue")+
  stat_qq_line(color="red")

A Q-Q plot is a scatter plot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. The data sets may not come from the same distribution.

Checking distribution of logWage

Wage %>%
  ggplot(aes(x=logwage))+
  geom_density(color="red", fill="orange")

By eyeballing the distribution of logwage it seems to be leptokurtic as well as left-skewed/negatively skewed.

Q-Q plot of logwage

Wage %>%
  ggplot(aes(sample=logwage))+
  stat_qq(color="blue")+
  stat_qq_line(color="red")

Similar interpretation as like wage variable. A Q-Q plot is a scatter plot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. The data sets may not come from the same distribution.

Distribution of logwage by education

  • One main driver of wage should be education, so let us start by descriptively investigating the relationship between these two variables. This R code plots the distribution of wage across the different education levels:
Wage %>%
  ggplot()+
  geom_density(aes(x=logwage, fill=education), alpha=0.4)+
   ggtitle("logwage by education level")

The graph shows that education does have a influence on logwage. A higher education level shifts the logwage distribution to slightly left, but it also increases the variance and changes the shape of the distribution. One striking feature of the empirical wage distribution is its bump between 5.7 and 6.6. There is no explanation for this bump in the dataset or its help page, so we have to assume it is a anomaly of the sample.

  • The previous example of the distribution of logwage by education was lack of information. Let’s try with another strong statistical apropos Box Plot.
Wage %>% 
  ggplot(aes(y = logwage, fill = education)) + 
  geom_boxplot()

This box plot is more informative compare to density plot. The logwage is increasing by higher grade of education. Even the box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score as well we can detect the outler for each level of education.

Distribution of logwage by education and race

This time lets check distribution of logwage by adding another dimension race variable.

Wage %>%
  ggplot(aes(y=logwage, fill=education))+
  geom_boxplot()+
  facet_wrap(~race)

This is actually quite informative. Logwage is increasing with higher education for White, Black and Asian.

Excercises

Exercises-1

Create a scatter plot of age against logwage (hint: use geom_point())

Answer-1:

Wage %>%
  ggplot(aes(x=age, y=logwage))+
  geom_point()

The scatter plot of age vs logwage is not very informative.

Exercises-2

Fit a best fit line of the plot for age against logwage (hint: use geom_smooth())

Answer-2:

Wage %>%
  ggplot(aes(x=age, y=logwage))+
  geom_point(color="gray")+
  geom_smooth(method = lm,formula = y ~x, se=TRUE, color="red")

Wage %>%
  ggplot(aes(x=age, y=logwage))+
  geom_point(color="gray")+
  geom_smooth(method = "gam", formula = y ~s(x, bs="cs"), se=TRUE)

The fitted red line from the linear model makes it clear that age is positively correlated with logwage. However, we can interpret whether the association is statistically significant or not. To determine the significance of the association, we should examine the P-value.

Exercises-3

Create boxplots for logwage, health, education

Answer-3:

Wage %>%
  ggplot(aes(y=logwage, fill=health))+
  geom_boxplot()+
  facet_wrap(~education)

Exercises-4

Create boxplots for logwage, education, marital status

Answer-4:

Wage %>%
  ggplot(aes(y=logwage, fill=education))+
  geom_boxplot()+
  facet_wrap(~maritl)

Exercises-5

Create boxplots for logwage, health, marital status

Answer-5:

Wage %>%
  ggplot(aes(y=logwage, fill=health))+
  geom_boxplot()+
  facet_wrap(~maritl)

Summary

The distribution of wage was right-skewed. The Q-Q plots shows that, the wage and logwage may not come from the normal distribution. From the box plot, we found that more logwage with higher education for White, Black and Asian.