The Lab 1 based on “Wage” dataset to investigate causal relationships. In particular, we will try and see if we can predict Wage based on Education, Race, Health, Age, and other variables.
The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.
The ISLR2 package has taken from the book “Introduction to Statistical Learning, Second Edition”. This package contains datasets used in the book “Introduction to Statistical Learning, with Applications in R (second edition)” by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani.
library(tidyverse)
library(ISLR2) # Wage data set belongs to ISLR2
Wage data for a group of 3000 male workers in the Mid-Atlantic region. Data was manually assembled by Steve Miller, of Inquidia Consulting (formerly Open BI). From the March 2011 Supplement to Current Population Survey data.
## load Wage data set
data("Wage")
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
This data set contains 3000 rows and 11 columns. Out of eleven variables, two variables year and age are integer, whereas logwage and wage are numeric. Reaming seven variables are factor/categorical variables.
sum(is.na(Wage))
## [1] 0
The wage data set contain with non-missing observations.
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
The results from summary statistics shows some basic discription of statistics like minimum, maximum, mean, median, 1st quartile and 3rd quartile.
Now that we have some idea about the dataframe, let’s do a deep dive and create some visualizations.
Wage %>%
ggplot(aes(x=wage))+
geom_density(color="red", fill="orange")
By eyeballing the distribution of wage it seems to be right-skewed/positively skewed.
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.
Wage %>%
ggplot(aes(sample=wage))+
stat_qq(color="blue")+
stat_qq_line(color="red")
A Q-Q plot is a scatter plot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. The data sets may not come from the same distribution.
Wage %>%
ggplot(aes(x=logwage))+
geom_density(color="red", fill="orange")
By eyeballing the distribution of logwage it seems to be
leptokurtic as well as left-skewed/negatively
skewed.
Wage %>%
ggplot(aes(sample=logwage))+
stat_qq(color="blue")+
stat_qq_line(color="red")
Similar interpretation as like wage variable. A Q-Q plot is a scatter plot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. The data sets may not come from the same distribution.
Wage %>%
ggplot()+
geom_density(aes(x=logwage, fill=education), alpha=0.4)+
ggtitle("logwage by education level")
The graph shows that education does have a influence on logwage. A higher education level shifts the logwage distribution to slightly left, but it also increases the variance and changes the shape of the distribution. One striking feature of the empirical wage distribution is its bump between 5.7 and 6.6. There is no explanation for this bump in the dataset or its help page, so we have to assume it is a anomaly of the sample.
Wage %>%
ggplot(aes(y = logwage, fill = education)) +
geom_boxplot()
This box plot is more informative compare to density plot. The logwage is increasing by higher grade of education. Even the box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score as well we can detect the outler for each level of education.
This time lets check distribution of logwage by adding another dimension race variable.
Wage %>%
ggplot(aes(y=logwage, fill=education))+
geom_boxplot()+
facet_wrap(~race)
This is actually quite informative. Logwage is increasing with higher education for White, Black and Asian.
Create a scatter plot of age against logwage (hint: use geom_point())
Answer-1:
Wage %>%
ggplot(aes(x=age, y=logwage))+
geom_point()
The scatter plot of age vs logwage is not very informative.
Fit a best fit line of the plot for age against logwage (hint: use geom_smooth())
Answer-2:
Wage %>%
ggplot(aes(x=age, y=logwage))+
geom_point(color="gray")+
geom_smooth(method = lm,formula = y ~x, se=TRUE, color="red")
Wage %>%
ggplot(aes(x=age, y=logwage))+
geom_point(color="gray")+
geom_smooth(method = "gam", formula = y ~s(x, bs="cs"), se=TRUE)
The fitted red line from the linear model makes it clear that age is positively correlated with logwage. However, we can interpret whether the association is statistically significant or not. To determine the significance of the association, we should examine the P-value.
Create boxplots for logwage, health, education
Answer-3:
Wage %>%
ggplot(aes(y=logwage, fill=health))+
geom_boxplot()+
facet_wrap(~education)
Create boxplots for logwage, education, marital status
Answer-4:
Wage %>%
ggplot(aes(y=logwage, fill=education))+
geom_boxplot()+
facet_wrap(~maritl)
Create boxplots for logwage, health, marital status
Answer-5:
Wage %>%
ggplot(aes(y=logwage, fill=health))+
geom_boxplot()+
facet_wrap(~maritl)
The distribution of wage was right-skewed. The Q-Q plots shows that, the wage and logwage may not come from the normal distribution. From the box plot, we found that more logwage with higher education for White, Black and Asian.