From the ‘Wage’ data set, selecting the response variable as ‘wage’ and the predictors ‘age’, ‘education’, ‘year’,and ‘jobclass’.
Predictor ‘age’ shows non-linear relationship in general with the wage variable. Whilst, year variable does not show non-linear relationship with wage variable in general trend (but if we differentiate the movement wage over year by education level, we could see non-linear relation for Advanced degree)
For 2 categorical predictors, it is not appropriate to comment about linear movement. While assuming education in increasing numeric scale, increasing from 1 (<HS) to 5 (advanced degree), the higher education (higher numeric marked) are likely to have higher salary mean. In terms of jobclass, information sector offer higher salary than industrial one.
data<- Wage %>% select(wage,age,year,education,jobclass)%>%arrange(age)
summary(data)
## wage age year education
## Min. : 20.09 Min. :18.00 Min. :2003 1. < HS Grad :268
## 1st Qu.: 85.38 1st Qu.:33.75 1st Qu.:2004 2. HS Grad :971
## Median :104.92 Median :42.00 Median :2006 3. Some College :650
## Mean :111.70 Mean :42.41 Mean :2006 4. College Grad :685
## 3rd Qu.:128.68 3rd Qu.:51.00 3rd Qu.:2008 5. Advanced Degree:426
## Max. :318.34 Max. :80.00 Max. :2009
## jobclass
## 1. Industrial :1544
## 2. Information:1456
##
##
##
##
GGally::ggpairs(data %>% select(wage,age,year,education,jobclass))
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#
data%>% ggplot(aes(x= age, y= wage,color=jobclass))+geom_jitter(pch=3,alpha=0.1)+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#
data%>% ggplot(aes(x= year, y= wage,color=education))+geom_point(pch=3,alpha=0.1) +geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
data%>% ggplot(aes(x= year, y= wage))+geom_point(pch=3,alpha=0.1) +geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#
data%>% ggplot(aes(x= education, y=wage,color=jobclass))+geom_jitter(alpha=0.1)+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#
data%>% ggplot(aes(x= jobclass, y= wage,color=education))+geom_jitter(pch=3,alpha=0.3)+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Average wage of workers (some college education and work in Industrial sector) aged 20 is less than those aged 60 years around 21.9 units of raw wage.
The older the workers are, the higher their salaries are in general. Besides, there is a small group of elite workers in their field, who have distinctive range of salary (above 200), which is higher than their average collegueas from the age of 25 to 65.
Moreover, there is an increase in raw wages over year (might be to adjust inflation).
data1<- Wage %>% select(wage,age,year,education,jobclass)%>%arrange(age)
# Fit model with linear predictor
mod<-gam(data= data1, wage ~ age+year+education+jobclass)
vis.gam(mod,ticktype="detailed",theta=-60)
#
predict(mod,data.frame(age=20,year=2006,education="3. Some College",jobclass="1. Industrial"))-
predict(mod,data.frame(age=60,year=2006,education="3. Some College",jobclass="1. Industrial"))
## 1
## -21.93978
# Plot partial residual for age
visreg(mod, "age",gg=T)
The model ‘mod’ faces the non-constant variance of residuals, which violate the linear regression of Homoscedasticity. It also violates assumption of normality.
Empirical right tail is longer than what it should be (the theory normal right tail)-skewed to the right
# The Homogeneity of variance assumption
plot1<-ggplot()+
geom_point(aes(x=predict(mod),y=residuals(mod),alpha=0.1))+
ggtitle("Original mod")+
geom_hline(yintercept = 0,color="red")
# The normality of residuals assumption (QQ plot)
e<-ggplot() +
geom_qq(aes(sample=residuals(mod)))+ggtitle("QQ plot of orignal mod")
i<-residuals(mod)
ii<-ggplot(data=as.tibble(i),aes(x = value, y = ..density..)) +
geom_histogram(colour = "red",fill = "red", alpha = 0.1) +
geom_density(colour = "green",fill = "green", alpha = 0.5) +
ggtitle("Distribution of residual of mod") +
xlab("residuals") + theme_bw()
Model’s heterokedascity problem is improved when taking log of response variable and residual distribution turns from highly right-skewed to approximately more symmetric with light tail in the left.
# revised model (taking log y to fix the heterokedascity)
mod2<-gam(data= data, log(wage)~age+year+education+jobclass)
plot2<-ggplot()+
geom_point(aes(x=predict(mod2),y=residuals(mod2)))+
ggtitle("revised mod")+
geom_hline(yintercept = 0,color="red")
# the normality of residuals assumption (QQ plot)
u<-residuals(mod2)
uu<-ggplot(data=as.tibble(u),aes(x = value, y = ..density..)) +
geom_histogram(colour = "red",fill = "red", alpha = 0.1) +
geom_density(colour = "blue",fill = "blue", alpha = 0.5) +
ggtitle("Distribution of residual of revised mod") +
xlab("residuals") + theme_bw()
# compare
gridExtra::grid.arrange(plot1,plot2)
gridExtra::grid.arrange(ii,uu)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(residuals(mod),main="")
qqline(residuals(mod))
title(main = "QQ plot", col.main= "green")
qqnorm(residuals(mod2),main="")
qqline(residuals(mod2))
title(main = "Revised QQ plot", col.main= "blue")
The visualization of spline with too many knots (20 knots for age, with 4 knots for year) is not a good approach as it blocks us from seeing the main transition. Therefore, specifying the proper number of knots is tricky problem, then we often proceed it by trial and error. Besides, there is a way that has been grown popularly to deal with “not too wiggly” spline is the use of auto penalized splines with REML, which will be applied in later parts of this assignment.
# Regression spline with 20 knots for age, 4 knots for year, and linear terms for the remainders.
mod3<-gam(data= data, log(wage) ~ s(age,k=20,fx=TRUE) + s(year,k=4,fx=TRUE) +education +jobclass, method="REML")
summary(mod3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(wage) ~ s(age, k = 20, fx = TRUE) + s(year, k = 4, fx = TRUE) +
## education + jobclass
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.40476 0.01833 240.296 < 2e-16 ***
## education2. HS Grad 0.11168 0.02033 5.492 4.3e-08 ***
## education3. Some College 0.22472 0.02154 10.433 < 2e-16 ***
## education4. College Grad 0.33600 0.02154 15.602 < 2e-16 ***
## education5. Advanced Degree 0.49667 0.02362 21.029 < 2e-16 ***
## jobclass2. Information 0.03516 0.01129 3.114 0.00186 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 19 19 18.077 < 2e-16 ***
## s(year) 3 3 8.536 1.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.309 Deviance explained = 31.5%
## -REML = 628.88 Scale est. = 0.085513 n = 3000
# Visualise the regression splines
ggplot(data) +
geom_point(aes(x=age,y=log(wage),alpha=0.1)) +
ggtitle("log Wage regression splines") +
geom_line(aes(x=age, y=fitted(mod3)), col='red')
ggplot(data) +
geom_point(aes(x=year,y=log(wage),alpha=0.1)) +
ggtitle("log Wage regression splines") +
geom_line(aes(x=year, y=fitted(mod3)), col='red')
Regarding auto knot selection using visreg, in terms of ages, there is a non-linear relation. Wages has increased significantly by the age of 40 to hit the peak, then hovering around that level until the age of 60, followed by a gradual decline.
In terms of year, there is a slight positive linear relation (possibly due to inflation adjustment by time). If taking the inflation factor into consideration, this slight increase might be faded.
In terms of education, the higher education level, the higher average of salary.
visreg(mod3,gg=T)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
visreg2d(mod3,"age","year",plot.type="persp",theta=-50)
There is a non-linear relation between age and wage over time. From the age of 20 to 40, there is a strong positive relationship with wage and age, while between the age of 40 and 60 there is a constant level of wage. After age of 60, wage witnesses a decline.
# Default mgcv approach: penalized regression spline
mod4<-gam(data= data, log(wage)~s(age)+year+education+jobclass,method="REML")
summary(mod4)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(wage) ~ s(age) + year + education + jobclass
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.990414 5.305136 -3.768 0.000168 ***
## year 0.012161 0.002645 4.598 4.44e-06 ***
## education2. HS Grad 0.113547 0.020204 5.620 2.09e-08 ***
## education3. Some College 0.227946 0.021375 10.664 < 2e-16 ***
## education4. College Grad 0.338383 0.021429 15.791 < 2e-16 ***
## education5. Advanced Degree 0.498599 0.023516 21.202 < 2e-16 ***
## jobclass2. Information 0.035032 0.011258 3.112 0.001877 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 5.622 6.732 49.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.309 Deviance explained = 31.2%
## -REML = 599.61 Scale est. = 0.085461 n = 3000
visreg(mod4, "age",gg=T)
plot(mod4,residuals = F,rug = T,shade = T)
In general, for all working ages, workers in information sector have higher wage per any period during their working life and total accumulated income, generated only from their works, than industrial sector
In detail, the starting wage of information is higher than industrial sector. The speed of wage increase by 40 in information is faster than industrial sector, while the wage downturn after age of 60 is slower.
For those working in the industrial industry, their wages increase to hit the peak, which is lower than the peak of those worked in information sector, at around age 45, then remaining in the short period, followed by significant decline.
For those working in information industry, their wages show a sharp increase since starting working to hit the peak until around 40, then decline very gradually and slightly for the next 20 years, following by a little bit heavier decline, which is still higher in wage and slower in decrease speed than industrial sector.
mod5<-gam(data= data,
log(wage)~year+education+jobclass+s(age,by=jobclass),
method="REML")
summary(mod5)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(wage) ~ year + education + jobclass + s(age, by = jobclass)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.981254 5.305069 -3.766 0.000169 ***
## year 0.012157 0.002645 4.596 4.48e-06 ***
## education2. HS Grad 0.112712 0.020222 5.574 2.71e-08 ***
## education3. Some College 0.227554 0.021391 10.638 < 2e-16 ***
## education4. College Grad 0.338126 0.021456 15.759 < 2e-16 ***
## education5. Advanced Degree 0.498901 0.023533 21.200 < 2e-16 ***
## jobclass2. Information 0.035181 0.011262 3.124 0.001801 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age):jobclass1. Industrial 4.955 6.019 35.70 <2e-16 ***
## s(age):jobclass2. Information 4.804 5.868 20.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.309 Deviance explained = 31.3%
## -REML = 605.71 Scale est. = 0.085444 n = 3000
visreg2d(mod5, "age","jobclass",plot.type="persp",theta=5)
visreg2d(mod5, "age","jobclass")
visreg(mod5,"age",by="jobclass")
Regarding hypothesis test for smooth interaction term, the smooth interaction term is not statistical significant as pval>0.05, then it should be removed as we fail to reject H0(interaction term in model5 has zero effect into model). Therefore, we favor model 4 than model 5.
# Hypothesis test interaction term
anova(mod4,mod5,test="F")
## Analysis of Deviance Table
##
## Model 1: log(wage) ~ s(age) + year + education + jobclass
## Model 2: log(wage) ~ year + education + jobclass + s(age, by = jobclass)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2985.9 255.31
## 2 2980.4 254.90 5.5343 0.40385 0.854 0.5207
Regarding hypothesis test for smooth term of age, the smooth term effect in model is statistical significant as we have enough evidence to reject H0 (p-value<0.05)
H0: smooth term of age s(age) = 0 H1: smooth term of age s(age) >< 0
Conclusion, mod4 (model with smooth term) is the best model so far.
# Hypothesis test smooth term
anova(mod4,mod2,test = "F")
## Analysis of Deviance Table
##
## Model 1: log(wage) ~ s(age) + year + education + jobclass
## Model 2: log(wage) ~ age + year + education + jobclass
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2985.9 255.31
## 2 2992.0 272.46 -6.0749 -17.157 33.048 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The wage of those aged 20 in an industrial field in 2009 with HS grad education is 66.21 units of wage
The wage of those aged 60 in an industrial field in 2009 with HS grad education is 98.87 units of wage
The difference in wages between them is 32.659 units of wage
p1 <- predict(mod4,data.frame(age=20,year=2009,education="2. HS Grad",jobclass="1. Industrial"), se=TRUE)
p2 <- predict(mod4,data.frame(age=60,year=2009,education="2. HS Grad",jobclass="1. Industrial"), se=TRUE)
#Predict Wage
exp(p1$fit)
## 1
## 66.21151
exp(p2$fit)
## 1
## 98.87051
exp(p1$fit)-exp(p2$fit)
## 1
## -32.659