In this problem, you will explore multilevel modeling using an R package nlme. A dataset (“oxboys.csv”) is provided on Canvas. This dataset contains three features from 26 students in Oxford: student id, (measured) height, and (measurement) year.
A.) Plot the relationship between height and year, and draw a linearly regressed line ignoring the id variable.
library(nlme)
oxboys = read.csv('oxboys.csv')
head(oxboys)
## id height year
## 1 1 140.5 1
## 2 1 143.4 2
## 3 1 144.8 3
## 4 1 147.1 4
## 5 1 147.7 5
## 6 1 150.2 6
attach(oxboys)
plot(year, height, col="blue", main="Oxford Students: Height vs. Year", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5, xlim = c(0,10))
abline(lm(height~year), col="red", lwd=5)
B.) Plot the relationship between height and year, but this time, fit a different linear regression for each individual. In ggplot, you can do it by specifying the group variable:
ggplot(oxboys, aes(x=year, y=height, group=id)) + stat_smooth(method=“lm”)
Other plotting packages also should have some such facility. Comment on differences between (a) and (b).
library(ggplot2)
ggplot(oxboys, aes(x=year, y=height, group=id)) + stat_smooth(method="lm", color="red") + ggtitle("Oxford Boys: Height vs. Year") + expand_limits(x=c(min(year),max(year))) + scale_x_continuous(breaks=seq(min(year), max(year), 1)) + theme(axis.title = element_text(face="bold", colour="blue", size=20), plot.title = element_text(family = "Trebuchet MS", color="blue", face="bold", size=22))
The first graph plots the relationship between year and height for all students. The abline represents the line of best fit for a global MLR model. In contrast, the ggplot graph shows the relationship between year and height with respect to each student (group=id). This is a graphic representation of a local model (i.e., 26 different linear regressions).
C.) Divide the dataset into training and test sets. The training set contains the first two years of the measurements, and the test set contains the rest of the measurements.
Build three different linear models. Predict the heights for the next 8 years, and calculate the mean squared errors from the three models.
oxboys.train <- subset(oxboys, year < 3)
oxboys.test <- subset(oxboys, year >= 3)
#=====================================================================
#GLOBAL MODEL
global.fit = lm(height~year, data=oxboys.train)
#summary(global.fit)
global_pred = predict(global.fit, newdata=oxboys.test)
global_MSE = mean((oxboys.test$height - global_pred)^2)
global_MSE
## [1] 68.73458
#======================================================================
#LOCAL MODEL
for(key in unique(oxboys.train$id)){
oxboys.train.ind <- subset(oxboys.train, id==key)
oxboys.test.ind <- subset(oxboys.test, id==key)
# build a linear model with oxboys.train.ind
local.fit = lm(height~year, data=oxboys.train.ind)
# predict on oxboys.test.ind, and measure MSE
local_pred = predict(local.fit, newdata=oxboys.test.ind)
}
local_MSE = mean((oxboys.test.ind$height - local_pred)^2)
local_MSE
## [1] 12.35857
#=====================================================================
#MULTI-LEVEL MODEL
mlm.obj = lme(height~year, data=oxboys.train, random=list(id=pdDiag(~year)))
mlm_pred = predict(mlm.obj, newdata=oxboys.test, level=1)
# calculate MSE
mlm_MSE = mean((oxboys.test$height - mlm_pred)^2)
mlm_MSE
## [1] 4.234867
D.) Repeat c), but this time, the training set contains the first 6 years, and the test set has the rest.
oxboys.train.rep <- subset(oxboys, year < 7)
oxboys.test.rep <- subset(oxboys, year >= 7)
#=====================================================================
#GLOBAL MODEL
global.fit.rep = lm(height~year, data=oxboys.train.rep)
global_pred_rep = predict(global.fit.rep, newdata=oxboys.test.rep)
global_MSE_rep = mean((oxboys.test.rep$height - global_pred_rep)^2)
global_MSE_rep
## [1] 80.06475
#======================================================================
#LOCAL MODEL
for(key in unique(oxboys.train.rep$id)){
oxboys.train.rep.ind <- subset(oxboys.train.rep, id==key)
oxboys.test.rep.ind <- subset(oxboys.test.rep, id==key)
# build a linear model with oxboys.train.ind
local.fit.rep = lm(height~year, data=oxboys.train.rep.ind)
# predict on oxboys.test.ind, and measure MSE
local_pred_rep = predict(local.fit.rep, newdata=oxboys.test.rep.ind)
}
local_MSE_rep = mean((oxboys.test.rep.ind$height - local_pred_rep)^2)
local_MSE_rep
## [1] 0.5588227
#=====================================================================
#MULTI-LEVEL MODEL
mlm.obj.rep = lme(height~year, data=oxboys.train.rep, random=list(id=pdDiag(~year)))
mlm_pred_rep = predict(mlm.obj.rep, newdata=oxboys.test.rep, level=1)
# calculate MSE
mlm_MSE_rep = mean((oxboys.test.rep$height - mlm_pred_rep)^2)
mlm_MSE_rep
## [1] 2.720906
E.) Discuss the results from C.) and D.). When does the multilevel perform better, and when does it not?
As discussed in class, the use of multi-level models is most appropriate when there is not enough data for the lowest level models and higher level models are too coarse OR when we want to get similar results for individuals within a group. In this case, we are operating on one level and we want to get similar results for individuals within a group (male Oxford students). The MLM achieves the lowest overall MSE in part D compared to part C. However, in part D the local model yields the best results whereas in part C the MLM yields the best results. The difference between part C and part D is the amount of data included in the training set, as you can see:
#Part C
dim(oxboys.train)#train
## [1] 52 3
dim(oxboys.test)#test
## [1] 182 3
#Part D
dim(oxboys.train.rep)#train
## [1] 156 3
dim(oxboys.test.rep)#test
## [1] 78 3
F.) Briefly state what do β00 and β10 mean in this multilevel model.
B00 represents the variability in the intercept term for each group (i.e., student ID). B10 represents the variability in the slope term for each group (again - student ID).