Multi-level Modeling

In this problem, you will explore multilevel modeling using an R package nlme. A dataset (“oxboys.csv”) is provided on Canvas. This dataset contains three features from 26 students in Oxford: student id, (measured) height, and (measurement) year.

A.) Plot the relationship between height and year, and draw a linearly regressed line ignoring the id variable.

library(nlme)
oxboys = read.csv('oxboys.csv')
head(oxboys)

##   id height year
## 1  1  140.5    1
## 2  1  143.4    2
## 3  1  144.8    3
## 4  1  147.1    4
## 5  1  147.7    5
## 6  1  150.2    6

attach(oxboys)

plot(year, height, col="blue", main="Oxford Students: Height vs. Year", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5, xlim = c(0,10))


abline(lm(height~year), col="red", lwd=5)

plot of chunk unnamed-chunk-1

B.) Plot the relationship between height and year, but this time, fit a different linear regression for each individual. In ggplot, you can do it by specifying the group variable: ggplot(oxboys, aes(x=year, y=height, group=id)) + stat_smooth(method=“lm”) Other plotting packages also should have some such facility. Comment on differences between (a) and (b).

library(ggplot2)

ggplot(oxboys, aes(x=year, y=height, group=id)) + stat_smooth(method="lm", color="red") + ggtitle("Oxford Boys: Height vs. Year") + expand_limits(x=c(min(year),max(year))) + scale_x_continuous(breaks=seq(min(year), max(year), 1)) + theme(axis.title = element_text(face="bold", colour="blue", size=20), plot.title = element_text(family = "Trebuchet MS", color="blue", face="bold", size=22))

plot of chunk unnamed-chunk-2

The first graph plots the relationship between year and height for all students. The abline represents the line of best fit for a global MLR model. In contrast, the ggplot graph shows the relationship between year and height with respect to each student (group=id). This is a graphic representation of a local model (i.e., 26 different linear regressions).

C.) Divide the dataset into training and test sets. The training set contains the first two years of the measurements, and the test set contains the rest of the measurements.

Build three different linear models. Predict the heights for the next 8 years, and calculate the mean squared errors from the three models.

Global model: a linear model ignoring the id variable

Local model: a different linear model for each individual i.e., 26 different linear regressions

Multilevel model: Use the nlme package to fit a multilevel model

oxboys.train <- subset(oxboys, year < 3)
oxboys.test <- subset(oxboys, year >= 3)

#=====================================================================

#GLOBAL MODEL

global.fit = lm(height~year, data=oxboys.train)
#summary(global.fit)

global_pred = predict(global.fit, newdata=oxboys.test)

global_MSE = mean((oxboys.test$height - global_pred)^2)
global_MSE

## [1] 68.73458

#======================================================================

#LOCAL MODEL
for(key in unique(oxboys.train$id)){
               oxboys.train.ind <- subset(oxboys.train, id==key)
               oxboys.test.ind <- subset(oxboys.test, id==key)
               # build a linear model with oxboys.train.ind
               local.fit = lm(height~year, data=oxboys.train.ind)
               # predict on oxboys.test.ind, and measure MSE
               local_pred = predict(local.fit, newdata=oxboys.test.ind)

}
local_MSE = mean((oxboys.test.ind$height - local_pred)^2)
local_MSE

## [1] 12.35857

#=====================================================================

#MULTI-LEVEL MODEL

mlm.obj = lme(height~year, data=oxboys.train, random=list(id=pdDiag(~year))) 
mlm_pred = predict(mlm.obj, newdata=oxboys.test, level=1)
# calculate MSE
mlm_MSE = mean((oxboys.test$height - mlm_pred)^2)
mlm_MSE

## [1] 4.234867

D.) Repeat c), but this time, the training set contains the first 6 years, and the test set has the rest.

oxboys.train.rep <- subset(oxboys, year < 7)
oxboys.test.rep <- subset(oxboys, year >= 7)

#=====================================================================

#GLOBAL MODEL

global.fit.rep = lm(height~year, data=oxboys.train.rep)

global_pred_rep = predict(global.fit.rep, newdata=oxboys.test.rep)

global_MSE_rep = mean((oxboys.test.rep$height - global_pred_rep)^2)
global_MSE_rep

## [1] 80.06475

#======================================================================

#LOCAL MODEL
for(key in unique(oxboys.train.rep$id)){
               oxboys.train.rep.ind <- subset(oxboys.train.rep, id==key)
               oxboys.test.rep.ind <- subset(oxboys.test.rep, id==key)
               # build a linear model with oxboys.train.ind
               local.fit.rep = lm(height~year, data=oxboys.train.rep.ind)
               # predict on oxboys.test.ind, and measure MSE
               local_pred_rep = predict(local.fit.rep, newdata=oxboys.test.rep.ind)

}
local_MSE_rep = mean((oxboys.test.rep.ind$height - local_pred_rep)^2)
local_MSE_rep

## [1] 0.5588227

#=====================================================================

#MULTI-LEVEL MODEL

mlm.obj.rep = lme(height~year, data=oxboys.train.rep, random=list(id=pdDiag(~year))) 
mlm_pred_rep = predict(mlm.obj.rep, newdata=oxboys.test.rep, level=1)
# calculate MSE
mlm_MSE_rep = mean((oxboys.test.rep$height - mlm_pred_rep)^2)
mlm_MSE_rep

## [1] 2.720906

E.) Discuss the results from C.) and D.). When does the multilevel perform better, and when does it not?

As discussed in class, the use of multi-level models is most appropriate when there is not enough data for the lowest level models and higher level models are too coarse OR when we want to get similar results for individuals within a group. In this case, we are operating on one level and we want to get similar results for individuals within a group (male Oxford students). The MLM achieves the lowest overall MSE in part D compared to part C. However, in part D the local model yields the best results whereas in part C the MLM yields the best results. The difference between part C and part D is the amount of data included in the training set, as you can see:

#Part C
dim(oxboys.train)#train

## [1] 52  3

dim(oxboys.test)#test

## [1] 182   3

#Part D
dim(oxboys.train.rep)#train

## [1] 156   3

dim(oxboys.test.rep)#test

## [1] 78  3

F.) Briefly state what do β00 and β10 mean in this multilevel model.

B00 represents the variability in the intercept term for each group (i.e., student ID). B10 represents the variability in the slope term for each group (again - student ID).