By including this statement, you agree that the work is all your own.
The work contained in this project is that of the authors and where material from other sources has been incorporated full acknowledgement has been given.
Print Name: Duncan Orr
Date: 09/10/2024
load("Boston.RData")
Run code that will give you the result of a leaps and bounds regression on the housing data. You should look at the 5 best models of each size and select the model based on Mallows 𝐶𝑝.
require(leaps)
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.3.3
bosthous.model.leaps <- leaps(x=housing[,1:13],
y=housing$MEDV,
nbest=5,
method="Cp",
names=colnames(housing[,1:13]))
bosthous.model.leaps
## $which
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B
## 1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## 3 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## 3 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## 3 FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## 4 FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## 4 FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## 4 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## 4 FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## 5 FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## 5 TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## 5 FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## 5 FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 6 FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 6 TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## 6 TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## 6 FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## 6 TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## 7 TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 7 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 7 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## 7 TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## 7 TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## 8 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 8 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 8 TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 8 TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 8 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 9 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 9 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 9 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 9 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## 9 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 10 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 10 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 10 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 10 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## 11 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 11 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 12 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 12 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 13 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## LSTAT
## 1 FALSE
## 1 TRUE
## 1 FALSE
## 1 FALSE
## 1 FALSE
## 2 FALSE
## 2 FALSE
## 2 FALSE
## 2 FALSE
## 2 FALSE
## 3 FALSE
## 3 FALSE
## 3 FALSE
## 3 TRUE
## 3 FALSE
## 4 FALSE
## 4 FALSE
## 4 FALSE
## 4 FALSE
## 4 FALSE
## 5 FALSE
## 5 FALSE
## 5 TRUE
## 5 FALSE
## 5 FALSE
## 6 FALSE
## 6 FALSE
## 6 FALSE
## 6 FALSE
## 6 FALSE
## 7 FALSE
## 7 FALSE
## 7 FALSE
## 7 FALSE
## 7 TRUE
## 8 FALSE
## 8 FALSE
## 8 TRUE
## 8 FALSE
## 8 TRUE
## 9 FALSE
## 9 TRUE
## 9 FALSE
## 9 FALSE
## 9 FALSE
## 10 TRUE
## 10 FALSE
## 10 FALSE
## 10 FALSE
## 10 FALSE
## 11 FALSE
## 11 TRUE
## 11 TRUE
## 11 TRUE
## 11 TRUE
## 12 TRUE
## 12 FALSE
## 12 TRUE
## 12 TRUE
## 12 TRUE
## 13 TRUE
##
## $label
## [1] "(Intercept)" "CRIM" "ZN" "INDUS" "CHAS"
## [6] "NOX" "RM" "AGE" "DIS" "RAD"
## [11] "TAX" "PTRATIO" "B" "LSTAT"
##
## $size
## [1] 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
## [26] 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10 11 11 11 11 11
## [51] 12 12 12 12 12 13 13 13 13 13 14
##
## $Cp
## [1] 228.64491 399.11526 529.99920 535.05838 548.11092 128.78215 151.22696
## [8] 154.21122 155.68302 156.96410 102.56165 103.69949 104.66813 108.09313
## [15] 108.22634 79.71697 80.86871 81.99068 82.37755 83.37585 57.04661
## [22] 58.04088 67.65482 67.83617 68.11933 43.13627 44.40523 45.31468
## [29] 48.27737 49.15914 30.91424 33.43166 35.46675 37.00343 39.52897
## [36] 22.62726 28.17101 29.34314 30.63885 30.73289 19.66987 21.56802
## [43] 21.77557 23.54966 23.94674 18.10040 18.64575 18.78493 19.94087
## [50] 20.06437 13.42030 17.48654 18.00801 18.16999 18.38865 12.18298
## [57] 15.14932 18.65005 18.78109 19.47528 14.00000
There are 13 possible variables in the data set.
Produce a plot the shows how Mallows 𝐶𝑝changes with the model size. Interpret this plot in context.
plot(bosthous.model.leaps$size, bosthous.model.leaps$Cp,
pch=16, cex=1.2)
From this plot we can see that as Mallow’s Cp decreases, the size increases. Since we are only interested in models where the Cp is approximately p+1, we can filter our plot so we only look at points where the Cp is between 0 & 20.
plot(bosthous.model.leaps$size, bosthous.model.leaps$Cp,
pch=16, cex=1.2, ylim = c(0,20))
By filtering my results so i can only see points on the plot where the y is less than 20, we can see that the best models have a size between 10 & 14.
Suggest the best model which would be appropriate for this analysis, and fit this model using the lm function. Justify your response - you may want to consider using the which command in R to help you narrow down this choice.
mallow_cp <- sort(bosthous.model.leaps$Cp)
mallow_cp
## [1] 12.18298 13.42030 14.00000 15.14932 17.48654 18.00801 18.10040
## [8] 18.16999 18.38865 18.64575 18.65005 18.78109 18.78493 19.47528
## [15] 19.66987 19.94087 20.06437 21.56802 21.77557 22.62726 23.54966
## [22] 23.94674 28.17101 29.34314 30.63885 30.73289 30.91424 33.43166
## [29] 35.46675 37.00343 39.52897 43.13627 44.40523 45.31468 48.27737
## [36] 49.15914 57.04661 58.04088 67.65482 67.83617 68.11933 79.71697
## [43] 80.86871 81.99068 82.37755 83.37585 102.56165 103.69949 104.66813
## [50] 108.09313 108.22634 128.78215 151.22696 154.21122 155.68302 156.96410
## [57] 228.64491 399.11526 529.99920 535.05838 548.11092
bosthous.leaps.data <- data.frame(size=bosthous.model.leaps$size,
bosthous.model.leaps$Cp,
bosthous.model.leaps$which,
row.names = NULL)
bosthous.leaps.data
best.model <- which.min(bosthous.model.leaps$Cp)
bosthous.model.leaps$which[best.model,]
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
## TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## PTRATIO B LSTAT
## TRUE TRUE TRUE
Our best fitting model is one with Mallow’s Cp 12.18298, as it has p+1=13 which is the closest approximation we have. In the model with Mallow’s Cp of 12.18298 we should fit CRIM, ZN, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B and LSTAT.
We could also choose the model with Mallow’s Cp 13.42030. Although this model doesn’t have as close a fit as our first one it has one less independent variable meaning it could be easier to work with. We should fit CRIM, ZN, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO and B.
I will however choose Option 1 as my model, as what we can see from this data is it requires us to use a lot of independent variables in our model to get an accurate fit.
lm.mod.bosthous <- lm(MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD +
TAX + PTRATIO + B + LSTAT, data = housing)
par(mfrow=c(2,2))
plot(lm.mod.bosthous)
Calculate and comment on the Mallows 𝐶𝑝statistics for the four models taking the full model to be the one with four independent variables. (Reference original worksheet for further details of question)
\[\begin{equation}\label{eq:MallowsCp} Mallow's 𝐶𝑝= (((n-p-1) \hat{\sigma_p^2})/(\hat{\sigma_q^2})) + 2(p+1) - n \end{equation}\]
Where n is the number of patients in the study, p is the number of independent variables in each model, \(\hat{\sigma_p^2}\) is the residuals variance from the small model, and \(\hat{\sigma_q^2}\) is the residuals variance from the full model.
For this calculation we had:
n=10
p=4
\(\hat{\sigma_p^2}\) = 2.735
\(\hat{\sigma_q^2}\) = 2.735
This gave us a Mallow’s 𝐶𝑝 of 5.
For this calculation we had:
n=10
p=3
\(\hat{\sigma_p^2}\) = 2.936
\(\hat{\sigma_q^2}\) = 2.735
This gave us a Mallow’s 𝐶𝑝 of 4.9143.
For this calculation we had:
n=10
p=2
\(\hat{\sigma_p^2}\) = 2.719
\(\hat{\sigma_q^2}\) = 2.735
This gave us a Mallow’s 𝐶𝑝 of 2.9183.
For this calculation we had:
n=10
p=1
\(\hat{\sigma_p^2}\) = 5.546
\(\hat{\sigma_q^2}\) = 2.735
This gave us a Mallow’s 𝐶𝑝 of 26.8954.
From these results, the best fit to the full model would either be Incubation 1, 2 or 3, as they all have Mallow’s 𝐶𝑝 less than or equal to 5. Arguably however, the best model would be Incubation 2 as it has the lowest Mallow’s 𝐶𝑝.