Statement of work

By including this statement, you agree that the work is all your own.

The work contained in this project is that of the authors and where material from other sources has been incorporated full acknowledgement has been given.

Print Name: Duncan Orr

Date: 09/10/2024


Inputting the data required for the question

load("Boston.RData")

Question 1

a)

Run code that will give you the result of a leaps and bounds regression on the housing data. You should look at the 5 best models of each size and select the model based on Mallows 𝐶𝑝.

require(leaps)
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.3.3
bosthous.model.leaps <- leaps(x=housing[,1:13], 
                           y=housing$MEDV,
                           nbest=5,
                           method="Cp", 
                           names=colnames(housing[,1:13]))

bosthous.model.leaps 
## $which
##     CRIM    ZN INDUS  CHAS   NOX    RM   AGE   DIS   RAD   TAX PTRATIO     B
## 1  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE FALSE
## 1  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
## 1  FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
## 1  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE   FALSE FALSE
## 1  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    TRUE FALSE
## 2  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE   FALSE FALSE
## 2  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE   FALSE FALSE
## 2   TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE FALSE
## 2  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE    TRUE FALSE
## 2  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE  TRUE
## 3  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE    TRUE  TRUE
## 3  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE    TRUE FALSE
## 3  FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE    TRUE FALSE
## 3  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE   FALSE FALSE
## 3  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE   FALSE  TRUE
## 4  FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE    TRUE  TRUE
## 4  FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE
## 4  FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE    TRUE  TRUE
## 4   TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE    TRUE  TRUE
## 4  FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE    TRUE  TRUE
## 5  FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE
## 5   TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE
## 5  FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE
## 5  FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE FALSE
## 5  FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 6  FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 6   TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE
## 6   TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE FALSE
## 6  FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE
## 6   TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE
## 7   TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 7  FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 7   TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE FALSE
## 7   TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE
## 7   TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE
## 8   TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 8   TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 8   TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 8   TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 8  FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 9   TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 9   TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 9   TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 9   TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE    TRUE  TRUE
## 9   TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE    TRUE  TRUE
## 10  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 10  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 10  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 10  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE    TRUE  TRUE
## 10  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 11  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 11  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE    TRUE  TRUE
## 11  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 11  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE    TRUE  TRUE
## 11  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 12  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 12  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 12  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE    TRUE  TRUE
## 12  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE    TRUE  TRUE
## 12  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
## 13  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE
##    LSTAT
## 1  FALSE
## 1   TRUE
## 1  FALSE
## 1  FALSE
## 1  FALSE
## 2  FALSE
## 2  FALSE
## 2  FALSE
## 2  FALSE
## 2  FALSE
## 3  FALSE
## 3  FALSE
## 3  FALSE
## 3   TRUE
## 3  FALSE
## 4  FALSE
## 4  FALSE
## 4  FALSE
## 4  FALSE
## 4  FALSE
## 5  FALSE
## 5  FALSE
## 5   TRUE
## 5  FALSE
## 5  FALSE
## 6  FALSE
## 6  FALSE
## 6  FALSE
## 6  FALSE
## 6  FALSE
## 7  FALSE
## 7  FALSE
## 7  FALSE
## 7  FALSE
## 7   TRUE
## 8  FALSE
## 8  FALSE
## 8   TRUE
## 8  FALSE
## 8   TRUE
## 9  FALSE
## 9   TRUE
## 9  FALSE
## 9  FALSE
## 9  FALSE
## 10  TRUE
## 10 FALSE
## 10 FALSE
## 10 FALSE
## 10 FALSE
## 11 FALSE
## 11  TRUE
## 11  TRUE
## 11  TRUE
## 11  TRUE
## 12  TRUE
## 12 FALSE
## 12  TRUE
## 12  TRUE
## 12  TRUE
## 13  TRUE
## 
## $label
##  [1] "(Intercept)" "CRIM"        "ZN"          "INDUS"       "CHAS"       
##  [6] "NOX"         "RM"          "AGE"         "DIS"         "RAD"        
## [11] "TAX"         "PTRATIO"     "B"           "LSTAT"      
## 
## $size
##  [1]  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6  6  6
## [26]  7  7  7  7  7  8  8  8  8  8  9  9  9  9  9 10 10 10 10 10 11 11 11 11 11
## [51] 12 12 12 12 12 13 13 13 13 13 14
## 
## $Cp
##  [1] 228.64491 399.11526 529.99920 535.05838 548.11092 128.78215 151.22696
##  [8] 154.21122 155.68302 156.96410 102.56165 103.69949 104.66813 108.09313
## [15] 108.22634  79.71697  80.86871  81.99068  82.37755  83.37585  57.04661
## [22]  58.04088  67.65482  67.83617  68.11933  43.13627  44.40523  45.31468
## [29]  48.27737  49.15914  30.91424  33.43166  35.46675  37.00343  39.52897
## [36]  22.62726  28.17101  29.34314  30.63885  30.73289  19.66987  21.56802
## [43]  21.77557  23.54966  23.94674  18.10040  18.64575  18.78493  19.94087
## [50]  20.06437  13.42030  17.48654  18.00801  18.16999  18.38865  12.18298
## [57]  15.14932  18.65005  18.78109  19.47528  14.00000

There are 13 possible variables in the data set.

b)

Produce a plot the shows how Mallows 𝐶𝑝changes with the model size. Interpret this plot in context.

plot(bosthous.model.leaps$size, bosthous.model.leaps$Cp, 
     pch=16, cex=1.2) 

From this plot we can see that as Mallow’s Cp decreases, the size increases. Since we are only interested in models where the Cp is approximately p+1, we can filter our plot so we only look at points where the Cp is between 0 & 20.

Change the limit on the Y axis as we are only interested in values within p+1.

plot(bosthous.model.leaps$size, bosthous.model.leaps$Cp, 
     pch=16, cex=1.2, ylim = c(0,20)) 

By filtering my results so i can only see points on the plot where the y is less than 20, we can see that the best models have a size between 10 & 14.

c)

Suggest the best model which would be appropriate for this analysis, and fit this model using the lm function. Justify your response - you may want to consider using the which command in R to help you narrow down this choice.

mallow_cp <- sort(bosthous.model.leaps$Cp)
mallow_cp 
##  [1]  12.18298  13.42030  14.00000  15.14932  17.48654  18.00801  18.10040
##  [8]  18.16999  18.38865  18.64575  18.65005  18.78109  18.78493  19.47528
## [15]  19.66987  19.94087  20.06437  21.56802  21.77557  22.62726  23.54966
## [22]  23.94674  28.17101  29.34314  30.63885  30.73289  30.91424  33.43166
## [29]  35.46675  37.00343  39.52897  43.13627  44.40523  45.31468  48.27737
## [36]  49.15914  57.04661  58.04088  67.65482  67.83617  68.11933  79.71697
## [43]  80.86871  81.99068  82.37755  83.37585 102.56165 103.69949 104.66813
## [50] 108.09313 108.22634 128.78215 151.22696 154.21122 155.68302 156.96410
## [57] 228.64491 399.11526 529.99920 535.05838 548.11092

Build a data frame

bosthous.leaps.data <- data.frame(size=bosthous.model.leaps$size, 
                         bosthous.model.leaps$Cp, 
                         bosthous.model.leaps$which, 
                         row.names = NULL)
bosthous.leaps.data 

Check using best fit

best.model <- which.min(bosthous.model.leaps$Cp)
bosthous.model.leaps$which[best.model,]
##    CRIM      ZN   INDUS    CHAS     NOX      RM     AGE     DIS     RAD     TAX 
##    TRUE    TRUE   FALSE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE 
## PTRATIO       B   LSTAT 
##    TRUE    TRUE    TRUE

Option 1:

Our best fitting model is one with Mallow’s Cp 12.18298, as it has p+1=13 which is the closest approximation we have. In the model with Mallow’s Cp of 12.18298 we should fit CRIM, ZN, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B and LSTAT.

Option 2:

We could also choose the model with Mallow’s Cp 13.42030. Although this model doesn’t have as close a fit as our first one it has one less independent variable meaning it could be easier to work with. We should fit CRIM, ZN, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO and B.

I will however choose Option 1 as my model, as what we can see from this data is it requires us to use a lot of independent variables in our model to get an accurate fit.

Creating a linear model and plotting the results.

lm.mod.bosthous <- lm(MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD + 
                        TAX + PTRATIO + B + LSTAT, data = housing)

par(mfrow=c(2,2))
plot(lm.mod.bosthous)

Question 2

Calculate and comment on the Mallows 𝐶𝑝statistics for the four models taking the full model to be the one with four independent variables. (Reference original worksheet for further details of question)

Mallow’s 𝐶𝑝 Equation

\[\begin{equation}\label{eq:MallowsCp} Mallow's 𝐶𝑝= (((n-p-1) \hat{\sigma_p^2})/(\hat{\sigma_q^2})) + 2(p+1) - n \end{equation}\]

Where n is the number of patients in the study, p is the number of independent variables in each model, \(\hat{\sigma_p^2}\) is the residuals variance from the small model, and \(\hat{\sigma_q^2}\) is the residuals variance from the full model.

Incubation 1 with 4 independent variables

For this calculation we had:

  • n=10

  • p=4

  • \(\hat{\sigma_p^2}\) = 2.735

  • \(\hat{\sigma_q^2}\) = 2.735

This gave us a Mallow’s 𝐶𝑝 of 5.

Incubation 2 with 3 independent variables

For this calculation we had:

  • n=10

  • p=3

  • \(\hat{\sigma_p^2}\) = 2.936

  • \(\hat{\sigma_q^2}\) = 2.735

This gave us a Mallow’s 𝐶𝑝 of 4.9143.

Incubation 3 with 2 independent variables

For this calculation we had:

  • n=10

  • p=2

  • \(\hat{\sigma_p^2}\) = 2.719

  • \(\hat{\sigma_q^2}\) = 2.735

This gave us a Mallow’s 𝐶𝑝 of 2.9183.

Incubation 4 with 1 independent variable

For this calculation we had:

  • n=10

  • p=1

  • \(\hat{\sigma_p^2}\) = 5.546

  • \(\hat{\sigma_q^2}\) = 2.735

This gave us a Mallow’s 𝐶𝑝 of 26.8954.

From these results, the best fit to the full model would either be Incubation 1, 2 or 3, as they all have Mallow’s 𝐶𝑝 less than or equal to 5. Arguably however, the best model would be Incubation 2 as it has the lowest Mallow’s 𝐶𝑝.