Psych 250C - Problem Set 6

# Load Dataset
library(MASS) 
data(UScereal)
UScereal$mfr.num <- as.numeric(UScereal$mfr)
UScereal <- UScereal

Question 1

Fit a “full model” predicting calories using all 8 numeric predictors in the data set (do not use mfr, mfr.num, or vitamins). Provide the code and output a summary which includes coefficient estimates, ANOVA table (SSregression, SSerror, df, F, etc), and R2 results.

# Fit the full model predicting calories using all 8 numeric predictors
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

model_summary <- summary(full_model)

print(model_summary)

## 
## Call:
## lm(formula = calories ~ protein + fat + sodium + fibre + carbo + 
##     sugars + shelf + potassium, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.917  -4.075   1.079   4.415  15.152 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.15905    3.74046  -5.657 5.46e-07 ***
## protein       4.87257    0.97909   4.977 6.51e-06 ***
## fat           9.27406    0.75034  12.360  < 2e-16 ***
## sodium        0.01207    0.01002   1.204 0.233676    
## fibre         2.80820    0.71883   3.907 0.000254 ***
## carbo         4.86987    0.17491  27.842  < 2e-16 ***
## sugars        4.54340    0.21068  21.565  < 2e-16 ***
## shelf         0.56679    1.39782   0.405 0.686668    
## potassium    -0.11594    0.02718  -4.265 7.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.951 on 56 degrees of freedom
## Multiple R-squared:  0.9858, Adjusted R-squared:  0.9838 
## F-statistic:   486 on 8 and 56 DF,  p-value: < 2.2e-16

# Calculate ANOVA table
anova_results <- anova(full_model)

print(anova_results)

## Analysis of Variance Table
## 
## Response: calories
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## protein    1 124262  124262 1965.7961 < 2.2e-16 ***
## fat        1  26971   26971  426.6799 < 2.2e-16 ***
## sodium     1   4966    4966   78.5631 2.999e-12 ***
## fibre      1  17242   17242  272.7643 < 2.2e-16 ***
## carbo      1  40981   40981  648.3188 < 2.2e-16 ***
## sugars     1  30170   30170  477.2777 < 2.2e-16 ***
## shelf      1     14      14    0.2196    0.6411    
## potassium  1   1150    1150   18.1925 7.757e-05 ***
## Residuals 56   3540      63                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Extract R² from the model summary
r_squared <- model_summary$r.squared
cat("R²: ", r_squared, "\n")

## R²:  0.9858005

One can see here that I have fitted a full model predicting calories using all 8 numeric predictors in the data set. We find that given our \(R^2\) value being 0.9858, this means about 98.58% of the variance in calories is explained by the model. Moreover, our F statistic for the model, F(8, 56) = 486, p <0.001, shows the overall model is highly significant, indicating that most of the numeric predictors significantly contribute to predicting the calorie content of the cereals, with sugars and fat having the largest positive impacts.

Question 2

Using the model fit in #1, estimate Shrunken-R using the Browne Equation (Eq 7.1 in RALM), double cross validation, and leave-one-out cross validation.

2.1) Browne’s Equation:

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

# Number of observations
n <- nrow(UScereal)

# Number of predictors
p <- length(full_model$coefficients) - 1  # Subtract 1 for the intercept

# Calculate shrunken R^2 using Browne's equation
R2 <- summary(full_model)$r.squared
R2_shrunken_browne <- 1 - ((1 - R2) * (n - 1) / (n - p - 1))
cat("Shrunken R^2 using Browne's Equation: ", R2_shrunken_browne, "\n")

## Shrunken R^2 using Browne's Equation:  0.983772

Our shrunken \(R^2\) using Browne’s equation is 0.983772. I did so through the hardcode method.

Our value here is that this value adjusts the \(R^2\) to account for the number of predictors and the sample size, giving you a more realistic measure of the model’s explanatory power.

2.2) Double Cross-Validation:

library(caret)

# Double cross-validation
set.seed(123) # For reproducibility
double_cv <- train(
  calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium,
  data = UScereal,
  method = "lm",
  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 2)
)
R2_double_cv <- double_cv$results$Rsquared
cat("Shrunken-R² (Double Cross-Validation): ", R2_double_cv, "\n")

## Shrunken-R² (Double Cross-Validation):  0.9742866

library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

# Double Cross-Validation
set.seed(123)  # For reproducibility
indices <- sample(1:n, size = n, replace = FALSE)
N <- nrow(UScereal)
half <- floor(n / 2)
rows <- sample.int(N, N/2)
train1 <- UScereal[indices[1:half], ]
train2 <- UScereal[indices[(half+1):n], ]

# Model on first half
train1a <- UScereal[rows,]
test1a <- UScereal[-rows,]
model1 <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = train1)
pred1 <- predict(model1, newdata = test1a)
SSres1 <- sum((pred1 - train2$calories)^2)
SStot1 <- sum((train2$calories - mean(train2$calories))^2)
shrunkenRCV1 <- cor(pred1, test1a$calories)^2
R2_1 <- 1 - (SSres1 / SStot1)


# Model on second half
train2a <- UScereal[-rows,]
test2a <- UScereal[rows,]
model2 <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = train2)
pred2 <- predict(model2, newdata = test2a)
SSres2 <- sum((pred2 - train1$calories)^2)
SStot2 <- sum((train1$calories - mean(train1$calories))^2)
shrunkenRCV2 <- cor(pred2, test2a$calories)^2
R2_2 <- 1 - (SSres2 / SStot2)

# Average R² for Double Cross-Validation
R2_double_cv <- (R2_1 + R2_2) / 2
shrunkenRCV_double <- mean(c(shrunkenRCV1, shrunkenRCV2))
cat("Shrunken-R² (Double Cross-Validation): ", R2_double_cv, "\n")

## Shrunken-R² (Double Cross-Validation):  -1.424213

cat("Shrunken-R² (Double Cross-Validation): ", shrunkenRCV_double, "\n")

## Shrunken-R² (Double Cross-Validation):  0.9771359

I conducted two methods of determining the double-cross shrunken \(R^2\). The first was with the other was with the trainControl() function from the caret library in R. The second was hardcoded. We find that for the caret library , our double-cross validation \(R^2\) is 0.9742866. For our hardcoded, the double-cross validation \(R^2\) is 0.9580511.

Our value here is that this value is an average \(R^2\) which comes from multiple iterations of cross-validation. It provides a reliable estimate of the model’s predictive power on unseen data by repeating the validation process.

2.3) Leave-One-Out Cross-Validation:

library(caret)

# Leave-one-out cross-validation
loocv <- train(
  calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium,
  data = UScereal,
  method = "lm",
  trControl = trainControl(method = "LOOCV")
)

print(loocv)

## Linear Regression 
## 
## 65 samples
##  8 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 64, 64, 64, 64, 64, 64, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   11.01019  0.9702071  7.412448
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

R2_loocv <- loocv$results$Rsquared
cat("Shrunken-R² (LOOCV):", R2_loocv, "\n")

## Shrunken-R² (LOOCV): 0.9702071

library(boot)

# Leave-One-Out Cross-Validation (LOOCV)
loocv_model <- glm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
loocv_results <- cv.glm(UScereal, loocv_model, K = n)
R2_loocv <- 1 - (loocv_results$delta[1] / var(UScereal$calories) * (n - 1) / n)
cat("Shrunken-R² (LOOCV): ", R2_loocv, "\n")

## Shrunken-R² (LOOCV):  0.9693577

library(lmf)

# Fit the initial model to extract leverage and residuals
model_all <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

# Extract leverage values
UScereal$leverage <- lm.extract(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)$leverage

# Extract residuals from the full model
UScereal$resall <- residuals(model_all)

# Calculate leave-one-out residuals
UScereal$resloo <- UScereal$resall / (1 - UScereal$leverage)

# Calculate leave-one-out predicted values
UScereal$Ypredloo <- UScereal$calories - UScereal$resloo

# Calculate LOOCV R²
R2_loocv <- cor(UScereal$Ypredloo, UScereal$calories)^2
cat("Shrunken-R² (LOOCV): ", R2_loocv, "\n")

## Shrunken-R² (LOOCV):  0.9702071

I conducted three methods of determining the leave-one out shrunken \(R^2\). The first was with the other was with the trainControl() function from the caret library in R. The second was with the boot library in R. And the last was with the lmf library. We find that for the caret library , our double-cross validation \(R^2\) is 0.9702071. For our boot library, the double-cross validation \(R^2\) is 0.9693577. For our lmf library, the double-cross validation \(R^2\) is 0.9702071, the same as our caret library.

This value is obtained by training the model on all but one observation and testing on the excluded observation, repeated for each data point. It is especially useful for small datasets to assess the model’s generalizability.

Question 3

Conduct forward, backwards, stepwise, and all-subsets regression using the same predictors and outcome as in #1. Examine whether the final model selected by these four methods is the same or different.

3.1) Forward Model

# Forward Selection
forward_model <- step(lm(calories ~ 1, data = UScereal), 
                      direction = "forward", 
                      scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium)

## Start:  AIC=538.38
## calories ~ 1
## 
##             Df Sum of Sq    RSS    AIC
## + carbo      1    155083  94213 477.13
## + protein    1    124262 125034 495.53
## + fat        1     86831 162464 512.55
## + sodium     1     69672 179623 519.07
## + sugars     1     61156 188139 522.09
## + potassium  1     56626 192670 523.63
## + shelf      1     45313 203982 527.34
## + fibre      1     37572 211723 529.76
## <none>                   249295 538.38
## 
## Step:  AIC=477.13
## calories ~ carbo
## 
##             Df Sum of Sq   RSS    AIC
## + sugars     1     69482 24730 392.19
## + fat        1     51294 42919 428.02
## + protein    1     26810 67403 457.36
## + potassium  1     21613 72600 462.19
## + fibre      1     13523 80690 469.06
## + shelf      1     13051 81162 469.44
## + sodium     1     11503 82710 470.67
## <none>                   94213 477.13
## 
## Step:  AIC=392.19
## calories ~ carbo + sugars
## 
##             Df Sum of Sq     RSS    AIC
## + fat        1   15660.6  9069.8 328.99
## + protein    1   10324.1 14406.4 359.07
## + fibre      1    5610.6 19119.9 377.47
## + potassium  1    5415.6 19314.9 378.13
## + sodium     1    1768.7 22961.8 389.37
## + shelf      1    1139.9 23590.5 391.12
## <none>                   24730.5 392.19
## 
## Step:  AIC=328.99
## calories ~ carbo + sugars + fat
## 
##             Df Sum of Sq    RSS    AIC
## + protein    1    4357.9 4711.9 288.43
## + fibre      1    3322.7 5747.1 301.33
## + potassium  1    2529.6 6540.3 309.74
## + sodium     1     812.8 8257.0 324.89
## <none>                   9069.8 328.99
## + shelf      1     116.4 8953.4 330.15
## 
## Step:  AIC=288.43
## calories ~ carbo + sugars + fat + protein
## 
##             Df Sum of Sq    RSS    AIC
## + potassium  1   172.139 4539.8 288.01
## <none>                   4711.9 288.43
## + shelf      1    11.634 4700.3 290.26
## + sodium     1     4.929 4707.0 290.36
## + fibre      1     4.834 4707.1 290.36
## 
## Step:  AIC=288.01
## calories ~ carbo + sugars + fat + protein + potassium
## 
##          Df Sum of Sq    RSS    AIC
## + fibre   1    903.98 3635.8 275.57
## <none>                4539.8 288.01
## + sodium  1     35.09 4504.7 289.50
## + shelf   1      0.09 4539.7 290.00
## 
## Step:  AIC=275.57
## calories ~ carbo + sugars + fat + protein + potassium + fibre
## 
##          Df Sum of Sq    RSS    AIC
## <none>                3635.8 275.57
## + sodium  1    85.571 3550.3 276.02
## + shelf   1     4.339 3631.5 277.50

summary(forward_model)

## 
## Call:
## lm(formula = calories ~ carbo + sugars + fat + protein + potassium + 
##     fibre, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.003  -5.273   1.323   4.424  17.050 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.60795    3.15131  -6.222 5.85e-08 ***
## carbo         4.93735    0.16429  30.052  < 2e-16 ***
## sugars        4.56824    0.20790  21.973  < 2e-16 ***
## fat           9.31086    0.74057  12.573  < 2e-16 ***
## protein       4.84992    0.97164   4.991 5.79e-06 ***
## potassium    -0.10691    0.02586  -4.134 0.000116 ***
## fibre         2.69062    0.70853   3.797 0.000352 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.917 on 58 degrees of freedom
## Multiple R-squared:  0.9854, Adjusted R-squared:  0.9839 
## F-statistic: 653.1 on 6 and 58 DF,  p-value: < 2.2e-16

cat("Predictors in Final Model (Forward Selection): ", names(coef(forward_model))[-1], "\n")

## Predictors in Final Model (Forward Selection):  carbo sugars fat protein potassium fibre

# Fit the initial null model
model_none <- lm(calories ~ 1, data = UScereal)

# Forward Selection
add1(model_none, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

# Iteratively add the variable with the highest F/lowest p-value
model_step <- update(model_none, ~ . + carbo)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

model_step <- update(model_none, ~ . + carbo +sugars)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

model_step <- update(model_none, ~ . + carbo + sugars + fat)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

model_step <- update(model_none, ~ . + carbo + sugars + fat + protein)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

final_model_forward <- model_step
cat("Predictors in Final Model (Forward Selection): ", names(coef(final_model_forward))[-1], "\n")

## Predictors in Final Model (Forward Selection):  carbo sugars fat protein

I completed this problem in two different ways. The first is looking at the AIC, or Akaike Information Criterion. This is a metric used to estimate the prediction error and the relative quality of statistical models for a specific dataset. When comparing a set of models for the data, AIC evaluates the quality of each model in relation to the others. Consequently, AIC serves as a tool for model selection. I have used this in the past to derive my model selection, and to test my models.

The second is the manner that was suggested by the course- the manual method. These two give different predictors in the final model for forward selection, namely the manual method gives, carbo, sugars, fat, protein. You can see each at the end of each output.

3.2) Backward Elimination

# Backward Elimination
backward_model <- step(lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal), 
                      direction = "backward",
                       scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium)

## Start:  AIC=277.83
## calories ~ protein + fat + sodium + fibre + carbo + sugars + 
##     shelf + potassium
## 
##             Df Sum of Sq   RSS    AIC
## - shelf      1        10  3550 276.03
## - sodium     1        92  3631 277.50
## <none>                    3540 277.83
## - fibre      1       965  4505 291.50
## - potassium  1      1150  4690 294.12
## - protein    1      1566  5105 299.64
## - fat        1      9657 13196 361.37
## - sugars     1     29397 32936 420.82
## - carbo      1     49000 52539 451.17
## 
## Step:  AIC=276.03
## calories ~ protein + fat + sodium + fibre + carbo + sugars + 
##     potassium
## 
##             Df Sum of Sq   RSS    AIC
## - sodium     1        86  3636 275.57
## <none>                    3550 276.03
## - fibre      1       954  4505 289.50
## - potassium  1      1153  4704 292.31
## - protein    1      1555  5105 297.64
## - fat        1      9916 13467 360.68
## - sugars     1     29967 33517 419.95
## - carbo      1     51524 55075 452.23
## 
## Step:  AIC=275.57
## calories ~ protein + fat + fibre + carbo + sugars + potassium
## 
##             Df Sum of Sq   RSS    AIC
## <none>                    3636 275.57
## - fibre      1       904  4540 288.01
## - potassium  1      1071  4707 290.36
## - protein    1      1562  5198 296.80
## - fat        1      9909 13545 359.06
## - sugars     1     30266 33902 418.69
## - carbo      1     56615 60251 456.07

summary(backward_model)

## 
## Call:
## lm(formula = calories ~ protein + fat + fibre + carbo + sugars + 
##     potassium, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.003  -5.273   1.323   4.424  17.050 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.60795    3.15131  -6.222 5.85e-08 ***
## protein       4.84992    0.97164   4.991 5.79e-06 ***
## fat           9.31086    0.74057  12.573  < 2e-16 ***
## fibre         2.69062    0.70853   3.797 0.000352 ***
## carbo         4.93735    0.16429  30.052  < 2e-16 ***
## sugars        4.56824    0.20790  21.973  < 2e-16 ***
## potassium    -0.10691    0.02586  -4.134 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.917 on 58 degrees of freedom
## Multiple R-squared:  0.9854, Adjusted R-squared:  0.9839 
## F-statistic: 653.1 on 6 and 58 DF,  p-value: < 2.2e-16

cat("Predictors in Final Model (Backward Selection): ", names(coef(backward_model))[-1], "\n")

## Predictors in Final Model (Backward Selection):  protein fat fibre carbo sugars potassium

# Fit the initial full model
model_all <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

# Backward Selection
drop1(model_all, test = "F")

# Iteratively drop the variable with the lowest F/highest p-value
model_step <- update(model_all, ~ . -shelf)
drop1(model_step, test = "F")

model_step <- update(model_all, ~ . -shelf -sodium)
drop1(model_step, test = "F")

final_model_backward <- model_step
cat("Predictors in Final Model (Backward Selection): ", names(coef(final_model_backward))[-1], "\n")

## Predictors in Final Model (Backward Selection):  protein fat fibre carbo sugars potassium

The second is the manner that was suggested by the course - the manual method. These two give the same predictors in the final model for backward selection, namely, protein, fat, fibre, carbo, sugars, potassium. You can see each at the end of each output.

3.3) Stepwise Selection

# Stepwise Selection
stepwise_model <- step(lm(calories ~ 1, data = UScereal), 
                       direction = "both", 
                       scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium)

## Start:  AIC=538.38
## calories ~ 1
## 
##             Df Sum of Sq    RSS    AIC
## + carbo      1    155083  94213 477.13
## + protein    1    124262 125034 495.53
## + fat        1     86831 162464 512.55
## + sodium     1     69672 179623 519.07
## + sugars     1     61156 188139 522.09
## + potassium  1     56626 192670 523.63
## + shelf      1     45313 203982 527.34
## + fibre      1     37572 211723 529.76
## <none>                   249295 538.38
## 
## Step:  AIC=477.13
## calories ~ carbo
## 
##             Df Sum of Sq    RSS    AIC
## + sugars     1     69482  24730 392.19
## + fat        1     51294  42919 428.02
## + protein    1     26810  67403 457.36
## + potassium  1     21613  72600 462.19
## + fibre      1     13523  80690 469.06
## + shelf      1     13051  81162 469.44
## + sodium     1     11503  82710 470.67
## <none>                    94213 477.13
## - carbo      1    155083 249295 538.38
## 
## Step:  AIC=392.19
## calories ~ carbo + sugars
## 
##             Df Sum of Sq    RSS    AIC
## + fat        1     15661   9070 328.99
## + protein    1     10324  14406 359.07
## + fibre      1      5611  19120 377.47
## + potassium  1      5416  19315 378.13
## + sodium     1      1769  22962 389.37
## + shelf      1      1140  23591 391.12
## <none>                    24730 392.19
## - sugars     1     69482  94213 477.13
## - carbo      1    163409 188139 522.09
## 
## Step:  AIC=328.99
## calories ~ carbo + sugars + fat
## 
##             Df Sum of Sq    RSS    AIC
## + protein    1      4358   4712 288.43
## + fibre      1      3323   5747 301.33
## + potassium  1      2530   6540 309.74
## + sodium     1       813   8257 324.89
## <none>                     9070 328.99
## + shelf      1       116   8953 330.15
## - fat        1     15661  24730 392.19
## - sugars     1     33849  42919 428.02
## - carbo      1    134563 143633 506.54
## 
## Step:  AIC=288.43
## calories ~ carbo + sugars + fat + protein
## 
##             Df Sum of Sq   RSS    AIC
## + potassium  1       172  4540 288.01
## <none>                    4712 288.43
## + shelf      1        12  4700 290.26
## + sodium     1         5  4707 290.36
## + fibre      1         5  4707 290.36
## - protein    1      4358  9070 328.99
## - fat        1      9694 14406 359.07
## - sugars     1     31092 35804 418.24
## - carbo      1     75664 80376 470.81
## 
## Step:  AIC=288.01
## calories ~ carbo + sugars + fat + protein + potassium
## 
##             Df Sum of Sq   RSS    AIC
## + fibre      1       904  3636 275.57
## <none>                    4540 288.01
## - potassium  1       172  4712 288.43
## + sodium     1        35  4505 289.50
## + shelf      1         0  4540 290.01
## - protein    1      2000  6540 309.74
## - fat        1      9063 13603 357.34
## - sugars     1     30725 35265 419.26
## - carbo      1     56223 60763 454.62
## 
## Step:  AIC=275.57
## calories ~ carbo + sugars + fat + protein + potassium + fibre
## 
##             Df Sum of Sq   RSS    AIC
## <none>                    3636 275.57
## + sodium     1        86  3550 276.03
## + shelf      1         4  3631 277.50
## - fibre      1       904  4540 288.01
## - potassium  1      1071  4707 290.36
## - protein    1      1562  5198 296.80
## - fat        1      9909 13545 359.06
## - sugars     1     30266 33902 418.69
## - carbo      1     56615 60251 456.07

summary(stepwise_model)

## 
## Call:
## lm(formula = calories ~ carbo + sugars + fat + protein + potassium + 
##     fibre, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.003  -5.273   1.323   4.424  17.050 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.60795    3.15131  -6.222 5.85e-08 ***
## carbo         4.93735    0.16429  30.052  < 2e-16 ***
## sugars        4.56824    0.20790  21.973  < 2e-16 ***
## fat           9.31086    0.74057  12.573  < 2e-16 ***
## protein       4.84992    0.97164   4.991 5.79e-06 ***
## potassium    -0.10691    0.02586  -4.134 0.000116 ***
## fibre         2.69062    0.70853   3.797 0.000352 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.917 on 58 degrees of freedom
## Multiple R-squared:  0.9854, Adjusted R-squared:  0.9839 
## F-statistic: 653.1 on 6 and 58 DF,  p-value: < 2.2e-16

cat("Predictors in Final Model (Stepwise Selection): ", names(coef(stepwise_model))[-1], "\n")

## Predictors in Final Model (Stepwise Selection):  carbo sugars fat protein potassium fibre

# Stepwise Selection
model_none <- lm(calories ~ 1, data = UScereal)

# Iteratively add and drop variables
# Initial addition
add1(model_none, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

model_step <- update(model_none, ~ . + carbo)

# Check for dropping variables
drop1(model_step, test = "F")

# Add next variable
model_step <- update(model_step, ~ . + sugars)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

drop1(model_step, test = "F")

# Add next variable
model_step <- update(model_step, ~ . + fat)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

drop1(model_step, test = "F")

# Add next variable
model_step <- update(model_step, ~ . + protein)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

drop1(model_step, test = "F")

# Add next variable
model_step <- update(model_step, ~ . + protein)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")

drop1(model_step, test = "F")

# Continue this process until neither adding nor dropping variables improves the model significantly
final_model_stepwise <- model_step

# Summary of the final model
summary(final_model_stepwise)

## 
## Call:
## lm(formula = calories ~ carbo + sugars + fat + protein, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.177  -4.693   1.419   4.939  24.758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.7698     3.5127  -5.343 1.49e-06 ***
## carbo         4.9247     0.1587  31.040  < 2e-16 ***
## sugars        4.2107     0.2116  19.898  < 2e-16 ***
## fat           8.8589     0.7973  11.111 3.41e-16 ***
## protein       4.0506     0.5438   7.449 4.28e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.862 on 60 degrees of freedom
## Multiple R-squared:  0.9811, Adjusted R-squared:  0.9798 
## F-statistic: 778.6 on 4 and 60 DF,  p-value: < 2.2e-16

cat("Predictors in Final Model (Stepwise Selection): ", names(coef(final_model_stepwise))[-1], "\n")

## Predictors in Final Model (Stepwise Selection):  carbo sugars fat protein

The second is the manner that was suggested by the course - the manual method. These two give slightly different predictors in the final model for stepwise selection, namely the manual methods gives carbo, sugars, fat, and protein. You can see each at the end of each output.

3.4) All-Subsets Regression

library(leaps)

# All-Subsets Regression
all_subsets <- regsubsets(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

summary(all_subsets)

## Subset selection object
## Call: regsubsets.formula(calories ~ protein + fat + sodium + fibre + 
##     carbo + sugars + shelf + potassium, data = UScereal)
## 8 Variables  (and intercept)
##           Forced in Forced out
## protein       FALSE      FALSE
## fat           FALSE      FALSE
## sodium        FALSE      FALSE
## fibre         FALSE      FALSE
## carbo         FALSE      FALSE
## sugars        FALSE      FALSE
## shelf         FALSE      FALSE
## potassium     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          protein fat sodium fibre carbo sugars shelf potassium
## 1  ( 1 ) " "     " " " "    " "   "*"   " "    " "   " "      
## 2  ( 1 ) " "     " " " "    " "   "*"   "*"    " "   " "      
## 3  ( 1 ) " "     "*" " "    " "   "*"   "*"    " "   " "      
## 4  ( 1 ) "*"     "*" " "    " "   "*"   "*"    " "   " "      
## 5  ( 1 ) "*"     "*" " "    " "   "*"   "*"    " "   "*"      
## 6  ( 1 ) "*"     "*" " "    "*"   "*"   "*"    " "   "*"      
## 7  ( 1 ) "*"     "*" "*"    "*"   "*"   "*"    " "   "*"      
## 8  ( 1 ) "*"     "*" "*"    "*"   "*"   "*"    "*"   "*"

summary_subsets <- summary(all_subsets)

# Find the model with the highest adjusted R²
max_adj_r2 <- which.max(summary_subsets$adjr2)
best_model <- summary_subsets$which[max_adj_r2, ]

# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]
cat("Predictors in Final Model (All-Subsets Regression): ", predictors, "\n")

## Predictors in Final Model (All-Subsets Regression):  protein fat sodium fibre carbo sugars potassium

library(leaps)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Prepare data for all-subsets regression
datasmall <- UScereal %>% select(calories, protein, fat, sodium, fibre, carbo, sugars, shelf, potassium)

# All-Subsets Regression
subsetmodels <- regsubsets(calories ~ ., data = datasmall, nbest = 1)
summary_subsetmodels <- summary(subsetmodels)

# Print summary of all-subsets models
print(summary_subsetmodels)

## Subset selection object
## Call: regsubsets.formula(calories ~ ., data = datasmall, nbest = 1)
## 8 Variables  (and intercept)
##           Forced in Forced out
## protein       FALSE      FALSE
## fat           FALSE      FALSE
## sodium        FALSE      FALSE
## fibre         FALSE      FALSE
## carbo         FALSE      FALSE
## sugars        FALSE      FALSE
## shelf         FALSE      FALSE
## potassium     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          protein fat sodium fibre carbo sugars shelf potassium
## 1  ( 1 ) " "     " " " "    " "   "*"   " "    " "   " "      
## 2  ( 1 ) " "     " " " "    " "   "*"   "*"    " "   " "      
## 3  ( 1 ) " "     "*" " "    " "   "*"   "*"    " "   " "      
## 4  ( 1 ) "*"     "*" " "    " "   "*"   "*"    " "   " "      
## 5  ( 1 ) "*"     "*" " "    " "   "*"   "*"    " "   "*"      
## 6  ( 1 ) "*"     "*" " "    "*"   "*"   "*"    " "   "*"      
## 7  ( 1 ) "*"     "*" "*"    "*"   "*"   "*"    " "   "*"      
## 8  ( 1 ) "*"     "*" "*"    "*"   "*"   "*"    "*"   "*"

# Find the model with the highest adjusted R²
max_adj_r2 <- which.max(summary_subsetmodels$adjr2)
best_model <- summary_subsetmodels$which[max_adj_r2, ]

# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]

# Fit the final model with the best predictors
final_model_all_subsets <- lm(as.formula(paste("calories ~", paste(predictors, collapse = " + "))), data = UScereal)

# Summary of the final model
summary(final_model_all_subsets)

## 
## Call:
## lm(formula = as.formula(paste("calories ~", paste(predictors, 
##     collapse = " + "))), data = UScereal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6044  -4.8410   0.9581   4.6033  15.2115 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.398829   3.212851  -6.349 3.83e-08 ***
## protein       4.839791   0.968566   4.997 5.86e-06 ***
## fat           9.314509   0.738204  12.618  < 2e-16 ***
## sodium        0.011574   0.009875   1.172 0.246027    
## fibre         2.781245   0.710481   3.915 0.000244 ***
## carbo         4.884618   0.169831  28.762  < 2e-16 ***
## sugars        4.553708   0.207605  21.934  < 2e-16 ***
## potassium    -0.113706   0.026423  -4.303 6.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.892 on 57 degrees of freedom
## Multiple R-squared:  0.9858, Adjusted R-squared:  0.984 
## F-statistic: 563.6 on 7 and 57 DF,  p-value: < 2.2e-16

cat("Predictors in Final Model (All-Subsets Regression): ", predictors, "\n")

## Predictors in Final Model (All-Subsets Regression):  protein fat sodium fibre carbo sugars potassium

I’ve completed this type of analysis prior to this, and I am sharing both methods here. I have adjusted the code such that it is in line with what was provided in the course. As a result, we find that the predictors in the final model to be protein, fat, sodium, fibre, carbo, sugars, and potassium. We did this by selecting the model with the highest adjusted \(R^2\) value.

3.5) Are all final models the same or different? Explain why this might occur.

The final models here are all slightly different except for Backward Selection and All-Subsets Regression, which both resulted in the same set of predictors: protein, fat, sodium, fibre, carbo, sugars, potassium.

The differences are due to a set of important distinctions. Firstly, the selection criteria differs between each models. Forward Selection adds predictors based on the highest improvement in the model at each step, which can lead to including predictors that marginally improve the model. In contstrat, Backward Selection starts with all predictors and removes the least significant ones, focusing on the overall fit. Stepwise Selection combines forward and backward steps, which can lead to a more conservative model by balancing both addition and removal, while All-Subsets Regression evaluates all possible subsets and selects the best model based on the highest adjusted \(R^2\).

Secondly, each method handles stopping rules differently. Forward and Backward selection stop when no further significant improvements are found based on specific criteria (F-Value/p-value in our case). Stepwise selection is a combination of both forward and backward criteria, which can result in a more streamlined model. All-Subsets selection uses an exhaustive search and selects the model with the highest adjusted \(R^2\), potentially leading to a model with more predictors.

Thirdly, each method handles multicollinearity slightly differently. For forward selection, because predictors are added sequentially, forward selection can inadvertently add correlated predictors without explicitly checking for multicollinearity. If two predictors are highly correlated, the one added first may account for most of the shared variance, making the subsequent addition of the correlated predictor less impactful. Backward selection may be better at detecting multicollinearity because it starts with all predictors and evaluates the impact of removing each one. If removing a predictor significantly reduces multicollinearity, it might be favored for removal. However, it still doesn’t explicitly check for multicollinearity; it relies on the overall model fit. Stepwise selection has a mixed approach and may handle multicollinearity slightly better than forward selection because it also considers removal. However, like forward and backward selection, it does not explicitly check for multicollinearity. All-subsets regression implicitly handles multicollinearity better because it evaluates all possible models and selects the one with the best overall fit. If multicollinearity reduces how well each model performs, models with high multicollinearity are less likely to be selected. However, it still does not explicitly check for multicollinearity; it relies on overall model performance metrics like adjusted \(R^2\). Because of its different handling, it will change how the final models might be different, as we find in our case.

Question 4

Use the final model(s) from the variable selection methods and the full model (the model with all predictors) to generate a prediction interval for a company with the following scores on the predictor variables: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, potassium = 100. (Note that if some of these predictors are not included in the model then the prediction interval will not change based on the values for those predictors.)

4.1) Generate prediction intervals for the above described case for each model. Feel free to delete any sections below which are redundant (i.e., variable selection methods choose the same model):

# Company Scores
company_scores <- data.frame(
  protein = 2,
  fat = 0,
  sodium = 200,
  fibre = 4,
  carbo = 20,
  sugars = 8,
  shelf = 1,
  potassium = 100
)

Full Model

# Full Model
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

# Prediction Interval for Full Model
full_model_pred <- predict(full_model, newdata = company_scores, interval = "prediction")

# Confidence Interval for the Full Model
full_model_confidence <- predict(full_model, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Full Model:", full_model_pred, "\n")

## Prediction Interval for Full Model: 124.9493 108.3175 141.5811

cat("Confidence Interval for Full Model:", full_model_confidence, "\n")

## Confidence Interval for Full Model: 124.9493 120.1588 129.7398

For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 124.95. The 95% prediction interval for the calories is [108.32, 141.58]. The 95% confidence interval is [120.16, 129.74].

Forward Model

# We have our two possible directions for our model.
# I provide both here

# Prediction Interval for Forward Model - AIC
forward_model_pred_AIC <- predict(forward_model, newdata = company_scores, interval = "prediction")

# Confidence Interval for the Forward Model - AIC
forward_model_confidence_AIC <- predict(forward_model, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Forward Model - AIC:\n")

## Prediction Interval for Forward Model - AIC:

print(forward_model_pred_AIC)

##        fit      lwr      upr
## 1 125.4564 109.1072 141.8056

cat("Confidence Interval for Forward Model - AIC:\n")

## Confidence Interval for Forward Model - AIC:

print(forward_model_confidence_AIC)

##        fit      lwr      upr
## 1 125.4564 121.4415 129.4713

# Prediction Interval for Forward Model - Manual
forward_model_pred_manual <- predict(final_model_forward, newdata = company_scores, interval = "prediction")

# Confidence Interval for the Forward Model - Manual
forward_model_confidence_manual <- predict(final_model_forward, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Forward Model - Manual:", forward_model_pred_manual, "\n")

## Prediction Interval for Forward Model - Manual: 121.5102 103.5117 139.5087

cat("Confidence Interval for Forward Model - Manual:", forward_model_confidence_manual, "\n")

## Confidence Interval for Forward Model - Manual: 121.5102 118.3919 124.6284

For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 121.51. The 95% prediction interval for the calories is [103.51, 139.51]. The 95% confidence interval for the calories is [118.39, 124.63].

Backward Model

# We have our two possible directions for our model.
# I provide both here

# Prediction Interval for Backward Model - AIC
backward_model_pred_AIC <- predict(backward_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Backward Model - AIC
backward_model_confidence_AIC <- predict(backward_model, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Backward Model - AIC:\n")

## Prediction Interval for Backward Model - AIC:

print(backward_model_pred_AIC)

##        fit      lwr      upr
## 1 125.4564 109.1072 141.8056

cat("Confidence Interval for Backward Model - AIC:\n")

## Confidence Interval for Backward Model - AIC:

print(backward_model_confidence_AIC)

##        fit      lwr      upr
## 1 125.4564 121.4415 129.4713

# Prediction Interval for Backward Model - Manual
backward_model_pred_manual <- predict(final_model_backward, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Backward Model - Manual
backward_model_confidence_manual <- predict(final_model_backward, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Backward Model - Manual:\n")

## Prediction Interval for Backward Model - Manual:

print(backward_model_pred_manual)

##        fit      lwr      upr
## 1 125.4564 109.1072 141.8056

cat("Confidence Interval for Backward Model - Manual:\n")

## Confidence Interval for Backward Model - Manual:

print(backward_model_confidence_manual)

##        fit      lwr      upr
## 1 125.4564 121.4415 129.4713

For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 125.46. The 95% prediction interval for the calories is [109.11, 141.81].

Stepwise Model

# We have our two possible directions for our model.
# I provide both here

# Prediction Interval for Stepwise Model - AIC
stepwise_model_pred_AIC <- predict(stepwise_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Stepwise Model - AIC
stepwise_model_confidence_AIC <- predict(stepwise_model, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Stepwise Model - AIC:\n")

## Prediction Interval for Stepwise Model - AIC:

print(stepwise_model_pred_AIC)

##        fit      lwr      upr
## 1 125.4564 109.1072 141.8056

cat("Confidence Interval for Stepwise Model - AIC:\n")

## Confidence Interval for Stepwise Model - AIC:

print(stepwise_model_confidence_AIC)

##        fit      lwr      upr
## 1 125.4564 121.4415 129.4713

# Prediction Interval for Stepwise Model - Manual
stepwise_model_pred_manual <- predict(final_model_stepwise, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Stepwise Model - Manual
stepwise_model_confidence_manual <- predict(final_model_stepwise, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for Stepwise Model - Manual:\n")

## Prediction Interval for Stepwise Model - Manual:

print(stepwise_model_pred_manual)

##        fit      lwr      upr
## 1 121.5102 103.5117 139.5087

cat("Confidence Interval for Stepwise Model - Manual:\n")

## Confidence Interval for Stepwise Model - Manual:

print(stepwise_model_confidence_manual)

##        fit      lwr      upr
## 1 121.5102 118.3919 124.6284

All Subset Model

library (leaps)

# All-Subsets Regression
all_subsets <- regsubsets(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

summary_subsets <- summary(all_subsets)

# Find the model with the highest adjusted R²
max_adj_r2 <- which.max(summary_subsets$adjr2)
best_model <- summary_subsets$which[max_adj_r2, ]

# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]

all_subsets_formula <- as.formula(paste("calories ~", paste(predictors, collapse = " + ")))
all_subsets_model <- lm(all_subsets_formula, data = UScereal)

# Prediction Interval for All-Subsets Model
all_subsets_model_pred <- predict(all_subsets_model, newdata = company_scores, interval = "prediction")

# Confidence Interval for the All-Subsets Model
all_subsets_model_confidence <- predict(all_subsets_model, newdata = company_scores, interval = "confidence")

cat("Prediction Interval for All-Subsets Model:\n")

## Prediction Interval for All-Subsets Model:

print(all_subsets_model_pred)

##       fit      lwr      upr
## 1 125.472 109.1691 141.7749

cat("Confidence Interval for All-Subsets Model:\n")

## Confidence Interval for All-Subsets Model:

print(all_subsets_model_confidence)

##       fit      lwr      upr
## 1 125.472 121.4684 129.4756

For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 125.47. The 95% prediction interval for the calories is [109.17, 141.78]. The 95% confidence interval is [121.47, 129.48].

4.2) Which of the models results in the narrowest prediction interval for this case? Compare the prediction interval from the full model to the models which have undergone variable selection, does it seem that the variable selection models are more precise (less variance)?

The narrowest prediction interval is the All-Subsets Regression Model, which has the narrowest prediction interval with a width of 32.6058. The Full Model has a prediction interval width of 33.2636. All variable selection models except the manual forward and manual step wise selection model resulted in narrower intervals than the Full Model. This suggests that these variable selection methods can lead to more precise (less variance) predictions. The narrower prediction intervals from the variable selection models show that these intervals may provide more reliable predictions by reducing the variance and focusing on the most significant predictors.

Question 5

Use the final model(s) from the variable selection methods to estimate shrunken-R2 using the Browne equation.

5.1) Calculate shrunken-R2 for each model. Feel free to delete any sections below which are redundant (i.e., variable selection methods chose the same model):

Full Model (for safe keeping and 5.2)

# Full Model
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
summary_full_model <- summary(full_model)
R2_full <- summary_full_model$r.squared
n_full <- nrow(UScereal)
p_full <- length(coef(full_model)) - 1

# Shrunken R2 Calculation
shrunken_R2_full <- 1 - ((1 - R2_full) * (n_full - 1) / (n_full - p_full - 1))
cat("Shrunken-R2 for Full Model: ", shrunken_R2_full, "\n")

## Shrunken-R2 for Full Model:  0.983772

Forward Selection

# Forward Model - AIC
summary_forward_model <- summary(forward_model)
R2_forward <- summary_forward_model$r.squared
n_forward <- nrow(UScereal)
p_forward <- length(coef(forward_model)) - 1

# Forward Model - Manual
summary_forward_model_manual <- summary(final_model_forward)
R2_forward_manual <- summary_forward_model_manual$r.squared
n_forward_manual <- nrow(UScereal)
p_forward_manual <- length(coef(final_model_forward)) - 1

# Shrunken R2 Calculation - AIC
shrunken_R2_forward <- 1 - ((1 - R2_forward) * (n_forward - 1) / (n_forward - p_forward - 1))
cat("Shrunken-R2 for Forward Model (AIC): ", shrunken_R2_forward, "\n")

## Shrunken-R2 for Forward Model (AIC):  0.9839069

# Shrunken R2 Calculation - Manual
shrunken_R2_forward_manual <- 1 - ((1 - R2_forward_manual) * (n_forward_manual - 1) / (n_forward_manual - p_forward_manual - 1))
cat("Shrunken-R2 for Forward Model (Manual): ", shrunken_R2_forward_manual, "\n")

## Shrunken-R2 for Forward Model (Manual):  0.9798389

Backward Selection

# Backward Model - AIC
summary_backward_model <- summary(backward_model)
R2_backward <- summary_backward_model$r.squared
n_backward <- nrow(UScereal)
p_backward <- length(coef(backward_model)) - 1

# Backward Model - Manual
summary_backward_model_manual <- summary(final_model_backward)
R2_backward_manual <- summary_backward_model_manual$r.squared
n_backward_manual <- nrow(UScereal)
p_backward_manual <- length(coef(final_model_backward)) - 1

# Shrunken R2 Calculation - AIC
shrunken_R2_backward <- 1 - ((1 - R2_backward) * (n_backward - 1) / (n_backward - p_backward - 1))
cat("Shrunken-R2 for Backward Model (AIC): ", shrunken_R2_backward, "\n")

## Shrunken-R2 for Backward Model (AIC):  0.9839069

# Shrunken R2 Calculation - Manual
shrunken_R2_backward_manual <- 1 - ((1 - R2_backward_manual) * (n_backward_manual - 1) / (n_backward_manual - p_backward_manual - 1))
cat("Shrunken-R2 for Backward Model (Manual): ", shrunken_R2_backward_manual, "\n")

## Shrunken-R2 for Backward Model (Manual):  0.9839069

Stepwise Selection Model

# Stepwise Model - AIC
summary_stepwise_model <- summary(stepwise_model)
R2_stepwise <- summary_stepwise_model$r.squared
n_stepwise <- nrow(UScereal)
p_stepwise <- length(coef(stepwise_model)) - 1

# Stepwise Model - Manual
summary_stepwise_model_manual <- summary(final_model_stepwise)
R2_stepwise_manual <- summary_stepwise_model_manual$r.squared
n_stepwise_manual <- nrow(UScereal)
p_stepwise_manual <- length(coef(final_model_stepwise)) - 1

# Shrunken R2 Calculation - AIC
shrunken_R2_stepwise <- 1 - ((1 - R2_stepwise) * (n_stepwise - 1) / (n_stepwise - p_stepwise - 1))
cat("Shrunken-R2 for Stepwise Model (AIC): ", shrunken_R2_stepwise, "\n")

## Shrunken-R2 for Stepwise Model (AIC):  0.9839069

# Shrunken R2 Calculation - Manual
shrunken_R2_stepwise_manual <- 1 - ((1 - R2_stepwise_manual) * (n_stepwise_manual - 1) / (n_stepwise_manual - p_stepwise_manual - 1))
cat("Shrunken-R2 for Stepwise Model (Manual): ", shrunken_R2_stepwise_manual, "\n")

## Shrunken-R2 for Stepwise Model (Manual):  0.9798389

All-Subsets

library(leaps)

# All-Subsets Regression
all_subsets <- regsubsets(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

summary_subsets <- summary(all_subsets)

# Find the model with the highest adjusted R2
max_adj_r2 <- which.max(summary_subsets$adjr2)
best_model <- summary_subsets$which[max_adj_r2, ]

# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]

all_subsets_formula <- as.formula(paste("calories ~", paste(predictors, collapse = " + ")))
all_subsets_model <- lm(all_subsets_formula, data = UScereal)
summary_all_subsets_model <- summary(all_subsets_model)

R2_all_subsets <- summary_all_subsets_model$r.squared
n_all_subsets <- nrow(UScereal)
p_all_subsets <- length(coef(all_subsets_model)) - 1

# Shrunken R2 Calculation
shrunken_R2_all_subsets <- 1 - ((1 - R2_all_subsets) * (n_all_subsets - 1) / (n_all_subsets - p_all_subsets - 1))
cat("Shrunken-R2 for All-Subsets Model: ", shrunken_R2_all_subsets, "\n")

## Shrunken-R2 for All-Subsets Model:  0.9840099

5.2) Based on the results from 5.1, which variable selection method has the best out of sample prediction? Is the best variable selection method better than the full model (compare to the full model (Question 2.1))?

The All-Subsets Regression Model has the highest shrunken \(R^2\) value of 0.9840099, suggesting the best out-of-sample prediction performance.

As I note actually at the beginning of (5.1) for this purpose, we find that the shrunken \(R^2\) for the Full Model is 0.983772. The All-Subsets Regression Model’s shrunken \(R^2\) (0.9840099) is slightly higher than the Full Model’s shrunken \(R^2\) (0.983772), suggesting that the All-Subsets Regression Model performs better in terms of out-of-sample prediction. Therefore, the All-Subsets Regression Model provides the most reliable predictions when considering the out-of-sample performance.

Question 6

6.1) Estimate a linear model predicting calories from mfr.num. Report the model fit and a test of significance for the model fit. In the conclusion, describe the null hypothesis being tested and whether or not it is rejected or retained based on the results):

# Estimate the linear model
mfr_model <- lm(calories ~ mfr.num, data = UScereal)

summary(mfr_model)

## 
## Call:
## lm(formula = calories ~ mfr.num, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105.44  -38.13  -12.36   26.10  286.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  143.259     14.484   9.891 1.92e-14 ***
## mfr.num        2.437      4.840   0.504    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.78 on 63 degrees of freedom
## Multiple R-squared:  0.004009,   Adjusted R-squared:  -0.0118 
## F-statistic: 0.2536 on 1 and 63 DF,  p-value: 0.6163

# Get the summary of the model
summary_model <- summary(mfr_model)

# Output the model fit (R2 and adj-R2)
r_squared <- summary_model$r.squared
adj_r_squared <- summary_model$adj.r.squared

# Output the significance test (p-value for the model)
f_statistic <- summary_model$fstatistic
p_value <- pf(f_statistic[1], f_statistic[2], f_statistic[3], lower.tail = FALSE)

# Print the results
print(paste("R-squared: ", round(r_squared, 4)))

## [1] "R-squared:  0.004"

print(paste("Adjusted R-squared: ", round(adj_r_squared, 4)))

## [1] "Adjusted R-squared:  -0.0118"

print(paste("p-value: ", round(p_value, 4)))

## [1] "p-value:  0.6163"

For our significance test and model summary, we have F(1, 63) = 0.2536, p = 0.6163. Given this result, we fail to reject the null the hypothesis that there is no relationship between the manufacturer and number of calories in one portion.

6.2) Provide an interpretation for the regression coefficient of the predictor mfr.num estimated from the model. Make sure to be clear what we mean by a 1-unit difference in the predictor.

This regression represents the slope here. The slope represents the change in the calorie content for cereals from different manufacturers. Specifically, for two manufacturers that differ by one point in their numeric encoding (moving from General Mills to Kellogg’s, which corresponds to an increase in mfr.num by one unit), there is an expected difference of 2.437 calories. The manufacturer with the higher numeric encoding is expected to have cereals with a higher calorie content by an average of 2.437 calories. However, this difference is not statistically significant, indicating that the relationship between the manufacturer and the calorie content is not reliable based on this model.

6.3) Create a scatter plot with mfr.num on the X axis and observed calories on the Y axis. Include in the plot line which represents the predicted values for each manufacturer.

library (ggplot2)

# Fit the linear model
mfr_model <- lm(calories ~ mfr.num, data = UScereal)

# Plot
plt_mfrnum <- ggplot(UScereal, aes(x = mfr.num, y = calories)) +
  geom_point() +
  geom_line(aes(y = predict(mfr_model)), color = "red", se = FALSE) +
  labs(
    title = "Calories vs. Manufacturer",
    x = "Manufacturer (Numerically Encoded)",
    y = "Calories"
  ) +
  theme_minimal()+
  theme(
    axis.text = element_text(size = 10, face = "plain"),
    axis.title = element_text(size = 11, face = "plain"),
    plot.title = element_text(size = 13, face = "plain", hjust = 0.5)
  )

## Warning in geom_line(aes(y = predict(mfr_model)), color = "red", se = FALSE):
## Ignoring unknown parameters: `se`

print(plt_mfrnum)

6.4) Fill out the following table with the predicted values for each manufacturer (from the sample) and the means from each manufacturer.

library(dplyr)

# Predict the values for each manufacturer
UScereal$predicted_calories <- predict(mfr_model, newdata = UScereal)

# Calculate the mean calories for each manufacturer
mean_calories <- UScereal %>%
  group_by(mfr.num) %>%
  summarise(mean_calories = mean(calories))

# Create the table with predicted values and group means
manufacturers <- data.frame(
  Manufacturer = c("1 (G)", "2 (K)", "3 (N)", "4 (P)", "5 (Q)", "6 (R)"),
  mfr.num = 1:6
)

manufacturers$Model_Prediction <- predict(mfr_model, newdata = manufacturers)
manufacturers$Group_Mean <- mean_calories$mean_calories

# Table
print(manufacturers)

##   Manufacturer mfr.num Model_Prediction Group_Mean
## 1        1 (G)       1         145.6959   137.7879
## 2        2 (K)       2         148.1333   149.6710
## 3        3 (N)       3         150.5707   160.2593
## 4        4 (P)       4         153.0081   194.7578
## 5        5 (Q)       5         155.4455   135.8507
## 6        6 (R)       6         157.8829   124.8521

6.5) Describe whether you think that treating manufacturer as a continuous variable is an appropriate and accurate model, given the information you’ve generated above.

I think that treating the manufacturer as a continuous variable in this context may not be the most appropriate and accurate modeling approach. This is because the very nature of the manufacturer is inherently a categorical variable, not a continuous one. Each value represents a distinct manufacturer with no natural order or scale between them. By treating it as a continuous variable, we imply a linear relationship and order that does not exist. Moreover, the regression model assumes that moving from one manufacturer to the next (from General Mills to Kellogg’s) has a linear impact on the calories, which is unlikely to be true. The differences in calorie content are more likely due to individual manufacturer practices rather than a smooth, linear transition.

Also consider our \(R^2\) value. It is very low (0.004), and the p-value for the coefficient is not significant (0.616), indicating that the model does not explain the variance in calories well. This suggests that the linear model does not capture the relationship between manufacturer and calorie content effectively. Additionally, the group means for calories show considerable variation among manufacturers (137.79 for General Mills vs. 194.76 for Post). This variation suggests that each manufacturer has distinct characteristics that affect calorie content, which a continuous variable model cannot capture.

Question 7

Dummy Coding (General Mills is Reference Group)

# Load necessary libraries
library(dplyr)
library(mosaic)

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following object is masked from 'package:boot':
## 
##     logit

## The following object is masked from 'package:caret':
## 
##     dotPlot

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

UScereal$mfr <- as.factor(UScereal$mfr)

# Create dummy variables with General Mills (G) as the reference group
UScereal <- transform(UScereal,
                      K = as.numeric(mfr == "K"),
                      N = as.numeric(mfr == "N"),
                      P = as.numeric(mfr == "P"),
                      Q = as.numeric(mfr == "Q"),
                      R = as.numeric(mfr == "R"))

# Create a table to show the dummy coding
dummy_coding_table <- UScereal %>%
  select(mfr, K, N, P, Q, R) %>%
  distinct()

# Rename the manufacturer column for clarity
dummy_coding_table <- dummy_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr == "G" ~ "1 (G)",
    mfr == "K" ~ "2 (K)",
    mfr == "N" ~ "3 (N)",
    mfr == "P" ~ "4 (P)",
    mfr == "Q" ~ "5 (Q)",
    mfr == "R" ~ "6 (R)"
  )) %>%
  select(Manufacturer, K, N, P, Q, R)

# Table
print(dummy_coding_table)

##                         Manufacturer K N P Q R
## 100% Bran                      3 (N) 0 1 0 0 0
## All-Bran                       2 (K) 1 0 0 0 0
## Apple Cinnamon Cheerios        1 (G) 0 0 0 0 0
## Bran Chex                      6 (R) 0 0 0 0 1
## Bran Flakes                    4 (P) 0 0 1 0 0
## Cap'n'Crunch                   5 (Q) 0 0 0 1 0

# Fit the linear model using dummy coding
dummyModel <- lm(calories ~ K + N + P + Q + R, data = UScereal)

summary(dummyModel)

## 
## Call:
## lm(formula = calories ~ K + N + P + Q + R, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  137.788     13.123  10.500 4.01e-15 ***
## K             11.883     18.778   0.633   0.5293    
## N             22.471     37.882   0.593   0.5553    
## P             56.970     24.355   2.339   0.0227 *  
## Q             -1.937     30.495  -0.064   0.9496    
## R            -12.936     30.495  -0.424   0.6730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

library(dplyr)
library(mosaic)

UScereal$mfr <- as.factor(UScereal$mfr)

# Apply dummy coding to the manufacturer factor
contrasts(UScereal$mfr) <- contr.treatment(length(levels(UScereal$mfr)), base = 1)

# Display the contrasts for verification
contrasts(UScereal$mfr)

##   2 3 4 5 6
## G 0 0 0 0 0
## K 1 0 0 0 0
## N 0 1 0 0 0
## P 0 0 1 0 0
## Q 0 0 0 1 0
## R 0 0 0 0 1

# Fit the linear model using dummy coding
dummyModel <- lm(calories ~ mfr, data = UScereal)

summary(dummyModel)

## 
## Call:
## lm(formula = calories ~ mfr, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  137.788     13.123  10.500 4.01e-15 ***
## mfr2          11.883     18.778   0.633   0.5293    
## mfr3          22.471     37.882   0.593   0.5553    
## mfr4          56.970     24.355   2.339   0.0227 *  
## mfr5          -1.937     30.495  -0.064   0.9496    
## mfr6         -12.936     30.495  -0.424   0.6730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	D1	D2	D3	D4	D5
1 (G)	0	0	0	0	0
2 (K)	1	0	0	0	0
3 (N)	0	1	0	0	0
4 (P)	0	0	1	0	0
5 (Q)	0	0	0	1	0
6 (R)	0	0	0	0	1

Sequential Coding (General Mills is “lowest” Group)

library(mosaic)
library(dplyr)

UScereal$mfr <- as.factor(UScereal$mfr)

# Sequential coding for manufacturer
UScereal <- UScereal %>%
  mutate(
    s1 = as.numeric(mfr.num > 1),
    s2 = as.numeric(mfr.num > 2),
    s3 = as.numeric(mfr.num > 3),
    s4 = as.numeric(mfr.num > 4),
    s5 = as.numeric(mfr.num > 5)
  )

# Create a table to show the sequential coding
sequential_coding_table <- UScereal %>%
  select(mfr.num, s1, s2, s3, s4, s5) %>%
  distinct()

# Rename the manufacturer column for clarity
sequential_coding_table <- sequential_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr.num == 1 ~ "1 (G)",
    mfr.num == 2 ~ "2 (K)",
    mfr.num == 3 ~ "3 (N)",
    mfr.num == 4 ~ "4 (P)",
    mfr.num == 5 ~ "5 (Q)",
    mfr.num == 6 ~ "6 (R)"
  )) %>%
  select(Manufacturer, s1, s2, s3, s4, s5)

# Table
print(sequential_coding_table)

##                         Manufacturer s1 s2 s3 s4 s5
## 100% Bran                      3 (N)  1  1  0  0  0
## All-Bran                       2 (K)  1  0  0  0  0
## Apple Cinnamon Cheerios        1 (G)  0  0  0  0  0
## Bran Chex                      6 (R)  1  1  1  1  1
## Bran Flakes                    4 (P)  1  1  1  0  0
## Cap'n'Crunch                   5 (Q)  1  1  1  1  0

# Fit the linear model using sequential coding
seqModel <- lm(calories ~ s1 + s2 + s3 + s4 + s5, data = UScereal)

summary(seqModel)

## 
## Call:
## lm(formula = calories ~ s1 + s2 + s3 + s4 + s5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   137.79      13.12  10.500 4.01e-15 ***
## s1             11.88      18.78   0.633   0.5293    
## s2             10.59      37.99   0.279   0.7814    
## s3             34.50      41.03   0.841   0.4039    
## s4            -58.91      34.33  -1.716   0.0914 .  
## s5            -11.00      38.93  -0.283   0.7785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	S1	S2	S3	S4	S5
1 (G)	0	0	0	0	0
2 (K)	1	0	0	0	0
3 (N)	1	1	0	0	0
4 (P)	1	1	1	0	0
5 (Q)	1	1	1	1	0
6 (R)	1	1	1	1	1

Helmert Coding (General Mills is “lowest” Group)

library(dplyr)
library(mosaic)

UScereal$mfr <- as.factor(UScereal$mfr)

# Helmert coding for manufacturer with General Mills (G) as the lowest group
UScereal <- transform(UScereal,
                      H1 = -5/6 * (mfr.num == 1) + 1/6 * (mfr.num > 1),
                      H2 = -4/5 * (mfr.num == 2) + 1/5 * (mfr.num > 2),
                      H3 = -3/4 * (mfr.num == 3) + 1/4 * (mfr.num > 3),
                      H4 = -2/3 * (mfr.num == 4) + 1/3 * (mfr.num > 4),
                      H5 = -1/2 * (mfr.num == 5) + 1/2 * (mfr.num > 5)
                  )

# Create a table to show the Helmert coding
helmert_coding_table <- UScereal %>%
  select(mfr.num, H1, H2, H3, H4, H5) %>%
  distinct()

# Rename the manufacturer column for clarity
helmert_coding_table <- helmert_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr.num == 1 ~ "1 (G)",
    mfr.num == 2 ~ "2 (K)",
    mfr.num == 3 ~ "3 (N)",
    mfr.num == 4 ~ "4 (P)",
    mfr.num == 5 ~ "5 (Q)",
    mfr.num == 6 ~ "6 (R)"
  )) %>%
  select(Manufacturer, H1, H2, H3, H4, H5)

# Table
print(helmert_coding_table)

##                         Manufacturer         H1   H2    H3         H4   H5
## 100% Bran                      3 (N)  0.1666667  0.2 -0.75  0.0000000  0.0
## All-Bran                       2 (K)  0.1666667 -0.8  0.00  0.0000000  0.0
## Apple Cinnamon Cheerios        1 (G) -0.8333333  0.0  0.00  0.0000000  0.0
## Bran Chex                      6 (R)  0.1666667  0.2  0.25  0.3333333  0.5
## Bran Flakes                    4 (P)  0.1666667  0.2  0.25 -0.6666667  0.0
## Cap'n'Crunch                   5 (Q)  0.1666667  0.2  0.25  0.3333333 -0.5

# Fit the linear model using Helmert coding
helmModel <- lm(calories ~ H1 + H2 + H3 + H4 + H5, data = UScereal)

summary(helmModel)

## 
## Call:
## lm(formula = calories ~ H1 + H2 + H3 + H4 + H5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  150.530      9.933  15.155   <2e-16 ***
## H1            15.290     17.533   0.872   0.3867    
## H2             4.259     19.503   0.218   0.8279    
## H3            -8.439     38.445  -0.220   0.8270    
## H4           -64.406     28.281  -2.277   0.0264 *  
## H5           -10.999     38.929  -0.283   0.7785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	H1	H2	H3	H4	H5
1 (G)	-0.8333333	0.0	0.00	0.0000000	0.0
2 (K)	0.1666667	-0.8	0.00	0.0000000	0.0
3 (N)	0.1666667	0.2	-0.75	0.0000000	0.0
4 (P)	0.1666667	0.2	0.25	-0.6666667	0.0
5 (Q)	0.1666667	0.2	0.25	0.3333333	-0.5
6 (R)	0.1666667	0.2	0.25	0.3333333	0.5

library(dplyr)
library(mosaic)

UScereal$mfr.num <- as.numeric(UScereal$mfr)

# Helmert coding for manufacturer with General Mills (G) as the lowest group
UScereal <- transform(UScereal,
                      h1 = ifelse(mfr.num == 1, 0, ifelse(mfr.num > 1, 1/5, -5/6)),
                      h2 = ifelse(mfr.num <= 2, 0, ifelse(mfr.num > 2, 1/4, -4/5)),
                      h3 = ifelse(mfr.num <= 3, 0, ifelse(mfr.num > 3, 1/3, -3/4)),
                      h4 = ifelse(mfr.num <= 4, 0, ifelse(mfr.num > 4, 1/2, -2/3)),
                      h5 = ifelse(mfr.num <= 5, 0, ifelse(mfr.num > 5, 1, -1/2))
                  )

# Create a table to show the Helmert coding
helmert_coding_table <- UScereal %>%
  select(mfr.num, h1, h2, h3, h4, h5) %>%
  distinct()

# Rename the manufacturer column for clarity
helmert_coding_table <- helmert_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr.num == 1 ~ "1 (G)",
    mfr.num == 2 ~ "2 (K)",
    mfr.num == 3 ~ "3 (N)",
    mfr.num == 4 ~ "4 (P)",
    mfr.num == 5 ~ "5 (Q)",
    mfr.num == 6 ~ "6 (R)"
  )) %>%
  select(Manufacturer, h1, h2, h3, h4, h5)

# Table
print(helmert_coding_table)

##                         Manufacturer  h1   h2        h3  h4 h5
## 100% Bran                      3 (N) 0.2 0.25 0.0000000 0.0  0
## All-Bran                       2 (K) 0.2 0.00 0.0000000 0.0  0
## Apple Cinnamon Cheerios        1 (G) 0.0 0.00 0.0000000 0.0  0
## Bran Chex                      6 (R) 0.2 0.25 0.3333333 0.5  1
## Bran Flakes                    4 (P) 0.2 0.25 0.3333333 0.0  0
## Cap'n'Crunch                   5 (Q) 0.2 0.25 0.3333333 0.5  0

# Fit the linear model using Helmert coding
helmModel <- lm(calories ~ h1 + h2 + h3 + h4 + h5, data = UScereal)

summary(helmModel)

## 
## Call:
## lm(formula = calories ~ h1 + h2 + h3 + h4 + h5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   137.79      13.12  10.500 4.01e-15 ***
## h1             59.42      93.89   0.633   0.5293    
## h2             42.35     151.96   0.279   0.7814    
## h3            103.50     123.10   0.841   0.4039    
## h4           -117.81      68.66  -1.716   0.0914 .  
## h5            -11.00      38.93  -0.283   0.7785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	H1	H2	H3	H4	H5
1 (G)	0.0	0.00	0.0000000	0.0	0
2 (K)	0.2	0.00	0.0000000	0.0	0
3 (N)	0.2	0.25	0.0000000	0.0	0
4 (P)	0.2	0.25	0.3333333	0.0	0
5 (Q)	0.2	0.25	0.3333333	0.5	0
6 (R)	0.2	0.25	0.3333333	0.5	1

# Load necessary libraries
library(dplyr)
library(mosaic)

UScereal$mfr <- as.factor(UScereal$mfr)

# Create Helmert coding for manufacturer with General Mills (G) as the lowest group
helmert_matrix <- contr.helmert(length(levels(UScereal$mfr)))

# Apply Helmert coding to the manufacturer factor
contrasts(UScereal$mfr) <- helmert_matrix

# Display the contrasts for verification
contrasts(UScereal$mfr)

##   [,1] [,2] [,3] [,4] [,5]
## G   -1   -1   -1   -1   -1
## K    1   -1   -1   -1   -1
## N    0    2   -1   -1   -1
## P    0    0    3   -1   -1
## Q    0    0    0    4   -1
## R    0    0    0    0    5

# Fit the linear model using Helmert coding
helmModel <- lm(calories ~ mfr, data = UScereal)

summary(helmModel)

## 
## Call:
## lm(formula = calories ~ mfr, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  150.530      9.933  15.155   <2e-16 ***
## mfr1           5.942      9.389   0.633   0.5293    
## mfr2           5.510     12.252   0.450   0.6546    
## mfr3          11.380      6.126   1.858   0.0682 .  
## mfr4          -4.954      5.950  -0.833   0.4084    
## mfr5          -5.136      4.915  -1.045   0.3003    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	H1	H2	H3	H4	H5
1 (G)	-1	-1	-1	-1	-1
2 (K)	1	-1	-1	-1	-1
3 (N)	0	2	-1	-1	-1
4 (P)	0	0	3	-1	-1
5 (Q)	0	0	0	4	-1
6 (R)	0	0	0	0	5

I completed the Helmert coding three ways. One I am not sure, when I adapted the provided code from lab 7, seems to give different results (see the first panel). When I use the modified code that I have adjusted myself, I get the general mills to be the absolute lowest. However, after adapting the contr.helmert function from the contrasts library, we get another output. However, one can find that while the predictor estimates may vary, one maintains the same significance levels.

Effect Coding

library(dplyr)
library(mosaic)

UScereal$mfr <- as.factor(UScereal$mfr)

# Create a numeric representation of mfr
UScereal$mfr.num <- as.numeric(UScereal$mfr)

# Effect coding for manufacturer with Ralston Purina (R) as the reference group
UScereal <- transform(UScereal,
                      E1 = (mfr.num == 1) - (mfr.num == 6),
                      E2 = (mfr.num == 2) - (mfr.num == 6),
                      E3 = (mfr.num == 3) - (mfr.num == 6),
                      E4 = (mfr.num == 4) - (mfr.num == 6),
                      E5 = (mfr.num == 5) - (mfr.num == 6)
                  )

# Create a table to show the effect coding
effect_coding_table <- UScereal %>%
  select(mfr.num, E1, E2, E3, E4, E5) %>%
  distinct()

# Rename the manufacturer column for clarity
effect_coding_table <- effect_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr.num == 1 ~ "1 (G)",
    mfr.num == 2 ~ "2 (K)",
    mfr.num == 3 ~ "3 (N)",
    mfr.num == 4 ~ "4 (P)",
    mfr.num == 5 ~ "5 (Q)",
    mfr.num == 6 ~ "6 (R)"
  )) %>%
  select(Manufacturer, E1, E2, E3, E4, E5)

# Table
print(effect_coding_table)

##                         Manufacturer E1 E2 E3 E4 E5
## 100% Bran                      3 (N)  0  0  1  0  0
## All-Bran                       2 (K)  0  1  0  0  0
## Apple Cinnamon Cheerios        1 (G)  1  0  0  0  0
## Bran Chex                      6 (R) -1 -1 -1 -1 -1
## Bran Flakes                    4 (P)  0  0  0  1  0
## Cap'n'Crunch                   5 (Q)  0  0  0  0  1

# Fit the linear model using effect coding
effectsModel <- lm(calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)

summary(effectsModel)

## 
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.5298     9.9329  15.155   <2e-16 ***
## E1          -12.7419    14.6106  -0.872   0.3867    
## E2           -0.8588    14.7965  -0.058   0.9539    
## E3            9.7295    30.6687   0.317   0.7522    
## E4           44.2280    19.4756   2.271   0.0268 *  
## E5          -14.6791    24.5725  -0.597   0.5525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	H1	H2	H3	H4	H5
1 (G)	1	0	0	0	0
2 (K)	0	1	0	0	0
3 (N)	0	0	1	0	0
4 (P)	0	0	0	1	0
5 (Q)	0	0	0	0	1
6 (R)	-1	-1	-1	-1	-1

library(dplyr)
library(mosaic)

UScereal$mfr <- as.factor(UScereal$mfr)

# Apply effect coding to the manufacturer factor using contr.sum
contrasts(UScereal$mfr) <- contr.sum(length(levels(UScereal$mfr)))

# Display the contrasts for verification
print(contrasts(UScereal$mfr))

##   [,1] [,2] [,3] [,4] [,5]
## G    1    0    0    0    0
## K    0    1    0    0    0
## N    0    0    1    0    0
## P    0    0    0    1    0
## Q    0    0    0    0    1
## R   -1   -1   -1   -1   -1

# Fit the linear model using effect coding
effectsModel1 <- lm(calories ~ mfr, data = UScereal)

summary(effectsModel1)

## 
## Call:
## lm(formula = calories ~ mfr, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.5298     9.9329  15.155   <2e-16 ***
## mfr1        -12.7419    14.6106  -0.872   0.3867    
## mfr2         -0.8588    14.7965  -0.058   0.9539    
## mfr3          9.7295    30.6687   0.317   0.7522    
## mfr4         44.2280    19.4756   2.271   0.0268 *  
## mfr5        -14.6791    24.5725  -0.597   0.5525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

Manufacturer	E1	E2	E3	E4	E5
1 (G)	1	0	0	0	0
2 (K)	0	1	0	0	0
3 (N)	0	0	1	0	0
4 (P)	0	0	0	1	0
5 (Q)	0	0	0	0	1
6 (R)	-1	-1	-1	-1	-1

Question 8

Fit a linear model predicting calories using manufacturer as represented by a set of effects codes. Provide the code and the output for both generating the new variables and fitting the model. Estimate the model fit (R² and adj-R²) and qualitatively compare it to the model in Question 1. Select one coefficient from the model to interpret.

8.1) Code and Output

library(dplyr)
library(mosaic)

UScereal$mfr <- as.factor(UScereal$mfr)

# Create a numeric representation of mfr
UScereal$mfr.num <- as.numeric(UScereal$mfr)

# Effect coding for manufacturer with Ralston Purina (R) as the reference group
UScereal <- transform(UScereal,
                      E1 = (mfr.num == 1) - (mfr.num == 6),
                      E2 = (mfr.num == 2) - (mfr.num == 6),
                      E3 = (mfr.num == 3) - (mfr.num == 6),
                      E4 = (mfr.num == 4) - (mfr.num == 6),
                      E5 = (mfr.num == 5) - (mfr.num == 6)
                  )

# Create a table to show the effect coding
effect_coding_table <- UScereal %>%
  select(mfr.num, E1, E2, E3, E4, E5) %>%
  distinct()

# Rename the manufacturer column for clarity
effect_coding_table <- effect_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr.num == 1 ~ "1 (G)",
    mfr.num == 2 ~ "2 (K)",
    mfr.num == 3 ~ "3 (N)",
    mfr.num == 4 ~ "4 (P)",
    mfr.num == 5 ~ "5 (Q)",
    mfr.num == 6 ~ "6 (R)"
  )) %>%
  select(Manufacturer, E1, E2, E3, E4, E5)

# Table
print(effect_coding_table)

##                         Manufacturer E1 E2 E3 E4 E5
## 100% Bran                      3 (N)  0  0  1  0  0
## All-Bran                       2 (K)  0  1  0  0  0
## Apple Cinnamon Cheerios        1 (G)  1  0  0  0  0
## Bran Chex                      6 (R) -1 -1 -1 -1 -1
## Bran Flakes                    4 (P)  0  0  0  1  0
## Cap'n'Crunch                   5 (Q)  0  0  0  0  1

# Fit the linear model using effect coding
effectsModel <- lm(calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)

summary(effectsModel)

## 
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.5298     9.9329  15.155   <2e-16 ***
## E1          -12.7419    14.6106  -0.872   0.3867    
## E2           -0.8588    14.7965  -0.058   0.9539    
## E3            9.7295    30.6687   0.317   0.7522    
## E4           44.2280    19.4756   2.271   0.0268 *  
## E5          -14.6791    24.5725  -0.597   0.5525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

# Fit the full model
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)

# Summary of the full model
summary(full_model)

## 
## Call:
## lm(formula = calories ~ protein + fat + sodium + fibre + carbo + 
##     sugars + shelf + potassium, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.917  -4.075   1.079   4.415  15.152 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.15905    3.74046  -5.657 5.46e-07 ***
## protein       4.87257    0.97909   4.977 6.51e-06 ***
## fat           9.27406    0.75034  12.360  < 2e-16 ***
## sodium        0.01207    0.01002   1.204 0.233676    
## fibre         2.80820    0.71883   3.907 0.000254 ***
## carbo         4.86987    0.17491  27.842  < 2e-16 ***
## sugars        4.54340    0.21068  21.565  < 2e-16 ***
## shelf         0.56679    1.39782   0.405 0.686668    
## potassium    -0.11594    0.02718  -4.265 7.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.951 on 56 degrees of freedom
## Multiple R-squared:  0.9858, Adjusted R-squared:  0.9838 
## F-statistic:   486 on 8 and 56 DF,  p-value: < 2.2e-16

8.2) \(R^2\) and Adjusted \(R^2\)

# Extract R-squared and adjusted R-squared
effectsModel_summary <- summary(effectsModel)

effectsModel_r2 <- effectsModel_summary$r.squared
effectsModel_adj_r2 <- effectsModel_summary$adj.r.squared
print(paste("Effects Code Model R-squared: ", effectsModel_r2))

## [1] "Effects Code Model R-squared:  0.103365947916525"

print(paste("Effects Code Model Adjusted R-squared: ", effectsModel_adj_r2))

## [1] "Effects Code Model Adjusted R-squared:  0.0273800112992811"

# Extract R-squared and adjusted R-squared
full_model_summary <- summary(full_model)

full_model_r2 <- full_model_summary$r.squared
full_model_adj_r2 <- full_model_summary$adj.r.squared
print(paste("Full Model R-squared: ", full_model_r2))

## [1] "Full Model R-squared:  0.985800539648456"

print(paste("Full Model Adjusted R-squared: ", full_model_adj_r2))

## [1] "Full Model Adjusted R-squared:  0.983772045312521"

Looking at these two models, the full model, which includes multiple predictors (protein, fat, sodium, fibre, carbo, sugars, shelf, and potassium), has a much higher \(R^2\) and adjusted \(R^2\) compared to the effects-coded model. The full model explains a much larger proportion of the variance in calories than the effects-coded model, indicating that other predictors like protein, fat, sodium, fibre, carbo, sugars, shelf, and potassium are significant factors in determining the calorie content of cereals.

8.3) Fill out the following table with the predicted values for each manufacturer (from the sample) and the means from each manufacturer.

library(dplyr)

UScereal$mfr <- as.factor(UScereal$mfr)

# Create a numeric representation of mfr
UScereal$mfr.num <- as.numeric(UScereal$mfr)

# Effect coding for manufacturer with Ralston Purina (R) as the reference group
UScereal <- transform(UScereal,
                      E1 = (mfr.num == 1) - (mfr.num == 6),
                      E2 = (mfr.num == 2) - (mfr.num == 6),
                      E3 = (mfr.num == 3) - (mfr.num == 6),
                      E4 = (mfr.num == 4) - (mfr.num == 6),
                      E5 = (mfr.num == 5) - (mfr.num == 6)
                  )

# Create a table to show the effect coding
effect_coding_table <- UScereal %>%
  select(mfr.num, E1, E2, E3, E4, E5) %>%
  distinct()

# Rename the manufacturer column for clarity
effect_coding_table <- effect_coding_table %>%
  mutate(Manufacturer = case_when(
    mfr.num == 1 ~ "1 (G)",
    mfr.num == 2 ~ "2 (K)",
    mfr.num == 3 ~ "3 (N)",
    mfr.num == 4 ~ "4 (P)",
    mfr.num == 5 ~ "5 (Q)",
    mfr.num == 6 ~ "6 (R)"
  )) %>%
  select(Manufacturer, E1, E2, E3, E4, E5)

# Fit the linear model using effect coding
effectsModel <- lm(calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)

summary(effectsModel)

## 
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.5298     9.9329  15.155   <2e-16 ***
## E1          -12.7419    14.6106  -0.872   0.3867    
## E2           -0.8588    14.7965  -0.058   0.9539    
## E3            9.7295    30.6687   0.317   0.7522    
## E4           44.2280    19.4756   2.271   0.0268 *  
## E5          -14.6791    24.5725  -0.597   0.5525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

# Predictions from the model
UScereal$predicted_calories <- predict(effectsModel, newdata = UScereal)

# Calculate group means
group_means <- UScereal %>%
  group_by(mfr) %>%
  summarise(mean_calories = mean(calories))

# Create the table with predictions and group means
result_table <- UScereal %>%
  group_by(mfr) %>%
  summarise(Model_Prediction = mean(predicted_calories), Group_Mean = mean(calories)) %>%
  mutate(Manufacturer = case_when(
    mfr == "G" ~ "1 (G)",
    mfr == "K" ~ "2 (K)",
    mfr == "N" ~ "3 (N)",
    mfr == "P" ~ "4 (P)",
    mfr == "Q" ~ "5 (Q)",
    mfr == "R" ~ "6 (R)"
  )) %>%
  select(Manufacturer, Model_Prediction, Group_Mean)

# Print the result table
print(result_table)

## # A tibble: 6 × 3
##   Manufacturer Model_Prediction Group_Mean
##   <chr>                   <dbl>      <dbl>
## 1 1 (G)                    138.       138.
## 2 2 (K)                    150.       150.
## 3 3 (N)                    160.       160.
## 4 4 (P)                    195.       195.
## 5 5 (Q)                    136.       136.
## 6 6 (R)                    125.       125.

8.4) Select one coefficient estimate from the model and provide an interpretation of that coefficient. Make sure it’s clear which coefficient you’ve selected.

effectsModel_summary

## 
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.5298     9.9329  15.155   <2e-16 ***
## E1          -12.7419    14.6106  -0.872   0.3867    
## E2           -0.8588    14.7965  -0.058   0.9539    
## E3            9.7295    30.6687   0.317   0.7522    
## E4           44.2280    19.4756   2.271   0.0268 *  
## E5          -14.6791    24.5725  -0.597   0.5525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

# Selected coefficient to interpret
selected_coef <- coef(effectsModel)["E2"]
selected_coef

##         E2 
## -0.8587862

This coefficient represents the difference in the number of calories between the cereals manufactured by Kellogg’s and the overall mean calories of all manufacturers, with Ralston Purina (reference group) coded as -1s to center the effects around zero. Specifically, for two manufacturers that differ by one point in their numeric encoding (moving from General Mills to Kellogg’s, which corresponds to an increase in mfr.num by one unit), there is an expected difference of -0.8588 calories. The manufacturer with the higher numeric encoding (Kellogg’s) is expected to have cereals with a lower calorie content by an average of 0.8588 calories compared to the overall mean. However, this difference is not statistically significant (t(59) = -0.058, p-value = 0.9539), indicating that the relationship between the manufacturer and the calorie content is not reliable based on this model

8.5) Comparison to Question 6: Which model seems to capture variability in the outcome more effectively? (You don’t need to use a test just qualitatively compare the models)

summary (mfr_model)

## 
## Call:
## lm(formula = calories ~ mfr.num, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105.44  -38.13  -12.36   26.10  286.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  143.259     14.484   9.891 1.92e-14 ***
## mfr.num        2.437      4.840   0.504    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.78 on 63 degrees of freedom
## Multiple R-squared:  0.004009,   Adjusted R-squared:  -0.0118 
## F-statistic: 0.2536 on 1 and 63 DF,  p-value: 0.6163

summary (effectsModel)

## 
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.05  -39.67  -14.85   24.40  245.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.5298     9.9329  15.155   <2e-16 ***
## E1          -12.7419    14.6106  -0.872   0.3867    
## E2           -0.8588    14.7965  -0.058   0.9539    
## E3            9.7295    30.6687   0.317   0.7522    
## E4           44.2280    19.4756   2.271   0.0268 *  
## E5          -14.6791    24.5725  -0.597   0.5525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.02738 
## F-statistic:  1.36 on 5 and 59 DF,  p-value: 0.2523

I have the effects model and the original full model. The full modelhas very low \(R^2\) (0.004009) and negative Adjusted \(R^2\) (-0.0118), indicating that it explains almost none of the variance in calories. In fact, the negative Adjusted \(R^2\) suggests that the model is worse than a simple mean model. The effects contrast has a higher \(R^2\) (0.1034) and a slightly positive Adjusted \(R^2\) (0.02738), indicating that it explains more variance in calories compared to full model.

Moreover, the full model is F(1, 63) = 0.2536, p = 0.6163, indicating that the model is not statistically significant. Our effects coding model has an F(5, 59) = 1.36, p = 0.2523, which is still not statistically significant but shows a slightly better fit compared to the full model.

Lastly, the full model has a residual standard error of 62.78. The effects model has a slightly lower residual standard error of 61.55, indicating that the predictions from Model 2 are closer to the actual values compared to the full model.

Question 9

9.1) Code and Output

library(dplyr)

UScereal$mfr <- as.factor(UScereal$mfr)

# Apply effect coding to the manufacturer factor using contr.sum
contrasts(UScereal$mfr) <- contr.sum(length(levels(UScereal$mfr)))

# Fit the linear model using effect coding and adding protein as a predictor
effectsModel_protein <- lm(calories ~ mfr + protein, data = UScereal)

# Summary of the new model
effectsModel_protein_summary <- summary(effectsModel_protein)
print(effectsModel_protein_summary)

## 
## Call:
## lm(formula = calories ~ mfr + protein, data = UScereal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.770  -21.777    1.998   18.496  119.847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   80.137     11.546   6.941 3.69e-09 ***
## mfr1           8.104     10.713   0.756   0.4525    
## mfr2           2.233     10.505   0.213   0.8324    
## mfr3         -40.540     22.716  -1.785   0.0795 .  
## mfr4          33.916     13.881   2.443   0.0176 *  
## mfr5          -3.724     17.490  -0.213   0.8321    
## protein       17.175      2.232   7.696 1.98e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.67 on 58 degrees of freedom
## Multiple R-squared:  0.5564, Adjusted R-squared:  0.5105 
## F-statistic: 12.12 on 6 and 58 DF,  p-value: 8.79e-09

9.2) \(R^2\) and Adjusted \(R^2\)

# Extract R-squared and adjusted R-squared
effectsModel_protein_r2 <- effectsModel_protein_summary$r.squared
effectsModel_protein_adj_r2 <- effectsModel_protein_summary$adj.r.squared
print(paste("R-squared: ", effectsModel_protein_r2))

## [1] "R-squared:  0.556401854548056"

print(paste("Adjusted R-squared: ", effectsModel_protein_adj_r2))

## [1] "Adjusted R-squared:  0.510512391225441"

9.3) Compare the Model from Question 8 to this new model. Does adding protein as a predictor significantly improve the model?

# Compare the R-squared and adjusted R-squared values
cat("Comparison of R-squared values:\n")

## Comparison of R-squared values:

cat("Effects Model R-squared: ", effectsModel_r2, "\n")

## Effects Model R-squared:  0.1033659

cat("Effects Model Adjusted R-squared: ", effectsModel_adj_r2, "\n")

## Effects Model Adjusted R-squared:  0.02738001

cat("Effects Model with Protein R-squared: ", effectsModel_protein_r2, "\n")

## Effects Model with Protein R-squared:  0.5564019

cat("Effects Model with Protein Adjusted R-squared: ", effectsModel_protein_adj_r2, "\n")

## Effects Model with Protein Adjusted R-squared:  0.5105124

The \(R^2\) value for the effects-coded model. is 0.1034, indicating that about 10.34% of the variance in calorie content is explained by the manufacturer effects. The \(R^2\) value for the effects-coded model with protein is 0.5564, indicating that about 55.64% of the variance in calorie content is explained by the combined effects of the manufacturer and protein content. The adjusted \(R^2\) value for effects-coded model is 0.0274, which is adjusted for the number of predictors and suggests a modest fit. The adjusted R-squared value for effects-coded model with protein is 0.5105, indicating a significantly better fit when protein is included as a predictor.

Overall, adding protein as a predictor significantly improves the model. The substantial increase in both R-squared and adjusted R-squared values demonstrates that protein content is a crucial predictor for calorie content in cereals, alongside the manufacturer effects.

9.4) Provide an interpretation for the coefficient for protein in the model.

coef_effectsModel_protein <- coef(effectsModel_protein)
protein_coef <- coef_effectsModel_protein["protein"]
protein_coef

##  protein 
## 17.17498

The coefficient for protein is 17.17498. This means that for each additional gram of protein, the calorie content is expected to change by 17.17498 calories, holding the effect-coded manufacturer variable constant.

9.5) Select the coefficient which corresponds to the coefficient you chose in 8.4 and provide an interpretation for this new model. Make sure it’s clear which coefficient you’ve selected.

# Interpret the coefficient for Kellogg's in the new model
kellogg_coef <- coef_effectsModel_protein["mfr2"]

kellogg_coef

##     mfr2 
## 2.233318

The coefficient for Kellogg’s is 2.233318. This represents the difference in the number of calories between the cereals manufactured by Kellogg’s and the overall mean, with Ralston Purina as the reference group, while controlling for the effect of protein.

Question 10

For each of the following circumstances select a coding strategy that you think would best allow you to make the comparisons that are of most direct interest. Select a coding strategy, describe which group would be the reference group (if needed) or which order the groups would be in (if needed).

10.1) You are conducting a research study on the effects of an educational intervention. There are 4 conditions in the study: no intervention, school intervention, home intervention, and school + home intervention. We would like to test whether the home intervention differs from the no intervention condition, whether the school (only) differs from home (only) intervention, and whether the school + home intervention condition differs from the school (only) condition.

For this question, the best coding strategy on the effects of an educational intervention would be Helmert coding. This is because where we want to make specific comparisons among the conditions, and Helmert coding allows us to make comparisons between each level of a factor and the average of subsequent levels.

For our order and reference, we would have the following:

No intervention
Home intervention
School intervention
School + Home intervention

Then our comparisons are

H1. No intervention vs. Home intervention

H2. School intervention vs. Home intervention

H3. School + Home intervention vs. School intervention

Our matrix might look something like this:

Intervention	H1	H2	H3
Home intervention	-1	-1	-1
No intervention	1	-1	-1
School + Home intervention	0	2	-1
School intervention	0	0	3

In Helmert coding, the reference group is implicitly defined as the last group, but it is not used in the traditional sense of reference groups in dummy or effect coding. Instead, each level (except the last one) is compared with the average of the subsequent levels.

Overall, Helmert coding is ideal for making specific, sequential comparisons among multiple levels of a factor. It allows us to compare each level with the mean of the subsequent levels, which aligns perfectly with this research question.

10.2) You are examining whether there are wage differences for individuals of different ethnicities. In the data you have information from individuals who identify as 5 different races/ethnicities: White/Caucasian, African American, LatinX, Asian/Asian American, Native American/First Nations. You would like to test whether there are any groups that are making significantly more or less money than the group average.

For this question, the best coding strategy for examining whether there are wage differences for individuals of different ethnicities and testing whether any groups make significantly more or less money than the group average is effect coding. Effect coding is appropriate when one wants to compare each level of a categorical variable to the overall mean of the dependent variable. In this context, it will help determine if any ethnic group has wages that significantly differ from the average wage across all groups.

Assuming the following ethnic groups:

White/Caucasian
African American
LatinX
Asian/Asian American
Native American/First Nations

In effect coding, each group is compared to the overall mean wage, and the coding for each group would look like this:

Ethnicity	E1	E2	E3	E4	E5
White/Caucasian	1	0	0	0	-1
African American	0	1	0	0	-1
LatinX	0	0	1	0	-1
Asian/Asian American	0	0	0	1	-1
Native American/First Nations	-1	-1	-1	-1	4

Overall, effect coding compares each group to the overall mean, and it’s useful in this context because it allows us to see if each group’s wages are significantly higher or lower than the average wage. The reference is the overall mean, which provides a good baseline for these comparisons.

10.3) You are a statistics savvy parent who regularly buys cereal for your children. Currently the household buys Kellogg’s cereal, but you are considering changing brands to find something with a lower sugar content. You want to know if there are any cereal manufacturer’s whose cereal has significantly lower sugar content than Kellogg’s.

For this question, the best coding strategy where we want to compare the sugar content of cereals from different manufacturers to that of Kellogg’s cereals,is dummy coding. Dummy coding allows you to compare each group to a specific reference group, which in this case will be Kellogg’s.

Our reference group would be Kellogg’s cereal.

Assuming the following manufacturers:

General Mills
Kellogg’s
Nabisco
Post
Quaker Oats
Ralston Purina

In dummy coding, each group except the reference group (Kellogg’s) will have a dummy variable coded as 1 if the observation belongs to that group, and 0 otherwise. The reference group (Kellogg’s) will not have a dummy variable and will be implicitly coded as the baseline.

Our matrix might look like

Manufacturer	D1	D2	D3	D4	D5
1 (G)	1	0	0	0	0
2 (K)	0	0	0	0	0
3 (N)	0	1	0	0	0
4 (P)	0	0	1	0	0
5 (Q)	0	0	0	1	0
6 (R)	0	0	0	0	1

Overall, dummy coding allows for direct comparisons of each manufacturer’s sugar content against the reference group (Kellogg’s). This approach provides clear insights into whether cereals from other manufacturers have significantly lower sugar content than Kellogg’s cereals.