# Load Dataset
library(MASS)
data(UScereal)
UScereal$mfr.num <- as.numeric(UScereal$mfr)
UScereal <- UScereal
Fit a “full model” predicting calories using all 8 numeric predictors in the data set (do not use mfr, mfr.num, or vitamins). Provide the code and output a summary which includes coefficient estimates, ANOVA table (SSregression, SSerror, df, F, etc), and R2 results.
# Fit the full model predicting calories using all 8 numeric predictors
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
model_summary <- summary(full_model)
print(model_summary)
##
## Call:
## lm(formula = calories ~ protein + fat + sodium + fibre + carbo +
## sugars + shelf + potassium, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.917 -4.075 1.079 4.415 15.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.15905 3.74046 -5.657 5.46e-07 ***
## protein 4.87257 0.97909 4.977 6.51e-06 ***
## fat 9.27406 0.75034 12.360 < 2e-16 ***
## sodium 0.01207 0.01002 1.204 0.233676
## fibre 2.80820 0.71883 3.907 0.000254 ***
## carbo 4.86987 0.17491 27.842 < 2e-16 ***
## sugars 4.54340 0.21068 21.565 < 2e-16 ***
## shelf 0.56679 1.39782 0.405 0.686668
## potassium -0.11594 0.02718 -4.265 7.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.951 on 56 degrees of freedom
## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9838
## F-statistic: 486 on 8 and 56 DF, p-value: < 2.2e-16
# Calculate ANOVA table
anova_results <- anova(full_model)
print(anova_results)
## Analysis of Variance Table
##
## Response: calories
## Df Sum Sq Mean Sq F value Pr(>F)
## protein 1 124262 124262 1965.7961 < 2.2e-16 ***
## fat 1 26971 26971 426.6799 < 2.2e-16 ***
## sodium 1 4966 4966 78.5631 2.999e-12 ***
## fibre 1 17242 17242 272.7643 < 2.2e-16 ***
## carbo 1 40981 40981 648.3188 < 2.2e-16 ***
## sugars 1 30170 30170 477.2777 < 2.2e-16 ***
## shelf 1 14 14 0.2196 0.6411
## potassium 1 1150 1150 18.1925 7.757e-05 ***
## Residuals 56 3540 63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Extract R² from the model summary
r_squared <- model_summary$r.squared
cat("R²: ", r_squared, "\n")
## R²: 0.9858005
One can see here that I have fitted a full model predicting calories using all 8 numeric predictors in the data set. We find that given our \(R^2\) value being 0.9858, this means about 98.58% of the variance in calories is explained by the model. Moreover, our F statistic for the model, F(8, 56) = 486, p <0.001, shows the overall model is highly significant, indicating that most of the numeric predictors significantly contribute to predicting the calorie content of the cereals, with sugars and fat having the largest positive impacts.
Using the model fit in #1, estimate Shrunken-R using the Browne Equation (Eq 7.1 in RALM), double cross validation, and leave-one-out cross validation.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# Number of observations
n <- nrow(UScereal)
# Number of predictors
p <- length(full_model$coefficients) - 1 # Subtract 1 for the intercept
# Calculate shrunken R^2 using Browne's equation
R2 <- summary(full_model)$r.squared
R2_shrunken_browne <- 1 - ((1 - R2) * (n - 1) / (n - p - 1))
cat("Shrunken R^2 using Browne's Equation: ", R2_shrunken_browne, "\n")
## Shrunken R^2 using Browne's Equation: 0.983772
Our shrunken \(R^2\) using Browne’s equation is 0.983772. I did so through the hardcode method.
Our value here is that this value adjusts the \(R^2\) to account for the number of predictors and the sample size, giving you a more realistic measure of the model’s explanatory power.
library(caret)
# Double cross-validation
set.seed(123) # For reproducibility
double_cv <- train(
calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium,
data = UScereal,
method = "lm",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 2)
)
R2_double_cv <- double_cv$results$Rsquared
cat("Shrunken-R² (Double Cross-Validation): ", R2_double_cv, "\n")
## Shrunken-R² (Double Cross-Validation): 0.9742866
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
# Double Cross-Validation
set.seed(123) # For reproducibility
indices <- sample(1:n, size = n, replace = FALSE)
N <- nrow(UScereal)
half <- floor(n / 2)
rows <- sample.int(N, N/2)
train1 <- UScereal[indices[1:half], ]
train2 <- UScereal[indices[(half+1):n], ]
# Model on first half
train1a <- UScereal[rows,]
test1a <- UScereal[-rows,]
model1 <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = train1)
pred1 <- predict(model1, newdata = test1a)
SSres1 <- sum((pred1 - train2$calories)^2)
SStot1 <- sum((train2$calories - mean(train2$calories))^2)
shrunkenRCV1 <- cor(pred1, test1a$calories)^2
R2_1 <- 1 - (SSres1 / SStot1)
# Model on second half
train2a <- UScereal[-rows,]
test2a <- UScereal[rows,]
model2 <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = train2)
pred2 <- predict(model2, newdata = test2a)
SSres2 <- sum((pred2 - train1$calories)^2)
SStot2 <- sum((train1$calories - mean(train1$calories))^2)
shrunkenRCV2 <- cor(pred2, test2a$calories)^2
R2_2 <- 1 - (SSres2 / SStot2)
# Average R² for Double Cross-Validation
R2_double_cv <- (R2_1 + R2_2) / 2
shrunkenRCV_double <- mean(c(shrunkenRCV1, shrunkenRCV2))
cat("Shrunken-R² (Double Cross-Validation): ", R2_double_cv, "\n")
## Shrunken-R² (Double Cross-Validation): -1.424213
cat("Shrunken-R² (Double Cross-Validation): ", shrunkenRCV_double, "\n")
## Shrunken-R² (Double Cross-Validation): 0.9771359
I conducted two methods of determining the double-cross shrunken \(R^2\). The first was with the other was with the trainControl() function from the caret library in R. The second was hardcoded. We find that for the caret library , our double-cross validation \(R^2\) is 0.9742866. For our hardcoded, the double-cross validation \(R^2\) is 0.9580511.
Our value here is that this value is an average \(R^2\) which comes from multiple iterations of cross-validation. It provides a reliable estimate of the model’s predictive power on unseen data by repeating the validation process.
library(caret)
# Leave-one-out cross-validation
loocv <- train(
calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium,
data = UScereal,
method = "lm",
trControl = trainControl(method = "LOOCV")
)
print(loocv)
## Linear Regression
##
## 65 samples
## 8 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 64, 64, 64, 64, 64, 64, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 11.01019 0.9702071 7.412448
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
R2_loocv <- loocv$results$Rsquared
cat("Shrunken-R² (LOOCV):", R2_loocv, "\n")
## Shrunken-R² (LOOCV): 0.9702071
library(boot)
# Leave-One-Out Cross-Validation (LOOCV)
loocv_model <- glm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
loocv_results <- cv.glm(UScereal, loocv_model, K = n)
R2_loocv <- 1 - (loocv_results$delta[1] / var(UScereal$calories) * (n - 1) / n)
cat("Shrunken-R² (LOOCV): ", R2_loocv, "\n")
## Shrunken-R² (LOOCV): 0.9693577
library(lmf)
# Fit the initial model to extract leverage and residuals
model_all <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
# Extract leverage values
UScereal$leverage <- lm.extract(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)$leverage
# Extract residuals from the full model
UScereal$resall <- residuals(model_all)
# Calculate leave-one-out residuals
UScereal$resloo <- UScereal$resall / (1 - UScereal$leverage)
# Calculate leave-one-out predicted values
UScereal$Ypredloo <- UScereal$calories - UScereal$resloo
# Calculate LOOCV R²
R2_loocv <- cor(UScereal$Ypredloo, UScereal$calories)^2
cat("Shrunken-R² (LOOCV): ", R2_loocv, "\n")
## Shrunken-R² (LOOCV): 0.9702071
I conducted three methods of determining the leave-one out shrunken \(R^2\). The first was with the other was with the trainControl() function from the caret library in R. The second was with the boot library in R. And the last was with the lmf library. We find that for the caret library , our double-cross validation \(R^2\) is 0.9702071. For our boot library, the double-cross validation \(R^2\) is 0.9693577. For our lmf library, the double-cross validation \(R^2\) is 0.9702071, the same as our caret library.
This value is obtained by training the model on all but one observation and testing on the excluded observation, repeated for each data point. It is especially useful for small datasets to assess the model’s generalizability.
Conduct forward, backwards, stepwise, and all-subsets regression using the same predictors and outcome as in #1. Examine whether the final model selected by these four methods is the same or different.
# Forward Selection
forward_model <- step(lm(calories ~ 1, data = UScereal),
direction = "forward",
scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium)
## Start: AIC=538.38
## calories ~ 1
##
## Df Sum of Sq RSS AIC
## + carbo 1 155083 94213 477.13
## + protein 1 124262 125034 495.53
## + fat 1 86831 162464 512.55
## + sodium 1 69672 179623 519.07
## + sugars 1 61156 188139 522.09
## + potassium 1 56626 192670 523.63
## + shelf 1 45313 203982 527.34
## + fibre 1 37572 211723 529.76
## <none> 249295 538.38
##
## Step: AIC=477.13
## calories ~ carbo
##
## Df Sum of Sq RSS AIC
## + sugars 1 69482 24730 392.19
## + fat 1 51294 42919 428.02
## + protein 1 26810 67403 457.36
## + potassium 1 21613 72600 462.19
## + fibre 1 13523 80690 469.06
## + shelf 1 13051 81162 469.44
## + sodium 1 11503 82710 470.67
## <none> 94213 477.13
##
## Step: AIC=392.19
## calories ~ carbo + sugars
##
## Df Sum of Sq RSS AIC
## + fat 1 15660.6 9069.8 328.99
## + protein 1 10324.1 14406.4 359.07
## + fibre 1 5610.6 19119.9 377.47
## + potassium 1 5415.6 19314.9 378.13
## + sodium 1 1768.7 22961.8 389.37
## + shelf 1 1139.9 23590.5 391.12
## <none> 24730.5 392.19
##
## Step: AIC=328.99
## calories ~ carbo + sugars + fat
##
## Df Sum of Sq RSS AIC
## + protein 1 4357.9 4711.9 288.43
## + fibre 1 3322.7 5747.1 301.33
## + potassium 1 2529.6 6540.3 309.74
## + sodium 1 812.8 8257.0 324.89
## <none> 9069.8 328.99
## + shelf 1 116.4 8953.4 330.15
##
## Step: AIC=288.43
## calories ~ carbo + sugars + fat + protein
##
## Df Sum of Sq RSS AIC
## + potassium 1 172.139 4539.8 288.01
## <none> 4711.9 288.43
## + shelf 1 11.634 4700.3 290.26
## + sodium 1 4.929 4707.0 290.36
## + fibre 1 4.834 4707.1 290.36
##
## Step: AIC=288.01
## calories ~ carbo + sugars + fat + protein + potassium
##
## Df Sum of Sq RSS AIC
## + fibre 1 903.98 3635.8 275.57
## <none> 4539.8 288.01
## + sodium 1 35.09 4504.7 289.50
## + shelf 1 0.09 4539.7 290.00
##
## Step: AIC=275.57
## calories ~ carbo + sugars + fat + protein + potassium + fibre
##
## Df Sum of Sq RSS AIC
## <none> 3635.8 275.57
## + sodium 1 85.571 3550.3 276.02
## + shelf 1 4.339 3631.5 277.50
summary(forward_model)
##
## Call:
## lm(formula = calories ~ carbo + sugars + fat + protein + potassium +
## fibre, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.003 -5.273 1.323 4.424 17.050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.60795 3.15131 -6.222 5.85e-08 ***
## carbo 4.93735 0.16429 30.052 < 2e-16 ***
## sugars 4.56824 0.20790 21.973 < 2e-16 ***
## fat 9.31086 0.74057 12.573 < 2e-16 ***
## protein 4.84992 0.97164 4.991 5.79e-06 ***
## potassium -0.10691 0.02586 -4.134 0.000116 ***
## fibre 2.69062 0.70853 3.797 0.000352 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.917 on 58 degrees of freedom
## Multiple R-squared: 0.9854, Adjusted R-squared: 0.9839
## F-statistic: 653.1 on 6 and 58 DF, p-value: < 2.2e-16
cat("Predictors in Final Model (Forward Selection): ", names(coef(forward_model))[-1], "\n")
## Predictors in Final Model (Forward Selection): carbo sugars fat protein potassium fibre
# Fit the initial null model
model_none <- lm(calories ~ 1, data = UScereal)
# Forward Selection
add1(model_none, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
# Iteratively add the variable with the highest F/lowest p-value
model_step <- update(model_none, ~ . + carbo)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
model_step <- update(model_none, ~ . + carbo +sugars)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
model_step <- update(model_none, ~ . + carbo + sugars + fat)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
model_step <- update(model_none, ~ . + carbo + sugars + fat + protein)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
final_model_forward <- model_step
cat("Predictors in Final Model (Forward Selection): ", names(coef(final_model_forward))[-1], "\n")
## Predictors in Final Model (Forward Selection): carbo sugars fat protein
I completed this problem in two different ways. The first is looking at the AIC, or Akaike Information Criterion. This is a metric used to estimate the prediction error and the relative quality of statistical models for a specific dataset. When comparing a set of models for the data, AIC evaluates the quality of each model in relation to the others. Consequently, AIC serves as a tool for model selection. I have used this in the past to derive my model selection, and to test my models.
The second is the manner that was suggested by the course- the manual method. These two give different predictors in the final model for forward selection, namely the manual method gives, carbo, sugars, fat, protein. You can see each at the end of each output.
# Backward Elimination
backward_model <- step(lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal),
direction = "backward",
scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium)
## Start: AIC=277.83
## calories ~ protein + fat + sodium + fibre + carbo + sugars +
## shelf + potassium
##
## Df Sum of Sq RSS AIC
## - shelf 1 10 3550 276.03
## - sodium 1 92 3631 277.50
## <none> 3540 277.83
## - fibre 1 965 4505 291.50
## - potassium 1 1150 4690 294.12
## - protein 1 1566 5105 299.64
## - fat 1 9657 13196 361.37
## - sugars 1 29397 32936 420.82
## - carbo 1 49000 52539 451.17
##
## Step: AIC=276.03
## calories ~ protein + fat + sodium + fibre + carbo + sugars +
## potassium
##
## Df Sum of Sq RSS AIC
## - sodium 1 86 3636 275.57
## <none> 3550 276.03
## - fibre 1 954 4505 289.50
## - potassium 1 1153 4704 292.31
## - protein 1 1555 5105 297.64
## - fat 1 9916 13467 360.68
## - sugars 1 29967 33517 419.95
## - carbo 1 51524 55075 452.23
##
## Step: AIC=275.57
## calories ~ protein + fat + fibre + carbo + sugars + potassium
##
## Df Sum of Sq RSS AIC
## <none> 3636 275.57
## - fibre 1 904 4540 288.01
## - potassium 1 1071 4707 290.36
## - protein 1 1562 5198 296.80
## - fat 1 9909 13545 359.06
## - sugars 1 30266 33902 418.69
## - carbo 1 56615 60251 456.07
summary(backward_model)
##
## Call:
## lm(formula = calories ~ protein + fat + fibre + carbo + sugars +
## potassium, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.003 -5.273 1.323 4.424 17.050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.60795 3.15131 -6.222 5.85e-08 ***
## protein 4.84992 0.97164 4.991 5.79e-06 ***
## fat 9.31086 0.74057 12.573 < 2e-16 ***
## fibre 2.69062 0.70853 3.797 0.000352 ***
## carbo 4.93735 0.16429 30.052 < 2e-16 ***
## sugars 4.56824 0.20790 21.973 < 2e-16 ***
## potassium -0.10691 0.02586 -4.134 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.917 on 58 degrees of freedom
## Multiple R-squared: 0.9854, Adjusted R-squared: 0.9839
## F-statistic: 653.1 on 6 and 58 DF, p-value: < 2.2e-16
cat("Predictors in Final Model (Backward Selection): ", names(coef(backward_model))[-1], "\n")
## Predictors in Final Model (Backward Selection): protein fat fibre carbo sugars potassium
# Fit the initial full model
model_all <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
# Backward Selection
drop1(model_all, test = "F")
# Iteratively drop the variable with the lowest F/highest p-value
model_step <- update(model_all, ~ . -shelf)
drop1(model_step, test = "F")
model_step <- update(model_all, ~ . -shelf -sodium)
drop1(model_step, test = "F")
final_model_backward <- model_step
cat("Predictors in Final Model (Backward Selection): ", names(coef(final_model_backward))[-1], "\n")
## Predictors in Final Model (Backward Selection): protein fat fibre carbo sugars potassium
I completed this problem in two different ways. The first is looking at the AIC, or Akaike Information Criterion. This is a metric used to estimate the prediction error and the relative quality of statistical models for a specific dataset. When comparing a set of models for the data, AIC evaluates the quality of each model in relation to the others. Consequently, AIC serves as a tool for model selection. I have used this in the past to derive my model selection, and to test my models.
The second is the manner that was suggested by the course - the manual method. These two give the same predictors in the final model for backward selection, namely, protein, fat, fibre, carbo, sugars, potassium. You can see each at the end of each output.
# Stepwise Selection
stepwise_model <- step(lm(calories ~ 1, data = UScereal),
direction = "both",
scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium)
## Start: AIC=538.38
## calories ~ 1
##
## Df Sum of Sq RSS AIC
## + carbo 1 155083 94213 477.13
## + protein 1 124262 125034 495.53
## + fat 1 86831 162464 512.55
## + sodium 1 69672 179623 519.07
## + sugars 1 61156 188139 522.09
## + potassium 1 56626 192670 523.63
## + shelf 1 45313 203982 527.34
## + fibre 1 37572 211723 529.76
## <none> 249295 538.38
##
## Step: AIC=477.13
## calories ~ carbo
##
## Df Sum of Sq RSS AIC
## + sugars 1 69482 24730 392.19
## + fat 1 51294 42919 428.02
## + protein 1 26810 67403 457.36
## + potassium 1 21613 72600 462.19
## + fibre 1 13523 80690 469.06
## + shelf 1 13051 81162 469.44
## + sodium 1 11503 82710 470.67
## <none> 94213 477.13
## - carbo 1 155083 249295 538.38
##
## Step: AIC=392.19
## calories ~ carbo + sugars
##
## Df Sum of Sq RSS AIC
## + fat 1 15661 9070 328.99
## + protein 1 10324 14406 359.07
## + fibre 1 5611 19120 377.47
## + potassium 1 5416 19315 378.13
## + sodium 1 1769 22962 389.37
## + shelf 1 1140 23591 391.12
## <none> 24730 392.19
## - sugars 1 69482 94213 477.13
## - carbo 1 163409 188139 522.09
##
## Step: AIC=328.99
## calories ~ carbo + sugars + fat
##
## Df Sum of Sq RSS AIC
## + protein 1 4358 4712 288.43
## + fibre 1 3323 5747 301.33
## + potassium 1 2530 6540 309.74
## + sodium 1 813 8257 324.89
## <none> 9070 328.99
## + shelf 1 116 8953 330.15
## - fat 1 15661 24730 392.19
## - sugars 1 33849 42919 428.02
## - carbo 1 134563 143633 506.54
##
## Step: AIC=288.43
## calories ~ carbo + sugars + fat + protein
##
## Df Sum of Sq RSS AIC
## + potassium 1 172 4540 288.01
## <none> 4712 288.43
## + shelf 1 12 4700 290.26
## + sodium 1 5 4707 290.36
## + fibre 1 5 4707 290.36
## - protein 1 4358 9070 328.99
## - fat 1 9694 14406 359.07
## - sugars 1 31092 35804 418.24
## - carbo 1 75664 80376 470.81
##
## Step: AIC=288.01
## calories ~ carbo + sugars + fat + protein + potassium
##
## Df Sum of Sq RSS AIC
## + fibre 1 904 3636 275.57
## <none> 4540 288.01
## - potassium 1 172 4712 288.43
## + sodium 1 35 4505 289.50
## + shelf 1 0 4540 290.01
## - protein 1 2000 6540 309.74
## - fat 1 9063 13603 357.34
## - sugars 1 30725 35265 419.26
## - carbo 1 56223 60763 454.62
##
## Step: AIC=275.57
## calories ~ carbo + sugars + fat + protein + potassium + fibre
##
## Df Sum of Sq RSS AIC
## <none> 3636 275.57
## + sodium 1 86 3550 276.03
## + shelf 1 4 3631 277.50
## - fibre 1 904 4540 288.01
## - potassium 1 1071 4707 290.36
## - protein 1 1562 5198 296.80
## - fat 1 9909 13545 359.06
## - sugars 1 30266 33902 418.69
## - carbo 1 56615 60251 456.07
summary(stepwise_model)
##
## Call:
## lm(formula = calories ~ carbo + sugars + fat + protein + potassium +
## fibre, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.003 -5.273 1.323 4.424 17.050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.60795 3.15131 -6.222 5.85e-08 ***
## carbo 4.93735 0.16429 30.052 < 2e-16 ***
## sugars 4.56824 0.20790 21.973 < 2e-16 ***
## fat 9.31086 0.74057 12.573 < 2e-16 ***
## protein 4.84992 0.97164 4.991 5.79e-06 ***
## potassium -0.10691 0.02586 -4.134 0.000116 ***
## fibre 2.69062 0.70853 3.797 0.000352 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.917 on 58 degrees of freedom
## Multiple R-squared: 0.9854, Adjusted R-squared: 0.9839
## F-statistic: 653.1 on 6 and 58 DF, p-value: < 2.2e-16
cat("Predictors in Final Model (Stepwise Selection): ", names(coef(stepwise_model))[-1], "\n")
## Predictors in Final Model (Stepwise Selection): carbo sugars fat protein potassium fibre
# Stepwise Selection
model_none <- lm(calories ~ 1, data = UScereal)
# Iteratively add and drop variables
# Initial addition
add1(model_none, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
model_step <- update(model_none, ~ . + carbo)
# Check for dropping variables
drop1(model_step, test = "F")
# Add next variable
model_step <- update(model_step, ~ . + sugars)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
drop1(model_step, test = "F")
# Add next variable
model_step <- update(model_step, ~ . + fat)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
drop1(model_step, test = "F")
# Add next variable
model_step <- update(model_step, ~ . + protein)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
drop1(model_step, test = "F")
# Add next variable
model_step <- update(model_step, ~ . + protein)
add1(model_step, scope = ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, test = "F")
drop1(model_step, test = "F")
# Continue this process until neither adding nor dropping variables improves the model significantly
final_model_stepwise <- model_step
# Summary of the final model
summary(final_model_stepwise)
##
## Call:
## lm(formula = calories ~ carbo + sugars + fat + protein, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.177 -4.693 1.419 4.939 24.758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.7698 3.5127 -5.343 1.49e-06 ***
## carbo 4.9247 0.1587 31.040 < 2e-16 ***
## sugars 4.2107 0.2116 19.898 < 2e-16 ***
## fat 8.8589 0.7973 11.111 3.41e-16 ***
## protein 4.0506 0.5438 7.449 4.28e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.862 on 60 degrees of freedom
## Multiple R-squared: 0.9811, Adjusted R-squared: 0.9798
## F-statistic: 778.6 on 4 and 60 DF, p-value: < 2.2e-16
cat("Predictors in Final Model (Stepwise Selection): ", names(coef(final_model_stepwise))[-1], "\n")
## Predictors in Final Model (Stepwise Selection): carbo sugars fat protein
I completed this problem in two different ways. The first is looking at the AIC, or Akaike Information Criterion. This is a metric used to estimate the prediction error and the relative quality of statistical models for a specific dataset. When comparing a set of models for the data, AIC evaluates the quality of each model in relation to the others. Consequently, AIC serves as a tool for model selection. I have used this in the past to derive my model selection, and to test my models.
The second is the manner that was suggested by the course - the manual method. These two give slightly different predictors in the final model for stepwise selection, namely the manual methods gives carbo, sugars, fat, and protein. You can see each at the end of each output.
library(leaps)
# All-Subsets Regression
all_subsets <- regsubsets(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
summary(all_subsets)
## Subset selection object
## Call: regsubsets.formula(calories ~ protein + fat + sodium + fibre +
## carbo + sugars + shelf + potassium, data = UScereal)
## 8 Variables (and intercept)
## Forced in Forced out
## protein FALSE FALSE
## fat FALSE FALSE
## sodium FALSE FALSE
## fibre FALSE FALSE
## carbo FALSE FALSE
## sugars FALSE FALSE
## shelf FALSE FALSE
## potassium FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## protein fat sodium fibre carbo sugars shelf potassium
## 1 ( 1 ) " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " " " "*" "*" " " " "
## 3 ( 1 ) " " "*" " " " " "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " " "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " " "*" "*" " " "*"
## 6 ( 1 ) "*" "*" " " "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
summary_subsets <- summary(all_subsets)
# Find the model with the highest adjusted R²
max_adj_r2 <- which.max(summary_subsets$adjr2)
best_model <- summary_subsets$which[max_adj_r2, ]
# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]
cat("Predictors in Final Model (All-Subsets Regression): ", predictors, "\n")
## Predictors in Final Model (All-Subsets Regression): protein fat sodium fibre carbo sugars potassium
library(leaps)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Prepare data for all-subsets regression
datasmall <- UScereal %>% select(calories, protein, fat, sodium, fibre, carbo, sugars, shelf, potassium)
# All-Subsets Regression
subsetmodels <- regsubsets(calories ~ ., data = datasmall, nbest = 1)
summary_subsetmodels <- summary(subsetmodels)
# Print summary of all-subsets models
print(summary_subsetmodels)
## Subset selection object
## Call: regsubsets.formula(calories ~ ., data = datasmall, nbest = 1)
## 8 Variables (and intercept)
## Forced in Forced out
## protein FALSE FALSE
## fat FALSE FALSE
## sodium FALSE FALSE
## fibre FALSE FALSE
## carbo FALSE FALSE
## sugars FALSE FALSE
## shelf FALSE FALSE
## potassium FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## protein fat sodium fibre carbo sugars shelf potassium
## 1 ( 1 ) " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " " " "*" "*" " " " "
## 3 ( 1 ) " " "*" " " " " "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " " "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " " "*" "*" " " "*"
## 6 ( 1 ) "*" "*" " " "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
# Find the model with the highest adjusted R²
max_adj_r2 <- which.max(summary_subsetmodels$adjr2)
best_model <- summary_subsetmodels$which[max_adj_r2, ]
# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]
# Fit the final model with the best predictors
final_model_all_subsets <- lm(as.formula(paste("calories ~", paste(predictors, collapse = " + "))), data = UScereal)
# Summary of the final model
summary(final_model_all_subsets)
##
## Call:
## lm(formula = as.formula(paste("calories ~", paste(predictors,
## collapse = " + "))), data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6044 -4.8410 0.9581 4.6033 15.2115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.398829 3.212851 -6.349 3.83e-08 ***
## protein 4.839791 0.968566 4.997 5.86e-06 ***
## fat 9.314509 0.738204 12.618 < 2e-16 ***
## sodium 0.011574 0.009875 1.172 0.246027
## fibre 2.781245 0.710481 3.915 0.000244 ***
## carbo 4.884618 0.169831 28.762 < 2e-16 ***
## sugars 4.553708 0.207605 21.934 < 2e-16 ***
## potassium -0.113706 0.026423 -4.303 6.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.892 on 57 degrees of freedom
## Multiple R-squared: 0.9858, Adjusted R-squared: 0.984
## F-statistic: 563.6 on 7 and 57 DF, p-value: < 2.2e-16
cat("Predictors in Final Model (All-Subsets Regression): ", predictors, "\n")
## Predictors in Final Model (All-Subsets Regression): protein fat sodium fibre carbo sugars potassium
I’ve completed this type of analysis prior to this, and I am sharing both methods here. I have adjusted the code such that it is in line with what was provided in the course. As a result, we find that the predictors in the final model to be protein, fat, sodium, fibre, carbo, sugars, and potassium. We did this by selecting the model with the highest adjusted \(R^2\) value.
The final models here are all slightly different except for Backward Selection and All-Subsets Regression, which both resulted in the same set of predictors: protein, fat, sodium, fibre, carbo, sugars, potassium.
The differences are due to a set of important distinctions. Firstly, the selection criteria differs between each models. Forward Selection adds predictors based on the highest improvement in the model at each step, which can lead to including predictors that marginally improve the model. In contstrat, Backward Selection starts with all predictors and removes the least significant ones, focusing on the overall fit. Stepwise Selection combines forward and backward steps, which can lead to a more conservative model by balancing both addition and removal, while All-Subsets Regression evaluates all possible subsets and selects the best model based on the highest adjusted \(R^2\).
Secondly, each method handles stopping rules differently. Forward and Backward selection stop when no further significant improvements are found based on specific criteria (F-Value/p-value in our case). Stepwise selection is a combination of both forward and backward criteria, which can result in a more streamlined model. All-Subsets selection uses an exhaustive search and selects the model with the highest adjusted \(R^2\), potentially leading to a model with more predictors.
Thirdly, each method handles multicollinearity slightly differently. For forward selection, because predictors are added sequentially, forward selection can inadvertently add correlated predictors without explicitly checking for multicollinearity. If two predictors are highly correlated, the one added first may account for most of the shared variance, making the subsequent addition of the correlated predictor less impactful. Backward selection may be better at detecting multicollinearity because it starts with all predictors and evaluates the impact of removing each one. If removing a predictor significantly reduces multicollinearity, it might be favored for removal. However, it still doesn’t explicitly check for multicollinearity; it relies on the overall model fit. Stepwise selection has a mixed approach and may handle multicollinearity slightly better than forward selection because it also considers removal. However, like forward and backward selection, it does not explicitly check for multicollinearity. All-subsets regression implicitly handles multicollinearity better because it evaluates all possible models and selects the one with the best overall fit. If multicollinearity reduces how well each model performs, models with high multicollinearity are less likely to be selected. However, it still does not explicitly check for multicollinearity; it relies on overall model performance metrics like adjusted \(R^2\). Because of its different handling, it will change how the final models might be different, as we find in our case.
Use the final model(s) from the variable selection methods and the full model (the model with all predictors) to generate a prediction interval for a company with the following scores on the predictor variables: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, potassium = 100. (Note that if some of these predictors are not included in the model then the prediction interval will not change based on the values for those predictors.)
# Company Scores
company_scores <- data.frame(
protein = 2,
fat = 0,
sodium = 200,
fibre = 4,
carbo = 20,
sugars = 8,
shelf = 1,
potassium = 100
)
# Full Model
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
# Prediction Interval for Full Model
full_model_pred <- predict(full_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Full Model
full_model_confidence <- predict(full_model, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Full Model:", full_model_pred, "\n")
## Prediction Interval for Full Model: 124.9493 108.3175 141.5811
cat("Confidence Interval for Full Model:", full_model_confidence, "\n")
## Confidence Interval for Full Model: 124.9493 120.1588 129.7398
For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 124.95. The 95% prediction interval for the calories is [108.32, 141.58]. The 95% confidence interval is [120.16, 129.74].
# We have our two possible directions for our model.
# I provide both here
# Prediction Interval for Forward Model - AIC
forward_model_pred_AIC <- predict(forward_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Forward Model - AIC
forward_model_confidence_AIC <- predict(forward_model, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Forward Model - AIC:\n")
## Prediction Interval for Forward Model - AIC:
print(forward_model_pred_AIC)
## fit lwr upr
## 1 125.4564 109.1072 141.8056
cat("Confidence Interval for Forward Model - AIC:\n")
## Confidence Interval for Forward Model - AIC:
print(forward_model_confidence_AIC)
## fit lwr upr
## 1 125.4564 121.4415 129.4713
# Prediction Interval for Forward Model - Manual
forward_model_pred_manual <- predict(final_model_forward, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Forward Model - Manual
forward_model_confidence_manual <- predict(final_model_forward, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Forward Model - Manual:", forward_model_pred_manual, "\n")
## Prediction Interval for Forward Model - Manual: 121.5102 103.5117 139.5087
cat("Confidence Interval for Forward Model - Manual:", forward_model_confidence_manual, "\n")
## Confidence Interval for Forward Model - Manual: 121.5102 118.3919 124.6284
For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 121.51. The 95% prediction interval for the calories is [103.51, 139.51]. The 95% confidence interval for the calories is [118.39, 124.63].
# We have our two possible directions for our model.
# I provide both here
# Prediction Interval for Backward Model - AIC
backward_model_pred_AIC <- predict(backward_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Backward Model - AIC
backward_model_confidence_AIC <- predict(backward_model, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Backward Model - AIC:\n")
## Prediction Interval for Backward Model - AIC:
print(backward_model_pred_AIC)
## fit lwr upr
## 1 125.4564 109.1072 141.8056
cat("Confidence Interval for Backward Model - AIC:\n")
## Confidence Interval for Backward Model - AIC:
print(backward_model_confidence_AIC)
## fit lwr upr
## 1 125.4564 121.4415 129.4713
# Prediction Interval for Backward Model - Manual
backward_model_pred_manual <- predict(final_model_backward, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Backward Model - Manual
backward_model_confidence_manual <- predict(final_model_backward, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Backward Model - Manual:\n")
## Prediction Interval for Backward Model - Manual:
print(backward_model_pred_manual)
## fit lwr upr
## 1 125.4564 109.1072 141.8056
cat("Confidence Interval for Backward Model - Manual:\n")
## Confidence Interval for Backward Model - Manual:
print(backward_model_confidence_manual)
## fit lwr upr
## 1 125.4564 121.4415 129.4713
For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 125.46. The 95% prediction interval for the calories is [109.11, 141.81].
# We have our two possible directions for our model.
# I provide both here
# Prediction Interval for Stepwise Model - AIC
stepwise_model_pred_AIC <- predict(stepwise_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Stepwise Model - AIC
stepwise_model_confidence_AIC <- predict(stepwise_model, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Stepwise Model - AIC:\n")
## Prediction Interval for Stepwise Model - AIC:
print(stepwise_model_pred_AIC)
## fit lwr upr
## 1 125.4564 109.1072 141.8056
cat("Confidence Interval for Stepwise Model - AIC:\n")
## Confidence Interval for Stepwise Model - AIC:
print(stepwise_model_confidence_AIC)
## fit lwr upr
## 1 125.4564 121.4415 129.4713
# Prediction Interval for Stepwise Model - Manual
stepwise_model_pred_manual <- predict(final_model_stepwise, newdata = company_scores, interval = "prediction")
# Confidence Interval for the Stepwise Model - Manual
stepwise_model_confidence_manual <- predict(final_model_stepwise, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for Stepwise Model - Manual:\n")
## Prediction Interval for Stepwise Model - Manual:
print(stepwise_model_pred_manual)
## fit lwr upr
## 1 121.5102 103.5117 139.5087
cat("Confidence Interval for Stepwise Model - Manual:\n")
## Confidence Interval for Stepwise Model - Manual:
print(stepwise_model_confidence_manual)
## fit lwr upr
## 1 121.5102 118.3919 124.6284
For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 121.51. The 95% prediction interval for the calories is [103.51, 139.51]. The 95% confidence interval is [118.39, 124.63].
library (leaps)
# All-Subsets Regression
all_subsets <- regsubsets(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
summary_subsets <- summary(all_subsets)
# Find the model with the highest adjusted R²
max_adj_r2 <- which.max(summary_subsets$adjr2)
best_model <- summary_subsets$which[max_adj_r2, ]
# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]
all_subsets_formula <- as.formula(paste("calories ~", paste(predictors, collapse = " + ")))
all_subsets_model <- lm(all_subsets_formula, data = UScereal)
# Prediction Interval for All-Subsets Model
all_subsets_model_pred <- predict(all_subsets_model, newdata = company_scores, interval = "prediction")
# Confidence Interval for the All-Subsets Model
all_subsets_model_confidence <- predict(all_subsets_model, newdata = company_scores, interval = "confidence")
cat("Prediction Interval for All-Subsets Model:\n")
## Prediction Interval for All-Subsets Model:
print(all_subsets_model_pred)
## fit lwr upr
## 1 125.472 109.1691 141.7749
cat("Confidence Interval for All-Subsets Model:\n")
## Confidence Interval for All-Subsets Model:
print(all_subsets_model_confidence)
## fit lwr upr
## 1 125.472 121.4684 129.4756
For a company with predictor values: protein = 2, fat = 0, sodium = 200, fibre = 4, carbo = 20, sugars = 8, shelf = 1, and potassium = 100, the predicted calories is 125.47. The 95% prediction interval for the calories is [109.17, 141.78]. The 95% confidence interval is [121.47, 129.48].
The narrowest prediction interval is the All-Subsets Regression Model, which has the narrowest prediction interval with a width of 32.6058. The Full Model has a prediction interval width of 33.2636. All variable selection models except the manual forward and manual step wise selection model resulted in narrower intervals than the Full Model. This suggests that these variable selection methods can lead to more precise (less variance) predictions. The narrower prediction intervals from the variable selection models show that these intervals may provide more reliable predictions by reducing the variance and focusing on the most significant predictors.
Use the final model(s) from the variable selection methods to estimate shrunken-R2 using the Browne equation.
# Full Model
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
summary_full_model <- summary(full_model)
R2_full <- summary_full_model$r.squared
n_full <- nrow(UScereal)
p_full <- length(coef(full_model)) - 1
# Shrunken R2 Calculation
shrunken_R2_full <- 1 - ((1 - R2_full) * (n_full - 1) / (n_full - p_full - 1))
cat("Shrunken-R2 for Full Model: ", shrunken_R2_full, "\n")
## Shrunken-R2 for Full Model: 0.983772
# Forward Model - AIC
summary_forward_model <- summary(forward_model)
R2_forward <- summary_forward_model$r.squared
n_forward <- nrow(UScereal)
p_forward <- length(coef(forward_model)) - 1
# Forward Model - Manual
summary_forward_model_manual <- summary(final_model_forward)
R2_forward_manual <- summary_forward_model_manual$r.squared
n_forward_manual <- nrow(UScereal)
p_forward_manual <- length(coef(final_model_forward)) - 1
# Shrunken R2 Calculation - AIC
shrunken_R2_forward <- 1 - ((1 - R2_forward) * (n_forward - 1) / (n_forward - p_forward - 1))
cat("Shrunken-R2 for Forward Model (AIC): ", shrunken_R2_forward, "\n")
## Shrunken-R2 for Forward Model (AIC): 0.9839069
# Shrunken R2 Calculation - Manual
shrunken_R2_forward_manual <- 1 - ((1 - R2_forward_manual) * (n_forward_manual - 1) / (n_forward_manual - p_forward_manual - 1))
cat("Shrunken-R2 for Forward Model (Manual): ", shrunken_R2_forward_manual, "\n")
## Shrunken-R2 for Forward Model (Manual): 0.9798389
# Backward Model - AIC
summary_backward_model <- summary(backward_model)
R2_backward <- summary_backward_model$r.squared
n_backward <- nrow(UScereal)
p_backward <- length(coef(backward_model)) - 1
# Backward Model - Manual
summary_backward_model_manual <- summary(final_model_backward)
R2_backward_manual <- summary_backward_model_manual$r.squared
n_backward_manual <- nrow(UScereal)
p_backward_manual <- length(coef(final_model_backward)) - 1
# Shrunken R2 Calculation - AIC
shrunken_R2_backward <- 1 - ((1 - R2_backward) * (n_backward - 1) / (n_backward - p_backward - 1))
cat("Shrunken-R2 for Backward Model (AIC): ", shrunken_R2_backward, "\n")
## Shrunken-R2 for Backward Model (AIC): 0.9839069
# Shrunken R2 Calculation - Manual
shrunken_R2_backward_manual <- 1 - ((1 - R2_backward_manual) * (n_backward_manual - 1) / (n_backward_manual - p_backward_manual - 1))
cat("Shrunken-R2 for Backward Model (Manual): ", shrunken_R2_backward_manual, "\n")
## Shrunken-R2 for Backward Model (Manual): 0.9839069
# Stepwise Model - AIC
summary_stepwise_model <- summary(stepwise_model)
R2_stepwise <- summary_stepwise_model$r.squared
n_stepwise <- nrow(UScereal)
p_stepwise <- length(coef(stepwise_model)) - 1
# Stepwise Model - Manual
summary_stepwise_model_manual <- summary(final_model_stepwise)
R2_stepwise_manual <- summary_stepwise_model_manual$r.squared
n_stepwise_manual <- nrow(UScereal)
p_stepwise_manual <- length(coef(final_model_stepwise)) - 1
# Shrunken R2 Calculation - AIC
shrunken_R2_stepwise <- 1 - ((1 - R2_stepwise) * (n_stepwise - 1) / (n_stepwise - p_stepwise - 1))
cat("Shrunken-R2 for Stepwise Model (AIC): ", shrunken_R2_stepwise, "\n")
## Shrunken-R2 for Stepwise Model (AIC): 0.9839069
# Shrunken R2 Calculation - Manual
shrunken_R2_stepwise_manual <- 1 - ((1 - R2_stepwise_manual) * (n_stepwise_manual - 1) / (n_stepwise_manual - p_stepwise_manual - 1))
cat("Shrunken-R2 for Stepwise Model (Manual): ", shrunken_R2_stepwise_manual, "\n")
## Shrunken-R2 for Stepwise Model (Manual): 0.9798389
library(leaps)
# All-Subsets Regression
all_subsets <- regsubsets(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
summary_subsets <- summary(all_subsets)
# Find the model with the highest adjusted R2
max_adj_r2 <- which.max(summary_subsets$adjr2)
best_model <- summary_subsets$which[max_adj_r2, ]
# Extract the names of the predictors in the final model
predictors <- names(best_model)[best_model][2:length(names(best_model)[best_model])]
all_subsets_formula <- as.formula(paste("calories ~", paste(predictors, collapse = " + ")))
all_subsets_model <- lm(all_subsets_formula, data = UScereal)
summary_all_subsets_model <- summary(all_subsets_model)
R2_all_subsets <- summary_all_subsets_model$r.squared
n_all_subsets <- nrow(UScereal)
p_all_subsets <- length(coef(all_subsets_model)) - 1
# Shrunken R2 Calculation
shrunken_R2_all_subsets <- 1 - ((1 - R2_all_subsets) * (n_all_subsets - 1) / (n_all_subsets - p_all_subsets - 1))
cat("Shrunken-R2 for All-Subsets Model: ", shrunken_R2_all_subsets, "\n")
## Shrunken-R2 for All-Subsets Model: 0.9840099
The All-Subsets Regression Model has the highest shrunken \(R^2\) value of 0.9840099, suggesting the best out-of-sample prediction performance.
As I note actually at the beginning of (5.1) for this purpose, we find that the shrunken \(R^2\) for the Full Model is 0.983772. The All-Subsets Regression Model’s shrunken \(R^2\) (0.9840099) is slightly higher than the Full Model’s shrunken \(R^2\) (0.983772), suggesting that the All-Subsets Regression Model performs better in terms of out-of-sample prediction. Therefore, the All-Subsets Regression Model provides the most reliable predictions when considering the out-of-sample performance.
# Estimate the linear model
mfr_model <- lm(calories ~ mfr.num, data = UScereal)
summary(mfr_model)
##
## Call:
## lm(formula = calories ~ mfr.num, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.44 -38.13 -12.36 26.10 286.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 143.259 14.484 9.891 1.92e-14 ***
## mfr.num 2.437 4.840 0.504 0.616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.78 on 63 degrees of freedom
## Multiple R-squared: 0.004009, Adjusted R-squared: -0.0118
## F-statistic: 0.2536 on 1 and 63 DF, p-value: 0.6163
# Get the summary of the model
summary_model <- summary(mfr_model)
# Output the model fit (R2 and adj-R2)
r_squared <- summary_model$r.squared
adj_r_squared <- summary_model$adj.r.squared
# Output the significance test (p-value for the model)
f_statistic <- summary_model$fstatistic
p_value <- pf(f_statistic[1], f_statistic[2], f_statistic[3], lower.tail = FALSE)
# Print the results
print(paste("R-squared: ", round(r_squared, 4)))
## [1] "R-squared: 0.004"
print(paste("Adjusted R-squared: ", round(adj_r_squared, 4)))
## [1] "Adjusted R-squared: -0.0118"
print(paste("p-value: ", round(p_value, 4)))
## [1] "p-value: 0.6163"
For our significance test and model summary, we have F(1, 63) = 0.2536, p = 0.6163. Given this result, we fail to reject the null the hypothesis that there is no relationship between the manufacturer and number of calories in one portion.
This regression represents the slope here. The slope represents the change in the calorie content for cereals from different manufacturers. Specifically, for two manufacturers that differ by one point in their numeric encoding (moving from General Mills to Kellogg’s, which corresponds to an increase in mfr.num by one unit), there is an expected difference of 2.437 calories. The manufacturer with the higher numeric encoding is expected to have cereals with a higher calorie content by an average of 2.437 calories. However, this difference is not statistically significant, indicating that the relationship between the manufacturer and the calorie content is not reliable based on this model.
library (ggplot2)
# Fit the linear model
mfr_model <- lm(calories ~ mfr.num, data = UScereal)
# Plot
plt_mfrnum <- ggplot(UScereal, aes(x = mfr.num, y = calories)) +
geom_point() +
geom_line(aes(y = predict(mfr_model)), color = "red", se = FALSE) +
labs(
title = "Calories vs. Manufacturer",
x = "Manufacturer (Numerically Encoded)",
y = "Calories"
) +
theme_minimal()+
theme(
axis.text = element_text(size = 10, face = "plain"),
axis.title = element_text(size = 11, face = "plain"),
plot.title = element_text(size = 13, face = "plain", hjust = 0.5)
)
## Warning in geom_line(aes(y = predict(mfr_model)), color = "red", se = FALSE):
## Ignoring unknown parameters: `se`
print(plt_mfrnum)
library(dplyr)
# Predict the values for each manufacturer
UScereal$predicted_calories <- predict(mfr_model, newdata = UScereal)
# Calculate the mean calories for each manufacturer
mean_calories <- UScereal %>%
group_by(mfr.num) %>%
summarise(mean_calories = mean(calories))
# Create the table with predicted values and group means
manufacturers <- data.frame(
Manufacturer = c("1 (G)", "2 (K)", "3 (N)", "4 (P)", "5 (Q)", "6 (R)"),
mfr.num = 1:6
)
manufacturers$Model_Prediction <- predict(mfr_model, newdata = manufacturers)
manufacturers$Group_Mean <- mean_calories$mean_calories
# Table
print(manufacturers)
## Manufacturer mfr.num Model_Prediction Group_Mean
## 1 1 (G) 1 145.6959 137.7879
## 2 2 (K) 2 148.1333 149.6710
## 3 3 (N) 3 150.5707 160.2593
## 4 4 (P) 4 153.0081 194.7578
## 5 5 (Q) 5 155.4455 135.8507
## 6 6 (R) 6 157.8829 124.8521
I think that treating the manufacturer as a continuous variable in this context may not be the most appropriate and accurate modeling approach. This is because the very nature of the manufacturer is inherently a categorical variable, not a continuous one. Each value represents a distinct manufacturer with no natural order or scale between them. By treating it as a continuous variable, we imply a linear relationship and order that does not exist. Moreover, the regression model assumes that moving from one manufacturer to the next (from General Mills to Kellogg’s) has a linear impact on the calories, which is unlikely to be true. The differences in calorie content are more likely due to individual manufacturer practices rather than a smooth, linear transition.
Also consider our \(R^2\) value. It is very low (0.004), and the p-value for the coefficient is not significant (0.616), indicating that the model does not explain the variance in calories well. This suggests that the linear model does not capture the relationship between manufacturer and calorie content effectively. Additionally, the group means for calories show considerable variation among manufacturers (137.79 for General Mills vs. 194.76 for Post). This variation suggests that each manufacturer has distinct characteristics that affect calorie content, which a continuous variable model cannot capture.
# Load necessary libraries
library(dplyr)
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:caret':
##
## dotPlot
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
UScereal$mfr <- as.factor(UScereal$mfr)
# Create dummy variables with General Mills (G) as the reference group
UScereal <- transform(UScereal,
K = as.numeric(mfr == "K"),
N = as.numeric(mfr == "N"),
P = as.numeric(mfr == "P"),
Q = as.numeric(mfr == "Q"),
R = as.numeric(mfr == "R"))
# Create a table to show the dummy coding
dummy_coding_table <- UScereal %>%
select(mfr, K, N, P, Q, R) %>%
distinct()
# Rename the manufacturer column for clarity
dummy_coding_table <- dummy_coding_table %>%
mutate(Manufacturer = case_when(
mfr == "G" ~ "1 (G)",
mfr == "K" ~ "2 (K)",
mfr == "N" ~ "3 (N)",
mfr == "P" ~ "4 (P)",
mfr == "Q" ~ "5 (Q)",
mfr == "R" ~ "6 (R)"
)) %>%
select(Manufacturer, K, N, P, Q, R)
# Table
print(dummy_coding_table)
## Manufacturer K N P Q R
## 100% Bran 3 (N) 0 1 0 0 0
## All-Bran 2 (K) 1 0 0 0 0
## Apple Cinnamon Cheerios 1 (G) 0 0 0 0 0
## Bran Chex 6 (R) 0 0 0 0 1
## Bran Flakes 4 (P) 0 0 1 0 0
## Cap'n'Crunch 5 (Q) 0 0 0 1 0
# Fit the linear model using dummy coding
dummyModel <- lm(calories ~ K + N + P + Q + R, data = UScereal)
summary(dummyModel)
##
## Call:
## lm(formula = calories ~ K + N + P + Q + R, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.788 13.123 10.500 4.01e-15 ***
## K 11.883 18.778 0.633 0.5293
## N 22.471 37.882 0.593 0.5553
## P 56.970 24.355 2.339 0.0227 *
## Q -1.937 30.495 -0.064 0.9496
## R -12.936 30.495 -0.424 0.6730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
library(dplyr)
library(mosaic)
UScereal$mfr <- as.factor(UScereal$mfr)
# Apply dummy coding to the manufacturer factor
contrasts(UScereal$mfr) <- contr.treatment(length(levels(UScereal$mfr)), base = 1)
# Display the contrasts for verification
contrasts(UScereal$mfr)
## 2 3 4 5 6
## G 0 0 0 0 0
## K 1 0 0 0 0
## N 0 1 0 0 0
## P 0 0 1 0 0
## Q 0 0 0 1 0
## R 0 0 0 0 1
# Fit the linear model using dummy coding
dummyModel <- lm(calories ~ mfr, data = UScereal)
summary(dummyModel)
##
## Call:
## lm(formula = calories ~ mfr, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.788 13.123 10.500 4.01e-15 ***
## mfr2 11.883 18.778 0.633 0.5293
## mfr3 22.471 37.882 0.593 0.5553
## mfr4 56.970 24.355 2.339 0.0227 *
## mfr5 -1.937 30.495 -0.064 0.9496
## mfr6 -12.936 30.495 -0.424 0.6730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | D1 | D2 | D3 | D4 | D5 |
|---|---|---|---|---|---|
| 1 (G) | 0 | 0 | 0 | 0 | 0 |
| 2 (K) | 1 | 0 | 0 | 0 | 0 |
| 3 (N) | 0 | 1 | 0 | 0 | 0 |
| 4 (P) | 0 | 0 | 1 | 0 | 0 |
| 5 (Q) | 0 | 0 | 0 | 1 | 0 |
| 6 (R) | 0 | 0 | 0 | 0 | 1 |
library(mosaic)
library(dplyr)
UScereal$mfr <- as.factor(UScereal$mfr)
# Sequential coding for manufacturer
UScereal <- UScereal %>%
mutate(
s1 = as.numeric(mfr.num > 1),
s2 = as.numeric(mfr.num > 2),
s3 = as.numeric(mfr.num > 3),
s4 = as.numeric(mfr.num > 4),
s5 = as.numeric(mfr.num > 5)
)
# Create a table to show the sequential coding
sequential_coding_table <- UScereal %>%
select(mfr.num, s1, s2, s3, s4, s5) %>%
distinct()
# Rename the manufacturer column for clarity
sequential_coding_table <- sequential_coding_table %>%
mutate(Manufacturer = case_when(
mfr.num == 1 ~ "1 (G)",
mfr.num == 2 ~ "2 (K)",
mfr.num == 3 ~ "3 (N)",
mfr.num == 4 ~ "4 (P)",
mfr.num == 5 ~ "5 (Q)",
mfr.num == 6 ~ "6 (R)"
)) %>%
select(Manufacturer, s1, s2, s3, s4, s5)
# Table
print(sequential_coding_table)
## Manufacturer s1 s2 s3 s4 s5
## 100% Bran 3 (N) 1 1 0 0 0
## All-Bran 2 (K) 1 0 0 0 0
## Apple Cinnamon Cheerios 1 (G) 0 0 0 0 0
## Bran Chex 6 (R) 1 1 1 1 1
## Bran Flakes 4 (P) 1 1 1 0 0
## Cap'n'Crunch 5 (Q) 1 1 1 1 0
# Fit the linear model using sequential coding
seqModel <- lm(calories ~ s1 + s2 + s3 + s4 + s5, data = UScereal)
summary(seqModel)
##
## Call:
## lm(formula = calories ~ s1 + s2 + s3 + s4 + s5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.79 13.12 10.500 4.01e-15 ***
## s1 11.88 18.78 0.633 0.5293
## s2 10.59 37.99 0.279 0.7814
## s3 34.50 41.03 0.841 0.4039
## s4 -58.91 34.33 -1.716 0.0914 .
## s5 -11.00 38.93 -0.283 0.7785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | S1 | S2 | S3 | S4 | S5 |
|---|---|---|---|---|---|
| 1 (G) | 0 | 0 | 0 | 0 | 0 |
| 2 (K) | 1 | 0 | 0 | 0 | 0 |
| 3 (N) | 1 | 1 | 0 | 0 | 0 |
| 4 (P) | 1 | 1 | 1 | 0 | 0 |
| 5 (Q) | 1 | 1 | 1 | 1 | 0 |
| 6 (R) | 1 | 1 | 1 | 1 | 1 |
library(dplyr)
library(mosaic)
UScereal$mfr <- as.factor(UScereal$mfr)
# Helmert coding for manufacturer with General Mills (G) as the lowest group
UScereal <- transform(UScereal,
H1 = -5/6 * (mfr.num == 1) + 1/6 * (mfr.num > 1),
H2 = -4/5 * (mfr.num == 2) + 1/5 * (mfr.num > 2),
H3 = -3/4 * (mfr.num == 3) + 1/4 * (mfr.num > 3),
H4 = -2/3 * (mfr.num == 4) + 1/3 * (mfr.num > 4),
H5 = -1/2 * (mfr.num == 5) + 1/2 * (mfr.num > 5)
)
# Create a table to show the Helmert coding
helmert_coding_table <- UScereal %>%
select(mfr.num, H1, H2, H3, H4, H5) %>%
distinct()
# Rename the manufacturer column for clarity
helmert_coding_table <- helmert_coding_table %>%
mutate(Manufacturer = case_when(
mfr.num == 1 ~ "1 (G)",
mfr.num == 2 ~ "2 (K)",
mfr.num == 3 ~ "3 (N)",
mfr.num == 4 ~ "4 (P)",
mfr.num == 5 ~ "5 (Q)",
mfr.num == 6 ~ "6 (R)"
)) %>%
select(Manufacturer, H1, H2, H3, H4, H5)
# Table
print(helmert_coding_table)
## Manufacturer H1 H2 H3 H4 H5
## 100% Bran 3 (N) 0.1666667 0.2 -0.75 0.0000000 0.0
## All-Bran 2 (K) 0.1666667 -0.8 0.00 0.0000000 0.0
## Apple Cinnamon Cheerios 1 (G) -0.8333333 0.0 0.00 0.0000000 0.0
## Bran Chex 6 (R) 0.1666667 0.2 0.25 0.3333333 0.5
## Bran Flakes 4 (P) 0.1666667 0.2 0.25 -0.6666667 0.0
## Cap'n'Crunch 5 (Q) 0.1666667 0.2 0.25 0.3333333 -0.5
# Fit the linear model using Helmert coding
helmModel <- lm(calories ~ H1 + H2 + H3 + H4 + H5, data = UScereal)
summary(helmModel)
##
## Call:
## lm(formula = calories ~ H1 + H2 + H3 + H4 + H5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.530 9.933 15.155 <2e-16 ***
## H1 15.290 17.533 0.872 0.3867
## H2 4.259 19.503 0.218 0.8279
## H3 -8.439 38.445 -0.220 0.8270
## H4 -64.406 28.281 -2.277 0.0264 *
## H5 -10.999 38.929 -0.283 0.7785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | H1 | H2 | H3 | H4 | H5 |
|---|---|---|---|---|---|
| 1 (G) | -0.8333333 | 0.0 | 0.00 | 0.0000000 | 0.0 |
| 2 (K) | 0.1666667 | -0.8 | 0.00 | 0.0000000 | 0.0 |
| 3 (N) | 0.1666667 | 0.2 | -0.75 | 0.0000000 | 0.0 |
| 4 (P) | 0.1666667 | 0.2 | 0.25 | -0.6666667 | 0.0 |
| 5 (Q) | 0.1666667 | 0.2 | 0.25 | 0.3333333 | -0.5 |
| 6 (R) | 0.1666667 | 0.2 | 0.25 | 0.3333333 | 0.5 |
library(dplyr)
library(mosaic)
UScereal$mfr.num <- as.numeric(UScereal$mfr)
# Helmert coding for manufacturer with General Mills (G) as the lowest group
UScereal <- transform(UScereal,
h1 = ifelse(mfr.num == 1, 0, ifelse(mfr.num > 1, 1/5, -5/6)),
h2 = ifelse(mfr.num <= 2, 0, ifelse(mfr.num > 2, 1/4, -4/5)),
h3 = ifelse(mfr.num <= 3, 0, ifelse(mfr.num > 3, 1/3, -3/4)),
h4 = ifelse(mfr.num <= 4, 0, ifelse(mfr.num > 4, 1/2, -2/3)),
h5 = ifelse(mfr.num <= 5, 0, ifelse(mfr.num > 5, 1, -1/2))
)
# Create a table to show the Helmert coding
helmert_coding_table <- UScereal %>%
select(mfr.num, h1, h2, h3, h4, h5) %>%
distinct()
# Rename the manufacturer column for clarity
helmert_coding_table <- helmert_coding_table %>%
mutate(Manufacturer = case_when(
mfr.num == 1 ~ "1 (G)",
mfr.num == 2 ~ "2 (K)",
mfr.num == 3 ~ "3 (N)",
mfr.num == 4 ~ "4 (P)",
mfr.num == 5 ~ "5 (Q)",
mfr.num == 6 ~ "6 (R)"
)) %>%
select(Manufacturer, h1, h2, h3, h4, h5)
# Table
print(helmert_coding_table)
## Manufacturer h1 h2 h3 h4 h5
## 100% Bran 3 (N) 0.2 0.25 0.0000000 0.0 0
## All-Bran 2 (K) 0.2 0.00 0.0000000 0.0 0
## Apple Cinnamon Cheerios 1 (G) 0.0 0.00 0.0000000 0.0 0
## Bran Chex 6 (R) 0.2 0.25 0.3333333 0.5 1
## Bran Flakes 4 (P) 0.2 0.25 0.3333333 0.0 0
## Cap'n'Crunch 5 (Q) 0.2 0.25 0.3333333 0.5 0
# Fit the linear model using Helmert coding
helmModel <- lm(calories ~ h1 + h2 + h3 + h4 + h5, data = UScereal)
summary(helmModel)
##
## Call:
## lm(formula = calories ~ h1 + h2 + h3 + h4 + h5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.79 13.12 10.500 4.01e-15 ***
## h1 59.42 93.89 0.633 0.5293
## h2 42.35 151.96 0.279 0.7814
## h3 103.50 123.10 0.841 0.4039
## h4 -117.81 68.66 -1.716 0.0914 .
## h5 -11.00 38.93 -0.283 0.7785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | H1 | H2 | H3 | H4 | H5 |
|---|---|---|---|---|---|
| 1 (G) | 0.0 | 0.00 | 0.0000000 | 0.0 | 0 |
| 2 (K) | 0.2 | 0.00 | 0.0000000 | 0.0 | 0 |
| 3 (N) | 0.2 | 0.25 | 0.0000000 | 0.0 | 0 |
| 4 (P) | 0.2 | 0.25 | 0.3333333 | 0.0 | 0 |
| 5 (Q) | 0.2 | 0.25 | 0.3333333 | 0.5 | 0 |
| 6 (R) | 0.2 | 0.25 | 0.3333333 | 0.5 | 1 |
# Load necessary libraries
library(dplyr)
library(mosaic)
UScereal$mfr <- as.factor(UScereal$mfr)
# Create Helmert coding for manufacturer with General Mills (G) as the lowest group
helmert_matrix <- contr.helmert(length(levels(UScereal$mfr)))
# Apply Helmert coding to the manufacturer factor
contrasts(UScereal$mfr) <- helmert_matrix
# Display the contrasts for verification
contrasts(UScereal$mfr)
## [,1] [,2] [,3] [,4] [,5]
## G -1 -1 -1 -1 -1
## K 1 -1 -1 -1 -1
## N 0 2 -1 -1 -1
## P 0 0 3 -1 -1
## Q 0 0 0 4 -1
## R 0 0 0 0 5
# Fit the linear model using Helmert coding
helmModel <- lm(calories ~ mfr, data = UScereal)
summary(helmModel)
##
## Call:
## lm(formula = calories ~ mfr, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.530 9.933 15.155 <2e-16 ***
## mfr1 5.942 9.389 0.633 0.5293
## mfr2 5.510 12.252 0.450 0.6546
## mfr3 11.380 6.126 1.858 0.0682 .
## mfr4 -4.954 5.950 -0.833 0.4084
## mfr5 -5.136 4.915 -1.045 0.3003
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | H1 | H2 | H3 | H4 | H5 |
|---|---|---|---|---|---|
| 1 (G) | -1 | -1 | -1 | -1 | -1 |
| 2 (K) | 1 | -1 | -1 | -1 | -1 |
| 3 (N) | 0 | 2 | -1 | -1 | -1 |
| 4 (P) | 0 | 0 | 3 | -1 | -1 |
| 5 (Q) | 0 | 0 | 0 | 4 | -1 |
| 6 (R) | 0 | 0 | 0 | 0 | 5 |
I completed the Helmert coding three ways. One I am not sure, when I adapted the provided code from lab 7, seems to give different results (see the first panel). When I use the modified code that I have adjusted myself, I get the general mills to be the absolute lowest. However, after adapting the contr.helmert function from the contrasts library, we get another output. However, one can find that while the predictor estimates may vary, one maintains the same significance levels.
library(dplyr)
library(mosaic)
UScereal$mfr <- as.factor(UScereal$mfr)
# Create a numeric representation of mfr
UScereal$mfr.num <- as.numeric(UScereal$mfr)
# Effect coding for manufacturer with Ralston Purina (R) as the reference group
UScereal <- transform(UScereal,
E1 = (mfr.num == 1) - (mfr.num == 6),
E2 = (mfr.num == 2) - (mfr.num == 6),
E3 = (mfr.num == 3) - (mfr.num == 6),
E4 = (mfr.num == 4) - (mfr.num == 6),
E5 = (mfr.num == 5) - (mfr.num == 6)
)
# Create a table to show the effect coding
effect_coding_table <- UScereal %>%
select(mfr.num, E1, E2, E3, E4, E5) %>%
distinct()
# Rename the manufacturer column for clarity
effect_coding_table <- effect_coding_table %>%
mutate(Manufacturer = case_when(
mfr.num == 1 ~ "1 (G)",
mfr.num == 2 ~ "2 (K)",
mfr.num == 3 ~ "3 (N)",
mfr.num == 4 ~ "4 (P)",
mfr.num == 5 ~ "5 (Q)",
mfr.num == 6 ~ "6 (R)"
)) %>%
select(Manufacturer, E1, E2, E3, E4, E5)
# Table
print(effect_coding_table)
## Manufacturer E1 E2 E3 E4 E5
## 100% Bran 3 (N) 0 0 1 0 0
## All-Bran 2 (K) 0 1 0 0 0
## Apple Cinnamon Cheerios 1 (G) 1 0 0 0 0
## Bran Chex 6 (R) -1 -1 -1 -1 -1
## Bran Flakes 4 (P) 0 0 0 1 0
## Cap'n'Crunch 5 (Q) 0 0 0 0 1
# Fit the linear model using effect coding
effectsModel <- lm(calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
summary(effectsModel)
##
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5298 9.9329 15.155 <2e-16 ***
## E1 -12.7419 14.6106 -0.872 0.3867
## E2 -0.8588 14.7965 -0.058 0.9539
## E3 9.7295 30.6687 0.317 0.7522
## E4 44.2280 19.4756 2.271 0.0268 *
## E5 -14.6791 24.5725 -0.597 0.5525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | H1 | H2 | H3 | H4 | H5 |
|---|---|---|---|---|---|
| 1 (G) | 1 | 0 | 0 | 0 | 0 |
| 2 (K) | 0 | 1 | 0 | 0 | 0 |
| 3 (N) | 0 | 0 | 1 | 0 | 0 |
| 4 (P) | 0 | 0 | 0 | 1 | 0 |
| 5 (Q) | 0 | 0 | 0 | 0 | 1 |
| 6 (R) | -1 | -1 | -1 | -1 | -1 |
library(dplyr)
library(mosaic)
UScereal$mfr <- as.factor(UScereal$mfr)
# Apply effect coding to the manufacturer factor using contr.sum
contrasts(UScereal$mfr) <- contr.sum(length(levels(UScereal$mfr)))
# Display the contrasts for verification
print(contrasts(UScereal$mfr))
## [,1] [,2] [,3] [,4] [,5]
## G 1 0 0 0 0
## K 0 1 0 0 0
## N 0 0 1 0 0
## P 0 0 0 1 0
## Q 0 0 0 0 1
## R -1 -1 -1 -1 -1
# Fit the linear model using effect coding
effectsModel1 <- lm(calories ~ mfr, data = UScereal)
summary(effectsModel1)
##
## Call:
## lm(formula = calories ~ mfr, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5298 9.9329 15.155 <2e-16 ***
## mfr1 -12.7419 14.6106 -0.872 0.3867
## mfr2 -0.8588 14.7965 -0.058 0.9539
## mfr3 9.7295 30.6687 0.317 0.7522
## mfr4 44.2280 19.4756 2.271 0.0268 *
## mfr5 -14.6791 24.5725 -0.597 0.5525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
| Manufacturer | E1 | E2 | E3 | E4 | E5 |
|---|---|---|---|---|---|
| 1 (G) | 1 | 0 | 0 | 0 | 0 |
| 2 (K) | 0 | 1 | 0 | 0 | 0 |
| 3 (N) | 0 | 0 | 1 | 0 | 0 |
| 4 (P) | 0 | 0 | 0 | 1 | 0 |
| 5 (Q) | 0 | 0 | 0 | 0 | 1 |
| 6 (R) | -1 | -1 | -1 | -1 | -1 |
Fit a linear model predicting calories using manufacturer as represented by a set of effects codes. Provide the code and the output for both generating the new variables and fitting the model. Estimate the model fit (R2 and adj-R2) and qualitatively compare it to the model in Question 1. Select one coefficient from the model to interpret.
library(dplyr)
library(mosaic)
UScereal$mfr <- as.factor(UScereal$mfr)
# Create a numeric representation of mfr
UScereal$mfr.num <- as.numeric(UScereal$mfr)
# Effect coding for manufacturer with Ralston Purina (R) as the reference group
UScereal <- transform(UScereal,
E1 = (mfr.num == 1) - (mfr.num == 6),
E2 = (mfr.num == 2) - (mfr.num == 6),
E3 = (mfr.num == 3) - (mfr.num == 6),
E4 = (mfr.num == 4) - (mfr.num == 6),
E5 = (mfr.num == 5) - (mfr.num == 6)
)
# Create a table to show the effect coding
effect_coding_table <- UScereal %>%
select(mfr.num, E1, E2, E3, E4, E5) %>%
distinct()
# Rename the manufacturer column for clarity
effect_coding_table <- effect_coding_table %>%
mutate(Manufacturer = case_when(
mfr.num == 1 ~ "1 (G)",
mfr.num == 2 ~ "2 (K)",
mfr.num == 3 ~ "3 (N)",
mfr.num == 4 ~ "4 (P)",
mfr.num == 5 ~ "5 (Q)",
mfr.num == 6 ~ "6 (R)"
)) %>%
select(Manufacturer, E1, E2, E3, E4, E5)
# Table
print(effect_coding_table)
## Manufacturer E1 E2 E3 E4 E5
## 100% Bran 3 (N) 0 0 1 0 0
## All-Bran 2 (K) 0 1 0 0 0
## Apple Cinnamon Cheerios 1 (G) 1 0 0 0 0
## Bran Chex 6 (R) -1 -1 -1 -1 -1
## Bran Flakes 4 (P) 0 0 0 1 0
## Cap'n'Crunch 5 (Q) 0 0 0 0 1
# Fit the linear model using effect coding
effectsModel <- lm(calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
summary(effectsModel)
##
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5298 9.9329 15.155 <2e-16 ***
## E1 -12.7419 14.6106 -0.872 0.3867
## E2 -0.8588 14.7965 -0.058 0.9539
## E3 9.7295 30.6687 0.317 0.7522
## E4 44.2280 19.4756 2.271 0.0268 *
## E5 -14.6791 24.5725 -0.597 0.5525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
# Fit the full model
full_model <- lm(calories ~ protein + fat + sodium + fibre + carbo + sugars + shelf + potassium, data = UScereal)
# Summary of the full model
summary(full_model)
##
## Call:
## lm(formula = calories ~ protein + fat + sodium + fibre + carbo +
## sugars + shelf + potassium, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.917 -4.075 1.079 4.415 15.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.15905 3.74046 -5.657 5.46e-07 ***
## protein 4.87257 0.97909 4.977 6.51e-06 ***
## fat 9.27406 0.75034 12.360 < 2e-16 ***
## sodium 0.01207 0.01002 1.204 0.233676
## fibre 2.80820 0.71883 3.907 0.000254 ***
## carbo 4.86987 0.17491 27.842 < 2e-16 ***
## sugars 4.54340 0.21068 21.565 < 2e-16 ***
## shelf 0.56679 1.39782 0.405 0.686668
## potassium -0.11594 0.02718 -4.265 7.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.951 on 56 degrees of freedom
## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9838
## F-statistic: 486 on 8 and 56 DF, p-value: < 2.2e-16
# Extract R-squared and adjusted R-squared
effectsModel_summary <- summary(effectsModel)
effectsModel_r2 <- effectsModel_summary$r.squared
effectsModel_adj_r2 <- effectsModel_summary$adj.r.squared
print(paste("Effects Code Model R-squared: ", effectsModel_r2))
## [1] "Effects Code Model R-squared: 0.103365947916525"
print(paste("Effects Code Model Adjusted R-squared: ", effectsModel_adj_r2))
## [1] "Effects Code Model Adjusted R-squared: 0.0273800112992811"
# Extract R-squared and adjusted R-squared
full_model_summary <- summary(full_model)
full_model_r2 <- full_model_summary$r.squared
full_model_adj_r2 <- full_model_summary$adj.r.squared
print(paste("Full Model R-squared: ", full_model_r2))
## [1] "Full Model R-squared: 0.985800539648456"
print(paste("Full Model Adjusted R-squared: ", full_model_adj_r2))
## [1] "Full Model Adjusted R-squared: 0.983772045312521"
Looking at these two models, the full model, which includes multiple predictors (protein, fat, sodium, fibre, carbo, sugars, shelf, and potassium), has a much higher \(R^2\) and adjusted \(R^2\) compared to the effects-coded model. The full model explains a much larger proportion of the variance in calories than the effects-coded model, indicating that other predictors like protein, fat, sodium, fibre, carbo, sugars, shelf, and potassium are significant factors in determining the calorie content of cereals.
library(dplyr)
UScereal$mfr <- as.factor(UScereal$mfr)
# Create a numeric representation of mfr
UScereal$mfr.num <- as.numeric(UScereal$mfr)
# Effect coding for manufacturer with Ralston Purina (R) as the reference group
UScereal <- transform(UScereal,
E1 = (mfr.num == 1) - (mfr.num == 6),
E2 = (mfr.num == 2) - (mfr.num == 6),
E3 = (mfr.num == 3) - (mfr.num == 6),
E4 = (mfr.num == 4) - (mfr.num == 6),
E5 = (mfr.num == 5) - (mfr.num == 6)
)
# Create a table to show the effect coding
effect_coding_table <- UScereal %>%
select(mfr.num, E1, E2, E3, E4, E5) %>%
distinct()
# Rename the manufacturer column for clarity
effect_coding_table <- effect_coding_table %>%
mutate(Manufacturer = case_when(
mfr.num == 1 ~ "1 (G)",
mfr.num == 2 ~ "2 (K)",
mfr.num == 3 ~ "3 (N)",
mfr.num == 4 ~ "4 (P)",
mfr.num == 5 ~ "5 (Q)",
mfr.num == 6 ~ "6 (R)"
)) %>%
select(Manufacturer, E1, E2, E3, E4, E5)
# Fit the linear model using effect coding
effectsModel <- lm(calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
summary(effectsModel)
##
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5298 9.9329 15.155 <2e-16 ***
## E1 -12.7419 14.6106 -0.872 0.3867
## E2 -0.8588 14.7965 -0.058 0.9539
## E3 9.7295 30.6687 0.317 0.7522
## E4 44.2280 19.4756 2.271 0.0268 *
## E5 -14.6791 24.5725 -0.597 0.5525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
# Predictions from the model
UScereal$predicted_calories <- predict(effectsModel, newdata = UScereal)
# Calculate group means
group_means <- UScereal %>%
group_by(mfr) %>%
summarise(mean_calories = mean(calories))
# Create the table with predictions and group means
result_table <- UScereal %>%
group_by(mfr) %>%
summarise(Model_Prediction = mean(predicted_calories), Group_Mean = mean(calories)) %>%
mutate(Manufacturer = case_when(
mfr == "G" ~ "1 (G)",
mfr == "K" ~ "2 (K)",
mfr == "N" ~ "3 (N)",
mfr == "P" ~ "4 (P)",
mfr == "Q" ~ "5 (Q)",
mfr == "R" ~ "6 (R)"
)) %>%
select(Manufacturer, Model_Prediction, Group_Mean)
# Print the result table
print(result_table)
## # A tibble: 6 × 3
## Manufacturer Model_Prediction Group_Mean
## <chr> <dbl> <dbl>
## 1 1 (G) 138. 138.
## 2 2 (K) 150. 150.
## 3 3 (N) 160. 160.
## 4 4 (P) 195. 195.
## 5 5 (Q) 136. 136.
## 6 6 (R) 125. 125.
effectsModel_summary
##
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5298 9.9329 15.155 <2e-16 ***
## E1 -12.7419 14.6106 -0.872 0.3867
## E2 -0.8588 14.7965 -0.058 0.9539
## E3 9.7295 30.6687 0.317 0.7522
## E4 44.2280 19.4756 2.271 0.0268 *
## E5 -14.6791 24.5725 -0.597 0.5525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
# Selected coefficient to interpret
selected_coef <- coef(effectsModel)["E2"]
selected_coef
## E2
## -0.8587862
This coefficient represents the difference in the number of calories between the cereals manufactured by Kellogg’s and the overall mean calories of all manufacturers, with Ralston Purina (reference group) coded as -1s to center the effects around zero. Specifically, for two manufacturers that differ by one point in their numeric encoding (moving from General Mills to Kellogg’s, which corresponds to an increase in mfr.num by one unit), there is an expected difference of -0.8588 calories. The manufacturer with the higher numeric encoding (Kellogg’s) is expected to have cereals with a lower calorie content by an average of 0.8588 calories compared to the overall mean. However, this difference is not statistically significant (t(59) = -0.058, p-value = 0.9539), indicating that the relationship between the manufacturer and the calorie content is not reliable based on this model
summary (mfr_model)
##
## Call:
## lm(formula = calories ~ mfr.num, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.44 -38.13 -12.36 26.10 286.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 143.259 14.484 9.891 1.92e-14 ***
## mfr.num 2.437 4.840 0.504 0.616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.78 on 63 degrees of freedom
## Multiple R-squared: 0.004009, Adjusted R-squared: -0.0118
## F-statistic: 0.2536 on 1 and 63 DF, p-value: 0.6163
summary (effectsModel)
##
## Call:
## lm(formula = calories ~ E1 + E2 + E3 + E4 + E5, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.05 -39.67 -14.85 24.40 245.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5298 9.9329 15.155 <2e-16 ***
## E1 -12.7419 14.6106 -0.872 0.3867
## E2 -0.8588 14.7965 -0.058 0.9539
## E3 9.7295 30.6687 0.317 0.7522
## E4 44.2280 19.4756 2.271 0.0268 *
## E5 -14.6791 24.5725 -0.597 0.5525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.55 on 59 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.02738
## F-statistic: 1.36 on 5 and 59 DF, p-value: 0.2523
I have the effects model and the original full model. The full modelhas very low \(R^2\) (0.004009) and negative Adjusted \(R^2\) (-0.0118), indicating that it explains almost none of the variance in calories. In fact, the negative Adjusted \(R^2\) suggests that the model is worse than a simple mean model. The effects contrast has a higher \(R^2\) (0.1034) and a slightly positive Adjusted \(R^2\) (0.02738), indicating that it explains more variance in calories compared to full model.
Moreover, the full model is F(1, 63) = 0.2536, p = 0.6163, indicating that the model is not statistically significant. Our effects coding model has an F(5, 59) = 1.36, p = 0.2523, which is still not statistically significant but shows a slightly better fit compared to the full model.
Lastly, the full model has a residual standard error of 62.78. The effects model has a slightly lower residual standard error of 61.55, indicating that the predictions from Model 2 are closer to the actual values compared to the full model.
library(dplyr)
UScereal$mfr <- as.factor(UScereal$mfr)
# Apply effect coding to the manufacturer factor using contr.sum
contrasts(UScereal$mfr) <- contr.sum(length(levels(UScereal$mfr)))
# Fit the linear model using effect coding and adding protein as a predictor
effectsModel_protein <- lm(calories ~ mfr + protein, data = UScereal)
# Summary of the new model
effectsModel_protein_summary <- summary(effectsModel_protein)
print(effectsModel_protein_summary)
##
## Call:
## lm(formula = calories ~ mfr + protein, data = UScereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.770 -21.777 1.998 18.496 119.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.137 11.546 6.941 3.69e-09 ***
## mfr1 8.104 10.713 0.756 0.4525
## mfr2 2.233 10.505 0.213 0.8324
## mfr3 -40.540 22.716 -1.785 0.0795 .
## mfr4 33.916 13.881 2.443 0.0176 *
## mfr5 -3.724 17.490 -0.213 0.8321
## protein 17.175 2.232 7.696 1.98e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.67 on 58 degrees of freedom
## Multiple R-squared: 0.5564, Adjusted R-squared: 0.5105
## F-statistic: 12.12 on 6 and 58 DF, p-value: 8.79e-09
# Extract R-squared and adjusted R-squared
effectsModel_protein_r2 <- effectsModel_protein_summary$r.squared
effectsModel_protein_adj_r2 <- effectsModel_protein_summary$adj.r.squared
print(paste("R-squared: ", effectsModel_protein_r2))
## [1] "R-squared: 0.556401854548056"
print(paste("Adjusted R-squared: ", effectsModel_protein_adj_r2))
## [1] "Adjusted R-squared: 0.510512391225441"
# Compare the R-squared and adjusted R-squared values
cat("Comparison of R-squared values:\n")
## Comparison of R-squared values:
cat("Effects Model R-squared: ", effectsModel_r2, "\n")
## Effects Model R-squared: 0.1033659
cat("Effects Model Adjusted R-squared: ", effectsModel_adj_r2, "\n")
## Effects Model Adjusted R-squared: 0.02738001
cat("Effects Model with Protein R-squared: ", effectsModel_protein_r2, "\n")
## Effects Model with Protein R-squared: 0.5564019
cat("Effects Model with Protein Adjusted R-squared: ", effectsModel_protein_adj_r2, "\n")
## Effects Model with Protein Adjusted R-squared: 0.5105124
The \(R^2\) value for the effects-coded model. is 0.1034, indicating that about 10.34% of the variance in calorie content is explained by the manufacturer effects. The \(R^2\) value for the effects-coded model with protein is 0.5564, indicating that about 55.64% of the variance in calorie content is explained by the combined effects of the manufacturer and protein content. The adjusted \(R^2\) value for effects-coded model is 0.0274, which is adjusted for the number of predictors and suggests a modest fit. The adjusted R-squared value for effects-coded model with protein is 0.5105, indicating a significantly better fit when protein is included as a predictor.
Overall, adding protein as a predictor significantly improves the model. The substantial increase in both R-squared and adjusted R-squared values demonstrates that protein content is a crucial predictor for calorie content in cereals, alongside the manufacturer effects.
coef_effectsModel_protein <- coef(effectsModel_protein)
protein_coef <- coef_effectsModel_protein["protein"]
protein_coef
## protein
## 17.17498
The coefficient for protein is 17.17498. This means that for each additional gram of protein, the calorie content is expected to change by 17.17498 calories, holding the effect-coded manufacturer variable constant.
# Interpret the coefficient for Kellogg's in the new model
kellogg_coef <- coef_effectsModel_protein["mfr2"]
kellogg_coef
## mfr2
## 2.233318
The coefficient for Kellogg’s is 2.233318. This represents the difference in the number of calories between the cereals manufactured by Kellogg’s and the overall mean, with Ralston Purina as the reference group, while controlling for the effect of protein.
For each of the following circumstances select a coding strategy that you think would best allow you to make the comparisons that are of most direct interest. Select a coding strategy, describe which group would be the reference group (if needed) or which order the groups would be in (if needed).
For this question, the best coding strategy on the effects of an educational intervention would be Helmert coding. This is because where we want to make specific comparisons among the conditions, and Helmert coding allows us to make comparisons between each level of a factor and the average of subsequent levels.
For our order and reference, we would have the following:
Then our comparisons are
H1. No intervention vs. Home intervention
H2. School intervention vs. Home intervention
H3. School + Home intervention vs. School intervention
Our matrix might look something like this:
| Intervention | H1 | H2 | H3 |
|---|---|---|---|
| Home intervention | -1 | -1 | -1 |
| No intervention | 1 | -1 | -1 |
| School + Home intervention | 0 | 2 | -1 |
| School intervention | 0 | 0 | 3 |
In Helmert coding, the reference group is implicitly defined as the last
group, but it is not used in the traditional sense of reference groups
in dummy or effect coding. Instead, each level (except the last one) is
compared with the average of the subsequent levels.
Overall, Helmert coding is ideal for making specific, sequential comparisons among multiple levels of a factor. It allows us to compare each level with the mean of the subsequent levels, which aligns perfectly with this research question.
For this question, the best coding strategy for examining whether there are wage differences for individuals of different ethnicities and testing whether any groups make significantly more or less money than the group average is effect coding. Effect coding is appropriate when one wants to compare each level of a categorical variable to the overall mean of the dependent variable. In this context, it will help determine if any ethnic group has wages that significantly differ from the average wage across all groups.
Assuming the following ethnic groups:
White/Caucasian
African American
LatinX
Asian/Asian American
Native American/First Nations
In effect coding, each group is compared to the overall mean wage, and the coding for each group would look like this:
| Ethnicity | E1 | E2 | E3 | E4 | E5 |
|---|---|---|---|---|---|
| White/Caucasian | 1 | 0 | 0 | 0 | -1 |
| African American | 0 | 1 | 0 | 0 | -1 |
| LatinX | 0 | 0 | 1 | 0 | -1 |
| Asian/Asian American | 0 | 0 | 0 | 1 | -1 |
| Native American/First Nations | -1 | -1 | -1 | -1 | 4 |
Overall, effect coding compares each group to the overall mean, and it’s useful in this context because it allows us to see if each group’s wages are significantly higher or lower than the average wage. The reference is the overall mean, which provides a good baseline for these comparisons.
For this question, the best coding strategy where we want to compare the sugar content of cereals from different manufacturers to that of Kellogg’s cereals,is dummy coding. Dummy coding allows you to compare each group to a specific reference group, which in this case will be Kellogg’s.
Our reference group would be Kellogg’s cereal.
Assuming the following manufacturers:
General Mills
Kellogg’s
Nabisco
Post
Quaker Oats
Ralston Purina
In dummy coding, each group except the reference group (Kellogg’s) will have a dummy variable coded as 1 if the observation belongs to that group, and 0 otherwise. The reference group (Kellogg’s) will not have a dummy variable and will be implicitly coded as the baseline.
Our matrix might look like
| Manufacturer | D1 | D2 | D3 | D4 | D5 |
|---|---|---|---|---|---|
| 1 (G) | 1 | 0 | 0 | 0 | 0 |
| 2 (K) | 0 | 0 | 0 | 0 | 0 |
| 3 (N) | 0 | 1 | 0 | 0 | 0 |
| 4 (P) | 0 | 0 | 1 | 0 | 0 |
| 5 (Q) | 0 | 0 | 0 | 1 | 0 |
| 6 (R) | 0 | 0 | 0 | 0 | 1 |
Overall, dummy coding allows for direct comparisons of each manufacturer’s sugar content against the reference group (Kellogg’s). This approach provides clear insights into whether cereals from other manufacturers have significantly lower sugar content than Kellogg’s cereals.