Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Using the Breaks dataset, we aim to determine if the hours to the deadline, taking a break, the length of a break in minutes and the frustration levels for completing an assignment affect how many minutes are wasted at any given time.
Quadratic term - FrustrationLevel (scale of 1(low frustration) to 10(high frustration))
Dichotomous term - TakeBreak (did the individual take a break (Yes/No))
Dichotomous vs Quantitative term - TakeBreak*MinutesBreak (Break vs Break length)
Looking at our data
library(knitr)
library(MASS)
library(car)
## Warning: package 'car' was built under R version 3.3.3
breaks = read.csv("Breaks.csv", header = T)
breaks$TakeBreak = as.numeric(breaks$TakeBreak)
kable(head(breaks))
| HoursToDeadline | FrustrationLevel | TakeBreak | MinsBreak | WastedMins |
|---|---|---|---|---|
| 48 | 1 | 1 | 0 | 5 |
| 46 | 1 | 1 | 0 | 14 |
| 44 | 1 | 2 | 36 | 12 |
| 42 | 1 | 2 | 48 | 22 |
| 40 | 1 | 2 | 25 | 21 |
| 38 | 2 | 1 | 0 | 4 |
pairs(breaks)
Multiple Regression Model
mult.reg = lm(WastedMins ~ FrustrationLevel + HoursToDeadline + TakeBreak + + TakeBreak*MinsBreak + MinsBreak, data = breaks)
summary(mult.reg)
##
## Call:
## lm(formula = WastedMins ~ FrustrationLevel + HoursToDeadline +
## TakeBreak + +TakeBreak * MinsBreak + MinsBreak, data = breaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0177 -4.0648 0.2672 3.3602 7.0119
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.90617 5.86707 2.370 0.0279 *
## FrustrationLevel -1.16540 0.49311 -2.363 0.0283 *
## HoursToDeadline -0.01478 0.08704 -0.170 0.8669
## TakeBreak -0.97841 4.57303 -0.214 0.8328
## MinsBreak 0.16684 0.14913 1.119 0.2765
## TakeBreak:MinsBreak NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.936 on 20 degrees of freedom
## Multiple R-squared: 0.517, Adjusted R-squared: 0.4204
## F-statistic: 5.351 on 4 and 20 DF, p-value: 0.004266
We have to pick a better model, so we remove variables that are not significant. A backwards elimination process was used to reach a model with only two predictors (FrustrationLevel and MinsBreak).
summary(stepAIC(mult.reg, direction = "backward"))
## Start: AIC=84.25
## WastedMins ~ FrustrationLevel + HoursToDeadline + TakeBreak +
## +TakeBreak * MinsBreak + MinsBreak
##
##
## Step: AIC=84.25
## WastedMins ~ FrustrationLevel + HoursToDeadline + TakeBreak +
## MinsBreak
##
## Df Sum of Sq RSS AIC
## - HoursToDeadline 1 0.703 488.07 82.289
## - TakeBreak 1 1.115 488.48 82.310
## - MinsBreak 1 30.498 517.86 83.771
## <none> 487.36 84.253
## - FrustrationLevel 1 136.109 623.47 88.411
##
## Step: AIC=82.29
## WastedMins ~ FrustrationLevel + TakeBreak + MinsBreak
##
## Df Sum of Sq RSS AIC
## - TakeBreak 1 1.229 489.29 80.352
## - MinsBreak 1 31.803 519.87 81.868
## <none> 488.07 82.289
## - FrustrationLevel 1 193.233 681.30 88.628
##
## Step: AIC=80.35
## WastedMins ~ FrustrationLevel + MinsBreak
##
## Df Sum of Sq RSS AIC
## <none> 489.29 80.352
## - MinsBreak 1 92.77 582.06 82.693
## - FrustrationLevel 1 208.78 698.08 87.236
##
## Call:
## lm(formula = WastedMins ~ FrustrationLevel + MinsBreak, data = breaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.974 -4.108 0.758 3.428 7.026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.37607 2.09687 5.902 6.13e-06 ***
## FrustrationLevel -1.13408 0.37014 -3.064 0.00568 **
## MinsBreak 0.14032 0.06871 2.042 0.05328 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.716 on 22 degrees of freedom
## Multiple R-squared: 0.5151, Adjusted R-squared: 0.471
## F-statistic: 11.68 on 2 and 22 DF, p-value: 0.0003489
Coefficients
Looking at the results of the more appropriate model, the intercept says that when FrustrationLevel and MinsBreak are zero, the minutes wasted are approximately 14. The coefficients for the significant variables tell us that for FrustrationLevel, if we hold all other variables constant, an increase in 1 degree of frustration levels leads to an decrease of -1.13 minutes of wasted time. Similarly, for MinsBreak, if we hold all other variables constant, an increase in 1 degree of length of minutes during break leads to an increase of 0.14 minutes of wasted time. So, the more frustrated, the less likely to waste time but the longer the break, the more likely to waste time.
Something that was very interesting to learn was that the coefficients will appear as NA when R can’t estimate all the parameters due to collinearity. If we look at the correlation table below, we see that TakeBreak and MinsBreak have a high correlation of 0.88. The best solution here would probably be to remove one of the variables and proceed.
kable(round(cor(breaks), 2))
| HoursToDeadline | FrustrationLevel | TakeBreak | MinsBreak | WastedMins | |
|---|---|---|---|---|---|
| HoursToDeadline | 1.00 | -0.61 | 0.13 | 0.19 | 0.35 |
| FrustrationLevel | -0.61 | 1.00 | -0.29 | -0.43 | -0.65 |
| TakeBreak | 0.13 | -0.29 | 1.00 | 0.88 | 0.43 |
| MinsBreak | 0.19 | -0.43 | 0.88 | 1.00 | 0.56 |
| WastedMins | 0.35 | -0.65 | 0.43 | 0.56 | 1.00 |
Residual Analysis
Residuals are not normal. Not only is the histogram not normal, but in the q-q plot the points do not fall along the line. Granted, the plots below are of the regression model before the elimination process.
hist(mult.reg$residuals, main = "Regression Residuals")
par(mfrow = c(1,2))
plot(fitted(mult.reg), resid(mult.reg))
abline(h=0, col = "blue")
qqnorm(mult.reg$residuals)
qqline(mult.reg$residuals, col = "red")
Linear Regression Model Appropriate?
Even if we were to remove variables from the full model and use only the ones deemed significant, we would still see that it would not be apprpriate. After including only the variables FrustationLevel and MinsBreak, the \(R^2\) value has risen to 47%, but that is still fairly low. The residuals also do not fall along the normal line, even though they are evenly spaced.
reduced.reg = stepAIC(mult.reg, direction = "backward")
## Start: AIC=84.25
## WastedMins ~ FrustrationLevel + HoursToDeadline + TakeBreak +
## +TakeBreak * MinsBreak + MinsBreak
##
##
## Step: AIC=84.25
## WastedMins ~ FrustrationLevel + HoursToDeadline + TakeBreak +
## MinsBreak
##
## Df Sum of Sq RSS AIC
## - HoursToDeadline 1 0.703 488.07 82.289
## - TakeBreak 1 1.115 488.48 82.310
## - MinsBreak 1 30.498 517.86 83.771
## <none> 487.36 84.253
## - FrustrationLevel 1 136.109 623.47 88.411
##
## Step: AIC=82.29
## WastedMins ~ FrustrationLevel + TakeBreak + MinsBreak
##
## Df Sum of Sq RSS AIC
## - TakeBreak 1 1.229 489.29 80.352
## - MinsBreak 1 31.803 519.87 81.868
## <none> 488.07 82.289
## - FrustrationLevel 1 193.233 681.30 88.628
##
## Step: AIC=80.35
## WastedMins ~ FrustrationLevel + MinsBreak
##
## Df Sum of Sq RSS AIC
## <none> 489.29 80.352
## - MinsBreak 1 92.77 582.06 82.693
## - FrustrationLevel 1 208.78 698.08 87.236
par(mfrow = c(1,2))
plot(fitted(reduced.reg), resid(reduced.reg))
abline(h=0, col = "blue")
qqnorm(reduced.reg$residuals)
qqline(reduced.reg$residuals, col = "red")
ceresPlots(reduced.reg)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.24
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 9.24
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 81