The file data-SoftDrinkDeliveryTime.csv contains data on the time that it takes a driver to deliver soft drinks to different locations. Regress delivery time on the number of cases and the distance walked, including their second order interaction. Please answer all questions in a commented R file.
What are your estimates of the regression parameters and whatis the associated value of \(R^2\)?
The data has been read in and assigned to variable dat. The data frame dat has been cleaned up to remove the first column of observation labels and the column order rearranged to have the order of predictor 1, predictor 2, and response. The columns were also renamed as time, cases, and dist, respectively, to make calling the columns easier.
An initial look at the data, via plotting, shows in Figure 1 that there could be some point(s) that could be outliers and/or leverage points. So we will have to pay close attention to the linear model results.
Figure 1 - Scatter Plot of Original Data
The linear model, including the second-order interaction of the predictors, is built using the lm function in R. The following chunk of R code builds the model and displays the summary of the model.
##
## Call:
## lm(formula = time ~ cases + dist + cases:dist, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7316 -1.5387 0.0606 1.4375 4.7841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.1390846 1.3997413 5.100 4.73e-05 ***
## cases 1.0144063 0.1912517 5.304 2.93e-05 ***
## dist 0.0058273 0.0033825 1.723 0.099622 .
## cases:dist 0.0007419 0.0001750 4.240 0.000366 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.449 on 21 degrees of freedom
## Multiple R-squared: 0.9782, Adjusted R-squared: 0.9751
## F-statistic: 314.6 on 3 and 21 DF, p-value: < 2.2e-16
The regression parameters are listed below, and the \(R^2\) value is 0.9782.
What do you notice in the diagnostic plots (all of them)?
In the Residuals vs Fitted plot (Figure 2), R is indicating that there may be some concerns with points 1, 10, and 11. Looking at the plot, it does appear that point 11 could be an outlier because it is not within the logical bands of constant variance that we would draw on the plot. There is also a fitted value alone out at a value of 80. This point may be something to watch for, but there is no indication in this plot that it is of major concern.
Figure 2 - Residuals vs Fitted of Original Model
The Normal Q-Q plot (Figure 3) does not indicate any deviation from normality. Again, R indicates points 1, 10, and 11 are concerning.
Figure 3 - Normal Q-Q of Original Model
The Scale-Location plot (Figure 4) shows that point 11 is approaching the rule-of-thumb limit of \(\sqrt{3}\), but has not reached it. However, it is the only point that is close to it and is concerning.
Figure 4 - Scale-Location of Original Model
The Residuals vs Leverage plot (Figure 5) shows that point 9 is clearly outside of the Cook’s distance value of 1. We interpret this as point 9 is influential (is an outlier and has leverage). This is confirmed with the Cook’s Distance plot in Figure 6. This is likely the point that was far away from the rest of the data as shown in Figure 1.
Figure 5 - Residuals vs Leverage of Original Model
Figure 6 - Cook’s Distance Plot of Original Model
Remove the observation(s) that appears to be the most influential.
From the previous question, we determined that observation 9 is an influential point and we will remove it from our model. The following chunk of R code was used to create a new data frame, dat2, that has the 9th observation removed.
Checking and comparing the structures of dat and dat2 respectively, you can see that dat2 is one row smaller than dat.
## 'data.frame': 25 obs. of 3 variables:
## $ time : num 16.7 11.5 12 14.9 13.8 ...
## $ cases: int 7 3 3 4 6 7 2 7 30 5 ...
## $ dist : int 560 220 340 80 150 330 110 210 1460 605 ...
## 'data.frame': 24 obs. of 3 variables:
## $ time : num 16.7 11.5 12 14.9 13.8 ...
## $ cases: int 7 3 3 4 6 7 2 7 5 16 ...
## $ dist : int 560 220 340 80 150 330 110 210 605 688 ...
What are your estimates of the regression parameters what is the associated value of \(R^2\)?
The new linear model with the 9th oservation removed is again built using the lm function in R. The following chunk of R code builds the model and displays the summary of the model.
##
## Call:
## lm(formula = time ~ cases + dist + cases:dist, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8495 -1.3509 -0.0835 1.6174 4.9098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7984402 1.9709874 2.942 0.008062 **
## cases 1.2660217 0.3229617 3.920 0.000848 ***
## dist 0.0080441 0.0040895 1.967 0.063212 .
## cases:dist 0.0003480 0.0004432 0.785 0.441497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.452 on 20 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9428
## F-statistic: 127.3 on 3 and 20 DF, p-value: 3.368e-13
The regression parameters are listed below, and the \(R^2\) value is 0.9502.
Table 1 shows the regression parameters side-by-side from the original model (model1) and the new model (model2). Clearly the regression parameters have changed significantly indicating that the 9th observation had a significant impact on the linear model.
Table 1 - Comparison of Regression Coefficients
| Betas | Model_1 | Model_2 |
|---|---|---|
| 0 | 7.1390846 | 5.7984402 |
| 1 | 1.0144063 | 1.2660217 |
| 2 | 0.0058273 | 0.0080441 |
| 3 | 0.0007419 | 0.0003480 |
What do you notice in the diagnostic plots (all of them) after removing this point?
In Figure 7 , R is indicating points 11,23,and 1 are points of interest. By looking at the graph, we think that point 11 could be an outlier because it’s outside the logical constraints of the constant variance
Figure 7 - Residuals vs Fitted for Model Two
The Normal Q-Q plot (Figure 3) does not indicate any deviation from
normality. With, R indicating points 10 , 11 and 1 being points that
might be concerning to mode.
Figure 7 - Normal Q-Q of Model Two
The Scale-Location plot for Model Two shows that point 11 is approaching the rule-of-thumb limit of \(\sqrt{3}\), but point 11 has not reached the it. However, point 11 is concerning to the model.
Figure 8 - Scale-Location for Model Two
The Residuals vs Leverage for Model Two shows that no points are influential on the model.But, R is raising concerns for points 11,10,21. They could be potential outliers and be points that influence the model parameters.
Figure 8 - Residuals vs Leverage for Model Two
Is there now another point that might be influential or that has leverage?
Looking at the Cook’s Distance for Model two figure, I would say that no other points need to be removed from the model. All the points are less the Cook’s distance rule of thumb ( less than one).
To prove that no points need to be removed, let us remove point 11 and see if it helps improve the model
Figure 8 - Cook’s Distance for Model Two
Response
Do you believe that this point should have be removed? (explain)
After removing point 11 , the model’s \(\beta\) values, or the their level of significance, didn’t significantly change between model 2 and model 3. The only element to change was the overall p-value.
We concluded that point 11 shouldn’t be removed because the influence of the point on the model is not significant.
##
## Call:
## lm(formula = time ~ cases + dist + cases:dist, data = dat3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8489 -1.4138 -0.1001 1.6349 4.9154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8158810 2.1087088 2.758 0.01252 *
## cases 1.2612102 0.3701404 3.407 0.00295 **
## dist 0.0080327 0.0042139 1.906 0.07186 .
## cases:dist 0.0003536 0.0004938 0.716 0.48263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.516 on 19 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9424
## F-statistic: 120.9 on 3 and 19 DF, p-value: 1.48e-12
##
## Call:
## lm(formula = time ~ cases + dist + cases:dist, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8495 -1.3509 -0.0835 1.6174 4.9098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7984402 1.9709874 2.942 0.008062 **
## cases 1.2660217 0.3229617 3.920 0.000848 ***
## dist 0.0080441 0.0040895 1.967 0.063212 .
## cases:dist 0.0003480 0.0004432 0.785 0.441497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.452 on 20 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9428
## F-statistic: 127.3 on 3 and 20 DF, p-value: 3.368e-13
knitr::opts_chunk$set(echo = TRUE)
#Load libraries
library(MASS)
library(readxl)
library(knitr)
#Read in data file
dat<-read.csv(file.choose())
#Observe head and structure of dat
head(dat)
str(dat)
#Eliminate first column of dat and reobserve header and structure
dat<-dat[,-1]
head(dat)
str(dat)
#Rename the columns to make easier to work with
colnames(dat)<-c('time', 'cases', 'dist')
head(dat)
#Generate the linear model that includes second order interaction
model1<-lm(time~cases+dist+cases:dist,data = dat)
summary(model1)
plot(model1)
#Based on plots, data point 9 is outside of Cooks Distance
#Remove data point 9 from data
dat2<-dat[-9,]
str(dat2)
model2<-lm(time~cases+dist+cases:dist,data = dat2)
summary(model2)
plot(model2)
# Removing Point 11
dat3 <- dat2[-11,]
model3 <- lm(time~cases+dist+cases:dist,data = dat3)
summary(model3)