data<- read.csv("c:\\Users\\abdal\\OneDrive\\Desktop\\TTU\\Spring 2023\\IE 5344\\FA\\FA10\\data-SoftDrinkDeliveryTime(1).csv")
x1<- data$NumCases
x2<- data$Distance.ft.
y <- data$DeliveryTime.min.
#1st model
model_1 <- lm(y~x1+x2+x1:x2)
summary(model_1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7316 -1.5387 0.0606 1.4375 4.7841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.1390846 1.3997413 5.100 4.73e-05 ***
## x1 1.0144063 0.1912517 5.304 2.93e-05 ***
## x2 0.0058273 0.0033825 1.723 0.099622 .
## x1:x2 0.0007419 0.0001750 4.240 0.000366 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.449 on 21 degrees of freedom
## Multiple R-squared: 0.9782, Adjusted R-squared: 0.9751
## F-statistic: 314.6 on 3 and 21 DF, p-value: < 2.2e-16
R-square is 0.9782
y= 7.1390846 + 1.0144063 X1 +0.0058273 X2 + 0.0007419 x1:x2
P2
What do you notice in the diagnostic plots (all of them)
plot(model_1)
In the residual Vs. fitted values plot, we can conclude that the assumption of constant variance is violated.
from the normal probability plot, we can conclude that the errors are normally distributed.
from the sqrt standardized residuals vs. fitted values. observation number 11 it could be outlier.
from the standardized residuals vs. the Leverage plot and from the hat values, we can see a point is larger than 2p/n (=0.24) on the x axis in the plot. This point is observation 9 They could be a leverage point
from the standardized residuals vs. the Leverage plot, we can see that we can do extra diagnosis on observation 9, once it’s close to the line cook’s distance.
Remove the observation(s) that appears to be the most influential
data2<- data[-9,]
x11<- data2$NumCases
x22<- data2$Distance.ft.
yy <- data2$DeliveryTime.min.
#model 2
model_2<- lm(yy~x11+x22+x11:x22)
summary(model_2)
##
## Call:
## lm(formula = yy ~ x11 + x22 + x11:x22)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8495 -1.3509 -0.0835 1.6174 4.9098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7984402 1.9709874 2.942 0.008062 **
## x11 1.2660217 0.3229617 3.920 0.000848 ***
## x22 0.0080441 0.0040895 1.967 0.063212 .
## x11:x22 0.0003480 0.0004432 0.785 0.441497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.452 on 20 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9428
## F-statistic: 127.3 on 3 and 20 DF, p-value: 3.368e-13
After taking observation 9 out, the R squared went down, means it can be outlier but not a leverage point.
model_1$coefficients
## (Intercept) x1 x2 x1:x2
## 7.1390845734 1.0144062540 0.0058273479 0.0007419211
model_2$coefficients
## (Intercept) x11 x22 x11:x22
## 5.7984402036 1.2660216932 0.0080440801 0.0003480246
What are your estimates of the regression parameters what is the associated value of R2 R square is 0.9502
From comparing the two coefficients for the two model before and after taking the observation 9, we can see the coefficient is significantly changing, means it is influential point.
plot(model_2)
In the residual Vs. fitted values plot, we can conclude that the assumption of constant variance is violated.
from the normal probability plot, we can conclude that the errors are normally distributed.
from the sqrt standardized residuals vs. fitted values. observation number 10 it could be outlier.
from the standardized residuals vs. the Leverage plot and from the hat values, we can see two point is larger than 2p/n (=0.25) on the x axis in the plot. Those two points could be a leverage point. points ate 20 and 21
from the standardized residuals vs. the Leverage plot, we can see that we don’t need extra diagnosis on any observation,none of them is close to the line cook’s distance.
as.data.frame(hatvalues(model_2))
## hatvalues(model_2)
## 1 0.11986469
## 2 0.11722790
## 3 0.11500184
## 4 0.13466379
## 5 0.08189527
## 6 0.05582552
## 7 0.23755678
## 8 0.08060898
## 9 0.20920377
## 10 0.14909568
## 11 0.24861269
## 12 0.07496202
## 13 0.08471636
## 14 0.08002516
## 15 0.21146407
## 16 0.06415220
## 17 0.11811742
## 18 0.22081417
## 19 0.19759678
## 20 0.33788606
## 21 0.73378743
## 22 0.07996732
## 23 0.14786902
## 24 0.09908505
data3<- data2[-20,]
data4 <- data3[-21,]
x111<- data4$NumCases
x222<- data4$Distance.ft.
yyy <- data4$DeliveryTime.min.
model_3<- lm(yyy~x111+x222+x111:x222)
summary(model_3)
##
## Call:
## lm(formula = yyy ~ x111 + x222 + x111:x222)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1097 -1.2872 -0.1149 1.3284 4.3828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.459e+00 2.006e+00 2.222 0.039308 *
## x111 1.614e+00 3.671e-01 4.396 0.000348 ***
## x222 9.574e-03 3.949e-03 2.424 0.026099 *
## x111:x222 -8.349e-05 4.834e-04 -0.173 0.864795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.324 on 18 degrees of freedom
## Multiple R-squared: 0.9597, Adjusted R-squared: 0.953
## F-statistic: 142.8 on 3 and 18 DF, p-value: 9.731e-13
R square is
0.9597
so the two points 20 and 21 were Leverage points since the R square went up.
Conclusion:
When we took the point 9 out we observed that it was influential (removing the point), and then we wanted to do more diagnosis on points 20 and 21 as they could be leverage points, we observed increasing in R square so they were leverage points, since those two points are leverage, the intercept could be biased estimated. so extra considerations on those points .