P1

data<- read.csv("c:\\Users\\abdal\\OneDrive\\Desktop\\TTU\\Spring 2023\\IE 5344\\FA\\FA10\\data-SoftDrinkDeliveryTime(1).csv")
x1<- data$NumCases
x2<- data$Distance.ft.
y <- data$DeliveryTime.min.

#1st model
model_1 <- lm(y~x1+x2+x1:x2)
summary(model_1)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7316 -1.5387  0.0606  1.4375  4.7841 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.1390846  1.3997413   5.100 4.73e-05 ***
## x1          1.0144063  0.1912517   5.304 2.93e-05 ***
## x2          0.0058273  0.0033825   1.723 0.099622 .  
## x1:x2       0.0007419  0.0001750   4.240 0.000366 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.449 on 21 degrees of freedom
## Multiple R-squared:  0.9782, Adjusted R-squared:  0.9751 
## F-statistic: 314.6 on 3 and 21 DF,  p-value: < 2.2e-16
  1. What are your estimates of the regression parameters and what is the associated value of R2?

R-square is 0.9782
y= 7.1390846 + 1.0144063 X1 +0.0058273 X2 + 0.0007419 x1:x2

P2

What do you notice in the diagnostic plots (all of them)

plot(model_1)

In the residual Vs. fitted values plot, we can conclude that the assumption of constant variance is violated.

from the normal probability plot, we can conclude that the errors are normally distributed.

from the sqrt standardized residuals vs. fitted values. observation number 11 it could be outlier.

from the standardized residuals vs. the Leverage plot and from the hat values, we can see a point is larger than 2p/n (=0.24) on the x axis in the plot. This point is observation 9 They could be a leverage point

from the standardized residuals vs. the Leverage plot, we can see that we can do extra diagnosis on observation 9, once it’s close to the line cook’s distance.

P3

Remove the observation(s) that appears to be the most influential

data2<- data[-9,]
x11<- data2$NumCases
x22<- data2$Distance.ft.
yy <- data2$DeliveryTime.min.
#model 2
model_2<- lm(yy~x11+x22+x11:x22)
summary(model_2)
## 
## Call:
## lm(formula = yy ~ x11 + x22 + x11:x22)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8495 -1.3509 -0.0835  1.6174  4.9098 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.7984402  1.9709874   2.942 0.008062 ** 
## x11         1.2660217  0.3229617   3.920 0.000848 ***
## x22         0.0080441  0.0040895   1.967 0.063212 .  
## x11:x22     0.0003480  0.0004432   0.785 0.441497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.452 on 20 degrees of freedom
## Multiple R-squared:  0.9502, Adjusted R-squared:  0.9428 
## F-statistic: 127.3 on 3 and 20 DF,  p-value: 3.368e-13

After taking observation 9 out, the R squared went down, means it can be outlier but not a leverage point.

model_1$coefficients
##  (Intercept)           x1           x2        x1:x2 
## 7.1390845734 1.0144062540 0.0058273479 0.0007419211
model_2$coefficients
##  (Intercept)          x11          x22      x11:x22 
## 5.7984402036 1.2660216932 0.0080440801 0.0003480246

P4

What are your estimates of the regression parameters what is the associated value of R2 R square is 0.9502

From comparing the two coefficients for the two model before and after taking the observation 9, we can see the coefficient is significantly changing, means it is influential point.

P5 and 6

plot(model_2)

In the residual Vs. fitted values plot, we can conclude that the assumption of constant variance is violated.

from the normal probability plot, we can conclude that the errors are normally distributed.

from the sqrt standardized residuals vs. fitted values. observation number 10 it could be outlier.

from the standardized residuals vs. the Leverage plot and from the hat values, we can see two point is larger than 2p/n (=0.25) on the x axis in the plot. Those two points could be a leverage point. points ate 20 and 21

from the standardized residuals vs. the Leverage plot, we can see that we don’t need extra diagnosis on any observation,none of them is close to the line cook’s distance.

as.data.frame(hatvalues(model_2))
##    hatvalues(model_2)
## 1          0.11986469
## 2          0.11722790
## 3          0.11500184
## 4          0.13466379
## 5          0.08189527
## 6          0.05582552
## 7          0.23755678
## 8          0.08060898
## 9          0.20920377
## 10         0.14909568
## 11         0.24861269
## 12         0.07496202
## 13         0.08471636
## 14         0.08002516
## 15         0.21146407
## 16         0.06415220
## 17         0.11811742
## 18         0.22081417
## 19         0.19759678
## 20         0.33788606
## 21         0.73378743
## 22         0.07996732
## 23         0.14786902
## 24         0.09908505
data3<- data2[-20,]
data4 <- data3[-21,]
x111<- data4$NumCases
x222<- data4$Distance.ft.
yyy <- data4$DeliveryTime.min.
model_3<- lm(yyy~x111+x222+x111:x222)
summary(model_3)
## 
## Call:
## lm(formula = yyy ~ x111 + x222 + x111:x222)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1097 -1.2872 -0.1149  1.3284  4.3828 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.459e+00  2.006e+00   2.222 0.039308 *  
## x111         1.614e+00  3.671e-01   4.396 0.000348 ***
## x222         9.574e-03  3.949e-03   2.424 0.026099 *  
## x111:x222   -8.349e-05  4.834e-04  -0.173 0.864795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.324 on 18 degrees of freedom
## Multiple R-squared:  0.9597, Adjusted R-squared:  0.953 
## F-statistic: 142.8 on 3 and 18 DF,  p-value: 9.731e-13

R square is

0.9597

so the two points 20 and 21 were Leverage points since the R square went up.

P7

Conclusion:

When we took the point 9 out we observed that it was influential (removing the point), and then we wanted to do more diagnosis on points 20 and 21 as they could be leverage points, we observed increasing in R square so they were leverage points, since those two points are leverage, the intercept could be biased estimated. so extra considerations on those points .