ISyE 6414 Regression Analysis | Spring 2023 Homework 5 Refer to the file homework05Hospital.csv, which can be found in the homework section of Canvas, to answer Questions 1-5. This data set presents data concerning the need for labor in 16 hospitals. The main objective of the regression analysis is to evaluate the performance of the hospitals in terms of how many labor hours are used relative to how many labor hours are needed. Here:

y: monthly labor hours required x1: monthly X-ray exposures x2: monthly occupied bed days x3: average length of patients’ stays (in days)

  1. Construct and solve a multiple linear regression model using this dataset. Identify and report each coeffcient ( estimate).
data = read.csv('homework05Hospital.csv', encoding = 'UTF-8')
data
##     Xray  BedDays Length    Hours
## 1   2463   472.92   4.45   566.52
## 2   2048  1339.75   6.92   696.82
## 3   3940   620.25   4.28  1033.15
## 4   6505   568.33   3.90  1603.62
## 5   5723  1497.60   5.50  1611.37
## 6  11520  1365.83   4.60  1613.27
## 7   5779  1687.00   5.62  1854.17
## 8   5969  1639.92   5.15  2160.55
## 9   8461  2872.33   6.18  2305.58
## 10 20106  3655.08   6.15  3503.93
## 11 13313  2912.00   5.88  3571.89
## 12 10771  3921.00   4.88  3741.40
## 13 15543  3865.67   5.50  4026.52
## 14 34703 12446.33  10.78 11732.17
## 15 39204 14098.40   7.05 15414.94
## 16 86533 15524.00   6.35 18854.45
model_lm = lm(Hours~., data = data)
summary(model_lm)
## 
## Call:
## lm(formula = Hours ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -677.23 -270.19   60.93  228.32  517.70 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1946.80204  504.18193   3.861  0.00226 ** 
## Xray           0.03858    0.01304   2.958  0.01197 *  
## BedDays        1.03939    0.06756  15.386 2.91e-09 ***
## Length      -413.75780   98.59828  -4.196  0.00124 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 387.2 on 12 degrees of freedom
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.9952 
## F-statistic:  1028 on 3 and 12 DF,  p-value: 9.919e-15

Beta0 = 1946.80 Beta-Xray = 0.03 Beta-BedDays = 1.03 Beta-Length = −413.75

Hence, the linear regression equation is: Hours = 1946.80204 + 0.03858 ∗ Xray + 1.03939 ∗ BedDays − 413.75780 ∗ Length + ϵ

  1. Interpret each coefficient.

Interpreting each beta coefficient. Beta-0 gives the value of # labour hours when the value of Xray exposure, BedDays, and Length independent variables are zero. Value: 1946.80

Beta-Xray gives us the change of unit value in monthly Xray exposure wrt monthly labour hours needed. Every increase in monthly Xray value, the avg monthly labour hrs increases by 0.038 hrs.

Beta_BedDays gives us the change of unit value in monthly BedDays wrt monthly labour hours needed. Every increase in monthly BedDays value, the avg monthly labour hrs increases by 1.03 hrs.

Beta-Length gives us the change of unit value in average length of patients’ stay wrt monthly labour hours needed. Every increase in average length of patients’ stay value, the avg monthly labour hrs decreases by 413.75 hours.

  1. Conduct a hypothesis test using = 0.10 to see whether or not the predictor BedDays is statistically significant. State the null and alternative hypotheses, the test statistic, critical t-value, and your conclusion.

H0 -> Beta-BedDays = 0 H1 -> Beta-BedDays =/0

As we can see from the above summary: t-statistic: 15.38 tα/2,n−k−1 = t0.5,12 = 1.782

Since the t-statistic > t-critical value, we can reject H0 and conclude that BedDays predictor is statistically significant and linearly impacts the labor hours requirement.

  1. Identify the p-value for testing the predictor Xray. By using the p-value and a = 0.10, state whether or not Xray is statistically significant. Would your answer be different if a were 0.01?

p-value = 0.01197 Hypotheses H0 -> Beta-BedDays = 0 H1 -> Beta-BedDays =/0

-For α = 0.10: p-value < α 0.01197 < 0.10 Hence, we can reject H0 and conclude that Beta-Xray is statistically significant for α = 0.10. -For α = 0.01: p-value > α 0.01197 > 0.01

Hence, we fail to reject H0 and conclude that Beta-Xray is not statistically significant for α = 0.01, i.e., Beta-Xray can be 0.

So, the answer is different for α = 0.01, since the p-value of the predictor Xray lies between significance levels of 0.10 and 0.01.

  1. Identify the p-value for testing the predictor Length. By using the p-value and a= 0.10, state whether or not Length is statistically signi cant. Would your answer be di erent if were 0.01?

p-value = 0.00124 Hypothesis: H0 -> Beta-BedDays = 0 H1 -> Beta-BedDays =/0

-For α = 0.10: p-value < α 0.00124 < 0.10 Hence, we can reject H0 and conclude that Beta-Length is statistically significant for α = 0.10. -For α = 0.01: p-value < α 0.00124 < 0.01 Hence, we can reject H0 and conclude that Beta-Length is statistically significant for α = 0.01.

So, the answer is the same for α = 0.01, since the p-value of the predictor Length is lower than both the significance levels of 0.10 and 0.01.

A regional express delivery company conducted a study to investigate the relationshipnmbetween the cost of shipment, y (in dollars), and the variables that control the shipping charge: package weight, x1 (in pounds), and distance shipped, x2 (in miles). By using the attached comma-deliminated data le HW5ShipmentData.csv and = 0:05, answer Questions 6-11. Submit all relevant outputs and your R codes (if you used R).

  1. Produce the matrix plot, and interpret the possible relations among all variables.
data1 = read.csv("HW5ShipmentData.csv")
data1
##    Cost Weight Distance
## 1   2.6   5.90       47
## 2   3.9   3.20      145
## 3   8.0   4.40      202
## 4   9.2   6.60      160
## 5   4.4   0.75      280
## 6   1.5   0.70       80
## 7  14.5   6.50      240
## 8   1.9   4.50       53
## 9   1.0   0.60      100
## 10 14.0   7.50      190
## 11 11.0   5.10      240
## 12  5.0   2.40      209
## 13  2.0   0.30      160
## 14  6.0   6.20      115
## 15  1.1   2.70       45
## 16  8.0   3.50      250
## 17  3.3   4.10       95
## 18 12.1   8.10      160
## 19 15.5   7.00      260
## 20  1.7   1.10       90
pairs(data1)

model_lm_1 = lm(Cost~., data= data1)
summary(model_lm_1)
## 
## Call:
## lm(formula = Cost ~ ., data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.239 -1.101 -0.129  1.283  2.313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.672757   0.891147  -5.244 6.60e-05 ***
## Weight       1.292414   0.137842   9.376 3.95e-08 ***
## Distance     0.036936   0.004602   8.026 3.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.493 on 17 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9063 
## F-statistic: 92.89 on 2 and 17 DF,  p-value: 7.066e-10

Interpretation: We can observe a possible linear ot polynomial relationship between Cost and Weight. And similar relationship can be seen with Cost and Distance. But on the other hand, we do not see any linear or polynomial relationship with Distance and Weight.

  1. Solve the rst-order model, i.e., y = 0 + 1x1 + 2x2 + . Also, produce the ANOVA table. Is the model useful (signi cant) as a whole, i.e., apply an F-test. Are the predictors signi cant? You can perform the hypothesis tests by considering p-values only.
model_anova = anova(model_lm_1)
model_anova
## Analysis of Variance Table
## 
## Response: Cost
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Weight     1 270.553 270.553 121.353 3.682e-09 ***
## Distance   1 143.631 143.631  64.424 3.489e-07 ***
## Residuals 17  37.901   2.229                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Positive coefficients can be observed in the Matrix plot. Beta-0 = −4.672757 Beta-Weight = 1.292414 Beta-Distance = 0.036936 Hence, the first-order model is: Cost = −4.672757 + 1.292414 ∗Weight + 0.036936 ∗ Distance + ϵ Testing the significance of the whole model: Hypothesis H0 -> Beta-Weight = Beta-Distance = 0 H1 -> H0 is not True Summary suggests that F-Statistic is 92.89 and the p-value of the F-test is 7.066e − 10, which is lower than α = 0.05. Therefore, we can reject the null hypothesis.

Testing the significance of individual predictors: Hypothesis (Beta-Weight): H0 -> Beta-Weight = 0 H1 -> Beta-Weight ̀¸= 0 Summary suggests that T-Statistic is 9.376 and p-value is 3.95e − 08, which is lower than α = 0.05. Therefore we can reject the null hypothesis and conclude that Beta-Weight is statistically significant. • Hypothesis (Beta-Distance): H0 -> Beta-Distance = 0 H1 -> Beta-Distance ̀¸= 0 Summary suggests that T-Statistic is 8.026 and the corresponding p-value is 3.49e − 07, which is significantly lower than α = 0.05. therefor, we can safely reject the null hypothesis and conclude that Beta-Distance is statistically significant.

  1. Calculate R2 by using appropriate ANOVA table quantities. Justify your result by referring to the R2 quantity found in your output.

SSE = 37.90

SST = 270.55 +143.63 = 452.08

R2 = 1.(SSE/SST) = 0.9162 = 91.62% We can compare both R2 values in the summary and the calculations.

  1. Check the random error assumptions for the first-order model, in particular, check whether E() = 0 or not, the normality, and the identical distribution (variance) assumptions. What are your conclusions? E(ϵ) = 0 assumption holds as we calculated the mean of residuals which came out to be 1.93963768657657e − 17.

Normality assumptions:

fit <- model_lm_1
res <- residuals(fit)
plot(fit$fitted.values, res, xlab="Fitted Values", ylab="Residuals", main="", abline(0,0,col="red"))

qqnorm(res, ylab="Residuals", main="", col="blue")

hist(res, xlab="Residuals", main="", nclass=10, col="orange")

plot(cooks.distance(fit), type="h", lwd=3, col="red", ylab="Cook's Distance", main="")

The Q-Q plot shows a normal distribution of the residuals. From the histogram, we can say that it’s unclear to say whether the normality holds or not. From the Q-Q plot we can say that Normality Assumption holds. We need more samples to say with respect to Histogram. From Residual v Fitted values graph, we can observe that the residuals show a clear pattern with fitted values. Hence, the Identical Distribution (Constant Variance) assumption does not hold True.

The variance increases on the extremes compared to the middle values in the Residuals vs Weight plot. We can say that the Cost can have a non-linear relation with Weight.

par(mfrow=c(2,1))
res <- stdres(model_lm_1)
plot(data1[,2],res,xlab="Weight",ylab="Residuals")
abline(0,0,col="red")
plot(data1[,3],res,xlab="Distance",ylab="Residuals")
abline(0,0,col="red")

  1. Find a 95% prediction interval for a single future observation when the predictors Weight = 6 and Distance = 160.
conf <- predict.lm(model_lm_1,newdata = data.frame(Weight = 6,Distance=160),interval= "confidence", level=0.95)
conf
##        fit      lwr      upr
## 1 8.991409 8.091995 9.890823
  1. Find a 95% prediction interval for a single future observation when the predictors Weight = 6 and Distance = 160.
pred_int <- predict.lm(model_lm_1,newdata = data.frame(Weight = 6,Distance=160),interval= "prediction", level = 0.95)
pred_int
##        fit      lwr      upr
## 1 8.991409 5.715274 12.26754