ISyE 6414 Regression Analysis | Spring 2023 Homework 5 Refer to the file homework05Hospital.csv, which can be found in the homework section of Canvas, to answer Questions 1-5. This data set presents data concerning the need for labor in 16 hospitals. The main objective of the regression analysis is to evaluate the performance of the hospitals in terms of how many labor hours are used relative to how many labor hours are needed. Here:
y: monthly labor hours required x1: monthly X-ray exposures x2: monthly occupied bed days x3: average length of patients’ stays (in days)
data = read.csv('homework05Hospital.csv', encoding = 'UTF-8')
data
## Xray BedDays Length Hours
## 1 2463 472.92 4.45 566.52
## 2 2048 1339.75 6.92 696.82
## 3 3940 620.25 4.28 1033.15
## 4 6505 568.33 3.90 1603.62
## 5 5723 1497.60 5.50 1611.37
## 6 11520 1365.83 4.60 1613.27
## 7 5779 1687.00 5.62 1854.17
## 8 5969 1639.92 5.15 2160.55
## 9 8461 2872.33 6.18 2305.58
## 10 20106 3655.08 6.15 3503.93
## 11 13313 2912.00 5.88 3571.89
## 12 10771 3921.00 4.88 3741.40
## 13 15543 3865.67 5.50 4026.52
## 14 34703 12446.33 10.78 11732.17
## 15 39204 14098.40 7.05 15414.94
## 16 86533 15524.00 6.35 18854.45
model_lm = lm(Hours~., data = data)
summary(model_lm)
##
## Call:
## lm(formula = Hours ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -677.23 -270.19 60.93 228.32 517.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1946.80204 504.18193 3.861 0.00226 **
## Xray 0.03858 0.01304 2.958 0.01197 *
## BedDays 1.03939 0.06756 15.386 2.91e-09 ***
## Length -413.75780 98.59828 -4.196 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 387.2 on 12 degrees of freedom
## Multiple R-squared: 0.9961, Adjusted R-squared: 0.9952
## F-statistic: 1028 on 3 and 12 DF, p-value: 9.919e-15
Beta0 = 1946.80 Beta-Xray = 0.03 Beta-BedDays = 1.03 Beta-Length = −413.75
Hence, the linear regression equation is: Hours = 1946.80204 + 0.03858 ∗ Xray + 1.03939 ∗ BedDays − 413.75780 ∗ Length + ϵ
Interpreting each beta coefficient. Beta-0 gives the value of # labour hours when the value of Xray exposure, BedDays, and Length independent variables are zero. Value: 1946.80
Beta-Xray gives us the change of unit value in monthly Xray exposure wrt monthly labour hours needed. Every increase in monthly Xray value, the avg monthly labour hrs increases by 0.038 hrs.
Beta_BedDays gives us the change of unit value in monthly BedDays wrt monthly labour hours needed. Every increase in monthly BedDays value, the avg monthly labour hrs increases by 1.03 hrs.
Beta-Length gives us the change of unit value in average length of patients’ stay wrt monthly labour hours needed. Every increase in average length of patients’ stay value, the avg monthly labour hrs decreases by 413.75 hours.
H0 -> Beta-BedDays = 0 H1 -> Beta-BedDays =/0
As we can see from the above summary: t-statistic: 15.38 tα/2,n−k−1 = t0.5,12 = 1.782
Since the t-statistic > t-critical value, we can reject H0 and conclude that BedDays predictor is statistically significant and linearly impacts the labor hours requirement.
p-value = 0.01197 Hypotheses H0 -> Beta-BedDays = 0 H1 -> Beta-BedDays =/0
-For α = 0.10: p-value < α 0.01197 < 0.10 Hence, we can reject H0 and conclude that Beta-Xray is statistically significant for α = 0.10. -For α = 0.01: p-value > α 0.01197 > 0.01
Hence, we fail to reject H0 and conclude that Beta-Xray is not statistically significant for α = 0.01, i.e., Beta-Xray can be 0.
So, the answer is different for α = 0.01, since the p-value of the predictor Xray lies between significance levels of 0.10 and 0.01.
p-value = 0.00124 Hypothesis: H0 -> Beta-BedDays = 0 H1 -> Beta-BedDays =/0
-For α = 0.10: p-value < α 0.00124 < 0.10 Hence, we can reject H0 and conclude that Beta-Length is statistically significant for α = 0.10. -For α = 0.01: p-value < α 0.00124 < 0.01 Hence, we can reject H0 and conclude that Beta-Length is statistically significant for α = 0.01.
So, the answer is the same for α = 0.01, since the p-value of the predictor Length is lower than both the significance levels of 0.10 and 0.01.
A regional express delivery company conducted a study to investigate the relationshipnmbetween the cost of shipment, y (in dollars), and the variables that control the shipping charge: package weight, x1 (in pounds), and distance shipped, x2 (in miles). By using the attached comma-deliminated data le HW5ShipmentData.csv and = 0:05, answer Questions 6-11. Submit all relevant outputs and your R codes (if you used R).
data1 = read.csv("HW5ShipmentData.csv")
data1
## Cost Weight Distance
## 1 2.6 5.90 47
## 2 3.9 3.20 145
## 3 8.0 4.40 202
## 4 9.2 6.60 160
## 5 4.4 0.75 280
## 6 1.5 0.70 80
## 7 14.5 6.50 240
## 8 1.9 4.50 53
## 9 1.0 0.60 100
## 10 14.0 7.50 190
## 11 11.0 5.10 240
## 12 5.0 2.40 209
## 13 2.0 0.30 160
## 14 6.0 6.20 115
## 15 1.1 2.70 45
## 16 8.0 3.50 250
## 17 3.3 4.10 95
## 18 12.1 8.10 160
## 19 15.5 7.00 260
## 20 1.7 1.10 90
pairs(data1)
model_lm_1 = lm(Cost~., data= data1)
summary(model_lm_1)
##
## Call:
## lm(formula = Cost ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.239 -1.101 -0.129 1.283 2.313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.672757 0.891147 -5.244 6.60e-05 ***
## Weight 1.292414 0.137842 9.376 3.95e-08 ***
## Distance 0.036936 0.004602 8.026 3.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.493 on 17 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9063
## F-statistic: 92.89 on 2 and 17 DF, p-value: 7.066e-10
Interpretation: We can observe a possible linear ot polynomial relationship between Cost and Weight. And similar relationship can be seen with Cost and Distance. But on the other hand, we do not see any linear or polynomial relationship with Distance and Weight.
model_anova = anova(model_lm_1)
model_anova
## Analysis of Variance Table
##
## Response: Cost
## Df Sum Sq Mean Sq F value Pr(>F)
## Weight 1 270.553 270.553 121.353 3.682e-09 ***
## Distance 1 143.631 143.631 64.424 3.489e-07 ***
## Residuals 17 37.901 2.229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Positive coefficients can be observed in the Matrix plot. Beta-0 = −4.672757 Beta-Weight = 1.292414 Beta-Distance = 0.036936 Hence, the first-order model is: Cost = −4.672757 + 1.292414 ∗Weight + 0.036936 ∗ Distance + ϵ Testing the significance of the whole model: Hypothesis H0 -> Beta-Weight = Beta-Distance = 0 H1 -> H0 is not True Summary suggests that F-Statistic is 92.89 and the p-value of the F-test is 7.066e − 10, which is lower than α = 0.05. Therefore, we can reject the null hypothesis.
Testing the significance of individual predictors: Hypothesis (Beta-Weight): H0 -> Beta-Weight = 0 H1 -> Beta-Weight ̀¸= 0 Summary suggests that T-Statistic is 9.376 and p-value is 3.95e − 08, which is lower than α = 0.05. Therefore we can reject the null hypothesis and conclude that Beta-Weight is statistically significant. • Hypothesis (Beta-Distance): H0 -> Beta-Distance = 0 H1 -> Beta-Distance ̀¸= 0 Summary suggests that T-Statistic is 8.026 and the corresponding p-value is 3.49e − 07, which is significantly lower than α = 0.05. therefor, we can safely reject the null hypothesis and conclude that Beta-Distance is statistically significant.
SSE = 37.90
SST = 270.55 +143.63 = 452.08
R2 = 1.(SSE/SST) = 0.9162 = 91.62% We can compare both R2 values in the summary and the calculations.
Normality assumptions:
fit <- model_lm_1
res <- residuals(fit)
plot(fit$fitted.values, res, xlab="Fitted Values", ylab="Residuals", main="", abline(0,0,col="red"))
qqnorm(res, ylab="Residuals", main="", col="blue")
hist(res, xlab="Residuals", main="", nclass=10, col="orange")
plot(cooks.distance(fit), type="h", lwd=3, col="red", ylab="Cook's Distance", main="")
The Q-Q plot shows a normal distribution of the residuals. From the
histogram, we can say that it’s unclear to say whether the normality
holds or not. From the Q-Q plot we can say that Normality Assumption
holds. We need more samples to say with respect to Histogram. From
Residual v Fitted values graph, we can observe that the residuals show a
clear pattern with fitted values. Hence, the Identical Distribution
(Constant Variance) assumption does not hold True.
The variance increases on the extremes compared to the middle values in the Residuals vs Weight plot. We can say that the Cost can have a non-linear relation with Weight.
par(mfrow=c(2,1))
res <- stdres(model_lm_1)
plot(data1[,2],res,xlab="Weight",ylab="Residuals")
abline(0,0,col="red")
plot(data1[,3],res,xlab="Distance",ylab="Residuals")
abline(0,0,col="red")
conf <- predict.lm(model_lm_1,newdata = data.frame(Weight = 6,Distance=160),interval= "confidence", level=0.95)
conf
## fit lwr upr
## 1 8.991409 8.091995 9.890823
pred_int <- predict.lm(model_lm_1,newdata = data.frame(Weight = 6,Distance=160),interval= "prediction", level = 0.95)
pred_int
## fit lwr upr
## 1 8.991409 5.715274 12.26754