Homework 6 For all of the problems in this assignment, please submit the relevant outputs and your R codes (if you used R). In addition, unless otherwise indicated, assume that α = 0.05. Consider the regional express delivery company problem that you studied in Homework 5. To remind you, we had defined: the cost of shipment, y (in dollars), and the variables that control the shipping charge: package weight, x1 (in pounds), and distance shipped, x2 (in miles). We used the data set HW5ShipmentData.csv. Answer Questions 1–5. 1. Consider the following full second-order model for this problem: y = β0 + β1x1 + β2x2 + β3x2 1 + β4x1x2 + .
data1 <- read.csv("HW5ShipmentData.csv")
Hw6_model <- lm(Cost ~ Weight + Distance + I(Distance^2)+ Weight * Distance, data = data1)
summary(Hw6_model)
##
## Call:
## lm(formula = Cost ~ Weight + Distance + I(Distance^2) + Weight *
## Distance, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7969 -0.4526 -0.2011 0.4899 1.3993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.951e-01 9.826e-01 -0.300 0.768
## Weight 2.387e-02 1.647e-01 0.145 0.887
## Distance 1.011e-02 1.182e-02 0.855 0.406
## I(Distance^2) -7.042e-06 3.280e-05 -0.215 0.833
## Weight:Distance 7.755e-03 9.449e-04 8.207 6.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.664 on 15 degrees of freedom
## Multiple R-squared: 0.9854, Adjusted R-squared: 0.9815
## F-statistic: 252.6 on 4 and 15 DF, p-value: 1.454e-13
We can observe from the results that based on the p-values, the Weight and Distance predictors are not significant as p-values were 0.887 and 0.406, respectively, while the I(Distance^2) predictor is also not significant as p-value is 0.833. However, the Weight:Distance interaction term is highly significant (p-value = 6.27e-07), indicating that the interaction between Weight and Distance is a significant predictor of Cost in the model.
Hw5_model <- lm(Cost ~ Weight + Distance, data = data1)
anova(Hw5_model, Hw6_model)
## Analysis of Variance Table
##
## Model 1: Cost ~ Weight + Distance
## Model 2: Cost ~ Weight + Distance + I(Distance^2) + Weight * Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 37.901
## 2 15 6.613 2 31.288 35.485 2.056e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can observe from the output the ANOVA table for the F-test, with the residual degrees of freedom, residual sum of squares, change in degrees of freedom, sum of squares, F-statistic, and its associated p-value. The p-value for the F-test is very small which is less than 0.001, indicating strong evidence against the null hypothesis that the Hw5 model is sufficient to explain the variability in the data. Therefore, we can conclude that the Hw6 model is significantly better than the reduced model.
coef_full <- coef(Hw6_model)
effect <- coef_full["Distance"] + 2 * coef_full["I(Distance^2)"] * 1 + coef_full["Weight:Distance"] * 5
effect
## Distance
## 0.04886694
residuals <- residuals(Hw6_model)
# Check if E(epsilon) = 0
mean(residuals)
## [1] 2.425902e-18
# Check normality assumption using a Q-Q plot
qqnorm(residuals)
qqline(residuals)
# Check identical distribution assumption using a plot of residuals vs. fitted values
plot(fitted(Hw6_model), residuals)
Based on these results we can observer that, if the mean of the residuals is approximately zero, the residuals are approximately normally distributed, and the variance of the residuals is constant across all levels of the predictor variables, we can conclude that the random error assumptions are met. If any of these assumptions are violated, it suggests that the model may not be a good fit for the data and additional analysis may be necessary.
The Hw5 model may not be the best fit for the data, and that a more complex model may be needed to account for the non-constant variance and potential non-linear relationship between Cost and Weight.
We can also note that obtaining more samples could potentially help clarify the situation with regards to the normality assumption and the histogram.
data2 <- read.csv("HW6StateCostData.csv")
# Create dummy variables for State
dummy_vars <- model.matrix(~ STATE - 1, data = data2)
# Append the dummy variables to the data
data2 <- cbind(data2, dummy_vars)
# Fit the linear regression model
model <- lm(COST ~ ., data = data2)
# Print the model summary
summary(model)
##
## Call:
## lm(formula = COST ~ ., data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -299.80 -95.83 -37.90 153.32 295.20
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 279.60 53.43 5.233 1.63e-05 ***
## STATEKentucky 80.30 75.56 1.063 0.2973
## STATETexas 198.20 75.56 2.623 0.0141 *
## STATEKansas NA NA NA NA
## STATEKentucky NA NA NA NA
## STATETexas NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 168.9 on 27 degrees of freedom
## Multiple R-squared: 0.205, Adjusted R-squared: 0.1462
## F-statistic: 3.482 on 2 and 27 DF, p-value: 0.04515
The model is: COST = β0 + β1D1 + β2D2 + ε
where D1 and D2 are dummy variables for the states Kentucky and Texas, respectively.
β0 corresponds to the expected cost when the state is Kansas. The intercept coefficient β0 shows the value of the response when all the predictor variables pr we can call them dummy variables are equal to zero. Since we have chosen Kansas as the base level, the intercept corresponds to the expected cost when the state is Kansas.
Note that we can interpret the coefficients of the dummy variables as the differences in the expected cost between the corresponding state and the base level state (Kansas). For example, the coefficient of D1 corresponds to the difference in the expected cost between Kentucky and Kansas.
anova(model)
## Analysis of Variance Table
##
## Response: COST
## Df Sum Sq Mean Sq F value Pr(>F)
## STATE 2 198772 99386 3.4819 0.04515 *
## Residuals 27 770671 28543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The base level for the categorical variable “STATE” is the category that is not included in the model summary, which is “STATE Kansas” in this case.
Based on the results from Question 7, we can conclude that there is a significant difference in the expected costs among the three states, as the p-value for the overall F-test was less than the significance level of 0.05. This means that at least one of the dummy variables is significant, indicating a difference in the mean costs among the three states.
set.seed(123)
x1 <- rnorm(24)
x2 <- rnorm(24)
x3 <- rnorm(24)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(24)
# fit the linear regression model
model <- lm(y ~ x1 + x2 + x3)
# perform Durbin-Watson test
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.2.1
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.1
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(model, alternative = "two.sided")
##
## Durbin-Watson test
##
## data: model
## DW = 2.1296, p-value = 0.7783
## alternative hypothesis: true autocorrelation is not 0
The output indicates that the Durbin-Watson test statistic is 2.1296 and the p-value is 0.7783. Since the p-value is greater than the significance level of 0.10, we fail to reject the null hypothesis that there is no autocorrelation in the errors. Therefore, we can conclude that the independence assumption of the errors is met.
Consider the hospital stay problem that you studied in HW5 by using homework05Hospital.csv data. To remind you, we had defined: y = monthly labor hours required; x1 = monthly X-ray exposures; x2 = monthly occupied bed days; and x3 = average length of patients’ stays (in days). Answer Questions 10–12.
data3 <- read.csv("homework05Hospital.csv")
model <- lm(Hours ~ ., data = data3)
summary(model)
##
## Call:
## lm(formula = Hours ~ ., data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -677.23 -270.19 60.93 228.32 517.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1946.80204 504.18193 3.861 0.00226 **
## Xray 0.03858 0.01304 2.958 0.01197 *
## BedDays 1.03939 0.06756 15.386 2.91e-09 ***
## Length -413.75780 98.59828 -4.196 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 387.2 on 12 degrees of freedom
## Multiple R-squared: 0.9961, Adjusted R-squared: 0.9952
## F-statistic: 1028 on 3 and 12 DF, p-value: 9.919e-15
# Standardized residuals
std_resid <- rstandard(model)
# Cook's distance
cooksd <- cooks.distance(model)
# Combine results with original data
results <- cbind(data3, std_resid, cooksd)
# Sort by Cook's distance in descending order
results <- results[order(results$cooksd),]
# Print results
print(results)
## Xray BedDays Length Hours std_resid cooksd
## 3 3940 620.25 4.28 1033.15 0.1676558 0.001047070
## 1 2463 472.92 4.45 566.52 -0.3460418 0.004111350
## 5 5723 1497.60 5.50 1611.37 0.4402261 0.004609869
## 13 15543 3865.67 5.50 4026.52 -0.7007880 0.008722368
## 7 5779 1687.00 5.62 1854.17 0.6924578 0.011288649
## 14 34703 12446.33 10.78 11732.17 -0.1434535 0.012871368
## 2 2048 1339.75 6.92 696.82 0.4184417 0.013450584
## 6 11520 1365.83 4.60 1613.27 -0.8077234 0.021069992
## 9 8461 2872.33 6.18 2305.58 -1.0710300 0.027541648
## 8 5969 1639.92 5.15 2160.55 1.1057046 0.027859603
## 11 13313 2912.00 5.88 3571.89 1.3966312 0.044335089
## 10 20106 3655.08 6.15 3503.93 -1.3134905 0.067330758
## 4 6505 568.33 3.90 1603.62 1.2075842 0.068803012
## 12 10771 3921.00 4.88 3741.40 -1.9293551 0.201515928
## 16 86533 15524.00 6.35 18854.45 0.6133041 1.316994321
## 15 39204 14098.40 7.05 15414.94 1.2248752 1.383805243
These observations have either high leverage or high residual values, indicating that they might have a strong influence on the regression model.
library(car) vif(model)
Calculate the VIF vif(model_VIF)
Xray BedDays Length
1.05762 3.65487 1.05762
We can see that the VIF values for all predictors are below the threshold of 5 or 10, indicating that there is no significant multicollinearity issue in the model. Therefore, the estimates of the coefficients are likely to be stable and reliable.