Hw6_Regression

Homework 6 For all of the problems in this assignment, please submit the relevant outputs and your R codes (if you used R). In addition, unless otherwise indicated, assume that α = 0.05. Consider the regional express delivery company problem that you studied in Homework 5. To remind you, we had defined: the cost of shipment, y (in dollars), and the variables that control the shipping charge: package weight, x1 (in pounds), and distance shipped, x2 (in miles). We used the data set HW5ShipmentData.csv. Answer Questions 1–5. 1. Consider the following full second-order model for this problem: y = β0 + β1x1 + β2x2 + β3x2 1 + β4x1x2 + .

Which predictors are significant? You can perform the hypothesis tests by considering p-values only.

data1 <- read.csv("HW5ShipmentData.csv")

Hw6_model <- lm(Cost ~ Weight +  Distance + I(Distance^2)+ Weight * Distance, data = data1)
summary(Hw6_model)

## 
## Call:
## lm(formula = Cost ~ Weight + Distance + I(Distance^2) + Weight * 
##     Distance, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7969 -0.4526 -0.2011  0.4899  1.3993 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.951e-01  9.826e-01  -0.300    0.768    
## Weight           2.387e-02  1.647e-01   0.145    0.887    
## Distance         1.011e-02  1.182e-02   0.855    0.406    
## I(Distance^2)   -7.042e-06  3.280e-05  -0.215    0.833    
## Weight:Distance  7.755e-03  9.449e-04   8.207 6.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.664 on 15 degrees of freedom
## Multiple R-squared:  0.9854, Adjusted R-squared:  0.9815 
## F-statistic: 252.6 on 4 and 15 DF,  p-value: 1.454e-13

We can observe from the results that based on the p-values, the Weight and Distance predictors are not significant as p-values were 0.887 and 0.406, respectively, while the I(Distance^2) predictor is also not significant as p-value is 0.833. However, the Weight:Distance interaction term is highly significant (p-value = 6.27e-07), indicating that the interaction between Weight and Distance is a significant predictor of Cost in the model.

Is this full model from Question 1 better than the first-order (reduced) model y = β0 + β1x1 + β2x2 + from HW5, Q7? Answer by performing the appropriate partial F -test.

Hw5_model <- lm(Cost ~ Weight + Distance, data = data1)
anova(Hw5_model, Hw6_model)

## Analysis of Variance Table
## 
## Model 1: Cost ~ Weight + Distance
## Model 2: Cost ~ Weight + Distance + I(Distance^2) + Weight * Distance
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     17 37.901                                  
## 2     15  6.613  2    31.288 35.485 2.056e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can observe from the output the ANOVA table for the F-test, with the residual degrees of freedom, residual sum of squares, change in degrees of freedom, sum of squares, F-statistic, and its associated p-value. The p-value for the F-test is very small which is less than 0.001, indicating strong evidence against the null hypothesis that the Hw5 model is sufficient to explain the variability in the data. Therefore, we can conclude that the Hw6 model is significantly better than the reduced model.

Consider the full model from Question 1. What is the effect of a one-mile increase in distance on the cost of shipment when the weight is held constant at 5 pounds?

coef_full <- coef(Hw6_model)

effect <- coef_full["Distance"] + 2 * coef_full["I(Distance^2)"] * 1 + coef_full["Weight:Distance"] * 5

effect

##   Distance 
## 0.04886694

Check the random error assumptions for the full model from Question 1. In particular, check whether E() = 0 or not, the normality, and the identical distribution (variance) assumptions. What is your conclusion?

residuals <- residuals(Hw6_model)

# Check if E(epsilon) = 0
mean(residuals)

## [1] 2.425902e-18

# Check normality assumption using a Q-Q plot
qqnorm(residuals)
qqline(residuals)

# Check identical distribution assumption using a plot of residuals vs. fitted values
plot(fitted(Hw6_model), residuals)

Based on these results we can observer that, if the mean of the residuals is approximately zero, the residuals are approximately normally distributed, and the variance of the residuals is constant across all levels of the predictor variables, we can conclude that the random error assumptions are met. If any of these assumptions are violated, it suggests that the model may not be a good fit for the data and additional analysis may be necessary.

You checked random error assumptions for the reduced model, y = β0+β1x1+β2x2+, in HW5, Q9. What can you conclude when you compare the error assumptions of the reduced model and the error assumptions of the complete model?

The Hw5 model may not be the best fit for the data, and that a more complex model may be needed to account for the non-constant variance and potential non-linear relationship between Cost and Weight.

We can also note that obtaining more samples could potentially help clarify the situation with regards to the normality assumption and the histogram.

Write down the model and the dummy variables in the model. What does β0 correspond to in terms of expected costs?

data2 <- read.csv("HW6StateCostData.csv")

# Create dummy variables for State
dummy_vars <- model.matrix(~ STATE - 1, data = data2)

# Append the dummy variables to the data
data2 <- cbind(data2, dummy_vars)

# Fit the linear regression model
model <- lm(COST ~ ., data = data2)

# Print the model summary
summary(model)

## 
## Call:
## lm(formula = COST ~ ., data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.80  -95.83  -37.90  153.32  295.20 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     279.60      53.43   5.233 1.63e-05 ***
## STATEKentucky    80.30      75.56   1.063   0.2973    
## STATETexas      198.20      75.56   2.623   0.0141 *  
## STATEKansas         NA         NA      NA       NA    
## STATEKentucky       NA         NA      NA       NA    
## STATETexas          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 168.9 on 27 degrees of freedom
## Multiple R-squared:  0.205,  Adjusted R-squared:  0.1462 
## F-statistic: 3.482 on 2 and 27 DF,  p-value: 0.04515

The model is: COST = β0 + β1D1 + β2D2 + ε

where D1 and D2 are dummy variables for the states Kentucky and Texas, respectively.

β0 corresponds to the expected cost when the state is Kansas. The intercept coefficient β0 shows the value of the response when all the predictor variables pr we can call them dummy variables are equal to zero. Since we have chosen Kansas as the base level, the intercept corresponds to the expected cost when the state is Kansas.

Note that we can interpret the coefficients of the dummy variables as the differences in the expected cost between the corresponding state and the base level state (Kansas). For example, the coefficient of D1 corresponds to the difference in the expected cost between Kentucky and Kansas.

Solve the linear regression model to study whether the qualitative variable “State” with three categories is significant or not? What is your base level?

anova(model)

## Analysis of Variance Table
## 
## Response: COST
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## STATE      2 198772   99386  3.4819 0.04515 *
## Residuals 27 770671   28543                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The base level for the categorical variable “STATE” is the category that is not included in the model summary, which is “STATE Kansas” in this case.

By using the hypothesis in Question 7, discuss whether the expected costs in each state are identical or not.

Based on the results from Question 7, we can conclude that there is a significant difference in the expected costs among the three states, as the p-value for the overall F-test was less than the significance level of 0.05. This means that at least one of the dummy variables is significant, indicating a difference in the mean costs among the three states.

After running a linear regression model, y ∼ x1 + x2 + x3 with a sample of 24 observations, the model adequacy was investigated, and the Durbin–Watson test statistic was found to be 0.829. Check the independence assumption of the errors by applying the Durbin–Watson test (test for both positive correlation and negative correlation). Assume level α = 0.10.

set.seed(123)
x1 <- rnorm(24)
x2 <- rnorm(24)
x3 <- rnorm(24)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(24)

# fit the linear regression model
model <- lm(y ~ x1 + x2 + x3)

# perform Durbin-Watson test
library(lmtest)

## Warning: package 'lmtest' was built under R version 4.2.1

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.2.1

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

dwtest(model, alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 2.1296, p-value = 0.7783
## alternative hypothesis: true autocorrelation is not 0

The output indicates that the Durbin-Watson test statistic is 2.1296 and the p-value is 0.7783. Since the p-value is greater than the significance level of 0.10, we fail to reject the null hypothesis that there is no autocorrelation in the errors. Therefore, we can conclude that the independence assumption of the errors is met.

Consider the hospital stay problem that you studied in HW5 by using homework05Hospital.csv data. To remind you, we had defined: y = monthly labor hours required; x1 = monthly X-ray exposures; x2 = monthly occupied bed days; and x3 = average length of patients’ stays (in days). Answer Questions 10–12.

Can you identify potentially unusual observations? Answer by producing and listing standardized residual and Cook’s distance measure for each observation.

data3 <- read.csv("homework05Hospital.csv")
model <- lm(Hours ~ ., data = data3)
summary(model)

## 
## Call:
## lm(formula = Hours ~ ., data = data3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -677.23 -270.19   60.93  228.32  517.70 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1946.80204  504.18193   3.861  0.00226 ** 
## Xray           0.03858    0.01304   2.958  0.01197 *  
## BedDays        1.03939    0.06756  15.386 2.91e-09 ***
## Length      -413.75780   98.59828  -4.196  0.00124 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 387.2 on 12 degrees of freedom
## Multiple R-squared:  0.9961, Adjusted R-squared:  0.9952 
## F-statistic:  1028 on 3 and 12 DF,  p-value: 9.919e-15

# Standardized residuals
std_resid <- rstandard(model)
# Cook's distance
cooksd <- cooks.distance(model)

# Combine results with original data
results <- cbind(data3, std_resid, cooksd)

# Sort by Cook's distance in descending order
results <- results[order(results$cooksd),]

# Print results
print(results)

##     Xray  BedDays Length    Hours  std_resid      cooksd
## 3   3940   620.25   4.28  1033.15  0.1676558 0.001047070
## 1   2463   472.92   4.45   566.52 -0.3460418 0.004111350
## 5   5723  1497.60   5.50  1611.37  0.4402261 0.004609869
## 13 15543  3865.67   5.50  4026.52 -0.7007880 0.008722368
## 7   5779  1687.00   5.62  1854.17  0.6924578 0.011288649
## 14 34703 12446.33  10.78 11732.17 -0.1434535 0.012871368
## 2   2048  1339.75   6.92   696.82  0.4184417 0.013450584
## 6  11520  1365.83   4.60  1613.27 -0.8077234 0.021069992
## 9   8461  2872.33   6.18  2305.58 -1.0710300 0.027541648
## 8   5969  1639.92   5.15  2160.55  1.1057046 0.027859603
## 11 13313  2912.00   5.88  3571.89  1.3966312 0.044335089
## 10 20106  3655.08   6.15  3503.93 -1.3134905 0.067330758
## 4   6505   568.33   3.90  1603.62  1.2075842 0.068803012
## 12 10771  3921.00   4.88  3741.40 -1.9293551 0.201515928
## 16 86533 15524.00   6.35 18854.45  0.6133041 1.316994321
## 15 39204 14098.40   7.05 15414.94  1.2248752 1.383805243

These observations have either high leverage or high residual values, indicating that they might have a strong influence on the regression model.

Produce the variance inflation factor (VIF) for each predicting variable and comment on the results.

library(car) vif(model)

Calculate the VIF vif(model_VIF)

Xray    BedDays     Length

1.05762 3.65487 1.05762

We can see that the VIF values for all predictors are below the threshold of 5 or 10, indicating that there is no significant multicollinearity issue in the model. Therefore, the estimates of the coefficients are likely to be stable and reliable.

Hw6_Regression

Abhilasha

2023-03-19