Load Libraries

library(haven)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(broom)
library(nortest)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Q1. Obtain the mean, standard deviation, min and max values for both x and y variables.

hospital <- read_csv("~/FALL 2020/STATISTICS FOR DEM  7273 DATA 1/ASSIGNMENTS/Assignment 4/hospital.csv")
## Parsed with column specification:
## cols(
##   cost = col_double(),
##   los = col_double()
## )
View(hospital)
summary(hospital$cost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1233    3193    5870    8810   12213   35381
sd(hospital$cost)
## [1] 7875.072
summary(hospital$los)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   12.00   16.39   19.00   85.00
sd(hospital$los)
## [1] 16.89478

2. Fit a simple linear regression model and write down the equation.

fit<-lm(cost~los, data=hospital)
summary(fit)
## 
## Call:
## lm(formula = cost ~ los, data = hospital)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7590  -2742  -1214   2377  16052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2706.96    1176.67   2.301   0.0283 *  
## los           372.29      50.38   7.390 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.6262 
## F-statistic: 54.61 on 1 and 31 DF,  p-value: 2.542e-08

yi=β0+β1xi+ϵi

Cost = 2706.96 + 372.29X

3. What is the SST, SSE, and SSR for this model, respectively?

coef(fit)
## (Intercept)         los 
##   2706.9613    372.2852
anova(fit)
## Analysis of Variance Table
## 
## Response: cost
##           Df     Sum Sq    Mean Sq F value    Pr(>F)    
## los        1 1265921342 1265921342   54.61 2.542e-08 ***
## Residuals 31  718614701   23181119                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SST = 1265921342

SSE = 718614701

SSR = 1265921342

4. How do we interpret the coefficient and the intercept, respectively?

For any additional increase in length of stay in the hospital, hospitalization cost is expected to increase by 372.29.

The intercept is the value of y when x is equal to zero. That is hospital cost if there is zero length of stay at the hospital. Thus, hospital cost will be 2706.96 if there is zero length of stay at the hospital.

5. What is the R2? How do we interpret this?

R2 = 0.6379

It means 63.79% of the variation in the hospitalization cost is explained by the length of stay in the hospital.

6. Does a linear function appear to fit the data well? If not, does the plot suggest any other potential problems with the model?

plot(fit, which=1)

No, the linear function does not fit the data well. The plot suggests that the variation around the estimated regression line is not constant suggesting that the assumption of equal error or non-constant variance is violated.

7. Make a residuals vs. fits plot. Interpret the residuals vs. fits plot — which model assumption does it suggest is violated? Elaborate your answer.

plot(x=fit$fitted.values, y=fit$residuals, xlab = "Fitted Values", ylab = "Residuals", col="blue")
abline(h=0)

plot(fit)

The plot suggests that there is non linear relationship between cost and length of stay. It also suggests that there are unusual data points in the data set such as outliers, with large residual values. This suggest that the linearity assumption model is violated as the outliers may cause the linear to tilt or zigzag and also may cause the slope to be misestimated.

8. Test the normality of residuals assumption. Does the test provide evidence that the residuals are not normally distributed?

ad.test(resid(fit))
## 
##  Anderson-Darling normality test
## 
## data:  resid(fit)
## A = 0.91897, p-value = 0.01714

Ho: There is normal distribution of the residuals. Ha: Residuals are not normally distributed.

Yes, from the test above the p-value of 0.01714 is less than the alpha value of 0.05. Therefore, the null hypothesis is rejected and l conclude that the residuals are not normally distributed.

9. To fix the problem identified in (7), what transformation would you use to fix the problem?

I will use “transform the independent variable to linearize the model.” Since the linearity assumption is in violation and is often appropriate to correct the non-linearity before considering the normality and constant variance assumptions, transforming the independent variable will fix the non-linearity problem.

10. To fix the problem identified in (6), what transformation would you use?

To fix the problem identified in (6), I will use “transforming the dependent variable to normalize error and equalize variance of errors” to fix the normality and constant variance assumptions.

11. Does the transformation appear to have helped rectify the original problems with the model? Why? Justify your answer by providing visual evidence and statistical tests results.

attach(hospital)
fit<-lm(cost~los)
boxcox(fit)

logcost1<- log(cost)
plot1<- plot(logcost1, los)
lines(lowess(logcost1, los), col="blue")

fitR<- lm(logcost1~los)
summary(fitR)
## 
## Call:
## lm(formula = logcost1 ~ los)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07050 -0.46336  0.01929  0.29820  1.35821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.152469   0.157570  51.739  < 2e-16 ***
## los         0.035232   0.006746   5.223 1.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared:  0.468,  Adjusted R-squared:  0.4509 
## F-statistic: 27.27 on 1 and 31 DF,  p-value: 1.135e-05
plot(fitR, which=1)

bptest(fitR)
## 
##  studentized Breusch-Pagan test
## 
## data:  fitR
## BP = 0.55063, df = 1, p-value = 0.4581
ad.test(resid(fitR))
## 
##  Anderson-Darling normality test
## 
## data:  resid(fitR)
## A = 0.28101, p-value = 0.6188
detach(hospital)

bptest hypothesis: Ho: We have constant variance Ha: We do not have constant variance of in residuals

Normality test: Ho: We have normal distribution of residuals. Ha: We do not have normal distribution of residuals.

The transformation seems to have fix the problems. The bptest shows that there is constant variance as the p-value of 0.4581 is greater than the alpha value of 0.05. Therefore, we fail to reject the null hypothesis and conclude that we have constant variance in the residuals.

The normality test also shows that there is normal distribution of the residuals as the p-value 0.6188 is greater than the alpha value of 0.05. Therefore, we fail to reject the null hypothesis and conclude that we have normal distribution of residuals

12. Is there an association between hospitalization cost and length of stay? Justify your answer.

Yes, there is a positive association between hospitalization cost and length of stay because as length of stay increases hospitalization cost increases.