library(haven)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(broom)
library(nortest)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
hospital <- read_csv("~/FALL 2020/STATISTICS FOR DEM 7273 DATA 1/ASSIGNMENTS/Assignment 4/hospital.csv")
## Parsed with column specification:
## cols(
## cost = col_double(),
## los = col_double()
## )
View(hospital)
summary(hospital$cost)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1233 3193 5870 8810 12213 35381
sd(hospital$cost)
## [1] 7875.072
summary(hospital$los)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 12.00 16.39 19.00 85.00
sd(hospital$los)
## [1] 16.89478
fit<-lm(cost~los, data=hospital)
summary(fit)
##
## Call:
## lm(formula = cost ~ los, data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7590 -2742 -1214 2377 16052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2706.96 1176.67 2.301 0.0283 *
## los 372.29 50.38 7.390 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.6262
## F-statistic: 54.61 on 1 and 31 DF, p-value: 2.542e-08
yi=β0+β1xi+ϵi
Cost = 2706.96 + 372.29X
coef(fit)
## (Intercept) los
## 2706.9613 372.2852
anova(fit)
## Analysis of Variance Table
##
## Response: cost
## Df Sum Sq Mean Sq F value Pr(>F)
## los 1 1265921342 1265921342 54.61 2.542e-08 ***
## Residuals 31 718614701 23181119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SST = 1265921342
SSE = 718614701
SSR = 1265921342
For any additional increase in length of stay in the hospital, hospitalization cost is expected to increase by 372.29.
The intercept is the value of y when x is equal to zero. That is hospital cost if there is zero length of stay at the hospital. Thus, hospital cost will be 2706.96 if there is zero length of stay at the hospital.
R2 = 0.6379
It means 63.79% of the variation in the hospitalization cost is explained by the length of stay in the hospital.
plot(fit, which=1)
No, the linear function does not fit the data well. The plot suggests that the variation around the estimated regression line is not constant suggesting that the assumption of equal error or non-constant variance is violated.
plot(x=fit$fitted.values, y=fit$residuals, xlab = "Fitted Values", ylab = "Residuals", col="blue")
abline(h=0)
plot(fit)
The plot suggests that there is non linear relationship between cost and length of stay. It also suggests that there are unusual data points in the data set such as outliers, with large residual values. This suggest that the linearity assumption model is violated as the outliers may cause the linear to tilt or zigzag and also may cause the slope to be misestimated.
ad.test(resid(fit))
##
## Anderson-Darling normality test
##
## data: resid(fit)
## A = 0.91897, p-value = 0.01714
Ho: There is normal distribution of the residuals. Ha: Residuals are not normally distributed.
Yes, from the test above the p-value of 0.01714 is less than the alpha value of 0.05. Therefore, the null hypothesis is rejected and l conclude that the residuals are not normally distributed.
I will use “transform the independent variable to linearize the model.” Since the linearity assumption is in violation and is often appropriate to correct the non-linearity before considering the normality and constant variance assumptions, transforming the independent variable will fix the non-linearity problem.
To fix the problem identified in (6), I will use “transforming the dependent variable to normalize error and equalize variance of errors” to fix the normality and constant variance assumptions.
attach(hospital)
fit<-lm(cost~los)
boxcox(fit)
logcost1<- log(cost)
plot1<- plot(logcost1, los)
lines(lowess(logcost1, los), col="blue")
fitR<- lm(logcost1~los)
summary(fitR)
##
## Call:
## lm(formula = logcost1 ~ los)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07050 -0.46336 0.01929 0.29820 1.35821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.152469 0.157570 51.739 < 2e-16 ***
## los 0.035232 0.006746 5.223 1.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared: 0.468, Adjusted R-squared: 0.4509
## F-statistic: 27.27 on 1 and 31 DF, p-value: 1.135e-05
plot(fitR, which=1)
bptest(fitR)
##
## studentized Breusch-Pagan test
##
## data: fitR
## BP = 0.55063, df = 1, p-value = 0.4581
ad.test(resid(fitR))
##
## Anderson-Darling normality test
##
## data: resid(fitR)
## A = 0.28101, p-value = 0.6188
detach(hospital)
bptest hypothesis: Ho: We have constant variance Ha: We do not have constant variance of in residuals
Normality test: Ho: We have normal distribution of residuals. Ha: We do not have normal distribution of residuals.
The transformation seems to have fix the problems. The bptest shows that there is constant variance as the p-value of 0.4581 is greater than the alpha value of 0.05. Therefore, we fail to reject the null hypothesis and conclude that we have constant variance in the residuals.
The normality test also shows that there is normal distribution of the residuals as the p-value 0.6188 is greater than the alpha value of 0.05. Therefore, we fail to reject the null hypothesis and conclude that we have normal distribution of residuals
Yes, there is a positive association between hospitalization cost and length of stay because as length of stay increases hospitalization cost increases.