Hospital administrators are interested in determining how hospitalization cost (y = cost) is related to length of stay (x = los) in the hospital. The dataset contains the reimbursed hospital costs and associated lengths of stay for a sample of 33 people. Use hospital.txt data to answer the following questions:
Libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(haven)
library(broom)
library(tibble)
library(tidyr)
library(stringr)
library(forcats)
library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Data
library(readxl)
hospital <- read_excel("hospital.xlsx")
View(hospital)
describe(hospital$cost)
describe(hospital$los)
mod1 <- lm(cost~los, data = hospital)
summary(mod1)
##
## Call:
## lm(formula = cost ~ los, data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7590 -2742 -1214 2377 16052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2706.96 1176.67 2.301 0.0283 *
## los 372.29 50.38 7.390 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.6262
## F-statistic: 54.61 on 1 and 31 DF, p-value: 2.542e-08
coef(mod1)
## (Intercept) los
## 2706.9613 372.2852
anova(mod1)
#y = 2706.96 + 372.29(x) + e
#SST 845206835
#SSE 126592134
#SSR 718614701
#Length of stay (x) increases the cost(y) by 372.29$(coefficient) for each single increase in x. The intercept is 2706.96, meaning that when los(x)is 0 the cost (y) equals 2,706.96$.
#R^2 = 0.63, meaning that approximately 63% of the variance in the cost is explained by the length of stay
attach(hospital)
ggplot(hospital, aes(x=los, y=cost))+geom_point()+geom_smooth(method='lm',se =F)
## `geom_smooth()` using formula 'y ~ x'
# Yes a linear line appears to fit the data well.
summary(mod1)
##
## Call:
## lm(formula = cost ~ los, data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7590 -2742 -1214 2377 16052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2706.96 1176.67 2.301 0.0283 *
## los 372.29 50.38 7.390 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.6262
## F-statistic: 54.61 on 1 and 31 DF, p-value: 2.542e-08
plot(x=mod1$fitted.values,y=mod1$residuals,xlab = "Fitted Values", ylab = "Residuals",pch=10, col="red")
abline(mod1)
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 0.16079, df = 1, p-value = 0.6884
#This residuals vs fits plot suggests that the assumption of constant variance is violated, however the results of the bp test shows that the assumption of constant variance is not violated
ad.test(resid(mod1))
##
## Anderson-Darling normality test
##
## data: resid(mod1)
## A = 0.91897, p-value = 0.01714
# Results of the Anderson-Darling test provide evidence that the normality of residuals assumption is vioiated
bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 0.16079, df = 1, p-value = 0.6884
boxcox(mod1)
# According to the results of the bp test, the assumption of constant variance was not violated. But in order to address the assumption of residuals normality, I would use the log transformation on the dependent variable.
logcost<- log(cost)
plot1 <- plot(logcost, los)
lines(lowess(logcost, los), col = "red")
fitr <- lm(logcost~los)
# I used the log transformation on the dependent variable
plot(x=fitr$fitted.values,y=fitr$residuals,xlab = "Fitted Values", ylab = "Residuals",pch=10, col="red")
ad.test(resid(fitr))
##
## Anderson-Darling normality test
##
## data: resid(fitr)
## A = 0.28101, p-value = 0.6188
bptest(fitr)
##
## studentized Breusch-Pagan test
##
## data: fitr
## BP = 0.55063, df = 1, p-value = 0.4581
anova(fitr)
coef(fitr)
## (Intercept) los
## 8.15246890 0.03523236
summary(fitr)
##
## Call:
## lm(formula = logcost ~ los)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07050 -0.46336 0.01929 0.29820 1.35821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.152469 0.157570 51.739 < 2e-16 ***
## los 0.035232 0.006746 5.223 1.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared: 0.468, Adjusted R-squared: 0.4509
## F-statistic: 27.27 on 1 and 31 DF, p-value: 1.135e-05
# The log transformation appeared to resolve the initial issue with the model. The assumption of residuals normality is now met and verified by the results of the AD test on the transformed model.
# There is a positive association between length of stay and hospitalization cost. The model suggests that the independent variable(los) is associated with a 3.6% increase in the dependent variable (hospitalization cost) for every one additional day spent in the hospital.
# Estimate = 3.6