Hospital administrators are interested in determining how hospitalization cost (y = cost) is related to length of stay (x = los) in the hospital. The dataset contains the reimbursed hospital costs and associated lengths of stay for a sample of 33 people. Use hospital.txt data to answer the following questions:

Libraries

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(haven)
library(broom)
library(tibble)
library(tidyr)
library(stringr)
library(forcats)
library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Data

library(readxl)
hospital <- read_excel("hospital.xlsx")
View(hospital)
  1. Obtain the mean, standard deviation, min and max values for both x and y variables.
describe(hospital$cost)
describe(hospital$los)
  1. Fit a simple linear regression model and write down the equation.
mod1 <- lm(cost~los, data = hospital)
summary(mod1)
## 
## Call:
## lm(formula = cost ~ los, data = hospital)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7590  -2742  -1214   2377  16052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2706.96    1176.67   2.301   0.0283 *  
## los           372.29      50.38   7.390 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.6262 
## F-statistic: 54.61 on 1 and 31 DF,  p-value: 2.542e-08
coef(mod1)
## (Intercept)         los 
##   2706.9613    372.2852
anova(mod1)
#y = 2706.96 + 372.29(x) + e
  1. What is the SST, SSE, and SSR for this model, respectively?
#SST 845206835
#SSE 126592134
#SSR 718614701
  1. How do we interpret the coefficient and the intercept, respectively?
#Length of stay (x) increases the cost(y) by 372.29$(coefficient) for each single increase in x. The intercept is 2706.96, meaning that when los(x)is 0 the cost (y) equals 2,706.96$.
  1. What is the R2? How do we interpret this?
#R^2 = 0.63, meaning that approximately 63% of the variance in the cost is explained by the length of stay
  1. Does a linear function appear to fit the data well? If not, does the plot suggest any other potential problems with the model?
attach(hospital)
ggplot(hospital, aes(x=los, y=cost))+geom_point()+geom_smooth(method='lm',se =F)
## `geom_smooth()` using formula 'y ~ x'

# Yes a linear line appears to fit the data well.
  1. Make a residuals vs. fits plot. Interpret the residuals vs. fits plot — which model assumption does it suggest is violated? Elaborate your answer.
summary(mod1)
## 
## Call:
## lm(formula = cost ~ los, data = hospital)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7590  -2742  -1214   2377  16052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2706.96    1176.67   2.301   0.0283 *  
## los           372.29      50.38   7.390 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.6262 
## F-statistic: 54.61 on 1 and 31 DF,  p-value: 2.542e-08
plot(x=mod1$fitted.values,y=mod1$residuals,xlab = "Fitted Values", ylab = "Residuals",pch=10, col="red")
abline(mod1)

bptest(mod1)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 0.16079, df = 1, p-value = 0.6884
#This residuals vs fits plot suggests that the assumption of constant variance is violated, however the results of the bp test shows that the assumption of constant variance is not violated
  1. Test the normality of residuals assumption. Does the test provide evidence that the residuals are not normally distributed?
ad.test(resid(mod1))
## 
##  Anderson-Darling normality test
## 
## data:  resid(mod1)
## A = 0.91897, p-value = 0.01714
# Results of the Anderson-Darling test provide evidence that the normality of residuals assumption is vioiated
  1. To fix the problem identified in (7), what transformation would you use to fix the problem?
bptest(mod1)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 0.16079, df = 1, p-value = 0.6884
boxcox(mod1)

# According to the results of the bp test, the assumption of constant variance was not violated. But in order to address the assumption of residuals normality, I would use the log transformation on the dependent variable.
  1. To fix the problem identified in (6), what transformation would you use?
logcost<- log(cost)
plot1 <- plot(logcost, los)
lines(lowess(logcost, los), col = "red")

fitr <- lm(logcost~los)

# I used the log transformation on the dependent variable
  1. Does the transformation appear to have helped rectify the original problems with the model? Why? Justify your answer by providing visual evidence and statistical tests results.
plot(x=fitr$fitted.values,y=fitr$residuals,xlab = "Fitted Values", ylab = "Residuals",pch=10, col="red")

ad.test(resid(fitr))
## 
##  Anderson-Darling normality test
## 
## data:  resid(fitr)
## A = 0.28101, p-value = 0.6188
bptest(fitr)
## 
##  studentized Breusch-Pagan test
## 
## data:  fitr
## BP = 0.55063, df = 1, p-value = 0.4581
anova(fitr)
coef(fitr)
## (Intercept)         los 
##  8.15246890  0.03523236
summary(fitr)
## 
## Call:
## lm(formula = logcost ~ los)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07050 -0.46336  0.01929  0.29820  1.35821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.152469   0.157570  51.739  < 2e-16 ***
## los         0.035232   0.006746   5.223 1.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared:  0.468,  Adjusted R-squared:  0.4509 
## F-statistic: 27.27 on 1 and 31 DF,  p-value: 1.135e-05
# The log transformation appeared to resolve the initial issue with the model. The assumption of residuals normality is now met and verified by the results of the AD test on the transformed model.
  1. Is there an association between hospitalization cost and length of stay? Justify your answer.
# There is a positive association between length of stay and hospitalization cost. The model suggests that the independent variable(los) is associated with a 3.6% increase in the dependent variable (hospitalization cost) for every one additional day spent in the hospital.

# Estimate = 3.6