hospital <- read.delim("~/Stats 1/Data/hospital.txt")
View(hospital)
describe(hospital$cost)
## vars n mean sd median trimmed mad min max range skew
## X1 1 33 8810.18 7875.07 5870 7435.67 5252.85 1233 35381 34148 1.58
## kurtosis se
## X1 2.25 1370.87
describe(hospital$los)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 33 16.39 16.89 12 13.26 10.38 1 85 84 2.31 6.13 2.94
simple.fit <- lm(cost~los, data=hospital)
summary(simple.fit)
##
## Call:
## lm(formula = cost ~ los, data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7590 -2742 -1214 2377 16052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2706.96 1176.67 2.301 0.0283 *
## los 372.29 50.38 7.390 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.6262
## F-statistic: 54.61 on 1 and 31 DF, p-value: 2.542e-08
#Ying's path
attach(hospital)
fit1 <- lm(cost ~ los)
summary(fit1)
##
## Call:
## lm(formula = cost ~ los)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7590 -2742 -1214 2377 16052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2706.96 1176.67 2.301 0.0283 *
## los 372.29 50.38 7.390 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.6262
## F-statistic: 54.61 on 1 and 31 DF, p-value: 2.542e-08
#equation is cost = 2707 + 372*los
anova(fit1)
## Analysis of Variance Table
##
## Response: cost
## Df Sum Sq Mean Sq F value Pr(>F)
## los 1 1265921342 1265921342 54.61 2.542e-08 ***
## Residuals 31 718614701 23181119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#SST = SSS+SSR = 1984536043
#SSE is 1265921342
#SSR is 718614701
#equation is cost = 2707 + 372*los
#The intercept is 2707. It is the predicted costs for a zero length of stay. The beta coefficient for the variable length of stay (aka the slope) is 372. This means that for every change in the length of stay, we can expect an increase in cost of 372 dollars.
#Adjusted R-squared: 0.6262. The Adjusted R Squared represents the proportion of i variation in the data that can be explained by the model. R-squared ranges between zero and one. Our R-squared of 0.6262 is closer to 1 and therefore means that a large proportion of the variability in the outcome has been explained by our model, specifically, by the length of stay. Another way of interpreting the R2 is that it means that 63% of the variation in the dependent variable (cost) is explained by the independent variable (length of stay).
#Plotting the model
fit1 <- lm(cost ~ los, data=hospital)
summary(fit1)
##
## Call:
## lm(formula = cost ~ los, data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7590 -2742 -1214 2377 16052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2706.96 1176.67 2.301 0.0283 *
## los 372.29 50.38 7.390 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.6262
## F-statistic: 54.61 on 1 and 31 DF, p-value: 2.542e-08
with(hospital,plot(cost,los))
abline(fit1, col="blue")
#Based on appearance, the linear model does not fit well. Why? The model appears to be non-linear, but that could be influenced by the possible outliers. However, without plotting residuals, it is difficult to say for sure. P.S. I used the geom_smooth function to help me see a pattern.
plot(fit1, which=1)
#Our ideal plot would display the residuals appearing to be equally variable across the entire range of fitted values. In our plot, visually, there doesn't appear to be an obvious pattern, and the error terms are varying, however, unless we run a formal test, we cannot be certain with a visual inspection alone.
#We will then run a formal test for constant variance using the BP test. Our null hypothesis is that the variance of error term is constant across values of x (homoscedasticity). The alternative hypothesis is that there is not constant variance.
library(lmtest)
bptest(fit1)
##
## studentized Breusch-Pagan test
##
## data: fit1
## BP = 0.16079, df = 1, p-value = 0.6884
#My conclusion is that since our p-value from the BP test is greater than 0.5, then we can fail to reject the null hypothesis and assume that our error terms have constant variance. HOWEVER, per Professor Ying's email...we are to ignore the BP test results, and only go by a visual examination of the residuals v. fit plot. By only looking at the visual, we could say that there is a violation of the constant variance.
#To test for normality, we can first run a QQ plot.
plot(fit1, which=2)
#We find that the residuals waver quite a bit from the diagonal line. We see a lighter tail at the top and wavering from the line in the middle. This provides visual evidence that the residuals are not normally distributed.
#We can also conduct a formal test for normality using the Andreson-Darling test (and also the Shapiro test).
library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
ad.test(fit1$residuals)
##
## Anderson-Darling normality test
##
## data: fit1$residuals
## A = 0.91897, p-value = 0.01714
shapiro.test(fit1$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit1$residuals
## W = 0.90795, p-value = 0.00862
#Here the probability value of both shapiro wilk and anderson darling test is less than our critical value of 0.05, hence we reject the null hypothesis which states that the residual data is normally distributed. We can conclude from both tests that our residual data is not normally distributed and therefore normality is NOT met.
#WHile I know that we are not asked to fix the problem of non-normality, because we eventually transform both variables in the next question, I would use natural log. I have determined this by conducting a boxcox assessment. The boxcox helped determine which transformation we will use based on lambda. Our lambda is close to the value of x, therefore we will use the natural log transformation.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
boxcox(fit1)
logcost <- log(cost)
plot1 <- plot(logcost, los)
lines(lowess(logcost, los), col="purple")
fitx <- lm(logcost~los)
summary(fitx)
##
## Call:
## lm(formula = logcost ~ los)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07050 -0.46336 0.01929 0.29820 1.35821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.152469 0.157570 51.739 < 2e-16 ***
## los 0.035232 0.006746 5.223 1.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared: 0.468, Adjusted R-squared: 0.4509
## F-statistic: 27.27 on 1 and 31 DF, p-value: 1.135e-05
plot(fitx, which=1)
plot(fitx, which=2)
9. To fix the problem identified in (7), what transformation would you use to fix the problem?
#We chose log to address constant variance to transform the dependent variable using log.
boxcox(fit1)
#SO the transformation would look like this:
logy <- log(cost)
loggedy<- lm(logy~los, data=hospital)
plot(loggedy, which=1)
#Per the Professor's email, we are to ignore the formal BP test results, but if I were not ignoring the results, I would include the BP test in checking if transformation is successful.
#SO, to fix non-linearity, we will transform both the independent and dependent variable using log.
#transformation of both x & y
logx <- log(los)
logy <- log(cost)
ggplot(hospital, aes(x=logx, y=logy))+geom_point()+geom_smooth(method= "lm",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
#Transformation of both x and y corrected the non-linearity.
#Yes, the transformation solved the problem of constant variance.
tranlm <- lm(logx ~ logy, data = hospital)
plot(tranlm, which=1)
#Yes, the transformation also corrected for linearity and normality. See question #10 for linearity correction.
#In terms of correcting for normal distribution of residuals, our transformation was successful.
#I determined that the transformation corrected the normality by using the ad test and getting a qq plot with the newly transformed x and y. First I will lay out my null and alternative hypothesis.
#Null hypothesis is that the residuals are normally distributed.
#Alternative hypothesis is that we do not normally distributed residuals.
#The original ad test results returned a p-value of 0.01714. After we transformed the model, it appears to have corrected the problem of normality. The new p-value for the ad test is 0.6116, which is higher than 0.5. This means we can fail to reject the null hypothesis.
ad.test(resid(tranlm))
##
## Anderson-Darling normality test
##
## data: resid(tranlm)
## A = 0.28325, p-value = 0.6116
plot(fitx, which=2)
12. Is there an association between hospitalization cost and length of stay? Justify your answer.
summary(fitx)
##
## Call:
## lm(formula = logcost ~ los)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07050 -0.46336 0.01929 0.29820 1.35821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.152469 0.157570 51.739 < 2e-16 ***
## los 0.035232 0.006746 5.223 1.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared: 0.468, Adjusted R-squared: 0.4509
## F-statistic: 27.27 on 1 and 31 DF, p-value: 1.135e-05
#First, the formula for a transformed model would be log(y)=0.034x+8.15. After exponentiation of the independent variable (cost), we get 3.59%. Therefore, we can interpret the outputs of our model as for every unit increase in the length of stay, there is an associated cost increase of 3.59%.
#Here is my math for the exponentiation: > exp(0.035232)
# [1] 1.03586
# > 1.03586-1
# [1] 0.03586
# > 0.03586*100
# [1] 3.586
#Because our p-value is less than 0.05, there is a statistically significant positive relationship between length of stay and cost. Even at the .01 level of significance, there is a statistically significant relationship.