hospital <- read.delim("~/Stats 1/Data/hospital.txt")
View(hospital)
  1. Obtain the mean, standard deviation, min and max values for both x and y variables
describe(hospital$cost)
##    vars  n    mean      sd median trimmed     mad  min   max range skew
## X1    1 33 8810.18 7875.07   5870 7435.67 5252.85 1233 35381 34148 1.58
##    kurtosis      se
## X1     2.25 1370.87
describe(hospital$los)
##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 33 16.39 16.89     12   13.26 10.38   1  85    84 2.31     6.13 2.94
  1. Fit a simple linear regression model and write down the equation.
simple.fit <-  lm(cost~los, data=hospital)
summary(simple.fit)
## 
## Call:
## lm(formula = cost ~ los, data = hospital)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7590  -2742  -1214   2377  16052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2706.96    1176.67   2.301   0.0283 *  
## los           372.29      50.38   7.390 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.6262 
## F-statistic: 54.61 on 1 and 31 DF,  p-value: 2.542e-08
#Ying's path
attach(hospital)
fit1 <- lm(cost ~ los)
summary(fit1)
## 
## Call:
## lm(formula = cost ~ los)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7590  -2742  -1214   2377  16052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2706.96    1176.67   2.301   0.0283 *  
## los           372.29      50.38   7.390 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.6262 
## F-statistic: 54.61 on 1 and 31 DF,  p-value: 2.542e-08
#equation is cost = 2707 + 372*los
  1. What is the SST, SSE, and SSR for this model, respectively?
anova(fit1)
## Analysis of Variance Table
## 
## Response: cost
##           Df     Sum Sq    Mean Sq F value    Pr(>F)    
## los        1 1265921342 1265921342   54.61 2.542e-08 ***
## Residuals 31  718614701   23181119                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#SST =  SSS+SSR = 1984536043
#SSE is 1265921342
#SSR is 718614701
  1. How do we interpret the coefficient and the intercept, respectively?
#equation is cost = 2707 + 372*los
#The intercept is 2707. It is the predicted costs for a zero length of stay. The beta coefficient for the variable length of stay (aka the slope) is 372. This means that for every change in the length of stay, we can expect an increase in cost of 372 dollars. 
  1. What is the R2? How do we interpret this?
#Adjusted R-squared:  0.6262. The Adjusted R Squared represents the proportion of i variation in the data that can be explained by the model. R-squared ranges between zero and one. Our R-squared of 0.6262 is closer to 1 and therefore means that a large proportion of the variability in the outcome has been explained by our model, specifically, by the length of stay. Another way of interpreting the R2 is that it means that 63% of the variation in the dependent variable (cost) is explained by the independent variable (length of stay).
  1. Does a linear function appear to fit the data well? If not, does the plot suggest any other potential problems with the model?
#Plotting the model 
fit1 <- lm(cost ~ los, data=hospital)
summary(fit1)
## 
## Call:
## lm(formula = cost ~ los, data = hospital)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7590  -2742  -1214   2377  16052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2706.96    1176.67   2.301   0.0283 *  
## los           372.29      50.38   7.390 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4815 on 31 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.6262 
## F-statistic: 54.61 on 1 and 31 DF,  p-value: 2.542e-08
with(hospital,plot(cost,los))
abline(fit1, col="blue")

#Based on appearance, the linear model does not fit well. Why? The model appears to be non-linear, but that could be influenced by the possible outliers. However, without plotting residuals, it is difficult to say for sure. P.S. I used the geom_smooth function to help me see a pattern.   
  1. Make a residuals vs. fits plot. Interpret the residuals vs. fits plot — which model assumption does it suggest is violated? Elaborate your answer.
plot(fit1, which=1)

#Our ideal plot would display the residuals appearing to be equally variable across the entire range of fitted values. In our plot, visually, there doesn't appear to be an obvious pattern, and the error terms are varying, however, unless we run a formal test, we cannot be certain with a visual inspection alone.

#We will then run a formal test for constant variance using the BP test. Our null hypothesis is that the variance of error term is constant across values of x (homoscedasticity). The alternative hypothesis is that there is not constant variance. 

library(lmtest)
bptest(fit1)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit1
## BP = 0.16079, df = 1, p-value = 0.6884
#My conclusion is that since our p-value from the BP test is greater than 0.5, then we can fail to reject the null hypothesis and assume that our error terms have constant variance. HOWEVER, per Professor Ying's email...we are to ignore the BP test results, and only go by a visual examination of the residuals v. fit plot. By only looking at the visual, we could say that there is a violation of the constant variance. 
  1. Test the normality of residuals assumption. Does the test provide evidence that the residuals are not normally distributed?
#To test for normality, we can first run a QQ plot. 
plot(fit1, which=2)

#We find that the residuals waver quite a bit from the diagonal line. We see a lighter tail at the top and wavering from the line in the middle. This provides visual evidence that the residuals are not normally distributed.  

#We can also conduct a formal test for normality using the Andreson-Darling test (and also the Shapiro test). 
library(nortest)
## Warning: package 'nortest' was built under R version 4.0.3
ad.test(fit1$residuals)  
## 
##  Anderson-Darling normality test
## 
## data:  fit1$residuals
## A = 0.91897, p-value = 0.01714
shapiro.test(fit1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit1$residuals
## W = 0.90795, p-value = 0.00862
#Here the probability value of both shapiro wilk and anderson darling test is less than our critical value of 0.05, hence we reject the null hypothesis which states that the residual data is normally distributed. We can conclude from both tests that our residual data is not normally distributed and therefore normality is NOT met. 


#WHile I know that we are not asked to fix the problem of non-normality, because we eventually transform both variables in the next question, I would use natural log. I have determined this by conducting a boxcox assessment. The boxcox helped  determine which transformation we will use based on lambda. Our lambda is close to the value of x, therefore we will use the natural log transformation. 
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
boxcox(fit1)

logcost <- log(cost)
plot1 <- plot(logcost, los)
lines(lowess(logcost, los), col="purple")

fitx <- lm(logcost~los)
summary(fitx)
## 
## Call:
## lm(formula = logcost ~ los)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07050 -0.46336  0.01929  0.29820  1.35821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.152469   0.157570  51.739  < 2e-16 ***
## los         0.035232   0.006746   5.223 1.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared:  0.468,  Adjusted R-squared:  0.4509 
## F-statistic: 27.27 on 1 and 31 DF,  p-value: 1.135e-05
plot(fitx, which=1)

plot(fitx, which=2)

9. To fix the problem identified in (7), what transformation would you use to fix the problem?

#We chose log to address constant variance to transform the dependent variable using log. 

boxcox(fit1)

#SO the transformation would look like this:
logy <- log(cost)
loggedy<- lm(logy~los, data=hospital)
plot(loggedy, which=1)

#Per the Professor's email, we are to ignore the formal BP test results, but if I were not ignoring the results, I would include the BP test in checking if transformation is successful. 
  1. To fix the problem identified in (6), what transformation would you use?
#SO, to fix non-linearity, we will transform both the independent and dependent  variable using log. 

#transformation of both x & y 
logx <- log(los)
logy <- log(cost)

ggplot(hospital, aes(x=logx, y=logy))+geom_point()+geom_smooth(method= "lm",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

#Transformation of both x and y corrected the non-linearity. 
  1. Does the transformation appear to have helped rectify the original problems with the model? Why? Justify your answer by providing visual evidence and statistical tests results.
#Yes, the transformation solved the problem of constant variance. 
tranlm <- lm(logx ~ logy, data = hospital)
plot(tranlm, which=1)

#Yes, the transformation also corrected for linearity and normality. See question #10 for linearity correction. 

#In terms of correcting for normal distribution of residuals, our transformation was successful. 

#I determined that the transformation corrected the normality by using the ad test and getting a qq plot with the newly transformed x and y. First I will lay out my null and alternative hypothesis. 
#Null hypothesis is that the residuals are normally distributed. 
#Alternative hypothesis is that we do not normally distributed residuals. 

#The original ad test results returned a p-value of 0.01714. After we transformed the model, it appears to have corrected the problem of normality. The new p-value for the ad test is 0.6116, which is higher than 0.5. This means we can fail to reject the null hypothesis. 
ad.test(resid(tranlm))
## 
##  Anderson-Darling normality test
## 
## data:  resid(tranlm)
## A = 0.28325, p-value = 0.6116
plot(fitx, which=2)

12. Is there an association between hospitalization cost and length of stay? Justify your answer.

summary(fitx)
## 
## Call:
## lm(formula = logcost ~ los)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07050 -0.46336  0.01929  0.29820  1.35821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.152469   0.157570  51.739  < 2e-16 ***
## los         0.035232   0.006746   5.223 1.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6447 on 31 degrees of freedom
## Multiple R-squared:  0.468,  Adjusted R-squared:  0.4509 
## F-statistic: 27.27 on 1 and 31 DF,  p-value: 1.135e-05
#First, the formula for a transformed model would be log(y)=0.034x+8.15. After exponentiation of the independent variable (cost), we get 3.59%. Therefore, we can interpret the outputs of our model as for every unit increase in the length of stay, there is an associated cost increase of 3.59%. 


#Here is my math for the exponentiation: > exp(0.035232)
# [1] 1.03586
# > 1.03586-1
# [1] 0.03586
# > 0.03586*100
# [1] 3.586

#Because our p-value is less than 0.05, there is a statistically significant positive relationship between length of stay and cost. Even at the .01 level of significance, there is a statistically significant relationship.