The attached .csv file has data pertaining to hospital expenditures (dependent variable). The column RVUs is a representation of standard outpatient workload. Using R, conduct correlation analysis and interpret. Then fit a linear model with Expenditures~RVUs. Interpret the results and linear regression assumptions. Then fit a linear model of ln(Expenditures)~RVUs. How did this transformation affect the assumptions?
Select data
data.wk6 <- read.csv("week 6 data.csv", header=TRUE)
attach(data.wk6)
Conduct correlation analysis (testing using Pearson’s product-moment correlation)
cor(data.wk6[,-6])
## Expenditures Enrolled RVUs FTEs Quality.Score
## Expenditures 1.0000000 0.7707756 0.9217239 0.9796506 0.2749501
## Enrolled 0.7707756 1.0000000 0.9152024 0.8148491 0.2526991
## RVUs 0.9217239 0.9152024 1.0000000 0.9504093 0.3075742
## FTEs 0.9796506 0.8148491 0.9504093 1.0000000 0.2769058
## Quality.Score 0.2749501 0.2526991 0.3075742 0.2769058 1.0000000
Most of the variables appear to have relatively strong correlations, except Quality Score which has a weaker correlation to every other variable. The strongest correlation is between FTEs and Expenditures, which is almost perfectly correlated, and the weakest correlation is between Quality Score and Enrolled.
Fit a linear model with Expenditures ~ RVUs
par(mfrow=c(2,1))
plot(RVUs,Expenditures,
main="Expenditures vs RVUs",
xlab="RVUs", ylab="Expenditures")
m1 <-lm(Expenditures ~ RVUs)
abline(m1)
m1.res = resid(m1)
plot(m1.res,
main="Expenditures vs. RVUs Residual Plot",
xlab="RVUs", ylab="Residuals")
abline(0,0)
Based on the plot of RVUs to Expenditures, we can see that a linear fit is appropriate, but the variability of the errors increases as RVUs increase. This means that the linear relationship is stronger for smaller RVUs and weaker for larger RVUs.
When plotting the residuals, we see this increase in variability simply because R automatically adjusts the x axis to only reflect the smaller RVUs.
Interpret the results and linear regression assumptions.
summary(m1)
##
## Call:
## lm(formula = Expenditures ~ RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## RVUs 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
The summary statistics indicate the model is relatively a good fit:
a. The standard error is low relative to its coefficient (2.351e+02 vs. 5.061e+00)
b. The t-value is large relative to the standard error (5.061 vs 46.449)
c. The p-value is small, meaning the relationship is not likely due to chance
d. The residual standard error is large compared to the mean (-3.785e+06 vs 6.735+07). This is due to the increased variance as RVU increases. The larger the RVUs, the less accurate the model predicts Expenditures. The smaller the residual standard error, the better the model fits the dataset, so in this case, having a larger residual standard error indicates potential issues with the model
e. Multiple R-Squared and Adjusted R-squared are the same since there is only one predictor variable. This number is relatively large meaning a large portion of the variance for Expenditures (approximately 85%) can be explained by RVUs
f. The F-state is large, indicating a relationship between the two variables exists, and the p-value is small.
Fit a linear model of ln(Expenditures)~RVUs
#plot previous model
par(mfrow=c(2,2))
plot(RVUs,Expenditures,
main="Expenditures vs RVUs",
xlab="RVUs", ylab="Expenditures")
lnExp <- log(Expenditures)
plot(m1.res,
main="Expenditures ~ RVUs Residual Plot",
xlab="RVUs", ylab="Residuals")
#plot new model below
plot(RVUs,lnExp,
main="Natrual Log of Expenditures vs RVUs",
xlab="RVUs", ylab="LN(Expenditures)")
m2 <-lm(lnExp ~ RVUs)
abline(m2)
#Plot residuals
m2.res = resid(m2)
plot(m2.res,
main="Natrual Log of Expenditures vs RVUs Residual Plot",
xlab="RVUs", ylab="Residuals")
abline(0,0)
By taking the natural log of Expenditures, the relationship between Expenditures and RVUs no longer appears to be linear and a curved line would better fit the model. However, there appears to be less variance between the model and the actual values as RVUs increase.
summary(m2)
##
## Call:
## lm(formula = lnExp ~ RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 3.325e-02 520.11 <2e-16 ***
## RVUs 1.349e-06 3.814e-08 35.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 2.2e-16
How did this transformation affect the assumptions?
a. The standard error is still low relative to the estimate, t-value is still large, p-valule is unchanged
b. The residual standard error is now much smaller compared to the mean (17.30 vs 0.51). By taking the natural log of expenditures, the variance between the model and the data has decreased, particularly at the higher RVUs values.
c. Multiple R-Squared and Adjusted R-squared are still relatively large (76.55%)
d. The F-state also remains large and the associated p-value is small