The attached .csv file has data pertaining to hospital expenditures (dependent variable). The column RVUs is a representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing).
mydata <- read.csv("C:/Users/dingz/Downloads/week 6 data.csv")
head(mydata)
## Expenditures Enrolled RVUs FTEs Quality.Score
## 1 114948144 25294 402703.73 954.91 0.67
## 2 116423140 42186 638251.99 949.25 0.58
## 3 119977702 23772 447029.54 952.51 0.52
## 4 19056531 2085 43337.26 199.98 0.93
## 5 246166031 67258 1579789.36 2162.15 0.96
## 6 152125186 23752 673036.55 1359.07 0.56
1.Using R, conduct correlation analysis (between the two variables) and interpret.
cor(mydata)
## Expenditures Enrolled RVUs FTEs Quality.Score
## Expenditures 1.0000000 0.7707756 0.9217239 0.9796506 0.2749501
## Enrolled 0.7707756 1.0000000 0.9152024 0.8148491 0.2526991
## RVUs 0.9217239 0.9152024 1.0000000 0.9504093 0.3075742
## FTEs 0.9796506 0.8148491 0.9504093 1.0000000 0.2769058
## Quality.Score 0.2749501 0.2526991 0.3075742 0.2769058 1.0000000
From the chart, we can see that most of the variables have relatively strong correlation except Quality.Score. Quality.Score has a weak correlation with every other variables. The strongest correlation is between Expenditures and FTEs which is 0.9796506. The weakest correlation is between Quality.Score and Enrolled which is 0.2526991.
correlation1 = cor(mydata$Expenditures, mydata$RVUs)
cat("Expenditures vs RVUs =", round(correlation1, 4))
## Expenditures vs RVUs = 0.9217
plot(mydata$Expenditures~mydata$RVUs,
xlab="RVUs",
ylab="Expenditures",
main="Expenditures vs RVUs")
mymodel1=lm(Expenditures~RVUs,data=mydata)
abline(mymodel1, col="blue", lwd=4)
summary(mymodel1)
##
## Call:
## lm(formula = Expenditures ~ RVUs, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## RVUs 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
res1=resid(mymodel1)
plot(res1,
main="Expenditures vs. RVUs Residual Plot",
xlab="RVUs", ylab="Residuals")
abline(a=0,b=0, col="purple", lwd=4)
Based on the linear model, we can see that there is a strong correlation coefficent to the Expenditures and RVUs. However, the residuals appear to grow as RVUs increase, which means the linear relationship is stronger for small RVUs but weaker for large RVUs. When we applying linear regression, the scatterplot will fans out.
lnexp<-log(mydata$Expenditures)
plot(lnexp~mydata$RVUs,
xlab="RVUs",
ylab="ln(Expenditures)",
main="Natural log of Expenditures vs RVUs")
mymodel2=lm(lnexp~RVUs,data=mydata)
abline(mymodel2, col="skyblue", lwd=4)
summary(mymodel2)
##
## Call:
## lm(formula = lnexp ~ RVUs, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 3.325e-02 520.11 <2e-16 ***
## RVUs 1.349e-06 3.814e-08 35.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 2.2e-16
res2=resid(mymodel2)
plot(res2,
main="Natural log of Expenditures vs. RVUs Residual Plot",
xlab="RVUs", ylab="Residuals")
abline(a=0,b=0, col="darkgreen", lwd=4)
Base on the linear model, we can see that there still have more data on smaller RVUs but seem to be curved not linear, and the residual charts seem more spread than before.
3.1 How did this log transformation affect the Gauss Markov Assumptions (sure, the residual analysis diagnostic charts will change but what is your takeaway - which assumptions are better met or are some assumptions not met now ) ?
correlation2 = cor(lnexp, mydata$RVUs)
cat("Natural log of Expenditures vs RVUs =", round(correlation2, 4))
## Natural log of Expenditures vs RVUs = 0.8753
Linear: Compare to the previous chart, the log transformation is more like a curve with less correlation coefficents (0.8753 vs 0.9217), but not much.
Homoscedasticity: The error of the log transformation seems more constant than the previous chart, so the homoscedasticity changed.
No Perfect Multicollinearity: No Perfect Multicollinearity has no changed because both chart are not perfectly correlated.
Zero Conditional Mean : The expected value is still not zero after log transformation but the residuals has less outlier than before.
3.2 Are you happy with this functional form capturing the relationship between Y and X or would like to keep some different functional form (EG - ln(Expenditures)~ln(RVUs) or ln(Expenditures)~RVUs + RVUs^2 ) ? Why (4 lines maximum) ? In other words, would you put all your money to create hospital expansion plans (hiring more doctors and nurses, opening more rooms,…) based on the specific relationship you find between expenditure and RVUs (or you think it could blow up in your face)? You can try some other transformations for dealing with positively skewed dataLinks to an external site. like root to the power n or even reciprocal.
I’m happy with this functional form since Expenditures and RVUs has a strong linear relationship. However, As RVUs large enough, this functional form seems to appear much more errors. In other word, the more expenditure we spend, the more workload and error will appear. If we want to put more money to create hospital expansion plans, I would recommend to use some different functional form such as ln(Expenditures)~RVUs + RVUs^2 to reduce outliers.