The attached .csv file has data pertaining to hospital expenditures (dependent variable). The column RVUs is a representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing).

mydata <- read.csv("C:/Users/dingz/Downloads/week 6 data.csv")
head(mydata)
##   Expenditures Enrolled       RVUs    FTEs Quality.Score
## 1    114948144    25294  402703.73  954.91          0.67
## 2    116423140    42186  638251.99  949.25          0.58
## 3    119977702    23772  447029.54  952.51          0.52
## 4     19056531     2085   43337.26  199.98          0.93
## 5    246166031    67258 1579789.36 2162.15          0.96
## 6    152125186    23752  673036.55 1359.07          0.56

1.Using R, conduct correlation analysis (between the two variables) and interpret.

cor(mydata)
##               Expenditures  Enrolled      RVUs      FTEs Quality.Score
## Expenditures     1.0000000 0.7707756 0.9217239 0.9796506     0.2749501
## Enrolled         0.7707756 1.0000000 0.9152024 0.8148491     0.2526991
## RVUs             0.9217239 0.9152024 1.0000000 0.9504093     0.3075742
## FTEs             0.9796506 0.8148491 0.9504093 1.0000000     0.2769058
## Quality.Score    0.2749501 0.2526991 0.3075742 0.2769058     1.0000000

From the chart, we can see that most of the variables have relatively strong correlation except Quality.Score. Quality.Score has a weak correlation with every other variables. The strongest correlation is between Expenditures and FTEs which is 0.9796506. The weakest correlation is between Quality.Score and Enrolled which is 0.2526991.

  1. Then fit a linear model with Expenditure as the dependent variable (Y) and RVUs as the independant (X) variable. Interpret the results (Interpreting regression coefficients in particular Download Interpreting regression coefficients in particular) and check whether the Gauss Markov Assumptions Download Gauss Markov Assumptions/ linear regression assumptions hold or not (by conducting residual plot analysisLinks to an external site. and explaining your results in your own words).
correlation1 = cor(mydata$Expenditures, mydata$RVUs)
cat("Expenditures vs RVUs =", round(correlation1, 4))
## Expenditures vs RVUs = 0.9217
plot(mydata$Expenditures~mydata$RVUs,
     xlab="RVUs",
     ylab="Expenditures",
     main="Expenditures vs RVUs")
mymodel1=lm(Expenditures~RVUs,data=mydata)
abline(mymodel1, col="blue", lwd=4)

summary(mymodel1)
## 
## Call:
## lm(formula = Expenditures ~ RVUs, data = mydata)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -185723026  -14097620    2813431   11919781  642218316 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.785e+06  4.413e+06  -0.858    0.392    
## RVUs         2.351e+02  5.061e+00  46.449   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8492 
## F-statistic:  2157 on 1 and 382 DF,  p-value: < 2.2e-16
res1=resid(mymodel1)
plot(res1,
     main="Expenditures vs. RVUs Residual Plot",
     xlab="RVUs", ylab="Residuals")
abline(a=0,b=0, col="purple", lwd=4)

Based on the linear model, we can see that there is a strong correlation coefficent to the Expenditures and RVUs. However, the residuals appear to grow as RVUs increase, which means the linear relationship is stronger for small RVUs but weaker for large RVUs. When we applying linear regression, the scatterplot will fans out.

  1. Then fit a linear model of ln(Expenditures)~RVUs. Mathematically speaking, the logarithm function tends to squeeze together the larger values in your data set and stretches out the smaller values. (If you are wondering why do a log transformation, see the first two charts here that shows how log reduces skewness to help meet normality of X assumption, Links to an external site.with the caveat Download caveatbeing that X is somewhat normally distributed to begin with in order for the transformation to reduce / remove skewness). This transformation is routine in Economics or Finance forecasting to stabilize the variance of a timeseries (GDP, stock prices,…). More readings on the why / the whenLinks to an external site..
lnexp<-log(mydata$Expenditures)
plot(lnexp~mydata$RVUs,
     xlab="RVUs",
     ylab="ln(Expenditures)",
     main="Natural log of Expenditures vs RVUs")
mymodel2=lm(lnexp~RVUs,data=mydata)
abline(mymodel2, col="skyblue", lwd=4)

summary(mymodel2)
## 
## Call:
## lm(formula = lnexp ~ RVUs, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59439 -0.29504  0.06135  0.35333  1.20871 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.730e+01  3.325e-02  520.11   <2e-16 ***
## RVUs        1.349e-06  3.814e-08   35.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared:  0.7661, Adjusted R-squared:  0.7655 
## F-statistic:  1251 on 1 and 382 DF,  p-value: < 2.2e-16
res2=resid(mymodel2)
plot(res2,
     main="Natural log of Expenditures vs. RVUs Residual Plot",
     xlab="RVUs", ylab="Residuals")
abline(a=0,b=0, col="darkgreen", lwd=4)

Base on the linear model, we can see that there still have more data on smaller RVUs but seem to be curved not linear, and the residual charts seem more spread than before.

3.1 How did this log transformation affect the Gauss Markov Assumptions (sure, the residual analysis diagnostic charts will change but what is your takeaway - which assumptions are better met or are some assumptions not met now ) ?

correlation2 = cor(lnexp, mydata$RVUs)
cat("Natural log of Expenditures vs RVUs =", round(correlation2, 4))
## Natural log of Expenditures vs RVUs = 0.8753

Linear: Compare to the previous chart, the log transformation is more like a curve with less correlation coefficents (0.8753 vs 0.9217), but not much.

Homoscedasticity: The error of the log transformation seems more constant than the previous chart, so the homoscedasticity changed.

No Perfect Multicollinearity: No Perfect Multicollinearity has no changed because both chart are not perfectly correlated.

Zero Conditional Mean : The expected value is still not zero after log transformation but the residuals has less outlier than before.

3.2 Are you happy with this functional form capturing the relationship between Y and X or would like to keep some different functional form (EG - ln(Expenditures)~ln(RVUs) or ln(Expenditures)~RVUs + RVUs^2 ) ? Why (4 lines maximum) ? In other words, would you put all your money to create hospital expansion plans (hiring more doctors and nurses, opening more rooms,…) based on the specific relationship you find between expenditure and RVUs (or you think it could blow up in your face)? You can try some other transformations for dealing with positively skewed dataLinks to an external site. like root to the power n or even reciprocal.

I’m happy with this functional form since Expenditures and RVUs has a strong linear relationship. However, As RVUs large enough, this functional form seems to appear much more errors. In other word, the more expenditure we spend, the more workload and error will appear. If we want to put more money to create hospital expansion plans, I would recommend to use some different functional form such as ln(Expenditures)~RVUs + RVUs^2 to reduce outliers.