setwd("/Users/jiwonban/ADEC7301/Week 6")
df <- read.csv("week 6 data-1.csv")

The attached .csv file has data pertaining to hospital Expenditures (dependent variable).  The column RVUs is a representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing).  

1. Using R, conduct correlation analysis (between the two variables) and interpret

cor(df$RVUs, df$Expenditures)
## [1] 0.9217239

The correlation coefficient between the variables RVUs and Expenditures is 0.9217, which indicates quite a strong positive relationship between the standard outpatient workload and hospital expenditures. In other words, the higher the RVU value (i.e., the more standard outpatient workload a hospital is billed for), the higher the hospital expenditure (the more money a hospital needs to pay) and vice versa (i.e., the higher the hospital expenditure, the higher the RVU value).

2. Then fit a linear model with Expenditure as the dependent variable (Y) and RVUs as the independant (X) variable. Interpret the results (Interpreting regression coefficients in particular).

lm(df$Expenditure ~ df$RVUs)
## 
## Call:
## lm(formula = df$Expenditure ~ df$RVUs)
## 
## Coefficients:
## (Intercept)      df$RVUs  
##  -3785072.2        235.1
summary(lm(df$Expenditure ~ df$RVUs))
## 
## Call:
## lm(formula = df$Expenditure ~ df$RVUs)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -185723026  -14097620    2813431   11919781  642218316 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.785e+06  4.413e+06  -0.858    0.392    
## df$RVUs      2.351e+02  5.061e+00  46.449   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8492 
## F-statistic:  2157 on 1 and 382 DF,  p-value: < 2.2e-16

The results of the linear model gives a slope of 235.10 and and intercept of -3785072.2. The slope means that, for every one-unit increase in the standard outpatient workload measure (RVUs), the model estimates an increase of $235.10 in hospital expenditures. The intercept means that when RVUs is zero (i.e., baseline hospital expenditure without accounting for standard outpatient workload is predicted to be at at a loss of $3,785,072.20.

2A. Does the Gauss Markov Assumptions/ linear regression assumptions hold or not (conduct residual plot analysis and explain results in own words). 

1) Linearity - the relationship should be relatively linear
2) Nearly normal residuals - the standardized residuals should be normally distributed
3) Constant variability - the variance of errors should be relatively constant
4) Independent observations - there should be unique observations for every x to y
plot(lm(df$Expenditure ~ df$RVUs),
     col = "purple")

The four provided residual plots provide more information about the distribution of our variables and whether assumptions for a simple linear regression are met. First, the Residual vs Fitted graph tells us whether the residuals have non-linear patterns; the straight horizontal red line and the residuals falling near the line tells us that there is a pattern to the residuals (i.e., perhaps heteroscedasticity concerns). Second, the Q-Q plot indicates whether residuals are normally distributed; we see that the residuals follow the dashed line relatively well (i.e., the data is relatively normally distributed), but there are some exceptions at the tails/ends of the distribution plot (e.g., 256). The Scale(Spread)-Location plot shows us the trends in the magnitudes of residuals, and the residuals should be randomly distributed; in this dataset, we see that there is a linear pattern to the residuals, indicating issues with heteroscedasticity. Lastly, the Residuals vs Leverage shows us the influential cases of our dataset (i.e., outliers); we see two cases that are outside the two gray dashed lines (cases 256 and 135) that require more examination.

Helpful resource: UVA Library (1)

3. Then fit a linear model of ln(Expenditures)~RVUs.  Mathematically speaking, the logarithm function tends to squeeze together the larger values in your data set and stretches out the smaller values.

(If you are wondering why do a log transformation, see the first two charts here that shows how log reduces skewness to help meet normality of X assumption,with the caveatbeing that X is somewhat normally distributed to begin with in order for the transformation to reduce / remove skewness).  This transformation is routine in Economics or Finance forecasting to stabilize the variance of a timeseries (GDP, stock prices,...).  More readings on the why / the when.

lm(log(df$Expenditure) ~ df$RVUs)
## 
## Call:
## lm(formula = log(df$Expenditure) ~ df$RVUs)
## 
## Coefficients:
## (Intercept)      df$RVUs  
##   1.730e+01    1.349e-06
summary(lm(log(df$Expenditure) ~ df$RVUs))
## 
## Call:
## lm(formula = log(df$Expenditure) ~ df$RVUs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59439 -0.29504  0.06135  0.35333  1.20871 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.730e+01  3.325e-02  520.11   <2e-16 ***
## df$RVUs     1.349e-06  3.814e-08   35.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared:  0.7661, Adjusted R-squared:  0.7655 
## F-statistic:  1251 on 1 and 382 DF,  p-value: < 2.2e-16
plot(lm(log(df$Expenditure) ~ df$RVUs),
     col = "navy")

3A. How did this log transformation affect the Gauss Markov Assumptions (sure, the residual analysis diagnostic charts will change but what is your takeaway - which assumptions are better met or are some assumptions not met now)?

Now, we see that the value of the residuals have decreased significantly and the shapes of our plots have changed from our initial set of plots. The takeaway is that, looking at the QQ plot, the distribution of residuals now very much assume normality, and while the Scale-Location plot seems more randomly dispersed, there are still issues with heteroscedasticity. The Residuals vs Fitted plot has changed significantly with the new log function, and the assumption of linearity is now violated (shown by the curve on the left-hand side). Lastly, the Residuals vs Leverage confirms that all values are still within bounds.

3B. Are you happy with this functional form capturing the relationship between Y and X or would like to keep some different functional form (EG - ln(Expenditures)~ln(RVUs) or ln(Expenditures)~RVUs + RVUs^2 ) ?  Why (4 lines maximum) ?  In other words, would you put all your money to create hospital expansion plans (hiring more doctors and nurses, opening more rooms,...) based on the specific relationship you find between expenditure and RVUs (or you think it could blow up in your face)?  You can try some other transformations for dealing with positively skewed data like root to the power n or even reciprocal.

# ln(Expenditures)~RVU + RVU^2
plot(lm(log(df$Expenditure) ~ df$RVUs + (df$RVUs)^2),
     col = "tan")

# ln(Expenditures)~ln(RVU)
plot(lm(log(df$Expenditure) ~ log(df$RVUs)),
     col = "green")

I decided to go about this question in an exploratory manner — by trying different approaches to see how the residual plots would change. The first manner (tan), ln(Expenditures)~RVU + RVU^2, resulted in a very similar pattern of plots as ln(Expenditures)~RVU. The second manner (green), ln(Expenditures)~ln(RVU), showed to be much more suitable; the residuals for Residuals vs Fitted and Scale-Location were much more randomly distributed, the QQ plots showed normality, and Residual vs Leverage indicated no influential cases. In other words, this second approach is a better way of representing the data. Considering that log functions need to be performed on the data to obtain suitable data for analyses, the data suggest that the relationship between RVUs and Expenditures are not linear in nature; in other words, there are other factors that are causing this seemingly direct correlation. Therefore, it may not be wise to simply expanding hospital investments; rather, multiple regression analyses of other relevant variables should be conducted to better understand the drivers of hospital expenditures.