# Clearing workspace  
rm(list = ls()) # Clear environment 
gc() 
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523028 28.0    1164120 62.2   660491 35.3
## Vcells 950904  7.3    8388608 64.0  1769515 13.6
# Clear unused memory
cat("\f") 

Downloading the Data. The attached .csv file has data pertaining to hospital expenditures (dependent variable). The column RVUs is a representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing).

# Downloading the Data
mydata <- read.csv('./week_6_data-1.csv') # Downloading Week 6 Data
# Heading the Data
head(mydata)
##   Expenditures Enrolled       RVUs    FTEs Quality.Score
## 1    114948144    25294  402703.73  954.91          0.67
## 2    116423140    42186  638251.99  949.25          0.58
## 3    119977702    23772  447029.54  952.51          0.52
## 4     19056531     2085   43337.26  199.98          0.93
## 5    246166031    67258 1579789.36 2162.15          0.96
## 6    152125186    23752  673036.55 1359.07          0.56

Confirmed the first 6 rows of this data matched the excel file. Download complete.

1 Using R, conduct correlation analysis (between the two variables) and interpret.

As stated, we are looking at the following

Dependent Variable: Hospital Expenditures

Independent Variable: RVU’s –> representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing)

# Describing Our Data
library(psych)
describe(mydata)
##               vars   n         mean           sd      median     trimmed
## Expenditures     1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled         2 384     24713.82     22756.42    16466.00    20647.21
## RVUs             3 384    546857.77    680047.21   246701.71   398680.18
## FTEs             4 384      1060.86      1336.29      483.07      750.11
## Quality.Score    5 384         0.71         0.11        0.73        0.72
##                       mad        min          max        range  skew kurtosis
## Expenditures  47444814.57 7839562.52 1.301994e+09 1.294154e+09  3.04    11.27
## Enrolled         13092.10    1218.00 1.198500e+05 1.186320e+05  1.94     4.14
## RVUs            237760.93   23218.01 3.574006e+06 3.550788e+06  2.10     4.27
## FTEs               401.38     116.29 7.518630e+03 7.402340e+03  2.55     6.94
## Quality.Score        0.10       0.31 9.600000e-01 6.500000e-01 -0.55     0.40
##                       se
## Expenditures  8850612.08
## Enrolled         1161.28
## RVUs            34703.51
## FTEs               68.19
## Quality.Score       0.01

Just did describe to get a better sense and visual of our data, this is not directly needed to run our correlation analysis.

# Correlation Analysis
?cor
## starting httpd help server ... done
ind.rvu <- mydata$RVUs           # Setting our Independent Variable
dep.exp <- mydata$Expenditures   # Setting our Dependent Variable

r1<- cor(x = ind.rvu,          # Calculating Correlation (R-Value)
    y = dep.exp)
r1
## [1] 0.9217239

As seen above from our correlation analysis, our R-value is equal to 0.9217239. This signifies that we have a very strong correlation between hospital expenditures and RVU’s. Our R-value can be equal to anything from -1 to 1. Anything below 0 is a negative relationship while anything above 1 is a positive relationship. Additionally, any number closer to 1 tells us that it is a stronger relationship. In this case, as there are more RVU’s, the more Hospital Expenditures there are.

# Plotting this relationship
plot(x = ind.rvu/1000,
     y = dep.exp/1000000,
     xlab = "RVU's (Thousand's)",
     ylab = "Hospital Expenditures (millions)",
     main = "RVU's vs Hospital Expenditures"
     )

Once we plotted this, it confirms our “eye test” for our calculations. We can see that this looks like a very strong positive relationship. As the numbers were so big from our data, I did RVU’s in thousands and Hospital Expenditures in millions.

3.1 How did this log transformation affect the Gauss Markov Assumptions (sure, the residual analysis diagnostic charts will change but what is your takeaway - which assumptions are better met or are some assumptions not met now ) ?

Linear: As we can see now, there is more of a curve so this appears to be less linear as it has a curve. After calculating below, we can see this is less than our original 0.92 number, but just by a little.

cor(x = ind.rvu, 
    y = log(dep.exp)
    )
## [1] 0.8752895

Homoscedasticity: This changed since we can now see our errors are much more constant than we saw before the log transformation. This is now better met.

No Perfect Multicollinearity: This did not change as it is still not perfectly correlated.

Zero Conditional Mean (or Expected Value of Residuals): We can see that it is not zero but there are less outliers now in our residuals.

3.2 Are you happy with this functional form capturing relationship between x and y or would you like to keep some different functional form? Why?

I am happy with this functional form. We can see that there is a strong linear relationship between expenditures and RVU’s. Before and after the transformation, we saw that the correlation stayed strong while starting to reduce our outliers. With this being said, the more the expenditures there are, the more the hospitals workload increase along with the ability to take new patients. From a care perspective, this would allow the hospital to take care of more people. From a business perspective, the more patients and care the hospital could take care of, the more money that would be earned. We would need to look at more variables on profitability, however. The one thing we should note is that before we transformed the data, we saw much more of our errors coming as there more RVU’s.