# Clearing workspace
rm(list = ls()) # Clear environment
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523028 28.0 1164120 62.2 660491 35.3
## Vcells 950904 7.3 8388608 64.0 1769515 13.6
# Clear unused memory
cat("\f")
# Downloading the Data
mydata <- read.csv('./week_6_data-1.csv') # Downloading Week 6 Data
# Heading the Data
head(mydata)
## Expenditures Enrolled RVUs FTEs Quality.Score
## 1 114948144 25294 402703.73 954.91 0.67
## 2 116423140 42186 638251.99 949.25 0.58
## 3 119977702 23772 447029.54 952.51 0.52
## 4 19056531 2085 43337.26 199.98 0.93
## 5 246166031 67258 1579789.36 2162.15 0.96
## 6 152125186 23752 673036.55 1359.07 0.56
Confirmed the first 6 rows of this data matched the excel file. Download complete.
As stated, we are looking at the following
Dependent Variable: Hospital Expenditures
Independent Variable: RVU’s –> representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing)
# Describing Our Data
library(psych)
describe(mydata)
## vars n mean sd median trimmed
## Expenditures 1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled 2 384 24713.82 22756.42 16466.00 20647.21
## RVUs 3 384 546857.77 680047.21 246701.71 398680.18
## FTEs 4 384 1060.86 1336.29 483.07 750.11
## Quality.Score 5 384 0.71 0.11 0.73 0.72
## mad min max range skew kurtosis
## Expenditures 47444814.57 7839562.52 1.301994e+09 1.294154e+09 3.04 11.27
## Enrolled 13092.10 1218.00 1.198500e+05 1.186320e+05 1.94 4.14
## RVUs 237760.93 23218.01 3.574006e+06 3.550788e+06 2.10 4.27
## FTEs 401.38 116.29 7.518630e+03 7.402340e+03 2.55 6.94
## Quality.Score 0.10 0.31 9.600000e-01 6.500000e-01 -0.55 0.40
## se
## Expenditures 8850612.08
## Enrolled 1161.28
## RVUs 34703.51
## FTEs 68.19
## Quality.Score 0.01
Just did describe to get a better sense and visual of our data, this is not directly needed to run our correlation analysis.
# Correlation Analysis
?cor
## starting httpd help server ... done
ind.rvu <- mydata$RVUs # Setting our Independent Variable
dep.exp <- mydata$Expenditures # Setting our Dependent Variable
r1<- cor(x = ind.rvu, # Calculating Correlation (R-Value)
y = dep.exp)
r1
## [1] 0.9217239
As seen above from our correlation analysis, our R-value is equal to 0.9217239. This signifies that we have a very strong correlation between hospital expenditures and RVU’s. Our R-value can be equal to anything from -1 to 1. Anything below 0 is a negative relationship while anything above 1 is a positive relationship. Additionally, any number closer to 1 tells us that it is a stronger relationship. In this case, as there are more RVU’s, the more Hospital Expenditures there are.
# Plotting this relationship
plot(x = ind.rvu/1000,
y = dep.exp/1000000,
xlab = "RVU's (Thousand's)",
ylab = "Hospital Expenditures (millions)",
main = "RVU's vs Hospital Expenditures"
)
Once we plotted this, it confirms our “eye test” for our calculations. We can see that this looks like a very strong positive relationship. As the numbers were so big from our data, I did RVU’s in thousands and Hospital Expenditures in millions.
First, I will look to see if these appear to be normally distributed. We can do this by plotting both of our variables to get a sense of what both variables look like.
# RVU's Histogram
hist(x = ind.rvu/1000,
xlab = "RVU's (Thousands)",
ylab = "Frequency",
main = "RVU's Histogram"
)
# Hospital Expenditures Histogram
hist(x = dep.exp/1000000,
xlab = "Hospital Expenditures (Millions)",
ylab = "Frequency",
main = "Hospital Expendiutres Histogram"
)
As we can see here, both RVU’s and Hospital Expenditures do not appear to be normally distributed and appear to be right-skewed.
# Correlation for the whole Data set
cor(mydata)
## Expenditures Enrolled RVUs FTEs Quality.Score
## Expenditures 1.0000000 0.7707756 0.9217239 0.9796506 0.2749501
## Enrolled 0.7707756 1.0000000 0.9152024 0.8148491 0.2526991
## RVUs 0.9217239 0.9152024 1.0000000 0.9504093 0.3075742
## FTEs 0.9796506 0.8148491 0.9504093 1.0000000 0.2769058
## Quality.Score 0.2749501 0.2526991 0.3075742 0.2769058 1.0000000
# Plotting the Relationships between each of the variables
plot(mydata)
Above, I plotted scatterplot matrix to visualize the relationships between every variable in our dataset to get a better idea for everything, not just our X and Y variables. We read these as the X variables are equal to their respective column and the Y value is equal to their respective row. In our case, the graph we already plotted is show as the middle graph in our first row. We can see we have many other strong relationships in this dataset.We can do this since all of our variables are quantitative variables in the data.
# Linear Model
?lm
lm2 <- lm(mydata$Expenditures~mydata$RVUs)
lm2
##
## Call:
## lm(formula = mydata$Expenditures ~ mydata$RVUs)
##
## Coefficients:
## (Intercept) mydata$RVUs
## -3785072.2 235.1
summary(lm2)
##
## Call:
## lm(formula = mydata$Expenditures ~ mydata$RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## mydata$RVUs 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
# Plotting this relationship
plot(x = ind.rvu/1000,
y = dep.exp/1000000,
xlab = "RVU's (Thousand's)",
ylab = "Hospital Expenditures (millions)",
main = "RVU's vs Hospital Expenditures"
)
abline(lm2)
# Plotting Residuals
lm2.res <- resid(lm2)
plot(lm2.res,
main="Expenditures vs. RVUs Residual Plot",
xlab="RVUs",
ylab="Residuals")
abline(0,0)
As seen from both our statistics and graphs, we can see there is a strong positive correlation between the RVU’s and Hospital expenditures. Another note we can make from both our graphs and residuals are that the linear relationship is much stronger when there are less RVU’s. We can see both on our scatterplot and residual chart that data point get further from the mean as RVU’s increase. We can also see in our summary that we have an adjusted r-squared of 0.8492. Adjusted R-Squared tells us the percentage of variance and since ours is .8492, this tells us we have a higher level of correlation.
Gauss Markov Assumptions
Linearity: Is that the relationship betweent the dependent and independent variable is linear. In our case as mentioned and seen, it looks like expenditures and RVU’s have a linear relationship.
Independence: Is the independence of errors across all observations. In summary, one error is not related with another. In our case, I could not tell if errors were related.
Homoscedasticity: Is defined as the variance of errors known as residuals should be constant across all levels of independent variables. As mentioned in our example, we can see that more of the residuals come from when there are more RVU’s. We see tighter dispersion around the mean and less residuals when RVUs are less.
No Perfect Multicollinearity: This tells us that the independent variables should not be perfectly correlated with one another. In our example, we can see that although we do have a stronger correlation among variables, they are not perfectly correlated.
Zero Conditional Mean (or Expected Value of Residuals): This means that the expected value of the residuals is zero for all the values of independent variables. We can see that not all residuals are zero, there some outliers as RVU’s increase but not a ton.
As noted in the question, the log functions helps us stretch out smaller values and squeeze together bigger values to make it closer to a normal distribution. I will start by applying this function to the histograms I plotted before where we should be able to clearly see a difference.
# RVU's Log Histogram
hist(x = log(ind.rvu/1000),
xlab = "RVU's (Thousands)",
ylab = "Frequency",
main = "RVU's Histogram"
)
# Hospital Expenditures Log Histogram
hist(x = log(dep.exp/1000000),
xlab = "Hospital Expenditures (Millions)",
ylab = "Frequency",
main = "Hospital Expendiutres Histogram"
)
plot(log(mydata))
Similar to what I did question 2 as well, I plotted all of our relationships between all of our variables. We can quickly see the difference in our charts compared to before. While none of the linear relationships changed, we can see that the skewness did change and everything has a tighter dispersion around the mean.
# Summary of Transformed Data
lm.transformed <- lm(log(mydata$Expenditures) ~ mydata$RVUs)
summary(lm.transformed)
##
## Call:
## lm(formula = log(mydata$Expenditures) ~ mydata$RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 3.325e-02 520.11 <2e-16 ***
## mydata$RVUs 1.349e-06 3.814e-08 35.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 2.2e-16
As we can see here, the biggest different is how much closer our residuals got. Before the median residual was 2813431 and now it is just 0.06135.
# Plotting our Transformed Data
plot(x = ind.rvu/1000,
y = log(dep.exp/1000000),
xlab = "RVU's (Thousand's)",
ylab = "Hospital Expenditures (millions)",
main = "Tranformed RVU's vs Hospital Expenditures"
)
abline(lm.transformed)
# Plotting Residuals
transformed.res <- resid(lm.transformed)
plot(transformed.res,
main="Expenditures vs. RVUs Residual Plot",
xlab="RVUs",
ylab="Transformed Residuals")
abline(0,0)
After plotting our transformed data, we can still see that on our initial scatterplot that more data is around where this is less RVU’s but now seems to be curved and not as linear. The big difference, however, comes from our residuals chart. We can see with our log transformation that the residuals are much more spread out and even.
Linear: As we can see now, there is more of a curve so this appears to be less linear as it has a curve. After calculating below, we can see this is less than our original 0.92 number, but just by a little.
cor(x = ind.rvu,
y = log(dep.exp)
)
## [1] 0.8752895
Homoscedasticity: This changed since we can now see our errors are much more constant than we saw before the log transformation. This is now better met.
No Perfect Multicollinearity: This did not change as it is still not perfectly correlated.
Zero Conditional Mean (or Expected Value of Residuals): We can see that it is not zero but there are less outliers now in our residuals.
I am happy with this functional form. We can see that there is a strong linear relationship between expenditures and RVU’s. Before and after the transformation, we saw that the correlation stayed strong while starting to reduce our outliers. With this being said, the more the expenditures there are, the more the hospitals workload increase along with the ability to take new patients. From a care perspective, this would allow the hospital to take care of more people. From a business perspective, the more patients and care the hospital could take care of, the more money that would be earned. We would need to look at more variables on profitability, however. The one thing we should note is that before we transformed the data, we saw much more of our errors coming as there more RVU’s.