# Clearing workspace  
rm(list = ls()) # Clear environment 
gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523028 28.0    1164120 62.2   660491 35.3
## Vcells 950904  7.3    8388608 64.0  1769515 13.6

# Clear unused memory
cat("\f")

Downloading the Data. The attached .csv file has data pertaining to hospital expenditures (dependent variable). The column RVUs is a representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing).

# Downloading the Data
mydata <- read.csv('./week_6_data-1.csv') # Downloading Week 6 Data

# Heading the Data
head(mydata)

##   Expenditures Enrolled       RVUs    FTEs Quality.Score
## 1    114948144    25294  402703.73  954.91          0.67
## 2    116423140    42186  638251.99  949.25          0.58
## 3    119977702    23772  447029.54  952.51          0.52
## 4     19056531     2085   43337.26  199.98          0.93
## 5    246166031    67258 1579789.36 2162.15          0.96
## 6    152125186    23752  673036.55 1359.07          0.56

Confirmed the first 6 rows of this data matched the excel file. Download complete.

1 Using R, conduct correlation analysis (between the two variables) and interpret.

As stated, we are looking at the following

Dependent Variable: Hospital Expenditures

Independent Variable: RVU’s –> representation of standard outpatient workload (RVU is a measure of the time and intensity {skill, effort, judgement} required of a professional service and are associated with Current Procedural Terminology (CPT) codes used in medical billing)

# Describing Our Data
library(psych)
describe(mydata)

##               vars   n         mean           sd      median     trimmed
## Expenditures     1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled         2 384     24713.82     22756.42    16466.00    20647.21
## RVUs             3 384    546857.77    680047.21   246701.71   398680.18
## FTEs             4 384      1060.86      1336.29      483.07      750.11
## Quality.Score    5 384         0.71         0.11        0.73        0.72
##                       mad        min          max        range  skew kurtosis
## Expenditures  47444814.57 7839562.52 1.301994e+09 1.294154e+09  3.04    11.27
## Enrolled         13092.10    1218.00 1.198500e+05 1.186320e+05  1.94     4.14
## RVUs            237760.93   23218.01 3.574006e+06 3.550788e+06  2.10     4.27
## FTEs               401.38     116.29 7.518630e+03 7.402340e+03  2.55     6.94
## Quality.Score        0.10       0.31 9.600000e-01 6.500000e-01 -0.55     0.40
##                       se
## Expenditures  8850612.08
## Enrolled         1161.28
## RVUs            34703.51
## FTEs               68.19
## Quality.Score       0.01

Just did describe to get a better sense and visual of our data, this is not directly needed to run our correlation analysis.

# Correlation Analysis
?cor

## starting httpd help server ... done

ind.rvu <- mydata$RVUs           # Setting our Independent Variable
dep.exp <- mydata$Expenditures   # Setting our Dependent Variable

r1<- cor(x = ind.rvu,          # Calculating Correlation (R-Value)
    y = dep.exp)
r1

## [1] 0.9217239

As seen above from our correlation analysis, our R-value is equal to 0.9217239. This signifies that we have a very strong correlation between hospital expenditures and RVU’s. Our R-value can be equal to anything from -1 to 1. Anything below 0 is a negative relationship while anything above 1 is a positive relationship. Additionally, any number closer to 1 tells us that it is a stronger relationship. In this case, as there are more RVU’s, the more Hospital Expenditures there are.

# Plotting this relationship
plot(x = ind.rvu/1000,
     y = dep.exp/1000000,
     xlab = "RVU's (Thousand's)",
     ylab = "Hospital Expenditures (millions)",
     main = "RVU's vs Hospital Expenditures"
     )

Once we plotted this, it confirms our “eye test” for our calculations. We can see that this looks like a very strong positive relationship. As the numbers were so big from our data, I did RVU’s in thousands and Hospital Expenditures in millions.

2 Then fit a linear model with Expenditure as the dependent variable (Y) and RVUs as the independant (X) variable. Interpret the results (Interpreting regression coefficients in particular Download Interpreting regression coefficients in particular) and whether the Gauss Markov Assumptions Download Gauss Markov Assumptions/ linear regression assumptions hold or not (by conducting residual plot analysisLinks to an external site. and explaining your results in your own words).

First, I will look to see if these appear to be normally distributed. We can do this by plotting both of our variables to get a sense of what both variables look like.

# RVU's Histogram
hist(x = ind.rvu/1000,
     xlab = "RVU's (Thousands)",
     ylab = "Frequency",
     main = "RVU's Histogram"
     )

# Hospital Expenditures Histogram
hist(x = dep.exp/1000000,
     xlab = "Hospital Expenditures (Millions)",
     ylab = "Frequency",
     main = "Hospital Expendiutres Histogram"
     )

As we can see here, both RVU’s and Hospital Expenditures do not appear to be normally distributed and appear to be right-skewed.

# Correlation for the whole Data set
cor(mydata)

##               Expenditures  Enrolled      RVUs      FTEs Quality.Score
## Expenditures     1.0000000 0.7707756 0.9217239 0.9796506     0.2749501
## Enrolled         0.7707756 1.0000000 0.9152024 0.8148491     0.2526991
## RVUs             0.9217239 0.9152024 1.0000000 0.9504093     0.3075742
## FTEs             0.9796506 0.8148491 0.9504093 1.0000000     0.2769058
## Quality.Score    0.2749501 0.2526991 0.3075742 0.2769058     1.0000000

# Plotting the Relationships between each of the variables
plot(mydata)

Above, I plotted scatterplot matrix to visualize the relationships between every variable in our dataset to get a better idea for everything, not just our X and Y variables. We read these as the X variables are equal to their respective column and the Y value is equal to their respective row. In our case, the graph we already plotted is show as the middle graph in our first row. We can see we have many other strong relationships in this dataset.We can do this since all of our variables are quantitative variables in the data.

# Linear Model
?lm
lm2 <- lm(mydata$Expenditures~mydata$RVUs)
lm2

## 
## Call:
## lm(formula = mydata$Expenditures ~ mydata$RVUs)
## 
## Coefficients:
## (Intercept)  mydata$RVUs  
##  -3785072.2        235.1

summary(lm2)

## 
## Call:
## lm(formula = mydata$Expenditures ~ mydata$RVUs)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -185723026  -14097620    2813431   11919781  642218316 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.785e+06  4.413e+06  -0.858    0.392    
## mydata$RVUs  2.351e+02  5.061e+00  46.449   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8492 
## F-statistic:  2157 on 1 and 382 DF,  p-value: < 2.2e-16

# Plotting this relationship
plot(x = ind.rvu/1000,
     y = dep.exp/1000000,
     xlab = "RVU's (Thousand's)",
     ylab = "Hospital Expenditures (millions)",
     main = "RVU's vs Hospital Expenditures"
     )
abline(lm2)

# Plotting Residuals
lm2.res <- resid(lm2)
plot(lm2.res,
     main="Expenditures vs. RVUs Residual Plot",
     xlab="RVUs", 
     ylab="Residuals")
abline(0,0)

As seen from both our statistics and graphs, we can see there is a strong positive correlation between the RVU’s and Hospital expenditures. Another note we can make from both our graphs and residuals are that the linear relationship is much stronger when there are less RVU’s. We can see both on our scatterplot and residual chart that data point get further from the mean as RVU’s increase. We can also see in our summary that we have an adjusted r-squared of 0.8492. Adjusted R-Squared tells us the percentage of variance and since ours is .8492, this tells us we have a higher level of correlation.

Gauss Markov Assumptions

Linearity: Is that the relationship betweent the dependent and independent variable is linear. In our case as mentioned and seen, it looks like expenditures and RVU’s have a linear relationship.

Independence: Is the independence of errors across all observations. In summary, one error is not related with another. In our case, I could not tell if errors were related.

Homoscedasticity: Is defined as the variance of errors known as residuals should be constant across all levels of independent variables. As mentioned in our example, we can see that more of the residuals come from when there are more RVU’s. We see tighter dispersion around the mean and less residuals when RVUs are less.

No Perfect Multicollinearity: This tells us that the independent variables should not be perfectly correlated with one another. In our example, we can see that although we do have a stronger correlation among variables, they are not perfectly correlated.

Zero Conditional Mean (or Expected Value of Residuals): This means that the expected value of the residuals is zero for all the values of independent variables. We can see that not all residuals are zero, there some outliers as RVU’s increase but not a ton.

3 Then fit a linear model of ln(Expenditures)~RVUs. Mathematically speaking, the logarithm function tends to squeeze together the larger values in your data set and stretches out the smaller values. (If you are wondering why do a log transformation, see the first two charts here that shows how log reduces skewness to help meet normality of X assumption, Links to an external site.with the caveat Download caveatbeing that X is somewhat normally distributed to begin with in order for the transformation to reduce / remove skewness). This transformation is routine in Economics or Finance forecasting to stabilize the variance of a timeseries (GDP, stock prices,…). More readings on the why / the whenLinks to an external site..

As noted in the question, the log functions helps us stretch out smaller values and squeeze together bigger values to make it closer to a normal distribution. I will start by applying this function to the histograms I plotted before where we should be able to clearly see a difference.

# RVU's Log Histogram
hist(x = log(ind.rvu/1000),
     xlab = "RVU's (Thousands)",
     ylab = "Frequency",
     main = "RVU's Histogram"
     )

# Hospital Expenditures Log Histogram
hist(x = log(dep.exp/1000000),
     xlab = "Hospital Expenditures (Millions)",
     ylab = "Frequency",
     main = "Hospital Expendiutres Histogram"
     )

plot(log(mydata))

Similar to what I did question 2 as well, I plotted all of our relationships between all of our variables. We can quickly see the difference in our charts compared to before. While none of the linear relationships changed, we can see that the skewness did change and everything has a tighter dispersion around the mean.

# Summary of Transformed Data
lm.transformed <- lm(log(mydata$Expenditures) ~ mydata$RVUs)
summary(lm.transformed)

## 
## Call:
## lm(formula = log(mydata$Expenditures) ~ mydata$RVUs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59439 -0.29504  0.06135  0.35333  1.20871 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.730e+01  3.325e-02  520.11   <2e-16 ***
## mydata$RVUs 1.349e-06  3.814e-08   35.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared:  0.7661, Adjusted R-squared:  0.7655 
## F-statistic:  1251 on 1 and 382 DF,  p-value: < 2.2e-16

As we can see here, the biggest different is how much closer our residuals got. Before the median residual was 2813431 and now it is just 0.06135.

# Plotting our Transformed Data
plot(x = ind.rvu/1000,
     y = log(dep.exp/1000000),
     xlab = "RVU's (Thousand's)",
     ylab = "Hospital Expenditures (millions)",
     main = "Tranformed RVU's vs Hospital Expenditures"
     )
abline(lm.transformed)

# Plotting Residuals
transformed.res <- resid(lm.transformed)
plot(transformed.res,
     main="Expenditures vs. RVUs Residual Plot",
     xlab="RVUs", 
     ylab="Transformed Residuals")
abline(0,0)

After plotting our transformed data, we can still see that on our initial scatterplot that more data is around where this is less RVU’s but now seems to be curved and not as linear. The big difference, however, comes from our residuals chart. We can see with our log transformation that the residuals are much more spread out and even.

3.1 How did this log transformation affect the Gauss Markov Assumptions (sure, the residual analysis diagnostic charts will change but what is your takeaway - which assumptions are better met or are some assumptions not met now ) ?

Linear: As we can see now, there is more of a curve so this appears to be less linear as it has a curve. After calculating below, we can see this is less than our original 0.92 number, but just by a little.

cor(x = ind.rvu, 
    y = log(dep.exp)
    )

## [1] 0.8752895

Homoscedasticity: This changed since we can now see our errors are much more constant than we saw before the log transformation. This is now better met.

No Perfect Multicollinearity: This did not change as it is still not perfectly correlated.

Zero Conditional Mean (or Expected Value of Residuals): We can see that it is not zero but there are less outliers now in our residuals.

3.2 Are you happy with this functional form capturing relationship between x and y or would you like to keep some different functional form? Why?

I am happy with this functional form. We can see that there is a strong linear relationship between expenditures and RVU’s. Before and after the transformation, we saw that the correlation stayed strong while starting to reduce our outliers. With this being said, the more the expenditures there are, the more the hospitals workload increase along with the ability to take new patients. From a care perspective, this would allow the hospital to take care of more people. From a business perspective, the more patients and care the hospital could take care of, the more money that would be earned. We would need to look at more variables on profitability, however. The one thing we should note is that before we transformed the data, we saw much more of our errors coming as there more RVU’s.

DrewBaker_HW6_SimpleLinearRegression

2024-04-29

1 Using R, conduct correlation analysis (between the two variables) and interpret.

3.1 How did this log transformation affect the Gauss Markov Assumptions (sure, the residual analysis diagnostic charts will change but what is your takeaway - which assumptions are better met or are some assumptions not met now ) ?

3.2 Are you happy with this functional form capturing relationship between x and y or would you like to keep some different functional form? Why?