Environment Setup

df <- read.csv(
  file = "/Users/Ryan McNulty/Documents/Data Analysis/DATA/week 6 data-1.csv", # File name
  na.strings = c("", "NA") # Replace blanks with NA
)
head(df)
##   Expenditures Enrolled       RVUs    FTEs Quality.Score
## 1    114948144    25294  402703.73  954.91          0.67
## 2    116423140    42186  638251.99  949.25          0.58
## 3    119977702    23772  447029.54  952.51          0.52
## 4     19056531     2085   43337.26  199.98          0.93
## 5    246166031    67258 1579789.36 2162.15          0.96
## 6    152125186    23752  673036.55 1359.07          0.56
# Define Variables
exp <- df$Expenditures
rvu <- df$RVUs

Q01 - Using R, conduct correlation analysis (between the two variables) and interpret.

library(psych)
describe(df)
##               vars   n         mean           sd      median     trimmed
## Expenditures     1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled         2 384     24713.82     22756.42    16466.00    20647.21
## RVUs             3 384    546857.77    680047.21   246701.71   398680.18
## FTEs             4 384      1060.86      1336.29      483.07      750.11
## Quality.Score    5 384         0.71         0.11        0.73        0.72
##                       mad        min          max        range  skew kurtosis
## Expenditures  47444814.57 7839562.52 1.301994e+09 1.294154e+09  3.04    11.27
## Enrolled         13092.10    1218.00 1.198500e+05 1.186320e+05  1.94     4.14
## RVUs            237760.93   23218.01 3.574006e+06 3.550788e+06  2.10     4.27
## FTEs               401.38     116.29 7.518630e+03 7.402340e+03  2.55     6.94
## Quality.Score        0.10       0.31 9.600000e-01 6.500000e-01 -0.55     0.40
##                       se
## Expenditures  8850612.08
## Enrolled         1161.28
## RVUs            34703.51
## FTEs               68.19
## Quality.Score       0.01
Hospital_Description <- describe(df)

correlation <- cor(exp,rvu)
cat("The correlation is:",correlation,".\n Meaning there is a strong positive correlation between the two variables. Which can also be seen in the scatter plot below.")
## The correlation is: 0.9217239 .
##  Meaning there is a strong positive correlation between the two variables. Which can also be seen in the scatter plot below.
plot(exp/1000000, rvu/1000, main = "Expenditures vs. RVUs", xlab = "Expenditures ($M)", ylab = "RVUs  (K)")

cat("As Expenditure increases so does RVUs. Meaning the more the hospital spends the more patient workload increases, probably because they have more capacity to take in new patients.")
## As Expenditure increases so does RVUs. Meaning the more the hospital spends the more patient workload increases, probably because they have more capacity to take in new patients.

Q02 - Then fit a linear model with Expenditure as the dependent variable (Y) and RVUs as the independent (X) variable. Interpret the results and whether the Gauss Markov Assumptions/linear regression assumptions hold or not (by conducting residual plot analysis and explaining your results in your own words).

Level of measurement: Both variables of interest from the hospital dataset, Expenditures and RVUs, are continuous numeric fields.

Related pairs: Each cases in the dataset has a pair of values, one value for Expenditures and one for RVUs. There are no missing values for these fields in the dataset.

Absence of outliers: There are 13 observations where the Expenditures values, and 12 observations where the RVUs values, are greater than 3 standard deviations from the mean. This could influence any interpretations made from the outcome of the correlation analysis.

df1 <- df
df1$outlier_Expenditures <- ( df1$Expenditures - Hospital_Description["Expenditures","mean"] ) / Hospital_Description["Expenditures","sd"] 

sum(abs((df1$outlier_Expenditures)>3)) 
## [1] 13
df1$outlier_RVUs <- ( df1$RVUs - Hospital_Description["RVUs","mean"])  / (Hospital_Description["RVUs","sd"] ) 

sum(abs((df1$outlier_RVUs)>3))  
## [1] 12

Normality: The two fields do not appear to be normally distributed. This could also influence any interpretations made from the outcome of the correlation analysis.

hist(x = df$Expenditures,xlab = "",main = "Hospital Expenditure" )

hist(x = df$RVUs, xlab = "", main = "Standard Outpatient Workload (RVU)")

Linearity: The shape of the relationship between Expenditures and RVUs can be approximated by a line. While the scatterplot of the two fields does not create a perfect line, there is some error expected for observed data that is being analyzed here.

cor(df)
##               Expenditures  Enrolled      RVUs      FTEs Quality.Score
## Expenditures     1.0000000 0.7707756 0.9217239 0.9796506     0.2749501
## Enrolled         0.7707756 1.0000000 0.9152024 0.8148491     0.2526991
## RVUs             0.9217239 0.9152024 1.0000000 0.9504093     0.3075742
## FTEs             0.9796506 0.8148491 0.9504093 1.0000000     0.2769058
## Quality.Score    0.2749501 0.2526991 0.3075742 0.2769058     1.0000000
plot(df)

The graph shows that there is strong linear relationship between Expenditures and RVUs, Expenditure and FTEs, Expenditure and Enrolled, but not very high positive correlation between Expenditure and Quality Score.

plot(x = df$RVUs, y = df$Expenditures, xlab = "RVUs", ylab = "Expenditures") 

univariate_reg  =  lm(df$Expenditures ~ df$RVUs)
options(scipen = 999)
summary(univariate_reg)
## 
## Call:
## lm(formula = df$Expenditures ~ df$RVUs)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -185723026  -14097620    2813431   11919781  642218316 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) -3785072.158  4412905.480  -0.858               0.392    
## df$RVUs          235.072        5.061  46.449 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8492 
## F-statistic:  2157 on 1 and 382 DF,  p-value: < 0.00000000000000022
abline(reg = univariate_reg, col="blue")

Interpret the linear model - Expenditures~RVUs.

As mentioned above, there is a strong positive correlation coefficient between Expenditures and RVUs, as evidenced by t = 46.45, p-value is less than 2e-16.

Expenditures= − 3785000 + 235.1∗RVUs

The model explained 84.96% variance in Expenditures (Adjusted R2).

plot(x = univariate_reg)

Gauss Markov Assumptions

Linearity - Expenditures and RVUs have a somewhat linear relationship Independent observations - There isn’t a way to tell from the dataset if the observations are independent. There could be a factor in the sampling (such as location or funding) that affect response rate. Normal residuals - There are outliers, but as a whole as shown above, the linear regression is somewhat normal. Constant variability / Homoskedasticity - There is some Homoskedasticity - in the residuals vs the fitted plot, most the observations are clustered around the bottom left.

Because of the above the assumptions are not held for Expenditures ~ RVUs

Q03 - Then fit a linear model of ln(Expenditures)~RVUs

hist(x = log(df$Expenditures),xlab = "", main = "Log Hospital Expenditure" )

hist(x = log(df$RVUs),        xlab = "", main = "Log Standard Outpatient Workload (RVU)")

plot(x = df$RVUs, y = log (df$Expenditures) , xlab = "RVUs", ylab = "Log of Expenditures") 

univariate_reg_transformedY  =  lm( formula = log(df$Expenditures) ~ df$RVUs)
abline(reg = univariate_reg_transformedY, col="blue")

summary(univariate_reg_transformedY)
## 
## Call:
## lm(formula = log(df$Expenditures) ~ df$RVUs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59439 -0.29504  0.06135  0.35333  1.20871 
## 
## Coefficients:
##                   Estimate     Std. Error t value            Pr(>|t|)    
## (Intercept) 17.29584389074  0.03325414655  520.11 <0.0000000000000002 ***
## df$RVUs      0.00000134911  0.00000003814   35.38 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared:  0.7661, Adjusted R-squared:  0.7655 
## F-statistic:  1251 on 1 and 382 DF,  p-value: < 0.00000000000000022
plot(univariate_reg_transformedY)

Linearity - This assumption is violated by the transformations - the observations fit more to a curve. Independent observations - There isn’t a way to tell from the dataset if the observations are independent. There could be a factor in the sampling (such as location or funding) that affect response rate. Normal residuals - There are less outliers. Constant variability / Homoskedasticity - The Homoskedasticity is a bit improved, but still not constant.

There were some improvement with the transformation, but also some assumptions (linearity) are not better met. If I was going to use this analysis on whether on not invest in expansion plans, I would probably use the other transformations to further limit outliers and Homoskedasticity.