df <- read.csv(
file = "/Users/Ryan McNulty/Documents/Data Analysis/DATA/week 6 data-1.csv", # File name
na.strings = c("", "NA") # Replace blanks with NA
)
head(df)
## Expenditures Enrolled RVUs FTEs Quality.Score
## 1 114948144 25294 402703.73 954.91 0.67
## 2 116423140 42186 638251.99 949.25 0.58
## 3 119977702 23772 447029.54 952.51 0.52
## 4 19056531 2085 43337.26 199.98 0.93
## 5 246166031 67258 1579789.36 2162.15 0.96
## 6 152125186 23752 673036.55 1359.07 0.56
# Define Variables
exp <- df$Expenditures
rvu <- df$RVUs
library(psych)
describe(df)
## vars n mean sd median trimmed
## Expenditures 1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled 2 384 24713.82 22756.42 16466.00 20647.21
## RVUs 3 384 546857.77 680047.21 246701.71 398680.18
## FTEs 4 384 1060.86 1336.29 483.07 750.11
## Quality.Score 5 384 0.71 0.11 0.73 0.72
## mad min max range skew kurtosis
## Expenditures 47444814.57 7839562.52 1.301994e+09 1.294154e+09 3.04 11.27
## Enrolled 13092.10 1218.00 1.198500e+05 1.186320e+05 1.94 4.14
## RVUs 237760.93 23218.01 3.574006e+06 3.550788e+06 2.10 4.27
## FTEs 401.38 116.29 7.518630e+03 7.402340e+03 2.55 6.94
## Quality.Score 0.10 0.31 9.600000e-01 6.500000e-01 -0.55 0.40
## se
## Expenditures 8850612.08
## Enrolled 1161.28
## RVUs 34703.51
## FTEs 68.19
## Quality.Score 0.01
Hospital_Description <- describe(df)
correlation <- cor(exp,rvu)
cat("The correlation is:",correlation,".\n Meaning there is a strong positive correlation between the two variables. Which can also be seen in the scatter plot below.")
## The correlation is: 0.9217239 .
## Meaning there is a strong positive correlation between the two variables. Which can also be seen in the scatter plot below.
plot(exp/1000000, rvu/1000, main = "Expenditures vs. RVUs", xlab = "Expenditures ($M)", ylab = "RVUs (K)")
cat("As Expenditure increases so does RVUs. Meaning the more the hospital spends the more patient workload increases, probably because they have more capacity to take in new patients.")
## As Expenditure increases so does RVUs. Meaning the more the hospital spends the more patient workload increases, probably because they have more capacity to take in new patients.
Level of measurement: Both variables of interest from the hospital dataset, Expenditures and RVUs, are continuous numeric fields.
Related pairs: Each cases in the dataset has a pair of values, one value for Expenditures and one for RVUs. There are no missing values for these fields in the dataset.
Absence of outliers: There are 13 observations where the Expenditures values, and 12 observations where the RVUs values, are greater than 3 standard deviations from the mean. This could influence any interpretations made from the outcome of the correlation analysis.
df1 <- df
df1$outlier_Expenditures <- ( df1$Expenditures - Hospital_Description["Expenditures","mean"] ) / Hospital_Description["Expenditures","sd"]
sum(abs((df1$outlier_Expenditures)>3))
## [1] 13
df1$outlier_RVUs <- ( df1$RVUs - Hospital_Description["RVUs","mean"]) / (Hospital_Description["RVUs","sd"] )
sum(abs((df1$outlier_RVUs)>3))
## [1] 12
Normality: The two fields do not appear to be normally distributed. This could also influence any interpretations made from the outcome of the correlation analysis.
hist(x = df$Expenditures,xlab = "",main = "Hospital Expenditure" )
hist(x = df$RVUs, xlab = "", main = "Standard Outpatient Workload (RVU)")
Linearity: The shape of the relationship between Expenditures and RVUs can be approximated by a line. While the scatterplot of the two fields does not create a perfect line, there is some error expected for observed data that is being analyzed here.
cor(df)
## Expenditures Enrolled RVUs FTEs Quality.Score
## Expenditures 1.0000000 0.7707756 0.9217239 0.9796506 0.2749501
## Enrolled 0.7707756 1.0000000 0.9152024 0.8148491 0.2526991
## RVUs 0.9217239 0.9152024 1.0000000 0.9504093 0.3075742
## FTEs 0.9796506 0.8148491 0.9504093 1.0000000 0.2769058
## Quality.Score 0.2749501 0.2526991 0.3075742 0.2769058 1.0000000
plot(df)
The graph shows that there is strong linear relationship between Expenditures and RVUs, Expenditure and FTEs, Expenditure and Enrolled, but not very high positive correlation between Expenditure and Quality Score.
plot(x = df$RVUs, y = df$Expenditures, xlab = "RVUs", ylab = "Expenditures")
univariate_reg = lm(df$Expenditures ~ df$RVUs)
options(scipen = 999)
summary(univariate_reg)
##
## Call:
## lm(formula = df$Expenditures ~ df$RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3785072.158 4412905.480 -0.858 0.392
## df$RVUs 235.072 5.061 46.449 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 0.00000000000000022
abline(reg = univariate_reg, col="blue")
Interpret the linear model - Expenditures~RVUs.
As mentioned above, there is a strong positive correlation coefficient between Expenditures and RVUs, as evidenced by t = 46.45, p-value is less than 2e-16.
Expenditures= − 3785000 + 235.1∗RVUs
The model explained 84.96% variance in Expenditures (Adjusted R2).
plot(x = univariate_reg)
Linearity - Expenditures and RVUs have a somewhat linear relationship Independent observations - There isn’t a way to tell from the dataset if the observations are independent. There could be a factor in the sampling (such as location or funding) that affect response rate. Normal residuals - There are outliers, but as a whole as shown above, the linear regression is somewhat normal. Constant variability / Homoskedasticity - There is some Homoskedasticity - in the residuals vs the fitted plot, most the observations are clustered around the bottom left.
Because of the above the assumptions are not held for Expenditures ~ RVUs
hist(x = log(df$Expenditures),xlab = "", main = "Log Hospital Expenditure" )
hist(x = log(df$RVUs), xlab = "", main = "Log Standard Outpatient Workload (RVU)")
plot(x = df$RVUs, y = log (df$Expenditures) , xlab = "RVUs", ylab = "Log of Expenditures")
univariate_reg_transformedY = lm( formula = log(df$Expenditures) ~ df$RVUs)
abline(reg = univariate_reg_transformedY, col="blue")
summary(univariate_reg_transformedY)
##
## Call:
## lm(formula = log(df$Expenditures) ~ df$RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.29584389074 0.03325414655 520.11 <0.0000000000000002 ***
## df$RVUs 0.00000134911 0.00000003814 35.38 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 0.00000000000000022
plot(univariate_reg_transformedY)
Linearity - This assumption is violated by the transformations - the observations fit more to a curve. Independent observations - There isn’t a way to tell from the dataset if the observations are independent. There could be a factor in the sampling (such as location or funding) that affect response rate. Normal residuals - There are less outliers. Constant variability / Homoskedasticity - The Homoskedasticity is a bit improved, but still not constant.
There were some improvement with the transformation, but also some assumptions (linearity) are not better met. If I was going to use this analysis on whether on not invest in expansion plans, I would probably use the other transformations to further limit outliers and Homoskedasticity.