setwd("/Users/jiwonban/ADEC7301/Week 6")
df <- read.csv("week 6 data-1.csv")
The attached .csv file has data pertaining to hospital
Expenditures (dependent variable). The column
RVUs is a representation of standard outpatient workload
(RVU is a measure of the time and intensity {skill, effort, judgement}
required of a professional service and are associated with Current
Procedural Terminology (CPT) codes used in medical billing).
cor(df$RVUs, df$Expenditures)
## [1] 0.9217239
The correlation coefficient between the variables RVUs
and Expenditures is 0.9217, which indicates quite a strong
positive relationship between the standard outpatient workload and
hospital expenditures. In other words, the higher the RVU value (i.e.,
the more standard outpatient workload a hospital is billed for), the
higher the hospital expenditure (the more money a hospital needs to pay)
and vice versa (i.e., the higher the hospital expenditure, the higher
the RVU value).
lm(df$Expenditure ~ df$RVUs)
##
## Call:
## lm(formula = df$Expenditure ~ df$RVUs)
##
## Coefficients:
## (Intercept) df$RVUs
## -3785072.2 235.1
summary(lm(df$Expenditure ~ df$RVUs))
##
## Call:
## lm(formula = df$Expenditure ~ df$RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## df$RVUs 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
The results of the linear model gives a slope of 235.10 and and
intercept of -3785072.2. The slope means that, for every one-unit
increase in the standard outpatient workload measure
(RVUs), the model estimates an increase of $235.10 in
hospital expenditures. The intercept means that when RVUs is zero (i.e.,
baseline hospital expenditure without accounting for standard outpatient
workload is predicted to be at at a loss of $3,785,072.20.
| 1) Linearity - the relationship should be relatively linear |
| 2) Nearly normal residuals - the standardized residuals should be normally distributed |
| 3) Constant variability - the variance of errors should be relatively constant |
| 4) Independent observations - there should be unique observations for every x to y |
plot(lm(df$Expenditure ~ df$RVUs),
col = "purple")
The four provided residual plots provide more information about the distribution of our variables and whether assumptions for a simple linear regression are met. First, the Residual vs Fitted graph tells us whether the residuals have non-linear patterns; the straight horizontal red line and the residuals falling near the line tells us that there is a pattern to the residuals (i.e., perhaps heteroscedasticity concerns). Second, the Q-Q plot indicates whether residuals are normally distributed; we see that the residuals follow the dashed line relatively well (i.e., the data is relatively normally distributed), but there are some exceptions at the tails/ends of the distribution plot (e.g., 256). The Scale(Spread)-Location plot shows us the trends in the magnitudes of residuals, and the residuals should be randomly distributed; in this dataset, we see that there is a linear pattern to the residuals, indicating issues with heteroscedasticity. Lastly, the Residuals vs Leverage shows us the influential cases of our dataset (i.e., outliers); we see two cases that are outside the two gray dashed lines (cases 256 and 135) that require more examination.
Helpful resource: UVA Library (1)
lm(log(df$Expenditure) ~ df$RVUs)
##
## Call:
## lm(formula = log(df$Expenditure) ~ df$RVUs)
##
## Coefficients:
## (Intercept) df$RVUs
## 1.730e+01 1.349e-06
summary(lm(log(df$Expenditure) ~ df$RVUs))
##
## Call:
## lm(formula = log(df$Expenditure) ~ df$RVUs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 3.325e-02 520.11 <2e-16 ***
## df$RVUs 1.349e-06 3.814e-08 35.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 2.2e-16
plot(lm(log(df$Expenditure) ~ df$RVUs),
col = "navy")
Now, we see that the value of the residuals have decreased significantly and the shapes of our plots have changed from our initial set of plots. The takeaway is that, looking at the QQ plot, the distribution of residuals now very much assume normality, and while the Scale-Location plot seems more randomly dispersed, there are still issues with heteroscedasticity. The Residuals vs Fitted plot has changed significantly with the new log function, and the assumption of linearity is now violated (shown by the curve on the left-hand side). Lastly, the Residuals vs Leverage confirms that all values are still within bounds.
# ln(Expenditures)~RVU + RVU^2
plot(lm(log(df$Expenditure) ~ df$RVUs + (df$RVUs)^2),
col = "tan")
# ln(Expenditures)~ln(RVU)
plot(lm(log(df$Expenditure) ~ log(df$RVUs)),
col = "green")
I decided to go about this question in an exploratory manner — by
trying different approaches to see how the residual plots would change.
The first manner (tan), ln(Expenditures)~RVU + RVU^2,
resulted in a very similar pattern of plots as
ln(Expenditures)~RVU. The second manner (green),
ln(Expenditures)~ln(RVU), showed to be much more suitable;
the residuals for Residuals vs Fitted and Scale-Location were much more
randomly distributed, the QQ plots showed normality, and Residual vs
Leverage indicated no influential cases. In other words, this second
approach is a better way of representing the data. Considering that log
functions need to be performed on the data to obtain suitable data for
analyses, the data suggest that the relationship between RVUs and
Expenditures are not linear in nature; in other words, there are other
factors that are causing this seemingly direct correlation. Therefore,
it may not be wise to simply expanding hospital investments; rather,
multiple regression analyses of other relevant variables should be
conducted to better understand the drivers of hospital expenditures.