#1. Correlation analysis between expenditure and RVU’s
#reading the data
Expenditure_data = read.csv("/Users/timyang/Downloads/week 6 data-1.csv")
head(Expenditure_data)
## Expenditures Enrolled RVUs FTEs Quality.Score
## 1 114948144 25294 402703.73 954.91 0.67
## 2 116423140 42186 638251.99 949.25 0.58
## 3 119977702 23772 447029.54 952.51 0.52
## 4 19056531 2085 43337.26 199.98 0.93
## 5 246166031 67258 1579789.36 2162.15 0.96
## 6 152125186 23752 673036.55 1359.07 0.56
#Corrrelation between expenditure and RVU's
correlation = cor(Expenditure_data$Expenditures, Expenditure_data$RVUs)
print(paste("correlation coefficient:", correlation))
## [1] "correlation coefficient: 0.921723872910014"
The correlation coefficient of 0.92172 between Expenditure and RVUs indicates a strong positive linear relationship. This implies that as Expenditure increases, RVUs tend to increase as well, and vice versa. The high correlation suggests a robust association between the two variables in the dataset.
#2.A linear model with Expenditure as the dependent variable (Y) and RVUs as the independant (X) variable
# Creating a linear model
linear_model <- lm(Expenditures ~ RVUs, data = Expenditure_data)
summary(linear_model)
##
## Call:
## lm(formula = Expenditures ~ RVUs, data = Expenditure_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## RVUs 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
The linear model suggests that, on average, each additional unit increase in RVUs is associated with an increase of 235.1 units in Expenditures. The intercept is not statistically significant, but the RVUs coefficient is highly significant (p-value < 0.001), indicating a strong linear relationship. The model explains 84.96% of the variability in Expenditures.
# Checking whether the Gauss Markov Assumptions / linear regression assumptions hold or not
# Residual plot analysis
par(mfrow = c(2, 2))
# Plot residuals vs. fitted values
plot(linear_model, which = 1)
# Plot normal Q-Q plot
plot(linear_model, which = 2)
# Plot scale-location plot (square root of standardized residuals)
plot(linear_model, which = 3)
# Plot residuals vs. leverage
plot(linear_model, which = 5)
The Residual vs Fitted plot exhibits a non linear trend since the values appear to be scattered below zero, which shows a violation of linearity. Additionally, the normal Q-Q plot shows a partial deviation from a straight line which indicates partial non-normality in residuals. However the model seems to uphold homoescendasticity since there is a consistent spread of residuals along the fitted values line for the Scale-location plot. Finally, the residual vs Leverage plot shows influential outliers as shown by the points beyond 0.5 mark which impacts the model’s reliability.
#3. Log transformed Model
# 'In' is the natural logarithm function and therefore fitting the model we get:
log_model <- lm(log(Expenditures) ~ RVUs, data = Expenditure_data)
summary(log_model)
##
## Call:
## lm(formula = log(Expenditures) ~ RVUs, data = Expenditure_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59439 -0.29504 0.06135 0.35333 1.20871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e+01 3.325e-02 520.11 <2e-16 ***
## RVUs 1.349e-06 3.814e-08 35.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared: 0.7661, Adjusted R-squared: 0.7655
## F-statistic: 1251 on 1 and 382 DF, p-value: < 2.2e-16
The linear model with the logarithmic transformation suggests that, on average, a one-unit increase in RVUs is associated with a 0.000001349 (0.0001349%) increase in Expenditures. The intercept is 17.3, representing the exponentiated starting(or initial point, when RVUs equals 0) point. Both coefficients are highly significant (p-value < 0.001), indicating a strong logarithmic relationship between Expenditures and RVUs.
#Creating a side by side historam to showl the distribution of original vs logarithimic data
# Create a side-by-side histogram
par(mfrow = c(1, 2))
# Original Expenditures histogram
hist(Expenditure_data$Expenditures, main = "Original Expenditures", xlab = "Expenditures", col = "lightblue", border = "black")
# Log-transformed Expenditures histogram
hist(log(Expenditure_data$Expenditures), main = "Log-transformed Expenditures", xlab = "log(Expenditures)", col = "lightgreen", border = "black")
it is definitely true how the original Expenditure data histogram shows a right skewed visual and on transforming the data to logarithmic form, the expenditures becomes more symetric as shown by the log-transformed chart on the right. Transformation to logarithimic form reduces the skewness and by doing so, it allows the normality assumption of the linear regression model to be met.
#Recordings 1
# The "linear" model and the 'log_model' are the original linear and the log-transformed linear models respectively
# Residual plots for the original linear model
par(mfrow = c(2, 2))
plot(linear_model)
# Residual plots for the log-transformed linear model
par(mfrow = c(2, 2))
plot(log_model)
The Residuals vs. Fitted shows a more level distribution of the residual values around zero for the log transformed model which indicates improved linearity in the transformed logarithmic model. Furthermore, the Normal Q-Q plot reveals more residual normality in the log-transfomed model since the stanardized residuals tend to follow a more elaborate straight line than in the original linear model, which represents increased normality. The scale-location plot reveals a consistent spread of residuals along the fitted values line. This indicates that homoscedasticity has been enhanced in the log-transformed model. Moreover, the Residuals vs.Leverage, the log transformation mitigates outliers by bringing larger values closer to smaller values as shown in the plots above.
#Recordings 2
This implies that the log- linear model (log(Expenditures)) ~ RVUs seems to explain the relationship quite well, as revealed by a strong relationship between Expenditure and RVUs. Nevertheless, including other functional forms such as quadratic terms and various transformation e. g. root, reciprocal etc might be necessary to achieve a good fit. Such caution should be exercised prior to major budgetary commitments, as different models may supply additional information or diminish risks stemming from speculative model used.