# Clear the workspace
  rm(list = ls())  # Clear environment
  gc()             # Clear unused memory
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523999 28.0    1166947 62.4   660385 35.3
## Vcells 956771  7.3    8388608 64.0  1769489 13.6
#Import the data
library(readr)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
df <- read_csv("week 6 data-1.csv")
## Rows: 384 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): Expenditures, Enrolled, RVUs, FTEs, Quality Score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 1

Using R, conduct correlation analysis (between the two variables) and interpret.

#co-variannce

covariance <- cov(x = df$RVUs, y = df$Expenditures)

stargazer(covariance, type = 'text')
## 
## =======================
## 108,712,332,961,081.000
## -----------------------

The co-variance is a measure of how variables move in relation to one another. This co-variance is very large and positive. At a minimum, we know that when RVUs goes up, the Expenditure also goes up. In other words, both variables are moving in the same direction.

It is hard to interpret the strength of co-variance, as it is not a standardized value. It is very positive, but our original x and y values were also very large, so it makes sense out covariance is large in magnitude as well. To get a better understanding of the strength of the relationship we will need to calculate the correlation coefficient.

#correlation coefficient
correlation <- cor(x = df$RVUs, y = df$Expenditures)

stargazer(correlation, type = 'text')
## 
## =====
## 0.922
## -----

The correlation coefficient is a standardized version of the covariance and can range from -1 to 1. The closer the value is to either -1 or 1, the stronger the correlation between the two variables. 0 means no correlation.

Here, we can see there is a very strong correlation, with a coefficient of .922.

r2 <- correlation^2
stargazer(r2, type = 'text')
## 
## =====
## 0.850
## -----

The R-squared value shows the percentage of the dependent variable that a linear model explains. The closer an R2 value is to 1, the better the model is able to explain the dependent variable. .850 is a pretty good R2 value. This implies that 85% of the dependent variable (Expenditures) is explained by the model.

Question 2

#Run a regression

model <- lm(data = df, formula = Expenditures ~ RVUs)

stargazer(model, type = 'text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            Expenditures        
## -----------------------------------------------
## RVUs                        235.072***         
##                               (5.061)          
##                                                
## Constant                  -3,785,072.000       
##                           (4,412,905.000)      
##                                                
## -----------------------------------------------
## Observations                    384            
## R2                             0.850           
## Adjusted R2                    0.849           
## Residual Std. Error  67,354,526.000 (df = 382) 
## F Statistic         2,157.470*** (df = 1; 382) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
plot(df$RVUs, df$Expenditures, main = 'Expendatures vs RVUs',
xlab= "RVUs", ylab = "Expenditures")
abline(model, col = 'blue')

Interpreting the regression coefficients:

Constant: -3,785,072.

In this model the constant means that when RVUs = 0, the Expenditures is $-3,785,072.

B1 = 235.072

This coefficient means that for every additional 1 RVU, Expenditure goes up by $235.072.

plot(model)

Gauss Markov Assumptions

  1. Linearity: The data shows a linear trend. This is seen from both the regression equation \(Expenditures = \beta_{0} + \beta_{1}RVUs + \epsilon\) and the regression plot. If we look at the residuals vs fitted plot we can also see that the residuals are spread around a horizontal line. It is important to note that there is much greater residual density on the left half of the plot.

    SATISFIED.

  2. Average Value of the Error Term is 0.

    residuals1 <- residuals(model)
    mean(residuals1)
    ## [1] 2.390607e-09

The mean of the residuals (error) is 0.

SATISFIED.

  1. All independent variables are uncorrelated with the error term. (EXOGENEITY).

    markov_cor_test1 <- cor.test(residuals1, df$RVUs)
    markov_cor_test1
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  residuals1 and df$RVUs
    ## t = -2.4131e-16, df = 382, p-value = 1
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.1000759  0.1000759
    ## sample estimates:
    ##          cor 
    ## -1.23463e-17

This shows that the correlation is close to 0 with a p-value of 1. This means that there is no correlation between the independent variable (RVUs) and the error term.

SATISFIED.

  1. Nearly Normal Residuals

    Looking at the Q-Q residuals plot we can see the residuals are not completely normal. On the left and right sides of the plot (low RVUs and high RVUs) deviates from the normal fit line. The histogram below shows the distribution of the residuals. The data is mostly normal, but there is a high peak (kurtosis).

    SATISFIED.

    hist(x = residuals1)

  2. Homoscedasticity: Error of the variance is constant.

    Looking at the residuals vs fitted plot, we can see that the variance in the error is not constant. Although there is a horizontal line, most of the residuals are concentrated on the left hand side of the plot. As RVUs gets greater, the residuals are more scattered, including points 135, 256, and 384 which are all well above the horizontal line.

    We can also look at the Scale-Location plot. There is an upwards sloping line with uneven spread points, meaning we do not satisfy homoscedasticity.

    UNSATISFIED.

  3. Error terms are uncorrelated with each other.

    Looking at the residual plots, there does not seem to be any specific pattern. Looking at the plotted model there seems to be no pattern in the error. To be sure we can run a durbinWatsonTest from the “cars” package. The Test Stat will be between 0 and 4, a value of 2 would mean no error correlation. Here we get a stat of 1.95.

    library(car)
    ## Warning: package 'car' was built under R version 4.3.2
    ## Loading required package: carData
    ## Warning: package 'carData' was built under R version 4.3.2
    durbinWatsonTest(model)
    ##  lag Autocorrelation D-W Statistic p-value
    ##    1     -0.02611034      1.951485   0.506
    ##  Alternative hypothesis: rho != 0

SATISFIED.

  1. Co-Linearity: No independent variable is a perfect linear function of another.

    Since there is only one independent variable we can say this condition is met.

    SATISFIED.

Lastly, looking at the residuals vs leverage plot, we can see there are 3 influential cases. These are points 135, 256, and 384. This means these points are affecting the model and there should be consideration to take them out.

Question: Model of ln(Expenditures)~RVUs

model2 <- lm(log(df$Expenditures) ~ df$RVUs)

summary(model2)
## 
## Call:
## lm(formula = log(df$Expenditures) ~ df$RVUs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59439 -0.29504  0.06135  0.35333  1.20871 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.730e+01  3.325e-02  520.11   <2e-16 ***
## df$RVUs     1.349e-06  3.814e-08   35.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5076 on 382 degrees of freedom
## Multiple R-squared:  0.7661, Adjusted R-squared:  0.7655 
## F-statistic:  1251 on 1 and 382 DF,  p-value: < 2.2e-16
hist(x= df$RVUs)

This histogram shows the distribution of RVUs in the data. It is very skewed to the right with a long tail.

hist(x = df$Expenditures)

This histogram shows the distribution of Expenditures in the data. The data is very skewed right with a very long tail. This means there are some extreme high values.

hist(x = log(df$Expenditures))

The histogram of the log of Expenditures shows a much more normally distributed representation of the data. In addition, the units are now much more manageable to work with. Taking the log of Expenditures “squeezes” it into a more normal looking distribution. This strategy works well for heavily skewed data (such as exponential variables) and can make very large units much more comprehensive.

plot(x = df$RVUs, y = log(df$Expenditures), main = "Log(Expenditures) vs RVUs", xlab = "RVUs", ylab = "log(Expenditures")
abline(model2)

This line does not seem to fit the data as well as the first model. This is also seen in the regression coefficients. The data points seem to follow a logarithmic path. This makes sense as we normalized the y variable, but not the x. The x variable (RVUs) is still skewed to the right and is more exponential looking in nature.

#correlation coefficient
correlation2 <- cor(x = df$RVUs, y = log(df$Expenditures))

stargazer(correlation2, type = 'text')
## 
## =====
## 0.875
## -----

Here, we can see there is still a strong correlation, with a coefficient of .875. It is not as good as our first model.

r2_2 <- correlation2^2
stargazer(r2_2, type = 'text')
## 
## =====
## 0.766
## -----

The r-squared value decreased from .850 to .766. This means the model of \(\hat{log(Expenditures)} = \beta_{0} + \beta_{1}RVUs\) explains 76.6% of the dependent variable (log(Expenditures)) is explained by the model.

plot(model2)

Gauss Markov Assumptions

  1. Linearity: The data shows a linear trend. This is seen from the regression equation \(log(Expenditures) = \beta_{0} + \beta_{1}RVUs + \epsilon\) . HOWEVER, If we look at the residuals vs fitted plot we can see that the residuals do not follow a linear pattern. There is a large upward sloping region on the left side of the graph that turns into a decreasing sloped line.

    NOT SATISFIED.

  2. Average Value of the Error Term is 0.

    residuals2 <- residuals(model2)
    mean(residuals2)
    ## [1] -1.291199e-17

    SATISFIED.

  3. All independent variables are uncorrelated with the error term. (EXOGENEITY).

    markov_cor_test2 <- cor.test(residuals2, df$RVUs)
    markov_cor_test2
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  residuals2 and df$RVUs
    ## t = -2.2875e-15, df = 382, p-value = 1
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.1000759  0.1000759
    ## sample estimates:
    ##          cor 
    ## -1.17041e-16

This shows that the correlation between RVUs and residuals is 0.

SATISFIED.

  1. Nearly Normal Residuals

    Looking at the Q-Q residuals plot we can see the residuals are not completely normal. On the left and right sides of the plot (low RVUs and high RVUs) deviates from the normal fit line very slighty. The histogram below shows the distribution of the residuals. Overall, the residuals are pretty normal, and more normal than the first model.

    hist(x = residuals2)

SATISFIED (BETTER THAN MODEL1)

  1. Homoscedasticity: Error of the variance is constant.

    Looking at the residuals vs fitted plot, we can see that the variance in the error is not constant. There is no horizontal line and the error points are not equally spread out.

    We can also look at the Scale-Location plot. There is no horizontal line, with the left side of the graph having a cubic-like relationship and the right hand side sloping upwards.

    NOT SATISFIED.

  2. Error terms are uncorrelated with each other.

    Looking at the residual plots, there does not seem to be any specific pattern. Looking at the plotted model there seems to be no pattern in the error (in terms of positive and negative). To be sure we can run a durbinWatsonTest from the “cars” package. The Test Stat will be between 0 and 4, a value of 2 would mean no error correlation. Here we get a stat of 2.11 with a p-value of .29.

    durbinWatsonTest(model2)
    ##  lag Autocorrelation D-W Statistic p-value
    ##    1     -0.05848753      2.111351   0.302
    ##  Alternative hypothesis: rho != 0

SATISFIED.

  1. Co-Linearity: No independent variable is a perfect linear function of another.

    Since there is only one independent variable we can say this condition is met.

    SATISFIED.

Lastly, looking at the Residuals vs Leverage we don’t see any influential cases. This is different from the first model.

Overall, this model \(log(Expenditures) = \beta_{0} + \beta_{1}RVUs + \epsilon\) seems to be worse than the first one.

In the first model \(Expenditures = \beta_{0} + \beta_{1}RVUs + \epsilon\) the only condition that was not met was HOMOSCEDASTICITY. Model 1 better satisfied the 6th requirement “Error terms are uncorrelated with each other” having a higher p-value than model2.

In the second model both LINEARITY and HOMOSCEDASTICITY are not met. It is important to note the second model satisfied the condition of nearly normal residuals better than the first model. This can be attributed to the log function on our dependent variable (Expenditures).

Question 4

I am not happy with either model. I think model 1 is a better fit of the two. This is due to a higher correlation coefficient, r-squared value, and having met more Gauss Markov conditions than model2 did.

I think the best fit model would be taking the log of both the Expenditures and RVUs. Both of the variables follow an exponential path, so taking the log will squeeze both variables and make both of them more normal looking.

model3 <- lm(log(df$Expenditures) ~ log(df$RVUs))

summary(model3)
## 
## Call:
## lm(formula = log(df$Expenditures) ~ log(df$RVUs))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74657 -0.19864 -0.02431  0.18642  0.93551 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.91487    0.16621   41.60   <2e-16 ***
## log(df$RVUs)  0.88444    0.01317   67.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2932 on 382 degrees of freedom
## Multiple R-squared:  0.9219, Adjusted R-squared:  0.9217 
## F-statistic:  4512 on 1 and 382 DF,  p-value: < 2.2e-16
plot(x = log(df$RVUs), y = log(df$Expenditures), main = "Log(Expenditures) vs log(RVUs)", xlab = "log(RVUs)", ylab = "log(Expenditures")
abline(model3)

#correlation coefficient
correlation3 <- cor(x = log(df$RVUs), y = log(df$Expenditures))

stargazer(correlation3, type = 'text')
## 
## =====
## 0.960
## -----
r2_3 <- correlation3^2
stargazer(r2_3, type = 'text')
## 
## =====
## 0.922
## -----

This model has the highest correlation coefficient of .960 of all the three models (compared .922 and .850). As a result, the r-squared .922. Meaning 92.2% of log(Expenditures) is explained by log(RVUs).

Looking at the histograms of the log of both variables in Question 3, we can see that the log transformation squeezed both variables to normal looking data with more manageable units.

plot(model3)

  1. Linearity: The data shows a linear trend. This is seen from both the regression equation \(log(Expenditures) = \beta_{0} + \beta_{1}log(RVUs) + \epsilon\) and the regression plot. If we look at the residuals vs fitted plot we can also see that the residuals are spread around an almost horizontal line. The residuals vs fitted was better in the first model, however the points are more uniformly spread out.

    SATISFIED.

  2. Average Value of the Error Term is 0.

    residuals3 <- residuals(model3)
    mean(residuals3)
    ## [1] -6.137883e-18

    SATISFIED.

  3. All independent variables are uncorrelated with the error term. (EXOGENEITY).

    markov_cor_test3 <- cor.test(residuals3, log(df$RVUs))
    markov_cor_test3
    ## 
    ##  Pearson's product-moment correlation
    ## 
    ## data:  residuals3 and log(df$RVUs)
    ## t = -2.0616e-16, df = 382, p-value = 1
    ## alternative hypothesis: true correlation is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.1000759  0.1000759
    ## sample estimates:
    ##         cor 
    ## -1.0548e-17

SATISFIED.

  1. Nearly Normal Residuals

    Looking at the Q-Q residuals plot we can see the residuals are not completely normal. On the left and right sides of the plot (low RVUs and high RVUs) deviates from the normal fit line. The histogram below shows the distribution of the residuals. The data is mostly normal, but there is a high peak (kurtosis).

    SATISFIED.

    hist(x = log(df$Expenditures))

Q-Q plot also looks the best of the three.

SATISFIED.

  1. Homoscedasticity: Error of the variance is constant.

    Looking at the residuals vs fitted plot, we can see that the variance in the error is constant. The residuals are spread about the graph.

    We can also look at the Scale-Location plot. There is a mostly horizontal line.

    SATISFIED.

  2. Error terms are uncorrelated with each other.

    durbinWatsonTest(model3)
    ##  lag Autocorrelation D-W Statistic p-value
    ##    1       0.1221566      1.734857   0.004
    ##  Alternative hypothesis: rho != 0

SATISFIED (NOT AS GOOD AS OTHER MODELS, BUT STILL GOOD).

  1. Co-Linearity: No independent variable is a perfect linear function of another.

    Since there is only one independent variable we can say this condition is met.

    SATISFIED.

Looking at leverage plot, there are no influential points.

If I was creating a hospital expansion plan I would use the third model.\(log(Expenditures) = \beta_{0} + \beta_{1}log(RVUs) + \epsilon\)

This model seemed to fit the data the best, backed by the highest correlation coefficient and r-squared value. The residual plots were also the cleanest in this model. Taking the log of both variables that seemed to follow an exponential pattern squeezed the data together in a more normal distribution which is good for linear regression.