project 3

knitr::opts_chunk$set(echo = TRUE)
options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("pacman")

## 
## The downloaded binary packages are in
##  /var/folders/wz/dkkpz5nj7d593r684nyd_5bm0000gn/T//RtmpDFHekM/downloaded_packages

install.packages("dplyr")

## 
## The downloaded binary packages are in
##  /var/folders/wz/dkkpz5nj7d593r684nyd_5bm0000gn/T//RtmpDFHekM/downloaded_packages

pacman::p_load(ggthemes, viridis, tidyverse, wooldridge)

pacman::p_load(SDAResources)
pacman::p_load(stargazer)
data(college)
college_clean <- na.omit(college)

1: Summarize previous work

In the earlier stages of this project, I looked to answer the question of how college expenditures, represented by the average cost per student, impact graduation rates. Using the SDAResources dataset, I estimated a simple linear regression model with the graduation rate (c150_4) as the dependent variable and the log-transformed tuition cost (logtuition) as the independent variable. This transformation accounted for non-linear relationships. Initial results revealed a statistically significant positive relationship between tuition and graduation rates. The coefficient for logtuition was 0.213, indicating that a 1% increase in tuition is associated with a 0.213 percentage point increase in graduation rate. However, the R-squared value (0.208) suggested that the model explained only 20.8% of the variation in graduation rates, leaving a significant portion unexplained. Also used a log-transformed tuition variable (log(npt4)) to address non-linearity and estimate its effect on graduation rates.

2: Multiple Regression Models

\(c150_4 = \beta_0 + \beta_1 logtuition_i + u_i\) (original) \(c150_4 = \beta_0 + \beta_1 logtuition_i + \beta_2 sat_avg + u_i\) (new)

The new model is a good fit for the data because it realistically reflects how tuition affects the outcome, using a log transformation to capture the diminishing impact of higher tuition costs. By including average SAT scores, the model controls for differences in student ability, ensuring the effect of tuition is measured more accurately. The linear structure makes it easy to interpret the relationships, with each coefficient showing how tuition and SAT scores influence the outcome, and the error term helps account for other unobserved factors, making the model more reliable.

###3 Correlation (AI used to help with this code)

There was trouble with the following code when trying to knit the document(says the execution is halted). Using AI to determine what is wrong with the code.

This is the original code I used to find the correlation and got 0.327

This is the code I got using chatgpt when I asked it how to fix the execution halted problem I was facing. It was trying to make sure both variables were numerical and ordered tp see if that was the problem. I ended up getting 0.327 again but still no luck trying to knit the document

A correlation of 0.327 between logtuition and sat_avg suggests a weak to moderate positive linear relationship between the two variables. This means that as logtuition increases, sat_avg tends to increase slightly as well, but the relationship is not very strong. Therefore, both variables can likely be included in the regression model without raising serious concerns about multicollinearity.

I still can’t find a way to get rid of the execution halted using the book and AI at this point

4: Regression Model

college$logtuition <- ifelse(college$npt4 > 0, log(college$npt4), NA)

OLS2 <- lm(c150_4 ~ logtuition + sat_avg, data = college)
summary(OLS2)

## 
## Call:
## lm(formula = c150_4 ~ logtuition + sat_avg, data = college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36574 -0.05359 -0.00047  0.06243  0.34616 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.426e+00  6.869e-02  -20.75   <2e-16 ***
## logtuition   7.893e-02  7.375e-03   10.70   <2e-16 ***
## sat_avg      1.071e-03  2.284e-05   46.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09009 on 1111 degrees of freedom
##   (258 observations deleted due to missingness)
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7266 
## F-statistic:  1480 on 2 and 1111 DF,  p-value: < 2.2e-16

stargazer(OLS2, type = "text", 
          title = "Regression Results", 
          digits = 3, 
          dep.var.labels = c("c150_4"), 
          covariate.labels = c("Log(Tuition)", "SAT Avg."),
          out = "regression_results.txt")

## 
## Regression Results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                c150            
## -----------------------------------------------
## Log(Tuition)                 0.079***          
##                               (0.007)          
##                                                
## SAT Avg.                     0.001***          
##                              (0.00002)         
##                                                
## Constant                     -1.426***         
##                               (0.069)          
##                                                
## -----------------------------------------------
## Observations                   1,114           
## R2                             0.727           
## Adjusted R2                    0.727           
## Residual Std. Error      0.090 (df = 1111)     
## F Statistic         1,480.073*** (df = 2; 1111)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

5: Interpret Results

Log(Tuition) Coefficient (0.079): The coefficient for log(tuition) is 0.079. This means that for every 1% increase in tuition, the dependent variable c150_4 is expected to increase by 0.079, assuming the average SAT score remains constant. This is a positive relationship, suggesting that higher tuition is associated with a higher value of c150_4, though the effect is relatively small. The significance of this coefficient (p-value < 0.01) indicates that the relationship between tuition and the outcome is statistically significant. This means that the result is unlikely to be due to random chance. The coefficient for sat_avg is 0.001. This means that for every 1-point increase in the average SAT score, the dependent variable c150_4 is expected to increase by 0.001, assuming tuition remains constant. This relationship is also positive, suggesting that higher average SAT scores are associated with a higher value of c150_4. The R squared value of 0.727 means that 72.7% of the variation in the dependent variable c150_4 is explained by the independent variables log(tuition) and sat_avg. This is a relatively high R squared value, indicating that the model fits the data well and the independent variables explain a substantial proportion of the variation in the outcome.

6: 95% model

confint(OLS2, level = 0.95)

##                    2.5 %       97.5 %
## (Intercept) -1.560364942 -1.290805290
## logtuition   0.064462765  0.093402869
## sat_avg      0.001026638  0.001116268

summary(OLS2)

## 
## Call:
## lm(formula = c150_4 ~ logtuition + sat_avg, data = college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36574 -0.05359 -0.00047  0.06243  0.34616 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.426e+00  6.869e-02  -20.75   <2e-16 ***
## logtuition   7.893e-02  7.375e-03   10.70   <2e-16 ***
## sat_avg      1.071e-03  2.284e-05   46.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09009 on 1111 degrees of freedom
##   (258 observations deleted due to missingness)
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7266 
## F-statistic:  1480 on 2 and 1111 DF,  p-value: < 2.2e-16

The regression results indicate that both the intercept and the log(tuition) coefficient are statistically significant, with p-values less than 0.01. The 95% confidence interval for log(tuition) does not include zero, further confirming that tuition has a statistically significant positive relationship with c150_4 Specifically, a 1% increase in tuition is associated with a 0.2125 increase in c150_4, suggesting that tuition has a meaningful impact on the outcome variable. While the model is statistically significant overall, the R-squared value of 0.208 indicates that the model explains only about 20.8% of the variation in c150_4, implying that other factors not included in the model may also be influencing the dependent variable.

7: F Test (AI used to help with some of this code)

complete_data <- na.omit(college[, c("c150_4", "logtuition", "sat_avg")])

restricted_model <- lm(c150_4 ~ 1, data = complete_data)
model <- lm(c150_4 ~ logtuition + sat_avg, data = complete_data)

anova(restricted_model, model)

## Analysis of Variance Table
## 
## Model 1: c150_4 ~ 1
## Model 2: c150_4 ~ logtuition + sat_avg
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1113 33.042                                  
## 2   1111  9.017  2    24.025 1480.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8: Residuals and Density Distribution

residuals <- resid(model)

ggplot(data.frame(residuals), aes(x = residuals)) + 
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "Density Plot of Residuals", x = "Residuals", y = "Density")

The error terms are once again normally distributed and the shape is essentially the same, however it is not as even as the error terms I found in part2 of my project. In part 2 the graph looked nearly perfect with the peak exactly at 0.0, while here there is clearly a little more variation as the peak looks to be closer to -0.1. The slight shift in the peak of the residuals from 0.0 to around -0.1 suggests that the model in the multiple regression analysis may have a small bias, with the predicted values slightly underestimating the actual values. While the residuals are still normally distributed, this variation indicates that the model’s fit is not as perfect as in the simple regression, but it is still reasonable.

9: Inferences from Results

The multiple regression analysis provides a more comprehensive understanding of the factors influencing c150_4 compared to the earlier simple regression. In the multiple regression, both logtuition and sat_avg are statistically significant predictors of graduation rates, and the model explains 72.7% of the variation in c150_4, compared to only 20.8% in the simple regression. The coefficient for logtuition decreases from 0.2125 in the simple regression to 0.079 in the multiple regression, as the inclusion of sat_avg corrects for omitted variable bias. In the simple regression, the effect of logtuition was overstated because it also captured variation due to sat_avg, which is correlated with both logtuition and c150_4 .Including sat_avg in the model isolates the unique contributions of each predictor, leading to more accurate estimates. Additionally, the multiple regression reduces residual variability, improving the overall fit and reliability of the model. This demonstrates the importance of considering multiple predictors to avoid bias and improve explanatory power in regression analyses.

10: What now?

Building on the current analysis, the project could focus on incorporating student demographics and location-based factors to provide a better understanding of what drives graduation rates. Variables such as socioeconomic background and racial and ethnic composition, could shed light on how different groups experience higher education. Additionally, location-based factors like urban versus rural settings might reveal geographic disparities in graduation outcomes. Exploring these variables would not only enhance the explanatory power of the model but also allow for a more in depth analysis of equity and access in higher education. Insights from such an analysis could inform targeted interventions to support underrepresented groups and address regional challenges, ultimately contributing to more inclusive and effective educational policies.