Abstract

We present a research investigation into the efficacy of the New York City School Quality Snapshot in forecasting the four-year college persistence rate among high schools in New York City. These evaluations play a pivotal role in shaping educational policies nationwide. While insights from educators and families are invaluable, they often fail to capture the fundamental factors that have the greatest impact on academic achievement. Furthermore, we construct a predictive framework based on proxy socioeconomic indicators (such as residence in temporary housing and economic necessity) to forecast the average college persistence rate of high schools in NYC. Our socioeconomic-based model surpasses a model reliant solely on data from the school quality snapshot, which includes input from students, parents, and educators within NYC’s public school system. This enhanced performance is evident across various diagnostic metrics, including root mean squared error, Akaike and Bayesian Information Criteria, and adjusted R-squared values. Additionally, we develop a weighted least squares model using the same set of proxy variables to compare modeling methodologies. While this model outperforms the linear model based on ratings, it introduces added complexity, resulting in reduced explainability compared to the direct use of proxy variables.

Introduction:

Our final project aims to analyze the predictive power of Department of Education (DOE) ratings on the academic performance of New York City (NYC) high schools. Leveraging data from OpenData’s - School Quality Report for NYC high schools between 2020 and 2021. Our final project will potentially include more up-to-date educational information and geographic data to augment our analysis.

First, let’s read in our source CSV file. This is posted in our GitHub repository in the interest of reproducibility.

There are several predictor variables of interest to us. The NYC School Quality Report from the Department of Education included ratings for schools in the following categories. We would like to see how effective these ratings are at predicting student success:

Quality Review Rating
Achievement Rating
Environment Rating
College and Career Readiness Rating

In addition, it’d be interesting to see if these ratings could be replaced by proxy variables? Doing so could save the Department of Education (DOE) time in not assigning ratings when they could have similar impact by knowing certain values for a school.

Our response variable will be the average student’s SAT Score at a given school. SAT Scores are an imperfect metric given their correlation with other socioeconomic factors, but for our purposes can serve as an imperfect benchmark to measure academic performance.

Main research question: Are DOE ratings effective predictors of a school’s ability to cultivate high academic achievement in students, and what alternative proxy variables offer improved accuracy in predicting academic performance, as assessed by SAT scores?

The primary measure of success we aim to predict is the 4-year college persistence rate for NYC high schools. This measure is defined as the percentage of students who graduate from a high school and eventually complete a 4-year college program. Identifying the key indicators of a school’s ability to successfully prepare students for college could benefit the NYC Department of Education (DOE) and NYC Public Schools in several ways:

It would provide insights to the NYC DOE and NYC Public Schools which would enable them to tailor instructional approaches and develop targeted curricula that specifically address college preparedness.
It would allow for strategic allocation of resources to address identified areas that significantly impact college readiness, ensuring that resources are utilized efficiently to increase the percentage of college-ready students across NYC Public Schools.

It is well-established that attending 4-year institutions significantly enhances career potential earnings. Ensuring that high school students are adequately prepared for their college careers not only benefits their immediate educational success but also contributes to their long-term success in life.

Literature Review

A significant determinant of academic achievement is a student’s socioeconomic status. The National Center for Education Statistics (NCES) reports that students hailing from low-income households are almost four times as likely to discontinue their high school education compared to their affluent counterparts. NCES.

Several prior studies have made attempts to use more sophisticated modeling techniques, different data sources, and different predictor variables to predict educational outcomes similar to what we’re trying to predict. In one such study,Bernacki, Chavez and Uesbeck based their modeling on trying to predict educational achievement based on student digital behavior, rather than the social factors we intend to explore. The model in this study reached an accuracy of 75%, and was able to flag early interventions. This modeling technique attempts to predict a slightly different metric of student success than our modeling will, and the training data and predictor variables differ as well.

Similarly, Musso, Cascallar, Bostani, and, Crawford attempted to train an artificial neural network (ANN) to identify relationships between variables and educational performance data. They modeled educational performance of Vietnamese students in grade five and included individual characteristics as well as information related to daily routines in their training data. This method uses a more sophisticated model, and resulted in an impressive prediction accuracy of \(95-100%\). However, as their training data comes from a different country with a different educational system and methods, it may not be prudent to compare the model’s results to those of our model or of any other US-centric study.

In another study, a group of friends as you can see here Final grades predicted final grade exams for Turkish students through machine learning models, using prior exam scores as their input variables.The logistic regression model is utilized to assess the actual impact of event context on “Fail” students in a multivariate setup. Four machine learning algorithms (Random Forest, XGBoost, KNN, and SVM) are applied using 70% of training data and 30% of test data to predict the students’ final grades. The findings of the study clearly show that the Random Forest algorithm may be used to predict which students will fail after eight weeks. Such data-driven predictions are significant for teachers and decision-makers and allow them to take measures to reduce the number of failed students and identify which types of learning resources or student activities are better predictors of the student’s academic performance.

Methodology

Our primary interest is finding proxy variables within the data that can better serve as predictors of 4-year college persistence rates at a given NYC high school than the school survey ratings collected by the quality review. Toward this end, we will need to first construct a baseline model that predicts a school’s college persistence rate.

We will attempt to use three variables as a proxy for the school’s survey rating in predicting college persistence:

temp_housing_pct: the percentage of students living in temporary housing
eni_hs_pct_912: Economic Need Index: a measure of the percentage of students facing economic hardship at a school¹
val_chronic_absent_hs_all: the percentage of students who are chronically absent²

We begin by taking a look at a summary of the dataset’s completeness.

Completeness Summary
rows	487
columns	393
all_missing_columns	12
total_missing_values	47359
complete_rows	0

There are 12 columns that are completely devoid of data, so we identify and remove those.

Columns with all missing values
All NA Columns
QR_1_1
QR_1_2
QR_2_2
QR_3_4
QR_4_2
QR_1_4
QR_1_3
QR_3_1
QR_4_1
QR_5_1
Dates_of_Review
principal

We create a 20% holdout set of data to be used later on in order to evaluate the efficacy of our model’s predictive capability. The remaining 80% of the data is to be used for model training and exploratory data analysis (EDA).

For ease of single-node computation, we’ll select the variables of interest from our dataset. Notably, these are the survey ratings, enrollment levels, and our preferred proxy variables for each school.

We take a look at whether the reduced training dataset contains any missing values and what the spread is.

The variable with the most missing data is college_rate. Some schools are also missing some survey ratings, and a very small percentage of schools are missing chronic absenteeism values.

We impute both our training and evaluation datasets. Given we are dealing with continuous numeric (and not categorical variables), we use the Predictive Mean Matching imputation method native to the R mice package.

To check underlying modeling assumptions, we plot distributions and relationships of different variables. First, we plot the distribution of college persistence rates among NYC high schools to check for normality.

We see a relatively normal distribution of college persistence rates. In the case of NYC high schools, the peak is at around 50%. This is inline with national averages released by the @CensusBureau_CollegeRates_2023.

The below plot shows the raw correlation between each variable in our pared down dataset (Collaborative Teaching, Trust, etc) and the response variable of interest: 4-Year College Persistence Rate.

From our correlation plot above, we can see strong negative relationships between two of our proxy variables of interest (Temporary Housing Rate and Economic Need Index) and our target variable: College Persistence Rate. There is also a negative relationship between Chronic Absenteeism and College Persistence Rate, but to a lesser degree. This gives signal that constructing models based on these variables could give good insight into the factors that most influence college persistence.

Enrollment has only a slightly positive relationship with College Persistence Rate. We expected school size might be important when modeling, but that does not appear to be likely.

We also see that the survey ratings are all at least somewhat positively correlated with one another, and the only survey rating that appears to have a relationship with College Persistence Rate is Rigorous Instruction. That relationship is only slightly positive. This signals that constructing a model based on one or more of the survey ratings might not give as much insight into college persistence as the proxy variables could.

Now we can plot the distributions of our proxy variables of interest.

We see the distribution of Temporary Housing Rate is right-skewed. We also see the distribution of Economic Need Index is left-skewed. The closer the index is to 1, the more economic hardship students at that school face, so schools with high rates of students facing economic hardship are more prevalent than schools with low rates. These variables are both candidates for transformation due to their skew. Our model will not likely feature Enrollment, as observations are so concentrated at the low end, and we already noted it is not as correlated with our target variable as the other proxies we’re considering. Chronic Absenteeism is closer to a normal distribution than the other variables, but it is still slightly right-skewed, so there are more schools in this dataset with pretty low rates and fewer schools with pretty high rates.

We check an assumption of linearity between our proxy predictors and our response variable by producing scatter plots of the response variable versus each of the proxy predictors.

We see a generally negative linear relationship between the response variable and rates of students in temporary housing. As that rate increases, college persistence tends to decrease. However, that relationship does not appear to hold for schools with higher rates of students in temporary housing. So the relationship cannot be completely captured by a linear trend.

We also see a non-linear relationship between the response variable and the economic need index.

Schools with lower enrollment levels have a wider range of college persistence rates than schools with higher enrollment levels.

Only one school where chronic absenteeism is greater than or equal to 50 percent achieves college persistence levels above 80 percent. However, college persistence varies widely at most chronic absenteeism levels.

Modeling

For evaluation purposes, we create a linear model based on the survey ratings present per school in our data. We fit this multiple least-squares model to predict the college persistence rate of a given high school. The model summary is printed below:

## 
## Call:
## lm(formula = base_formula, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5405 -0.1119  0.0053  0.1135  0.4303 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5399     0.1976   2.732  0.00659 ** 
## survey_pp_CT   0.1150     0.2635   0.436  0.66281    
## survey_pp_RI   2.1733     0.1976  11.001  < 2e-16 ***
## survey_pp_SE  -1.5105     0.2664  -5.671  2.8e-08 ***
## survey_pp_ES  -0.3090     0.2802  -1.103  0.27079    
## survey_pp_SF   0.2349     0.2131   1.102  0.27109    
## survey_pp_TR  -0.4708     0.4237  -1.111  0.26724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1581 on 383 degrees of freedom
## Multiple R-squared:  0.2495, Adjusted R-squared:  0.2377 
## F-statistic: 21.22 on 6 and 383 DF,  p-value: < 2.2e-16

We find our base model for the school survey ratings produces an adjusted R-squared of \(R^2_{adj} = 0.24\). This is lower than the predictive model in @roth_meta-analyzing_1996 produces. The two survey ratings that appear to be statistically significant to the model are Rigorous Instruction, which we expected based on our correlation analysis, and Supportive Environment, which we did not expect. We reduce the model via backward selection, and Effective Leadership becomes statistically significant as well. We reprint a summary below:

## 
## Call:
## lm(formula = college_rate ~ survey_pp_RI + survey_pp_SE + survey_pp_ES, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53159 -0.11178  0.00553  0.11225  0.46053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5234     0.1369   3.824 0.000153 ***
## survey_pp_RI   2.1816     0.1951  11.182  < 2e-16 ***
## survey_pp_SE  -1.5291     0.2379  -6.426 3.86e-10 ***
## survey_pp_ES  -0.4134     0.1232  -3.355 0.000873 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1579 on 386 degrees of freedom
## Multiple R-squared:  0.2459, Adjusted R-squared:  0.2401 
## F-statistic: 41.97 on 3 and 386 DF,  p-value: < 2.2e-16

The adjusted R-squared is the same due to rounding. We check for suspected multicollinearity within this model:

(#tab:base-rating-model-vif-check) Variance Inflation Factors
	VIF Value
survey_pp_RI	2.04
survey_pp_SE	2.12
survey_pp_ES	1.59

Surprisingly, none of the variance inflation factors are greater than five, so there are no multicollinearity issues to address for this model.

Let’s look at some diagnostic plots for this model.

The residuals look relatively normal from the QQ-plot but there seems to be some clustering in the residuals vs fitted plot and we can see a slight curving.

We then create a basic multiple least squares linear model between the response and our three socioeconomic proxy variables: Temporary Housing Rate, Economic Need Index, and Chronic Absenteeism. We include Enrollment as well. The summary statistics of the socieoeconomic model are shown below.

## 
## Call:
## lm(formula = proxy_formula, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45147 -0.08833  0.00416  0.08316  0.31536 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.045e+00  3.776e-02  27.675  < 2e-16 ***
## temp_housing_pct          -5.449e-01  1.193e-01  -4.565 6.72e-06 ***
## economic_need             -4.528e-01  6.124e-02  -7.394 8.97e-13 ***
## val_chronic_absent_hs_all -2.008e-01  3.805e-02  -5.279 2.18e-07 ***
## enrollment                 6.561e-06  9.261e-06   0.708    0.479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1261 on 385 degrees of freedom
## Multiple R-squared:  0.5198, Adjusted R-squared:  0.5149 
## F-statistic: 104.2 on 4 and 385 DF,  p-value: < 2.2e-16

Enrollment is not statistically significant, so we remove it and reprint a summary.

## 
## Call:
## lm(formula = college_rate ~ temp_housing_pct + economic_need + 
##     val_chronic_absent_hs_all, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45236 -0.08473  0.00471  0.08199  0.31786 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.05498    0.03494  30.192  < 2e-16 ***
## temp_housing_pct          -0.55304    0.11872  -4.659 4.39e-06 ***
## economic_need             -0.45928    0.06053  -7.588 2.46e-13 ***
## val_chronic_absent_hs_all -0.20141    0.03802  -5.298 1.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1261 on 386 degrees of freedom
## Multiple R-squared:  0.5192, Adjusted R-squared:  0.5155 
## F-statistic:   139 on 3 and 386 DF,  p-value: < 2.2e-16

We find our proxy socioeconomic model produces an adjusted R-squared of \(R^2_{adj} = 0.52\). We also check for multicollinearity within this model. Unlike with the base model based on the survey ratings, we do not expect any such issues with this model.

(#tab:proxy-model-vif-check) Variance Inflation Factors
	VIF Value
temp_housing_pct	2.08
economic_need	2.35
val_chronic_absent_hs_all	1.26

None of the variance inflation factors are greater than five, so there are no multicollinearity issues to address for this model.

We produce diagnostic plots for the model below.

We see no strong trend in the residuals vs. fitted plot, indicating heteroscedasticity. The residuals also look to be normally distributed.

We can also test the assumption of normally-distributed residuals via a Shapiro-Wilk test for normality. Here we operate with the null \(H_0\) and alternative hypotheses \(H_a\):

\(H_0\): the error terms of the socioeconomic proxy model come from a normally-distributed population
\(H_a\): the error terms of the socioeconomic proxy model come from a population that is not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  proxy_model$residuals
## W = 0.99654, p-value = 0.5668

Running a Shapiro test for normality at a 95% threshold, we receive a p-value of 0.5848, higher than our threshold, so we cannot reject our null hypothesis.

Plotting our proxy model’s residuals, we can confirm normality as well visually:

## 
## Call:
## lm(formula = proxy_formula, data = train, weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4992 -0.8375  0.0549  0.8169  3.1001 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.03959    0.03078  33.776  < 2e-16 ***
## temp_housing_pct          -0.68658    0.12684  -5.413 1.09e-07 ***
## economic_need             -0.41046    0.05692  -7.211 2.96e-12 ***
## val_chronic_absent_hs_all -0.21632    0.03831  -5.646 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.244 on 386 degrees of freedom
## Multiple R-squared:  0.5589, Adjusted R-squared:  0.5554 
## F-statistic:   163 on 3 and 386 DF,  p-value: < 2.2e-16

Experimentation and Results

Model Evaluation

## [1] 0.1645167

## [1] 0.1425847

## [1] 0.1434627

We can also use the Akaike and Bayesian Information Criterion for evaluatng the complexity of our models.

## 
## Model selection based on AICc:
## 
##                        K    AICc Delta_AICc AICcWt Cum.Wt     LL
## Weighted Least-Squares 5 -508.84       0.00   0.96   0.96 259.50
## Proxy Variables        5 -502.43       6.41   0.04   1.00 256.29
## Survey Ratings         5 -326.91     181.93   0.00   1.00 168.53

## 
## Model selection based on BIC:
## 
##                        K     BIC Delta_BIC BICWt Cum.Wt     LL
## Weighted Least-Squares 5 -489.16      0.00  0.96   0.96 259.50
## Proxy Variables        5 -482.76      6.41  0.04   1.00 256.29
## Survey Ratings         5 -307.24    181.93  0.00   1.00 168.53

From our tables above, we can see smaller values of corrected AIC (which accounts for smaller sample sizes) from the proxy and WLS models than our base survey rating model. This implies better predictive performance for our WLS and proxy-variable models

Conclusion

Overall, our model to predict a high school’s college persistence rate based on socioeconomic proxy variables outperformed the NYC Schools Open Survey Quality ratings of schools. This is not to say that school ratings based on teacher, student, and parent responses are not valuable inputs. However, they should not be the sole basis upon which educational policy decisions are made, considering the collective socioeconomic factors that most influence a school’s performance.

Some limitations of our approach would come from conflation between socioeconomic factors, as well as lacking a more robust imputation method. While our proxy variable model predicts college persistence better than one based off survey ratings, there could be error via omission of unseen variables that are collinear with these inputs. This stems from the availability of data in our source data. We used a predictive mean matching imputation method, native to the mice R package (@Flexible-imputation). While this allows for realistic imputed values (no imputed value will fall outside the range of observed data), the underlying population distribution of those values could be non-normal.

Future work could include joining in other academic performance metrics (average SAT/ACT scores, etc.) to see if our proxy variables also have predictive power. The dataset provided is indexed on a high school’s district borough number (DBN), which is present in several NYC Open Data datasets on education in New York City. As mentioned above, joining in other data sources to augment this data could be a good way to address the omission of any variables that better correlate with college persistence rates.

Overall, identifying the factors that most strongly correlate with academic performance and college persistence can improve in educational policy design. In addition, the public availability of educational data through open data platforms only serves to augment the relationships that help this decision-making.

Appendices

Below is the code used to generate this report. It’s also available on GitHub here

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(tidyverse)
library(gridExtra)
library(glue)
library(mice)
library(corrplot)
library(caret)
library(modelr)
library("papaja")
library(DataExplorer)
library(cowplot)
library(car)
library(AICcmodavg)
df <- read.csv("https://raw.githubusercontent.com/waheeb123/Data-621/main/Final%20Project/2020-2021_School_Quality_Reports_-_High_School_20240407.csv")
label_cols <- c("dbn", "school_name", "school_type")
# Convert needed columns to numeric typing
df <- cbind(df[, label_cols], as.data.frame(lapply(df[,!names(df) %in% label_cols], as.numeric)))

df$college_rate <- df$val_persist3_4yr_all
df$economic_need <- df$eni_hs_pct_912
remove <- c("discrete_columns", "continuous_columns",
            "total_observations", "memory_usage")
completeness <- introduce(df) |>
    select(-all_of(remove))
knitr::kable(t(completeness), caption = "Completeness Summary")


find_all_na_cols <- function(dframe){
    col_sums_na <- colSums(is.na(dframe))
    all_na_cols <- names(col_sums_na[col_sums_na == nrow(dframe)])
    all_na_cols
}

all_na_cols <- find_all_na_cols(df)
df <- df |>
    select(-all_of(all_na_cols))
all_na_cols <- as.data.frame(all_na_cols)
colnames(all_na_cols) <- c("All NA Columns")

# Display the dataframe containing columns with all missing values using kable
knitr::kable(all_na_cols, caption = "Columns with all missing values")


set.seed(42)

# Adding a 20% holdout of our input data for model evaluation later
train <- subset(df[sample(1:nrow(df)), ]) %>% sample_frac(0.8)
test  <- dplyr::anti_join(df, train, by = 'dbn')
cols <- c("survey_pp_CT", "survey_pp_RI",
          "survey_pp_ES", "survey_pp_SE",
          "survey_pp_SF", "survey_pp_TR",
          "temp_housing_pct", "economic_need",
          "college_rate", "enrollment",
          "val_chronic_absent_hs_all")
train_data <- train[, cols]
p1 <- plot_missing(train_data, missing_only = FALSE,
                   ggtheme = theme_classic(), title = "Missing Values")

# Plot missing value percentages by cols of interest
p1 <- p1 + 
    scale_fill_brewer(palette = "Paired")
p1
imp <- mice(train_data, method="pmm", seed=42, printFlag = FALSE)
train <- complete(imp)
test_data <- test[, cols]
imp <- mice(test_data, method="pmm", seed=42, printFlag = FALSE)
test <- complete(imp)
# Plot target variable distribution
ggplot(train, aes(x=college_rate)) + 
    geom_density() + 
    labs(x="4-Year College Persistence Rate",
         y="Density of NYC High Schools",
         title="Average 4-Year College Persistence Rates: NYC High Schools 2020-2021",
         caption="The average NYC high school sees ~50% of students go on to have 4-year college persistence.")

theme_set(theme_apa())
# Renaming training dataframe for correlation plot
train_renamed <- train %>%
  rename("Collaborative Teaching"=survey_pp_CT,
         "Rigorous Instruction"=survey_pp_RI,
         "Supportive Env"=survey_pp_SE,
         "Effective Leadership"=survey_pp_ES,
         "Family-Community Ties"=survey_pp_SF,
         "Trust"=survey_pp_TR,
         "Temporary Housing Pct"=temp_housing_pct,
         "Economic Need"=economic_need,
         "College Persistence"=college_rate,
         "Enrollment"=enrollment,
         "Chronic Absenteeism"=val_chronic_absent_hs_all)

# Create correlation plot between vars of interest
corMatrix <- cor(train_renamed)
corrplot(corMatrix, method="color", type="lower", tl.col="black", addCoef.col = "black", tl.cex = 0.7, number.cex = 0.6)

# Plot temp housing rates
pa <- ggplot(train, aes(x=temp_housing_pct)) +
    geom_density() +
    labs(x="% Students Temp Housing", y="Density")
# Plot economic need index
pb <- ggplot(train, aes(x=economic_need)) +
    geom_density() +
    labs(x="Economic Need Index", y="Density")
# Plot enrollment
pc <- ggplot(train, aes(x=enrollment)) +
    geom_density() +
    labs(x="Enrollment", y="Density")
# Plot chronic absenteeism
pd <- ggplot(train, aes(x=val_chronic_absent_hs_all)) +
    geom_density() +
    labs(x="Chronic Absenteeism", y="Density")
p <- plot_grid(pa, pb, pc, pd, nrow = 2, ncol = 2, align = "hv", axis = "t")
p

# Plot temp housing percentage vs college persistence rate
pa <- ggplot(train, aes(x=temp_housing_pct, y=college_rate)) +
  geom_point() +
  labs(x="% Students in Temp Housing",
       y="College Persist")
# Plot ENI vs college persistence rate
pb <- ggplot(train, aes(x=economic_need, y=college_rate)) +
  geom_point() +
  labs(x="Economic Need Index",
       y="College Persist")
pc <- ggplot(train, aes(x=enrollment, y=college_rate)) +
  geom_point() +
  labs(x="Enrollment",
       y="College Persist")
pd <- ggplot(train, aes(x=val_chronic_absent_hs_all, y=college_rate)) +
  geom_point() +
  labs(x="Chronic Absenteeism",
       y="College Persist")
p <- plot_grid(pa, pb, pc, pd, nrow = 2, ncol = 2, align = "hv", axis = "t")
p

base_formula <- college_rate ~ survey_pp_CT + survey_pp_RI + survey_pp_SE + survey_pp_ES + survey_pp_SF + survey_pp_TR
rating_model <- lm(base_formula,
                   train)
summary(rating_model)

rating_model <- update(rating_model, ~ . - survey_pp_CT - survey_pp_SF - survey_pp_TR)
summary(rating_model)

vif_df <- as.data.frame(vif(rating_model))
colnames(vif_df) <- c("VIF Value")
apa_table(vif_df, caption = "Variance Inflation Factors", placement = "H")

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(rating_model)

# Create OLS linear model based on our proxy variables: no transforms
proxy_formula <- college_rate ~ temp_housing_pct + economic_need + val_chronic_absent_hs_all + enrollment
proxy_model <- lm(proxy_formula, train)
summary(proxy_model)

proxy_model <- update(proxy_model, ~ . - enrollment)
summary(proxy_model)

vif_df <- as.data.frame(vif(proxy_model))
colnames(vif_df) <- c("VIF Value")
apa_table(vif_df, caption = "Variance Inflation Factors", placement = "H")

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(proxy_model)

# Test proxy model for normality of residuals
shapiro.test(proxy_model$residuals)
hist(proxy_model$residuals, xlab="Proxy Model Residuals")
# Calculating weights for WLS
weights <- 1 / lm(abs(proxy_model$residuals) ~ proxy_model$fitted.values)$fitted.values^2

#perform weighted least squares regression
proxy_formula <- proxy_model$call$formula
wls_model <- lm(proxy_formula, data = train, weights=weights)

summary(wls_model)
# Compute RMSE for each model on our testing data
rmse(rating_model, test)
modelr::rmse(proxy_model, test)
modelr::rmse(wls_model, test)
model_list <- list(rating_model, proxy_model, wls_model)
model_names <- c("Survey Ratings", "Proxy Variables", "Weighted Least-Squares")

# Print AIC results
aictab(model_list, modnames=model_names)

# Print BIC for each model
bictab(model_list, modnames=model_names)

noauthor_student_2021 (fix, not in references) Economic hardship in this context is based on three criteria: whether the student is 1) eligible for public assistance from the NYC Human Resources Administration (HRA); 2) lived in temporary housing in the past four years; 3) is in high school, has a home language other than English, and entered the NYC DOE for the first time within the last four years.↩︎
Chronic absenteeism is defined by the NYC DOE as “students who are absent 10 percent or more of the total days.”↩︎

DATA 621: Final Project

Waheeb Algabri, Joe Garcia, Lwin Shwe, Mikhail Broomes