Introduction

Dropping out of high school can have a significant impact on the success and well-being of an individual’s life. High school dropouts often face higher unemployment rates, lower salaries, and are at a higher risk of being incarcerated. These are not only personal consequences; they can extend over to the community and economy.

This analysis aims to examine how the number of high school dropouts within a cohort varies depending on the cohort’s expected graduation timeline and the year the cohort began high school. Specifically, it compares cohorts who were expected to graduate in 4 years to those who were expected to graduate in 5 or 6 years. Additionally, the analysis examines how dropout rates have changed over time for cohorts that started high school each year between 2001 and 2015.

# Importing the data
Graduation <- read_csv("C:/Users/dijan/Documents/DATA 712/graduation_data.csv", show_col_types = FALSE)
# Looking at the data
data("Graduation")
dplyr::glimpse(Graduation)
## Rows: 327
## Columns: 27
## $ Borough                                <chr> "Bronx", "Bronx", "Bronx", "Bro…
## $ Category                               <chr> "All Students", "All Students",…
## $ `Cohort Year`                          <dbl> 2015, 2014, 2013, 2012, 2011, 2…
## $ Cohort                                 <chr> "4 year August", "4 year August…
## $ `# Total Cohort`                       <dbl> 13891, 13951, 13730, 13838, 142…
## $ `# Grads`                              <dbl> 9752, 9398, 9102, 8985, 8821, 8…
## $ `% Grads`                              <dbl> 70.2, 67.4, 66.3, 64.9, 61.8, 5…
## $ `# Total Regents`                      <dbl> 8446, 8246, 8105, 8149, 8073, 7…
## $ `% Total Regents of Cohort`            <dbl> 60.8, 59.1, 59.0, 58.9, 56.5, 5…
## $ `% Total Regents of Grads`             <dbl> 86.6, 87.7, 89.0, 90.7, 91.5, 9…
## $ `# Advanced Regents`                   <dbl> 1579, 1584, 1548, 1505, 1494, 1…
## $ `% Advanced Regents of Cohort`         <dbl> 11.4, 11.4, 11.3, 10.9, 10.5, 1…
## $ `% Advanced Regents of Grads`          <dbl> 16.2, 16.9, 17.0, 16.8, 16.9, 1…
## $ `# Regents without Advanced`           <dbl> 6867, 6662, 6557, 6644, 6579, 6…
## $ `% Regents without Advanced of Cohort` <dbl> 49.4, 47.8, 47.8, 48.0, 46.1, 4…
## $ `% Regents without Advanced of Grads`  <dbl> 70.4, 70.9, 72.0, 73.9, 74.6, 7…
## $ `# Local`                              <dbl> 1306, 1152, 997, 836, 748, 710,…
## $ `% Local of Cohort`                    <dbl> 9.4, 8.3, 7.3, 6.0, 5.2, 5.0, 4…
## $ `% Local of Grads`                     <dbl> 13.4, 12.3, 11.0, 9.3, 8.5, 8.4…
## $ `# Still Enrolled`                     <dbl> 2124, 2632, 2742, 2876, 3243, 3…
## $ `% Still Enrolled`                     <dbl> 15.3, 18.9, 20.0, 20.8, 22.7, 2…
## $ `# Dropout`                            <dbl> 1759, 1693, 1606, 1757, 1866, 2…
## $ `% Dropout`                            <dbl> 12.7, 12.1, 11.7, 12.7, 13.1, 1…
## $ `# SACC (IEP Diploma)`                 <dbl> 80, 64, 118, 100, 207, 240, 273…
## $ `% SACC (IEP Diploma) of Cohort`       <dbl> 0.6, 0.5, 0.9, 0.7, 1.4, 1.7, 1…
## $ `# TASC (GED)`                         <dbl> 175, 164, 151, 110, 126, 144, 1…
## $ `% TASC (GED) of Cohort`               <dbl> 1.3, 1.2, 1.1, 0.8, 0.9, 1.0, 1…
# Renaming variables
Graduation <- Graduation %>%
  rename(Dropout = `# Dropout`,
         Cohort_year = `Cohort Year`)

# Removing the rows where the borough is "District 79"
Graduation <- Graduation %>%
  filter(Borough != "District 79")
# Convert 'Cohort' into a binary variable (0 for cohorts who were expected to graduate in 4 years, 1 for cohorts who were expected to graduate in 5 or 6 years)
Graduation <- Graduation %>%
  mutate(Expected_graduation = case_when(
    Cohort %in% c("4 year August", "4 year June") ~ 0,
    TRUE ~ 1
  ))

Analysis

# Model predicting dropout outcome by expected graduation
m1 <- glm(Dropout ~ Expected_graduation, family = poisson, data = Graduation)
summary(m1)
## 
## Call:
## glm(formula = Dropout ~ Expected_graduation, family = poisson, 
##     data = Graduation)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.378984   0.002191  3367.5   <2e-16 ***
## Expected_graduation 0.443781   0.002651   167.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 216902  on 309  degrees of freedom
## Residual deviance: 187515  on 308  degrees of freedom
## AIC: 190395
## 
## Number of Fisher Scoring iterations: 4

This Poisson regression model shows that the expected graduation time significantly affects the number of students who drop out. The coefficient for expected graduation is 0.4438, meaning that cohorts expected to graduate in 5 or 6 years had a higher number of students who dropped out compared to those expected to graduate in 4 years. The p-value is extremely small, indicating strong statistical significance.

# Average Marginal Effects for expected graduation using Model 1
sim_coefs1 <- sim(m1)
sim_est1 <- sim_ame(sim_coefs1, var = "Expected_graduation",
                    contrast = "rd")
summary(sim_est1)
##         Estimate 2.5 % 97.5 %
## E[Y(0)]     1602  1595   1609
## E[Y(1)]     2497  2490   2505
## RD           895   884    905

This model estimates the expected number of dropouts for cohorts based on their expected graduation time. For cohorts expected to graduate in 4 years, the expected number of dropouts is 1602. For cohorts expected to graduate in 5 or 6 years, the expected number of dropouts is 2497. The difference between these two groups is 895 dropouts. The results are statistically significant.

# Model predicting dropout outcome by expected graduation and cohort year
m2 <- glm(Dropout ~ Expected_graduation + Cohort_year, family = poisson, data = Graduation)
summary(m2)
## 
## Call:
## glm(formula = Dropout ~ Expected_graduation + Cohort_year, family = poisson, 
##     data = Graduation)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         112.684562   0.634614   177.6   <2e-16 ***
## Expected_graduation   0.398569   0.002664   149.6   <2e-16 ***
## Cohort_year          -0.052432   0.000316  -165.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 216902  on 309  degrees of freedom
## Residual deviance: 160046  on 307  degrees of freedom
## AIC: 162928
## 
## Number of Fisher Scoring iterations: 4

This Poisson regression model shows that both the expected graduation time and cohort year significantly affect the number of students who drop out. The coefficient for expected graduation is 0.3986, meaning that cohorts expected to graduate in 5 or 6 years had a higher number of dropouts compared to those expected to graduate in 4 years. The coefficient for cohort year is -0.0524, suggesting a slight decrease in the number of dropouts with each subsequent cohort year. The p-values for both variables are extremely small, indicating strong statistical significance.

# Average Marginal Effects for expected graduation using Model 2
sim_coefs2 <- sim(m2)
sim_est2 <- sim_ame(sim_coefs2, var = "Expected_graduation",
                    contrast = "rd")
summary(sim_est2)
##         Estimate 2.5 % 97.5 %
## E[Y(0)]     1645  1638   1652
## E[Y(1)]     2450  2444   2458
## RD           806   796    816

This model estimates the expected number of dropouts based on expected graduation time and cohort year. For cohorts expected to graduate in 4 years, the expected number of dropouts is 1645. For cohorts expected to graduate in 5 or 6 years, the expected number of dropouts is 2450. The difference between these two groups is 806 dropouts. This model accounts for changes in dropout patterns over time. The results are statistically significant.

# Effect of cohort year
sim_est2a <- sim_ame(sim_coefs2, var = "Cohort_year",
                    contrast = "rd")
## Warning: `contrast` is ignored when the focal variable is continuous.
summary(sim_est2a)
##                      Estimate 2.5 % 97.5 %
## E[dY/d(Cohort_year)]     -111  -113   -110

This model estimates the effect of cohort year on the expected number of dropouts. For each one-year increase in cohort year, the expected number of dropouts decreases by 111. This suggests that over time cohorts experienced fewer dropouts. This result is statistically significant.

# Dose-Response relationship prediction for cohort year
sim_est2b <- sim_adrf(sim_coefs2, var = "Cohort_year",
                    contrast = "adrf")

summary(sim_est2b)
##              Estimate 2.5 % 97.5 %
## E[Y(2001)]       3036  3022   3051
## E[Y(2001.7)]     2927  2914   2940
## E[Y(2002.4)]     2822  2810   2833
## E[Y(2003.1)]     2720  2710   2730
## E[Y(2003.8)]     2622  2613   2631
## E[Y(2004.5)]     2527  2520   2536
## E[Y(2005.2)]     2436  2430   2444
## E[Y(2005.9)]     2348  2343   2355
## E[Y(2006.6)]     2264  2259   2270
## E[Y(2007.3)]     2182  2177   2188
## E[Y(2008)]       2104  2099   2109
## E[Y(2008.7)]     2028  2023   2033
## E[Y(2009.4)]     1955  1950   1960
## E[Y(2010.1)]     1884  1879   1890
## E[Y(2010.8)]     1816  1811   1822
## E[Y(2011.5)]     1751  1745   1757
## E[Y(2012.2)]     1688  1682   1694
## E[Y(2012.9)]     1627  1620   1634
## E[Y(2013.6)]     1568  1561   1576
## E[Y(2014.3)]     1512  1505   1519
## E[Y(2015)]       1457  1450   1465
plot(sim_est2b)

This model estimates the expected number of dropouts for different cohort years. The results show a gradual decrease in the expected number of dropouts over time, with the number decreasing from 3036 in 2001 to 1457 in 2015. This suggests that over the years dropout rates have steadily declined. The results are statistically significant.

# Dose-Response relationship effect for cohort year
sim_est2b <- sim_adrf(sim_coefs2, var = "Cohort_year",
                    contrast = "amef")

summary(sim_est2b)
##                             Estimate  2.5 % 97.5 %
## E[dY/d(Cohort_year)|2001]     -159.2 -161.8 -156.6
## E[dY/d(Cohort_year)|2001.7]   -153.5 -155.9 -151.0
## E[dY/d(Cohort_year)|2002.4]   -147.9 -150.2 -145.6
## E[dY/d(Cohort_year)|2003.1]   -142.6 -144.7 -140.5
## E[dY/d(Cohort_year)|2003.8]   -137.5 -139.4 -135.5
## E[dY/d(Cohort_year)|2004.5]   -132.5 -134.4 -130.6
## E[dY/d(Cohort_year)|2005.2]   -127.7 -129.5 -126.0
## E[dY/d(Cohort_year)|2005.9]   -123.1 -124.7 -121.5
## E[dY/d(Cohort_year)|2006.6]   -118.7 -120.2 -117.2
## E[dY/d(Cohort_year)|2007.3]   -114.4 -115.8 -113.0
## E[dY/d(Cohort_year)|2008]     -110.3 -111.6 -109.0
## E[dY/d(Cohort_year)|2008.7]   -106.3 -107.5 -105.1
## E[dY/d(Cohort_year)|2009.4]   -102.5 -103.6 -101.3
## E[dY/d(Cohort_year)|2010.1]    -98.8  -99.9  -97.7
## E[dY/d(Cohort_year)|2010.8]    -95.2  -96.2  -94.3
## E[dY/d(Cohort_year)|2011.5]    -91.8  -92.7  -90.9
## E[dY/d(Cohort_year)|2012.2]    -88.5  -89.3  -87.7
## E[dY/d(Cohort_year)|2012.9]    -85.3  -86.1  -84.5
## E[dY/d(Cohort_year)|2013.6]    -82.2  -82.9  -81.5
## E[dY/d(Cohort_year)|2014.3]    -79.3  -79.9  -78.6
## E[dY/d(Cohort_year)|2015]      -76.4  -77.0  -75.8
plot(sim_est2b)

This model estimates the effect of cohort year on the change in the expected number of dropouts over time. The results show a gradual decrease in the change of dropouts with each subsequent cohort year, starting with a decrease of 159.2 dropouts in 2001 and gradually declining to 76.4 dropouts in 2015. This suggests that the effect of cohort year on dropouts has become less pronounced over time. The results are statistically significant.

dispersiontest(m2)
## 
##  Overdispersion test
## 
## data:  m2
## z = 13.39, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   409.1987

The over dispersion test for the Poisson regression model indicates that the dispersion parameter is significantly greater than 1. The test statistic and the p-value strongly suggests that the data exhibits over dispersion. The estimated dispersion value is 409.1987, indicating that the variance of the outcome variable is much higher than expected under the Poisson distribution, where the mean and variance should be equal. This suggests that a Poisson model may not be the best fit for this data. Alternative models, such as negative binomial regression, may be more appropriate.

# Negative binomial regression
m3 <- MASS::glm.nb(Dropout ~ Expected_graduation + Cohort_year, data = Graduation)
summary(m3)
## 
## Call:
## MASS::glm.nb(formula = Dropout ~ Expected_graduation + Cohort_year, 
##     data = Graduation, init.theta = 3.011356575, link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         117.304644  16.923254   6.932 4.16e-12 ***
## Expected_graduation   0.397519   0.066815   5.950 2.69e-09 ***
## Cohort_year          -0.054733   0.008424  -6.497 8.20e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.0114) family taken to be 1)
## 
##     Null deviance: 411.35  on 309  degrees of freedom
## Residual deviance: 327.08  on 307  degrees of freedom
## AIC: 5193.2
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.011 
##           Std. Err.:  0.230 
## 
##  2 x log-likelihood:  -5185.157

This Negative Binomial regression model shows that both expected graduation time and cohort year significantly affect the number of dropouts. The coefficient for expected graduation is 0.3975, meaning that cohorts expected to graduate in 5 or 6 years had a higher number of dropouts compared to those expected to graduate in 4 years. The coefficient for cohort year is -0.0547, indicating that each subsequent cohort year is associated with a decrease in the number of dropouts. Both variables are statistically significant, and the model accounts for over dispersion, providing a better fit than the Poisson model.

# Effect of expected graduation
sim_coefs3 <- sim(m3)
sim_est3 <- sim_ame(sim_coefs3, var = "Expected_graduation",
                    contrast = "rd")
summary(sim_est3)
##         Estimate 2.5 % 97.5 %
## E[Y(0)]     1648  1488   1828
## E[Y(1)]     2452  2256   2665
## RD           804   553   1074

This model estimates the expected number of dropouts based on expected graduation time using the Negative Binomial regression. For cohorts expected to graduate in 4 years, the expected number of dropouts is 1648. For those expected to graduate in 5 or 6 years, the expected number of dropouts is 2452. The difference between these two groups is 804 dropouts. The result is statistically significant.

# Effect of cohort year
sim_est3a <- sim_ame(sim_coefs3, var = "Cohort_year",
                    contrast = "rd")
## Warning: `contrast` is ignored when the focal variable is continuous.
summary(sim_est3a)
##                      Estimate 2.5 % 97.5 %
## E[dY/d(Cohort_year)]     -116  -153    -79

This model estimates the change in the expected number of dropouts based on cohort year using the Negative Binomial regression. For each subsequent cohort year, the expected number of dropouts decreases by 116.2. The result is statistically significant.

# Dose-response relationship: prediction
sim_est3b <- sim_adrf(sim_coefs3, var = "Cohort_year",
                    contrast = "adrf")

summary(sim_est3b)
##              Estimate 2.5 % 97.5 %
## E[Y(2001)]       3085  2698   3531
## E[Y(2001.7)]     2969  2622   3362
## E[Y(2002.4)]     2857  2548   3204
## E[Y(2003.1)]     2750  2480   3057
## E[Y(2003.8)]     2646  2407   2914
## E[Y(2004.5)]     2547  2334   2780
## E[Y(2005.2)]     2451  2265   2661
## E[Y(2005.9)]     2359  2195   2540
## E[Y(2006.6)]     2270  2119   2426
## E[Y(2007.3)]     2185  2040   2334
## E[Y(2008)]       2103  1971   2239
## E[Y(2008.7)]     2024  1897   2154
## E[Y(2009.4)]     1948  1821   2082
## E[Y(2010.1)]     1875  1743   2018
## E[Y(2010.8)]     1804  1668   1956
## E[Y(2011.5)]     1736  1599   1897
## E[Y(2012.2)]     1671  1528   1842
## E[Y(2012.9)]     1608  1457   1793
## E[Y(2013.6)]     1548  1389   1744
## E[Y(2014.3)]     1490  1328   1693
## E[Y(2015)]       1434  1268   1646
plot(sim_est3b)

This model estimates the expected number of dropouts for different cohort years using the Negative Binomial regression. There is a gradual decrease in the expected number of dropouts over time, with the number decreasing from 3085 in 2001 to 1434 in 2015. This suggests that dropout rates have steadily declined over the years. The results are statistically significant.

# Dose-response relationship: effect
sim_est3b <- sim_adrf(sim_coefs3, var = "Cohort_year",
                    contrast = "amef")

summary(sim_est3b)
##                             Estimate  2.5 % 97.5 %
## E[dY/d(Cohort_year)|2001]     -168.8 -243.4 -103.3
## E[dY/d(Cohort_year)|2001.7]   -162.5 -231.7 -100.6
## E[dY/d(Cohort_year)|2002.4]   -156.4 -220.6  -98.0
## E[dY/d(Cohort_year)|2003.1]   -150.5 -210.1  -95.4
## E[dY/d(Cohort_year)|2003.8]   -144.8 -200.2  -92.8
## E[dY/d(Cohort_year)|2004.5]   -139.4 -190.7  -90.3
## E[dY/d(Cohort_year)|2005.2]   -134.2 -181.6  -87.9
## E[dY/d(Cohort_year)|2005.9]   -129.1 -173.0  -85.5
## E[dY/d(Cohort_year)|2006.6]   -124.3 -165.0  -83.2
## E[dY/d(Cohort_year)|2007.3]   -119.6 -157.1  -81.0
## E[dY/d(Cohort_year)|2008]     -115.1 -149.7  -78.8
## E[dY/d(Cohort_year)|2008.7]   -110.8 -142.6  -76.8
## E[dY/d(Cohort_year)|2009.4]   -106.6 -136.1  -74.7
## E[dY/d(Cohort_year)|2010.1]   -102.6 -129.6  -72.8
## E[dY/d(Cohort_year)|2010.8]    -98.7 -123.3  -70.9
## E[dY/d(Cohort_year)|2011.5]    -95.0 -117.2  -69.0
## E[dY/d(Cohort_year)|2012.2]    -91.5 -111.7  -67.2
## E[dY/d(Cohort_year)|2012.9]    -88.0 -106.4  -65.5
## E[dY/d(Cohort_year)|2013.6]    -84.7 -101.4  -63.8
## E[dY/d(Cohort_year)|2014.3]    -81.5  -96.8  -62.1
## E[dY/d(Cohort_year)|2015]      -78.5  -92.4  -60.5
plot(sim_est3b)

This model estimates the change in the expected number of dropouts based on cohort year using the Negative Binomial regression. For each subsequent cohort year, the expected number of dropouts decreases by 168.8 in 2001, with the effect becoming slightly smaller over time as the years progress. For example, by 2015, the expected decrease in dropouts per cohort year was 78.5. These results suggest a consistent decline in the number of dropouts over time. The effect is statistically significant across all years.

Discussion

The goal of this analysis was to investigate how expected graduation time and cohort year influence the number of high school dropouts. The results show significant differences in dropout rates based on these factors. Cohorts expected to graduate in 5 or 6 years experienced a higher number of dropouts compared to those expected to graduate in 4 years, with a difference of 804 dropouts. This finding suggests that longer graduation timelines may be associated with higher dropout rates, possibly due to disengagement over time. Additionally, the analysis revealed a consistent decline in the number of dropouts across cohorts from 2001 to 2015, with each subsequent cohort year being associated with a decrease in dropouts. The expected number of dropouts decreased from 3085 in 2001 to 1434 in 2015, indicating a steady decline in dropout rates over time. This decline may reflect broader reforms in education or increased engagement over the years. The statistical significance of these findings suggests that both expected graduation time and cohort year play important roles in dropout trends. Future research should explore additional factors that may influence these trends, such as changes in educational policies, school resources, and community engagement.