Preschool is a form of early childhood education that is offered to children of ages between three and five years before they enter primary school. This program is designed to develop children’s cognitive and behavioral skills. It, therefore, aims to prepare students from an early period of their childhood to learn better during school years, graduate from high school successfully, and thus leave a potential longitudinal effect on the students. Although a number of past researches suggest that public funding of early childhood education, especially Head Start, High/Scope Perry preschool, Universal Pre-K etc. have been proven successful in encouraging families to send their children to preschool nationwide, there is still a wide gap at the socio-demographic level when it comes to preschool attendance.

The purpose of this project is to analyze how the likelihood of a person’s enrollment in a formal preschool program varies by socio-demographic characteristics such as, his/her sex, race , disability status and and parent’s employment status in New York.

The sample data to be analyzedhas been derived from the Public Use Microdata Samples (PUMS) 2017. PUMS is part of the American Community Survey (ACS) , which is a survey that provides various socio-demographic as well as financial information on a yearly basis about various states and its people. The data is collected on an ongoing basis, January through December, to provide every state with the information they need to make important decisions. The sample data that will be used in this analysis is consisted of 9071 observations.

The variables to be analyzed are as follows:

  (i) PRESCHL (Preschool Completion):
      0. Never participated in preshool education
      1. Completed preschool education
      
  (ii) RAC1P (Race):
      1 .White alone
      2 .Black or African American alone
      3 .American Indian alone
      4 .Alaska Native alone
      5 .American Indian and Alaska Native tribes specified
      6 .Asian alone
      7 .Native Hawaiian and Other Pacific Islander alone
      8 .Some other race alone
      9 .Two or more races
      
  (iii) SEX:
      1. Male
      2. Female
      
  (iv) ESP (Employment Status of Parents):
      1.Both parents in labor force
      2.Father only in labor force
      3.Mother only in labor force
      4.Neither parent in labor force living with one parent and iving with father
      5.Father in the labor force
      6.Father not in labor force living with mother:
      7.Mother in the labor force
      8.Mother not in labor forcE
      
   (v) DIS (Disability Status):
      1. With a disability
      2. Without a disability
      
      

To begin the analysis, at first let’s import the datatset:

library(readr)
PUMS_NY <- read_csv("C:/Users/Nusrat/Desktop/MA - 3rd semester, Spring 19/SOC 791 - Independent Research (with Python)/PUMS Dataset/PUMS_NY.csv")
## Warning: Missing column names filled in: 'X8' [8]
## Parsed with column specification:
## cols(
##   PRESCHL = col_double(),
##   SEX = col_double(),
##   ESP = col_double(),
##   RAC1P = col_double(),
##   NOP = col_double(),
##   CIT = col_double(),
##   DIS = col_double(),
##   X8 = col_logical()
## )

Logistic Regression

Model 1: Relationship Between Preschool Completion and Race of Participant

Let’s create a simple model to predict preshool completion depending on the race of the participant:

library(Zelig)
## Loading required package: survival
m1 <- lm(PRESCHL ~ RAC1P, data = PUMS_NY)
summary(m1)
## 
## Call:
## lm(formula = PRESCHL ~ RAC1P, data = PUMS_NY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3394 -0.3394 -0.2436  0.6606  0.8139 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.358590   0.006871   52.19   <2e-16 ***
## RAC1P       -0.019161   0.001735  -11.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4571 on 9068 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01326,    Adjusted R-squared:  0.01316 
## F-statistic: 121.9 on 1 and 9068 DF,  p-value: < 2.2e-16

Result: The test suggests that the race is a statistically significant predictor of participant’s preschool completion at the confidence interval of 0.1%.

Model 2: Relationship Between Preschool Completion and Employment Status of Parents

Now let’s create another simple model that predicts preshool completion depending on the participant’s parent’s employment status:

m2 <- lm(PRESCHL ~ ESP, data = PUMS_NY)
summary(m2)
## 
## Call:
## lm(formula = PRESCHL ~ ESP, data = PUMS_NY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5056 -0.4859 -0.3871  0.4944  0.6327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.525382   0.010249  51.261  < 2e-16 ***
## ESP         -0.019755   0.002618  -7.547 5.18e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4964 on 5625 degrees of freedom
##   (3444 observations deleted due to missingness)
## Multiple R-squared:  0.01002,    Adjusted R-squared:  0.009847 
## F-statistic: 56.95 on 1 and 5625 DF,  p-value: 5.184e-14

Result: The test suggests that the employment status of parents is a statistically significant predictor of participant’s preschool completion at the confidence interval of 0.1%.

Model 3: Relationship Between Pariticipant’s RACE and Parent’s Employment Status on Preschool Completion

Now let’s create a third model that predicts preshool completion depending on the race of the participant as well as his/her parent’s employment status:

m3 <- lm(PRESCHL ~ RAC1P+ESP, data = PUMS_NY)
summary(m3)
## 
## Call:
## lm(formula = PRESCHL ~ RAC1P + ESP, data = PUMS_NY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5176 -0.4989 -0.3585  0.5011  0.6884 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.545648   0.011548  47.252  < 2e-16 ***
## RAC1P       -0.009425   0.002485  -3.793 0.000151 ***
## ESP         -0.018658   0.002631  -7.092 1.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4959 on 5624 degrees of freedom
##   (3444 observations deleted due to missingness)
## Multiple R-squared:  0.01255,    Adjusted R-squared:  0.0122 
## F-statistic: 35.74 on 2 and 5624 DF,  p-value: 3.785e-16

Result: When combined, race of participants and and employment status of parents are statistically significantly related to participant’s preschool completion at the confidence interval of 0.1%.

Model 4: Relationship Between Participant’s Race, Sex, Disability Status and Parent’s Employment Status on Preschool Completion

We’ll now create a more complex model that predicts preshool completion depending on the race, sex and disability status of the participant as well as his/her parent’s employment status:

m4 <- lm(PRESCHL ~ RAC1P+ESP+SEX+DIS, data = PUMS_NY)
summary(m4)
## 
## Call:
## lm(formula = PRESCHL ~ RAC1P + ESP + SEX + DIS, data = PUMS_NY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7452 -0.4933 -0.3496  0.5019  0.6985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.992288   0.102114   9.717  < 2e-16 ***
## RAC1P       -0.009487   0.002481  -3.824 0.000133 ***
## ESP         -0.019323   0.002630  -7.346 2.34e-13 ***
## SEX          0.004774   0.013213   0.361 0.717900    
## DIS         -0.227796   0.050305  -4.528 6.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.495 on 5622 degrees of freedom
##   (3444 observations deleted due to missingness)
## Multiple R-squared:  0.01615,    Adjusted R-squared:  0.01545 
## F-statistic: 23.06 on 4 and 5622 DF,  p-value: < 2.2e-16

Result: There is statistically significant relation between participant’s preschool completion and all the above mentioned variables at the confidence interval of 0.1%.

Model 5: Interaction Between Participant’s Race and Disability Status

m5 <- lm(PRESCHL ~ RAC1P*ESP, data = PUMS_NY)
summary(m5)
## 
## Call:
## lm(formula = PRESCHL ~ RAC1P * ESP, data = PUMS_NY)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5196 -0.4998 -0.3659  0.5002  0.6697 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5509911  0.0138843  39.685  < 2e-16 ***
## RAC1P       -0.0115833  0.0039841  -2.907  0.00366 ** 
## ESP         -0.0204357  0.0036741  -5.562 2.79e-08 ***
## RAC1P:ESP    0.0006533  0.0009423   0.693  0.48820    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4959 on 5623 degrees of freedom
##   (3444 observations deleted due to missingness)
## Multiple R-squared:  0.01263,    Adjusted R-squared:  0.01211 
## F-statistic: 23.98 on 3 and 5623 DF,  p-value: 2.042e-15

Result: The interaction between participant’s race and parent’s employment status is not significant.

Best Model:

Let’s now use information criteria to see which model is the best fit of all.

library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(m1, m2, m3, m4,m5), doctype = FALSE)
## 
## ============================================================================
##              Model 1      Model 2      Model 3      Model 4      Model 5    
## ----------------------------------------------------------------------------
## (Intercept)     0.36 ***     0.53 ***     0.55 ***     0.99 ***     0.55 ***
##                (0.01)       (0.01)       (0.01)       (0.10)       (0.01)   
## RAC1P          -0.02 ***                 -0.01 ***    -0.01 ***    -0.01 ** 
##                (0.00)                    (0.00)       (0.00)       (0.00)   
## ESP                         -0.02 ***    -0.02 ***    -0.02 ***    -0.02 ***
##                             (0.00)       (0.00)       (0.00)       (0.00)   
## SEX                                                    0.00                 
##                                                       (0.01)                
## DIS                                                   -0.23 ***             
##                                                       (0.05)                
## RAC1P:ESP                                                           0.00    
##                                                                    (0.00)   
## ----------------------------------------------------------------------------
## R^2             0.01         0.01         0.01         0.02         0.01    
## Adj. R^2        0.01         0.01         0.01         0.02         0.01    
## Num. obs.    9070         5627         5627         5627         5627       
## RMSE            0.46         0.50         0.50         0.50         0.50    
## ============================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

AIC and BIC:

AIC(m1,m2,m3,m4,m5)
##    df       AIC
## m1  3 11542.552
## m2  3  8091.787
## m3  4  8079.412
## m4  6  8062.881
## m5  5  8080.931
BIC(m1,m2,m3,m4,m5)
##    df       BIC
## m1  3 11563.890
## m2  3  8111.693
## m3  4  8105.954
## m4  6  8102.693
## m5  5  8114.108

Result: The model comparisons, AIC and BIC values suggest that model 4 is the best model of all (AIC:8062.881, BIC: 8102.693).

Visual Plots:

library(visreg)
visreg(m1, "RAC1P", scale = "response")

Result: The plot suggests that participants who are white only, have a better likelihood to complete preschool education compared to participant’s belonging to all other races.

visreg(m4, "ESP", by = "SEX",scale ="response",line=list(col="black"),
                             fill=list(col="limegreen"), xlab="ESP")

Result: The graph on both side are pretty similar. It means that no matter which sexual identity the person belongs to, the likehihood of completing preschool education is more for those whose both parent work.