Introduction

We completed the following analyses for the Policy DataFest 2020 competition at the University of Waterloo. This is a “hackathon” wherein each team of graduate students were given a dataset(s) to work with in a little over 24 hours. The results of the analyses students provide would then be used to inform policies and recommendations.

For our team, the topic was about linking early childhood education & development to socioeconomic status (SES).

Data Cleaning, Analyses & Graphing

Loading libraries:

The early childhood education data were obtained through the Ontario Ministry of Education database, which contains data at the census division level regarding how many children are at risk (as assessed by kindergarten teachers) and may not be ready for first grade. The assessment tool used is called the Early Development Instrument (EDI).

The EDI measures 5 different domains: * Physical Health and Well-Being * Social Competence * Emotional Maturity * Language and Cognitive Development * Communication Skills and General Knowledge

The data set also provided information such as * How many children were at risk in 1 or more of the 5 domains * How many children were at risk in 2 or more of the 5 domains * How many children were at risk in 9 of the 15 subdomains

First, we combined all of the data across the 4 cycles in which data were collected:

For our main measure, we looked at the proportion of children who were vulnerable in 1 or more domains, as this would probably give us the most general picture.

Then we cleaned the data. Note that each of the cycles roughly mapped onto the 4 census cycles (i.e., 2001, 2006, 2011, 2016).

Because the data we obtained were split by gender, we first created a data frame with just an overall number of children at risk, regardless of their gender.

In order to obtain information related to a region’s SES, we looked to the Canadian census for this information.

Our team discovered the cansensus package, and it really helped saved us time down the road. For more information about how to use this package, visit the developer’s website

First, we created lists, as well as searched through the databases for critical variables such as unemployment rate, median household income, etc at the census division level.

Below is an example:

## # A tibble: 3 x 7
##   vector  type   label   units    parent_vector aggregation  details       
##   <chr>   <fct>  <chr>   <fct>    <chr>         <chr>        <chr>         
## 1 v_CA11~ Total  Unempl~ Percent~ <NA>          Average of ~ CA 2011 NHS; ~
## 2 v_CA11~ Male   Unempl~ Percent~ <NA>          Average of ~ CA 2011 NHS; ~
## 3 v_CA11~ Female Unempl~ Percent~ <NA>          Average of ~ CA 2011 NHS; ~

After finding all of the vectors of interest, we compiled 4 different data frames, for all 4 census year cycles.

Year 2001

Year 2006

Year 2011

Year 2016

These data were then combined into one data frame.

And merged with the EDI data frame file.

Note that we adjusted income for inflation when comparing data across all 4 years.

Maps

We then created map representations outlining differences across divisions

Median Household Income

We first focused our main analysis on the most recent 2016 data set, using income (as a proxy of SES) to predict vulnerability (i.e., percent of children vulnerable in 1 or more domain) in school readiness.

  vulnerable perc
Predictors Estimates CI p
(Intercept) 0.40 0.32 – 0.48 <0.001
median_hh_income -0.00 -0.00 – -0.00 0.023
Observations 49
R2 / R2 adjusted 0.106 / 0.087

We then separated the 2016 data into each of the domain to see if SES was implicated in a particular domain or across all domains:

# Load Packages
# Packages <- c("reshape2", "dplyr", "ggplot2", "readr", "data.table", "psych","ez","nlme",
#               "plyr", "lsr", "pastecs", "pbkrtest", "MASS","Publish",
#               "ggthemes", "gridExtra","boot", "magrittr","Publish","car","apa","apaTables",
#               "Hmisc","BayesFactor","tinytex","psycho")
# lapply(Packages, library, character.only = T)

rdat <- df[df$Year=="2016",]


## By domain
# phys
rdat$vuln.phys <- rdat$olowph/rdat$vfa
# soc
rdat$vuln.soc <- rdat$olowsoc/rdat$vfa
# emo
rdat$vuln.emo <- rdat$olowem/rdat$vfa 
# lacog
rdat$vuln.lacog <- rdat$olowlc/rdat$vfa 
# comgen
rdat$vuln.comgen <- rdat$olowcg/rdat$vfa 


## melt data
# only interesting columns
dat <- as.data.frame(rdat[,c("GeoUID","median_hh_income","vfa","vuln.phys","vuln.soc","vuln.emo","vuln.lacog","vuln.comgen")])
dat<-dat[-c(ncol(dat))] # remove geometry file information

datl <- melt(dat, id= c("GeoUID","median_hh_income","vfa"), variable.name = "vuln.area", value.name="percent.vuln")
datl$vuln.area <- factor(datl$vuln.area, levels = c("vuln.phys","vuln.emo","vuln.soc", "vuln.comgen","vuln.lacog")) 


# figure 1 800 x 675
ggplot(data = datl, aes(x= median_hh_income, y = percent.vuln, size = vfa)) +
  scale_y_continuous(labels = function(x) scales::percent(x, accuracy = 1), 
                     limits = c(0,.32), breaks = seq(0,.32,.02), expand=c(0,0), 
                     name = "Vulnerable\n") +
  scale_x_continuous(labels = scales::dollar, limits = c(50000,104000), expand=c(0,0),
                     breaks =seq(50000,104000,5000), name = "\nMedian Household Income") +
  scale_size_continuous(guide = FALSE) + 
  geom_smooth(fullrange = TRUE,method='lm', aes(colour = vuln.area, fill = vuln.area), 
              size = .5, alpha = .1, show.legend = FALSE) +
  geom_point(aes(colour=vuln.area),shape=19, alpha = .4) +
  guides(colour=guide_legend(override.aes = list(size=5, alpha = .4))) +
  scale_colour_manual(labels=c("Physical Health/Wellbeing","Emotional Maturity","Social Competence", "Communication/General Knowledge","Language/Cognitive Development"),
                      values=c("Red","orange","chartreuse3","turquoise3","mediumpurple")) +
  scale_fill_manual(values=c("Red","orange","chartreuse3","turquoise3","mediumpurple")) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), # blank aesthetics
        plot.background = element_rect(fill= "white"),
        panel.background = element_rect(colour = "black", fill = "white"), 
        text = element_text(size=12, family= "Arial Rounded MT Bold"),
        axis.text = element_text(colour = "black",size=12),
        axis.title.y=element_text(colour="black"),
        axis.ticks = element_line(colour="black"),
        axis.line = element_line(size = 0.3, linetype = "solid", colour = "black"),
        legend.background = element_rect(fill= "white"),legend.key = element_blank(),
        legend.box.background = element_rect(colour = "black", size = .5),
        legend.text = element_text(size = 12,colour="black"),legend.spacing.y = unit(0, 'cm'),
        legend.position =c(.71,.85), legend.title=element_blank(),legend.direction = "vertical",
        strip.background = element_blank(), strip.placement = "outside",
        axis.text.x = element_text(angle = 45, hjust = .9),
        strip.text = element_text(size = 12, colour = "black"), plot.margin = unit(c(.5,.5,.5,.5),"cm"))

  vuln phys vuln soc vuln emo vuln lacog vuln comgen
Predictors Estimates CI p Estimates CI p Estimates CI p Estimates CI p Estimates CI p
(Intercept) 0.30 0.21 – 0.39 <0.001 0.13 0.09 – 0.17 <0.001 0.18 0.12 – 0.24 <0.001 0.11 0.08 – 0.14 <0.001 0.10 0.06 – 0.14 <0.001
median_hh_income -0.00 -0.00 – -0.00 0.008 -0.00 -0.00 – 0.00 0.335 -0.00 -0.00 – 0.00 0.157 -0.00 -0.00 – -0.00 0.022 -0.00 -0.00 – 0.00 0.960
Observations 49 49 49 49 49
R2 / R2 adjusted 0.141 / 0.123 0.020 / -0.001 0.042 / 0.022 0.106 / 0.087 0.000 / -0.021

Based on the analysis, it seems that only physical & wellbeing and language/cognitive development are predicted by median household income

Next, we looked again at our main vulnerability measure (i.e., percent of children vulnerable in at least 1 domain) across all 4 cycles which we had data.

We conducted a linear-mixed effects model using census division as the random effect, looking at both a model with only linear terms and one with an interaction term to see whether or not there were moderating effects.

Note that we used inflation-adjusted income in these analyses.

  vulnerable perc vulnerable perc
Predictors Estimates CI p Estimates CI p
(Intercept) -3.29 -4.90 – -1.69 <0.001 6.34 -3.20 – 15.88 0.193
Income.Adj -0.00 -0.00 – -0.00 <0.001 -0.00 -0.00 – -0.00 0.042
Year 0.00 0.00 – 0.00 <0.001 -0.00 -0.01 – 0.00 0.222
Income.Adj * Year 0.00 0.00 – 0.00 0.045
Random Effects
σ2 0.00 0.00
τ00 0.00 GeoUID 0.00 GeoUID
ICC 0.40 0.41
N 49 GeoUID 49 GeoUID
Observations 196 196
Marginal R2 / Conditional R2 0.174 / 0.507 0.175 / 0.515

We see that year significantly moderated the relation between income (SES) and school readiness.

Gender Effects

Since we were given gender data, we next wanted to see whether there were systematic differences between girls and boys

edi.long<-melt(edi, id=c(1:14, 23), variable.name = "Domain", value.name = "Count")
edi.long$Count<-as.numeric(edi.long$Count)
edi.long$vfa<-as.numeric(as.character(edi.long$vfa))
edi.long$Percent<-edi.long$Count/edi.long$vfa

edi.long$Domain<-recode(edi.long$Domain, "'olowph'='Physical Health';
                        'olowsoc'='Social Competence';
                        'olowem'='Emotional Maturity';
                        'olowlc'='Language & Cognition';
                        'olowcg'='General Knowledge';
                        'olow'='1+ Vulnerability';
                        'olow2'='2+ Vulnerability'")
edi.long<-edi.long[complete.cases(edi.long$cduid),]


graph1 <- ddply(edi.long, c("sex", "Domain", "Year"), function(df)
  return(c(dep.avg=mean(df$Percent, na.rm=T),
           dep.sd=sd(df$Percent, na.rm=T),
           dep.count=length(df$Percent))))

graph2<-subset(graph1, !(Domain %in% c("cmci", "2+ Vulnerability")))
graph2$Domain<-factor(graph2$Domain, levels(graph2$Domain)[c(1,2,3,7,4,8,5,6)])
graph2$sex<-recode(graph2$sex, "'F'='Female';'M'='Male'")

ggplot(graph2, aes(x=Year, y=dep.avg, colour=Domain))+
  geom_point(size=3, shape=16, position = position_dodge(0.5), alpha=.6)+
  geom_errorbar(aes(ymax=dep.avg+dep.sd*qt(0.975,df=dep.count-1)/sqrt(dep.count),
                    ymin=dep.avg-dep.sd*qt(0.975,df=dep.count-1)/sqrt(dep.count)),
                width=0, position = position_dodge(0.5))+
  facet_grid(.~sex)+
  theme(
    panel.background= element_rect(fill=NA), # transparent panel
    plot.background = element_rect(fill=NA, colour=NA), #transparent background
    panel.grid=element_blank(), # remove panel grid
        plot.title = element_text(family = "Arial Rounded MT Bold"),
        strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.title = element_text(family = "Arial Rounded MT Bold"),
    axis.ticks.x=element_blank(), # remove tick marks on x-axis
    axis.ticks=element_line(colour="gray70"), # change colour of tick marks
    panel.border = element_rect(fill="transparent", colour="gray70"), # change panel border colour
    legend.background = element_rect(fill = "transparent", colour = "transparent"), # change legend background
    axis.text = element_text(color="#432620", family = "Arial Rounded MT Bold"),
    legend.key = element_rect(fill = "transparent", colour = "transparent"),
    strip.background = element_rect(color="grey70", fill="transparent")
    # ,legend.position="none"
  )+
  scale_y_continuous("Percent Children In Domain\n",
                     labels = function(x) scales::percent(x, accuracy=1),
                     breaks=seq(0, 1, .1))+
  scale_x_discrete("\nYear") +
  scale_colour_manual("Early Development Domain", labels=c("1+ Domain", "Physical Health/Wellbeing",
                               "Emotional Maturity",
                               "Social Competence", 
                               "Communication/General Knowledge",
                               "Language/Cognitive Development"),
                      values=c("darkblue", "Red", "orange", "chartreuse3", "turquoise3", "mediumpurple")) +
  coord_cartesian(ylim = c(0, .45))

Here, we present the full linear mixed-effects model with the following predictors: *

  vulnerable perc vulnerable perc
Predictors Estimates CI p Estimates CI p
(Intercept) -3.37 -4.79 – -1.94 <0.001 6.01 -2.46 – 14.49 0.164
Year 0.00 0.00 – 0.00 <0.001 -0.00 -0.01 – 0.00 0.188
Gender (Male) 0.14 0.14 – 0.15 <0.001 0.14 0.14 – 0.15 <0.001
Income.Adj -0.00 -0.00 – -0.00 <0.001 -0.00 -0.00 – -0.00 0.026
Income.Adj * Year 0.00 0.00 – 0.00 0.028
Random Effects
σ2 0.00 0.00
τ00 0.00 GeoUID 0.00 GeoUID
ICC 0.32 0.32
N 49 GeoUID 49 GeoUID
Observations 392 392
Marginal R2 / Conditional R2 0.725 / 0.813 0.726 / 0.815

Conclusion

  • We found that an area’s SES (as approximated by median household income) was indeed related to vulnerability in early childhood development
  • Those in low SES areas and boys seem especially vulnerable