Socioeconomic Status & Early Childhood Education

Introduction

We completed the following analyses for the Policy DataFest 2020 competition at the University of Waterloo. This is a “hackathon” wherein each team of graduate students were given a dataset(s) to work with in a little over 24 hours. The results of the analyses students provide would then be used to inform policies and recommendations.

For our team, the topic was about linking early childhood education & development to socioeconomic status (SES).

Data Cleaning, Analyses & Graphing

Loading libraries:

library(cancensus)
library(dplyr)
library(plyr)
library(sf)
library(beepr)
library(car)
library(reshape2)
library(ggplot2)
library(sjPlot)
library(lme4)
library(corrplot)
library(reldist)
library(lme4)

The early childhood education data were obtained through the Ontario Ministry of Education database, which contains data at the census division level regarding how many children are at risk (as assessed by kindergarten teachers) and may not be ready for first grade. The assessment tool used is called the Early Development Instrument (EDI).

The EDI measures 5 different domains: * Physical Health and Well-Being * Social Competence * Emotional Maturity * Language and Cognitive Development * Communication Skills and General Knowledge

The data set also provided information such as * How many children were at risk in 1 or more of the 5 domains * How many children were at risk in 2 or more of the 5 domains * How many children were at risk in 9 of the 15 subdomains

First, we combined all of the data across the 4 cycles in which data were collected:

# cycle13 = read.csv("ontario_edi_cycles_1-3_aggregated_to_census_division.csv")
# cycle4 = read.csv("ontario_edi_cycle_4_-_aggregated_to_census_division.csv")
# 
# cycle1234 = rbind.fill(cycle13, cycle4)
# 
# write.csv(cycle1234,'cycle1234.csv')

For our main measure, we looked at the proportion of children who were vulnerable in 1 or more domains, as this would probably give us the most general picture.

Then we cleaned the data. Note that each of the cycles roughly mapped onto the 4 census cycles (i.e., 2001, 2006, 2011, 2016).

edi<-read.csv("cycle1234.csv")

edi <- as.data.frame(lapply(edi, function(x) gsub("-9999", NA, x)))
edi$sex<-factor(recode(edi$sex, "1='Female';2='Male';'-9'=NA"))
edi$Year<-ifelse(edi$Cycle=="1", "2001", 
                 ifelse(edi$Cycle=="2", "2006", 
                        ifelse(edi$Cycle=="3", "2011", 
                               ifelse(edi$Cycle=="4", "2016", NA))))

Because the data we obtained were split by gender, we first created a data frame with just an overall number of children at risk, regardless of their gender.

# merge across gender
edi.m<-melt(edi, id=c("cduid", "cdname", "Cycle", "sex", "Year")) # create long data frame
edi.m$value<-as.numeric(as.character(edi.m$value))
edi.t<-dcast(edi.m, cduid + Cycle + Year ~ variable, sum, na.rm=F)

In order to obtain information related to a region’s SES, we looked to the Canadian census for this information.

Our team discovered the cansensus package, and it really helped saved us time down the road. For more information about how to use this package, visit the developer’s website

options(cancensus.api_key = "CensusMapper_b0eea0b32daea3e497e1aca253454bdb")
options(cancensus.cache_path = "custom cache path")

First, we created lists, as well as searched through the databases for critical variables such as unemployment rate, median household income, etc at the census division level.

Below is an example:

list<-list_census_vectors('CA11')

search_census_vectors("unemployment", dataset = 'CA11')

## # A tibble: 3 x 7
##   vector  type   label   units    parent_vector aggregation  details       
##   <chr>   <fct>  <chr>   <fct>    <chr>         <chr>        <chr>         
## 1 v_CA11~ Total  Unempl~ Percent~ <NA>          Average of ~ CA 2011 NHS; ~
## 2 v_CA11~ Male   Unempl~ Percent~ <NA>          Average of ~ CA 2011 NHS; ~
## 3 v_CA11~ Female Unempl~ Percent~ <NA>          Average of ~ CA 2011 NHS; ~

After finding all of the vectors of interest, we compiled 4 different data frames, for all 4 census year cycles.

Year 2001

can01 <- get_census(dataset='CA01', regions=list(C="01"),
                  vectors=c("median_hh_income"="v_CA01_1634", 
                            "immigrant"="v_CA01_406",
                            "population"="v_CA01_2",
                            "participation"= "v_CA01_740",
                            "unemployment"="v_CA01_742"), level='CD', 
                  quiet = TRUE, 
                  geo_format = 'sf', labels = 'short')
beep(3)
can01$Year<-"2001"
can01<-can01[-c(6,9)]

Year 2006

can06 <- get_census(dataset='CA06', regions=list(C="01"),
                  vectors=c("median_hh_income"="v_CA06_2000", 
                            "immigrant"="v_CA06_545",
                            "population"="v_CA06_1",
                            "participation"= "v_CA06_580",
                            "unemployment"="v_CA06_582"), level='CD', 
                  quiet = TRUE, 
                  geo_format = 'sf', labels = 'short')
beep(3)
can06$Year<-"2006"
can06<-can06[-c(2)]

Year 2011

can11 <- get_census(dataset='CA11', regions=list(C="01"),
                  vectors=c("median_hh_income"="v_CA11N_2562", 
                            "immigrant"="v_CA11N_22",
                            "population"="v_CA11N_1",
                            "participation"= "v_CA11N_2002",
                            "unemployment"="v_CA11N_2008"), level='CD', 
                  quiet = TRUE, 
                  geo_format = 'sf', labels = 'short')
beep(3)
can11$Year<-"2011"
can11<-can11[-c(2, 7, 14)]

Year 2016

can16 <- get_census(dataset='CA16', regions=list(C="01"),
                  vectors=c("median_hh_income"="v_CA16_2397", 
                            "immigrant"="v_CA16_3411",
                            "population"="v_CA16_401",
                            "participation"= "v_CA16_5612",
                            "unemployment"="v_CA16_5618"), level='CD', 
                  quiet = TRUE, 
                  geo_format = 'sf', labels = 'short')
beep(3)
can16$Year<-"2016"
can16<-can16[-c(8)]

These data were then combined into one data frame.

can<-rbind(can01, can06, can11, can16)

And merged with the EDI data frame file.

Note that we adjusted income for inflation when comparing data across all 4 years.

df<-merge(can, edi.t,  by.x=c("GeoUID", "Year"), by.y=c("cduid", "Year"),  all.x=F, all.y=T)

df$vulnerable.perc<-df$olow/df$vfa
df$immigrant.perc<-df$immigrant/df$population

# we adjusted income for inflation

df$Income.Adj<-ifelse(df$Cycle=="3", df$median_hh_income*1.076,
                      ifelse(df$Cycle=="2", df$median_hh_income*1.172, 
                             ifelse(df$Cycle=="1", df$median_hh_income*1.317,
                                    ifelse(df$Cycle=="4", df$median_hh_income, NA))))

df$Year<-as.numeric(as.character(df$Year))

Maps

We then created map representations outlining differences across divisions

EDI

Here, we outlined which census divisions may have proportionally more vulnerable children:

ggplot(df) + geom_sf(aes(fill = vulnerable.perc), colour = "grey") +
  scale_fill_viridis_c("Percent\nVulnerable", 
                       labels = function(x) scales::percent(x, accuracy=1), 
                       option="magma", direction = -1) + theme_minimal() +
  facet_grid(.~Year)+
  theme(panel.grid = element_blank(),
        axis.text = element_blank(),
        plot.title = element_text(family = "Arial Rounded MT Bold"),
        strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.title = element_text(family = "Arial Rounded MT Bold"),
        axis.ticks = element_blank()) + 
  coord_sf(datum=NA) +
  labs(title = "Percentage of Vulnerable Children Across Ontario Census Divisions")

Median Household Income

Not Inflation Adjusted

ggplot(df) + geom_sf(aes(fill = median_hh_income), colour = "grey") +
  scale_fill_viridis_c("Median HH\nIncome", labels = scales::dollar) + theme_minimal() +
  facet_grid(.~Year)+
  theme(panel.grid = element_blank(),
        axis.text = element_blank(),
        plot.title = element_text(family = "Arial Rounded MT Bold"),
        strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.title = element_text(family = "Arial Rounded MT Bold"),
        axis.ticks = element_blank()) +
  coord_sf(datum=NA) +
  labs(title = "Median Household Income in Ontario", 
       subtitle = "(Not Adjusted for Inflation)")

Inflation Adjusted

ggplot(df) + geom_sf(aes(fill = Income.Adj), colour = "gray") +
  scale_fill_viridis_c("Median HH\nIncome",labels = scales::dollar,limits = c(40000,105000),breaks = c(40000,60000,80000,100000)) + theme_minimal() +
  
  facet_grid(.~Year)+
  theme(panel.grid = element_blank(),
        axis.text = element_blank(),
        plot.title = element_text(family = "Arial Rounded MT Bold"),
        strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.title = element_text(family = "Arial Rounded MT Bold"),
        plot.subtitle = element_text(family = "Arial Rounded MT Bold"),
        axis.ticks = element_blank()) +
  coord_sf(datum=NA) +
  labs(title = "Median Household Income in Ontario",
       subtitle = "(Adjusted to 2016 Dollar Amounts)")

We first focused our main analysis on the most recent 2016 data set, using income (as a proxy of SES) to predict vulnerability (i.e., percent of children vulnerable in 1 or more domain) in school readiness.

inc16<-lm(vulnerable.perc~median_hh_income, data = df[df$Year=="2016",])

tab_model(inc16)

	vulnerable perc
Predictors	Estimates	CI	p
(Intercept)	0.40	0.32 – 0.48	<0.001
median_hh_income	-0.00	-0.00 – -0.00	0.023
Observations	49
R² / R² adjusted	0.106 / 0.087

We then separated the 2016 data into each of the domain to see if SES was implicated in a particular domain or across all domains:

# Load Packages
# Packages <- c("reshape2", "dplyr", "ggplot2", "readr", "data.table", "psych","ez","nlme",
#               "plyr", "lsr", "pastecs", "pbkrtest", "MASS","Publish",
#               "ggthemes", "gridExtra","boot", "magrittr","Publish","car","apa","apaTables",
#               "Hmisc","BayesFactor","tinytex","psycho")
# lapply(Packages, library, character.only = T)

rdat <- df[df$Year=="2016",]


## By domain
# phys
rdat$vuln.phys <- rdat$olowph/rdat$vfa
# soc
rdat$vuln.soc <- rdat$olowsoc/rdat$vfa
# emo
rdat$vuln.emo <- rdat$olowem/rdat$vfa 
# lacog
rdat$vuln.lacog <- rdat$olowlc/rdat$vfa 
# comgen
rdat$vuln.comgen <- rdat$olowcg/rdat$vfa 


## melt data
# only interesting columns
dat <- as.data.frame(rdat[,c("GeoUID","median_hh_income","vfa","vuln.phys","vuln.soc","vuln.emo","vuln.lacog","vuln.comgen")])
dat<-dat[-c(ncol(dat))] # remove geometry file information

datl <- melt(dat, id= c("GeoUID","median_hh_income","vfa"), variable.name = "vuln.area", value.name="percent.vuln")
datl$vuln.area <- factor(datl$vuln.area, levels = c("vuln.phys","vuln.emo","vuln.soc", "vuln.comgen","vuln.lacog")) 


# figure 1 800 x 675
ggplot(data = datl, aes(x= median_hh_income, y = percent.vuln, size = vfa)) +
  scale_y_continuous(labels = function(x) scales::percent(x, accuracy = 1), 
                     limits = c(0,.32), breaks = seq(0,.32,.02), expand=c(0,0), 
                     name = "Vulnerable\n") +
  scale_x_continuous(labels = scales::dollar, limits = c(50000,104000), expand=c(0,0),
                     breaks =seq(50000,104000,5000), name = "\nMedian Household Income") +
  scale_size_continuous(guide = FALSE) + 
  geom_smooth(fullrange = TRUE,method='lm', aes(colour = vuln.area, fill = vuln.area), 
              size = .5, alpha = .1, show.legend = FALSE) +
  geom_point(aes(colour=vuln.area),shape=19, alpha = .4) +
  guides(colour=guide_legend(override.aes = list(size=5, alpha = .4))) +
  scale_colour_manual(labels=c("Physical Health/Wellbeing","Emotional Maturity","Social Competence", "Communication/General Knowledge","Language/Cognitive Development"),
                      values=c("Red","orange","chartreuse3","turquoise3","mediumpurple")) +
  scale_fill_manual(values=c("Red","orange","chartreuse3","turquoise3","mediumpurple")) +
  theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), # blank aesthetics
        plot.background = element_rect(fill= "white"),
        panel.background = element_rect(colour = "black", fill = "white"), 
        text = element_text(size=12, family= "Arial Rounded MT Bold"),
        axis.text = element_text(colour = "black",size=12),
        axis.title.y=element_text(colour="black"),
        axis.ticks = element_line(colour="black"),
        axis.line = element_line(size = 0.3, linetype = "solid", colour = "black"),
        legend.background = element_rect(fill= "white"),legend.key = element_blank(),
        legend.box.background = element_rect(colour = "black", size = .5),
        legend.text = element_text(size = 12,colour="black"),legend.spacing.y = unit(0, 'cm'),
        legend.position =c(.71,.85), legend.title=element_blank(),legend.direction = "vertical",
        strip.background = element_blank(), strip.placement = "outside",
        axis.text.x = element_text(angle = 45, hjust = .9),
        strip.text = element_text(size = 12, colour = "black"), plot.margin = unit(c(.5,.5,.5,.5),"cm"))

m_ph <- lm(vuln.phys~median_hh_income, data = rdat)
m_soc <- lm(vuln.soc~median_hh_income, data = rdat)
m_em <- lm(vuln.emo~median_hh_income, data = rdat)
m_lc <- lm(vuln.lacog~median_hh_income, data = rdat)
m_cg <- lm(vuln.comgen~median_hh_income, data = rdat)

tab_model(m_ph, m_soc, m_em, m_lc, m_cg)

	vuln phys			vuln soc			vuln emo			vuln lacog			vuln comgen
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	0.30	0.21 – 0.39	<0.001	0.13	0.09 – 0.17	<0.001	0.18	0.12 – 0.24	<0.001	0.11	0.08 – 0.14	<0.001	0.10	0.06 – 0.14	<0.001
median_hh_income	-0.00	-0.00 – -0.00	0.008	-0.00	-0.00 – 0.00	0.335	-0.00	-0.00 – 0.00	0.157	-0.00	-0.00 – -0.00	0.022	-0.00	-0.00 – 0.00	0.960
Observations	49			49			49			49			49
R² / R² adjusted	0.141 / 0.123			0.020 / -0.001			0.042 / 0.022			0.106 / 0.087			0.000 / -0.021

Based on the analysis, it seems that only physical & wellbeing and language/cognitive development are predicted by median household income

Next, we looked again at our main vulnerability measure (i.e., percent of children vulnerable in at least 1 domain) across all 4 cycles which we had data.

We conducted a linear-mixed effects model using census division as the random effect, looking at both a model with only linear terms and one with an interaction term to see whether or not there were moderating effects.

Note that we used inflation-adjusted income in these analyses.

mod1<-lmer(vulnerable.perc~Income.Adj+Year+(1|GeoUID), data=df, REML=F)
mod2<-lmer(vulnerable.perc~Income.Adj*Year+(1|GeoUID), data=df, REML=F)

tab_model(mod1, mod2)

	vulnerable perc			vulnerable perc
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	-3.29	-4.90 – -1.69	<0.001	6.34	-3.20 – 15.88	0.193
Income.Adj	-0.00	-0.00 – -0.00	<0.001	-0.00	-0.00 – -0.00	0.042
Year	0.00	0.00 – 0.00	<0.001	-0.00	-0.01 – 0.00	0.222
Income.Adj * Year				0.00	0.00 – 0.00	0.045
Random Effects
σ²	0.00			0.00
τ₀₀	0.00 _GeoUID			0.00 _GeoUID
ICC	0.40			0.41
N	49 _GeoUID			49 _GeoUID
Observations	196			196
Marginal R² / Conditional R²	0.174 / 0.507			0.175 / 0.515

We see that year significantly moderated the relation between income (SES) and school readiness.

Gender Effects

Since we were given gender data, we next wanted to see whether there were systematic differences between girls and boys

edi.long<-melt(edi, id=c(1:14, 23), variable.name = "Domain", value.name = "Count")
edi.long$Count<-as.numeric(edi.long$Count)
edi.long$vfa<-as.numeric(as.character(edi.long$vfa))
edi.long$Percent<-edi.long$Count/edi.long$vfa

edi.long$Domain<-recode(edi.long$Domain, "'olowph'='Physical Health';
                        'olowsoc'='Social Competence';
                        'olowem'='Emotional Maturity';
                        'olowlc'='Language & Cognition';
                        'olowcg'='General Knowledge';
                        'olow'='1+ Vulnerability';
                        'olow2'='2+ Vulnerability'")
edi.long<-edi.long[complete.cases(edi.long$cduid),]


graph1 <- ddply(edi.long, c("sex", "Domain", "Year"), function(df)
  return(c(dep.avg=mean(df$Percent, na.rm=T),
           dep.sd=sd(df$Percent, na.rm=T),
           dep.count=length(df$Percent))))

graph2<-subset(graph1, !(Domain %in% c("cmci", "2+ Vulnerability")))
graph2$Domain<-factor(graph2$Domain, levels(graph2$Domain)[c(1,2,3,7,4,8,5,6)])
graph2$sex<-recode(graph2$sex, "'F'='Female';'M'='Male'")

ggplot(graph2, aes(x=Year, y=dep.avg, colour=Domain))+
  geom_point(size=3, shape=16, position = position_dodge(0.5), alpha=.6)+
  geom_errorbar(aes(ymax=dep.avg+dep.sd*qt(0.975,df=dep.count-1)/sqrt(dep.count),
                    ymin=dep.avg-dep.sd*qt(0.975,df=dep.count-1)/sqrt(dep.count)),
                width=0, position = position_dodge(0.5))+
  facet_grid(.~sex)+
  theme(
    panel.background= element_rect(fill=NA), # transparent panel
    plot.background = element_rect(fill=NA, colour=NA), #transparent background
    panel.grid=element_blank(), # remove panel grid
        plot.title = element_text(family = "Arial Rounded MT Bold"),
        strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
        legend.title = element_text(family = "Arial Rounded MT Bold"),
    axis.ticks.x=element_blank(), # remove tick marks on x-axis
    axis.ticks=element_line(colour="gray70"), # change colour of tick marks
    panel.border = element_rect(fill="transparent", colour="gray70"), # change panel border colour
    legend.background = element_rect(fill = "transparent", colour = "transparent"), # change legend background
    axis.text = element_text(color="#432620", family = "Arial Rounded MT Bold"),
    legend.key = element_rect(fill = "transparent", colour = "transparent"),
    strip.background = element_rect(color="grey70", fill="transparent")
    # ,legend.position="none"
  )+
  scale_y_continuous("Percent Children In Domain\n",
                     labels = function(x) scales::percent(x, accuracy=1),
                     breaks=seq(0, 1, .1))+
  scale_x_discrete("\nYear") +
  scale_colour_manual("Early Development Domain", labels=c("1+ Domain", "Physical Health/Wellbeing",
                               "Emotional Maturity",
                               "Social Competence", 
                               "Communication/General Knowledge",
                               "Language/Cognitive Development"),
                      values=c("darkblue", "Red", "orange", "chartreuse3", "turquoise3", "mediumpurple")) +
  coord_cartesian(ylim = c(0, .45))

Here, we present the full linear mixed-effects model with the following predictors: *

df2<-merge(can, edi, by.x=c("GeoUID", "Year"), by.y=c("cduid", "Year"),  all.x=F, all.y=T)

df2$olow<-as.numeric(as.character(df2$olow))
df2$vfa<-as.numeric(as.character(df2$vfa))
df2$immigrant<-as.numeric(as.character(df2$vfa))
df2$population<-as.numeric(as.character(df2$population))

df2$vulnerable.perc<-df2$olow/df2$vfa
df2$immigrant.perc<-df2$immigrant/df2$population

df2$Income.Adj<-ifelse(df2$Cycle=="3", df2$median_hh_income*1.076,
                      ifelse(df2$Cycle=="2", df2$median_hh_income*1.172, 
                             ifelse(df2$Cycle=="1", df2$median_hh_income*1.317,
                                    ifelse(df2$Cycle=="4", df2$median_hh_income, NA))))

df2$Year<-as.numeric(as.character(df2$Year))

names(df2)[19]<-"Gender"

# regression model 
mod1<-lmer(vulnerable.perc~Income.Adj+Year+Gender+(1|GeoUID), data=df2, REML=F)
mod2<-lmer(vulnerable.perc~Income.Adj*Year+Gender+(1|GeoUID), data=df2, REML=F)

tab_model(mod1, mod2)

	vulnerable perc			vulnerable perc
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	-3.37	-4.79 – -1.94	<0.001	6.01	-2.46 – 14.49	0.164
Year	0.00	0.00 – 0.00	<0.001	-0.00	-0.01 – 0.00	0.188
Gender (Male)	0.14	0.14 – 0.15	<0.001	0.14	0.14 – 0.15	<0.001
Income.Adj	-0.00	-0.00 – -0.00	<0.001	-0.00	-0.00 – -0.00	0.026
Income.Adj * Year				0.00	0.00 – 0.00	0.028
Random Effects
σ²	0.00			0.00
τ₀₀	0.00 _GeoUID			0.00 _GeoUID
ICC	0.32			0.32
N	49 _GeoUID			49 _GeoUID
Observations	392			392
Marginal R² / Conditional R²	0.725 / 0.813			0.726 / 0.815

Conclusion

We found that an area’s SES (as approximated by median household income) was indeed related to vulnerability in early childhood development
Those in low SES areas and boys seem especially vulnerable