We completed the following analyses for the Policy DataFest 2020 competition at the University of Waterloo. This is a “hackathon” wherein each team of graduate students were given a dataset(s) to work with in a little over 24 hours. The results of the analyses students provide would then be used to inform policies and recommendations.
For our team, the topic was about linking early childhood education & development to socioeconomic status (SES).
Loading libraries:
library(cancensus)
library(dplyr)
library(plyr)
library(sf)
library(beepr)
library(car)
library(reshape2)
library(ggplot2)
library(sjPlot)
library(lme4)
library(corrplot)
library(reldist)
library(lme4)The early childhood education data were obtained through the Ontario Ministry of Education database, which contains data at the census division level regarding how many children are at risk (as assessed by kindergarten teachers) and may not be ready for first grade. The assessment tool used is called the Early Development Instrument (EDI).
The EDI measures 5 different domains: * Physical Health and Well-Being * Social Competence * Emotional Maturity * Language and Cognitive Development * Communication Skills and General Knowledge
The data set also provided information such as * How many children were at risk in 1 or more of the 5 domains * How many children were at risk in 2 or more of the 5 domains * How many children were at risk in 9 of the 15 subdomains
First, we combined all of the data across the 4 cycles in which data were collected:
# cycle13 = read.csv("ontario_edi_cycles_1-3_aggregated_to_census_division.csv")
# cycle4 = read.csv("ontario_edi_cycle_4_-_aggregated_to_census_division.csv")
#
# cycle1234 = rbind.fill(cycle13, cycle4)
#
# write.csv(cycle1234,'cycle1234.csv')For our main measure, we looked at the proportion of children who were vulnerable in 1 or more domains, as this would probably give us the most general picture.
Then we cleaned the data. Note that each of the cycles roughly mapped onto the 4 census cycles (i.e., 2001, 2006, 2011, 2016).
edi<-read.csv("cycle1234.csv")
edi <- as.data.frame(lapply(edi, function(x) gsub("-9999", NA, x)))
edi$sex<-factor(recode(edi$sex, "1='Female';2='Male';'-9'=NA"))
edi$Year<-ifelse(edi$Cycle=="1", "2001",
ifelse(edi$Cycle=="2", "2006",
ifelse(edi$Cycle=="3", "2011",
ifelse(edi$Cycle=="4", "2016", NA))))Because the data we obtained were split by gender, we first created a data frame with just an overall number of children at risk, regardless of their gender.
# merge across gender
edi.m<-melt(edi, id=c("cduid", "cdname", "Cycle", "sex", "Year")) # create long data frame
edi.m$value<-as.numeric(as.character(edi.m$value))
edi.t<-dcast(edi.m, cduid + Cycle + Year ~ variable, sum, na.rm=F) In order to obtain information related to a region’s SES, we looked to the Canadian census for this information.
Our team discovered the cansensus package, and it really helped saved us time down the road. For more information about how to use this package, visit the developer’s website
options(cancensus.api_key = "CensusMapper_b0eea0b32daea3e497e1aca253454bdb")
options(cancensus.cache_path = "custom cache path")First, we created lists, as well as searched through the databases for critical variables such as unemployment rate, median household income, etc at the census division level.
Below is an example:
## # A tibble: 3 x 7
## vector type label units parent_vector aggregation details
## <chr> <fct> <chr> <fct> <chr> <chr> <chr>
## 1 v_CA11~ Total Unempl~ Percent~ <NA> Average of ~ CA 2011 NHS; ~
## 2 v_CA11~ Male Unempl~ Percent~ <NA> Average of ~ CA 2011 NHS; ~
## 3 v_CA11~ Female Unempl~ Percent~ <NA> Average of ~ CA 2011 NHS; ~
After finding all of the vectors of interest, we compiled 4 different data frames, for all 4 census year cycles.
Year 2001
can01 <- get_census(dataset='CA01', regions=list(C="01"),
vectors=c("median_hh_income"="v_CA01_1634",
"immigrant"="v_CA01_406",
"population"="v_CA01_2",
"participation"= "v_CA01_740",
"unemployment"="v_CA01_742"), level='CD',
quiet = TRUE,
geo_format = 'sf', labels = 'short')
beep(3)
can01$Year<-"2001"
can01<-can01[-c(6,9)]Year 2006
can06 <- get_census(dataset='CA06', regions=list(C="01"),
vectors=c("median_hh_income"="v_CA06_2000",
"immigrant"="v_CA06_545",
"population"="v_CA06_1",
"participation"= "v_CA06_580",
"unemployment"="v_CA06_582"), level='CD',
quiet = TRUE,
geo_format = 'sf', labels = 'short')
beep(3)
can06$Year<-"2006"
can06<-can06[-c(2)]Year 2011
can11 <- get_census(dataset='CA11', regions=list(C="01"),
vectors=c("median_hh_income"="v_CA11N_2562",
"immigrant"="v_CA11N_22",
"population"="v_CA11N_1",
"participation"= "v_CA11N_2002",
"unemployment"="v_CA11N_2008"), level='CD',
quiet = TRUE,
geo_format = 'sf', labels = 'short')
beep(3)
can11$Year<-"2011"
can11<-can11[-c(2, 7, 14)]Year 2016
can16 <- get_census(dataset='CA16', regions=list(C="01"),
vectors=c("median_hh_income"="v_CA16_2397",
"immigrant"="v_CA16_3411",
"population"="v_CA16_401",
"participation"= "v_CA16_5612",
"unemployment"="v_CA16_5618"), level='CD',
quiet = TRUE,
geo_format = 'sf', labels = 'short')
beep(3)
can16$Year<-"2016"
can16<-can16[-c(8)]These data were then combined into one data frame.
And merged with the EDI data frame file.
Note that we adjusted income for inflation when comparing data across all 4 years.
df<-merge(can, edi.t, by.x=c("GeoUID", "Year"), by.y=c("cduid", "Year"), all.x=F, all.y=T)
df$vulnerable.perc<-df$olow/df$vfa
df$immigrant.perc<-df$immigrant/df$population
# we adjusted income for inflation
df$Income.Adj<-ifelse(df$Cycle=="3", df$median_hh_income*1.076,
ifelse(df$Cycle=="2", df$median_hh_income*1.172,
ifelse(df$Cycle=="1", df$median_hh_income*1.317,
ifelse(df$Cycle=="4", df$median_hh_income, NA))))
df$Year<-as.numeric(as.character(df$Year))We then created map representations outlining differences across divisions
Here, we outlined which census divisions may have proportionally more vulnerable children:
ggplot(df) + geom_sf(aes(fill = vulnerable.perc), colour = "grey") +
scale_fill_viridis_c("Percent\nVulnerable",
labels = function(x) scales::percent(x, accuracy=1),
option="magma", direction = -1) + theme_minimal() +
facet_grid(.~Year)+
theme(panel.grid = element_blank(),
axis.text = element_blank(),
plot.title = element_text(family = "Arial Rounded MT Bold"),
strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.title = element_text(family = "Arial Rounded MT Bold"),
axis.ticks = element_blank()) +
coord_sf(datum=NA) +
labs(title = "Percentage of Vulnerable Children Across Ontario Census Divisions") ggplot(df) + geom_sf(aes(fill = median_hh_income), colour = "grey") +
scale_fill_viridis_c("Median HH\nIncome", labels = scales::dollar) + theme_minimal() +
facet_grid(.~Year)+
theme(panel.grid = element_blank(),
axis.text = element_blank(),
plot.title = element_text(family = "Arial Rounded MT Bold"),
strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.title = element_text(family = "Arial Rounded MT Bold"),
axis.ticks = element_blank()) +
coord_sf(datum=NA) +
labs(title = "Median Household Income in Ontario",
subtitle = "(Not Adjusted for Inflation)")ggplot(df) + geom_sf(aes(fill = Income.Adj), colour = "gray") +
scale_fill_viridis_c("Median HH\nIncome",labels = scales::dollar,limits = c(40000,105000),breaks = c(40000,60000,80000,100000)) + theme_minimal() +
facet_grid(.~Year)+
theme(panel.grid = element_blank(),
axis.text = element_blank(),
plot.title = element_text(family = "Arial Rounded MT Bold"),
strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.title = element_text(family = "Arial Rounded MT Bold"),
plot.subtitle = element_text(family = "Arial Rounded MT Bold"),
axis.ticks = element_blank()) +
coord_sf(datum=NA) +
labs(title = "Median Household Income in Ontario",
subtitle = "(Adjusted to 2016 Dollar Amounts)")We first focused our main analysis on the most recent 2016 data set, using income (as a proxy of SES) to predict vulnerability (i.e., percent of children vulnerable in 1 or more domain) in school readiness.
| vulnerable perc | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.40 | 0.32 – 0.48 | <0.001 |
| median_hh_income | -0.00 | -0.00 – -0.00 | 0.023 |
| Observations | 49 | ||
| R2 / R2 adjusted | 0.106 / 0.087 | ||
We then separated the 2016 data into each of the domain to see if SES was implicated in a particular domain or across all domains:
# Load Packages
# Packages <- c("reshape2", "dplyr", "ggplot2", "readr", "data.table", "psych","ez","nlme",
# "plyr", "lsr", "pastecs", "pbkrtest", "MASS","Publish",
# "ggthemes", "gridExtra","boot", "magrittr","Publish","car","apa","apaTables",
# "Hmisc","BayesFactor","tinytex","psycho")
# lapply(Packages, library, character.only = T)
rdat <- df[df$Year=="2016",]
## By domain
# phys
rdat$vuln.phys <- rdat$olowph/rdat$vfa
# soc
rdat$vuln.soc <- rdat$olowsoc/rdat$vfa
# emo
rdat$vuln.emo <- rdat$olowem/rdat$vfa
# lacog
rdat$vuln.lacog <- rdat$olowlc/rdat$vfa
# comgen
rdat$vuln.comgen <- rdat$olowcg/rdat$vfa
## melt data
# only interesting columns
dat <- as.data.frame(rdat[,c("GeoUID","median_hh_income","vfa","vuln.phys","vuln.soc","vuln.emo","vuln.lacog","vuln.comgen")])
dat<-dat[-c(ncol(dat))] # remove geometry file information
datl <- melt(dat, id= c("GeoUID","median_hh_income","vfa"), variable.name = "vuln.area", value.name="percent.vuln")
datl$vuln.area <- factor(datl$vuln.area, levels = c("vuln.phys","vuln.emo","vuln.soc", "vuln.comgen","vuln.lacog"))
# figure 1 800 x 675
ggplot(data = datl, aes(x= median_hh_income, y = percent.vuln, size = vfa)) +
scale_y_continuous(labels = function(x) scales::percent(x, accuracy = 1),
limits = c(0,.32), breaks = seq(0,.32,.02), expand=c(0,0),
name = "Vulnerable\n") +
scale_x_continuous(labels = scales::dollar, limits = c(50000,104000), expand=c(0,0),
breaks =seq(50000,104000,5000), name = "\nMedian Household Income") +
scale_size_continuous(guide = FALSE) +
geom_smooth(fullrange = TRUE,method='lm', aes(colour = vuln.area, fill = vuln.area),
size = .5, alpha = .1, show.legend = FALSE) +
geom_point(aes(colour=vuln.area),shape=19, alpha = .4) +
guides(colour=guide_legend(override.aes = list(size=5, alpha = .4))) +
scale_colour_manual(labels=c("Physical Health/Wellbeing","Emotional Maturity","Social Competence", "Communication/General Knowledge","Language/Cognitive Development"),
values=c("Red","orange","chartreuse3","turquoise3","mediumpurple")) +
scale_fill_manual(values=c("Red","orange","chartreuse3","turquoise3","mediumpurple")) +
theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), # blank aesthetics
plot.background = element_rect(fill= "white"),
panel.background = element_rect(colour = "black", fill = "white"),
text = element_text(size=12, family= "Arial Rounded MT Bold"),
axis.text = element_text(colour = "black",size=12),
axis.title.y=element_text(colour="black"),
axis.ticks = element_line(colour="black"),
axis.line = element_line(size = 0.3, linetype = "solid", colour = "black"),
legend.background = element_rect(fill= "white"),legend.key = element_blank(),
legend.box.background = element_rect(colour = "black", size = .5),
legend.text = element_text(size = 12,colour="black"),legend.spacing.y = unit(0, 'cm'),
legend.position =c(.71,.85), legend.title=element_blank(),legend.direction = "vertical",
strip.background = element_blank(), strip.placement = "outside",
axis.text.x = element_text(angle = 45, hjust = .9),
strip.text = element_text(size = 12, colour = "black"), plot.margin = unit(c(.5,.5,.5,.5),"cm"))m_ph <- lm(vuln.phys~median_hh_income, data = rdat)
m_soc <- lm(vuln.soc~median_hh_income, data = rdat)
m_em <- lm(vuln.emo~median_hh_income, data = rdat)
m_lc <- lm(vuln.lacog~median_hh_income, data = rdat)
m_cg <- lm(vuln.comgen~median_hh_income, data = rdat)
tab_model(m_ph, m_soc, m_em, m_lc, m_cg) | vuln phys | vuln soc | vuln emo | vuln lacog | vuln comgen | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 0.30 | 0.21 – 0.39 | <0.001 | 0.13 | 0.09 – 0.17 | <0.001 | 0.18 | 0.12 – 0.24 | <0.001 | 0.11 | 0.08 – 0.14 | <0.001 | 0.10 | 0.06 – 0.14 | <0.001 |
| median_hh_income | -0.00 | -0.00 – -0.00 | 0.008 | -0.00 | -0.00 – 0.00 | 0.335 | -0.00 | -0.00 – 0.00 | 0.157 | -0.00 | -0.00 – -0.00 | 0.022 | -0.00 | -0.00 – 0.00 | 0.960 |
| Observations | 49 | 49 | 49 | 49 | 49 | ||||||||||
| R2 / R2 adjusted | 0.141 / 0.123 | 0.020 / -0.001 | 0.042 / 0.022 | 0.106 / 0.087 | 0.000 / -0.021 | ||||||||||
Based on the analysis, it seems that only physical & wellbeing and language/cognitive development are predicted by median household income
Next, we looked again at our main vulnerability measure (i.e., percent of children vulnerable in at least 1 domain) across all 4 cycles which we had data.
We conducted a linear-mixed effects model using census division as the random effect, looking at both a model with only linear terms and one with an interaction term to see whether or not there were moderating effects.
Note that we used inflation-adjusted income in these analyses.
mod1<-lmer(vulnerable.perc~Income.Adj+Year+(1|GeoUID), data=df, REML=F)
mod2<-lmer(vulnerable.perc~Income.Adj*Year+(1|GeoUID), data=df, REML=F)
tab_model(mod1, mod2)| vulnerable perc | vulnerable perc | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -3.29 | -4.90 – -1.69 | <0.001 | 6.34 | -3.20 – 15.88 | 0.193 |
| Income.Adj | -0.00 | -0.00 – -0.00 | <0.001 | -0.00 | -0.00 – -0.00 | 0.042 |
| Year | 0.00 | 0.00 – 0.00 | <0.001 | -0.00 | -0.01 – 0.00 | 0.222 |
| Income.Adj * Year | 0.00 | 0.00 – 0.00 | 0.045 | |||
| Random Effects | ||||||
| σ2 | 0.00 | 0.00 | ||||
| τ00 | 0.00 GeoUID | 0.00 GeoUID | ||||
| ICC | 0.40 | 0.41 | ||||
| N | 49 GeoUID | 49 GeoUID | ||||
| Observations | 196 | 196 | ||||
| Marginal R2 / Conditional R2 | 0.174 / 0.507 | 0.175 / 0.515 | ||||
We see that year significantly moderated the relation between income (SES) and school readiness.
Since we were given gender data, we next wanted to see whether there were systematic differences between girls and boys
edi.long<-melt(edi, id=c(1:14, 23), variable.name = "Domain", value.name = "Count")
edi.long$Count<-as.numeric(edi.long$Count)
edi.long$vfa<-as.numeric(as.character(edi.long$vfa))
edi.long$Percent<-edi.long$Count/edi.long$vfa
edi.long$Domain<-recode(edi.long$Domain, "'olowph'='Physical Health';
'olowsoc'='Social Competence';
'olowem'='Emotional Maturity';
'olowlc'='Language & Cognition';
'olowcg'='General Knowledge';
'olow'='1+ Vulnerability';
'olow2'='2+ Vulnerability'")
edi.long<-edi.long[complete.cases(edi.long$cduid),]
graph1 <- ddply(edi.long, c("sex", "Domain", "Year"), function(df)
return(c(dep.avg=mean(df$Percent, na.rm=T),
dep.sd=sd(df$Percent, na.rm=T),
dep.count=length(df$Percent))))
graph2<-subset(graph1, !(Domain %in% c("cmci", "2+ Vulnerability")))
graph2$Domain<-factor(graph2$Domain, levels(graph2$Domain)[c(1,2,3,7,4,8,5,6)])
graph2$sex<-recode(graph2$sex, "'F'='Female';'M'='Male'")
ggplot(graph2, aes(x=Year, y=dep.avg, colour=Domain))+
geom_point(size=3, shape=16, position = position_dodge(0.5), alpha=.6)+
geom_errorbar(aes(ymax=dep.avg+dep.sd*qt(0.975,df=dep.count-1)/sqrt(dep.count),
ymin=dep.avg-dep.sd*qt(0.975,df=dep.count-1)/sqrt(dep.count)),
width=0, position = position_dodge(0.5))+
facet_grid(.~sex)+
theme(
panel.background= element_rect(fill=NA), # transparent panel
plot.background = element_rect(fill=NA, colour=NA), #transparent background
panel.grid=element_blank(), # remove panel grid
plot.title = element_text(family = "Arial Rounded MT Bold"),
strip.text.x = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.text = element_text(family = "Arial Rounded MT Bold", size = 12),
legend.title = element_text(family = "Arial Rounded MT Bold"),
axis.ticks.x=element_blank(), # remove tick marks on x-axis
axis.ticks=element_line(colour="gray70"), # change colour of tick marks
panel.border = element_rect(fill="transparent", colour="gray70"), # change panel border colour
legend.background = element_rect(fill = "transparent", colour = "transparent"), # change legend background
axis.text = element_text(color="#432620", family = "Arial Rounded MT Bold"),
legend.key = element_rect(fill = "transparent", colour = "transparent"),
strip.background = element_rect(color="grey70", fill="transparent")
# ,legend.position="none"
)+
scale_y_continuous("Percent Children In Domain\n",
labels = function(x) scales::percent(x, accuracy=1),
breaks=seq(0, 1, .1))+
scale_x_discrete("\nYear") +
scale_colour_manual("Early Development Domain", labels=c("1+ Domain", "Physical Health/Wellbeing",
"Emotional Maturity",
"Social Competence",
"Communication/General Knowledge",
"Language/Cognitive Development"),
values=c("darkblue", "Red", "orange", "chartreuse3", "turquoise3", "mediumpurple")) +
coord_cartesian(ylim = c(0, .45))Here, we present the full linear mixed-effects model with the following predictors: *
df2<-merge(can, edi, by.x=c("GeoUID", "Year"), by.y=c("cduid", "Year"), all.x=F, all.y=T)
df2$olow<-as.numeric(as.character(df2$olow))
df2$vfa<-as.numeric(as.character(df2$vfa))
df2$immigrant<-as.numeric(as.character(df2$vfa))
df2$population<-as.numeric(as.character(df2$population))
df2$vulnerable.perc<-df2$olow/df2$vfa
df2$immigrant.perc<-df2$immigrant/df2$population
df2$Income.Adj<-ifelse(df2$Cycle=="3", df2$median_hh_income*1.076,
ifelse(df2$Cycle=="2", df2$median_hh_income*1.172,
ifelse(df2$Cycle=="1", df2$median_hh_income*1.317,
ifelse(df2$Cycle=="4", df2$median_hh_income, NA))))
df2$Year<-as.numeric(as.character(df2$Year))
names(df2)[19]<-"Gender"
# regression model
mod1<-lmer(vulnerable.perc~Income.Adj+Year+Gender+(1|GeoUID), data=df2, REML=F)
mod2<-lmer(vulnerable.perc~Income.Adj*Year+Gender+(1|GeoUID), data=df2, REML=F)
tab_model(mod1, mod2)| vulnerable perc | vulnerable perc | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -3.37 | -4.79 – -1.94 | <0.001 | 6.01 | -2.46 – 14.49 | 0.164 |
| Year | 0.00 | 0.00 – 0.00 | <0.001 | -0.00 | -0.01 – 0.00 | 0.188 |
| Gender (Male) | 0.14 | 0.14 – 0.15 | <0.001 | 0.14 | 0.14 – 0.15 | <0.001 |
| Income.Adj | -0.00 | -0.00 – -0.00 | <0.001 | -0.00 | -0.00 – -0.00 | 0.026 |
| Income.Adj * Year | 0.00 | 0.00 – 0.00 | 0.028 | |||
| Random Effects | ||||||
| σ2 | 0.00 | 0.00 | ||||
| τ00 | 0.00 GeoUID | 0.00 GeoUID | ||||
| ICC | 0.32 | 0.32 | ||||
| N | 49 GeoUID | 49 GeoUID | ||||
| Observations | 392 | 392 | ||||
| Marginal R2 / Conditional R2 | 0.725 / 0.813 | 0.726 / 0.815 | ||||