This notebook compares ASER 2018 data, NAS 2017 data, and IHDS 2012 data. ASER and NAS differ in both sampling and the assessment tool used. I provide a brief recap of ASER and NAS and then discuss what data is used for this comparison. The ASER Centre has put together a comparison of the two surveys here
NCERT has been conducting NAS since 2001 but in previous years they only collected data for a single grade at a time. (For example, they collected data for class 5 in academic year 2001-02, for class 8 in academic year 2002-03, and for class 3 in academic year 2003-04. See here page 6 for more info.) In addition, the sample sizes in previous years were smaller (they were only intended to be representative at the state level) and the assessment tool used was different. More details of NAS 2017 can be found here. In particular, the following documents are partiularly useful:
Andres copied NAS data from here. I copied the ASER data from table 5 in the 2018 ASER report. To make the data as comparable as possible…
Note that even with these corrections, there are still a lot of ways in which ASER and NAS differ. The biggest difference is that ASER is representative of all children enrolled in govt schools while NAS is representative only of children who showed up on the day of the assessment. In addition, the assessment itself varies quite a bit. (And the ASER exam suggests that the pen and paper exam )
library(tidyverse)
library(haven)
library(survey)
path <- "C:/Users/dougj/Documents/Data/Education"
figures <- "C:/Users/dougj/Dropbox/Education in India/Original research/Learning outcomes data/figures"
library(ggrepel)
### Rank comparison
main_data <- read_csv(file.path(path, "ASER 2018 and NAS 2017 govt school grade 3 reading_corrected.csv"))
Parsed with column specification:
cols(
State = [31mcol_character()[39m,
ASER_2012 = [32mcol_double()[39m,
ASER_2014 = [32mcol_double()[39m,
ASER_2016 = [32mcol_double()[39m,
ASER_2018 = [32mcol_double()[39m,
NAS = [32mcol_double()[39m,
aser_rank = [32mcol_double()[39m,
nas_rank = [32mcol_double()[39m
)
ggplot(data = main_data,aes(x=aser_rank,y=nas_rank,label= State)) + geom_abline(intercept = 0, slope = 1, color="orange") +
geom_abline(intercept = -6, slope = 1, color="gray", lwd=1, lty=2) +
geom_abline(intercept = 6, slope = 1, color="gray", lwd=1, lty=2) +
geom_point(color="darkblue") +
theme_bw() + labs(title="State Rankings Based on Language Results for Std III Students (Rural)",
x = "Rank in ASER (2018)", y = "Rank in NAS (2017)") +
scale_y_continuous(breaks=c(1:28)) + scale_x_continuous(breaks=c(1:28)) + theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "gray",size=0.1)) + geom_label_repel()
# ggsave("aser_nas_lang_ranking.png", width = 9, height = 6 , path = figures)
ggsave("aser_nas_lang_ranking.png", width = 9, height = 6)
Given the huge variation in state GDP per capita, one would expect that there would be some correlation in GPD per capita and learning outcomes. Surprisingly, this is not the case.
setwd("C:/Users/dougj/Documents/Data/Education")
The working directory was changed to C:/Users/dougj/Documents/Data/Education inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
gdp <- read_csv("state GDP per capita.csv")
Parsed with column specification:
cols(
State = [31mcol_character()[39m,
`NSDP per capita` = [32mcol_double()[39m,
`Data year` = [31mcol_character()[39m
)
df <- main_data %>% left_join(gdp, by = "State")
print(paste("Number of unmatched states", sum(is.na(df$`NSDP per capita`))))
[1] "Number of unmatched states 0"
df$gdp_rank <- NA
df$gdp_rank[order(df$`NSDP per capita`, decreasing = TRUE)] <- 1:nrow(df)
temp <- df %>% select(ASER_2018, 'NSDP per capita', NAS)
cor(temp, method = "pearson", use = "pairwise.complete.obs")
ASER_2018 NSDP per capita NAS
ASER_2018 1.0000000 0.4154964 0.1872901
NSDP per capita 0.4154964 1.0000000 0.0515147
NAS 0.1872901 0.0515147 1.0000000
cor(temp, method = "spearman", use = "pairwise.complete.obs")
ASER_2018 NSDP per capita NAS
ASER_2018 1.0000000 0.3758205 0.1267176
NSDP per capita 0.3758205 1.0000000 0.1382236
NAS 0.1267176 0.1382236 1.0000000
# formal test for correlation
cor.test(temp$ASER_2018,temp$`NSDP per capita`, use = "pairwise.complete.obs", method = "pearson")
Pearson's product-moment correlation
data: temp$ASER_2018 and temp$`NSDP per capita`
t = 2.284, df = 25, p-value = 0.03113
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.04213542 0.68703188
sample estimates:
cor
0.4154964
cor.test(temp$ASER_2018,temp$`NSDP per capita`, use = "pairwise.complete.obs", method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: temp$ASER_2018 and temp$`NSDP per capita`
S = 2044.8, p-value = 0.05337
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.3758205
cor.test(temp$NAS,temp$`NSDP per capita`, use = "pairwise.complete.obs", method = "pearson")
Pearson's product-moment correlation
data: temp$NAS and temp$`NSDP per capita`
t = 0.26302, df = 26, p-value = 0.7946
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3278634 0.4165853
sample estimates:
cor
0.0515147
cor.test(temp$NAS,temp$`NSDP per capita`, use = "pairwise.complete.obs", method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: temp$NAS and temp$`NSDP per capita`
S = 3148.9, p-value = 0.483
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1382236
ggplot(data = df,aes(x= gdp_rank,y=nas_rank,label= State)) + geom_abline(intercept = 0, slope = 1, color="orange") +
geom_point(color="darkblue") +
theme_bw() + labs(title="NAS vs GDP",
x = "GDP rank", y = "Rank in NAS (2017)") +
scale_y_continuous(breaks=c(1:28)) + scale_x_continuous(breaks=c(1:28)) + theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "gray",size=0.1)) + geom_label_repel()
ggplot(data = df,aes(x= gdp_rank,y=aser_rank,label= State)) + geom_abline(intercept = 0, slope = 1, color="orange") +
geom_point(color="darkblue") +
theme_bw() + labs(title="ASER vs GDP",
x = "GDP rank", y = "Rank in ASER (2018)") +
scale_y_continuous(breaks=c(1:28)) + scale_x_continuous(breaks=c(1:28)) + theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "gray",size=0.1)) + geom_label_repel()
While ASER and NAS use different assessments, based on their descriptions it seems like the reading components of each assessment seeks to measure about the same thing. Therefore, it is useful to compare the average scores (as opposed to ranks) of each assessment for each state.
Calculate percentage of students with ASER == 4 (i.e. can read std 2 level text) for rural (URBAN2011 == 0) govt school (CS4 ==2) or private aided (CS4 == 3) students in grades 2 to 5 (there are too few students if we restrict only to grade 3) by state. warning: this code takes a long time
ihds_ind_dir <- "C:/Users/dougj/Documents/Data/IHDS/IHDS 2012/DS0001"
ind_file <- file.path(ihds_ind_dir, "36151-0001-Data.dta")
# read in just those variables that i need
# this is much faster than reading in everything and then selecting
ihds <- read_dta(ind_file, col_select = c(STATEID, DISTID, PSUID, HHID, HHSPLITID, PERSONID, IDPSU, WT, RO3, RO7, RO5, URBAN2011, starts_with("CS"), starts_with("TA"), starts_with("ED")))
ihds <- ihds %>% mutate(psu_expanded = paste(STATEID, DISTID, PSUID, sep ="-"), hh_expanded = paste(STATEID, DISTID, PSUID, HHID, HHSPLITID, sep ="-"))
# confirm that TA4 (class) is not NA if TA8B is not NA --> there are only 38 instances when TA8B is not NA but TA4 is NA
ihds %>% filter(!is.na(TA8B)) %>% count(TA4)
# drop the one row with missing values for weights
ihds <- ihds %>% filter(!is.na(WT))
# create variable for ASER at level 4
ihds <- ihds %>% mutate(ASER4 = (TA8B ==4)) %>% mutate(State = as_factor(STATEID))
# use the survey package to set the survey design. I will use the ihds_svy object to calculate CIs
ihds_svy <- svydesign(id =~ psu_expanded + hh_expanded, weights =~ WT, data = ihds)
# use statsby to get the % of selected kids who achieve level 4 on ASER reading by state
# note that I am not sure if subsetting within statsby is kosher, but the standard errors should be ok
# more or less ok regardless
ihds_scores <- svyby(~ASER4, ~State, subset(ihds_svy, !is.na(TA8B) & (CS4 == 2 | CS4 == 3) & (TA4 >= 2 & TA4 <= 5) & (URBAN2011 == 0)), svymean, na.rm=TRUE)
# convert to a tibble
ihds_scores <- as.tibble(ihds_scores) %>% select(State, ASER4TRUE, se.ASER4TRUE) %>% rename(ihds = ASER4TRUE, ihds_se = se.ASER4TRUE)
# Unsure, but I think this gets rid of the space and number at the end of the state name
ihds_scores$State <- sub("...$", "", ihds_scores$State)
# replace the standard errors with NA is SE == 0
ihds_scores$ihds[ihds_scores$ihds_se == 0] <- NA
# replace Orissa with Odisha in state
ihds_scores$State[ihds_scores$State == "Orissa"] <- "Odisha"
ihds_scores$ihds <- round(ihds_scores$ihds, 3)*100
aser_nas <- main_data %>% select(State, ASER_2018, NAS)
scores <- aser_nas %>% full_join(ihds_scores, by = "State")
# drop if IHDS_ASER is NA
# scores <- scores %>% filter(!is.na(IHDS_ASER)) %>% arrange(ASER) %>% select(State, ASER, IHDS_ASER, NAS, obs)
# calculate correlation matrix --> while there are a few outliers, overall the ASER data and IHDS data match Ok
# note that
temp <- scores %>% select(ASER_2018, ihds, NAS)
cor(temp, method = "pearson", use = "pairwise.complete.obs")
ASER_2018 ihds NAS
ASER_2018 1.0000000 0.617984954 0.187290066
ihds 0.6179850 1.000000000 0.003204316
NAS 0.1872901 0.003204316 1.000000000
cor(temp, method = "spearman", use = "pairwise.complete.obs")
ASER_2018 ihds NAS
ASER_2018 1.0000000 0.60152502 0.12671756
ihds 0.6015250 1.00000000 -0.03106467
NAS 0.1267176 -0.03106467 1.00000000
# conduct formal correlation test
cor.test(temp$NAS,temp$ihds, use = "pairwise.complete.obs", method = "pearson")
Pearson's product-moment correlation
data: temp$NAS and temp$ihds
t = 0.01433, df = 20, p-value = 0.9887
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4189701 0.4242396
sample estimates:
cor
0.003204316
cor.test(temp$ASER_2018,temp$ihds, use = "pairwise.complete.obs", method = "pearson")
Pearson's product-moment correlation
data: temp$ASER_2018 and temp$ihds
t = 3.5153, df = 20, p-value = 0.002176
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2655702 0.8247157
sample estimates:
cor
0.617985
cor.test(temp$NAS,temp$ASER_2018, use = "pairwise.complete.obs", method = "pearson")
Pearson's product-moment correlation
data: temp$NAS and temp$ASER_2018
t = 0.95332, df = 25, p-value = 0.3496
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2074917 0.5296102
sample estimates:
cor
0.1872901
cor.test(temp$NAS,temp$ihds, use = "pairwise.complete.obs", method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: temp$NAS and temp$ihds
S = 1826, p-value = 0.8908
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.03106467
cor.test(temp$ASER_2018,temp$ihds, use = "pairwise.complete.obs", method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: temp$ASER_2018 and temp$ihds
S = 705.7, p-value = 0.003063
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.601525
cor.test(temp$NAS,temp$ASER_2018, use = "pairwise.complete.obs", method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: temp$NAS and temp$ASER_2018
S = 2860.9, p-value = 0.5288
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1267176
Create graph showing IHDS, ASER, and NAS all on same scale.
scores.long <- scores %>% select(State, ASER_2018, NAS, ihds) %>%
rename(IHDS = ihds) %>%
filter(!is.na(ASER_2018) & !is.na(NAS)) %>%
gather(Assessment, avg_score, -State)
ggplot(data = scores.long, aes(x= reorder(State, avg_score),y=avg_score, fill= Assessment)) +
geom_bar(stat = "identity", position = position_dodge(width=.8)) +
theme(axis.text.x = element_text(angle=90)) +
scale_fill_manual(values = c("lightblue", "blue", "darkblue"))+
labs(y = "Average score", x = "")
Attempt to create the same bar but with standard errors for IHDS.
# treat the SE as just another score
scores.long2 <- scores %>%
filter(!is.na(ASER_2018) & !is.na(NAS)) %>%
mutate(IHDS = paste(ihds,ihds_se, sep ="-")) %>%
select(State, ASER_2018, NAS, IHDS) %>%
gather(Assessment, avg_score, -State) %>%
separate(avg_score, sep = "-", into = c("avg_score","se"))
Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# convert avg_score and se back to numeric
scores.long2 <- scores.long2 %>%
mutate(avg_score = as.numeric(avg_score), se = as.numeric(se))
NAs introduced by coercionNAs introduced by coercion
# multiply se
scores.long2 <- scores.long2 %>%
mutate(se = 100*se)
# create y_min and y_max
m <- qnorm(1-.05/2)
scores.long2 <- scores.long2 %>%
mutate(ymin = avg_score-m*se, ymax = avg_score+m*se) %>%
mutate(ymin = ifelse(ymin <0, 0, ymin))
ggplot(data = scores.long2, aes(x= reorder(State, avg_score),y=avg_score, fill= Assessment)) +
geom_bar(stat = "identity", position = position_dodge(width=.8)) +
geom_errorbar(aes(ymin = ymin, ymax = ymax), width = .1) +
theme(axis.text.x = element_text(angle=90)) +
scale_fill_manual(values = c("lightblue", "blue", "darkblue"))+
labs(y = "Average score", x = "")
ggsave("aser_nas_ihds_values.png", width = 9, height = 6 , path = figures)
Show line graphs of ASER class 3 reading scores for govt school students over time.
# reshape the main data
scores.long3 <- main_data %>%
select(State, starts_with("ASER")) %>%
select(-aser_rank) %>%
gather(key ="Temp", value= "Reading", -State) %>%
separate(Temp, sep = "_", into = c("dummy","Year"))
ggplot(data = scores.long3, aes(x=Year, y=Reading, color=State)) +
geom_line(aes(group=State))
ggsave("aser_over_time.png", path = figures)
Saving 7.29 x 4.5 in image