Summary
This notebook provides a detailed description of the IHDS 2011-12 household dataset and accompanying documentation. It also provides sample code to import the household dataset and use the survey and srvyr packages to perform inference using the dataset
Overview of IHDS data
IHDS is a panel survey led by Sonalde Desai and others at NCAER/UofMd. The first round of the panel was conducted in 2004-05 and the second wave was conducted in 2011-12. In both rounds of the survey they administered a test similar to the ASER test to all children aged 8-11. (The response rate for the ASER survey was not great though. In their publication for the India Policy Forum, Sonalde explains the reasons for this I think.) In addition, the survey asked basic questions about enrollment and education achievement for all household members.
Datasets and documentation
When you download the round 2 data, you get 14 datasets. These are split into folders “DS00X” where X corrresponds to the following:
- Individual
- Household
- Eligible Women
- Birth History 5.Medical Staff
- Medical Facilities
- Non Resident
- School Staff
- School Facilities
- Wage and Salary
- Tracking
- Village
- Village Panchayat
- Village Respondent
Note that, unlike the panel dataset, there is only one version of the files. (For the panel dataset, there are different versions of the files corresponding to different ways of merging/combining the two rounds of data.) More information about each of the datasets, including which variables to use for weighting observations when using each dataset and how to merge the datasets, can be found here. According to documentation, merge on stateid, distid, psuid, hh, hhsplitid
Documentation for the 2012 dataset is saved in the “IHDS/IHDS 2012” folder in my data folder. The user guide contains basic information about the sampling strategy and various constructed variables (e.g. the consumption variables). The data guide seems to just duplicate this information. To figure out which variables to use, go to the codebooks and questionnaires located in each of the
Finding variables in the datasets
The fastest way to find relevant variables in the dataset is to first search for keywords in the questionnaire to find the relevant question number and then search for the question number in the codebook. This can be a bit confusing since the variables in the codebook are organized by subject rather than ordered by question number. (For example, some of the education questions from the household roster are coded as “CS” questions and come later in the roster.)
Note that codebook section “ED5: HQ19 11.5 Education: Enrolled now” refers to variable ED5 in the dataset which comes from question 11.5 which is one page 19 of the household questionnaire.
Key education variables in the individual dataset (DS0001)
I have included a list of some of the key education-related variables (and other basic variables) below. Note that this list might not be comprehensive (i.e. there may be some ed related vars which I have missed).
Identifiers and other vars
- STATEID
- DISTID
- PSUID
- HHID
- HHSPLITID
- PERSONID
- IDPSU - this one might be unique, not sure
- WT - weights
- RO3 - sex
- RO7 - primary activity status
- RO5 - Age
Testing vars for children 8-11 (page 38 of questionnaire)
- TA8A, TA8B, TA9A, TA9B, TA10A, TA10B – Reading, math, and writing test results (vars with suffix “B” contain the results)
- TA* - other vars related to ASER test
Education vars for all members (page 23 of questionnaire)
- ED2-ED12 – ed-related info such as highest grade, current enrollment, etc for all household members.
Education vars all hh members who were enrolled in school in previous 12 months (page 44 of questionnaire)
- CS3 - in school
- CS4 - type of school
- CS5 - distance to school
- CS6 - standard years
- CS8 - medium of instruction
- CS9 - year English taught
- CS10 - school hours per week
- CS11 - homework hours per week
- CS12 - private tuition hours per week
- CS13 - days / months absent
- (and many more)
Education vars for children 8-11 (page 46 of questionnaire)
Sample code to read in key variables
Setup chunk
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --[39m
[30m[32mv[30m [34mggplot2[30m 3.2.1 [32mv[30m [34mpurrr [30m 0.3.3
[32mv[30m [34mtibble [30m 2.1.3 [32mv[30m [34mdplyr [30m 0.8.3
[32mv[30m [34mtidyr [30m 1.0.0 [32mv[30m [34mstringr[30m 1.4.0
[32mv[30m [34mreadr [30m 1.3.1 [32mv[30m [34mforcats[30m 0.4.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(haven)
library(survey)
Loading required package: grid
Loading required package: Matrix
Attaching package: 㤼㸱Matrix㤼㸲
The following objects are masked from 㤼㸱package:tidyr㤼㸲:
expand, pack, unpack
Loading required package: survival
Attaching package: 㤼㸱survey㤼㸲
The following object is masked from 㤼㸱package:graphics㤼㸲:
dotchart
library(srvyr)
Attaching package: 㤼㸱srvyr㤼㸲
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
Open the individual dataset (36151-0001) and select key education variables
ihds_ind_dir <- "C:/Users/dougj/Documents/Data/IHDS/IHDS 2012/DS0001"
ind_file <- file.path(ihds_ind_dir, "36151-0001-Data.dta")
# read in just those variables that i need
# this is much faster than reading in everything and then selecting
df <- read_dta(ind_file, col_select = c(STATEID, DISTID, PSUID, HHID, HHSPLITID, PERSONID, IDPSU, WT, RO3, RO7, RO5, starts_with("CS"), starts_with("TA"), starts_with("ED")))
Specify the survey design using the survey package
While the user guide has a short description of the sample, a clearer and more complete description can be found here. The dataset includes no variables for strata or fpc so we only specify a two-stage survey with the appropriate weights
df <- df %>% mutate(psu_expanded = paste(STATEID, DISTID, PSUID, sep ="-"), hh_expanded = paste(STATEID, DISTID, PSUID, HHID, HHSPLITID, sep ="-"))
# there is one row with a missing value for WT and most other variables drop that
df <- df %>% filter(!is.na(WT))
df_svy <- svydesign(id =~ psu_expanded + hh_expanded, weights =~ WT, data = df)
Calculate average ASER reading scores by state without using the survey package (but using weights), using the survey package, and then using the srvyr package. All yield the same answers. Note that the survey package and srvyr are both really slow.
# confirm we are looking at the right variable by looking at the variable label from Stata
attributes(df$TA8B)
$label
[1] "HQ34 26.8 Test: Reading level"
$labels
Cannot read 0 Letters 1 Words 2 Paragraph 3 Story 4
0 1 2 3 4
$class
[1] "haven_labelled"
attributes(df$RO5)$label
[1] "HQ4 2.5 Age"
# look at response rates to TA8B by age --> looks like there are some NAs even for 8 to 11.
# unclear why that is.
df %>%
group_by(RO5) %>%
summarise(non_na_count = sum(!is.na(TA8B)), na_count = sum(is.na(TA8B)) )
# Calculate average reading score DIRECTLY
attributes(df$STATEID)$labels
Jammu & Kashmir 01 Himachal Pradesh 02 Punjab 03 Chandigarh 04
1 2 3 4
Uttarakhand 05 Haryana 06 Delhi 07 Rajasthan 08
5 6 7 8
Uttar Pradesh 09 Bihar 10 Sikkim 11 Arunachal Pradesh 12
9 10 11 12
Nagaland 13 Manipur 14 Mizoram 15 Tripura 16
13 14 15 16
Meghalaya 17 Assam 18 West Bengal 19 Jharkhand 20
17 18 19 20
Orissa 21 Chhattisgarh 22 Madhya Pradesh 23 Gujarat 24
21 22 23 24
Daman & Diu 25 Dadra+Nagar Haveli 26 Maharashtra 27 Andhra Pradesh 28
25 26 27 28
Karnataka 29 Goa 30 Lakshadweep 31 Kerala 32
29 30 31 32
Tamil Nadu 33 Pondicherry 34 Anadman/Nicobar 35
33 34 35
df %>%
group_by(STATEID) %>%
summarise(weighted.mean(TA8B, na.rm = TRUE, w = WT))
# Calculate average reading score USING SURVEY PACKAGE
svyby(~TA8B, ~STATEID, df_svy, svymean, na.rm=TRUE)
# Calculate average reading score USING SRVYR PACKAGE
# note that i can't use a formula to specify the ID var
# also
df_srvyr <- df %>%
as_survey_design(c(psu_expanded, hh_expanded), weights = WT)
Temp - calculate ICC for ASER scores
Argh. This is why R is annoying! Not really sure if this works, but it seems like it should
attributes(VarCorr(fit)$PSUID)$stddev / aser_var
(Intercept)
0.06675789
---
title: "Overview of IHDS 2012 Education Data"
output:
  html_notebook: default
  pdf_document: default
---

## Summary
This notebook provides a detailed description of the IHDS 2011-12 household dataset and accompanying documentation. It also provides sample code to import the household dataset and use the survey and srvyr packages to perform inference using the dataset

## Overview of IHDS data
IHDS is a panel survey led by Sonalde Desai and others at NCAER/UofMd.  The first round of the panel was conducted in 2004-05 and the second wave was conducted in 2011-12.  In both rounds of the survey they administered a test similar to the ASER test to all children aged 8-11. (The response rate for the ASER survey was not great though.  In their publication for the India Policy Forum, Sonalde explains the reasons for this I think.) In addition, the survey asked basic questions about enrollment and education achievement for all household members.


## Datasets and documentation
When you download the round 2 data, you get 14 datasets. These are split into folders "DS00X" where X corrresponds to the following:

1. Individual
2. Household
3. Eligible Women
4. Birth History
5.Medical Staff
6. Medical Facilities
7. Non Resident
8. School Staff
9. School Facilities
10. Wage and Salary
11. Tracking
12. Village
13. Village Panchayat
14. Village Respondent

Note that, unlike the panel dataset, there is only one version of the files. (For the panel dataset, there are different versions of the files corresponding to different ways of merging/combining the two rounds of data.) More information about each of the datasets, including which variables to use for weighting observations when using each dataset and how to merge the datasets, can be found [here](https://www.icpsr.umich.edu/icpsrweb/content/DSDR/idhs-II-data-guide.html). According to documentation, merge on stateid, distid, psuid, hh, hhsplitid

Documentation for the 2012 dataset is saved in the "IHDS/IHDS 2012" folder in my data folder. The user guide contains basic information about the sampling strategy and various constructed variables (e.g. the consumption variables). The data guide seems to just duplicate this information. To figure out which variables to use, go to the codebooks and questionnaires located in each of the 

### Finding variables in the datasets
The fastest way to find relevant variables in the dataset is to first search for keywords in the questionnaire to find the relevant question number and then search for the question number in the codebook.  This can be a bit confusing since the variables in the codebook are organized by subject rather than ordered by question number. (For example, some of the education questions from the household roster are coded as "CS" questions and come later in the roster.)

Note that codebook section **"ED5: HQ19 11.5 Education: Enrolled now"** refers to variable **ED5** in the dataset which comes from **question 11.5** which is one **page 19** of the **household questionnaire**.


## Key education variables in the individual dataset (DS0001)
I have included a list of some of the key education-related variables (and other basic variables) below. Note that this list might not be comprehensive (i.e. there may be some ed related vars which I have missed).


### Identifiers and other vars
* STATEID
* DISTID
* PSUID
* HHID
* HHSPLITID
* PERSONID
* IDPSU - this one might be unique, not sure
* WT - weights
* RO3 - sex
* RO7 - primary activity status
* RO5 - Age


### Testing vars  for children 8-11 (page 38 of questionnaire)
* TA8A, TA8B, TA9A, TA9B, TA10A, TA10B -- Reading, math, and writing test results (vars with suffix "B" contain the results)
* TA* - other vars related to ASER test

### Education vars for all members (page 23 of questionnaire)
* ED2-ED12 -- ed-related info such as highest grade, current enrollment, etc for all household members. 

### Education vars all hh members who were enrolled in school in previous 12 months (page 44 of questionnaire)
* CS3 - in school
* CS4 - type of school
* CS5 - distance to school
* CS6 - standard years
* CS8 - medium of instruction
* CS9 - year English taught
* CS10 - school hours per week
* CS11 - homework hours per week
* CS12 - private tuition hours per week
* CS13 - days / months absent
* (and many more)

### Education vars for children 8-11 (page 46 of questionnaire)
* CH* 

## Sample code to read in key variables
Setup chunk
```{r setup}
library(tidyverse)
library(haven)
library(survey)
library(srvyr)
```

Open the individual dataset (36151-0001) and select key education variables

```{r}
ihds_ind_dir <- "C:/Users/dougj/Documents/Data/IHDS/IHDS 2012/DS0001"
ind_file <- file.path(ihds_ind_dir, "36151-0001-Data.dta")
# read in just those variables that i need
# this is much faster than reading in everything and then selecting
df <- read_dta(ind_file, col_select = c(STATEID, DISTID, PSUID, HHID, HHSPLITID, PERSONID, IDPSU, WT, RO3, RO7, RO5, starts_with("CS"), starts_with("TA"), starts_with("ED")))
```

## Specify the survey design using the survey package
While the user guide has a short description of the sample, a clearer and more complete description can be found [here](https://ihds.umd.edu/sites/default/files/publications/papers/technical%20paper%201.pdf). The dataset includes no variables for strata or fpc so we only specify a two-stage survey with the appropriate weights

```{r}
df <- df %>% mutate(psu_expanded = paste(STATEID, DISTID, PSUID, sep ="-"), hh_expanded = paste(STATEID, DISTID, PSUID, HHID, HHSPLITID, sep ="-"))
# there is one row with a missing value for WT and most other variables drop that
df <- df %>% filter(!is.na(WT))
df_svy <- svydesign(id =~ psu_expanded + hh_expanded, weights =~ WT, data = df)
```

Calculate average ASER reading scores by state without using the survey package (but using weights), using the survey package, and then using the srvyr package. All yield the same answers. Note that the survey package and srvyr are both really slow. 

```{r}
# confirm we are looking at the right variable by looking at the variable label from Stata
attributes(df$TA8B)
attributes(df$RO5)$label

# look at response rates to TA8B by age --> looks like there are some NAs even for 8 to 11. 
# unclear why that is.  
df %>%
  group_by(RO5) %>%
  summarise(non_na_count = sum(!is.na(TA8B)), na_count = sum(is.na(TA8B)) )

# Calculate average reading score DIRECTLY
attributes(df$STATEID)$labels
df %>%
  group_by(STATEID) %>%
  summarise(weighted.mean(TA8B, na.rm = TRUE, w = WT))

# Calculate average reading score USING SURVEY PACKAGE
svyby(~TA8B, ~STATEID, df_svy, svymean, na.rm=TRUE)

# Calculate average reading score USING SRVYR PACKAGE 
# note that i can't use a formula to specify the ID var
# also 
df_srvyr <- df %>%
  as_survey_design(c(psu_expanded, hh_expanded), weights = WT)

df_srvyr %>%
  group_by(STATEID) %>%
  summarize(aser_mean = survey_mean(TA8B, na.rm =TRUE))

```


## Temp - calculate ICC for ASER scores
Argh.  This is why R is annoying! Not really sure if this works, but it seems like it should
```{r}
# calculate variance of TA8B
aser_var <- var(df$TA8B, na.rm = TRUE)

# fit a random effects model for PSUID and get the variance of that component
library(lme4)
temp <-df %>% select(TA8B,PSUID)
fit <- lmer(TA8B ~(1|PSUID), data = temp)
print("ICC is")
attributes(VarCorr(fit)$PSUID)$stddev / aser_var
```

