Summary

This notebook provides a detailed description of the IHDS 2011-12 household dataset and accompanying documentation. It also provides sample code to import the household dataset and use the survey and srvyr packages to perform inference using the dataset

Overview of IHDS data

IHDS is a panel survey led by Sonalde Desai and others at NCAER/UofMd. The first round of the panel was conducted in 2004-05 and the second wave was conducted in 2011-12. In both rounds of the survey they administered a test similar to the ASER test to all children aged 8-11. (The response rate for the ASER survey was not great though. In their publication for the India Policy Forum, Sonalde explains the reasons for this I think.) In addition, the survey asked basic questions about enrollment and education achievement for all household members.

Datasets and documentation

When you download the round 2 data, you get 14 datasets. These are split into folders “DS00X” where X corrresponds to the following:

  1. Individual
  2. Household
  3. Eligible Women
  4. Birth History 5.Medical Staff
  5. Medical Facilities
  6. Non Resident
  7. School Staff
  8. School Facilities
  9. Wage and Salary
  10. Tracking
  11. Village
  12. Village Panchayat
  13. Village Respondent

Note that, unlike the panel dataset, there is only one version of the files. (For the panel dataset, there are different versions of the files corresponding to different ways of merging/combining the two rounds of data.) More information about each of the datasets, including which variables to use for weighting observations when using each dataset and how to merge the datasets, can be found here. According to documentation, merge on stateid, distid, psuid, hh, hhsplitid

Documentation for the 2012 dataset is saved in the “IHDS/IHDS 2012” folder in my data folder. The user guide contains basic information about the sampling strategy and various constructed variables (e.g. the consumption variables). The data guide seems to just duplicate this information. To figure out which variables to use, go to the codebooks and questionnaires located in each of the

Finding variables in the datasets

The fastest way to find relevant variables in the dataset is to first search for keywords in the questionnaire to find the relevant question number and then search for the question number in the codebook. This can be a bit confusing since the variables in the codebook are organized by subject rather than ordered by question number. (For example, some of the education questions from the household roster are coded as “CS” questions and come later in the roster.)

Note that codebook section “ED5: HQ19 11.5 Education: Enrolled now” refers to variable ED5 in the dataset which comes from question 11.5 which is one page 19 of the household questionnaire.

Key education variables in the individual dataset (DS0001)

I have included a list of some of the key education-related variables (and other basic variables) below. Note that this list might not be comprehensive (i.e. there may be some ed related vars which I have missed).

Identifiers and other vars

  • STATEID
  • DISTID
  • PSUID
  • HHID
  • HHSPLITID
  • PERSONID
  • IDPSU - this one might be unique, not sure
  • WT - weights
  • RO3 - sex
  • RO7 - primary activity status
  • RO5 - Age

Testing vars for children 8-11 (page 38 of questionnaire)

  • TA8A, TA8B, TA9A, TA9B, TA10A, TA10B – Reading, math, and writing test results (vars with suffix “B” contain the results)
  • TA* - other vars related to ASER test

Education vars for all members (page 23 of questionnaire)

  • ED2-ED12 – ed-related info such as highest grade, current enrollment, etc for all household members.

Education vars all hh members who were enrolled in school in previous 12 months (page 44 of questionnaire)

  • CS3 - in school
  • CS4 - type of school
  • CS5 - distance to school
  • CS6 - standard years
  • CS8 - medium of instruction
  • CS9 - year English taught
  • CS10 - school hours per week
  • CS11 - homework hours per week
  • CS12 - private tuition hours per week
  • CS13 - days / months absent
  • (and many more)

Education vars for children 8-11 (page 46 of questionnaire)

  • CH*

Sample code to read in key variables

Setup chunk

library(tidyverse)
Registered S3 method overwritten by 'dplyr':
  method           from
  print.rowwise_df     
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.2.1     v purrr   0.3.3
v tibble  2.1.3     v dplyr   0.8.3
v tidyr   1.0.0     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(haven)
library(survey)
Loading required package: grid
Loading required package: Matrix

Attaching package: 㤼㸱Matrix㤼㸲

The following objects are masked from 㤼㸱package:tidyr㤼㸲:

    expand, pack, unpack

Loading required package: survival

Attaching package: 㤼㸱survey㤼㸲

The following object is masked from 㤼㸱package:graphics㤼㸲:

    dotchart
library(srvyr)

Attaching package: 㤼㸱srvyr㤼㸲

The following object is masked from 㤼㸱package:stats㤼㸲:

    filter

Open the individual dataset (36151-0001) and select key education variables

ihds_ind_dir <- "C:/Users/dougj/Documents/Data/IHDS/IHDS 2012/DS0001"
ind_file <- file.path(ihds_ind_dir, "36151-0001-Data.dta")
# read in just those variables that i need
# this is much faster than reading in everything and then selecting
df <- read_dta(ind_file, col_select = c(STATEID, DISTID, PSUID, HHID, HHSPLITID, PERSONID, IDPSU, WT, RO3, RO7, RO5, starts_with("CS"), starts_with("TA"), starts_with("ED")))

Specify the survey design using the survey package

While the user guide has a short description of the sample, a clearer and more complete description can be found here. The dataset includes no variables for strata or fpc so we only specify a two-stage survey with the appropriate weights

df <- df %>% mutate(psu_expanded = paste(STATEID, DISTID, PSUID, sep ="-"), hh_expanded = paste(STATEID, DISTID, PSUID, HHID, HHSPLITID, sep ="-"))
# there is one row with a missing value for WT and most other variables drop that
df <- df %>% filter(!is.na(WT))
df_svy <- svydesign(id =~ psu_expanded + hh_expanded, weights =~ WT, data = df)

Calculate average ASER reading scores by state without using the survey package (but using weights), using the survey package, and then using the srvyr package. All yield the same answers. Note that the survey package and srvyr are both really slow.

# confirm we are looking at the right variable by looking at the variable label from Stata
attributes(df$TA8B)
$label
[1] "HQ34 26.8 Test: Reading level"

$labels
Cannot read 0     Letters 1       Words 2   Paragraph 3       Story 4 
            0             1             2             3             4 

$class
[1] "haven_labelled"
attributes(df$RO5)$label
[1] "HQ4 2.5 Age"
# look at response rates to TA8B by age --> looks like there are some NAs even for 8 to 11. 
# unclear why that is.  
df %>%
  group_by(RO5) %>%
  summarise(non_na_count = sum(!is.na(TA8B)), na_count = sum(is.na(TA8B)) )

# Calculate average reading score DIRECTLY
attributes(df$STATEID)$labels
   Jammu & Kashmir 01   Himachal Pradesh 02             Punjab 03         Chandigarh 04 
                    1                     2                     3                     4 
       Uttarakhand 05            Haryana 06              Delhi 07          Rajasthan 08 
                    5                     6                     7                     8 
     Uttar Pradesh 09              Bihar 10             Sikkim 11  Arunachal Pradesh 12 
                    9                    10                    11                    12 
          Nagaland 13            Manipur 14            Mizoram 15            Tripura 16 
                   13                    14                    15                    16 
         Meghalaya 17              Assam 18        West Bengal 19          Jharkhand 20 
                   17                    18                    19                    20 
            Orissa 21       Chhattisgarh 22     Madhya Pradesh 23            Gujarat 24 
                   21                    22                    23                    24 
       Daman & Diu 25 Dadra+Nagar Haveli 26        Maharashtra 27     Andhra Pradesh 28 
                   25                    26                    27                    28 
         Karnataka 29                Goa 30        Lakshadweep 31             Kerala 32 
                   29                    30                    31                    32 
        Tamil Nadu 33        Pondicherry 34    Anadman/Nicobar 35 
                   33                    34                    35 
df %>%
  group_by(STATEID) %>%
  summarise(weighted.mean(TA8B, na.rm = TRUE, w = WT))

# Calculate average reading score USING SURVEY PACKAGE
svyby(~TA8B, ~STATEID, df_svy, svymean, na.rm=TRUE)

# Calculate average reading score USING SRVYR PACKAGE 
# note that i can't use a formula to specify the ID var
# also 
df_srvyr <- df %>%
  as_survey_design(c(psu_expanded, hh_expanded), weights = WT)

Temp - calculate ICC for ASER scores

Argh. This is why R is annoying! Not really sure if this works, but it seems like it should

attributes(VarCorr(fit)$PSUID)$stddev / aser_var
(Intercept) 
 0.06675789 
---
title: "Overview of IHDS 2012 Education Data"
output:
  html_notebook: default
  pdf_document: default
---

## Summary
This notebook provides a detailed description of the IHDS 2011-12 household dataset and accompanying documentation. It also provides sample code to import the household dataset and use the survey and srvyr packages to perform inference using the dataset

## Overview of IHDS data
IHDS is a panel survey led by Sonalde Desai and others at NCAER/UofMd.  The first round of the panel was conducted in 2004-05 and the second wave was conducted in 2011-12.  In both rounds of the survey they administered a test similar to the ASER test to all children aged 8-11. (The response rate for the ASER survey was not great though.  In their publication for the India Policy Forum, Sonalde explains the reasons for this I think.) In addition, the survey asked basic questions about enrollment and education achievement for all household members.


## Datasets and documentation
When you download the round 2 data, you get 14 datasets. These are split into folders "DS00X" where X corrresponds to the following:

1. Individual
2. Household
3. Eligible Women
4. Birth History
5.Medical Staff
6. Medical Facilities
7. Non Resident
8. School Staff
9. School Facilities
10. Wage and Salary
11. Tracking
12. Village
13. Village Panchayat
14. Village Respondent

Note that, unlike the panel dataset, there is only one version of the files. (For the panel dataset, there are different versions of the files corresponding to different ways of merging/combining the two rounds of data.) More information about each of the datasets, including which variables to use for weighting observations when using each dataset and how to merge the datasets, can be found [here](https://www.icpsr.umich.edu/icpsrweb/content/DSDR/idhs-II-data-guide.html). According to documentation, merge on stateid, distid, psuid, hh, hhsplitid

Documentation for the 2012 dataset is saved in the "IHDS/IHDS 2012" folder in my data folder. The user guide contains basic information about the sampling strategy and various constructed variables (e.g. the consumption variables). The data guide seems to just duplicate this information. To figure out which variables to use, go to the codebooks and questionnaires located in each of the 

### Finding variables in the datasets
The fastest way to find relevant variables in the dataset is to first search for keywords in the questionnaire to find the relevant question number and then search for the question number in the codebook.  This can be a bit confusing since the variables in the codebook are organized by subject rather than ordered by question number. (For example, some of the education questions from the household roster are coded as "CS" questions and come later in the roster.)

Note that codebook section **"ED5: HQ19 11.5 Education: Enrolled now"** refers to variable **ED5** in the dataset which comes from **question 11.5** which is one **page 19** of the **household questionnaire**.


## Key education variables in the individual dataset (DS0001)
I have included a list of some of the key education-related variables (and other basic variables) below. Note that this list might not be comprehensive (i.e. there may be some ed related vars which I have missed).


### Identifiers and other vars
* STATEID
* DISTID
* PSUID
* HHID
* HHSPLITID
* PERSONID
* IDPSU - this one might be unique, not sure
* WT - weights
* RO3 - sex
* RO7 - primary activity status
* RO5 - Age


### Testing vars  for children 8-11 (page 38 of questionnaire)
* TA8A, TA8B, TA9A, TA9B, TA10A, TA10B -- Reading, math, and writing test results (vars with suffix "B" contain the results)
* TA* - other vars related to ASER test

### Education vars for all members (page 23 of questionnaire)
* ED2-ED12 -- ed-related info such as highest grade, current enrollment, etc for all household members. 

### Education vars all hh members who were enrolled in school in previous 12 months (page 44 of questionnaire)
* CS3 - in school
* CS4 - type of school
* CS5 - distance to school
* CS6 - standard years
* CS8 - medium of instruction
* CS9 - year English taught
* CS10 - school hours per week
* CS11 - homework hours per week
* CS12 - private tuition hours per week
* CS13 - days / months absent
* (and many more)

### Education vars for children 8-11 (page 46 of questionnaire)
* CH* 

## Sample code to read in key variables
Setup chunk
```{r setup}
library(tidyverse)
library(haven)
library(survey)
library(srvyr)
```

Open the individual dataset (36151-0001) and select key education variables

```{r}
ihds_ind_dir <- "C:/Users/dougj/Documents/Data/IHDS/IHDS 2012/DS0001"
ind_file <- file.path(ihds_ind_dir, "36151-0001-Data.dta")
# read in just those variables that i need
# this is much faster than reading in everything and then selecting
df <- read_dta(ind_file, col_select = c(STATEID, DISTID, PSUID, HHID, HHSPLITID, PERSONID, IDPSU, WT, RO3, RO7, RO5, starts_with("CS"), starts_with("TA"), starts_with("ED")))
```

## Specify the survey design using the survey package
While the user guide has a short description of the sample, a clearer and more complete description can be found [here](https://ihds.umd.edu/sites/default/files/publications/papers/technical%20paper%201.pdf). The dataset includes no variables for strata or fpc so we only specify a two-stage survey with the appropriate weights

```{r}
df <- df %>% mutate(psu_expanded = paste(STATEID, DISTID, PSUID, sep ="-"), hh_expanded = paste(STATEID, DISTID, PSUID, HHID, HHSPLITID, sep ="-"))
# there is one row with a missing value for WT and most other variables drop that
df <- df %>% filter(!is.na(WT))
df_svy <- svydesign(id =~ psu_expanded + hh_expanded, weights =~ WT, data = df)
```

Calculate average ASER reading scores by state without using the survey package (but using weights), using the survey package, and then using the srvyr package. All yield the same answers. Note that the survey package and srvyr are both really slow. 

```{r}
# confirm we are looking at the right variable by looking at the variable label from Stata
attributes(df$TA8B)
attributes(df$RO5)$label

# look at response rates to TA8B by age --> looks like there are some NAs even for 8 to 11. 
# unclear why that is.  
df %>%
  group_by(RO5) %>%
  summarise(non_na_count = sum(!is.na(TA8B)), na_count = sum(is.na(TA8B)) )

# Calculate average reading score DIRECTLY
attributes(df$STATEID)$labels
df %>%
  group_by(STATEID) %>%
  summarise(weighted.mean(TA8B, na.rm = TRUE, w = WT))

# Calculate average reading score USING SURVEY PACKAGE
svyby(~TA8B, ~STATEID, df_svy, svymean, na.rm=TRUE)

# Calculate average reading score USING SRVYR PACKAGE 
# note that i can't use a formula to specify the ID var
# also 
df_srvyr <- df %>%
  as_survey_design(c(psu_expanded, hh_expanded), weights = WT)

df_srvyr %>%
  group_by(STATEID) %>%
  summarize(aser_mean = survey_mean(TA8B, na.rm =TRUE))

```


## Temp - calculate ICC for ASER scores
Argh.  This is why R is annoying! Not really sure if this works, but it seems like it should
```{r}
# calculate variance of TA8B
aser_var <- var(df$TA8B, na.rm = TRUE)

# fit a random effects model for PSUID and get the variance of that component
library(lme4)
temp <-df %>% select(TA8B,PSUID)
fit <- lmer(TA8B ~(1|PSUID), data = temp)
print("ICC is")
attributes(VarCorr(fit)$PSUID)$stddev / aser_var
```

