Overall objective

The purpose of the final assessment is for you to use Demographic and Health Survey (DHS) data to create Rmarkdown reports for a chosen country. This will give you the opportunity to apply your skills in data analysis and visualization to real-world data.

Here is a quick summary of the report requirements.

The report should:

Finally, you will record a short video (5 to 15 minutes) where you walk through your Rmd document. (This video presentation is not graded, it is just a way to verify that you are the actual author of your work; so do not spend time preparing/perfecting your presentation!).

The link to your document on Rpubs and the link to your video should both be posted as a comment on the assignment page before Jan 31st, 23:59

Those are the essential instructions. The rest of this document is a guide to help you excel at this task, but if you feel ready to proceed with just the instructions above, feel free to charge ahead–no need to read further!

Prelude: systematic vs just-in-time learning

As you work on this assessment, you will encounter quite a few functions and concepts that you have not previously learned. You’ll therefore be required to pick these up “on the go” as you work through the analysis. This type of learning, where you acquire knowledge just at the point when you need it, is known as just-in-time learning. And it is different from the systematic, front-loaded learning approach that we have used throughout the course thus far, in which all relevant information was presented before it is applied or tested.

Just-in-time learning will require you to be proactive in seeking out resources and asking for help when needed. It may be challenging at times, but it is a valuable skill to have as a programmer and will serve you well in your future career.

Now let’s see what are the recommended sequences of steps for working on the assessment.

Step 1: Understand the DHS

Before beginning your analysis, it is important to understand the Demographic and Health Surveys (DHS) and the data that you will be using.

The Demographic and Health Surveys (DHS) are a series of standardized surveys conducted in developing countries that collect data on various aspects of population, health, and nutrition.

These surveys, which are funded by the United States Agency for International Development (USAID), provide policymakers with accurate and reliable data to inform decision-making and improve public health outcomes.

The individual (women’s) recode files contain individual-level data on women aged 15 to 49 in the surveyed households. This data will be the source for the final reports.

You can learn more about the DHS program and the various surveys they run by watching the following video:

Pay particular attention to the section from 2:33 to 3:20, as it provides information on the specific survey type whose data you will be using.

Step 2: Choose a theme of interest.

Now that you have a general understanding of the DHS surveys and the data that is collected, it’s time to choose a theme for your report. This will determine the focus of your analysis and the specific statistics you will be presenting.

To help guide your selection, consider looking through past “Key Findings/Summary Reports” that certain DHS-participating countries have published. These reports highlight important patterns and trends from the surveys and can serve as inspiration for your own analysis. Links to a few examples can be found below:

And you can peruse all the available reports here:

Some potential topics to consider include family planning, gender, maternity, alcohol and tobacco, fertility, HIV, tuberculosis, youth, disability, women’s empowerment and domestic violence, and living conditions.

Keep in mind that you are not limited to replicating these reports. You can use them for inspiration as you choose your theme, but feel free to explore other topics as well.

Importantly, we recommend you choose a theme and set of statistics that describes the women surveyed, not their children, and not men.

To calculate any statistics about men, you would need to request a different data file, the men’s recode, where men’s answers were recorded. For example, the first figure on page 4 of the Nigeria 2018 Key Findings report compares age at first sex and marriage for men and women. This is not something you can replicate with the

To calculate statistics about children, you would need to reshape (pivot) the women’s dataset, since each woman has multiple children. Although you have learned how to pivot, we suspect that this particular instance of pivoting will prove quite challenging, so we recommend against doing analyses that are child-focused. So, for example, for a topic like Anaemia, Page 13 of the Nigeria 2018 Key Findings report mentions the statistic that 68% of children age 6-59 months are anemic. This is not something that can be directly calculated from the women’s dataset. You would need to pivot the dataset first, which we recommend against. So you should avoid choosing a report theme that focuses on children.

Step 3: Obtain the women’s recode file for a specific country

To complete your analysis, you will need access to the individual (women’s) recode files for a specific country’s DHS survey. All DHS data can be obtained directly from the DHS website by registering for an account and then requesting the datasets.

But to speed you up, we have requested and compiled the latest survey data for countries surveyed since 2003. To access these files, follow this Google Drive link. From this folder, choose the zip file for the country you are interested in and download it to your computer.

You may be a bit puzzled by the DHS file names. These names follow a specific convention, with the Country Code (e.g. CM for Cameroon), Dataset Type (e.g. IR for the women’s dataset), Dataset Version (indicated by the first and second characters, representing the DHS Phase and Release version, respectively), and File Format (e.g. DT-Stata, SD-SAS) all included. For your convenience, we have renamed the zip files using the full country names rather than the country codes.

Apart from the country names, the other part of the name you should pay attention is the number right after “IR” in the names, which indicates the DHS survey phase. For example, if the file name is “Cameroon_IR71DT”, the number 7 indicates that this is DHS survey phase 7. The rest of the file name nomenclature is not particularly important for this analysis, but you can learn more about it in this video:

Once you have downloaded and unzipped the folder, you will see that it contains several different file types. The two files that you will need for this analysis are the .DTA file (your main data file in STATA format) and the .MAP file (your data dictionary). Leave these files as they are and create a new RStudio project for your analysis. Place the DHS data in the data subfolder of your project.

Once you have downloaded and unzipped the folder, you will see that it contains several different file types. The two files that you will need for this analysis are the .DTA file (your main data file in STATA format) and the .MAP file (your data dictionary).

Now, create a new RStudio project for your analysis and place these two files in the data subfolder of your project.

Step 4: Import your dataset

To import your DHS dataset, you will use the read_dta() function from the {haven} package. Because DHS files can be quite large, it is recommended that you do not import the entire dataset at once (this may take unreasonably long to run). Instead, you can use two arguments to select a subset of the data to import: n_max and col_select.

The n_max argument allows you to specify the number of rows to import. For example:

ir_raw <- haven::read_dta(here("data/NGIR7BDT/NGIR7BFL.DTA"),
                          n_max = 300)

This will only import the first 300 rows.

The col_select argument allows you to select specific columns to import by either name or position. For example:

# select by position
ir_raw <- haven::read_dta(here("data/NGIR7BDT/NGIR7BFL.DTA"),
                          col_select = 1:3)

# select by name
ir_raw <- haven::read_dta(here("data/NGIR7BDT/NGIR7BFL.DTA"),
                           c(caseid, v000, v001))

You will probably want to start your analysis by first importing all columns, but just a few rows (maybe the 100, with n_max = 100). Then when you have decided which the columns of interest, you can import all rows but just a few columns (the chosen columns for your analysis.)

A final special note about importing data. The read_dta function will read in any factor data columns in as a special data type called labelled data. This is unfamiliar to you and can be problematic for later analysis, so we recommend converting them all to regular R factors using the function haven::as_factor().

For example:

ir_converted <- 
  ir_raw %>%
  haven::as_factor() 

Step 5: Identify and analyze the relevant variables

You now have the data, and from looking at official Summary Reports from DHS you have a sense of the theme, and some potential indicators/statistics you would like to plot and calculate.

Your next step is to figure out which of the DHS variables is relevant for the analyses you are conducting. This is not a simple task, since the DHS variable names are not very descriptive (v000, v0001 and so on), and there are several thousand variables.

For this, you can use two key resources.

  1. The first is the .MAP file we talked about earlier. If you open this (with TextEdit or Notepad) you’ll see that it is a data dictionary with a definition for each variable. For example:

    CASEID (id) Case Identification
    V000 Country code and phase
    V001 Cluster number
    V002 Household number
    V003 Respondent’s line number

And so on.

So, if you are working on a theme like HIV, you can simply search for the word “HIV” in this data dictionary, and it will lead you to the HIV relevant variables. Sometimes the variable description here is not detailed enough. In that case, you can search for the variable name in the DHS recode manuals, which you can find on this page. This is like the data dictionary, but it explains each variable in a bit more detail.

  1. Secondly, the DHS has created a wonderful online resource called “Guide to DHS statistics”, viewable here. If you are studying a topic like HIV, you can simply go to HIV section, click on a specific subtopic, e.g. “Prior HIV testing” and you will see a list of variables that are relevant to that statistic.

Acceptable Format

You can use any Rmarkdown-based format that you like!

To make this clear, we have created an output for each of the following formats:

The Demographic and Health Surveys (DHS) are a series of standardized surveys conducted in developing countries to collect data on various aspects of population, health, and nutrition.

Introduction

Requirements

Pick a specific theme covered by the Demographic and Health Surveys. This could be one of those mentioned in the list above, or another theme of your choice.

Your report should contain:

Rubric

Data wrangling skills. The work and accompanying video demonstrate proficiency in using at least some of the dplyr verbs

Visualization skills: Plots and figures are …

Figures are clean, well-presented and follow.

Deadline

How to submit your work

FAQs

The DHS Survey data is weighted. Should I use weights?

No, you do not have to take into account weighting.

My percentages and numbers differ with those seen in the official publications

This is because of weighting.

Why should I try to do a good job at this?

This will

Must the report be in English?

No. You are allowed to write the report in English, French or Spanish.




Common features of high-quality work:


Common features of poorly performing work.









Code styling

• Code is well-formatted.

• Student used RStudio's code reformatter, or followed consistent style guides

• Lines of code whose function is not intuitive are well commented.






















Overall steps

  1. Understand the overall goals of the assignment.

    1. What are the DHS surveys? Watch the following video. We will embed it.


    https://www.youtube.com/watch?v=abP6xeb50Do&ab_channel=TheDHSProgram


Pay particular attention from 2:33, which is when the DHS survey data (of the kind you will be analyzing, is introduced). 


  1. Two short Rmarkdown-based DHS summary reports for two countries on a theme of your choosing. Should contain:

    1. At least 200 words of text

    2. At least 4 plots using ggplot.