(This is a work-in-progress manuscript, updated on 2025-01-19)

Becoming Bilingual in R for Stata Speakers

1 Introduction

1.1 Why do I write this living and open book?

After having used Stata for over 20 years, I began to learn R by taking online open courses in 2019, with a goal to be able to use a free software for data analysis. As I heard about the notorious initial learning curve, I had to endure painful few months first. It was frustrating partially because I knew exactly how to do ‘that’ in Stata. Slowly, however, I saw unique benefits of R that Stata does not offer easily. Specifically, I was thoroughly impressed with R’s ability in two aspects: to reproduce data management and analysis seamlessly; and to share results interactively.

It turned out that it was my great fortune that I learned R in 2019. During 2020 - the year of the COVID-19 pandemic and data science in full spot light, I continued learning and using R. The two features that impressed me earlier were proven to be invaluable for my work with COVID-19 data and beyond. And, I started appreciate and even like features/characteristics of R that I initially thought weird and challenging. I am completely sold on R.

But, don’t get me wrong. Stata is the first statistical analysis software that I learned, and it is my mother tongue (in data). As a bilingual speaker of Korean and English, however, this is exactly why I decided to keep both “languages” and help other Stata speakers learn R. Just like learning a second language, I - and you - need to think differently, be willing to make embarrassing mistakes, and keep practicing.

This “book” has two aims one for me one for you: * to compile and organize R code for my own continued learning and practice; and
* to share my study note and cheat sheets with fellow Stata speakers.
So many people helped me learn R, and I want to contribute to the community - specifically from a Stata speaker’s perspective.

Also, I do not know everything in Stata. If you have suggestions to improve my Stata code here, hey please do let me know!

1.2 Who is this for?

Those who know and use Stata fairly well, and
Those who need and want to use R specifically (1) to facilitate reproducibility in your work and (2) create and share interactive results and data products

This book aims to help Stata speakers kick start learning R in their mother tongue. Consider it a “how to speak in XX” for tourists. It will help you have reasonably understandable conversations in R. But, you will need to learn “grammar” elsewhere to be truly fluent in R - see further general resources below.

Also, this is probably NOT for those who primarily conduct advanced statistical analyses. For such purposes, I have not found substantial advantages in R over Stata yet (and, thus, you may not need to suffer from learning a second language).

1.3 How is it organized?

The main focus is on data management (Chapter 2) and exploratory analysis (Chapter 3), including creation of basic charts. Chapter 4 further dives into making charts, using plot_ly. Finally, Chapter 5 lists resources for R Markdown, Shiny App, and Slides - the tools and features that will help you fully benefit from R.

Every Stata code/command is “translated” into R code/command.

Stata code appears in white box.

#Translated R code appears in light gray box.

Why on earth in R…? is in red boxes.
It covers unique things in R that do not really exist in Stata. This is exactly what I mean by different ways of thinking across different languages. For me these were bottlenecks when I was learning R first.

Tips are in yellow boxes.
They include basic vocabulary for travelers.

Recap are in blue boxes.
They summarize key concepts and lessons in the section.

Throughout this book, Demographic and Health Surveys model data are used for examples. Download Individual Recode in Stata dataset format. I also use DHS indicator data (i.e., estimates calculated for specific indicators from indvidual surveys) from Demographic and Health Surveys API.

1.4 What additional resources can be helpful for Stata speakers?

These are my go-to resources when I’m stuck:

I heavily relied on this site by (http://www.matthieugomez.com/statar/).
R for Stata Users by Muenchen and Hilbe is a book that I bought several years ago and finally started benefiting from. The author’s website (http://r4stats.com/books/r4stata/) has many more interesting and helpful resources.
Princeton’s training site (https://dss.princeton.edu/training/RStata.pdf).
This paper on DHS analyses using R.
IPUMS PMA resources - which often use both Stata and R.

1.5 What additional resources can be helpful in general?

There are many, but these are my favorite resources:

StackOverflow Q&A forum. I almost always can find an answer here.
R Studio Cheat Sheets
JHU coursera courses. These are your “grammar” courses. It can be painful first, but it pays off. I took them twice.
These are “textbooks” for the two above courses.

2 Data management

2.1 Open and assess data

2.1.1 Set your working directory

cd "~/Dropbox/0iSquared/iSquared_Bilingual"

setwd("~/Dropbox/0iSquared/iSquared_Bilingual")

2.1.2 Open your dataset

Open or load the DHS model dataset. Make sure the example DHS model dataset is saved in your working directory.

use ZZIR62FL.DTA, clear

library(haven)
modeldata<-data.frame(read_dta("ZZIR62FL.dta"))

Why on earth in R…?
Did you notice that we gave a name to the dataset, modeldata? In R, we need to specify what data we are working on in each command, except when we “pipe through” a series of commands” (coming up shortly). Yes, for many Stata speakers this is really odd, since Stata deals with only one dataset in use at a time. On the other hand, R can be hyper multitasking. For instance, we can see age distribution of observations in one dataset (e.g., Nigeria DHS) and the same distribution in another dataset (e.g., Bangladesh DHS) back to back.

Why on earth in R…?
For any code outside base functions, we need to “check out and open a book from Library” in each time/session you open R Studio. The specific code will be in the book, and we can use the code until we close the session/window. This is kind of similar with ado files in Stata. But, a big difference: in Stata, we download an ado file once and we are set. In R, we need to first install a package, install.packages("haven"), (i.e., make sure a book exists in your library) and then load the package to your R session, library(haven) (i.e., open the book).

Did you notice that quotation marks are used for install.packages but not for library? Don’t ask me why, and let me know if you know the answer…

Tips
library vs. require
They both achieve the same goal - i.e., loading a package in a R session.
Difference? library allows loading multiple packages (e.g., library(tidyverse, haven), while require requires loading only one package at a time (e.g., require(tidyverse) require(haven)). require does have some advantages, but it seems to come to almost a matter of preference. I like library, and my work has never required me to use require.

Check the number of observations and variables

describe, short

dim(modeldata)

There are 8348 observations for 4275 in the DHS model dataset! For our practice, let’s work with a much smaller dataset, using only the basic characteristics of women (variables starting with “v0” or “v1”), current pregnant status (v213), marital status (v501, v502, v503), and contraceptive use (variables starting with “v3”).

keep caseid v0* v1* v3* v213 v501 v502 v503
describe, short

library(tidyverse)
dta <- select(modeldata, caseid, starts_with(c("v0", "v1", "v3")), v213, v501, v502, v503)

"Or, pipe through"

dta <- modeldata %>% select(caseid, starts_with(c("v0", "v1", "v3")), v213, v501, v502, v503) 

dim(dta)

And, assume there are many contraceptive use related variables that are not relevant for us. We want to drop them.

drop v3a* v305*
describe, short

dta <- dta %>% select( -starts_with(c("v3a", "v305"))) 

dim(dta)

Now, there are still 8348 observations but only 192 variables in the practice dataset. Also, note that we gave a different name for this practice dataset - dta.

Vocabulary
Among R speakers, variables are often called columns, and observations are called rows. Hence, ncol(dta) to count the number of columns or variables, and nrow(dta) to count the number of rows or observatiosn.

2.1.3 Assess variables

codebook v00*

dta %>% select(starts_with("v00")) %>% summary()

Tips
A few commonly used lingo in R
- data frame = dataset
- columns = variables
- rows = observations

Tips: famous package, tidyverse
It is one of the most widely used packages for data management. Also, it’s like an “umbrella-package” that installs many other popular packages (like ggplot2 and dplyr) together. In other word, if you load tidyverse, you do not need to load ggplot2 or dplyr (and severak other widely used ones) separately.

Tips: loading the usual suspect packages
Almost always, my code requires certain popular packages - such as tidyverse. I typically load them in the beginning of the script so that I do not need to think or worry about which packages are needed. The following is my favorite usual suspects. Start build your own!
- tidyverse
- plotly

2.2 Clean data

Big part of data science is data cleaning - right, fellow data janitors? Our example DHS model dataset has been already cleaned (Thank you, the DHS Program!), but let’s assume that we need to clean it even more. Below I list select data cleaning tasks that I had to do with other datasets.

2.2.1 Change variable names to lower case.

Strong personal preference, but I change all variables using only lower case. Whenever I type using both lower and upper cases, there are always errors from mixing cases.

rename *, lower

names(dta)<-tolower(names(dta))

2.2.2 Drop all-missing variables

Keep variables if at least one value exists. In other word, drop variables if all values are missing.

In DHS, some variables are country-specific. These will be still included in the dataset and will have missing in all observations.

missings dropvars * , force

dta <- dta[ ,colSums(is.na(dta))<nrow(dta)]

Now, there are still 8348 observations but only 166 in the practice dataset.

2.2.3 Impute for missing values

In this example, let’s use the national average using non-missing observations.

set seed 123
    
        *#create random numbers ranging 0-1
        gen random_number = runiform()
        
        gen age = v012 
        
        *#create age with missing in 20% of observations 
        gen age_with_missing = age 
        replace age_with_missing = . if random_number<0.2
        
        *#create age with imputed values for missing 
        gen age_with_impute = age_with_missing 
        egen temp=mean(age_with_missing)
        replace age_with_impute = temp if age_with_impute==.                            
    *"Compare distributions of the three age variables"
    sum *age*
    capture drop temp *age*

set.seed(123)

dtatest <- dta %>% 
    mutate(
        #create random numbers ranging 0-1
        random_number = runif(nrow(dta), 0, 1),   
        
        age = v012, 
        
        #create age with missing in 20% of observations 
        age_with_missing = age, 
        age_with_missing = replace(age_with_missing,
                                   random_number<0.2, 
                                   NA), 
        
        #create age with imputed values for missing 
        age_with_impute = age_with_missing, 
        age_with_impute = replace(age_with_impute, 
                                  is.na(age_with_impute)==TRUE, 
                                  mean(age_with_missing, na.rm=TRUE))
    )

"Compare distributions of the three age variables"
dtatest%>%select(starts_with("age"))%>%summary()

2.2.4 Change numeric variables following a specific pattern

Assume “Not Applicable” is coded as a negative value in a dataset. Change all to missing.

    *#create a variable that coded with -99 for not relevant 
    *#E.g., the number of marriages would be missing for those never married 
    gen num_married = v503 
    replace num_married = -99 if num_married==. 
        
    *#create a new variable that will be cleaned
    gen clean_num_married = num_married 
    
    *#replace the negative value with NA
    foreach var of varlist *clean* {
        replace `var' = . if `var'<0
        }
    *"Compare distributions from the two variables"
    sum *married*

dtatest <- dta %>% 
    mutate(
        #create a variable that coded with -99 for not relevant 
        #E.g., the number of marriages would be missing for those never married 
        num_married = v503, 
        num_married = replace(num_married, 
                              is.na(num_married)==TRUE, 
                              -99), 
        #create a new variable that will be cleaned 
        clean_num_married = num_married
    )%>%
    #replace the negative value with NA
    mutate_at(vars(starts_with("clean")), 
              ~replace(., .<0, NA)) 
              
"Compare distributions from the two variables"

[1] “Compare distributions from the two variables”

# dtatest%>%select(contains("married"))%>%summary()
summary(dtatest$num_married)

Min. 1st Qu. Median Mean 3rd Qu. Max. -99.00 -99.00 1.00 -29.15 1.00 2.00

summary(dtatest$clean_num_married)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 1.00 1.00 1.00 1.21 1.00 2.00 2529

Tips
Unlike in Stata, we have to spell the function/command completely. For example, if we say sum(dtatest$num_married) instead of summary(dtatest$num_married), we get an error.

Tips
starts_with(c(“abc”, “xyz”)) = abc*, xyz*

2.2.5 Change numeric to character (or vise versa)

Change all numeric variables in the dataset to character

dtatest <- dta %>% select(starts_with("v00")) %>%
    mutate_if(is.numeric, as.character)
str(dtatest)

Change select numeric variables in the dataset to character

dtatest <- dta %>% select(starts_with("v00")) %>%
    mutate_at(vars(v008, v009), 
              ~replace(., is.numeric(.)==TRUE, as.character(.)))
str(dtatest)

2.3 Prepare data for your specific question

Presumably with a clear and specific data science question in our mind, now we have to prepare our analysis dataset - creating analysis variables and structuring the dataset at the level of analysis unit.

We will try select examples with simple questions.

2.3.1 Example 1: Does contraceptive use correlate with education at a village level?

Unit of analysis: cluster (v001)
Dependent variable: Proportion of women who use modern method in a cluster
3 independent variables:
– Proportion of women attended secondary school or higher in the cluster
– Median age of women in the cluster
– Urban vs. rural designation of the cluster

dta1 <- dta %>% 
    #Create individual-level variables 
    mutate(
        mcp = v313==3, #modern method users
        eduhigh = v106>=2, #attended secondary or higher  
        age = v012, #age in years
        urban = v102==1 #living in urban (constant within cluster)
    )%>%
    
    #calculate cluster-level values and collapse 
    group_by(v001)%>%
    summarise(
        cluster_mcp = mean(mcp), 
        cluster_eduhigh = mean(eduhigh), 
        cluster_age = median(age), 
        cluster_urban = mean(urban)==1 
    )

There are 217 clusters in the dataset, and now the analysis dataset dta1 has 217 observations. It has 5 variables: cluster ID, the dependent variables, and three independent variables.

Note that for this cluster-level analysis, we do not need to worry about sampling weight.

2.3.2 Example 2: Among women who are not currently pregnant, are those with higher education more likely to use modern contraceptive?

Unit of analysis: woman (caseid)
Dependent variable: using modern method
3 independent variables:
– Woman attended secondary school or higher
– Woman’s age
– Living in urban areas

dta2 <- dta %>% 
    #Keep only those who are not currenetly pregnant or unsure 
    filter(v213==0)%>%
    
    #Create individual-level variables 
    mutate(
        mcp = v313==3, #modern method users
        eduhigh = v106>=2, #attended secondary or higher  
        age = v012, #age in years
        urban = v102==1 #living in urban (constant within cluster)
    )

Now, excluding those who are currently pregnant, we have 7587 in our analysis dataset. FYI, there are a total of 8348 women in our practice modeldata.

3 Data analysis examples

In this section, we will explore data analysis examples, using peer-reviewed papers.

3.1 Case study using DHS API data

Monitoring Progress in Equality for the Sustainable Development Goals: A Case Study of Meeting Demand for Family Planning

Does age-adjusted measurement of contraceptive use better explain the relationship between fertility and contraception (Demographic Research 2018 Dec 39;45:1227-1240)

3.1.1 Access, assess, and tidy data

We will import three indicators/variables from DHS API data.

##### get necessary library #####
suppressWarnings(suppressMessages(library(jsonlite))) 
suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(dplyr)))

##### call indicators #####
# Call and save individual indicator data here 
# This code uses API key, "DUMMY-123456". Replace it with your own valid key 

url1<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
             # define surveys to include 
             "&surveyid=all",
             # define two indicators for which we need only national estimates   
             "&indicatorIds=FE_FRTR_W_TFR,MA_MSTA_W_UNI", 
             # define subnational dimensions - all and age
             "&breakdown=National",
             # # provide API key
             # "&APIkey=USAAID-113824")) 
             # define return per page (keep this at max...)
             "&perpage=50000"))

jsondata1<-fromJSON(url1)

dtaapi1<-data.table(jsondata1$Data)

#### Assess #####
dim(dtaapi1)
table(dtaapi1$IndicatorId)
table(dtaapi1$CharacteristicCategory, dtaapi1$CharacteristicLabel)

##### call indicators #####
# Call and save individual indicator data here 
# This code uses API key, "DUMMY-123456". Replace it with your own valid key 

url2<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
             # define surveys to include 
             "&surveyid=all",
             # define the third indicator that needs age-specific estimates    
             "&indicatorIds=FP_CUSA_W_ANY", 
             # define subnational dimensions 
             "&breakdown=All",
             # # provide API key
             # "&APIkey=USAAID-113824")) 
             # define return per page (keep this at max...)
             "&perpage=5000&page=1"))

jsondata2<-fromJSON(url2)

dtaapi2<-data.table(jsondata2$Data)

#### Assess #####
dim(dtaapi2)
table(dtaapi2$IndicatorId)
table(dtaapi2$SurveyId)
# table(dtaapi2$CharacteristicCategory, dtaapi2$CharacteristicLabel)

url3<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
              "&surveyid=all",
              "&indicatorIds=FP_CUSA_W_ANY",  
              "&breakdown=All",
              "&perpage=5000&page=2"))

jsondata3<-fromJSON(url3)

dtaapi3<-data.table(jsondata3$Data)

dim(dtaapi3)
table(dtaapi3$IndicatorId)
table(dtaapi3$SurveyId)

url4<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
              "&surveyid=all",
              "&indicatorIds=FP_CUSA_W_ANY",  
              "&breakdown=All",
              "&perpage=5000&page=3"))

jsondata4<-fromJSON(url4)

dtaapi4<-data.table(jsondata4$Data)

dim(dtaapi4)
table(dtaapi4$IndicatorId)
table(dtaapi4$SurveyId)

dim(dtaapi1)
dim(dtaapi2)
dim(dtaapi3)
dim(dtaapi4)

dtaapi<-rbind(dtaapi1, dtaapi2, dtaapi3, dtaapi4)
dim(dtaapi)

Data frame dtaapi has 12400 rows and 12400 columns. We will create another data frame called dtalong to keep only rows and columns that are relevant for the analysis. We call it dtalong because we will reshape it to wide soon.

##### Keep/Select relevant variables #####
dtalong<-dtaapi%>%
    select(CountryName, DHS_CountryCode, SurveyId, 
           IndicatorId, Value, DenominatorWeighted,  
           CharacteristicCategory, CharacteristicLabel)

##### Change variable/column names to lower case #####
colnames(dtalong)<-tolower(names(dtalong))    

##### Check the datasets #####
str(dtalong)

##### Tidy #####
dtalong<-dtalong %>%
    # rename #
    rename(
        country = countryname, 
        group = characteristiccategory, 
        grouplabel = characteristiclabel, 
        denomw = denominatorweighted
        )%>%
    mutate(
        # change format #
        denomw = as.numeric(denomw),
        # replace indicator name #
        indicator = indicatorid, 
        indicator = ifelse(indicator=="FP_CUSA_W_ANY", "cpr", indicator), 
        indicator = ifelse(indicator=="FE_FRTR_W_TFR", "tfr", indicator), 
        indicator = ifelse(indicator=="MA_MSTA_W_UNI", "inunion", indicator)
    )

table(dtalong$group)

##### More tidy #####
dtalong<-dtalong%>% 
    # filter/keep rows/observations that are relevant for analysis #
    filter(group=="Age (5-year groups)" | grepl("Total", group))%>%
    # replace group name #
    mutate(
        group = ifelse(group=="Total 15-49", "Total", group), 
        grouplabel = ifelse(grouplabel=="Total 15-49", "Total", grouplabel))

table(dtalong$group)

Now, data frame dtalong has 2774 rows and 2774 columns. It also has variables names that are more friendly/intuitive for me. column dtalong$indicator has 3 values: tfr (Total fertility rate), tfr (percent of women in-union), and tfr (CPR). CPR has estimates for both national and specific to five-year age groups. TFR and in-union have only national estimates.

Next, we will reshape dtalong to wide so that each of the three indicators is a column/variable.

dtalongtemp<-dtalong%>%
    select(country, dhs_countrycode, surveyid, grouplabel, 
           indicator, value)

dta<-reshape(dtalongtemp, 
             timevar = "indicator", 
             idvar = c("country", "dhs_countrycode", "surveyid", "grouplabel"), 
             direction = "wide")

dim(dta)
colnames(dta)
str(dta)

##### clean column name #####
colnames(dta) <- sub("value\\.", "", colnames(dta))
str(dta)

##### tidy #####
dta<-dta%>%
    mutate(
        # create new variables #
        year=as.numeric(substr(surveyid,3,6)),
        type=substr(surveyid,7,9),
        period2=year>=2001,
        ssa=0, 
        ssa=ifelse(
                    # Eastern Africa        
                    country=="Burundi" |
                    country=="Comoros" |
                    country=="Djibouti" |
                    country=="Eritrea" |
                    country=="Ethiopia" |
                    country=="French Southern Territories" |
                    country=="Kenya" |
                    country=="Madagascar" |
                    country=="Malawi" |
                    country=="Mauritius" |
                    country=="Mayotte" |
                    country=="Mozambique" |
                    country=="Réunion" |
                    country=="Rwanda" |
                    country=="Seychelles" |
                    country=="Somalia" |
                    country=="South Sudan" |
                    country=="Uganda" |
                    country=="United Republic of Tanzania" |
                    country=="Zambia" |
                    country=="Zimbabwe" |
                    # Middle Africa     
                    country=="Angola" |
                    country=="Cameroon" |
                    country=="Central African Republic" |
                    country=="Chad" |
                    country=="Congo" |
                    country=="Democratic Republic of the Congo" |
                    country=="Equatorial Guinea" |
                    country=="Gabon" |
                    country=="Sao Tome and Principe" |
                    # Southern Africa       
                    country=="Botswana" |
                    country=="Eswatini" |
                    country=="Lesotho" |
                    country=="Namibia" |
                    country=="South Africa" |
                    # Western Africa        
                    country=="Benin" |
                    country=="Burkina Faso" |
                    country=="Cabo Verde" |
                    country=="Cote d'Ivoire" | # Côte d’Ivoire
                    country=="Gambia" |
                    country=="Ghana" |
                    country=="Guinea" |
                    country=="Guinea-Bissau" |
                    country=="Liberia" |
                    country=="Mali" |
                    country=="Mauritania" |
                    country=="Niger" |
                    country=="Nigeria" |
                    country=="Saint Helena" |
                    country=="Senegal" |
                    country=="Sierra Leone" |
                    country=="Togo",
                    1, ssa)
    )   

# ##### tidy - label variables #####
# suppressPackageStartupMessages(library(Hmisc))
# 
# label(dta$year)<- "Year of survey" 
# label(dta$type)<- "Type of survey"
# label(dta$cpr)<- "CPR among all women"    
# label(dta$period2)<- "1985-2000 vs. 2001-present"

str(dta)
table(dta$country, dta$ssa)

3.1.2 Data management

3.1.3 Data analysis

3.2 Case study using DHS model data

Estimates of Side Effects Counseling in Family Planning Using Three Data Sources: Implications for Monitoring and Survey Design