(This is a work-in-progress manuscript, updated on 2025-01-19)
Becoming Bilingual in R for Stata Speakers
After having used Stata for over 20 years, I began to learn R by taking online open courses in 2019, with a goal to be able to use a free software for data analysis. As I heard about the notorious initial learning curve, I had to endure painful few months first. It was frustrating partially because I knew exactly how to do ‘that’ in Stata. Slowly, however, I saw unique benefits of R that Stata does not offer easily. Specifically, I was thoroughly impressed with R’s ability in two aspects: to reproduce data management and analysis seamlessly; and to share results interactively.
It turned out that it was my great fortune that I learned R in 2019. During 2020 - the year of the COVID-19 pandemic and data science in full spot light, I continued learning and using R. The two features that impressed me earlier were proven to be invaluable for my work with COVID-19 data and beyond. And, I started appreciate and even like features/characteristics of R that I initially thought weird and challenging. I am completely sold on R.
But, don’t get me wrong. Stata is the first statistical analysis software that I learned, and it is my mother tongue (in data). As a bilingual speaker of Korean and English, however, this is exactly why I decided to keep both “languages” and help other Stata speakers learn R. Just like learning a second language, I - and you - need to think differently, be willing to make embarrassing mistakes, and keep practicing.
This “book” has two aims one for me one for you:
* to compile and organize R code for my own continued learning and practice; and
* to share my study note and cheat sheets with fellow Stata speakers.
So many people helped me learn R, and I want to contribute to the community - specifically from a Stata speaker’s perspective.
Also, I do not know everything in Stata. If you have suggestions to improve my Stata code here, hey please do let me know!
This book aims to help Stata speakers kick start learning R in their mother tongue. Consider it a “how to speak in XX” for tourists. It will help you have reasonably understandable conversations in R. But, you will need to learn “grammar” elsewhere to be truly fluent in R - see further general resources below.
Also, this is probably NOT for those who primarily conduct advanced statistical analyses. For such purposes, I have not found substantial advantages in R over Stata yet (and, thus, you may not need to suffer from learning a second language).
The main focus is on data management (Chapter 2) and exploratory analysis (Chapter 3), including creation of basic charts. Chapter 4 further dives into making charts, using plot_ly. Finally, Chapter 5 lists resources for R Markdown, Shiny App, and Slides - the tools and features that will help you fully benefit from R.
Every Stata code/command is “translated” into R code/command.
Stata code appears in white box.
#Translated R code appears in light gray box.
Why on earth in R…? is in red boxes.
It covers unique things in R that do not really exist in Stata. This is exactly what I mean by different ways of thinking across different languages. For me these were bottlenecks when I was learning R first.
Tips are in yellow boxes.
They include basic vocabulary for travelers.
Recap are in blue boxes.
They summarize key concepts and lessons in the section.
Throughout this book, Demographic and Health Surveys model data are used for examples. Download Individual Recode in Stata dataset format. I also use DHS indicator data (i.e., estimates calculated for specific indicators from indvidual surveys) from Demographic and Health Surveys API.
These are my go-to resources when I’m stuck:
There are many, but these are my favorite resources:
cd "~/Dropbox/0iSquared/iSquared_Bilingual"
setwd("~/Dropbox/0iSquared/iSquared_Bilingual")
Open or load the DHS model dataset. Make sure the example DHS model dataset is saved in your working directory.
use ZZIR62FL.DTA, clear
library(haven)
modeldata<-data.frame(read_dta("ZZIR62FL.dta"))
Why on earth in R…?
Did you notice that we gave a name to the dataset, modeldata
? In R, we need to specify what data we are working on in each command, except when we “pipe through” a series of commands” (coming up shortly). Yes, for many Stata speakers this is really odd, since Stata deals with only one dataset in use at a time. On the other hand, R can be hyper multitasking. For instance, we can see age distribution of observations in one dataset (e.g., Nigeria DHS) and the same distribution in another dataset (e.g., Bangladesh DHS) back to back.
Why on earth in R…?
For any code outside base functions, we need to “check out and open a book from Library” in each time/session you open R Studio. The specific code will be in the book, and we can use the code until we close the session/window. This is kind of similar with ado files in Stata. But, a big difference: in Stata, we download an ado file once and we are set. In R, we need to first install a package, install.packages("haven")
, (i.e., make sure a book exists in your library) and then load the package to your R session, library(haven)
(i.e., open the book).
Did you notice that quotation marks are used for install.packages
but not for library
? Don’t ask me why, and let me know if you know the answer…
Tips
library
vs. require
They both achieve the same goal - i.e., loading a package in a R session.
Difference? library
allows loading multiple packages (e.g., library(tidyverse, haven)
, while require
requires loading only one package at a time (e.g., require(tidyverse)
require(haven)
). require
does have some advantages, but it seems to come to almost a matter of preference. I like library
, and my work has never required me to use require
.
Check the number of observations and variables
describe, short
dim(modeldata)
There are 8348 observations for 4275 in the DHS model dataset! For our practice, let’s work with a much smaller dataset, using only the basic characteristics of women (variables starting with “v0” or “v1”), current pregnant status (v213), marital status (v501, v502, v503), and contraceptive use (variables starting with “v3”).
keep caseid v0* v1* v3* v213 v501 v502 v503
describe, short
library(tidyverse)
dta <- select(modeldata, caseid, starts_with(c("v0", "v1", "v3")), v213, v501, v502, v503)
"Or, pipe through"
dta <- modeldata %>% select(caseid, starts_with(c("v0", "v1", "v3")), v213, v501, v502, v503)
dim(dta)
And, assume there are many contraceptive use related variables that are not relevant for us. We want to drop them.
drop v3a* v305*
describe, short
dta <- dta %>% select( -starts_with(c("v3a", "v305")))
dim(dta)
Now, there are still 8348 observations but only 192 variables in the practice dataset. Also, note that we gave a different name for this practice dataset - dta
.
Vocabulary
Among R speakers, variables are often called columns, and observations are called rows. Hence, ncol(dta)
to count the number of columns or variables, and nrow(dta)
to count the number of rows or observatiosn.
codebook v00*
dta %>% select(starts_with("v00")) %>% summary()
Tips
A few commonly used lingo in R
- data frame = dataset
- columns = variables
- rows = observations
Tips: famous package, tidyverse
It is one of the most widely used packages for data management. Also, it’s like an “umbrella-package” that installs many other popular packages (like ggplot2 and dplyr) together. In other word, if you load tidyverse, you do not need to load ggplot2 or dplyr (and severak other widely used ones) separately.
Tips: loading the usual suspect packages
Almost always, my code requires certain popular packages - such as tidyverse. I typically load them in the beginning of the script so that I do not need to think or worry about which packages are needed. The following is my favorite usual suspects. Start build your own!
- tidyverse
- plotly
Big part of data science is data cleaning - right, fellow data janitors? Our example DHS model dataset has been already cleaned (Thank you, the DHS Program!), but let’s assume that we need to clean it even more. Below I list select data cleaning tasks that I had to do with other datasets.
Strong personal preference, but I change all variables using only lower case. Whenever I type using both lower and upper cases, there are always errors from mixing cases.
rename *, lower
names(dta)<-tolower(names(dta))
Keep variables if at least one value exists. In other word, drop variables if all values are missing.
In DHS, some variables are country-specific. These will be still included in the dataset and will have missing in all observations.
missings dropvars * , force
dta <- dta[ ,colSums(is.na(dta))<nrow(dta)]
Now, there are still 8348 observations but only 166 in the practice dataset.
In this example, let’s use the national average using non-missing observations.
set seed 123
*#create random numbers ranging 0-1
gen random_number = runiform()
gen age = v012
*#create age with missing in 20% of observations
gen age_with_missing = age
replace age_with_missing = . if random_number<0.2
*#create age with imputed values for missing
gen age_with_impute = age_with_missing
egen temp=mean(age_with_missing)
replace age_with_impute = temp if age_with_impute==.
*"Compare distributions of the three age variables"
sum *age*
capture drop temp *age*
set.seed(123)
dtatest <- dta %>%
mutate(
#create random numbers ranging 0-1
random_number = runif(nrow(dta), 0, 1),
age = v012,
#create age with missing in 20% of observations
age_with_missing = age,
age_with_missing = replace(age_with_missing,
random_number<0.2,
NA),
#create age with imputed values for missing
age_with_impute = age_with_missing,
age_with_impute = replace(age_with_impute,
is.na(age_with_impute)==TRUE,
mean(age_with_missing, na.rm=TRUE))
)
"Compare distributions of the three age variables"
dtatest%>%select(starts_with("age"))%>%summary()
Assume “Not Applicable” is coded as a negative value in a dataset. Change all to missing.
*#create a variable that coded with -99 for not relevant
*#E.g., the number of marriages would be missing for those never married
gen num_married = v503
replace num_married = -99 if num_married==.
*#create a new variable that will be cleaned
gen clean_num_married = num_married
*#replace the negative value with NA
foreach var of varlist *clean* {
replace `var' = . if `var'<0
}
*"Compare distributions from the two variables"
sum *married*
dtatest <- dta %>%
mutate(
#create a variable that coded with -99 for not relevant
#E.g., the number of marriages would be missing for those never married
num_married = v503,
num_married = replace(num_married,
is.na(num_married)==TRUE,
-99),
#create a new variable that will be cleaned
clean_num_married = num_married
)%>%
#replace the negative value with NA
mutate_at(vars(starts_with("clean")),
~replace(., .<0, NA))
"Compare distributions from the two variables"
[1] “Compare distributions from the two variables”
# dtatest%>%select(contains("married"))%>%summary()
summary(dtatest$num_married)
Min. 1st Qu. Median Mean 3rd Qu. Max. -99.00 -99.00 1.00 -29.15 1.00 2.00
summary(dtatest$clean_num_married)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 1.00 1.00 1.00 1.21 1.00 2.00 2529
Tips
Unlike in Stata, we have to spell the function/command completely. For example, if we say sum(dtatest$num_married)
instead of summary(dtatest$num_married)
, we get an error.
Tips
starts_with(c(“abc”, “xyz”)) = abc*, xyz*
Change all numeric variables in the dataset to character
dtatest <- dta %>% select(starts_with("v00")) %>%
mutate_if(is.numeric, as.character)
str(dtatest)
Change select numeric variables in the dataset to character
dtatest <- dta %>% select(starts_with("v00")) %>%
mutate_at(vars(v008, v009),
~replace(., is.numeric(.)==TRUE, as.character(.)))
str(dtatest)
Presumably with a clear and specific data science question in our mind, now we have to prepare our analysis dataset - creating analysis variables and structuring the dataset at the level of analysis unit.
We will try select examples with simple questions.
dta1 <- dta %>%
#Create individual-level variables
mutate(
mcp = v313==3, #modern method users
eduhigh = v106>=2, #attended secondary or higher
age = v012, #age in years
urban = v102==1 #living in urban (constant within cluster)
)%>%
#calculate cluster-level values and collapse
group_by(v001)%>%
summarise(
cluster_mcp = mean(mcp),
cluster_eduhigh = mean(eduhigh),
cluster_age = median(age),
cluster_urban = mean(urban)==1
)
There are 217 clusters in the dataset, and now the analysis dataset dta1
has 217 observations. It has 5 variables: cluster ID, the dependent variables, and three independent variables.
Note that for this cluster-level analysis, we do not need to worry about sampling weight.
dta2 <- dta %>%
#Keep only those who are not currenetly pregnant or unsure
filter(v213==0)%>%
#Create individual-level variables
mutate(
mcp = v313==3, #modern method users
eduhigh = v106>=2, #attended secondary or higher
age = v012, #age in years
urban = v102==1 #living in urban (constant within cluster)
)
Now, excluding those who are currently pregnant, we have 7587 in our analysis dataset. FYI, there are a total of 8348 women in our practice modeldata.
In this section, we will explore data analysis examples, using peer-reviewed papers.
We will import three indicators/variables from DHS API data.
##### get necessary library #####
suppressWarnings(suppressMessages(library(jsonlite)))
suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(dplyr)))
##### call indicators #####
# Call and save individual indicator data here
# This code uses API key, "DUMMY-123456". Replace it with your own valid key
url1<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
# define surveys to include
"&surveyid=all",
# define two indicators for which we need only national estimates
"&indicatorIds=FE_FRTR_W_TFR,MA_MSTA_W_UNI",
# define subnational dimensions - all and age
"&breakdown=National",
# # provide API key
# "&APIkey=USAAID-113824"))
# define return per page (keep this at max...)
"&perpage=50000"))
jsondata1<-fromJSON(url1)
dtaapi1<-data.table(jsondata1$Data)
#### Assess #####
dim(dtaapi1)
table(dtaapi1$IndicatorId)
table(dtaapi1$CharacteristicCategory, dtaapi1$CharacteristicLabel)
##### call indicators #####
# Call and save individual indicator data here
# This code uses API key, "DUMMY-123456". Replace it with your own valid key
url2<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
# define surveys to include
"&surveyid=all",
# define the third indicator that needs age-specific estimates
"&indicatorIds=FP_CUSA_W_ANY",
# define subnational dimensions
"&breakdown=All",
# # provide API key
# "&APIkey=USAAID-113824"))
# define return per page (keep this at max...)
"&perpage=5000&page=1"))
jsondata2<-fromJSON(url2)
dtaapi2<-data.table(jsondata2$Data)
#### Assess #####
dim(dtaapi2)
table(dtaapi2$IndicatorId)
table(dtaapi2$SurveyId)
# table(dtaapi2$CharacteristicCategory, dtaapi2$CharacteristicLabel)
url3<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
"&surveyid=all",
"&indicatorIds=FP_CUSA_W_ANY",
"&breakdown=All",
"&perpage=5000&page=2"))
jsondata3<-fromJSON(url3)
dtaapi3<-data.table(jsondata3$Data)
dim(dtaapi3)
table(dtaapi3$IndicatorId)
table(dtaapi3$SurveyId)
url4<-(paste0("http://api.dhsprogram.com/rest/dhs/data?f=json",
"&surveyid=all",
"&indicatorIds=FP_CUSA_W_ANY",
"&breakdown=All",
"&perpage=5000&page=3"))
jsondata4<-fromJSON(url4)
dtaapi4<-data.table(jsondata4$Data)
dim(dtaapi4)
table(dtaapi4$IndicatorId)
table(dtaapi4$SurveyId)
dim(dtaapi1)
dim(dtaapi2)
dim(dtaapi3)
dim(dtaapi4)
dtaapi<-rbind(dtaapi1, dtaapi2, dtaapi3, dtaapi4)
dim(dtaapi)
Data frame dtaapi
has 12400 rows and 12400 columns. We will create another data frame called dtalong
to keep only rows and columns that are relevant for the analysis. We call it dtalong
because we will reshape it to wide soon.
##### Keep/Select relevant variables #####
dtalong<-dtaapi%>%
select(CountryName, DHS_CountryCode, SurveyId,
IndicatorId, Value, DenominatorWeighted,
CharacteristicCategory, CharacteristicLabel)
##### Change variable/column names to lower case #####
colnames(dtalong)<-tolower(names(dtalong))
##### Check the datasets #####
str(dtalong)
##### Tidy #####
dtalong<-dtalong %>%
# rename #
rename(
country = countryname,
group = characteristiccategory,
grouplabel = characteristiclabel,
denomw = denominatorweighted
)%>%
mutate(
# change format #
denomw = as.numeric(denomw),
# replace indicator name #
indicator = indicatorid,
indicator = ifelse(indicator=="FP_CUSA_W_ANY", "cpr", indicator),
indicator = ifelse(indicator=="FE_FRTR_W_TFR", "tfr", indicator),
indicator = ifelse(indicator=="MA_MSTA_W_UNI", "inunion", indicator)
)
table(dtalong$group)
##### More tidy #####
dtalong<-dtalong%>%
# filter/keep rows/observations that are relevant for analysis #
filter(group=="Age (5-year groups)" | grepl("Total", group))%>%
# replace group name #
mutate(
group = ifelse(group=="Total 15-49", "Total", group),
grouplabel = ifelse(grouplabel=="Total 15-49", "Total", grouplabel))
table(dtalong$group)
Now, data frame dtalong
has 2774 rows and 2774 columns. It also has variables names that are more friendly/intuitive for me. column dtalong$indicator
has 3 values: tfr (Total fertility rate), tfr (percent of women in-union), and tfr (CPR). CPR has estimates for both national and specific to five-year age groups. TFR and in-union have only national estimates.
Next, we will reshape dtalong
to wide so that each of the three indicators is a column/variable.
dtalongtemp<-dtalong%>%
select(country, dhs_countrycode, surveyid, grouplabel,
indicator, value)
dta<-reshape(dtalongtemp,
timevar = "indicator",
idvar = c("country", "dhs_countrycode", "surveyid", "grouplabel"),
direction = "wide")
dim(dta)
colnames(dta)
str(dta)
##### clean column name #####
colnames(dta) <- sub("value\\.", "", colnames(dta))
str(dta)
##### tidy #####
dta<-dta%>%
mutate(
# create new variables #
year=as.numeric(substr(surveyid,3,6)),
type=substr(surveyid,7,9),
period2=year>=2001,
ssa=0,
ssa=ifelse(
# Eastern Africa
country=="Burundi" |
country=="Comoros" |
country=="Djibouti" |
country=="Eritrea" |
country=="Ethiopia" |
country=="French Southern Territories" |
country=="Kenya" |
country=="Madagascar" |
country=="Malawi" |
country=="Mauritius" |
country=="Mayotte" |
country=="Mozambique" |
country=="Réunion" |
country=="Rwanda" |
country=="Seychelles" |
country=="Somalia" |
country=="South Sudan" |
country=="Uganda" |
country=="United Republic of Tanzania" |
country=="Zambia" |
country=="Zimbabwe" |
# Middle Africa
country=="Angola" |
country=="Cameroon" |
country=="Central African Republic" |
country=="Chad" |
country=="Congo" |
country=="Democratic Republic of the Congo" |
country=="Equatorial Guinea" |
country=="Gabon" |
country=="Sao Tome and Principe" |
# Southern Africa
country=="Botswana" |
country=="Eswatini" |
country=="Lesotho" |
country=="Namibia" |
country=="South Africa" |
# Western Africa
country=="Benin" |
country=="Burkina Faso" |
country=="Cabo Verde" |
country=="Cote d'Ivoire" | # Côte d’Ivoire
country=="Gambia" |
country=="Ghana" |
country=="Guinea" |
country=="Guinea-Bissau" |
country=="Liberia" |
country=="Mali" |
country=="Mauritania" |
country=="Niger" |
country=="Nigeria" |
country=="Saint Helena" |
country=="Senegal" |
country=="Sierra Leone" |
country=="Togo",
1, ssa)
)
# ##### tidy - label variables #####
# suppressPackageStartupMessages(library(Hmisc))
#
# label(dta$year)<- "Year of survey"
# label(dta$type)<- "Type of survey"
# label(dta$cpr)<- "CPR among all women"
# label(dta$period2)<- "1985-2000 vs. 2001-present"
str(dta)
table(dta$country, dta$ssa)