Applied Economics with R

Part 1 out of 3.

Hans H. Sievertsen (h.h.sievertsen@bristol.ac.uk)

Goals

Intended Learning Outcomes

After this lecture you should be able to create an empirical economic project with R. Specifically, you should be able to:

load data
adjust data
describe data
visualize data
regression analysis
exporting material to a Word document

Why R?

It is free.

Plan for part 1:

From loading data to regressions
You can try methods simultaneously using the hands-on tool.

Plan for part 1:

From loading data to regressions
You can try methods simultaneously using the hands-on tool.

Chapter 2: The case

Chapter 2: The data is loaded

Chapter 3: Holmes is cleaning the data

Chapter 4: A first look at the data

Chapter 5: Sherlock creates charts

Chapter 6: Sherlock runs regressions

Chapter 7: Sherlock concludes on part 1

Prerequisites & Reading for part 1

Prerequisites

Introduction to econometrics.
Economic principles (or introduction to microeconomics & macroeconomics).

Reading

Essential

Wickham & Grolemund (2016): "R for data science: import, tidy, transform, visualize, and model data." Chapters 1-3, available here: https://r4ds.had.co.nz/.

Recommended

Angrist & Pischke (2014): "Mastering' metrics: The path from cause to effect.", Chapters 1-2.
Hanck, Arnold, Gerber and Schmelzer (2019): "Introduction to Econometrics with R" Chapters 4-7, available here: https://www.econometrics-with-r.org/

Chapter 1: The case

Holmes: "Watson my dear, we received a telegram. It seems like we have a new case. Could you please read it out loud? "
Watson: "So I see. This seems like a straightforward. Here is the question that is looking for an answer: "
Research Question: What is the effect of an academic summer camp for children between year 5 and year 6?"
Watson: "How can we answer this?"
Holmes: "Data! Data! Data!.. I can't make bricks without clay." (from The Adventure of the Copper Beeches)

Chapter 1: The case

W: "Ahh, according to the telegram there appears to exist a dataset. Maybe that will be of use. We can estimate a complicated econometric model using the data."
H: "Hold your horses Watson. We first need context. Every good applied econonomics project provides a good description of the relevant context."
W: "What do you need to know?"
H: "How the school system works, what the summer camp is about, how participation in the summer camp is determined and so on!"
W: "Ah I see. This is all in the Telegram:"
- school system: 10 years of compulsory schooling
- summer camp: optional academic summer camp that aims at improving self-confidence and study skills.
- participation: optional and quite costly.
H: "Very good Watson. We are now ready to have a look at the dataset."

Chapter 2: The data is loaded

H: Tell me my dear Watson, where is this dataset?"
W: "An obscure Internet location, with the address:
"https://www.hhsievertsen.net/economicdata/src/school_data_1.csv"
H: "Brilliant Watson, R can easily handle obscure locations, as long as they are precise. Let me show you:"

Chapter 2: The data is loaded

H: Tell me my dear Watson, where is this dataset?"
W: "An obscure Internet location, with the address:
"https://www.hhsievertsen.net/economicdata/src/school_data_1.csv"
H: "Brilliant Watson, R can easily handle obscure locations, as long as they are precise. Let me show you:"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")

## Parsed with column specification:
## cols(
##   person_id = col_character(),
##   school_id = col_double(),
##   summercamp = col_double(),
##   female = col_double(),
##   parental_schooling = col_double(),
##   parental_lincome = col_double(),
##   test_year_5 = col_double(),
##   test_year_6 = col_double()
## )

Chapter 2: The data is loaded

W: "Sherlock, your friend, Mr. R, is clearly talking gibberish. I don't see much use in him. "
H: "You see, but you do not observe! In fact, the message given by R is very informative. Let me explain, my dear Watson. "
- Parsed with column specification: the dataset was loaded sucessefully.
- person_id = col_character() tells us that the first column is named "person_id" and is of a col_character() type.
- school_id = col_double() tells us that the second column is a named "school_id". That column contains values that are stored as double precision floating point. In other words: numbers.
- The symbol combination <- is called the "assignment operator", which assigns what is on the right of the symbol to what is on the left of this symbol.
- school_data<- means that we assign the loaded data set to an object called "school_data".

Chapter 2: The data is loaded

H: "From that I infer that we have information about:"
- person_id: a personal identifier. That is always useful.
- school_id: a school identifier. That is might come in handy.
- summercamp: information on whether the child participated in the camp. Crucial!
- female: the gender of the child. You never know. Maybe useful!
- parental_schooling: Ahh some information about the child's background. The parents' schooling level.
- parental_lincome: More information about the child's background. The parents' income in logs.
- test_year_5: Some information about school achievement before the camp. Very interesting.
- test_year_6: Some information about school achievement after the camp. Crucial!

Chapter 2: The data is loaded

W: "Impressive Sherlock, but I still don't know the size of the dataset. "
H: "Correct Watson. Well spotted! R has several ways to show that. I like to use the str() function"

Chapter 2: The data is loaded

W: "Impressive Sherlock, but I still don't know the size of the dataset. "
H: "Correct Watson. Well spotted! R has several ways to show that. I like to use the str() function"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
str(school_data)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 491 obs. of  8 variables:
##  $ person_id         : chr  "p1" "p2" "p3" "p4" ...
##  $ school_id         : num  5 14 7 8 9 26 13 11 23 9 ...
##  $ summercamp        : num  1 0 1 1 1 0 0 0 0 1 ...
##  $ female            : num  0 1 0 0 1 0 0 0 0 1 ...
##  $ parental_schooling: num  14 11 13 14 14 12 12 12 13 13 ...
##  $ parental_lincome  : num  15.3 14 15.1 15.3 15.7 ...
##  $ test_year_5       : num  NA 1.25 2.21 2.31 NA ...
##  $ test_year_6       : num  3.1 1.76 2.57 2.96 3.54 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   person_id = col_character(),
##   ..   school_id = col_double(),
##   ..   summercamp = col_double(),
##   ..   female = col_double(),
##   ..   parental_schooling = col_double(),
##   ..   parental_lincome = col_double(),
##   ..   test_year_5 = col_double(),
##   ..   test_year_6 = col_double()
##   .. )

W: "Ahh I see. We have 491 observations and 8 variables. "
H: "Well done Watson. And we even see some of the values. We can for example see that there are missing values shown as NA. "

Chapter 2: The data is loaded

W: "Truly remarkable Sherlock. But how does this help us answer whether the the summer camps work?"
H: "Patience my dear Watson. Let us first recap what we have learned about R. "
- We use read_csv() to load a dataset.
- The exact location of the dataset is entered in the ()
- For example a location on your computer:
  "C:\\Users\\hans\\Documents\\school_data_1.csv"
- We always have to use two "\\" instead of just a single "\" in R.
- The function read_csv() loads the dataset into a so called "data frame". Data frames are just one type of R objects. The data frame object type is very handy for data manipulation.
- The symbol combination <- is the "assignment operator".
- the function str() gives us detailed information about the data frame.

Chapter 2: The data is loaded

W: "Great. Inspector Lestrade loves Stata. How would he do it?"
- In R we would write read_csv("school_data_1.csv") to load a csv dataset.
- In Stata we would write use "school_data_1.dta,clear" to load a dta dataset.
- Note: R can also load dta files and Stata can also load csv files, but the example above is more common.
- Stata typically only has one dataset loaded. We therefore do not need to give them names.
- R can load many datasets. We therefore have to give them a name: school_data<-read_csv(..)

Chapter 3: "Holmes is cleaning the data"

Chapter 3: Holmes is cleaning the data

W: "The dataset just contains one missing observation, let's delete that."
H: "Watson, you see, but you don't observe. We've only seen a few observations, but to see all we should use R tools. R, show me the number of missing observations:"

Chapter 3: Holmes is cleaning the data

W: "The dataset just contains one missing observation, let's delete that."
H: "Watson, you see, but you don't observe. We've only seen a few observations, but to see all we should use R tools. R, show me the number of missing observations:"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
colSums(is.na(school_data))

##          person_id          school_id         summercamp             female 
##                  0                  0                  0                  0 
## parental_schooling   parental_lincome        test_year_5        test_year_6 
##                  5                  0                  6                  5

W: "Magic, pure magic."
H: "Not magic, just R. "
W: "But tell me Sherlock, what should we do with these observations?"
H: "We can either remove them or replace the missing value with a reasonable guess. Let's remove them for now and return to this issue later."

Chapter 3: Holmes is cleaning the data

W: "We ask R to only include complete.cases."

Chapter 3: Holmes is cleaning the data

W: "We ask R to only include complete.cases."

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
str(school_data)

## Classes 'tbl_df', 'tbl' and 'data.frame':  475 obs. of  8 variables:
##  $ person_id         : chr  "p2" "p3" "p4" "p6" ...
##  $ school_id         : num  14 7 8 26 13 11 23 9 25 15 ...
##  $ summercamp        : num  0 1 1 0 0 0 0 1 1 0 ...
##  $ female            : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ parental_schooling: num  11 13 14 12 12 12 13 13 15 12 ...
##  $ parental_lincome  : num  14 15.1 15.3 14 14.5 ...
##  $ test_year_5       : num  1.247 2.207 2.31 1.628 0.733 ...
##  $ test_year_6       : num  1.76 2.57 2.96 1.94 1.46 ...

W: "Magic, pure magic. We now only have 476 observations. How did that work?"
H: "We can select a subset of a data frame by using [r,c] where r are the rows we want to keep and c the columns."

Chapter 3: Holmes is cleaning the data

H: "Let me summarize the R tools we used we used:"

is.na() evaluates whether the content of () is missing: NA.
if the content inside () is missing, the function is.na() will return "TRUE" or simply "1"
We use colSums(is.na(dataset)) to count the number of "1"s in each column of "dataset".
The funciton complete.cases(school_data) evaluate whether an observation is complete and returns "TRUE" if it is.
school_data[complete.cases(school_data), ] selects all the rows of school_data with complete cases.
school_data<-school_data[complete.cases(school_data), ]
overwrites school_data with the new school_data.
In Stata we would write: keep if test_year_6!=. & test_year_5!=.&....

Chapter 4: A first look at the data

So what have Holmes and Watson done so far?

Loaded a dataset using read_csv()
Stored the dataset as a data frame with the name school_data.
Investigated the number of variables and observations in the dataset using str().
Removed observations with missing values using complete_cases().
Let them now describe the dataset

Chapter 4: A first look at the data

W: "So tell me Sherlock, how do we describe the contents of the dataset?."
H: "We simply ask for a summary:"

Chapter 4: A first look at the data

W: "So tell me Sherlock, how do we describe the contents of the dataset?."
H: "We simply ask for a summary:"

options(width = 999)
school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
summary(school_data)

##   person_id           school_id       summercamp         female       parental_schooling parental_lincome  test_year_5      test_year_6    
##  Length:475         Min.   : 1.00   Min.   :0.0000   Min.   :0.0000   Min.   :11.00      Min.   :13.07    Min.   :0.5484   Min.   :0.6024  
##  Class :character   1st Qu.: 8.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:12.00      1st Qu.:14.72    1st Qu.:1.8279   1st Qu.:2.0544  
##  Mode  :character   Median :16.00   Median :1.0000   Median :0.0000   Median :13.00      Median :15.13    Median :2.2833   Median :2.4825  
##                     Mean   :15.75   Mean   :0.5389   Mean   :0.4947   Mean   :13.04      Mean   :15.15    Mean   :2.2955   Mean   :2.5192  
##                     3rd Qu.:23.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:14.00      3rd Qu.:15.62    3rd Qu.:2.7642   3rd Qu.:2.9919  
##                     Max.   :30.00   Max.   :1.0000   Max.   :1.0000   Max.   :16.00      Max.   :17.39    Max.   :4.4056   Max.   :4.5500

W: "I only see a mess of numbers."

Chapter 4: A first look at the data

H: "Alright. Let me tidy it for you. I use the stargazer function for that:"

Chapter 4: A first look at the data

H: "Alright. Let me tidy it for you. I use the stargazer function for that:"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
stargazer(data.frame(school_data),type="html")


Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

school_id	475	15.747	8.669	1	8	23	30
summercamp	475	0.539	0.499	0	0	1	1
female	475	0.495	0.500	0	0	1	1
parental_schooling	475	13.044	0.948	11	12	14	16
parental_lincome	475	15.150	0.695	13.074	14.722	15.623	17.388
test_year_5	475	2.295	0.684	0.548	1.828	2.764	4.406
test_year_6	475	2.519	0.732	0.602	2.054	2.992	4.550

W: "Nice Sheerlock.I now clearly see that test scores improved from year 5 to year 6. The summer camp clearly worked. Case solved!"

Chapter 4: A first look at the data

H: "Careful Watson, no immature conclusions Watson!"
W: "But Holmes, don't you see it?"
H: "I see it. But you don't see the uncertainty and the the lack fo causal evidence."
W: "Then show me!"

Chapter 4: A first look at the data

H: "Careful Watson, no immature conclusion Watson!"
W: "But Holmes, don't you see it?"
H: "I see it. But you don't see the issue of uncertainty and omitted variable bias."
W: "Then show me!"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
t.test(test_year_6~summercamp,data=school_data)

## 
##  Welch Two Sample t-test
## 
## data:  test_year_6 by summercamp
## t = -16.795, df = 471.78, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9962080 -0.7875088
## sample estimates:
## mean in group 0 mean in group 1 
##        2.038514        2.930372

Chapter 4: A first look at the data

H: "Dear Watson. I conducted a t-test comparing the two means. Tell me what you observe?"
W: "Test scores are significantly for summer school attendees after they participated. The standard errors deal with your uncertainty issue. The camp clearly worked!"
H: "I admit that your were right with respect to uncertainty. The difference is significant, but your conclusion is still immature. Look:!"

Chapter 4: A first look at the data

H: "Dear Watson. I conducted a t-test comparing the two means. Tell me what you observe?"
W: "Test scores are significantly for summer school attendees after they participated. The standard errors deal with your **uncertainty* issue. The camp clearly worked!"*
H: "I admit that your were right with respect to uncertainty. The difference is significant, but your conclusion is still immature. Look:!"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
t.test(test_year_5~summercamp,data=school_data)

## 
##  Welch Two Sample t-test
## 
## data:  test_year_5 by summercamp
## t = -9.286, df = 470.95, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6483848 -0.4219009
## sample estimates:
## mean in group 0 mean in group 1 
##        2.007073        2.542215

Chapter 4: A first look at the data

H: "Test scores were significantly higher for summer camp attendees already before the summer camp!"
W: "So what?"
H: "Summer camp participants were doing significantly better than non-participants already before the camp. We don't know whether the higher test score afterwards is due to the camp or simply something else?"
W: "What could that be?"
H: "An omitted variable, for example parental preferences and investments. "
W: "Explain!

Chapter 4: A first look at the data

H: "Consider the following example. Let S be a variable capturing summer camp participation and P a variable capturing parental investments in general. "

True model: \[testscore=\beta_1+\beta_2 S+\beta_3 P+u\]
Our model: \[testscore=\beta_1+\beta_2 S+u\]
What we estimate: \[E[\hat{\beta_2}]=\beta_2+\beta_3\frac{cov(S,P)}{var(S)}\]
We do not identify the effect of the summer school \(\beta_2\) if summer school participation is correlated with parental investments and parental investments are correlated with test scores.

Chapter 4: A first look at the data

W: "But the dataset also include information about parental background. Can't we just control for that?"

H: "Well, we should look at that. But I am not very optimistic. Let us first recap what we have done. "
We use summary() to obtain basic summary statistics for the dataset inside the ().
In Stata we would type summarize
We use stargazer() to create nice looking tables.
We use t.test(test_year_5~summercamp,data=school_data) to conduct a t-test.
In Stata we would type ttest test_year_5, by(summercamp)

Chapter 5: Sherlock creates charts

W: "So tell me Sherlock, are test scores correlated with parental background?" > - H: "Let's use a create a chart to investigate this."

Chapter 5: Sherlock creates charts

W: "So tell me Sherlock, are test scores correlated with parental background?"
H: "Let's use a create a chart to investigate this."

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
# create a scatter plot with a linear fit
ggplot(school_data, aes(x=parental_lincome,y=test_year_5))+geom_point(alpha=0.5)+ theme_classic()+
  geom_smooth(method=lm,color="black")

plot of chunk c4_4

Chapter 5: Sherlock creates charts

W: "There is a clear correlation. Let us just control for parental background and the case is solved!"

H: "Hold your horses Watson, let's first investigate whether parental background is linked to summer camp participation."

Chapter 5: Sherlock creates charts

W: "There is a clear correlation. Let us just control for parental background and the case is solved!"
H: "Hold your horses Watson, let's first investigate whether parental background is linked to summer camp participation."

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
# Plot distributions
ggplot(school_data, aes(x=parental_lincome,fill=as.factor(summercamp), color=as.factor(summercamp))) +
  geom_density(alpha=0.4)+geom_histogram(aes(y=..density..),position="dodge", alpha=0.5)+ theme_classic()

plot of chunk c4_3

Chapter 5: Sherlock creates charts

W: "I told you. Parental income is correlated with test scores and summer camp participation. We should simply control for it."

H: "I am not that optimistic. But I am willing to try it. The simplest way to control for it is to run a multiple linear regression. But let us first recap what we did"
We use ggplot() to create a chart object.
We use aes() to specify the x and y variables.
We use geom_point() to create a scatter plot.
We use geom_density() to create a density plot.
We use geom_histogram() to create a histogram.
There are many settings in ggplot. You can find them here.
ggplot(school_data, aes(x=parental_lincome,y=test_year_5))+geom_point() corresponds to tw (scatter test_year5 parental_lincome ) in Stata.

Chapter 6: Sherlock runs regressions

H: "Okay, let's estimate a linear regression with R:"

Chapter 6: Sherlock runs regressions

H: "Okay, let's estimate a linear regression with R:"

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
lm(test_year_6~summercamp,data=school_data)

## 
## Call:
## lm(formula = test_year_6 ~ summercamp, data = school_data)
## 
## Coefficients:
## (Intercept)   summercamp  
##      2.0385       0.8919

W: "That is not very informative. What about your uncertanty?"
H: "Simply write summary(lm(test_year_6~summercamp,data=school_data))!

Chapter 6: Sherlock runs regressions

H: "And here you go. A beautiful version with controls: "

Chapter 6: Sherlock runs regressions

H: "And here you go. A beautiful version with controls: "

school_data<-read_csv("https://www.hhsievertsen.net/economicdata/src/school_data_1.csv")
school_data<-school_data[complete.cases(school_data), ]
m1<-lm(test_year_6~summercamp,data=school_data)
m2<-lm(test_year_6~summercamp+parental_lincome+female,data=school_data)
m3<-lm(test_year_6~summercamp+parental_schooling+female+parental_lincome,data=school_data)
stargazer(m1,m2,m3,type="html",digits=2,font.size="tiny")


	Dependent variable:

	test_year_6
	(1)	(2)	(3)

summercamp	0.89^***	0.54^***	0.51^***
	(0.05)	(0.04)	(0.04)

parental_schooling			0.17^***
			(0.03)

parental_lincome		0.66^***	0.48^***
		(0.03)	(0.05)

female		-0.04	-0.04
		(0.04)	(0.04)

Constant	2.04^***	-7.69^***	-7.15^***
	(0.04)	(0.43)	(0.43)


Observations	475	475	475
R²	0.37	0.70	0.72
Adjusted R²	0.37	0.70	0.71
Residual Std. Error	0.58 (df = 473)	0.40 (df = 471)	0.39 (df = 470)
F Statistic	277.46^*** (df = 1; 473)	366.32^*** (df = 3; 471)	294.79^*** (df = 4; 470)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

Chapter 6: Sherlock runs regressions

W: "The summer camp clearly worked. The coefficient on summer camp is positive and significant. Even after controlling for parental background. " > - H: "Once you eliminate the impossible, whatever remains is the truth. " > - W: "But what other explanation could there be? " > - H: "When we added the variable "parental_schooling" the coefficient changed. We only added one additional control for parental background. We could add an infinite number of controls. And they might all change the coefficient. This all suggest that including a control for parental income does not necessarily sufficiently capture parental background. There might be unobservable variables that still lead to an omitted variable bias." > - W: "We are doomed. The case is clearly impossible" > - H: "Don't give up Watson. We just need to look for exogenous variation in summer camp participation".

Chapter 6: Sherlock runs regressions

H: "Let me just remind you what we have used so far.".

We use lm(y~x1+x2,data=dataframe) to estimate a linear model, where or dependent variable is y and or independent variables are x1 and x2.
In Stata we would run "reg y x1 x2".
We use stargazer() to create nice regression tables.

Chapter 7: Sherlock concludes on part 1

H: "We have made plenty of progress so far:".

We want to know the effect of a summer camp on child development.
We found a dataset containing lot's of useful information.
Unfortunately, summer camp participation seems to endogeneous. It is correlated with observable child background characteristics and likely also unobservable characteristics.
Stay tuned for part 2, where Sherlock:
- discovers a new dataset.
- uses a instrument.
- investigates the difference-in-differences.
- tells stories about Cuba and Voting.