In this markdown file, I will provide the code as well as the explanations to the coding structure and their subsequent solutions to the first PS.
Before we start we load (and install all) required packages, set the appropriate working directory and load the data. Moreover, we take a first look at the data to get an overview
# Get the data, we are working with the Health_WB dataset:
wb_health = read.csv("Health_WB.csv")
Now, before we do anything with any type of data, it would be good to have a general look at the data.
first_obs = head(wb_health)
So, we got 47 variables covering different health statuses. Especially, we have 14763 health status country-year observations. Note that this is a WORLDBANK Data set, so we should probably look at all the country variables to see if we have some overlaps regarding descriptive statistics:
# Count unique elements of countries
length(unique(wb_health$countryname))
## [1] 259
Oddly, enough, we have over 260 “country” observations. That does not make too much sense, because the world has, in world bank data, only 219 countries. So, unless we got some alien places, it would be smart to check if they included continents, economic groups, different income levels etc. as “countries”.
# Do this by looking if there is Europe in it:
sum(ifelse(wb_health$countryname == "European Union", 1, 0))
## [1] 57
As we can see, the EU is considered a country here, which is, of course, totally incorrect.
Bear this in mind for the subsequent statistical analyses. It is always important to have a look at our data such that we don’t misinterpret something from the beginning, which would render any type of analysis useless.
There is a reason we did the analysis of country names beforehand. If you read the first data question carefully, you might notice that we should get summary stats for all countries. This does NOT mean all country names. Consequently, we should assess which names are indeed countries and which are not.
While you can do the selection in R itself, I personally recommend that diversity is key. Thus, I did the primary data cleaning in Excel (you could use the countrycode() package in R, but many country variables are written differently compared to the worldbank data set. As such, I would advise you to follow the approach we followed). That is, I:
# The list with the country names
non_country_names <- c("Arab World", "Central Europe and the Baltics",
"Early-demographic dividend",
"East Asia & Pacific",
"East Asia & Pacific (IDA & IBRD countries)",
"East Asia & Pacific (excluding high income)",
"Euro area", "Europe & Central Asia",
"Europe & Central Asia (IDA & IBRD countries)",
"Europe & Central Asia (excluding high income)",
"European Union",
"Fragile and conflict affected situations",
"Heavily indebted poor countries (HIPC)",
"High income",
"Late-demographic dividend",
"Latin America & Caribbean",
"Latin America & Caribbean (excluding high income)",
"Latin America & the Caribbean (IDA & IBRD countries)",
"Least developed, countries: UN classification",
"Low & middle income", "Low income",
"Lower middle income", "Middle East & North Africa",
"Middle East & North Africa (IDA & IBRD countries)",
"Middle East & North Africa (excluding high income)",
"Middle income", "North America", "Not classified",
"OECD members", "Other small states",
"Pacific island small states",
"Post-demographic dividend", "Pre-demographic dividend",
"Small states", "South Asia", "South Asia (IDA & IBRD)",
"Sub-Saharan Africa",
"Sub-Saharan Africa (IDA & IBRD countries)",
"Sub-Saharan Africa (excluding high income)",
"Upper middle income", "World")
# The updated data set including ONLY countries
wb_health_country <- subset(wb_health, !(countryname %in% non_country_names))
# Let's check:
length(non_country_names)
## [1] 41
length(unique(wb_health_country$countryname))
## [1] 219
Let’s start with the immunization:
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
# For convenience,
# I change the name
# of the variables,
# as their original
# names are quite
# long
names(wb_health_country)[names(wb_health_country) ==
"Improved.water.source....of.population.with.access."] <- "Water access"
names(wb_health_country)[names(wb_health_country) ==
"Improved.sanitation.facilities....of.population.with.access."] <- "Sanitary access"
wb_immun = subset(wb_health_country,
select = c(Immunization..BCG....of.one.year.old.children.,
Immunization..DPT....of.children.ages.12.23.months.,
Immunization..HepB3....of.one.year.old.children.,
Immunization..Hib3....of.children.ages.12.23.months.,
Immunization..measles....of.children.ages.12.23.months.,
Immunization..Pol3....of.one.year.old.children.))
names(wb_immun)[names(wb_immun) ==
"Immunization..BCG....of.one.year.old.children."] <- "GBC"
names(wb_immun)[names(wb_immun) ==
"Immunization..DPT....of.children.ages.12.23.months."] <- "DPT"
names(wb_immun)[names(wb_immun) ==
"Immunization..HepB3....of.one.year.old.children."] <- "HepB3"
names(wb_immun)[names(wb_immun) ==
"Immunization..Hib3....of.children.ages.12.23.months."] <- "Hib3"
names(wb_immun)[names(wb_immun) ==
"Immunization..measles....of.children.ages.12.23.months."] <- "measles"
names(wb_immun)[names(wb_immun) ==
"Immunization..Pol3....of.one.year.old.children."] <- "Pol3"
# The stargazer
# command is always
# quite handy when
# operating with
# descriptive
# statistics. Here,
# we need to get
# the mean, SD,
# min, max, count
stargazer(wb_immun, type = "text",
median = F, digits = 2,
header = FALSE, single.row = FALSE,
no.space = TRUE,
column.sep.width = "3pt",
font.size = "small",
title = "Summary Statistics of Immunization")
##
## Summary Statistics of Immunization
## ===========================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -----------------------------------------------------------
## GBC 5,362 83.35 20.52 1.00 77.00 98.00 99.00
## DPT 6,559 78.43 23.09 1.00 69.00 96.00 99.00
## HepB3 3,180 83.86 19.51 1.00 79.00 96.00 99.00
## Hib3 2,425 86.13 17.41 1.00 83.00 97.00 99.00
## measles 6,434 77.30 22.58 1.00 66.00 95.00 99.00
## Pol3 6,565 79.14 22.98 1.00 71.00 96.00 99.00
## -----------------------------------------------------------
Now the Life Expectancy at birth:
# I always create
# subsets to get
# the main
# variables I am
# interested in,
# before I use any
# analysis
wb_le = subset(wb_health_country,
select = c(Life.expectancy.at.birth..female..years.,
Life.expectancy.at.birth..male..years.,
Life.expectancy.at.birth..total..years.))
names(wb_le)[names(wb_le) ==
"Life.expectancy.at.birth..female..years."] <- "LE Female"
names(wb_le)[names(wb_le) ==
"Life.expectancy.at.birth..male..years."] <- "LE Male"
names(wb_le)[names(wb_le) ==
"Life.expectancy.at.birth..total..years."] <- "LE Total"
stargazer(wb_le, type = "text",
median = F, digits = 2,
header = FALSE, single.row = FALSE,
no.space = TRUE,
column.sep.width = "3pt",
font.size = "small",
title = "Summary Statistics of Life Expectancy")
##
## Summary Statistics of Life Expectancy
## =============================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------
## LE Female 11,034 66.04 12.08 22.39 57.20 75.34 87.30
## LE Male 11,034 61.39 10.92 16.29 53.88 69.47 81.60
## LE Total 11,034 63.66 11.44 19.27 55.59 72.29 84.28
## -------------------------------------------------------------
Lastly, the school enrollment:
# I always create
# subsets to get
# the main
# variables I am
# interested in,
# before I use any
# analysis
wb_school = subset(wb_health_country,
select = c(School.enrollment..primary....net.,
School.enrollment..secondary....net.))
names(wb_school)[names(wb_school) ==
"School.enrollment..primary....net."] <- "SE primary"
names(wb_school)[names(wb_school) ==
"School.enrollment..secondary....net."] <- "SE secondary"
stargazer(wb_school,
type = "text", median = F,
digits = 2, header = FALSE,
single.row = FALSE,
no.space = TRUE,
column.sep.width = "3pt",
font.size = "small",
title = "Summary Statistics of School Enrollment")
##
## Summary Statistics of School Enrollment
## ================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ----------------------------------------------------------------
## SE primary 2,255 87.92 13.96 25.62 85.98 96.65 100.00
## SE secondary 1,707 68.46 25.34 2.68 51.18 88.50 100.00
## ----------------------------------------------------------------
reg1 = lm(wb_health_country$Life.expectancy.at.birth..total..years. ~
wb_health_country$Health.expenditure..public....of.GDP. +
wb_health_country$Health.expenditure..private....of.GDP.)
reg2 = lm(wb_health_country$Life.expectancy.at.birth..total..years. ~
wb_health_country$`Water access` +
wb_health_country$`Sanitary access`)
log_primary_net_school <- log(wb_health_country$School.enrollment..primary....net.)
log_secondary_net_school <- log(wb_health_country$School.enrollment..secondary....net.)
reg3 = lm(wb_health_country$Life.expectancy.at.birth..total..years. ~
wb_health_country$School.enrollment..primary....net. +
wb_health_country$School.enrollment..secondary....net.)
stargazer(reg1, reg2,
reg3, type = "text",
header = FALSE, single.row = FALSE,
no.space = TRUE,
column.sep.width = "3pt",
font.size = "small",
omit.table.layout = "n",
column.labels = c("life exp",
"life exp", "life exp (logs)"),
dep.var.labels.include = FALSE,
omit.stat = c("rsq",
"f"))
##
## ============================================================================================
## Dependent variable:
## -----------------------------------------------------
## life exp life exp life exp (logs)
## (1) (2) (3)
## --------------------------------------------------------------------------------------------
## Health.expenditure..public....of.GDP. 2.151***
## (0.066)
## Health.expenditure..private....of.GDP. -0.873***
## (0.092)
## `Water access` 0.163***
## (0.007)
## `Sanitary access` 0.183***
## (0.004)
## School.enrollment..primary....net. 0.128***
## (0.013)
## School.enrollment..secondary....net. 0.247***
## (0.007)
## Constant 62.800*** 41.542*** 42.715***
## (0.378) (0.372) (0.896)
## --------------------------------------------------------------------------------------------
## Observations 3,630 4,716 1,491
## Adjusted R2 0.257 0.753 0.746
## Residual Std. Error 8.192 (df = 3627) 4.784 (df = 4713) 4.434 (df = 1488)
## ============================================================================================
To get the plots, we should first only focus on the non-NA variables when calculating the average values.
# This is the
# command in which
# you get the sub
# selection of the
# variables of
# interest.
life_exp = na.omit(subset(wb_health_country,
select = c(countryname,
Life.expectancy.at.birth..total..years.,
School.enrollment..primary....net.)))
# Then, you can
# aggregate the
# individual
# countries to get
# the average
# values per
# country. Note
# that this is done
# through the
# aggregate
# command, where we
# specify according
# by which variable
# we want to
# aggregate and get
# the mean value.
lifeexp <- aggregate(life_exp[,
2:3], by = list(life_exp$countryname),
FUN = mean, na.rm = TRUE)
library(ggplot2)
# This is just a custom way of designing the plot. I think it's personally handy to have a nice looking plot and this one
# does the trick and combines all the data. It looks difficlut, but it really is not.
ggplot(lifeexp, aes(x = School.enrollment..primary....net.,
y = Life.expectancy.at.birth..total..years.)) + # the dataset, the variables of interest (given in an aes - command)
geom_point(colour = "blue", fill = "white") + # the form of your plot (in our case a scatter plot given by geom_point)
geom_smooth(method = "lm", se = T, color = "darkred") +# the linear line included by geom_smooth()
ylab("Life Expectancy at birth (Total)") + # Name of the variables
xlab("Primary School enrolment (Net, Total)") +
ggtitle("Relationship between Primary School Enrolment and Life Expectancy at Birth") + # Title of the plot
theme(plot.title= element_text(size=13, color="grey26", # The theme (so what it should look like)
hjust=0.5,
lineheight=1.2), panel.background = element_rect(fill="#f7f7f7"), # This is the background gray-silver colour
panel.grid.major.y = element_line(size = 0.5, linetype = "solid", color = "grey"), # These are the horizontal lines on each number.
# Note that major means each number (10,20...)
panel.grid.minor = element_blank(), # This is the minor documentation (implying that we don't have any lines between numbers)
panel.grid.major.x = element_blank(), # Here, we say we don't want any vertical lines
plot.background = element_rect(fill="#f7f7f7", color = "#f7f7f7"),
axis.title.x = element_text(color="grey26", size=12),
axis.title.y = element_text(color="grey26", size=12),
axis.line = element_line(color = "black")) # This is to define the axis colour for x and y
## `geom_smooth()` using formula 'y ~ x'
And here we got it, looks better than the old plain plots R usually uses (or the standard plots from STATA).
# Load the data
# again
data_FE = read.csv("data_FE.csv")
# We can do that
# easily with the
# mean and sd
# commands
mean_value = mean(data_FE$Explanatory.Variable)
SD_value = sd(data_FE$Explanatory.Variable)
# This is to print
print(paste0("The mean value is: ",
round(mean_value,
3)))
## [1] "The mean value is: 0.499"
print(paste0("The SD value is: ",
round(SD_value, 3)))
## [1] "The SD value is: 0.288"
library(stargazer)
# Let's do that
# again:
reg4 = lm(Dependent.Variable ~
Explanatory.Variable,
data = data_FE)
stargazer(reg4, type = "text")
##
## ================================================
## Dependent variable:
## ---------------------------
## Dependent.Variable
## ------------------------------------------------
## Explanatory.Variable 0.360
## (0.318)
##
## Constant 5.374***
## (0.183)
##
## ------------------------------------------------
## Observations 1,000
## R2 0.001
## Adjusted R2 0.0003
## Residual Std. Error 2.901 (df = 998)
## F Statistic 1.276 (df = 1; 998)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
library(plm)
# Here, we use the
# plm package,
# where we simply
# use the Household
# IDs as 'dummy' to
# control for all
# the effects. As
# such, this dummy
# absorbs all the
# effects that are
# different for
# each HH due to
# the HH
# individuality
# itself.
reg5 = plm(Dependent.Variable ~
Explanatory.Variable,
data = data_FE, index = "Household.ID")
stargazer(reg4, reg5,
type = "text")
##
## =================================================================
## Dependent variable:
## --------------------------------------------
## Dependent.Variable
## OLS panel
## linear
## (1) (2)
## -----------------------------------------------------------------
## Explanatory.Variable 0.360 0.401***
## (0.318) (0.020)
##
## Constant 5.374***
## (0.183)
##
## -----------------------------------------------------------------
## Observations 1,000 1,000
## R2 0.001 0.317
## Adjusted R2 0.0003 0.241
## Residual Std. Error 2.901 (df = 998)
## F Statistic 1.276 (df = 1; 998) 417.085*** (df = 1; 899)
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# I use the
# notation z_1 for
# the x variable.
# We can get this
# through an
# if-else statement
# in which we
# simply generate a
# variable that is
# x_1 or added 100.
# This is a way to
# automatically
# induce Fixed
# Effects.
data_FE$z_1 = ifelse(data_FE$Household.ID <=
50, data_FE$Explanatory.Variable,
data_FE$Explanatory.Variable +
100)
mean_value_z_1 = mean(data_FE$z_1)
SD_value_z_1 = sd(data_FE$z_1)
print(paste0("The mean value is:",
round(mean_value_z_1,
3)))
## [1] "The mean value is:50.499"
print(paste0("The SD value is:",
round(SD_value_z_1,
3)))
## [1] "The SD value is:50.024"
# Let's run this
# OLS regression.
reg6 = lm(Dependent.Variable ~
z_1, data = data_FE)
stargazer(reg4, reg6,
type = "text", header = FALSE,
single.row = FALSE,
no.space = TRUE,
column.sep.width = "3pt",
font.size = "small")
##
## ===========================================================
## Dependent variable:
## ----------------------------
## Dependent.Variable
## (1) (2)
## -----------------------------------------------------------
## Explanatory.Variable 0.360
## (0.318)
## z_1 0.050***
## (0.001)
## Constant 5.374*** 3.022***
## (0.183) (0.066)
## -----------------------------------------------------------
## Observations 1,000 1,000
## R2 0.001 0.747
## Adjusted R2 0.0003 0.747
## Residual Std. Error (df = 998) 2.901 1.460
## F Statistic (df = 1; 998) 1.276 2,949.883***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As is shown, the coefficient for the explanatory variable increases. This is intuitive, as we increased the value of half our sample while leaving the rest as is. As a consequence, half of the sample now has a higher value than before, which will then increase the average value of our coefficient automatically. To see this, compare both scatterplots.
The first shows the actual relation. As we can see from the plot, the datapoints follow no clear structure at all, on the first sight. By not controlling for “groups” (thereby binning the datapoints into groups), we can only see an extremely randomly distributed relation.
library(ggplot2)
highlight_df = subset(data_FE,
Explanatory.Variable > 0.25 &
Explanatory.Variable <
0.26)
ggplot(data_FE, aes(x = Explanatory.Variable,
y = Dependent.Variable)) +
geom_point(colour = "black",
fill = "white") + geom_point(data = highlight_df,
aes(x = Explanatory.Variable,
y = Dependent.Variable),
colour = "green1", size = 2) +
ylab("Dependent variable") +
xlab("Explanatory variable") +
ggtitle("Relationship between X and Y without manipulation") +
theme(plot.title = element_text(size = 13,
color = "grey26", hjust = 0.5,
lineheight = 1.2), panel.background = element_rect(fill = "#f7f7f7"),
panel.grid.major.y = element_line(size = 0.5,
linetype = "solid",
color = "grey"), panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.background = element_rect(fill = "#f7f7f7",
color = "#f7f7f7"),
axis.title.x = element_text(color = "grey26",
size = 12), axis.title.y = element_text(color = "grey26",
size = 12), axis.line = element_line(color = "black")) +
geom_smooth(method = "lm",
se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'
Now let’s see what happens once we use the manipulated explanatory variable.
As it is applicable, the explanatory data now can be categorized into two bins. The reason for it is that, before, the dataset was distributed solely between 0 and 1 on the x-axis and between 0 and approximately 11 on the y-axis. As half of the data received a push of 100, it is clear that this data will be shifted to the right on the x-axis. What is interesting in this case is the fact that, apparently, the data receiving the push was also the data that were associated with the larger y values in the first place.
To make that clear, look at the green values in the plot above. As we can see, they have quite similar x-axis values, but differ widely in their associated y-values. What is interesting is that, apparently, the x-values associated with larger y-values are also in a different household group. Therefore, we can clearly see differences between household classes. Unfortunately, the statistics program does not automatically see this and assumes no actual relationship at all, due to the fact that the values for the different households influence each other to a point where they have the potential to “cancel each other out”.
As a consequence, including household FE is a necessary procedure to observe a relationship between x and y, as these effects will correct what has been misinterpreted by the program. To see how they do it, we add 100 to the individual observations that had a larger y-value in the first place to create groups of data (or bins of data).
This is applicable in the plot below:
ggplot(data_FE, aes(x = z_1,
y = Dependent.Variable)) +
geom_point(colour = "black",
fill = "white") +
geom_point(data = highlight_df,
aes(x = z_1,
y = Dependent.Variable),
colour = "green1",
size = 2) + ylab("Dependent variable") +
xlab("Explanatory variable") +
ggtitle("Relationship between X and Y without manipulation") +
theme(plot.title = element_text(size = 13,
color = "grey26",
hjust = 0.5,
lineheight = 1.2),
panel.background = element_rect(fill = "#f7f7f7"),
panel.grid.major.y = element_line(size = 0.5,
linetype = "solid",
color = "grey"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.background = element_rect(fill = "#f7f7f7",
color = "#f7f7f7"),
axis.title.x = element_text(color = "grey26",
size = 12),
axis.title.y = element_text(color = "grey26",
size = 12),
axis.line = element_line(color = "black")) +
geom_smooth(method = "lm",
se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'
Now, we can easily see what happened to the green datapoints in the set. As is observable, some of the datapoints received the push to the right and it was the datapoints that were originially associated with a larger y-value. Apparently, we had half of the households indicating lower values and half indicating higher values. By “adding” the effect to the second group, we can now clearly characterize the, always existing, grouped differences and finally observe the true relationship. This is exactly what the Fixed Effects do.
All things considered, the transformation of the grouped variable has the similar effect on the model as using FE in the first place. As a consequence, it is logical that the coefficients differ dramatically.
# Let's run the
# fixed effects
# regression by
# including the
# manipulated
# dataset
reg7 = plm(Dependent.Variable ~
z_1, data = data_FE,
index = "Household.ID")
stargazer(reg5, reg7,
type = "text", header = FALSE,
single.row = FALSE,
no.space = TRUE,
column.sep.width = "3pt",
font.size = "small")
##
## ======================================================
## Dependent variable:
## ----------------------------
## Dependent.Variable
## (1) (2)
## ------------------------------------------------------
## Explanatory.Variable 0.401***
## (0.020)
## z_1 0.401***
## (0.020)
## ------------------------------------------------------
## Observations 1,000 1,000
## R2 0.317 0.317
## Adjusted R2 0.241 0.241
## F Statistic (df = 1; 899) 417.085*** 417.085***
## ======================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here, we can observe that the two coefficients are the same. The intuition behind this lies in the fixed effects regression model. In the first model, we introduced the FE to the explanatory variable. This then delivered the divergence in datapoints observed above. In the second model, we introduced the FE in addition to the already manipulated datapoints, which we called \(z_1\). As we can see, the effects even increased, potentially due to the fact that we swifted from group-wise FE to individual-wise FE. This indicates that the FE we introduced in the first setting are indeed the carrying factor for the differences in outcomes between the OLS and the FE regression. As we created now the FE in each period, it appears as that the effects are constant over time, but different between individuals. That is known as the within transformation and the effect eliminates exactly this individual-specific effect that is constant over time.