L10 (Lab) - Panel Data Analysis

In this lab exercise, you will use command plm from R package plm for panel data analysis.

Income and Democracy

The data set used for this exercise is income_democracy.xlsx from E10.2 of Stock and Watson (2020, e4). income_democracy.xlsx These data file income_democracy.xlsx contains a panel data set for 195 countries for the years 1960, 1965, … 2000. The data were supplied by Professor Daron Acemoglu and are a subset of the data used in his paper with Simon Johnson, James Robinson, and Pierre Yared, “Income and Democracy”, American Economic Review, 2008, 98:3: 808-842. A detailed description of the data is given in here, which is also available in LMS.

As stated in Acemoglu et al. (2008):

“This statistical association between income and democracy is the cornerstone of the influential modernization theory. Lipset (1959) suggested that democracy was both created and consolidated by a broad process of ‘modernization’ which involved changes in ‘the factors of industrialization, urbanization, wealth, and education [which] are so closely interrelated as to form one common factor. And the factors subsumed under economic development carry with it the political correlate of democracy’. The central tenet of the modernization theory, that higher income per capita causes a country to be democratic, is also reproduced in most major works on democracy (e.g., Robert A. Dahl 1971; Samuel P. Huntington 1991; Dietrich Rusechemeyer, John D. Stephens, and Evelyn H. Stephens 1992).”

In this exercise, we revisit the relationship between income per capital and democracy. To investigate the causal effect of income on democracy, our strategy is to control for country-specific factors affecting both income and democracy by including country fixed effects. Acemoglu et al (2008) argued that “While fixed effect regressions are not a panacea for omitted variable biases, they are well suited to the investigation of the relationship between income and democracy, especially in the postwar era. The major source of potential bias in a regression of democracy on income per capita is country-specific, historical factors influencing both political and economic development. If these omitted characteristics are, to a first approximation, time-invariant, the inclusion of fixed effects will remove them and this source of bias. Consider, for example, the comparison of the United States and Colombia. The United States is both richer and more democratic, so a simple cross-country comparison, as well as the existing empirical strategies in the literature, which do not control for fixed country effects, would suggest that higher per capita income causes democracy. The idea of fixed effects is to move beyond this comparison and investigate the ‘within-country variation’, that is, to ask whether Colombia is more likely to become (relatively) democratic as it becomes (relatively) richer.”

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

library(openxlsx)               # load package; to read xlsx file

## Warning: package 'openxlsx' was built under R version 4.3.3

library(sandwich)               # load package; to compute heteroskedasticity robust standard errors
library(lmtest)                 # load package; to conduct hypothesis test using robust SE
library(plm)                    # load package; to run panel data regressions

Import Data: income_democracy

id <- "1siMeulDqo4zLWbrObypac5OlrwMqUWRy"
inde <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(inde)

## 'data.frame':    1369 obs. of  13 variables:
##  $ country   : chr  "Andorra" "Andorra" "Andorra" "Andorra" ...
##  $ year      : num  1960 1965 1970 1975 1980 ...
##  $ dem_ind   : num  NA NA 0.5 NA NA ...
##  $ log_gdppc : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ log_pop   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ age_1     : num  NA NA NA NA NA ...
##  $ age_2     : num  NA NA NA NA NA ...
##  $ age_3     : num  NA NA NA NA NA ...
##  $ age_4     : num  NA NA NA NA NA ...
##  $ age_5     : num  NA NA NA NA NA ...
##  $ educ      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ age_median: num  NA NA NA NA NA ...
##  $ code      : num  1 1 1 1 1 1 1 1 1 2 ...

attach(inde)

Description of main variables:

country: country name
year: year
dem_ind: index of democracy
log_gdppc: logarithm of real GDP per capita
log_pop: logarithm of population
age_1 - age_5: fraction of the population age 0-14, 15-29, 30-44, 45-59, 60 and older, respectively
educ: average years of education for adults (25 years and older)

The index of democracy used in Acemoglu et al. (2008) is “the Freedom House Political Rights Index. A country receives the highest score if political rights come closest to the ideals suggested by a checklist of questions, beginning with whether there are free and fair elections, whether those who are elected rule, whether there are competitive parties or other political groupings, whether the opposition plays an important role and has actual power, and whether minority groups have reasonable self-government or can participate in the government through informal consensus.” For more details, see Freedom House (2004), https://freedomhouse.org/.

Before panel data analysis, let’s do cross-sectional data analysis using OLS. Select data for a single year, say 2000, and regress dem_ind on log_gdppc:

## Cross-sectional analysis using data for 2000
 # Select cross sectional data for 2000 and plot dem_ind vs log_gdppc
inde2000 <- subset(inde, year=="2000")
fit.cs <- lm(dem_ind~log_gdppc, data=inde2000)
summary(fit.cs)

## 
## Call:
## lm(formula = dem_ind ~ log_gdppc, data = inde2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67354 -0.19299  0.06096  0.19034  0.58530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.9584     0.1780  -5.386 2.82e-07 ***
## log_gdppc     0.1912     0.0211   9.063 7.69e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2808 on 146 degrees of freedom
##   (47 observations deleted due to missingness)
## Multiple R-squared:   0.36,  Adjusted R-squared:  0.3556 
## F-statistic: 82.13 on 1 and 146 DF,  p-value: 7.689e-16

plot(x=inde2000$log_gdppc, y=inde2000$dem_ind, xlab="log GDP per capita", ylab="democracy index")
abline(fit.cs)

Regression using a single year data suggests that the relationship between dem_ind and log_gdppc is significantly positive. However, this result can be seriously misleading because the OLS estimates clearly suffer from omitted variable bias.

Is the data set a balanced panel?

[Ans] The dataset is unbalanced because data are available over different years for different countries.

## Check if data are balanced
is.pbalanced(inde, index=c('country','year'))

## [1] FALSE

b(i) What are the minimum and maximum values of dem_ind in the data set? What are the mean and standard deviation of dem_ind in the data set? What are the 10th, 25th, 50th, 75th, and 90th percentiles of its distribution?

## Distributional info.
hist(dem_ind)

summary(dem_ind)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1667  0.5000  0.4991  0.8333  1.0000     103

sd(dem_ind, na.rm=TRUE)

## [1] 0.3713367

quantile(dem_ind, probs=c(0.1, 0.25, 0.5, 0.75, 0.9), na.rm=TRUE)

##       10%       25%       50%       75%       90% 
## 0.0000000 0.1666667 0.5000000 0.8333333 1.0000000

[Ans] Values of dem_ind range from 0.0 to 1.0. The mean in 0.50 and the standard deviation is 0.37. The 10th, 25th, 50th, 75th, and 90th percentiles are 0.0, 0.17, 0.5, 0.83, and 1.0.

b(ii) What is the value of dem_ind for the United States in 2000? Averaged over all years in the data set?

us2000 <- subset(inde, country=="United States" & year=="2000") # US data 2000
us2000$dem_ind    # dem_ind for US in 2000

## [1] 1

us <- subset(inde, country=="United States")  # US data 1960-2000
us$dem_ind        # dem_ind for US

## [1] 0.95 0.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00

mean(us$dem_ind)  # averaged US dem_ind over 1960-2000

## [1] 0.9855556

[Ans] The value for the U.S. in 2000 is $dem\_ind = 1.0$ . The average for the nine years in the sample is 0.99.

b(iii) What is the value of dem_ind for Libya in 2000? Averaged over all years in the data set?

libya2000 <- subset(inde, country=="Libya" & year=="2000") # Libya data 2000
libya2000$dem_ind    # dem_ind for Libya in 2000

## [1] 0

libya <- subset(inde, country=="Libya")  # Libya data 1960-2000
libya$dem_ind        # dem_ind for Libya

## [1] 0.3100000 0.3400000 0.0000000 0.0000000 0.1666667 0.1666667 0.0000000
## [8] 0.0000000 0.0000000

mean(libya$dem_ind)  # averaged Libya dem_ind over 1960-2000

## [1] 0.1092593

[Ans] The value for Libya in 2000 is $dem_ind = 0.0$ . The average for the nine years in the sample is 0.11.

b(iv) List five countries with an average value of dem_ind greater than 0.95; less than 0.10; and between 0.3 and 0.7.

 # (b.iv) Countries with low, mid, and high dem_ind
    # (b.iv.1) Remove rows with NA in Column "dem_ind"
inde.c <- inde[complete.cases(inde[ , c('dem_ind')]), ]
    
    # (b.iv.2) Compute average dem_ind for each country
inde.c$ave_dem <- ave(inde.c$dem_ind, inde.c$country)

    # (b.iv.3) Countries with high averaged dem_ind: ave_dem>0.95
list.high <- subset(inde.c, ave_dem>0.95)$country
unique(list.high)

##  [1] "Australia"           "Austria"             "Belgium"            
##  [4] "Belize"              "Barbados"            "Canada"             
##  [7] "Switzerland"         "Costa Rica"          "Czech Republic"     
## [10] "Germany"             "Germany, West"       "Denmark"            
## [13] "France"              "United Kingdom"      "Ireland"            
## [16] "Iceland"             "Italy"               "Japan"              
## [19] "Kiribati"            "St. Kitts and Nevis" "St. Lucia"          
## [22] "Lithuania"           "Luxembourg"          "Malta"              
## [25] "Netherlands"         "Norway"              "New Zealand"        
## [28] "Slovakia"            "Slovenia"            "Sweden"             
## [31] "United States"

    # (b.iv.4) Countries with low averaged dem_ind: ave_dem<0.1
list.low <- subset(inde.c, ave_dem<0.1)$country
unique(list.low)

##  [1] "Afghanistan"       "Angola"            "Burundi"          
##  [4] "Brunei"            "China"             "Cuba"             
##  [7] "Germany, East"     "Eritrea"           "Equatorial Guinea"
## [10] "Iraq"              "Myanmar"           "Korea, Dem. Rep." 
## [13] "Rwanda"            "Saudi Arabia"      "Turkmenistan"     
## [16] "Uzbekistan"        "Vietnam"           "Congo, Dem. Rep."

    # (b.iv.5) Countries with mid averaged dem_ind: 0.3<ave_dem<0.7
list.mid <- subset(inde.c, 0.3<ave_dem & ave_dem<0.7)$country
unique(list.mid)

##  [1] "Argentina"              "Armenia"                "Antigua"               
##  [4] "Bangladesh"             "Bulgaria"               "Bosnia and Herzegovina"
##  [7] "Bolivia"                "Brazil"                 "Chile"                 
## [10] "Comoros"                "Cape Verde"             "Dominican Republic"    
## [13] "Ecuador"                "Spain"                  "Ethiopia 1993-"        
## [16] "Fiji"                   "Georgia"                "Ghana"                 
## [19] "Gambia, The"            "Guinea-Bissau"          "Guatemala"             
## [22] "Guyana"                 "Honduras"               "Hungary"               
## [25] "Jordan"                 "Korea, Rep."            "Kuwait"                
## [28] "Lebanon"                "Lesotho"                "Morocco"               
## [31] "Madagascar"             "Maldives"               "Mexico"                
## [34] "Macedonia, FYR"         "Mozambique"             "Malaysia"              
## [37] "Nigeria"                "Nicaragua"              "Nepal"                 
## [40] "Pakistan-post-1972"     "Pakistan-pre-1972"      "Panama"                
## [43] "Peru"                   "Philippines"            "Poland"                
## [46] "Paraguay"               "Russia"                 "Senegal"               
## [49] "Singapore"              "El Salvador"            "Sao Tome and Principe" 
## [52] "Suriname"               "Seychelles"             "Thailand"              
## [55] "Tonga"                  "Turkey"                 "Taiwan"                
## [58] "Ukraine"                "Yemen"                  "Yugoslavia - post 1991"
## [61] "South Africa"           "Zambia"                 "Zimbabwe"

[Ans] See R results.

Regress dem_ind on log_gdppc. Use standard errors that are clustered by country.

To obtain the clustered standard errors, use command vcovCL. Note that the default “type” in “vcovCL” is “HC1” for lm objects and “HC0” otherwise.

  # OLS regression with clustered standard errors
fit.c <- lm(dem_ind~log_gdppc, data=inde)
summary(fit.c)   # results with unclustered standard errors

## 
## Call:
## lm(formula = dem_ind ~ log_gdppc, data = inde)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72854 -0.19534  0.02586  0.19123  0.72698 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.354828   0.070919  -19.10   <2e-16 ***
## log_gdppc    0.235673   0.008626   27.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2719 on 956 degrees of freedom
##   (411 observations deleted due to missingness)
## Multiple R-squared:  0.4385, Adjusted R-squared:  0.4379 
## F-statistic: 746.5 on 1 and 956 DF,  p-value: < 2.2e-16

result.c <- coeftest(fit.c, vcovCL(fit.c, cluster=inde$country, type="HC1")) 
result.c         # results with clustered standard errors

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -1.354828   0.100421 -13.491 < 2.2e-16 ***
## log_gdppc    0.235673   0.011837  19.910 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  # 95% CI for the slope coefficient 
b <- fit.c$coefficients["log_gdppc"]  #estimated slope coefficient
ucl <- b+1.96*result.c[2,2]  # Upper Confidence Limit
lcl <- b-1.96*result.c[2,2]  # Lower Confidence Limit
ucl; lcl

## log_gdppc 
## 0.2588736

## log_gdppc 
## 0.2124726

[Ans] The coefficient is 0.24 with a standard error of 0.01. The 95% confidence interval is 0.21 to 0.26. Clustered standard errors are needed because of country-specific omitted factors in the regressions. The clustered standard error for dem_ind is 0.012; the unclustered standard error is smaller (0.009) because it ignores the positive within-country correlation of the errors.

d(i) Suggest a variable that varies across countries but plausibly varies little—or not at all—over time and that could cause omitted variable bias in the regression in (c).

[Ans] Countries have different histories, institutions, social structures, and religions. All affect their preference for demography and may also be correlated with economic development, and therefore per-capital income.

d(ii) Estimate the regression in (c), allowing for country fixed effects.

## Country fixed effects
fit.d.ii <- plm(dem_ind~log_gdppc, data=inde, index=c("country", "year"), effect="individual", model="within")  
coeftest(fit.d.ii, vcovHC(fit.d.ii, cluster="group", type="HC1"))  # Clustered standard errors

## 
## t test of coefficients:
## 
##           Estimate Std. Error t value Pr(>|t|)   
## log_gdppc 0.083741   0.031421  2.6652 0.007849 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Ans] The estimated coefficient falls to 0.084 with a standard error of 0.031. The estimated effect, while significantly smaller is still statistically significant at the 1% significance level.

d(iii) Exclude the data for Azerbaijan, and rerun the regression. Do the results change? Why or why not?

subset(inde, country=="Azerbaijan")   # data for Azerbaijan

##       country year   dem_ind log_gdppc  log_pop     age_1     age_2     age_3
## 69 Azerbaijan 2000 0.1666667  7.683833 8.947025 0.3391963 0.2604956 0.2186417
##        age_4     age_5 educ age_median code
## 69 0.0997561 0.0819104   NA       24.3   11

inde.no.Azer <- subset(inde, country!="Azerbaijan")  # Exclude data for Azerbaijan
fit.d.iii <- plm(dem_ind~log_gdppc, data=inde.no.Azer, index=c("country", "year"), effect="individual", model="within")
coeftest(fit.d.iii, vcovHC(fit.d.iii, cluster="group", type="HC0"))  # Clustered standard errors

## 
## t test of coefficients:
## 
##           Estimate Std. Error t value Pr(>|t|)   
## log_gdppc 0.083741   0.031404  2.6666 0.007817 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Ans] Data for Azerbaijan is available only for 2000. These data are completely absorbed by the country-specific fixed effect and therefore have no effect on the estimated coefficient on log_gdppc.

d(iv) Suggest a variable that varies over time but plausibly varies little—or not at all—across countries and that could cause omitted variable bias in the regression in (c).

[Ans] The demand for democracy is contagious and sweeps across countries (remember the “Arab Spring” of 2012). To the extent that these events coincide with global changes in economic conditions they are correlated with log_gdppc and will therefore lead to omitted variable bias.

## (Optional) Only time fixed effects
fit.time <- plm(dem_ind~log_gdppc, data=inde, index=c("country", "year"), effect="time", model="within")
coeftest(fit.time, vcovHC(fit.time, cluster="group", type="HC0"))

## 
## t test of coefficients:
## 
##           Estimate Std. Error t value  Pr(>|t|)    
## log_gdppc 0.235107   0.012137   19.37 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

d(v) Estimate the regression in (c), allowing for both time and country fixed effects.

## Both time and country fixed effects
fit.d.v <- plm(dem_ind~log_gdppc, data=inde, index=c("country", "year"), effect="twoways", model="within")
coeftest(fit.d.v, vcovHC(fit.d.v, cluster="group", type="HC0"))

## 
## t test of coefficients:
## 
##           Estimate Std. Error t value Pr(>|t|)
## log_gdppc 0.053588   0.042091  1.2731   0.2033

[Ans] The estimated coefficient falls further to 0.05, approximately 1/5 of the value that omits time and country fixed effects. The estimate is not statistically significant at the 10% level.

d(vi) There are additional demographic controls in the data set. Should these variables be included in the regression? If so, how do the results change when they are included?

Control variables:

log_pop: logarithm of population
age_1 - age_5: fraction of the population age 0-14, 15-29, 30-44, 45-59, 60 and older, respectively
educ: average years of education for adults (25 years and older)

fit.d.vi <- plm(dem_ind~log_gdppc+age_2+age_3+age_4+age_5+educ+log_pop, data=inde, effect="twoways", model="within")
coeftest(fit.d.vi, vcovHC(fit.d.vi, cluster="group", type="HC0"))

## 
## t test of coefficients:
## 
##              Estimate  Std. Error t value Pr(>|t|)   
## log_gdppc  0.02520125  0.05312453  0.4744 0.635410   
## age_2     -0.52551572  0.60034016 -0.8754 0.381745   
## age_3     -2.48123514  0.88025459 -2.8188 0.004988 **
## age_4      0.29781160  1.27765324  0.2331 0.815773   
## age_5      0.60540586  1.27632867  0.4743 0.635444   
## educ      -0.00040127  0.02288646 -0.0175 0.986017   
## log_pop   -0.06922950  0.12261279 -0.5646 0.572555   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[Ans] When population, age, and education are included, the estimated coefficient on log_gdppc falls further to 0.03 with a standard error of 0.05. Therefore, after controlling for omitted variables - particularly country-fixed effects - there is little evidence of an income effect on the demand for democracy.

L10 (Lab) - Panel Data Analysis

Rui Fan

Income and Democracy

Clear the Workspace

Install and Load Needed Packages

Import Data: income_democracy