Practice 1

D. Acemoglu, S. Johnson, and J. A. Robinson in the paper “The Colonial Origins of Comparative Development: An Empirical Investigation” (2001) evaluated the effect of institutions on economic performance. According to their theory, current economic performance of former colonies depends on the type of institutions europeans introduced during the process of colonization. The type of institutions, in its own turn, depends on natural conditions in a colony.

If natural conditions in a colony were bad and caused diseases and higher mortality, europeans tended to set up ‘extractive states’ that was used only to transfer resources to the metropolitan country. In such cases colonizers did not develop high-quality institutions and thus, provided no protection for private property (Kongo). In colonies with good conditions europeans tried to settle more thoroughly and replicated European institutions with higher emphasis on the private property and the system of checks and balancies against government expropriation (Canada, New Zealand).

To find the support for the their theory, researchers used advanced methods (instrumental variables and two-staged regression), but at some steps they use ordinary least squares regressions (OLS) that were discussed in this course. In this practical task you are suggested to replicate one OLS model from the paper.

You are provided with the dataset that contains the following variables:

As a first step of the research authors evaluate the effect of the risk of expropriation of private foreign investment by government on the logarithm of GDP capita taking into account some geographical factors – latitude and a continent where a country is situated. There are four dummies for continents, and the researchers take the America as the base category. Then researchers perform a regression.

1.1. What is the dependent variable in the model?

Logarithm of GDP per capita (logpgp95).

1.2. What are the independent variables in the model?

Risk of expropriation, latitute and continent (avexpr, lat_abst, africa, asia, other). We can take america as an independent variable as well, but then it will be excluded from model as it is a base catergory.

1.3. Reproduce the model proposed by Acemoglu et al. Provide the R code you used to perform the model.

ace <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/col_origins.csv")
model1 <- lm(data = ace, logpgp95 ~ avexpr + lat_abst + africa + asia + other)
summary(model1)
## 
## Call:
## lm(formula = logpgp95 ~ avexpr + lat_abst + africa + asia + other, 
##     data = ace)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.66865 -0.28680  0.06585  0.34075  1.25274 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.85108    0.33959  17.230  < 2e-16 ***
## avexpr       0.38956    0.05065   7.691 8.26e-12 ***
## lat_abst     0.33256    0.44549   0.747    0.457    
## africa      -0.91639    0.16627  -5.511 2.56e-07 ***
## asia        -0.15306    0.15478  -0.989    0.325    
## other        0.30355    0.37476   0.810    0.420    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6261 on 105 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.7152, Adjusted R-squared:  0.7016 
## F-statistic: 52.74 on 5 and 105 DF,  p-value: < 2.2e-16

1.4. All else equal (ceteris paribus), how does the logarithm of GDP per capita change (on average) when the indicator of risk of expropriation increases by 1 unit?

It increases by 0.389.

1.5. All else equal (ceteris paribus), how does the logarithm of GDP per capita differ in African and American countries?

In African countries it is lower by 0.916.

1.6. Which of the factors significantly affect the GDP per capita? At what level of significance?

1.7. Evaluate the model quality.

  1. Check (using at least two methods) whether multicollinearity is present in this model. Report your R code and provide your comments.
# 1. Calculate correlation
ace <- na.omit(ace) # exclude NAs
cor(ace[,2:8]) # correlations
##             lat_abst      avexpr    logpgp95        asia     america
## lat_abst  1.00000000  0.68906798  0.61538711 -0.05125440 -0.15851512
## avexpr    0.68906798  1.00000000  0.78187091  0.01659243 -0.12194294
## logpgp95  0.61538711  0.78187091  1.00000000  0.07627958  0.09097340
## asia     -0.05125440  0.01659243  0.07627958  1.00000000 -0.26578872
## america  -0.15851512 -0.12194294  0.09097340 -0.26578872  1.00000000
## africa   -0.43541439 -0.45408480 -0.63182774 -0.36083087 -0.29837041
## other     0.08838213  0.15407514  0.18030993 -0.09449112 -0.07813454
##              africa       other
## lat_abst -0.4354144  0.08838213
## avexpr   -0.4540848  0.15407514
## logpgp95 -0.6318277  0.18030993
## asia     -0.3608309 -0.09449112
## america  -0.2983704 -0.07813454
## africa    1.0000000 -0.10607430
## other    -0.1060743  1.00000000

No pairs of variables with extremely high correlation are detected.

library(car)
vif(model1)
##   avexpr lat_abst   africa     asia    other 
## 2.044114 2.047614 1.606022 1.248572 1.045686

No VIFs greater than 10 (and greater than 5 as well), so we can conclude that there is no multicollinearity.

  1. Check (using graphs with residuals) whether heteroskedasticity is present in this model. Report your R code and provide your comments.
# take residuals from model
ace$res <- model1$residuals

# graphs for every independent variable
# for binary ones (africa, asia, other) are useless
# so we skip them

library(ggplot2)
ggplot(data = ace, aes(x = avexpr, y = res)) + geom_point()

ggplot(data = ace, aes(x = lat_abst, y = res)) + geom_point()

The spread of points with respect to the line \(y=0\) stay approximately the same when values of avexpr and lat_abst increase. There are no patterns in residuals, points are scattered randomly. So, no evidence for heteroskedaticity (not constant variance of residuals).

  1. Check whether there are influential points in this model. Report your R code and provide your comments. If yes, exclude these points (just delete from the data set or make a subset), re-run the model and compare the results. Provide your comments.
library(ggfortify)
autoplot(model1, which = 4:5)

Observations with values of Cook’s distance greater than 1 are supposed to be influential. Here we do not see such points (the graph on the left). Observation with both higher leverage and large residuals are usually excluded from analysis. Here there are no such points as well, only with high leverage (the graph on the right). Nothing to exclude.

Practice 2

You are suggested to conduct a small research on the political self-identification before the Chilean plebiscite in 1988. Your question of interest is the following: which factors affect the people’s propensity to advocate status-quo (Pinochet being in rule)? You are provided with a dataset with the results of the survey conducted in Chile. It contains the following variables (only those that are needed for the model are listed here):

Make a regression model that would help you to decide which factors mentioned above affect the people’s position towards status-quo.

2.1. What is the dependent variable in your model?

Position towards status-quo (statusquo).

2.2. What are the independent variables in your model?

Gender, age, income.

2.3. Run the model. Provide the R code you used to perform the model.

chile <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/chile.csv")
model2 <- lm(data = chile, statusquo ~ gender + age + log_income)
summary(model2)
## 
## Call:
## lm(formula = statusquo ~ gender + age + log_income, data = chile)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7766 -1.0041 -0.1631  1.0835  2.0508 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.491511   0.262048  -1.876   0.0609 .  
## gender      -0.242791   0.051880  -4.680 3.10e-06 ***
## age          0.010946   0.001733   6.316 3.41e-10 ***
## log_income   0.019661   0.025148   0.782   0.4344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.067 on 1705 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.03434,    Adjusted R-squared:  0.03265 
## F-statistic: 20.21 on 3 and 1705 DF,  p-value: 7.143e-13

2.4. Which of the factors significantly affect the people’s position on the status-quo spectrum? At what level of significance?

Gender and age, at 0.1% significance level (0.001).

2.5. Interpret the coefficient of the variable age, i.e. explain what happens when the age of a respondent increases by one year.

The position towards status-quo increases (on average) by 0.01, so older people favour Pinochet in rule more than younger people.

2.6. Interpret the coefficient of the variable sex, i.e. explain what happens when we move from a female respondent to a male one.

The position of male respondents (coded with 1) on a status-quo scale is less right than the position of female respondents (coded with 0) by 0.24 since the coefficient of gender is negative and lower values correspond to lower level of approval of Pinochet.

Practice 3

Modify the model from the previous task (Practice 2) so as to cover the differences in the effect of age on identification on the status-quo scale between men and women. In other words, use the same model as before, but consider including some specific term(s) in your model.

3.1. Write the equation of the new model.

\[ statusquo = \beta_0 + \beta_1\times gender + \beta_2 \times age + \beta_3 \times log\_income + \beta_4 \times age \times gender. \]

3.2. Explain in what way the new model is different from the model from Practice 2.

We added an interaction term \(age \times gender\) to include the differences in the effect of age between men and women.

3.3. Run the model. Provide the R code you used to perform the model.

model3 <- lm(data = chile, statusquo ~ gender + age + log_income + age:gender)
summary(model3)
## 
## Call:
## lm(formula = statusquo ~ gender + age + log_income + age:gender, 
##     data = chile)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7272 -1.0038 -0.1325  1.0859  2.1031 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.351342   0.273677  -1.284 0.199392    
## gender      -0.476816   0.142445  -3.347 0.000834 ***
## age          0.007516   0.002604   2.887 0.003942 ** 
## log_income   0.018615   0.025140   0.740 0.459117    
## gender:age   0.006149   0.003486   1.764 0.077926 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.066 on 1704 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.0361, Adjusted R-squared:  0.03384 
## F-statistic: 15.96 on 4 and 1704 DF,  p-value: 7.861e-13

3.4. Does the age affect the identification (position on a status-quo scale) differently for men and women? Explain your answer.

Yes, we can conclude that it is different since the coefficient of the interaction term gender:age is significant at 10% level of significance (0.1). However, this significance level is less common than 5%, so may be we should say that the difference in effect for men and women is insignificant, so age does not affect identification differently for men and women.