Problem 1

Match the corresponding purpose with the method/model that should be used to get desirable results. Mind the assumptions that lie behind each method and than make your choice.

  1. construct integral index of economic prosperity based on different social and economic indicators

  2. evaluate the effect of freedom of the press operationalized as the Freedom of the Press Index (Freedom House) on the level of corruption represented by Corruption Perception Index

  3. decide how students can be grouped based on their abilities of different kinds (set of numeric scores for many tests)

  4. evaluate the effect of age, gender and political preferences (place on the left-right scale) on people’s propensity to participate in actions of civil unrest (0 – not participate, 1 – participate)

  5. define whether a 42-year person of the median income and having 2 children would participate in actions of civil disobedience

  6. evaluate the effect of economic development measured as the GDP per capita on the political stability operationalized as the Political Stability Index taking into account ethnic heterogeneity and the level of social inequality (suppose that we have the panel data and want to somehow capture individual effect of predictors in each country)

  1. cluster analysis

  2. principal component analysis

  3. decision tree

  4. simple linear regression

  5. logistic regression

  6. linear regression with fixed effects

Problem 2

You decided to create your own index of democracy based on other indicators concerning political regime and quality of governance.

You have three indicators:

To do this you perform pricnciple component analysis.

  1. Open the database you need.
df <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/indicators_pca.csv")
  1. Scale the variables you need.
df$cpi_sc <- scale(df$cpi)
df$fp_sc <- scale(df$fp)
df$va_sc <- scale(df$va)
  1. Choose columns you need and perform the PCA.
data <- df[5:7]
comp <- prcomp(data)
  1. Look at the results you got.
print(comp)

Consider the first principle component (PC1). It will be your aggregate index of democracy.

2.1. What is the weight of the Corruption Perception Index in your index of democracy?

2.2. Write the expression showing how your index is calculated based on the scaled values of Corruption Perception Index, Freedom of the Press and Voice and Accountability (round the corresponding weights to two decimal places).

2.3. Which of the indicators contribute the most to your index? Explain your answer.

2.4. Why Corruption Perception Index and Voice and Accountability have a positive contribution, and Freedom of the Press has a negative one?

2.5. What is the value of your democracy index for Brazil (use the scaled values)?

2.6. Add the column with your democracy index to the original dataset.

df$democracy <- comp$x[, 1]

Get the list of the countries that have the lowest values of new democracy index:

head(df[order(df$democracy), ], 10)

Get the list of the countries that have the highest values of new democracy index:

tail(df[order(df$democracy), ], 10)

Does your democracy index seem to be sensible? Comment.

Problem 3

There is the function \(d(x,y)\) that we want to use to measure the distance between \(x\) and \(y\). Choose the properties that should hold true for \(d(x,y)\).

Problem 4

A researcher wants to organize countries in groups based on the indicators he has in the dataset, but he has no idea how many clusters he want to obtain. He wants to see the whole picture and than decide how deep classification needed should be. Which approach should the researcher use: K-means or hierarchical clustering? Explain your answer.

Problem 5

You are provided with the dataset on different indicators related to the quality of governance. You have to assign the states in your dataset to different groups based on the values of such indices.

Perform hierarchical cluster analysis. Load and prepare data:

df2 <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/indicators_clust.csv")
rownames(df2) <- df2$country 

data2 <- df2[, -1]
scaled_d <- scale(data2)
distances <- dist(scaled_d)

Visualize the results with the dendrogram.

cl <- hclust(distances)
plot(cl)

5.1. If you decide to divide your data just into two clusters, will be Argentina and Chile in the same cluster? Why?

5.2. Judging by the dendrogram, can you say that Austria and Canada have something in common? Explain your answer.

5.3. Can you suggest any substantial idea why all the countries are in two different groups?

5.4. Suppose you chose the different type of clustering, and performed K-means clustering. Visualize the relationship between Political Stability Index (ps) and Voice and Accountability (va) taking into account the information about two major classes of the countries.

kmeans_clusters <- kmeans(distances, 2)$cluster
plot(ps ~ va, col = kmeans_clusters, data = data2)

Can you conclude that if consider the relationship between Political Stability and Voice and Accountability, countries form two clearly defined distinct groups? Explain your answer.

5.5. Do the same thing, but choose any other pair of variables from the dataset. Can you conclude that if consider the relationship between indicators of your choice, countries form two clearly defined distinct groups? Explain your answer.

Problem 6

Suppose you have the results of the following survey: respondents were asked whether they support the current policy in the country. The dataset on this survey includes the variables:

polit – whether a person supports the current policy (0 – not, 1 – yes)

male – takes 1 if a person is male, 0 – female

income_average – average income (in rubles)

age – age of a person

educ – level of education (factor)

Here is the decision tree based on the indicators from the dataset. The purpose of this tree is to decide whether a person with some characteristics will support the current policy.

6.1. Judging by this tree, will the respondent with the following characteristics support the current policy:

  1. male, 25 years old, with average income 4500 rubles and level of education equals 0 (high school)?

  2. female, 30 years old, with average income 4800 rubles and level of education equals 1 (professional college)?

6.2. As you can see, this tree does not cover all the variables in the dataset. Explain why such situation might take place, i.e. explain why there is no need to take all the variables into account while making decision about the support of the current policy.