Data Analysis in the Social Sciences: Exam

Part 1

Load a csv-file containing the data from World Values Survey (a smaller subset): http://math-info.hse.ru/f/2018-19/pep/wws.csv.

Variables:

country: country of a respondent;
gender: gender of a respondent (Female or Male);
trust: trust to other people (Most people can be trusted and Need to be very careful);
age: age of a respondent;
emancipative values: importance of emancipative values;
secular_values: importance of secular values.

Task 1.1 (2 points)

Create a box plot for the importance of secular values. Provide your R code. Provide your comments on the distribution of this variable.

Task 1.2 (16 points)

In your small research you are interested in the following question: which factors (trust, age) affect importance of secular values. To answer this question you are suggested to run a model.

1. State dependent variable(s) in your model.

2. State independent variable(s) in your model.

3. What type of model should you use (paired linear regression, multiple linear regression, logistic regression)?

4. Run a model in R. Provide your R code and R output as well.

5. Interpret the coefficient of trustNeed to be very careful (say what happens when the values of a variable change, not only significance).

6. How does the importance of secular values change when age increases by one year (on average)?

7. Report the adjusted R-squared of this model. Provide your comments.

8. Which coefficients are statistically significant at 1% significance level?

9. Test whether the multicollinearity is present in this model. Provide your R code and R output as well. Provide your comments.

10. Test (using graphs with residuals) whether the heteroskedasticity is present in this model. Provide your R code. Provide your comments.

11. Check whether there are influential points in this model. Report your R code, R output and provide your comments. If yes, exclude these points (just delete from the data set or make a subset), re-run the model and compare the results. Provide your comments.

12. Can we say that residuals of this model are distributed normally? Provide your R code to check it. Provide your comments.

Task 1.3 (4 points)

Modify the model from the previous task so as to cover the differences in the effect of age on importance of secular values between females and males.

1. What is the name (type) of the model you should use here?

2. Run the model in R. Report your R code and R output.

3. Can we say that the age affects the importance of secular values differently for males and females? Explain your answer.

Part 2

A first-year student of Higher School of Economics, tries to check how the effectiveness of insect sprays depends on its type. The data on sprays are stored in R.

Variables:

count: a mesure of effectiveness, number of insects died;
spray: the type of spray.

Type View(InsectSprays) and look (so there is no need to load any external files, just use InsectSprays).

Task 2.1 (3 points)

1. Delete rows with missing values (NA).

2. Report descriptive statistics for the measure of effectiveness. Provide your R code and R output. What is the maximum value? What is the average count of insects died?

Task 2.2 (5 points)

Suppose you want to check whether the type of spray affects its effectiveness.

1. Run a model that will help you to answer this question. Provide your R code and output.

2. What is the base level of type in your model? How have you found this?

3. Write the equation of this model.

4. Interpret the coefficient of sprayD. Your interpretation should include the exact meaning of this coefficient, not only significance/insignificance.

Part 3

The data set you are suggested to work with in this task contains audio features of top Spotify songs in 2018 (a modified version of the Kaggle data frame). The results can be downloaded here: http://math-info.hse.ru/f/2018-19/pep/spot.csv.

Variables:

name: name of the song.
artists: artist(s) of the song.
danceability: danceability describes how suitable a track is for dancing based on a combination of musical elements.
energy: energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
loudness: the overall loudness of a track in decibels (dB).
mode: mode indicates the modality (major — 1 and minor — 0) of a track.
liveness: detects the presence of an audience in the recording.
tempo: the overall estimated tempo of a track in beats per minute (BPM).
duration_ms: the duration of the track in milliseconds.

Task 3.1 (1 point)

Count how many tracks were regarded as major ones and minor ones. Provide your R code. Which tracks prevail here?

Task 3.2 (9 points)

A student tries to test the hypothesis that probability of a track to be major increases as energy of a track increases. He also wants to take into account the tempo of the track.

1. Which type of regression should he choose?

simple linear regression;
simple logistic regression;
multiple logistic regression;
linear regression with dummy variables.

2. Run a model chosen in R. Provide your R code and R output.

3. Apply transformation needed and interpret the transformed coefficient of energy (even if it is not significant, still do this).

4. Do the results obtained agree with this student’s hypothesis? Can we say that we proved that the type of track (major or minor) can be explained by its energy?

5. Check the quality of this model using an appropriate method. Provide your R code and R output. Provide your comments.

Part 4 (5 points)

You decided to create your own index of democracy based on other indicators concerning political regime and quality of governance. You have three indicators:

Corruption Perception Index (cpi). Ranges from 0 to 100, higher values correspond to better control of corruption.
Freedom of the press (fp). Ranges from 0 to 100, higher values correspond to less freedom of the press.
Voice and Accountability (va). Ranges from -2.5 to 2.5, higher values correspond to more accountability.

To do this you perform principle component analysis.

1. Open the database you need and scale the variables you need.

df <- read.csv("http://math-info.hse.ru/f/2016-17/ps-pep-quant/indicators_pca.csv")

2. Perform the PCA.

data <- df[2:4]
comp <- prcomp(data, center = TRUE, scale = TRUE)

3. Look at the results you got.

print(comp)

## Standard deviations (1, .., p=3):
## [1] 1.6095025 0.6090172 0.1964677
## 
## Rotation (n x k) = (3 x 3):
##            PC1        PC2        PC3
## cpi  0.5340029 -0.8386209 -0.1074981
## fp  -0.5908099 -0.4610746  0.6620830
## va   0.6048013  0.2900433  0.7416807

Consider the first principle component (PC2). It will be your aggregate index of democracy.

3.1. What is the weight of the Freedom of the press in your index of democracy?

3.2. Write the expression showing how your index is calculated based on the scaled values of Corruption Perception Index, Freedom of the Press and Voice and Accountability (round the corresponding weights to two decimal places).

3.3. Which of the indicators contribute the most to your index? Explain your answer.

3.4. Why Corruption Perception Index and Voice and Accountability have a positive contribution, and Freedom of the Press has a negative one?

3.5. Add the column with your democracy index to the original dataset.

df$democracy <- comp$x[, 1]

3.6. What is the value of your democracy index for USA (use the scaled values)?

Bonus tasks

If you still feel uneasy with R, you can try these non-programming tasks and earn extra points.

1. (2 points) Consider the following situation. A student decides to evaluate the association between the economic development and the type of the political regime. As an indicator for the economic development he uses the GDP per capita. The type of the political regime takes three values: ‘autocracy’, ‘hybrid regime’, and ‘democracy’. There are 191 countries in the sample.

To find the strength of the association he wants to estimate the Pearson correlation coefficient between the GDP per capita and the type of the political regime. Is the approach used by the student correct? Explain your answer.

2. (2 points) A young researcher decided to study the relationship between the income of residents of the district A (income) and the number of years they live in this district (years). He estimated the Pearson correlation coefficient and got the following results:

\[ \text{Corr(income, years)}=0.3,\text{ p−value}=0.7. \]

Can we regard this result as the reliable one? In other words, can we make any conclusions about the relationship between the income of residents and the number of years they live in the district? Explain your answer.

3. (1 point) A student tossed a coin 10 times. In 4 cases he got a head, and in 6 cases he got a tail. Assuming that the probability of a head is the relative frequency of a head estimated on the data given, calculate the odds of obtaining a head.

4. (0.5 point) For some linear model RSS (residual sum of squares) equals 700, TSS (total sum of squares) equals 1400. a. Find R-squared for this model.

5. (0.5 point) Can p-value be greater than 1? Explain your answer.