Load a csv-file containing data on people’s preferences towards choice of fishing mode (whether a person likes fishing on a boat or on a beach or on a pier). The variables are:

mode: mode of fishing (beach, boat, charter, pier);
price: price a person is ready to pay for the mode chosen;
catch: catch rate, i.e, chance to catch a fish;
income: person’s income;
boat: whether a person chooses a boat mode or not;
pboat: price a person is ready to pay for fishing on a boat;
cboat: catch rate on a boat.

Part 1

Task 1.1 (18 points)

In your small research you are interested in the following question: which factors (income, catch rate) affect price a person is ready to pay for fishing. To answer this question you are suggested to run a model.

State dependent variable(s) in your model.
State independent variable(s) in your model.
What type of model should you use (paired linear regression, multiple linear regression, logistic regression)?
Run a model in R. Provide your R code and R output as well.
Interpret the coefficient of catch (say what happens when the values of a variable change, not only significance).
How does the price change when income increases by one unit (on average)?
Report the adjusted R-squared of this model. Provide your comments.
Which coefficients are statistically significant at 1% significance level?
Test whether the multicollinearity is present in this model. Provide your R code and R output as well. Provide your comments.
Test (using graphs with residuals) whether the heteroskedasticity is present in this model. Provide your R code. Provide your comments.
Check whether there are influential points in this model. Report your R code, R output and provide your comments. If yes, exclude these points (just delete from the data set or make a subset), re-run the model and compare the results. Provide your comments.
Can we say that residuals of this model are distributed normally? Provide your R code to check it. Provide your comments.

Task 1.2 (4 points)

Modify the model from the previous task so as to cover the differences in the effect of catch rate on price between people choosing a boat and not choosing a boat.

What is the name (type) of the model you should use here?
Run the model in R. Report your R code and R output.
Can we say that the catch rate affects the price differently for people chosing a boat and not chosing a boat? Explain your answer.

Part 2

Task 2.1 (3 points)

Report descriptive statistics for the income. Provide your R code and R output. What is the maximum value? What is the median income of people?

Task 2.2 (5 points)

Suppose you want to check whether the chosen mode of fishing affects price a person is ready to pay for fishing.

Run a model that will help you to answer this question. Provide your R code and output.
What is the base level of type in your model? How have you found this?
Write the equation of this model.
Interpret the coefficient of modecharter. Your interpretation should include the exact meaning of this coefficient, not only significance/insignificance.

Task 3.1 (1 point)

Count how many people chose fishing on a boat and not fishing on a boat. Provide your R code. Which people prevail here?

Task 3.2 (9 points)

A student tries to test the hypothesis that probability of choosing fishing on a boat (boat) increases as price (pboat) decreases. He also wants to take into account the catch rate for this mode (cboat).

Which type of regression should he choose?

simple linear regression;
simple logistic regression;
multiple logistic regression;
linear regression with dummy variables.

Run a model chosen in R. Provide your R code and R output.
Apply transformation needed and interpret the transformed coefficient of cboat (even if it is not significant, still do this).
Do the results obtained agree with this student’s hypothesis? Can we say that we proved that the mode of fishing (boat or not boat) can be explained by its price?
Check the quality of this model using an appropriate method. Provide your R code and R output. Provide your comments.

Part 4 (5 points)

You decided to create your own index of democracy based on other indicators concerning political regime and quality of governance. You have three indicators:

Corruption Perception Index (cpi). Ranges from 0 to 100, higher values correspond to better control of corruption.
Freedom of the press (fp). Ranges from 0 to 100, higher values correspond to less freedom of the press.
Voice and Accountability (va). Ranges from -2.5 to 2.5, higher values correspond to more accountability.

To do this you perform principle component analysis.

Open the database you need and scale the variables you need.

df <- read.csv("https://raw.githubusercontent.com/allatambov/PyDat-0919/master/indicators_pca.csv")

Perform the PCA.

data <- df[2:4]
comp <- prcomp(data, center = TRUE, scale = TRUE)

Look at the results you got.

print(comp)

## Standard deviations (1, .., p=3):
## [1] 1.6095025 0.6090172 0.1964677
## 
## Rotation (n x k) = (3 x 3):
##            PC1        PC2        PC3
## cpi  0.5340029 -0.8386209 -0.1074981
## fp  -0.5908099 -0.4610746  0.6620830
## va   0.6048013  0.2900433  0.7416807

Consider the first principle component (PC1). It will be your aggregate index of democracy.

3.1. What is the weight of the Freedom of the press in your index of democracy?

3.2. Write the expression showing how your index is calculated based on the scaled values of Corruption Perception Index, Freedom of the Press and Voice and Accountability (round the corresponding weights to two decimal places).

3.3. Which of the indicators contribute the most to your index? Explain your answer.

3.4. Why Corruption Perception Index and Voice and Accountability have a positive contribution, and Freedom of the Press has a negative one?

3.5. Add the column with your democracy index to the original dataset.

df$democracy <- comp$x[, 1]

3.6. What is the value of your democracy index for Bolivia (use the scaled values)?

Bonus tasks

If you still feel uneasy with R, you can try these non-programming tasks and earn extra points.

(2 points) Consider the following situation. A student decides to evaluate the association between the economic development and the type of the political regime. As an indicator for the economic development he uses the GDP per capita. The type of the political regime takes three values: ‘autocracy’, ‘hybrid regime’, and ‘democracy’. There are 191 countries in the sample.

To find the strength of the association he wants to estimate the Pearson correlation coefficient between the GDP per capita and the type of the political regime. Is the approach used by the student correct? Explain your answer.

(2 points) A young researcher decided to study the relationship between the income of residents of the district A (income) and the number of years they live in this district (years). He estimated the Pearson correlation coefficient and got the following results:

\[\text{Corr(income, years)}=0.3, \text{ p−value}=0.7.\]

Can we regard this result as the reliable one? In other words, can we make any conclusions about the relationship between the income of residents and the number of years they live in the district? Explain your answer.

(1 point) A student tossed a coin 10 times. In 7 cases he got a head, and in 3 cases he got a tail. Assuming that the probability of a head is the relative frequency of a head estimated on the data given, calculate the odds of obtaining a head?
(0.5 point) For some linear model RSS (residual sum of squares) equals 700, TSS (total sum of squares) equals 1400. Find R-squared for this model.
(0.5 point) Can p-value be greater than 1? Explain your answer.

Exam

Alla Tambovtseva

27-12-2019