Hints for HW2
In homework 2, we use R to simulate the standard (binary) choice model of individuals.
There are 1000 individuals, each individual making a choice between good 1 and good 2. Our task is to estimate the parameter \(β\) in individuals’ utility function.
Specifically, individual \(i\)’s utility from good \(k\) is \(β x_k + \epsilon_{ik}\), where
\(x_k\) is the individual’s choice. If individual 1 chooses good \(1\), then \(x\) is \(0\); otherwise, \(x\) is \(1\).
\(\epsilon_{ik}\) are noises/shocks.
Each individual makes a binary choice, denoted by \(y_{ij}\). We first simulate noises \(\epsilon\) and the (optimal) choices \(y\), and then estimate \(\beta\) based on the choice data \(y\).
Simulation
1 To simulate data, first make a data frame as follows.
This is a soft question. You may use seq(), rep() and sort() functions from base R. Since df is a tibble, you need tibble() from dplyr.
Of course, you can replicate the results using data.frame from base R if you wish. If you are new to R, you should use tibble() from dplyr.
2 Second, draw type-I extreme value random variables. Set the seed at 1. You can use evd package to draw the variables. You should get exactly the same realization if the seed is correctly set.
Use rgumbel() from package evd to generate e.
To add the new variable e into df, use mutate() function from dplyr.
3 Third, compute the latent value of each option to obtain the following data frame:
Use the utility function to get latent.
4 Finally, compute y by comparing the latent values of k=1,2 for each i to obtain the following result:
This one may be tricky. There are many ways to get y. You may use for loop if you are more familiar with the procedural programming. Any method is fine as long as you can replicate the choice data here.
I personally prefer the dplyr-style:
we first find the maximal latent for each individual (named
value), which is the individual’s maximized utility (or value function)if latent equals value, then y is 1 and otherwise 0.
df = df |>
group_by(i) |>
mutate(value = max(latent)) |>
ungroup() |>
mutate(y = as.integer(latent == value)) |>
mutate(value = NULL)For part a, I use the group_by() function to find the max latent by \(i\). If you are using the development version of dplyr, the three lines group_by(), mutate() and ungroup() can be combined into a single line: mutate(value = max(latent), .by = i).
For part b, I use the trick that as.integer(bool) returns 1 if bool is TRUE and otherwise 0. You can also use the (vectorized) ifelse() function.
The last line mutate(value = NULL) simply removes the value variable from df.
Estimate the parameter
- Write a function to compute the likelihood for a given
β and data and name the functioncompute_loglikelihood_a1(b, df).
This exercise asks you to write a function. The syntax for writing an R function is
compute_loglikelihood_a1 = function(b, df) {
<FUNCTION BODY>
}Use the formula for \(l(β)\) to compute the likelihood. You (should) only need x and y in df.
Check your answer by verifying that compute_loglikelihood_a1(1,df) outputs the desired value.
- Compute the value and plot the result using ggplot2 packages. You can use latex2exp package to use LaTeX math symbol in the label:
You do not need to write any code here. Just load the packages ggplot2, latex2exp, foreach, and then copy&run the code.
The foreach package is for parallelized loops in R.
- Find and report \(β\) that maximizes the log likelihood for the simulated data. You can use
optimfunction to achieve this. You will use Brent method and set the lower bound at -1 and upper bound at 1 for the parameter search.
Check the documentation of optim by running ?optim.