Question 1

In multiple linear regression, the conditional mean of one variable (\(Y\)) is modeled as a linear combination of other variables:

\[E(Y \mid X_1 = x_1, X_2 = x_2, \ldots) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\]

Here we create some data that follows this specification:

set.seed(303029293)
n <- 100000
d <- tibble(x1 = c(rep(0, 1000), rep(1, n - 1000)),
            x2 = rnorm(n),
            x3 = runif(n, -1, 1)) |>
  mutate(x4 = x3^2, 
         x5 = x1 * x2 * 0.5,
         x6 = rexp(n),
         y = 2 + 0.1 * x1 + x3 + x5 + rnorm(n, sd = 2))

To fit the linear regression, we use the lm function:

(m1 <- lm(y ~ x1 + x3 + x5, data = d))
## 
## Call:
## lm(formula = y ~ x1 + x3 + x5, data = d)
## 
## Coefficients:
## (Intercept)           x1           x3           x5  
##     2.05647      0.04067      1.00456      0.99470

We got values that were pretty close to the truth (and on average, we should get the correct values as linear regression parameters are unbiased for the population parameters).

But what happens if we include some other variables that aren’t related to \(Y\)?

(m2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d))
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4           x5  
##    2.050490     0.041720    -0.008373     1.004603     0.039055     1.011461  
##          x6  
##   -0.008110

The second model will fit better as measured by the \(R^2\) value (higher is better), but we haven’t really done better, we have just fit this particular sample more closely:

summary(m1)$r.squared < summary(m2)$r.squared
## [1] TRUE

There are various “information criterion” calculations that attempt to balance the fit of the model against a penalty for including more variables. In this case, smaller values are better models. Let’s try the Bayesian Information Criterion (BIC):

BIC(m1, m2)
##    df      BIC
## m1  5 422075.9
## m2  8 422105.4

We can see that m1 has a lower BIC, so adding those extra variables did not help the model.

In this problem we will do some searching over different combinations of variables to find the best possible model.

Part (a) (1 point)

Write a function model_bic1 that takes a vector of integers as inputs and returns the BIC calculated for a linear regression fit on the “x” variables with those integers as suffixes. E.g., the input c(1, 2) would fit a linear regression with the formula y ~ x1 + x2. The paste and as.formula functions will be helpful.

Verify your results match the output above for the input c(1, 3, 5).

model_bic1 <- function(vars) {
  formula_str <- paste("y ~", paste0("x", vars, collapse = " + "))
  model <- lm(as.formula(formula_str), data = d)
  BIC(model)
}
model_bic1(c(1, 3, 5))
## [1] 422075.9

Part (b) (1 point1)

Here are all possible models we could make from 6 variables (not even counting creating interaction terms or other functions of the variables).

model_indexes <- map(1:6, ~ data.frame(combn(6, .x)) |> as.list()) |> unlist(recursive = FALSE)
length(model_indexes)
## [1] 63

Using an appropriate map function, compute the BIC for all possible models. Which model had the lowest BIC? (This will take a while, so you might want to add cache = TRUE to your chunk header).

model_indexes <- map(1:6, ~ combn(6, .x, simplify = FALSE)) %>% flatten()
bic_values <- map_dbl(model_indexes, model_bic1)
best_index <- which.min(bic_values)
best_model_vars <- model_indexes[[best_index]]
best_model_vars
## [1] 3 5
  • The model with the lowest BIC includes variables x3 and x5.

Part (c) (1 point)

Using the profvis function in the profvis library to profile your model_bic1 function. Looking at the Data column and the time graph (the flame graph panel will probably be empty), find where the function spends most of its time.

library(profvis)
profvis({
  map_dbl(model_indexes[1:10], model_bic1)
})

You won’t be able to include the output directly, but write up your findings. * Most of the runtime is spent on model.frame.default and lm.fit at 150ms each.

Part (d) (1 point)

Look up the lm.fit function. Write a newmodel_bic2 function that uses the lm.fit function instead of lm.

model_bic2 <- function(vars) {
  design <- dmm[, c("(Intercept)", paste0("x", vars)), drop = FALSE]
  fit <- lm.fit(x = design, y = dmf$y)
  class(fit) <- "lm"
  BIC(fit)
}

To do so, you will need to create a “model matrix”. Here is a pre-computed model matrix with all the columns you will need for lm.fit (though you won’t need all of them for a given model).

dmf <- model.frame(y ~ ., data = d)
dmm <- model.matrix(terms(dmf), dmf)

If fit <- lm.fit(...) is the output of the lm.fit function, you will need to set the class to: class(fit) <- "lm" before calling BIC on the result. Verify your function by computing the BIC for the best model and compare it to model_bic1.

model_bic1(best_model_vars)
## [1] 422064.8
model_bic2(best_model_vars)
## [1] 422064.8

Part (e) (1 point)

Use the bench::mark function to compare model_bic1 and model_bic2. Which one is faster?

library(bench)
bench::mark(
  model_bic1 = model_bic1(best_model_vars),
  model_bic2 = model_bic2(best_model_vars)
)
## # A tibble: 2 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 model_bic1    103ms    152ms      6.60    23.9MB     6.60
## 2 model_bic2    103ms    150ms      6.69    11.5MB     6.69
  • According to the output, model_bic2 is faster than model_bic1

Part (f) (1 point)

Repeat profiling with faster function. Now where does the function spend most of its time? Theorize about some other ways this computation could sped up even more.

library(profvis)
profvis({
  map_dbl(model_indexes[1:10], model_bic2)
})
  • Now the function spends most of its time on lm.fit with 130ms. One way we could speed up the computation even more is by parallelizing the computation over multiple cores.

Question 2

Part (a) (1 point)

A common task in estimation is to find a parameter that maximizes a particular function. In other words, find the parameter that corresponds to where to the function has its largest value.

For example, for independent, identically distributed samples, the likelihood function is given by: \[L(\theta) = \prod_{i=1}^n f(x_i ; \theta)\] where \(f\) is the common density function and \(\theta\) is the parameter of interest.

For example, suppose we have a sample of binary observations that are either 1 or 0, and we model them as independent Bernoulli random values with a parameter \(\theta\): \[f(x; \theta) = \theta^x (1 - \theta)^{1 - x}\]

In this case, the likelihood function becomes: \[L(\theta) = \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i}\]

Write a function that takes two inputs: the parameter theta and a vector x and computes the likelihood evaluated at theta. Demonstrate your function on the the following:

bernoulli_likelihood <- function(theta, x) {
  sum_x <- sum(x)
  n <- length(x)
  theta^sum_x * (1 - theta)^(n - sum_x)
}

x <- c(1, 0, 1)
theta <- 0.5

likelihood_value <- bernoulli_likelihood(theta, x)
likelihood_value
## [1] 0.125

Part (b) (2 points)

For a given sample of values, we then want to maximize the likelihood function by finding the theta argument with the highest function value.

x_2b <- c(rep(1, 20), rep(0, 30))

Now write a function that takes theta as an argument and uses the x given in the previous chunk to compute the likelihood function. Graph this function on the following inputs:

likelihood <- function(theta) {
  sum_x <- sum(x_2b)
  n <- length(x_2b)
  theta^sum_x * (1 - theta)^(n - sum_x)
}

candidate_thetas <- seq(0, 1, length.out = 250)
lvals <- sapply(candidate_thetas, likelihood)
plot(candidate_thetas, lvals, type = "l", 
     xlab = expression(theta), 
     ylab = "Likelihood", 
     main = "Likelihood Function for Bernoulli Sample")

Store your results in a vector called lvals (for “likelihood function values”); we’ll use this in a later part.

Based on the graph, where would you say the maximum occurs? * Based on the graph, I’d say the maximum occurs when theta = 0.4

Part (c) (1 point)

Using iteration, find the value in candidate_thetas that has the highest likelihood value. (A better solution would use which.max, but we will pretend that function does not exist.)

best_theta <- NA
max_val <- -Inf
for (i in seq_along(candidate_thetas)) {
  val <- lvals[i]
  if (!is.na(val) && !is.nan(val) && is.finite(val) && val > max_val) {
    max_val <- val
    best_theta <- candidate_thetas[i]
  }
}

best_theta
## [1] 0.4016064

Part (d) (3 points)

A better algorithm for finding the maximum in a set of values is called Golden Section Search. In GSS, we start by evaluating the function at three points: the two end points (\(x_1\) and \(x_3\)) of our set of values and a point in the middle (\(x_2\)).

Then, at each step of the algorithm, pick the larger of two intervals \((x_1, x_2)\) and \((x_2, x_3)\). Evaluate \(f\) at a new point \(x_4\) within that interval. Suppose we picked the left interval \((x_1, x_2)\). If \(f(x_4)\) > \(f(x_2)\), then the highest point must be in \(x_1\) and \(x_2\). If \(f(x_2) > f(x_4)\), then the highest point must be between \(x_4\) and \(x_3\). Update the end points and middle points appropriately, and continue the algorithm until the differences between the points are sufficiently small.

Consider the function \(f(x) = 10 - (x - 3)^2\) that has a maximum at \(x = 3\).

f <- function(x) { 10 - (x - 3)^2 }

curve(f(x), 0, 5)

Below is an implementation of one step of golden section search applied to \(f\).

find_midpoint <- function(a, b) {
  (a + b) / 2
}

gss_one_step <- function(fun, left_point, middle_point, right_point) {
  f_1 <- fun(left_point)
  f_2 <- fun(middle_point)
  f_3 <- fun(right_point)
  
  a <- middle_point - left_point
  b <- right_point - middle_point
  
  looking_left_side <- a > b
  
  new_point <- if (looking_left_side) {
    find_midpoint(left_point, middle_point)
  } else {
    find_midpoint(middle_point, right_point) 
  } 
  
  f_4 <- fun(new_point)
  
  if (looking_left_side) {
    if (f_4 > f_2) {
      return(list(left_point = left_point,
                  middle_point = new_point,
                  right_point = middle_point))
    } else {
      return(list(left_point = new_point,
                  middle_point = middle_point,
                  right_point = right_point))
    } 
  } else {
    ## looking in the right interval
    if (f_4 > f_2) {
      return(list(left_point = middle_point,
                  middle_point = new_point,
                  right_point = right_point))
    } else {
      return(list(left_point = left_point,
                  middle_point = middle_point,
                  right_point = new_point))
    }
  }
}


## first, set the indexes of where we are going to start
start_left  <- 0
start_right <- 5
start_mid   <- find_midpoint(start_left, start_right)

one_step <- gss_one_step(f, start_left, start_mid, start_right)
one_step
## $left_point
## [1] 0
## 
## $middle_point
## [1] 2.5
## 
## $right_point
## [1] 3.75
map_dbl(one_step, f)
##   left_point middle_point  right_point 
##       1.0000       9.7500       9.4375

Implement a while loop that performs gss_one_step until either the end points are both within 0.00001 of the middle point or more than 100 steps have been performed. Report the value of the last middle_point and the function evaluated at this point. How many steps through the algorithm were required?

f <- function(x) { 10 - (x - 3)^2 }

start_left  <- 0
start_right <- 5
start_mid   <- (start_left + start_right) / 2

left_point   <- start_left
right_point  <- start_right
middle_point <- start_mid
steps        <- 0

while (((abs(middle_point - left_point) > 1e-5) || (abs(right_point - middle_point) > 1e-5)) && steps < 100) {
  res <- gss_one_step(f, left_point, middle_point, right_point)
  left_point   <- res$left_point
  middle_point <- res$middle_point
  right_point  <- res$right_point
  steps <- steps + 1
}

result <- list(
  middle_point = middle_point,
  f_at_middle  = f(middle_point),
  steps        = steps
)
result
## $middle_point
## [1] 3.000002
## 
## $f_at_middle
## [1] 10
## 
## $steps
## [1] 31

Part (e) (2 points)

Apply Golden Section Search to your likelihood function from Part (b). Verify your results by checking against your answer for the \(\theta\) with the highest likelihood from (b). For which method was the maximum likelihood higher (it probably won’t differ much).

find_midpoint <- function(a, b) {
  (a + b) / 2
}

gss_one_step <- function(fun, left_point, middle_point, right_point) {
  f_1 <- fun(left_point)
  f_2 <- fun(middle_point)
  f_3 <- fun(right_point)
  
  a <- middle_point - left_point
  b <- right_point - middle_point
  
  looking_left_side <- a > b
  
  new_point <- if (looking_left_side) {
    find_midpoint(left_point, middle_point)
  } else {
    find_midpoint(middle_point, right_point)
  }
  
  f_4 <- fun(new_point)
  
  if (looking_left_side) {
    if (f_4 > f_2) {
      return(list(left_point = left_point,
                  middle_point = new_point,
                  right_point = middle_point))
    } else {
      return(list(left_point = new_point,
                  middle_point = middle_point,
                  right_point = right_point))
    }
  } else {
    if (f_4 > f_2) {
      return(list(left_point = middle_point,
                  middle_point = new_point,
                  right_point = right_point))
    } else {
      return(list(left_point = left_point,
                  middle_point = middle_point,
                  right_point = new_point))
    }
  }
}

start_left  <- 0.001
start_right <- 0.999
start_mid   <- (start_left + start_right) / 2

left_point   <- start_left
right_point  <- start_right
middle_point <- start_mid
steps        <- 0

while (((abs(middle_point - left_point) > 1e-5) ||
        (abs(right_point - middle_point) > 1e-5)) && steps < 100) {
  res <- gss_one_step(likelihood, left_point, middle_point, right_point)
  left_point   <- res$left_point
  middle_point <- res$middle_point
  right_point  <- res$right_point
  steps <- steps + 1
}

result_gss <- list(
  theta = middle_point,
  likelihood = likelihood(middle_point),
  steps = steps
)
result_gss
## $theta
## [1] 0.4000036
## 
## $likelihood
## [1] 2.430733e-15
## 
## $steps
## [1] 26

Question 3

Part (a) (1 point)

Using R’s debugger or the browser() function, step through one step of GSS from the example in Part (d) of the previous problem:

gss_one_step(f, start_left, start_mid, start_right)
## $left_point
## [1] 0.5
## 
## $middle_point
## [1] 0.7495
## 
## $right_point
## [1] 0.999

Answer the following questions:

  • Did the algorithm look in the left or right interval?
  • The algorithm looked in the left interval.
  • What was the value of \(f\) evaluated on the new point? (i.e., f_4)
  • The value of f evaluated on the new point is 6.9375

Part (b) (1 point)

Using the bench::mark, evaluate the speed of your solution in 2(c) and 2(e). Which was faster?

bench_results <- bench::mark(
  grid_method = {
    best_theta <- NA
    max_val <- -Inf
    for (i in seq_along(candidate_thetas)) {
      val <- lvals[i]
      if (!is.na(val) && !is.nan(val) && is.finite(val) && val > max_val) {
        max_val <- val
        best_theta <- candidate_thetas[i]
      }
    }
    best_theta
  },
  gss_method = {
    start_left  <- 0.001
    start_right <- 0.999
    start_mid   <- (start_left + start_right) / 2
    
    left_point   <- start_left
    right_point  <- start_right
    middle_point <- start_mid
    steps        <- 0
    
    while (((abs(middle_point - left_point) > 1e-5) ||
            (abs(right_point - middle_point) > 1e-5)) && steps < 100) {
      res <- gss_one_step(likelihood, left_point, middle_point, right_point)
      left_point   <- res$left_point
      middle_point <- res$middle_point
      right_point  <- res$right_point
      steps <- steps + 1
    }
    middle_point
  },
  check = FALSE
)

bench_results
## # A tibble: 2 × 6
##   expression       min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 grid_method    5.1ms    5.7ms      165.    60.9KB     11.5
## 2 gss_method    6.89ms   7.41ms      122.    80.4KB     11.3
  • According to the bench results, the Grid Search Method was faster.

Part (c) (1 point)

Use the profiler on the version that took longer. Suggest an area for improvement of this algorithm.

Rprof("gss_profile.out")

start_left  <- 0.001
start_right <- 0.999
start_mid   <- (start_left + start_right) / 2

left_point   <- start_left
right_point  <- start_right
middle_point <- start_mid
steps        <- 0

while (((abs(middle_point - left_point) > 1e-5) ||
        (abs(right_point - middle_point) > 1e-5)) && steps < 100) {
  res <- gss_one_step(likelihood, left_point, middle_point, right_point)
  left_point   <- res$left_point
  middle_point <- res$middle_point
  right_point  <- res$right_point
  steps <- steps + 1
}

Rprof(NULL)

summaryRprof("gss_profile.out")
## $by.self
## [1] self.time  self.pct   total.time total.pct 
## <0 rows> (or 0-length row.names)
## 
## $by.total
## [1] total.time total.pct  self.time  self.pct  
## <0 rows> (or 0-length row.names)
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 0
  • One improvement could be switching to using the log-likelihood to improve numerical stability and potentially reduce computation time.

Part (d) (2 points)

Rewrite your solution to 2(c) using a parallel computation technique. (2 points)

library(parallel)

# Define a function to compute the likelihood at each candidate theta
compute_likelihood <- function(i) {
  theta <- candidate_thetas[i]
  val <- likelihood(theta)
  list(theta = theta, val = val)
}

# Use mclapply to compute in parallel
results <- mclapply(seq_along(candidate_thetas), compute_likelihood, mc.cores = detectCores())

# Extract theta with the highest likelihood
likelihood_vals <- sapply(results, function(x) x$val)
best_index <- which.max(likelihood_vals)
best_theta_parallel <- results[[best_index]]$theta

best_theta_parallel
## [1] 0.4016064