Data Science 306 Homework 9

install.packages("pillar")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

Question 1

In multiple linear regression, the conditional mean of one variable (\(Y\)) is modeled as a linear combination of other variables:

\[E(Y \mid X_1 = x_1, X_2 = x_2, \ldots) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\]

Here we create some data that follows this specification:

set.seed(303029293)
n <- 100000
d <- tibble(x1 = c(rep(0, 1000), rep(1, n - 1000)),
            x2 = rnorm(n),
            x3 = runif(n, -1, 1)) |>
  mutate(x4 = x3^2, 
         x5 = x1 * x2 * 0.5,
         x6 = rexp(n),
         y = 2 + 0.1 * x1 + x3 + x5 + rnorm(n, sd = 2))

To fit the linear regression, we use the lm function:

(m1 <- lm(y ~ x1 + x3 + x5, data = d))

## 
## Call:
## lm(formula = y ~ x1 + x3 + x5, data = d)
## 
## Coefficients:
## (Intercept)           x1           x3           x5  
##     2.05647      0.04067      1.00456      0.99470

We got values that were pretty close to the truth (and on average, we should get the correct values as linear regression parameters are unbiased for the population parameters).

But what happens if we include some other variables that aren’t related to \(Y\)?

(m2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d))

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4           x5  
##    2.050490     0.041720    -0.008373     1.004603     0.039055     1.011461  
##          x6  
##   -0.008110

The second model will fit better as measured by the \(R^2\) value (higher is better), but we haven’t really done better, we have just fit this particular sample more closely:

summary(m1)$r.squared < summary(m2)$r.squared

## [1] TRUE

There are various “information criterion” calculations that attempt to balance the fit of the model against a penalty for including more variables. In this case, smaller values are better models. Let’s try the Bayesian Information Criterion (BIC):

BIC(m1, m2)

##    df      BIC
## m1  5 422075.9
## m2  8 422105.4

We can see that m1 has a lower BIC, so adding those extra variables did not help the model.

In this problem we will do some searching over different combinations of variables to find the best possible model.

Part (a) (1 point)

Write a function model_bic1 that takes a vector of integers as inputs and returns the BIC calculated for a linear regression fit on the “x” variables with those integers as suffixes. E.g., the input c(1, 2) would fit a linear regression with the formula y ~ x1 + x2. The paste and as.formula functions will be helpful.

model_bic1 <- function(vars) { 
  predictors <- paste0("x", vars)
  
  form_str <- paste("y ~", paste(predictors, collapse = " + "))
  form_obj <- as.formula(form_str)
  
  model <- lm(form_obj, data = d)
  BIC(model)
  
  }

Verify your results match the output above for the input c(1, 3, 5).

Part (b) (1 point1)

Here are all possible models we could make from 6 variables (not even counting creating interaction terms or other functions of the variables).

model_indexes <- map(1:6, ~ data.frame(combn(6, .x)) |> as.list()) |> unlist(recursive = FALSE)
length(model_indexes)

## [1] 63

Using an appropriate map function, compute the BIC for all possible models. Which model had the lowest BIC? (This will take a while, so you might want to add cache = TRUE to your chunk header).

bics <- map_dbl(model_indexes, model_bic1)

best_index <- which.min(bics)
top_model_vars <- model_indexes [[best_index]]

top_model_vars

## [1] 3 5

Part (c) (1 point)

Using the profvis function in the profvis library to profile your model_bic1 function. Looking at the Data column and the time graph (the flame graph panel will probably be empty), find where the function spends most of its time.

library(profvis)

profvis({
  model_indexes <- map(1:6, ~ data.frame(combn(6, .x)) |> as.list()) |> unlist(recursive = FALSE)
  bics <- map_dbl(model_indexes, model_bic1)
})

You won’t be able to include the output directly, but write up your findings.

Looking through the Data column, the majority of the time within the function is spent in the lm() function, as it takes a larger dataset and fits it to a linear model.

Part (d) (1 point)

Look up the lm.fit function. Write a newmodel_bic2 function that uses the lm.fit function instead of lm.

To do so, you will need to create a “model matrix”. Here is a pre-computed model matrix with all the columns you will need for lm.fit (though you won’t need all of them for a given model).

dmf <- model.frame(y ~ ., data = d)
dmm <- model.matrix(terms(dmf), dmf)

If fit <- lm.fit(...) is the output of the lm.fit function, you will need to set the class to: class(fit) <- "lm" before calling BIC on the result. Verify your function by computing the BIC for the best model and compare it to model_bic1.

model_bic2 <- function(vars) { 
  predictors <- paste0("x", vars)
  
  cols <- which(colnames(dmm) %in% predictors)
  
  x <- dmm[, c(1, cols)]
  y <- dmf$y
  
  fit <- lm.fit(x, y)
  class(fit) <- "lm"
  
  BIC(fit)

}


model_bic1(c(3, 5))

## [1] 422064.8

model_bic2(c(3, 5))

## [1] 422064.8

Part (e) (1 point)

Use the bench::mark function to compare model_bic1 and model_bic2. Which one is faster?

library(bench)

bench::mark( 
  model_bic1(c(3,5)), 
  model_bic2(c(3,5))
  )

## # A tibble: 2 × 6
##   expression               min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 model_bic1(c(3, 5)) 110.84ms  151.6ms      6.60    23.9MB     6.60
## 2 model_bic2(c(3, 5))   6.47ms   50.1ms     19.9     11.5MB    29.9

model_bic2 was found to be significantly faster at a median of 24.2ms than model_bic1 which had a median speed of around 100ms.

Part (f) (1 point)

Repeat profiling with faster function. Now where does the function spend most of its time? Theorize about some other ways this computation could sped up even more.

library(profvis)

profvis({
  model_bic2(c(3, 5))
})

After profiling with a faster function, it spends most of its time in the lm.fit() function. One way to continue speeding up the function would be to reduce the data set size so less rows would have to be fit in.

Question 2

Part (a) (1 point)

A common task in estimation is to find a parameter that maximizes a particular function. In other words, find the parameter that corresponds to where to the function has its largest value.

For example, for independent, identically distributed samples, the likelihood function is given by: \[L(\theta) = \prod_{i=1}^n f(x_i ; \theta)\] where \(f\) is the common density function and \(\theta\) is the parameter of interest.

For example, suppose we have a sample of binary observations that are either 1 or 0, and we model them as independent Bernoulli random values with a parameter \(\theta\): \[f(x; \theta) = \theta^x (1 - \theta)^{1 - x}\]

In this case, the likelihood function becomes: \[L(\theta) = \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i}\]

Write a function that takes two inputs: the parameter theta and a vector x and computes the likelihood evaluated at theta. Demonstrate your function on the the following:

berno_func <- function(theta, x) { 
  sum_x <- sum(x)
  n <- length(x)
  lihood <- theta^sum_x * (1 - theta)^(n - sum_x)
  return(lihood)
  }

x <- c(1, 0, 1)
theta <- 0.5

berno_func(theta, x)

## [1] 0.125

Part (b) (2 points)

For a given sample of values, we then want to maximize the likelihood function by finding the theta argument with the highest function value.

x_2b <- c(rep(1, 20), rep(0, 30))

Now write a function that takes theta as an argument and uses the x given in the previous chunk to compute the likelihood function. Graph this function on the following inputs:

lihood_x2b <- function(theta) { 
  sum_x <- sum(x_2b)
  n <- length(x_2b)
  theta^sum_x * (1 - theta)^(n - sum_x)
  }

candidate_thetas <- seq(0, 1, length.out = 250)

lvals <- sapply(candidate_thetas, lihood_x2b)

plot(candidate_thetas, lvals, type = "l", 
     xlab = expression(theta), 
     ylab = "Frequency", 
     main = "Likelihood of Bernoulli Function")

Store your results in a vector called lvals (for “likelihood function values”); we’ll use this in a later part.

Based on the graph, where would you say the maximum occurs?

Based on the graph I would say the maxium occurs at 0.4.

Part (c) (1 point)

Using iteration, find the value in candidate_thetas that has the highest likelihood value. (A better solution would use which.max, but we will pretend that function does not exist.)

max_value <- -Inf
max_index <- NA

for (i in seq_along(lvals)) {
  if (lvals[i] > max_value) {
    max_value <- lvals[i]
    max_index <- i
  }
}

pick_theta <- candidate_thetas[max_index]
pick_theta

## [1] 0.4016064

Part (d) (3 points)

A better algorithm for finding the maximum in a set of values is called Golden Section Search. In GSS, we start by evaluating the function at three points: the two end points (\(x_1\) and \(x_3\)) of our set of values and a point in the middle (\(x_2\)).

Then, at each step of the algorithm, pick the larger of two intervals \((x_1, x_2)\) and \((x_2, x_3)\). Evaluate \(f\) at a new point \(x_4\) within that interval. Suppose we picked the left interval \((x_1, x_2)\). If \(f(x_4)\) > \(f(x_2)\), then the highest point must be in \(x_1\) and \(x_2\). If \(f(x_2) > f(x_4)\), then the highest point must be between \(x_4\) and \(x_3\). Update the end points and middle points appropriately, and continue the algorithm until the differences between the points are sufficiently small.

Consider the function \(f(x) = 10 - (x - 3)^2\) that has a maximum at \(x = 3\).

f <- function(x) { 10 - (x - 3)^2 }

curve(f(x), 0, 5)

Below is an implementation of one step of golden section search applied to \(f\).

find_midpoint <- function(a, b) {
  (a + b) / 2
}

gss_one_step <- function(fun, left_point, middle_point, right_point) {
  f_1 <- fun(left_point)
  f_2 <- fun(middle_point)
  f_3 <- fun(right_point)
  
  a <- middle_point - left_point
  b <- right_point - middle_point
  
  looking_left_side <- a > b
  
  new_point <- if (looking_left_side) {
    find_midpoint(left_point, middle_point)
  } else {
    find_midpoint(middle_point, right_point) 
  } 
  
  f_4 <- fun(new_point)
  
  if (looking_left_side) {
    if (f_4 > f_2) {
      return(list(left_point = left_point,
                  middle_point = new_point,
                  right_point = middle_point))
    } else {
      return(list(left_point = new_point,
                  middle_point = middle_point,
                  right_point = right_point))
    } 
  } else {
    ## looking in the right interval
    if (f_4 > f_2) {
      return(list(left_point = middle_point,
                  middle_point = new_point,
                  right_point = right_point))
    } else {
      return(list(left_point = left_point,
                  middle_point = middle_point,
                  right_point = new_point))
    }
  }
}


## first, set the indexes of where we are going to start
start_left  <- 0
start_right <- 5
start_mid   <- find_midpoint(start_left, start_right)

one_step <- gss_one_step(f, start_left, start_mid, start_right)
one_step

## $left_point
## [1] 0
## 
## $middle_point
## [1] 2.5
## 
## $right_point
## [1] 3.75

map_dbl(one_step, f)

##   left_point middle_point  right_point 
##       1.0000       9.7500       9.4375

Implement a while loop that performs gss_one_step until either the end points are both within 0.00001 of the middle point or more than 100 steps have been performed. Report the value of the last middle_point and the function evaluated at this point. How many steps through the algorithm were required?

Part (e) (2 points)

Apply Golden Section Search to your likelihood function from Part (b). Verify your results by checking against your answer for the \(\theta\) with the highest likelihood from (b). For which method was the maximum likelihood higher (it probably won’t differ much).

gss_res <- optimize(lihood_x2b, interval = c(0,1), maximum = TRUE)

Question 3

Part (a) (1 point)

Using R’s debugger or the browser() function, step through one step of GSS from the example in Part (d) of the previous problem:

gss_one_step(f, start_left, start_mid, start_right)

## $left_point
## [1] 0
## 
## $middle_point
## [1] 2.5
## 
## $right_point
## [1] 3.75

Answer the following questions:

Did the algorithm look in the left or right interval?
What was the value of \(f\) evaluated on the new point? (i.e., f_4)

Part (b) (1 point)

Using the bench::mark, evaluate the speed of your solution in 2(c) and 2(e). Which was faster?

Part (c) (1 point)

Use the profiler on the version that took longer. Suggest an area for improvement of this algorithm.

Part (d) (2 points)

Rewrite your solution to 2(c) using a parallel computation technique. (2 points)

Data Science 306 Homework 9

Anish Devireddy, anishdey

Question 1

Part (a) (1 point)

Part (b) (1 point1)

Part (c) (1 point)

Part (d) (1 point)

Part (e) (1 point)

Part (f) (1 point)

Question 2

Part (a) (1 point)

Part (b) (2 points)

Part (c) (1 point)

Part (d) (3 points)

Part (e) (2 points)

Question 3

Part (a) (1 point)

Part (b) (1 point)

Part (c) (1 point)

Part (d) (2 points)