In multiple linear regression, the conditional mean of one variable (\(Y\)) is modeled as a linear combination of other variables:
\[E(Y \mid X_1 = x_1, X_2 = x_2, \ldots) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\]
Here we create some data that follows this specification:
set.seed(303029293)
n <- 100000
d <- tibble(x1 = c(rep(0, 1000), rep(1, n - 1000)),
x2 = rnorm(n),
x3 = runif(n, -1, 1)) |>
mutate(x4 = x3^2,
x5 = x1 * x2 * 0.5,
x6 = rexp(n),
y = 2 + 0.1 * x1 + x3 + x5 + rnorm(n, sd = 2))
To fit the linear regression, we use the lm
function:
(m1 <- lm(y ~ x1 + x3 + x5, data = d))
##
## Call:
## lm(formula = y ~ x1 + x3 + x5, data = d)
##
## Coefficients:
## (Intercept) x1 x3 x5
## 2.05647 0.04067 1.00456 0.99470
We got values that were pretty close to the truth (and on average, we should get the correct values as linear regression parameters are unbiased for the population parameters).
But what happens if we include some other variables that aren’t related to \(Y\)?
(m2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d))
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d)
##
## Coefficients:
## (Intercept) x1 x2 x3 x4 x5
## 2.050490 0.041720 -0.008373 1.004603 0.039055 1.011461
## x6
## -0.008110
The second model will fit better as measured by the \(R^2\) value (higher is better), but we haven’t really done better, we have just fit this particular sample more closely:
summary(m1)$r.squared < summary(m2)$r.squared
## [1] TRUE
There are various “information criterion” calculations that attempt to balance the fit of the model against a penalty for including more variables. In this case, smaller values are better models. Let’s try the Bayesian Information Criterion (BIC):
BIC(m1, m2)
## df BIC
## m1 5 422075.9
## m2 8 422105.4
We can see that m1 has a lower BIC, so adding those extra variables did not help the model.
In this problem we will do some searching over different combinations of variables to find the best possible model.
Write a function model_bic1 that takes a vector of
integers as inputs and returns the BIC calculated for a linear
regression fit on the “x” variables with those integers as suffixes.
E.g., the input c(1, 2) would fit a linear regression with
the formula y ~ x1 + x2. The paste and
as.formula functions will be helpful.
Verify your results match the output above for the input
c(1, 3, 5).
model_bic1 <- function(vars) {
formula_str <- paste("y ~", paste0("x", vars, collapse = " + "))
model <- lm(as.formula(formula_str), data = d)
BIC(model)
}
model_bic1(c(1, 3, 5))
## [1] 422075.9
Here are all possible models we could make from 6 variables (not even counting creating interaction terms or other functions of the variables).
model_indexes <- map(1:6, ~ data.frame(combn(6, .x)) |> as.list()) |> unlist(recursive = FALSE)
length(model_indexes)
## [1] 63
Using an appropriate map function, compute the BIC for
all possible models. Which model had the lowest BIC? (This will take a
while, so you might want to add cache = TRUE to your chunk
header).
model_indexes <- map(1:6, ~ combn(6, .x, simplify = FALSE)) %>% flatten()
bic_values <- map_dbl(model_indexes, model_bic1)
best_index <- which.min(bic_values)
best_model_vars <- model_indexes[[best_index]]
best_model_vars
## [1] 3 5
Using the profvis function in the profvis
library to profile your model_bic1 function. Looking at the
Data column and the time graph (the flame graph panel will
probably be empty), find where the function spends most of its time.
library(profvis)
profvis({
map_dbl(model_indexes[1:10], model_bic1)
})
You won’t be able to include the output directly, but write up your findings. * Most of the runtime is spent on model.frame.default and lm.fit at 150ms each.
Look up the lm.fit function. Write a
newmodel_bic2 function that uses the lm.fit
function instead of lm.
model_bic2 <- function(vars) {
design <- dmm[, c("(Intercept)", paste0("x", vars)), drop = FALSE]
fit <- lm.fit(x = design, y = dmf$y)
class(fit) <- "lm"
BIC(fit)
}
To do so, you will need to create a “model matrix”. Here is a
pre-computed model matrix with all the columns you will need for
lm.fit (though you won’t need all of them for a given
model).
dmf <- model.frame(y ~ ., data = d)
dmm <- model.matrix(terms(dmf), dmf)
If fit <- lm.fit(...) is the output of the
lm.fit function, you will need to set the class to:
class(fit) <- "lm" before calling BIC on the result.
Verify your function by computing the BIC for the best model and compare
it to model_bic1.
model_bic1(best_model_vars)
## [1] 422064.8
model_bic2(best_model_vars)
## [1] 422064.8
Use the bench::mark function to compare
model_bic1 and model_bic2. Which one is
faster?
library(bench)
bench::mark(
model_bic1 = model_bic1(best_model_vars),
model_bic2 = model_bic2(best_model_vars)
)
## # A tibble: 2 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 model_bic1 103ms 152ms 6.60 23.9MB 6.60
## 2 model_bic2 103ms 150ms 6.69 11.5MB 6.69
Repeat profiling with faster function. Now where does the function spend most of its time? Theorize about some other ways this computation could sped up even more.
library(profvis)
profvis({
map_dbl(model_indexes[1:10], model_bic2)
})
A common task in estimation is to find a parameter that maximizes a particular function. In other words, find the parameter that corresponds to where to the function has its largest value.
For example, for independent, identically distributed samples, the likelihood function is given by: \[L(\theta) = \prod_{i=1}^n f(x_i ; \theta)\] where \(f\) is the common density function and \(\theta\) is the parameter of interest.
For example, suppose we have a sample of binary observations that are either 1 or 0, and we model them as independent Bernoulli random values with a parameter \(\theta\): \[f(x; \theta) = \theta^x (1 - \theta)^{1 - x}\]
In this case, the likelihood function becomes: \[L(\theta) = \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i}\]
Write a function that takes two inputs: the parameter
theta and a vector x and computes the
likelihood evaluated at theta. Demonstrate your function on
the the following:
bernoulli_likelihood <- function(theta, x) {
sum_x <- sum(x)
n <- length(x)
theta^sum_x * (1 - theta)^(n - sum_x)
}
x <- c(1, 0, 1)
theta <- 0.5
likelihood_value <- bernoulli_likelihood(theta, x)
likelihood_value
## [1] 0.125
For a given sample of values, we then want to maximize the likelihood
function by finding the theta argument with the highest
function value.
x_2b <- c(rep(1, 20), rep(0, 30))
Now write a function that takes theta as an argument and
uses the x given in the previous chunk to compute the
likelihood function. Graph this function on the following inputs:
likelihood <- function(theta) {
sum_x <- sum(x_2b)
n <- length(x_2b)
theta^sum_x * (1 - theta)^(n - sum_x)
}
candidate_thetas <- seq(0, 1, length.out = 250)
lvals <- sapply(candidate_thetas, likelihood)
plot(candidate_thetas, lvals, type = "l",
xlab = expression(theta),
ylab = "Likelihood",
main = "Likelihood Function for Bernoulli Sample")
Store your results in a vector called lvals (for
“likelihood function values”); we’ll use this in a later part.
Based on the graph, where would you say the maximum occurs? * Based on the graph, I’d say the maximum occurs when theta = 0.4
Using iteration, find the value in candidate_thetas that
has the highest likelihood value. (A better solution would use
which.max, but we will pretend that function does not
exist.)
best_theta <- NA
max_val <- -Inf
for (i in seq_along(candidate_thetas)) {
val <- lvals[i]
if (!is.na(val) && !is.nan(val) && is.finite(val) && val > max_val) {
max_val <- val
best_theta <- candidate_thetas[i]
}
}
best_theta
## [1] 0.4016064
A better algorithm for finding the maximum in a set of values is called Golden Section Search. In GSS, we start by evaluating the function at three points: the two end points (\(x_1\) and \(x_3\)) of our set of values and a point in the middle (\(x_2\)).
Then, at each step of the algorithm, pick the larger of two intervals \((x_1, x_2)\) and \((x_2, x_3)\). Evaluate \(f\) at a new point \(x_4\) within that interval. Suppose we picked the left interval \((x_1, x_2)\). If \(f(x_4)\) > \(f(x_2)\), then the highest point must be in \(x_1\) and \(x_2\). If \(f(x_2) > f(x_4)\), then the highest point must be between \(x_4\) and \(x_3\). Update the end points and middle points appropriately, and continue the algorithm until the differences between the points are sufficiently small.
Consider the function \(f(x) = 10 - (x - 3)^2\) that has a maximum at \(x = 3\).
f <- function(x) { 10 - (x - 3)^2 }
curve(f(x), 0, 5)
Below is an implementation of one step of golden section search applied to \(f\).
find_midpoint <- function(a, b) {
(a + b) / 2
}
gss_one_step <- function(fun, left_point, middle_point, right_point) {
f_1 <- fun(left_point)
f_2 <- fun(middle_point)
f_3 <- fun(right_point)
a <- middle_point - left_point
b <- right_point - middle_point
looking_left_side <- a > b
new_point <- if (looking_left_side) {
find_midpoint(left_point, middle_point)
} else {
find_midpoint(middle_point, right_point)
}
f_4 <- fun(new_point)
if (looking_left_side) {
if (f_4 > f_2) {
return(list(left_point = left_point,
middle_point = new_point,
right_point = middle_point))
} else {
return(list(left_point = new_point,
middle_point = middle_point,
right_point = right_point))
}
} else {
## looking in the right interval
if (f_4 > f_2) {
return(list(left_point = middle_point,
middle_point = new_point,
right_point = right_point))
} else {
return(list(left_point = left_point,
middle_point = middle_point,
right_point = new_point))
}
}
}
## first, set the indexes of where we are going to start
start_left <- 0
start_right <- 5
start_mid <- find_midpoint(start_left, start_right)
one_step <- gss_one_step(f, start_left, start_mid, start_right)
one_step
## $left_point
## [1] 0
##
## $middle_point
## [1] 2.5
##
## $right_point
## [1] 3.75
map_dbl(one_step, f)
## left_point middle_point right_point
## 1.0000 9.7500 9.4375
Implement a while loop that performs
gss_one_step until either the end points are both within
0.00001 of the middle point or more than 100 steps have been performed.
Report the value of the last middle_point and the function
evaluated at this point. How many steps through the algorithm were
required?
f <- function(x) { 10 - (x - 3)^2 }
start_left <- 0
start_right <- 5
start_mid <- (start_left + start_right) / 2
left_point <- start_left
right_point <- start_right
middle_point <- start_mid
steps <- 0
while (((abs(middle_point - left_point) > 1e-5) || (abs(right_point - middle_point) > 1e-5)) && steps < 100) {
res <- gss_one_step(f, left_point, middle_point, right_point)
left_point <- res$left_point
middle_point <- res$middle_point
right_point <- res$right_point
steps <- steps + 1
}
result <- list(
middle_point = middle_point,
f_at_middle = f(middle_point),
steps = steps
)
result
## $middle_point
## [1] 3.000002
##
## $f_at_middle
## [1] 10
##
## $steps
## [1] 31
Apply Golden Section Search to your likelihood function from Part (b). Verify your results by checking against your answer for the \(\theta\) with the highest likelihood from (b). For which method was the maximum likelihood higher (it probably won’t differ much).
find_midpoint <- function(a, b) {
(a + b) / 2
}
gss_one_step <- function(fun, left_point, middle_point, right_point) {
f_1 <- fun(left_point)
f_2 <- fun(middle_point)
f_3 <- fun(right_point)
a <- middle_point - left_point
b <- right_point - middle_point
looking_left_side <- a > b
new_point <- if (looking_left_side) {
find_midpoint(left_point, middle_point)
} else {
find_midpoint(middle_point, right_point)
}
f_4 <- fun(new_point)
if (looking_left_side) {
if (f_4 > f_2) {
return(list(left_point = left_point,
middle_point = new_point,
right_point = middle_point))
} else {
return(list(left_point = new_point,
middle_point = middle_point,
right_point = right_point))
}
} else {
if (f_4 > f_2) {
return(list(left_point = middle_point,
middle_point = new_point,
right_point = right_point))
} else {
return(list(left_point = left_point,
middle_point = middle_point,
right_point = new_point))
}
}
}
start_left <- 0.001
start_right <- 0.999
start_mid <- (start_left + start_right) / 2
left_point <- start_left
right_point <- start_right
middle_point <- start_mid
steps <- 0
while (((abs(middle_point - left_point) > 1e-5) ||
(abs(right_point - middle_point) > 1e-5)) && steps < 100) {
res <- gss_one_step(likelihood, left_point, middle_point, right_point)
left_point <- res$left_point
middle_point <- res$middle_point
right_point <- res$right_point
steps <- steps + 1
}
result_gss <- list(
theta = middle_point,
likelihood = likelihood(middle_point),
steps = steps
)
result_gss
## $theta
## [1] 0.4000036
##
## $likelihood
## [1] 2.430733e-15
##
## $steps
## [1] 26
Using R’s debugger or the browser() function, step
through one step of GSS from the example in Part (d) of the previous
problem:
gss_one_step(f, start_left, start_mid, start_right)
## $left_point
## [1] 0.5
##
## $middle_point
## [1] 0.7495
##
## $right_point
## [1] 0.999
Answer the following questions:
f_4)Using the bench::mark, evaluate the speed of your
solution in 2(c) and 2(e). Which was faster?
bench_results <- bench::mark(
grid_method = {
best_theta <- NA
max_val <- -Inf
for (i in seq_along(candidate_thetas)) {
val <- lvals[i]
if (!is.na(val) && !is.nan(val) && is.finite(val) && val > max_val) {
max_val <- val
best_theta <- candidate_thetas[i]
}
}
best_theta
},
gss_method = {
start_left <- 0.001
start_right <- 0.999
start_mid <- (start_left + start_right) / 2
left_point <- start_left
right_point <- start_right
middle_point <- start_mid
steps <- 0
while (((abs(middle_point - left_point) > 1e-5) ||
(abs(right_point - middle_point) > 1e-5)) && steps < 100) {
res <- gss_one_step(likelihood, left_point, middle_point, right_point)
left_point <- res$left_point
middle_point <- res$middle_point
right_point <- res$right_point
steps <- steps + 1
}
middle_point
},
check = FALSE
)
bench_results
## # A tibble: 2 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 grid_method 5.1ms 5.7ms 165. 60.9KB 11.5
## 2 gss_method 6.89ms 7.41ms 122. 80.4KB 11.3
Use the profiler on the version that took longer. Suggest an area for improvement of this algorithm.
Rprof("gss_profile.out")
start_left <- 0.001
start_right <- 0.999
start_mid <- (start_left + start_right) / 2
left_point <- start_left
right_point <- start_right
middle_point <- start_mid
steps <- 0
while (((abs(middle_point - left_point) > 1e-5) ||
(abs(right_point - middle_point) > 1e-5)) && steps < 100) {
res <- gss_one_step(likelihood, left_point, middle_point, right_point)
left_point <- res$left_point
middle_point <- res$middle_point
right_point <- res$right_point
steps <- steps + 1
}
Rprof(NULL)
summaryRprof("gss_profile.out")
## $by.self
## [1] self.time self.pct total.time total.pct
## <0 rows> (or 0-length row.names)
##
## $by.total
## [1] total.time total.pct self.time self.pct
## <0 rows> (or 0-length row.names)
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 0
Rewrite your solution to 2(c) using a parallel computation technique. (2 points)
library(parallel)
# Define a function to compute the likelihood at each candidate theta
compute_likelihood <- function(i) {
theta <- candidate_thetas[i]
val <- likelihood(theta)
list(theta = theta, val = val)
}
# Use mclapply to compute in parallel
results <- mclapply(seq_along(candidate_thetas), compute_likelihood, mc.cores = detectCores())
# Extract theta with the highest likelihood
likelihood_vals <- sapply(results, function(x) x$val)
best_index <- which.max(likelihood_vals)
best_theta_parallel <- results[[best_index]]$theta
best_theta_parallel
## [1] 0.4016064