install.packages("pillar")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
In multiple linear regression, the conditional mean of one variable (\(Y\)) is modeled as a linear combination of other variables:
\[E(Y \mid X_1 = x_1, X_2 = x_2, \ldots) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\]
Here we create some data that follows this specification:
set.seed(303029293)
n <- 100000
d <- tibble(x1 = c(rep(0, 1000), rep(1, n - 1000)),
x2 = rnorm(n),
x3 = runif(n, -1, 1)) |>
mutate(x4 = x3^2,
x5 = x1 * x2 * 0.5,
x6 = rexp(n),
y = 2 + 0.1 * x1 + x3 + x5 + rnorm(n, sd = 2))
To fit the linear regression, we use the lm
function:
(m1 <- lm(y ~ x1 + x3 + x5, data = d))
##
## Call:
## lm(formula = y ~ x1 + x3 + x5, data = d)
##
## Coefficients:
## (Intercept) x1 x3 x5
## 2.05647 0.04067 1.00456 0.99470
We got values that were pretty close to the truth (and on average, we should get the correct values as linear regression parameters are unbiased for the population parameters).
But what happens if we include some other variables that aren’t related to \(Y\)?
(m2 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d))
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = d)
##
## Coefficients:
## (Intercept) x1 x2 x3 x4 x5
## 2.050490 0.041720 -0.008373 1.004603 0.039055 1.011461
## x6
## -0.008110
The second model will fit better as measured by the \(R^2\) value (higher is better), but we haven’t really done better, we have just fit this particular sample more closely:
summary(m1)$r.squared < summary(m2)$r.squared
## [1] TRUE
There are various “information criterion” calculations that attempt to balance the fit of the model against a penalty for including more variables. In this case, smaller values are better models. Let’s try the Bayesian Information Criterion (BIC):
BIC(m1, m2)
## df BIC
## m1 5 422075.9
## m2 8 422105.4
We can see that m1 has a lower BIC, so adding those extra variables did not help the model.
In this problem we will do some searching over different combinations of variables to find the best possible model.
Write a function model_bic1 that takes a vector of
integers as inputs and returns the BIC calculated for a linear
regression fit on the “x” variables with those integers as suffixes.
E.g., the input c(1, 2) would fit a linear regression with
the formula y ~ x1 + x2. The paste and
as.formula functions will be helpful.
model_bic1 <- function(vars) {
predictors <- paste0("x", vars)
form_str <- paste("y ~", paste(predictors, collapse = " + "))
form_obj <- as.formula(form_str)
model <- lm(form_obj, data = d)
BIC(model)
}
Verify your results match the output above for the input
c(1, 3, 5).
Here are all possible models we could make from 6 variables (not even counting creating interaction terms or other functions of the variables).
model_indexes <- map(1:6, ~ data.frame(combn(6, .x)) |> as.list()) |> unlist(recursive = FALSE)
length(model_indexes)
## [1] 63
Using an appropriate map function, compute the BIC for
all possible models. Which model had the lowest BIC? (This will take a
while, so you might want to add cache = TRUE to your chunk
header).
bics <- map_dbl(model_indexes, model_bic1)
best_index <- which.min(bics)
top_model_vars <- model_indexes [[best_index]]
top_model_vars
## [1] 3 5
Using the profvis function in the profvis
library to profile your model_bic1 function. Looking at the
Data column and the time graph (the flame graph panel will
probably be empty), find where the function spends most of its time.
library(profvis)
profvis({
model_indexes <- map(1:6, ~ data.frame(combn(6, .x)) |> as.list()) |> unlist(recursive = FALSE)
bics <- map_dbl(model_indexes, model_bic1)
})
You won’t be able to include the output directly, but write up your findings.
Looking through the Data column, the majority of the time within the function is spent in the lm() function, as it takes a larger dataset and fits it to a linear model.
Look up the lm.fit function. Write a
newmodel_bic2 function that uses the lm.fit
function instead of lm.
To do so, you will need to create a “model matrix”. Here is a
pre-computed model matrix with all the columns you will need for
lm.fit (though you won’t need all of them for a given
model).
dmf <- model.frame(y ~ ., data = d)
dmm <- model.matrix(terms(dmf), dmf)
If fit <- lm.fit(...) is the output of the
lm.fit function, you will need to set the class to:
class(fit) <- "lm" before calling BIC on the result.
Verify your function by computing the BIC for the best model and compare
it to model_bic1.
model_bic2 <- function(vars) {
predictors <- paste0("x", vars)
cols <- which(colnames(dmm) %in% predictors)
x <- dmm[, c(1, cols)]
y <- dmf$y
fit <- lm.fit(x, y)
class(fit) <- "lm"
BIC(fit)
}
model_bic1(c(3, 5))
## [1] 422064.8
model_bic2(c(3, 5))
## [1] 422064.8
Use the bench::mark function to compare
model_bic1 and model_bic2. Which one is
faster?
library(bench)
bench::mark(
model_bic1(c(3,5)),
model_bic2(c(3,5))
)
## # A tibble: 2 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 model_bic1(c(3, 5)) 110.84ms 151.6ms 6.60 23.9MB 6.60
## 2 model_bic2(c(3, 5)) 6.47ms 50.1ms 19.9 11.5MB 29.9
model_bic2 was found to be significantly faster at a median of 24.2ms than model_bic1 which had a median speed of around 100ms.
Repeat profiling with faster function. Now where does the function spend most of its time? Theorize about some other ways this computation could sped up even more.
library(profvis)
profvis({
model_bic2(c(3, 5))
})
After profiling with a faster function, it spends most of its time in the lm.fit() function. One way to continue speeding up the function would be to reduce the data set size so less rows would have to be fit in.
A common task in estimation is to find a parameter that maximizes a particular function. In other words, find the parameter that corresponds to where to the function has its largest value.
For example, for independent, identically distributed samples, the likelihood function is given by: \[L(\theta) = \prod_{i=1}^n f(x_i ; \theta)\] where \(f\) is the common density function and \(\theta\) is the parameter of interest.
For example, suppose we have a sample of binary observations that are either 1 or 0, and we model them as independent Bernoulli random values with a parameter \(\theta\): \[f(x; \theta) = \theta^x (1 - \theta)^{1 - x}\]
In this case, the likelihood function becomes: \[L(\theta) = \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i}\]
Write a function that takes two inputs: the parameter
theta and a vector x and computes the
likelihood evaluated at theta. Demonstrate your function on
the the following:
berno_func <- function(theta, x) {
sum_x <- sum(x)
n <- length(x)
lihood <- theta^sum_x * (1 - theta)^(n - sum_x)
return(lihood)
}
x <- c(1, 0, 1)
theta <- 0.5
berno_func(theta, x)
## [1] 0.125
For a given sample of values, we then want to maximize the likelihood
function by finding the theta argument with the highest
function value.
x_2b <- c(rep(1, 20), rep(0, 30))
Now write a function that takes theta as an argument and
uses the x given in the previous chunk to compute the
likelihood function. Graph this function on the following inputs:
lihood_x2b <- function(theta) {
sum_x <- sum(x_2b)
n <- length(x_2b)
theta^sum_x * (1 - theta)^(n - sum_x)
}
candidate_thetas <- seq(0, 1, length.out = 250)
lvals <- sapply(candidate_thetas, lihood_x2b)
plot(candidate_thetas, lvals, type = "l",
xlab = expression(theta),
ylab = "Frequency",
main = "Likelihood of Bernoulli Function")
Store your results in a vector called lvals (for
“likelihood function values”); we’ll use this in a later part.
Based on the graph, where would you say the maximum occurs?
Based on the graph I would say the maxium occurs at 0.4.
Using iteration, find the value in candidate_thetas that
has the highest likelihood value. (A better solution would use
which.max, but we will pretend that function does not
exist.)
max_value <- -Inf
max_index <- NA
for (i in seq_along(lvals)) {
if (lvals[i] > max_value) {
max_value <- lvals[i]
max_index <- i
}
}
pick_theta <- candidate_thetas[max_index]
pick_theta
## [1] 0.4016064
A better algorithm for finding the maximum in a set of values is called Golden Section Search. In GSS, we start by evaluating the function at three points: the two end points (\(x_1\) and \(x_3\)) of our set of values and a point in the middle (\(x_2\)).
Then, at each step of the algorithm, pick the larger of two intervals \((x_1, x_2)\) and \((x_2, x_3)\). Evaluate \(f\) at a new point \(x_4\) within that interval. Suppose we picked the left interval \((x_1, x_2)\). If \(f(x_4)\) > \(f(x_2)\), then the highest point must be in \(x_1\) and \(x_2\). If \(f(x_2) > f(x_4)\), then the highest point must be between \(x_4\) and \(x_3\). Update the end points and middle points appropriately, and continue the algorithm until the differences between the points are sufficiently small.
Consider the function \(f(x) = 10 - (x - 3)^2\) that has a maximum at \(x = 3\).
f <- function(x) { 10 - (x - 3)^2 }
curve(f(x), 0, 5)
Below is an implementation of one step of golden section search applied to \(f\).
find_midpoint <- function(a, b) {
(a + b) / 2
}
gss_one_step <- function(fun, left_point, middle_point, right_point) {
f_1 <- fun(left_point)
f_2 <- fun(middle_point)
f_3 <- fun(right_point)
a <- middle_point - left_point
b <- right_point - middle_point
looking_left_side <- a > b
new_point <- if (looking_left_side) {
find_midpoint(left_point, middle_point)
} else {
find_midpoint(middle_point, right_point)
}
f_4 <- fun(new_point)
if (looking_left_side) {
if (f_4 > f_2) {
return(list(left_point = left_point,
middle_point = new_point,
right_point = middle_point))
} else {
return(list(left_point = new_point,
middle_point = middle_point,
right_point = right_point))
}
} else {
## looking in the right interval
if (f_4 > f_2) {
return(list(left_point = middle_point,
middle_point = new_point,
right_point = right_point))
} else {
return(list(left_point = left_point,
middle_point = middle_point,
right_point = new_point))
}
}
}
## first, set the indexes of where we are going to start
start_left <- 0
start_right <- 5
start_mid <- find_midpoint(start_left, start_right)
one_step <- gss_one_step(f, start_left, start_mid, start_right)
one_step
## $left_point
## [1] 0
##
## $middle_point
## [1] 2.5
##
## $right_point
## [1] 3.75
map_dbl(one_step, f)
## left_point middle_point right_point
## 1.0000 9.7500 9.4375
Implement a while loop that performs
gss_one_step until either the end points are both within
0.00001 of the middle point or more than 100 steps have been performed.
Report the value of the last middle_point and the function
evaluated at this point. How many steps through the algorithm were
required?
Apply Golden Section Search to your likelihood function from Part (b). Verify your results by checking against your answer for the \(\theta\) with the highest likelihood from (b). For which method was the maximum likelihood higher (it probably won’t differ much).
gss_res <- optimize(lihood_x2b, interval = c(0,1), maximum = TRUE)
Using R’s debugger or the browser() function, step
through one step of GSS from the example in Part (d) of the previous
problem:
gss_one_step(f, start_left, start_mid, start_right)
## $left_point
## [1] 0
##
## $middle_point
## [1] 2.5
##
## $right_point
## [1] 3.75
Answer the following questions:
f_4)Using the bench::mark, evaluate the speed of your
solution in 2(c) and 2(e). Which was faster?
Use the profiler on the version that took longer. Suggest an area for improvement of this algorithm.
Rewrite your solution to 2(c) using a parallel computation technique. (2 points)