head(lineman)
## name weight dash40
## 1 Oday Aboushi 308 5.45
## 2 Zach Allen 332 5.43
## 3 Ezekial Ansah 271 4.63
## 4 Terron Armstead 306 4.71
## 5 Jeff Baca 302 5.03
## 6 Alvin Bailey 312 4.95
The lineman data comes from the NFL combine, which is
essentially the tryout for college players to try to get NFL teams to
draft them.
They ask the players for their height and weight
(weight), then the players can participate in different
events. The most common event is the 40 yard dash (dash40).
We want to see if there is an association between weight
and dash40
gg_dash40_weight <-
ggplot(
data = lineman,
mapping = aes(
x = weight,
y = dash40
)
) +
geom_point() +
labs(
x = 'Weight (lbs)',
y = '40 yard dash (seconds)'
) +
theme_bw()
gg_dash40_weight +
geom_smooth(
color = 'red',
method = 'loess',
se = F,
formula = y ~ x
)
There is an association between weight and
dash40, but it doesn’t appear to be a constant, linear
association.
What we can do is create a ‘broken-stick’ style model:
\[\hat{y} = b_0 + b_1x + b_2(x - k)^+\]
where \((x - k)^+\) is the difference between \(x\) and \(k\) if it is positive and 0 otherwise.
For example, if \(k = 5\) and \(x = 7\), then \((7 - 5)^+ = 2\), but if \(x = 3\) then \((3 - 5)^+ = 0\)
The code chunk below shows how to a broken-stick model with \(k = 300\)
example_lm <-
lineman |>
# Adding a column for (x - k)^+
# rowwise() will look at the summary function by row instead of the entire col
rowwise() |>
mutate(
bend = max(0, weight - 300)
) |>
# Undoing rowwise()
ungroup() |>
# Fitting the model:
lm(formula = dash40 ~ weight + bend)
# Displaying the results of the linear model
tidy(example_lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.39 0.0994 24.0 1.55e-107
## 2 weight 0.00925 0.000346 26.7 5.20e-128
## 3 bend -0.00375 0.000637 -5.89 4.75e- 9
And we can see the model that it fits:
gg_dash40_weight +
geom_line(
data = lineman |>
mutate(
# Adding the predicted time
time_hat = example_lm$fitted.values
),
mapping = aes(y = time_hat),
color = 'red',
linewidth = 1
) +
theme_bw()
The goal of this homework is to find the best ‘bending’ point.
Write a function named bend_lm() with 3 arguments:
y = the response variable
x = the explanatory variable
k = the ‘bend’ in the explanatory variable
It should return two separate objects:
model: the fitted linear model from
lm()sse: the squared sum of the residualslm() using $residualsbend_lm <- function(x, y, k = 0){
# Placing x & y in the same data frame to use in the lm function
func_df <-
data.frame(y = y, x = x) |>
# Calculating (x - k)+
rowwise() |>
mutate(bend = max(0, x - k)) |>
ungroup()
func_lm <-
# Fitting the model:
lm(formula = y ~ x + bend, data = func_df)
# Calculating the SSE
sse <- sum((func_lm$residuals)^2)
return(
list(
'model' = func_lm,
'sse' = sse
)
)
}
The code chunks below will test out your function. If it works correctly it should match what is in Brighspace.
# Testing the function. Should return the same
tidy(
bend_lm(
y = iris$Petal.Length,
x = iris$Petal.Width,
k = 2
)$model
)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.926 0.0654 14.2 2.90e-29
## 2 x 2.46 0.0529 46.4 1.06e-89
## 3 bend -2.74 0.361 -7.60 3.20e-12
# Testing the function. Should return the same
tidy(
bend_lm(
y = MASS::cats$Hwt,
x = MASS::cats$Bwt - mean(MASS::cats$Bwt)
)$model
)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.3 0.228 45.2 9.89e-86
## 2 x 3.05 0.603 5.05 1.32e- 6
## 3 bend 1.71 0.953 1.80 7.42e- 2
Conduct a grid search for the value of \(k\) for the model:
\[\hat{y} = b_0 + b_1x + b_2(x - k)^+\]
over the values of 275 to 350 in increments of 0.01.
Save the results in a data frame named bend_df with two
columns named k and sse.
Write the grid search in the code chunk below.
I recommend searching from 275 to 350 by increments of 1 until you get the code to run. Then change it from 1 to 0.01.
bend_df <-
data.frame(
k = seq(275, 350, 0.01),
sse = -1
)
for (i in 1:nrow(bend_df)){
bend_df[i, 'sse'] <-
bend_lm(
x = lineman$weight,
y = lineman$dash40,
k = bend_df$k[i]
)$sse
}
tibble(bend_df)
## # A tibble: 7,501 × 2
## k sse
## <dbl> <dbl>
## 1 275 45.1
## 2 275. 45.1
## 3 275. 45.1
## 4 275. 45.1
## 5 275. 45.1
## 6 275. 45.1
## 7 275. 45.1
## 8 275. 45.1
## 9 275. 45.1
## 10 275. 45.1
## # ℹ 7,491 more rows
If done correctly, the code chunk below should create the same graph seen in Brightspace.
ggplot(
data = bend_df,
mapping = aes(
x = k,
y = sse
)
) +
# Adding the line for the function
geom_line(
linewidth = 1
) +
# Adding a vertical line at k that minimizes sse
geom_vline(
color = 'red',
linetype = 'dashed',
xintercept = 322,
linewidth = 1
) +
labs(x = 'Bend Point') +
theme_bw()
What should the chosen value of \(k\) be?
A quicker alternative to a grid search if the model has many parameters is to perform gradient descent. It uses the value of the derivative to help find a better guess of the bend point than the current one. When run well, it is much quicker than using a grid search.
How gradient descent works is by updating the current value (\(k_0\)) into the new value (\(k_1\)). The formula to update the value of the bend point is:
\[k_1 = k_0 - \alpha \times f'(k_0)\]
where \(\alpha\) is some predetermined value (we’ll use 0.000001) and \(f'(k_0)\) is the value of the derivative evaluated at \(k_0\) (the derivative with the current value plugged in).
For example, let’s try to find the minimum of \(f(x) = x^2\). Let’s say our current value of \(x\) is \(x_0 = 0.5\) and the derivative is \(f'(x) = 2x\). Using \(\alpha = 0.1\), we can find a better guess by:
\[x_1 = x_0 - \alpha \times 2x_0 = 0.5 - (0.1)(2)(0.5) = 0.4\]
which is closer to the actual minimum of 0. We’d repeat this process until the value of the objective function changes by a very small amount. Our stopping criteria will be:
\[\left|\frac{f(x_1) - f(x_0)}{f(x_0)} \right| < c\]
where \(c\) is a number chosen before hand. If we choose c = 0.001, we’d check if we stop by
\[\left|\frac{f(0.4) - f(0.5)}{f(0.5)}\right| = \left|\frac{0.4^2 - 0.5^2}{0.5^2}\right|\]
\[\left|\frac{0.4^2 - 0.5^2}{0.5^2} \right| = 0.36\]
\[0.36 \gt 0.001\]
Since \(0.36 > 0.001\), we keep going and repeat and update our new guess by replacing \(x_0\) with \(x_1\) to update our next guess, \(x_2\):
\[x_2 = 0.4 - 0.1(2)(0.4) = 0.32\]
Create a function of the derivative that takes the value of \(k\) and returns the value of the derivative evaluated at \(k\)
The derivative of the objective function with respect to \(k\) is:
\[f'(k) = -b_2 n_k\left[ \bar{y}_k - b_0 - (b_1 + 1) \bar{x}_k \right]\]
where:
\[n_k = \text{the number of players with weight above the specified k}\]
\[\bar{y}_k = \text{the average 40 yard dash time of players with weight above k}\]
\[\bar{x}_k = \text{the average weight of players with weight above k} = \frac{1}{n_k}\sum(x-k)^+\]
and \(b_0\), \(b_1\), \(b_2\) are the coefficients from the fitted
model. You can get these values from
bend_lm(...)$model$coef
## (Intercept)
## 2.386428
## x
## 0.009247701
## bend
## -0.003750159
The function should:
Fit the model for the specified value of \(k\)
Calculate the derivative (specified above)
Call the function bend_lm_der(y, x, k)
To make this a little easier:
Calculate a vector of \((x -
k)^+\) by using (x - k)*(x > k)
Fit the model and get the coefficients using
bend_lm(k)$model$coef
Calculate \(n_k\), \(\bar{y}_k = \sum(y*(x > k))/n_k\), and \(\bar{x}_k = \sum(x - k)^+ / n_k\)
Calculate the value of the derivative. This can be easier if you break it up into two or more pieces.
Return the derivative
bend_lm_der <- function(y, x, k){
# Step 1: Calculating (x - k)+
x_k_pos <- (x - k)*(x > k)
# Step 2: Getting the model coefficients for the specified value of k
bend_coef <- bend_lm(y = y, x = x, k = k)$model$coef
# Step 3: Getting the values
nk <- sum(x > k)
x_bar_k <- sum(x_k_pos) / nk
y_bar_k <- sum(y * (x > k)) / nk
## Step 4: Calculating the derivative by breaking the derivative into pieces
# Part 1 - The positive part: b2*sum(x - k)+
part1 <- -1 * nk * bend_coef[3]
# Part 2 - The piece being subtracted
part2 <- y_bar_k - bend_coef[1] - (bend_coef[2] + 1) * x_bar_k
# Returning the derivative: part1 * part2
return(part1 * part2)
}
The code chunk below will check if you created the function correctly
# Check 1: Lineman data
bend_lm_der(
x = lineman$weight,
y = lineman$dash40,
k = 300
)
## bend
## -42.68175
# Check 2: iris data
bend_lm_der(
x = iris$Petal.Length,
y = iris$Petal.Width,
k = 2
)
## bend
## 26.60027
# Check 3: iris data
bend_lm_der(
x = MASS::cats$Bwt,
y = MASS::cats$Hwt,
k = 3
)
## bend
## -563.769
Write the code to perform gradient descent below. For each loop, it should
update the number of iterations (number of loops). Call it
iters. On the 5th loop, iters = 5, on the 10th
loop, iters = 10, etc…
calculate the value of the objective function with the
k, bend_lm(y = ..., x = ..., k = k)$sse (call
it of_curr)
calculate the value of the derivative using the current value of
k (call it gradient)
update the value of k using gradient
descent
calculate the value of the objective function using the new value
of k (call it of_new)
You’ll be using alpha = 1e-4
Set up gradient descent in the code chunk below using the following initial values:
k = 300
alpha = 0.0001
c = \(10^{-8}\)
# objects need to perform gradient descent
iters <- 0 # Number of iterations for gradient descent
k <- 300 # Initial value of the bend point
# Finding the value of the objective function with the current b
of_new <- bend_lm(y = lineman$dash40,
x = lineman$weight,
k = k)$sse
of_curr <- 1 # Needed to get the loop started
alpha <- 1e-4 # how much the value of k changes based on the derivative
c <- 1e-8 # how much the derivative needs to change to stop the algorithm
When writing the code below for gradient descent, use
c = 1e^-5 until you get it working. Then change it to
c = 1e^-8
while (abs(of_new - of_curr)/of_curr > c) {
# Step 1) update iters
iters <- iters + 1
# Step 2) calculate the value of the objective function
of_curr <-
bend_lm(
y = lineman$dash40,
x = lineman$weight,
k = k
)$sse
# Step 3) calculate the value of the derivative
gradient <-
bend_lm_der(
y = lineman$dash40,
x = lineman$weight,
k = k
)
# Step 4) updating the value of k
k <- k - alpha * gradient
# Step 5) Update the value of the objective function
of_new <- bend_lm(
y = lineman$dash40,
x = lineman$weight,
k = k
)$sse
}
The code chunk below will check the results and that they match what is in Brightspace
## k.bend sse iterations
## 320.904 43.249 8876.000
While the results aren’t 100% the same between question 2 and question 3, they are very similar!