library(dslabs)
library(HistData)
library(tidyverse)
library(broom)
library(gridExtra)
library(Lahman)
library(ggthemes)
theme_set(theme_bw())
options(digits=2)
C15 Multivariate Regression
Jorge Cornick, 22 de enero 2025
1 Using statistics to understand baseball performance
We will use baseball as a way to illustrate how LR can help us understand data. We start by loading the Lahman baseball database, and constructing the variables we will be using in our analysis
<- Teams |> filter(yearID %in% 1962:2002) |>
dat mutate(team = teamID, year = yearID, r = R/G,
singles = (H - X2B - X3B - HR)/G,
doubles = X2B/G, triples = X3B/G, hr = HR/G,
sb = SB/G, bb = BB/G) |>
select(team, year, r, singles, doubles, triples, hr, sb, bb)
Note that variables are expressed in a per game basis. The following graphs show the joint distribution of several variables (the graph is quite improved version of the ones that appear in the book)
Code
<- dat |> ggplot(aes(hr,r, color = r)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p1 <- dat |> ggplot(aes(sb,r, color = r)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p2 <- dat |> ggplot(aes(bb,r, color = r)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p3 <- dat |> ggplot(aes(hr,bb, color = bb)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p4 grid.arrange(p1,p2,p3,p4, ncol = 2)
Note that hr and bb are associated with r, but also between themselves, so while it is obvious that hr cause r, it is not obvious if bb do. Perhaps the association shows because hr causes both r and bb. Note there is no well defined association between sb and r.
While visual inspection is always useful, we can go ahead and compute correlations:
<- dat |>
all_cors summarize(hr_r = cor(hr,r),
sb_r = cor(sb,r),
bb_r = cor(bb,r),
hr_bb = cor(hr,bb))
all_cors
hr_r sb_r bb_r hr_bb
1 0.76 0.13 0.55 0.41
2 Regression applied to baseball statistics
We explore first the association between hr and r (simple regression) and then progress to multivariate regression. We start by inspecting the relationship graphically to assess whether these two variables seem to follow a bivariate normal distribution. Note that we work with scaled (standardized) variables
|> mutate(z_hr = round(scale(hr))) |>
dat filter(z_hr %in% -2:3) |>
ggplot() +
stat_qq(aes(sample = scale(r))) +
facet_wrap(~z_hr)
Here is an alternative visualization, with both variables scaled, and the addition of the qq_line
|> mutate(z_hr = round(scale(hr))) |>
dat filter(z_hr %in% -2:3) |>
ggplot() +
geom_qq_line(aes(sample = scale(r)))+
geom_qq(aes(sample = scale(r))) +
facet_wrap(~z_hr)
Since the variables r and hr seem to follow a bivariate normal distribution, we can now run a linear regression, and juxtapose the scatter plot and the regression line. Note that we are extracting just the coefficients from lm, which provides also additional information. First we add the line “by hand”, and in the second graph we take advantage of geom_smooth(), Note that we don’t need to pass and (x,y) aesthetic to geom_smooth() because the aesthetic was defined globally in the ggplot call when we created p1
<- lm(r ~ hr, data = dat)$coef
hr_fit + geom_abline(intercept = hr_fit[[1]], slope = hr_fit[[2]]) p1
We could get the same graph using “tidy”, so that all relevant info is stored for further reference. In the book, the broom package is introduced in the next section, but I show it here for convenience.
<- tidy(lm(r ~ hr, data =dat), conf.int = TRUE)
hr_fit + geom_abline(intercept = hr_fit$estimate[1], slope = hr_fit$estimate[2]) p1
An easier way to get the same graph is
+ geom_smooth(method = "lm", se = FALSE) p1
3 The broom package
In previous chapters we have discussed (and showed) the advantages of working within the tidyverse, but for this, we need to work with data frames. Alas, lm takes data frames as arguments but it does not RETURN a data frame.
The broom package allows us to extract information from lm in data frame format, which allows us to summarize, filter, select rows and other familiar operations over data frames in the tidyverse.
Here is an example:
<- lm(r ~ bb,data = dat)
fit tidy(fit,conf.int=TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.93 0.116 16.7 1.91e-55 1.70 2.15
2 bb 0.739 0.0348 21.2 1.90e-83 0.671 0.807
Just a reminder: “statistic” is the estimate divided by the se. It indicates how many standar deviations the estimate is above zero. “p-value” is the probability of observing the estimated value if the true value were zero.
4 Confounding
If we run simple regressions of runs on bb and of runs on singles, it would appear that bb “cause” more runs than singles. Since both singles and bb take a runner to first base, and runners on base have a higher chance of scoring a run after a single than after a bb, what gives.
Correlations can give as a hint:
|> summarize(cor(bb, hr), cor(singles, hr), cor(bb, singles)) dat
cor(bb, hr) cor(singles, hr) cor(bb, singles)
1 0.41 -0.19 -0.051
It appears that pitchers, afraid of hr when a hitter is at bat, will sometimes prefer to give them a bb. The difficulty of discerning what is really associated with what is called confounding. In the book, stratification is used to help explaing confounding, but here we jump directrly into multivariate regression which, unlike simple regression, allows us to identify the impact of each variable separatedly.
tidy(lm(r ~ bb + hr, data = dat), conf.int = TRUE)
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.74 0.0820 21.2 3.38e- 83 1.58 1.90
2 bb 0.387 0.0269 14.4 8.41e- 43 0.334 0.440
3 hr 1.57 0.0488 32.1 1.39e-157 1.47 1.66
Note that both coefficients are lower that those obtained using simple regression, but the change (reduction) of the coefficient on bb is much larger.
5 Building a baseball team
We will build some indicators that would be useful for building a team with a given budget (obviously a restricted optimization problem) but we will not go all the way: since this book doesn’t deal with optimization, once we have all the data required to solve the problem, we are told that the tool required to do so would be linear programming, which lies outside the scope of the book, which is kind of anticlimactic…but the code for the lp problem is available in the book’s source code.
First, we regress runs on all possible at-bat outcomes
<- dat |> filter(year <= 2001) |>
fit lm(r ~ bb + singles + doubles + triples + hr, data = _)
<- tidy(fit, conf.int = TRUE) |> filter(term != "(Intercept)")
tidyfit tidyfit
# A tibble: 5 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bb 0.370 0.0119 31.2 1.00e-149 0.347 0.393
2 singles 0.517 0.0128 40.5 5.29e-213 0.492 0.543
3 doubles 0.775 0.0229 33.8 7.09e-168 0.730 0.820
4 triples 1.24 0.0778 15.9 4.62e- 51 1.09 1.39
5 hr 1.44 0.0248 58.1 1.98e-323 1.39 1.49
Note that all confidence intervals are quite narrow, suggesting the model does a very good job estimating the real parameters. We can also see this graphically. Note that we are using, in this case, data up to 2001 to predict 2002 results.
|> mutate(r_hat = predict(fit, newdata = dat)) |>
dat filter(year == 2002) %>%
ggplot(aes(r, r_hat, label = team)) +
geom_point() +
geom_text(nudge_x = 0.1, cex = 2) +
geom_abline() # with no parameters given, this is the identity line
Our calculations so far are based on team level statistics, and our aim it to get useful player level statistics, so we will approach the problem in two steps.
First, we will compute the average number of plate appearances per game. While it is not explained in the text, we must assume that each team plays 162 games in the season, and that´s where the number in the code below comes from:
<- Batting |> filter(yearID == 2002) |>
pa_per_game group_by(teamID) |>
summarize(pa_per_game = sum(AB + BB)/162) |>
pull(pa_per_game) |>
mean()
And now we will calculate, for each player, a plate appearance (pa) rate: the player’s number of appearances divided by the average. Then we will normalize the data by dividing bb, singles and so on by each player’s pa rate. We can see there is considerable variability in the predicted runs per player, given their statistics. Note that the rate is for the whole period 97 to 01, not per season
<- Batting |>
players filter(yearID %in% 1997:2001) |>
group_by(playerID) |>
mutate(pa = BB + AB) |>
summarize(g = sum(pa)/pa_per_game,
bb = sum(BB)/g,
singles = sum(H - X2B - X3B - HR)/g,
doubles = sum(X2B)/g,
triples = sum(X3B)/g,
hr = sum(HR)/g,
avg = sum(H)/sum(AB),
pa = sum(pa)) |>
filter(pa >= 1000) |>
select(-g)
$r_hat = predict(fit, newdata = players) players
Here is the distribution of runs per game
hist(players$r_hat, main = "Predicted runs per game")
Since our problem is to build a team on a budget, we need to know not just how a player with given statistics is expected to perform, but also what salaries do players with such statistics are likely to command. We can get the salaries data from the Salaries data frame, and add it to the “players” data frame we just created:
<- Salaries |>
players filter(yearID == 2002) |>
select(playerID, salary) |>
right_join(players, by = "playerID")
(This step and the following assume that we know the structure and contents of the Lahman data set)
Next we add the each player’s defensive position. The following code deserve study by itself in terms of the techniques used to construct the data frame we are after. Some of the techniques are either new (and not explained) or perhaps they were introduced in the first volume and I forgot them.
At any rate, a little experimentation, which I carried out in the console, was enough to understand what each command does.
<-
position_names paste0("G_", c("p","c","1b","2b","3b","ss","lf","cf","rf", "dh"))
# This creates names that are similar to those in Lehman, but shorter
<- Appearances |>
tmp filter(yearID == 2002) |>
group_by(playerID) |>
summarize_at(position_names, sum) |>
ungroup()
# Note that Appearances is part of the Lahman data set. we have not created it. The code above will count the number of times each player played at each position. In the next chunck of code, we assign to each player the position he has played more frequently
<- tmp |>
pos select(all_of(position_names)) |>
apply(X = _, 1, which.max)
# apply will look at values of X across rows (that is what the "1" stands for). It will return the number of the column associated with the largest value of X. Each column corresponds to a position (print tmp to see it). The result will be used as an index vector to select the position that will be associated with each player in the next step
<- tibble(playerID = tmp$playerID, POS = position_names[pos]) |>
players mutate(POS = str_to_upper(str_remove(POS, "G_"))) |>
filter(POS != "P") |>
right_join(players, by = "playerID") |>
filter(!is.na(POS) & !is.na(salary))
# I am not quite sure why the G was added in the defintion of position name, if it was to be deleted in the end.
Finally, we add each player’s name and last name.
<- People |>
players select(playerID, nameFirst, nameLast, debut) |>
mutate(debut = as.Date(debut)) |>
right_join(players, by = "playerID")
Here is a list of the top ten players by the number of expected runs:
|> select(nameFirst, nameLast, POS, salary, r_hat) |> arrange(desc(r_hat)) |> head(10) players
nameFirst nameLast POS salary r_hat
1 Barry Bonds LF 15000000 8.1
2 Larry Walker RF 12666667 8.0
3 Todd Helton 1B 5000000 7.4
4 Manny Ramirez LF 15462727 7.4
5 Sammy Sosa RF 15000000 7.2
6 Jeff Bagwell 1B 11000000 7.1
7 Mike Piazza C 10571429 7.0
8 Jason Giambi 1B 10428571 6.9
9 Edgar Martinez DH 7086668 6.9
10 Jim Thome 1B 8000000 6.9
We can easily verify that players with higher metrics tend to to have higher salaries
<- players |>
test group_by(POS) |>
ggplot(aes(salary,r_hat, color = POS)) +
geom_point(size = 2) +
geom_smooth(aes(group = 1)) +
scale_x_log10()
test
The next task is to build the team. The problem is to maximize expected runs subject to a budget constraint. This can be solved using linear programming. The code is provide in the source code, but the technique is not explained, so we will omitt that section.
6 Exercises
We start by creating a table with singles and bb for each player for 2002
<- Batting |> filter(yearID == 2002) |>
dat mutate(pa = AB + BB,
singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) |>
filter(pa >= 100) |>
select(playerID, singles, bb)
Next we create a table with averages for 1999-2001
<- Batting |> filter(yearID %in% 1999:2001) |>
avg mutate(pa = AB + BB,
singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) |>
filter(pa >= 100) |>
select(playerID, singles, bb) |>
group_by(playerID) |>
summarise(av_singles = mean(singles), av_bb=mean(bb))
We use inner_join
to combine both tables
<- inner_join(dat, avg, by = "playerID") dat
And now we compute correlations between periods
|> summarize(cor(singles, av_singles), cor(bb,av_bb)) dat
cor(singles, av_singles) cor(bb, av_bb)
1 0.55 0.72
The point here is the stronger correlation between bb and av_bb than between singles and av_singles. This indicates that past performance is a better predictor of current performance in the case of bb than in the case of singles.
Lets explore the correlations visually
<- dat |> ggplot(aes(singles,av_singles))+geom_point(alpha=0.5)
c1 <- dat |> ggplot(aes(bb,av_bb)) + geom_point(alpha=0.5)
c2 grid.arrange(c1,c2,ncol=2)
Now fit a linear model for each metric and use confint
to compare estimates.
<- dat |> lm(singles ~ av_singles, data = _)
fit_sing <- dat |> lm(bb ~ av_bb, data = _)
fit_bb confint(fit_sing)
2.5 % 97.5 %
(Intercept) 0.047 0.077
av_singles 0.499 0.677
confint(fit_bb)
2.5 % 97.5 %
(Intercept) 0.0078 0.023
av_bb 0.7489 0.909
Note that we could use tidy
with the option conf.int = TRUE
instead. The results are much easier to read.
<- tidy(fit_sing, conf.int = TRUE)
tidyfsing tidyfsing
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0621 0.00742 8.37 1.07e-15 0.0475 0.0766
2 av_singles 0.588 0.0451 13.0 1.70e-32 0.499 0.677
<- tidy(fit_bb, conf.int = TRUE)
tidyfbb tidyfbb
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0155 0.00391 3.96 9.04e- 5 0.00779 0.0232
2 av_bb 0.829 0.0408 20.3 3.22e-63 0.749 0.909
In a previous section (C14) we computed the correlations between parents and children by sex. Here is code that computes these correlations (but I like my code in C14 Regression.qmd better).
library(HistData)
set.seed(1)
<- GaltonFamilies |>
galton_heights group_by(family, gender) |>
sample_n(1) |>
ungroup()
<- galton_heights |>
cors pivot_longer(father:mother, names_to = "parent", values_to = "parentHeight") |>
mutate(child = ifelse(gender == "female", "daughter", "son")) |>
unite(pair, c("parent", "child")) |>
group_by(pair) |>
summarize(cor = cor(parentHeight, childHeight))
cors
# A tibble: 4 × 2
pair cor
<chr> <dbl>
1 father_daughter 0.386
2 father_son 0.444
3 mother_daughter 0.388
4 mother_son 0.303
Correlation estimates are random variables. We may ask, therefore, if the correlations just computed are significantly different from each other. While we could compare the correlations directly, we have not studied the relevant techniques (got explanations from Gemini, but the code is a bit messy) so we will run regressions and compare slopes instead.
Note that galton_heights is not a tidy data frame. The column father contains information about the parent´s sex and height, as does the column mother. We can correct this with the following code:
<- galton_heights |>
galton_heights_long pivot_longer(father:mother, names_to = "parent", values_to = "parentHeight")
1:10,] galton_heights_long[
# A tibble: 10 × 8
family midparentHeight children childNum gender childHeight parent
<fct> <dbl> <int> <int> <fct> <dbl> <chr>
1 001 75.4 4 2 female 69.2 father
2 001 75.4 4 2 female 69.2 mother
3 001 75.4 4 1 male 73.2 father
4 001 75.4 4 1 male 73.2 mother
5 002 73.7 4 3 female 65.5 father
6 002 73.7 4 3 female 65.5 mother
7 002 73.7 4 1 male 73.5 father
8 002 73.7 4 1 male 73.5 mother
9 003 72.1 2 2 female 68 father
10 003 72.1 2 2 female 68 mother
# ℹ 1 more variable: parentHeight <dbl>
Note that we had previously transformed the data like this, to calculate correlations, but only temporarily, without saving the transformed data frame.
Now, we have one column with the parent’s sex, and another for height. To run the four linear models with one line of code (as opposed to one separate code chunk for son-father, daughter-father, and so on) we need to do a little more work. The key addition is the creation of the “pair” variable.
<- galton_heights |>
galton_heights_long pivot_longer(father:mother, names_to = "parent", values_to = "parentHeight") |>
mutate(child = ifelse(gender == "female", "daughter", "son")) |>
unite(pair, c("parent", "child"))
1:10,] galton_heights_long[
# A tibble: 10 × 8
family midparentHeight children childNum gender childHeight pair
<fct> <dbl> <int> <int> <fct> <dbl> <chr>
1 001 75.4 4 2 female 69.2 father_daughter
2 001 75.4 4 2 female 69.2 mother_daughter
3 001 75.4 4 1 male 73.2 father_son
4 001 75.4 4 1 male 73.2 mother_son
5 002 73.7 4 3 female 65.5 father_daughter
6 002 73.7 4 3 female 65.5 mother_daughter
7 002 73.7 4 1 male 73.5 father_son
8 002 73.7 4 1 male 73.5 mother_son
9 003 72.1 2 2 female 68 father_daughter
10 003 72.1 2 2 female 68 mother_daughter
# ℹ 1 more variable: parentHeight <dbl>
Note we have a number of columns we won’t be using, so we might as well drop them:
<- galton_heights_long |>
gh_compact select(-midparentHeight,-children, -childNum)
Now we are ready to run our regressions. Note that the result is list, and we will need to unbundle the information that it contains
<- gh_compact |>
allmodels group_by(pair) |>
do(model = tidy(lm(childHeight ~ parentHeight, data = .),conf.int = TRUE))
If we just print this object, we get very little information
allmodels
# A tibble: 4 × 2
# Rowwise:
pair model
<chr> <list>
1 father_daughter <tibble [2 × 7]>
2 father_son <tibble [2 × 7]>
3 mother_daughter <tibble [2 × 7]>
4 mother_son <tibble [2 × 7]>
We can extract the results for each model as follows:
<- bind_rows(allmodels[1,]$model)
fd fd
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 40.2 4.34 9.26 7.76e-17 31.6 48.7
2 parentHeight 0.345 0.0625 5.52 1.21e- 7 0.222 0.468
But here is a way of doing it all at once
<- bind_rows(allmodels$model) |>
coefs filter(term == "parentHeight") |>
select(estimate, std.error,conf.low,conf.high) |>
mutate(pair =c("fd", "fs", "md", "ms")) |>
relocate(pair,.before = estimate)
coefs
# A tibble: 4 × 5
pair estimate std.error conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 fd 0.345 0.0625 0.222 0.468
2 fs 0.426 0.0646 0.299 0.553
3 md 0.413 0.0745 0.266 0.560
4 ms 0.312 0.0739 0.166 0.458
Finally, let’s graph the confidence intervals
|>
coefs ggplot(aes(pair,estimate,ymin = conf.low, ymax = conf.high)) +
geom_point(size = 3)+
geom_errorbar()