C15 Multivariate Regression

Jorge Cornick, 22 de enero 2025

library(dslabs)
library(HistData)
library(tidyverse)
library(broom)
library(gridExtra)
library(Lahman)
library(ggthemes)
theme_set(theme_bw())
options(digits=2)

1 Using statistics to understand baseball performance

We will use baseball as a way to illustrate how LR can help us understand data. We start by loading the Lahman baseball database, and constructing the variables we will be using in our analysis

dat <- Teams |> filter(yearID %in% 1962:2002) |>
  mutate(team = teamID, year = yearID, r = R/G, 
         singles = (H - X2B - X3B - HR)/G, 
         doubles = X2B/G, triples = X3B/G, hr = HR/G,
         sb = SB/G, bb = BB/G) |>
  select(team, year, r, singles, doubles, triples, hr, sb, bb)

Note that variables are expressed in a per game basis. The following graphs show the joint distribution of several variables (the graph is quite improved version of the ones that appear in the book)

Code
p1 <- dat |> ggplot(aes(hr,r, color = r)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p2 <- dat |> ggplot(aes(sb,r, color = r)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p3 <- dat |> ggplot(aes(bb,r, color = r)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
p4 <- dat |> ggplot(aes(hr,bb, color = bb)) + geom_point(alpha = 0.6) + theme(legend.position = "none")
grid.arrange(p1,p2,p3,p4, ncol = 2)

Note that hr and bb are associated with r, but also between themselves, so while it is obvious that hr cause r, it is not obvious if bb do. Perhaps the association shows because hr causes both r and bb. Note there is no well defined association between sb and r.

While visual inspection is always useful, we can go ahead and compute correlations:

all_cors  <- dat |> 
  summarize(hr_r = cor(hr,r),
            sb_r = cor(sb,r),
            bb_r = cor(bb,r),
            hr_bb = cor(hr,bb)) 
all_cors
  hr_r sb_r bb_r hr_bb
1 0.76 0.13 0.55  0.41

2 Regression applied to baseball statistics

We explore first the association between hr and r (simple regression) and then progress to multivariate regression. We start by inspecting the relationship graphically to assess whether these two variables seem to follow a bivariate normal distribution. Note that we work with scaled (standardized) variables

dat |> mutate(z_hr = round(scale(hr))) |>
  filter(z_hr %in% -2:3) |>
  ggplot() +  
  stat_qq(aes(sample = scale(r))) +
  facet_wrap(~z_hr) 

Here is an alternative visualization, with both variables scaled, and the addition of the qq_line

dat |> mutate(z_hr = round(scale(hr))) |>
  filter(z_hr %in% -2:3) |>
  ggplot() +  
  geom_qq_line(aes(sample = scale(r)))+
  geom_qq(aes(sample = scale(r))) +
  facet_wrap(~z_hr) 

Since the variables r and hr seem to follow a bivariate normal distribution, we can now run a linear regression, and juxtapose the scatter plot and the regression line. Note that we are extracting just the coefficients from lm, which provides also additional information. First we add the line “by hand”, and in the second graph we take advantage of geom_smooth(), Note that we don’t need to pass and (x,y) aesthetic to geom_smooth() because the aesthetic was defined globally in the ggplot call when we created p1

hr_fit  <- lm(r ~ hr, data = dat)$coef
p1 +   geom_abline(intercept = hr_fit[[1]], slope = hr_fit[[2]])

We could get the same graph using “tidy”, so that all relevant info is stored for further reference. In the book, the broom package is introduced in the next section, but I show it here for convenience.

hr_fit <- tidy(lm(r ~ hr, data =dat), conf.int = TRUE)
p1 + geom_abline(intercept = hr_fit$estimate[1], slope = hr_fit$estimate[2])

An easier way to get the same graph is

p1 + geom_smooth(method = "lm", se = FALSE)

3 The broom package

In previous chapters we have discussed (and showed) the advantages of working within the tidyverse, but for this, we need to work with data frames. Alas, lm takes data frames as arguments but it does not RETURN a data frame.

The broom package allows us to extract information from lm in data frame format, which allows us to summarize, filter, select rows and other familiar operations over data frames in the tidyverse.

Here is an example:

fit <- lm(r ~ bb,data = dat)
tidy(fit,conf.int=TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    1.93     0.116       16.7 1.91e-55    1.70      2.15 
2 bb             0.739    0.0348      21.2 1.90e-83    0.671     0.807

Just a reminder: “statistic” is the estimate divided by the se. It indicates how many standar deviations the estimate is above zero. “p-value” is the probability of observing the estimated value if the true value were zero.

4 Confounding

If we run simple regressions of runs on bb and of runs on singles, it would appear that bb “cause” more runs than singles. Since both singles and bb take a runner to first base, and runners on base have a higher chance of scoring a run after a single than after a bb, what gives.

Correlations can give as a hint:

dat |> summarize(cor(bb, hr), cor(singles, hr), cor(bb, singles))
  cor(bb, hr) cor(singles, hr) cor(bb, singles)
1        0.41            -0.19           -0.051

It appears that pitchers, afraid of hr when a hitter is at bat, will sometimes prefer to give them a bb. The difficulty of discerning what is really associated with what is called confounding. In the book, stratification is used to help explaing confounding, but here we jump directrly into multivariate regression which, unlike simple regression, allows us to identify the impact of each variable separatedly.

tidy(lm(r ~ bb + hr, data = dat), conf.int = TRUE)
# A tibble: 3 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)    1.74     0.0820      21.2 3.38e- 83    1.58      1.90 
2 bb             0.387    0.0269      14.4 8.41e- 43    0.334     0.440
3 hr             1.57     0.0488      32.1 1.39e-157    1.47      1.66 

Note that both coefficients are lower that those obtained using simple regression, but the change (reduction) of the coefficient on bb is much larger.

5 Building a baseball team

We will build some indicators that would be useful for building a team with a given budget (obviously a restricted optimization problem) but we will not go all the way: since this book doesn’t deal with optimization, once we have all the data required to solve the problem, we are told that the tool required to do so would be linear programming, which lies outside the scope of the book, which is kind of anticlimactic…but the code for the lp problem is available in the book’s source code.

First, we regress runs on all possible at-bat outcomes

fit <- dat |>  filter(year <= 2001) |> 
  lm(r ~ bb + singles + doubles + triples + hr, data = _)
tidyfit <- tidy(fit, conf.int = TRUE) |> filter(term != "(Intercept)")
tidyfit
# A tibble: 5 × 7
  term    estimate std.error statistic   p.value conf.low conf.high
  <chr>      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 bb         0.370    0.0119      31.2 1.00e-149    0.347     0.393
2 singles    0.517    0.0128      40.5 5.29e-213    0.492     0.543
3 doubles    0.775    0.0229      33.8 7.09e-168    0.730     0.820
4 triples    1.24     0.0778      15.9 4.62e- 51    1.09      1.39 
5 hr         1.44     0.0248      58.1 1.98e-323    1.39      1.49 

Note that all confidence intervals are quite narrow, suggesting the model does a very good job estimating the real parameters. We can also see this graphically. Note that we are using, in this case, data up to 2001 to predict 2002 results.

dat |> mutate(r_hat = predict(fit, newdata = dat)) |>
  filter(year == 2002) %>%
  ggplot(aes(r, r_hat, label = team)) + 
  geom_point() +
  geom_text(nudge_x = 0.1, cex = 2) + 
  geom_abline() # with no parameters given, this is the identity line

Our calculations so far are based on team level statistics, and our aim it to get useful player level statistics, so we will approach the problem in two steps.

First, we will compute the average number of plate appearances per game. While it is not explained in the text, we must assume that each team plays 162 games in the season, and that´s where the number in the code below comes from:

pa_per_game <- Batting |> filter(yearID == 2002) |> 
  group_by(teamID) |>
  summarize(pa_per_game = sum(AB + BB)/162) |> 
  pull(pa_per_game) |> 
  mean()

And now we will calculate, for each player, a plate appearance (pa) rate: the player’s number of appearances divided by the average. Then we will normalize the data by dividing bb, singles and so on by each player’s pa rate. We can see there is considerable variability in the predicted runs per player, given their statistics. Note that the rate is for the whole period 97 to 01, not per season

players <- Batting |> 
  filter(yearID %in% 1997:2001) |> 
  group_by(playerID) |>
  mutate(pa = BB + AB) |>
  summarize(g = sum(pa)/pa_per_game,
    bb = sum(BB)/g,
    singles = sum(H - X2B - X3B - HR)/g,
    doubles = sum(X2B)/g, 
    triples = sum(X3B)/g, 
    hr = sum(HR)/g,
    avg = sum(H)/sum(AB),
    pa = sum(pa)) |>
  filter(pa >= 1000) |>
  select(-g)

players$r_hat = predict(fit, newdata = players)

Here is the distribution of runs per game

hist(players$r_hat, main = "Predicted runs per game")

Since our problem is to build a team on a budget, we need to know not just how a player with given statistics is expected to perform, but also what salaries do players with such statistics are likely to command. We can get the salaries data from the Salaries data frame, and add it to the “players” data frame we just created:

players <- Salaries |> 
  filter(yearID == 2002) |>
  select(playerID, salary) |>
  right_join(players, by = "playerID")

(This step and the following assume that we know the structure and contents of the Lahman data set)

Next we add the each player’s defensive position. The following code deserve study by itself in terms of the techniques used to construct the data frame we are after. Some of the techniques are either new (and not explained) or perhaps they were introduced in the first volume and I forgot them.

At any rate, a little experimentation, which I carried out in the console, was enough to understand what each command does.

position_names <- 
  paste0("G_", c("p","c","1b","2b","3b","ss","lf","cf","rf", "dh"))
# This creates names that are similar to those in Lehman, but shorter

tmp <- Appearances |> 
  filter(yearID == 2002) |> 
  group_by(playerID) |>
  summarize_at(position_names, sum) |>
  ungroup()
# Note that Appearances is part of the Lahman data set. we have not created it.  The code above will count the number of times each player played at each position.  In the next chunck of code, we assign to each player the position he has played more frequently  

pos <- tmp |>
  select(all_of(position_names)) |>
  apply(X = _, 1, which.max) 
# apply will look at values of X across rows (that is what the "1" stands for).  It will return the number of the column associated with the largest value of X.  Each column corresponds to a position (print tmp to see it).  The result will be used as an index vector to select the position that will be associated with each player in the next step

players <- tibble(playerID = tmp$playerID, POS = position_names[pos]) |>
  mutate(POS = str_to_upper(str_remove(POS, "G_"))) |>
  filter(POS != "P") |>
  right_join(players, by = "playerID") |>
  filter(!is.na(POS)  & !is.na(salary))

# I am not quite sure why the G was added in the defintion of position name, if it was to be deleted in the end.

Finally, we add each player’s name and last name.

players <- People |>
  select(playerID, nameFirst, nameLast, debut) |>
  mutate(debut = as.Date(debut)) |>
  right_join(players, by = "playerID")

Here is a list of the top ten players by the number of expected runs:

players |> select(nameFirst, nameLast, POS, salary, r_hat) |> arrange(desc(r_hat)) |> head(10) 
   nameFirst nameLast POS   salary r_hat
1      Barry    Bonds  LF 15000000   8.1
2      Larry   Walker  RF 12666667   8.0
3       Todd   Helton  1B  5000000   7.4
4      Manny  Ramirez  LF 15462727   7.4
5      Sammy     Sosa  RF 15000000   7.2
6       Jeff  Bagwell  1B 11000000   7.1
7       Mike   Piazza   C 10571429   7.0
8      Jason   Giambi  1B 10428571   6.9
9      Edgar Martinez  DH  7086668   6.9
10       Jim    Thome  1B  8000000   6.9

We can easily verify that players with higher metrics tend to to have higher salaries

test <- players |>
  group_by(POS) |> 
  ggplot(aes(salary,r_hat, color = POS)) +
  geom_point(size = 2) +
  geom_smooth(aes(group = 1)) +
  scale_x_log10()

test

The next task is to build the team. The problem is to maximize expected runs subject to a budget constraint. This can be solved using linear programming. The code is provide in the source code, but the technique is not explained, so we will omitt that section.

6 Exercises

We start by creating a table with singles and bb for each player for 2002

dat <- Batting |> filter(yearID == 2002) |>
  mutate(pa = AB + BB, 
         singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) |>
  filter(pa >= 100) |>
  select(playerID, singles, bb)

Next we create a table with averages for 1999-2001

avg <- Batting |> filter(yearID %in% 1999:2001) |>
  mutate(pa = AB + BB, 
         singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) |>
  filter(pa >= 100) |>
  select(playerID, singles, bb) |> 
  group_by(playerID) |> 
  summarise(av_singles = mean(singles), av_bb=mean(bb))

We use inner_join to combine both tables

dat <- inner_join(dat, avg, by = "playerID")

And now we compute correlations between periods

dat |> summarize(cor(singles, av_singles), cor(bb,av_bb))
  cor(singles, av_singles) cor(bb, av_bb)
1                     0.55           0.72

The point here is the stronger correlation between bb and av_bb than between singles and av_singles. This indicates that past performance is a better predictor of current performance in the case of bb than in the case of singles.

Lets explore the correlations visually

c1 <- dat |> ggplot(aes(singles,av_singles))+geom_point(alpha=0.5)
c2 <- dat |> ggplot(aes(bb,av_bb)) + geom_point(alpha=0.5)
grid.arrange(c1,c2,ncol=2)

Now fit a linear model for each metric and use confintto compare estimates.

fit_sing <- dat |> lm(singles ~ av_singles, data = _)
fit_bb <- dat |> lm(bb ~ av_bb, data = _)
confint(fit_sing)
            2.5 % 97.5 %
(Intercept) 0.047  0.077
av_singles  0.499  0.677
confint(fit_bb)
             2.5 % 97.5 %
(Intercept) 0.0078  0.023
av_bb       0.7489  0.909

Note that we could use tidy with the option conf.int = TRUE instead. The results are much easier to read.

tidyfsing <- tidy(fit_sing, conf.int = TRUE)
tidyfsing
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0621   0.00742      8.37 1.07e-15   0.0475    0.0766
2 av_singles    0.588    0.0451      13.0  1.70e-32   0.499     0.677 
tidyfbb <- tidy(fit_bb, conf.int = TRUE)
tidyfbb
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0155   0.00391      3.96 9.04e- 5  0.00779    0.0232
2 av_bb         0.829    0.0408      20.3  3.22e-63  0.749      0.909 

In a previous section (C14) we computed the correlations between parents and children by sex. Here is code that computes these correlations (but I like my code in C14 Regression.qmd better).

library(HistData)
set.seed(1)
galton_heights <- GaltonFamilies |>
  group_by(family, gender) |>
  sample_n(1) |>
  ungroup()

cors <- galton_heights |> 
  pivot_longer(father:mother, names_to = "parent", values_to = "parentHeight") |>
  mutate(child = ifelse(gender == "female", "daughter", "son")) |>
  unite(pair, c("parent", "child")) |> 
  group_by(pair) |>
  summarize(cor = cor(parentHeight, childHeight))
cors
# A tibble: 4 × 2
  pair              cor
  <chr>           <dbl>
1 father_daughter 0.386
2 father_son      0.444
3 mother_daughter 0.388
4 mother_son      0.303

Correlation estimates are random variables. We may ask, therefore, if the correlations just computed are significantly different from each other. While we could compare the correlations directly, we have not studied the relevant techniques (got explanations from Gemini, but the code is a bit messy) so we will run regressions and compare slopes instead.

Note that galton_heights is not a tidy data frame. The column father contains information about the parent´s sex and height, as does the column mother. We can correct this with the following code:

galton_heights_long <- galton_heights |>
  pivot_longer(father:mother, names_to = "parent", values_to = "parentHeight")
galton_heights_long[1:10,]
# A tibble: 10 × 8
   family midparentHeight children childNum gender childHeight parent
   <fct>            <dbl>    <int>    <int> <fct>        <dbl> <chr> 
 1 001               75.4        4        2 female        69.2 father
 2 001               75.4        4        2 female        69.2 mother
 3 001               75.4        4        1 male          73.2 father
 4 001               75.4        4        1 male          73.2 mother
 5 002               73.7        4        3 female        65.5 father
 6 002               73.7        4        3 female        65.5 mother
 7 002               73.7        4        1 male          73.5 father
 8 002               73.7        4        1 male          73.5 mother
 9 003               72.1        2        2 female        68   father
10 003               72.1        2        2 female        68   mother
# ℹ 1 more variable: parentHeight <dbl>

Note that we had previously transformed the data like this, to calculate correlations, but only temporarily, without saving the transformed data frame.

Now, we have one column with the parent’s sex, and another for height. To run the four linear models with one line of code (as opposed to one separate code chunk for son-father, daughter-father, and so on) we need to do a little more work. The key addition is the creation of the “pair” variable.

galton_heights_long <- galton_heights |>
  pivot_longer(father:mother, names_to = "parent", values_to = "parentHeight") |>
  mutate(child = ifelse(gender == "female", "daughter", "son")) |>
  unite(pair, c("parent", "child"))
galton_heights_long[1:10,]
# A tibble: 10 × 8
   family midparentHeight children childNum gender childHeight pair           
   <fct>            <dbl>    <int>    <int> <fct>        <dbl> <chr>          
 1 001               75.4        4        2 female        69.2 father_daughter
 2 001               75.4        4        2 female        69.2 mother_daughter
 3 001               75.4        4        1 male          73.2 father_son     
 4 001               75.4        4        1 male          73.2 mother_son     
 5 002               73.7        4        3 female        65.5 father_daughter
 6 002               73.7        4        3 female        65.5 mother_daughter
 7 002               73.7        4        1 male          73.5 father_son     
 8 002               73.7        4        1 male          73.5 mother_son     
 9 003               72.1        2        2 female        68   father_daughter
10 003               72.1        2        2 female        68   mother_daughter
# ℹ 1 more variable: parentHeight <dbl>

Note we have a number of columns we won’t be using, so we might as well drop them:

gh_compact <- galton_heights_long |> 
  select(-midparentHeight,-children, -childNum)

Now we are ready to run our regressions. Note that the result is list, and we will need to unbundle the information that it contains

allmodels <- gh_compact |>
  group_by(pair) |>
  do(model = tidy(lm(childHeight ~ parentHeight, data = .),conf.int = TRUE))

If we just print this object, we get very little information

allmodels 
# A tibble: 4 × 2
# Rowwise: 
  pair            model           
  <chr>           <list>          
1 father_daughter <tibble [2 × 7]>
2 father_son      <tibble [2 × 7]>
3 mother_daughter <tibble [2 × 7]>
4 mother_son      <tibble [2 × 7]>

We can extract the results for each model as follows:

fd <- bind_rows(allmodels[1,]$model)
fd
# A tibble: 2 × 7
  term         estimate std.error statistic  p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    40.2      4.34        9.26 7.76e-17   31.6      48.7  
2 parentHeight    0.345    0.0625      5.52 1.21e- 7    0.222     0.468

But here is a way of doing it all at once

coefs <- bind_rows(allmodels$model) |> 
  filter(term == "parentHeight") |> 
  select(estimate, std.error,conf.low,conf.high) |> 
  mutate(pair =c("fd", "fs", "md", "ms")) |> 
  relocate(pair,.before = estimate)
coefs
# A tibble: 4 × 5
  pair  estimate std.error conf.low conf.high
  <chr>    <dbl>     <dbl>    <dbl>     <dbl>
1 fd       0.345    0.0625    0.222     0.468
2 fs       0.426    0.0646    0.299     0.553
3 md       0.413    0.0745    0.266     0.560
4 ms       0.312    0.0739    0.166     0.458

Finally, let’s graph the confidence intervals

coefs |> 
  ggplot(aes(pair,estimate,ymin = conf.low, ymax = conf.high)) +
  geom_point(size = 3)+
  geom_errorbar()