Homework 2

Setup

Replace “Your name” in the YAML header with your name.

This code chunk allows R Markdown to knit your output file even if there are coding errors.

knitr::opts_chunk$set(error = TRUE, fig.width = 6, fig.height = 4)

The packages you will are tidyverse and broom, plus any others you add.

library(tidyverse)
library(broom)

Dataset

We begin by reading in the data set named “NFL combine 1987-2021.csv” into the R object nfl.

# read in data file
nfl <- read_csv("NFL combine 1987-2021.csv")

The dataset is about the “Combine” for the National Football League (NFL) from 1987-2021 – American football, that is. The Combine allows for prospects for the NFL draft to be evaluated in several aspects of their athletic ability on a level playing field. This way, NFL teams will have objective measurements of their abilities to inform their draft decisions.

Variable descriptions:

There are 13,215 observations on the following 14 variables:

ID: observation identifier
Year: year of NFL combine
Name: player name
College: college of player
Position: position player is tested at for the combine:
- Offensive positions: C = Center, FB = Fullback, OG = Offensive Guard, OL = Offensive Line, OT = Offensive Tackle, QB = Quarterback, RB = Running Back, TE = Tight End, WR = Wide Receiver
- Defensive positions: CB = Cornerback, DB = Defensive Back, DE = Defensive End, DL = Defensive Line, DT = Defensive Tackle, EDG = Edge Rusher, FS = Free Safety, ILB = Inside Linebacker, LB = Linebacker, NT = Nose Tackle, OLB = Outside Linebacker, S = Safety, SS = Strong Safety
- Special teams: K = Kicker, LS = Long Snapper, P = Punter
Height: height in inches
Weight: weight in pounds
Wonderlic: Score on test on Wonderlic test of intelligence
Yard40: time in seconds to run 40 yard dash
BenchPress: number of repetitions of 225 pounds
VertLeap: vertical leap in inches from standing
BroadJump: Broad jump in inches from standing
Shuttle: time in seconds to complete the shuttle drill
Cones: time in seconds to complete the 3-cone drill

Data processing and description

Missing data

There are many missing values in the dataset, so will limit the data to players where we have full information on the following variables. These players can be considered “full partipicants of the combine. Be sure to use this dataset, nfl2, for the rest of the homework.

nfl2 <- nfl |> 
  filter(
    !is.na(Yard40),
    !is.na(Height), 
    !is.na(Weight), 
    !is.na(BenchPress),
    !is.na(VertLeap),
    !is.na(BroadJump),
    !is.na(Shuttle), 
    !is.na(Cones), 
    !is.na(Position), 
    !is.na(College)
  )

Positions

There are so many categories for the Position variable. That we will combine some into larger categories as the variable Position2.

nfl2 <- nfl2 %>%
  mutate(
    Position2 = case_when(
      Position %in% c("C", "OG", "OL", "OT") ~ "OLine", 
      Position %in% c("CB", "DB", "FS", "S", "SS") ~ "Secondary", 
      Position %in% c("DE", "DL", "DT", "EDG", "NT") ~ "DLine", 
      Position %in% c("ILB", "LB", "OLB") ~ "LB", 
      Position %in% c("FB", "RB") ~ "RB", 
      TRUE ~ Position
    )
  )

nfl2 |>
  count(Position2)

## # A tibble: 10 × 2
##    Position2     n
##    <chr>     <int>
##  1 DLine      1072
##  2 LB          796
##  3 LS           10
##  4 OLine      1162
##  5 P             1
##  6 QB           22
##  7 RB          682
##  8 Secondary  1192
##  9 TE          396
## 10 WR          592

We will remove the positions LS (long-snapper), P (punter), and QB (quarterback) due to their low counts. Also, LS and P are special teams positions that may not represent “typical” football players.

nfl2 <- nfl2 |> 
  filter(
    Position2 != "QB", 
    Position2 != "LS",
    Position2 != "P",
  )

count(nfl2, Position2)

## # A tibble: 7 × 2
##   Position2     n
##   <chr>     <int>
## 1 DLine      1072
## 2 LB          796
## 3 OLine      1162
## 4 RB          682
## 5 Secondary  1192
## 6 TE          396
## 7 WR          592

An overview of the 7 positions given above are as follows:

OLine (Offensive Line), RB (Running Back), TE (Tight End), and WR (Wide Receivers) are offensive positions.
DLine (Defensive Line), LB (LineBacker), and Secondary are defensive positions

A typical play in football goes like this: after the offensive line snaps the ball to the quarterback, the defensive line tries to crash through the offensive line to tackle the quarterback. The offensive line blocks the defensive line. The quarterback then decides to either hand off the ball to the running back or pass it to one of the wide receivers. The linebackers, who start behind the defensive line, try to tackle the running back. The secondary guard the wide receivers and try to tip the passes away. The tight end can be either a blocker or a pass receiver, depending on the play.

With all these different “jobs”, it makes sense that different sizes of players play each position.

ggplot(
  data = nfl2, 
  mapping = aes(x = Height, y = Weight, color = Position2)
) + 
  geom_point(alpha=.5) + 
  labs(x = "Height (in)", y = "Weight (lbs)")

As shown in the graph above, the OLine and DLine are the largest players because they are crashing into each other at the line of scrimmage. The tight ends are less heavy than the OLine because they also go out for passes in addition to blocking. The Wide Receivers and the Secondary weigh the least because they are sprinting around going after passes. For more information, please see this Wikipedia article on American football positions.

Description

Yard40, the time in seconds to run a 40 yard dash, is most quoted statistic to emerge from the Combine, in my opinion. We will treat it as the response variable and examine how it it related to Height, Weight, and Position2.

Part I: Practice checking regression assumptions

Height as explanatory variable

[2] Make a scatterplot with Yard40 on the y-axis and Height on the x-axis. Within geom_point(), use a small value of alpha to ease some of the overplotting.

# code chunk
ggplot(nfl2, aes(x = Height, y = Yard40)) +
  geom_point(alpha = 0.2)

[2] Briefly describe the relationship between Height and Yard40.

Answer:There is a weak linear relationship between Height and Yard40, with a slight trend that taller players have larger (slower) Yard40 times.

[2] Fit the simple linear regression model predicting Yard40 with Height. Then make the residual vs. predicted plot.

# code chunk
fit_h <- lm(Yard40 ~ Height, data = nfl2)

# code chunk
ggplot(augment(fit_h), aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

[2] Describe how the above residual plot indicates whether the correct mean assumption is satisfied (or not). Note: on all questions where you check assumptions, you need not give a definitive answer (yes/no) on whether the assumption is satisfied, but you must demonstrate that you know what to look for.

Answer:The correct mean assumption is assessed by checking whether the residuals are centered vertically around zero across all ranges of predicted values.

[2] Describe how the above residual plot indicates whether the constant variance assumption is satisfied (or not).

Answer: The constant variance assumption is assessed by checking whether the vertical spread of residuals is similar across all ranges of predicted values.

Weight as an explanatory variable

[2] Make a scatterplot with Yard40 on the y-axis and Weight on the x-axis. Within geom_point(), use a small value of alpha to ease some of the overplotting of the points.

# code chunk
ggplot(nfl2, aes(x = Weight, y = Yard40)) +
  geom_point(alpha = 0.2)

[2] Briefly describe the relationship between Weight and Yard40.

Answer:There is a positive linear relationship between Weight and Yard40; as Weight increases, Yard40 tends to increase.

[2] Fit the simple linear regression model predicting Yard40 with Weight. Then make the residual vs. predicted plot.

# code chunk
fit_w <- lm(Yard40 ~ Weight, data = nfl2)

# code chunk
ggplot(augment(fit_w), aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

[2] Describe how the above residual plot indicates whether the correct mean assumption is satisfied (or not).

Answer:The residual plot shows whether the residuals are centered around zero across all predicted values, which indicates whether the correct mean assumption is satisfied.

[2] Describe how the above residual plot indicates whether the constant variance assumption is satisfied (or not).

Answer: The residual plot shows whether the spread of residuals is similar across predicted values, which indicates whether the constant variance assumption is satisfied.

[2] Make a normal quantile plot for the residuals of the Yard40 versus Weight relationship.

# code chunk
ggplot(augment(fit_w), aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()

[2] Describe how the normal quantile plot indicates whether the normality assumption is satisfied (or not).

Answer: The normality assumption is assessed by checking whether the residuals follow the straight line in the normal quantile plot.

Part 2: Multiple regression

Both Weight and Height as explanatory variables

Because we’ve seen that Yard40 depends on Height and Weight separately, it makes sense to try them both in a multiple regression model.

[2] Fit the model and report the equation that gives the estimated mean Yard40 as a function of Weight and Height.

# code chunk
fit_hw <- lm(Yard40 ~ Weight + Height, data = nfl2)

# code chunk
tidy(fit_hw)

## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  3.93    0.0702        56.0  0       
## 2 Weight       0.00656 0.0000642    102.   0       
## 3 Height      -0.0101  0.00110       -9.17 6.50e-20

Equation: Yard40^=β0+β1⋅Weight+β2⋅Height

[3] Write a sentence that interprets the value of the estimated coefficient for Height, in the data context.

Answer:For a one-unit increase in Height, the predicted Yard40 changes by the value of the Height coefficient, holding Weight constant.

[2] Explain why the sign of the slope for Height is surprising, given the relationship seen in problems 1 and 2.

Answer:This is surprising because earlier we examined Height alone, but in multiple regression the effect is adjusted for Weight, which changes the relationship.

Height effect in multiple regression model

To investigate the effect of Height, we will examine the relationship between Yard40 and Height separately for groups of the dataset clustered by similar Weight values. You should use the variable wt_gp I have created for you below in this analysis.

nfl2 <- nfl2 |> 
  mutate(
    wt_gp = case_when(
      Weight < 200 ~ "200-",
      Weight < 225 ~ "200-224",
      Weight < 250 ~ "225-249",
      Weight < 275 ~ "250-274", 
      Weight < 300 ~ "275-299",
      Weight >= 300 ~ "300+"
    )
  )

[2] Make six separate scatterplots that show the relationship between Yard40 and Height for the six weight groups. Each scatterplot should also contain the fitted regression line.

# code chunk
ggplot(nfl2, aes(x = Height, y = Yard40)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~wt_gp)

## `geom_smooth()` using formula = 'y ~ x'

[2] Assess the strength of the relationships in each of the six scatterplots by calculating six correlations between Yard40 and Height.

nfl2 |> 
  group_by(wt_gp) |> 
  summarize(
    r = cor(Height, Yard40)
  )

## # A tibble: 6 × 2
##   wt_gp         r
##   <chr>     <dbl>
## 1 200-     0.0417
## 2 200-224 -0.0573
## 3 225-249  0.0611
## 4 250-274 -0.0278
## 5 275-299 -0.0952
## 6 300+     0.157

[2] Briefly describe the overall strength of the relationships shown in all six scatterplots.

Answer: The relationships within each group are generally weak, indicating that Height alone has a weak association with Yard40 when Weight is held approximately constant.

Part 3: Using `Position2`

Recall the seven position groups discussed in the homework introduction, given by the variable Position2.

[2] Make a scatterplot of Yard40 vs. Weight that shows the different categories of Position2 as different colored points. Also, separate regression lines should be plotted for each group.

# code chunk
ggplot(nfl2, aes(x = Weight, y = Yard40, color = Position2)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Model without interaction

[2] Fit the model with both Position2 and Weight as explanatory variables (with no interaction). Include a table that shows the estimated coefficients of this model.

# code chunk
fit_p <- lm(Yard40 ~ Weight + Position2, data = nfl2)
tidy(fit_p)

## # A tibble: 8 × 5
##   term               estimate std.error statistic   p.value
##   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         3.68     0.0330      112.   0        
## 2 Weight              0.00452  0.000115     39.4  4.06e-302
## 3 Position2LB        -0.0504   0.00836      -6.03 1.71e-  9
## 4 Position2OLine      0.186    0.00660      28.2  1.35e-164
## 5 Position2RB        -0.0615   0.0102       -6.06 1.47e-  9
## 6 Position2Secondary -0.0439   0.0116       -3.80 1.47e-  4
## 7 Position2TE        -0.0420   0.00896      -4.68 2.86e-  6
## 8 Position2WR        -0.0775   0.0119       -6.50 8.72e- 11

[3] Interpret the estimated coefficient value corresponding to Position2 = LB.

Answer: The coefficient for Position2 = LB represents the difference in predicted Yard40 between linebackers and the reference category, holding Weight constant.

Model with interaction

[2] Fit the model with both Position2 and Weight as explanatory variables (but with interaction this time). Include a table that shows the estimated coefficients of this model.

# code chunk
fit_pi <- lm(Yard40 ~ Weight * Position2, data = nfl2)
tidy(fit_pi)

## # A tibble: 14 × 5
##    term                       estimate std.error statistic   p.value
##    <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                3.28      0.0494       66.3  0        
##  2 Weight                     0.00594   0.000173     34.4  1.34e-236
##  3 Position2LB                0.770     0.135         5.68 1.39e-  8
##  4 Position2OLine             0.592     0.103         5.75 9.18e-  9
##  5 Position2RB                0.294     0.0803        3.66 2.57e-  4
##  6 Position2Secondary         0.926     0.0854       10.8  4.14e- 27
##  7 Position2TE                0.186     0.169         1.10 2.73e-  1
##  8 Position2WR                0.976     0.0909       10.7  1.15e- 26
##  9 Weight:Position2LB        -0.00315   0.000554     -5.69 1.30e-  8
## 10 Weight:Position2OLine     -0.00142   0.000337     -4.22 2.51e-  5
## 11 Weight:Position2RB        -0.00119   0.000335     -3.56 3.74e-  4
## 12 Weight:Position2Secondary -0.00427   0.000391    -10.9  1.80e- 27
## 13 Weight:Position2TE        -0.000719  0.000662     -1.09 2.77e-  1
## 14 Weight:Position2WR        -0.00463   0.000414    -11.2  1.00e- 28

Each of the seven positions has a regression line with a different slope under this model, as shown in your plot for #19.

[2] Which two position categories have the smallest slopes?

Answer: The categories with the smallest slopes are those whose Weight slopes are lowest in the interaction model (as seen from the regression output).

[2] For each of these two categories, calculate the slope (in other words, the increase in predicted Yard40 for each one pound increase in Weight).