Replace “Your name” in the YAML header with your name.
This code chunk allows R Markdown to knit your output file even if there are coding errors.
knitr::opts_chunk$set(error = TRUE, fig.width = 6, fig.height = 4)
The packages you will are tidyverse and broom, plus any others you add.
library(tidyverse)
library(broom)
We begin by reading in the data set named “NFL combine 1987-2021.csv”
into the R object nfl.
# read in data file
nfl <- read_csv("NFL combine 1987-2021.csv")
The dataset is about the “Combine” for the National Football League (NFL) from 1987-2021 – American football, that is. The Combine allows for prospects for the NFL draft to be evaluated in several aspects of their athletic ability on a level playing field. This way, NFL teams will have objective measurements of their abilities to inform their draft decisions.
There are 13,215 observations on the following 14 variables:
ID: observation identifier
Year: year of NFL combine
Name: player name
College: college of player
Position: position player is tested at for the combine:
Offensive positions: C = Center, FB = Fullback, OG = Offensive Guard, OL = Offensive Line, OT = Offensive Tackle, QB = Quarterback, RB = Running Back, TE = Tight End, WR = Wide Receiver
Defensive positions: CB = Cornerback, DB = Defensive Back, DE = Defensive End, DL = Defensive Line, DT = Defensive Tackle, EDG = Edge Rusher, FS = Free Safety, ILB = Inside Linebacker, LB = Linebacker, NT = Nose Tackle, OLB = Outside Linebacker, S = Safety, SS = Strong Safety
Special teams: K = Kicker, LS = Long Snapper, P = Punter
Height: height in inches
Weight: weight in pounds
Wonderlic: Score on test on Wonderlic test of intelligence
Yard40: time in seconds to run 40 yard dash
BenchPress: number of repetitions of 225 pounds
VertLeap: vertical leap in inches from standing
BroadJump: Broad jump in inches from standing
Shuttle: time in seconds to complete the shuttle drill
Cones: time in seconds to complete the 3-cone drill
There are many missing values in the dataset, so will limit the data
to players where we have full information on the following variables.
These players can be considered “full partipicants of the combine. Be
sure to use this dataset, nfl2, for the rest of the
homework.
nfl2 <- nfl |>
filter(
!is.na(Yard40),
!is.na(Height),
!is.na(Weight),
!is.na(BenchPress),
!is.na(VertLeap),
!is.na(BroadJump),
!is.na(Shuttle),
!is.na(Cones),
!is.na(Position),
!is.na(College)
)
There are so many categories for the Position variable.
That we will combine some into larger categories as the variable
Position2.
nfl2 <- nfl2 %>%
mutate(
Position2 = case_when(
Position %in% c("C", "OG", "OL", "OT") ~ "OLine",
Position %in% c("CB", "DB", "FS", "S", "SS") ~ "Secondary",
Position %in% c("DE", "DL", "DT", "EDG", "NT") ~ "DLine",
Position %in% c("ILB", "LB", "OLB") ~ "LB",
Position %in% c("FB", "RB") ~ "RB",
TRUE ~ Position
)
)
nfl2 |>
count(Position2)
## # A tibble: 10 × 2
## Position2 n
## <chr> <int>
## 1 DLine 1072
## 2 LB 796
## 3 LS 10
## 4 OLine 1162
## 5 P 1
## 6 QB 22
## 7 RB 682
## 8 Secondary 1192
## 9 TE 396
## 10 WR 592
We will remove the positions LS (long-snapper), P (punter), and QB (quarterback) due to their low counts. Also, LS and P are special teams positions that may not represent “typical” football players.
nfl2 <- nfl2 |>
filter(
Position2 != "QB",
Position2 != "LS",
Position2 != "P",
)
count(nfl2, Position2)
## # A tibble: 7 × 2
## Position2 n
## <chr> <int>
## 1 DLine 1072
## 2 LB 796
## 3 OLine 1162
## 4 RB 682
## 5 Secondary 1192
## 6 TE 396
## 7 WR 592
An overview of the 7 positions given above are as follows:
OLine (Offensive Line), RB (Running Back), TE (Tight End), and WR (Wide Receivers) are offensive positions.
DLine (Defensive Line), LB (LineBacker), and Secondary are defensive positions
A typical play in football goes like this: after the offensive line snaps the ball to the quarterback, the defensive line tries to crash through the offensive line to tackle the quarterback. The offensive line blocks the defensive line. The quarterback then decides to either hand off the ball to the running back or pass it to one of the wide receivers. The linebackers, who start behind the defensive line, try to tackle the running back. The secondary guard the wide receivers and try to tip the passes away. The tight end can be either a blocker or a pass receiver, depending on the play.
With all these different “jobs”, it makes sense that different sizes of players play each position.
ggplot(
data = nfl2,
mapping = aes(x = Height, y = Weight, color = Position2)
) +
geom_point(alpha=.5) +
labs(x = "Height (in)", y = "Weight (lbs)")
As shown in the graph above, the OLine and DLine are the largest players because they are crashing into each other at the line of scrimmage. The tight ends are less heavy than the OLine because they also go out for passes in addition to blocking. The Wide Receivers and the Secondary weigh the least because they are sprinting around going after passes. For more information, please see this Wikipedia article on American football positions.
Yard40, the time in seconds to run a 40 yard dash, is
most quoted statistic to emerge from the Combine, in my opinion. We will
treat it as the response variable and examine how it it related to
Height, Weight, and
Position2.
Yard40 on the y-axis and
Height on the x-axis. Within geom_point(), use
a small value of alpha to ease some of the
overplotting.# code chunk
ggplot(nfl2, aes(x = Height, y = Yard40)) +
geom_point(alpha = 0.2)
Answer:There is a weak linear relationship between Height and Yard40, with a slight trend that taller players have larger (slower) Yard40 times.
# code chunk
fit_h <- lm(Yard40 ~ Height, data = nfl2)
# code chunk
ggplot(augment(fit_h), aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
Answer:The correct mean assumption is assessed by checking whether the residuals are centered vertically around zero across all ranges of predicted values.
Answer: The constant variance assumption is assessed by checking whether the vertical spread of residuals is similar across all ranges of predicted values.
Yard40 on the y-axis and
Weight on the x-axis. Within geom_point(), use
a small value of alpha to ease some of the overplotting of
the points.# code chunk
ggplot(nfl2, aes(x = Weight, y = Yard40)) +
geom_point(alpha = 0.2)
Answer:There is a positive linear relationship between Weight and Yard40; as Weight increases, Yard40 tends to increase.
# code chunk
fit_w <- lm(Yard40 ~ Weight, data = nfl2)
# code chunk
ggplot(augment(fit_w), aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
Answer:The residual plot shows whether the residuals are centered around zero across all predicted values, which indicates whether the correct mean assumption is satisfied.
Answer: The residual plot shows whether the spread of residuals is similar across predicted values, which indicates whether the constant variance assumption is satisfied.
# code chunk
ggplot(augment(fit_w), aes(sample = .resid)) +
stat_qq() +
stat_qq_line()
Answer: The normality assumption is assessed by checking whether the residuals follow the straight line in the normal quantile plot.
Because we’ve seen that Yard40 depends on Height and
Weight separately, it makes sense to try them both in a multiple
regression model.
Yard40 as a function of Weight and Height.# code chunk
fit_hw <- lm(Yard40 ~ Weight + Height, data = nfl2)
# code chunk
tidy(fit_hw)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.93 0.0702 56.0 0
## 2 Weight 0.00656 0.0000642 102. 0
## 3 Height -0.0101 0.00110 -9.17 6.50e-20
Equation: Yard40^=β0+β1⋅Weight+β2⋅Height
Answer:For a one-unit increase in Height, the predicted Yard40 changes by the value of the Height coefficient, holding Weight constant.
Answer:This is surprising because earlier we examined Height alone, but in multiple regression the effect is adjusted for Weight, which changes the relationship.
To investigate the effect of Height, we will examine the relationship
between Yard40 and Height separately for groups of the dataset clustered
by similar Weight values. You should use the variable wt_gp
I have created for you below in this analysis.
nfl2 <- nfl2 |>
mutate(
wt_gp = case_when(
Weight < 200 ~ "200-",
Weight < 225 ~ "200-224",
Weight < 250 ~ "225-249",
Weight < 275 ~ "250-274",
Weight < 300 ~ "275-299",
Weight >= 300 ~ "300+"
)
)
# code chunk
ggplot(nfl2, aes(x = Height, y = Yard40)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~wt_gp)
## `geom_smooth()` using formula = 'y ~ x'
nfl2 |>
group_by(wt_gp) |>
summarize(
r = cor(Height, Yard40)
)
## # A tibble: 6 × 2
## wt_gp r
## <chr> <dbl>
## 1 200- 0.0417
## 2 200-224 -0.0573
## 3 225-249 0.0611
## 4 250-274 -0.0278
## 5 275-299 -0.0952
## 6 300+ 0.157
Answer: The relationships within each group are generally weak, indicating that Height alone has a weak association with Yard40 when Weight is held approximately constant.
Position2Recall the seven position groups discussed in the homework
introduction, given by the variable Position2.
Position2 as different colored points. Also,
separate regression lines should be plotted for each group.# code chunk
ggplot(nfl2, aes(x = Weight, y = Yard40, color = Position2)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Position2 and
Weight as explanatory variables (with no interaction).
Include a table that shows the estimated coefficients of this
model.# code chunk
fit_p <- lm(Yard40 ~ Weight + Position2, data = nfl2)
tidy(fit_p)
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.68 0.0330 112. 0
## 2 Weight 0.00452 0.000115 39.4 4.06e-302
## 3 Position2LB -0.0504 0.00836 -6.03 1.71e- 9
## 4 Position2OLine 0.186 0.00660 28.2 1.35e-164
## 5 Position2RB -0.0615 0.0102 -6.06 1.47e- 9
## 6 Position2Secondary -0.0439 0.0116 -3.80 1.47e- 4
## 7 Position2TE -0.0420 0.00896 -4.68 2.86e- 6
## 8 Position2WR -0.0775 0.0119 -6.50 8.72e- 11
Answer: The coefficient for Position2 = LB represents the difference in predicted Yard40 between linebackers and the reference category, holding Weight constant.
Position2 and
Weight as explanatory variables (but with interaction this
time). Include a table that shows the estimated coefficients of this
model.# code chunk
fit_pi <- lm(Yard40 ~ Weight * Position2, data = nfl2)
tidy(fit_pi)
## # A tibble: 14 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.28 0.0494 66.3 0
## 2 Weight 0.00594 0.000173 34.4 1.34e-236
## 3 Position2LB 0.770 0.135 5.68 1.39e- 8
## 4 Position2OLine 0.592 0.103 5.75 9.18e- 9
## 5 Position2RB 0.294 0.0803 3.66 2.57e- 4
## 6 Position2Secondary 0.926 0.0854 10.8 4.14e- 27
## 7 Position2TE 0.186 0.169 1.10 2.73e- 1
## 8 Position2WR 0.976 0.0909 10.7 1.15e- 26
## 9 Weight:Position2LB -0.00315 0.000554 -5.69 1.30e- 8
## 10 Weight:Position2OLine -0.00142 0.000337 -4.22 2.51e- 5
## 11 Weight:Position2RB -0.00119 0.000335 -3.56 3.74e- 4
## 12 Weight:Position2Secondary -0.00427 0.000391 -10.9 1.80e- 27
## 13 Weight:Position2TE -0.000719 0.000662 -1.09 2.77e- 1
## 14 Weight:Position2WR -0.00463 0.000414 -11.2 1.00e- 28
Each of the seven positions has a regression line with a different slope under this model, as shown in your plot for #19.
Answer: The categories with the smallest slopes are those whose Weight slopes are lowest in the interaction model (as seen from the regression output).
Answer: The categories with the smallest slopes are those whose Weight slopes are lowest in the interaction model (as seen from the regression output).