First I will be loading in my data
data = read.csv('https://raw.githubusercontent.com/jacmantooth/applied_stats-/main/2020%20nfl%20defense%20.csv')
summary(data)
## Tm G Att Cmp Yds
## Length:32 Min. :16 Min. :494.0 Min. :298.0 Min. :3051
## Class :character 1st Qu.:16 1st Qu.:540.0 1st Qu.:342.5 1st Qu.:3573
## Mode :character Median :16 Median :557.0 Median :367.0 Median :3807
## Mean :16 Mean :563.1 Mean :367.4 Mean :3842
## 3rd Qu.:16 3rd Qu.:581.2 3rd Qu.:380.8 3rd Qu.:4113
## Max. :16 Max. :674.0 Max. :450.0 Max. :4697
## TD DADOT Air YAC Bltz
## Min. :17.00 Min. :7.000 Min. :1604 Min. :1376 Min. : 98.0
## 1st Qu.:22.75 1st Qu.:7.700 1st Qu.:2141 1st Qu.:1750 1st Qu.:145.8
## Median :28.00 Median :8.100 Median :2366 Median :1858 Median :180.0
## Mean :27.22 Mean :8.181 Mean :2332 Mean :1870 Mean :187.2
## 3rd Qu.:30.25 3rd Qu.:8.625 3rd Qu.:2502 3rd Qu.:2040 3rd Qu.:225.5
## Max. :38.00 Max. :9.500 Max. :2970 Max. :2282 Max. :290.0
## Bltz. Hrry Hrry. QBKD
## Length:32 Min. :40.00 Length:32 Min. :29.00
## Class :character 1st Qu.:50.00 Class :character 1st Qu.:41.75
## Mode :character Median :59.50 Mode :character Median :47.00
## Mean :59.84 Mean :48.47
## 3rd Qu.:69.25 3rd Qu.:55.25
## Max. :80.00 Max. :71.00
## QBKD. Sk Prss Prss.
## Length:32 Min. :17.00 Min. :101.0 Length:32
## Class :character 1st Qu.:28.50 1st Qu.:132.5 Class :character
## Mode :character Median :36.50 Median :143.0 Mode :character
## Mean :35.47 Mean :143.8
## 3rd Qu.:42.75 3rd Qu.:160.2
## Max. :56.00 Max. :202.0
## MTkl win lose tie
## Min. : 68.00 Min. : 1.000 Min. : 2.000 Min. :0.0000
## 1st Qu.: 96.25 1st Qu.: 5.000 1st Qu.: 5.000 1st Qu.:0.0000
## Median :109.00 Median : 7.500 Median : 8.500 Median :0.0000
## Mean :107.72 Mean : 7.969 Mean : 7.969 Mean :0.0625
## 3rd Qu.:120.50 3rd Qu.:11.000 3rd Qu.:11.000 3rd Qu.:0.0000
## Max. :147.00 Max. :14.000 Max. :15.000 Max. :1.0000
## rank PF Y.P TO
## Min. : 1.00 Min. :296.0 Min. :4.600 Min. : 9.00
## 1st Qu.: 8.75 1st Qu.:356.5 1st Qu.:5.300 1st Qu.:18.75
## Median :16.50 Median :382.5 Median :5.500 Median :22.00
## Mean :16.50 Mean :396.6 Mean :5.572 Mean :20.78
## 3rd Qu.:24.25 3rd Qu.:440.8 3rd Qu.:5.900 3rd Qu.:23.00
## Max. :32.00 Max. :519.0 Max. :6.300 Max. :29.00
## FL X1stD Int X1stDP
## Min. : 4.000 Min. :280.0 Min. : 3.00 Min. :161.0
## 1st Qu.: 6.750 1st Qu.:330.5 1st Qu.:10.00 1st Qu.:186.8
## Median : 8.000 Median :349.0 Median :12.00 Median :205.0
## Mean : 8.438 Mean :347.1 Mean :12.34 Mean :204.6
## 3rd Qu.:10.000 3rd Qu.:366.2 3rd Qu.:15.00 3rd Qu.:215.0
## Max. :15.000 Max. :415.0 Max. :18.00 Max. :253.0
## RushAtt Yds.1 TD.1 Y.A
## Min. :358.0 Min. :1289 Min. :10.00 Min. :3.600
## 1st Qu.:405.2 1st Qu.:1765 1st Qu.:13.00 1st Qu.:4.100
## Median :429.0 Median :1888 Median :16.00 Median :4.500
## Mean :431.0 Mean :1902 Mean :16.62 Mean :4.388
## 3rd Qu.:449.5 3rd Qu.:2030 3rd Qu.:19.25 3rd Qu.:4.600
## Max. :517.0 Max. :2564 Max. :27.00 Max. :5.200
## X1stDR playoff winning.record.
## Min. : 78.00 Length:32 Length:32
## 1st Qu.: 97.75 Class :character Class :character
## Median :113.00 Mode :character Mode :character
## Mean :111.28
## 3rd Qu.:120.50
## Max. :145.00
The quantitative variable that I picked for this project will be Touchdowns. What Touchdowns means in my data set is how many touchdowns did that defense give up. The next part of this project will be the summary statistics. The first part is my five-number summary and the mean of my variable.
summary(data$TD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 22.75 28.00 27.22 30.25 38.00
My mean is 27.22, the minimum is 17, 1st Qu is 22.75, the median is 28.00, while the 3rd Qu is 30.25 and my maximum was 38 touchdowns allowed. I will now find the standard deviation of my variable
sd(data$TD)
## [1] 5.210005
My standard deviation is 5.210005
The next part of my project is the graphical display. I will be first creating a histogram
hist(data$TD)
This graph shows us that not many teams allowed 35+ touchdowns but few allowed only 15 touchdowns. The next graph I will be making is the box plot.
boxplot(data$TD)
This show us the five number summary which i already talked about above. Finally i will be making the qq plot
qqnorm(data$TD)
The Q-Q plot shows us there is no trend with the data. Now I will be exploring data on a categorical variable The categorical variable that i used is Playoff. Which is whether a team made the playoff or not.
table(data$playoff)
##
## no yes
## 18 14
As you can see not many teams made the playoffs. In the nfl they only allow 14 teams in the playoffs The next part will be my relative frequency table
table(data$playoff)/length(data$playoff)
##
## no yes
## 0.5625 0.4375
This next part I made a two way table of how many time did a team Blitz and if a team made the playoffs.
table(data$playoff,data$Bltz)
##
## 98 108 131 133 135 138 142 147 150 153 159 171 174 178 179 181 194 195
## no 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0
## yes 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1
##
## 197 203 212 221 224 230 244 247 249 253 256 268 290
## no 0 1 1 0 1 0 0 1 1 0 1 0 0
## yes 1 0 0 1 0 1 1 0 0 1 0 1 1
You really cant tell anything from this so, this really is noy that usefully
I will be preforming a hypothesis test comparing the mean to a fixed value. i will also be stating my null and alternative hypothesis. \[ H_0: \mu = 3000 \\ H_A: \mu\neq 3000 \] so what my null hypothesis says is that passing yards allowed will be 3000 yards. while my alternative hypothesis say that it will not be 3000
t.test(data$Yds, mu=3000)
##
## One Sample t-test
##
## data: data$Yds
## t = 11.242, df = 31, p-value = 1.822e-12
## alternative hypothesis: true mean is not equal to 3000
## 95 percent confidence interval:
## 3689.576 3995.236
## sample estimates:
## mean of x
## 3842.406
So basically i am going to reject my null hypothesis because it ia was way off.
I will now be preforming a hypothesis test comparing two means. I will also be stating my null and alternative hypothesis. I will be comparing air time and yard after the catch. my null hypothesis is that Air yardage will be equal to YAC while my alternative hypothesis is that Air will not equal YAC. I will have Air as mu1 and Yac as mu2
\[ H_0: \mu_1 = \mu_2 \\ H_A: \mu_1 \neq \mu_2 \]
t.test(data$Air,data$YAC)
##
## Welch Two Sample t-test
##
## data: data$Air and data$YAC
## t = 7.3269, df = 57.461, p-value = 8.693e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 335.7103 588.1647
## sample estimates:
## mean of x mean of y
## 2331.625 1869.688
I can safely accept my alternative hypothesis of the two not being equal
The two quantitative variables that I have picked are Losses and Yards allowed. Losses is how many losses that team had that season and Yard allowed being how many yards did that team defense give up
I will first be doing a linear model and linear regression I want to predict loses with yards allowed by a defense
lm(lose ~ Yds, data = data)
##
## Call:
## lm(formula = lose ~ Yds, data = data)
##
## Coefficients:
## (Intercept) Yds
## -1.40808 0.00244
mod=lm(lose ~ Yds, data = data)
model = mod
I have now created a linear model with Yds allowed being a predictor for losses. my slop is .00244 and my intercept is -1.40808.This is say for every yard allowed we multiple it by .00244 the we add -1.40808.
I will now do a summary function to get more info on the linear regression
summary(model)
##
## Call:
## lm(formula = lose ~ Yds, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8140 -3.0360 0.4666 2.9385 6.0829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.408075 5.446100 -0.259 0.7978
## Yds 0.002440 0.001409 1.732 0.0936 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.326 on 30 degrees of freedom
## Multiple R-squared: 0.09089, Adjusted R-squared: 0.06059
## F-statistic: 2.999 on 1 and 30 DF, p-value: 0.09357
Oh wow so much stuff. We see the residuals and the coefficient in this summary.We see that our significant it is not that significant here, same goes for our intercept.
I will now plot my linear model
plot(model)
The most interesting graph here is the normal Q-Q.This graph is looking at the normality of my data. We see that my data is actually pretty normal here, in the middle then goes off towards the top. i would say that my data is pretty normal then.
I will now use the predict command to predict losses by a team base on yards given up
Name_predict<-predict(model)
names(Name_predict)<-c(data$Tm)
Name_predict
## Arizona Cardinals Atlanta Falcons Baltimore Ravens
## 7.433321 10.054260 7.221011
## Buffalo Bills Carolina Panthers Chicago Bears
## 7.684678 7.926273 7.633430
## Cincinnati Bengals Cleveland Browns Dallas Cowboys
## 8.009245 8.260601 7.479688
## Denver Broncos Detroit Lions Green Bay Packers
## 7.882346 9.715051 7.228332
## Houston Texans Indianapolis Colts Jacksonville Jaguars
## 8.607131 8.026327 8.917056
## Kansas City Chiefs Las Vegas Raiders Los Angeles Chargers
## 7.814016 8.870689 7.323506
## Los Angeles Rams Miami Dolphins Minnesota Vikings
## 6.037440 8.411903 8.697424
## New England Patriots New Orleans Saints New York Giants
## 7.272258 7.064828 7.882346
## New York Jets Philadelphia Eagles Pittsburgh Steelers
## 9.351438 7.860383 6.181421
## San Francisco 49ers Seattle Seahawks Tampa Bay Buccaneers
## 6.710977 9.719932 8.219115
## Tennessee Titans Washington Football Team
## 9.424649 6.078926
This is so cool in my opinion. If you watched football you know that WFT defiantly had more losses then 6 and I find it interesting that it says the falcon are predicted to be the team in the NFL lol.
I will now be doing my residual for my prediction, The residual is the difference between the prediction and actual value.
Name_residuals<-residuals(model)
names(Name_residuals)<-c(data$Tm)
Name_residuals
## Arizona Cardinals Atlanta Falcons Baltimore Ravens
## 0.56667856 1.94574012 -2.22101078
## Buffalo Bills Carolina Panthers Chicago Bears
## -4.68467773 3.07372738 0.36656967
## Cincinnati Bengals Cleveland Browns Dallas Cowboys
## 2.99075540 -3.26060089 2.52031187
## Denver Broncos Detroit Lions Green Bay Packers
## 3.11765373 1.28494910 -4.22833184
## Houston Texans Indianapolis Colts Jacksonville Jaguars
## 3.39286907 -3.02632706 6.08294432
## Kansas City Chiefs Las Vegas Raiders Los Angeles Chargers
## -5.81401641 -0.87068898 1.67649442
## Los Angeles Rams Miami Dolphins Minnesota Vikings
## -0.03743988 -2.41190274 0.30257604
## New England Patriots New Orleans Saints New York Giants
## 1.72774182 -3.06482823 2.11765373
## New York Jets Philadelphia Eagles Pittsburgh Steelers
## 4.64856160 3.13961690 -2.18142067
## San Francisco 49ers Seattle Seahawks Tampa Bay Buccaneers
## 3.28902286 -5.71993161 -3.21911490
## Tennessee Titans Washington Football Team
## -4.42464897 2.92107413
As you can see the predictions were all over the place, with some being over and some being under. we also see that some prediction are pretty spot on. I will now create a histogram to see if our residual is normal
hist(residuals(model),xlab = "residual")
Clearly it is not normal. well maybe a little normal but mostly not normal.
I will now be predicting losses base on Yards given up. let say four random teams gave up 2000,3000,4000,5000 yards. Now I will predict the losses each team will have using my model. I will also be showing the confident interval.
ylist =c (2000,3000,4000,5000)
point <- data.frame(Yds = ylist)
predict(model,point, interval = "confidence")
## fit lwr upr
## 1 3.472630 -1.963572 8.908831
## 2 5.912982 3.207729 8.618235
## 3 8.353334 7.069895 9.636773
## 4 10.793687 7.252676 14.334697
From this prediction model we see that if a team gives up only 2000 yards it is predicted that they will only lose about 3.5 games while if a team gives up 5000 yards, they are predicted to lose 10.80 games. we expect that 95% of my data will fall between these intervals.
Next, I will be making a graph with my two quantitative variables. I know some of yall don’t watch football so I thought labeling the dots would help. I know it looks messy in some places but now you can tell which teams are which dots. I also added my abline line and colored it to help it stand out.
plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(mod, col="red", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)
I thought this graph was really cool honestly. It is really interesting to see where each team falls. For example look at the Seattle Seahawks and Detroit Lions both gave up similar amount of yardages but one had only 4 losses while the lions had 11 losses.
I will now compute the correlation between my two variables and do a hypothesis test on the correlation.
cor(data$Yds,data$lose)
## [1] 0.3014828
As you can see my correlation is .30142828 Now I will be doing an hypothesis test of the correlation test
cor.test(data$Yds,data$lose)
##
## Pearson's product-moment correlation
##
## data: data$Yds and data$lose
## t = 1.7319, df = 30, p-value = 0.09357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05275726 0.58832861
## sample estimates:
## cor
## 0.3014828
I will reject the null hypothesis in this case and accept my alternative hypothesis of it not being equal to 0
Now I will use a categorical variable and repeat the regression. I will now be predicting losses by Yds given up and winning record
fit <- lm(lose ~ Yds + winning.record., data)
fit
##
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
##
## Coefficients:
## (Intercept) Yds winning.record.yes
## 4.781008 0.001449 -5.859531
summary(fit)
##
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8848 -0.9870 0.1921 0.6826 4.0877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7810080 2.6550400 1.801 0.0822 .
## Yds 0.0014491 0.0006758 2.144 0.0405 *
## winning.record.yes -5.8595310 0.5740969 -10.207 4.14e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.578 on 29 degrees of freedom
## Multiple R-squared: 0.802, Adjusted R-squared: 0.7884
## F-statistic: 58.74 on 2 and 29 DF, p-value: 6.319e-11
Now I am going to compute a prediction. I will create the ultimate defense, a NFL team that only gave up 2000 yards but somehow had a losing record ooooo interesting.
y <- data.frame(Yds = 2000, winning.record. = "no")
predict(fit, y)
## 1
## 7.679284
It is predicted that this team will lose about 7.7 games sadly.
I will now make a graph for my categorical variable
plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(fit, col="black", lwd=2)
## Warning in abline(fit, col = "black", lwd = 2): only using the first two of 3
## regression coefficients
abline(model, col="orange", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)
So here I did my colors in ECU color, go tigers. The orange line is my original model and the black line the new LM we created above. In my opinion the new LM that we created above is better. We really cant predict anything here! I would say we need more data, maybe data from 2011-2021 season would be good enough.
I thought the most interesting thing about my Project Part 1 was the prediction model. I thought it was really cool to see if you can predict losses by a team by a certain variable which was fun to play around with. I also enjoyed creating the last graph and adding the names to the variables. what I found out after doing Project Part 1 is that it is very hard to predict losses in the NFL, I think you could tell just by looking at the residual for my prediction. I now understand why ESPN always get predictions wrong.