Project Part 0: Hypothesis Testing

Part 1

First I will be loading in my data

data = read.csv('https://raw.githubusercontent.com/jacmantooth/applied_stats-/main/2020%20nfl%20defense%20.csv')
summary(data)
##       Tm                  G           Att             Cmp             Yds      
##  Length:32          Min.   :16   Min.   :494.0   Min.   :298.0   Min.   :3051  
##  Class :character   1st Qu.:16   1st Qu.:540.0   1st Qu.:342.5   1st Qu.:3573  
##  Mode  :character   Median :16   Median :557.0   Median :367.0   Median :3807  
##                     Mean   :16   Mean   :563.1   Mean   :367.4   Mean   :3842  
##                     3rd Qu.:16   3rd Qu.:581.2   3rd Qu.:380.8   3rd Qu.:4113  
##                     Max.   :16   Max.   :674.0   Max.   :450.0   Max.   :4697  
##        TD            DADOT            Air            YAC            Bltz      
##  Min.   :17.00   Min.   :7.000   Min.   :1604   Min.   :1376   Min.   : 98.0  
##  1st Qu.:22.75   1st Qu.:7.700   1st Qu.:2141   1st Qu.:1750   1st Qu.:145.8  
##  Median :28.00   Median :8.100   Median :2366   Median :1858   Median :180.0  
##  Mean   :27.22   Mean   :8.181   Mean   :2332   Mean   :1870   Mean   :187.2  
##  3rd Qu.:30.25   3rd Qu.:8.625   3rd Qu.:2502   3rd Qu.:2040   3rd Qu.:225.5  
##  Max.   :38.00   Max.   :9.500   Max.   :2970   Max.   :2282   Max.   :290.0  
##     Bltz.                Hrry          Hrry.                QBKD      
##  Length:32          Min.   :40.00   Length:32          Min.   :29.00  
##  Class :character   1st Qu.:50.00   Class :character   1st Qu.:41.75  
##  Mode  :character   Median :59.50   Mode  :character   Median :47.00  
##                     Mean   :59.84                      Mean   :48.47  
##                     3rd Qu.:69.25                      3rd Qu.:55.25  
##                     Max.   :80.00                      Max.   :71.00  
##     QBKD.                 Sk             Prss          Prss.          
##  Length:32          Min.   :17.00   Min.   :101.0   Length:32         
##  Class :character   1st Qu.:28.50   1st Qu.:132.5   Class :character  
##  Mode  :character   Median :36.50   Median :143.0   Mode  :character  
##                     Mean   :35.47   Mean   :143.8                     
##                     3rd Qu.:42.75   3rd Qu.:160.2                     
##                     Max.   :56.00   Max.   :202.0                     
##       MTkl             win              lose             tie        
##  Min.   : 68.00   Min.   : 1.000   Min.   : 2.000   Min.   :0.0000  
##  1st Qu.: 96.25   1st Qu.: 5.000   1st Qu.: 5.000   1st Qu.:0.0000  
##  Median :109.00   Median : 7.500   Median : 8.500   Median :0.0000  
##  Mean   :107.72   Mean   : 7.969   Mean   : 7.969   Mean   :0.0625  
##  3rd Qu.:120.50   3rd Qu.:11.000   3rd Qu.:11.000   3rd Qu.:0.0000  
##  Max.   :147.00   Max.   :14.000   Max.   :15.000   Max.   :1.0000  
##       rank             PF             Y.P              TO       
##  Min.   : 1.00   Min.   :296.0   Min.   :4.600   Min.   : 9.00  
##  1st Qu.: 8.75   1st Qu.:356.5   1st Qu.:5.300   1st Qu.:18.75  
##  Median :16.50   Median :382.5   Median :5.500   Median :22.00  
##  Mean   :16.50   Mean   :396.6   Mean   :5.572   Mean   :20.78  
##  3rd Qu.:24.25   3rd Qu.:440.8   3rd Qu.:5.900   3rd Qu.:23.00  
##  Max.   :32.00   Max.   :519.0   Max.   :6.300   Max.   :29.00  
##        FL             X1stD            Int            X1stDP     
##  Min.   : 4.000   Min.   :280.0   Min.   : 3.00   Min.   :161.0  
##  1st Qu.: 6.750   1st Qu.:330.5   1st Qu.:10.00   1st Qu.:186.8  
##  Median : 8.000   Median :349.0   Median :12.00   Median :205.0  
##  Mean   : 8.438   Mean   :347.1   Mean   :12.34   Mean   :204.6  
##  3rd Qu.:10.000   3rd Qu.:366.2   3rd Qu.:15.00   3rd Qu.:215.0  
##  Max.   :15.000   Max.   :415.0   Max.   :18.00   Max.   :253.0  
##     RushAtt          Yds.1           TD.1            Y.A       
##  Min.   :358.0   Min.   :1289   Min.   :10.00   Min.   :3.600  
##  1st Qu.:405.2   1st Qu.:1765   1st Qu.:13.00   1st Qu.:4.100  
##  Median :429.0   Median :1888   Median :16.00   Median :4.500  
##  Mean   :431.0   Mean   :1902   Mean   :16.62   Mean   :4.388  
##  3rd Qu.:449.5   3rd Qu.:2030   3rd Qu.:19.25   3rd Qu.:4.600  
##  Max.   :517.0   Max.   :2564   Max.   :27.00   Max.   :5.200  
##      X1stDR         playoff          winning.record.   
##  Min.   : 78.00   Length:32          Length:32         
##  1st Qu.: 97.75   Class :character   Class :character  
##  Median :113.00   Mode  :character   Mode  :character  
##  Mean   :111.28                                        
##  3rd Qu.:120.50                                        
##  Max.   :145.00

Part 2

The quantitative variable that I picked for this project will be Touchdowns. What Touchdowns means in my data set is how many touchdowns did that defense give up. The next part of this project will be the summary statistics. The first part is my five-number summary and the mean of my variable.

summary(data$TD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   22.75   28.00   27.22   30.25   38.00

My mean is 27.22, the minimum is 17, 1st Qu is 22.75, the median is 28.00, while the 3rd Qu is 30.25 and my maximum was 38 touchdowns allowed. I will now find the standard deviation of my variable

sd(data$TD)
## [1] 5.210005

My standard deviation is 5.210005

The next part of my project is the graphical display. I will be first creating a histogram

hist(data$TD)

This graph shows us that not many teams allowed 35+ touchdowns but few allowed only 15 touchdowns. The next graph I will be making is the box plot.

boxplot(data$TD)

This show us the five number summary which i already talked about above. Finally i will be making the qq plot

qqnorm(data$TD)

The Q-Q plot shows us there is no trend with the data. Now I will be exploring data on a categorical variable The categorical variable that i used is Playoff. Which is whether a team made the playoff or not.

table(data$playoff)
## 
##  no yes 
##  18  14

As you can see not many teams made the playoffs. In the nfl they only allow 14 teams in the playoffs The next part will be my relative frequency table

table(data$playoff)/length(data$playoff)
## 
##     no    yes 
## 0.5625 0.4375

This next part I made a two way table of how many time did a team Blitz and if a team made the playoffs.

table(data$playoff,data$Bltz)
##      
##       98 108 131 133 135 138 142 147 150 153 159 171 174 178 179 181 194 195
##   no   1   0   1   1   1   0   1   1   0   1   1   0   1   1   1   1   0   0
##   yes  0   1   1   0   0   1   0   0   1   0   0   1   0   0   0   0   1   1
##      
##       197 203 212 221 224 230 244 247 249 253 256 268 290
##   no    0   1   1   0   1   0   0   1   1   0   1   0   0
##   yes   1   0   0   1   0   1   1   0   0   1   0   1   1

You really cant tell anything from this so, this really is noy that usefully

Hypothesis Testing Part 3

I will be preforming a hypothesis test comparing the mean to a fixed value. i will also be stating my null and alternative hypothesis. \[ H_0: \mu = 3000 \\ H_A: \mu\neq 3000 \] so what my null hypothesis says is that passing yards allowed will be 3000 yards. while my alternative hypothesis say that it will not be 3000

t.test(data$Yds, mu=3000)
## 
##  One Sample t-test
## 
## data:  data$Yds
## t = 11.242, df = 31, p-value = 1.822e-12
## alternative hypothesis: true mean is not equal to 3000
## 95 percent confidence interval:
##  3689.576 3995.236
## sample estimates:
## mean of x 
##  3842.406

So basically i am going to reject my null hypothesis because it ia was way off.

I will now be preforming a hypothesis test comparing two means. I will also be stating my null and alternative hypothesis. I will be comparing air time and yard after the catch. my null hypothesis is that Air yardage will be equal to YAC while my alternative hypothesis is that Air will not equal YAC. I will have Air as mu1 and Yac as mu2

\[ H_0: \mu_1 = \mu_2 \\ H_A: \mu_1 \neq \mu_2 \]

t.test(data$Air,data$YAC)
## 
##  Welch Two Sample t-test
## 
## data:  data$Air and data$YAC
## t = 7.3269, df = 57.461, p-value = 8.693e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  335.7103 588.1647
## sample estimates:
## mean of x mean of y 
##  2331.625  1869.688

I can safely accept my alternative hypothesis of the two not being equal

Project Part 1: Regresssion

Part 1

The two quantitative variables that I have picked are Losses and Yards allowed. Losses is how many losses that team had that season and Yard allowed being how many yards did that team defense give up

I will first be doing a linear model and linear regression I want to predict loses with yards allowed by a defense

lm(lose ~ Yds, data = data)
## 
## Call:
## lm(formula = lose ~ Yds, data = data)
## 
## Coefficients:
## (Intercept)          Yds  
##    -1.40808      0.00244
mod=lm(lose ~ Yds, data = data)
model = mod

I have now created a linear model with Yds allowed being a predictor for losses. my slop is .00244 and my intercept is -1.40808.This is say for every yard allowed we multiple it by .00244 the we add -1.40808.

I will now do a summary function to get more info on the linear regression

summary(model)
## 
## Call:
## lm(formula = lose ~ Yds, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8140 -3.0360  0.4666  2.9385  6.0829 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.408075   5.446100  -0.259   0.7978  
## Yds          0.002440   0.001409   1.732   0.0936 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.326 on 30 degrees of freedom
## Multiple R-squared:  0.09089,    Adjusted R-squared:  0.06059 
## F-statistic: 2.999 on 1 and 30 DF,  p-value: 0.09357

Oh wow so much stuff. We see the residuals and the coefficient in this summary.We see that our significant it is not that significant here, same goes for our intercept.

I will now plot my linear model

plot(model)

The most interesting graph here is the normal Q-Q.This graph is looking at the normality of my data. We see that my data is actually pretty normal here, in the middle then goes off towards the top. i would say that my data is pretty normal then.

Part 2

I will now use the predict command to predict losses by a team base on yards given up

Name_predict<-predict(model)
names(Name_predict)<-c(data$Tm)
Name_predict
##        Arizona Cardinals          Atlanta Falcons         Baltimore Ravens 
##                 7.433321                10.054260                 7.221011 
##            Buffalo Bills        Carolina Panthers            Chicago Bears 
##                 7.684678                 7.926273                 7.633430 
##       Cincinnati Bengals         Cleveland Browns           Dallas Cowboys 
##                 8.009245                 8.260601                 7.479688 
##           Denver Broncos            Detroit Lions        Green Bay Packers 
##                 7.882346                 9.715051                 7.228332 
##           Houston Texans       Indianapolis Colts     Jacksonville Jaguars 
##                 8.607131                 8.026327                 8.917056 
##       Kansas City Chiefs        Las Vegas Raiders     Los Angeles Chargers 
##                 7.814016                 8.870689                 7.323506 
##         Los Angeles Rams           Miami Dolphins        Minnesota Vikings 
##                 6.037440                 8.411903                 8.697424 
##     New England Patriots       New Orleans Saints          New York Giants 
##                 7.272258                 7.064828                 7.882346 
##            New York Jets      Philadelphia Eagles      Pittsburgh Steelers 
##                 9.351438                 7.860383                 6.181421 
##      San Francisco 49ers         Seattle Seahawks     Tampa Bay Buccaneers 
##                 6.710977                 9.719932                 8.219115 
##         Tennessee Titans Washington Football Team 
##                 9.424649                 6.078926

This is so cool in my opinion. If you watched football you know that WFT defiantly had more losses then 6 and I find it interesting that it says the falcon are predicted to be the team in the NFL lol.

I will now be doing my residual for my prediction, The residual is the difference between the prediction and actual value.

Name_residuals<-residuals(model)
names(Name_residuals)<-c(data$Tm)
Name_residuals
##        Arizona Cardinals          Atlanta Falcons         Baltimore Ravens 
##               0.56667856               1.94574012              -2.22101078 
##            Buffalo Bills        Carolina Panthers            Chicago Bears 
##              -4.68467773               3.07372738               0.36656967 
##       Cincinnati Bengals         Cleveland Browns           Dallas Cowboys 
##               2.99075540              -3.26060089               2.52031187 
##           Denver Broncos            Detroit Lions        Green Bay Packers 
##               3.11765373               1.28494910              -4.22833184 
##           Houston Texans       Indianapolis Colts     Jacksonville Jaguars 
##               3.39286907              -3.02632706               6.08294432 
##       Kansas City Chiefs        Las Vegas Raiders     Los Angeles Chargers 
##              -5.81401641              -0.87068898               1.67649442 
##         Los Angeles Rams           Miami Dolphins        Minnesota Vikings 
##              -0.03743988              -2.41190274               0.30257604 
##     New England Patriots       New Orleans Saints          New York Giants 
##               1.72774182              -3.06482823               2.11765373 
##            New York Jets      Philadelphia Eagles      Pittsburgh Steelers 
##               4.64856160               3.13961690              -2.18142067 
##      San Francisco 49ers         Seattle Seahawks     Tampa Bay Buccaneers 
##               3.28902286              -5.71993161              -3.21911490 
##         Tennessee Titans Washington Football Team 
##              -4.42464897               2.92107413

As you can see the predictions were all over the place, with some being over and some being under. we also see that some prediction are pretty spot on. I will now create a histogram to see if our residual is normal

hist(residuals(model),xlab = "residual")

Clearly it is not normal. well maybe a little normal but mostly not normal.

I will now be predicting losses base on Yards given up. let say four random teams gave up 2000,3000,4000,5000 yards. Now I will predict the losses each team will have using my model. I will also be showing the confident interval.

ylist =c (2000,3000,4000,5000)
point <- data.frame(Yds = ylist)
predict(model,point, interval = "confidence")
##         fit       lwr       upr
## 1  3.472630 -1.963572  8.908831
## 2  5.912982  3.207729  8.618235
## 3  8.353334  7.069895  9.636773
## 4 10.793687  7.252676 14.334697

From this prediction model we see that if a team gives up only 2000 yards it is predicted that they will only lose about 3.5 games while if a team gives up 5000 yards, they are predicted to lose 10.80 games. we expect that 95% of my data will fall between these intervals.

Next, I will be making a graph with my two quantitative variables. I know some of yall don’t watch football so I thought labeling the dots would help. I know it looks messy in some places but now you can tell which teams are which dots. I also added my abline line and colored it to help it stand out.

plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(mod, col="red", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)

I thought this graph was really cool honestly. It is really interesting to see where each team falls. For example look at the Seattle Seahawks and Detroit Lions both gave up similar amount of yardages but one had only 4 losses while the lions had 11 losses.

Part 3

I will now compute the correlation between my two variables and do a hypothesis test on the correlation.

cor(data$Yds,data$lose)
## [1] 0.3014828

As you can see my correlation is .30142828 Now I will be doing an hypothesis test of the correlation test

cor.test(data$Yds,data$lose)
## 
##  Pearson's product-moment correlation
## 
## data:  data$Yds and data$lose
## t = 1.7319, df = 30, p-value = 0.09357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05275726  0.58832861
## sample estimates:
##       cor 
## 0.3014828

I will reject the null hypothesis in this case and accept my alternative hypothesis of it not being equal to 0

Part 4

Now I will use a categorical variable and repeat the regression. I will now be predicting losses by Yds given up and winning record

fit <- lm(lose ~ Yds + winning.record., data)
fit
## 
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
## 
## Coefficients:
##        (Intercept)                 Yds  winning.record.yes  
##           4.781008            0.001449           -5.859531
summary(fit)
## 
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8848 -0.9870  0.1921  0.6826  4.0877 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.7810080  2.6550400   1.801   0.0822 .  
## Yds                 0.0014491  0.0006758   2.144   0.0405 *  
## winning.record.yes -5.8595310  0.5740969 -10.207 4.14e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.578 on 29 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:  0.7884 
## F-statistic: 58.74 on 2 and 29 DF,  p-value: 6.319e-11

Now I am going to compute a prediction. I will create the ultimate defense, a NFL team that only gave up 2000 yards but somehow had a losing record ooooo interesting.

y <- data.frame(Yds = 2000, winning.record.  = "no")
predict(fit, y)
##        1 
## 7.679284

It is predicted that this team will lose about 7.7 games sadly.

Part 5

I will now make a graph for my categorical variable

plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(fit, col="black", lwd=2)
## Warning in abline(fit, col = "black", lwd = 2): only using the first two of 3
## regression coefficients
abline(model, col="orange", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)

So here I did my colors in ECU color, go tigers. The orange line is my original model and the black line the new LM we created above. In my opinion the new LM that we created above is better. We really cant predict anything here! I would say we need more data, maybe data from 2011-2021 season would be good enough.

Report

I thought the most interesting thing about my Project Part 1 was the prediction model. I thought it was really cool to see if you can predict losses by a team by a certain variable which was fun to play around with. I also enjoyed creating the last graph and adding the names to the variables. what I found out after doing Project Part 1 is that it is very hard to predict losses in the NFL, I think you could tell just by looking at the residual for my prediction. I now understand why ESPN always get predictions wrong.