Project Part 0: Hypothesis Testing

Part 1

First, I will be loading in data

data = read.csv('https://raw.githubusercontent.com/jacmantooth/applied_stats-/main/2020%20nfl%20defense%20.csv')
summary(data)
##       Tm                  G           Att             Cmp             Yds      
##  Length:32          Min.   :16   Min.   :494.0   Min.   :298.0   Min.   :3051  
##  Class :character   1st Qu.:16   1st Qu.:540.0   1st Qu.:342.5   1st Qu.:3573  
##  Mode  :character   Median :16   Median :557.0   Median :367.0   Median :3807  
##                     Mean   :16   Mean   :563.1   Mean   :367.4   Mean   :3842  
##                     3rd Qu.:16   3rd Qu.:581.2   3rd Qu.:380.8   3rd Qu.:4113  
##                     Max.   :16   Max.   :674.0   Max.   :450.0   Max.   :4697  
##        TD            DADOT            Air            YAC            Bltz      
##  Min.   :17.00   Min.   :7.000   Min.   :1604   Min.   :1376   Min.   : 98.0  
##  1st Qu.:22.75   1st Qu.:7.700   1st Qu.:2141   1st Qu.:1750   1st Qu.:145.8  
##  Median :28.00   Median :8.100   Median :2366   Median :1858   Median :180.0  
##  Mean   :27.22   Mean   :8.181   Mean   :2332   Mean   :1870   Mean   :187.2  
##  3rd Qu.:30.25   3rd Qu.:8.625   3rd Qu.:2502   3rd Qu.:2040   3rd Qu.:225.5  
##  Max.   :38.00   Max.   :9.500   Max.   :2970   Max.   :2282   Max.   :290.0  
##     Bltz.                Hrry          Hrry.                QBKD      
##  Length:32          Min.   :40.00   Length:32          Min.   :29.00  
##  Class :character   1st Qu.:50.00   Class :character   1st Qu.:41.75  
##  Mode  :character   Median :59.50   Mode  :character   Median :47.00  
##                     Mean   :59.84                      Mean   :48.47  
##                     3rd Qu.:69.25                      3rd Qu.:55.25  
##                     Max.   :80.00                      Max.   :71.00  
##     QBKD.                 Sk             Prss          Prss.          
##  Length:32          Min.   :17.00   Min.   :101.0   Length:32         
##  Class :character   1st Qu.:28.50   1st Qu.:132.5   Class :character  
##  Mode  :character   Median :36.50   Median :143.0   Mode  :character  
##                     Mean   :35.47   Mean   :143.8                     
##                     3rd Qu.:42.75   3rd Qu.:160.2                     
##                     Max.   :56.00   Max.   :202.0                     
##       MTkl             win              lose             tie        
##  Min.   : 68.00   Min.   : 1.000   Min.   : 2.000   Min.   :0.0000  
##  1st Qu.: 96.25   1st Qu.: 5.000   1st Qu.: 5.000   1st Qu.:0.0000  
##  Median :109.00   Median : 7.500   Median : 8.500   Median :0.0000  
##  Mean   :107.72   Mean   : 7.969   Mean   : 7.969   Mean   :0.0625  
##  3rd Qu.:120.50   3rd Qu.:11.000   3rd Qu.:11.000   3rd Qu.:0.0000  
##  Max.   :147.00   Max.   :14.000   Max.   :15.000   Max.   :1.0000  
##       rank             PF             Y.P              TO       
##  Min.   : 1.00   Min.   :296.0   Min.   :4.600   Min.   : 9.00  
##  1st Qu.: 8.75   1st Qu.:356.5   1st Qu.:5.300   1st Qu.:18.75  
##  Median :16.50   Median :382.5   Median :5.500   Median :22.00  
##  Mean   :16.50   Mean   :396.6   Mean   :5.572   Mean   :20.78  
##  3rd Qu.:24.25   3rd Qu.:440.8   3rd Qu.:5.900   3rd Qu.:23.00  
##  Max.   :32.00   Max.   :519.0   Max.   :6.300   Max.   :29.00  
##        FL             X1stD            Int            X1stDP     
##  Min.   : 4.000   Min.   :280.0   Min.   : 3.00   Min.   :161.0  
##  1st Qu.: 6.750   1st Qu.:330.5   1st Qu.:10.00   1st Qu.:186.8  
##  Median : 8.000   Median :349.0   Median :12.00   Median :205.0  
##  Mean   : 8.438   Mean   :347.1   Mean   :12.34   Mean   :204.6  
##  3rd Qu.:10.000   3rd Qu.:366.2   3rd Qu.:15.00   3rd Qu.:215.0  
##  Max.   :15.000   Max.   :415.0   Max.   :18.00   Max.   :253.0  
##     RushAtt          Yds.1           TD.1            Y.A       
##  Min.   :358.0   Min.   :1289   Min.   :10.00   Min.   :3.600  
##  1st Qu.:405.2   1st Qu.:1765   1st Qu.:13.00   1st Qu.:4.100  
##  Median :429.0   Median :1888   Median :16.00   Median :4.500  
##  Mean   :431.0   Mean   :1902   Mean   :16.62   Mean   :4.388  
##  3rd Qu.:449.5   3rd Qu.:2030   3rd Qu.:19.25   3rd Qu.:4.600  
##  Max.   :517.0   Max.   :2564   Max.   :27.00   Max.   :5.200  
##      X1stDR         playoff          winning.record.   
##  Min.   : 78.00   Length:32          Length:32         
##  1st Qu.: 97.75   Class :character   Class :character  
##  Median :113.00   Mode  :character   Mode  :character  
##  Mean   :111.28                                        
##  3rd Qu.:120.50                                        
##  Max.   :145.00

Part 2

The quantitative variable that I picked for this project will be Touchdowns. What Touchdowns means in my data set is how many touchdowns did that defense give up. The next part of this project will be the summary statistics. The first part is my five-number summary and the mean of my variable.

summary(data$TD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   22.75   28.00   27.22   30.25   38.00

My mean is 27.22, the minimum is 17, 1st Qu is 22.75, the median is 28.00, while the 3rd Qu is 30.25 and my maximum was 38 touchdowns allowed. I will now find the standard deviation of my variable

sd(data$TD)
## [1] 5.210005

My standard deviation is 5.210005

The next part of my project is the graphical display. I will be first creating a histogram

hist(data$TD)

This graph shows us that not many teams allowed 35+ touchdowns but few allowed only 15 touchdowns. The next graph I will be making is the box plot.

boxplot(data$TD)

This shows us the five-number summary which I already talked about above. Finally, I will be making the qq plot

qqnorm(data$TD)

The Q-Q plot shows us there is no trend with the data. Now I will be exploring data on a categorical variable The categorical variable that I used is Playoff. Which is whether a team made the playoff or not.

table(data$playoff)
## 
##  no yes 
##  18  14

As you can see not many teams made the playoffs. In the NFL they only allow 14 teams in the playoffs The next part will be my relative frequency table

table(data$playoff)/length(data$playoff)
## 
##     no    yes 
## 0.5625 0.4375

This next part I made a two-way table of how many times did a team Blitz and if a team made the playoffs.

table(data$playoff,data$Bltz)
##      
##       98 108 131 133 135 138 142 147 150 153 159 171 174 178 179 181 194 195
##   no   1   0   1   1   1   0   1   1   0   1   1   0   1   1   1   1   0   0
##   yes  0   1   1   0   0   1   0   0   1   0   0   1   0   0   0   0   1   1
##      
##       197 203 212 221 224 230 244 247 249 253 256 268 290
##   no    0   1   1   0   1   0   0   1   1   0   1   0   0
##   yes   1   0   0   1   0   1   1   0   0   1   0   1   1

You really can’t tell anything from this so, this really is noy that usefully

Hypothesis Testing Part 3

I will be preforming a hypothesis test comparing the mean to a fixed value. i will also be stating my null and alternative hypothesis. \[ H_0: \mu = 3000 \\ H_A: \mu\neq 3000 \] so, what my null hypothesis says is that passing yards allowed will be 3000 yards. while my alternative hypothesis say that it will not be 3000

t.test(data$Yds, mu=3000)
## 
##  One Sample t-test
## 
## data:  data$Yds
## t = 11.242, df = 31, p-value = 1.822e-12
## alternative hypothesis: true mean is not equal to 3000
## 95 percent confidence interval:
##  3689.576 3995.236
## sample estimates:
## mean of x 
##  3842.406

So basically, I am going to reject my null hypothesis because it ia was way off.

I will now be preforming a hypothesis test comparing two means. I will also be stating my null and alternative hypothesis. I will be comparing airtime and yard after the catch. my null hypothesis is that Air yardage will be equal to YAC while my alternative hypothesis is that Air will not equal YAC. I will have Air as mu1 and YAC as mu2

\[ H_0: \mu_1 = \mu_2 \\ H_A: \mu_1 \neq \mu_2 \]

t.test(data$Air,data$YAC)
## 
##  Welch Two Sample t-test
## 
## data:  data$Air and data$YAC
## t = 7.3269, df = 57.461, p-value = 8.693e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  335.7103 588.1647
## sample estimates:
## mean of x mean of y 
##  2331.625  1869.688

I can safely accept my alternative hypothesis of the two not being equal

Project Part 1: Regression

Part 1

The two quantitative variables that I have picked are Losses and Yards allowed. Losses is how many losses that team had that season and Yard allowed being how many yards did that team defense give up

I will first be doing a linear model and linear regression I want to predict loses with yards allowed by a defense

lm(lose ~ Yds, data = data)
## 
## Call:
## lm(formula = lose ~ Yds, data = data)
## 
## Coefficients:
## (Intercept)          Yds  
##    -1.40808      0.00244
mod=lm(lose ~ Yds, data = data)
model = mod

I have now created a linear model with Yds allowed being a predictor for losses. my slop is .00244 and my intercept is -1.40808. This is saying for every yard allowed we multiple it by .00244 the we add -1.40808.

I will now do a summary function to get more info on the linear regression

summary(model)
## 
## Call:
## lm(formula = lose ~ Yds, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8140 -3.0360  0.4666  2.9385  6.0829 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.408075   5.446100  -0.259   0.7978  
## Yds          0.002440   0.001409   1.732   0.0936 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.326 on 30 degrees of freedom
## Multiple R-squared:  0.09089,    Adjusted R-squared:  0.06059 
## F-statistic: 2.999 on 1 and 30 DF,  p-value: 0.09357

Oh, wow so much stuff. We see the residuals and the coefficient in this summary. We see that our significant it is not that significant here, same goes for our intercept.

I will now plot my linear model

plot(model)

The most interesting graph here is the normal Q-Q. This graph is looking at the normality of my data. We see that my data is actually pretty normal here, in the middle then goes off towards the top. i would say that my data is pretty normal then.

Part 2

I will now use the predict command to predict losses by a team base on yards given up

Name_predict<-predict(model)
names(Name_predict)<-c(data$Tm)
Name_predict
##        Arizona Cardinals          Atlanta Falcons         Baltimore Ravens 
##                 7.433321                10.054260                 7.221011 
##            Buffalo Bills        Carolina Panthers            Chicago Bears 
##                 7.684678                 7.926273                 7.633430 
##       Cincinnati Bengals         Cleveland Browns           Dallas Cowboys 
##                 8.009245                 8.260601                 7.479688 
##           Denver Broncos            Detroit Lions        Green Bay Packers 
##                 7.882346                 9.715051                 7.228332 
##           Houston Texans       Indianapolis Colts     Jacksonville Jaguars 
##                 8.607131                 8.026327                 8.917056 
##       Kansas City Chiefs        Las Vegas Raiders     Los Angeles Chargers 
##                 7.814016                 8.870689                 7.323506 
##         Los Angeles Rams           Miami Dolphins        Minnesota Vikings 
##                 6.037440                 8.411903                 8.697424 
##     New England Patriots       New Orleans Saints          New York Giants 
##                 7.272258                 7.064828                 7.882346 
##            New York Jets      Philadelphia Eagles      Pittsburgh Steelers 
##                 9.351438                 7.860383                 6.181421 
##      San Francisco 49ers         Seattle Seahawks     Tampa Bay Buccaneers 
##                 6.710977                 9.719932                 8.219115 
##         Tennessee Titans Washington Football Team 
##                 9.424649                 6.078926

This is so cool in my opinion. If you watched football you know that WFT defiantly had more losses, then 6 and I find it interesting that it says the falcon are predicted to be the team in the NFL lol.

I will now be doing my residual for my prediction; The residual is the difference between the prediction and actual value.

Name_residuals<-residuals(model)
names(Name_residuals)<-c(data$Tm)
Name_residuals
##        Arizona Cardinals          Atlanta Falcons         Baltimore Ravens 
##               0.56667856               1.94574012              -2.22101078 
##            Buffalo Bills        Carolina Panthers            Chicago Bears 
##              -4.68467773               3.07372738               0.36656967 
##       Cincinnati Bengals         Cleveland Browns           Dallas Cowboys 
##               2.99075540              -3.26060089               2.52031187 
##           Denver Broncos            Detroit Lions        Green Bay Packers 
##               3.11765373               1.28494910              -4.22833184 
##           Houston Texans       Indianapolis Colts     Jacksonville Jaguars 
##               3.39286907              -3.02632706               6.08294432 
##       Kansas City Chiefs        Las Vegas Raiders     Los Angeles Chargers 
##              -5.81401641              -0.87068898               1.67649442 
##         Los Angeles Rams           Miami Dolphins        Minnesota Vikings 
##              -0.03743988              -2.41190274               0.30257604 
##     New England Patriots       New Orleans Saints          New York Giants 
##               1.72774182              -3.06482823               2.11765373 
##            New York Jets      Philadelphia Eagles      Pittsburgh Steelers 
##               4.64856160               3.13961690              -2.18142067 
##      San Francisco 49ers         Seattle Seahawks     Tampa Bay Buccaneers 
##               3.28902286              -5.71993161              -3.21911490 
##         Tennessee Titans Washington Football Team 
##              -4.42464897               2.92107413

As you can see the predictions were all over the place, with some being over and some being under. we also see that some predictions are pretty spot on. I will now create a histogram to see if our residual is normal

hist(residuals(model),xlab = "residual")

Clearly it is not normal. well maybe a little normal but mostly not normal.

I will now be predicting losses base on Yards given up. let say four random teams gave up 2000,3000,4000,5000 yards. Now I will predict the losses each team will have using my model. I will also be showing the confident interval.

ylist =c (2000,3000,4000,5000)
point <- data.frame(Yds = ylist)
predict(model,point, interval = "confidence")
##         fit       lwr       upr
## 1  3.472630 -1.963572  8.908831
## 2  5.912982  3.207729  8.618235
## 3  8.353334  7.069895  9.636773
## 4 10.793687  7.252676 14.334697

From this prediction model we see that if a team gives up only 2000 yards it is predicted that they will only lose about 3.5 games while if a team gives up 5000 yards, they are predicted to lose 10.80 games. we expect that 95% of my data will fall between these intervals.

Next, I will be making a graph with my two quantitative variables. I know some of yall don’t watch football so I thought labeling the dots would help. I know it looks messy in some places but now you can tell which teams are which dots. I also added my abline line and colored it to help it stand out.

plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(mod, col="red", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)

I thought this graph was really cool honestly. It is really interesting to see where each team falls. For example, look at the Seattle Seahawks and Detroit Lions both gave up similar amount of yardages but one had only 4 losses while the lions had 11 losses.

Part 3

I will now compute the correlation between my two variables and do a hypothesis test on the correlation.

cor(data$Yds,data$lose)
## [1] 0.3014828

As you can see my correlation is .30142828 Now I will be doing an hypothesis test of the correlation test

cor.test(data$Yds,data$lose)
## 
##  Pearson's product-moment correlation
## 
## data:  data$Yds and data$lose
## t = 1.7319, df = 30, p-value = 0.09357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05275726  0.58832861
## sample estimates:
##       cor 
## 0.3014828

I will reject the null hypothesis in this case and accept my alternative hypothesis of it not being equal to 0

Part 4

Now I will use a categorical variable and repeat the regression. I will now be predicting losses by Yds given up and winning record

fit <- lm(lose ~ Yds + winning.record., data)
fit
## 
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
## 
## Coefficients:
##        (Intercept)                 Yds  winning.record.yes  
##           4.781008            0.001449           -5.859531
summary(fit)
## 
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8848 -0.9870  0.1921  0.6826  4.0877 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.7810080  2.6550400   1.801   0.0822 .  
## Yds                 0.0014491  0.0006758   2.144   0.0405 *  
## winning.record.yes -5.8595310  0.5740969 -10.207 4.14e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.578 on 29 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:  0.7884 
## F-statistic: 58.74 on 2 and 29 DF,  p-value: 6.319e-11

Now I am going to compute a prediction. I will create the ultimate defense, a NFL team that only gave up 2000 yards but somehow had a losing record ooooo interesting.

y <- data.frame(Yds = 2000, winning.record.  = "no")
predict(fit, y)
##        1 
## 7.679284

It is predicted that this team will lose about 7.7 games sadly.

Part 5

I will now make a graph for my categorical variable

plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(fit, col="black", lwd=2)
## Warning in abline(fit, col = "black", lwd = 2): only using the first two of 3
## regression coefficients
abline(model, col="orange", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)

So here I did my colors in ECU color, go tigers. The orange line is my original model and the black line the new LM we created above. In my opinion the new LM that we created above is better. We really can’t predict anything here! I would say we need more data, maybe data from 2011-2021 season would be good enough.

Report

I thought the most interesting thing about my Project Part 1 was the prediction model. I thought it was really cool to see if you can predict losses by a team by a certain variable which was fun to play around with. I also enjoyed creating the last graph and adding the names to the variables. what I found out after doing Project Part 1 is that it is very hard to predict losses in the NFL, I think you could tell just by looking at the residual for my prediction. I now understand why ESPN always get predictions wrong.

Project Part 2: Bootstrap and Cross Validation

Part 1: Bootstrapping

Part 1A: Repeat the hypothesis tests you preformed in Module 0 using the bootstrap techniques. Compare the confidence intervals and results. Compute a p value for the test statistic.

What is Bootstrapping one my ask, Well Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples with replacement.

First thing first I want to set my seed and load in the library so that I can do Bootstrapping. after I do that, I want to create a function that finds the mean. I can now Bootstrap using 10,000 samples

library(boot)
set.seed(52)
samp_mean <- function(x,i){
  mean( x[i])
}


results <- boot(data$Yds, samp_mean,10000)
plot(results)

As you can see our data is pretty normal so that’s a good thing. I now Compare the confidence intervals and result. Let’s first see what our bootstrap confidence interval is

boot.ci(results)
## Warning in boot.ci(results): bootstrap variances needed for studentized
## intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (3698, 3986 )   (3700, 3987 )  
## 
## Level     Percentile            BCa          
## 95%   (3697, 3985 )   (3696, 3984 )  
## Calculations and Intervals on Original Scale

So, the population mean is between 3697 Yards and 3985 Yards allowed. we can now compare our bootstrap confidence interval to our t test confidence interval. I will just use my t test from earlier where my null hypothesis says that passing yards allowed will be 3000 yards. while my alternative hypothesis says that it will not be 3000 \[ H_0: \mu = 3000 \\ H_A: \mu\neq 3000 \]

t.test(data$Yds, mu=3000)
## 
##  One Sample t-test
## 
## data:  data$Yds
## t = 11.242, df = 31, p-value = 1.822e-12
## alternative hypothesis: true mean is not equal to 3000
## 95 percent confidence interval:
##  3689.576 3995.236
## sample estimates:
## mean of x 
##  3842.406

We see that our confidence interval for t test is 3689.576 yards to 3995.236 yards while our bootstrap confidence interval is 3697 yards and 3985 yards. Our two-confidence interval are pretty darn close. In both cases we would reject our null hypothesis since it is way short of both confidence interval.

I will now Computing p Value With Bootstrap I will Compute the p Value using t test \[ t = \frac{\mu-\overline x}{SE} \] First, I will get \(\overline x\) using the average of every bootstrap

xbar = results$t0
xbar
## [1] 3842.406

I next will find the \(SE\) using the \(SD\) of the bootstrap statistic

se=sd(results$t)

Since my graph above for our bootstrap looks pretty normal, we can actually get the confidence interval by using \(\pm 2SE\)

c(results$t0-2*se,results$t0+2*se)
## [1] 3695.422 3989.391

My confidence interval is 3695.422 and 3989.391 we will now find the t

t = (3000-results$t0)/se
1-pt(t,9999)
## [1] 1

My p value is 1, what this imply that 100% of the time our null hypothesis of 3000 will be false.

Part 1B: Create a hypothesis test on a non-parametric statistic (median, mode, min, max, etc.) Preform the hypothesis test using the bootstrap method. Be sure to create confidence intervals and properly state your conclusion. Compute a p value for the test statistic.

Here I am going to use median for my non-parametric statistic. The reason I picked median was because it was the most normal out of the four. I thought the more normal the non-parametric would be the easier it would be to work with.

First thing I will be doing is creating a function so we can use it when bootstrapping.

samp_median <- function(x,i){
  median(x[i])
}

boot2 <- boot(data$Yds,samp_median, R = 10000)
plot(boot2)

This isnt the most normal but it’s the best out of the four. you should have seen the mean lol, it was so bad and not normal. let’s see what our bootstrap confidence interval is

boot.ci(boot2,type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot2, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (3674, 3962 )  
## Calculations and Intervals on Original Scale

we see that our 95% CONFIDENCE INTERVAL is 3664 - 3962. Since our data is pretty normalishhhh we can do it this way by using \(\overline x \pm 2SE\)

xbar=mean(boot2$t)
s = sd(boot2$t)
c(xbar- 2*s, xbar+2*s)
## [1] 3678.857 3943.784

As you can see our two confidence intervals are pretty darn close. so, either way would have been fine.

I will now find the p value for the test statistic. my null hypothesis will be the following \[ H_0: \mu = 3642 \\ H_A: \mu\neq 3642 \]

mean(3642==boot2$t)
## [1] 0.0077
t = (3642-boot2$t0)/s
pt(t,9999)
## [1] 0.006372729

so less then 1% of my cases were 3642. you may ask why did i pick 3642. well, it was the first number that gave me a value and I wanted a value not just 0 because 0 is boring and empty. we can reject the null hypothesis since it is less then 5%

Part 2: Cross Validation

Part 2A: Randomly divided your data into two pieces using the two-thirds split. Withhold a third of the data and preform the regression. Graph a scatterplot identifying the data used in the testing and training and including the linear regression. Examine the prediction on the testing data using the R2 and RMSE as well as the confidence intervals on the regression.

Now I will be doing Cross validation. I will first need to load the library of caret, after that I will divide my data into two-thirds for the test sample.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: ggplot2
TS <- createDataPartition(data$rank, p =.66, list=FALSE)
TS
##       Resample1
##  [1,]         1
##  [2,]         3
##  [3,]         5
##  [4,]         6
##  [5,]         7
##  [6,]         8
##  [7,]         9
##  [8,]        10
##  [9,]        11
## [10,]        12
## [11,]        16
## [12,]        17
## [13,]        18
## [14,]        21
## [15,]        22
## [16,]        23
## [17,]        24
## [18,]        25
## [19,]        26
## [20,]        27
## [21,]        29
## [22,]        30
## [23,]        31
## [24,]        32

So now we will turn my test sample into two different data’s. The test set will become my training data and not test set will be my test data.

trainData<- data[TS,]
testData<- data[-TS,]
testData[]
##                      Tm  G Att Cmp  Yds TD DADOT  Air  YAC Bltz  Bltz. Hrry
## 2       Atlanta Falcons 16 625 425 4697 34   8.5 2757 2248  224 32.80%   73
## 4         Buffalo Bills 16 573 369 3726 23   7.0 2192 1758  230 35.80%   51
## 13       Houston Texans 16 541 377 4104 30   7.7 2232 2282  212 35.90%   47
## 14   Indianapolis Colts 16 562 369 3866 24   8.0 2546 1766  108 17.10%   71
## 15 Jacksonville Jaguars 16 533 370 4231 34   8.3 2238 2101  179 31.00%   40
## 19     Los Angeles Rams 16 548 347 3051 17   7.4 1604 1747  171 27.30%   46
## 20       Miami Dolphins 16 545 343 4024 21   9.2 2365 2036  247 40.60%   56
## 28  San Francisco 49ers 16 537 341 3327 25   7.8 1907 1673  203 33.60%   69
##     Hrry. QBKD QBKD. Sk Prss  Prss. MTkl win lose tie rank  PF Y.P TO FL X1stD
## 2  10.70%   59 9.40% 29  161 23.60%  100   4   12   0   19 414 6.2 21  9   367
## 4   7.90%   46 8.00% 38  135 21.00%  127  13    3   0   16 375 5.5 26 11   351
## 13  8.00%   40 7.40% 34  121 20.50%  125   4   12   0   27 464 6.2  9  6   390
## 14 11.20%   34 6.00% 40  145 22.90%   70  11    5   0   10 362 5.4 25 10   329
## 15  6.90%   49 9.20% 18  107 18.50%  115   1   15   0   31 492 6.3 17  5   394
## 19  7.30%   45 8.20% 53  144 23.00%   79  10    6   0    1 296 4.6 22  8   280
## 20  9.20%   46 8.40% 41  143 23.50%  107  10    6   0    6 338 5.9 29 11   336
## 28 11.40%   38 7.10% 30  137 22.70%   68   6   10   0   17 390 5.0 20  8   311
##    Int X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record.
## 2   12    239     380  1677   15 4.4     97      no              no
## 4   15    199     414  1914   21 4.6    125     yes             yes
## 13   3    222     493  2564   24 5.2    145      no              no
## 14  15    203     390  1448   16 3.7     94     yes             yes
## 15  12    210     517  2452   23 4.7    140      no              no
## 19  14    170     388  1460   12 3.8     91     yes             yes
## 20  18    193     412  1862   17 4.5    113      no             yes
## 28  12    176     430  1703   12 4.0    102      no              no

oooo look at my test data so nice, and pretty. just what we want!

Now I’m going to try and predict wins base off of times a team Blitzed using our training data.

model<- lm(win ~ Bltz, data = trainData)
summary(model)
## 
## Call:
## lm(formula = win ~ Bltz, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6084 -2.0162  0.2017  2.3528  5.5889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.08200    2.38886   1.709   0.1016  
## Bltz         0.02219    0.01250   1.776   0.0896 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.161 on 22 degrees of freedom
## Multiple R-squared:  0.1254, Adjusted R-squared:  0.08562 
## F-statistic: 3.154 on 1 and 22 DF,  p-value: 0.08959

Now I will add a new column to my data, saying whether it is Trained or naw. so if it was in the training set we will name it Train and if it wasn’t in the Training set it gets called Test.

data[TS, 'Test_Train']<- "Train"
data[-TS, 'Test_Train']<- "Test"
data$Test_Train
##  [1] "Train" "Test"  "Train" "Test"  "Train" "Train" "Train" "Train" "Train"
## [10] "Train" "Train" "Train" "Test"  "Test"  "Test"  "Train" "Train" "Train"
## [19] "Test"  "Test"  "Train" "Train" "Train" "Train" "Train" "Train" "Train"
## [28] "Test"  "Train" "Train" "Train" "Train"

So pretty and just what I wanted

I will next graph a scatterplot identifying the data used in the testing and training and will include the linear regression. I thought using ggplot was way better and looked way nicer.

library(ggplot2)
ggplot(data = data, mapping = aes(Bltz,win,color = Test_Train))+
  geom_jitter()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

What i get from this graph is that my Test and Train Data is very different. You can see that the Test is flat while the training is a positive slope

The next thing I will be doing is examining the prediction on the testing data using the R2 and RMSE as well as the confidence intervals on the regression.

pred <-predict(model,data[-TS,])
pred
##        2        4       13       14       15       19       20       28 
## 9.053513 9.186678 8.787182 6.478977 8.054771 7.877216 9.563981 8.587433

This is the prediction for number of wins based on blitz’s using the test data. I wish it showed which team it was I will now do the r squared value comparing the predictions to the actual wins

R2(pred,data[-TS,"win"])
## [1] 0.007137425

we see that our R squared is 0.001330617 but compared to our model which was 0.124. uhm so they are nowhere close to each other but they arnt crazy far apart.

Part 2B: Repeat using a 10-fold cross validation. Be certain to compare confidence intervals. You need not create the visualizations but be sure to comment on the results.

what is k fold? well in k folds you divide your data up into k pieces and you hold back one of the pieces. you then fit your model on the remaining pieces.you then test the model on that piece that you held back. you then mix them all together hold back one hold back a different one this time and repeat k times. we do this to run a statistic on the statistic on the test statistic. that is what and how k fold is.

I will now do the last part of this project using k folds. here i will be doing a 10 fold cross validation

trainc <- trainControl(method = "cv", number = 10)
model2 <- train(win ~ Bltz, data = data,
                method = "lm",
                trControl = trainc)
print(model2)
## Linear Regression 
## 
## 32 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 29, 29, 29, 28, 30, 28, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE   
##   3.210976  0.6454075  2.8723
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

we used 32 samples, we that my Rsquared values is 0.5961926, we see the different sample sizes and we see how many folds it did. the thing that interest me the most is the R-squared, a R-squared value between 0.3 < r < 0.5 is generally considered a weak or low effect. basically maybe there is an small connection between blitzing and winning?

I will now do a Summary of the model2

summary(model2)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9534 -2.2919  0.1584  2.9381  5.6245 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.98481    2.39204   2.084   0.0458 *
## Bltz         0.01594    0.01235   1.290   0.2068  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.459 on 30 degrees of freedom
## Multiple R-squared:  0.05258,    Adjusted R-squared:  0.021 
## F-statistic: 1.665 on 1 and 30 DF,  p-value: 0.2068

This gives me a good estimate based on the k-fold cross validation.

Part 3: Write your report!

I thought some parts of this project was really hard to understand. I had trouble wrapping my head around k-fold cross validation but after watching a ton of videos I somewhat understand k-fold cross validation now. i thought understanding this project was harder then actually doing it. Some successes of my Bootstrapping were that everything went right and was pretty easy to get done. One of my failure for Bootstrapping was that I had a hard time find non-parametric statistic that was normal. One of my successes of Cross Validation was that I was able to create the graph and prediction on the testing data. I had two major failures with Cross Validation first was not being able to understand it that well. The last failure was that the two variables really didn’t tell us anything at all but predicting football takes a lot more than just two simple variables. Also dont not leave your cat alone with your labtop. I was almost finish with this project and i go up to go to the rest room. When I came back she’s on my laptop and some how closed out of r and i lost alot of my project LOL.This isnt the first time she has done this either smh.

Project Part 3: Goodness of Fit

Part 1: Goodness of Fit

Part 1A: Use a categorical variable and preform a goodness of fit test. There should be more than two categories for this test to work properly. It is easiest to assume that all categories should be represented equally but if you have reason to suggest another model, please explain. In either case include the expected values.

let’s look at our data

head(data)
##                  Tm  G Att Cmp  Yds TD DADOT  Air  YAC Bltz  Bltz. Hrry  Hrry.
## 1 Arizona Cardinals 16 570 365 3623 26   7.7 1920 2057  256 39.30%   66 10.10%
## 2   Atlanta Falcons 16 625 425 4697 34   8.5 2757 2248  224 32.80%   73 10.70%
## 3  Baltimore Ravens 16 596 380 3536 22   7.9 2078 1787  290 44.10%   60  9.10%
## 4     Buffalo Bills 16 573 369 3726 23   7.0 2192 1758  230 35.80%   51  7.90%
## 5 Carolina Panthers 16 585 398 3825 28   7.4 2372 1971  153 24.00%   60  9.40%
## 6     Chicago Bears 16 547 350 3705 28   8.6 2377 1751  131 21.40%   50  8.20%
##   QBKD  QBKD. Sk Prss  Prss. MTkl win lose tie rank  PF Y.P TO FL X1stD Int
## 1   54  9.50% 48  168 25.80%  111   8    8   0   12 367 5.3 21 10   363  11
## 2   59  9.40% 29  161 23.60%  100   4   12   0   19 414 6.2 21  9   367  12
## 3   71 11.90% 39  170 25.90%  135  11    5   0    2 303 5.2 22 12   347  10
## 4   46  8.00% 38  135 21.00%  127  13    3   0   16 375 5.5 26 11   351  15
## 5   45  7.70% 29  134 21.00%  119   5   11   0   18 402 5.6 22 15   360   7
## 6   48  8.80% 35  133 21.80%   89   8    8   0   14 370 5.4 18  8   331  10
##   X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record. Test_Train
## 1    207     436  2008   13 4.6    118      no              no      Train
## 2    239     380  1677   15 4.4     97      no              no       Test
## 3    213     382  1740   12 4.6     96     yes             yes      Train
## 4    199     414  1914   21 4.6    125     yes             yes       Test
## 5    211     408  1936   17 4.7    114      no              no      Train
## 6    190     438  1814   11 4.1    104     yes              no      Train

as you can see my data doesnt have that much categorical variables in it. so what i am going to do is make my own categorical variables using the data given.

so here I am going make three different categorical variables.

data[which(data$rank <= 16),"D_rank"] = "Top"
data[which(data$rank > 16),"D_rank"] = "Bottom"
data[which(data$Prss. > 20 ),"Prss above 20%"] = "yes"
data[which(data$Prss. <= 20 ),"Prss above 20%"] = "no"
data[which(data$D_rank == 'Top' & data$playoff == 'yes'),"top rank d and play off"] = "yes"
data[which(data$D_rank == 'Top' & data$playoff == 'no'| data$D_rank == 'Bottom' & data$playoff == 'yes' | data$D_rank == 'Bottom' & data$playoff == 'no'),"top rank d and play off"] = "no"
data[which(data$D_rank == 'Top' &  data$playoff == 'yes' & data$`Prss above 20%` == 'yes'),"playoff_topD_pabove20"]="yes" 
data$playoff_topD_pabove20[is.na(data$playoff_topD_pabove20)] = 'no'
head(data)
##                  Tm  G Att Cmp  Yds TD DADOT  Air  YAC Bltz  Bltz. Hrry  Hrry.
## 1 Arizona Cardinals 16 570 365 3623 26   7.7 1920 2057  256 39.30%   66 10.10%
## 2   Atlanta Falcons 16 625 425 4697 34   8.5 2757 2248  224 32.80%   73 10.70%
## 3  Baltimore Ravens 16 596 380 3536 22   7.9 2078 1787  290 44.10%   60  9.10%
## 4     Buffalo Bills 16 573 369 3726 23   7.0 2192 1758  230 35.80%   51  7.90%
## 5 Carolina Panthers 16 585 398 3825 28   7.4 2372 1971  153 24.00%   60  9.40%
## 6     Chicago Bears 16 547 350 3705 28   8.6 2377 1751  131 21.40%   50  8.20%
##   QBKD  QBKD. Sk Prss  Prss. MTkl win lose tie rank  PF Y.P TO FL X1stD Int
## 1   54  9.50% 48  168 25.80%  111   8    8   0   12 367 5.3 21 10   363  11
## 2   59  9.40% 29  161 23.60%  100   4   12   0   19 414 6.2 21  9   367  12
## 3   71 11.90% 39  170 25.90%  135  11    5   0    2 303 5.2 22 12   347  10
## 4   46  8.00% 38  135 21.00%  127  13    3   0   16 375 5.5 26 11   351  15
## 5   45  7.70% 29  134 21.00%  119   5   11   0   18 402 5.6 22 15   360   7
## 6   48  8.80% 35  133 21.80%   89   8    8   0   14 370 5.4 18  8   331  10
##   X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record. Test_Train
## 1    207     436  2008   13 4.6    118      no              no      Train
## 2    239     380  1677   15 4.4     97      no              no       Test
## 3    213     382  1740   12 4.6     96     yes             yes      Train
## 4    199     414  1914   21 4.6    125     yes             yes       Test
## 5    211     408  1936   17 4.7    114      no              no      Train
## 6    190     438  1814   11 4.1    104     yes              no      Train
##   D_rank Prss above 20% top rank d and play off playoff_topD_pabove20
## 1    Top            yes                      no                    no
## 2 Bottom            yes                      no                    no
## 3    Top            yes                     yes                   yes
## 4    Top            yes                     yes                   yes
## 5 Bottom            yes                      no                    no
## 6    Top            yes                     yes                   yes

ooooo so nice! got carried away lol but now i have more data to work with. let’s create a table of playoff_topD_pabove20 that i created. what this categorical variable means is that is a team made the playoff was a top 16 team and blitzed above 20% then they got a yes other wise a no.

table(data$playoff_topD_pabove20)
## 
##  no yes 
##  20  12

very intersting i would say. this means over half of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time. my null hypothesis will be the following The number of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time is equal to 16 while my alternative hypothesis is The number of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time is not equal to 16 \[ H_0: \mu = 16 \\ H_A: \mu\neq 16 \]

test = chisq.test(table(data$playoff_topD_pabove20), p = c(1,1)/2)
test
## 
##  Chi-squared test for given probabilities
## 
## data:  table(data$playoff_topD_pabove20)
## X-squared = 2, df = 1, p-value = 0.1573

we see that my p-value is pretty small while my X-squared is pretty big. While I have evidence to suggest these numbers might be different, the statistics did not bear that out. I will fail to reject my null hypothesis here.

The expected values for the number of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time follows

test$expected
##  no yes 
##  16  16

Part 1B: Include a bar chart.

Here is my bar chart

barplot(table(data$playoff_topD_pabove20))

Part 2: Test for Independence

Part 2A: Using two categorical variables preform a test for independence. Include the expected values

table(data$D_rank,data$winning.record.)
##         
##          no yes
##   Bottom 14   2
##   Top     5  11

so if you were at the bottom of the league in defense you likely had an losing record. Let’s set up the hypothesis test. \[ H_0: \textrm{Winning record is independent of Defense rank } \\ H_A: \textrm{Winning record is dependent on Defense rank } \]

test2 = chisq.test(table(data$D_rank,data$winning.record.))
test2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(data$D_rank, data$winning.record.)
## X-squared = 8.2915, df = 1, p-value = 0.003983

oh wow we got a pretty big x squared value and a small p-value. We have evidence to suggest that the Winning record is dependent on Defense rank. Here i am going to reject the null hypothesis here is the expected

test2$expected
##         
##           no yes
##   Bottom 9.5 6.5
##   Top    9.5 6.5

Part 2B: mosaic plot

mosaicplot(table(data$D_rank,data$winning.record.))

Project Part 4: Anova

Part 1: One-Way ANOVA

Part 1A: Use a categorical variable (with more than two categories) and a quantitative variable and preform the ANOVA test to see if the means are equal.

What’s up yall today I will be exploring my data some more but this time using Anova. To start off with I am going to make a new categorical variable to play around with, also let’s look at the data again.

data[which(data$D_rank == 'Top' &  data$`Prss above 20%` == 'yes'),"topD_pabove20"]="yes" 
data$topD_pabove20[is.na(data$topD_pabove20)] = 'no'
head(data)
##                  Tm  G Att Cmp  Yds TD DADOT  Air  YAC Bltz  Bltz. Hrry  Hrry.
## 1 Arizona Cardinals 16 570 365 3623 26   7.7 1920 2057  256 39.30%   66 10.10%
## 2   Atlanta Falcons 16 625 425 4697 34   8.5 2757 2248  224 32.80%   73 10.70%
## 3  Baltimore Ravens 16 596 380 3536 22   7.9 2078 1787  290 44.10%   60  9.10%
## 4     Buffalo Bills 16 573 369 3726 23   7.0 2192 1758  230 35.80%   51  7.90%
## 5 Carolina Panthers 16 585 398 3825 28   7.4 2372 1971  153 24.00%   60  9.40%
## 6     Chicago Bears 16 547 350 3705 28   8.6 2377 1751  131 21.40%   50  8.20%
##   QBKD  QBKD. Sk Prss  Prss. MTkl win lose tie rank  PF Y.P TO FL X1stD Int
## 1   54  9.50% 48  168 25.80%  111   8    8   0   12 367 5.3 21 10   363  11
## 2   59  9.40% 29  161 23.60%  100   4   12   0   19 414 6.2 21  9   367  12
## 3   71 11.90% 39  170 25.90%  135  11    5   0    2 303 5.2 22 12   347  10
## 4   46  8.00% 38  135 21.00%  127  13    3   0   16 375 5.5 26 11   351  15
## 5   45  7.70% 29  134 21.00%  119   5   11   0   18 402 5.6 22 15   360   7
## 6   48  8.80% 35  133 21.80%   89   8    8   0   14 370 5.4 18  8   331  10
##   X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record. Test_Train
## 1    207     436  2008   13 4.6    118      no              no      Train
## 2    239     380  1677   15 4.4     97      no              no       Test
## 3    213     382  1740   12 4.6     96     yes             yes      Train
## 4    199     414  1914   21 4.6    125     yes             yes       Test
## 5    211     408  1936   17 4.7    114      no              no      Train
## 6    190     438  1814   11 4.1    104     yes              no      Train
##   D_rank Prss above 20% top rank d and play off playoff_topD_pabove20
## 1    Top            yes                      no                    no
## 2 Bottom            yes                      no                    no
## 3    Top            yes                     yes                   yes
## 4    Top            yes                     yes                   yes
## 5 Bottom            yes                      no                    no
## 6    Top            yes                     yes                   yes
##   topD_pabove20
## 1           yes
## 2            no
## 3           yes
## 4           yes
## 5            no
## 6           yes

so here I made a new variable called topD_pabove20. what this variable means is that if a team had a top 16 ranked Defense and pressure was above 20% they got a yes other-wise I filled the Na with no. prolly not the best idea to fill NA with no but im lazy and it works, also i didnt want to write another long code just for no. also oh wow our data table is getting bigger by the day.

Now I’m going to ask if top 16 ranked Defense and if pressure was above 20% affects Wins for a team. we are going see if the means are the same. Here is my hypotheses test that i will be using \[ H_0: u_1=u_2=...=u_n \\ H_A: u_1 \neq u_2 \textrm{ for some } i \textrm{ and }j \]

model = aov(win~topD_pabove20, data = data)
summary(model)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## topD_pabove20  1  175.8  175.78   25.95 1.78e-05 ***
## Residuals     30  203.2    6.77                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

as you can see the means are different and our Pr is less then 5%, so we can reject our null hypothesis

Part 1B:Include a visualization looking at the means.

so here we going see is our assumption of equal variance is true.

plot(model,1)

looks like we had the right idea of assumption of equal variance

now we are going see if our model is normal

plot(model,2)

okay so our QQ-plot looks pretty normal, no need to be worried! so i can safely reject my null hypothesis of means being the same.

Here im going create a visualization to show that the means are different.

library(ggrepel)
y1 <-  mean(data$win, na.rm = TRUE)
plotnfl <-ggplot(data = data, aes(x = topD_pabove20, y = win))+
  geom_jitter(color = 'grey',width = 0.1, height = 0.1) +
  stat_summary(fun.data = 'mean_se', color = "red") +
  geom_hline(yintercept = y1,  color = "blue",linetype = "dashed")
plotnfl + geom_label_repel(aes(label = Tm),
                  data         = subset(data, win< 4),
                  box.padding   = 0.5,
                  point.padding = 0.5,
                  segment.color = 'grey50') +
  geom_label_repel(aes(label = Tm),
                  data         = subset(data, win> 12),
                  box.padding   = 0.5,
                  point.padding = 0.5,
                  segment.color = 'grey50') +
  geom_label_repel(aes(label = Tm),
                  data         = subset(data, topD_pabove20=="no"&win>10  ),
                  box.padding   = 0.5,
                  point.padding = 0.5,
                  segment.size  = 0.2,
                  segment.color = 'grey50') +
  theme_classic()

clearly this graph supports that the null hypotheses was false. This graph really helps see the differences in the means as the means are really far apart. I also label my graph to spicy it up some. I mainly did it because I wanted to show the differences between the top team and the bottom team. I didnt want to add all the team just because it made it messy and took away from what I was trying to show. I can safely reject my null hypotheses now and sleep tightly tonight.

Part 2

Part 2A:Using two categorical variables and a quantitative variable, preform a two-way ANOVA.

Here i will be using two categorical variables and a quantitative variable, preforming a two-way ANOVA. so here we are going have three hypotheses

\[ H_0: \textrm{ There is no difference in average wins for any top Defences that blizted above 20%. }\ H_A: \textrm{ There is a difference in average wins for any top Defences that blizted above 20%. } \\ H_0: \textrm{ There is no difference in average wins with team in the playoff. }\ H_A: \textrm{ There is a difference in average wins with team in the playoff.} \\ H_0: \textrm{The effect of one independent variable on average Wins does not depend on the effect of the other independent variable . }\ H_A: \textrm{ There is an interaction effect between top Defences that blizted above 20% and team in the playoff on wins. } \] A bit messy but it get’s what im trying to find.

model = aov(win~topD_pabove20*playoff, data = data)
summary(model)
##                       Df Sum Sq Mean Sq F value   Pr(>F)    
## topD_pabove20          1 175.78  175.78  48.192 1.51e-07 ***
## playoff                1  92.84   92.84  25.453 2.45e-05 ***
## topD_pabove20:playoff  1   8.21    8.21   2.252    0.145    
## Residuals             28 102.13    3.65                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ooo interesting we can reject our first two hypotheses but our third hypothesis we can not as it is not less then 10% very very interesting.I will now take a deeper look into it with some visualizations and find an answer.

Part 2B:Include a visualization looking at the means.

y1 <-  mean(data$win, na.rm = TRUE)
ggplot(data = data, aes(x = topD_pabove20 , y = win, color = playoff))+
  geom_jitter(width = 0.1, height = 0.1) +
  stat_summary(fun.data = 'mean_se') 

as you can see the yes mean’s is pretty close but then the no mean’s is pretty far apart. It should be noted that in the no part their are only two teams that made the playoffs and had no’s. Now im going create a another visualization to help better see what’s going on.

plotnfl <-ggplot(data = data, aes(x = topD_pabove20 , y = win, color = playoff))+
  geom_jitter(width = 0.1, height = 0.1) +
  geom_boxplot()
plotnfl + geom_label_repel(aes(label = Tm),
                  data         = subset(data, win< 4),
                  box.padding   = 0.5,
                  point.padding = 0.5,
                  segment.size  = 0.2,
                  show.legend=F
                  ) +
  geom_label_repel(aes(label = Tm),
                  data         = subset(data, win > 13),
                  box.padding   = 0.5,
                  point.padding = 0.5,
                  segment.size  = 0.2,
                  show.legend=F
                  ) +
  theme_classic()

table(data$topD_pabove20,data$playoff)
##      
##       no yes
##   no  14   2
##   yes  4  12

ayeeee this gives a better picture of why we can not reject the third null hypothesis. looking at the yes on the x-axis, you see that they share a couple of point’s and that brings the mean closer together. so that’s the reason we can not reject the third null hypothesis as the means are fairly close.

Part 2C:Discuss if the conditions of the test were met by looking at the graphs and looking at the contingency table of the categorical variables.

let’s get the contingency table of the categorical variables.

table(data$topD_pabove20,data$playoff)
##      
##       no yes
##   no  14   2
##   yes  4  12

I believe that the conditions of the test were not met as We did not find a statistically-significant difference and interaction between the terms was not significant. im a little lost of this last part but i think i got the right idea.

Project Part 5: Non-Parametric

Part 1: Wilcoxon

Part 1A: Use a categorical variable and a quantitative variable to compare two medians using Wilcoxon ranked sum test.

I will ve usibg a categorical variable and a quantitative variable from my data set to compare the medians by using the Wilcoxon ranked sum test.

my quantitatitve varuable that i will be using is the amount of Blitz a team did.

rank(data$Bltz, ties.method = "average")
##  [1] 30.0 24.0 32.0 25.0 11.0  3.5 17.0  7.0  3.5 15.0  8.0 10.0 22.0  2.0 16.0
## [16] 23.0  9.0  1.0 13.0 27.0 12.0  6.0 20.0 14.0 28.0  5.0 26.0 21.0 29.0 31.0
## [31] 18.0 19.0

now we can do the Wilcoxon ranked sum test Let me do a test against a value first. I am going to ask is the median number of QB hurry is 70

median(data$Hrry, na.rm = TRUE)
## [1] 59.5
wilcox.test(data$Hrry, mu = 70, na.rm = TRUE)
## Warning in wilcox.test.default(data$Hrry, mu = 70, na.rm = TRUE): cannot compute
## exact p-value with ties
## Warning in wilcox.test.default(data$Hrry, mu = 70, na.rm = TRUE): cannot compute
## exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$Hrry
## V = 49.5, p-value = 0.0001033
## alternative hypothesis: true location is not equal to 70

so i would reject my null hypothesis here

now let’s compare the two means oooooo. i will be using one of my categorical variable that i created a while back.

by(data$Bltz,data$`Prss above 20%`, median)
## data$`Prss above 20%`: no
## [1] 179
## ------------------------------------------------------------ 
## data$`Prss above 20%`: yes
## [1] 195
table(data$`Prss above 20%`)
## 
##  no yes 
##   5  27

let’s run the wilcox test now

wilcox.test(Bltz ~ `Prss above 20%`, data = data)
## Warning in wilcox.test.default(x = c(181L, 142L, 179L, 159L, 194L), y =
## c(256L, : cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Bltz by Prss above 20%
## W = 56, p-value = 0.568
## alternative hypothesis: true location shift is not equal to 0

We will fail to reject the null hypothesis!

Part 2: Spearman Rank Correlation

Part 2A: Using two quantitative variables, preform a Spearman rank correlation test.

cor.test(data$win,data$Bltz, method = "spearman")
## Warning in cor.test.default(data$win, data$Bltz, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  data$win and data$Bltz
## S = 4327.9, p-value = 0.2562
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.206766

sadly this is a kinda weak connection :( we would fail to reject our null hypthesis