First, I will be loading in data
data = read.csv('https://raw.githubusercontent.com/jacmantooth/applied_stats-/main/2020%20nfl%20defense%20.csv')
summary(data)
## Tm G Att Cmp Yds
## Length:32 Min. :16 Min. :494.0 Min. :298.0 Min. :3051
## Class :character 1st Qu.:16 1st Qu.:540.0 1st Qu.:342.5 1st Qu.:3573
## Mode :character Median :16 Median :557.0 Median :367.0 Median :3807
## Mean :16 Mean :563.1 Mean :367.4 Mean :3842
## 3rd Qu.:16 3rd Qu.:581.2 3rd Qu.:380.8 3rd Qu.:4113
## Max. :16 Max. :674.0 Max. :450.0 Max. :4697
## TD DADOT Air YAC Bltz
## Min. :17.00 Min. :7.000 Min. :1604 Min. :1376 Min. : 98.0
## 1st Qu.:22.75 1st Qu.:7.700 1st Qu.:2141 1st Qu.:1750 1st Qu.:145.8
## Median :28.00 Median :8.100 Median :2366 Median :1858 Median :180.0
## Mean :27.22 Mean :8.181 Mean :2332 Mean :1870 Mean :187.2
## 3rd Qu.:30.25 3rd Qu.:8.625 3rd Qu.:2502 3rd Qu.:2040 3rd Qu.:225.5
## Max. :38.00 Max. :9.500 Max. :2970 Max. :2282 Max. :290.0
## Bltz. Hrry Hrry. QBKD
## Length:32 Min. :40.00 Length:32 Min. :29.00
## Class :character 1st Qu.:50.00 Class :character 1st Qu.:41.75
## Mode :character Median :59.50 Mode :character Median :47.00
## Mean :59.84 Mean :48.47
## 3rd Qu.:69.25 3rd Qu.:55.25
## Max. :80.00 Max. :71.00
## QBKD. Sk Prss Prss.
## Length:32 Min. :17.00 Min. :101.0 Length:32
## Class :character 1st Qu.:28.50 1st Qu.:132.5 Class :character
## Mode :character Median :36.50 Median :143.0 Mode :character
## Mean :35.47 Mean :143.8
## 3rd Qu.:42.75 3rd Qu.:160.2
## Max. :56.00 Max. :202.0
## MTkl win lose tie
## Min. : 68.00 Min. : 1.000 Min. : 2.000 Min. :0.0000
## 1st Qu.: 96.25 1st Qu.: 5.000 1st Qu.: 5.000 1st Qu.:0.0000
## Median :109.00 Median : 7.500 Median : 8.500 Median :0.0000
## Mean :107.72 Mean : 7.969 Mean : 7.969 Mean :0.0625
## 3rd Qu.:120.50 3rd Qu.:11.000 3rd Qu.:11.000 3rd Qu.:0.0000
## Max. :147.00 Max. :14.000 Max. :15.000 Max. :1.0000
## rank PF Y.P TO
## Min. : 1.00 Min. :296.0 Min. :4.600 Min. : 9.00
## 1st Qu.: 8.75 1st Qu.:356.5 1st Qu.:5.300 1st Qu.:18.75
## Median :16.50 Median :382.5 Median :5.500 Median :22.00
## Mean :16.50 Mean :396.6 Mean :5.572 Mean :20.78
## 3rd Qu.:24.25 3rd Qu.:440.8 3rd Qu.:5.900 3rd Qu.:23.00
## Max. :32.00 Max. :519.0 Max. :6.300 Max. :29.00
## FL X1stD Int X1stDP
## Min. : 4.000 Min. :280.0 Min. : 3.00 Min. :161.0
## 1st Qu.: 6.750 1st Qu.:330.5 1st Qu.:10.00 1st Qu.:186.8
## Median : 8.000 Median :349.0 Median :12.00 Median :205.0
## Mean : 8.438 Mean :347.1 Mean :12.34 Mean :204.6
## 3rd Qu.:10.000 3rd Qu.:366.2 3rd Qu.:15.00 3rd Qu.:215.0
## Max. :15.000 Max. :415.0 Max. :18.00 Max. :253.0
## RushAtt Yds.1 TD.1 Y.A
## Min. :358.0 Min. :1289 Min. :10.00 Min. :3.600
## 1st Qu.:405.2 1st Qu.:1765 1st Qu.:13.00 1st Qu.:4.100
## Median :429.0 Median :1888 Median :16.00 Median :4.500
## Mean :431.0 Mean :1902 Mean :16.62 Mean :4.388
## 3rd Qu.:449.5 3rd Qu.:2030 3rd Qu.:19.25 3rd Qu.:4.600
## Max. :517.0 Max. :2564 Max. :27.00 Max. :5.200
## X1stDR playoff winning.record.
## Min. : 78.00 Length:32 Length:32
## 1st Qu.: 97.75 Class :character Class :character
## Median :113.00 Mode :character Mode :character
## Mean :111.28
## 3rd Qu.:120.50
## Max. :145.00
The quantitative variable that I picked for this project will be Touchdowns. What Touchdowns means in my data set is how many touchdowns did that defense give up. The next part of this project will be the summary statistics. The first part is my five-number summary and the mean of my variable.
summary(data$TD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 22.75 28.00 27.22 30.25 38.00
My mean is 27.22, the minimum is 17, 1st Qu is 22.75, the median is 28.00, while the 3rd Qu is 30.25 and my maximum was 38 touchdowns allowed. I will now find the standard deviation of my variable
sd(data$TD)
## [1] 5.210005
My standard deviation is 5.210005
The next part of my project is the graphical display. I will be first creating a histogram
hist(data$TD)
This graph shows us that not many teams allowed 35+ touchdowns but few allowed only 15 touchdowns. The next graph I will be making is the box plot.
boxplot(data$TD)
This shows us the five-number summary which I already talked about above. Finally, I will be making the qq plot
qqnorm(data$TD)
The Q-Q plot shows us there is no trend with the data. Now I will be exploring data on a categorical variable The categorical variable that I used is Playoff. Which is whether a team made the playoff or not.
table(data$playoff)
##
## no yes
## 18 14
As you can see not many teams made the playoffs. In the NFL they only allow 14 teams in the playoffs The next part will be my relative frequency table
table(data$playoff)/length(data$playoff)
##
## no yes
## 0.5625 0.4375
This next part I made a two-way table of how many times did a team Blitz and if a team made the playoffs.
table(data$playoff,data$Bltz)
##
## 98 108 131 133 135 138 142 147 150 153 159 171 174 178 179 181 194 195
## no 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0
## yes 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1
##
## 197 203 212 221 224 230 244 247 249 253 256 268 290
## no 0 1 1 0 1 0 0 1 1 0 1 0 0
## yes 1 0 0 1 0 1 1 0 0 1 0 1 1
You really can’t tell anything from this so, this really is noy that usefully
I will be preforming a hypothesis test comparing the mean to a fixed value. i will also be stating my null and alternative hypothesis. \[ H_0: \mu = 3000 \\ H_A: \mu\neq 3000 \] so, what my null hypothesis says is that passing yards allowed will be 3000 yards. while my alternative hypothesis say that it will not be 3000
t.test(data$Yds, mu=3000)
##
## One Sample t-test
##
## data: data$Yds
## t = 11.242, df = 31, p-value = 1.822e-12
## alternative hypothesis: true mean is not equal to 3000
## 95 percent confidence interval:
## 3689.576 3995.236
## sample estimates:
## mean of x
## 3842.406
So basically, I am going to reject my null hypothesis because it ia was way off.
I will now be preforming a hypothesis test comparing two means. I will also be stating my null and alternative hypothesis. I will be comparing airtime and yard after the catch. my null hypothesis is that Air yardage will be equal to YAC while my alternative hypothesis is that Air will not equal YAC. I will have Air as mu1 and YAC as mu2
\[ H_0: \mu_1 = \mu_2 \\ H_A: \mu_1 \neq \mu_2 \]
t.test(data$Air,data$YAC)
##
## Welch Two Sample t-test
##
## data: data$Air and data$YAC
## t = 7.3269, df = 57.461, p-value = 8.693e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 335.7103 588.1647
## sample estimates:
## mean of x mean of y
## 2331.625 1869.688
I can safely accept my alternative hypothesis of the two not being equal
The two quantitative variables that I have picked are Losses and Yards allowed. Losses is how many losses that team had that season and Yard allowed being how many yards did that team defense give up
I will first be doing a linear model and linear regression I want to predict loses with yards allowed by a defense
lm(lose ~ Yds, data = data)
##
## Call:
## lm(formula = lose ~ Yds, data = data)
##
## Coefficients:
## (Intercept) Yds
## -1.40808 0.00244
mod=lm(lose ~ Yds, data = data)
model = mod
I have now created a linear model with Yds allowed being a predictor for losses. my slop is .00244 and my intercept is -1.40808. This is saying for every yard allowed we multiple it by .00244 the we add -1.40808.
I will now do a summary function to get more info on the linear regression
summary(model)
##
## Call:
## lm(formula = lose ~ Yds, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8140 -3.0360 0.4666 2.9385 6.0829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.408075 5.446100 -0.259 0.7978
## Yds 0.002440 0.001409 1.732 0.0936 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.326 on 30 degrees of freedom
## Multiple R-squared: 0.09089, Adjusted R-squared: 0.06059
## F-statistic: 2.999 on 1 and 30 DF, p-value: 0.09357
Oh, wow so much stuff. We see the residuals and the coefficient in this summary. We see that our significant it is not that significant here, same goes for our intercept.
I will now plot my linear model
plot(model)
The most interesting graph here is the normal Q-Q. This graph is looking at the normality of my data. We see that my data is actually pretty normal here, in the middle then goes off towards the top. i would say that my data is pretty normal then.
I will now use the predict command to predict losses by a team base on yards given up
Name_predict<-predict(model)
names(Name_predict)<-c(data$Tm)
Name_predict
## Arizona Cardinals Atlanta Falcons Baltimore Ravens
## 7.433321 10.054260 7.221011
## Buffalo Bills Carolina Panthers Chicago Bears
## 7.684678 7.926273 7.633430
## Cincinnati Bengals Cleveland Browns Dallas Cowboys
## 8.009245 8.260601 7.479688
## Denver Broncos Detroit Lions Green Bay Packers
## 7.882346 9.715051 7.228332
## Houston Texans Indianapolis Colts Jacksonville Jaguars
## 8.607131 8.026327 8.917056
## Kansas City Chiefs Las Vegas Raiders Los Angeles Chargers
## 7.814016 8.870689 7.323506
## Los Angeles Rams Miami Dolphins Minnesota Vikings
## 6.037440 8.411903 8.697424
## New England Patriots New Orleans Saints New York Giants
## 7.272258 7.064828 7.882346
## New York Jets Philadelphia Eagles Pittsburgh Steelers
## 9.351438 7.860383 6.181421
## San Francisco 49ers Seattle Seahawks Tampa Bay Buccaneers
## 6.710977 9.719932 8.219115
## Tennessee Titans Washington Football Team
## 9.424649 6.078926
This is so cool in my opinion. If you watched football you know that WFT defiantly had more losses, then 6 and I find it interesting that it says the falcon are predicted to be the team in the NFL lol.
I will now be doing my residual for my prediction; The residual is the difference between the prediction and actual value.
Name_residuals<-residuals(model)
names(Name_residuals)<-c(data$Tm)
Name_residuals
## Arizona Cardinals Atlanta Falcons Baltimore Ravens
## 0.56667856 1.94574012 -2.22101078
## Buffalo Bills Carolina Panthers Chicago Bears
## -4.68467773 3.07372738 0.36656967
## Cincinnati Bengals Cleveland Browns Dallas Cowboys
## 2.99075540 -3.26060089 2.52031187
## Denver Broncos Detroit Lions Green Bay Packers
## 3.11765373 1.28494910 -4.22833184
## Houston Texans Indianapolis Colts Jacksonville Jaguars
## 3.39286907 -3.02632706 6.08294432
## Kansas City Chiefs Las Vegas Raiders Los Angeles Chargers
## -5.81401641 -0.87068898 1.67649442
## Los Angeles Rams Miami Dolphins Minnesota Vikings
## -0.03743988 -2.41190274 0.30257604
## New England Patriots New Orleans Saints New York Giants
## 1.72774182 -3.06482823 2.11765373
## New York Jets Philadelphia Eagles Pittsburgh Steelers
## 4.64856160 3.13961690 -2.18142067
## San Francisco 49ers Seattle Seahawks Tampa Bay Buccaneers
## 3.28902286 -5.71993161 -3.21911490
## Tennessee Titans Washington Football Team
## -4.42464897 2.92107413
As you can see the predictions were all over the place, with some being over and some being under. we also see that some predictions are pretty spot on. I will now create a histogram to see if our residual is normal
hist(residuals(model),xlab = "residual")
Clearly it is not normal. well maybe a little normal but mostly not normal.
I will now be predicting losses base on Yards given up. let say four random teams gave up 2000,3000,4000,5000 yards. Now I will predict the losses each team will have using my model. I will also be showing the confident interval.
ylist =c (2000,3000,4000,5000)
point <- data.frame(Yds = ylist)
predict(model,point, interval = "confidence")
## fit lwr upr
## 1 3.472630 -1.963572 8.908831
## 2 5.912982 3.207729 8.618235
## 3 8.353334 7.069895 9.636773
## 4 10.793687 7.252676 14.334697
From this prediction model we see that if a team gives up only 2000 yards it is predicted that they will only lose about 3.5 games while if a team gives up 5000 yards, they are predicted to lose 10.80 games. we expect that 95% of my data will fall between these intervals.
Next, I will be making a graph with my two quantitative variables. I know some of yall don’t watch football so I thought labeling the dots would help. I know it looks messy in some places but now you can tell which teams are which dots. I also added my abline line and colored it to help it stand out.
plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(mod, col="red", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)
I thought this graph was really cool honestly. It is really interesting to see where each team falls. For example, look at the Seattle Seahawks and Detroit Lions both gave up similar amount of yardages but one had only 4 losses while the lions had 11 losses.
I will now compute the correlation between my two variables and do a hypothesis test on the correlation.
cor(data$Yds,data$lose)
## [1] 0.3014828
As you can see my correlation is .30142828 Now I will be doing an hypothesis test of the correlation test
cor.test(data$Yds,data$lose)
##
## Pearson's product-moment correlation
##
## data: data$Yds and data$lose
## t = 1.7319, df = 30, p-value = 0.09357
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05275726 0.58832861
## sample estimates:
## cor
## 0.3014828
I will reject the null hypothesis in this case and accept my alternative hypothesis of it not being equal to 0
Now I will use a categorical variable and repeat the regression. I will now be predicting losses by Yds given up and winning record
fit <- lm(lose ~ Yds + winning.record., data)
fit
##
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
##
## Coefficients:
## (Intercept) Yds winning.record.yes
## 4.781008 0.001449 -5.859531
summary(fit)
##
## Call:
## lm(formula = lose ~ Yds + winning.record., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8848 -0.9870 0.1921 0.6826 4.0877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7810080 2.6550400 1.801 0.0822 .
## Yds 0.0014491 0.0006758 2.144 0.0405 *
## winning.record.yes -5.8595310 0.5740969 -10.207 4.14e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.578 on 29 degrees of freedom
## Multiple R-squared: 0.802, Adjusted R-squared: 0.7884
## F-statistic: 58.74 on 2 and 29 DF, p-value: 6.319e-11
Now I am going to compute a prediction. I will create the ultimate defense, a NFL team that only gave up 2000 yards but somehow had a losing record ooooo interesting.
y <- data.frame(Yds = 2000, winning.record. = "no")
predict(fit, y)
## 1
## 7.679284
It is predicted that this team will lose about 7.7 games sadly.
I will now make a graph for my categorical variable
plot(lose ~ Yds, data = data, main = "Loses Against Yards allowed", xlab = "Yards allowed", ylab = "Loses")
abline(fit, col="black", lwd=2)
## Warning in abline(fit, col = "black", lwd = 2): only using the first two of 3
## regression coefficients
abline(model, col="orange", lwd=2)
text(lose~ Yds, labels=Tm,data=data, cex=0.5, font=2)
So here I did my colors in ECU color, go tigers. The orange line is my original model and the black line the new LM we created above. In my opinion the new LM that we created above is better. We really can’t predict anything here! I would say we need more data, maybe data from 2011-2021 season would be good enough.
I thought the most interesting thing about my Project Part 1 was the prediction model. I thought it was really cool to see if you can predict losses by a team by a certain variable which was fun to play around with. I also enjoyed creating the last graph and adding the names to the variables. what I found out after doing Project Part 1 is that it is very hard to predict losses in the NFL, I think you could tell just by looking at the residual for my prediction. I now understand why ESPN always get predictions wrong.
What is Bootstrapping one my ask, Well Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples with replacement.
First thing first I want to set my seed and load in the library so that I can do Bootstrapping. after I do that, I want to create a function that finds the mean. I can now Bootstrap using 10,000 samples
library(boot)
set.seed(52)
samp_mean <- function(x,i){
mean( x[i])
}
results <- boot(data$Yds, samp_mean,10000)
plot(results)
As you can see our data is pretty normal so that’s a good thing. I now Compare the confidence intervals and result. Let’s first see what our bootstrap confidence interval is
boot.ci(results)
## Warning in boot.ci(results): bootstrap variances needed for studentized
## intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results)
##
## Intervals :
## Level Normal Basic
## 95% (3698, 3986 ) (3700, 3987 )
##
## Level Percentile BCa
## 95% (3697, 3985 ) (3696, 3984 )
## Calculations and Intervals on Original Scale
So, the population mean is between 3697 Yards and 3985 Yards allowed. we can now compare our bootstrap confidence interval to our t test confidence interval. I will just use my t test from earlier where my null hypothesis says that passing yards allowed will be 3000 yards. while my alternative hypothesis says that it will not be 3000 \[ H_0: \mu = 3000 \\ H_A: \mu\neq 3000 \]
t.test(data$Yds, mu=3000)
##
## One Sample t-test
##
## data: data$Yds
## t = 11.242, df = 31, p-value = 1.822e-12
## alternative hypothesis: true mean is not equal to 3000
## 95 percent confidence interval:
## 3689.576 3995.236
## sample estimates:
## mean of x
## 3842.406
We see that our confidence interval for t test is 3689.576 yards to 3995.236 yards while our bootstrap confidence interval is 3697 yards and 3985 yards. Our two-confidence interval are pretty darn close. In both cases we would reject our null hypothesis since it is way short of both confidence interval.
I will now Computing p Value With Bootstrap I will Compute the p Value using t test \[ t = \frac{\mu-\overline x}{SE} \] First, I will get \(\overline x\) using the average of every bootstrap
xbar = results$t0
xbar
## [1] 3842.406
I next will find the \(SE\) using the \(SD\) of the bootstrap statistic
se=sd(results$t)
Since my graph above for our bootstrap looks pretty normal, we can actually get the confidence interval by using \(\pm 2SE\)
c(results$t0-2*se,results$t0+2*se)
## [1] 3695.422 3989.391
My confidence interval is 3695.422 and 3989.391 we will now find the t
t = (3000-results$t0)/se
1-pt(t,9999)
## [1] 1
My p value is 1, what this imply that 100% of the time our null hypothesis of 3000 will be false.
Here I am going to use median for my non-parametric statistic. The reason I picked median was because it was the most normal out of the four. I thought the more normal the non-parametric would be the easier it would be to work with.
First thing I will be doing is creating a function so we can use it when bootstrapping.
samp_median <- function(x,i){
median(x[i])
}
boot2 <- boot(data$Yds,samp_median, R = 10000)
plot(boot2)
This isnt the most normal but it’s the best out of the four. you should have seen the mean lol, it was so bad and not normal. let’s see what our bootstrap confidence interval is
boot.ci(boot2,type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot2, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (3674, 3962 )
## Calculations and Intervals on Original Scale
we see that our 95% CONFIDENCE INTERVAL is 3664 - 3962. Since our data is pretty normalishhhh we can do it this way by using \(\overline x \pm 2SE\)
xbar=mean(boot2$t)
s = sd(boot2$t)
c(xbar- 2*s, xbar+2*s)
## [1] 3678.857 3943.784
As you can see our two confidence intervals are pretty darn close. so, either way would have been fine.
I will now find the p value for the test statistic. my null hypothesis will be the following \[ H_0: \mu = 3642 \\ H_A: \mu\neq 3642 \]
mean(3642==boot2$t)
## [1] 0.0077
t = (3642-boot2$t0)/s
pt(t,9999)
## [1] 0.006372729
so less then 1% of my cases were 3642. you may ask why did i pick 3642. well, it was the first number that gave me a value and I wanted a value not just 0 because 0 is boring and empty. we can reject the null hypothesis since it is less then 5%
Now I will be doing Cross validation. I will first need to load the library of caret, after that I will divide my data into two-thirds for the test sample.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: ggplot2
TS <- createDataPartition(data$rank, p =.66, list=FALSE)
TS
## Resample1
## [1,] 1
## [2,] 3
## [3,] 5
## [4,] 6
## [5,] 7
## [6,] 8
## [7,] 9
## [8,] 10
## [9,] 11
## [10,] 12
## [11,] 16
## [12,] 17
## [13,] 18
## [14,] 21
## [15,] 22
## [16,] 23
## [17,] 24
## [18,] 25
## [19,] 26
## [20,] 27
## [21,] 29
## [22,] 30
## [23,] 31
## [24,] 32
So now we will turn my test sample into two different data’s. The test set will become my training data and not test set will be my test data.
trainData<- data[TS,]
testData<- data[-TS,]
testData[]
## Tm G Att Cmp Yds TD DADOT Air YAC Bltz Bltz. Hrry
## 2 Atlanta Falcons 16 625 425 4697 34 8.5 2757 2248 224 32.80% 73
## 4 Buffalo Bills 16 573 369 3726 23 7.0 2192 1758 230 35.80% 51
## 13 Houston Texans 16 541 377 4104 30 7.7 2232 2282 212 35.90% 47
## 14 Indianapolis Colts 16 562 369 3866 24 8.0 2546 1766 108 17.10% 71
## 15 Jacksonville Jaguars 16 533 370 4231 34 8.3 2238 2101 179 31.00% 40
## 19 Los Angeles Rams 16 548 347 3051 17 7.4 1604 1747 171 27.30% 46
## 20 Miami Dolphins 16 545 343 4024 21 9.2 2365 2036 247 40.60% 56
## 28 San Francisco 49ers 16 537 341 3327 25 7.8 1907 1673 203 33.60% 69
## Hrry. QBKD QBKD. Sk Prss Prss. MTkl win lose tie rank PF Y.P TO FL X1stD
## 2 10.70% 59 9.40% 29 161 23.60% 100 4 12 0 19 414 6.2 21 9 367
## 4 7.90% 46 8.00% 38 135 21.00% 127 13 3 0 16 375 5.5 26 11 351
## 13 8.00% 40 7.40% 34 121 20.50% 125 4 12 0 27 464 6.2 9 6 390
## 14 11.20% 34 6.00% 40 145 22.90% 70 11 5 0 10 362 5.4 25 10 329
## 15 6.90% 49 9.20% 18 107 18.50% 115 1 15 0 31 492 6.3 17 5 394
## 19 7.30% 45 8.20% 53 144 23.00% 79 10 6 0 1 296 4.6 22 8 280
## 20 9.20% 46 8.40% 41 143 23.50% 107 10 6 0 6 338 5.9 29 11 336
## 28 11.40% 38 7.10% 30 137 22.70% 68 6 10 0 17 390 5.0 20 8 311
## Int X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record.
## 2 12 239 380 1677 15 4.4 97 no no
## 4 15 199 414 1914 21 4.6 125 yes yes
## 13 3 222 493 2564 24 5.2 145 no no
## 14 15 203 390 1448 16 3.7 94 yes yes
## 15 12 210 517 2452 23 4.7 140 no no
## 19 14 170 388 1460 12 3.8 91 yes yes
## 20 18 193 412 1862 17 4.5 113 no yes
## 28 12 176 430 1703 12 4.0 102 no no
oooo look at my test data so nice, and pretty. just what we want!
Now I’m going to try and predict wins base off of times a team Blitzed using our training data.
model<- lm(win ~ Bltz, data = trainData)
summary(model)
##
## Call:
## lm(formula = win ~ Bltz, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6084 -2.0162 0.2017 2.3528 5.5889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.08200 2.38886 1.709 0.1016
## Bltz 0.02219 0.01250 1.776 0.0896 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.161 on 22 degrees of freedom
## Multiple R-squared: 0.1254, Adjusted R-squared: 0.08562
## F-statistic: 3.154 on 1 and 22 DF, p-value: 0.08959
Now I will add a new column to my data, saying whether it is Trained or naw. so if it was in the training set we will name it Train and if it wasn’t in the Training set it gets called Test.
data[TS, 'Test_Train']<- "Train"
data[-TS, 'Test_Train']<- "Test"
data$Test_Train
## [1] "Train" "Test" "Train" "Test" "Train" "Train" "Train" "Train" "Train"
## [10] "Train" "Train" "Train" "Test" "Test" "Test" "Train" "Train" "Train"
## [19] "Test" "Test" "Train" "Train" "Train" "Train" "Train" "Train" "Train"
## [28] "Test" "Train" "Train" "Train" "Train"
So pretty and just what I wanted
I will next graph a scatterplot identifying the data used in the testing and training and will include the linear regression. I thought using ggplot was way better and looked way nicer.
library(ggplot2)
ggplot(data = data, mapping = aes(Bltz,win,color = Test_Train))+
geom_jitter()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
What i get from this graph is that my Test and Train Data is very different. You can see that the Test is flat while the training is a positive slope
The next thing I will be doing is examining the prediction on the testing data using the R2 and RMSE as well as the confidence intervals on the regression.
pred <-predict(model,data[-TS,])
pred
## 2 4 13 14 15 19 20 28
## 9.053513 9.186678 8.787182 6.478977 8.054771 7.877216 9.563981 8.587433
This is the prediction for number of wins based on blitz’s using the test data. I wish it showed which team it was I will now do the r squared value comparing the predictions to the actual wins
R2(pred,data[-TS,"win"])
## [1] 0.007137425
we see that our R squared is 0.001330617 but compared to our model which was 0.124. uhm so they are nowhere close to each other but they arnt crazy far apart.
what is k fold? well in k folds you divide your data up into k pieces and you hold back one of the pieces. you then fit your model on the remaining pieces.you then test the model on that piece that you held back. you then mix them all together hold back one hold back a different one this time and repeat k times. we do this to run a statistic on the statistic on the test statistic. that is what and how k fold is.
I will now do the last part of this project using k folds. here i will be doing a 10 fold cross validation
trainc <- trainControl(method = "cv", number = 10)
model2 <- train(win ~ Bltz, data = data,
method = "lm",
trControl = trainc)
print(model2)
## Linear Regression
##
## 32 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 29, 29, 29, 28, 30, 28, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.210976 0.6454075 2.8723
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
we used 32 samples, we that my Rsquared values is 0.5961926, we see the different sample sizes and we see how many folds it did. the thing that interest me the most is the R-squared, a R-squared value between 0.3 < r < 0.5 is generally considered a weak or low effect. basically maybe there is an small connection between blitzing and winning?
I will now do a Summary of the model2
summary(model2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9534 -2.2919 0.1584 2.9381 5.6245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.98481 2.39204 2.084 0.0458 *
## Bltz 0.01594 0.01235 1.290 0.2068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.459 on 30 degrees of freedom
## Multiple R-squared: 0.05258, Adjusted R-squared: 0.021
## F-statistic: 1.665 on 1 and 30 DF, p-value: 0.2068
This gives me a good estimate based on the k-fold cross validation.
I thought some parts of this project was really hard to understand. I had trouble wrapping my head around k-fold cross validation but after watching a ton of videos I somewhat understand k-fold cross validation now. i thought understanding this project was harder then actually doing it. Some successes of my Bootstrapping were that everything went right and was pretty easy to get done. One of my failure for Bootstrapping was that I had a hard time find non-parametric statistic that was normal. One of my successes of Cross Validation was that I was able to create the graph and prediction on the testing data. I had two major failures with Cross Validation first was not being able to understand it that well. The last failure was that the two variables really didn’t tell us anything at all but predicting football takes a lot more than just two simple variables. Also dont not leave your cat alone with your labtop. I was almost finish with this project and i go up to go to the rest room. When I came back she’s on my laptop and some how closed out of r and i lost alot of my project LOL.This isnt the first time she has done this either smh.
let’s look at our data
head(data)
## Tm G Att Cmp Yds TD DADOT Air YAC Bltz Bltz. Hrry Hrry.
## 1 Arizona Cardinals 16 570 365 3623 26 7.7 1920 2057 256 39.30% 66 10.10%
## 2 Atlanta Falcons 16 625 425 4697 34 8.5 2757 2248 224 32.80% 73 10.70%
## 3 Baltimore Ravens 16 596 380 3536 22 7.9 2078 1787 290 44.10% 60 9.10%
## 4 Buffalo Bills 16 573 369 3726 23 7.0 2192 1758 230 35.80% 51 7.90%
## 5 Carolina Panthers 16 585 398 3825 28 7.4 2372 1971 153 24.00% 60 9.40%
## 6 Chicago Bears 16 547 350 3705 28 8.6 2377 1751 131 21.40% 50 8.20%
## QBKD QBKD. Sk Prss Prss. MTkl win lose tie rank PF Y.P TO FL X1stD Int
## 1 54 9.50% 48 168 25.80% 111 8 8 0 12 367 5.3 21 10 363 11
## 2 59 9.40% 29 161 23.60% 100 4 12 0 19 414 6.2 21 9 367 12
## 3 71 11.90% 39 170 25.90% 135 11 5 0 2 303 5.2 22 12 347 10
## 4 46 8.00% 38 135 21.00% 127 13 3 0 16 375 5.5 26 11 351 15
## 5 45 7.70% 29 134 21.00% 119 5 11 0 18 402 5.6 22 15 360 7
## 6 48 8.80% 35 133 21.80% 89 8 8 0 14 370 5.4 18 8 331 10
## X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record. Test_Train
## 1 207 436 2008 13 4.6 118 no no Train
## 2 239 380 1677 15 4.4 97 no no Test
## 3 213 382 1740 12 4.6 96 yes yes Train
## 4 199 414 1914 21 4.6 125 yes yes Test
## 5 211 408 1936 17 4.7 114 no no Train
## 6 190 438 1814 11 4.1 104 yes no Train
as you can see my data doesnt have that much categorical variables in it. so what i am going to do is make my own categorical variables using the data given.
so here I am going make three different categorical variables.
data[which(data$rank <= 16),"D_rank"] = "Top"
data[which(data$rank > 16),"D_rank"] = "Bottom"
data[which(data$Prss. > 20 ),"Prss above 20%"] = "yes"
data[which(data$Prss. <= 20 ),"Prss above 20%"] = "no"
data[which(data$D_rank == 'Top' & data$playoff == 'yes'),"top rank d and play off"] = "yes"
data[which(data$D_rank == 'Top' & data$playoff == 'no'| data$D_rank == 'Bottom' & data$playoff == 'yes' | data$D_rank == 'Bottom' & data$playoff == 'no'),"top rank d and play off"] = "no"
data[which(data$D_rank == 'Top' & data$playoff == 'yes' & data$`Prss above 20%` == 'yes'),"playoff_topD_pabove20"]="yes"
data$playoff_topD_pabove20[is.na(data$playoff_topD_pabove20)] = 'no'
head(data)
## Tm G Att Cmp Yds TD DADOT Air YAC Bltz Bltz. Hrry Hrry.
## 1 Arizona Cardinals 16 570 365 3623 26 7.7 1920 2057 256 39.30% 66 10.10%
## 2 Atlanta Falcons 16 625 425 4697 34 8.5 2757 2248 224 32.80% 73 10.70%
## 3 Baltimore Ravens 16 596 380 3536 22 7.9 2078 1787 290 44.10% 60 9.10%
## 4 Buffalo Bills 16 573 369 3726 23 7.0 2192 1758 230 35.80% 51 7.90%
## 5 Carolina Panthers 16 585 398 3825 28 7.4 2372 1971 153 24.00% 60 9.40%
## 6 Chicago Bears 16 547 350 3705 28 8.6 2377 1751 131 21.40% 50 8.20%
## QBKD QBKD. Sk Prss Prss. MTkl win lose tie rank PF Y.P TO FL X1stD Int
## 1 54 9.50% 48 168 25.80% 111 8 8 0 12 367 5.3 21 10 363 11
## 2 59 9.40% 29 161 23.60% 100 4 12 0 19 414 6.2 21 9 367 12
## 3 71 11.90% 39 170 25.90% 135 11 5 0 2 303 5.2 22 12 347 10
## 4 46 8.00% 38 135 21.00% 127 13 3 0 16 375 5.5 26 11 351 15
## 5 45 7.70% 29 134 21.00% 119 5 11 0 18 402 5.6 22 15 360 7
## 6 48 8.80% 35 133 21.80% 89 8 8 0 14 370 5.4 18 8 331 10
## X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record. Test_Train
## 1 207 436 2008 13 4.6 118 no no Train
## 2 239 380 1677 15 4.4 97 no no Test
## 3 213 382 1740 12 4.6 96 yes yes Train
## 4 199 414 1914 21 4.6 125 yes yes Test
## 5 211 408 1936 17 4.7 114 no no Train
## 6 190 438 1814 11 4.1 104 yes no Train
## D_rank Prss above 20% top rank d and play off playoff_topD_pabove20
## 1 Top yes no no
## 2 Bottom yes no no
## 3 Top yes yes yes
## 4 Top yes yes yes
## 5 Bottom yes no no
## 6 Top yes yes yes
ooooo so nice! got carried away lol but now i have more data to work with. let’s create a table of playoff_topD_pabove20 that i created. what this categorical variable means is that is a team made the playoff was a top 16 team and blitzed above 20% then they got a yes other wise a no.
table(data$playoff_topD_pabove20)
##
## no yes
## 20 12
very intersting i would say. this means over half of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time. my null hypothesis will be the following The number of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time is equal to 16 while my alternative hypothesis is The number of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time is not equal to 16 \[ H_0: \mu = 16 \\ H_A: \mu\neq 16 \]
test = chisq.test(table(data$playoff_topD_pabove20), p = c(1,1)/2)
test
##
## Chi-squared test for given probabilities
##
## data: table(data$playoff_topD_pabove20)
## X-squared = 2, df = 1, p-value = 0.1573
we see that my p-value is pretty small while my X-squared is pretty big. While I have evidence to suggest these numbers might be different, the statistics did not bear that out. I will fail to reject my null hypothesis here.
The expected values for the number of NFl teams that made the playoffs had an top 16 defense and blitz over 20% of the time follows
test$expected
## no yes
## 16 16
Here is my bar chart
barplot(table(data$playoff_topD_pabove20))
table(data$D_rank,data$winning.record.)
##
## no yes
## Bottom 14 2
## Top 5 11
so if you were at the bottom of the league in defense you likely had an losing record. Let’s set up the hypothesis test. \[ H_0: \textrm{Winning record is independent of Defense rank } \\ H_A: \textrm{Winning record is dependent on Defense rank } \]
test2 = chisq.test(table(data$D_rank,data$winning.record.))
test2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(data$D_rank, data$winning.record.)
## X-squared = 8.2915, df = 1, p-value = 0.003983
oh wow we got a pretty big x squared value and a small p-value. We have evidence to suggest that the Winning record is dependent on Defense rank. Here i am going to reject the null hypothesis here is the expected
test2$expected
##
## no yes
## Bottom 9.5 6.5
## Top 9.5 6.5
mosaicplot(table(data$D_rank,data$winning.record.))
What’s up yall today I will be exploring my data some more but this time using Anova. To start off with I am going to make a new categorical variable to play around with, also let’s look at the data again.
data[which(data$D_rank == 'Top' & data$`Prss above 20%` == 'yes'),"topD_pabove20"]="yes"
data$topD_pabove20[is.na(data$topD_pabove20)] = 'no'
head(data)
## Tm G Att Cmp Yds TD DADOT Air YAC Bltz Bltz. Hrry Hrry.
## 1 Arizona Cardinals 16 570 365 3623 26 7.7 1920 2057 256 39.30% 66 10.10%
## 2 Atlanta Falcons 16 625 425 4697 34 8.5 2757 2248 224 32.80% 73 10.70%
## 3 Baltimore Ravens 16 596 380 3536 22 7.9 2078 1787 290 44.10% 60 9.10%
## 4 Buffalo Bills 16 573 369 3726 23 7.0 2192 1758 230 35.80% 51 7.90%
## 5 Carolina Panthers 16 585 398 3825 28 7.4 2372 1971 153 24.00% 60 9.40%
## 6 Chicago Bears 16 547 350 3705 28 8.6 2377 1751 131 21.40% 50 8.20%
## QBKD QBKD. Sk Prss Prss. MTkl win lose tie rank PF Y.P TO FL X1stD Int
## 1 54 9.50% 48 168 25.80% 111 8 8 0 12 367 5.3 21 10 363 11
## 2 59 9.40% 29 161 23.60% 100 4 12 0 19 414 6.2 21 9 367 12
## 3 71 11.90% 39 170 25.90% 135 11 5 0 2 303 5.2 22 12 347 10
## 4 46 8.00% 38 135 21.00% 127 13 3 0 16 375 5.5 26 11 351 15
## 5 45 7.70% 29 134 21.00% 119 5 11 0 18 402 5.6 22 15 360 7
## 6 48 8.80% 35 133 21.80% 89 8 8 0 14 370 5.4 18 8 331 10
## X1stDP RushAtt Yds.1 TD.1 Y.A X1stDR playoff winning.record. Test_Train
## 1 207 436 2008 13 4.6 118 no no Train
## 2 239 380 1677 15 4.4 97 no no Test
## 3 213 382 1740 12 4.6 96 yes yes Train
## 4 199 414 1914 21 4.6 125 yes yes Test
## 5 211 408 1936 17 4.7 114 no no Train
## 6 190 438 1814 11 4.1 104 yes no Train
## D_rank Prss above 20% top rank d and play off playoff_topD_pabove20
## 1 Top yes no no
## 2 Bottom yes no no
## 3 Top yes yes yes
## 4 Top yes yes yes
## 5 Bottom yes no no
## 6 Top yes yes yes
## topD_pabove20
## 1 yes
## 2 no
## 3 yes
## 4 yes
## 5 no
## 6 yes
so here I made a new variable called topD_pabove20. what this variable means is that if a team had a top 16 ranked Defense and pressure was above 20% they got a yes other-wise I filled the Na with no. prolly not the best idea to fill NA with no but im lazy and it works, also i didnt want to write another long code just for no. also oh wow our data table is getting bigger by the day.
Now I’m going to ask if top 16 ranked Defense and if pressure was above 20% affects Wins for a team. we are going see if the means are the same. Here is my hypotheses test that i will be using \[ H_0: u_1=u_2=...=u_n \\ H_A: u_1 \neq u_2 \textrm{ for some } i \textrm{ and }j \]
model = aov(win~topD_pabove20, data = data)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## topD_pabove20 1 175.8 175.78 25.95 1.78e-05 ***
## Residuals 30 203.2 6.77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
as you can see the means are different and our Pr is less then 5%, so we can reject our null hypothesis
so here we going see is our assumption of equal variance is true.
plot(model,1)
looks like we had the right idea of assumption of equal variance
now we are going see if our model is normal
plot(model,2)
okay so our QQ-plot looks pretty normal, no need to be worried! so i can safely reject my null hypothesis of means being the same.
Here im going create a visualization to show that the means are different.
library(ggrepel)
y1 <- mean(data$win, na.rm = TRUE)
plotnfl <-ggplot(data = data, aes(x = topD_pabove20, y = win))+
geom_jitter(color = 'grey',width = 0.1, height = 0.1) +
stat_summary(fun.data = 'mean_se', color = "red") +
geom_hline(yintercept = y1, color = "blue",linetype = "dashed")
plotnfl + geom_label_repel(aes(label = Tm),
data = subset(data, win< 4),
box.padding = 0.5,
point.padding = 0.5,
segment.color = 'grey50') +
geom_label_repel(aes(label = Tm),
data = subset(data, win> 12),
box.padding = 0.5,
point.padding = 0.5,
segment.color = 'grey50') +
geom_label_repel(aes(label = Tm),
data = subset(data, topD_pabove20=="no"&win>10 ),
box.padding = 0.5,
point.padding = 0.5,
segment.size = 0.2,
segment.color = 'grey50') +
theme_classic()
clearly this graph supports that the null hypotheses was false. This graph really helps see the differences in the means as the means are really far apart. I also label my graph to spicy it up some. I mainly did it because I wanted to show the differences between the top team and the bottom team. I didnt want to add all the team just because it made it messy and took away from what I was trying to show. I can safely reject my null hypotheses now and sleep tightly tonight.
Here i will be using two categorical variables and a quantitative variable, preforming a two-way ANOVA. so here we are going have three hypotheses
\[ H_0: \textrm{ There is no difference in average wins for any top Defences that blizted above 20%. }\ H_A: \textrm{ There is a difference in average wins for any top Defences that blizted above 20%. } \\ H_0: \textrm{ There is no difference in average wins with team in the playoff. }\ H_A: \textrm{ There is a difference in average wins with team in the playoff.} \\ H_0: \textrm{The effect of one independent variable on average Wins does not depend on the effect of the other independent variable . }\ H_A: \textrm{ There is an interaction effect between top Defences that blizted above 20% and team in the playoff on wins. } \] A bit messy but it get’s what im trying to find.
model = aov(win~topD_pabove20*playoff, data = data)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## topD_pabove20 1 175.78 175.78 48.192 1.51e-07 ***
## playoff 1 92.84 92.84 25.453 2.45e-05 ***
## topD_pabove20:playoff 1 8.21 8.21 2.252 0.145
## Residuals 28 102.13 3.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ooo interesting we can reject our first two hypotheses but our third hypothesis we can not as it is not less then 10% very very interesting.I will now take a deeper look into it with some visualizations and find an answer.
y1 <- mean(data$win, na.rm = TRUE)
ggplot(data = data, aes(x = topD_pabove20 , y = win, color = playoff))+
geom_jitter(width = 0.1, height = 0.1) +
stat_summary(fun.data = 'mean_se')
as you can see the yes mean’s is pretty close but then the no mean’s is pretty far apart. It should be noted that in the no part their are only two teams that made the playoffs and had no’s. Now im going create a another visualization to help better see what’s going on.
plotnfl <-ggplot(data = data, aes(x = topD_pabove20 , y = win, color = playoff))+
geom_jitter(width = 0.1, height = 0.1) +
geom_boxplot()
plotnfl + geom_label_repel(aes(label = Tm),
data = subset(data, win< 4),
box.padding = 0.5,
point.padding = 0.5,
segment.size = 0.2,
show.legend=F
) +
geom_label_repel(aes(label = Tm),
data = subset(data, win > 13),
box.padding = 0.5,
point.padding = 0.5,
segment.size = 0.2,
show.legend=F
) +
theme_classic()
table(data$topD_pabove20,data$playoff)
##
## no yes
## no 14 2
## yes 4 12
ayeeee this gives a better picture of why we can not reject the third null hypothesis. looking at the yes on the x-axis, you see that they share a couple of point’s and that brings the mean closer together. so that’s the reason we can not reject the third null hypothesis as the means are fairly close.
let’s get the contingency table of the categorical variables.
table(data$topD_pabove20,data$playoff)
##
## no yes
## no 14 2
## yes 4 12
I believe that the conditions of the test were not met as We did not find a statistically-significant difference and interaction between the terms was not significant. im a little lost of this last part but i think i got the right idea.
I will ve usibg a categorical variable and a quantitative variable from my data set to compare the medians by using the Wilcoxon ranked sum test.
my quantitatitve varuable that i will be using is the amount of Blitz a team did.
rank(data$Bltz, ties.method = "average")
## [1] 30.0 24.0 32.0 25.0 11.0 3.5 17.0 7.0 3.5 15.0 8.0 10.0 22.0 2.0 16.0
## [16] 23.0 9.0 1.0 13.0 27.0 12.0 6.0 20.0 14.0 28.0 5.0 26.0 21.0 29.0 31.0
## [31] 18.0 19.0
now we can do the Wilcoxon ranked sum test Let me do a test against a value first. I am going to ask is the median number of QB hurry is 70
median(data$Hrry, na.rm = TRUE)
## [1] 59.5
wilcox.test(data$Hrry, mu = 70, na.rm = TRUE)
## Warning in wilcox.test.default(data$Hrry, mu = 70, na.rm = TRUE): cannot compute
## exact p-value with ties
## Warning in wilcox.test.default(data$Hrry, mu = 70, na.rm = TRUE): cannot compute
## exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: data$Hrry
## V = 49.5, p-value = 0.0001033
## alternative hypothesis: true location is not equal to 70
so i would reject my null hypothesis here
now let’s compare the two means oooooo. i will be using one of my categorical variable that i created a while back.
by(data$Bltz,data$`Prss above 20%`, median)
## data$`Prss above 20%`: no
## [1] 179
## ------------------------------------------------------------
## data$`Prss above 20%`: yes
## [1] 195
table(data$`Prss above 20%`)
##
## no yes
## 5 27
let’s run the wilcox test now
wilcox.test(Bltz ~ `Prss above 20%`, data = data)
## Warning in wilcox.test.default(x = c(181L, 142L, 179L, 159L, 194L), y =
## c(256L, : cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: Bltz by Prss above 20%
## W = 56, p-value = 0.568
## alternative hypothesis: true location shift is not equal to 0
We will fail to reject the null hypothesis!
cor.test(data$win,data$Bltz, method = "spearman")
## Warning in cor.test.default(data$win, data$Bltz, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$win and data$Bltz
## S = 4327.9, p-value = 0.2562
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.206766
sadly this is a kinda weak connection :( we would fail to reject our null hypthesis