Introduction

The dataset used was provided by Max Horowitz on Kaggle and each row contains all data for a single play in the NFL.There is a massive amount of data (356,768 rows and 100 columns) available and there is a lot of different ways to analyze it. First thing I’ll do is load the database and the packages I’ll be using.

library(tidyverse)
library(ggplot2)
library(rpart.plot)
library(caTools)
library(ggpubr)
library(scatterplot3d)
football <- read.csv("~/R/NFL Play by Play 2009-2017 (v4).csv")

One of the simpler uses of this data is if you have a question like “if it’s fourth down and less than 3 yards should the team go for it or attempt a field goal in the 40 yard range?” I can filter examples of fourth down and less than 3 yards and calculate the percentage of first downs gained versus field goals between 40 and 50 yards that are good.

FourthConv <- filter(football, down == 4, ydstogo < 3)
table(FourthConv$FirstDown)
## 
##    0    1 
## 1904 5088
5088/(5088+1904)
## [1] 0.7276888
FG <- filter(football, FieldGoalDistance < 50 | FieldGoalDistance > 40)
table(FG$FieldGoalResult)
## 
## Blocked    Good No Good 
##     197    7479    1272
7479/(197+7479+1272)
## [1] 0.8358292

For this example we can see that attempting the field goal is the better option by about 9% (83.58% vs 72.77%)

Grouping and Summarizing Data

What I want to do next is focus on the stats for each game by team. In order to do this, I grouped the data by Game ID and team on offense (since I want to focus on offensive stats). Since there is only a single column with all of the yards gained, I created 2 more columns that contained only passing and rushing yards.In addition, I also counted the total number of fumbles, sacks, rushing and passing attempts and the perceptage of pass attempts. Lastly, I created a column on whether the team won the game or not based on the max score for both teams. I will then use the summary function to see some simple statistics for each column.

TotalStats <- filter(football, PlayType %in% c("Pass", "Run", "Sack")) %>%
  mutate(PassingYards = ifelse(PassAttempt == 1, Yards.Gained, 0), 
         RushingYards = ifelse(RushAttempt == 1, Yards.Gained, 0)) %>%
  group_by(GameID, posteam, Season) %>% 
  summarize(TotalInts = sum(InterceptionThrown), 
            Score=max(PosTeamScore), 
            Fumbles=sum(Fumble), 
            Sacks=sum(Sack), 
            PassYards = sum(PassingYards), 
            RushYards = sum(RushingYards), 
            TotalYards = sum(PassingYards+RushingYards), 
            RushAttempts = sum(RushAttempt), 
            PassAttempts = sum(PassAttempt), 
            PassPerc = PassAttempts/(RushAttempts+PassAttempts), 
            AvgRush = RushYards/RushAttempts) %>%
              filter(posteam != "")
summary(TotalStats[4:14])
##    TotalInts          Score         Fumbles          Sacks       
##  Min.   :0.0000   Min.   : 0.0   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:14.0   1st Qu.:0.000   1st Qu.: 1.000  
##  Median :1.0000   Median :20.0   Median :1.000   Median : 2.000  
##  Mean   :0.9362   Mean   :20.7   Mean   :1.116   Mean   : 2.311  
##  3rd Qu.:1.0000   3rd Qu.:27.0   3rd Qu.:2.000   3rd Qu.: 3.000  
##  Max.   :6.0000   Max.   :61.0   Max.   :9.000   Max.   :11.000  
##    PassYards     RushYards       TotalYards     RushAttempts  
##  Min.   :  0   Min.   :  1.0   Min.   : 87.0   Min.   : 7.00  
##  1st Qu.:194   1st Qu.: 78.0   1st Qu.:306.0   1st Qu.:21.00  
##  Median :244   Median :108.0   Median :362.0   Median :26.00  
##  Mean   :248   Mean   :114.5   Mean   :362.5   Mean   :26.21  
##  3rd Qu.:299   3rd Qu.:145.0   3rd Qu.:418.0   3rd Qu.:31.00  
##  Max.   :675   Max.   :363.0   Max.   :792.0   Max.   :60.00  
##   PassAttempts      PassPerc         AvgRush      
##  Min.   : 7.00   Min.   :0.1290   Min.   : 0.100  
##  1st Qu.:29.00   1st Qu.:0.4930   1st Qu.: 3.385  
##  Median :34.00   Median :0.5714   Median : 4.173  
##  Mean   :34.61   Mean   :0.5683   Mean   : 4.292  
##  3rd Qu.:40.00   3rd Qu.:0.6464   3rd Qu.: 5.059  
##  Max.   :70.00   Max.   :0.8727   Max.   :13.562

Regression

The first thing I want to look deeper into is what are the major contributors to scoring points? First I’ll look at total yards versus score and then move into a more detailed analysis with multiple variables. Since there are 4608 observations, I’ll also include a graph using geom_hex which will help show if there is a concentration of data points in one area.

ggplot(TotalStats, aes(x=TotalYards, y=Score)) + geom_point()

ggplot(TotalStats,aes(x=TotalYards, y=Score)) + geom_point() + geom_hex(bins = 50)

cor(TotalStats$TotalYards, TotalStats$Score)
## [1] 0.5945849

From the graph we can see that there appears to be a positive linear relationship between the 2 variables, although it is very spread out, but there is a concentration in the middle. The correlation of .5946 shows that there is moderate positive coorlation between the two variables. Next I’ll seperate the data into train and test samples and do linear regression to see if there is statistical significance between the variables.

set.seed(77)
split <- sample.split(TotalStats$Season, SplitRatio = .6)
TotalStatsTrain <- subset(TotalStats, split == TRUE)
TotalStatsTest <- subset(TotalStats, split == FALSE)

n <- lm(Score ~ TotalYards, data=TotalStatsTrain)
summary(n)
## 
## Call:
## lm(formula = Score ~ TotalYards, data = TotalStatsTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.679  -5.481  -0.480   4.884  31.833 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.289026   0.677857  -7.803 8.54e-15 ***
## TotalYards   0.071678   0.001827  39.241  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.876 on 2761 degrees of freedom
## Multiple R-squared:  0.358,  Adjusted R-squared:  0.3578 
## F-statistic:  1540 on 1 and 2761 DF,  p-value: < 2.2e-16

We see that TotalYards is statistically signifiant, containing a p-value of <2e-16. If we had the total number of yards from a game we would be able to estimate the score by the equation y=.071678x-5.289026. However, the R-square value is on the lower side at .3578, meaning that only 35.78% of the score is explained by the totalyards. Next, I’ll test the model on the test data to ensure the model is correctly predicting the data.

TestR2 <- function(y, testdata, model, testY) {
  predictest <- predict(model, testdata)
  SSE <- sum((testY-predictest)^2, na.rm=TRUE)
    SST <- sum((testY - mean(y, na.rm=TRUE))^2, na.rm=TRUE)
    1-SSE/SST
}


TestR2(TotalStats$Score, TotalStatsTest, n, TotalStatsTest$Score)
## [1] 0.3461587
Test <- predict(n, TotalStatsTest)

With a similar R-squared value (.3461587) we can conclude that the model does represent the data accurately.

Next, to try and more accurately estimate the score, I’ll do multiple linear regression by adding interceptions, passing and rushing yards, sacks and fumbles to the model and then compare it to the test data.

m <- lm(Score ~ PassYards + RushYards + TotalInts + Sacks + Fumbles, data=TotalStatsTrain)
summary(m)
## 
## Call:
## lm(formula = Score ~ PassYards + RushYards + TotalInts + Sacks + 
##     Fumbles, data = TotalStatsTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.303  -4.940  -0.269   4.440  30.591 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.792811   0.731984   2.449   0.0144 *  
## PassYards    0.060553   0.001846  32.800  < 2e-16 ***
## RushYards    0.078285   0.002879  27.190  < 2e-16 ***
## TotalInts   -1.979716   0.136162 -14.539  < 2e-16 ***
## Sacks       -0.992654   0.084601 -11.733  < 2e-16 ***
## Fumbles     -0.839323   0.127898  -6.562  6.3e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.2 on 2757 degrees of freedom
## Multiple R-squared:  0.4643, Adjusted R-squared:  0.4633 
## F-statistic: 477.9 on 5 and 2757 DF,  p-value: < 2.2e-16
TestR2(TotalStats$Score, TotalStatsTest, m, TotalStatsTest$Score)
## [1] 0.4767972

Once again, all of the variables show statistical significance. The R-squared value is a bit better at .4633. On the test data, the R-squared is very similar at .4767972. From this data we are able to see that rushing yards result in slightly more points that passing yards (.078285 vs. .060553), sacks prevent slightly more points than fumbles (-.992654 vs -.839323) and interceptions result in 1 less point over sacks. One way to utilize this information is to compare it to a specific team to see how they vary in scoring. I will filter the data for New England to see how they’re formula for scoring compares.

NE <- filter(TotalStats, posteam == "NE")
m <- lm(Score ~ PassYards + RushYards + TotalInts + Sacks + Fumbles, data=NE)
summary(m)
## 
## Call:
## lm(formula = Score ~ PassYards + RushYards + TotalInts + Sacks + 
##     Fumbles, data = NE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7938  -4.7474  -0.7068   3.5058  22.2910 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.397446   3.247465  -0.122   0.9028    
## PassYards    0.073883   0.007745   9.539  < 2e-16 ***
## RushYards    0.093516   0.012893   7.253 2.66e-11 ***
## TotalInts   -3.895933   0.793978  -4.907 2.57e-06 ***
## Sacks       -0.531366   0.443820  -1.197   0.2333    
## Fumbles     -1.103176   0.577586  -1.910   0.0582 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.85 on 138 degrees of freedom
## Multiple R-squared:  0.5244, Adjusted R-squared:  0.5071 
## F-statistic: 30.43 on 5 and 138 DF,  p-value: < 2.2e-16

Comparing the two analyses it can be seen that both passing and rushing yards are higher than the general equation, suggesting more scoring efficiency. Interceptions are actually double of the general equation (-3.895933 vs -1.979716). Interestingly, sacks aren’t statistically significant. Analysis like this could help with game planning when playing certain teams to reduce their scoring potential.

ANOVA

I’ve heard a lot about how the NFL is turning into a passing league, so I want to see if there is a statistical difference between the percentage of passing plays between seasons. I will first visualize the data in a box plot first and then do formal analysis.

TotalStats$Season <- as.character(TotalStats$Season)
ggplot(TotalStats, mapping= aes(x=Season, y=PassPerc, fill=Season)) + geom_boxplot() +
stat_summary(fun.y="mean", geom="point", shape=21, fill="red") +
guides(fill=FALSE)

From the boxplot it can be seen that there is a overall slight increase in pass percentage over the years, except for 2017 which is a bit lower. It’s also worth noting that the earlier years (2009, 2010 and 2011) all have multiple outliers on the lower end and after that only 2017 has multiple outliers. Now I’ll run ANOVA to see if there is a statistical difference between the seasons.

k <- aov(TotalStats$PassPerc ~ as.factor(TotalStats$Season))
summary(k)
##                                Df Sum Sq Mean Sq F value  Pr(>F)    
## as.factor(TotalStats$Season)    8   0.46 0.05777   4.968 3.8e-06 ***
## Residuals                    4599  53.48 0.01163                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA test we get a P value of 3.8e-06, showing that there is a statistical difference between the mean pass percentage among seasons. Next, I’ll run Tukey’s test to see which specific seasons are different from the other and plot it to better visualize the differences.

TukeyHSD(k)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = TotalStats$PassPerc ~ as.factor(TotalStats$Season))
## 
## $`as.factor(TotalStats$Season)`
##                    diff           lwr         upr     p adj
## 2010-2009  0.0049907191 -0.0159236262 0.025905064 0.9982011
## 2011-2009  0.0032076583 -0.0177066869 0.024122004 0.9999323
## 2012-2009  0.0116408261 -0.0092735192 0.032555171 0.7297423
## 2013-2009  0.0189378095 -0.0019765358 0.039852155 0.1126001
## 2014-2009  0.0194152190 -0.0014991263 0.040329564 0.0935778
## 2015-2009  0.0268385491  0.0059242038 0.047752894 0.0022566
## 2016-2009  0.0309001451  0.0099857998 0.051814490 0.0001614
## 2017-2009  0.0132910706 -0.0076232747 0.034205416 0.5630930
## 2011-2010 -0.0017830607 -0.0226974060 0.019131285 0.9999993
## 2012-2010  0.0066501070 -0.0142642382 0.027564452 0.9871467
## 2013-2010  0.0139470904 -0.0069672548 0.034861436 0.4945903
## 2014-2010  0.0144244999 -0.0064898454 0.035338845 0.4457099
## 2015-2010  0.0218478300  0.0009334848 0.042762175 0.0327172
## 2016-2010  0.0259094260  0.0049950808 0.046823771 0.0039002
## 2017-2010  0.0083003515 -0.0126139938 0.029214697 0.9496403
## 2012-2011  0.0084331678 -0.0124811775 0.029347513 0.9448416
## 2013-2011  0.0157301512 -0.0051841941 0.036644496 0.3217148
## 2014-2011  0.0162075606 -0.0047067846 0.037121906 0.2813189
## 2015-2011  0.0236308907  0.0027165455 0.044545236 0.0135851
## 2016-2011  0.0276924867  0.0067781415 0.048606832 0.0013391
## 2017-2011  0.0100834122 -0.0108309330 0.030997758 0.8577323
## 2013-2012  0.0072969834 -0.0136173619 0.028211329 0.9767723
## 2014-2012  0.0077743929 -0.0131399524 0.028688738 0.9657568
## 2015-2012  0.0151977230 -0.0057166223 0.036112068 0.3701440
## 2016-2012  0.0192593190 -0.0016550263 0.040173664 0.0994853
## 2017-2012  0.0016502445 -0.0192641008 0.022564590 0.9999996
## 2014-2013  0.0004774095 -0.0204369358 0.021391755 1.0000000
## 2015-2013  0.0079007396 -0.0130136057 0.028815085 0.9622883
## 2016-2013  0.0119623356 -0.0089520097 0.032876681 0.6989389
## 2017-2013 -0.0056467389 -0.0265610842 0.015267606 0.9957162
## 2015-2014  0.0074233301 -0.0134910151 0.028337675 0.9741623
## 2016-2014  0.0114849261 -0.0094294191 0.032399271 0.7442342
## 2017-2014 -0.0061241484 -0.0270384936 0.014790197 0.9925505
## 2016-2015  0.0040615960 -0.0168527493 0.024975941 0.9995992
## 2017-2015 -0.0135474785 -0.0344618238 0.007366867 0.5362351
## 2017-2016 -0.0176090745 -0.0385234198 0.003305271 0.1813453
plot(TukeyHSD(k), las=1, cex.axis=.5)

These results were a bit surprising. Some of it I expected, we see that 2015 and 2016 are statistically different from 2009 (with 2013 and 2014 close at .11 and .09, respectively) and then also 2015 and 2016 are statistically different from 2010 and 2011. This makes sense as there may not be enough difference from one year to the other, but after a few years a difference is noticeable. However, the one thing that is most surprising is how 2017 is not statistically different from any other years, the lowest P-value was between 2017 and 2016 at .18. So while the general trend seems to be that passing percentage is increasing, 2017 did not follow that trend. 2017 is actually very similar to 2012. It will be very interesting to see how 2018 and future seasons compare to this to see if this is a new trend or if 2017 was just an anomaly.

Decision Tree

I want to figure out what is the probability of winning a game based on a team’s first half stats. In order to do this I’ll create a decision tree that includes the variables Sacks, Rushing Yards, Passing Yards, Fumbles, Rush Attempts, Pass Attempts and Score Differential. I’ll once again split the data into test and train data sets to see how accurately the decision tree predicts data that the model didn’t use.

Firsthalf <- filter(football, PlayType %in% c("Pass", "Run", "Sack"), qtr %in% c(1, 2)) %>%
  mutate(PassingYards = ifelse(PassAttempt == 1, Yards.Gained, 0), 
         RushingYards = ifelse(RushAttempt == 1, Yards.Gained, 0), Team_ID = str_c(GameID, posteam)) %>%
  group_by(Team_ID, Season, GameID) %>% 
  summarize(TotalInts = sum(InterceptionThrown), 
            Fumbles=sum(Fumble), 
            Sacks=sum(Sack), 
            PassYards = sum(PassingYards), 
            RushYards = sum(RushingYards), 
            TotalYards = sum(PassingYards+RushingYards), 
            RushAttempts = sum(RushAttempt), 
            PassAttempts = sum(PassAttempt), 
            AvgRush = (RushYards/RushAttempts)) %>%
  na.omit()


HalfOff <- filter(football, TimeSecs == 1800, posteam != "") %>%
  mutate(Team_ID = str_c(GameID, posteam)) %>%
  group_by(Team_ID) %>% 
  summarize(Diff = max(ScoreDiff))

Halfdef <- filter(football, TimeSecs == 1800,posteam != "") %>%
  mutate(Team_ID = str_c(GameID, DefensiveTeam), Diff = -ScoreDiff) %>%
  group_by(Team_ID) %>% 
  summarize(Diff = max(Diff))

offscore <- filter(football, TimeSecs == 0, qtr == 4 ,posteam != "") %>%
  mutate(Team_ID = str_c(GameID, posteam), Result = ifelse (ScoreDiff > 0, "Win", ifelse(ScoreDiff < 0, "Lost", "Tie"))) %>%
  select(Team_ID, Result)

defscore <- filter(football, TimeSecs == 0, qtr == 4, posteam != "") %>%
  mutate(Team_ID = str_c(GameID, DefensiveTeam), Result = ifelse (ScoreDiff < 0, "Win", ifelse(ScoreDiff > 0, "Lost", "Tie"))) %>%
  select(Team_ID, Result)

FullResult <- offscore %>%
  full_join(defscore, by = "Team_ID")

FullDiff <- HalfOff %>%
  full_join(Halfdef, by = "Team_ID")

whole <- Firsthalf %>%
  inner_join(FullResult, by = "Team_ID") %>%
  inner_join(FullDiff, by = "Team_ID")


Result.Pred <- mutate(whole, Result = ifelse(Result.x %in% c("Win", "Lost", "Tie"), Result.x, Result.y),
                      Diff = ifelse(is.na(Diff.x) == FALSE, Diff.x, Diff.y))
table(Result.Pred$Result)
## 
## Lost  Tie  Win 
## 1900   30 1900
set.seed(50)
split <- sample.split(Result.Pred$Season, SplitRatio = .6)
FirstHalfTrain <- subset(Result.Pred, split == TRUE)
FirstHalfTest <- subset(Result.Pred, split == FALSE)
Wintree <- rpart(Result ~ Sacks + RushYards + PassYards + Fumbles + RushAttempts + PassAttempts + Diff,data = FirstHalfTrain, 
                 control = rpart.control(minpslit=10))
Wintree
## n= 2298 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 2298 1118 Lost (0.513489991 0.006962576 0.479547433)  
##   2) Diff< 2.5 1346  333 Lost (0.752600297 0.007429421 0.239970282) *
##   3) Diff>=2.5 952  173 Win (0.175420168 0.006302521 0.818277311) *
predictree <- predict(Wintree, newdata=FirstHalfTest, type="class")
table(FirstHalfTest$Result, predictree)
##       predictree
##        Lost Tie Win
##   Lost  596   0 124
##   Tie     8   0   6
##   Win   226   0 572
(596+572)/sum(598, 572, 226, 124, 8, 6)
## [1] 0.7614081

Despite all of the variables used in the making of the decision tree, it came down to if the team is up by more than 2.5 points they are more likely to win. Using this model on the test data a 76.14% success rate is achieved.

Clustering

Lastly, I’ll use clustering to try and organize quarterbacks into 3 groups. In order to do this I’ve used the original data to group by Passers and included their Pass Yards, Pass Attempts, Pass Lengths (Short and Deep), Pass Locations (left, right and middle), Pass Outcomes (incomplete and complete), Touchdowns and Interceptions. All of these I’ve converted into percentages so the clusters are based on the quarterbacks’ tendencies and not the amount of seasons they’ve been in the league. I also filtered the Pass attempts over 100 in order to not include trick plays (non-quarterbacks usually). As these numbers aren’t all going to be similar (TDRatio will go over 1, accuracy will be around 60% and pass direction will probably be around 33%) I will need to scale the numbers to make sure certain variables don’t take over in priority when doing clustering. First I’ll go a histogram of each variable to see if it follows a normal distribution. If that’s the case, I can then use scale() to turn them all into Z-score format

PasserStats <- filter(football, PlayType == "Pass") %>% 
  mutate(PassingYards = ifelse(PassAttempt == 1, Yards.Gained, 0)) %>%
  group_by(Passer_ID, Passer) %>% 
  summarize(Interceptions = sum(InterceptionThrown), 
            PassYards = sum(PassingYards),
            PassAttempts = sum(PassAttempt), 
            ShortPasses = sum(PassLength == "Short", na.rm=TRUE),
            DeepPasses = sum(PassLength == "Deep", na.rm=TRUE),
            LeftPasses = sum(PassLocation == "left", na.rm=TRUE),
            RightPasses = sum(PassLocation == "right", na.rm=TRUE),
            MiddlePasses = sum(PassLocation == "middle", na.rm=TRUE),
            InCompletePass = sum(PassOutcome == "Incomplete Pass", na.rm=TRUE),
            CompletePass = sum(PassOutcome == "Complete", na.rm=TRUE),
            Touchdowns = sum(Touchdown),
            ShortRatio = ShortPasses/PassAttempts,
            DeepRatio = DeepPasses/PassAttempts,
            LeftRatio = LeftPasses/PassAttempts,
            MiddleRatio = MiddlePasses/PassAttempts, 
            RightRatio = RightPasses/PassAttempts,
            TDRatio = Touchdowns/Interceptions,
            Accuracy = CompletePass/PassAttempts) %>% 
  filter(PassAttempts > 100) %>%
  select(Passer, Passer_ID, ShortRatio, DeepRatio, LeftRatio, MiddleRatio, RightRatio, TDRatio, Accuracy)

TD <- gghistogram(PasserStats, x="TDRatio", fill = "Blue")
Short <- gghistogram(PasserStats, x="ShortRatio", fill = "Red")
Deep <- gghistogram(PasserStats, x="DeepRatio", fill = "Purple")
Left <- gghistogram(PasserStats, x="LeftRatio", fill = "Orange")
Right <- gghistogram(PasserStats, x="RightRatio", fill = "Green")
Middle <- gghistogram(PasserStats, x="MiddleRatio", fill = "Black")
Accuracy <- gghistogram(PasserStats, x="Accuracy", fill = "Yellow")

ggarrange(TD, Short, Deep, Left, Right, Middle, Accuracy,
          ncol = 3, nrow = 3)

PasserStats.Scale <- as.data.frame(lapply(PasserStats[3:9], scale))

Now I’ll do Kmeans clustering. After that I’ll put the cluster into the original dataset (without the scaling) and use the aggregate fuction to get the averages of each column to see how the clusters compare to each other.

set.seed(1089)
k.means.passer <- kmeans(PasserStats.Scale[1:7], 3)
PasserStats$cluster <- as.factor(k.means.passer$cluster)
aggregate(data = PasserStats, . ~ cluster, FUN = mean)
##   cluster   Passer Passer_ID ShortRatio DeepRatio LeftRatio MiddleRatio
## 1       1 270.3333  213.6667  0.7170714 0.2810664 0.4421082   0.1583164
## 2       2 168.1333  160.7778  0.7835984 0.2129106 0.3567771   0.2404223
## 3       3 178.6282  160.5641  0.8235408 0.1734707 0.3584571   0.2203609
##   RightRatio  TDRatio  Accuracy
## 1  0.3977133 1.875000 0.5370733
## 2  0.3993095 1.081705 0.5725727
## 3  0.4181936 1.691369 0.6151912

There are many takeaways from these clusters. For example, Cluster 1 can be seen as a more aggressive quarterback as they have the highest DeepRatio, but do suffer from the lowest accuracy. To view this visually I will create a 3D graph showing the PassLocation and then 2 other graphs showing the PassLength and then Accuracy vs TDRatio color coded based on the clusters to see how they are grouped.

PasserStats$cluster <- as.factor(k.means.passer$cluster)
colors <- c("Red", "Green", "Blue")
colors <- colors[as.numeric(PasserStats$cluster)]

scatterplot3d(PasserStats$MiddleRatio, PasserStats$LeftRatio,PasserStats$RightRatio, pch = 16, color = colors, main="3D Scatterplot", xlab = "Middle Ratio", ylab = "Left Ratio", zlab = "Right Ratio")

Short.Deep <- ggscatter(PasserStats, x = "DeepRatio", y = "ShortRatio", color = "cluster", shape = "cluster")
Accuracy.TD <- ggscatter(PasserStats, x = "Accuracy", y = "TDRatio", color = "cluster", shape = "cluster")
ggarrange(Short.Deep, Accuracy.TD,
          ncol = 2, nrow = 1)

Source: Horowitz, Max. Detailed NFL Play-by-Play Data 2009-2017. March 2018. https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016.