The dataset used was provided by Max Horowitz on Kaggle and each row contains all data for a single play in the NFL.There is a massive amount of data (356,768 rows and 100 columns) available and there is a lot of different ways to analyze it. First thing I’ll do is load the database and the packages I’ll be using.
library(tidyverse)
library(ggplot2)
library(rpart.plot)
library(caTools)
library(ggpubr)
library(scatterplot3d)
football <- read.csv("~/R/NFL Play by Play 2009-2017 (v4).csv")
One of the simpler uses of this data is if you have a question like “if it’s fourth down and less than 3 yards should the team go for it or attempt a field goal in the 40 yard range?” I can filter examples of fourth down and less than 3 yards and calculate the percentage of first downs gained versus field goals between 40 and 50 yards that are good.
FourthConv <- filter(football, down == 4, ydstogo < 3)
table(FourthConv$FirstDown)
##
## 0 1
## 1904 5088
5088/(5088+1904)
## [1] 0.7276888
FG <- filter(football, FieldGoalDistance < 50 | FieldGoalDistance > 40)
table(FG$FieldGoalResult)
##
## Blocked Good No Good
## 197 7479 1272
7479/(197+7479+1272)
## [1] 0.8358292
For this example we can see that attempting the field goal is the better option by about 9% (83.58% vs 72.77%)
What I want to do next is focus on the stats for each game by team. In order to do this, I grouped the data by Game ID and team on offense (since I want to focus on offensive stats). Since there is only a single column with all of the yards gained, I created 2 more columns that contained only passing and rushing yards.In addition, I also counted the total number of fumbles, sacks, rushing and passing attempts and the perceptage of pass attempts. Lastly, I created a column on whether the team won the game or not based on the max score for both teams. I will then use the summary function to see some simple statistics for each column.
TotalStats <- filter(football, PlayType %in% c("Pass", "Run", "Sack")) %>%
mutate(PassingYards = ifelse(PassAttempt == 1, Yards.Gained, 0),
RushingYards = ifelse(RushAttempt == 1, Yards.Gained, 0)) %>%
group_by(GameID, posteam, Season) %>%
summarize(TotalInts = sum(InterceptionThrown),
Score=max(PosTeamScore),
Fumbles=sum(Fumble),
Sacks=sum(Sack),
PassYards = sum(PassingYards),
RushYards = sum(RushingYards),
TotalYards = sum(PassingYards+RushingYards),
RushAttempts = sum(RushAttempt),
PassAttempts = sum(PassAttempt),
PassPerc = PassAttempts/(RushAttempts+PassAttempts),
AvgRush = RushYards/RushAttempts) %>%
filter(posteam != "")
summary(TotalStats[4:14])
## TotalInts Score Fumbles Sacks
## Min. :0.0000 Min. : 0.0 Min. :0.000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:14.0 1st Qu.:0.000 1st Qu.: 1.000
## Median :1.0000 Median :20.0 Median :1.000 Median : 2.000
## Mean :0.9362 Mean :20.7 Mean :1.116 Mean : 2.311
## 3rd Qu.:1.0000 3rd Qu.:27.0 3rd Qu.:2.000 3rd Qu.: 3.000
## Max. :6.0000 Max. :61.0 Max. :9.000 Max. :11.000
## PassYards RushYards TotalYards RushAttempts
## Min. : 0 Min. : 1.0 Min. : 87.0 Min. : 7.00
## 1st Qu.:194 1st Qu.: 78.0 1st Qu.:306.0 1st Qu.:21.00
## Median :244 Median :108.0 Median :362.0 Median :26.00
## Mean :248 Mean :114.5 Mean :362.5 Mean :26.21
## 3rd Qu.:299 3rd Qu.:145.0 3rd Qu.:418.0 3rd Qu.:31.00
## Max. :675 Max. :363.0 Max. :792.0 Max. :60.00
## PassAttempts PassPerc AvgRush
## Min. : 7.00 Min. :0.1290 Min. : 0.100
## 1st Qu.:29.00 1st Qu.:0.4930 1st Qu.: 3.385
## Median :34.00 Median :0.5714 Median : 4.173
## Mean :34.61 Mean :0.5683 Mean : 4.292
## 3rd Qu.:40.00 3rd Qu.:0.6464 3rd Qu.: 5.059
## Max. :70.00 Max. :0.8727 Max. :13.562
The first thing I want to look deeper into is what are the major contributors to scoring points? First I’ll look at total yards versus score and then move into a more detailed analysis with multiple variables. Since there are 4608 observations, I’ll also include a graph using geom_hex which will help show if there is a concentration of data points in one area.
ggplot(TotalStats, aes(x=TotalYards, y=Score)) + geom_point()
ggplot(TotalStats,aes(x=TotalYards, y=Score)) + geom_point() + geom_hex(bins = 50)
cor(TotalStats$TotalYards, TotalStats$Score)
## [1] 0.5945849
From the graph we can see that there appears to be a positive linear relationship between the 2 variables, although it is very spread out, but there is a concentration in the middle. The correlation of .5946 shows that there is moderate positive coorlation between the two variables. Next I’ll seperate the data into train and test samples and do linear regression to see if there is statistical significance between the variables.
set.seed(77)
split <- sample.split(TotalStats$Season, SplitRatio = .6)
TotalStatsTrain <- subset(TotalStats, split == TRUE)
TotalStatsTest <- subset(TotalStats, split == FALSE)
n <- lm(Score ~ TotalYards, data=TotalStatsTrain)
summary(n)
##
## Call:
## lm(formula = Score ~ TotalYards, data = TotalStatsTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.679 -5.481 -0.480 4.884 31.833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.289026 0.677857 -7.803 8.54e-15 ***
## TotalYards 0.071678 0.001827 39.241 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.876 on 2761 degrees of freedom
## Multiple R-squared: 0.358, Adjusted R-squared: 0.3578
## F-statistic: 1540 on 1 and 2761 DF, p-value: < 2.2e-16
We see that TotalYards is statistically signifiant, containing a p-value of <2e-16. If we had the total number of yards from a game we would be able to estimate the score by the equation y=.071678x-5.289026. However, the R-square value is on the lower side at .3578, meaning that only 35.78% of the score is explained by the totalyards. Next, I’ll test the model on the test data to ensure the model is correctly predicting the data.
TestR2 <- function(y, testdata, model, testY) {
predictest <- predict(model, testdata)
SSE <- sum((testY-predictest)^2, na.rm=TRUE)
SST <- sum((testY - mean(y, na.rm=TRUE))^2, na.rm=TRUE)
1-SSE/SST
}
TestR2(TotalStats$Score, TotalStatsTest, n, TotalStatsTest$Score)
## [1] 0.3461587
Test <- predict(n, TotalStatsTest)
With a similar R-squared value (.3461587) we can conclude that the model does represent the data accurately.
Next, to try and more accurately estimate the score, I’ll do multiple linear regression by adding interceptions, passing and rushing yards, sacks and fumbles to the model and then compare it to the test data.
m <- lm(Score ~ PassYards + RushYards + TotalInts + Sacks + Fumbles, data=TotalStatsTrain)
summary(m)
##
## Call:
## lm(formula = Score ~ PassYards + RushYards + TotalInts + Sacks +
## Fumbles, data = TotalStatsTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.303 -4.940 -0.269 4.440 30.591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.792811 0.731984 2.449 0.0144 *
## PassYards 0.060553 0.001846 32.800 < 2e-16 ***
## RushYards 0.078285 0.002879 27.190 < 2e-16 ***
## TotalInts -1.979716 0.136162 -14.539 < 2e-16 ***
## Sacks -0.992654 0.084601 -11.733 < 2e-16 ***
## Fumbles -0.839323 0.127898 -6.562 6.3e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.2 on 2757 degrees of freedom
## Multiple R-squared: 0.4643, Adjusted R-squared: 0.4633
## F-statistic: 477.9 on 5 and 2757 DF, p-value: < 2.2e-16
TestR2(TotalStats$Score, TotalStatsTest, m, TotalStatsTest$Score)
## [1] 0.4767972
Once again, all of the variables show statistical significance. The R-squared value is a bit better at .4633. On the test data, the R-squared is very similar at .4767972. From this data we are able to see that rushing yards result in slightly more points that passing yards (.078285 vs. .060553), sacks prevent slightly more points than fumbles (-.992654 vs -.839323) and interceptions result in 1 less point over sacks. One way to utilize this information is to compare it to a specific team to see how they vary in scoring. I will filter the data for New England to see how they’re formula for scoring compares.
NE <- filter(TotalStats, posteam == "NE")
m <- lm(Score ~ PassYards + RushYards + TotalInts + Sacks + Fumbles, data=NE)
summary(m)
##
## Call:
## lm(formula = Score ~ PassYards + RushYards + TotalInts + Sacks +
## Fumbles, data = NE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7938 -4.7474 -0.7068 3.5058 22.2910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.397446 3.247465 -0.122 0.9028
## PassYards 0.073883 0.007745 9.539 < 2e-16 ***
## RushYards 0.093516 0.012893 7.253 2.66e-11 ***
## TotalInts -3.895933 0.793978 -4.907 2.57e-06 ***
## Sacks -0.531366 0.443820 -1.197 0.2333
## Fumbles -1.103176 0.577586 -1.910 0.0582 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.85 on 138 degrees of freedom
## Multiple R-squared: 0.5244, Adjusted R-squared: 0.5071
## F-statistic: 30.43 on 5 and 138 DF, p-value: < 2.2e-16
Comparing the two analyses it can be seen that both passing and rushing yards are higher than the general equation, suggesting more scoring efficiency. Interceptions are actually double of the general equation (-3.895933 vs -1.979716). Interestingly, sacks aren’t statistically significant. Analysis like this could help with game planning when playing certain teams to reduce their scoring potential.
I’ve heard a lot about how the NFL is turning into a passing league, so I want to see if there is a statistical difference between the percentage of passing plays between seasons. I will first visualize the data in a box plot first and then do formal analysis.
TotalStats$Season <- as.character(TotalStats$Season)
ggplot(TotalStats, mapping= aes(x=Season, y=PassPerc, fill=Season)) + geom_boxplot() +
stat_summary(fun.y="mean", geom="point", shape=21, fill="red") +
guides(fill=FALSE)
From the boxplot it can be seen that there is a overall slight increase in pass percentage over the years, except for 2017 which is a bit lower. It’s also worth noting that the earlier years (2009, 2010 and 2011) all have multiple outliers on the lower end and after that only 2017 has multiple outliers. Now I’ll run ANOVA to see if there is a statistical difference between the seasons.
k <- aov(TotalStats$PassPerc ~ as.factor(TotalStats$Season))
summary(k)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(TotalStats$Season) 8 0.46 0.05777 4.968 3.8e-06 ***
## Residuals 4599 53.48 0.01163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA test we get a P value of 3.8e-06, showing that there is a statistical difference between the mean pass percentage among seasons. Next, I’ll run Tukey’s test to see which specific seasons are different from the other and plot it to better visualize the differences.
TukeyHSD(k)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = TotalStats$PassPerc ~ as.factor(TotalStats$Season))
##
## $`as.factor(TotalStats$Season)`
## diff lwr upr p adj
## 2010-2009 0.0049907191 -0.0159236262 0.025905064 0.9982011
## 2011-2009 0.0032076583 -0.0177066869 0.024122004 0.9999323
## 2012-2009 0.0116408261 -0.0092735192 0.032555171 0.7297423
## 2013-2009 0.0189378095 -0.0019765358 0.039852155 0.1126001
## 2014-2009 0.0194152190 -0.0014991263 0.040329564 0.0935778
## 2015-2009 0.0268385491 0.0059242038 0.047752894 0.0022566
## 2016-2009 0.0309001451 0.0099857998 0.051814490 0.0001614
## 2017-2009 0.0132910706 -0.0076232747 0.034205416 0.5630930
## 2011-2010 -0.0017830607 -0.0226974060 0.019131285 0.9999993
## 2012-2010 0.0066501070 -0.0142642382 0.027564452 0.9871467
## 2013-2010 0.0139470904 -0.0069672548 0.034861436 0.4945903
## 2014-2010 0.0144244999 -0.0064898454 0.035338845 0.4457099
## 2015-2010 0.0218478300 0.0009334848 0.042762175 0.0327172
## 2016-2010 0.0259094260 0.0049950808 0.046823771 0.0039002
## 2017-2010 0.0083003515 -0.0126139938 0.029214697 0.9496403
## 2012-2011 0.0084331678 -0.0124811775 0.029347513 0.9448416
## 2013-2011 0.0157301512 -0.0051841941 0.036644496 0.3217148
## 2014-2011 0.0162075606 -0.0047067846 0.037121906 0.2813189
## 2015-2011 0.0236308907 0.0027165455 0.044545236 0.0135851
## 2016-2011 0.0276924867 0.0067781415 0.048606832 0.0013391
## 2017-2011 0.0100834122 -0.0108309330 0.030997758 0.8577323
## 2013-2012 0.0072969834 -0.0136173619 0.028211329 0.9767723
## 2014-2012 0.0077743929 -0.0131399524 0.028688738 0.9657568
## 2015-2012 0.0151977230 -0.0057166223 0.036112068 0.3701440
## 2016-2012 0.0192593190 -0.0016550263 0.040173664 0.0994853
## 2017-2012 0.0016502445 -0.0192641008 0.022564590 0.9999996
## 2014-2013 0.0004774095 -0.0204369358 0.021391755 1.0000000
## 2015-2013 0.0079007396 -0.0130136057 0.028815085 0.9622883
## 2016-2013 0.0119623356 -0.0089520097 0.032876681 0.6989389
## 2017-2013 -0.0056467389 -0.0265610842 0.015267606 0.9957162
## 2015-2014 0.0074233301 -0.0134910151 0.028337675 0.9741623
## 2016-2014 0.0114849261 -0.0094294191 0.032399271 0.7442342
## 2017-2014 -0.0061241484 -0.0270384936 0.014790197 0.9925505
## 2016-2015 0.0040615960 -0.0168527493 0.024975941 0.9995992
## 2017-2015 -0.0135474785 -0.0344618238 0.007366867 0.5362351
## 2017-2016 -0.0176090745 -0.0385234198 0.003305271 0.1813453
plot(TukeyHSD(k), las=1, cex.axis=.5)
These results were a bit surprising. Some of it I expected, we see that 2015 and 2016 are statistically different from 2009 (with 2013 and 2014 close at .11 and .09, respectively) and then also 2015 and 2016 are statistically different from 2010 and 2011. This makes sense as there may not be enough difference from one year to the other, but after a few years a difference is noticeable. However, the one thing that is most surprising is how 2017 is not statistically different from any other years, the lowest P-value was between 2017 and 2016 at .18. So while the general trend seems to be that passing percentage is increasing, 2017 did not follow that trend. 2017 is actually very similar to 2012. It will be very interesting to see how 2018 and future seasons compare to this to see if this is a new trend or if 2017 was just an anomaly.
I want to figure out what is the probability of winning a game based on a team’s first half stats. In order to do this I’ll create a decision tree that includes the variables Sacks, Rushing Yards, Passing Yards, Fumbles, Rush Attempts, Pass Attempts and Score Differential. I’ll once again split the data into test and train data sets to see how accurately the decision tree predicts data that the model didn’t use.
Firsthalf <- filter(football, PlayType %in% c("Pass", "Run", "Sack"), qtr %in% c(1, 2)) %>%
mutate(PassingYards = ifelse(PassAttempt == 1, Yards.Gained, 0),
RushingYards = ifelse(RushAttempt == 1, Yards.Gained, 0), Team_ID = str_c(GameID, posteam)) %>%
group_by(Team_ID, Season, GameID) %>%
summarize(TotalInts = sum(InterceptionThrown),
Fumbles=sum(Fumble),
Sacks=sum(Sack),
PassYards = sum(PassingYards),
RushYards = sum(RushingYards),
TotalYards = sum(PassingYards+RushingYards),
RushAttempts = sum(RushAttempt),
PassAttempts = sum(PassAttempt),
AvgRush = (RushYards/RushAttempts)) %>%
na.omit()
HalfOff <- filter(football, TimeSecs == 1800, posteam != "") %>%
mutate(Team_ID = str_c(GameID, posteam)) %>%
group_by(Team_ID) %>%
summarize(Diff = max(ScoreDiff))
Halfdef <- filter(football, TimeSecs == 1800,posteam != "") %>%
mutate(Team_ID = str_c(GameID, DefensiveTeam), Diff = -ScoreDiff) %>%
group_by(Team_ID) %>%
summarize(Diff = max(Diff))
offscore <- filter(football, TimeSecs == 0, qtr == 4 ,posteam != "") %>%
mutate(Team_ID = str_c(GameID, posteam), Result = ifelse (ScoreDiff > 0, "Win", ifelse(ScoreDiff < 0, "Lost", "Tie"))) %>%
select(Team_ID, Result)
defscore <- filter(football, TimeSecs == 0, qtr == 4, posteam != "") %>%
mutate(Team_ID = str_c(GameID, DefensiveTeam), Result = ifelse (ScoreDiff < 0, "Win", ifelse(ScoreDiff > 0, "Lost", "Tie"))) %>%
select(Team_ID, Result)
FullResult <- offscore %>%
full_join(defscore, by = "Team_ID")
FullDiff <- HalfOff %>%
full_join(Halfdef, by = "Team_ID")
whole <- Firsthalf %>%
inner_join(FullResult, by = "Team_ID") %>%
inner_join(FullDiff, by = "Team_ID")
Result.Pred <- mutate(whole, Result = ifelse(Result.x %in% c("Win", "Lost", "Tie"), Result.x, Result.y),
Diff = ifelse(is.na(Diff.x) == FALSE, Diff.x, Diff.y))
table(Result.Pred$Result)
##
## Lost Tie Win
## 1900 30 1900
set.seed(50)
split <- sample.split(Result.Pred$Season, SplitRatio = .6)
FirstHalfTrain <- subset(Result.Pred, split == TRUE)
FirstHalfTest <- subset(Result.Pred, split == FALSE)
Wintree <- rpart(Result ~ Sacks + RushYards + PassYards + Fumbles + RushAttempts + PassAttempts + Diff,data = FirstHalfTrain,
control = rpart.control(minpslit=10))
Wintree
## n= 2298
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2298 1118 Lost (0.513489991 0.006962576 0.479547433)
## 2) Diff< 2.5 1346 333 Lost (0.752600297 0.007429421 0.239970282) *
## 3) Diff>=2.5 952 173 Win (0.175420168 0.006302521 0.818277311) *
predictree <- predict(Wintree, newdata=FirstHalfTest, type="class")
table(FirstHalfTest$Result, predictree)
## predictree
## Lost Tie Win
## Lost 596 0 124
## Tie 8 0 6
## Win 226 0 572
(596+572)/sum(598, 572, 226, 124, 8, 6)
## [1] 0.7614081
Despite all of the variables used in the making of the decision tree, it came down to if the team is up by more than 2.5 points they are more likely to win. Using this model on the test data a 76.14% success rate is achieved.
Lastly, I’ll use clustering to try and organize quarterbacks into 3 groups. In order to do this I’ve used the original data to group by Passers and included their Pass Yards, Pass Attempts, Pass Lengths (Short and Deep), Pass Locations (left, right and middle), Pass Outcomes (incomplete and complete), Touchdowns and Interceptions. All of these I’ve converted into percentages so the clusters are based on the quarterbacks’ tendencies and not the amount of seasons they’ve been in the league. I also filtered the Pass attempts over 100 in order to not include trick plays (non-quarterbacks usually). As these numbers aren’t all going to be similar (TDRatio will go over 1, accuracy will be around 60% and pass direction will probably be around 33%) I will need to scale the numbers to make sure certain variables don’t take over in priority when doing clustering. First I’ll go a histogram of each variable to see if it follows a normal distribution. If that’s the case, I can then use scale() to turn them all into Z-score format
PasserStats <- filter(football, PlayType == "Pass") %>%
mutate(PassingYards = ifelse(PassAttempt == 1, Yards.Gained, 0)) %>%
group_by(Passer_ID, Passer) %>%
summarize(Interceptions = sum(InterceptionThrown),
PassYards = sum(PassingYards),
PassAttempts = sum(PassAttempt),
ShortPasses = sum(PassLength == "Short", na.rm=TRUE),
DeepPasses = sum(PassLength == "Deep", na.rm=TRUE),
LeftPasses = sum(PassLocation == "left", na.rm=TRUE),
RightPasses = sum(PassLocation == "right", na.rm=TRUE),
MiddlePasses = sum(PassLocation == "middle", na.rm=TRUE),
InCompletePass = sum(PassOutcome == "Incomplete Pass", na.rm=TRUE),
CompletePass = sum(PassOutcome == "Complete", na.rm=TRUE),
Touchdowns = sum(Touchdown),
ShortRatio = ShortPasses/PassAttempts,
DeepRatio = DeepPasses/PassAttempts,
LeftRatio = LeftPasses/PassAttempts,
MiddleRatio = MiddlePasses/PassAttempts,
RightRatio = RightPasses/PassAttempts,
TDRatio = Touchdowns/Interceptions,
Accuracy = CompletePass/PassAttempts) %>%
filter(PassAttempts > 100) %>%
select(Passer, Passer_ID, ShortRatio, DeepRatio, LeftRatio, MiddleRatio, RightRatio, TDRatio, Accuracy)
TD <- gghistogram(PasserStats, x="TDRatio", fill = "Blue")
Short <- gghistogram(PasserStats, x="ShortRatio", fill = "Red")
Deep <- gghistogram(PasserStats, x="DeepRatio", fill = "Purple")
Left <- gghistogram(PasserStats, x="LeftRatio", fill = "Orange")
Right <- gghistogram(PasserStats, x="RightRatio", fill = "Green")
Middle <- gghistogram(PasserStats, x="MiddleRatio", fill = "Black")
Accuracy <- gghistogram(PasserStats, x="Accuracy", fill = "Yellow")
ggarrange(TD, Short, Deep, Left, Right, Middle, Accuracy,
ncol = 3, nrow = 3)
PasserStats.Scale <- as.data.frame(lapply(PasserStats[3:9], scale))
Now I’ll do Kmeans clustering. After that I’ll put the cluster into the original dataset (without the scaling) and use the aggregate fuction to get the averages of each column to see how the clusters compare to each other.
set.seed(1089)
k.means.passer <- kmeans(PasserStats.Scale[1:7], 3)
PasserStats$cluster <- as.factor(k.means.passer$cluster)
aggregate(data = PasserStats, . ~ cluster, FUN = mean)
## cluster Passer Passer_ID ShortRatio DeepRatio LeftRatio MiddleRatio
## 1 1 270.3333 213.6667 0.7170714 0.2810664 0.4421082 0.1583164
## 2 2 168.1333 160.7778 0.7835984 0.2129106 0.3567771 0.2404223
## 3 3 178.6282 160.5641 0.8235408 0.1734707 0.3584571 0.2203609
## RightRatio TDRatio Accuracy
## 1 0.3977133 1.875000 0.5370733
## 2 0.3993095 1.081705 0.5725727
## 3 0.4181936 1.691369 0.6151912
There are many takeaways from these clusters. For example, Cluster 1 can be seen as a more aggressive quarterback as they have the highest DeepRatio, but do suffer from the lowest accuracy. To view this visually I will create a 3D graph showing the PassLocation and then 2 other graphs showing the PassLength and then Accuracy vs TDRatio color coded based on the clusters to see how they are grouped.
PasserStats$cluster <- as.factor(k.means.passer$cluster)
colors <- c("Red", "Green", "Blue")
colors <- colors[as.numeric(PasserStats$cluster)]
scatterplot3d(PasserStats$MiddleRatio, PasserStats$LeftRatio,PasserStats$RightRatio, pch = 16, color = colors, main="3D Scatterplot", xlab = "Middle Ratio", ylab = "Left Ratio", zlab = "Right Ratio")
Short.Deep <- ggscatter(PasserStats, x = "DeepRatio", y = "ShortRatio", color = "cluster", shape = "cluster")
Accuracy.TD <- ggscatter(PasserStats, x = "Accuracy", y = "TDRatio", color = "cluster", shape = "cluster")
ggarrange(Short.Deep, Accuracy.TD,
ncol = 2, nrow = 1)
Source: Horowitz, Max. Detailed NFL Play-by-Play Data 2009-2017. March 2018. https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016.