First, I loaded the .csv file I prepared earlier.
library(ggplot2)
nfl <- read.csv('salary_season.csv', stringsAsFactors=FALSE)
There's still a little bit more wrangling I want to do before I start doing explatory data analysis. It turned out the data from USA Today had a few more problems than I initially realized.
The first problem was somewhat easy to overcome. The salary data sometimes listed a player as linebacker, sometimes as an outside linebacker, and sometimes as a middle linbacker.
This is actually a classification problem that is common to NFL data due to the different roles a linebacker can play depending on the scheme. Some linebackers almost always cover receivers on passing downs, where as others (especially in a 3-4 scheme) do nothing but rush the quarterback.
For simplicity's sake, I chose to ignore those differences and simply combine all these vectors into a single total linebacker vector.
total_lbs <- nfl[,'LB.Cost'] + nfl[,'OLB.Cost'] + nfl[,'MLB.Cost']
nfl <- cbind(nfl, tLB.Cost = total_lbs)
The next problem was quite a bit more serious, and required me to adjust my approach. When writing my sql query to create the CSV file, I noticed something odd. The Jacksonville Jaguars apparently did not have any offensive linemen on payroll in 2009.
As much as that may make sense to a football fan who's familiar with the Jaguars, further inspection revealed that only 2005 for any team included any offensive linemen. My original plan was to measure each team's proportion of total salary allocated to each position group. But with offensive line data most missing, this wouldn't be accurate at all.
To still salvage something of value, I decided to tighten my scope on what I did have, which were salaries for all the defensive players (I double checked to see if safeties were in there because fool me twice…).
So I added up all the vectors for defensive position groups to come up with a column in the data frame that had the total amount a team spent on defense in a given year:
total_d <- nfl[,'tLB.Cost'] + nfl[,'DE.Cost'] + nfl[,'DT.Cost']
+ nfl[,'CB.Cost'] + nfl[,'S.Cost']
## [1] 20987608 14206291 8679487 21157531 23600075 8684938 14391218
## [8] 24410246 11308507 8616419 19596532 15407654 19935396 25779113
## [15] 8796763 15945772 12245098 22618265 15781961 22450314 13313262
## [22] 12954835 14900766 16735721 22487912 21477897 14697516 15229005
## [29] 19912717 20219778 20160020 19737054 27806607 29177786 7896817
## [36] 12742179 22618808 9823661 14665919 18316282 12681959 12871118
## [43] 15090312 22102221 14785242 22788382 9885032 11979860 22240018
## [50] 12896885 14382422 14231055 11965745 11167844 19843002 12692831
## [57] 23975097 17253753 17156840 13999025 17880023 15142973 16843827
## [64] 20191789 23123427 22453745 9595202 12358932 17393155 13901603
## [71] 16556039 16609767 16533088 21826804 6405068 16810645 12362656
## [78] 9499616 11991751 8379200 12236661 12192944 12664917 13352810
## [85] 13165816 19902083 13015073 9448309 20911311 19445232 19911414
## [92] 9929464 15469347 9791338 15069660 13807025 8568090 15200427
## [99] 15744195 16050921 16903193 16426797 16004314 14669164 17636600
## [106] 14849571 6935506 21514791 8508724 16193033 13467389 8821334
## [113] 18562604 9210125 15305234 14739927 12981412 9963318 10353101
## [120] 12385620 19494041 14791984 22616973 10237859 10309964 8473463
## [127] 12567285 16993945 10842796 12400871 11607735 14872762 11754660
## [134] 10646451 12619122 10772309 9832626 15115870 6556901 13441464
## [141] 13943431 11250029 10839310 12667534 13402042 8595301 14657582
## [148] 7556264 12291319 10550992 17701048 7634065 13215667 13166453
## [155] 12708738 8164599 8692539 10383465 7587683 26153211 6739646
## [162] 14912636 9626294 12315075 9158864 12931480 10582632 13940278
## [169] 7755277 12255008 6487615 12309258 19027031 9926182 10531929
## [176] 11201081 11576674 8347605 14515123 15194650 11630364 16580996
## [183] 16042988 8572084 10060676 9878832 11729333 8766521 11744943
## [190] 17489554 16023270 25825279 7050207 13659826 9134057 9237433
## [197] 4508769 9185975 10941964 12573976 6419681 11704060 6218695
## [204] 10549681 19946284 18646424 13178608 7654471 8860631 7902861
## [211] 8057549 10406741 8737210 13574126 11185200 10134474 11108132
## [218] 8176132 10519078 10087032 15311359 12705204 9525908 9287822
## [225] 6915739 10301812 6688331 5864737 7910314 9391170 7674219
## [232] 10847175 12120177 11529080 5246093 6687884 13889167 14503126
## [239] 11790732 7327550 6863460 6747058 6388537 10433753 6117091
## [246] 12213647 11370327 8538749 5622953 9206879 5192823 14378545
## [253] 12436696 19709006 8474167 3768313
nfl <- cbind(nfl, tD.Cost = total_d)
Here I simply took my position group totals from earlier and determined the ratio of total defensive money spent.
lb_pct <- nfl[,'tLB.Cost'] / nfl[,'tD.Cost']
dt_pct <- nfl[,'DT.Cost'] / nfl[,'tD.Cost']
de_pct <- nfl[,'DE.Cost'] / nfl[,'tD.Cost']
cb_pct <- nfl[,'CB.Cost'] / nfl[,'tD.Cost']
s_pct <- nfl[,'S.Cost'] / nfl[,'tD.Cost']
nfl <- cbind(nfl, lb_pct, dt_pct, de_pct, cb_pct, s_pct)
And before I started running any analysis, I cut the data down to a smaller frame that only included the propotion numbers and the performance data that could be impacted by defense peformance.
def <- subset(nfl,select=c("team_name",
"year",
"Wins",
"Points.Against",
"Def.Rank.Pts",
"Def.Rank.Yds",
"lb_pct",
"dt_pct",
"de_pct",
"cb_pct",
"s_pct"))
First, I just wanted to see if NFL tradition of ranking NFL defenses by yards allowed was really a better measure than points against.
Here's the plot by yards ranking:
plot(def$'Def.Rank.Yds', def$'Wins', main ="Defensive Ranking by Yards Versus Wins", xlab="Defensive Ranking by Yards Allowed", ylab="Wins")
And the plot for points ranking:
plot(def$'Def.Rank.Pts', def$'Wins', main ="Defensive Ranking by Points Versus Wins", xlab="Defensive Ranking by Points Allowed", ylab="Wins")
Looked to me like ranking by points is a better indicator. I checked out the correlations to confirm this:
cor(def$'Def.Rank.Yds', def$Wins)
## [1] -0.4504027
cor(def$'Def.Rank.Pts', def$Wins)
## [1] -0.669541
Sure enough, the correlation to wins is considerably stronger for points allowed than it is yards (the correlations are negative because the rankings are inverted, ie 1 is the best, 32 is the worst).
However, in order to totally accept this conclusion, I would need to confirm that the “Points Allowed” data provided by Pro Football Reference only included points actually given up by the defense (as opposed to special teams touchdowns or pick sixes, which increase a team's total points allowed but not that of the defense).
I was also generally curious how teams split up their defense budgets. This was very easy to see using the summary() function:
summary(def[,7:11])
## lb_pct dt_pct de_pct cb_pct
## Min. :0.1675 Min. :0.02181 Min. :0.04404 Min. :0.03735
## 1st Qu.:0.5387 1st Qu.:0.11743 1st Qu.:0.16472 1st Qu.:0.15108
## Median :0.6117 Median :0.15785 Median :0.21490 Median :0.19289
## Mean :0.5976 Mean :0.16963 Mean :0.23281 Mean :0.22932
## 3rd Qu.:0.6796 3rd Qu.:0.20804 3rd Qu.:0.30204 3rd Qu.:0.28582
## Max. :0.9039 Max. :0.50078 Max. :0.66488 Max. :1.16359
## s_pct
## Min. :0.03005
## 1st Qu.:0.09039
## Median :0.12067
## Mean :0.13761
## 3rd Qu.:0.17084
## Max. :0.45880
I also wanted a visualization of what kind of variability existed for each position group:
ggplot(data = def) +geom_histogram(aes(x = lb_pct))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = def) +geom_histogram(aes(x = dt_pct))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = def) +geom_histogram(aes(x = de_pct))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = def) +geom_histogram(aes(x = cb_pct))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = def) +geom_histogram(aes(x = s_pct))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
With the graphs, it's a bit easier to see what's going on. Most striking to me is the variance that exists within the defensive end grouping. The interquartile range is the largest of the 5 position groups measured.
When I originally set out with this project, I wanted to use the “K-Means Clustering” algorithm to detect patterns in the way general managers (the person in charge of acquiring players) structure their teams, and to see how those patterns compared in terms of success (wins and playoff appearances).
My understanding (which is very much nascent at this point) is you begin with a multi dimensional matrix, and choose a “K” number of clusters to establish your means. You than pick an aribitrary number of points that will eventually define your patterns. Finally, the algorithm creates a vector for each point that attempts to guess points that will be the closest to grouping of vectors, using euclidian distance. It continues to guess, using past performance to improve it's guesses, until a satisfactorily small distance is achieved for all the vectors.
My strategy to implement this using NFL data would be to use the 5 position group budget allocation percentages and wins as vector elements, and use maybe 3 or 4 nearest neighbor points to look for
My thoughts as to why this would be a good fit for salary data is that the K-Means are already all percentages, thus avoiding the scaling problems that the K-Means can be prone to.
Also, if successful, it would be easy to measure the success of each of each nearest cluster point, as they would all have different win values. Than, you would have a template you could conceivably give back to a general manager and say “This pattern is much more associated with success than the others.”