NFL Box Score - Posted by Andrew Carson

For this project, the objective is to select three untidy data sets posted by students on the class discussion board, and - for each of the three datasets - read the data into R, tidy the data, and complete the analysis suggested by the classmate in the discussion board post.

This R markdown file explains the steps used to complete the analysis for the third dataset I selected, which was posted by Andrew Carson. It is a box score from the San Francisco 49ers vs. Seattle Seahawks game on 9/25/2016. The dataset and source is shown below:

Create Database Containing Box Score Data

The NFL Box Score was loaded to a MySQL database schema called ‘project_two’ into a table called ‘nfl_box_score’. The SQL scripts to build the database and load the data are available on GitHub at the link below. The box score was loaded as-is except all spaces in the stat description were converted to underscores and periods were removed. Please note that the GitHub file includes the code used for all three datasets selected for the project, not just this dataset.

https://github.com/LelandoSupreme/DATA607/blob/master/Randles_Project2_DATA607.sql

Read nfl_box_score MySQL Table into R

To facilitate the loading of data from MySQL to an R data frame, I loaded the “RMySQL” package.

install.packages("RMySQL",repos='http://cran.wustl.edu/')
library(RMySQL)

Once the package was installed, I connected to the database and created a data frame from the ‘nfl_box_score’ table.

# Get the MySQL
drv = dbDriver("MySQL")
# Create a connection to the MySQL database
con <- dbConnect(drv, user = 'root', password = 'temp1002!', dbname = 'project_two')
# Create the nfl_box_score data frame
nfl_box_score <- dbReadTable(con, "nfl_box_score")
# View the nfl_box_score data frame
nfl_box_score
##                  stat           sfo           sea
## 1         First_Downs            12            18
## 2        Rush-Yds-TDs      31-135-2      31-127-2
## 3  Comp-Att-Yd-TD-INT 14-25-119-0-1 22-32-308-2-1
## 4        Sacked-Yards           0-0          2-17
## 5      Net_Pass_Yards           119           291
## 6         Total_Yards           254           418
## 7        Fumbles-Lost           1-0           1-1
## 8           Turnovers             1             2
## 9     Penalties-Yards          4-35          6-50
## 10    Third_Down_Conv          4-15          9-14
## 11   Fourth_Down_Conv           1-1           0-0
## 12 Time_of_Possession      24:03:00      35:57:00

As you can see, the ‘nfl_box_score’ table has 3 columns: stat, sfo and sea.

Tidy and Tranform Data

Hadley Wickham defines “tidy data” (http://vita.had.co.nz/papers/tidy-data.pdf.) as data which is structured such that:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In the case of the nfl_box_score data, there are 24 variables: Team, First Downs, Rushing Attempts, Rushing Yards, Rushing TDs, Pass Completions, Passing Attempts, Passing Yards, Passing TDs, Passing Interceptions, Net Pass Yards, Sacks, Sacked Yards, Fumbles, Fumbles Lost, Turnovers, Penalties, Penalty Yards, Third Down Conversions, Third Down Conversion Attempts, Fourth Down Conversions, Fourth Down Conversion Attempts, Total Yards and Time of Possession. To make each row an observation, we need one row for every team. In this case, there is only one type of observational unit, so one table will be appropriate.

To tidy and transform the data, I loaded the tidyr package (https://cran.r-project.org/web/packages/tidyr/tidyr.pdf) and the dplyr package (https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).

install.packages("tidyr",repos='http://mirrors.nics.utk.edu/cran/')
library(tidyr)
library(dplyr)

Once the packages were loaded, I pipelined nine functions to create the tidy data frame ‘nfl_box_score’. The steps performed:
1. Used the gather function to turn the sfa and sea columns into values in a ‘team’ column.
2. Turned the values in the stat column into column headers using the spread function.
3. Used the separate function seven times to break the columns holding multiple stats into multiple columns holding each stat.

# Pipelined commands executing steps shown above
nfl_box_score <- nfl_box_score %>% gather("team", "values", 2:3) %>% spread(stat, values) %>% separate(`Comp-Att-Yd-TD-INT`, c("pass_completions", "passing_attempts", "passing_yards", "passing_tds", "passing_ints"), "-", extra = "drop") %>% separate(Fourth_Down_Conv, c("fourth_down_conversions", "fourth_down_conv_attempts"), "-", extra = "drop") %>% separate(`Fumbles-Lost`, c("fumbles", "fumbles_lost"), "-", extra = "drop") %>% separate(`Penalties-Yards`, c("penalties", "penalty_yds"), "-", extra = "drop") %>% separate(`Rush-Yds-TDs`, c("rushing_attempts", "rushing_yds", "rushing_tds"), "-", extra = "drop") %>% separate(`Sacked-Yards`, c("sacks", "sacked_yds"), "-", extra = "drop") %>% separate(Third_Down_Conv, c("third_down_conversions", "third_down_conv_attempts"), "-", extra = "drop")
# View results
head(nfl_box_score)
##   team pass_completions passing_attempts passing_yards passing_tds
## 1  sea               22               32           308           2
## 2  sfo               14               25           119           0
##   passing_ints First_Downs fourth_down_conversions
## 1            1          18                       0
## 2            1          12                       1
##   fourth_down_conv_attempts fumbles fumbles_lost Net_Pass_Yards penalties
## 1                         0       1            1            291         6
## 2                         1       1            0            119         4
##   penalty_yds rushing_attempts rushing_yds rushing_tds sacks sacked_yds
## 1          50               31         127           2     2         17
## 2          35               31         135           2     0          0
##   third_down_conversions third_down_conv_attempts Time_of_Possession
## 1                      9                       14           35:57:00
## 2                      4                       15           24:03:00
##   Total_Yards Turnovers
## 1         418         2
## 2         254         1

Income by Sex by Age Ranges

The suggested analysis is to “compare the yards per touchdown (both rushing and passing) for both teams” and interpret what these stats mean.

# Convert character vector columns to integer data type
nfl_box_score[,c(2:21,23:24)] <- as.integer(unlist(nfl_box_score[,c(2:21,23:24)]))
# Compute yards per touchdown
nfl_box_score %>% group_by(team) %>% summarize(Total_Yards / (passing_tds + rushing_tds))
## # A tibble: 2 × 2
##    team `Total_Yards/(passing_tds + rushing_tds)`
##   <chr>                                     <dbl>
## 1   sea                                     104.5
## 2   sfo                                     127.0

What does this stat tell us? Not a lot. In theory the team who needs less yards per touchdown is more efficient than their opponent, but this could also be the product of shorter fields due to superior special teams or turnovers. As a general statement over a span of multiple games, the teams with the lowest ratio of yards to TDs is going to be one of the best teams. On a one-game basis, it is difficult to draw conclusions. Seatlle did win the game, 37-18, and had the lower yards per TD number.