For this project, the objective is to select three untidy data sets posted by students on the class discussion board, and - for each of the three datasets - read the data into R, tidy the data, and complete the analysis suggested by the classmate in the discussion board post.
This R markdown file explains the steps used to complete the analysis for the third dataset I selected, which was posted by Andrew Carson. It is a box score from the San Francisco 49ers vs. Seattle Seahawks game on 9/25/2016. The dataset and source is shown below:
The NFL Box Score was loaded to a MySQL database schema called ‘project_two’ into a table called ‘nfl_box_score’. The SQL scripts to build the database and load the data are available on GitHub at the link below. The box score was loaded as-is except all spaces in the stat description were converted to underscores and periods were removed. Please note that the GitHub file includes the code used for all three datasets selected for the project, not just this dataset.
https://github.com/LelandoSupreme/DATA607/blob/master/Randles_Project2_DATA607.sql
To facilitate the loading of data from MySQL to an R data frame, I loaded the “RMySQL” package.
install.packages("RMySQL",repos='http://cran.wustl.edu/')
library(RMySQL)
Once the package was installed, I connected to the database and created a data frame from the ‘nfl_box_score’ table.
# Get the MySQL
drv = dbDriver("MySQL")
# Create a connection to the MySQL database
con <- dbConnect(drv, user = 'root', password = 'temp1002!', dbname = 'project_two')
# Create the nfl_box_score data frame
nfl_box_score <- dbReadTable(con, "nfl_box_score")
# View the nfl_box_score data frame
nfl_box_score
## stat sfo sea
## 1 First_Downs 12 18
## 2 Rush-Yds-TDs 31-135-2 31-127-2
## 3 Comp-Att-Yd-TD-INT 14-25-119-0-1 22-32-308-2-1
## 4 Sacked-Yards 0-0 2-17
## 5 Net_Pass_Yards 119 291
## 6 Total_Yards 254 418
## 7 Fumbles-Lost 1-0 1-1
## 8 Turnovers 1 2
## 9 Penalties-Yards 4-35 6-50
## 10 Third_Down_Conv 4-15 9-14
## 11 Fourth_Down_Conv 1-1 0-0
## 12 Time_of_Possession 24:03:00 35:57:00
As you can see, the ‘nfl_box_score’ table has 3 columns: stat, sfo and sea.
Hadley Wickham defines “tidy data” (http://vita.had.co.nz/papers/tidy-data.pdf.) as data which is structured such that:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
In the case of the nfl_box_score data, there are 24 variables: Team, First Downs, Rushing Attempts, Rushing Yards, Rushing TDs, Pass Completions, Passing Attempts, Passing Yards, Passing TDs, Passing Interceptions, Net Pass Yards, Sacks, Sacked Yards, Fumbles, Fumbles Lost, Turnovers, Penalties, Penalty Yards, Third Down Conversions, Third Down Conversion Attempts, Fourth Down Conversions, Fourth Down Conversion Attempts, Total Yards and Time of Possession. To make each row an observation, we need one row for every team. In this case, there is only one type of observational unit, so one table will be appropriate.
To tidy and transform the data, I loaded the tidyr package (https://cran.r-project.org/web/packages/tidyr/tidyr.pdf) and the dplyr package (https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).
install.packages("tidyr",repos='http://mirrors.nics.utk.edu/cran/')
library(tidyr)
library(dplyr)
Once the packages were loaded, I pipelined nine functions to create the tidy data frame ‘nfl_box_score’. The steps performed:
1. Used the gather function to turn the sfa and sea columns into values in a ‘team’ column.
2. Turned the values in the stat column into column headers using the spread function.
3. Used the separate function seven times to break the columns holding multiple stats into multiple columns holding each stat.
# Pipelined commands executing steps shown above
nfl_box_score <- nfl_box_score %>% gather("team", "values", 2:3) %>% spread(stat, values) %>% separate(`Comp-Att-Yd-TD-INT`, c("pass_completions", "passing_attempts", "passing_yards", "passing_tds", "passing_ints"), "-", extra = "drop") %>% separate(Fourth_Down_Conv, c("fourth_down_conversions", "fourth_down_conv_attempts"), "-", extra = "drop") %>% separate(`Fumbles-Lost`, c("fumbles", "fumbles_lost"), "-", extra = "drop") %>% separate(`Penalties-Yards`, c("penalties", "penalty_yds"), "-", extra = "drop") %>% separate(`Rush-Yds-TDs`, c("rushing_attempts", "rushing_yds", "rushing_tds"), "-", extra = "drop") %>% separate(`Sacked-Yards`, c("sacks", "sacked_yds"), "-", extra = "drop") %>% separate(Third_Down_Conv, c("third_down_conversions", "third_down_conv_attempts"), "-", extra = "drop")
# View results
head(nfl_box_score)
## team pass_completions passing_attempts passing_yards passing_tds
## 1 sea 22 32 308 2
## 2 sfo 14 25 119 0
## passing_ints First_Downs fourth_down_conversions
## 1 1 18 0
## 2 1 12 1
## fourth_down_conv_attempts fumbles fumbles_lost Net_Pass_Yards penalties
## 1 0 1 1 291 6
## 2 1 1 0 119 4
## penalty_yds rushing_attempts rushing_yds rushing_tds sacks sacked_yds
## 1 50 31 127 2 2 17
## 2 35 31 135 2 0 0
## third_down_conversions third_down_conv_attempts Time_of_Possession
## 1 9 14 35:57:00
## 2 4 15 24:03:00
## Total_Yards Turnovers
## 1 418 2
## 2 254 1
The suggested analysis is to “compare the yards per touchdown (both rushing and passing) for both teams” and interpret what these stats mean.
# Convert character vector columns to integer data type
nfl_box_score[,c(2:21,23:24)] <- as.integer(unlist(nfl_box_score[,c(2:21,23:24)]))
# Compute yards per touchdown
nfl_box_score %>% group_by(team) %>% summarize(Total_Yards / (passing_tds + rushing_tds))
## # A tibble: 2 × 2
## team `Total_Yards/(passing_tds + rushing_tds)`
## <chr> <dbl>
## 1 sea 104.5
## 2 sfo 127.0
What does this stat tell us? Not a lot. In theory the team who needs less yards per touchdown is more efficient than their opponent, but this could also be the product of shorter fields due to superior special teams or turnovers. As a general statement over a span of multiple games, the teams with the lowest ratio of yards to TDs is going to be one of the best teams. On a one-game basis, it is difficult to draw conclusions. Seatlle did win the game, 37-18, and had the lower yards per TD number.