College football was never an overt interest of mine. I had never watched a game beyond walking by my dad in the living room on a Sunday night, but after coming to UT and experiencing my first in person game day, it made so much sense why so many students and alumni were purchasing season tickets for hundreds of dollars. As Django Walker once said “As I grew, so did my pride for the thunder of Ole Smokey and singing Texas Fight.” That being said, after paying more attention to the plays, I realized UT’s track record is not what it used to be. Despite the enthusiasm coming from the bleachers and Twitter, Quinn Ewers is not Vince Young can not always be counted on to make a complete pass, and our outstanding running backs and defense don’t do so well on the road.
The data sets used here are the 2013 college football statistics collected from every game and every play of the 130 FBS college football team obtained from http://cfbstats.com/. The three datasets used (play, game, and stadium) contain variables for the game code, home team code, visiting team code, offensive team code and points, defensive team code and points, stadium turf type, type of play, distance, down, and period. Offensive and defensive team points, distance, down, and period are all numeric variables. The team and game codes are numeric, but represent categorical data. Stadium turf type and type of play are categorical.
From these data sets, I sought to find any correlation to a teams success. I suspect that as the game goes on, the players will score fewer points. Additionally, I do expect a correlation with turf rather than grass to have higher distances run by the players. The research questions are as follows: How does period effect points scored and how does surface relate to the distance traveled?
# load in libraries
library(tidyverse)
## Error: package or namespace load failed for 'tidyverse':
## .onAttach failed in attachNamespace() for 'tidyverse', details:
## call: NULL
## error: package or namespace load failed for 'tidyr' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
## namespace 'vctrs' 0.5.2 is already loaded, but >= 0.6.0 is required
library(ggplot2)
# read in dataframes used
play = read.csv("play.csv")
stadium = read.csv("stadium.csv")
game = read.csv("game.csv")
# join game and play data sets into game_play
game_play = game %>%
left_join(play, by = "Game.Code")
## Error in game %>% left_join(play, by = "Game.Code"): could not find function "%>%"
# join stadium with game_play into game_play_stadium
full_data = game_play %>%
left_join(stadium, by = "Stadium.Code")
## Error in game_play %>% left_join(stadium, by = "Stadium.Code"): could not find function "%>%"
# check potential mismatches:
game %>%
anti_join(play, by = "Game.Code") %>%
dim()
## Error in game %>% anti_join(play, by = "Game.Code") %>% dim(): could not find function "%>%"
game_play %>%
anti_join(stadium, by = "Stadium.Code") %>%
dim()
## Error in game_play %>% anti_join(stadium, by = "Stadium.Code") %>% dim(): could not find function "%>%"
Initially, game had 848 observations and 6 variables. play had 160697 observations of 17 variables. Stadium had 164 observations of 7 variables. First, game and play were joined by the “Game.Code” ID which they shared in common. Since play had far more observations than game (multiple plays per game), the resultant data frame had the same number of rows as the play data frame. Then this new table containing game and play was joined with the stadium data set using the “Stadium.Code” ID. The number of observations stayed constant, but to check, anti_join was used and found there were zero mismatches. An issue with these datasets is that the Offense points and Defense Points are all cumulative, so it is important to note that all visualizations and statistics calculated from these numbers are not the number of points scored at that point in time, but up to that point.
# cleans up the data to show only variables and rows we need
clean_data = full_data %>%
filter(!is.na(Distance), Play.Type != "PENALTY", Play.Type != "TIMEOUT") %>%
select(!c(Site, Clock, City, State, Spot, Drive.Number, Drive.Play, Spot))
## Error in full_data %>% filter(!is.na(Distance), Play.Type != "PENALTY", : could not find function "%>%"
# creates new column for the final score for each period
period_scores = clean_data %>%
# selects columns for the last play of each period
group_by(Game.Code, Period.Number) %>%
filter(Play.Number == max(Play.Number)) %>%
# creates column and arranges rows from highest final score
mutate(Winning.Points = max(Offense.Points, Defense.Points)) %>%
select(c(Game.Code, Period.Number, Offense.Points, Defense.Points, Winning.Points, Offense.Team.Code, Defense.Team.Code)) %>%
arrange(desc(Winning.Points))
## Error in clean_data %>% group_by(Game.Code, Period.Number) %>% filter(Play.Number == : could not find function "%>%"
head(period_scores)
## Error in eval(expr, envir, enclos): object 'period_scores' not found
# creates new column for the final score for each game
final_score = clean_data %>%
# selects columns for the last play of each game
group_by(Game.Code) %>%
filter(Play.Number == max(Play.Number)) %>%
# creates column and arranges rows from highest final score
mutate(Final.Points = max(Offense.Points, Defense.Points)) %>%
select(c(Game.Code, Period.Number, Offense.Points, Defense.Points, Offense.Team.Code, Defense.Team.Code, Final.Points)) %>%
arrange(desc(Final.Points))
## Error in clean_data %>% group_by(Game.Code) %>% filter(Play.Number == : could not find function "%>%"
head(final_score)
## Error in eval(expr, envir, enclos): object 'final_score' not found
# new column for difference in final scores
period_scores %>%
mutate(Score.Diff = Winning.Points - min(Offense.Points, Defense.Points)) %>%
arrange(desc(Score.Diff))
## Error in period_scores %>% mutate(Score.Diff = Winning.Points - min(Offense.Points, : could not find function "%>%"
The first step was to just select the main variables I wanted to deal with and get only the rows where the distance column had a value in them. The next step was to create a new column containing the final score of the winning team at the end of each period. This step found that the highest scores were generally obtained from the fourth quarter, with a few games finalizing their scores by the third quarter. The highest scoring game in the 2005 season had a score of 80 points. I then used a similar process, but found the final scores of the overall game. Finally I checked the differences between the scores of the winning and loosing team for each period to see the greatest point disparities. I manually checked the highest point disparity teams and Ohio State dominated Florida A&M in 2005.
# most common surface types played on
clean_data %>%
group_by(Surface) %>%
summarise(n = n()) %>%
arrange(desc(n))
## Error in clean_data %>% group_by(Surface) %>% summarise(n = n()) %>% arrange(desc(n)): could not find function "%>%"
# look at variation in distance run by surface types
distance_surface = clean_data %>%
group_by(Surface) %>%
summarize(a = mean(Distance), b = mean(Year.Opened)) %>%
arrange(desc(a))
## Error in clean_data %>% group_by(Surface) %>% summarize(a = mean(Distance), : could not find function "%>%"
head(distance_surface)
## Error in eval(expr, envir, enclos): object 'distance_surface' not found
tail(distance_surface)
## Error in eval(expr, envir, enclos): object 'distance_surface' not found
# types of plays
clean_data %>%
group_by(Play.Type) %>%
summarize(n = n())
## Error in clean_data %>% group_by(Play.Type) %>% summarize(n = n()): could not find function "%>%"
# variation in distance by surface types and play types
distance_all_plays = clean_data %>%
group_by(Surface, Play.Type) %>%
summarize(a = mean(Distance), b = mean(Year.Opened)) %>%
arrange(desc(a))
## Error in clean_data %>% group_by(Surface, Play.Type) %>% summarize(a = mean(Distance), : could not find function "%>%"
# variation in distance by surface types only for RUSH plays
rush_distance = clean_data %>%
filter(Play.Type == "RUSH") %>%
group_by(Surface) %>%
summarize(a = mean(Distance), b = mean(Year.Opened)) %>%
arrange(desc(a))
## Error in clean_data %>% filter(Play.Type == "RUSH") %>% group_by(Surface) %>% : could not find function "%>%"
head(rush_distance)
## Error in eval(expr, envir, enclos): object 'rush_distance' not found
For the surface and distance comparison, I first found the most common surface types, and it seems normal grass is the most common with Field turf being the next most. These statistics are not for the most common surfaces across stadiums, but for the games played. Then, I created a new data set for the distance by surface, and found the SoftTop surface system had the highest distance. I also included the year the stadiums were built to see if they had newer turf, but different turfs can be installed regardless of the year the stadium was built. The distance was only higher by .07 yards from the next surface type, but the difference from the lowest average distance was 0.8 yards. I then checked the different play types, and their frequencies. Rush and Pass plays were the most common. The next data set I created was the average distance for each play type and surface type. This showed the greatest yardage being achieved by Pass plays regardless of the turf type. I created another data set subsetting only for rush plays since I assume these had more to do with the surface type than the others. For rush plays, the best surface type was Sports Turf with an average of 8.71 yards and the next best was ProGrass with an average of 8.45 yards, a significantly greater difference between these two compared to the average yards for all play types.
final_score %>%
ggplot(aes(x = Final.Points)) +
geom_histogram(bins = 20) +
labs(title = "Winning Scores for all 2005 Games", subtitle = "Figure 1", x = "Final Game Score", y = "Number of Games") +
scale_x_continuous(breaks = seq(0, 80, 10)) +
theme_bw() # adds in grid lines
## Error in final_score %>% ggplot(aes(x = Final.Points)): could not find function "%>%"
This graph shows the distribution of final game scores. As discussed earlier, the highest score in any game this year was 80, but the games overall had a slightly right skewed distribution. The majority of the games had a final score of 30.
period_scores %>%
ggplot(aes(x = Winning.Points, fill = Period.Number)) +
geom_boxplot() +
facet_grid(rows = vars(Period.Number)) +
labs(title = "Winning Scores by Period", subtitle = "Figure 2", x = "Highest Team Score", y = "Period", fill = "Period Number") +
scale_x_continuous(breaks = seq(0, 80, 10)) +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
theme_bw()
## Error in period_scores %>% ggplot(aes(x = Winning.Points, fill = Period.Number)): could not find function "%>%"
As expected, the scores are increasing with period number, but the difference in median score from the second quarter to third is the lowest. The first quarter sees a mean of 10 points gained, then another 11 in the second quarter. In the third quarter however, the median score is less than 9 yards higher than the second quarter. The fourth quarter has an increase of around 10 yards again. Though not all games go into overtime, I chose to include the 5th period here because it is interesting to see that these closer games have better defenses and a noticeably lower median than the combined medians of the games in the 4th quarter.
distance_all_plays %>%
ggplot(aes(y = Surface, x = a, colour = Play.Type)) +
geom_point() +
labs(title = "Effect of Surface and Play Type on Distance", subtitle = "Figure 3", y = "Surface", x = "Mean Distance (Yards)", colour = "Play Type") +
scale_x_continuous(breaks = seq(6, 10, .25)) +
theme(legend.justification = "bottom") +
theme_bw()
## Error in distance_all_plays %>% ggplot(aes(y = Surface, x = a, colour = Play.Type)): could not find function "%>%"
The Pass mean distance seems unaffected by surface type overall as all these points lay around the 8.75 yards mark. The Field Goal plays seem to really benefit from the ProGrass surface. The punt average distance was highest on A-Turf. The Rush average distance was highest on ProGrass, though similar to the pass plays, these were not greatly difference from one surface type to another.
As seen in Figure 1, the majority of the winning teams have scored 30 points by the end of the game. When split up by period as in Figure 2, it seems there is an even distribution of points earned per period, but in the 3rd period after half time, the winning teams seem to have scored fewer points than in the other periods. This could be caused by a number of factors, one of them being that players lose their adrenaline and motivation for the game during the time waiting for the next half to start. Different surface types seemed to affect different plays with varying degrees. Figure3 shows the field goals covered the greatest distance on ProGrass. Punts were most effective on A-turf, passes were most successful on Presteige System, and rushes on SportsTurf. Overall, the SoftTop Turf System had a mean distance of 8.78 even though it scored the lowest for Field goals independently. This could be because of the number of field goal attempts was lower and this lower distance did not impact the overall mean. SportsTurf was the second highest with an overall mean of 8.71 and was the most optimal for rush plays.
Some challenges were the organization of the data, and lack of data in some areas. Since the teams switch who is playing offense versus defense based on who has the possession, it was difficult to determine who the winners were of each game, and I would have liked to further explore which teams were doing well overall. There was also no data on which team was the home team for each game. After seeing there was not that big of a difference in the distance based on type of surface, I wanted to see if there was a difference based on what surface they were accustomed to playing on versus what type of turf the current stadium had. It was very fun to keep exploring the data through this project, I definitely had to make adjustments as I went. When I initially grouped the distance by surface and play type, I was that passing yards were playing a huge role in how it was ranked, so I changed the focus to rushing yards. I learned how important it is to fully understand what the data is before hand. I was very confused by the “5” under period until I checked and found very few games had it, and it was likely representing if a game went into overtime. There was also a distance listed for penalty plays which I had to remove since those were probably negative numbers and were not dependent on the surface type.