If you observe the almost overwhelmingly large data set on Kaggle.com titled, “Football commentary data set” you will find some intriguing patters and similarities among the styles of commentary over the last half century. In this data set, I had access to every single word spoken on broadcasts of both college and professional football.
What I was most interested in for this study were the specific combinations of words that were spoken most frequently on broadcasts. To do this, I decided to do a bigram to discover words that were said after one another. As a football fan myself, I had predictions as to what words were most likely to be connected. Things like, “yard line”, “penalty flag” were what I expected to observe. Also, I felt like position names such as “running back”, “wide reciever”, or “middle linebacker” were some of the results that I anticipated.
library(jsonlite)
library(tidyverse)
library(tidyjson)
library(tidytext)
library(gridExtra)
library(wordcloud2)
I read in the JSON, removed irrelevant columns, and converted it to a usable data frame.
football <- fromJSON("/Users/jamestait0514/Desktop/dataset/raw_transcripts.json", simplifyDataFrame = TRUE)
football %>% spread_all() %>%
select(transcript, year) -> football_clean
as.data.frame(football_clean) -> football_clean_df
football_20c <- football_clean_df %>%
filter(as.Date(year, format = "%Y") <= 2000)
Then I organized the dataset to have individual words for each column and removed stop words.
football_20c %>%
unnest_tokens(bigram, transcript, token = "ngrams", n=2) -> football_words_df
football_separated <- football_words_df %>%
separate(bigram, c("word1", "word2"), sep = " ")
football_filtered <- football_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
football_filtered %>%
count(word1, word2, sort = TRUE)%>%
head(20)
## word1 word2 n
## 1 yard line 2193
## 2 kansas city 491
## 3 field goal 451
## 4 gt gt 450
## 5 los angeles 378
## 6 notre dame 332
## 7 minnesota vikings 303
## 8 baltimore colts 297
## 9 monday night 243
## 10 green bay 233
## 11 york jets 223
## 12 ball game 212
## 13 san diego 207
## 14 middle linebacker 199
## 15 angeles rams 198
## 16 national football 195
## 17 football league 189
## 18 football game 187
## 19 wide receiver 162
## 20 pittsburgh steelers 161
football_filtered %>%
filter(word1 == "goal") %>%
count(word2, sort = TRUE)%>%
head(20)
## word2 n
## 1 line 132
## 2 attempt 48
## 3 kicker 16
## 4 position 16
## 5 range 16
## 6 post 9
## 7 kickers 7
## 8 ago 5
## 9 david 5
## 10 unit 4
## 11 send 3
## 12 attempts 2
## 13 called 2
## 14 dawson 2
## 15 jim 2
## 16 kicking 2
## 17 party 2
## 18 pittsburgh 2
## 19 10 1
## 20 17 1
football_filtered %>%
filter(word2 == "attempt") %>%
count(word1, sort = TRUE)%>%
head(20)
## word1 n
## 1 goal 48
## 2 pass 8
## 3 yard 6
## 4 conversion 3
## 5 fieldgoal 3
## 6 kick 3
## 7 50yard 2
## 8 gabriel 2
## 9 screen 2
## 10 twelve 2
## 11 23yard 1
## 12 40yard 1
## 13 46yard 1
## 14 bacchus 1
## 15 collins 1
## 16 fide 1
## 17 hearth 1
## 18 offensive 1
## 19 onsidekick 1
## 20 sixth 1
Based on the results of the study, I was correct but also surprised when comparing them to my hypothesis. I anticipated player positions and certain events within the games to be the most common combinations of words within the commentary. In a sense, that was correct. “wide reciever” was the most common combination of word found within that data. However, the combinations following that were very much a surprise to me. There were two different types of combinations of words that ended up being some of the most frequently used. Team names and bowl games. I hadn’t thought about how the name of teams have two parts, and whenever a commentator speaks about a team they always use an identical series of two of three words. “St. Louis” and “York Giants” were both phrases that ended up in the top 5 of combined words. Most likely in reference to the St. Louis Rams and the New York Giants. The other combination that surprised me was bowl games or super bowl. The super bowl is one of the most highly broadcasted events in sports and its a combination of two words, so I should have anticipated the words, “super bowl” as being very high on the list. I also observed that a very popular college bowl game, the rose bowl, was one of the most frequent as well.
When conducting this study, I was dealing with a particularly large data set. This was an advantage and disadvantage at the same time. I was able to work with a large sample size of information, and thus have more accurate observations. However, while examining the data, I ran into errors trying to run code. As a result, I had to reduce the sample size that I was working with. I filtered out some of the years of data and was able to run the code as a results. Also, the data I worked with included the audio from commercials within the broadcast. Even though this did not seem to impact my results, it was a worry.
If I were to expand on my study, I would filter the data down to certain categories of word combinations. Like figuring out which cities of teams, position of players, or bowl games were most commonly spoken about on broadcasts.