N-gram Analysis of Football Commentary

Introduction

If you observe the almost overwhelmingly large data set on Kaggle.com titled, “Football commentary data set” you will find some intriguing patters and similarities among the styles of commentary over the last half century. In this data set, I had access to every single word spoken on broadcasts of both college and professional football.

What I was most interested in for this study were the specific combinations of words that were spoken most frequently on broadcasts. To do this, I decided to do a bigram to discover words that were said after one another. As a football fan myself, I had predictions as to what words were most likely to be connected. Things like, “yard line”, “penalty flag” were what I expected to observe. Also, I felt like position names such as “running back”, “wide reciever”, or “middle linebacker” were some of the results that I anticipated.

library(jsonlite)
library(tidyverse)
library(tidyjson)
library(tidytext)
library(gridExtra)
library(wordcloud2)

Import and edit data

I read in the JSON, removed irrelevant columns, and converted it to a usable data frame.

football <- fromJSON("/Users/jamestait0514/Desktop/dataset/raw_transcripts.json", simplifyDataFrame  = TRUE)

football %>% spread_all() %>% 
  select(transcript, year) -> football_clean

as.data.frame(football_clean) -> football_clean_df

football_20c <- football_clean_df %>% 
  filter(as.Date(year, format = "%Y") <= 2000)

Clean data

Then I organized the dataset to have individual words for each column and removed stop words.

football_20c %>% 
  unnest_tokens(bigram, transcript, token = "ngrams", n=2) -> football_words_df

football_separated <- football_words_df %>%
  separate(bigram, c("word1", "word2"), sep = " ")

football_filtered <- football_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

Show Top 10 Words All Time

football_filtered %>% 
  count(word1, word2, sort = TRUE)%>%
head(20)

##         word1      word2    n
## 1        yard       line 2193
## 2      kansas       city  491
## 3       field       goal  451
## 4          gt         gt  450
## 5         los    angeles  378
## 6       notre       dame  332
## 7   minnesota    vikings  303
## 8   baltimore      colts  297
## 9      monday      night  243
## 10      green        bay  233
## 11       york       jets  223
## 12       ball       game  212
## 13        san      diego  207
## 14     middle linebacker  199
## 15    angeles       rams  198
## 16   national   football  195
## 17   football     league  189
## 18   football       game  187
## 19       wide   receiver  162
## 20 pittsburgh   steelers  161

football_filtered %>%
  filter(word1 == "goal") %>%
  count(word2, sort = TRUE)%>%
head(20)

##         word2   n
## 1        line 132
## 2     attempt  48
## 3      kicker  16
## 4    position  16
## 5       range  16
## 6        post   9
## 7     kickers   7
## 8         ago   5
## 9       david   5
## 10       unit   4
## 11       send   3
## 12   attempts   2
## 13     called   2
## 14     dawson   2
## 15        jim   2
## 16    kicking   2
## 17      party   2
## 18 pittsburgh   2
## 19         10   1
## 20         17   1

football_filtered %>%
  filter(word2 == "attempt") %>%
  count(word1, sort = TRUE)%>%
head(20)

##         word1  n
## 1        goal 48
## 2        pass  8
## 3        yard  6
## 4  conversion  3
## 5   fieldgoal  3
## 6        kick  3
## 7      50yard  2
## 8     gabriel  2
## 9      screen  2
## 10     twelve  2
## 11     23yard  1
## 12     40yard  1
## 13     46yard  1
## 14    bacchus  1
## 15    collins  1
## 16       fide  1
## 17     hearth  1
## 18  offensive  1
## 19 onsidekick  1
## 20      sixth  1

Conclusion

Based on the results of the study, I was correct but also surprised when comparing them to my hypothesis. I anticipated player positions and certain events within the games to be the most common combinations of words within the commentary. In a sense, that was correct. “wide reciever” was the most common combination of word found within that data. However, the combinations following that were very much a surprise to me. There were two different types of combinations of words that ended up being some of the most frequently used. Team names and bowl games. I hadn’t thought about how the name of teams have two parts, and whenever a commentator speaks about a team they always use an identical series of two of three words. “St. Louis” and “York Giants” were both phrases that ended up in the top 5 of combined words. Most likely in reference to the St. Louis Rams and the New York Giants. The other combination that surprised me was bowl games or super bowl. The super bowl is one of the most highly broadcasted events in sports and its a combination of two words, so I should have anticipated the words, “super bowl” as being very high on the list. I also observed that a very popular college bowl game, the rose bowl, was one of the most frequent as well.

When conducting this study, I was dealing with a particularly large data set. This was an advantage and disadvantage at the same time. I was able to work with a large sample size of information, and thus have more accurate observations. However, while examining the data, I ran into errors trying to run code. As a result, I had to reduce the sample size that I was working with. I filtered out some of the years of data and was able to run the code as a results. Also, the data I worked with included the audio from commercials within the broadcast. Even though this did not seem to impact my results, it was a worry.

If I were to expand on my study, I would filter the data down to certain categories of word combinations. Like figuring out which cities of teams, position of players, or bowl games were most commonly spoken about on broadcasts.