library(jsonlite)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag() masks stats::lag()
library(tidyjson)
##
## Attaching package: 'tidyjson'
##
## The following object is masked from 'package:jsonlite':
##
## read_json
##
## The following object is masked from 'package:stats':
##
## filter
library(tidytext)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(wordcloud2)
I read in the JSON, removed irrelevant columns, and converted it to a usable data frame.
football <- fromJSON("/Users/jamestait0514/Desktop/dataset/raw_transcripts.json", simplifyDataFrame = TRUE)
football %>% spread_all() %>%
select(transcript, year) -> football_clean
as.data.frame(football_clean) -> football_clean_df
Then I organized the dataset to have individual words for each row and removed stop words and other undesired words.
football_clean_df %>%
group_by(year) %>%
unnest_tokens(word, transcript) %>%
anti_join(stop_words) %>%
filter(!word %in% "gt") -> football_words_df
## Joining, by = "word"
The results for the ten most common words are not surpising and lined up with my expectations. Each word is very closely associated with football.
football_words_df %>%
group_by(word) %>%
count(word, sort = TRUE) %>%
head(10)
## # A tibble: 10 × 2
## # Groups: word [10]
## word n
## <chr> <int>
## 1 game 115820
## 2 play 106344
## 3 line 93601
## 4 ball 86689
## 5 time 74135
## 6 field 64243
## 7 yards 62827
## 8 football 59939
## 9 yard 58844
## 10 run 48689
There is a severe spike in anger words after the year 2000. These results were surprising to me and not what I expected to see. Many football fans would argue that the game has gotten too safe since the early 2000’s with many new rules being put in place to protect the players. Prior to 2000, football was definitely more aggressive making these results very fascinating.
football_words_df %>%
inner_join(get_sentiments("nrc")) %>%
group_by(year,sentiment) %>%
count(sentiment) %>%
filter(sentiment %in% "anger") %>%
ggplot(aes(x=as.Date(year, format = "%Y"),y=n, group = sentiment)) + geom_smooth()+xlab("Year")+ylab("Count")+ggtitle("Frequency of Anger Words Over Time")
## Joining, by = "word"
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The first peak in the proportion of anger words was prior to 1990, followed by a small dip. Interestingly, the overall highest proportion is actually in the most recent data (2020). Another interesting take away from the data is how the confidence band at the start is extremely wide, which implies there was a lot of varience at the time. There really isn’t an obvious explanation for why the proportion changed so much over time. One assumption you could make is that different commentators have different styles of calling games and as a result the use of anger words varied simultaneously.
football_words_df %>%
inner_join(get_sentiments("nrc")) %>%
group_by(year,sentiment) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n)) %>%
filter(sentiment %in% "anger") %>%
ggplot(aes(x=as.Date(year, format = "%Y"), y = freq , group = sentiment)) + geom_smooth()+xlab("Year")+ylab("Proportion")+ggtitle("Proportion of Anger Words Over Time")
## Joining, by = "word"
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Every word inside this word cloud is extremely closely tied to the game of Football which doesnt come as a surprise. However, one interesting observation is how Florida, Ohio, and Oklahoma are the only three locations. I believe it speaks for the strong culture of football in those states.
football_words_df %>%
group_by(word) %>%
filter(!word %in% c("unk", "person")) %>%
count(word,sort=TRUE) %>%
head(100) %>%
wordcloud2()
All of these words are expected when thinking of common words used in football and I wasn’t surprised by the results. However, one thing I noticed was the increase in the word “penalty”. This goes along with the narative that football has gotten much softer in recent years with new rules and increased penalties.
top10_1_senti <- football_words_df %>%
filter(as.Date(year, format = "%Y") < 1994) %>%
group_by(word) %>%
count(word, sort=TRUE) %>%
inner_join(get_sentiments("afinn")) %>%
head(10) %>%
ggplot(aes(x=n, y=reorder(word, n), fill = value)) + geom_col() + xlab("Count") + ylab("Word") + ggtitle("Top 10 Words 1968-1993 (w/ sentiments)")
## Joining, by = "word"
top10_2_senti <- football_words_df %>%
filter(as.Date(year, format = "%Y") >= 1994) %>%
group_by(word) %>%
count(word, sort=TRUE) %>%
inner_join(get_sentiments("afinn")) %>%
head(10) %>%
ggplot(aes(x=n, y=reorder(word, n), fill = value)) + geom_col() + xlab("Count") + ylab("Word") + ggtitle("Top 10 Words 1994-2018 (w/ sentiments)")
## Joining, by = "word"
grid.arrange(top10_1_senti, top10_2_senti, ncol = 2, nrow = 1)
The results of this project susprised me in some ways but also lined up with my expectations in other areas. I expected anger words to steadily decline and to be at its lowest in more recent years. The results were actually the opposite. According to the data, anger words are at an all time high today. Even with all the new rules to make the gamne safer for players, commentary has not followed suit. An area where the data lined up with my expectations was when we compared the top ten words from before and after 1993. In modern football there are many new rules and thus an increase in penalties. The spike in the use of the word penalty lined up with that expectation perfectly.
Some potential errors were with the scale of the data we were working with. Since the dataset was so massive it was difficult to be precise and comfirm the validity of it. One thing that isnt accounted for are the specific commentators, which could have a large impact on the frequency of certain words. Different commentators have different styles of calling games which undoubtedly would impact the data.
Like I just mentioned, one thing you could take away from this data would be the styles of different commentators in certain eras.