Introduction

These are the packages I needed for this Unsupervised Machine Learning Project. With the MLB season having just gotten underway, I wanted to analyze the emotion and sentiment surrounding the five teams in the National League’s Eastern Division. I did this by finding a season preview article for each of the division’s five teams. Each one of these articles are from different websites. The goal of this project was to determine what the expectations are for each of the five teams and the emotions people have with baseball coming back.

If you have any questions about my work, feel free to reach out to . I hope you enjoy this project!

Loading Libraries and Sentiments

remove(list = ls())
setwd("~/Machine Learning Unsupervised")
set.seed(12345)
 pacman::p_load(tokenizers,stopwords,dplyr,ggplot2,
               ggthemes,tidytext, qdap, tm,lda,topicmodels, ggrepel,
               tidyverse,wordcloud,wordcloud2, skmeans, clue, cluster, knitr,
               fpc, ldatuning, wakefield)

               library(SentimentAnalysis)
               library(sentimentr)
               library(widyr)
               library(tidytext)
               
               pacman::p_load(textsampler)
               
               
               library(FactoMineR)
               library(factoextra)
               library(scales)
               library(magrittr)
               library(textreadr)
               library(magrittr)
               library(selectr)
               
               pacman::p_load(tidyverse, tidytext, textclean, tokenizers, markovchain)
          pacman::p_load(stm, rvest, tm)
          pacman::p_load(gutenbergr)
          library(dplyr)
          library(plyr)
          library(textdata)
          library(cowplot)
      
                                nrc_sent = get_sentiments("nrc")
                                      table(nrc_sent$sentiment)
## 
##        anger anticipation      disgust         fear          joy 
##         1247          839         1058         1476          689 
##     negative     positive      sadness     surprise        trust 
##         3324         2312         1191          534         1231
                                bing_sent = get_sentiments("bing")

News Sentiment Mets

#News Sentiment Mets
                                
                                #4/11/2021
                                 #Scrape the web site ==============================================
mets = read_html("https://www.cbssports.com/mlb/news/mets-2021-season-preview-projected-lineup-rotation-as-francisco-lindor-tries-to-lead-club-back-to-playoffs/")
                                
                                mets0  = mets %>% html_nodes("p") %>% html_text()
                                
                                mets1 = data.frame(text = mets0)
                                
#Preprocessing
                                
mets2 = mets1 %>% slice(7:36)  
#equivalent
                                mets2$text = as.character(mets2$text)
                                
                                #Into the tidytext world          
                                mets3 = mets2 %>% unnest_tokens(word, text, 
                                                              to_lower = T,
                                                              strip_punct = T,
                                                              strip_numeric = T
                                )
                                
                                
                                #take out the stopword
                                mets3 %>% dplyr::count(word, sort = T)
                                mets3 = mets3 %>% anti_join(stop_words, by = "word")
                                
                                mets3 %>%
                                  dplyr::count(word, sort = TRUE) %>%
                                  top_n(10) %>% ggplot(aes(fct_reorder(word,n),n,fill=as.factor(n)))+ geom_col() + 
                                  coord_flip() + ggtitle("Top 10 Words in Article About the Mets")
## Selecting by n

Many of the top words in this article make sense such as Mets, their new superstar Francisco Lindor, and season. Words such as spot and bad are a little less clear why they are used so much. The emotion and sentiment analysis can be used to infer in what fashion those words were used.

                                #create a dataframe
                                mets4 = cbind.data.frame(linenumber = row_number(mets3), mets3)
                                
                                mets4 %>% dplyr::count(word, sort = T)
                                ###############################################
                                #emotion/sentiment
                                
                                nrc_positive <- nrc_sent %>% 
                                  filter(sentiment == "positive")
                                
                                
                                mets4_positive = mets4 %>%
                                  inner_join(nrc_positive) %>%
                                  dplyr::count(index = linenumber %/% 50,sentiment) %>%
                                  spread(sentiment, n, fill = 0)
## Joining, by = "word"
                                mets_positive = mets4_positive %>% 
                                  ggplot(aes(index, positive, fill = as.factor(positive))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                mets_positive + ggtitle("New York Mets")

                                #sentiment
                                mets4_sentiment <- mets4 %>%
                                  inner_join(bing_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) %>%
                                  spread(sentiment, n, fill = 0) %>%
                                  mutate(sentiment = positive - negative)
## Joining, by = "word"
                                head(mets4_sentiment)
                                mets_sentiment = ggplot(mets4_sentiment, 
                                                      aes(index, sentiment, fill = as.factor(sentiment ))) +
                                  geom_col()
                                print(mets_sentiment + ggtitle("New York Mets"))

                                plot_grid(mets_positive + ggtitle("New York Mets"), mets_sentiment, nrow = 2)

                                ## All emotion
                                mets4_emotion = mets4 %>%
                                  inner_join(nrc_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
                                mets_emotion = mets4_emotion %>% 
                                  ggplot(aes(index, n, fill = as.factor(sentiment))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                mets_emotion + ggtitle("New York Mets")

                                plot_grid(mets_emotion + ggtitle("New York Mets"), mets_sentiment, nrow = 2)

News Sentiment Marlins

                                #4/11/2021
                                 #Scrape the web site ==============================================
marlins = read_html("https://www.sun-sentinel.com/sports/miami-marlins/fl-sp-marlins-season-preview-20210330-dosealvc4rdqzc2w4fwsvjheu4-story.html")
                                
                                marlins0  = marlins %>% html_nodes("p") %>% html_text()
                                
                                marlins1 = data.frame(text = marlins0)
                                
#Preprocessing
                                
marlins2 = marlins1 %>% slice(1:23)  
#equivalent
                                marlins2$text = as.character(marlins2$text)
                                
                                #Into the tidytext world          
                                marlins3 = marlins2 %>% unnest_tokens(word, text, 
                                                              to_lower = T,
                                                              strip_punct = T,
                                                              strip_numeric = T
                                )
                                
                                
                                #take out the stopword
                                marlins3 %>% dplyr::count(word, sort = T)
                                marlins3 = marlins3 %>% anti_join(stop_words, by = "word")
                                marlins3 %>%
                                  dplyr::count(word, sort = TRUE) %>%
                                  top_n(10) %>% ggplot(aes(fct_reorder(word,n),n,fill=as.factor(n)))+ geom_col() + 
                                  coord_flip() + ggtitle("Top 10 Words in Article About the Marlins")
## Selecting by n

All of the top words in this article make sense because they are the names of people on the team, baseball terms, or the name of a team or a city a team plays in.

                                #create a dataframe
                                marlins4 = cbind.data.frame(linenumber = row_number(marlins3), marlins3)
                                
                                marlins4 %>% dplyr::count(word, sort = T)
                                ###############################################
                                #emotion/sentiment
                                marlins4_positive = marlins4 %>%
                                  inner_join(nrc_positive) %>%
                                  dplyr::count(index = linenumber %/% 50,sentiment) %>%
                                  spread(sentiment, n, fill = 0)
## Joining, by = "word"
                                marlins_positive = marlins4_positive %>% 
                                  ggplot(aes(index, positive, fill = as.factor(positive))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                marlins_positive + ggtitle("Miami Marlins")

                                #sentiment
                                marlins4_sentiment <- marlins4 %>%
                                  inner_join(bing_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) %>%
                                  spread(sentiment, n, fill = 0) %>%
                                  mutate(sentiment = positive - negative)
## Joining, by = "word"
                                head(marlins4_sentiment)
                                marlins_sentiment = ggplot(marlins4_sentiment, 
                                                      aes(index, sentiment, fill = as.factor(sentiment ))) +
                                  geom_col()
                                print(marlins_sentiment + ggtitle("Miami Marlins"))

                                plot_grid(marlins_positive + ggtitle("Miami Marlins"), marlins_sentiment, nrow = 2)

                                ## All emotion
                                marlins4_emotion = marlins4 %>%
                                  inner_join(nrc_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
                                marlins_emotion = marlins4_emotion %>% 
                                  ggplot(aes(index, n, fill = as.factor(sentiment))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                marlins_emotion + ggtitle("Miami Marlins")

                                plot_grid(marlins_emotion + ggtitle("Miami Marlins"), marlins_sentiment, nrow = 2)

News Sentiment Nationals

                                #4/11/2021
                                 #Scrape the web site ==============================================
nationals = read_html("https://wtop.com/washington-nationals/2021/04/nationals-preview-21-questions-for-2021/")
                                
                                nationals0  = nationals %>% html_nodes("p") %>% html_text()
                                
                                nationals1 = data.frame(text = nationals0)
#Preprocessing
                                
nationals2 = nationals1 %>% slice(4:30)  
#equivalent
                                nationals2$text = as.character(nationals2$text)
                                
                                #Into the tidytext world          
                                nationals3 = nationals2 %>% unnest_tokens(word, text, 
                                                              to_lower = T,
                                                              strip_punct = T,
                                                              strip_numeric = T
                                )
                                
                                
                                #take out the stopword
                                nationals3 %>% dplyr::count(word, sort = T)
                                nationals3 = nationals3 %>% anti_join(stop_words, by = "word")
                                
                                nationals3 %>%
                                  dplyr::count(word, sort = TRUE) %>%
                                  top_n(10) %>% ggplot(aes(fct_reorder(word,n),n,fill=as.factor(n)))+ geom_col() + 
                                  coord_flip() + ggtitle("Top 10 Words in Article About the Nationals")
## Selecting by n

All of the top words used make sense because they are baseball terms.

                                #create a dataframe
                                nationals4 = cbind.data.frame(linenumber = row_number(nationals3), nationals3)
                                
                                nationals4 %>% dplyr::count(word, sort = T)
                                ###############################################
                                #emotion/sentiment
                                
                                nrc_positive <- get_sentiments("nrc") %>% 
                                  filter(sentiment == "positive")
                                
                                
                                nationals4_positive = nationals4 %>%
                                  inner_join(nrc_positive) %>%
                                  dplyr::count(index = linenumber %/% 50,sentiment) %>%
                                  spread(sentiment, n, fill = 0)
## Joining, by = "word"
                                nationals_positive = nationals4_positive %>% 
                                  ggplot(aes(index, positive, fill = as.factor(positive))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                nationals_positive + ggtitle("Washington Nationals")

                                #sentiment
                                nationals4_sentiment <- nationals4 %>%
                                  inner_join(bing_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) %>%
                                  spread(sentiment, n, fill = 0) %>%
                                  mutate(sentiment = positive - negative)
## Joining, by = "word"
                                head(nationals4_sentiment)
                                nationals_sentiment = ggplot(nationals4_sentiment, 
                                                      aes(index, sentiment, fill = as.factor(sentiment ))) +
                                  geom_col()
                                print(nationals_sentiment + ggtitle("Washington Nationals"))

                                plot_grid(nationals_positive + ggtitle("Washington Nationals"), nationals_sentiment, nrow = 2)

                                ## All emotion
                                nationals4_emotion = nationals4 %>%
                                  inner_join(nrc_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
                                nationals_emotion = nationals4_emotion %>% 
                                  ggplot(aes(index, n, fill = as.factor(sentiment))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                nationals_emotion + ggtitle("Washington Nationals")

                                plot_grid(nationals_emotion + ggtitle("Washington Nationals"), nationals_sentiment, nrow = 2)

News Sentiment Phillies

                                #4/11/2021
                                 #Scrape the web site ==============================================
phillies = read_html("https://www.nbcsports.com/philadelphia/phillies/mlb-season-preview-2021-phillies-lineup-roster-rotation")
                                
                                phillies0  = phillies %>% html_nodes("p") %>% html_text()
                                
                                phillies1 = data.frame(text = phillies0)

#Preprocessing
                                
phillies2 = phillies1 %>% slice(3:45)  
#equivalent
                                phillies2$text = as.character(phillies2$text)
                                
                                #Into the tidytext world          
                                phillies3 = phillies2 %>% unnest_tokens(word, text, 
                                                              to_lower = T,
                                                              strip_punct = T,
                                                              strip_numeric = T
                                )
                                
                                
                                #take out the stopword
                                phillies3 %>% dplyr::count(word, sort = T)
                                phillies3 = phillies3 %>% anti_join(stop_words, by = "word")
                                
                                phillies3 %>%
                                  dplyr::count(word, sort = TRUE) %>%
                                  top_n(10) %>% ggplot(aes(fct_reorder(word,n),n,fill=as.factor(n)))+ geom_col() + 
                                  coord_flip() + ggtitle("Top 10 Words in Article About the Phillies")
## Selecting by n

The words used in the Phillies season preview article all make sense because they are baseball terms or player names.

                                #create a dataframe
                                phillies4 = cbind.data.frame(linenumber = row_number(phillies3), phillies3)
                                
                                phillies4 %>% dplyr::count(word, sort = T)
                                ###############################################
                                #emotion/sentiment
                                
                                nrc_positive <- get_sentiments("nrc") %>% 
                                  filter(sentiment == "positive")
                                
                                
                                phillies4_positive = phillies4 %>%
                                  inner_join(nrc_positive) %>%
                                  dplyr::count(index = linenumber %/% 50,sentiment) %>%
                                  spread(sentiment, n, fill = 0)
## Joining, by = "word"
                                phillies_positive = phillies4_positive %>% 
                                  ggplot(aes(index, positive, fill = as.factor(positive))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                phillies_positive + ggtitle("Philadelphia Phillies")

                                #sentiment
                                phillies4_sentiment <- phillies4 %>%
                                  inner_join(bing_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) %>%
                                  spread(sentiment, n, fill = 0) %>%
                                  mutate(sentiment = positive - negative)
## Joining, by = "word"
                                head(phillies4_sentiment)
                                phillies_sentiment = ggplot(phillies4_sentiment, 
                                                      aes(index, sentiment, fill = as.factor(sentiment ))) +
                                  geom_col()
                                print(phillies_sentiment + ggtitle("Philadelphia Phillies"))

                                plot_grid(phillies_positive + ggtitle("Philadelphia Phillies"), phillies_sentiment, nrow = 2)

                                ## All emotion
                                phillies4_emotion = phillies4 %>%
                                  inner_join(nrc_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
                                phillies_emotion = phillies4_emotion %>% 
                                  ggplot(aes(index, n, fill = as.factor(sentiment))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                phillies_emotion + ggtitle("Philadelphia Phillies")

                                plot_grid(phillies_emotion + ggtitle("Philadelphia Phillies"), phillies_sentiment, nrow = 2)

News Sentiment Braves

                                #4/11/2021
                                 #Scrape the web site ==============================================
braves = read_html("https://www.cbssports.com/mlb/news/braves-2021-season-preview-projected-lineup-rotation-and-three-things-to-know-about-atlanta/")
                                
                                braves0  = braves %>% html_nodes("p") %>% html_text()
                                
                                braves1 = data.frame(text = braves0)

#Preprocessing
                                
braves2 = braves1 %>% slice(7:37)  
#equivalent
                                braves2$text = as.character(braves2$text)
                                
                                #Into the tidytext world          
                                braves3 = braves2 %>% unnest_tokens(word, text, 
                                                              to_lower = T,
                                                              strip_punct = T,
                                                              strip_numeric = T
                                )
                                
                                
                                #take out the stopword
                                braves3 %>% dplyr::count(word, sort = T)
                                braves3 = braves3 %>% anti_join(stop_words, by = "word")
                                braves3 %>%
                                  dplyr::count(word, sort = TRUE) %>%
                                  top_n(10) %>% ggplot(aes(fct_reorder(word,n),n,fill=as.factor(n)))+ geom_col() + 
                                  coord_flip() + ggtitle("Top 10 Words in Article About the Braves")
## Selecting by n

The top words used in this article seem to suggest they talk about the Braves last season because there a lot of baseball statistics words used, suggesting they were talking a lot of about player and team statistics They also used the word playoffs and the Braves made the playoffs last year. It will be interesting to see the emotion and sentiment of the article to see the perception of the Braves playoff loss last year.

                                #create a dataframe
                                braves4 = cbind.data.frame(linenumber = row_number(braves3), braves3)
                                
                                braves4 %>% dplyr::count(word, sort = T)
                                ###############################################
                                #emotion/sentiment
                                
                                nrc_positive <- get_sentiments("nrc") %>% 
                                  filter(sentiment == "positive")
                                
                                
                                braves4_positive = braves4 %>%
                                  inner_join(nrc_positive) %>%
                                  dplyr::count(index = linenumber %/% 50,sentiment) %>%
                                  spread(sentiment, n, fill = 0)
## Joining, by = "word"
                                braves_positive = braves4_positive %>% 
                                  ggplot(aes(index, positive, fill = as.factor(positive))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                braves_positive + ggtitle("Atlanta Braves")

                                #sentiment
                                braves4_sentiment <- braves4 %>%
                                  inner_join(bing_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) %>%
                                  spread(sentiment, n, fill = 0) %>%
                                  mutate(sentiment = positive - negative)
## Joining, by = "word"
                                head(braves4_sentiment)
                                braves_sentiment = ggplot(braves4_sentiment, 
                                                      aes(index, sentiment, fill = as.factor(sentiment ))) +
                                  geom_col()
                                print(braves_sentiment + ggtitle("Atlanta Braves"))

                                plot_grid(braves_positive + ggtitle("Atlanta Braves"), braves_sentiment, nrow = 2)

                                ## All emotion
                                braves4_emotion = braves4 %>%
                                  inner_join(nrc_sent) %>%
                                  dplyr::count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
                                braves_emotion = braves4_emotion %>% 
                                  ggplot(aes(index, n, fill = as.factor(sentiment))) +
                                  geom_col() +
                                  theme(legend.position = "right")
                                braves_emotion + ggtitle("Atlanta Braves")

                                plot_grid(braves_emotion + ggtitle("Atlanta Braves"), braves_sentiment)

Positivity Comparison

plot_grid(mets_positive + theme(legend.position = "none") + ggtitle("New York Mets"),marlins_positive + theme(legend.position = "none") + ggtitle("Miami Marlins"), nationals_positive + theme(legend.position = "none") + ggtitle("Washington Nationals"), phillies_positive + theme(legend.position = "none") + ggtitle("Philadelphia Phillies"), braves_positive + theme(legend.position = "none")
+ ggtitle("Atlanta Braves"),nrow=5)

The Nationals, Mets, and Braves seem to have the highest positivity ratings. This could indicate the Nationals, Mets, and Braves have the highest expectations coming into this season. The Marlins and Phillies seem to have lower positivity ratings. This could indicate that there is not a lot of optimism for these teams heading into the season.

Sentiment Comparison

plot_grid(mets_sentiment + theme(legend.position = "none") + ggtitle("New York Mets"), marlins_sentiment + theme(legend.position = "none")+ ggtitle("Miami Marlins"), nationals_sentiment + theme(legend.position = "non") + ggtitle("Washington Nationals"), phillies_sentiment + theme(legend.position = "none") + ggtitle("Philadelphia Phillies"), braves_sentiment + theme(legend.position = "none")+ ggtitle("Atlanta Braves"),nrow = 3)

Overall, the Marlins seem to have the highest sentiment because they are the only team that does not have a negative value. This could be because they made an unexpected playoff appearance last year. The Phillies and Braves seem to have the least amount of sentiment with around 2/3 of their indexes each having negative sentiment values. This could be because the Phillies have underperformed the past few years, and the Braves were disappointing in the playoffs last year. The Mets seem to have the third highest sentiment having about an even split of positive and negative indexes. This could be because they have a new owner that people are excited about, but people are uncertain about the current talent of the team. The Nationals seem to have the second highest sentiment with about 2/3 of their indexes having positive values. This could be because sentiment is still high from their World Series victory a few years ago.

Emotion Comparison

plot_grid(mets_emotion + theme(legend.position = "left") + ggtitle("New York Mets"), marlins_emotion + theme(legend.position = "none") + ggtitle("Miami Marlins"), nationals_emotion + theme(legend.position = "none") + ggtitle("Washington Nationals"), phillies_emotion + theme(legend.position = "none") + ggtitle("Philadelphia Phillies"), braves_emotion + theme(legend.position = "none")+ ggtitle("Atlanta Braves"),nrow = 5)

The Mets article seems to show a lot of trust, positivity, and anticipation, while also showing some fear. The Mets article does not seem to have much surprise or negativity. The Marlins article seems to show a lot of trust, positivity, and anticipation. It does not show much positive, negativity, fear, or sadness. The Nationals article shows a lot of trust, positivity, anticipation, and fear. It does not show much surprise, disgust, or joy. The Phillies article shows a lot of positivity and trust. Overall, the emotions in this article are pretty well mixed. The Braves article shows negativity, anticipation, and positivity. It does not show much joy or surprise. Overall, the articles seem to exhibit an immense amount of positivity and trust which seems to display the trust these writers have in their readers and that the MLB season starting makes them feel positive about life.

Conclusion

Overall, all of the articles seem to convey a lot of positivity, and they seem to have a pretty significantly high sentiment as well. This can show that people are excited for the MLB season to return, and that people have high expectations for the National League East Division, considered by many to be the most well-rounded in all of baseball. Although I think these are the main reasons for these results, I think it is important to note that the Nationals and Marlins articles come from their local media, so the positive sentiment and emotions could be somewhat biased.

Thank You!

I want to thank my professor, Dr. Armando Rodriguez for the inspiration on this project. I would not have thought to do something like this if it was not for his class and how he talks about these techniques can be applied in the sports world. I would also like to thank the University of New Haven for providing this program as well.