For this assignment I wanted to see how two NBA teams compared on twitter. The first team I chose was the Boston Celtics, my favorite team, and the second team was the Philadelphia 76ers. The Celtics have had an up and down season so far, while the 76ers are one of the worst teams in the league so far. I thought this would be an interesting comparison because Boston (and New England) is known for strongly supporting their sports teams, but the same can’t always be said for Philadelphia basketball fans. I think the sentiments of the tweets for each team could be especially interesting to compare.
First I loaded all the libraries I’ll need.
library(twitteR)
library(tidytext)
library(stringr)
library(ggplot2)
library(dplyr)
library(knitr)
library(wordcloud2)
Then I logged in to twitter.
First I pulled the last 1000 tweets that used #Celtics and created a dataframe.
num_tweets <- 1000
Celtics <- searchTwitter('#Celtics', n = num_tweets)
Celtics_df <- twListToDF(Celtics)
head(Celtics_df)
Next I looked at the tweet count by platform.
Celtics_df$statusSource = substr(Celtics_df$statusSource,
regexpr('>', Celtics_df$statusSource) + 1,
regexpr('</a>', Celtics_df$statusSource) - 1)
Celtics_platform <- Celtics_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n/sum(n)) %>%
arrange(desc(n))
kable(Celtics_platform %>% top_n(10))
statusSource | n | percent_of_tweets |
---|---|---|
Twitter for iPhone | 347 | 0.347 |
Twitter for Android | 294 | 0.294 |
Twitter Web Client | 112 | 0.112 |
IFTTT | 87 | 0.087 |
SocialOomph | 41 | 0.041 |
TweetDeck | 32 | 0.032 |
Twitter for iPad | 23 | 0.023 |
6 | 0.006 | |
Libsyn On-Publish | 6 | 0.006 |
celtics_fanly | 5 | 0.005 |
Hootsuite | 5 | 0.005 |
It looks like two of the most popular platforms are cellphones. This might be from fans that are tweeting while watching the game live, or at a bar.
After that I wanted to see if there were any superfans that showed up in the most active users.
kable(Celtics_df %>%
group_by(screenName) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n/sum(n)) %>%
arrange(desc(n)) %>%
top_n(10))
screenName | n | percent_of_tweets |
---|---|---|
celtic_rookie | 67 | 0.067 |
CelticsViews | 42 | 0.042 |
EspiriTruth | 14 | 0.014 |
peskydefender | 9 | 0.009 |
CSNNE | 8 | 0.008 |
celtics_fanly | 5 | 0.005 |
SMHerlin | 5 | 0.005 |
CelticsPregame | 4 | 0.004 |
FaguinhoMV | 4 | 0.004 |
JmCeltics | 4 | 0.004 |
kc1nyk | 4 | 0.004 |
MaNiNhO_t2p2 | 4 | 0.004 |
NBA_Scholar | 4 | 0.004 |
It looks like there is a mix of Celtics fans and Celtics media sources among the top tweeters.
Once I knew who was doing the majority of the #Celtics tweeting, I wanted to take a closer look at what they were tweeting.
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
Celtics_words <- Celtics_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
kable(Celtics_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(20))
word | n |
---|---|
#celtics | 997 |
rt | 665 |
horford | 603 |
al | 443 |
block | 413 |
@celtics | 365 |
@nba | 358 |
ahead | 294 |
bucket | 294 |
crucial | 292 |
la | 187 |
game | 97 |
#nba | 92 |
win | 92 |
victoire | 84 |
@parlonsnba | 82 |
le | 76 |
celtics | 74 |
pistons | 71 |
qui | 70 |
CelticsWC <- Celtics_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(40)
wordcloud2(CelticsWC, size = 3, gridSize = 1, color = 'green', minSize = 12)
Most of the common words were pretty generic, and could easily come from just about any NBA team or city. However, one player did show up in two places on the top 20 list. Al Horford’s name makes up two of the top four most common words tweeted. His first game back from injury was last night, so it makes sense that fans would be excited to see him. He also had a block at the end of the game to help the Celtics win, which could be why the word block is so high up on the list as well.
Then I looked at the sentiments found in the #Celtics tweets. The Celtics have won the last two games, so I expected that most tweets would be pretty positive.
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
Celtics_words_sentiments <- Celtics_words %>% inner_join(nrc, by = "word")
kable(Celtics_words_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% arrange(desc(n)))
sentiment | n |
---|---|
positive | 770 |
trust | 434 |
negative | 151 |
fear | 136 |
anticipation | 130 |
anger | 129 |
disgust | 114 |
joy | 85 |
sadness | 79 |
surprise | 65 |
As I expected, the majority (over three quarters) of the tweets were positive. The Celtics had an exciting win last night, so I would expect the recent tweets to reflect that.
A quick look at the positive tweets and specific words shows that four of the tweets are talking about how the Celtics finally have all their players healthy. There are also a couple tweets describing how good specific players are.
pos_tw_ids <- Celtics_words_sentiments %>% filter(sentiment == "positive") %>% distinct(id, word)
kable(Celtics_df %>% inner_join(pos_tw_ids, by = "id") %>% select(word) %>% slice(1:10))
pick
don
passion winning passion winning passion winning lead
ahead
I also looked at tweets categorized with the fear sentiment. There were a couple tweets from after Friday night’s loss and before Saturday’s win, so sadness makes sense for those tweets. A couple tweets don’t seem to fit with the sadness sentiment, but specific words like ‘killing’ pulled from the tweet are what cuased them to be labeled as sadness. This shows how important it is to look at the overall tweet before determining sentiment, rather than just pulling out key words.
sadness_tw_ids <- Celtics_words_sentiments %>% filter(sentiment == "sadness") %>% distinct(id, word)
kable(Celtics_df %>% inner_join(sadness_tw_ids, by = "id") %>% select(word) %>% slice(1:10))
inter
winning
winning
winning
harry
ruined
trickery bad
tough
killing
Again, I started with pulling the 1000 most recent tweets that used #76ers and created a dataframe.
Philly <- searchTwitter('#76ers', n = num_tweets)
Philly_df <- twListToDF(Philly)
Then I looked at the most common platforms used by Philadelphia fans.
Philly_df$statusSource = substr(Philly_df$statusSource,
regexpr('>', Philly_df$statusSource) + 1,
regexpr('</a>', Philly_df$statusSource) - 1)
Philly_platform <- Philly_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n / sum(n)) %>%
arrange(desc(n))
kable(Philly_platform %>% top_n(10))
statusSource | n | percent_of_tweets |
---|---|---|
SocialOomph | 211 | 0.211 |
dlvr.it | 179 | 0.179 |
Twitter Web Client | 164 | 0.164 |
Twitter for iPhone | 140 | 0.140 |
Twitter for Android | 114 | 0.114 |
ri_76ers | 17 | 0.017 |
Rotoinfo.com NBA | 17 | 0.017 |
TweetDeck | 16 | 0.016 |
SocialNewsDesk | 13 | 0.013 |
NBA Daily Lineups | 12 | 0.012 |
I looked at the most common words found in the tweets.
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
Philly_words <- Philly_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
PhillyWC <- Philly_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(30)
wordcloud2(PhillyWC, size = 3, gridSize = 1, color = 'red', minSize = 12)
## Warning in if (class(data) == "table") {: the condition has length > 1 and
## only the first element will be used
Like the Celtics tweets, some of the words on this list were related to basketball in general. Only one player broke into the most common words, Joel Embiid. Two other NBA teams (the Suns and the Timberwolves) also showed up in the top 20 words, this makes sense because the teams played each other recently.
I also looked at the sentiments from Philadelphia tweets.
Philly_words <- Philly_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
Philly_words_sentiments <- Philly_words %>% inner_join(nrc, by = "word")
kable(Philly_words_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% arrange(desc(n)))
sentiment | n |
---|---|
positive | 369 |
anticipation | 299 |
negative | 240 |
trust | 224 |
joy | 198 |
fear | 180 |
sadness | 165 |
anger | 152 |
surprise | 85 |
disgust | 63 |
Fewer than half the tweets were positive, which I found not very suprising. The team has won only three of 13 games so far this season, so I don’t expect fans to be very positive. However, one of those wins was last night, so the positive sentiment could definitely been higher.
First I created a city variable that I used to combine the dataframes.
Celtics_platform$city <- "Boston"
Philly_platform$city <- "Philadelphia"
Celtics_words_sentiments$city <- "Boston"
Philly_words_sentiments$city <- "Philadelphia"
platform <- rbind(Celtics_platform, Philly_platform)
words_sentiments <- rbind(Celtics_words_sentiments, Philly_words_sentiments)
The I started by comparing the most common platforms.
pf <- c("dlvr.it", "Twitter for iPhone", "Twitter for Android", "SocialOomph", "Twitter Web Client")
pf_df <- platform %>% filter(statusSource %in% pf)
ggplot(pf_df, aes(x = statusSource, y = percent_of_tweets, fill = city)) +
geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette="Dark2") +
xlab("Platform") +
ylab("Percent of tweets") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Boston fans tweet much more frequently from their cell phones. Philadelphia fans had a much more even spread of which platforms they used for their tweets.
I finished by comparing the sentiment of the tweets from the two teams.
sent_df <- words_sentiments %>%
group_by(city, sentiment) %>%
summarize(n = n()) %>%
mutate(frequency = n/sum(n))
ggplot(sent_df, aes(x = sentiment, y = frequency, fill = city)) +
geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette="Dark2") +
xlab("Sentiment") +
ylab("Percent of tweets") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The Celtics had a much higher percent of tweets in the positive and trust categories. The team had a great win last night and the fans are usually very supportive of their team, even when they are struggling. The Celtics tweets also had a slightly higher number of tweets in the disgust category, although without looking at the full text of the tweets it’s hard to guess why. The 76ers have show a much higher number of tweets in the anticipation, joy, negative, and sadness categories. The negative and sadness make sense with how the season is going so far for the team. The joy and anticipation could be in response to the recent win, but again it is hard to tell without pulling out the specific tweets in those categories. It would be interesting to check again later in the season to see if this comparison changes as teams go on winning or losing streaks.