Twitter Data Analysis

Sentiment Analysis

Text attributes of tweets also contains a lot of information. To access those, one needs to use sentiment analysis. Sentiment analysis is the process of classification of emotions (for example, my choice of disctionay has 10 categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust emotions). Implementing sentiment analysis enables me to read the tweets for the emotions behind them. As the first step, I intend to extract the most frequently used words in the tweets. This would give me an overall view of the tweet texts. It needs to be pointed out that I have done the cleaning of the text before hand. The code has be hidden to have a shorter document.

# getting rid of non-useful words - stop words
data("stop_words")
# head(stop_words)

tw_nobot_orig <- tw_nobot
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
tw_nobot$stripped_text <- removeNumPunct(tw_nobot$stripped_text)

#English stop words
custom_stop_words <- bind_rows(stop_words,
                               data_frame(word = tm::stopwords("english"),
                                          lexicon = "custom"))
tw_nobot$stripped_text <- removeWords(tw_nobot$stripped_text,custom_stop_words$word)

tw_nobot$word <- NA
tw_bostonblock_clean <- tw_nobot %>%
  select(word, stripped_text) %>%
  unnest_tokens(word, stripped_text)

tw_bostonblock_clean <- tw_bostonblock_clean %>%
  filter(word != "a")
tw_bostonblock_clean <- tw_bostonblock_clean %>%
  filter(word != "the")
tw_bostonblock_clean$word <- str_replace_all(tw_bostonblock_clean$word, "im", "i")

# plot
tw_bostonblock_clean %>%
  count(word, sort = TRUE) %>%
  top_n(25) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col(fill = "red3") +
  xlab(NULL) +
  coord_flip() +
  labs(title = "Count of unique words found in tweets") +
  ylab("") +
  xlab("") +
  theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"), axis.text.y = element_text(face = "bold", size = 10), axis.text.x = element_blank())

As visible in this graph, the word “Boston” is the most frequently used word in the dataset by significant difference to other words. Other words as “tie”, “love”, “wind”, “humidity” and “family” are also among the top 25. Now that I have a better idea of what’s going on in the texts, I move on to extracting emotions from each tweet.

# sentiment analysis
# tw_nobot
tw_nobot$stripped_text <- gsub("昼㸰","",tw_nobot$stripped_text)
tw_nobot$stripped_text <- gsub("#","",tw_nobot$stripped_text)

emotions_total <- get_nrc_sentiment(tw_nobot$stripped_text)

#㤼㸱Syuzhet㤼㸲 breaks the emotion into 10 different emotions 㤼㸶 anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive

emo_bar <- colSums(emotions_total)
emo_sum <- data.frame(count=emo_bar, emotions_total=names(emo_bar))
emo_sum$emotions_total <- factor(emo_sum$emotions_total, levels=emo_sum$emotions_total[order(emo_sum$count, decreasing = TRUE)])

library(plotly) 
plot_ly(emo_sum, x=~emotions_total, y=~count, type="bar", color=~emotions_total) %>%
  layout(xaxis=list(title=""), showlegend=FALSE,
         title="Emotion Type for all tweets")

This graph shows the distribution of emotions across the tweets of the dataset. The most expressed feeling is “positive” and the least is “disgust”. I am specifically interested in the “trust” emotion. It can be a demonstration of social capital in a community. The higher the trust is between the citizens of a community, the higher the level of social resilience in that community. The next thing I am interested in is to see the distribution of “trust” across different neighborhoods of Boston. To do this, I map the entire city of Boston with the most frequent emotion of each neighborhood. This would give me an idea of the level of social capital in the neighborhoods. Obviously, this needs furthur investigation to prove the correlation and reliability of this comparison.

drops <- c("anger","anticipation","disgust","fear" , "joy","sadness","surprise","trust","negative", "positive")
tw_nobot <- tw_nobot[ , !(names(tw_nobot) %in% drops)]

dt <- cbind(tw_nobot, emotions_total)

dt <- dt %>%
  group_by(neighborhood) %>%
  mutate(tweet_per_neighborhood = n())
# kepping dt unchanged
dt1 <- dt %>%
  group_by(neighborhood) %>%
  mutate(Anger = sum(anger), Anticipation = sum(anticipation), Disgust = sum(disgust), Fear = sum(fear),
         Joy = sum(joy), Sadness = sum(sadness), Surprise = sum(surprise), Trust = sum(trust),
         Negative = sum(negative), Positive = sum(positive))

#adding geo data
db <- geo_join(nhoods, dt1, "Name", "neighborhood")
df <- db[!is.na(db$Trust),]

#the pop up variable
mypopup <- paste0(df$Name, " ", round((df$Trust/df$tweet_per_neighborhood)*100,2))
#the palette
mypal <- colorNumeric(
  palette = "RdYlBu",
  domain = (df$Trust/df$tweet_per_neighborhood)
)

myLAT <- 42.3398
myLNG <- -71.0892
mycentername <- "Northeastern University" # picking one point on the map to perminantly pop up

# mapping
mymap <- leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(myLNG, myLAT, zoom = 12) %>%
  addMarkers(lat=myLAT, lng=myLNG, popup=mycentername) %>%
  addPolygons(data = df, highlight = highlightOptions(weight = 3,
                                           color = "red",
                                           bringToFront = TRUE) ,
              fillColor = ~mypal(df$Trust/df$tweet_per_neighborhood), 
              color = "#000000", 
              fillOpacity = 0.7, 
              weight = 1,
              smoothFactor = 0.2,
              popup = mypopup) %>%
  addLegend(pal = mypal,
            values = df$Trust/df$tweet_per_neighborhood,
            position = "bottomright",
            title = "Trust Tweets",
            opacity = 1
           )
mymap

# saveWidget(mymap, file="D:\\School\\Semester 8\\DI\\Trust.html")

As demonstrated in this map, Beacon Hill, Longwood, South Boston, Charlestown, North End and Dorchester have high percentage of trust-related tweets. These neighborhoods are mostly racially homogeneous. In theory, communities with this characteristic tend to present higher levels of social resilience. In other words, their residents have stronger social ties due to their similarities.

Does this mean people in these communities will support eachother more in times of crisis?

In order to answer that, I need to
1)design and verify social capital metrics and
2)Analyze data under three different scenarios of normal situation, natural events and human-driven disasters.

Twitter Data Analysis

Saina Sheini

March 2020

Sentiment Analysis

Does this mean people in these communities will support eachother more in times of crisis?