In this second major assignment to be graded, you will text-analyze tweets collected around a keyword, social distancing. For doing so, you are provided with a tweet data set, social_distancing_HMI.RData
, which contains only English-written tweets with geo-location information. Analyzing the tweets, you are required to set the seed to reproduce the random number generator by using your ID, set.seed(####)
and submit the following things.
First, you create a graph showing the time-series trend of tweets posted over time (i.e., the number of tweets posted on each day).
Second, you create a map showing the geo-location of positive and negative tweets in the United States, classified by lexicon-based sentiment analysis using the bing
lexicon. Or you can create a graph tracing the rhythm of expressing overall sentiments around social distancing on Twitter in the U.S.
Third, you identify and visualize the top 10 bigrams in terms of the frequency in the United States and United Kingdom, and find any similarity or difference in the bigrams bewteen the two countries.
Fourth, you create a semantic network graph of word co-occurrences in the tweets from the U.S. and U.K., and discuss some topics (word or hashtag clusters) identified by analyzing the semantic network in the tweets.
Requirement 1: You should set the seed to reproduce the random number generator by using your ID. For instance, if your ID # is 20201234, then you set the seed by entering set.seed(20201234)
before creating a semantic network of word occurrences in the tweet dataset.
Requirement 2: The words to be counted should NOT contain stop words, punctuation marks, non-ASCII codes, HTML tags, and URLs.
Requirement 3: Word co-occurrence is defined as an instance of any pair of words or hashtags used together in the same tweet.
Requirement 4: By 11:59 PM on June 29th (Monday), you will upload the following things to the Assignments section on our e-class page.
Your R codes used in the workflow of text processing, tokenization, analysis, and visualization (20 points).
The graph showing the time-series trend of tweets posted over time and your written interpretation of the result (15 points).
The graph showing either the geo-location map or the time-series trend of tweet sentiments in the U.S. and your written interpretation of the result (15 points).
The tables to compare the most prominent bigrams in tweets between the U.S. and U.K. and your discussion on the similarity or difference in the bigrams between the two countries (15 points).
The semantic network of word co-occurrences in the tweets from the U.S. and U.K. and your interpretation and discussion on some salient topics emerging from the network (15 points).
Your short written responses (less than 10 sentences) to the following questions (10 point each)
6.1) What do you think the most prominent two or three advantages and disadvantages of lexicon-based sentiment analysis?
6.2) What do you think any meaningful differences between analyzing bigrams and word co-occurrences? Which method do you find more helpful to understand public opinions about social distancing in the U.S. and U.K.?