First Major Assignment

In this first major assignment to be graded, you will 1) pre-process and tokenize text retrieved from 10,000 randomly sampled tweets about Covid-19 in covid19_tweets_df, 2) create three tables to show the top 20 hashtags among the tweets post on March 26th, 27th, and 28th, respectively, and 3) generate three word clouds that visualize the frequency of top 100 hashtags (except #covid19, #coronavirus, #covid2019, #coronavirusoutbreak, #covid, #coronavirusoubreak) on each day.

Requirement 1: You should set the seed to reproduce the random number generator by using your ID. For instance, if your ID # is 20201234, then you set the seed by entering set.seed(20201234) before extracting a random sample of tweets from the dataset.

Requirement 2: The top hashtags to be counted should exclude #covid19, #coronavirus, #covid2019, #coronavirusoutbreak, #covid, and #coronavirusoubreak.

Requirement 3: By 11:59 PM on May 13th (Wednesday), you will upload the following things to the Assignments section on our e-class page.

  1. Your R codes for setting the seed, tweet data processing and text tokenization (30 points)

  2. The three frequency tables of top 50 hashtags in decscending order (10 points)

  3. The resulting wordcloud (10 points)

  4. Your short written responses (less than 10 sentences) to the following questions (5 point each)

    4.1) What steps did you take in your text pre-processing?

    4.2) What were the most complicated two or three issues in the above process and how did you address each problem?

    4.3) What do you think a text data processing is more needed to improve your wordcloud?

    4.4) What was the most effective (or helpful) function in your text processing and why?

  5. Is there any recognizable difference between the word clouds that show the viral hashtags on each day? If so, what is the difference and what do you think the cause of the difference?

Hint 1: To extract a random sample of 10,000 tweets from the dataset, covid19_tweets_df, you may use the dplyr function, sample_n(), which selects random rows from a data frame.

Example

library(tidyverse)
load("covid19_tweets_df.RData")
set.seed(20201234) # In generating random numbers, you can set the seed that insures us to reproduce the same numbers.
covid19_tweets_sample <- covid19_tweets_df %>% 
  sample_n(10000) # Selecting 10000 rows randomly from covid19_tweets_df and saving it to a new dataset object, covid19_tweets_sample.

Hint 2: You can create a vector of hashtags to be excluded from the hashtag tokens to be counted as follows:

hashtag_outs <- c("#covid19", "#coronavirus", "#covid2019", "#coronavirusoutbreak", "#covid", "#coronavirusoubreak")