W12: Information Extraction

Information extraction by counting the words

Last time, we practiced with R to map the frequency of tweets into U.S. But doing so does not allow us to know what is being discussed on Twitter in the country. So today we are going to learn about how to reveal what’s being said importantly among tweets posted in the U.S.

So we want to examine what is the most prominent issue on tweets about COVID-19 in the U.S. For addressing such a question, a primary task to do is to quantify which word (or phrase) is being used the most among the tweets posted in the U.S. This measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a tweet or in a document.

Let’s take a look at which word is the most prominent in the tweets about COVID-19 posted in the U.S. For doing so, we need to format our tweet dataset into a tidy dataset. And we are going to treat all the tweets from the U.S. as one document and count the tokenized terms in terms of their frequency in the document.

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(stringr)
library(stopwords)
library(textclean) # A new package introduced to preprocess text easily

load("covid_tweets_423.RData") # We will use the tweets collected on April 23rd.
covid_tweets

## # A tibble: 18,224 x 9
##    user_id status_id created_at          screen_name text  lang  country    lat
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr> <chr>    <dbl>
##  1 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… "@ev… en    United…  36.0 
##  2 169480… 12533658… 2020-04-23 16:51:11 Coachjmorr… "Ple… en    United…  36.9 
##  3 215583… 12533658… 2020-04-23 16:51:09 KOROGLU_BA… "@Ay… tr    Azerba…  40.2 
##  4 744597… 12533657… 2020-04-23 16:51:05 FoodFocusSA "Pre… en    South … -26.1 
##  5 155877… 12533657… 2020-04-23 16:51:01 opcionsecu… "#AT… es    Ecuador  -1.67
##  6 998960… 12533657… 2020-04-23 16:51:01 amystones4  "Tha… en    United…  53.7 
##  7 102768… 12533657… 2020-04-23 16:51:00 COTACYT     "Men… es    Mexico   23.7 
##  8 247382… 12533657… 2020-04-23 16:50:54 bkracing123 "The… en    United…  53.9 
##  9 175662… 12533657… 2020-04-23 16:50:51 AnnStrahm   "Thi… en    United…  37.5 
## 10 226707… 12533656… 2020-04-23 16:50:42 JLeonRojas  "INF… es    Chile   -35.5 
## # … with 18,214 more rows, and 1 more variable: lng <dbl>

class(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text))

## [1] "character"

replace_non_ascii(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text))

## [1] "I'm glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person &amp; organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs"
## [2] "Whenever we reopen masks will be worn during the ENTIRE time you're in the salon: before, during, &amp; after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"

sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii)

##  I’m glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person &amp; organization has a role to play in getting us all through this together.  https://t.co/WUr2eWcRgs 
## "I'm glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person &amp; organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs" 
##               Whenever we reopen masks will be worn during the ENTIRE time you’re in the salon: before, during, &amp; after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome… https://t.co/H5n8NzUjsg 
##           "Whenever we reopen masks will be worn during the ENTIRE time you're in the salon: before, during, &amp; after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"

replace_contraction(sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii))

##   I’m glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person &amp; organization has a role to play in getting us all through this together.  https://t.co/WUr2eWcRgs 
## "I am glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person &amp; organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs" 
##                Whenever we reopen masks will be worn during the ENTIRE time you’re in the salon: before, during, &amp; after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome… https://t.co/H5n8NzUjsg 
##           "Whenever we reopen masks will be worn during the ENTIRE time you are in the salon: before, during, &amp; after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"

replace_html(replace_contraction(sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii)))

## I’m glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person &amp; organization has a role to play in getting us all through this together.  https://t.co/WUr2eWcRgs 
##   "I am glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs" 
##              Whenever we reopen masks will be worn during the ENTIRE time you’re in the salon: before, during, &amp; after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome… https://t.co/H5n8NzUjsg 
##             "Whenever we reopen masks will be worn during the ENTIRE time you are in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"

replace_url(replace_html(replace_contraction(sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii))))

## [1] "I am glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. "
## [2] "Whenever we reopen masks will be worn during the ENTIRE time you are in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... "

covid_tweets_tidy <- covid_tweets %>% 
  filter(lang == "en") %>% # Selecting tweets written in English only
  filter(country == "United States") %>% # Selecting tweets posted in the U.S. only
  mutate(text = str_replace_all(text, "RT", " ")) %>% # Removing retweet marker
  mutate(text = sapply(text, replace_non_ascii)) %>% 
  mutate(text = sapply(text, replace_contraction)) %>% 
  mutate(text = sapply(text, replace_html)) %>% 
  mutate(text = sapply(text, replace_url)) %>% 
  unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets 
  filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
  filter(str_detect(word, "[a-z]")) # Selecting words that should contain any alphbetical letter

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

covid_tweets_tidy # Look at the word column

## # A tibble: 69,448 x 9
##    user_id status_id created_at          screen_name lang  country   lat   lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  2 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  3 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  4 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  5 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  6 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  7 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  8 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
##  9 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
## 10 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en    United…  36.0 -115.
## # … with 69,438 more rows, and 1 more variable: word <chr>

Now we are ready to count the word column and find the most prominent words in terms of the frequency.

covid_tweets_tidy %>% 
  count(word, sort=T) # Count the elements in the word column and arrange the result in a descending order

## # A tibble: 17,622 x 2
##    word             n
##    <chr>        <int>
##  1 #covid19      2150
##  2 covid19       1560
##  3 #coronavirus   722
##  4 can            411
##  5 people         406
##  6 us             335
##  7 new            317
##  8 get            305
##  9 just           293
## 10 now            290
## # … with 17,612 more rows

As you can see in the counting result, the top 10 words (or tokens) in frequency are #covid19, covid19, #coronavirus, can, people, us, new, get, just, now…

What do you think on the list of the top 10 words? Are they helpful in guessing the important issue being discussed among the tweets about COVID-19 in the U.S.? They could be important terms when it comes to COVID-19 in general. But they are not important to the COVID-19 issue in the U.S. because they are the terms that are also the most frequent in the U.K., India, or Canada. Being important in the U.S. may mean being special to the country.

So the most frequent words are the words in the tweets that occur many times but may not be important in understanding what’s going on in the U.S. Why? Because they are also likely to be the most frequent words among tweets posted in the U.K. or Canada where English is the official language.

In English, these are probably stop words like “the”, “is”, “a”, “an”, “of”, and so on. Of course, such stop words were already removed by using the pre-determined list of stopwords from the stopwords package: filter(!word %in% stopwords()). But we still see the words that are being used frequently but are not relevant to finding the important issue about COVID-19 in the U.S. Actually, the most frequent words in tweets from the U.S. are the words that are also frequent in tweets from another country as long as they are written in English.

Let’s see which words are being used frequently in the U.K.

covid_tweets %>% 
  filter(lang == "en") %>% # Selecting tweets written in English only
  filter(country == "United Kingdom") %>% # Seleting tweets posted in the UK only
  mutate(text = str_replace_all(text, "RT", " ")) %>% # Removing retweet marker
  mutate(text = sapply(text, replace_non_ascii)) %>% 
  mutate(text = sapply(text, replace_contraction)) %>% 
  mutate(text = sapply(text, replace_html)) %>% 
  mutate(text = sapply(text, replace_url)) %>% 
  unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets 
  filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
  filter(str_detect(word, "[a-z]")) %>% 
  count(word, sort=T)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## # A tibble: 7,240 x 2
##    word             n
##    <chr>        <int>
##  1 #covid19       629
##  2 covid19        328
##  3 #coronavirus   220
##  4 can            134
##  5 covid          108
##  6 people         102
##  7 just            79
##  8 uk              73
##  9 us              73
## 10 now             70
## # … with 7,230 more rows

From the result, we can figure out that the most frequent words (e.g., “can”, “people”, “just”, “us”, “now”) are commonly used words across the countries. The thing that the words are used frequently does not say that they are important in the tweets about COVID-19 in the U.S. When everyone can speak in English fluently, my English skill does not make me look important in the job market.

So we might take the approach of finding out the words that might be more important in the tweets from the U.S. than other countries. For doing so, we need to consider an approach called a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of tweets or documents. This can be combined with term frequency (tf) to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The statistic tf-idf is intended to measure how important a word is to a document (or a tweet) in a collection of document (tweets) or a corpus

The statistic tf-idf is intended to measure how important a word is to a group of tweets from the U.S. in a collection of all the tweets from the U.S. the U.K., India, and Canada.

So in terms of idf, the word “covid19” decreases in its importance because the word is likely to appear in almost every tweet from the U.S. or the U.K., or India, or Canada. Please remember that our dataset of tweets was collected because they contain a hashtag related to “covid19”!

But the inverse document frequency (idf) for any given term is defined as

So this metric decreases the weight for the words in tweets, which are likely to be used in all the countries such as “covid19”, “coronavirus”, “people”, “now”, etc., but increases the weight for the words that are unique to each country. For example, a word “modi” refers to the last name of currnet Prime Minister of India, so it is frequently used in tweets from India but much less likely to be used in tweets from the U.S., the U.K., or Canada. In this case, the IDF gives mcuh greater weight to the word “modi” in compared with its term frequency.

Term frequency (tf) in tweets about COVID-19

Let’s start by looking at the COVID-19 tweets posted in the top four countries (with English as an official language) and examine first term frequency, then tf-idf. We can start just by using dplyr functions such as group_by() and left_join(). What are the most commonly used words in tweets about COVID-19? (Let’s also calculate the total words in each country for later use).

library(dplyr)
library(tidytext)

covid_tweets %>% 
  count(country, sort=T)

## # A tibble: 163 x 2
##    country            n
##    <chr>          <int>
##  1 United States   4998
##  2 India           1527
##  3 Indonesia       1324
##  4 United Kingdom  1311
##  5 Brazil           942
##  6 Canada           658
##  7 Spain            640
##  8 Mexico           567
##  9 Nigeria          552
## 10 Colombia         381
## # … with 153 more rows

# Focusing on the top four countries with English as an official language

tweet_words <- covid_tweets %>% 
  filter(lang == "en") %>% # Selecting tweets written in English only
  filter(country %in% c("United States","India","United Kingdom","Canada")) %>% # Selecting tweets from U.S., India, U.K., and Canada
  mutate(text = str_replace_all(text, "RT", " ")) %>% # Removing retweet marker
  mutate(text = sapply(text, replace_non_ascii)) %>% 
  mutate(text = sapply(text, replace_contraction)) %>% 
  mutate(text = sapply(text, replace_html)) %>% 
  mutate(text = sapply(text, replace_url)) %>% 
  unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets 
  filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
  filter(str_detect(word, "[a-z]")) %>% 
  count(country, word, sort=T)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

tweet_words

## # A tibble: 35,663 x 3
##    country        word             n
##    <chr>          <chr>        <int>
##  1 United States  #covid19      2150
##  2 United States  covid19       1560
##  3 United States  #coronavirus   722
##  4 United Kingdom #covid19       629
##  5 United States  can            411
##  6 United States  people         406
##  7 India          #covid19       379
##  8 India          covid19        358
##  9 United States  us             335
## 10 United Kingdom covid19        328
## # … with 35,653 more rows

Adding the total number of words used in the tweets from each country

total_words <- tweet_words %>% 
  group_by(country) %>% 
  summarise(total = sum(n))
total_words

## # A tibble: 4 x 2
##   country        total
##   <chr>          <int>
## 1 Canada          9209
## 2 India          17863
## 3 United Kingdom 19583
## 4 United States  69448

We can measure the proportion of each word to total words. So there is one row in this tweet_words data frame for each word-country combination; n is the number of times that word is used in that country and total is the total words in that country. The usual suspects are here with the highest n: “covid19”, “coronavirus”, “can”, “us”, “people”, and so forth. Here we can look at the most commonly used words in COVID-19 tweets from each country with respect to term frequency.

tweet_words <- tweet_words %>% 
  left_join(total_words)

## Joining, by = "country"

tweet_words

## # A tibble: 35,663 x 4
##    country        word             n total
##    <chr>          <chr>        <int> <int>
##  1 United States  #covid19      2150 69448
##  2 United States  covid19       1560 69448
##  3 United States  #coronavirus   722 69448
##  4 United Kingdom #covid19       629 19583
##  5 United States  can            411 69448
##  6 United States  people         406 69448
##  7 India          #covid19       379 17863
##  8 India          covid19        358 17863
##  9 United States  us             335 69448
## 10 United Kingdom covid19        328 19583
## # … with 35,653 more rows

Let’s sort the words by tf in each country

tweet_words %>% 
  group_by(country) %>% 
  mutate(tf = n/total) %>% 
  top_n(10, tf) %>% 
  arrange(desc(tf)) %>% 
  ungroup

## # A tibble: 41 x 5
##    country        word             n total     tf
##    <chr>          <chr>        <int> <int>  <dbl>
##  1 United Kingdom #covid19       629 19583 0.0321
##  2 Canada         #covid19       289  9209 0.0314
##  3 United States  #covid19      2150 69448 0.0310
##  4 Canada         covid19        216  9209 0.0235
##  5 United States  covid19       1560 69448 0.0225
##  6 India          #covid19       379 17863 0.0212
##  7 India          covid19        358 17863 0.0200
##  8 United Kingdom covid19        328 19583 0.0167
##  9 United Kingdom #coronavirus   220 19583 0.0112
## 10 United States  #coronavirus   722 69448 0.0104
## # … with 31 more rows

The result of sorting the words by tf shows that the top 10 words by tf in each country are quite similar across the countries.

But as we discussed above, the most frequently used words may not show what is being said importantly in tweets because they are too common.

tweet_words %>% 
  group_by(country) %>% 
  mutate(tf = n/total) %>% 
  top_n(10, tf) %>% 
  arrange(desc(tf)) %>% 
  ungroup %>% 
  mutate(country = as.factor(country), 
         word = reorder_within(word, tf, country)) %>% # To order the words by tf within each country
  ggplot(aes(word, tf, fill = country)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() + # This line removes the separator and country name as a suffix
  labs(x = NULL, y = "Term Frequency") +
  facet_wrap(~country, ncol=2, scales="free") +
  coord_flip()

The idea of tf-idf is to find the important words for the content of each tweet by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of tweets, in this case, the group of U.S. tweets as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.

The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per tweet. One column (word here) contains the terms/tokens, one column contains the tweets (country in this case), and the last necessary column contains the counts, how many times each country contains each term (n in this example). We calculated a total for each country for our explorations above, but it is not necessary for the bind_tf_idf function; the table only needs to contain all the words in each country tweets.

tweet_words

## # A tibble: 35,663 x 4
##    country        word             n total
##    <chr>          <chr>        <int> <int>
##  1 United States  #covid19      2150 69448
##  2 United States  covid19       1560 69448
##  3 United States  #coronavirus   722 69448
##  4 United Kingdom #covid19       629 19583
##  5 United States  can            411 69448
##  6 United States  people         406 69448
##  7 India          #covid19       379 17863
##  8 India          covid19        358 17863
##  9 United States  us             335 69448
## 10 United Kingdom covid19        328 19583
## # … with 35,653 more rows

tweet_words <- tweet_words %>% 
  bind_tf_idf(word, country, n)
tweet_words

## # A tibble: 35,663 x 7
##    country        word             n total      tf   idf tf_idf
##    <chr>          <chr>        <int> <int>   <dbl> <dbl>  <dbl>
##  1 United States  #covid19      2150 69448 0.0310      0      0
##  2 United States  covid19       1560 69448 0.0225      0      0
##  3 United States  #coronavirus   722 69448 0.0104      0      0
##  4 United Kingdom #covid19       629 19583 0.0321      0      0
##  5 United States  can            411 69448 0.00592     0      0
##  6 United States  people         406 69448 0.00585     0      0
##  7 India          #covid19       379 17863 0.0212      0      0
##  8 India          covid19        358 17863 0.0200      0      0
##  9 United States  us             335 69448 0.00482     0      0
## 10 United Kingdom covid19        328 19583 0.0167      0      0
## # … with 35,653 more rows

Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all the four countries, so the idf term (which will tenn be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the tweets in the collection.

Let’s look at terms with high tf-idf in tweets about COVID-19.

tweet_words %>% 
  select(-total) %>% 
  arrange(desc(tf_idf))

## # A tibble: 35,663 x 6
##    country        word                   n      tf   idf  tf_idf
##    <chr>          <chr>              <int>   <dbl> <dbl>   <dbl>
##  1 India          #indiafightscorona    78 0.00437 1.39  0.00605
##  2 India          modi                  53 0.00297 1.39  0.00411
##  3 Canada         ontario               27 0.00293 1.39  0.00406
##  4 India          namo                  45 0.00252 1.39  0.00349
##  5 Canada         #cdnpoli              23 0.00250 1.39  0.00346
##  6 Canada         canada                21 0.00228 1.39  0.00316
##  7 India          @narendramodi         73 0.00409 0.693 0.00283
##  8 India          @pmoindia             62 0.00347 0.693 0.00241
##  9 India          #india                25 0.00140 1.39  0.00194
## 10 United Kingdom #lockdownuk           27 0.00138 1.39  0.00191
## # … with 35,653 more rows

Here we see some hashtags that are specific to the country context and some named entities like names of person, place, or organizations, which are in fact important in the tweets of each country. These terms rarely occur in all countires, and they are important, characteristic words for each country within the corpus of COVID-19 tweets.

Some of the values for idf are the same for different terms because they are four groups of tweets in this corpus and we are seeing the numerical value for ln(4/1) = 1.386294, ln(4/2) = 0.6931472, etc.

And let’s look at a visualization for these high tf-idf words.

tweet_words %>% 
  arrange(desc(tf_idf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(country) %>% 
  top_n(10) %>% 
  ungroup %>% 
  ggplot(aes(word, tf_idf, fill = country)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "TF-IDF") +
  facet_wrap(~country, ncol=2, scales="free") +
  coord_flip()

## Selecting by tf_idf

Using tf-idf, we can see the words in the top list are the more important to each country insofar as they indicate the country-specific situation about COVID-19. What measuring tf-idf has done here is show us that COVID-19 tweets contain similar words across the four countries, and what distinguishes a country from the rest within the collection of COVID-19 tweets are the hashtags, Twitter handles, the names of people or places or organizations. This is the point of tf-idf: it identifies words that are important to one country within a collection of tweets.

8th Assignment

In this assignment, you will 1) reduce the tweet dataset from “covid_tweets_423.RData” to the tweets written in English and posted in any four countries (with at least 100 tweets or more) using the “country” variable, 2) visualize four bar graphs to show the top 10 words in term frequency from COVID-19 tweets in each country of your interest, 3) visualize four bar graphs to show the top 10 words in tf-idf in each country of your interest, and 4) your insights about the results.

Requirement 1: You should select tweets in English using the lang variable.

Requirement 2: You should select tweets posted in four countries using the country variable.

Requirement 3: Your bar graphs should be labeled with the top 10 words, the country, and the quantity by which the top 10 words are ordered.

Requirement 4: By 11:59 PM on June 17th (Wednesday), you will upload the following things to this Assignments section on our e-class page.

Four graphs to visualize the top 10 words in term frequency among the tweets about COVID-19 from four countries you selected and the R code used to generate the graph.
Four graphs to visualize the top 10 words in tf-idf among the tweets about COVID-19 from four countries you selected and the R code used to generate the graph.
What are the differences in the top 10 words between tf and tf-idf? What do you think the cause of such differences?

W12: Information Extraction

이신행

5/28/2020

Information extraction by counting the words

Term frequency (tf) in tweets about COVID-19

8th Assignment