Introduction

This report seeks to conduct an exploratory analysis of the English language files contained within the SwiftKey dataset.

blogs_data <- tibble(text = read_lines('en_US/en_US.blogs.txt'))
news_data <- tibble(text = read_lines('en_US/en_US.news.txt'))
twitter_data <- tibble(text = read_lines('en_US/en_US.twitter.txt'))

Blogs Summary

First, we begin using the tidytext package to tokenize all the words in the blogs dataset.

current_token <- blogs_data %>%
  unnest_tokens(word, text)

There are 899288 lines of text and 37546239 words in this dataset. Next we present a brief table of the most frequently occurring words.

current_token %>%
  count(word, sort = TRUE)
## # A tibble: 320,003 x 2
##    word        n
##    <chr>   <int>
##  1 the   1860156
##  2 and   1094401
##  3 to    1069440
##  4 a      900362
##  5 of     876799
##  6 i      775032
##  7 in     598532
##  8 that   460782
##  9 is     432712
## 10 it     403902
## # … with 319,993 more rows

Clearly, the most frequent words are stop words (i.e. the, and, to, a). Plotting a histogram of word counts confirms that these stop words may cause issues.

current_token_count <- current_token %>%
  count(word, sort = TRUE)
ggplot(current_token_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As such, we’ll use an antijoin to remove the stopwards and then display the top words by count and the histogram of the results.

current_token_no_stop <- current_token %>%
  anti_join(stop_words)
## Joining, by = "word"
current_token_no_stop %>%
  count(word, sort = TRUE)
## # A tibble: 319,278 x 2
##    word       n
##    <chr>  <int>
##  1 time   90918
##  2 people 59574
##  3 day    52372
##  4 love   45230
##  5 life   41251
##  6 it’s   38657
##  7 1      30907
##  8 2      29561
##  9 world  29305
## 10 i’m    29189
## # … with 319,268 more rows
current_token_no_stop_count <- current_token_no_stop %>%
  count(word, sort = TRUE)
ggplot(current_token_no_stop_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

current_token_no_stop_count %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 50))

There is still a large concentration of words with low counts but the removal of the stop words has made the resultant frequency table more informative.

News Summary

First, we begin using the tidytext package to tokenize all the words in the blogs dataset.

current_token <- news_data %>%
  unnest_tokens(word, text)

There are 1010242 lines of text and 34762395 words in this dataset. Next we present a brief table of the most frequently occurring words.

current_token %>%
  count(word, sort = TRUE)
## # A tibble: 284,533 x 2
##    word        n
##    <chr>   <int>
##  1 the   1974366
##  2 to     906145
##  3 and    889511
##  4 a      878035
##  5 of     774502
##  6 in     679065
##  7 for    353901
##  8 that   347079
##  9 is     284240
## 10 on     269881
## # … with 284,523 more rows

Clearly, the most frequent words are stop words (i.e. the, and, to, a). Plotting a histogram of word counts confirms that these stop words may cause issues.

current_token_count <- current_token %>%
  count(word, sort = TRUE)
ggplot(current_token_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As such, we’ll use an antijoin to remove the stopwards and then display the top words by count and the histogram of the results.

current_token_no_stop <- current_token %>%
  anti_join(stop_words)
## Joining, by = "word"
current_token_no_stop %>%
  count(word, sort = TRUE)
## # A tibble: 283,812 x 2
##    word        n
##    <chr>   <int>
##  1 time    57062
##  2 people  47666
##  3 city    37953
##  4 1       37292
##  5 school  35498
##  6 game    34949
##  7 percent 34690
##  8 day     31901
##  9 2       31784
## 10 million 30914
## # … with 283,802 more rows
current_token_no_stop_count <- current_token_no_stop %>%
  count(word, sort = TRUE)
ggplot(current_token_no_stop_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

current_token_no_stop_count %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 50))

There is still a large concentration of words with low counts but the removal of the stop words has made the resultant frequency table more informative.

Twitter Summary

First, we begin using the tidytext package to tokenize all the words in the blogs dataset.

current_token <- twitter_data %>%
  unnest_tokens(word, text)

There are 2360148 lines of text and 30093372 words in this dataset. Next we present a brief table of the most frequently occurring words.

current_token %>%
  count(word, sort = TRUE)
## # A tibble: 370,388 x 2
##    word       n
##    <chr>  <int>
##  1 the   937405
##  2 to    788645
##  3 i     723447
##  4 a     611358
##  5 you   548089
##  6 and   438538
##  7 for   385348
##  8 in    380376
##  9 of    359635
## 10 is    358775
## # … with 370,378 more rows

Clearly, the most frequent words are stop words (i.e. the, and, to, a). Plotting a histogram of word counts confirms that these stop words may cause issues.

current_token_count <- current_token %>%
  count(word, sort = TRUE)
ggplot(current_token_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As such, we’ll use an antijoin to remove the stopwards and then display the top words by count and the histogram of the results.

current_token_no_stop <- current_token %>%
  anti_join(stop_words)
## Joining, by = "word"
current_token_no_stop %>%
  count(word, sort = TRUE)
## # A tibble: 369,663 x 2
##    word        n
##    <chr>   <int>
##  1 love   106721
##  2 day     91710
##  3 rt      89537
##  4 time    76794
##  5 lol     70133
##  6 3       54940
##  7 people  52040
##  8 happy   48998
##  9 follow  48104
## 10 2       45515
## # … with 369,653 more rows
current_token_no_stop_count <- current_token_no_stop %>%
  count(word, sort = TRUE)
ggplot(current_token_no_stop_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

current_token_no_stop_count %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 50))

There is still a large concentration of words with low counts but the removal of the stop words has made the resultant frequency table more informative.