This report seeks to conduct an exploratory analysis of the English language files contained within the SwiftKey dataset.
blogs_data <- tibble(text = read_lines('en_US/en_US.blogs.txt'))
news_data <- tibble(text = read_lines('en_US/en_US.news.txt'))
twitter_data <- tibble(text = read_lines('en_US/en_US.twitter.txt'))
First, we begin using the tidytext package to tokenize all the words in the blogs dataset.
current_token <- blogs_data %>%
unnest_tokens(word, text)
There are 899288 lines of text and 37546239 words in this dataset. Next we present a brief table of the most frequently occurring words.
current_token %>%
count(word, sort = TRUE)
## # A tibble: 320,003 x 2
## word n
## <chr> <int>
## 1 the 1860156
## 2 and 1094401
## 3 to 1069440
## 4 a 900362
## 5 of 876799
## 6 i 775032
## 7 in 598532
## 8 that 460782
## 9 is 432712
## 10 it 403902
## # … with 319,993 more rows
Clearly, the most frequent words are stop words (i.e. the, and, to, a). Plotting a histogram of word counts confirms that these stop words may cause issues.
current_token_count <- current_token %>%
count(word, sort = TRUE)
ggplot(current_token_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As such, we’ll use an antijoin to remove the stopwards and then display the top words by count and the histogram of the results.
current_token_no_stop <- current_token %>%
anti_join(stop_words)
## Joining, by = "word"
current_token_no_stop %>%
count(word, sort = TRUE)
## # A tibble: 319,278 x 2
## word n
## <chr> <int>
## 1 time 90918
## 2 people 59574
## 3 day 52372
## 4 love 45230
## 5 life 41251
## 6 it’s 38657
## 7 1 30907
## 8 2 29561
## 9 world 29305
## 10 i’m 29189
## # … with 319,268 more rows
current_token_no_stop_count <- current_token_no_stop %>%
count(word, sort = TRUE)
ggplot(current_token_no_stop_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
current_token_no_stop_count %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 50))
There is still a large concentration of words with low counts but the removal of the stop words has made the resultant frequency table more informative.
First, we begin using the tidytext package to tokenize all the words in the blogs dataset.
current_token <- news_data %>%
unnest_tokens(word, text)
There are 1010242 lines of text and 34762395 words in this dataset. Next we present a brief table of the most frequently occurring words.
current_token %>%
count(word, sort = TRUE)
## # A tibble: 284,533 x 2
## word n
## <chr> <int>
## 1 the 1974366
## 2 to 906145
## 3 and 889511
## 4 a 878035
## 5 of 774502
## 6 in 679065
## 7 for 353901
## 8 that 347079
## 9 is 284240
## 10 on 269881
## # … with 284,523 more rows
Clearly, the most frequent words are stop words (i.e. the, and, to, a). Plotting a histogram of word counts confirms that these stop words may cause issues.
current_token_count <- current_token %>%
count(word, sort = TRUE)
ggplot(current_token_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As such, we’ll use an antijoin to remove the stopwards and then display the top words by count and the histogram of the results.
current_token_no_stop <- current_token %>%
anti_join(stop_words)
## Joining, by = "word"
current_token_no_stop %>%
count(word, sort = TRUE)
## # A tibble: 283,812 x 2
## word n
## <chr> <int>
## 1 time 57062
## 2 people 47666
## 3 city 37953
## 4 1 37292
## 5 school 35498
## 6 game 34949
## 7 percent 34690
## 8 day 31901
## 9 2 31784
## 10 million 30914
## # … with 283,802 more rows
current_token_no_stop_count <- current_token_no_stop %>%
count(word, sort = TRUE)
ggplot(current_token_no_stop_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
current_token_no_stop_count %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 50))
There is still a large concentration of words with low counts but the removal of the stop words has made the resultant frequency table more informative.
First, we begin using the tidytext package to tokenize all the words in the blogs dataset.
current_token <- twitter_data %>%
unnest_tokens(word, text)
There are 2360148 lines of text and 30093372 words in this dataset. Next we present a brief table of the most frequently occurring words.
current_token %>%
count(word, sort = TRUE)
## # A tibble: 370,388 x 2
## word n
## <chr> <int>
## 1 the 937405
## 2 to 788645
## 3 i 723447
## 4 a 611358
## 5 you 548089
## 6 and 438538
## 7 for 385348
## 8 in 380376
## 9 of 359635
## 10 is 358775
## # … with 370,378 more rows
Clearly, the most frequent words are stop words (i.e. the, and, to, a). Plotting a histogram of word counts confirms that these stop words may cause issues.
current_token_count <- current_token %>%
count(word, sort = TRUE)
ggplot(current_token_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As such, we’ll use an antijoin to remove the stopwards and then display the top words by count and the histogram of the results.
current_token_no_stop <- current_token %>%
anti_join(stop_words)
## Joining, by = "word"
current_token_no_stop %>%
count(word, sort = TRUE)
## # A tibble: 369,663 x 2
## word n
## <chr> <int>
## 1 love 106721
## 2 day 91710
## 3 rt 89537
## 4 time 76794
## 5 lol 70133
## 6 3 54940
## 7 people 52040
## 8 happy 48998
## 9 follow 48104
## 10 2 45515
## # … with 369,653 more rows
current_token_no_stop_count <- current_token_no_stop %>%
count(word, sort = TRUE)
ggplot(current_token_no_stop_count, aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
current_token_no_stop_count %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 50))
There is still a large concentration of words with low counts but the removal of the stop words has made the resultant frequency table more informative.