In this section we read in the files and summarize the number of characters in each lines of the three different files.
news <- file("en_US.news.txt", open="rb")
news_lines <- readLines(news)
close(news)
summary(nchar(news_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11384.0
blogs <- file("en_US.blogs.txt", open="rb")
blog_lines <- readLines(blogs)
close(blogs)
summary(nchar(blog_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
twitter <- file("en_US.twitter.txt", open="rb")
twitter_lines <- readLines(twitter)
close(twitter)
summary(nchar(twitter_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
We see that the tweet lines are limited with 140 characters, while news can be as long as more than 11 thousand characters and blogs even reach to more than 40 thousand characters.
Now we remove the original files from the memory and use random sample of 2000 lines for each file, because of the memory.
rm(twitter_lines, blog_lines, news_lines)
sample_twitter <- sample(readLines("en_US.twitter.txt", skipNul = TRUE), 2000)
sample_blogs <- sample(readLines("en_US.blogs.txt", skipNul = TRUE), 2000)
sample_news <- sample(readLines("en_US.news.txt", skipNul = TRUE), 2000)
We tokenize each dataset separately:
twitter_1 <- tokenize_n(sample_twitter, n=1)
twitter_2 <- tokenize_n(sample_twitter, n=2)
twitter_3 <- tokenize_n(sample_twitter, n=3)
news_1 <- tokenize_n(sample_news, n=1)
news_2 <- tokenize_n(sample_news, n=2)
news_3 <- tokenize_n(sample_news, n=3)
blogs_1 <- tokenize_n(sample_blogs, n=1)
blogs_2 <- tokenize_n(sample_blogs, n=2)
blogs_3 <- tokenize_n(sample_blogs, n=3)
sample_all <- c(sample_twitter, sample_news, sample_blogs)
all_1 <- tokenize_n(sample_all, n=1)
all_2 <- tokenize_n(sample_all, n=2)
all_3 <- tokenize_n(sample_all, n=3)
rm(sample_all, sample_twitter, sample_blogs, sample_news)
The following plot shows most common 10 words used for all and each of the datasets.
Now we check top 10 2-grams for aggregate data.
Now we check top 10 3-grams for aggregate data.
We do not see any surprise in the top 10 grams.
The shiny app will predict the next word after a series of words based on the following algorithm. Based on the available data, I will calculate how frequently the series of the words appear. Then based on the data, I will calculate the frequency of the next words. the app will suggest the word with has the highest probability.