Reading in the datasets and some preleminary statistics

In this section we read in the files and summarize the number of characters in each lines of the three different files.

news <- file("en_US.news.txt", open="rb")
news_lines <- readLines(news)
close(news)
summary(nchar(news_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11384.0
blogs <- file("en_US.blogs.txt", open="rb")
blog_lines <- readLines(blogs)
close(blogs)
summary(nchar(blog_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833
twitter <- file("en_US.twitter.txt", open="rb")
twitter_lines <- readLines(twitter)
close(twitter)
summary(nchar(twitter_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

We see that the tweet lines are limited with 140 characters, while news can be as long as more than 11 thousand characters and blogs even reach to more than 40 thousand characters.

Now we remove the original files from the memory and use random sample of 2000 lines for each file, because of the memory.

rm(twitter_lines, blog_lines, news_lines)
sample_twitter <- sample(readLines("en_US.twitter.txt", skipNul = TRUE), 2000)
sample_blogs <- sample(readLines("en_US.blogs.txt", skipNul = TRUE), 2000)
sample_news <- sample(readLines("en_US.news.txt", skipNul = TRUE), 2000)

We tokenize each dataset separately:

twitter_1 <- tokenize_n(sample_twitter, n=1)
twitter_2 <- tokenize_n(sample_twitter, n=2)
twitter_3 <- tokenize_n(sample_twitter, n=3)

news_1 <- tokenize_n(sample_news, n=1)
news_2 <- tokenize_n(sample_news, n=2)
news_3 <- tokenize_n(sample_news, n=3)

blogs_1 <- tokenize_n(sample_blogs, n=1)
blogs_2 <- tokenize_n(sample_blogs, n=2)
blogs_3 <- tokenize_n(sample_blogs, n=3)

sample_all <- c(sample_twitter, sample_news, sample_blogs)

all_1 <- tokenize_n(sample_all, n=1)
all_2 <- tokenize_n(sample_all, n=2)
all_3 <- tokenize_n(sample_all, n=3)

rm(sample_all, sample_twitter, sample_blogs, sample_news)

Some basic plots

The following plot shows most common 10 words used for all and each of the datasets.

Now we check top 10 2-grams for aggregate data.

Now we check top 10 3-grams for aggregate data.

Discussion and plan for the Shiny app

We do not see any surprise in the top 10 grams.

The shiny app will predict the next word after a series of words based on the following algorithm. Based on the available data, I will calculate how frequently the series of the words appear. Then based on the data, I will calculate the frequency of the next words. the app will suggest the word with has the highest probability.