Introduction

This exercise is about working with our data and do some exploratory analysis of the data in order to prepare for building prediction models in the upcoming exercises and the shiny app development. First the data is loaded and then stored in a rds data set in order to clean up the environment and save some memory. Then data is then pre-processed and then visualised during the exploratory analysis. Since the processes in this project takes much of the memory which decreases the efficiency of the code, only part of the data will be used. This is done by ‘sampling’ the data and only use part of it. In this exercise 25 % of the each og the three data sets was used. The buliding of N-gram models will be done using ‘tidytext’ package, which is convinent other packages are also an option, however memory usage should definitely be taken into consideration.

# Loading Packages 
library(tm)
library(stringi)
library(SnowballC)
library(tidyverse)
library(tidytext)
library(wordcloud)
library(RWeka)

#Setting the seed
set.seed(3434)

Data loading

Since on of the limitations of this exercise is memory, only part of the data will be used. This is done by reading the whole data from the three datasets ‘blog’, ‘news’ and ‘Twitter’, and then sampling 25 % of the data for analysis.

# Blog text file
con <- file("final/en_US/en_US.blogs.txt")
linesInFile.blog <- readLines(con,encoding="UTF-8",skipNul = TRUE)
fileSize.blog <- format(object.size(linesInFile.blog),units = "MB")
fileNoOfLines.blog <- length(linesInFile.blog)
fileWords.blog <- sum(stri_count_words(linesInFile.blog))
close(con)
info.blog <- paste0("File size: ", fileSize.blog, " Lines in File: ", fileNoOfLines.blog, " Words in file: ", fileWords.blog)

# News text file
con <- file("final/en_US/en_US.news.txt.")
linesInFile.news <- readLines(con,encoding="UTF-8",skipNul = TRUE)
fileSize.news <- format(object.size(linesInFile.news),units = "MB")
fileNoOfLines.news <- length(linesInFile.news)
fileWords.news <- sum(stri_count_words(linesInFile.news))
close(con)
info.news <- paste0("File size: ", fileSize.news, " Lines in File: ", fileNoOfLines.news, " Words in file: ", fileWords.news)

# Twitter text file
con <- file("final/en_US/en_US.twitter.txt", "r")
linesInFile.twitter <- readLines(con,encoding="UTF-8",skipNul = TRUE)
fileSize.twitter <- format(object.size(linesInFile.twitter),units = "Mb")
fileNoOfLines.twitter <- length(linesInFile.twitter)
fileWords.twitter <- sum(stri_count_words(linesInFile.twitter))
close(con)
info.twitter <- paste0("File size: ", fileSize.twitter, " Lines in File: ", fileNoOfLines.twitter, " Words in file: ", fileWords.twitter)

# Summary of data files info
info.blog
## [1] "File size: 255.4 Mb Lines in File: 899288 Words in file: 37546250"
info.news
## [1] "File size: 19.8 Mb Lines in File: 77259 Words in file: 2674536"
info.twitter
## [1] "File size: 319 Mb Lines in File: 2360148 Words in file: 30093413"
# Sampling from data
SampleData1 <- c(sample(linesInFile.blog,length(linesInFile.blog)*0.25),
                sample(linesInFile.news,length(linesInFile.news)*0.25),
                sample(linesInFile.twitter,length(linesInFile.twitter)*0.25)
)

# Saving data samples in RDS
saveRDS(SampleData1, "SampleUSData1.rds")

# Cleaning the environment 
rm(list=ls())

Preprocessing

After cleaning up the environ ment and the dataset samples is stored, the data is then loaded and prepared for the pre-processing which include: Punctuation removal Converting to lower case Removing numbers Removing stopwords Removing extra white spaces Word-stemming

Converting the document to plain text

# Reading data
data <- readRDS("SampleUSData1.rds")

# Creatoing the corpus
cor <- VCorpus(VectorSource(data))

# Remove punctuation
cor <- tm_map(cor, removePunctuation)

# Transform to lower case 
cor <- tm_map(cor, tolower)

# Strip digits
cor <- tm_map(cor, removeNumbers)

# Remove stopwords
cor <- tm_map(cor, removeWords, stopwords("english"))

# remove whitespace
cor <- tm_map(cor, stripWhitespace)

# Stemming the docs
cor <- tm_map(cor, stemDocument)

# converting to plain text doc
cor <- tm_map(cor, PlainTextDocument)

N-gram models using tidytext package

So after pre-processing the following exploratory analysis will be done by using tidytext package, to those that are used to working with ’ %>% ’ from the dplyr package. Part of exploratory analysis is to look at the fancy wordcloud plot

# Converting the corpus into a data frame. 
dfCor <- tidy(cor) %>% 
        select(text)
rm(cor)
# Creating  a data 
dfCor1 <- dfCor %>% 
        unnest_tokens(output = word, input = text, drop = TRUE, format = "text" ) %>% 
        anti_join(stop_words) %>% 
        count(word, sort = TRUE)

# Taking a look on the workcloud 
wordcloud(word = dfCor1$word, freq = dfCor1$n, min.freq = 1,max.words = 50, 
          random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

# Uni-gram data preparation and looking at the top 20
unigram <- dfCor %>%
        unnest_tokens(output = word, input = text, drop = TRUE, format = "text") %>%
        count(word, sort = TRUE) %>%
        filter(word != "") %>% 
        arrange(desc(n)) %>% 
        slice(1:20)

# Bigram data preparation and looking at the top 20
bigram <- dfCor %>%
        unnest_tokens(word, text, token = "ngrams", n = 2) %>%
        separate(word, c("word1", "word2"), sep = " ") %>%
        filter(!word1 %in% stop_words$word) %>%
        filter(!word2 %in% stop_words$word) %>%
        unite(word,word1, word2, sep = " ",na.rm = TRUE) %>%
        count(word, sort = TRUE) %>%
        filter(word != "") %>% 
        arrange(desc(n)) %>% 
        slice(1:20)

# Trigram data preparation and looking at the top 20
trigram <- dfCor %>%
        unnest_tokens(word, text, token = "ngrams", n = 3,drop = T,format = "text",) %>%
        separate(word, c("word1", "word2","word3"), sep = " ") %>%
        filter(!word1 %in% stop_words$word) %>%
        filter(!word2 %in% stop_words$word) %>%
        filter(!word3 %in% stop_words$word) %>%
        unite(word,word1, word2,word3, sep = " ", na.rm = TRUE) %>%
        count(word, sort = TRUE) %>%
        filter(word != "") %>% 
        arrange(desc(n)) %>% 
        slice(1:20)

Visualisation of N-grams

Th following plots shows the top 20 uni- bi-, tri- grams:

p1 <- ggplot(unigram) + 
        geom_bar(aes(x= reorder(word, n),y=n), stat = "identity", fill = "#de5833") +
        theme_minimal() +
        coord_flip() +
        labs(title = "Top 20 unigrams",
             subtitle = "using Tidytext in R",
             caption = "Data Source:")
p1

p2 <- ggplot(bigram) + 
        geom_bar(aes(x= reorder(word, n),y=n), stat = "identity", fill = "#de5833") +
        theme_minimal() +
        coord_flip() +
        labs(title = "Top 20 Bigrams",
             subtitle =  "Using Tidytext in R",
             caption = "Data Source: ")
p2

p3 <- ggplot(trigram) + 
        geom_bar(aes(x= reorder(word, n),y=n), stat = "identity", fill = "#de5833") +
        theme_minimal() +
        coord_flip() +
        labs(title = "Top 20 Trigrams",
             subtitle = "using Tidytext in R",
             caption = "Data Source:")
p3

Further analysis

The next step i to build predictive models using Katz’ n-gram Backoff model. Here is the goal to build both accurate and efficient model. Other options would be to effectivize the current tokenization process exploring other packages.

The shiny app development is aimed to be intuitive and easy to go with few steps and clear output to the user. The idea is that the user would be able to right a few words and the applications should suggest an upcoming word.