This exercise is about working with our data and do some exploratory analysis of the data in order to prepare for building prediction models in the upcoming exercises and the shiny app development. First the data is loaded and then stored in a rds data set in order to clean up the environment and save some memory. Then data is then pre-processed and then visualised during the exploratory analysis. Since the processes in this project takes much of the memory which decreases the efficiency of the code, only part of the data will be used. This is done by ‘sampling’ the data and only use part of it. In this exercise 25 % of the each og the three data sets was used. The buliding of N-gram models will be done using ‘tidytext’ package, which is convinent other packages are also an option, however memory usage should definitely be taken into consideration.
# Loading Packages
library(tm)
library(stringi)
library(SnowballC)
library(tidyverse)
library(tidytext)
library(wordcloud)
library(RWeka)
#Setting the seed
set.seed(3434)
Since on of the limitations of this exercise is memory, only part of the data will be used. This is done by reading the whole data from the three datasets ‘blog’, ‘news’ and ‘Twitter’, and then sampling 25 % of the data for analysis.
# Blog text file
con <- file("final/en_US/en_US.blogs.txt")
linesInFile.blog <- readLines(con,encoding="UTF-8",skipNul = TRUE)
fileSize.blog <- format(object.size(linesInFile.blog),units = "MB")
fileNoOfLines.blog <- length(linesInFile.blog)
fileWords.blog <- sum(stri_count_words(linesInFile.blog))
close(con)
info.blog <- paste0("File size: ", fileSize.blog, " Lines in File: ", fileNoOfLines.blog, " Words in file: ", fileWords.blog)
# News text file
con <- file("final/en_US/en_US.news.txt.")
linesInFile.news <- readLines(con,encoding="UTF-8",skipNul = TRUE)
fileSize.news <- format(object.size(linesInFile.news),units = "MB")
fileNoOfLines.news <- length(linesInFile.news)
fileWords.news <- sum(stri_count_words(linesInFile.news))
close(con)
info.news <- paste0("File size: ", fileSize.news, " Lines in File: ", fileNoOfLines.news, " Words in file: ", fileWords.news)
# Twitter text file
con <- file("final/en_US/en_US.twitter.txt", "r")
linesInFile.twitter <- readLines(con,encoding="UTF-8",skipNul = TRUE)
fileSize.twitter <- format(object.size(linesInFile.twitter),units = "Mb")
fileNoOfLines.twitter <- length(linesInFile.twitter)
fileWords.twitter <- sum(stri_count_words(linesInFile.twitter))
close(con)
info.twitter <- paste0("File size: ", fileSize.twitter, " Lines in File: ", fileNoOfLines.twitter, " Words in file: ", fileWords.twitter)
# Summary of data files info
info.blog
## [1] "File size: 255.4 Mb Lines in File: 899288 Words in file: 37546250"
info.news
## [1] "File size: 19.8 Mb Lines in File: 77259 Words in file: 2674536"
info.twitter
## [1] "File size: 319 Mb Lines in File: 2360148 Words in file: 30093413"
# Sampling from data
SampleData1 <- c(sample(linesInFile.blog,length(linesInFile.blog)*0.25),
sample(linesInFile.news,length(linesInFile.news)*0.25),
sample(linesInFile.twitter,length(linesInFile.twitter)*0.25)
)
# Saving data samples in RDS
saveRDS(SampleData1, "SampleUSData1.rds")
# Cleaning the environment
rm(list=ls())
Converting the document to plain text
# Reading data
data <- readRDS("SampleUSData1.rds")
# Creatoing the corpus
cor <- VCorpus(VectorSource(data))
# Remove punctuation
cor <- tm_map(cor, removePunctuation)
# Transform to lower case
cor <- tm_map(cor, tolower)
# Strip digits
cor <- tm_map(cor, removeNumbers)
# Remove stopwords
cor <- tm_map(cor, removeWords, stopwords("english"))
# remove whitespace
cor <- tm_map(cor, stripWhitespace)
# Stemming the docs
cor <- tm_map(cor, stemDocument)
# converting to plain text doc
cor <- tm_map(cor, PlainTextDocument)
So after pre-processing the following exploratory analysis will be done by using tidytext package, to those that are used to working with ’ %>% ’ from the dplyr package. Part of exploratory analysis is to look at the fancy wordcloud plot
# Converting the corpus into a data frame.
dfCor <- tidy(cor) %>%
select(text)
rm(cor)
# Creating a data
dfCor1 <- dfCor %>%
unnest_tokens(output = word, input = text, drop = TRUE, format = "text" ) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Taking a look on the workcloud
wordcloud(word = dfCor1$word, freq = dfCor1$n, min.freq = 1,max.words = 50,
random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
# Uni-gram data preparation and looking at the top 20
unigram <- dfCor %>%
unnest_tokens(output = word, input = text, drop = TRUE, format = "text") %>%
count(word, sort = TRUE) %>%
filter(word != "") %>%
arrange(desc(n)) %>%
slice(1:20)
# Bigram data preparation and looking at the top 20
bigram <- dfCor %>%
unnest_tokens(word, text, token = "ngrams", n = 2) %>%
separate(word, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(word,word1, word2, sep = " ",na.rm = TRUE) %>%
count(word, sort = TRUE) %>%
filter(word != "") %>%
arrange(desc(n)) %>%
slice(1:20)
# Trigram data preparation and looking at the top 20
trigram <- dfCor %>%
unnest_tokens(word, text, token = "ngrams", n = 3,drop = T,format = "text",) %>%
separate(word, c("word1", "word2","word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
unite(word,word1, word2,word3, sep = " ", na.rm = TRUE) %>%
count(word, sort = TRUE) %>%
filter(word != "") %>%
arrange(desc(n)) %>%
slice(1:20)
Th following plots shows the top 20 uni- bi-, tri- grams:
p1 <- ggplot(unigram) +
geom_bar(aes(x= reorder(word, n),y=n), stat = "identity", fill = "#de5833") +
theme_minimal() +
coord_flip() +
labs(title = "Top 20 unigrams",
subtitle = "using Tidytext in R",
caption = "Data Source:")
p1
p2 <- ggplot(bigram) +
geom_bar(aes(x= reorder(word, n),y=n), stat = "identity", fill = "#de5833") +
theme_minimal() +
coord_flip() +
labs(title = "Top 20 Bigrams",
subtitle = "Using Tidytext in R",
caption = "Data Source: ")
p2
p3 <- ggplot(trigram) +
geom_bar(aes(x= reorder(word, n),y=n), stat = "identity", fill = "#de5833") +
theme_minimal() +
coord_flip() +
labs(title = "Top 20 Trigrams",
subtitle = "using Tidytext in R",
caption = "Data Source:")
p3
The next step i to build predictive models using Katz’ n-gram Backoff model. Here is the goal to build both accurate and efficient model. Other options would be to effectivize the current tokenization process exploring other packages.
The shiny app development is aimed to be intuitive and easy to go with few steps and clear output to the user. The idea is that the user would be able to right a few words and the applications should suggest an upcoming word.