This is the final project in the Cuorsera data specialization from Johns Hopkins University (https://www.coursera.org/learn/data-science-project). The object of the week 2 is to perform an exploratory analysis from real data from three different data sets and then think about how to create a text model prediction, in order to predict what will be the next word from an user interface input. The report will be separated into two task.
First: perform a thoroughly analysis from the real data and find the most common unigrams, bigrams and trigrams.
Second: based on the information obtained from task one, a text prediction model will be thinked in order to establish the base to the next week object
The data set comes from the company SwiftKey which works with Coursera in this capstone project. The original data can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The first step is to download the data from the provided link and load the necessary libraries for the next steps. The data comes from three different data sets, one from twitter, another from blogs and the final one from news.
suppressPackageStartupMessages({
library(tm)
#library(data.table)
library(tidyr)
library(dplyr)
library(tidytext)
library(stringi)
library(ggplot2)
})
#Read and store data from working directory
leo_twitter_raw <-readLines("en_US.twitter.txt",skipNul = TRUE,warn = FALSE) #Defalut encoding is UTF-8
leo_blog_raw <-readLines("en_US.blogs.txt",skipNul = TRUE,warn = FALSE)
leo_news_raw <-readLines("en_US.news.txt",skipNul = TRUE, warn = FALSE)
The next step into the exploratory analysis is to find how much information each data set has and how is composed, for example, number of lines, file size, number of words per file and maximum number of character per lines.
#Count lines in each text
nlines_twitter <- length(leo_twitter_raw)
nlines_blog <- length(leo_blog_raw)
nlines_news <- length(leo_news_raw)
#File size
size_blogs_file <- file.size("en_US.blogs.txt")/ 1024^2 #convert to MB
size_news_file <- file.size("en_US.news.txt")/ 1024^2
size_twitter_file <- file.size("en_US.twitter.txt")/ 1024^2
#Number of words in each file
leo_twitter_wordcount <- sum(stri_count_words(leo_twitter_raw))
leo_blog_wordcount <- sum(stri_count_words(leo_blog_raw))
leo_news_wordcount <- sum(stri_count_words(leo_news_raw))
#Max lenght in a line
max_char_twitter <- max(nchar(leo_twitter_raw))
max_char_blog <- max(nchar(leo_blog_raw))
max_char_news <- max(nchar(leo_news_raw))
The previous code is summarized into a data frame (code not shown here). This data shows that the three original files consume more than 150 Mb each. The Twitter file contains more lines than the others and has more words and the longest character count per line.
## Data File.size.Mb Line.Count Word.count Max.char.line
## 1 twitter 200.4242 2360148 38154238 40835
## 2 blogs 196.2775 899288 2693898 5760
## 3 news 159.3641 77259 30218166 213
The following plots shows the frequency of words per line in each data set
Because the data sets have considerable size in Mb and that the speech text mining is a big power consuming process, a sample is taken from the original data set containing the information from blogs, news and twitter. With a sample of 15% I consider that an useful model can be achieved. However, as the model will be constructed in the following weeks, I will construct the first model and observe if with this original sample, the model can perform well. Before uniting the three datasets, all non-latin characters are eliminated
leo_twitter_raw2 <- iconv(leo_twitter_raw, "latin1", "ASCII", sub="")
leo_blog_raw2 <- iconv(leo_blog_raw, "latin1", "ASCII", sub="")
leo_news_raw2 <- iconv(leo_news_raw, "latin1", "ASCII", sub="")
set.seed(54321) #Set seed for reproducibility
leo_twitter_sample<- sample(leo_twitter_raw2,length(leo_twitter_raw)*0.15)
leo_blog_sample<- sample(leo_blog_raw2,length(leo_blog_raw2)*0.15)
leo_news_sample<- sample(leo_news_raw2,length(leo_news_raw2)*0.15)
rm(leo_twitter_raw2, leo_blog_raw2, leo_news_raw2,leo_twitter_raw,
leo_blog_raw,leo_news_raw, summary_plot) #Clean working space from unnecessary data
The next step is cleaning the data and tokenization. Tokenization is the process of spliting the strings into their component words. For example, the string “The house is white” will be divided into “the”, “house”, “is”, “white”, each word will be one line of a data frame. To make the tokenization I will be using “tidytext” (https://www.tidytextmining.com/) which works well with other common packages as ggplot, tibble, dyplyr and more.
Because the object of the project is to create a prediction model for text input, some words, phrases and numbers will not be necessary or will not be taken into account in this model. In the following steps the next things are removed: numbers, URL, punctuation and white spaces.
After this, comes the tokenization, in this step more data is removed or changed from the data set.
To start, lets create a corpus data set, which is the sum of the three different data sets (twitter, blogs and news). More information about corpus data in https://en.wikipedia.org/wiki/Text_corpus To create the corpus, the sum af the three data sets is converted into a Vcorpus data (more information of what a Vcorpus is in: https://stats.stackexchange.com/questions/164372/what-is-vectorsource-and-vcorpus-in-tm-text-mining-package-in-r)
sample_data <- c(leo_twitter_sample, leo_blog_sample,leo_news_sample)
writeLines(sample_data, "./final_data/sample.txt") #Save sampled data
corpus <- VCorpus(VectorSource(sample_data))
rm(sample_data) #remove sample_data as it is now stored in "corpus"
#Remove punctuation by replacing it with " "
corpus <- tm_map(corpus, removePunctuation)
#Transform to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus, removeNumbers)
## Remove URLs
remove_web_url <- function(x) gsub("http[[:alnum:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(remove_web_url))
#Strip extra whitespace from a text document. Multiple whitespace characters are collapsed to a single blank.
corpus <- tm_map(corpus, stripWhitespace)
Now starts the tokenization process which, as explained before, is the process of splitting the text strings into each individual word. To do it, the function unnest_token() function from the tidytext package is used. By default, this function converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. However, this was done before in the corpus. One important thing is that the function works only with a tibble data set, so the corpus need to be converted to this type of data frame. Common stop words, as “the”, “to”, “of” will be removed from the data as they do not add useful information to the model. However, a more advance model could use these words, but it is not the object of this one because it need to be efficiently in the use of memory. The function anti_join() and the data (stop_words) is used to remove these words.
tidy_corpus <- corpus %>% tidy() #%>% select(id,text)
tidy_corpus <- tidy_corpus %>% select(id,text)
rm(corpus)#Remove data as it is now stored in "tidy_corpus"
tidy_corpus <- tidy_corpus %>% unnest_tokens(word,text)
#Dataset of naughty words that need to be erased from the set (definition from Coursera capstone)
naughty_words <- tibble(word = readLines('https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en'))
#Remove stop words and then also naughty words
data(stop_words)
tidy_corpus <- tidy_corpus %>% anti_join(stop_words,by= "word")
Now, create a wordcloud with the top 50 words by frequency. For this, the library “wordcloud” is used.
After cleaning and tokenizing the data set comes the step of generating unigrams, bigrams, trigrams and quadgrams. These will then be used (not in this week project) to create a prediction model.
tidy_unigram <- tidy_corpus %>% count(word, sort = TRUE)
#Create bi-gram
tidy_bigram <- tidy_corpus %>% unnest_tokens(bigram, word, token = "ngrams", n = 2) %>% drop_na()
#Create trigram
tidy_trigram <- tidy_corpus %>% unnest_tokens(trigram, word, token = "ngrams", n = 3) %>% drop_na()
#Create quadgram
tidy_quadgram <- tidy_corpus %>% unnest_tokens(quadgram, word, token = "ngrams", n = 4) %>% drop_na()
Plot common unigram, bigram, trigram and quadgram. The code can be seen in the Apendix part
The time consumed from the creation of the corpus, to the tokenizing process and the creation of the uni/bi/tri/quad gram was 6.853075 minutes (this not include the plots). The notebook used has an intel core 7-5500U 2.4 Ghz and 8 gb or RAM DDR3 (speed 1600 Mhz) and the Rstudio was Version 1.3.1073.
One important question is how many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? The answer is 1419 words represents 50% of the language from the sampled data. 22.908 words represent 90% of the language
The object is to build basic n-gram model - using the exploratory analysis previously performed, for predicting the next word based on the previous 1, 2, or 3 words. The model should handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora.
The model would use the words that have a high probability to appear in a text. To do this, the n-grams will calculate the probability of each word or n-gram in the data sets.
There are many ways of create a prediction model, but the model would use Markov chains. This is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event (from wikipedia: https://en.wikipedia.org/wiki/Markov_chain). So basically in a Markov model, in order to predict the next state, we must only consider the current state. More information in https://medium.com/edureka/introduction-to-markov-chains-c6cb4bcd5723
However, Markov Chains consider only words that appeared in the data that was used to create the model. Because of it, I will have to think in a way to predict new words.
All of the previous should be done efficiently, because it would have to run in a shinny application.
The model will then be tested to see its efficiency in prediction. This will be done using the original data set, as I only took 15% of it to create the model. Therefore, I can use the 85% rest to test my model.
Possible improvements that could be done in the generation of n-grams: + Using synonyms in order to predict words that are not in the base model but are synonyms + Deal with punctuation and misspelled words + Use stop words that were removed in this project
Code for summary of the original three files.
## Data File.size.Mb Line.Count Word.count Max.char.line
## 1 twitter 200.4242 2360148 38154238 40835
## 2 blogs 196.2775 899288 2693898 5760
## 3 news 159.3641 77259 30218166 213
Code for n-gram plots
#plot topwords of unigrams
tidy_corpus %>% count(word, sort = TRUE) %>% head(12) %>% mutate(word = reorder(word,n)) %>%
ggplot(aes(word,n)) + geom_col() + xlab(NULL) + ylab("Word count") + coord_flip() + theme_bw() + ggtitle("Most common uni-grams")
#PLot most commmon bi-grams
tidy_bigram %>% count(bigram, sort = TRUE) %>% head(12) %>% mutate(bigram = reorder(bigram,n)) %>%
ggplot(aes(bigram,n)) + geom_col() + xlab(NULL)+ ylab("Frequency count") + coord_flip() + theme_bw() + ggtitle("Most common bi-grams")
#PLot most commmon tri-grams
tidy_trigram %>% count(trigram, sort = TRUE) %>% head(12) %>% mutate(trigram = reorder(trigram,n)) %>%
ggplot(aes(trigram,n)) + geom_col() + xlab(NULL) + ylab("Frequency count") + coord_flip() + theme_bw() + ggtitle("Most common tri-grams")
Word coverage (take into account that the “language” only consider the 15% sample of the data set and not the full English dictionary)
participation_word<-function(x,word_coverage){ #x is the unigram output sorted by frequency, y is the percent word coverage
word_count<-0 # initial counter
participation <- word_coverage*sum(x$n) # number of words to hit coverage
for (i in 1:nrow(x))
{if (word_count >= participation) {
return (i)
}
word_count<-word_count+x$n[i]
}}