In this report I show a summary of the exploratory analysis I performed on the data. First I describe the characteristics of the files and how the data was imported and preprocessed. Then, based on a sample of the data I obtained the most common 1-3 grams. Finally I briefly describe how I plan to build the prediction model.
The packages used for the exploratory analysis were:
First, the working directory was changed to the folder where the files were extracted (in this case I’ll work with the files in english only):
blogData <- readLines("en_US.blogs.txt",skipNul = TRUE)
twitterData <- readLines("en_US.twitter.txt",skipNul = TRUE)
newsData <- readLines("en_US.news.txt", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", skipNul = TRUE): incomplete final
## line found on 'en_US.news.txt'
In order to proceed with the analysis, the 3 datasets were merged:
I sampled 1% of the data using the function sample as suggested:
Using the merged data I define the corpus using the function corpus from the tm package:
As almost every NLP guide on the internet suggest, I performed some transformations (like removing white spaces, transforming all the characters to lowercase, remove punctuation and so on):
options(mc.cores=1)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
us_files <- tm_map(corpus, toSpace, "/|@|\\|")
# convert to lowercase
us_files <- tm_map(us_files, content_transformer(tolower))
# remove punctuation
us_files <- tm_map(us_files, removePunctuation)
# remove numbers
us_files <- tm_map(us_files, removeNumbers)
# strip whitespace
us_files <- tm_map(us_files, stripWhitespace)
# remove english stop words
us_files <- tm_map(us_files, removeWords, stopwords("english"))
# initiate stemming
corpus <- tm_map(us_files, stemDocument) # Stemming words
Using the Term Document Matrix then I obtained the list of 1-grams that are mostly used:
Now I created the N grams
#Creating the n-grams
corpus.unigram <- TermDocumentMatrix(corpus)
corpus.unigram <- removeSparseTerms(corpus.unigram, 0.99)
corpus.unigram.freq <- freq_df(corpus.unigram)
corpus.bigram <- TermDocumentMatrix(corpus, control=list(tokenize=bigramTokenizer))
corpus.bigram <- removeSparseTerms(corpus.bigram, 0.999)
corpus.bigram.freq <- freq_df(corpus.bigram)
corpus.trigram <- TermDocumentMatrix(corpus, control=list(tokenize=trigramTokenizer))
corpus.trigram <- removeSparseTerms(corpus.trigram, 0.999)
corpus.trigram.freq <- freq_df(corpus.trigram)
corpus.quadgram <- TermDocumentMatrix(corpus, control=list(tokenize=quadgramTokenizer))
corpus.quadgram <- removeSparseTerms(corpus.quadgram, 0.9999)
corpus.quadgram.freq <- freq_df(corpus.quadgram)
Here is a graphical description of the frequent 1-grams:
Similarly, this is the list of the most used 2-grams:
top_50_plot(corpus.bigram.freq,"Top 50 2 word phrases","steelblue")
# Top 50 words - trigram
top_50_plot(corpus.trigram.freq,"Top 50 3 word phrases","steelblue")
# Top 50 words - quadgram
top_50_plot(corpus.quadgram.freq,"Top 50 4 word phrases","steelblue")
The shiny app is shown in the R server