Introduction

The purpose of this is to explore the dataset to be used in creating the SwiftKey equivalent by using R Studio. It is part of the requirements of the Capstone Project under the Data Science Specialization of Coursera.

Preparing the Data

The first step in the Natural Language Processing (NLP) capstone project is to import all the data into R so it can be properly analyzed and explored, I included as well the setting of the directory for convenience.I have included as well all the libraries I used at the top for ease of access.

Load the Library

library(quanteda)
## Package version: 1.5.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(readtext)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'tm'
## The following objects are masked from 'package:quanteda':
## 
##     as.DocumentTermMatrix, stopwords
library(LaF)
library(stringi)
library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RWeka)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
setwd("C:/Users/10012186/Documents/work/en_US")

twitter <- readtext("en_US.twitter.txt")
blog <- readtext("en_US.blogs.txt")
news <- readtext("en_US.news.txt")

Summary of the Data

As we can see below, the blog file has the most content in it while the twitter file has the least.

corpusTwitter <- corpus(twitter)
corpusBlog <- corpus(blog)
corpusNews <- corpus(news)
masterCorpus <- corpusTwitter+corpusBlog+corpusNews
docnames(masterCorpus) <- c("Twitter", "Blog", "News")
summary(masterCorpus)
## Corpus consisting of 3 documents:
## 
##     Text  Types   Tokens Sentences
##  Twitter 581913 36898285   2574102
##     Blog 530064 44346847   2015464
##     News 118765  3113070    140254
## 
## Source: Combination of corpuses corpusTwitter + corpusBlog and corpusNews
## Created: Wed Aug 14 14:29:01 2019
## Notes:

Processing the Data

In order to process the data with the Quanteda package, we need a transformed dataset into a corpus. After transforming the data, we’ll proceed to tokenize the texts in the corpus. We need to remove numbers, punctuation, hyphens and symbols from the list of tokens.

The next step is to preprocess the master list of tokens from the corpus and remove as much of the noise from it as is reasonably possible. This includes making everything lower case, removing profanity and removing any tokens that end with anything other than a letter.

The list of profanities were taken from this repository: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

masterTokens <- tokens(masterCorpus, remove_numbers=TRUE, remove_punct=TRUE, remove_symbols=TRUE, remove_hyphens=TRUE, remove_twitter=TRUE)
masterTokens <- tokens_tolower(masterTokens)
masterTokens <- tokens_remove(masterTokens, pattern="^[^a-zA-Z]|[^a-zA-Z]$", valuetype="regex", padding=TRUE)
profanity <- readLines("~/work/en_US/en")
masterTokensClean <- tokens_remove(masterTokens, profanity, padding = TRUE)

Quanteda Package

This is the third iteration of my attempt at this milestone report because on my first iteration, I forgot to remove the stop words, then at the second iteration I could not produce a plot for the biGramMatrix and triGramMatrix. Hence, this time I’m using the Quanteda package in R to be able to plot uniGram, biGram, and triiGram.

Using DFM

Using this package, we’ll create a document-feature matrix (dfm) from the list of clean tokens. The dfm is a summary table of all the unique tokens in the text and the count of how many times that token appears in each text file in your corpus. We’ll also remove all english stop words, these stop words are included in the qunateda package. We’ll also assume for individual letters are errors or typos in the text, so we’ll also remove these.

df <- dfm(masterTokensClean, remove=stopwords("english"))
df <- dfm_remove(df, "\\b[a-zA-Z]\\b", valuetype="regex")

Trimming

We’ll also remove any words that did not appear in at least two of the documents. This is to filter out any uncommon words (low probability that you would want to predict them) and typos.

dfTrimmed <- dfm_trim(df, min_docfreq=2)

Exploratory Data Analysis

The top 25 most frequently occuring tokens from the final dfm.

topfeatures(dfTrimmed, n=25)
##            just    like     one    will     can     get    time    love 
## 1337939  255568  226386  216180  215806  192319  186663  171336  152123 
##    good     now     day    know     new     see      go  people    back 
##  151847  146213  145886  141683  129716  118641  117793  114650  112154 
##   great   think    make      us   going  really  thanks 
##  108567  103521  101297  100412   97956   96880   96753

Here’s a word cloud of the most common words within the dataset.

textplot_wordcloud(dfTrimmed, max.words=50, scale=c(5, 2))
## Warning: scalemax.words is deprecated; use min_size and max_sizemax_words
## instead

Here’s the top 10 words in each dataset.What’s interesting is that for twitter, “rt” is an actual word because it’s a shortcut for retweet. While, for the news dataset, the word “said” is significantly higher than the rest which makes sense for news articles to keep using.

twitterTop <- topfeatures(dfTrimmed[1], n=10)
blogTop <- topfeatures(dfTrimmed[2], n=10)
newsTop <- topfeatures(dfTrimmed[3], n=10)

barplot(twitterTop, main = "Top 10 Words in Twitter Dataset", ylab = "Count")

barplot(blogTop, main = "Top 10 Words in Blogs Dataset", ylab = "Count")

barplot(newsTop, main = "Top 10 Words in News Dataset", ylab = "Count")

Ngrams

Since we already looked at the most frequent word at the beginning, we’ll proceed to look at 2 and 3 word combinations in the corpus, these are called as bigrams and trigrams.

biGram <- tokens_ngrams(masterTokensClean, n=2)
biGramDf <- dfm(biGram)

triGram <- tokens_ngrams(masterTokensClean, n=3)
triGramDf <- dfm(triGram)

Bigrams

topfeatures(biGramDf, n=25)
##   of_the   in_the  for_the   to_the   on_the    to_be   at_the   i_have 
##   258486   246653   137564   136225   129689   118884    89344    79815 
##  and_the    i_was     is_a     in_a    and_i     i_am   it_was    it_is 
##    77901    75737    74571    73001    72709    72149    70550    66793 
##    for_a with_the   if_you   have_a going_to   is_the  will_be   to_get 
##    65994    65922    63895    60664    60465    56237    55336    54032 
## from_the 
##    53026
twitterTop2grams <- topfeatures(biGramDf[1])
blogTop2grams <- topfeatures(biGramDf[2])
newsTop2grams <- topfeatures(biGramDf[3])

par(mfrow=c(3,1))
barplot(twitterTop2grams, main = "Top 10 bigrams in Twitter", ylab = "Count")
barplot(blogTop2grams, main = "Top 10 bigrams in Blogs", ylab = "Count")
barplot(newsTop2grams, main = "Top 10 bigrams in News", ylab = "Count")

Trigrams

topfeatures(triGramDf, n=25)
##     thanks_for_the         one_of_the           a_lot_of 
##              23805              21051              19331 
##            to_be_a          i_want_to        going_to_be 
##              13206              13180              12730 
##           i_have_a looking_forward_to          i_have_to 
##              10896              10572              10332 
##           it_was_a      thank_you_for         the_end_of 
##              10289              10138               9718 
##         out_of_the         be_able_to         i_love_you 
##               9703               9310               9196 
##          i_need_to        some_of_the      can't_wait_to 
##               9147               8584               8290 
##         as_well_as        the_rest_of          one_of_my 
##               8192               8176               8054 
##     for_the_follow        is_going_to        you_want_to 
##               7932               7824               7726 
##        a_couple_of 
##               7466
twitterTop3grams <- topfeatures(triGramDf[1])
blogTop3grams <- topfeatures(triGramDf[2])
newsTop3grams <- topfeatures(triGramDf[3])

par(mfrow=c(3,1))
barplot(twitterTop3grams, main = "Top 10 trigrams in Twitter", ylab = "Count", las=2)
barplot(blogTop3grams, main = "Top 10 trigrams in Blogs", ylab = "Count", las=2)
barplot(newsTop3grams, main = "Top 10 trigrams in News", ylab = "Count", las=2)