The purpose of this is to explore the dataset to be used in creating the SwiftKey equivalent by using R Studio. It is part of the requirements of the Capstone Project under the Data Science Specialization of Coursera.
The first step in the Natural Language Processing (NLP) capstone project is to import all the data into R so it can be properly analyzed and explored, I included as well the setting of the directory for convenience.I have included as well all the libraries I used at the top for ease of access.
library(quanteda)
## Package version: 1.5.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(readtext)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'tm'
## The following objects are masked from 'package:quanteda':
##
## as.DocumentTermMatrix, stopwords
library(LaF)
library(stringi)
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
setwd("C:/Users/10012186/Documents/work/en_US")
twitter <- readtext("en_US.twitter.txt")
blog <- readtext("en_US.blogs.txt")
news <- readtext("en_US.news.txt")
As we can see below, the blog file has the most content in it while the twitter file has the least.
corpusTwitter <- corpus(twitter)
corpusBlog <- corpus(blog)
corpusNews <- corpus(news)
masterCorpus <- corpusTwitter+corpusBlog+corpusNews
docnames(masterCorpus) <- c("Twitter", "Blog", "News")
summary(masterCorpus)
## Corpus consisting of 3 documents:
##
## Text Types Tokens Sentences
## Twitter 581913 36898285 2574102
## Blog 530064 44346847 2015464
## News 118765 3113070 140254
##
## Source: Combination of corpuses corpusTwitter + corpusBlog and corpusNews
## Created: Wed Aug 14 14:29:01 2019
## Notes:
In order to process the data with the Quanteda package, we need a transformed dataset into a corpus. After transforming the data, we’ll proceed to tokenize the texts in the corpus. We need to remove numbers, punctuation, hyphens and symbols from the list of tokens.
The next step is to preprocess the master list of tokens from the corpus and remove as much of the noise from it as is reasonably possible. This includes making everything lower case, removing profanity and removing any tokens that end with anything other than a letter.
The list of profanities were taken from this repository: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
masterTokens <- tokens(masterCorpus, remove_numbers=TRUE, remove_punct=TRUE, remove_symbols=TRUE, remove_hyphens=TRUE, remove_twitter=TRUE)
masterTokens <- tokens_tolower(masterTokens)
masterTokens <- tokens_remove(masterTokens, pattern="^[^a-zA-Z]|[^a-zA-Z]$", valuetype="regex", padding=TRUE)
profanity <- readLines("~/work/en_US/en")
masterTokensClean <- tokens_remove(masterTokens, profanity, padding = TRUE)
This is the third iteration of my attempt at this milestone report because on my first iteration, I forgot to remove the stop words, then at the second iteration I could not produce a plot for the biGramMatrix and triGramMatrix. Hence, this time I’m using the Quanteda package in R to be able to plot uniGram, biGram, and triiGram.
Using this package, we’ll create a document-feature matrix (dfm) from the list of clean tokens. The dfm is a summary table of all the unique tokens in the text and the count of how many times that token appears in each text file in your corpus. We’ll also remove all english stop words, these stop words are included in the qunateda package. We’ll also assume for individual letters are errors or typos in the text, so we’ll also remove these.
df <- dfm(masterTokensClean, remove=stopwords("english"))
df <- dfm_remove(df, "\\b[a-zA-Z]\\b", valuetype="regex")
We’ll also remove any words that did not appear in at least two of the documents. This is to filter out any uncommon words (low probability that you would want to predict them) and typos.
dfTrimmed <- dfm_trim(df, min_docfreq=2)
The top 25 most frequently occuring tokens from the final dfm.
topfeatures(dfTrimmed, n=25)
## just like one will can get time love
## 1337939 255568 226386 216180 215806 192319 186663 171336 152123
## good now day know new see go people back
## 151847 146213 145886 141683 129716 118641 117793 114650 112154
## great think make us going really thanks
## 108567 103521 101297 100412 97956 96880 96753
Here’s a word cloud of the most common words within the dataset.
textplot_wordcloud(dfTrimmed, max.words=50, scale=c(5, 2))
## Warning: scalemax.words is deprecated; use min_size and max_sizemax_words
## instead
Here’s the top 10 words in each dataset.What’s interesting is that for twitter, “rt” is an actual word because it’s a shortcut for retweet. While, for the news dataset, the word “said” is significantly higher than the rest which makes sense for news articles to keep using.
twitterTop <- topfeatures(dfTrimmed[1], n=10)
blogTop <- topfeatures(dfTrimmed[2], n=10)
newsTop <- topfeatures(dfTrimmed[3], n=10)
barplot(twitterTop, main = "Top 10 Words in Twitter Dataset", ylab = "Count")
barplot(blogTop, main = "Top 10 Words in Blogs Dataset", ylab = "Count")
barplot(newsTop, main = "Top 10 Words in News Dataset", ylab = "Count")
Since we already looked at the most frequent word at the beginning, we’ll proceed to look at 2 and 3 word combinations in the corpus, these are called as bigrams and trigrams.
biGram <- tokens_ngrams(masterTokensClean, n=2)
biGramDf <- dfm(biGram)
triGram <- tokens_ngrams(masterTokensClean, n=3)
triGramDf <- dfm(triGram)
topfeatures(biGramDf, n=25)
## of_the in_the for_the to_the on_the to_be at_the i_have
## 258486 246653 137564 136225 129689 118884 89344 79815
## and_the i_was is_a in_a and_i i_am it_was it_is
## 77901 75737 74571 73001 72709 72149 70550 66793
## for_a with_the if_you have_a going_to is_the will_be to_get
## 65994 65922 63895 60664 60465 56237 55336 54032
## from_the
## 53026
twitterTop2grams <- topfeatures(biGramDf[1])
blogTop2grams <- topfeatures(biGramDf[2])
newsTop2grams <- topfeatures(biGramDf[3])
par(mfrow=c(3,1))
barplot(twitterTop2grams, main = "Top 10 bigrams in Twitter", ylab = "Count")
barplot(blogTop2grams, main = "Top 10 bigrams in Blogs", ylab = "Count")
barplot(newsTop2grams, main = "Top 10 bigrams in News", ylab = "Count")
topfeatures(triGramDf, n=25)
## thanks_for_the one_of_the a_lot_of
## 23805 21051 19331
## to_be_a i_want_to going_to_be
## 13206 13180 12730
## i_have_a looking_forward_to i_have_to
## 10896 10572 10332
## it_was_a thank_you_for the_end_of
## 10289 10138 9718
## out_of_the be_able_to i_love_you
## 9703 9310 9196
## i_need_to some_of_the can't_wait_to
## 9147 8584 8290
## as_well_as the_rest_of one_of_my
## 8192 8176 8054
## for_the_follow is_going_to you_want_to
## 7932 7824 7726
## a_couple_of
## 7466
twitterTop3grams <- topfeatures(triGramDf[1])
blogTop3grams <- topfeatures(triGramDf[2])
newsTop3grams <- topfeatures(triGramDf[3])
par(mfrow=c(3,1))
barplot(twitterTop3grams, main = "Top 10 trigrams in Twitter", ylab = "Count", las=2)
barplot(blogTop3grams, main = "Top 10 trigrams in Blogs", ylab = "Count", las=2)
barplot(newsTop3grams, main = "Top 10 trigrams in News", ylab = "Count", las=2)