Three data files are used, originating from blogs, news and twitter which can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
The overall goal of this project is to develop a Shiny app that takes as input a phrase (multiple words) and predicts the next word based on a prediction algorithm. ## Let’s load the required packages first
setwd("C:/Users/zxu3/Documents/R/data science")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(stringi)
library(tm)
## Loading required package: NLP
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(RWeka)
library(RColorBrewer)
library(SnowballC)
library(lattice)
library(quanteda)
## Package version: 1.5.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
library(wordcloud2)
library(stringr)
First, we load the datasets into R. The data is provided in four languages but the English language files were used as a start.
We will then take a preliminary look at the files, in terms of size, word count and number of lines.
# Read the blogs and Twitter data into R
file1 <- file("en_US.blogs.txt", "rb")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
close(file1)
file2 <- file("en_US.news.txt", "rb")
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'
close(file2)
file3 <- file("en_US.twitter.txt", "rb")
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
close(file3)
# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source num.lines num.words mean.num.words
## 1 blogs 899288 37546239 41.75107
## 2 news 77259 2674536 34.61779
## 3 twitter 2360148 30093413 12.75065
As the datasets are very large, we will randomly sample 5,000 records from each file for further processing.
# Sample the data
data.sample <- c(sample(blogs,5000),
sample(news, 5000),
sample(twitter, 5000))
length(data.sample)
## [1] 15000
sum(stri_count_words(data.sample))
## [1] 440992
data.sample <- iconv(data.sample , "latin1", "ASCII", sub="")
data.sample <- str_replace_all(data.sample , "[\r\n]" , "")
To clean our sample data, we can use the tm_map function in the tm package. However, it turns out to be very tedious. We will use the Quanteda package instead. In the field of natural language processing, an N-gram is a contiguous sequence of n items from a given text document. It can be used to predict the next item in a sequence based on the current (n-1) item.
#tokenization using the Quanteda package
unigram <- tokens(data.sample, remove_punct = TRUE, ngrams = 1)
bigram <- tokens(data.sample, remove_punct = TRUE, ngrams = 2)
trigram <- tokens(data.sample, remove_punct = TRUE, ngrams = 3)
#Load Profanity word list
profanity <- read.csv("profane_words.txt", header = FALSE, stringsAsFactors = FALSE)
unigram<- dfm(unigram, remove=profanity)
unigramfreq <- textstat_frequency(unigram)
bigram<- dfm(bigram, remove=profanity)
bigramfreq <- textstat_frequency(bigram)
trigram <- dfm(trigram, remove=profanity)
trigramfreq <- textstat_frequency(trigram)
The following charts depict the top N-gram word frequencies (by unigrams, bigrams and trigrams, respectively).
wordcloud(data.sample, max.words = 300, random.order = FALSE,
rot.per = 0.1, scale = c(2.5, 0.3), use.r.layout = FALSE,
colors = brewer.pal(8, "Dark2"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
#barplot - Unigrams with Top 20 Frequencies
barplot(unigramfreq[1:20,]$freq, names.arg = unigramfreq[1:20,]$feature,
main = "Top 10 Most Frequent Unigrams", col="green2")
#barplot - Bigrams with Top 20 Frequencies
barplot(bigramfreq [1:20,]$freq, names.arg = bigramfreq [1:20,]$feature,
main = "Top 10 Most Frequent Bigrams", col="tomato2")
#barplot - Trigrams with Top 20 Frequencies
barplot(trigramfreq[1:20,]$freq, names.arg = trigramfreq[1:20,]$feature,
main = "Top 10 Most Frequent Trigrams", col="blue")
At this point I have successfully imported the project data set, cleansed and massaged the data, and performed NLP procesing to include generating ngrams for 1-3 terms and analyzing the term frequency.
The data presented in this report is based on 5000 observations from each of the source data. For the final project application, I plan to scale this percentage up to a higher value.
Open issues:
Investigate more efficient tokenization and ngram generation techniques Develop the word prediction and backoff algorithm to be used in the final app Develop a shiny app