DS10 - Swiftkey and Natural Language Processing (NLP) Project

PURPOSE:
The project will be a milestone report that will utlize Natural Language Processing (NLP) to examine course Swiftkey data.
The goals for this project is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Obtain feedback on your plans for creating a predictive algorithm and Shiny application.

PROCESSING:

# Get the Swiftkey data set and load it.
# Download the data file.
#if(!file.exists("./mydata_DS10")){dir.create("./mydata_DS10")}
#fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(fileUrl,destfile="./mydata_DS10/Dataset.zip")

#Unzip the dataSet to /mydata_DS10 directory
#unzip(zipfile="./mydata_DS10/Dataset.zip",exdir="./mydata_DS10")

# Import libraries that are needed for processing in this module.
library(NLP)
library(tm)

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RWeka)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)

# Read in the news stream.
news <- readLines(connection_news <- file("final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

# Read in the blog stream.
blogs <- readLines(connection_blog <- file("final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

# Read in the twitter stream.
twitters <- readLines(connection_twitters <- file("final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

# Close my connection for the blog, news and twitter streams.
close(connection_news)
close(connection_blog)
close(connection_twitters)

#Get the length of the blog, news and twitter streams.
length(news)

## [1] 77259

length(blogs)

## [1] 899288

length(twitters)

## [1] 2360148

#Get the number of words of the blog, news and twitter streams.
newsWords <- sum(sapply(gregexpr("\\S+", news), length))
newsWords

## [1] 2643969

blogsWords <- sum(sapply(gregexpr("\\S+", blogs), length))
blogsWords

## [1] 37334131

twitterWords <- sum(sapply(gregexpr("\\S+", twitters), length))
twitterWords

## [1] 30373583

twitcount<-nchar(twitters)
tmax<-which.max(twitcount)
nchar(twitters[tmax])

## [1] 140

#Cleanse the blog, news and twitter streams.
cleanedNews<- iconv(news, 'UTF-8', 'ASCII', "byte")
cleanedBlog<- iconv(blogs, 'UTF-8', 'ASCII', "byte")
cleanedTwitter<- iconv(twitters, 'UTF-8', 'ASCII', "byte")

#Sampling using subset of 15000
twitterSample<-sample(cleanedTwitter, 15000)
doc.vec <- VectorSource(twitterSample)                      
doc.corpus <- Corpus(doc.vec)

#Convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)
#Remove numbers from the stream.
doc.corpus<- tm_map(doc.corpus, removeNumbers)
#Remove puncuation marks from the stream.
doc.corpus<- tm_map(doc.corpus, removePunctuation)
#Remove blank spaces from the stream.
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
## Ensure plain text for the stream. 
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)

#Begin the n-gram set-up process.
uniGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

#Set up matrices for the unigrams, bigrams, and trigrams.
uniGramMatrice <- TermDocumentMatrix(doc.corpus, control = list(tokenize = uniGramTokens))
biGramMatrice <- TermDocumentMatrix(doc.corpus, control = list(tokenize = biGramTokens))
triGramMatrice <- TermDocumentMatrix(doc.corpus, control = list(tokenize = triGramTokens))

UnifreqTerms <- findFreqTerms(uniGramMatrice, lowfreq = 500)
UnitermFrequency <- rowSums(as.matrix(uniGramMatrice[UnifreqTerms,]))
UnitermFrequency <- data.frame(unigram=names(UnitermFrequency), frequency=UnitermFrequency)

## Generate a plot for the unigram. 
unigram_plot <- ggplot(UnitermFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", colour = "black", fill="yellow") +  coord_flip() +
    theme(legend.title=element_blank()) + labs(title = "Unigrams by Frequencies") +
    xlab("Unigrams") + ylab("Frequencies") 
print(unigram_plot)

## Generate a plot for the bigram. 
freqTerms <- findFreqTerms(biGramMatrice, lowfreq = 200)
termFrequency <- rowSums(as.matrix(biGramMatrice[freqTerms,]))
termFrequency <- data.frame(bigram=names(termFrequency), frequency=termFrequency)

bigram_plot <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency )) +
  geom_bar(stat = "identity", colour = "black", fill="yellow") +  coord_flip() +
  theme(legend.title=element_blank()) + labs(title = "Bigrams by Frequencies ") +
  xlab("Bigrams") + ylab("Frequencies")
print(bigram_plot)

## Generate a plot for the trigram.
freqTerms <- findFreqTerms(triGramMatrice, lowfreq = 25)
termFrequency <- rowSums(as.matrix(triGramMatrice[freqTerms,]))
termFrequency <- data.frame(trigram=names(termFrequency), frequency=termFrequency)

trigram_plot <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", colour = "black", fill="yellow") +  coord_flip() +
    theme(legend.title=element_blank()) + labs(title = "Trigrams by Frequencies") +
    xlab("Trigrams") + ylab("Frequencies")
print(trigram_plot)

#Generate a visual ball of the n-grams.
wordcloud(doc.corpus, max.words = 100, random.order = FALSE,rot.per=0.40, use.r.layout=FALSE,colors=brewer.pal(6, "Dark2"))

SUMMARY:
A data set from Swiftkey was downloaded and processed using a R-markdown program. The data was obtained by using this link:https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip>

The project is addressing the field of Natural Language Processing (NLP). The data consisted of three major streams; news, blogs, and twitter. The data was analyzed in order to determine summary statistics. The length of the news, blogs, and twitters respectively, were 77259, 899288, and 2360148. The words of the news, blogs, and twitters respectively, were 2643969, 37334131, and 30373583. In addition, the maximum number of characters in a twitter stream was 140.

An algorithm was applied to the data to cleanse the data streams. For example, punctuation marks, blank spaces, and numbers were cleansed from the data streams. Subsequently, the n-gram process was invoked in order to determine unigrams, bigrams, and trigrams. Matrices were set-up to help generated plots for the frequency of the unigrams, frequency of the bigrams, and frequency of the trigrams.
A unigram by frequencies plot was generated. The most popular unigrams included the following: “the”, “you”, “and”, “for”, “that”. A bigram by frequencies plot was generated. The most popular bigrams included the following: “in the”, “for the”, “of the”, “to be”, “on the”. A trigram by frequencies plot was generated. The most popular trigrams included the following: “thanks for the”, “I want to”, “cant wait to”, “for the follow”, “thank you for”. Finally, a visual ball was generated which showed that popular words include the following: “the”, “you”," and“,”for“,”that“,”have“.

The next part of the project will attempt to determine an appropriate algorithm for predictive modelling. A Shiny application will be created in order to demonstrate the predictive modeling.

DS10 - Swiftkey and Natural Language Processing (NLP) Project - Milestone Report

E.Dobos

May 14, 2017