Synopsis

The objectives of this Milestone Project are to perform exploratory analysis, to understand and to build predictive text models. SwiftKey is a corporate partner with Johns Hopkins University. SwiftKey builds a smart keyboard that makes it easier for people to type on the mobile devices. The smart keyboard presents three options for what the next word might be.

The proposal for this Milestone Project is to develop a predictive text model to find the highest probability of the next word given that a preliminary word or text has been provided.

Natural Language Processing (NLP) concept is applied to the exploration and analysis of this project. The datasets with en_US locale and three separate sources such as Blogs, News articles and Twitter are provided for the project.

Set Seed

set.seed function is needed to the possibility for reproducible results.

set.seed(200316)

Load Appropriate Libraries

# Set the file location and appropriate Libraries.
setwd("c:/Users/user/Desktop/Project/Week1")
library(tm)
## Loading required package: NLP
library(NLP)
library(wordcloud)
## Loading required package: RColorBrewer
library(RWeka)
library(base)
library(SnowballC)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(slam)

Getting Data

The datasets are obtained and downloaded from the fileUrl as below.

# Specify the file URL and download the files.
fileUrl <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

download.file(fileUrl, destfile="Coursera-SwiftKey.zip")

# Unzip the downloaded zip file.
unzip("Coursera-SwiftKey.zip", exdir="SwiftKey")
# List all the files after unzip.
list.files("./SwiftKey/final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
list.files("./SwiftKey/final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
# Read data files and assign a variable for each data files.
blogs <- readLines(conBlog <- file("./SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8"),skipNul = TRUE)
close(conBlog)

# Data Checking for Blogs.
head(blogs,3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
tail(blogs,3)
## [1] "Plus, I have also been allowing myself not to get <U+0091>stressed<U+0092> over things that have not been done! If the ironing is not done right now, it<U+0092>s not the end of the world! If that phone call is made tomorrow rather than today, then that<U+0092>s OK too! Living in the moment and allowing myself the time to get <U+0091>back to feeling great<U+0092>!"
## [2] "(5) What's the barrier to entry and why is the business sustainable?"                                                                                                                                                                                                                                                               
## [3] "In response to an over-whelming number of comments we sat down and created a list of do (s) and don<U+0092>t (s) <U+0096> these recommendations are easy to follow and except for - adding some herbs to your rinse . So let<U+0092>s get begin<U+0085>"
summary(blogs)
##    Length     Class      Mode 
##    899288 character character
file_news <- file("./SwiftKey/final/en_US/en_US.news.txt", open="rb")
news <- readLines(file_news,encoding="UTF-8",skipNul=TRUE)
close(file_news)

# Data Checking for News.
head(news,3)
## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
tail(news,3)
## [1] "But I'm in the mood. After six or more months of chill and ice crystals in Northeast Ohio, the ground is soft and fragrant. Seemingly overnight, things are growing as if we were in the tropics. We are again producing fruit of the earth: sweet corn, mightily fragrant herbs, deep green and tender broccoli."                                                      
## [2] "That starts this Sunday at Chivas. The Goats aren't a great team, but they just beat one (a 1-0 win over Salt Lake at Rio Tinto). They also have the one player who can rival Roger Espinoza as \"The Best Guy in MLS That No One Talks About Because He Doesn't Play in New York, LA or the Pacific Northwest\" in goalkeeper Dan Kennedy. These will be tough points."
## [3] "The only outwardly religious adornment was a billboard-sized banner with an image of Our Lady of Charity, patron saint of Cuba, hanging on the side of the National Library."
summary(news)
##    Length     Class      Mode 
##   1010242 character character
twitter <- readLines(con <- file("./SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8"),skipNul = TRUE)
close(con)

# Data Checking for Twitter.
head(twitter,3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
tail(twitter,3)
## [1] "u welcome"                                                                                                     
## [2] "It is #RHONJ time!!"                                                                                           
## [3] "The key to keeping your woman happy= attention, affection, treat her like a queen and sex her like a pornstar!"
summary(twitter)
##    Length     Class      Mode 
##   2360148 character character

Data Files

Basic Data Information

# Create a basic report about the data sets.

# Getting information about each File Size 
blogs_size <- file.info("./SwiftKey/final/en_US/en_US.blogs.txt")$size / 1024^2
news_size <- file.info("./SwiftKey/final/en_US/en_US.news.txt")$size / 1024^2
twitter_size <- file.info("./SwiftKey/final/en_US/en_US.twitter.txt")$size / 1024^2

# Getting information about Word Counts
blogs_words<-sum(sapply(gregexpr("\\S+",blogs), length))
news_words<-sum(sapply(gregexpr("\\S+",news), length))
twitter_words<-sum(sapply(gregexpr("\\S+",twitter), length))

# Getting information about Number of Lines
blogs_length <- length(blogs)
news_length <- length(news)
twitter_length <- length(twitter)

# Getting information about Number of Characters
blogs_characters <- sum(nchar(blogs))
news_characters <- sum(nchar(news))
twitter_characters <- sum(nchar(twitter))

# Getting information about Maximum of Characters
blogs_MaxChar <- max(nchar(blogs))
news_MaxChar <- max(nchar(news))
twitter_MaxChar <-max(nchar(twitter))

# Tabulate the results and the basic report summary
df<-data.frame(c("Blogs","News","Twitter"), c("en_US.blogs","en_US.news","en_US.twitter"), c(as.integer(blogs_size), as.integer(news_size), as.integer(twitter_size)), c(blogs_words, news_words, twitter_words), c(blogs_length,news_length,twitter_length), c(blogs_characters, news_characters, twitter_characters), c(blogs_MaxChar, news_MaxChar, twitter_MaxChar), stringsAsFactors = FALSE)
colnames(df) <- c("Data", "File Name", "File Size (MB)", "Word Counts", "Number Of Lines", "Number Of Characters", "Maximum Of Characters")

df
##      Data     File Name File Size (MB) Word Counts Number Of Lines
## 1   Blogs   en_US.blogs            200    37334131          899288
## 2    News    en_US.news            196    34372530         1010242
## 3 Twitter en_US.twitter            159    30373583         2360148
##   Number Of Characters Maximum Of Characters
## 1            206824505                 40833
## 2            203223159                 11384
## 3            162096241                   140

The table above has been summarized all the figures regarding File Size, Word Counts, Number of Lines, Number of Characters and Maximum of Characters.

Cleaning Data

# Remove illegal characters
my_blogs <- iconv(blogs, 'UTF-8', 'ASCII', "byte")
my_news <- iconv(news, 'UTF-8', 'ASCII', "byte")
my_twitter <- iconv(twitter, 'UTF-8', 'ASCII', "byte")

# Remove NAs
my_blogs <- (my_blogs[!is.na(my_blogs)])
my_news <- (my_news[!is.na(my_news)])
my_twitter <- (my_twitter[!is.na(my_twitter)])

Data Sets

Data Treatment

Data treatment is a data processing to produce the useful information. The processes involved are as below:

  1. Combination of three data sets with a total of 4,269,678 text entries.

  2. Use 1% of the combined data to simplify calculations with a total of 42,697 text entries.

  3. Combination of the created sample into a corpus.

  4. Data Cleaning and Tidy the Corpus.

# Combination of three data files and check its length.
combined <- c(my_blogs,my_news,my_twitter)
length(combined)
## [1] 4269678
# Create a sample by extracting 1% of the available data 
combinedSample <- sample(combined,round(0.01*length(combined)))
length(combinedSample)
## [1] 42697
# Form a corpus from the sample.
docs_vector<-VectorSource(combinedSample)
corpus <- Corpus(docs_vector)

# Clean and Tidy corpus
# remove all punctuations
corpus <- tm_map(corpus, removePunctuation)
# remove all numbers
corpus <- tm_map(corpus, removeNumbers)
# convert to lower case
corpus <- tm_map(corpus, tolower)
# remove all stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
# convert to plain text document
corpus <- tm_map(corpus, PlainTextDocument)

Generate n-gram Model.

# Build n-gram with function.
N_gram <- function(n){
        options(mc.cores=1)
        Ngram_tokenizer <- function(x)
                NGramTokenizer(x, Weka_control(min = n, max = n))
        Ngram_TDM <- TermDocumentMatrix(corpus, control=list(tokenize=Ngram_tokenizer))
        dat <- as.matrix(rollup(Ngram_TDM,2,na.rm=TRUE, FUN = sum))
        dat <- data.frame(Word=rownames(dat),Frequency=dat[,1])
        dat <- dat[order(-dat$Frequency),][1:30,]
        dat$Word <- factor(dat$Word, as.character(dat$Word))
        return(dat)
}
# Running n-gram with unigram, bigram and trigram
unigram <- N_gram(1)
bigram <- N_gram(2)
trigram <- N_gram(3)

Plot Graph

# Plot graph
plot <- function(final){
        final <- final[order(final$Frequency, decreasing = TRUE),]
        ggplot(final, aes(x=Word, y=Frequency)) + ggtitle("Numbers of Word") +
                xlab("Word(s)")+ geom_bar(stat="Identity", fill="pink",
                                          color="black") +
                theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

# Bar chart for Unigram
plot(unigram)

From the graph above, we can conclude that the top 3 of most common words are will, said and just with an approximately of more than 3000 entries.

# Bar chart for Bigram
plot(bigram)

From the graph above, we can conclude that the most common TWO words are right now, new york and cant wait with at least of 175 entries.

# Bar Chart for Trigram
plot(trigram)

From the graph above, we can conclude that the most common phrases are happy mothers day, new york city and cant wait see.

Data Visualization with Word Cloud

wordcloud(corpus, max.words = 200, random.order = FALSE,rot.per=0.25, use.r.layout=FALSE,colors=brewer.pal(8, "Accent"))

Conclusion

In general, the highest frequencies possess the greatest probability to be existed in the given text or a given word. Hence, we can conclude that the words with the highest frequencies are the best prediction model.

For the next stage of the project, the data will be improved by achieving greater accuracy and by reducing the processing time to improve efficiency. As a result, data with low frequencies will be eliminated from our prediction model.

Future Plans

  1. Involve more text entries by taking more than 42,697 text entries.
  2. Comparisons more levels for n-gram.
  3. Develop a Shiny app related to text prediction.