Overview

The main purpose of the project is to build a predictive text data product that could makes it easier for people to type on their mobile devices. In partners with Coursera, SwiftKey(http://swiftkey.com/) has provided the data that was used in this project. The data is text files compiled from news articles, twitter and blogs in four different languages, namely english, finish, german and russian. For the purposes of this project only the English text files will be used. This report present an exploratory analysis of each of these text files and briefly introduce the next step which is building a language model along with a predictor to the English Language.

Data Acquisition and Cleaning

Downloading the data

library("NLP") #Generics NLP Function set
library("openNLP") #Generics NLP Function set
library("tm") #For Text Mining & Corpus workings
library("RWeka") #For n-gram vector generation
library("ggplot2") #Charting functionality
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate
datafile <- "Coursera-SwiftKey.zip"
if(!file.exists(datafile)) {
  fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(fileURL, destfile = datafile, method = "internal")
  unzip(datafile)
}
# list unzipped files 
list.files("final", recursive = TRUE)
##  [1] "de_DE/de_DE.blogs.txt"   "de_DE/de_DE.news.txt"   
##  [3] "de_DE/de_DE.twitter.txt" "en_US/en_US.blogs.txt"  
##  [5] "en_US/en_US.news.txt"    "en_US/en_US.twitter.txt"
##  [7] "fi_FI/fi_FI.blogs.txt"   "fi_FI/fi_FI.news.txt"   
##  [9] "fi_FI/fi_FI.twitter.txt" "ru_RU/ru_RU.blogs.txt"  
## [11] "ru_RU/ru_RU.news.txt"    "ru_RU/ru_RU.twitter.txt"

The English text files are used for the project as listed below:

Eng_TextFiles <- list.files("final/en_US")
Eng_TextFiles
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Basic data summary

Size of text files
TxtInfo <- paste("final/en_US/",Eng_TextFiles, sep="")
sizes <- paste("Size of ",TxtInfo, " = ", file.info(TxtInfo)$size, "bytes")
sizes
## [1] "Size of  final/en_US/en_US.blogs.txt  =  210160014 bytes"  
## [2] "Size of  final/en_US/en_US.news.txt  =  205811889 bytes"   
## [3] "Size of  final/en_US/en_US.twitter.txt  =  167105338 bytes"
Number of lines of each text files, preview a few lines of the text
  1. Twitter text file
twitterLines=readLines(file("final/en_US/en_US.twitter.txt","r"))
NumLines_twitter <- length(twitterLines)

The twitter text file has 2360148 lines in total.

Preview of the twitter data layout

readLines(file("final/en_US/en_US.twitter.txt","r"), 3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
  1. Blog text file
blogLines=readLines(file("final/en_US/en_US.blogs.txt","r"))
NumLines_blog <- length(blogLines)

The blog text file has 2360148 lines in total.

Preview of the blogs data layout

readLines(file("final/en_US/en_US.blogs.txt","r"), 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
  1. News text file
newsLines=readLines(file("final/en_US/en_US.news.txt","r"))
NumLines_news <- length(newsLines)

The news text file has 77259 lines in total.

Preview of the news data layout

readLines(file("final/en_US/en_US.news.txt","r"), 3)
## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."

Loading sample data

Take part of the text file as sample data for faster workings with tokenization and analytics. For each text file 3000 lines are taken for analysis.

sampleData_twitter <- readLines(file("final/en_US/en_US.twitter.txt","r"), 3000)
sampleData_blog <- readLines(file("final/en_US/en_US.blogs.txt","r"), 3000)
sampleData_news <- readLines(file("final/en_US/en_US.news.txt","r"), 3000)
sampleData <- paste(sampleData_twitter, sampleData_blog, sampleData_news)

Sample data clearning: Tokenization & Profanity filtering

To identifying appropriate tokens such as words, punctuation, and numbers and removing profanity and other words we do not want to predict. The data was cleaned by using the tm package. A list of profanity words can be downloaded from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt, the downloaded file is saved as ProfanityWords.txt.

Building the main corpus, removing numbers, whitespaces, special characters and lowercasing all contents.

# Building the main corpus
corpus <- VCorpus(VectorSource(sampleData)) 

# removing numbers
corpus <- tm_map(corpus, removeNumbers) 

# removing whitespaces
corpus <- tm_map(corpus, stripWhitespace) 

# lowercasing all contents
corpus <- tm_map(corpus, content_transformer(tolower)) 

# removing special characters
corpus <- tm_map(corpus, removePunctuation) 

# removing the profanity words
con <- file("~/final/ProfanityWords.txt",open="r")
lines=readLines(con) 
vec <- ""
for (i in 1:length(lines)){
   vec <- append(vec, lines[i])
}
close(con)
ProfanityWordsVector <- vec[2:length(vec)]
corpus <- tm_map(corpus, removeWords, ProfanityWordsVector)

Sample data analysis

Converting Corpus to Data Frame for data processing

cleanText<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)

The single word tokenization, Bi-grams sets and Tri-grams sets

NGram1Token <- NGramTokenizer(cleanText, Weka_control(min = 1, max = 1))
NGram2Token <- NGramTokenizer(cleanText, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
NGram3Token <- NGramTokenizer(cleanText, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))

# Understanding words & word pairs, analyze the distributions of word frequencies
oneWord <- data.frame(table(NGram1Token))
twoWord <- data.frame(table(NGram2Token))
threeWord <- data.frame(table(NGram3Token))
oneWordSorted <- oneWord[order(oneWord$Freq,decreasing = TRUE),]
twoWordSorted <- twoWord[order(twoWord$Freq,decreasing = TRUE),]
threeWordSorted <- threeWord[order(threeWord$Freq,decreasing = TRUE),]
Top20OneWord <- oneWordSorted[1:20,]
colnames(Top20OneWord) <- c("Word","Frequency")
Top20TwoWord <- twoWordSorted[1:20,]
colnames(Top20TwoWord) <- c("Word","Frequency")
Top20ThreeWord <- threeWordSorted[1:20,]
colnames(Top20ThreeWord) <- c("Word","Frequency")

Plotting the data

Top 20 Single words

ggplot(Top20OneWord, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="red") +geom_text(aes(label=Frequency), vjust=-0.2)

Top 20 2-grams words

ggplot(Top20TwoWord, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="green") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top 20 3-grams words

ggplot(Top20ThreeWord, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next steps

For the final analysis, text modelling, and text prediction, it needs to do the followings:

  1. N-Gram modelling of the full text data sets
  2. Optimize model for low memory utilization
  3. Implement model as a Shiny App