Milestone Report of Capstone - Exploratory Analysis of Prediction on Next Words

Synopsis

The ultimate goal of this Data Science Capstone session is to create a prediction algrithm that performs on moble devices when people type on keyboard. The training data to get started can downloaded in following link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset used in this capstone are the text files in en_US folder.

Before realizing the ultimate goal, building prediction model for next words, a milestone report is created for two purposes: (1) displaying the exploratory analysis on words survey, and (2) proposal for prediction algorithm and Shiny app.

In this report, the following tasks will be accomplished:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Data Process

Install required packages and librares

library(NLP)
library(tm)
library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Download dataset to local drive

### download ziped file to local drive and unzip file
if (!file.exists("./capstone_dataset.zip")){
    url <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(url,"./capstone_dataset.zip", mode="wb")
    unzip("capstone_dataset.zip")
    unlink(url)
}

read text files and sample text data for training data

# read twitter text file
twittercon <- file("./final/en_US/en_US.twitter.txt")
twitterlines <- readLines(twittercon)
nline_twitter <- length(twitterlines)

# read news text file
newscon <- file("./final/en_US/en_US.news.txt")
newslines <- readLines(newscon)
nline_news <- length(newslines)

# read blogs text file
blogscon <- file("./final/en_US/en_US.blogs.txt")
blogslines <- readLines(blogscon)
nline_blogs <- length(blogslines)

close(twittercon)
close(newscon)
close(blogscon)

# set seed adn data sampling
set.seed(123)
samplingratio <- 0.002

# the twitter sampling data for data training
twittersampleindex <-sample(1:nline_twitter, floor(samplingratio*nline_twitter), replace = FALSE)
twitter_sample <- twitterlines[twittersampleindex]  

# the news sampling data for data training
newsampleindex <-sample(1:nline_news, floor(samplingratio*nline_news), replace = FALSE)
news_sample <- newslines[newsampleindex]


# the blogs sampling data for data training
blogsampleindex <-sample(1:nline_blogs, floor(samplingratio*nline_blogs), replace = FALSE)
blogs_sample <- blogslines[blogsampleindex]

The basic table for en_US dataset are following.

en_US_dataset

##         Number of lines Number of words
## Twitter         2360148        32793399
## News            1010242        36721087
## Blog             899288        39120549

Tokenization for words in text files

To save computational time, I build a function for word tokenization in text file. The input of this function is set as sampled text data, which are got above.

# Build a function for word tokenization
Tokenization_function <- function(textlines){
    
    text_sample <- gsub("[!;:,.?~#$%&()0-9]", " ", textlines)  # remove special symbols
    
    # replace short terms with original words
    text_sample <- gsub("['][m]", " am", text_sample) # replace abrreviation of "I'm" to "I am"
    text_sample <- gsub("['][re]", " are", text_sample) # replace abrreviation similar to "you're" to "you are"
    text_sample <- gsub("['][v][e]", " have", text_sample) # replace abrreviation of "*'ll" to "will"
    text_sample <- gsub("['][d] [l][i][k][e]", " would like", text_sample) # replace abrreviation of "*'d like" to "* would like"
    text_sample <- gsub("[w][o][n]['][t]", " would not", text_sample) # replace abrreviation of "won't" to "* will not"
    text_sample <- gsub("[n]['][t]", " not", text_sample)  # replace abrreviation of "*n't" to "not"
    text_sample <- gsub("['][l][l]", " will", text_sample) # replace abrreviation of "*'ll" to "will"
    text_sample <- removeNumbers( text_sample)    # remove numbers
    text_sample <- removePunctuation( text_sample)    # remove punctuation
    text_sample <- tolower(text_sample)
    text_sample <- removeWords(text_sample, stopwords("en"))  # remove stop words
    
    # creat corpus for computation of TermDocumentMatrix
    myCorpus <- Corpus(VectorSource(text_sample))          
    myTextMatrix <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
    
    myToken <- sort(rowSums(as.matrix(myTextMatrix)), decreasing = TRUE)
    
    Token <- {}
    Token$Name <- names(myToken)
    Token$Count <-myToken
    Token <- data.frame(Token)
    
    return(Token)
}

For Twitter text file, the tokens are retrieved below.

Twitter_Token <- Tokenization_function(twitter_sample)

The wordcloud of the tokens of twitter text file is displayed below.

wordcloud(Twitter_Token$Name, Twitter_Token$Count, max.words=100)

The barplot of top 10 popular tokens of twitter text file is displayed below.

# number of bars in barplot
nbars=10
ggplot(Twitter_Token[1:nbars, ], aes(x = factor(Name), y = Count)) + geom_bar(stat = "identity")+xlab("Token Names") + ylab("Count") + ggtitle("Counts of The Most Popular Unigram from Twitter text file")+theme(axis.text.x = element_text(colour = 'black', angle=30, size = 16))

For News text file, the tokens are retrieved below.

News_Token <- Tokenization_function(news_sample)

The wordcloud of the tokens of news text file is displayed below.

wordcloud(News_Token$Name, News_Token$Count, max.words=100)

The barplot of top 10 popular tokens of news text file is displayed below.

# number of bars in barplot
nbars=10
ggplot(News_Token[1:nbars, ], aes(x = factor(Name), y = Count)) + geom_bar(stat = "identity")+xlab("Token Names") + ylab("Count") + ggtitle("Counts of The Most Popular Unigram from News")+theme(axis.text.x = element_text(colour = 'black', angle=30, size = 16))

For Blog text file, the tokens are retrieved below.

Blogs_Token <- Tokenization_function(blogs_sample)

The wordcloud of the tokens of blog text file is displayed below.

wordcloud(Blogs_Token$Name, Blogs_Token$Count, max.words=100)

The barplot of top 10 popular tokens of news text file is displayed below.

# number of bars in barplot
nbars=10
ggplot(Blogs_Token[1:nbars, ], aes(x = factor(Name), y = Count)) + geom_bar(stat = "identity")+xlab("Token Names") + ylab("Count") + ggtitle("Counts of The Most Popular Unigram from Blogs")+theme(axis.text.x = element_text(colour = 'black', angle=30, size = 16))

Summary of observations on tokens in twitter, news and blog text files.

Ratio between words and lines:

The ratio between words nad lines in twitter, news and blog text files are following: 13.895, 36.349 and 43.502. Although the number of lines in Twitter text file is around twice as those in News and Blog text files, the number of lines of these three files are close to each other. The reason for this observation is probably becasue that twitter texts are meant to express personal thoughts and feelings quickly, which makes text sentences very short. However, the blogs can have styles of articles and personal diaries, which makes sentences are more elaborated than twitter sentences. Moreover, the news text files should be professional, which would have both sound and concise articles.

The signatures of most popular tokens in each files:

Four out of ten most popular tokens in twitter texts includes “good”, “like”, “love” and “thanks”, which demonstrates the signature of twitter texts are closed to personal feelings. As for tokens in both news and blog texts, the words used are neutral. Besides, token counts in both news and blog text are similar to each other, except the occurence of word “said” in news texts is exceptionally high, which is imaginable.

Form two observations mentioned above, we can know different types of text files have their own signatures depending on the attributes of text files.