Milestone Report: Coursera Data Science Capstone

Executive Summary

This milestone report is created as part of the Coursera Data Science capstone project in order to provide an overview of the data we will work with in this project. Here, we summerize the data processing procedure and present the results of exploratory analysis on three English databases namely, Twitter, News, and Blogs. In the end, a summary of our plan for creating a prediction algorithm and a Shiny app is provided.

Summary Statistics

Three databases used in this project are loaded:

## Warning: package 'stringi' was built under R version 3.2.5

## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 167155 appears to contain an embedded nul

## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 268547 appears to contain an embedded nul

## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 1274086 appears to contain an embedded nul

## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 1759032 appears to contain an embedded nul

## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.news.txt",
## encoding = "UTF-8"): incomplete final line found on './Coursera-Swiftkey/
## final/en_US/en_US.news.txt'

Summary statistics for the three databases:

TBL_Summary <- data.frame(File = c("en_US.twitter","en_US.news","en_US.blogs"),
                            Size_MB = c(twitter_sz, news_sz, blogs_sz),
                            Lines_Count = c(length(twitter),length(news),length(blogs)),
                            Word_Count = c(sum(stri_count_words(twitter)),sum(stri_count_words(news)),sum(stri_count_words(blogs))),
                            Word_Average = c(mean(stri_count_words(twitter)),mean(stri_count_words(news)),mean(stri_count_words(blogs))))
TBL_Summary

##            File  Size_MB Lines_Count Word_Count Word_Average
## 1 en_US.twitter 159.3641     2360148   30093369     12.75063
## 2    en_US.news 196.2775       77259    2674536     34.61779
## 3   en_US.blogs 200.4242      899288   37546246     41.75108

Plotting the summaries:

library(ggplot2)
qplot(stri_count_words(twitter))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(stri_count_words(news))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(stri_count_words(blogs))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Sampling We need to use sampling to make the analysis manageable. This following code randomly subsets each of the files to 20,000 lines of text, creates one giant file from combining these three sample, and saves it as a new file.

Creating and Cleaning Corpus After creating the sample file, we need to make a corpus and next tokenize that corpus. Before creating the corpus, we removed all the characters that are not recognized by the libraries we are using in this section.

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Number of steps are required here to clean up the corpus. We convert all characters to lower-case, remove some of the irrelevant strings, numbers, punctuations, all the redundant white-spaces, email addresses, and finally all the web links. We also remove all the standard stopwords in English language using the stopword word set in tm library. These are the usual suspects for frequent words in a corpus. We finally take out all the profanity words from the corpus. The following code takes care of all these steps.

## Warning in readLines(file("C:/Users/Arash/Desktop/Coursera-SwiftKey/final/
## Profanity_List.txt"), : incomplete final line found on 'C:/Users/Arash/
## Desktop/Coursera-SwiftKey/final/Profanity_List.txt'

N-grams and Data Analysis After all the boring stuff, we move to the real analyses. We will use unigram (UT), bigram (BT), trigram (TT), and quadgram (QT) analyses to understand the frequency of words and words combinations. RWeka library is used in this section.

## Warning: package 'RWeka' was built under R version 3.2.5

## Warning: closing unused connection 5 (C:/Users/Arash/Desktop/Coursera-
## SwiftKey/final/Profanity_List.txt)

To do an exploratory analysis on these N-grams, we find the frequency of word combinations in each N-gram:

In the following we present top 10 most frequent words/word combinations in each of the four N-grams.

paste("Unigram Frequencies")

## [1] "Unigram Frequencies"

head(UT_frq_TBL, 10)

##          word freq
## said     said 5932
## will     will 5568
## one       one 5122
## just     just 4535
## like     like 4343
## can       can 4073
## time     time 3571
## get       get 3398
## new       new 3163
## people people 2777

paste("Bigram Frequencies")

## [1] "Bigram Frequencies"

head(BT_frq_TBL, 10)

##                    word freq
## last year     last year  383
## new york       new york  373
## right now     right now  339
## dont know     dont know  325
## years ago     years ago  267
## high school high school  263
## feel like     feel like  230
## first time   first time  216
## last week     last week  216
## im going       im going  209

paste("Trigram Frequencies")

## [1] "Trigram Frequencies"

head(TT_frq_TBL, 10)

##                                  word freq
## new york city           new york city   43
## cant wait see           cant wait see   39
## let us know               let us know   29
## im pretty sure         im pretty sure   27
## new york times         new york times   26
## happy mothers day   happy mothers day   24
## two years ago           two years ago   24
## im looking forward im looking forward   22
## world war ii             world war ii   21
## dont even know         dont even know   20

paste("Quadgram Frequencies")

## [1] "Quadgram Frequencies"

head(QT_frq_TBL, 10)

##                                                                          word
## g fat g saturated                                           g fat g saturated
## advertising fees advertising linking     advertising fees advertising linking
## advertising linking amazoncom amazonca advertising linking amazoncom amazonca
## amazon eu associates programmes               amazon eu associates programmes
## amazon eu content provided                         amazon eu content provided
## amazon services llc amazon                         amazon services llc amazon
## amazon services llc andor                           amazon services llc andor
## amazonca amazoncouk amazonde amazonfr   amazonca amazoncouk amazonde amazonfr
## amazoncom amazonca amazoncouk amazonde amazoncom amazonca amazoncouk amazonde
## amazoncouk amazonde amazonfr amazonit   amazoncouk amazonde amazonfr amazonit
##                                        freq
## g fat g saturated                         9
## advertising fees advertising linking      7
## advertising linking amazoncom amazonca    7
## amazon eu associates programmes           7
## amazon eu content provided                7
## amazon services llc amazon                7
## amazon services llc andor                 7
## amazonca amazoncouk amazonde amazonfr     7
## amazoncom amazonca amazoncouk amazonde    7
## amazoncouk amazonde amazonfr amazonit     7

The following 4-part histograms summarises the 10-most common N-grams.

ggplot(UT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
        geom_bar(stat = "identity", fill = "darkslateblue", colour = "black", width = 0.50) + 
        scale_x_discrete(breaks = seq(1, 10, by = 1), labels = UT_frq_TBL$word[1:10]) +
        ggtitle("Unigrams") +
        ylab("Frequency of Words") +
        xlab("Words") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

ggplot(BT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
        geom_bar(stat = "identity", fill = "tomato", colour = "black", width = 0.50) + 
        scale_x_discrete(breaks = seq(1, 10, by = 1), labels = BT_frq_TBL$word[1:10]) +
        ggtitle("Bigrams") +
        ylab("Frequency of Words") +
        xlab("Words") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

ggplot(TT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
        geom_bar(stat = "identity", fill = "gold3", colour = "black", width = 0.50) + 
        scale_x_discrete(breaks = seq(1, 10, by = 1), labels = TT_frq_TBL$word[1:10]) +
        ggtitle("Trigrams") +
        ylab("Frequency of Words") +
        xlab("Words") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

ggplot(QT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
        geom_bar(stat = "identity", fill = "mediumpurple", colour = "black", width = 0.50) + 
        scale_x_discrete(breaks = seq(1, 10, by = 1), labels = QT_frq_TBL$word[1:10]) +
        ggtitle("Quadgrams") +
        ylab("Frequency of Words") +
        xlab("Words") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Future Steps - Algoritham and Shiny App The N-grams explained in this report will be used to calculate the probability of the occurance of the next word for completing a string of text. A Trigram/Quadgram model will be used in this prediction. As for the Shiny, we will create an online application that takes part of the text string from the user and based on the algorithm prepared in the previous section proposes a list of possible words in order to complete the string.

Milestone Report: Coursera Data Science Capstone

Arash Amoozegar

June 2016