Abstract

In this project, I will use the data set provided by Coursera and Swift Key make a Shiny website. It will divided several parts: Understanding the problem

-Data acquisition and cleaning

-Exploratory analysis

-Statistical modeling

-Predictive modeling

-Creative exploration

-Creating a data product

-Creating a short slide deck pitching your product

The files in downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip I will be using the files in final/en_US folder mainly

Getting and Cleaning Data

Load file

fileB <- readLines("final/en_US/en_US.blogs.txt")
fileN <- readLines("final/en_US/en_US.news.txt")
## Warning in readLines("final/en_US/en_US.news.txt"): incomplete final line found
## on 'final/en_US/en_US.news.txt'
fileT <- readLines("final/en_US/en_US.twitter.txt")
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul
fileA <- rbind(fileB, fileN, fileT)
## Warning in rbind(fileB, fileN, fileT): number of columns of result is not a
## multiple of vector length (arg 1)

Summary

summ <- sapply(list(fileB, fileN, fileT), stri_stats_general)
wdctA <- sapply(list(fileB, fileN, fileT), wordcount)
rbind(c("blogs", "news", "twitter"), summ, wdctA)
##             [,1]        [,2]       [,3]       
##             "blogs"     "news"     "twitter"  
## Lines       "899288"    "77259"    "2360148"  
## LinesNEmpty "899288"    "77259"    "2360148"  
## Chars       "206824382" "15639408" "162096031"
## CharsNWhite "170389539" "13072698" "134082634"
## wdctA       "37334131"  "2643969"  "30373543"

Remove punctuation, number and whitespace

samp <- fileA %>% removePunctuation() %>% removeNumbers() %>% tolower()

This is to remove the punctuation, numbers and turn every word into lower case so that “the,” and “The” count as the same word.

Sample file (we will only take 0.001 of the total data)

samp <- samp[sapply(samp, wordcount) > 3]
samp <- sample(samp, as.integer(round(length(samp) * 0.001)))

The sample is taken from random sampling with size 0.001 of the original (only including lines with more than 3 words). This is to reduce the file size and processing.
Removing all lines with less than 4 words is to for ngrams as it require every lines to have at least n words.

wordcount(samp, count_fun = min)
## [1] 4

Exploratory Analysis

In the exploratory analysis, I will be using n grams to find out the common phrases

1-grams (phrases with one word)

ng1 <- ngram(samp, n=1)
pt1 <- get.phrasetable(ng1) %>% as.data.frame()
head(pt1, 20)
##    ngrams  freq        prop
## 1    the  10365 0.050999823
## 2     to   5543 0.027273711
## 3    and   5420 0.026668504
## 4      a   5050 0.024847960
## 5     of   4473 0.022008896
## 6     in   3584 0.017634671
## 7      i   3171 0.015602551
## 8   that   2225 0.010947864
## 9    for   2154 0.010598516
## 10    is   2070 0.010185203
## 11    it   1766 0.008689405
## 12    on   1603 0.007887382
## 13   you   1511 0.007434706
## 14  with   1500 0.007380582
## 15   was   1387 0.006824578
## 16    as   1138 0.005599402
## 17    at   1101 0.005417347
## 18  have   1100 0.005412427
## 19    be   1097 0.005397666
## 20  this   1092 0.005373064

frequency plot 1-grams

g1 <- ggplot(pt1[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g1 <- g1 + geom_bar(stat="identity") + labs(x = "word", y = "frequency", title = "Top 15 words with highest frequency in the file text")
g1

“the”, “to”, and “and” have the three highest frequency in 1-grams.

2-grams (phrase with two words)

ng2 <- ngram(samp, n = 2)
pt2 <- get.phrasetable(ng2) %>% as.data.frame()
head(pt2, 20)
##       ngrams freq         prop
## 1    of the   990 0.0050425047
## 2    in the   857 0.0043650773
## 3    to the   492 0.0025059721
## 4   for the   398 0.0020271888
## 5    on the   391 0.0019915347
## 6     to be   348 0.0017725168
## 7    at the   295 0.0015025645
## 8   and the   285 0.0014516302
## 9      in a   250 0.0012733598
## 10 with the   225 0.0011460238
## 11     is a   223 0.0011358369
## 12    for a   219 0.0011154632
## 13   it was   213 0.0010849025
## 14    i was   188 0.0009575666
## 15    and i   182 0.0009270059
## 16 from the   178 0.0009066322
## 17     of a   174 0.0008862584
## 18   with a   174 0.0008862584
## 19    it is   174 0.0008862584
## 20   i have   165 0.0008404175

frequency plot 2-grams

g2 <- ggplot(pt2[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g2 <- g2 + geom_bar(stat="identity") + labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 2 words with highest frequency in the file text") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
g2

“of the”, “in the”, and “to the” have the three highest frequency in 2-grams.
All of the top four contains the word “the”.

3-grams (phrase with three words)

ng3 <- ngram(samp, n=3)
pt3 <- get.phrasetable(ng3) %>% as.data.frame()
head(pt3, 20)
##            ngrams freq         prop
## 1       a lot of    86 0.0004532184
## 2     one of the    74 0.0003899786
## 3     as well as    38 0.0002002593
## 4     out of the    38 0.0002002593
## 5        to be a    38 0.0002002593
## 6    going to be    35 0.0001844493
## 7    some of the    34 0.0001791794
## 8      i have to    34 0.0001791794
## 9       it was a    34 0.0001791794
## 10    be able to    33 0.0001739094
## 11    the end of    32 0.0001686394
## 12   part of the    29 0.0001528295
## 13     this is a    24 0.0001264795
## 14   this is the    24 0.0001264795
## 15   the rest of    24 0.0001264795
## 16   i dont know    24 0.0001264795
## 17    end of the    24 0.0001264795
## 18 the fact that    23 0.0001212096
## 19  in the first    23 0.0001212096
## 20   is going to    22 0.0001159396

frequency plot 3-grams

g3 <- ggplot(pt3[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g3 <- g3 + geom_bar(stat="identity") + 
  labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 3 words with highest frequency in the file text") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),legend.position="bottom")

g3

“a lot of”, “one of the”, “as well as”, and “out of the” have the four highest frequency in 3-grams.
All of the top three contains the words “the” and “of”

Total word count in the file

wdct <- wordcount(samp)
wdct
## [1] 202698

average number of character in each word

charPwd <- nchar(samp) / wdct

How many unique words is needed to cover 50% of the file text?

count = 0
for (i in 1:nrow(pt1)) 
{
  count = count + pt1$freq[i]
  if (count >= 0.5 * wdct)
  {
    break
  }
}
i
## [1] 138
i/nrow(pt1)
## [1] 0.005846219

It require 138 number of unique words to cover 50% of the file text, which 0.0058462 of the total number of unique words.

Find non-English words

#Load a English dictionary
data("GradyAugmented")

# Get all words that are in the dictionary
noEngIn <- sapply(pt1$ngrams, function(x) x %>% str_trim  %in% GradyAugmented)

nonEng <- pt1$ngrams[!noEngIn]
head(nonEng)
## [1] "im "   " "     "dont " "— "    "– "    "it’s "

Remove number of word start with capitalized data

# Remove words that contain numbers
nonEng <- nonEng[!grepl(".*?[0-9]+.*?", str_trim(nonEng))]

# Remove words that is capitalized
# It is likely that such word is a special noun
nonEng <- nonEng[!grepl("^[A-Z]", str_trim(nonEng))]
head(nonEng)
## [1] "im "   " "     "dont " "— "    "– "    "it’s "

From the above list of character, we can see there is little to none words from foreign languages.

Exploratory analysis summary

I think 0.1% of the original data can already have a accurate representation of the training set since the sample already have 202698 words.
The reduction of sample set can allow more rapid exploration of the data while keeping the accuracy of the findings.

Text Prediction

Plan

Prediction algorithm

I will create a prediction algorithm base on the n-grams words frequency. The frequency convert to probability.

  1. Find all 3-grams phrase that contain the input word
  2. Use the frequency of all the phrase to generate a probability distribution to determine which which is the next word.
  3. For words that hasn’t appears in the n-grams, it will return a random 3 word phrase generated by the frequency (the higher the frequency in the training set, the higher the chance that the phrase is output)

Shiny app

It will have a side panel which allow user to input word. It will also have the main panel which will produce output phrase from the prediction algorithm and the top three most probable phrase base on the probability distribution