In this project, I will use the data set provided by Coursera and Swift Key make a Shiny website. It will divided several parts: Understanding the problem
-Data acquisition and cleaning
-Exploratory analysis
-Statistical modeling
-Predictive modeling
-Creative exploration
-Creating a data product
-Creating a short slide deck pitching your product
The files in downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip I will be using the files in final/en_US folder mainly
fileB <- readLines("final/en_US/en_US.blogs.txt")
fileN <- readLines("final/en_US/en_US.news.txt")
## Warning in readLines("final/en_US/en_US.news.txt"): incomplete final line found
## on 'final/en_US/en_US.news.txt'
fileT <- readLines("final/en_US/en_US.twitter.txt")
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 167155 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 268547 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1759032 appears to
## contain an embedded nul
fileA <- rbind(fileB, fileN, fileT)
## Warning in rbind(fileB, fileN, fileT): number of columns of result is not a
## multiple of vector length (arg 1)
summ <- sapply(list(fileB, fileN, fileT), stri_stats_general)
wdctA <- sapply(list(fileB, fileN, fileT), wordcount)
rbind(c("blogs", "news", "twitter"), summ, wdctA)
## [,1] [,2] [,3]
## "blogs" "news" "twitter"
## Lines "899288" "77259" "2360148"
## LinesNEmpty "899288" "77259" "2360148"
## Chars "206824382" "15639408" "162096031"
## CharsNWhite "170389539" "13072698" "134082634"
## wdctA "37334131" "2643969" "30373543"
samp <- fileA %>% removePunctuation() %>% removeNumbers() %>% tolower()
This is to remove the punctuation, numbers and turn every word into lower case so that “the,” and “The” count as the same word.
samp <- samp[sapply(samp, wordcount) > 3]
samp <- sample(samp, as.integer(round(length(samp) * 0.001)))
The sample is taken from random sampling with size 0.001 of the
original (only including lines with more than 3 words). This is to
reduce the file size and processing.
Removing all lines with less than 4 words is to for ngrams as it require
every lines to have at least n words.
wordcount(samp, count_fun = min)
## [1] 4
In the exploratory analysis, I will be using n grams to find out the common phrases
ng1 <- ngram(samp, n=1)
pt1 <- get.phrasetable(ng1) %>% as.data.frame()
head(pt1, 20)
## ngrams freq prop
## 1 the 10365 0.050999823
## 2 to 5543 0.027273711
## 3 and 5420 0.026668504
## 4 a 5050 0.024847960
## 5 of 4473 0.022008896
## 6 in 3584 0.017634671
## 7 i 3171 0.015602551
## 8 that 2225 0.010947864
## 9 for 2154 0.010598516
## 10 is 2070 0.010185203
## 11 it 1766 0.008689405
## 12 on 1603 0.007887382
## 13 you 1511 0.007434706
## 14 with 1500 0.007380582
## 15 was 1387 0.006824578
## 16 as 1138 0.005599402
## 17 at 1101 0.005417347
## 18 have 1100 0.005412427
## 19 be 1097 0.005397666
## 20 this 1092 0.005373064
g1 <- ggplot(pt1[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g1 <- g1 + geom_bar(stat="identity") + labs(x = "word", y = "frequency", title = "Top 15 words with highest frequency in the file text")
g1
“the”, “to”, and “and” have the three highest frequency in 1-grams.
ng2 <- ngram(samp, n = 2)
pt2 <- get.phrasetable(ng2) %>% as.data.frame()
head(pt2, 20)
## ngrams freq prop
## 1 of the 990 0.0050425047
## 2 in the 857 0.0043650773
## 3 to the 492 0.0025059721
## 4 for the 398 0.0020271888
## 5 on the 391 0.0019915347
## 6 to be 348 0.0017725168
## 7 at the 295 0.0015025645
## 8 and the 285 0.0014516302
## 9 in a 250 0.0012733598
## 10 with the 225 0.0011460238
## 11 is a 223 0.0011358369
## 12 for a 219 0.0011154632
## 13 it was 213 0.0010849025
## 14 i was 188 0.0009575666
## 15 and i 182 0.0009270059
## 16 from the 178 0.0009066322
## 17 of a 174 0.0008862584
## 18 with a 174 0.0008862584
## 19 it is 174 0.0008862584
## 20 i have 165 0.0008404175
g2 <- ggplot(pt2[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g2 <- g2 + geom_bar(stat="identity") + labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 2 words with highest frequency in the file text") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
g2
“of the”, “in the”, and “to the” have the three highest frequency in
2-grams.
All of the top four contains the word “the”.
ng3 <- ngram(samp, n=3)
pt3 <- get.phrasetable(ng3) %>% as.data.frame()
head(pt3, 20)
## ngrams freq prop
## 1 a lot of 86 0.0004532184
## 2 one of the 74 0.0003899786
## 3 as well as 38 0.0002002593
## 4 out of the 38 0.0002002593
## 5 to be a 38 0.0002002593
## 6 going to be 35 0.0001844493
## 7 some of the 34 0.0001791794
## 8 i have to 34 0.0001791794
## 9 it was a 34 0.0001791794
## 10 be able to 33 0.0001739094
## 11 the end of 32 0.0001686394
## 12 part of the 29 0.0001528295
## 13 this is a 24 0.0001264795
## 14 this is the 24 0.0001264795
## 15 the rest of 24 0.0001264795
## 16 i dont know 24 0.0001264795
## 17 end of the 24 0.0001264795
## 18 the fact that 23 0.0001212096
## 19 in the first 23 0.0001212096
## 20 is going to 22 0.0001159396
g3 <- ggplot(pt3[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g3 <- g3 + geom_bar(stat="identity") +
labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 3 words with highest frequency in the file text") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),legend.position="bottom")
g3
“a lot of”, “one of the”, “as well as”, and “out of the” have the
four highest frequency in 3-grams.
All of the top three contains the words “the” and “of”
wdct <- wordcount(samp)
wdct
## [1] 202698
charPwd <- nchar(samp) / wdct
count = 0
for (i in 1:nrow(pt1))
{
count = count + pt1$freq[i]
if (count >= 0.5 * wdct)
{
break
}
}
i
## [1] 138
i/nrow(pt1)
## [1] 0.005846219
It require 138 number of unique words to cover 50% of the file text, which 0.0058462 of the total number of unique words.
#Load a English dictionary
data("GradyAugmented")
# Get all words that are in the dictionary
noEngIn <- sapply(pt1$ngrams, function(x) x %>% str_trim %in% GradyAugmented)
nonEng <- pt1$ngrams[!noEngIn]
head(nonEng)
## [1] "im " " " "dont " "— " "– " "it’s "
Remove number of word start with capitalized data
# Remove words that contain numbers
nonEng <- nonEng[!grepl(".*?[0-9]+.*?", str_trim(nonEng))]
# Remove words that is capitalized
# It is likely that such word is a special noun
nonEng <- nonEng[!grepl("^[A-Z]", str_trim(nonEng))]
head(nonEng)
## [1] "im " " " "dont " "— " "– " "it’s "
From the above list of character, we can see there is little to none words from foreign languages.
I think 0.1% of the original data can already have a accurate
representation of the training set since the sample already have 202698
words.
The reduction of sample set can allow more rapid exploration of the data
while keeping the accuracy of the findings.
I will create a prediction algorithm base on the n-grams words frequency. The frequency convert to probability.
It will have a side panel which allow user to input word. It will also have the main panel which will produce output phrase from the prediction algorithm and the top three most probable phrase base on the probability distribution