In this project, I will use the data set provided by Coursera and Swift Key make a Shiny website. It will divided several parts: Understanding the problem
-Data acquisition and cleaning
-Exploratory analysis
-Statistical modeling
-Predictive modeling
-Creative exploration
-Creating a data product
-Creating a short slide deck pitching your product
The files in downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip I will be using the files in final/en_US folder mainly
randomSelectLine <- function(filePath, destFile, n) {
con <- file(filePath, "r")
lin <- readLines(con)
repeat
{
inn <- sample(length(lin), n, replace = FALSE)
#' Check so that the minimum words in each line is at least 4
#' THis is for exploratory analysis later
if (wordcount(lin[inn], count_fun = min) > 3)
{
break
}
}
writeLines(lin[inn], destFile)
close(con)
}
randomSelectLine("final/en_US/en_US.blogs.txt", "sample.txt", n = 100)
I randomly chose 20 lines in the en_US.blogs.txt file and input it into the sample.txt file for exploratory analysis to reduce data size. Each line of the sample.txt will be one line in the original file
fileB <- readLines("final/en_US/en_US.blogs.txt")
wc <- wordcount(fileB)
numRow <- nrow(fileB)
avgWd <- wc / numRow
The total word count in the file is 37334131, the number of lines is , and the average word per line is .
sample <- readLines("sample.txt")
head(sample)
## [1] "I called Louise Backer, manager of the Springfield Symphony Youth Orchestra and she said she's \"new at this job\" and never heard of Ben Snyder although she had heard of Bob Staffanson. I told her about myself and the Symphony in the old days and about the musical Stusicks. Backer said she will send me some of their promotional material. Susan Bennett is the Public Relations Director. Any relation to the District Attorney?"
## [2] "For several more years I fished all around the Miami area; anywhere my bicycle would carry me – all along the 79th Street Causeway, at the fresh water canals out around NW 27th Avenue and 103rd Street and in Hialeah along the banks of the Miami River – I never carried a camera. It was hard enough just to carry my cane pole or my rod and reel and a tackle box on my bike. My stepfather took my mother and I on some more but truly unforgettable fishing trips during this period of my life, too. Like the time we got caught in a hurricane while trying to cross Card Sound in a 16’ wooden boat powered by a 10hp outboard motor (the weather had turned so bad Truman brought the Presidential Yacht out of the ocean and anchored it in Card Sound to protect it), or the time we got stranded in the middle of Blackwater Sound when our rental outboard failed to start and we couldn’t find the entrance to the fishing camp at Jewfish Creek in the dark of night. The mosquitoes were voracious and we had to jump overboard into the murky water in order to retain our sanity; it was like a scene out of the African Queen! But, again, we have no photographic documentation."
## [3] "I want to dissipate all of my sorrows"
## [4] "And in one image, Mr Killen is photographed standing with another man in front of a wall with graffiti reading “White Power”."
## [5] "6) Clay Buchholz has 3 wins this season, tying him for 9th with a host of other pitchers in the MLB this season. What those other pitchers don’t have, though, is Buchholz’s 8.69 ERA. In fact, only one other pitcher of the top 36 pitchers in wins this year has an ERA over 5.00. That man, Ivan Nova (Yankees, 5.18 ERA), hasn’t lost in 15 regular-season decisions, which hasn’t happened in over 50 years."
## [6] "2. Benefit Ooh La Lift Instant Under Eye Brightener: I have always been a fan of benefit products and found them to be pretty affordable and of good quality. I had high hopes for it but after using it every day for weeks I noticed NO difference. Literally nothing. I might as well have applied nothing under my eyes because that’s what it seemed like. The consistency is nice and it penetrates into the skin fairly quick but like I said, it didn’t brighten or conceal or reduce dark under eye circles or anything. What a bummer!"
Each row in sample consist on a line
sampleNoPun <- sample %>% removePunctuation(preserve_intra_word_contractions
= TRUE) %>% tolower()
This is to remove the punctuation so that “the,” and “the” count as the same word.
In the exploratory analysis, I will be using n grams to find out the common phrases
ng1 <- ngram(sampleNoPun, n=1)
pt1 <- get.phrasetable(ng1) %>% as.data.frame()
head(pt1, 20)
## ngrams freq prop
## 1 the 227 0.050805730
## 2 to 140 0.031333930
## 3 and 130 0.029095792
## 4 of 110 0.024619517
## 5 a 105 0.023500448
## 6 i 86 0.019247986
## 7 in 78 0.017457475
## 8 is 67 0.014995524
## 9 it 55 0.012309758
## 10 that 54 0.012085944
## 11 for 45 0.010071620
## 12 you 41 0.009176365
## 13 with 40 0.008952551
## 14 my 38 0.008504924
## 15 be 31 0.006938227
## 16 this 30 0.006714414
## 17 on 29 0.006490600
## 18 at 26 0.005819158
## 19 have 25 0.005595345
## 20 so 25 0.005595345
g1 <- ggplot(pt1[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g1 <- g1 + geom_bar(stat="identity") + labs(x = "word", y = "frequency", title = "Top 15 words with highest frequency in the sample text")
g1
“the”, “to”, and “and” have the three highest frequency in 1-grams.
ng2 <- ngram(sampleNoPun, n = 2)
pt2 <- get.phrasetable(ng2) %>% as.data.frame()
head(pt2, 20)
## ngrams freq prop
## 1 of the 26 0.005952381
## 2 in the 21 0.004807692
## 3 to the 13 0.002976190
## 4 to be 11 0.002518315
## 5 and the 10 0.002289377
## 6 to do 8 0.001831502
## 7 will be 8 0.001831502
## 8 and i 7 0.001602564
## 9 for the 7 0.001602564
## 10 if you 7 0.001602564
## 11 one of 7 0.001602564
## 12 at the 7 0.001602564
## 13 have to 7 0.001602564
## 14 about the 7 0.001602564
## 15 on my 6 0.001373626
## 16 there are 6 0.001373626
## 17 it is 6 0.001373626
## 18 in a 6 0.001373626
## 19 the first 6 0.001373626
## 20 for a 5 0.001144689
g2 <- ggplot(pt2[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g2 <- g2 + geom_bar(stat="identity") + labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 2 words with highest frequency in the sample text") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
g2
“of the”, “in the”, and “to the” have the three highest frequency in
2-grams.
All of the top four contains the word “the”.
ng3 <- ngram(sampleNoPun, n=3)
pt3 <- get.phrasetable(ng3) %>% as.data.frame()
head(pt3, 20)
## ngrams freq prop
## 1 of the cake 3 0.0007029053
## 2 the end of 3 0.0007029053
## 3 out of the 3 0.0007029053
## 4 one of the 3 0.0007029053
## 5 we agree that 2 0.0004686036
## 6 bakery called dreamville 2 0.0004686036
## 7 and invert it 2 0.0004686036
## 8 look like a 2 0.0004686036
## 9 i have always 2 0.0004686036
## 10 go to school 2 0.0004686036
## 11 rent assistance payments 2 0.0004686036
## 12 relation to the 2 0.0004686036
## 13 that you can 2 0.0004686036
## 14 i have to 2 0.0004686036
## 15 the cake and 2 0.0004686036
## 16 in real life 2 0.0004686036
## 17 i thought i 2 0.0004686036
## 18 and look like 2 0.0004686036
## 19 sure we agree 2 0.0004686036
## 20 bottom of the 2 0.0004686036
g3 <- ggplot(pt3[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g3 <- g3 + geom_bar(stat="identity") +
labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 3 words with highest frequency in the sample text") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),legend.position="bottom")
g3
“of the cake”, “the end of”, “out of the”, and “one of the” have the
four highest frequency in 3-grams.
All of the top three contains the words “the” and “of”
wdct <- wordcount(sampleNoPun)
count = 0
for (i in 1:nrow(pt1))
{
count = count + pt1$freq[i]
if (count >= 0.5 * wdct)
{
break
}
}
i
## [1] 96
i/nrow(pt1)
## [1] 0.05790109
It require 96 number of unique words to cover 50% of the sample text, which 0.0579011 of the total number of unique words.
#Load a English dictionary
data("GradyAugmented")
# Get all words that are in the dictionary
noEngIn <- sapply(pt1$ngrams, function(x) x %>% str_trim %>% tolower() %in% GradyAugmented)
nonEng <- pt1$ngrams[!noEngIn]
head(nonEng)
## [1] "2 " "he’s " "it’s " "– " "don’t " "blog "
Remove number of word start with capitalized data
# Remove words that contain numbers
nonEng <- nonEng[!grepl(".*?[0-9]+.*?", str_trim(nonEng))]
# Remove words that is capitalized
# It is likely that such word is a special noun
nonEng <- nonEng[!grepl("^[A-Z]", str_trim(nonEng))]
nonEng
## [1] "he’s " "it’s " "– "
## [4] "don’t " "blog " "dreamville "
## [7] "caramelised " "sgt " "liuhebafa "
## [10] "that’s " "hasn’t " "— "
## [13] "there’s " "taoist " "miami "
## [16] "retro " "passionfruit " "prequel "
## [19] "jonah’s " "ikea " "dremel "
## [22] " " "business” " "elasticized "
## [25] "snyder " "andor " "bestso "
## [28] "facebook " "cheesecake " "didn’t "
## [31] "sgt's " "stusicks " "brooklyn "
## [34] "buchholz’s " "sevenyearold " "now… "
## [37] "endnotes " "hialeah " "craigslist "
## [40] "chestthat " "‘trigger’ " "regularseason "
## [43] "ellison " "freshfromtheoven " "greenpeace "
## [46] "bersih " "honeyd " "wristwatch "
## [49] "pooku " "“pretty " "minty's "
## [52] "cleopatra's " "wellingtonsand " "sourdough "
## [55] "blackwater " "politically " "marydeluxe "
## [58] "espadrilles " "mba " "raksura "
## [61] "won’t " "penetrates " "kinect "
## [64] "joann's " "usa " "uber "
## [67] "selfjustified " "honeyj " "stepfather "
## [70] "favourite " "deluxeville " "gfc "
## [73] "greg's " "‘em” " "independents "
## [76] "‘normal’ " "takes…which " "weeknight "
## [79] "“gaming” " "‘dryness’ " "buchholz "
## [82] "force” " "“things " "jeanclaude "
## [85] "selffulfilling " "springfield " "texting "
## [88] "gon " "“screw " "normalboots "
## [91] "columbia… " "taxpayer's " "structured "
## [94] "sow's " "non " "what’s "
## [97] "speculaas " "department’s " "summer¨ "
## [100] "normal” " "atleast " "threestoogesstyle "
## [103] "you’d " "killen " "“scientific” "
## [106] "“taught " "bait's " "onethe "
## [109] "i’m " "character's " "chinthat "
## [112] "damme " "“i’d " "¨indian "
## [115] "taiwanese " "mustread " "wario "
## [118] "colourq " "can’t " "nations” "
## [121] "nw " "sayshe " "staffanson "
## [124] "wasn’t " "nestabilities " "onscreen "
## [127] "power” " "lohan " "macadamia "
## [130] "wouldn’t " "you’re " "colossians "
## [133] "alona " "shapeabilities " "yankees "
## [136] "soi " "“white " "hopen "
## [139] "requesta " "segal " "oneshe "
## [142] "jontron " "bruschetta " "“and "
## [145] "repurposing " "uninspired " "everybody's "
## [148] "blogs " "fave " "ploughboys "
## [151] "zuma’s " "mlb " "choc "
## [154] "heartfelt " "aren’t " "cornerthere "
## [157] "eotd " "seas…except " "mozzarella "
## [160] "fbar " "father's " "‘secret’ "
## [163] "couldn’t " "alaska…british " "hummus "
## [166] "here's " "krauss " "i’d "
From the above list of character, we can see there is little to none words from foreign languages.
I will create a prediction algorithm base on the n-grams words frequency. The frequency convert to probability.
It will have a side panel which allow user to input word. It will also have the main panel which will produce output phrase from the prediction algorithm and the top three most probable phrase base on the probability distribution