Abstract

In this project, I will use the data set provided by Coursera and Swift Key make a Shiny website. It will divided several parts: Understanding the problem

-Data acquisition and cleaning

-Exploratory analysis

-Statistical modeling

-Predictive modeling

-Creative exploration

-Creating a data product

-Creating a short slide deck pitching your product

The files in downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip I will be using the files in final/en_US folder mainly

Getting and Cleaning Data

Randomly select 20 lines and save into “sample.txt” for exploration

randomSelectLine <- function(filePath, destFile, n) {
  con <- file(filePath, "r")
  lin <- readLines(con)
  repeat
  {
    inn <- sample(length(lin), n, replace = FALSE)
    
    #' Check so that the minimum words in each line is at least 4
    #' THis is for exploratory analysis later
    if (wordcount(lin[inn], count_fun = min) > 3)
    {
      break
    }
  }
  writeLines(lin[inn], destFile)
  close(con)
}

randomSelectLine("final/en_US/en_US.blogs.txt", "sample.txt", n = 100)

I randomly chose 20 lines in the en_US.blogs.txt file and input it into the sample.txt file for exploratory analysis to reduce data size. Each line of the sample.txt will be one line in the original file

Summary of en_US.blogs.txt

fileB <- readLines("final/en_US/en_US.blogs.txt")
wc <- wordcount(fileB)
numRow <- nrow(fileB)
avgWd <- wc / numRow

The total word count in the file is 37334131, the number of lines is , and the average word per line is .

Read the sample

sample <- readLines("sample.txt")
head(sample)

## [1] "I called Louise Backer, manager of the Springfield Symphony Youth Orchestra and she said she's \"new at this job\" and never heard of Ben Snyder although she had heard of Bob Staffanson. I told her about myself and the Symphony in the old days and about the musical Stusicks. Backer said she will send me some of their promotional material. Susan Bennett is the Public Relations Director. Any relation to the District Attorney?"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "For several more years I fished all around the Miami area; anywhere my bicycle would carry me – all along the 79th Street Causeway, at the fresh water canals out around NW 27th Avenue and 103rd Street and in Hialeah along the banks of the Miami River – I never carried a camera. It was hard enough just to carry my cane pole or my rod and reel and a tackle box on my bike. My stepfather took my mother and I on some more but truly unforgettable fishing trips during this period of my life, too. Like the time we got caught in a hurricane while trying to cross Card Sound in a 16’ wooden boat powered by a 10hp outboard motor (the weather had turned so bad Truman brought the Presidential Yacht out of the ocean and anchored it in Card Sound to protect it), or the time we got stranded in the middle of Blackwater Sound when our rental outboard failed to start and we couldn’t find the entrance to the fishing camp at Jewfish Creek in the dark of night. The mosquitoes were voracious and we had to jump overboard into the murky water in order to retain our sanity; it was like a scene out of the African Queen! But, again, we have no photographic documentation."
## [3] "I want to dissipate all of my sorrows"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [4] "And in one image, Mr Killen is photographed standing with another man in front of a wall with graffiti reading “White Power”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [5] "6) Clay Buchholz has 3 wins this season, tying him for 9th with a host of other pitchers in the MLB this season. What those other pitchers don’t have, though, is Buchholz’s 8.69 ERA. In fact, only one other pitcher of the top 36 pitchers in wins this year has an ERA over 5.00. That man, Ivan Nova (Yankees, 5.18 ERA), hasn’t lost in 15 regular-season decisions, which hasn’t happened in over 50 years."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [6] "2. Benefit Ooh La Lift Instant Under Eye Brightener: I have always been a fan of benefit products and found them to be pretty affordable and of good quality. I had high hopes for it but after using it every day for weeks I noticed NO difference. Literally nothing. I might as well have applied nothing under my eyes because that’s what it seemed like. The consistency is nice and it penetrates into the skin fairly quick but like I said, it didn’t brighten or conceal or reduce dark under eye circles or anything. What a bummer!"

Each row in sample consist on a line

Create sample without punctuations

sampleNoPun <- sample %>% removePunctuation(preserve_intra_word_contractions    
 = TRUE) %>% tolower()

This is to remove the punctuation so that “the,” and “the” count as the same word.

Exploratory Analysis

In the exploratory analysis, I will be using n grams to find out the common phrases

1-grams (phrases with one word)

ng1 <- ngram(sampleNoPun, n=1)
pt1 <- get.phrasetable(ng1) %>% as.data.frame()
head(pt1, 20)

##    ngrams freq        prop
## 1    the   227 0.050805730
## 2     to   140 0.031333930
## 3    and   130 0.029095792
## 4     of   110 0.024619517
## 5      a   105 0.023500448
## 6      i    86 0.019247986
## 7     in    78 0.017457475
## 8     is    67 0.014995524
## 9     it    55 0.012309758
## 10  that    54 0.012085944
## 11   for    45 0.010071620
## 12   you    41 0.009176365
## 13  with    40 0.008952551
## 14    my    38 0.008504924
## 15    be    31 0.006938227
## 16  this    30 0.006714414
## 17    on    29 0.006490600
## 18    at    26 0.005819158
## 19  have    25 0.005595345
## 20    so    25 0.005595345

frequency plot 1-grams

g1 <- ggplot(pt1[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g1 <- g1 + geom_bar(stat="identity") + labs(x = "word", y = "frequency", title = "Top 15 words with highest frequency in the sample text")
g1

“the”, “to”, and “and” have the three highest frequency in 1-grams.

2-grams (phrase with two words)

ng2 <- ngram(sampleNoPun, n = 2)
pt2 <- get.phrasetable(ng2) %>% as.data.frame()
head(pt2, 20)

##        ngrams freq        prop
## 1     of the    26 0.005952381
## 2     in the    21 0.004807692
## 3     to the    13 0.002976190
## 4      to be    11 0.002518315
## 5    and the    10 0.002289377
## 6      to do     8 0.001831502
## 7    will be     8 0.001831502
## 8      and i     7 0.001602564
## 9    for the     7 0.001602564
## 10    if you     7 0.001602564
## 11    one of     7 0.001602564
## 12    at the     7 0.001602564
## 13   have to     7 0.001602564
## 14 about the     7 0.001602564
## 15     on my     6 0.001373626
## 16 there are     6 0.001373626
## 17     it is     6 0.001373626
## 18      in a     6 0.001373626
## 19 the first     6 0.001373626
## 20     for a     5 0.001144689

frequency plot 2-grams

g2 <- ggplot(pt2[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g2 <- g2 + geom_bar(stat="identity") + labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 2 words with highest frequency in the sample text") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
g2

“of the”, “in the”, and “to the” have the three highest frequency in 2-grams.
All of the top four contains the word “the”.

3-grams (phrase with three words)

ng3 <- ngram(sampleNoPun, n=3)
pt3 <- get.phrasetable(ng3) %>% as.data.frame()
head(pt3, 20)

##                       ngrams freq         prop
## 1               of the cake     3 0.0007029053
## 2                the end of     3 0.0007029053
## 3                out of the     3 0.0007029053
## 4                one of the     3 0.0007029053
## 5             we agree that     2 0.0004686036
## 6  bakery called dreamville     2 0.0004686036
## 7             and invert it     2 0.0004686036
## 8               look like a     2 0.0004686036
## 9             i have always     2 0.0004686036
## 10             go to school     2 0.0004686036
## 11 rent assistance payments     2 0.0004686036
## 12          relation to the     2 0.0004686036
## 13             that you can     2 0.0004686036
## 14                i have to     2 0.0004686036
## 15             the cake and     2 0.0004686036
## 16             in real life     2 0.0004686036
## 17              i thought i     2 0.0004686036
## 18            and look like     2 0.0004686036
## 19            sure we agree     2 0.0004686036
## 20            bottom of the     2 0.0004686036

frequency plot 3-grams

g3 <- ggplot(pt3[1:15,], aes(x = reorder(ngrams, -freq), y=freq, fill=ngrams))
g3 <- g3 + geom_bar(stat="identity") + 
  labs(x = "phrase", y = "frequency", title = "Top 15 phrase with 3 words with highest frequency in the sample text") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),legend.position="bottom")

g3

“of the cake”, “the end of”, “out of the”, and “one of the” have the four highest frequency in 3-grams.
All of the top three contains the words “the” and “of”

Total word count in the sample

wdct <- wordcount(sampleNoPun)

How many unique words is needed to cover 50% of the sample text?

count = 0
for (i in 1:nrow(pt1)) 
{
  count = count + pt1$freq[i]
  if (count >= 0.5 * wdct)
  {
    break
  }
}
i

## [1] 96

i/nrow(pt1)

## [1] 0.05790109

It require 96 number of unique words to cover 50% of the sample text, which 0.0579011 of the total number of unique words.

Find non-English words

#Load a English dictionary
data("GradyAugmented")

# Get all words that are in the dictionary
noEngIn <- sapply(pt1$ngrams, function(x) x %>% str_trim %>% tolower() %in% GradyAugmented)

nonEng <- pt1$ngrams[!noEngIn]
head(nonEng)

## [1] "2 "     "he’s "  "it’s "  "– "     "don’t " "blog "

Remove number of word start with capitalized data

# Remove words that contain numbers
nonEng <- nonEng[!grepl(".*?[0-9]+.*?", str_trim(nonEng))]

# Remove words that is capitalized
# It is likely that such word is a special noun
nonEng <- nonEng[!grepl("^[A-Z]", str_trim(nonEng))]
nonEng

##   [1] "he’s "              "it’s "              "– "                
##   [4] "don’t "             "blog "              "dreamville "       
##   [7] "caramelised "       "sgt "               "liuhebafa "        
##  [10] "that’s "            "hasn’t "            "— "                
##  [13] "there’s "           "taoist "            "miami "            
##  [16] "retro "             "passionfruit "      "prequel "          
##  [19] "jonah’s "           "ikea "              "dremel "           
##  [22] " "                  "business” "         "elasticized "      
##  [25] "snyder "            "andor "             "bestso "           
##  [28] "facebook "          "cheesecake "        "didn’t "           
##  [31] "sgt's "             "stusicks "          "brooklyn "         
##  [34] "buchholz’s "        "sevenyearold "      "now… "             
##  [37] "endnotes "          "hialeah "           "craigslist "       
##  [40] "chestthat "         "‘trigger’ "         "regularseason "    
##  [43] "ellison "           "freshfromtheoven "  "greenpeace "       
##  [46] "bersih "            "honeyd "            "wristwatch "       
##  [49] "pooku "             "“pretty "           "minty's "          
##  [52] "cleopatra's "       "wellingtonsand "    "sourdough "        
##  [55] "blackwater "        "politically "       "marydeluxe "       
##  [58] "espadrilles "       "mba "               "raksura "          
##  [61] "won’t "             "penetrates "        "kinect "           
##  [64] "joann's "           "usa "               "uber "             
##  [67] "selfjustified "     "honeyj "            "stepfather "       
##  [70] "favourite "         "deluxeville "       "gfc "              
##  [73] "greg's "            "‘em” "              "independents "     
##  [76] "‘normal’ "          "takes…which "       "weeknight "        
##  [79] "“gaming” "          "‘dryness’ "         "buchholz "         
##  [82] "force” "            "“things "           "jeanclaude "       
##  [85] "selffulfilling "    "springfield "       "texting "          
##  [88] "gon "               "“screw "            "normalboots "      
##  [91] "columbia… "         "taxpayer's "        "structured "       
##  [94] "sow's "             "non "               "what’s "           
##  [97] "speculaas "         "department’s "      "summer¨ "          
## [100] "normal” "           "atleast "           "threestoogesstyle "
## [103] "you’d "             "killen "            "“scientific” "     
## [106] "“taught "           "bait's "            "onethe "           
## [109] "i’m "               "character's "       "chinthat "         
## [112] "damme "             "“i’d "              "¨indian "          
## [115] "taiwanese "         "mustread "          "wario "            
## [118] "colourq "           "can’t "             "nations” "         
## [121] "nw "                "sayshe "            "staffanson "       
## [124] "wasn’t "            "nestabilities "     "onscreen "         
## [127] "power” "            "lohan "             "macadamia "        
## [130] "wouldn’t "          "you’re "            "colossians "       
## [133] "alona "             "shapeabilities "    "yankees "          
## [136] "soi "               "“white "            "hopen "            
## [139] "requesta "          "segal "             "oneshe "           
## [142] "jontron "           "bruschetta "        "“and "             
## [145] "repurposing "       "uninspired "        "everybody's "      
## [148] "blogs "             "fave "              "ploughboys "       
## [151] "zuma’s "            "mlb "               "choc "             
## [154] "heartfelt "         "aren’t "            "cornerthere "      
## [157] "eotd "              "seas…except "       "mozzarella "       
## [160] "fbar "              "father's "          "‘secret’ "         
## [163] "couldn’t "          "alaska…british "    "hummus "           
## [166] "here's "            "krauss "            "i’d "

From the above list of character, we can see there is little to none words from foreign languages.

Text Prediction

Plan

Prediction algorithm

I will create a prediction algorithm base on the n-grams words frequency. The frequency convert to probability.

Find all 3-grams phrase that contain the input word
Use the frequency of all the phrase to generate a probability distribution to determine which which is the next word.
For words that hasn’t appears in the n-grams, it will return a random 3 word phrase generated by the frequency (the higher the frequency in the training set, the higher the chance that the phrase is output)

Shiny app

It will have a side panel which allow user to input word. It will also have the main panel which will produce output phrase from the prediction algorithm and the top three most probable phrase base on the probability distribution

Data Science Capstone

Jamin Wong

2022-07-18