Jeff B
31 May 2020
This application was built for the final Capstone project of the Johns Hopkins University Data Science: Statistics and Machine Learning Specialization on Coursera. This presentation provides a brief overview of the application, which is accessible here. It is structured as follows:
Summary of Data Sets for Corpus
| File | News | Blogs | |
| Total Lines (#) | 2360148 | 1010242 | 899288 |
| Total Words (#) | 30093413 | 34762395 | 37546239 |
| Longest Line | 140 | 11384 | 40833 |
| Avg. Line Length | 68.68043 | 201.16284 | 229.98668 |
| Unique Words (#) | 1554362 | 1066687 | 1352044 |
Example 1: Creating a DFM from the corpus
dfm_trigram <- tokensClean %>% tokens_ngrams(n = 1) %>% dfm(tolower = TRUE) %>% dfm_trim(min_termfreq = 3)
Example 2: Creating a data.table object of a frequency table
gramfreq_tri <- data.table(textstat_frequency(dfm_trigram))[,1:2]
Example of frequency table:
| X | feature | frequency |
|---|---|---|
| 1 | one_of_the | 2564 |
| 2 | a_lot_of | 2281 |
| 3 | to_be_a | 1245 |
First it converts it into a regex, ngram-formatted term
convertInput("a case of")
[1] "^a_case_of_"
If it doesn't find a match, it “backs off” the term and searches again
backoff_ngram("a case of") %>% convertInput
[1] "^case_of_"
It returns the results in a simple table with discounted probabilities
predGram("a case of")
# A tibble: 5 x 2
nextword probability
<chr> <dbl>
1 the 0.214
2 a 0.107
3 beer 0.107
4 what 0.107
5 mistaken 0.107