This week, we move on to th next tasks, exploratory data analysis and modeling.
And so two key questions to consider here are,
Once youβve considered these basic questions you can move on to kind of more complex ones like,
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Task 2 - Exploratory Data Analysis The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Tasks to accomplish
Questions to consider
library(pryr) # package used to get the size of the file in the R environment
library(corpus) # corpus is an R text processing package with full suppport for international text (Unicode)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tm) # package used to remove numbers and strip whitespace from a text document
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Attaching package: 'tm'
## The following object is masked from 'package:pryr':
##
## inspect
library(stringi) # the function checks whether all bytes in a string are in the ascii set 1,2,3,4...,127
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "final_dataset.zip")
unzip("final_dataset.zip")
us.twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"): line
## 1759032 appears to contain an embedded nul
us.blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
us.news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
To address the questions in Quiz 1:
Here I provide the basic summaries of the three files. Word counts, line counts, and basic data tables.
filelist <- list(us.twitter=us.twitter, us.blogs=us.blogs, us.news=us.news)
file.summary <- data.frame(File = names(filelist), # file name list
Size = sapply(filelist, function(x) {object_size(x)/1024^2}), # file memory size
Length = sapply(filelist, function(x) {length(x)}), # length of the files
Max.Length = sapply(filelist, function(x) {summary(nchar(x))[6]})) # max line length of each file
knitr::kable(file.summary)
| File | Size | Length | Max.Length | |
|---|---|---|---|---|
| us.twitter | us.twitter | 318.9895 | 2360148 | 140 |
| us.blogs | us.blogs | 255.3545 | 899288 | 40833 |
| us.news | us.news | 257.3404 | 1010242 | 11384 |
Sub_sample the files
us.twit <- us.twitter[1:5000]
# us.blog <-us.blogs[1:5000]
# us.new <- us.news[1:5000]
Clean the dataset
#1. remove special characters
us.twit <- iconv(us.twit, "UTF-8","ASCII",sub="")
#2. Remove numbers from text document
us.twit <- removeNumbers(us.twit)
#3. Remove white spaces from text document
us.twit <- stripWhitespace(us.twit)
#4. Remove punctuation
us.twit <- removePunctuation(us.twit)
#5. Set the entire dataset to lower case
us.twit <- tolower(us.twit)
#6. Profanity filtering - task 1, remove bad words.
download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", destfile = "bad_words.txt")
badword <- readLines("bad_words.txt", encoding="UTF-8")
us.twit <- removeWords(us.twit, c(badword))
sum.twit.1 <- term_stats(us.twit, drop=stopwords_en, drop_punct = TRUE) # remove punctuation and english stop words (is, this, and....)
sum.twit.2 <- term_stats(us.twit, drop=stopwords_en, drop_punct = TRUE, ngrams=2)
sum.twit.3 <- term_stats(us.twit, drop=stopwords_en, drop_punct = TRUE, ngrams=3)
# sum.blog.1 <- term_stats(us.blog,drop=stopwords_en, drop_punct = TRUE)
# sum.blog.2 <- term_stats(us.blog,drop=stopwords_en, drop_punct = TRUE, ngrams=2)
# sum.blog.3 <- term_stats(us.blog,drop=stopwords_en, drop_punct = TRUE, ngrams=3)
#
# sum.new.1 <- term_stats(us.new, drop=stopwords_en, drop_punct = TRUE)
# sum.new.2 <- term_stats(us.new, drop=stopwords_en, drop_punct = TRUE, ngrams=2)
# sum.new.3 <- term_stats(us.new, drop=stopwords_en, drop_punct = TRUE, ngrams=3)
Take twitter as an example
sum.twit.1.20 <- sum.twit.1 %>%
slice_head(n=20)
ggplot(sum.twit.1.20, aes(x=count,y=reorder(term, count))) +
geom_bar(stat= "identity",
color="black",
fill="lightblue") +
geom_text(aes(label=count, hjust=1.5)) +
ylab("1-Gram") +
xlab("Frequency") +
ggtitle("Top 20 Unigram - Twitter")
sum.twit.2.20 <- sum.twit.2 %>%
slice_head(n=20)
ggplot(sum.twit.2.20, aes(x=count,y=reorder(term, count))) +
geom_bar(stat= "identity",
color="black",
fill="lightblue") +
geom_text(aes(label=count, hjust=1.5)) +
ylab("1-Gram") +
xlab("Frequency") +
ggtitle("Top 20 Bigram - Twitter")
sum.twit.3.20 <- sum.twit.3 %>%
slice_head(n=20)
ggplot(sum.twit.3.20, aes(x=count,y=reorder(term, count))) +
geom_bar(stat= "identity",
color="black",
fill="lightblue") +
geom_text(aes(label=count, hjust=1.5)) +
ylab("1-Gram") +
xlab("Frequency") +
ggtitle("Top 20 Trigram - Twitter")
# Total unique words in the original US twit
total.count.twit <- sum(sum.twit.1$count)
cum_p <- cumsum(sum.twit.1$count)/sum(sum.twit.1$count)
# 50%
which(cum_p>=0.5)[1]
## [1] 468
# 90%
which(cum_p>=0.9)[1]
## [1] 6105
468 unique words is needed in a frequency sorted dictionary to cover 50% of all word instances in the language
6105 unique words is needed in a frequency sorted dictionary to cover 90% of all word instances in the language
us.twit_new <- us.twitter[1:5000]
sum.us.twit_new <- term_stats(us.twit_new, drop=stopwords_en, drop_punct = TRUE)
nrow(sum.us.twit_new)-sum(stri_enc_isascii(sum.us.twit_new$term))
## [1] 50
no_eng <- stri_enc_isascii(sum.us.twit_new$term)
(sum.us.twit_new$term)[!no_eng]
## [1] "β₯" "β€" "π" "π"
## [5] "π" "βΊ" "οΏ½" "π"
## [9] "π" "π₯" "π³" "π"
## [13] "π" "π" "β" "β¬"
## [17] "π" "π" "π" "#blaufrΓ€nkisch"
## [21] "minibΓΌk" "soirΓ©e" "β" "βΌ"
## [25] "β‘" "β" "π" "π"
## [29] "π" "πΈ" "π»" "π"
## [33] "πΆ" "π" "π" "π¦"
## [37] "π" "π" "π" "π"
## [41] "π€" "π₯" "π«" "π"
## [45] "π" "π‘" "π’" "π£"
## [49] "π°" "π"
50 words in the twitter sample come from foreign languages
classify the words in the dictionary by types (clusters). Category the word in my sample into different clusters and only left one world per cluster for memory saving purpose.
If word 1 in cluster 1 has been identified in my sample, then all words in that cluster could be used to cover the phrases in the population.
The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish
Questions to consider
As you develop your prediction model, two key aspects that you will have to keep in mind are the size and runtime of the algorithm. These are defined as:
Your goal for this prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user.
Keep in mind that currently available predictive text models can run on mobile phones, which typically have limited memory and processing power compared to desktop computers. Therefore, you should consider very carefully (1) how much memory is being used by the objects in your workspace; and (2) how much time it is taking to run your model. Ultimately, your model will need to run in a Shiny app that runs on the shinyapps.io server.
Tips, tricks, and hints
Here are a few tools that may be of use to you as you work on their algorithm:
There will likely be a tradeoff that you have to make in between size and runtime. For example, an algorithm that requires a lot of memory, may run faster, while a slower algorithm may require less memory. You will have to find the right balance between the two in order to provide a good experience to the user.
The goal is to extract the word from the bigram and trigram that can be stored and searched based on Markov chain.
Suppose I have the word that can be the first or the second in the trigram. I will have a prediciton for the next word
If there is not, I will serach the word in the bigram database.
If the user does not chose the suggested word and typed a new word, then the new suggestion will display. Based on the exploratory analysis proformed,