Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
The first step in analyzing any new data set is figuring out: (a) what data you have and (b) what are the standard tools and models used for that type of data.
In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, you should familiarize yourself with Natural Language Processing, Text Mining, and the associated tools in R
This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.
[Capstone Dataset](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
Your original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if you find additional data sets that may be useful for building your model you may use them.
Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.
The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data. When you commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to your target. You can learn to read, speak and write the language. Alternatively, you can study data and learn from existing information about the language through literature and the internet. At the very least, you need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.
Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.
library(stringi)
library(dplyr)
library(tm)
library(wordcloud)
library(ggplot2)
library(gridExtra)
library(RWeka)
library(knitr)
library(kableExtra)
blogs <- readLines("data/en_US/en_US.blogs.txt", encoding="UTF-8")
news <- readLines("data/en_US/en_US.news.txt", encoding="UTF-8")
twitter <- readLines("data/en_US/en_US.twitter.txt", encoding="UTF-8")
datasummary<- data.frame(name=c("twitter", "blogs","news"),
size=
c(file.info("data/en_US/en_US.blogs.txt")$size/1024^2,
file.info("data/en_US/en_US.blogs.txt")$size/1024^2,
file.info("data/en_US/en_US.news.txt")$size/1024^2),
wordcount = c(sum(stri_count_words(twitter)),
sum(stri_count_words(blogs)),
sum(stri_count_words(news))),
length = c(length(twitter), length(blogs), length(news)))
names(datasummary) = c("Data Source",
"Size (MB)",
"Word Count",
"Length")
kable(datasummary) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Data Source | Size (MB) | Word Count | Length |
|---|---|---|---|
| 200.4242 | 30093372 | 2360148 | |
| blogs | 200.4242 | 37546239 | 899288 |
| news | 196.2775 | 34762395 | 1010242 |
sample_data <- c(sample(twitter, 500), sample(blogs, 500), sample(news, 500))
profanity library by RobertJGabriel
profanity <- readLines("https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt")
clean <- function(x) {
text <- paste(x)
## Remove punctuation
text <- removePunctuation(text)
## Removing special characters
text <- iconv(text, "UTF-8", "ASCII", sub = "")
## Remove numbers
text <- removeNumbers(text)
## Converting to lower case
text <- tolower(text)
## Remove bad words
text <- removeWords(text, c(profanity))
## Remove stopwords
text <- removeWords(text, c(stopwords("english")))
## Remove white space
text <- stripWhitespace(text)
}
clean_data <- clean(sample_data)
token_data <- NGramTokenizer(clean_data)
grams1 <- function(x){
x <- NGramTokenizer(token_data, Weka_control(min=1, max=1))
x <- data.frame(table(x))
x <- arrange(x, desc(Freq))
}
grams2 <- function(x){
x <- NGramTokenizer(token_data, Weka_control(min=2, max=2))
x <- data.frame(table(x))
x <- arrange(x, desc(Freq))
}
grams3 <- function(x){
x <- NGramTokenizer(token_data, Weka_control(min=3, max=3))
x <- data.frame(table(x))
x <- arrange(x, desc(Freq))
}
unigrams <- grams1(token_data)
names(unigrams) <- c("word", "freq")
bigrams <- grams2(token_data)
names(bigrams) <- c("word", "freq")
trigrams <- grams3(token_data)
names(trigrams) <- c("word", "freq")
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
unigrams_plot <- ggplot(unigrams[1:10,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat='identity', aes(fill = word)) +
geom_text(label=unigrams$freq[1:10], size=4, hjust = 2) +
guides(fill = FALSE) +
xlab("1-gram") +
ylab("Frequency") +
ggtitle("Top 10 Unigrams")+
coord_flip() +
theme_bw()
print(unigrams_plot)
bigrams_plot <- ggplot(bigrams[1:10,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat='identity', aes(fill = word)) +
geom_text(label=bigrams$freq[1:10], size=4, hjust = 2) +
guides(fill = FALSE) +
xlab("2-gram") +
ylab("Frequency") +
ggtitle("Top 10 2-Grams")+
coord_flip() +
theme_bw()
print(bigrams_plot)
trigrams_plot <- ggplot(trigrams[1:10,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat='identity', aes(fill = word)) +
geom_text(label=trigrams$freq[1:10], size=4, hjust = 2) +
guides(fill = FALSE) +
xlab("3-gram") +
ylab("Frequency") +
ggtitle("Top 10 3-Grams")+
coord_flip() +
theme_bw()
print(trigrams_plot)
The now clean and sampled data will be used to create a predictive text Shiny App. This Shiny app will run a model that will predict the next word in a sentence based on the frequency n-grams found in this sample data set.