Project Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Task 0: Understanding the Promblem

The first step in analyzing any new data set is figuring out: (a) what data you have and (b) what are the standard tools and models used for that type of data.

In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, you should familiarize yourself with Natural Language Processing, Text Mining, and the associated tools in R

Dataset

This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.

[Capstone Dataset](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

Your original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if you find additional data sets that may be useful for building your model you may use them.

Task 1: Getting and Cleaning Data

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data. When you commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to your target. You can learn to read, speak and write the language. Alternatively, you can study data and learn from existing information about the language through literature and the internet. At the very least, you need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.

Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.

Loading required R packages

library(stringi)
library(dplyr)
library(tm)
library(wordcloud)
library(ggplot2)
library(gridExtra)
library(RWeka)
library(knitr)
library(kableExtra)

Getting Data

Loading Data Sets

blogs <- readLines("data/en_US/en_US.blogs.txt", encoding="UTF-8")
news <- readLines("data/en_US/en_US.news.txt", encoding="UTF-8")
twitter <- readLines("data/en_US/en_US.twitter.txt", encoding="UTF-8")

Generating Data Summary

datasummary<- data.frame(name=c("twitter", "blogs","news"),
                         size=
                                 c(file.info("data/en_US/en_US.blogs.txt")$size/1024^2, 
                                   file.info("data/en_US/en_US.blogs.txt")$size/1024^2,
                                   file.info("data/en_US/en_US.news.txt")$size/1024^2), 
                         wordcount = c(sum(stri_count_words(twitter)),
                                       sum(stri_count_words(blogs)),
                                       sum(stri_count_words(news))),
                         length = c(length(twitter), length(blogs), length(news)))

names(datasummary) = c("Data Source", 
                       "Size (MB)", 
                       "Word Count", 
                       "Length")

kable(datasummary) %>% 
        kable_styling(bootstrap_options = c("striped", "hover"))

Data Source	Size (MB)	Word Count	Length
twitter	200.4242	30093372	2360148
blogs	200.4242	37546239	899288
news	196.2775	34762395	1010242

Sampling Data

sample_data <- c(sample(twitter, 500), sample(blogs, 500), sample(news, 500))

Cleaning Data

Removing profanity

profanity library by RobertJGabriel

profanity <- readLines("https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt")

clean <- function(x) {
        text <- paste(x)
        ## Remove punctuation
        text <- removePunctuation(text)
        ## Removing special characters
        text <- iconv(text, "UTF-8", "ASCII", sub = "")
        ## Remove numbers
        text <- removeNumbers(text)
        ## Converting to lower case
        text <- tolower(text)
        ## Remove bad words
        text <- removeWords(text, c(profanity))
        ## Remove stopwords
        text <- removeWords(text, c(stopwords("english")))
        ## Remove white space
        text <- stripWhitespace(text)
}

clean_data <- clean(sample_data)

Tokenizing text and creating N-Grams

token_data <- NGramTokenizer(clean_data)

grams1 <- function(x){
        x <- NGramTokenizer(token_data, Weka_control(min=1, max=1))
        x <- data.frame(table(x))
        x <- arrange(x, desc(Freq))
}

grams2 <- function(x){
        x <- NGramTokenizer(token_data, Weka_control(min=2, max=2))
        x <- data.frame(table(x))
        x <- arrange(x, desc(Freq))
}

grams3 <- function(x){
        x <- NGramTokenizer(token_data, Weka_control(min=3, max=3))
        x <- data.frame(table(x))
        x <- arrange(x, desc(Freq))
}

unigrams <- grams1(token_data)
names(unigrams) <- c("word", "freq")

bigrams <- grams2(token_data)
names(bigrams) <- c("word", "freq")

trigrams <- grams3(token_data)
names(trigrams) <- c("word", "freq")

Task 2: Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Plotting Frequency of N-Grams

Unigram Frequency

unigrams_plot <- ggplot(unigrams[1:10,], aes(x = reorder(word, freq), y = freq)) +
        geom_bar(stat='identity', aes(fill = word)) +
        geom_text(label=unigrams$freq[1:10], size=4, hjust = 2) +
        guides(fill = FALSE) +
        xlab("1-gram") + 
        ylab("Frequency") + 
        ggtitle("Top 10 Unigrams")+
        coord_flip() + 
        theme_bw()

print(unigrams_plot)

Bigram Frequency

bigrams_plot <- ggplot(bigrams[1:10,], aes(x = reorder(word, freq), y = freq)) +
        geom_bar(stat='identity', aes(fill = word)) +
        geom_text(label=bigrams$freq[1:10], size=4, hjust = 2) +
        guides(fill = FALSE) +
        xlab("2-gram") + 
        ylab("Frequency") + 
        ggtitle("Top 10 2-Grams")+
        coord_flip() + 
        theme_bw()

print(bigrams_plot)

Trigram Frequency

trigrams_plot <- ggplot(trigrams[1:10,], aes(x = reorder(word, freq), y = freq)) +
        geom_bar(stat='identity', aes(fill = word)) +
        geom_text(label=trigrams$freq[1:10], size=4, hjust = 2) +
        guides(fill = FALSE) +
        xlab("3-gram") + 
        ylab("Frequency") + 
        ggtitle("Top 10 3-Grams")+
        coord_flip() + 
        theme_bw()

print(trigrams_plot)

Plans for Building a Prediction Model

The now clean and sampled data will be used to create a predictive text Shiny App. This Shiny app will run a model that will predict the next word in a sentence based on the frequency n-grams found in this sample data set.

Data Science Capstone Week 2 Report

Katherine Williams

9/20/2021