Capstone Project: Exploration

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Exploratory Analysis

In this report we will try to understand the distribution and relationship between the words, tokens, and phrases in the text. Our goal is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Loading the Data

First of all we are going to load the data files and take a look at some general numbers about how many lines, how many words, file size and so on.

suppressMessages(library(ggplot2)) 
suppressMessages(library(quanteda)) 
suppressMessages(library(stringi))

#Read Raw Data
twitter <- readLines("./en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=T)
blogs <- readLines("./en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=T)
news <- readLines("./en_US/en_US.news.txt", encoding="UTF-8", skipNul=T)

twitter_len <- length(twitter); blogs_len <- length(blogs); news_len <- length(news)
twitter_wd <- sum(stri_count_words(twitter)); blogs_wd <- sum(stri_count_words(blogs)); news_wd <- sum(stri_count_words(news))

df <- data.frame(file=c("twitter", "blogs", "news"), size= c(object.size(twitter), object.size(blogs),object.size(news)), lines=c(twitter_len, blogs_len, news_len), words= c(twitter_wd, blogs_wd, news_wd))
df

##      file      size   lines    words
## 1 twitter 316037600 2360148 30093410
## 2   blogs 260564320  899288 37546246
## 3    news 261759048 1010242 34762395

Creating a Sample for the Analysis

Once we have look the raw data, we are going to create a sample data for our analysis. We use a proportion of 0.01 to extract the data from each source and then mix it in one corpus variable. The proportion value is set to have a balance between the sample size and the speed of the analysis.

#get sample
sample_proportion <- 0.01
set.seed(1122)
twitter_s <- sample(twitter, length(twitter)*sample_proportion)
news_s <- sample(news, length(news)*sample_proportion)
blogs_s <- sample(blogs, length(blogs)*sample_proportion)
#create corpus
#remove(twitter); remove(blogs); remove(news)
docs <- c(twitter_s, blogs_s, news_s)
#remove(twitter_s); remove(blogs_s); remove(news_s)

Word Analysis

Using the quantida package, we are going to use the dfm function to analyse unigram, bigram and trigram distributions of our sample data. To do this we are going to use some preprocessing variables such as: lower characters, remove numbers, remove emojis and ignore english stopwords. Once we created the dfm objects we are going to look at the top features.

unigram <- dfm(docs, what= "word", toLower = T, removeNumbers = T, removePunct = T, removeSeparators = T, removeTwitter = T, ignoredFeatures = stopwords("english"), language = "english", verbose = FALSE)

bigram <- dfm(docs, what= "word", ngram=2, toLower = T, removeNumbers = T, removePunct = T, removeSeparators = T, removeTwitter = T, ignoredFeatures = stopwords("english"), language = "english", verbose = FALSE)

trigram <- dfm(docs, what= "word", ngram=3, toLower = T, removeNumbers = T, removePunct = T, removeSeparators = T, removeTwitter = T, ignoredFeatures = stopwords("english"), language = "english", verbose = FALSE)

topfeatures(unigram, 50)

##   will   just   said    one   like    can   time    get    new   good 
##   3181   3018   2996   2807   2693   2466   2250   2214   1953   1841 
##    now    day   love people   know    see   back     go   make  first 
##   1795   1718   1637   1574   1551   1471   1362   1360   1343   1328 
##   last  going  great   also   much  think     us really    got   year 
##   1308   1292   1276   1270   1266   1254   1169   1159   1137   1134 
##    two    way   well   want   work  today   even  still thanks  right 
##   1121   1102   1093   1077   1074   1062   1044   1039   1014    983 
##   need  years   many     rt   life    say  never little   come   take 
##    981    951    886    873    868    861    856    851    837    820

topfeatures(bigram, 50)

##       right_now       last_year      last_night        new_york 
##             260             185             173             171 
##       years_ago       last_week     high_school looking_forward 
##             146             138             132             118 
##      first_time       feel_like      looks_like         one_day 
##             116             113             102              99 
##        st_louis     even_though  happy_birthday       make_sure 
##              96              94              93              93 
##        just_got      new_jersey       next_week       good_luck 
##              90              90              87              86 
##         can_get    good_morning       look_like       every_day 
##              81              81              80              80 
##   san_francisco       two_years      every_time       next_year 
##              76              71              70              68 
##     los_angeles    social_media       come_back       next_time 
##              67              66              66              65 
##      little_bit       long_time       just_like         can_see 
##              65              62              62              60 
##     pretty_much   united_states     many_people       one_thing 
##              59              58              56              55 
##        will_get     pretty_good      last_month        get_back 
##              54              54              53              51 
##     sounds_like         go_back       will_take       last_time 
##              51              50              48              48 
##       years_old       will_come 
##              47              47

topfeatures(trigram, 50)

##                let_us_know              new_york_city 
##                         20                         20 
##              two_years_ago             happy_new_year 
##                         19                         18 
##          happy_happy_happy          happy_mothers_day 
##                         18                         16 
##         happy_mother's_day         gov_chris_christie 
##                         15                         15 
##              just_got_back              cinco_de_mayo 
##                         14                         13 
##                   ha_ha_ha                   la_la_la 
##                         10                         10 
##     president_barack_obama               world_war_ii 
##                         10                          9 
##              long_time_ago           dreams_come_true 
##                          8                          8 
##            will_take_place       high_school_students 
##                          8                          8 
##           first_time_since            cents_per_share 
##                          8                          8 
##               come_join_us           st_patrick's_day 
##                          7                          7 
##     social_security_number            dream_come_true 
##                          7                          7 
##       biased_biased_biased              big_time_rush 
##                          7                          7 
##             will_come_back            osama_bin_laden 
##                          7                          7 
##             five_years_ago          u.s_census_bureau 
##                          7                          7 
##             last_two_years             past_two_years 
##                          7                          7 
##       per_serving_calories         u.s_district_judge 
##                          7                          7 
##              just_got_done          weave_weave_weave 
##                          6                          6 
##             time_last_year             new_year's_eve 
##                          6                          6 
##             new_york_times          several_years_ago 
##                          6                          6 
##           past_three_years    chief_executive_officer 
##                          6                          6 
##           next_school_year uninterrupted_open_highest 
##                          6                          6 
##          open_highest_form             four_years_ago 
##                          6                          6 
##          trenton_gov_chris            st_louis_county 
##                          6                          6 
##       superior_court_judge      senior_vice_president 
##                          6                          6

Taking a look at some plots

We will print this data into some histograms to have a better look at the data:

plot(topfeatures(unigram, 50), xaxt = "n", ylab = "Frequency", main = "Unigram Top Features")
axis(1, 1:50, labels = names(topfeatures(unigram, 50)))

plot(topfeatures(bigram, 50), xaxt = "n", ylab = "Frequency", main = "Bigram Top Features")
axis(1, 1:50, labels = names(topfeatures(bigram, 50)))

plot(topfeatures(trigram, 50), xaxt = "n", ylab = "Frequency", main = "Trigram Top Features")
axis(1, 1:50, labels = names(topfeatures(trigram, 50)))

Plans for Prediction Model and App

To create the prediction algorithm we will use the bigrams and trigrams that we analysed, and optionally quadgrams.

first we will get the first word the user types and we will look at the bigrams that start with that word.
We will suggest the next word looking at the frequency of the subsetted bigrams giving more probability to the bigrams that are more frequent.
when the user continues typing the next word we will repeat the process using the trigrams that have the same first two words.
for the next word we can use quadgrams or we could leave the first word out and taking as the beginning of the trigram the latest two words that the user has typed and guess the third word.