Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
In this report we will try to understand the distribution and relationship between the words, tokens, and phrases in the text. Our goal is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
First of all we are going to load the data files and take a look at some general numbers about how many lines, how many words, file size and so on.
suppressMessages(library(ggplot2))
suppressMessages(library(quanteda))
suppressMessages(library(stringi))
#Read Raw Data
twitter <- readLines("./en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=T)
blogs <- readLines("./en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=T)
news <- readLines("./en_US/en_US.news.txt", encoding="UTF-8", skipNul=T)
twitter_len <- length(twitter); blogs_len <- length(blogs); news_len <- length(news)
twitter_wd <- sum(stri_count_words(twitter)); blogs_wd <- sum(stri_count_words(blogs)); news_wd <- sum(stri_count_words(news))
df <- data.frame(file=c("twitter", "blogs", "news"), size= c(object.size(twitter), object.size(blogs),object.size(news)), lines=c(twitter_len, blogs_len, news_len), words= c(twitter_wd, blogs_wd, news_wd))
df
## file size lines words
## 1 twitter 316037600 2360148 30093410
## 2 blogs 260564320 899288 37546246
## 3 news 261759048 1010242 34762395
Once we have look the raw data, we are going to create a sample data for our analysis. We use a proportion of 0.01 to extract the data from each source and then mix it in one corpus variable. The proportion value is set to have a balance between the sample size and the speed of the analysis.
#get sample
sample_proportion <- 0.01
set.seed(1122)
twitter_s <- sample(twitter, length(twitter)*sample_proportion)
news_s <- sample(news, length(news)*sample_proportion)
blogs_s <- sample(blogs, length(blogs)*sample_proportion)
#create corpus
#remove(twitter); remove(blogs); remove(news)
docs <- c(twitter_s, blogs_s, news_s)
#remove(twitter_s); remove(blogs_s); remove(news_s)
Using the quantida package, we are going to use the dfm function to analyse unigram, bigram and trigram distributions of our sample data. To do this we are going to use some preprocessing variables such as: lower characters, remove numbers, remove emojis and ignore english stopwords. Once we created the dfm objects we are going to look at the top features.
unigram <- dfm(docs, what= "word", toLower = T, removeNumbers = T, removePunct = T, removeSeparators = T, removeTwitter = T, ignoredFeatures = stopwords("english"), language = "english", verbose = FALSE)
bigram <- dfm(docs, what= "word", ngram=2, toLower = T, removeNumbers = T, removePunct = T, removeSeparators = T, removeTwitter = T, ignoredFeatures = stopwords("english"), language = "english", verbose = FALSE)
trigram <- dfm(docs, what= "word", ngram=3, toLower = T, removeNumbers = T, removePunct = T, removeSeparators = T, removeTwitter = T, ignoredFeatures = stopwords("english"), language = "english", verbose = FALSE)
topfeatures(unigram, 50)
## will just said one like can time get new good
## 3181 3018 2996 2807 2693 2466 2250 2214 1953 1841
## now day love people know see back go make first
## 1795 1718 1637 1574 1551 1471 1362 1360 1343 1328
## last going great also much think us really got year
## 1308 1292 1276 1270 1266 1254 1169 1159 1137 1134
## two way well want work today even still thanks right
## 1121 1102 1093 1077 1074 1062 1044 1039 1014 983
## need years many rt life say never little come take
## 981 951 886 873 868 861 856 851 837 820
topfeatures(bigram, 50)
## right_now last_year last_night new_york
## 260 185 173 171
## years_ago last_week high_school looking_forward
## 146 138 132 118
## first_time feel_like looks_like one_day
## 116 113 102 99
## st_louis even_though happy_birthday make_sure
## 96 94 93 93
## just_got new_jersey next_week good_luck
## 90 90 87 86
## can_get good_morning look_like every_day
## 81 81 80 80
## san_francisco two_years every_time next_year
## 76 71 70 68
## los_angeles social_media come_back next_time
## 67 66 66 65
## little_bit long_time just_like can_see
## 65 62 62 60
## pretty_much united_states many_people one_thing
## 59 58 56 55
## will_get pretty_good last_month get_back
## 54 54 53 51
## sounds_like go_back will_take last_time
## 51 50 48 48
## years_old will_come
## 47 47
topfeatures(trigram, 50)
## let_us_know new_york_city
## 20 20
## two_years_ago happy_new_year
## 19 18
## happy_happy_happy happy_mothers_day
## 18 16
## happy_mother's_day gov_chris_christie
## 15 15
## just_got_back cinco_de_mayo
## 14 13
## ha_ha_ha la_la_la
## 10 10
## president_barack_obama world_war_ii
## 10 9
## long_time_ago dreams_come_true
## 8 8
## will_take_place high_school_students
## 8 8
## first_time_since cents_per_share
## 8 8
## come_join_us st_patrick's_day
## 7 7
## social_security_number dream_come_true
## 7 7
## biased_biased_biased big_time_rush
## 7 7
## will_come_back osama_bin_laden
## 7 7
## five_years_ago u.s_census_bureau
## 7 7
## last_two_years past_two_years
## 7 7
## per_serving_calories u.s_district_judge
## 7 7
## just_got_done weave_weave_weave
## 6 6
## time_last_year new_year's_eve
## 6 6
## new_york_times several_years_ago
## 6 6
## past_three_years chief_executive_officer
## 6 6
## next_school_year uninterrupted_open_highest
## 6 6
## open_highest_form four_years_ago
## 6 6
## trenton_gov_chris st_louis_county
## 6 6
## superior_court_judge senior_vice_president
## 6 6
We will print this data into some histograms to have a better look at the data:
plot(topfeatures(unigram, 50), xaxt = "n", ylab = "Frequency", main = "Unigram Top Features")
axis(1, 1:50, labels = names(topfeatures(unigram, 50)))
plot(topfeatures(bigram, 50), xaxt = "n", ylab = "Frequency", main = "Bigram Top Features")
axis(1, 1:50, labels = names(topfeatures(bigram, 50)))
plot(topfeatures(trigram, 50), xaxt = "n", ylab = "Frequency", main = "Trigram Top Features")
axis(1, 1:50, labels = names(topfeatures(trigram, 50)))
To create the prediction algorithm we will use the bigrams and trigrams that we analysed, and optionally quadgrams.