This report explains how we will build a predictive text algorithm. This is the type of program that allows your phone to know what you will type next and make suggestions prior to your input. The procedure I follow is:
Obtain a large body (corpus) of text that we will analyze. It is important to document how we retrieve this data so that all the findings can be reproducable from a fixed starting place by anyone. These data will be too large to process during our analysis so we will cut it down to about 1% of its initial size to start with.
Load the corpus and transform it so that it doesn’t have words and symbols that are unimportant for predicting text.
Do some basic analysis on which words appear the most in the text. These are called Unigrams. Then we will analyze which words appear most frequently together. When we consider 2 word combinations they are called bigrams. Three words are trigrams.
Data for the final project was obtained from this location. Our first step is to get some summary statistics on what we are dealing with. We will limit our analysis to the US English files.
%%bash
ls -lh data/final/en_US/*
wc -l data/final/en_US/*
-rw-r--r--@ 1 David staff 200M Jul 22 2014 data/final/en_US/en_US.blogs.txt
-rw-r--r--@ 1 David staff 196M Jul 22 2014 data/final/en_US/en_US.news.txt
-rw-r--r--@ 1 David staff 159M Jul 22 2014 data/final/en_US/en_US.twitter.txt
899288 data/final/en_US/en_US.blogs.txt
1010242 data/final/en_US/en_US.news.txt
2360148 data/final/en_US/en_US.twitter.txt
4269678 total
So the files are roughly 150M - 200M. The course materials provide some strong hints that these files should be randomly sampled to reduce the size for the initial analysis. I’m going to do this with some UNIX commands instead of writing code. I’m going to limit the sample sizes to 1% of what the original data sets are.
%%bash
gshuf -n 23696 -o data/sample/en_US/en_US.twitter.txt data/final/en_US/en_US.twitter.txt
gshuf -n 10102 -o data/sample/en_US/en_US.news.txt data/final/en_US/en_US.news.txt
gshuf -n 8992 -o data/sample/en_US/en_US.blogs.txt data/final/en_US/en_US.blogs.txt
%%bash
ls -lh data/sample/en_US/*
wc -l data/sample/en_US/*
-rw-r--r-- 1 David staff 2.0M Mar 11 10:01 data/sample/en_US/en_US.blogs.txt
-rw-r--r-- 1 David staff 2.0M Mar 11 10:01 data/sample/en_US/en_US.news.txt
-rw-r--r-- 1 David staff 1.6M Mar 11 10:01 data/sample/en_US/en_US.twitter.txt
8992 data/sample/en_US/en_US.blogs.txt
10102 data/sample/en_US/en_US.news.txt
23696 data/sample/en_US/en_US.twitter.txt
42790 total
Now we start the data analysis in R. We are going to use the following libraries. Some of the computations benefit from paralization so we will load the do multicore library.
library(tm)
library(doMC)
registerDoMC(cores = 8)
library(ggplot2)
library(dplyr)
library(quanteda)
Now we are ready to start working with our downsampled data. We will limit ourselves to the 3 files found in the English subdirectory.
# Load the corpus
c <- corpus(textfile(file='data/sample/en_US/*'))
summary(c)
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences
## en_US.blogs.txt 34160 423542 23523
## en_US.news.txt 35118 408764 19959
## en_US.twitter.txt 33912 373192 38041
##
## Source: /Users/David/Hack/Datascience-Coursera/Capstone/* on x86_64 by David
## Created: Sun Mar 20 14:31:58 2016
## Notes:
Filter out words (features) that are not useful and create document frequency matricies for unigrams, bigrams, and trigrams. By default, the document feature matricies remove words that appear with little frequency in the text. This is called sparse entries.
#
# create a list of profanity to filter
conn <- file("data/swearWords.csv", "r")
sw <- readLines(conn)
close.connection(conn)
sw <- strsplit(sw,",")
# create a sparse document feature matrix for the unigrams
d <- dfm(c, ignoredFeatures=c(sw[[1]], stopwords("english")),removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 57,645 feature types
## ... removed 218 features, from 251 supplied (glob) feature types
## ... created a 3 x 57427 sparse dfm
## ... complete.
## Elapsed time: 2.703 seconds.
# create the freq. matrix for bigrams
d2 <- dfm(c, ngrams = 2, ignoredFeatures=c(sw[[1]], stopwords("english")),removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 472,624 feature types
## ... removed 251,849 features, from 251 supplied (glob) feature types
## ... created a 3 x 220775 sparse dfm
## ... complete.
## Elapsed time: 12.813 seconds.
# create the freq matrix for trigrams
d3 <- dfm(c, ngrams = 3, ignoredFeatures=c(sw[[1]], stopwords("english")),removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 848,078 feature types
## ... removed 717,022 features, from 251 supplied (glob) feature types
## ... created a 3 x 131056 sparse dfm
## ... complete.
## Elapsed time: 21.871 seconds.
Now we can do some analysis on the words, bigrams, and trigrams. We want to get a sense of how common the words are and how frequently they appear in the text as a whole. We provide word clouds to visualize how often words appear.
# counts of the top words
topfeatures(d)
## will just said one like can get time new now
## 3173 3048 3002 2927 2748 2498 2284 2157 1912 1862
# frequencies of the top words
topfeatures(d) / d@Dim[2]
## will just said one like can
## 0.05525276 0.05307608 0.05227506 0.05096906 0.04785206 0.04349870
## get time new now
## 0.03977223 0.03756073 0.03329444 0.03242377
plot(d, min.freq = 750, random.order = FALSE)
topfeatures(d2) # counts for the top bigrams
## right_now new_york last_year last_night
## 242 218 203 162
## high_school years_ago last_week feel_like
## 138 138 135 123
## first_time looking_forward
## 120 111
topfeatures(d2) / d2@Dim[2] # frequencies for the top bigrams
## right_now new_york last_year last_night
## 0.0010961386 0.0009874306 0.0009194882 0.0007337787
## high_school years_ago last_week feel_like
## 0.0006250708 0.0006250708 0.0006114823 0.0005571283
## first_time looking_forward
## 0.0005435398 0.0005027743
plot(d2, min.freq = 50, random.order = FALSE)
topfeatures(d3) # counts for the top trigrams
## new_york_city let_us_know happy_new_year
## 33 27 23
## happy_mothers_day happy_mother's_day cinco_de_mayo
## 22 21 18
## president_barack_obama new_york_times two_years_ago
## 15 14 14
## world_war_ii
## 13
topfeatures(d3) / d3@Dim[2] # frequencies for the top trigrams
## new_york_city let_us_know happy_new_year
## 2.518008e-04 2.060188e-04 1.754975e-04
## happy_mothers_day happy_mother's_day cinco_de_mayo
## 1.678672e-04 1.602368e-04 1.373459e-04
## president_barack_obama new_york_times two_years_ago
## 1.144549e-04 1.068246e-04 1.068246e-04
## world_war_ii
## 9.919424e-05
plot(d3, min.freq = 5, random.order = FALSE)
The most interesting fiding is that the frequency of the most prevelant terms falls of by an order of 10 as we move from single words, bigrams, and then trigrams.
* n-gram | frequency
* will : 3173
* right_now : 242
* new_york_city : 33
This intuitively makes sense, that longer word sequences will appear less frequently, but it will reduce the predictive power of our algorithm. This points to the inherient limitations of the bag of words approach that we are taking in this initial analysis. If we instead created models that understood the parts of speech and the most likely next word given the context of the previous words our predictive powers will likely increase.