This is an exploratory report for the Data Science Specialization Swiftkey Capstone project, through Coursera and the Johns Hopkins University Bloomberg School of Public Health, and in conjunction with Swiftkey. The instructors are Jeff Leek, Roger Peng, and Brian Caffo.
The task at hand is to build a predictive text application using NLP that will predict the next word based on the previous entry or entries. The data set consists of a corpus of text that will be used to create the predictive model. It can be found here.
The data is loaded into R using unz and readLines functions, processed into two samples, one of which is archived while the other is processed in to a series of vectors containing raw single words; and position and element indexes for each word, so phrases can be rebuilt later. Whitespace is stripped and the text is encoded using utf-8.
Basic data summaries are calculated.
tw2 <- unz("Coursera-SwiftKey.zip", "final/en_US/en_US.twitter.txt")
twData <- readLines(tw2)
twLines <- length(twData)
twWords <- length(unlist(strsplit(twData, "\\s")))
close(tw2)
rm(tw2)
rm(twData)
gcv2 <- gc(verbose = FALSE)
Twitter Summary:
bl2 <- unz("Coursera-SwiftKey.zip", "final/en_US/en_US.blogs.txt")
blData <- readLines(bl2)
blLines <- length(blData)
blWords <- length(unlist(strsplit(blData, "\\s")))
close(bl2)
rm(bl2)
rm(blData)
gcv3 <- gc(verbose = FALSE)
Blog Summary:
nw2 <- unz("Coursera-SwiftKey.zip", "final/en_US/en_US.news.txt")
nwData <- readLines(nw2)
nwLines <- length(nwData)
nwWords <- length(unlist(strsplit(nwData, "\\s")))
close(nw2)
rm(nw2)
rm(nwData)
gcv4 <- gc(verbose = FALSE)
News Summary:
A quick summary of this generally uncleaned may help direct the data cleaning. It may be a good idea to take look at the highest frequency words.
There are a number of very high frequency words. These are the 20 words with the highest frequency.
## frequency
## the 2104752
## to 1346798
## and 1129139
## a 1127168
## of 989608
## in 755406
## I 730524
## for 519742
## is 507385
## that 463196
## on 378286
## you 364483
## with 338093
## was 304270
## it 303389
## at 264551
## my 261575
## be 260377
## have 249990
## The 240479
These words appear to be function words commonly chosen as “stop words” in NLP. Many techniques recommend removal of these words due to their overrepresentaiton, however it may be worth keeping them in to investigate alternative methods of handling these words.
Another question is, how often does each word frequency occur.
The frequency of frequencies plot shows a large number of 1-occurence words. Before deciding how to handle these, more information is needed.
Digging a little deeper reveals 2.08% of all words have a frequency of one. In fact, removing all words with 66 occurences or less, retains 90% of the overall words in the data set.
By removing all words with 15088 occurences or less, 50.01% of the highest frequency words are retained.
It is clear that removing low frequency words will help with model efficiency, and should be an effective way to remove little-used foreign-language words, hard to detect misspellings, and other “noise” from the model. What is less clear is at what point these less frequent words gain sufficient value to merit remaining in the model. This requires testing during the model building phase.
Lastly, it seems likely that some punctuation removal will be necessary to consolidate word occurences, but first a better understanding of what kind of punctuation is in use is necessary, to see if it might be useful data.
Unsurprisingly; periods, commas, and apostrophes are the most common punctuation. But glancing through the other results suggests a little more investigation.
One special case to consider is words starting with #. These are likely to be hashtags from our twitter data set. These are likely to be unrelated to the words found before and after them, and may introduce noise into our model.
## [1] "#MothersDay" "#OccupyMadison" "#cufon"
## [4] "#wordpress." "#SNTCK" "#specialneeds"
## [7] "#Well" "#FridayThe13th" "#Broncos"
## [10] "#fashion," "#internmistakes!" "#kids"
## [13] "#Wisconsin" "#FanNight" "#unfollowfriday"
## [16] "#eating" "#sleep" "#exercising"
## [19] "#TeamSoaringHigh" "#cockroach"
133866 occurences are found, 0.26% of all words. Most look like low-frequency words that would probably be excluded anyway, but based on the usage of these types of words, the best course of action seems to be to exclude them entirely.
While more detailed data cleaning may be required later, the data cleaning for now is then rounded out with these steps:
These actions were considered but not taken at this time, until a better picture of their impact on the model develops
We can begin our summary statistic review by creating our 1-gram, 2-gram, and 3-gram vectors using our clean data. This can be done by creating a set of flag variables that tell us whether a word is eligible to be a starter word for a 1-gram, 2-gram, or 3-gram. This is determined by looking at the exclusion rules established above, and the word position data we preserved from our original sample.
A two-gram frequency distribution can now be examined.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 6 2 2220000
Two-grams range in frequency from 1 to 2221297, with a median of 1.
## frequency
## 2221297
## of the 215211
## in the 203072
## to the 105961
## for the 100415
## on the 98176
## to be 80759
## at the 70967
## and the 62301
## in a 59272
## with the 52362
## is a 50066
## it was 47680
## for a 46852
## from the 43495
An examination of the most frequent phrases again shows a large spike in those based solely on “function” words.
A three-gram frequency distribution can also be looked at.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 2 1 4410000
Three-grams range in frequency from 1 to 4409362, with a median of 1.
thgHd <- as.data.frame(head(thgFreq, 15))
colnames(thgHd) <- "frequency"
print(thgHd)
## frequency
## 4409362
## one of the 17142
## a lot of 14851
## thanks for the 11768
## to be a 8978
## going to be 8678
## the end of 7427
## i want to 7375
## out of the 7288
## it was a 6986
## as well as 6926
## some of the 6824
## be able to 6538
## part of the 6177
## i have a 5755
An examination of the most frequent phrases still shows a large spike in those based solely on “function” words. There is also an increasing number of 1 occurence phrases, indicating an increase in phrase uniqueness as more words are added.
Based on this cleaning and exploratory analysis, the first stages of modeling will proceed using the cleaned n-gram frequencies as a basis. It is expected that other elements will be included or removed based on the quality of the results, and the inclusion/exclusion levels for low and high frequency words will be set during model testing.
The shiny app based on the model will be constructed to allow free entry of a text string, and provide 3 suggestions for the next word based on the prediction model analysis of the previously entered text. The predictions will update every time a space is entered, and it would be beneficial to design this app so that a suggested word can be added to the text with a button click.