This is the Milestone Report/ project for the Cousera Data Science Specialization Capstone Course. This course includes a partnership with Swift key, an app which presents you with 3 predictions for your next word when typing on a mobile device. Our task will be to come up with our own method of generating these predictions. We have been giving 3 text corpus files, a large body of text used in analysis of word association of a particular language.
In this report we are to accomplish the following tasks, as well as create a plan for the final project.
. Understanding the problem . Data acquisition and cleaning . Exploratory analysis (pre and post cleaning)
First lets look at those corpus that we have been given.
Text Corpus file info
setwd("C:/R_Projects/Coursera_DS_Capstone")
file.info("final/en_US/en_US.blogs.txt")$size/1024/1024
## [1] 200.4242
file.info("final/en_US/en_US.news.txt")$size/1024/1024
## [1] 196.2775
file.info("final/en_US/en_US.twitter.txt")$size/1024/1024
## [1] 159.3641
Loading Text files
twitter_full <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
blogs_full <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
news_full <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
Number of lines
length(twitter_full)
## [1] 2360148
length(blogs_full)
## [1] 899288
length(news_full)
## [1] 77259
Split list to have a sublist for each element of words(anything separated by space). Then calculated the number of those words.
split_t <- strsplit(twitter_full, split = " ", fixed = T)
split_b <- strsplit(blogs_full, split = " ", fixed = T)
split_n <- strsplit(news_full, split = " ", fixed = T)
length(unlist(split_t))
## [1] 30373543
length(unlist(split_b))
## [1] 37334131
length(unlist(split_n))
## [1] 2643969
Example lines
head(twitter_full)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
#head(blogs)
#head(news)
In the interest of saving space I’m not showing blog and news examples. They are quite a bit longer.
histograms of corpus
hist(sapply(split_t, length))
hist(sapply(split_b, length))
hist(sapply(split_n, length))
I would normally split my test and training sets into around 20% and 80%, respectively. Maybe even 30% and 70% or 40% and 60%, However with the size of this data set and a time consuming transformation I’ll have to compute later on I’ll only be using 3% for my training. I might use like 0.5% for my test. Through out building these data tables i continued to his size problems for 40%, 20%, 10%, 5%.
#library(caret)
#inTrain <- createDataPartition(y = twitter, p = 0.4, list = T)
#twitter_t <- twitter[inTrain, ]
# This would work, I think, but the dataset is too large.
set.seed(1111)
twitter <- sample(twitter_full, length(twitter_full) * 0.03)
blogs <- sample(blogs_full, length(blogs_full) * 0.03)
news <- sample(news_full, length(news_full) * 0.03)
Now that we have have our data partitioned we’ll need to understand our ultimate goals before we can start cleaning and shaping out data into something that can be used for our prediction assignment.
Our final outputs of this project will be a shiny app which will give 3 words as predictions based upon an entered body of text, and a r-slides presentation for pitching our app. We will not be modifying this to any type of mobile application so our focus should primarily by based upon prediction, we will then later adjust for additional factor, such as speed. Text corpus are given to us in several languages, I’ll be focusing only on the English versions. We are allowed to included additional data in our analysis and prediction models.
Another important consideration here is that, as far as I can tell, we will have no history of the users use of language and will focus only on common use. Prediction of the next words would likely be aided by data/knowledge of a users common language choice or other user related data, possible even demographic or geographic in nature. This might be a subject to revisit for enhancement at a later time.
While the text corpus are a valuable tool I believe one the best starting points to be English langue structure and the history of similar prediction.I will not re-present all research findings, but rather a highlight.
This type of prediction falls under Natural language processing or (NLP). The history of NLP has been one that started with standardized rules, such as if-then statements and as processing power has increased moved more to statistical approaches. The most modern approach to this type of problem is the use of n-grams which are calculated from a text corpus. This is likely what the course instructors and partners want use to use in our prediction model.
I also explored use of tagging adjective, noun, and verb. I’ve Considered use of antonyms and synonyms. I’ve also looked at sentence structure, such as simply, complete, and complex sentences.
When designing this prediction model I knew I would be using n-grams. However, my original plans reference more the rules of sentence structure, word type (adjective, noun, verb), and the antonyms/synonyms of past words related to linking terms (but, so).
Use of antonyms and synonyms proved to be too complicated. After considering several of the ways in which word structure would be used I have come to the conclusion that modified n-grams would be a better way to archive the same results.
The application will first clean the entered words. This cleaning will be similar, but not exactly the same as what is preformed to create the n-grams from the text corpus. n-grams of 2 to 4 words will be used. After the entered words are cleaned the application will first review the text and only evaluate the window from the last period forward or the last 4 word, whichever is smaller. Then the application will look to the n-gram tables to find the best match.
We’ll split entered data and our corpus by sentence ends. Remove non-alpha numerics and capitalization. We’ll be using a word dictionary to classify non recognized words as a symbol. This accomplishes two key tasks. First it’ll likely be used be used most on nouns which there would no easy way to have a full list of every possible one (names of places, people, so forth. Secondly it gives us a way to deal with unrecognized words. For this when creating our n-grams we’ll use the symbol as one of the predictor words and remove it as a possibility for the predicted word. We’ll also be turning common acronyms into their respective words, these most often exist for common phrases and it’s likely our dictionary won’t have them.Because these are short there really isn’t much use in trying to predict them and in sentence structure they have more meaning when considered as the words they represent.
As sentences are likely to be not statistically related to one another(at least from what we are doing here) we’ll split these lines by punctuation marks.
blogs <- unlist(strsplit(blogs, split = ".", fixed = T))
blogs <- unlist(strsplit(blogs, split = "!", fixed = T))
blogs <- unlist(strsplit(blogs, split = "?", fixed = T))
news <- unlist(strsplit(news, split = ".", fixed = T))
news <- unlist(strsplit(news, split = "!", fixed = T))
news <- unlist(strsplit(news, split = "?", fixed = T))
twitter <- unlist(strsplit(twitter, split = ".", fixed = T))
twitter <- unlist(strsplit(twitter, split = "!", fixed = T))
twitter <- unlist(strsplit(twitter, split = "?", fixed = T))
There may have been instances where punctuation was used for other purposes, but sort of manually review there is nothing to be done. We’ll address any results of this later.
We’ll also remove the characters used in emojies and other non-words, such as “:”, and “)”, among others.
blogs <- gsub("[^a-zA-Z0-9 ]","",blogs)
news <- gsub("[^a-zA-Z0-9 ]","",news)
twitter <- gsub("[^a-zA-Z0-9 ]","",twitter)
We’ll remove capitalization, I can’t see any use for it.
blogs <- unlist(lapply(blogs, tolower))
news <- unlist(lapply(news, tolower))
twitter <- unlist(lapply(twitter, tolower))
Cleaning will include the replacement of acronyms. These should be evaluated as the phrases they represent. This same cleaning will occur to the app input.
I reviewed a list of the top 500 acronyms used in today’s communication from http://www.muller-godschalk.com/acronyms.html#S. I found a large number to be what I would think be very uncommon.
I found a list of what sums up to 124 from http://abbreviations.yourdictionary.com/articles/common-accronyms.html broken up into several sections. I then scanned this list, being so sort, and removed problem acronyms such as KISS - Keep It Simple Stupid that represent words or have multiple meanings. Though it is a common acronym “kiss” is also a word and used much more often as such. Some also referenced nouns that I thought would be better to treat as such. I then augmented this list with others from http://www.netlingo.com/top50/popular-text-terms.php and saved the result as a .csv
acronyms <- read.csv("acronyms.csv")
acronyms$acronym <- sapply(acronyms$acronym, tolower)
acronyms$meaning <- sapply(acronyms$meaning, tolower)
#for(i in 1:80){twitter <- gsub(acronyms[i,1], acronyms[i,2],twitter)}
#^this works but it's a loop, below is a better aproach
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
##
## The following object is masked from 'package:base':
##
## Filter
#mgsub allows for a list of patterns
twitter <- mgsub(acronyms[,1], acronyms[,2], twitter)
blogs <- mgsub(acronyms[,1], acronyms[,2], blogs)
news <- mgsub(acronyms[,1], acronyms[,2], news)
I have decided not to remove/replace contractions for two reasons. Firstly, if we did we would then lose the ability to predict off of them when they are used. For instance if we split all of the “don’t” in to “do” and “not” we’ll have a large number of “not”s following “do” but we wouldn’t have a way to view don’t. The exception to this statement is if we used separate n-gram table for anytime there is a contraction, which is much too complicated.
The second reason is that contractions might very well present them-selves differently in terms of word association than their decontracted versions.
Therefore later on we’ll need to add these to any word list we use
Next we’ll remove any leading or trailing blank space and then combine any multiple space into a single space. The we’ll split the strings into separate list of words to keep their order for creating n-grams later, but still allow us to work with the individual words.
library(stringr)
##
## Attaching package: 'stringr'
##
## The following object is masked from 'package:qdap':
##
## %>%
twitter <- gsub("\\s+", " ", str_trim(twitter)) #remove trail spaces, replace doubles
twitter <- twitter[twitter != ""] #remove empty lines
blogs <- gsub("\\s+", " ", str_trim(blogs))
blogs <- blogs[blogs != ""]
news <- gsub("\\s+", " ", str_trim(news))
news <- news[news != ""]
My next action was going to be to remove single letters that were not words, such as D which is left over from :D. However there may be some single letters that are being reference as an object therefor I’ll only do this for the twitter lines as they are the ones filled with “D” and “p” and other remains of emojies.
#twitter <- strsplit(twitter, split = " ")
#for(x in 1:length(twitter)){ #for loop for the main list
# for(y in 1:length(twitter[[x]])){ #for loop for the sublist
# temp <- twitter[[x]][y] # assign value to save time
# if(nchar(temp) == 1){ #if single character word
# twitter[[x]][y] <- sub("[^ai0-9]", "", temp)}
#removes everything word which is not an "a" or "i" or number
# }
#}
#for(x in 1:length(twitter)){twitter[[x]] <- Reduce(paste, twitter[[x]])}
#the next step requires the lists of words to be recreated as sentences.
#twitter <- unlist(twitter)
#The above lines work, but it's a loop we can use the below instead
twitter <- gsub("\\b[^a|^i]{1,1}\\b", " ", twitter)
#\\b indicates the bounds to be Word boundaries
#^ means not
#| means or
#{1,1} means for segments selected by the bounds from length 1 to 1
#again remove spaces and empties
twitter <- gsub("\\s+", " ", str_trim(twitter)) #remove trail spaces, replace doubles
twitter <- twitter[twitter != ""] #remove empty lines
#twitter <- strsplit(twitter, split = " ")
I was also thinking about how to best handle uncommon nouns.It occurs to me I can replace all not recognized words with a symbol, such as “*“. I’ll need a dictionary to identify the real words and I’ll use this same dictionary later when scanning input. these non-recognized words won’t always be nouns, but the general principle should work most of the time.
I looked at several word lists,I decided to go with version 5 of 12dicts from SCOWL (Spell Checker Oriented Word Lists). I am using the version of this file with American English & Inflections only. http://wordlist.aspell.net/12dicts/
Remember we need to add some contractions as words to this dictionary.
words <- readLines("words/2of12inf.txt", encoding="UTF-8")
cont <- read.csv("cont.csv")#list of contractions
cont <- gsub("[^a-zA-Z0-9 ]","",cont$cont)
cont <- unlist(lapply(cont, tolower))
cont <- unique(cont)
words <- c(words, cont)
words <- gsub("[^a-zA-Z0-9 ]","",words)
#for(x in 1:length(twitter)){ #for loop for the main list
# for(y in 1:length(twitter[[x]])){ #for loop for the sublist
# temp <- twitter[[x]][y] # assign value to save time
# if(!temp %in% words){ #if single character word
# twitter[[x]][y] <- "*"}
# }
#}
#ptm <- proc.time()
#proc.time() - ptm
#68 seconds for 1000
#^Works, but takes too long.
twitter2 <- strsplit(twitter, split = " ")
for(x in 1:length(twitter2)){
twitter2[[x]][!twitter2[[x]] %in% words] <- "*"
}
#8 seconds for 1000, much faster but still along time, but I cant find another way
blogs2 <- strsplit(blogs, split = " ")
for(x in 1:length(blogs2)){
blogs2[[x]][!blogs2[[x]] %in% words] <- "*"
}
news2 <- strsplit(news, split = " ")
for(x in 1:length(news2)){
news2[[x]][!news2[[x]] %in% words] <- "*"
}
Unfortunately this step is still a vary long process to preform on the larger data set. If anyone knows a quicker way I would be very grateful for any assistance. A quicker way to perform this task would allow me to use a larger portion of the data set and therefor have a more accurate result.
I did not review the dictionary for profanity. If it is in there then it will be evaluated as a word if not then it’ll be seen as an unrecognized word. I believe this is the correct way to go about it. If there is truly such a high prevalence of a profane word then there are two possibilities. Either the word is used very frequently, and our ultimate goal is then to predict it, or our sample/ text corpus is non-representative of the population and we need a larger or different one.
Depending on how we go about creating n-grams and calculating frequency it may help to have a data.frame.
library(plyr)
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:qdapTools':
##
## id
all_3 <- c(twitter2, blogs2, news2)
corpus <- data.frame(rbind.fill.matrix(lapply(all_3, t)))
# creates a matrix the width of the longest lines, fills in missing values
Saving our data object will allow us to save a lot of time when rerunning processes.
saveRDS(corpus,file="corpus_df.Rds")
saveRDS(all_3,file="corpus_list.Rds")
corpus_list <- readRDS("corpus_list.Rds")
corpus_df <- readRDS("corpus_df.Rds")
Number of lines, number of words, number of unique words, example lines, histogram of number of words per line.
length(corpus_list) #number of lines
## [1] 201655
length(unlist(corpus_list)) #number of words
## [1] 2132859
length(unique(unlist(corpus_list))) # nuber of unique words, note unreconized = *
## [1] 31820
head(corpus_list) #example lines
## [[1]]
## [1] "my" "*" "tweet" "goes" "to"
## [6] "my" "beautiful" "smart" "ass" "funny"
## [11] "bully" "ass" "tall" "ass" "baby"
## [16] "sis" "*" "stink" "*" "butt"
##
## [[2]]
## [1] "*" "but" "the" "whole" "point"
## [6] "of" "mu" "and" "the" "transition"
## [11] "to" "*" "*" "is" "to"
## [16] "make" "this" "process" "easier" "and"
## [21] "the" "rules" "should" "reflect" "this"
##
## [[3]]
## [1] "very" "happy" "that" "has" "won" "*"
## [7] "*" "between" "*" "national" "and" "*"
##
## [[4]]
## [1] "yea"
##
## [[5]]
## [1] "hope" "hours" "are" "good"
##
## [[6]]
## [1] "may" "the" "road" "rise" "up" "to" "meet" "you" "may" "the"
## [11] "wind" "be" "ever" "at" "your" "back"
hist(sapply(corpus_list, length)) #histogram
We now have a much cleaner and more orderly text corpus that can be used for creating n-grams for our prediction model.
Need a way to create n-grams, some variation of lapply with ngram will likely be the solution.
#library(ngram)
#corpus_2n <- lapply(all_3, ngram, n=2)
#corpus_2n <- ngram (all_3, n = 2)
#corpus_3n <- ngram (corpus, n = 3)
#corpus_4n <- ngram (corpus, n = 4)
#corpus_2n <- corpus_2n[!corpus_2n$ == "*"]
#depends on the form of the n-grams
We now need to make sure the correct form of contractions with an apostrophe shows up in the predictions. We must be care to only replace the version for the predictions and not
cont2 <- read.csv("cont2.csv")#list of contractions with and without '
cont2 <- unique(cont2)
#corpus$ <- mgsub(cont2[,1], cont2[,2], corpus$)
#depends on the form of the n-grams
Will next need to calculate the values for our n-grams. We’ll use the n-grams tables by looking at predictor word(s) and the predicted word. Calculating the frequency of which the predicted word appears for any occurrence of the predictors.
The application structure will need an interface built in Shiny.
It’ll process the data first by breaking the lines(or text) by punctuation into phrases. Next we’ll take the last phrase and remove alpha numerics, make all lower case, and replace acronyms. Then if the number of words is greater than the longest n-gram we’ll grab only as far as the longest n-gram would need, if it is short we’ll indicate to use the appropriate n-gram. The length of ngrams will depend on time to calculate and if a great enough number exist to be meaningful. Before we use those n-grams we’ll check the words against our dictionary and replace any unmatched with a symbol.
Now that we have our cleaned phrase and know which of our already calculated n-gram tables to use, we need just check for the pattern and use the top 3 predicted words with highest frequency for the predictors.