Introduction

This document is made as part of the the Coursera Data Science Capstone project. It is the first step of creating a text prediction application in R. In this first step I will look into:
1. loading of data,
2. data exploration
3. preprocessing data 4. saving the results for further use

At the end of this document the basic findings so far are given plus a look ahead how to progress the model further into a text prediction app.

NB: note that the loadData.R file contains some functions for data loading and preprocessing corpora. See my github page for more details. On github there is also the RMD file. In this document only general calculations and results are shown, as this is is a management level document.

source("loadData.R") #this file contains

Data loading

For this project there are three files provided:
- en_US.blogs.txt
- en_US.news.txt
- en_US.twitter.txt

The files are downloaded from the Coursera website and extracted into the ‘/sources/’ subfolder. I did this manually to save time loading and extraction these large files.

The files contain texts from respectively blogs, newsfeeds and twitter. The purpose of these files is to have sample texts which provide information to base predictions of the next word on.

I will load the three documents for exploratory analysis. Because further analysis shows that the documents are too large to handle, a random subsample is taken. This will contain a large enough dataset to base the prediction algoritms on. Initially 3% of all datasets was taken to do exploratory analysis. Because differences in the size and structure of the files this was enlarged (news) or reduced (blogs/twitter) based on results at a later stage of this exploratory analysis. It was my goal to include a more evenly distributed sample between the three sources.

Visual inspection also shows that some of the texts contain mostly one sentence per line (twitter, news). The blogs texts however often contains whole stories per textline. In order to capture the next word it is necessary to split every sentence into a new line, the last word of the previous sentance is not really related to the first word of the next sentence. In the analysis it also was visible that a lot of sentences only consist of a few characters or words. I chose to remove shorter sentences than 15 characters. This improves the quality of the sentences a lot.

After loading the reduced datasets these are loaded in Corpus objects. Corpusses provide a structured way to store text data and enable various NLP (natural language processing) functions.

Data exploration

Visual inspection of the loaded data also showed that the (grammatical) quality of the texts is often not that good.
A few examples:
- Punctuation is not always according to grammatical rules;
- Profane language is used in the texts;
- The text contains a lot of numbers and special characters.

It is clear that data cleaning is required. This will be done in data preprocessing. This is most true for the blogs and twitter datasets. The news dataset is gramatically better and has less profane words. The twitter dataset also has less unique words on average in the text.

Below I added a table containing amount of textlines per document (original and reduced sample size) and the unique word counts per document. These calculations are based on document term matrices. Visible is that even with a limited size, we still have about 31157 unique words in the dataset. This should be sufficient to make a prediction algoritm on.

In the first analysis cycle it was visible that the Twitter dataset contained the most lines. The above amounts are already corrected based on these findings. This includes taking (1000%) higher percentage of News lines; a (30%) lower amount of twitter data and a (30%) lower amount of blog textlines.This will also balance the results of the prediction, as we would get more ‘Twitter’-like predictions if nothing was corrected. This accounts for the News dataset being much smaller, but having more unique words per line. It also accounts for the Blogs dataset already been split into separate sentences per line.

It is also interesting to see if which terms are most frequent. See overview below for Twitter (other sources show similar result - see appendix 1). Clear is that ‘The’ and ‘And’ are very common in all datasets. Also other stop words as with, that, you, have, etc. are very common. After that, the word frequencies decline more gradually. I am not yet sure if these should be removed when predicting the next word, as the next word is often ‘the’ or ‘and’. On the other side there is almost no predictive value of words preceding or following ‘and’, ‘the’ and other stop words. For now however, I keep them in.

## [1] "The Twitter dataset has 8 unique words with more than 1000 occurences."

## [1] "In the Twitter dataset these words occur most frequently:"

##      UniqueWordsTwitter
## the                5783
## you                2970
## and                2723
## for                2348
## that               1357
## your               1090
## with               1079
## have               1075
## are                 973
## this                945

## [1] "Twitter most frequent placed in wordcloud:"

## [1] "Twitter distribution of most common words:"

Looking at the least frequent terms, there is some data cleaning necessary. Many contain good words, but are preceded by “" sign (summation of terms), are inside quotes or have a ‘-’ sign and there are various numbers. these should be removed. Below a sample of the least frequent words (<5 occurences) is shown from the three datasets:

##  [1] "''crank"      "''daddys"     "''king"       "'02"         
##  [5] "'04."         "'07"          "'09!"         "'13."        
##  [9] "'30"          "'70"          "'87"          "'99."        
## [13] "'alive'"      "'allstar"     "'apt-get"     "'babe"       
## [17] "'balance.'"   "'bout"        "'british'"    "'bump"       
## [21] "'bye,"        "'can"         "'cat"         "'cause"      
## [25] "'cheers'"     "'citing!!!"   "'community'"  "'daddy"      
## [29] "'definitive'" "'do"

##  [1] "'40s"          "'90's."        "'93"           "'96"          
##  [5] "'a'"           "'and"          "'beer'"        "'black'"      
##  [9] "'blessed"      "'bout"         "'cal'"         "'cause"       
## [13] "'cept"         "'cold"         "'departure'"   "'desperate'"  
## [17] "'desperately"  "'do"           "'don't"        "'doomsday'"   
## [21] "'em"           "'endogenous'." "'euthanasia'"  "'fabulous"    
## [25] "'family':"     "'far"          "'fares'"       "'fiã°la'"     
## [29] "'genteel'"     "'happy.."

##  [1] "''a"             "''ad-rock''"     "''couldn't"     
##  [4] "''if"            "''paradise"      "''pina,\""      
##  [7] "''undefeated.\"" "''we"            "'01"            
## [10] "'02"             "'06"             "'08"            
## [13] "'40s"            "'50s"            "'60s"           
## [16] "'68"             "'68-69"          "'70s"           
## [19] "'71"             "'80s"            "'80s,"          
## [22] "'84,"            "'87"             "'90s"           
## [25] "'90s,"           "'98,"            "'alphabet"      
## [28] "'american"       "'authorized'"    "'born"

Data preparation

In this stage I will do some data cleaning based on above findings. First is the basic cleaning of the texts.

Basic cleaning of words (lower case, replace special characters, remove quotes, replace punctuation, etc.). This gets rid of all the most common ‘strange’ words and characters.
Replace all profane words with beep tag. I chose to replace instead of remove to not mess up the order. Later in de prediction we can remove all beep tags from the n-grams. This ensures that word orders are kept.
Ignore all words that only occur less than 4 times. These are probably wrongly spelled or have some other problems. This also get rid of a large part of the unique words, decreasing the size of the model and making it faster / less memory intensive. Below are the results from lower than 4 occurances and >=4 occurences. The latter has still some errors, where the last is quite good quality.
I chose to combine all corpora to one big corpus. During prediction of the next word it also does not matter from which source the data comes.

## [1] "Corpus pre-processing 1% (bespoke cleaning starting)"
## [1] "Corpus pre-processing 25% (bespoke cleaning continuing)"
## [1] "Corpus pre-processing 50% (bespoke cleaning finished)"
## [1] "Corpus pre-processing 60% (whitespace cleaning finished)"
## [1] "Corpus pre-processing 70% (number cleaning finished)"
## [1] "Corpus pre-processing 80% (punctuation cleaning finished)"
## [1] "Corpus pre-processing 100% (all to low caps cleaning finished)"
## [1] "Corpus pre-processing finished (e.g. remove strange characters, strip white space and remove caps."

## [1] "749  unique profane Words where found in the corpus text."
## [1] "0% starting -beeping- profane words. Processing can be slow!"
## [1] "5%"
## [1] "25%"
## [1] "50%"
## [1] "75%"
## [1] "100%"
## [1] "Profane words removed from corpus."

## [1] "there are  30648 unique words with less than 4 occurences. Below is a sample:"

##  [1] "zerran"      "zested"      "zesty"       "zetas"       "zettl"      
##  [6] "zeus"        "zharki"      "zicari"      "ziegler"     "zigging"    
## [11] "ziggler"     "ziggy"       "zigzags"     "zilch"       "ziliak"     
## [16] "zimmerli"    "zimmermann"  "zin"         "zina"        "zinc"       
## [21] "zinfandels"  "zinofile"    "zionist"     "zionists"    "zions"      
## [26] "zipline"     "ziploc"      "zipped"      "zipper"      "zips"       
## [31] "zipsters"    "zirmed"      "zit"         "zite's"      "ziti"       
## [36] "zlist"       "zloty"       "zodiac"      "zoeyâ\200\231s"    "zoghbi"     
## [41] "zomg"        "zondrvan"    "zoning"      "zooey"       "zookeeper"  
## [46] "zoological"  "zoom"        "zoran"       "zori"        "zoroastrian"
## [51] "zskuu"       "zuccotti"    "zuckerberg"  "zuma"        "zune"       
## [56] "zunis"       "zurian"      "zusi"        "zygote"      "zynga"

## [1] "there are  11161 unique words with >= 4 occurences. Below is a sample:"

##  [1] "yellow"      "yells"       "yep"         "yes"         "yesterday"  
##  [6] "yesterday's" "yet"         "yherajk"     "yield"       "yoga"       
## [11] "yogurt"      "yolo"        "york"        "yorkshire"   "you"        
## [16] "you'd"       "you'll"      "you're"      "you've"      "youâ\200"      
## [21] "youâ\200\231d"     "youâ\200\231ll"    "youâ\200\231re"    "youâ\200\231ve"    "young"      
## [26] "younger"     "youngest"    "youngsters"  "your"        "youre"      
## [31] "yours"       "yourself"    "yourselves"  "youth"       "youtube"    
## [36] "yrs"         "yum"         "yummy"       "yup"         "zac"        
## [41] "zack"        "zak"         "zayn"        "zealand"     "zero"       
## [46] "zest"        "zimmerman"   "zinfandel"   "zip"         "zoey"       
## [51] "zombie"      "zombies"     "zone"        "zones"       "zoo"        
## [56] "zooming"     "zoos"        "zucchini"    "zumba"       "zumwalt"

Unfortunatly I have to remove quite of bit of words. However, cutting unique words that only occur once or twice is not that bad. It is quite random that these where in de file in the first place. The good side is that many ‘bad’ or misspelled words are removed which we do not want to predict.

Get N-Grams

Now I will calculate some N-Grams. I will focus on 2 and 3 word groups. Below is a sample of the most common N-Grams (resp 4 and 4 combinations). It is visible that the terms are quite standard common terms which could be correct.

## [1] "There are  19902  unique combinations of more than 4 occurences. A sample of the most frequent 2-Gram words are: "

##  [1] "*beep* *beep*"      "*beep* a"           "*beep* â\200“"        
##  [4] "*beep* and"         "*beep* are"         "*beep* at"         
##  [7] "*beep* but"         "*beep* by"          "*beep* can"        
## [10] "*beep* for"         "*beep* had"         "*beep* her"        
## [13] "*beep* him"         "*beep* i"           "*beep* i'm"        
## [16] "*beep* in"          "*beep* is"          "*beep* it"         
## [19] "*beep* like"        "*beep* me"          "*beep* my"         
## [22] "*beep* of"          "*beep* on"          "*beep* or"         
## [25] "*beep* orientation" "*beep* that"        "*beep* the"        
## [28] "*beep* them"        "*beep* this"        "*beep* to"         
## [31] "*beep* u"           "*beep* up"          "*beep* was"        
## [34] "*beep* when"        "*beep* who"         "*beep* with"       
## [37] "*beep* would"       "*beep* you"         "*beep* your"       
## [40] "*beep*ed a"

## [1] "There are  6722  unique combinations of more than 4 occurences. A sample of the most frequent 3-Gram words are: "

##  [1] "*beep* and the"      "*beep* in the"       "*beep* it up"       
##  [4] "*beep*ed in a"       "*beep*ysis of the"   "a *beep* and"       
##  [7] "a and m"             "a bag of"            "a beautiful day"    
## [10] "a belief in"         "a better place"      "a big deal"         
## [13] "a bill that"         "a billion dollars"   "a bit like"         
## [16] "a bit more"          "a bit of"            "a book on"          
## [19] "a box of"            "a break from"        "a bunch of"         
## [22] "a car accident"      "a career high"       "a chance for"       
## [25] "a chance to"         "a charge of"         "a coffee shop"      
## [28] "a combination of"    "a conversation with" "a copy of"          
## [31] "a couple days"       "a couple of"         "a couple weeks"     
## [34] "a cup of"            "a daily basis"       "a day after"        
## [37] "a day of"            "a day or"            "a era in"           
## [40] "a fan of"

The settings can be modified based on if you get good or bad combinations and how many combinations I want in the final model. It is best to increase the values with bigger datasets, as the amount of faulty word doubles increases.

Now I will build a table with frequency counts for all the words and n-grams. This dataset can later be saved to file and used during model building. This decreases the dataload a lot because there are much less unique words or wordcombinations to load than the complete texts at the beginning of this document.

## [1] "The 1-Gram dataset has 6419 unique words"

## [1] "The 2-Gram dataset has 5944 unique words combinations."

## [1] "The 3-Gram dataset has 536 unique words combinations."

There might be some word combinations that contain profane words (beep). I now choose to remove these combinations. This gives clean word combinations. The combinations are written to file for later use. After cleaning profane word combinations there are:

## [1] "The 1-Gram dataset has 6382 unique words"

## [1] "The 2-Gram dataset has 5907 unique words combinations."

## [1] "The 3-Gram dataset has 536 unique words combinations."

## [1] "Below are a few examples for the 3-gram category to get an idea: "

##           combination count
## 1          one of the   217
## 2            a lot of   163
## 3      thanks for the   149
## 4         going to be   120
## 5             to be a   114
## 6           i want to    96
## 7            it was a    87
## 8          as well as    83
## 9          out of the    81
## 10        some of the    79
## 11         the end of    77
## 12        is going to    74
## 13      thank you for    73
## 14        a couple of    72
## 15        the rest of    72
## 16           i have a    71
## 17 looking forward to    71
## 18        part of the    71
## 19         be able to    67
## 20          i have to    61

Shown in a graphs this looks as follows for 1, 2 and 3 gram combinations:

I will save the results to file for further use.

#show wordcloud graph
saveNGramsTexts(Ngram1SmallDataFrame,"result_ngram1.csv")
saveNGramsTexts(Ngram2SmallDataFrame,"result_ngram2.csv")
saveNGramsTexts(Ngram3SmallDataFrame,"result_ngram3.csv")

Conclusions

It was possible to load in various texts, analyse them and pre-process them to smaller collections that can be used for predicting the next word. The results can be used in the remainder of the capstone project. The big gain is that we do not have to use the huge textfiles anymore and that these have been condensed to smaller files containing the N-grams and frequency counts.

I can also make an estimate of the quality of the data set gathered. In the English Oxford dictionary there are 171,476 words (source: ). In this case we would only have 4% of all words. However, an avarage person uses only 20.000 words and a native 8 year old only 10.000 (source: ). This would take the completeness of this dataset to about 32% and 64% respecively. For basic use, the dataset looks big enough. For more advanced uses, the dataset is too small. For this application I think it is good enough to continue. Especially because this set is about the maximum my pc can handle memory wise.

What will be next…

There are still some open ends and considerations how to proceed next:
- The N-grams can I think be used to build a prediction model. I want to add new columns: first and next word, based on the numbers. When combining this with the frequencies, a prediction of the next word should be possible.

Stop words are at the moment included in the text. This looks like a good thing, because the next word in a sentence is often a stop word. However, it might be at a later stage that we do not want them in. Stop words can then easily be removed during data preparation using the TM_map functionality.
It might also be a good idea to stem the texts to get stemmed words and to replace synonyms with a single word. This would make the model more compact. I have not done this, because I am not yet sure this would improve or break the prediction algoritm. If stemming is used it would also be necessary to replaced the stemmed words by correct English synonym. The stems are mostly not correct English. It might be good to delay this proces until a word is entered in the app. You can then match on the stem and search the next word (and eventualy get the correct English equivalent for that stemmed word). Another approach could be to add the stem also to the N-gram datasets.
Foreign words might be in the dataset. I checked manually through a sample of the unique word list. This showed that almost all was English language. It could be made better by checking every sentence for English and a few non English languages for stopwords. If more foreign stopwords are found, the sentence could be discarded.
It might be possible that some words are not in the resulting 1,2 or 3 gram collections. There are a few options:

Increase the amount of text used (might not be very usefull if texts are already huge - runs into limits of computer memory)
Increase the factor used for reducing sparse items (might lead to memory problems if too high - is however effective)
Search for synonyms of the searched word. Maybe they are in the set? This could be done in combination of stemming
Look for word associations of the searched word. This might be difficult or require a huge dataset in the app.
Add the most common 1-gram word (or random word from top 10 most common).
Report not found back in the app

Appendix 1 - Overview of most frequent terms

## [1] "The Blogs dataset has 11 unique words over 1000 occurences."

## [1] "Blogs word counts for most frequent words:"

##      UniqueWordsBlogs
## the             11371
## and              6437
## that             2577
## for              2189
## with             1777
## you              1726
## was              1635
## this             1406
## have             1324
## but              1199

## [1] "The News dataset has 8 unique words over 1000 occurences."

## [1] "News feed word counts for most frequent words:"

##      UniqueWordsNews
## the            13695
## and             6023
## for             2331
## that            2250
## with            1848
## was             1536
## from            1066
## his             1047
## but              983
## said             971

Data Exploration Corpusses

Bert Schwenk

28 november 2018