Data Exploration - NLP

The goal of this exercise is to do some data exploration of the three datasets provided:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

These three files are all natural language and will be used to develop a model that can predict the next word when a user enters a series of two or three word phrases.

Exploratory Work

Open the 3 input files and do some basic exploratory work on the entire text files.

File information (whole file):

News:

File name: C:/Users/johns dell/Dropbox/DataScienceSpecializationJHU/Capstone - Class 10/FinalProject/en_US/Data/en_US.news.txt
File Size (MB): 196.2775126
Number of lines: 1010242
Number of words: 2693898

Blogs:

File name: C:/Users/johns dell/Dropbox/DataScienceSpecializationJHU/Capstone - Class 10/FinalProject/en_US/Data/en_US.blogs.txt
File Size (MB): 200.4242077
Number of lines: 899288
Number of words: 38154238

Twitter:

File name: C:/Users/johns dell/Dropbox/DataScienceSpecializationJHU/Capstone - Class 10/FinalProject/en_US/Data/en_US.twitter.txt
File Size (MB): 159.364069
Number of lines: 2360148
Number of words: 30218166

Analysis: These files are large (storage/memory wise) as well as the number of lines and number of words. For initial exploration and testing I will use a sample of 1000 lines from each file.

Cleanup Function (content_transformer)

There are a number of cleanup actions that need to be taken to make the data easier to work with:

remove extra spaces
remove numbers
remove special characters
remove URL’s
remove profanity
remove stopwords
convert all text to lower case

It is necessary to do this pre-processing to allow for easier data manipulation later on.

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3

Document term matrix

Create a document term matrix! A document term matrix evaluates the corpus and outputs the terms and their frequency (along with some meta-data).

## <<DocumentTermMatrix (documents: 3, terms: 13069)>>
## Non-/sparse entries: 18362/20845
## Sparsity           : 53%
## Maximal term length: 74
## Weighting          : term frequency (tf)

## <<DocumentTermMatrix (documents: 3, terms: 13069)>>
## Non-/sparse entries: 18362/20845
## Sparsity           : 53%
## Maximal term length: 74
## Weighting          : term frequency (tf)
## Sample             :
##                    Terms
## Docs                 and but for have that  the this was with you
##   en_US.blogs.txt   1234 225 399  244  556 2062  267 317  363 312
##   en_US.news.txt     801 175 386  139  337 1888   99 204  253 107
##   en_US.twitter.txt  182  51 145   58  107  420   71  55   52 218

## [1] "The number of words in each document in the corpus:"

##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##             32337             26859              9552

## [1] "The top 25 words in the corpus are:"

##  the  and that  for with  you  was  but have this  not  are from said  his 
## 4370 2217 1000  930  668  637  576  451  441  437  371  367  327  304  290 
##  all they  its  one will like what just  out more 
##  279  278  275  273  260  255  255  252  245  242

Graphical display

Graphical representation of the top 25 terms using the entire corpus.

## [1] "Below is a graph of the top 25 words used in the corpus:"

## [1] "The wordcloud below is another representation of the words frequency:"

It is interesting to note that “the” and “and” are the two most frequently appearing words. Unfortunately, “the” and “and” are not very helpful in our prediction model. As such, tm provides us an easy way to remove “stopwords”.

Stopword elimination

Stopwords are common words that do not provide much useful information when creating ngrams. As such, they will be removed from the corpus, a new document term matrix will be created and a new top 25 words will be created for comparison.

## <<DocumentTermMatrix (documents: 3, terms: 12971)>>
## Non-/sparse entries: 18085/20828
## Sparsity           : 54%
## Maximal term length: 74
## Weighting          : term frequency (tf)

## <<DocumentTermMatrix (documents: 3, terms: 12971)>>
## Non-/sparse entries: 18085/20828
## Sparsity           : 54%
## Maximal term length: 74
## Weighting          : term frequency (tf)
## Sample             :
##                    Terms
## Docs                can day get just like new one said time will
##   en_US.blogs.txt   119  73  88  131  142  72 148   34  114  117
##   en_US.news.txt     46  31  46   59   55  80  73  259   55  106
##   en_US.twitter.txt  51  44  40   62   58  36  52   11   32   37

##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##             21293             19122              6916

## [1] "The top 25 words in the corpus after the standard stopwords are removed are below:"

##   said    one   will   like   just    can   time    new    get    day 
##    304    273    260    255    252    216    201    188    174    148 
##   know   year    now  first   good people   much    two   make   also 
##    146    146    140    135    134    125    123    115    112    110 
##   love   dont   well    see   last 
##    110    104    103    102     99

## [1] "Below is a histogram of the top 25 words after having removed the stopwords."

## [1] "The wordcloud below is another representation of the words frequency:"

This data is much more informative! It seems to make sense that “said” is the top word in terms of frequency given that the corpus includes news.

Ngram work

Now do the ngram work creating bigrams (and trigrams and quadgrams in the future). I tried 5-grams & 6-grams but that didn’t really produce anything too useful considering the time it took to create them.

## Chars:       330304
## Letters:     282004
## Whitespace:  48300
## Punctuation: 0
## Digits:      0
## Words:       48301
## Sentences:   0
## Lines:       1 
## Wordlens:    0 969 2799 3320 4242 4505 6066 7319 8459 10621 
##              1 1 1 1 1 1 1 1 1 1 
## Senlens:     0 
##              10 
## Syllens:     0 2 4 66 549 2766 8049 17437 19013 
##              1 2 1 1 1 1 1 1 1

## [1] "Below is a truncated listing of the bigrams for the corpus:"

## come officials | 1 
## thought {1} | 
## 
## mother tausha | 1 
## cram {1} | 
## 
## users need | 1 
## stay {1} | 
## 
## bit song | 1 
## called {1} | 
## 
## wearing business | 1 
## wear {1} | 
## 
## [[ ... results truncated ... ]]

## 'data.frame':    45900 obs. of  3 variables:
##  $ ngrams: chr  "new york " "year old " "last year " "last night " ...
##  $ freq  : int  26 25 19 16 15 13 12 12 12 11 ...
##  $ prop  : num  0.000538 0.000518 0.000393 0.000331 0.000311 ...

## [1] "Below is a plot of the top 25 bigrams for the corpus based upon the phrasetable:"

Analysis

The bigrams listed make sense: New York, years old, last year, last night, etc. These are all phrases that are usually found together! So I must be on the right track!

Future work: The three files provide a substantial amount of data and using the entire corpus for training is not possible given the limitations of my laptop computer. As such, a sample of the data will be created. A balance of memory usage and amount of data in the training set will have to be reached. The long term plan is to develop the 2, 3, and 4-grams and store them in files that will be accessed by the future prediction application. This is the only way that the response time of the application can be reasonable to users.

Thanks!

John McConnell

NLP Exploring Data Week 2