Text corpus analysis

The purpose of this analysis is the text files exploration. The results of the initial exploration are neccesary for further development of word prediction application.

The issues I want to explore are:

diversity of the library, ie what % of the word covers majority (eg 95%) of the text. This would help reduce the dictionaries and deal with rare word in the application
difference between 3 files to be taken into account when I create the application library
% of profanity word
ngrams: how well the words can be predicted by 1, 2 or 3 previous words

1. Read the file.

setwd("~/R files/Natural Language Processing/Coursera-SwiftKey/en_US")
Sys.setlocale(category = "LC_ALL", locale = "English")

## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

blogs<-readLines("en_US.blogs.txt", encoding="UTF-8")
news<-readLines("en_US.news.txt", encoding="UTF-8")

## Warning: incomplete final line found on 'en_US.news.txt'

twitter<-readLines("en_US.twitter.txt")

## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul

For the initial analysis I take 5K line samples from each files.

sampleblog<-sample(blogs,5000)
samplenews<-sample(news, 5000)
sampletwitter<-sample(twitter, 5000)

2. Cleaning of the files includes:

remove profanity words. I used one of the available dictionaries of 1384 words, but removed from it some words which I dont consider profanity.

#dictionary of profanity words
badwords<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
badwords<-badwords[-(which(badwords%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]

remove special symbols
convert to lower, so that words in the beggining of the sentence would not be read as different words.
change words containing digit to the NUM symbol meaning “numeric word”.
change “.”,“?”,“!” to STOP sign meaning end of the sentence after given phrase, while other punctuation would be removed.

Let’s take the example of the first 100 words in twitter text before and after cleaning

## [1] "never know where news will take you. Today: accused serial stalker's home. Out of jail-wants to clear name. His Interview @ 11 Sentence found in the YMCA menвЂ™s room: вЂњHe fell out of love with his sideburns but they refused to leave on their own.вЂќ Can anyone claim? Actually I forgot not likes to throw mike under the bus! \"If you don't like change you're going to like irrelevance even less\" -US Army Gen Shinseki via Show tomorrow night, 7pm at Clearwater Theater in West Dundee with our friends Indolent, Zero to End, and Dysfunctional Mariachi. $8 for tix \"It"

## [1] "never know where news will take you STOP today accused serial stalker's home STOP out of jailwants to clear name STOP his interview  NUM sentence found in the ymca menвs room вhe fell out of love with his sideburns but they refused to leave on their own STOPв can anyone claim STOP actually i forgot not likes to throw mike under the bus STOP if you don't like change you're going to like irrelevance even less us army gen shinseki via show tomorrow night NUM at clearwater theater in west dundee with our friends indolent zero to end and"

3. Tokenize the texts, create n-grams.

Let’s look at top 10 words, top 2-grams and 3-grams of the texts.

## Warning: package 'tm' was built under R version 3.1.2

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.1.2
## Warning: package 'RWeka' was built under R version 3.1.2
## Warning: package 'reshape2' was built under R version 3.1.1

top 10 words

##       blog   news   twitter    
##  [1,] "the"  "the"  "the"      
##  [2,] "and"  "and"  "you"      
##  [3,] "that" "NUM"  "NUM"      
##  [4,] "NUM"  "for"  "and"      
##  [5,] "for"  "that" "for"      
##  [6,] "you"  "with" "that"     
##  [7,] "with" "said" "profanity"
##  [8,] "was"  "was"  "this"     
##  [9,] "this" "his"  "with"     
## [10,] "have" "from" "your"

top 2-grams

##       blog      news      twitter     
##  [1,] "of the"  "in the"  "i m"       
##  [2,] "in the"  "of the"  "for the"   
##  [3,] "to the"  "to the"  "in the"    
##  [4,] "to be"   "for the" "it s"      
##  [5,] "on the"  "on the"  "don t"     
##  [6,] "and the" "at the"  "of the"    
##  [7,] "i was"   "in NUM"  "can t"     
##  [8,] "for the" "it s"    "on the"    
##  [9,] "and i"   "in a"    "thanks for"
## [10,] "it is"   "the NUM" "you re"

top 3-grams

##       blog          news             twitter             
##  [1,] "one of the"  "more than NUM"  "thanks for the"    
##  [2,] "a lot of"    "one of the"     "can t wait"        
##  [3,] "i don t"     "NUM percent of" "i don t"           
##  [4,] "some of the" "a lot of"       "i can t"           
##  [5,] "to be a"     "in the NUM"     "looking forward to"
##  [6,] "i m not"     "NUM and NUM"    "i want to"         
##  [7,] "i didn t"    "i don t"        "t wait to"         
##  [8,] "in the NUM"  "NUM NUM NUM"    "going to be"       
##  [9,] "this is a"   "NUM to NUM"     "i ll be"           
## [10,] "as well as"  "in the first"   "for the follow"

The conclusions I can make from first review: - the most used phrases are pretty similar in 3 texts but I’ll explore ths deeper in the next points. - news text use much more numeric words than the other two texts - twitter text appears to have more profanity than the other two texts

4. Comparison of 3 text sources.

Let’s compare the most used words and phrases of 3 files. The chart below shows the % of common words between blog, twitter and news text.

## Warning: package 'ggplot2' was built under R version 3.1.1

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

plot of chunk unnamed-chunk-11

The above chart shows that there’s more similarity betweem blog and news texts (~80% similarity in the dictionary) and lowest between news and twitter (~60% of the dictionary)

Let’s also look at % of profanity words in 3 text files.

blog

## [1] "0.47%"

twitter

## [1] "1.04%"

news

## [1] "0.32%"

5. What’s % of words constitutes different % of words used in 3 files

The below chart demonstrates what % of dictionary is required to cover given share of the text, for 3 types of texts.

plot of chunk unnamed-chunk-15

The chart shows that majority of blog texts can be covered by smaller share of words used in the text, than news and twitter. If I want to cover 90% of the dictionary, I can take only 34% of the words for blog text, and around 50% in twitter. This means that blog text has more rarely used words than twitter and news.

The approach I’m going to apply for application is labeling words which were used only once in the corpus as “rare word”. Let’s see what share of text would be covered in this case.

blog

## [1] 0.9308

twitter

## [1] 0.8757

news

## [1] 0.9221

This approach would give around 90% coverage for all texts.

6. How well the next word can be predicted by previous word

Let’s take a sample word “work”

# number of times "look" appears in the text
ungramblog$count[ungramblog$Terms=="work"]

## [1] 197

Here the top 5 words following the word “work”

##        term2 term1 count
## 114506  work  STOP    31
## 114525  work  with    14
## 114440  work   and    13
## 114489  work    on    10
## 114515  work    to     7

If we consider several 2-gram ending with “work”: we’ll see that the next word is different depending on what word is in front of “work”:

to + work would be tipycally followed by preposition (with/in, etc) or STOP
article (a/the)+ work would be typically followed by preposition (of/in, etc), or noun (site/place, etc)
adjective (hard) + work would be typically followed by STOP or the word starting the new part of sentence (and/but/it, etc)

##        term3 term2 term1 count
## 173571    to  work  STOP     6
## 173577    to  work  with     4

##       term3 term2 term1 count
## 61650  hard  work   and     1
## 61651  hard  work   day     1

##        term3 term2 term1 count
## 162773   the  work    of     2
## 4183       a  work    of     1

Therefore, predicting next word based on 3-gram instead of 2-gram should improve the result for certain words.

Use of longer n-grams might provide even more precise prediction, however, the chances that same n-gram would appear in the text are lower, plus we’ll have to compromise on the calculation time and dictionaries volume. Therefore, we have to experiment with different ngram size and find the optimal length.

7. Next steps

create the text corpus based on 3 texts to be used in word prediction application. This includes cleaning of the text, labeling rarely used words
create the algorithm for the next word prediction. The algorithm would be based on ngrams built from text corpus, ie next word = f(last words, number of last words)
experiment with different size of ngram to find out the optimum length in terms of prediction quality/calculation time
analyze whether other features can be taken into account in the algorithm, eg part-of-speech or the ending of the words in the text. Do the additional features improve the performance and, at the same time, whats their cost in terms of memory and calculation time next word = f(last words, number of last words, ending of the last word, POS of the last words)