The purpose of this analysis is the text files exploration. The results of the initial exploration are neccesary for further development of word prediction application.

The issues I want to explore are:

1. Read the file.

setwd("~/R files/Natural Language Processing/Coursera-SwiftKey/en_US")
Sys.setlocale(category = "LC_ALL", locale = "English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
blogs<-readLines("en_US.blogs.txt", encoding="UTF-8")
news<-readLines("en_US.news.txt", encoding="UTF-8")
## Warning: incomplete final line found on 'en_US.news.txt'
twitter<-readLines("en_US.twitter.txt")  
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul

For the initial analysis I take 5K line samples from each files.

sampleblog<-sample(blogs,5000)
samplenews<-sample(news, 5000)
sampletwitter<-sample(twitter, 5000)

2. Cleaning of the files includes:

#dictionary of profanity words
badwords<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
badwords<-badwords[-(which(badwords%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]

Let’s take the example of the first 100 words in twitter text before and after cleaning

## [1] "never know where news will take you. Today: accused serial stalker's home. Out of jail-wants to clear name. His Interview @ 11 Sentence found in the YMCA men’s room: “He fell out of love with his sideburns but they refused to leave on their own.” Can anyone claim? Actually I forgot not likes to throw mike under the bus! \"If you don't like change you're going to like irrelevance even less\" -US Army Gen Shinseki via Show tomorrow night, 7pm at Clearwater Theater in West Dundee with our friends Indolent, Zero to End, and Dysfunctional Mariachi. $8 for tix \"It"
## [1] "never know where news will take you STOP today accused serial stalker's home STOP out of jailwants to clear name STOP his interview  NUM sentence found in the ymca menвs room вhe fell out of love with his sideburns but they refused to leave on their own STOPв can anyone claim STOP actually i forgot not likes to throw mike under the bus STOP if you don't like change you're going to like irrelevance even less us army gen shinseki via show tomorrow night NUM at clearwater theater in west dundee with our friends indolent zero to end and"

3. Tokenize the texts, create n-grams.

Let’s look at top 10 words, top 2-grams and 3-grams of the texts.

## Warning: package 'tm' was built under R version 3.1.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.1.2
## Warning: package 'RWeka' was built under R version 3.1.2
## Warning: package 'reshape2' was built under R version 3.1.1
##       blog   news   twitter    
##  [1,] "the"  "the"  "the"      
##  [2,] "and"  "and"  "you"      
##  [3,] "that" "NUM"  "NUM"      
##  [4,] "NUM"  "for"  "and"      
##  [5,] "for"  "that" "for"      
##  [6,] "you"  "with" "that"     
##  [7,] "with" "said" "profanity"
##  [8,] "was"  "was"  "this"     
##  [9,] "this" "his"  "with"     
## [10,] "have" "from" "your"
##       blog      news      twitter     
##  [1,] "of the"  "in the"  "i m"       
##  [2,] "in the"  "of the"  "for the"   
##  [3,] "to the"  "to the"  "in the"    
##  [4,] "to be"   "for the" "it s"      
##  [5,] "on the"  "on the"  "don t"     
##  [6,] "and the" "at the"  "of the"    
##  [7,] "i was"   "in NUM"  "can t"     
##  [8,] "for the" "it s"    "on the"    
##  [9,] "and i"   "in a"    "thanks for"
## [10,] "it is"   "the NUM" "you re"
##       blog          news             twitter             
##  [1,] "one of the"  "more than NUM"  "thanks for the"    
##  [2,] "a lot of"    "one of the"     "can t wait"        
##  [3,] "i don t"     "NUM percent of" "i don t"           
##  [4,] "some of the" "a lot of"       "i can t"           
##  [5,] "to be a"     "in the NUM"     "looking forward to"
##  [6,] "i m not"     "NUM and NUM"    "i want to"         
##  [7,] "i didn t"    "i don t"        "t wait to"         
##  [8,] "in the NUM"  "NUM NUM NUM"    "going to be"       
##  [9,] "this is a"   "NUM to NUM"     "i ll be"           
## [10,] "as well as"  "in the first"   "for the follow"

The conclusions I can make from first review: - the most used phrases are pretty similar in 3 texts but I’ll explore ths deeper in the next points. - news text use much more numeric words than the other two texts - twitter text appears to have more profanity than the other two texts

4. Comparison of 3 text sources.

Let’s compare the most used words and phrases of 3 files. The chart below shows the % of common words between blog, twitter and news text.

## Warning: package 'ggplot2' was built under R version 3.1.1
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

plot of chunk unnamed-chunk-11

The above chart shows that there’s more similarity betweem blog and news texts (~80% similarity in the dictionary) and lowest between news and twitter (~60% of the dictionary)

Let’s also look at % of profanity words in 3 text files.

## [1] "0.47%"
## [1] "1.04%"
## [1] "0.32%"

5. What’s % of words constitutes different % of words used in 3 files

The below chart demonstrates what % of dictionary is required to cover given share of the text, for 3 types of texts.

plot of chunk unnamed-chunk-15

The chart shows that majority of blog texts can be covered by smaller share of words used in the text, than news and twitter. If I want to cover 90% of the dictionary, I can take only 34% of the words for blog text, and around 50% in twitter. This means that blog text has more rarely used words than twitter and news.

The approach I’m going to apply for application is labeling words which were used only once in the corpus as “rare word”. Let’s see what share of text would be covered in this case.

## [1] 0.9308
## [1] 0.8757
## [1] 0.9221

This approach would give around 90% coverage for all texts.

6. How well the next word can be predicted by previous word

Let’s take a sample word “work”

# number of times "look" appears in the text
ungramblog$count[ungramblog$Terms=="work"]
## [1] 197

Here the top 5 words following the word “work”

##        term2 term1 count
## 114506  work  STOP    31
## 114525  work  with    14
## 114440  work   and    13
## 114489  work    on    10
## 114515  work    to     7

If we consider several 2-gram ending with “work”: we’ll see that the next word is different depending on what word is in front of “work”:

##        term3 term2 term1 count
## 173571    to  work  STOP     6
## 173577    to  work  with     4
##       term3 term2 term1 count
## 61650  hard  work   and     1
## 61651  hard  work   day     1
##        term3 term2 term1 count
## 162773   the  work    of     2
## 4183       a  work    of     1

Therefore, predicting next word based on 3-gram instead of 2-gram should improve the result for certain words.

Use of longer n-grams might provide even more precise prediction, however, the chances that same n-gram would appear in the text are lower, plus we’ll have to compromise on the calculation time and dictionaries volume. Therefore, we have to experiment with different ngram size and find the optimal length.

7. Next steps