---
title: "Text corpus analysis"
author: "Veronika Nuretdinova"
date: "Monday, February 2, 2014"
output: html_document
---
The purpose of this analysis is the text files exploration. The results of the initial exploration are neccesary for further development of word prediction application.

The issues I want to explore:
- diversity of the library, ie what % of the word covers majority (eg 95%) of the text.  This would help reduce the dictionaries and deal with rare word in the application
- difference between 3 files to be taken into account when I create the application library
- % of profanity word 
- ngrams: how well the words can be predicted by 1, 2 or 3 previous words

#1. Read the file.  


```r
setwd("~/R files/Natural Language Processing/Coursera-SwiftKey/en_US")
Sys.setlocale(category = "LC_ALL", locale = "English")
```

```
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
```

```r
blogs<-readLines("en_US.blogs.txt", encoding="UTF-8")
news<-readLines("en_US.news.txt", encoding="UTF-8")
```

```
## Warning: incomplete final line found on 'en_US.news.txt'
```

```r
twitter<-readLines("en_US.twitter.txt")  
```

```
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
```

For the initial analysis I take 5K line samples from each files.  


```r
sampleblog<-sample(blogs,5000)
samplenews<-sample(news, 5000)
sampletwitter<-sample(twitter, 5000)
```

#2. Cleaning of the files includes:
- remove profanity words 
- remove special symbols
- convert to lower, so that words in the beggining of the sentence would not be read as different words. 
- change words containing digit to the NUM symbol meaning "numeric word".    
- change ".","?","!" to STOP sign meaning end of the sentence after given phrase, while other punctuation would be removed.

Let's take the example of the first 100 words in blog text before and after cleaning


```
## [1] "He jumped out of the bed, pulled on this track-suit bottoms and ran to check the other bedrooms. All empty. Sometimes we just need to stop and watch the fishes. The day after he had killed King Arthur, Mordred opened his eyes to flickering candlelight and damp rock. There had been nightmares, screaming, and much pain. Terrible pain such as his pampered body had never felt before. But the worst had passed. His crippled form stirred in the shadows and his remaining hand closed about cold metal. Not his axe – he’d lost that on the battlefield, along with his"
```


```
## [1] "he jumped out of the bed pulled on this tracksuit bottoms and ran to check the other bedrooms STOP all empty STOP sometimes we just need to stop and watch the fishes STOP the day after he had profanity king arthur mordred opened his eyes to flickering candlelight and damp rock STOP there had been nightmares screaming and much pain STOP terrible pain such as his pampered body had never felt before STOP but the worst had passed STOP his crippled form stirred in the shadows and his remaining hand closed about cold metal STOP not his axe  hed"
```

#3. Tokenize the texts, create n-grams. Let's look at top 10 words and top 2-grams and 3-grams from one of the file, I take the blogs text.



```
##       1-gram 2-gram    3-gram       
##  [1,] "the"  "of the"  "one of the" 
##  [2,] "and"  "in the"  "a lot of"   
##  [3,] "that" "to the"  "i don t"    
##  [4,] "NUM"  "on the"  "as well as" 
##  [5,] "for"  "to be"   "it is a"    
##  [6,] "you"  "and the" "out of the" 
##  [7,] "with" "and i"   "a couple of"
##  [8,] "was"  "for the" "to be a"    
##  [9,] "this" "is a"    "going to be"
## [10,] "have" "i am"    "be able to"
```




#4. Comparison of 3 text sources.

Let's compare the most used words and phrases of 3 files. The chart below shows the % of common words between blog, twitter and news text.
![plot of chunk unnamed-chunk-7](figure/unnamed-chunk-7.png) 

The above chart shows that there's more similarity betweem blog and news texts (~80% similarity in the dictionary) and lowest between news and twitter (~60% of the dictionary)

If we look % of profanity words in 3 text files, then we see than news text are slightly lower than twitter and blog in the use of profanity words. But overall, all text have ~0.8% of profanity


```
## [1] 0.005813
```

```
## [1] 0.01245
```

```
## [1] 0.005438
```


#5. #What's % of words constitutes different % of words used in 3 files

The below chart demonstrates what % of dictionary is required to cover given share of the text, for 3 types of texts.

![plot of chunk unnamed-chunk-9](figure/unnamed-chunk-9.png) 

The chart shows that majority of blog texts can be covered by smaller share of words used in the text, than news and twitter.  If I want to cover 90% of the dictionary, I can take only 34% of the words for blog text, and around 50% in news and twitter. This means that blog text has more rarely used words than twitter and news.

The approach I'm going to apply for application is labeling words which were used only once in the corpus as "rare word". Let's see what share of text would be covered in this case.  


```
## [1] 0.9297
```

```
## [1] 0.8785
```

```
## [1] 0.9215
```
This approach would give around 90% coverage for all texts.

#6. How well the next word can be predicted by previous word

Let's take a sample word "broken" 

```r
# number of times "look" appears in the text
ungramblog$count[ungramblog$Terms=="broken"]
```

```
## [1] 22
```

Here the top 5 words following the word "prepared"

```
##        term2 term1 count
## 16195 broken  into     3
## 16189 broken   and     2
## 16190 broken    by     2
## 16204 broken    up     2
## 16191 broken china     1
```

If we consider several 2-gram ending with "prepared": we'll see that the next word is different depending on what word is in front of "prepared": verb (has/is) +broken would be tipycally followed by "stop", adverb or preposition,  while preposition (with/in/a/the/from) + broken would be followed by noun.


```
##          term2  term1 count
## 5880       and broken     2
## 36340      get broken     2
## 47168       is broken     2
## 112807    your broken     2
## 132          a broken     1
## 3874     album broken     1
## 9144       are broken     1
## 11916     bars broken     1
## 39375      has broken     1
## 48944    jeans broken     1
## 49478     just broken     1
## 53228   longer broken     1
## 63581       of broken     1
## 80177     seem broken     1
## 80911  severed broken     1
```

```
##       term3  term2 term1 count
## 59737   has broken  into     1
## 75396    is broken    by     1
```

```
##        term3  term2 term1 count
## 307        a broken   leg     1
## 101681    of broken walls     1
```

threegramblog[which(threegramblog$term2=="broken"),]
Therefore, predicting next word based on 3-gram instead of 2-gram should improve the result. 

Use of longer n-grams might provide even more precise prediction, however, the chances that same n-gram would appear in the text are lower, plus we'll have to compromise on the calculation time and dictionaries volume. Therefore, we have to experiment with different ngram size and find the optimal length. 

#7. Next steps
- create the text corpus based on 3 texts to be used in word prediction application.  This includes cleaning of the text, labeling rarely used words
- create the algorithm for the next word prediction.  The algorithm would be based on ngrams built from text corpus, ie
next word = f(last words, number of last words)   
- experiment with different size of ngram to find out the optimum length in terms of prediction quality/calculation time
- analyze whether other features can be taken into account in the algorithm, eg part-of-speech or the ending of the words in the text. Do the additional features improve the performance and, at the same time, whats their cost in terms of memory and calculation time
next word = f(last words, number of last words, ending of the last word, POS of the last words)  

knit2html("text_analysis.Rmd")
rpubsUpload("text analysis","text analysis.html")