Capstone SwiftKey Predictive Text ModelExploratory Analysis Of Data

Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a difficult task. SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone project, we will work on understanding and building predictive text models like those used by SwiftKey.

Data Source
Large databases comprising of text in a target language are commonly used when generating language models. The data is from the DataSet called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text. In this project, you will use the English database.

This report covers the exploratory analysis for the said project.

Tasks To Accomplish
1. Exploratory Analysis
Perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the files.
2. Frequency Analysis
Understand frequencies of words and word pairs build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Executive Summary

Data
As per the requirement of the project, the data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The file Coursera-SwiftKey.zip is available in a zip format.

Corpus
We analyse three files of US English data available:
- blogs
- news
- twitter
Note
We find that all the three files the blogs, the news & the twitter are similar; the data set seems to contain one blog or one news item or one tweet per line.
Where as Twitter record is limited to 140 character per message there seems to be no (apparent) limit per line for blog & news.

Activities To Be Done
For each of the above data, we will get the following information:
1. file size
2. line count & non-empty line count
3. character count
4. nonwhite character count
5. word count per line - summary
6. word count per file
7. word count frequency per file
8. frequency of word count frequency per file

Then we will merged the data and analyse the word count frequency for the Merged Data

Then for each data & also the MergedData, there will be visualization for:
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words

Subsequently for the MergedData, Bigrams & Trigrams will be generated.

Stats related to Bigrams & Trigrams were generated as follows:
1. Bigrams / Trigrams count frequency
2. frequency of Bigrams / Trigrams count frequency

Again, for the Bigrams & Trigrams there will be visualization for:
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words

Pre Process

Pre-Requisites

Before you start execution of this Rmd file:
1. Please set working dir to your repository.
2. Please download the Coursera-SwiftKey.zip and copy the to your repository.
3. From the said zip file, unzip & copy the folder en_US into your repository.

setwd(<your_assignment_repository>)

knitr Global Options

knitr::opts_chunk$set(tidy=FALSE, fig.path='figures/')

Load Libraries

library(parallel)
library(dplyr)
library(stringi)
library(tau)
library(ggplot2)
library(wordcloud)

Stop Words

Words not to be used in the word frequency count process;
Original found at http://en.wikipedia.org/wiki/Stop_words
This has been modified to suit our requirement

# my stop words   
# since we plan to ignore all 1-letter or 2-letter words   
# listed here are only words with 3-letter or more   
gblStopWords <- c("aaa","aaaa","all","also","and","any","are","but","can","cant","cry","due","etc","few","for","get","had","has","hasnt","have","her","here","hers","herself","him","himself","his","how","inc","into","its","ltd","may","nor","not","now","off","once","one","only","onto","our","ours","out","over","own","part","per","put","see","seem","she","than","that","the","their","them","then","thence","there","these","they","this","those","though","thus","too","top","upon","very","via","was","were","what","when","which","while","who","whoever","whom","whose","why","will","with","within","without","would","yet","you","your","yours","the")

Note
A lot of thought has gone into whether to include or exclude the Stop Words, in the Exploratory Analysis. After much deliberation, it was decided to exclude the Stop Words in the Exploratory Analysis stage with the understanding that when the Bigrams, Trigram or the Predictive Model is being developed it would be mandatory to include these words.

Bad Words

Bad Words or Swear Words or Profanity not to be used in the word frequency count process;
Original found at http://en.wiktionary.org/wiki/Category:English_swear_words
This has been modified to suit our requirement

# my bad words   
gblBadWords <- c("arse","ass","asshole","bastard","bitch","bloody","bollocks","child-fucker","cunt","damn","fuck","goddamn","godsdamn","hell","motherfucker","shit","shitass","whore")

Note
Again, like the Stop Words, it was decided to exclude the Bad Words in the Exploratory Analysis stage with the understanding that when the Bigrams, Trigram or the predictive model is being developed it would be mandatory to include these words.

Frequency Cateogry Helper Function

To get a better understanding of the Frequncy Distribution of the Words, we categorize the words into “Words With Frequency Less Than n” as given below:

# frequency category helper function
FreqCategory <- function(value) {
    strCategory <- ifelse(value <=5,      "      5",
                ifelse(value <=10,     "     10",
                ifelse(value <=50,     "     50",
                ifelse(value <=100,    "    100",
                ifelse(value <=500,    "    500",
                ifelse(value <=1000,   "  1,000",
                ifelse(value <=5000,   "  5,000",
                ifelse(value <=10000,  " 10,000",
                ifelse(value <=50000,  " 50,000",
                ifelse(value <=100000, "100,000",
                              ">100,000"))))))))))
                strCategory
    }

Good Bigram - Helper Function
good bigram … returns true if good false if not
will be true if following are true
* length of both words > 2
* both words not found in badWords Note
After some work, it was obsereved that with StopWords removed, very few bigrams remain hence decided not to remove StopWords

GoodBigram <- function(strInptWord) {
    vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
    blnReturnVal <- TRUE
    if (length(vctSpltWrds) != 2)
        blnReturnVal <- FALSE
    if ((blnReturnVal == TRUE) && (stri_length(vctSpltWrds[1])<=2 | stri_length(vctSpltWrds[2])<=2))
        blnReturnVal <- FALSE
    #if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblStopWords) || (vctSpltWrds[2] %in% gblStopWords)))
    #    blnReturnVal <- FALSE
    if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblBadWords) || (vctSpltWrds[2] %in% gblBadWords)))
        blnReturnVal <- FALSE
    return(blnReturnVal)
}

Good Trigram - Helper Function
good trigram … returns true if good false if not
will be true if following are true
* length of all three words > 2
* all three words not found in badWords Note
After some work, it was obsereved that with StopWords removed, very few trigrams remain hence decided not to remove StopWords

GoodTrigram <- function(strInptWord) {
    vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
    blnReturnVal <- TRUE
    if (length(vctSpltWrds) != 3)
        blnReturnVal <- FALSE
    if ((blnReturnVal == TRUE) && (stri_length(vctSpltWrds[1])<=2 | stri_length(vctSpltWrds[2])<=2 | stri_length(vctSpltWrds[3])<=2))
        blnReturnVal <- FALSE
    #if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblStopWords) || (vctSpltWrds[2] %in% gblStopWords)) || (vctSpltWrds[3] %in% gblStopWords))
    #    blnReturnVal <- FALSE
    if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblBadWords) || (vctSpltWrds[2] %in% gblBadWords)) || (vctSpltWrds[2] %in% gblBadWords))
        blnReturnVal <- FALSE
    return(blnReturnVal)
}

Search Text - Helper Function
* programmed from ngram2 & ngram3 only
* for ngram2 - returns first word
* for ngram3 - returns first two word
* for ngramX - returns “”

SearchText <- function(strInptWord) {
    vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
    strRtrnWord <- ifelse(length(vctSpltWrds)==2, vctSpltWrds[1],
                   ifelse(length(vctSpltWrds)==3, paste(vctSpltWrds[1], vctSpltWrds[2], sep=" "),
                   ""))
    strRtrnWord
}

Next Text - Helper Function
* programmed from ngram2 & ngram3 only
* for ngram2 - returns word 2
* for ngram3 - returns word 3
* for ngramX - returns “”

NextText <- function(strInptWord) {
    vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
    strRtrnWord <- ifelse(length(vctSpltWrds)==2, vctSpltWrds[2],
                   ifelse(length(vctSpltWrds)==3, vctSpltWrds[3],
                   ""))
    strRtrnWord
}

Exploratory Data Analysis

Zip File

zipFileName <- "Coursera-SwiftKey.zip"
# find out which files where unzipped
unzip(zipFileName, list = TRUE )

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

The corpora of US English data is contained in

## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00

These files have been extracted and copied to the en_US folder within the repository.

File Size

File size presented in Megabytes / MB:

lngBlogSize <- file.info("en_US/en_US.blogs.txt")$size / 1024^2
lngNewsSize <- file.info("en_US/en_US.news.txt")$size / 1024^2
lngTwtsSize <- file.info("en_US/en_US.twitter.txt")$size / 1024^2

en_US.blogs.txt   : 200.4242 MB   
en_US.news.txt    : 196.2775 MB    
en_US.twitter.txt : 159.3641 MB

Read Files

Read the three files, using the readLines functions;
We should use UTF-8 becuase there is a possiblity that non-english characters may be present

vctBlogLins <- readLines("en_US/en_US.blogs.txt", encoding="UTF-8")
vctNewsLins <- readLines("en_US/en_US.news.txt", encoding="UTF-8")

## Warning: incomplete final line found on 'en_US/en_US.news.txt'

vctTwtsLins <- readLines("en_US/en_US.twitter.txt", encoding="UTF-8")

## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul

Line & Chars Stats

Line & Chars statistics for the above files are presented as given below:

stri_stats_general(vctBlogLins)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(vctNewsLins)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

stri_stats_general(vctTwtsLins)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

Chars Per Line Stats

Chars Per Line statistics for the above files are presented as given below:

lngBlogCharCnts <- nchar(vctBlogLins)
summary(lngBlogCharCnts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40800

lngNewsCharCnts <- nchar(vctNewsLins)
summary(lngNewsCharCnts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     202     270    5760

lngTwtsCharCnts <- nchar(vctTwtsLins)
summary(lngTwtsCharCnts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.7   100.0   140.0

Words Per Line Stats

Words Per Line statistics for the above files are presented as given below:

lngBlogWordCnts <- stri_count_words(vctBlogLins)
summary(lngBlogWordCnts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       9      28      42      60    6730

lngNewsWordCnts <- stri_count_words(vctNewsLins)
summary(lngNewsWordCnts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    19.0    32.0    34.6    46.0  1120.0

lngTwtsWordCnts <- stri_count_words(vctTwtsLins)
summary(lngTwtsWordCnts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    47.0

Word Count Stats

Word Count Per File statistics for the above files are presented as given below:

lngBlogWordTots <- sum(lngBlogWordCnts)
lngNewsWordTots <- sum(lngNewsWordCnts)
lngTwtsWordTots <- sum(lngTwtsWordCnts)

en_US.blogs.txt   :     37,541,795    
en_US.news.txt    :      2,674,536         
en_US.twitter.txt :     30,092,866

Data Cleaning

Before we do any process on the above mentioned files, we need to ensure that the data is clean. To have Tidy Data, we need to carry out the following steps * remove special (non-ascii) chars
ie - removes foreign chars
note - if this step is not done, error is encountered in subsequent steps
* convert to lower case
* remove control chars
* remove numbers
* remove punctuations
* remove special chars
* remove white spaces
* remove words which have len(word) <= 2
* remove stop words ie insignificant words like [ a the this that here there ] etc
* remove sparse words ie words with Frequency <= 2
* remove custom words ie custom list of words to be removed (possible but not done)

Clean Blogs

Process to clean blogs as given

# convert to ascii
vctBlogLins <- stri_enc_toascii(vctBlogLins)
# to lower
vctBlogLins <- stri_trans_tolower(vctBlogLins)
# remove control chars
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[[:cntrl:]]", "")
# remove numbers
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[[:digit:]]", "")
# remove punctuation
##vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[[:punct:]]", "")
# remove special chars
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[~!@#$%&-_=:;<>,`\"]", "")
# remove specal chars
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[\\$\\*\\+\\.\\?\\[\\]\\^\\{\\}\\|\\(\\\\]", "")
# remove extra white spaces
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "\\s+", " ")
# string split to list
vctBlogWrds <- stri_split_fixed(vctBlogLins," ")
# list to words
vctBlogWrds <- unlist(vctBlogWrds)
# trim white spaces
vctBlogWrds <- stri_trim_both(vctBlogWrds, pattern = "\\P{Wspace}")
# frequency table
dfBlogFreq <- as.data.frame(table(vctBlogWrds), stringsAsFactors=F)
names(dfBlogFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfBlogFreq <- filter(dfBlogFreq, stri_length(Word)>2)
# remove all stop words ...
dfBlogFreq <- filter(dfBlogFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfBlogFreq <- filter(dfBlogFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfBlogFreq <- filter(dfBlogFreq, Freq>2)
# sort
dfBlogFreq <- arrange(dfBlogFreq, desc(Freq))
# show top 30
head(dfBlogFreq,30)

##       Word   Freq
## 1     from 148107
## 2    about 115050
## 3     just 100015
## 4     like  98257
## 5     more  92425
## 6     some  88703
## 7     time  88143
## 8     been  77898
## 9     know  59932
## 10  people  59219
## 11 because  57623
## 12    dont  56261
## 13   other  55691
## 14     new  54341
## 15    even  51747
## 16   first  50783
## 17    well  50755
## 18    make  50561
## 19     day  50485
## 20    back  50421
## 21  really  49771
## 22    much  48805
## 23    good  48599
## 24   think  47545
## 25     way  46933
## 26   after  46382
## 27  little  45614
## 28   could  44951
## 29    love  44736
## 30     two  40583

# add FrequencyCategory colum
dfBlogFreq <- mutate(dfBlogFreq, Fcat=FreqCategory(dfBlogFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ... 
#dfBlogRfrq <- dfBlogFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfBlogRfrq <- dfBlogFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfBlogRfrq$Fcat <- factor(dfBlogRfrq$Fcat, levels=dfBlogRfrq$Fcat, ordered=T)
# head
head(dfBlogRfrq,11)

## Source: local data frame [11 x 2]
## 
##        Fcat  Rfrq
## 1         5 42825
## 2        10 22046
## 3        50 30676
## 4       100  7822
## 5       500 10275
## 6     1,000  2217
## 7     5,000  2375
## 8    10,000   352
## 9    50,000   246
## 10  100,000    17
## 11 >100,000     3

Clean News

Process to clean News as given

# convert to ascii
vctNewsLins <- stri_enc_toascii(vctNewsLins)
# to lower
vctNewsLins <- stri_trans_tolower(vctNewsLins)
# remove control chars
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[[:cntrl:]]", "")
# remove numbers
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[[:digit:]]", "")
# remove punctuation
##vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[[:punct:]]", "")
# remove special chars
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[~!@#$%&-_=:;<>,`\"]", "")
# remove specal chars
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[\\$\\*\\+\\.\\?\\[\\]\\^\\{\\}\\|\\(\\\\]", "")
# remove extra white spaces
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "\\s+", " ")
# string split to list
vctNewsWrds <- stri_split_fixed(vctNewsLins," ")
# list to words
vctNewsWrds <- unlist(vctNewsWrds)
# trim white spaces
vctNewsWrds <- stri_trim_both(vctNewsWrds, pattern = "\\P{Wspace}")
# frequency table
dfNewsFreq <- as.data.frame(table(vctNewsWrds), stringsAsFactors=F)
names(dfNewsFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfNewsFreq <- filter(dfNewsFreq, stri_length(Word)>2)
# remove all stop words ...
dfNewsFreq <- filter(dfNewsFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfNewsFreq <- filter(dfNewsFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfNewsFreq <- filter(dfNewsFreq, Freq>2)
# sort
dfNewsFreq <- arrange(dfNewsFreq, desc(Freq))
# show top 30
head(dfNewsFreq,30)

##       Word  Freq
## 1     said 19167
## 2     from 11648
## 3    about  6932
## 4     more  6729
## 5      new  5337
## 6     been  5162
## 7    after  4728
## 8     year  4470
## 9      two  4438
## 10   first  4150
## 11    just  4144
## 12    last  4027
## 13    time  3992
## 14    some  3988
## 15   years  3987
## 16   other  3930
## 17   state  3810
## 18    like  3780
## 19  people  3667
## 20   could  3122
## 21 because  3019
## 22    city  2828
## 23    most  2729
## 24 percent  2629
## 25   three  2623
## 26  school  2611
## 27  before  2571
## 28    back  2538
## 29    make  2499
## 30    says  2492

# add FrequencyCategory colum
dfNewsFreq <- mutate(dfNewsFreq, Fcat=FreqCategory(dfNewsFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ... 
#dfNewsRfrq <- dfNewsFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfNewsRfrq <- dfNewsFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfNewsRfrq$Fcat <- factor(dfNewsRfrq$Fcat, levels=dfNewsRfrq$Fcat, ordered=T)
# head
head(dfNewsRfrq,11)

## Source: local data frame [9 x 2]
## 
##      Fcat  Rfrq
## 1       5 12716
## 2      10  6872
## 3      50  9097
## 4     100  1984
## 5     500  2224
## 6   1,000   300
## 7   5,000   159
## 8  10,000     4
## 9  50,000     2

Clean Tweets

Process to clean tweets as given

# convert to ascii
vctTwtsLins <- stri_enc_toascii(vctTwtsLins)
# to lower
vctTwtsLins <- stri_trans_tolower(vctTwtsLins)
# remove control chars
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[[:cntrl:]]", "")
# remove numbers
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[[:digit:]]", "")
# remove punctuation
##vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[[:punct:]]", "")
# remove special chars
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[~!@#$%&-_=:;<>,`\"]", "")
# remove specal chars
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[\\$\\*\\+\\.\\?\\[\\]\\^\\{\\}\\|\\(\\\\]", "")
# remove extra white spaces
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "\\s+", " ")
# string split to list
vctTwtsWrds <- stri_split_fixed(vctTwtsLins," ")
# list to words
vctTwtsWrds <- unlist(vctTwtsWrds)
# trim white spaces
vctTwtsWrds <- stri_trim_both(vctTwtsWrds, pattern = "\\P{Wspace}")
# frequency table
dfTwtsFreq <- as.data.frame(table(vctTwtsWrds), stringsAsFactors=F)
names(dfTwtsFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfTwtsFreq <- filter(dfTwtsFreq, stri_length(Word)>2)
# remove all stop words ...
dfTwtsFreq <- filter(dfTwtsFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfTwtsFreq <- filter(dfTwtsFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfTwtsFreq <- filter(dfTwtsFreq, Freq>2)
# sort
dfTwtsFreq <- arrange(dfTwtsFreq, desc(Freq))
# show top 30
head(dfTwtsFreq,30)

##      Word   Freq
## 1    just 149619
## 2    like 121325
## 3    love 105589
## 4    good  99672
## 5   about  90952
## 6    dont  90108
## 7     day  90061
## 8  thanks  88664
## 9    from  83691
## 10   know  79269
## 11  great  75382
## 12   time  74697
## 13  today  71233
## 14    new  69381
## 15    lol  66709
## 16   more  62522
## 17   some  61568
## 18   back  57380
## 19    got  55696
## 20  going  55563
## 21  think  53702
## 22 people  51496
## 23   need  50652
## 24  happy  48545
## 25   want  47869
## 26 follow  47326
## 27   make  47176
## 28   well  46182
## 29  right  45533
## 30 really  45254

# add FrequencyCategory colum
dfTwtsFreq <- mutate(dfTwtsFreq, Fcat=FreqCategory(dfTwtsFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ... 
#dfTwtsRfrq <- dfTwtsFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfTwtsRfrq <- dfTwtsFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfTwtsRfrq$Fcat <- factor(dfTwtsRfrq$Fcat, levels=dfTwtsRfrq$Fcat, ordered=T)
# head
head(dfTwtsRfrq,11)

## Source: local data frame [11 x 2]
## 
##        Fcat  Rfrq
## 1         5 42459
## 2        10 20882
## 3        50 26292
## 4       100  5879
## 5       500  7670
## 6     1,000  1597
## 7     5,000  1774
## 8    10,000   249
## 9    50,000   234
## 10  100,000    20
## 11 >100,000     3

Create Merged Data Set

As the last step of data process, we merge all the three data sets and perform some analysis on that also

Merge Data

Process to clean tweets as given

# merge all three word vectors
vctTotsWrds = c(vctBlogWrds, vctNewsWrds, vctTwtsWrds)
# frequency table
dfTotsFreq <- as.data.frame(table(vctTotsWrds), stringsAsFactors=F)
names(dfTotsFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfTotsFreq <- filter(dfTotsFreq, stri_length(Word)>2)
# remove all stop words ...
# dfTotsFreq <- filter(dfTotsFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfTotsFreq <- filter(dfTotsFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfTotsFreq <- filter(dfTotsFreq, Freq>2)
# sort
dfTotsFreq <- arrange(dfTotsFreq, desc(Freq))
# show top 30
head(dfTotsFreq,30)

##     Word    Freq
## 1    the 2941467
## 2    and 1588012
## 3    you  847975
## 4    for  774514
## 5   that  718765
## 6   with  478926
## 7   this  430159
## 8    was  412655
## 9   have  397589
## 10   are  362470
## 11   but  338994
## 12   not  304657
## 13  your  272653
## 14   all  269013
## 15  just  253778
## 16  from  243446
## 17   its  242710
## 18   out  228426
## 19  what  224450
## 20  like  223362
## 21  they  216587
## 22  will  215326
## 23 about  212934
## 24   one  212078
## 25   can  191628
## 26  when  191544
## 27   get  186067
## 28  time  166832
## 29  more  161676
## 30 there  159332

# add FrequencyCategory colum
dfTotsFreq <- mutate(dfTotsFreq, Fcat=FreqCategory(dfTotsFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ... 
#dfTotsRfrq <- dfTotsFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfTotsRfrq <- dfTotsFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfTotsRfrq$Fcat <- factor(dfTotsRfrq$Fcat, levels=dfTotsRfrq$Fcat, ordered=T)
# head
head(dfTotsRfrq,11)

## Source: local data frame [11 x 2]
## 
##        Fcat  Rfrq
## 1         5 71797
## 2        10 34435
## 3        50 43397
## 4       100 10468
## 5       500 13745
## 6     1,000  3154
## 7     5,000  3657
## 8    10,000   633
## 9    50,000   514
## 10  100,000    77
## 11 >100,000    57

Process to save NGrams-1

dfNgr1Csvx <- data.frame(dfTotsFreq$Word, dfTotsFreq$Freq)
colnames(dfNgr1Csvx) <- c("Word","Freq")
write.csv(dfNgr1Csvx, file="N1-SwiftKey.csv", row.names=F)
message("NGrams-1 Frequencies Saved As: N1-SwiftKey.csv \r", appendLF=FALSE)

## NGrams-1 Frequencies Saved As: N1-SwiftKey.csv

flush.console() 
head(dfNgr1Csvx)

##   Word    Freq
## 1  the 2941467
## 2  and 1588012
## 3  you  847975
## 4  for  774514
## 5 that  718765
## 6 with  478926

NGrams

an NGram is a contiguous sequence of n items from a given sequence of text or speech. In our case we will generate nGram from words. The NGram typically are collected from a text or speech corpus.

NGram of size 1 (single word) is referred to as a “unigram”;
Size 2 NGram (two words) is called a “bigram”; [Eg: good boy; good work; good deed]
Size 3 NGram (three words) is a “trigram”; [Eg: check this name; what we miss]
Larger sizes are sometimes referred to by the value of n, eg “four-gram”, “five-gram”, and so on.

Note
When an attempt was made to generate bigram & trigram on the entire corpus, the process took a lot of time and also the R session was “aborted” possibly due to “out of memory error”. Since this is exploratory analysis, we will work with a small subset of 1% of the corpus. This will enable us to get a feel of the NGrams-2 & NGrams-3.

Generate NGrams-2

Process to generate NGrams-2 as given

# merge all three line vectors
# these are all cleaned up & ready for use ... so clean up again is not required
vctTotsData = c(vctBlogLins, vctNewsLins, vctTwtsLins)
# sample 1% of TotsDataWords
set.seed (77777)
lngSmplSize <- length(vctTotsData)*0.01
vctTotsData <- sample(vctTotsData, lngSmplSize)
# ngram 2 split
vctNrg2Data = textcnt(vctTotsData, split=" ", n=2L, method="string")
vctNrg2Data = vctNrg2Data[order(vctNrg2Data, decreasing=T)]
# frequency table
dfNgr2Freq <- data.frame(Word=names(vctNrg2Data), Freq=vctNrg2Data, row.names=NULL, stringsAsFactors=F)
# remove sparse words ... words with Freq <= 2
dfNgr2Freq <- filter(dfNgr2Freq, Freq>2)
# filter dfNgr2Freq
vctGoodNgr2 <- vapply(dfNgr2Freq$Word, GoodBigram, USE.NAMES=F, logical(1))
dfNgr2Freq = filter(dfNgr2Freq, vctGoodNgr2)
# head
head(dfNgr2Freq,30)

##          Word Freq
## 1     for the 1354
## 2     and the  757
## 3    with the  622
## 4    from the  517
## 5  thanks for  473
## 6     you can  433
## 7     all the  415
## 8     you are  377
## 9   thank you  376
## 10   that the  368
## 11   you have  353
## 12  the first  349
## 13   the same  347
## 14  have been  332
## 15  about the  306
## 16   the best  305
## 17  the world  278
## 18   they are  268
## 19   has been  254
## 20    are you  244
## 21  there are  235
## 22   into the  233
## 23  right now  233
## 24   you know  233
## 25   when you  222
## 26   and then  216
## 27   the most  216
## 28    the way  216
## 29    the day  203
## 30    the new  202

# add FrequencyCategory colum
dfNgr2Freq <- mutate(dfNgr2Freq, Fcat=FreqCategory(dfNgr2Freq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ... 
#dfNgr2Rfrq <- dfNgr2Freq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfNgr2Rfrq <- dfNgr2Freq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfNgr2Rfrq$Fcat <- factor(dfNgr2Rfrq$Fcat, levels=dfNgr2Rfrq$Fcat, ordered=T)
# head
head(dfNgr2Rfrq,11)

## Source: local data frame [7 x 2]
## 
##      Fcat  Rfrq
## 1       5 11890
## 2      10  3790
## 3      50  2566
## 4     100   214
## 5     500   103
## 6   1,000     3
## 7   5,000     1

Process to save NGrams-2

dfNgr2Csvx <- data.frame(dfNgr2Freq$Word)
colnames(dfNgr2Csvx) <- c("Term")
dfNgr2Csvx <- mutate(dfNgr2Csvx, Search=mapply(SearchText, dfNgr2Csvx$Term), Next=mapply(NextText, dfNgr2Csvx$Term), Freq=dfNgr2Freq$Freq)
write.csv(dfNgr2Csvx, file="N2-SwiftKey.csv", row.names=F)
message("NGrams-2 Frequencies Saved As: N2-SwiftKey.csv \r", appendLF=FALSE)

## NGrams-2 Frequencies Saved As: N2-SwiftKey.csv

flush.console() 
head(dfNgr2Csvx)

##         Term Search Next Freq
## 1    for the    for  the 1354
## 2    and the    and  the  757
## 3   with the   with  the  622
## 4   from the   from  the  517
## 5 thanks for thanks  for  473
## 6    you can    you  can  433

Generate NGrams-3
Process to generate NGrams-3 as given

# merge all three word vectors
vctTotsData = c(vctBlogLins, vctNewsLins, vctTwtsLins)
# sample 1% of TotsDataWords
set.seed (77777)
lngSmplSize <- length(vctTotsData)*0.01
vctTotsData <- sample(vctTotsData, lngSmplSize)
# ngram 3 split
vctNrg3Data = textcnt(vctTotsData, split=" ", n=3L, method="string")
vctNrg3Data = vctNrg3Data[order(vctNrg3Data, decreasing=T)]
# frequency table
dfNgr3Freq <- data.frame(Word=names(vctNrg3Data), Freq=vctNrg3Data, row.names=NULL, stringsAsFactors=F)
# remove sparse words ... words with Freq <= 2
dfNgr3Freq <- filter(dfNgr3Freq, Freq>2)
# filter dfNgr3Freq
LogiGoodTgrm <- vapply(dfNgr3Freq$Word, GoodTrigram, USE.NAMES=F, logical(1))
dfNgr3Freq = filter(dfNgr3Freq, LogiGoodTgrm)
# head
head(dfNgr3Freq,30)

##                    Word Freq
## 1        thanks for the  256
## 2         thank you for   97
## 3        for the follow   82
## 4         the fact that   80
## 5        the first time   69
## 6  thanks for following   52
## 7         for the first   46
## 8         the same time   43
## 9          what are you   42
## 10         all the time   36
## 11        cant wait for   33
## 12         the only one   33
## 13          you for the   33
## 14          for all the   31
## 15    happy mothers day   30
## 16      would have been   30
## 17         did you know   29
## 18         for the next   28
## 19        you know what   28
## 20         all over the   27
## 21          you can see   27
## 22        check out the   26
## 23         for the rest   26
## 24          and you can   23
## 25         that you can   23
## 26    the united states   23
## 27          all the way   22
## 28        away from the   22
## 29   bmw service center   22
## 30        more and more   22

# add FrequencyCategory colum
dfNgr3Freq <- mutate(dfNgr3Freq, Fcat=FreqCategory(dfNgr3Freq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ... 
#dfNgr3Rfrq <- dfNgr3Freq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfNgr3Rfrq <- dfNgr3Freq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfNgr3Rfrq$Fcat <- factor(dfNgr3Rfrq$Fcat, levels=dfNgr3Rfrq$Fcat, ordered=T)
# head
head(dfNgr3Rfrq,11)

## Source: local data frame [5 x 2]
## 
##      Fcat Rfrq
## 1       5 2664
## 2      10  532
## 3      50  178
## 4     100    5
## 5     500    1

Process to save NGrams-3

dfNgr3Csvx <- data.frame(dfNgr3Freq$Word)
colnames(dfNgr3Csvx) <- c("Term")
dfNgr3Csvx <- mutate(dfNgr3Csvx, Search=mapply(SearchText, dfNgr3Csvx$Term), Next=mapply(NextText, dfNgr3Csvx$Term), Freq=dfNgr3Freq$Freq)
write.csv(dfNgr3Csvx, file="N3-SwiftKey.csv", row.names=F)
message("NGrams-3 Frequencies Saved As: N3-SwiftKey.csv \r", appendLF=FALSE)

## NGrams-3 Frequencies Saved As: N3-SwiftKey.csv

flush.console()
head(dfNgr3Csvx)

##                   Term     Search      Next Freq
## 1       thanks for the thanks for       the  256
## 2        thank you for  thank you       for   97
## 3       for the follow    for the    follow   82
## 4        the fact that   the fact      that   80
## 5       the first time  the first      time   69
## 6 thanks for following thanks for following   52

Data Visualization

We now show Frequency Table (Top 30 Words) & WOrd CLoud (Top 100 Words) for each of the data file.

Blogs - Word Frequency - Top 30 Words

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfBlogFreq,1:30), aes(x=Word,y=Freq), ) +
    geom_bar(stat="identity", fill="blue") +
    ylab("Frequency") +
    xlab("Words") +
    ggtitle("Blogs - Word Frequency - Top 30 Words") +
    theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
    coord_flip()

plot of chunk blog_wfrq

Blogs - Frequency of Word Frequency

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfBlogRfrq, aes(Fcat,Rfrq/1000))+
    geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfBlogRfrq$Fcat))) +
    xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +

    ggtitle("Blogs - Frequency Of Word Frequency")

plot of chunk blog_fowf

Blogs - Word Cloud - Top 100 Words

# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfBlogFreq$Word[1:100], dfBlogFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))

plot of chunk blog_wcld

News - Word Frequency - Top 30 Words

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfNewsFreq,1:30), aes(x=Word,y=Freq), ) +
    geom_bar(stat="identity", fill="blue") +
    ylab("Frequency") +
    xlab("Words") +
    ggtitle("News - Word Frequency - Top 30 Words") +
    theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
    coord_flip()

plot of chunk news_wfrq

News - Frequency of Word Frequency

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfNewsRfrq, aes(Fcat,Rfrq/1000))+
    geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfNewsRfrq$Fcat))) +
    xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
    ggtitle("News - Frequency Of Word Frequency")

plot of chunk news_fowf

News - Word Cloud - Top 100 Words

# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfNewsFreq$Word[1:100], dfNewsFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))

plot of chunk news_wcld

Tweets - Word Frequency - Top 30 Words

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfTwtsFreq,1:30), aes(x=Word,y=Freq), ) +
    geom_bar(stat="identity", fill="blue") +
    ylab("Frequency") +
    xlab("Words") +
    ggtitle("Tweets - Word Frequency - Top 30 Words") +
    theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
    coord_flip()

plot of chunk twts_wfrq

Tweets - Frequency of Word Frequency

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfTwtsRfrq, aes(Fcat,Rfrq/1000))+
    geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfTwtsRfrq$Fcat))) +
    xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
    ggtitle("Tweets - Frequency Of Word Frequency")

plot of chunk twts_fowf

Tweets - Word Cloud - Top 100 Words

# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfTwtsFreq$Word[1:100], dfTwtsFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))

plot of chunk twts_wcld

Merged Data - Word Frequency - Top 30 Words

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfTwtsFreq,1:30), aes(x=Word,y=Freq), ) +
    geom_bar(stat="identity", fill="blue") +
    ylab("Frequency") +
    xlab("Words") +
    ggtitle("Merged Data - Word Frequency - Top 30 Words") +
    theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
    coord_flip()

plot of chunk tots_wfrq

Merged Data - Frequency of Word Frequency

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfTotsRfrq, aes(Fcat,Rfrq/1000))+
    geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfTotsRfrq$Fcat))) +
    xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
    ggtitle("Merged Data - Frequency Of Word Frequency")

plot of chunk tots_fowf

Merged Data - Word Cloud - Top 100 Words

Merged Data Word Cloud Of Top 100 Words

# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfTwtsFreq$Word[1:100], dfTwtsFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))

plot of chunk tots_wcld

Merged Data 1% Sample - Bigrams Frequency - Top 30 Bigrams

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfNgr2Freq,1:30), aes(x=Word,y=Freq), ) +
    geom_bar(stat="identity", fill="blue") +
    ylab("Frequency") +
    xlab("Bigrams") +
    ggtitle("Merged Data 1% Sample - Bigram Frequency - Top 30 Bigrams") +
    theme(plot.title=element_text(size=rel(1.25), colour="blue")) +
    coord_flip()

plot of chunk bgrm_wfrq

Merged Data 1% Sample - Frequency of Bigram Frequency

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfNgr2Rfrq, aes(Fcat,Rfrq/1000))+
    geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfNgr2Rfrq$Fcat))) +
    xlab("Bigrams With Frequency Less Than") + ylab("Frequency In '1000s") +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.25), colour="blue")) +
    ggtitle("Merged Data 1% Sample - Frequency Of Bigram Frequency")

plot of chunk bgrm_fowf

Merged Data 1% Sample - Word Cloud - Top 100 Bigrams

# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfNgr2Freq$Word[1:100], dfNgr2Freq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))

plot of chunk bgrm_wcld

Merged Data 1% Sample - Trigrams Frequency - Top 30 Trigrams

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfNgr3Freq,1:30), aes(x=Word,y=Freq), ) +
    geom_bar(stat="identity", fill="blue") +
    ylab("Frequency") +
    xlab("Trigrams") +
    ggtitle("Merged Data 1% Sample - Trigram Frequency - Top 30 Trigrams") +
    theme(plot.title=element_text(size=rel(1.25), colour="blue")) +
    coord_flip()

plot of chunk tgrm_wfrq

Merged Data 1% Sample - Frequency of Trigram Frequency

# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfNgr3Rfrq, aes(Fcat,Rfrq/1000))+
    geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfNgr3Rfrq$Fcat))) +
    xlab("Trigrams With Frequency Less Than") + ylab("Frequency In '1000s") +
    theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.25), colour="blue")) +
    ggtitle("Merged Data 1% Sample - Frequency Of Trigram Frequency")

plot of chunk tgrm_fowf

Merged Data 1% Sample - Word Cloud - Top 100 Trigrams

# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfNgr3Freq$Word[1:100], dfNgr3Freq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))

plot of chunk tgrm_wcld

Conclusion

Activities Done Successfuly

For each of the above file, we collected & saw the following information:
1. file size
2. line count & non-empty line count
3. character count
4. nonwhite character count
5. word count per line - summary
6. word count per file
7. word count frequency per file
8. frequency of word count frequency per file

Then we merged the data of all the three files and analysed the word count frequency for the Merged Data

Then for each file & also the MergedData there was visualization for
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words

Subsequently for the MergedData (1% Of Sample), Bigrams & Trigrams were generated.

Stats related to Bigrams & Trigrams were generated as follows:
1. Bigrams / Trigrams count frequency
2. frequency of Bigrams / Trigrams count frequency

Again, for the Bigrams & Trigrams there was visualization as follows:
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words

Action Plan

Having gained an understanding of the corpus through exploratory analysis of data, the next steps would be as follows:
1. Build Bigrams & TriGrams Frequency Matrix for the whole Corpus.
2. Create a language model using the whole corpus.
3. Build a predictive alogorithm / model.
4. Create a ShinyApp data product.
5. Create a presentation for Non-Data-Scientist users.
Note: The current analysis was done only on sample data of 1% of the corpus.

< End Of Report >

Capstone SwiftKey Predictive Text Model
Exploratory Analysis Of Data

Cyrus Lentin

Friday, March 17, 2015