Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a difficult task. SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.
When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone project, we will work on understanding and building predictive text models like those used by SwiftKey.
Data Source
Large databases comprising of text in a target language are commonly used when generating language models. The data is from the DataSet called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text. In this project, you will use the English database.
This report covers the exploratory analysis for the said project.
Tasks To Accomplish
1. Exploratory Analysis
Perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the files.
2. Frequency Analysis
Understand frequencies of words and word pairs build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Data
As per the requirement of the project, the data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The file Coursera-SwiftKey.zip is available in a zip format.
Corpus
We analyse three files of US English data available:
- blogs
- news
- twitter
Note
We find that all the three files the blogs, the news & the twitter are similar; the data set seems to contain one blog or one news item or one tweet per line.
Where as Twitter record is limited to 140 character per message there seems to be no (apparent) limit per line for blog & news.
Activities To Be Done
For each of the above data, we will get the following information:
1. file size
2. line count & non-empty line count
3. character count
4. nonwhite character count
5. word count per line - summary
6. word count per file
7. word count frequency per file
8. frequency of word count frequency per file
Then we will merged the data and analyse the word count frequency for the Merged Data
Then for each data & also the MergedData, there will be visualization for:
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words
Subsequently for the MergedData, Bigrams & Trigrams will be generated.
Stats related to Bigrams & Trigrams were generated as follows:
1. Bigrams / Trigrams count frequency
2. frequency of Bigrams / Trigrams count frequency
Again, for the Bigrams & Trigrams there will be visualization for:
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words
Pre-Requisites
Before you start execution of this Rmd file:
1. Please set working dir to your repository.
2. Please download the Coursera-SwiftKey.zip and copy the to your repository.
3. From the said zip file, unzip & copy the folder en_US into your repository.
setwd(<your_assignment_repository>)
knitr Global Options
knitr::opts_chunk$set(tidy=FALSE, fig.path='figures/')
Load Libraries
library(parallel)
library(dplyr)
library(stringi)
library(tau)
library(ggplot2)
library(wordcloud)
Stop Words
Words not to be used in the word frequency count process;
Original found at http://en.wikipedia.org/wiki/Stop_words
This has been modified to suit our requirement
# my stop words
# since we plan to ignore all 1-letter or 2-letter words
# listed here are only words with 3-letter or more
gblStopWords <- c("aaa","aaaa","all","also","and","any","are","but","can","cant","cry","due","etc","few","for","get","had","has","hasnt","have","her","here","hers","herself","him","himself","his","how","inc","into","its","ltd","may","nor","not","now","off","once","one","only","onto","our","ours","out","over","own","part","per","put","see","seem","she","than","that","the","their","them","then","thence","there","these","they","this","those","though","thus","too","top","upon","very","via","was","were","what","when","which","while","who","whoever","whom","whose","why","will","with","within","without","would","yet","you","your","yours","the")
Note
A lot of thought has gone into whether to include or exclude the Stop Words, in the Exploratory Analysis. After much deliberation, it was decided to exclude the Stop Words in the Exploratory Analysis stage with the understanding that when the Bigrams, Trigram or the Predictive Model is being developed it would be mandatory to include these words.
Bad Words
Bad Words or Swear Words or Profanity not to be used in the word frequency count process;
Original found at http://en.wiktionary.org/wiki/Category:English_swear_words
This has been modified to suit our requirement
# my bad words
gblBadWords <- c("arse","ass","asshole","bastard","bitch","bloody","bollocks","child-fucker","cunt","damn","fuck","goddamn","godsdamn","hell","motherfucker","shit","shitass","whore")
Note
Again, like the Stop Words, it was decided to exclude the Bad Words in the Exploratory Analysis stage with the understanding that when the Bigrams, Trigram or the predictive model is being developed it would be mandatory to include these words.
Frequency Cateogry Helper Function
To get a better understanding of the Frequncy Distribution of the Words, we categorize the words into “Words With Frequency Less Than n” as given below:
# frequency category helper function
FreqCategory <- function(value) {
strCategory <- ifelse(value <=5, " 5",
ifelse(value <=10, " 10",
ifelse(value <=50, " 50",
ifelse(value <=100, " 100",
ifelse(value <=500, " 500",
ifelse(value <=1000, " 1,000",
ifelse(value <=5000, " 5,000",
ifelse(value <=10000, " 10,000",
ifelse(value <=50000, " 50,000",
ifelse(value <=100000, "100,000",
">100,000"))))))))))
strCategory
}
Good Bigram - Helper Function
good bigram … returns true if good false if not
will be true if following are true
* length of both words > 2
* both words not found in badWords Note
After some work, it was obsereved that with StopWords removed, very few bigrams remain hence decided not to remove StopWords
GoodBigram <- function(strInptWord) {
vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
blnReturnVal <- TRUE
if (length(vctSpltWrds) != 2)
blnReturnVal <- FALSE
if ((blnReturnVal == TRUE) && (stri_length(vctSpltWrds[1])<=2 | stri_length(vctSpltWrds[2])<=2))
blnReturnVal <- FALSE
#if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblStopWords) || (vctSpltWrds[2] %in% gblStopWords)))
# blnReturnVal <- FALSE
if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblBadWords) || (vctSpltWrds[2] %in% gblBadWords)))
blnReturnVal <- FALSE
return(blnReturnVal)
}
Good Trigram - Helper Function
good trigram … returns true if good false if not
will be true if following are true
* length of all three words > 2
* all three words not found in badWords Note
After some work, it was obsereved that with StopWords removed, very few trigrams remain hence decided not to remove StopWords
GoodTrigram <- function(strInptWord) {
vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
blnReturnVal <- TRUE
if (length(vctSpltWrds) != 3)
blnReturnVal <- FALSE
if ((blnReturnVal == TRUE) && (stri_length(vctSpltWrds[1])<=2 | stri_length(vctSpltWrds[2])<=2 | stri_length(vctSpltWrds[3])<=2))
blnReturnVal <- FALSE
#if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblStopWords) || (vctSpltWrds[2] %in% gblStopWords)) || (vctSpltWrds[3] %in% gblStopWords))
# blnReturnVal <- FALSE
if ((blnReturnVal == TRUE) && ((vctSpltWrds[1] %in% gblBadWords) || (vctSpltWrds[2] %in% gblBadWords)) || (vctSpltWrds[2] %in% gblBadWords))
blnReturnVal <- FALSE
return(blnReturnVal)
}
Search Text - Helper Function
* programmed from ngram2 & ngram3 only
* for ngram2 - returns first word
* for ngram3 - returns first two word
* for ngramX - returns “”
SearchText <- function(strInptWord) {
vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
strRtrnWord <- ifelse(length(vctSpltWrds)==2, vctSpltWrds[1],
ifelse(length(vctSpltWrds)==3, paste(vctSpltWrds[1], vctSpltWrds[2], sep=" "),
""))
strRtrnWord
}
Next Text - Helper Function
* programmed from ngram2 & ngram3 only
* for ngram2 - returns word 2
* for ngram3 - returns word 3
* for ngramX - returns “”
NextText <- function(strInptWord) {
vctSpltWrds <- unlist(stri_split_fixed(strInptWord," "))
strRtrnWord <- ifelse(length(vctSpltWrds)==2, vctSpltWrds[2],
ifelse(length(vctSpltWrds)==3, vctSpltWrds[3],
""))
strRtrnWord
}
Zip File
zipFileName <- "Coursera-SwiftKey.zip"
# find out which files where unzipped
unzip(zipFileName, list = TRUE )
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
The corpora of US English data is contained in
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
These files have been extracted and copied to the en_US folder within the repository.
File Size
File size presented in Megabytes / MB:
lngBlogSize <- file.info("en_US/en_US.blogs.txt")$size / 1024^2
lngNewsSize <- file.info("en_US/en_US.news.txt")$size / 1024^2
lngTwtsSize <- file.info("en_US/en_US.twitter.txt")$size / 1024^2
en_US.blogs.txt : 200.4242 MB
en_US.news.txt : 196.2775 MB
en_US.twitter.txt : 159.3641 MB
Read Files
Read the three files, using the readLines functions;
We should use UTF-8 becuase there is a possiblity that non-english characters may be present
vctBlogLins <- readLines("en_US/en_US.blogs.txt", encoding="UTF-8")
vctNewsLins <- readLines("en_US/en_US.news.txt", encoding="UTF-8")
## Warning: incomplete final line found on 'en_US/en_US.news.txt'
vctTwtsLins <- readLines("en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
Line & Chars Stats
Line & Chars statistics for the above files are presented as given below:
stri_stats_general(vctBlogLins)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(vctNewsLins)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
stri_stats_general(vctTwtsLins)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
Chars Per Line Stats
Chars Per Line statistics for the above files are presented as given below:
lngBlogCharCnts <- nchar(vctBlogLins)
summary(lngBlogCharCnts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40800
lngNewsCharCnts <- nchar(vctNewsLins)
summary(lngNewsCharCnts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 202 270 5760
lngTwtsCharCnts <- nchar(vctTwtsLins)
summary(lngTwtsCharCnts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.7 100.0 140.0
Words Per Line Stats
Words Per Line statistics for the above files are presented as given below:
lngBlogWordCnts <- stri_count_words(vctBlogLins)
summary(lngBlogWordCnts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 9 28 42 60 6730
lngNewsWordCnts <- stri_count_words(vctNewsLins)
summary(lngNewsWordCnts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 19.0 32.0 34.6 46.0 1120.0
lngTwtsWordCnts <- stri_count_words(vctTwtsLins)
summary(lngTwtsWordCnts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.8 18.0 47.0
Word Count Stats
Word Count Per File statistics for the above files are presented as given below:
lngBlogWordTots <- sum(lngBlogWordCnts)
lngNewsWordTots <- sum(lngNewsWordCnts)
lngTwtsWordTots <- sum(lngTwtsWordCnts)
en_US.blogs.txt : 37,541,795
en_US.news.txt : 2,674,536
en_US.twitter.txt : 30,092,866
Before we do any process on the above mentioned files, we need to ensure that the data is clean. To have Tidy Data, we need to carry out the following steps * remove special (non-ascii) chars
ie - removes foreign chars
note - if this step is not done, error is encountered in subsequent steps
* convert to lower case
* remove control chars
* remove numbers
* remove punctuations
* remove special chars
* remove white spaces
* remove words which have len(word) <= 2
* remove stop words ie insignificant words like [ a the this that here there ] etc
* remove sparse words ie words with Frequency <= 2
* remove custom words ie custom list of words to be removed (possible but not done)
Clean Blogs
Process to clean blogs as given
# convert to ascii
vctBlogLins <- stri_enc_toascii(vctBlogLins)
# to lower
vctBlogLins <- stri_trans_tolower(vctBlogLins)
# remove control chars
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[[:cntrl:]]", "")
# remove numbers
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[[:digit:]]", "")
# remove punctuation
##vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[[:punct:]]", "")
# remove special chars
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[~!@#$%&-_=:;<>,`\"]", "")
# remove specal chars
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "[\\$\\*\\+\\.\\?\\[\\]\\^\\{\\}\\|\\(\\\\]", "")
# remove extra white spaces
vctBlogLins <- stri_replace_all_regex(vctBlogLins, pattern = "\\s+", " ")
# string split to list
vctBlogWrds <- stri_split_fixed(vctBlogLins," ")
# list to words
vctBlogWrds <- unlist(vctBlogWrds)
# trim white spaces
vctBlogWrds <- stri_trim_both(vctBlogWrds, pattern = "\\P{Wspace}")
# frequency table
dfBlogFreq <- as.data.frame(table(vctBlogWrds), stringsAsFactors=F)
names(dfBlogFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfBlogFreq <- filter(dfBlogFreq, stri_length(Word)>2)
# remove all stop words ...
dfBlogFreq <- filter(dfBlogFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfBlogFreq <- filter(dfBlogFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfBlogFreq <- filter(dfBlogFreq, Freq>2)
# sort
dfBlogFreq <- arrange(dfBlogFreq, desc(Freq))
# show top 30
head(dfBlogFreq,30)
## Word Freq
## 1 from 148107
## 2 about 115050
## 3 just 100015
## 4 like 98257
## 5 more 92425
## 6 some 88703
## 7 time 88143
## 8 been 77898
## 9 know 59932
## 10 people 59219
## 11 because 57623
## 12 dont 56261
## 13 other 55691
## 14 new 54341
## 15 even 51747
## 16 first 50783
## 17 well 50755
## 18 make 50561
## 19 day 50485
## 20 back 50421
## 21 really 49771
## 22 much 48805
## 23 good 48599
## 24 think 47545
## 25 way 46933
## 26 after 46382
## 27 little 45614
## 28 could 44951
## 29 love 44736
## 30 two 40583
# add FrequencyCategory colum
dfBlogFreq <- mutate(dfBlogFreq, Fcat=FreqCategory(dfBlogFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ...
#dfBlogRfrq <- dfBlogFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfBlogRfrq <- dfBlogFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfBlogRfrq$Fcat <- factor(dfBlogRfrq$Fcat, levels=dfBlogRfrq$Fcat, ordered=T)
# head
head(dfBlogRfrq,11)
## Source: local data frame [11 x 2]
##
## Fcat Rfrq
## 1 5 42825
## 2 10 22046
## 3 50 30676
## 4 100 7822
## 5 500 10275
## 6 1,000 2217
## 7 5,000 2375
## 8 10,000 352
## 9 50,000 246
## 10 100,000 17
## 11 >100,000 3
Clean News
Process to clean News as given
# convert to ascii
vctNewsLins <- stri_enc_toascii(vctNewsLins)
# to lower
vctNewsLins <- stri_trans_tolower(vctNewsLins)
# remove control chars
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[[:cntrl:]]", "")
# remove numbers
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[[:digit:]]", "")
# remove punctuation
##vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[[:punct:]]", "")
# remove special chars
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[~!@#$%&-_=:;<>,`\"]", "")
# remove specal chars
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "[\\$\\*\\+\\.\\?\\[\\]\\^\\{\\}\\|\\(\\\\]", "")
# remove extra white spaces
vctNewsLins <- stri_replace_all_regex(vctNewsLins, pattern = "\\s+", " ")
# string split to list
vctNewsWrds <- stri_split_fixed(vctNewsLins," ")
# list to words
vctNewsWrds <- unlist(vctNewsWrds)
# trim white spaces
vctNewsWrds <- stri_trim_both(vctNewsWrds, pattern = "\\P{Wspace}")
# frequency table
dfNewsFreq <- as.data.frame(table(vctNewsWrds), stringsAsFactors=F)
names(dfNewsFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfNewsFreq <- filter(dfNewsFreq, stri_length(Word)>2)
# remove all stop words ...
dfNewsFreq <- filter(dfNewsFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfNewsFreq <- filter(dfNewsFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfNewsFreq <- filter(dfNewsFreq, Freq>2)
# sort
dfNewsFreq <- arrange(dfNewsFreq, desc(Freq))
# show top 30
head(dfNewsFreq,30)
## Word Freq
## 1 said 19167
## 2 from 11648
## 3 about 6932
## 4 more 6729
## 5 new 5337
## 6 been 5162
## 7 after 4728
## 8 year 4470
## 9 two 4438
## 10 first 4150
## 11 just 4144
## 12 last 4027
## 13 time 3992
## 14 some 3988
## 15 years 3987
## 16 other 3930
## 17 state 3810
## 18 like 3780
## 19 people 3667
## 20 could 3122
## 21 because 3019
## 22 city 2828
## 23 most 2729
## 24 percent 2629
## 25 three 2623
## 26 school 2611
## 27 before 2571
## 28 back 2538
## 29 make 2499
## 30 says 2492
# add FrequencyCategory colum
dfNewsFreq <- mutate(dfNewsFreq, Fcat=FreqCategory(dfNewsFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ...
#dfNewsRfrq <- dfNewsFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfNewsRfrq <- dfNewsFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfNewsRfrq$Fcat <- factor(dfNewsRfrq$Fcat, levels=dfNewsRfrq$Fcat, ordered=T)
# head
head(dfNewsRfrq,11)
## Source: local data frame [9 x 2]
##
## Fcat Rfrq
## 1 5 12716
## 2 10 6872
## 3 50 9097
## 4 100 1984
## 5 500 2224
## 6 1,000 300
## 7 5,000 159
## 8 10,000 4
## 9 50,000 2
Clean Tweets
Process to clean tweets as given
# convert to ascii
vctTwtsLins <- stri_enc_toascii(vctTwtsLins)
# to lower
vctTwtsLins <- stri_trans_tolower(vctTwtsLins)
# remove control chars
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[[:cntrl:]]", "")
# remove numbers
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[[:digit:]]", "")
# remove punctuation
##vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[[:punct:]]", "")
# remove special chars
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[~!@#$%&-_=:;<>,`\"]", "")
# remove specal chars
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "[\\$\\*\\+\\.\\?\\[\\]\\^\\{\\}\\|\\(\\\\]", "")
# remove extra white spaces
vctTwtsLins <- stri_replace_all_regex(vctTwtsLins, pattern = "\\s+", " ")
# string split to list
vctTwtsWrds <- stri_split_fixed(vctTwtsLins," ")
# list to words
vctTwtsWrds <- unlist(vctTwtsWrds)
# trim white spaces
vctTwtsWrds <- stri_trim_both(vctTwtsWrds, pattern = "\\P{Wspace}")
# frequency table
dfTwtsFreq <- as.data.frame(table(vctTwtsWrds), stringsAsFactors=F)
names(dfTwtsFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfTwtsFreq <- filter(dfTwtsFreq, stri_length(Word)>2)
# remove all stop words ...
dfTwtsFreq <- filter(dfTwtsFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfTwtsFreq <- filter(dfTwtsFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfTwtsFreq <- filter(dfTwtsFreq, Freq>2)
# sort
dfTwtsFreq <- arrange(dfTwtsFreq, desc(Freq))
# show top 30
head(dfTwtsFreq,30)
## Word Freq
## 1 just 149619
## 2 like 121325
## 3 love 105589
## 4 good 99672
## 5 about 90952
## 6 dont 90108
## 7 day 90061
## 8 thanks 88664
## 9 from 83691
## 10 know 79269
## 11 great 75382
## 12 time 74697
## 13 today 71233
## 14 new 69381
## 15 lol 66709
## 16 more 62522
## 17 some 61568
## 18 back 57380
## 19 got 55696
## 20 going 55563
## 21 think 53702
## 22 people 51496
## 23 need 50652
## 24 happy 48545
## 25 want 47869
## 26 follow 47326
## 27 make 47176
## 28 well 46182
## 29 right 45533
## 30 really 45254
# add FrequencyCategory colum
dfTwtsFreq <- mutate(dfTwtsFreq, Fcat=FreqCategory(dfTwtsFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ...
#dfTwtsRfrq <- dfTwtsFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfTwtsRfrq <- dfTwtsFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfTwtsRfrq$Fcat <- factor(dfTwtsRfrq$Fcat, levels=dfTwtsRfrq$Fcat, ordered=T)
# head
head(dfTwtsRfrq,11)
## Source: local data frame [11 x 2]
##
## Fcat Rfrq
## 1 5 42459
## 2 10 20882
## 3 50 26292
## 4 100 5879
## 5 500 7670
## 6 1,000 1597
## 7 5,000 1774
## 8 10,000 249
## 9 50,000 234
## 10 100,000 20
## 11 >100,000 3
As the last step of data process, we merge all the three data sets and perform some analysis on that also
Merge Data
Process to clean tweets as given
# merge all three word vectors
vctTotsWrds = c(vctBlogWrds, vctNewsWrds, vctTwtsWrds)
# frequency table
dfTotsFreq <- as.data.frame(table(vctTotsWrds), stringsAsFactors=F)
names(dfTotsFreq) <- c("Word","Freq")
# remove all words with len <= 2
dfTotsFreq <- filter(dfTotsFreq, stri_length(Word)>2)
# remove all stop words ...
# dfTotsFreq <- filter(dfTotsFreq, !(Word %in% gblStopWords))
# remove all bad words ...
dfTotsFreq <- filter(dfTotsFreq, !(Word %in% gblBadWords))
# remove sparse words ... words with Freq <= 2
dfTotsFreq <- filter(dfTotsFreq, Freq>2)
# sort
dfTotsFreq <- arrange(dfTotsFreq, desc(Freq))
# show top 30
head(dfTotsFreq,30)
## Word Freq
## 1 the 2941467
## 2 and 1588012
## 3 you 847975
## 4 for 774514
## 5 that 718765
## 6 with 478926
## 7 this 430159
## 8 was 412655
## 9 have 397589
## 10 are 362470
## 11 but 338994
## 12 not 304657
## 13 your 272653
## 14 all 269013
## 15 just 253778
## 16 from 243446
## 17 its 242710
## 18 out 228426
## 19 what 224450
## 20 like 223362
## 21 they 216587
## 22 will 215326
## 23 about 212934
## 24 one 212078
## 25 can 191628
## 26 when 191544
## 27 get 186067
## 28 time 166832
## 29 more 161676
## 30 there 159332
# add FrequencyCategory colum
dfTotsFreq <- mutate(dfTotsFreq, Fcat=FreqCategory(dfTotsFreq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ...
#dfTotsRfrq <- dfTotsFreq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfTotsRfrq <- dfTotsFreq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfTotsRfrq$Fcat <- factor(dfTotsRfrq$Fcat, levels=dfTotsRfrq$Fcat, ordered=T)
# head
head(dfTotsRfrq,11)
## Source: local data frame [11 x 2]
##
## Fcat Rfrq
## 1 5 71797
## 2 10 34435
## 3 50 43397
## 4 100 10468
## 5 500 13745
## 6 1,000 3154
## 7 5,000 3657
## 8 10,000 633
## 9 50,000 514
## 10 100,000 77
## 11 >100,000 57
Process to save NGrams-1
dfNgr1Csvx <- data.frame(dfTotsFreq$Word, dfTotsFreq$Freq)
colnames(dfNgr1Csvx) <- c("Word","Freq")
write.csv(dfNgr1Csvx, file="N1-SwiftKey.csv", row.names=F)
message("NGrams-1 Frequencies Saved As: N1-SwiftKey.csv \r", appendLF=FALSE)
## NGrams-1 Frequencies Saved As: N1-SwiftKey.csv
flush.console()
head(dfNgr1Csvx)
## Word Freq
## 1 the 2941467
## 2 and 1588012
## 3 you 847975
## 4 for 774514
## 5 that 718765
## 6 with 478926
an NGram is a contiguous sequence of n items from a given sequence of text or speech. In our case we will generate nGram from words. The NGram typically are collected from a text or speech corpus.
NGram of size 1 (single word) is referred to as a “unigram”;
Size 2 NGram (two words) is called a “bigram”; [Eg: good boy; good work; good deed]
Size 3 NGram (three words) is a “trigram”; [Eg: check this name; what we miss]
Larger sizes are sometimes referred to by the value of n, eg “four-gram”, “five-gram”, and so on.
Note
When an attempt was made to generate bigram & trigram on the entire corpus, the process took a lot of time and also the R session was “aborted” possibly due to “out of memory error”. Since this is exploratory analysis, we will work with a small subset of 1% of the corpus. This will enable us to get a feel of the NGrams-2 & NGrams-3.
Generate NGrams-2
Process to generate NGrams-2 as given
# merge all three line vectors
# these are all cleaned up & ready for use ... so clean up again is not required
vctTotsData = c(vctBlogLins, vctNewsLins, vctTwtsLins)
# sample 1% of TotsDataWords
set.seed (77777)
lngSmplSize <- length(vctTotsData)*0.01
vctTotsData <- sample(vctTotsData, lngSmplSize)
# ngram 2 split
vctNrg2Data = textcnt(vctTotsData, split=" ", n=2L, method="string")
vctNrg2Data = vctNrg2Data[order(vctNrg2Data, decreasing=T)]
# frequency table
dfNgr2Freq <- data.frame(Word=names(vctNrg2Data), Freq=vctNrg2Data, row.names=NULL, stringsAsFactors=F)
# remove sparse words ... words with Freq <= 2
dfNgr2Freq <- filter(dfNgr2Freq, Freq>2)
# filter dfNgr2Freq
vctGoodNgr2 <- vapply(dfNgr2Freq$Word, GoodBigram, USE.NAMES=F, logical(1))
dfNgr2Freq = filter(dfNgr2Freq, vctGoodNgr2)
# head
head(dfNgr2Freq,30)
## Word Freq
## 1 for the 1354
## 2 and the 757
## 3 with the 622
## 4 from the 517
## 5 thanks for 473
## 6 you can 433
## 7 all the 415
## 8 you are 377
## 9 thank you 376
## 10 that the 368
## 11 you have 353
## 12 the first 349
## 13 the same 347
## 14 have been 332
## 15 about the 306
## 16 the best 305
## 17 the world 278
## 18 they are 268
## 19 has been 254
## 20 are you 244
## 21 there are 235
## 22 into the 233
## 23 right now 233
## 24 you know 233
## 25 when you 222
## 26 and then 216
## 27 the most 216
## 28 the way 216
## 29 the day 203
## 30 the new 202
# add FrequencyCategory colum
dfNgr2Freq <- mutate(dfNgr2Freq, Fcat=FreqCategory(dfNgr2Freq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ...
#dfNgr2Rfrq <- dfNgr2Freq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfNgr2Rfrq <- dfNgr2Freq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfNgr2Rfrq$Fcat <- factor(dfNgr2Rfrq$Fcat, levels=dfNgr2Rfrq$Fcat, ordered=T)
# head
head(dfNgr2Rfrq,11)
## Source: local data frame [7 x 2]
##
## Fcat Rfrq
## 1 5 11890
## 2 10 3790
## 3 50 2566
## 4 100 214
## 5 500 103
## 6 1,000 3
## 7 5,000 1
Process to save NGrams-2
dfNgr2Csvx <- data.frame(dfNgr2Freq$Word)
colnames(dfNgr2Csvx) <- c("Term")
dfNgr2Csvx <- mutate(dfNgr2Csvx, Search=mapply(SearchText, dfNgr2Csvx$Term), Next=mapply(NextText, dfNgr2Csvx$Term), Freq=dfNgr2Freq$Freq)
write.csv(dfNgr2Csvx, file="N2-SwiftKey.csv", row.names=F)
message("NGrams-2 Frequencies Saved As: N2-SwiftKey.csv \r", appendLF=FALSE)
## NGrams-2 Frequencies Saved As: N2-SwiftKey.csv
flush.console()
head(dfNgr2Csvx)
## Term Search Next Freq
## 1 for the for the 1354
## 2 and the and the 757
## 3 with the with the 622
## 4 from the from the 517
## 5 thanks for thanks for 473
## 6 you can you can 433
Generate NGrams-3
Process to generate NGrams-3 as given
# merge all three word vectors
vctTotsData = c(vctBlogLins, vctNewsLins, vctTwtsLins)
# sample 1% of TotsDataWords
set.seed (77777)
lngSmplSize <- length(vctTotsData)*0.01
vctTotsData <- sample(vctTotsData, lngSmplSize)
# ngram 3 split
vctNrg3Data = textcnt(vctTotsData, split=" ", n=3L, method="string")
vctNrg3Data = vctNrg3Data[order(vctNrg3Data, decreasing=T)]
# frequency table
dfNgr3Freq <- data.frame(Word=names(vctNrg3Data), Freq=vctNrg3Data, row.names=NULL, stringsAsFactors=F)
# remove sparse words ... words with Freq <= 2
dfNgr3Freq <- filter(dfNgr3Freq, Freq>2)
# filter dfNgr3Freq
LogiGoodTgrm <- vapply(dfNgr3Freq$Word, GoodTrigram, USE.NAMES=F, logical(1))
dfNgr3Freq = filter(dfNgr3Freq, LogiGoodTgrm)
# head
head(dfNgr3Freq,30)
## Word Freq
## 1 thanks for the 256
## 2 thank you for 97
## 3 for the follow 82
## 4 the fact that 80
## 5 the first time 69
## 6 thanks for following 52
## 7 for the first 46
## 8 the same time 43
## 9 what are you 42
## 10 all the time 36
## 11 cant wait for 33
## 12 the only one 33
## 13 you for the 33
## 14 for all the 31
## 15 happy mothers day 30
## 16 would have been 30
## 17 did you know 29
## 18 for the next 28
## 19 you know what 28
## 20 all over the 27
## 21 you can see 27
## 22 check out the 26
## 23 for the rest 26
## 24 and you can 23
## 25 that you can 23
## 26 the united states 23
## 27 all the way 22
## 28 away from the 22
## 29 bmw service center 22
## 30 more and more 22
# add FrequencyCategory colum
dfNgr3Freq <- mutate(dfNgr3Freq, Fcat=FreqCategory(dfNgr3Freq$Freq))
# new data frame for Relative Frquency or Frequency Of Categorized Frequencies ...
#dfNgr3Rfrq <- dfNgr3Freq %>% group_by(Fcat) %>% summarise(Rfrq=sum(Freq))
dfNgr3Rfrq <- dfNgr3Freq %>% group_by(Fcat) %>% summarise(Rfrq=n())
dfNgr3Rfrq$Fcat <- factor(dfNgr3Rfrq$Fcat, levels=dfNgr3Rfrq$Fcat, ordered=T)
# head
head(dfNgr3Rfrq,11)
## Source: local data frame [5 x 2]
##
## Fcat Rfrq
## 1 5 2664
## 2 10 532
## 3 50 178
## 4 100 5
## 5 500 1
Process to save NGrams-3
dfNgr3Csvx <- data.frame(dfNgr3Freq$Word)
colnames(dfNgr3Csvx) <- c("Term")
dfNgr3Csvx <- mutate(dfNgr3Csvx, Search=mapply(SearchText, dfNgr3Csvx$Term), Next=mapply(NextText, dfNgr3Csvx$Term), Freq=dfNgr3Freq$Freq)
write.csv(dfNgr3Csvx, file="N3-SwiftKey.csv", row.names=F)
message("NGrams-3 Frequencies Saved As: N3-SwiftKey.csv \r", appendLF=FALSE)
## NGrams-3 Frequencies Saved As: N3-SwiftKey.csv
flush.console()
head(dfNgr3Csvx)
## Term Search Next Freq
## 1 thanks for the thanks for the 256
## 2 thank you for thank you for 97
## 3 for the follow for the follow 82
## 4 the fact that the fact that 80
## 5 the first time the first time 69
## 6 thanks for following thanks for following 52
We now show Frequency Table (Top 30 Words) & WOrd CLoud (Top 100 Words) for each of the data file.
Blogs - Word Frequency - Top 30 Words
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfBlogFreq,1:30), aes(x=Word,y=Freq), ) +
geom_bar(stat="identity", fill="blue") +
ylab("Frequency") +
xlab("Words") +
ggtitle("Blogs - Word Frequency - Top 30 Words") +
theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
coord_flip()
Blogs - Frequency of Word Frequency
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfBlogRfrq, aes(Fcat,Rfrq/1000))+
geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfBlogRfrq$Fcat))) +
xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
ggtitle("Blogs - Frequency Of Word Frequency")
Blogs - Word Cloud - Top 100 Words
# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfBlogFreq$Word[1:100], dfBlogFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))
News - Word Frequency - Top 30 Words
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfNewsFreq,1:30), aes(x=Word,y=Freq), ) +
geom_bar(stat="identity", fill="blue") +
ylab("Frequency") +
xlab("Words") +
ggtitle("News - Word Frequency - Top 30 Words") +
theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
coord_flip()
News - Frequency of Word Frequency
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfNewsRfrq, aes(Fcat,Rfrq/1000))+
geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfNewsRfrq$Fcat))) +
xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
ggtitle("News - Frequency Of Word Frequency")
News - Word Cloud - Top 100 Words
# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfNewsFreq$Word[1:100], dfNewsFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))
Tweets - Word Frequency - Top 30 Words
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfTwtsFreq,1:30), aes(x=Word,y=Freq), ) +
geom_bar(stat="identity", fill="blue") +
ylab("Frequency") +
xlab("Words") +
ggtitle("Tweets - Word Frequency - Top 30 Words") +
theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
coord_flip()
Tweets - Frequency of Word Frequency
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfTwtsRfrq, aes(Fcat,Rfrq/1000))+
geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfTwtsRfrq$Fcat))) +
xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
ggtitle("Tweets - Frequency Of Word Frequency")
Tweets - Word Cloud - Top 100 Words
# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfTwtsFreq$Word[1:100], dfTwtsFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))
Merged Data - Word Frequency - Top 30 Words
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfTwtsFreq,1:30), aes(x=Word,y=Freq), ) +
geom_bar(stat="identity", fill="blue") +
ylab("Frequency") +
xlab("Words") +
ggtitle("Merged Data - Word Frequency - Top 30 Words") +
theme(plot.title=element_text(size=rel(1.5), colour="blue")) +
coord_flip()
Merged Data - Frequency of Word Frequency
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfTotsRfrq, aes(Fcat,Rfrq/1000))+
geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfTotsRfrq$Fcat))) +
xlab("Words With Frequency Less Than") + ylab("Frequency In '1000s") +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.5), colour="blue")) +
ggtitle("Merged Data - Frequency Of Word Frequency")
Merged Data - Word Cloud - Top 100 Words
Merged Data Word Cloud Of Top 100 Words
# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfTwtsFreq$Word[1:100], dfTwtsFreq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))
Merged Data 1% Sample - Bigrams Frequency - Top 30 Bigrams
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfNgr2Freq,1:30), aes(x=Word,y=Freq), ) +
geom_bar(stat="identity", fill="blue") +
ylab("Frequency") +
xlab("Bigrams") +
ggtitle("Merged Data 1% Sample - Bigram Frequency - Top 30 Bigrams") +
theme(plot.title=element_text(size=rel(1.25), colour="blue")) +
coord_flip()
Merged Data 1% Sample - Frequency of Bigram Frequency
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfNgr2Rfrq, aes(Fcat,Rfrq/1000))+
geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfNgr2Rfrq$Fcat))) +
xlab("Bigrams With Frequency Less Than") + ylab("Frequency In '1000s") +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.25), colour="blue")) +
ggtitle("Merged Data 1% Sample - Frequency Of Bigram Frequency")
Merged Data 1% Sample - Word Cloud - Top 100 Bigrams
# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfNgr2Freq$Word[1:100], dfNgr2Freq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))
Merged Data 1% Sample - Trigrams Frequency - Top 30 Trigrams
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(slice(dfNgr3Freq,1:30), aes(x=Word,y=Freq), ) +
geom_bar(stat="identity", fill="blue") +
ylab("Frequency") +
xlab("Trigrams") +
ggtitle("Merged Data 1% Sample - Trigram Frequency - Top 30 Trigrams") +
theme(plot.title=element_text(size=rel(1.25), colour="blue")) +
coord_flip()
Merged Data 1% Sample - Frequency of Trigram Frequency
# plot top 30
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
ggplot(dfNgr3Rfrq, aes(Fcat,Rfrq/1000))+
geom_bar(stat="identity", width=0.8, fill=rainbow(length(dfNgr3Rfrq$Fcat))) +
xlab("Trigrams With Frequency Less Than") + ylab("Frequency In '1000s") +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),axis.text.y=element_text(angle=60, hjust=1, vjust=1),plot.title=element_text(size=rel(1.25), colour="blue")) +
ggtitle("Merged Data 1% Sample - Frequency Of Trigram Frequency")
Merged Data 1% Sample - Word Cloud - Top 100 Trigrams
# word cloud
par(mfrow=c(1,1), mar=c(0.5, 4, 2, 1), oma = c(0, 0, 0, 0))
wordcloud(dfNgr3Freq$Word[1:100], dfNgr3Freq$Freq[1:100], random.order=F, max.words=100, colors=brewer.pal(8, "Dark2"))
Activities Done Successfuly
For each of the above file, we collected & saw the following information:
1. file size
2. line count & non-empty line count
3. character count
4. nonwhite character count
5. word count per line - summary
6. word count per file
7. word count frequency per file
8. frequency of word count frequency per file
Then we merged the data of all the three files and analysed the word count frequency for the Merged Data
Then for each file & also the MergedData there was visualization for
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words
Subsequently for the MergedData (1% Of Sample), Bigrams & Trigrams were generated.
Stats related to Bigrams & Trigrams were generated as follows:
1. Bigrams / Trigrams count frequency
2. frequency of Bigrams / Trigrams count frequency
Again, for the Bigrams & Trigrams there was visualization as follows:
* Word Frequency - Top 30 Words
* Frequency Of Word Frequency
* Word Cloud - Top 100 Words
Action Plan
Having gained an understanding of the corpus through exploratory analysis of data, the next steps would be as follows:
1. Build Bigrams & TriGrams Frequency Matrix for the whole Corpus.
2. Create a language model using the whole corpus.
3. Build a predictive alogorithm / model.
4. Create a ShinyApp data product.
5. Create a presentation for Non-Data-Scientist users.
Note: The current analysis was done only on sample data of 1% of the corpus.
< End Of Report >