The main objective of this task is to understand the basic relationships we observe in the data and prepare to build our first Text Analytics (linguistic) models
We plan to acheive the above objective using the following steps
- Exploratory Analysis with Plots and statistics
showing various features of the dataset including word counts.
- Extending the above analysis for highest number of 'n' occurences also
called as 1-gram, 2-gram, 3-grams
- Conclude with next steps regarding predictive modelling leveraging
validation and refiniing of expectations based on Data
This report has been created to fulfill the requirements of the exploratory analysis milestone for the Data Science Capstone offered by the Johns Hopkins University on Coursera.
In this project, we willuse the English Language dataset from the data provided by SwiftKey belonging to a corpus called HC Corpora
Dataset used in this project can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
There are 3 major subsets of the Data: 1. Blogs 2. News 3. Twitter
Set the working directory in R
## [1] "C:/Users/somannam.AUTH/Documents/Personal files/Personal files/PMP/DATA Science- Journey/R/Data Science-Coursera/Capstone Project"
Load the required Packages in R
Loading the Data set using R
# load data locally
download_url = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
data_file = 'Coursera-SwiftKey.zip'
if (!file.exists(data_file)) {
cat('Downloading Dataset...\n')
download.file(download_url, destfile=data_file, method="curl")
cat('Unzipping Dataset...\n')
unzip(data_file)
}else{
cat('Dataset is already downloaded!\n')
}
Loading the English Dataset from the Zip file
blog_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
con <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
Let’s do a simple Exploratory Analysis for high level understanding of the Data
# Summary Blog Data
summary(blog_data) # Number of Characters in Blogs
round(file.info("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024/1024, digits = 1)
# Summary News Data
summary(news_data)# Number of Characters in News
round(file.info("Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/1024/1024, digits = 1)
# Summary Twitter Data
summary(twitter_data) # Number of Characters in Twitter
round(file.info("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/1024/1024, digits = 1)
# Disk size (in MB)
blogs_dsize <- file.info("en_US.blogs.txt")$size / 1024 / 1024
news_dsize <- file.info("en_US.news.txt")$size / 1024 / 1024
twitter_dsize <- file.info("en_US.twitter.txt")$size / 1024 / 1024
#In-memory size (in MB)
blogs_msize<-object.size(blog_data) / 1024 / 1024
news_msize<-object.size(news_data) / 1024 / 1024
twitter_msize<-object.size(twitter_data) / 1024 / 1024
# Words in lines
blogs_words <- stri_count_words(blog_data)
news_words <- stri_count_words(news_data)
twitter_words <- stri_count_words(twitter_data)
# Summary
data.frame(source = c("blogs", "news", "twitter"),
files_MB = c(blogs_dsize, news_dsize, twitter_dsize),
in_memory_MB = c(blogs_msize, news_msize, twitter_msize),
lines = c(length(blog_data), length(news_data), length(twitter_data)),
words_num = c(sum(blogs_words), sum(news_words), sum(twitter_words)),
mean_words_num = c(mean(blogs_words), mean(news_words), mean(twitter_words)))
## source files_MB in_memory_MB lines words_num mean_words_num
## 1 blogs 200.4242 248.4935 899288 37546246 41.75108
## 2 news 196.2775 249.6329 1010242 34762395 34.40997
## 3 twitter 159.3641 301.3967 2360148 30093369 12.75063
As we can see Blogs have the highest number of word count and highest file size with 200MB of Data
We can also look at some sample Data to determine next set of actions
blogs_sample <- sample(blog_data, 5000, replace = FALSE)
news_sample <- sample(news_data, 5000, replace = FALSE)
twitter_sample <- sample(twitter_data, 5000, replace = FALSE)
bnt_data <- c(blogs_sample, news_sample, twitter_sample)
We will take a sample of Data from the above file to perorm next set of Analysis We shall sample about 5000 observations from each of the Data Set
Various Data cleaning methods are used to set-up the Data for further Analysis
A series of data cleaning and transformation steps were performed on the sampling data such that the profanity words and other words not needed to predict are removed. • Remove extra whitespace • Remove numbers • Remove unnecessary punctuation • Change text to all lowercase • Remove stop words • Remove swear/profanity words ( File was sourced from Shutterstock’s Github page and is usable under the Creative Commons Attribution 4.0 International License described by the following link: http://creativecommons.org/licenses/by/4.0/)
if (!file.exists("badwords.txt")){
download.file(url="https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en", destfile="badwords.txt", method = "curl")
}
badWords <- readLines("badwords.txt")
corpus <- VCorpus(VectorSource(bnt_data))
corpus <- tm_map(corpus, function(x) iconv(x, to='UTF-8', sub='byte'))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, badWords)
corpus <- tm_map(corpus, PlainTextDocument)
wordcloud(corpus,scale=c(5,.5),min.freq=5,max.words=100,
random.order=TRUE, rot.per=.35,colors=brewer.pal(8, "Dark2"))
We are using ‘hunspell’ package to parse the Data
allwords <- hunspell_parse(bnt_data)
print(allwords[[3]])
## [1] "Crochet" "Pattern" "Central" "has" "many" "free" "patterns" "you" "can"
## [10] "browse" "through" "to" "get" "patterns" "for" "many" "different" "types"
## [19] "of" "pets" "Crochet" "For" "Pets"
# Summarizing words after stemming
stems <- unlist(hunspell_stem(unlist(allwords)))
words <- sort(table(stems), decreasing = TRUE)
print(head(words, 30))
## stems
## the to and a of i in I it that s is for on you with was he this at
## 22041 12318 11613 10674 9452 7828 7567 7550 5586 5212 4725 4670 4557 3979 3693 3160 2915 2480 2453 2414
## have be my as t are but we not from
## 2413 2350 2274 2257 2240 2196 2148 2109 1842 1733
Since we see most of the top words are ‘stop words’ we shall filter these points:
df <- as.data.frame(words)
df$stems <- as.character(df$stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
stops <- df$stems %in% unlist(stopwords)
wcdata <- head(df[!stops,], 150)
print(wcdata, max = 40)
## stems Freq
## 52 time 1187
## 53 It 1164
## 69 day 924
## 85 He 707
## 87 love 687
## 88 people 687
## 101 aft 585
## 113 bee 532
## 115 In 526
## 119 week 512
## 120 A 508
## 122 ally 496
## 125 play 489
## 130 call 469
## 135 home 443
## 137 start 439
## 138 school 437
## 147 feel 404
## 150 game 401
## 155 life 384
## [ reached getOption("max.print") -- omitted 130 rows ]
names(wcdata) <- c("word", "freq")
wordcloud2(wcdata)
blog_words <- hunspell_parse(blogs)
blog_stems <- unlist(hunspell_stem(unlist(blog_words)))
blog_stem_words <- sort(table(blog_stems), decreasing = TRUE)
print(head(words, 30))
## stems
## the to and a of i in I it that s is for on you with was he this at
## 22041 12318 11613 10674 9452 7828 7567 7550 5586 5212 4725 4670 4557 3979 3693 3160 2915 2480 2453 2414
## have be my as t are but we not from
## 2413 2350 2274 2257 2240 2196 2148 2109 1842 1733
# Stemming for Blogs
df_blogs <- as.data.frame(blog_stem_words)
df_blogs$blog_stems <- as.character(df_blogs$blog_stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
blog_stops <- df_blogs$blog_stems %in% unlist(stopwords)
blog_wcdata <- head(df_blogs[!stops,], 150)
print(blog_wcdata, max = 40)
## blog_stems Freq
## 52 there 5372
## 53 had 5253
## 69 would 3968
## 85 think 3112
## 87 want 3031
## 88 now 3022
## 101 good 2564
## 113 these 2345
## 115 He 2292
## 119 need 2266
## 120 most 2156
## 122 two 2070
## 125 life 2022
## 130 feel 1973
## 135 being 1910
## 137 book 1905
## 138 read 1901
## 147 said 1816
## 150 something 1779
## 155 its 1712
## [ reached getOption("max.print") -- omitted 130 rows ]
# Word cloud for Blogs
library(wordcloud2)
names(blog_wcdata) <- c("blog_word", "freq")
wordcloud2(blog_wcdata)
news_words <- hunspell_parse(news)
news_stems <- unlist(hunspell_stem(unlist(news_words)))
news_stem_words <- sort(table(news_stems), decreasing = TRUE)
print(head(news_stem_words, 30))
## news_stems
## the to and a of in s that for it on is with he said was at i I as
## 98453 45326 44395 44340 38775 34040 22891 18489 17688 17208 15373 14171 12795 12743 12623 11440 10728 9748 9725 9382
## his but be have from are t by they year
## 7801 7755 7705 7541 7533 6882 6814 6623 6442 6364
# Stemming for News
df_news <- as.data.frame(news_stem_words)
df_news$news_stems <- as.character(df_news$news_stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
news_stops <- df_news$news_stems %in% unlist(stopwords)
news_wcdata <- head(df_news[!stops,], 150)
print(news_wcdata, max = 40)
## news_stems Freq
## 52 were 3688
## 53 time 3660
## 69 what 3061
## 85 some 2531
## 87 A 2506
## 88 school 2505
## 101 could 2059
## 113 take 1903
## 115 now 1873
## 119 only 1756
## 120 million 1729
## 122 how 1695
## 125 way 1692
## 130 even 1625
## 135 show 1593
## 137 don 1560
## 138 good 1558
## 147 through 1489
## 150 down 1446
## 155 help 1401
## [ reached getOption("max.print") -- omitted 130 rows ]
# Word cloud for News
names(news_wcdata) <- c("news_word", "freq")
wordcloud2(news_wcdata)
twitter_words <- hunspell_parse(twitter)
twitter_stems <- unlist(hunspell_stem(unlist(twitter_words)))
twitter_stem_words <- sort(table(twitter_stems), decreasing = TRUE)
print(head(twitter_stem_words, 30))
## twitter_stems
## the i I to a you and it for in is of on s my that t me be at
## 46768 45892 40040 39161 30890 30154 21902 21026 19378 18732 18118 18111 15984 15907 14664 13949 11497 10261 9439 9330
## have your with this so we are just m can
## 9220 8870 8507 8356 8354 8265 7895 7639 7158 6741
# Stemming for Twitter
df_twitter <- as.data.frame(twitter_stem_words)
df_twitter$blog_stems <- as.character(df_twitter$twitter_stems)
stopwords <- hunspell_parse(readLines('http://jeroenooms.github.io/files/stopwords.txt'))
twitter_stops <- df$twitter_stems %in% unlist(stopwords)
twitter_wcdata <- head(df[!stops,], 150)
print(twitter_wcdata, max = 40)
## stems Freq
## 52 time 1187
## 53 It 1164
## 69 day 924
## 85 He 707
## 87 love 687
## 88 people 687
## 101 aft 585
## 113 bee 532
## 115 In 526
## 119 week 512
## 120 A 508
## 122 ally 496
## 125 play 489
## 130 call 469
## 135 home 443
## 137 start 439
## 138 school 437
## 147 feel 404
## 150 game 401
## 155 life 384
## [ reached getOption("max.print") -- omitted 130 rows ]
# Word cloud for Twitter
library(wordcloud2)
names(twitter_wcdata) <- c("twitter_word", "freq")
wordcloud2(twitter_wcdata)
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
#Count words
freq_df <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_df <- data.frame(word=names(freq), freq=freq)
return(freq_df)
}
unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram_freq <- freq_df(unigram)
bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)
trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)
# Frequency Plot for Unigram Frquency
freq_plot <- function(data, title) {
ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity", color = "black", fill = "green",
alpha = .4)
}
freq_plot(unigram_freq, "Top-25 Unigrams")
freq_plot <- function(data, title) {
ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity", color = "black", fill = "blue",
alpha = .4)
}
freq_plot(bigram_freq, "Top-25 Bigrams")
freq_plot <- function(data, title) {
ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity", color = "black", fill = "orange",
alpha = .4)
}
freq_plot(trigram_freq, "Top-25 Trigrams")
With the Exploratory Data Analysis we have understanding the distribution and relationship between the words, tokens, and phrases in the text.
We have observed a great difference between frequency Plot for each pairings using ngram modeling
There is a mximum occurance of stop words compared to regular words
Managing Stop words play an important role in text mining
Now that we have a basic understanding of the Data and working mechanism of ngrams, this knowledge will be very handy in further developing the predective model
Exploratory Data Analysis helped to validate and/or refine our expectation from the Data, which is critical to lay the foundation for the predective model
Development of Shiny App that takes n gram as an input and predicts the next text with considerable accuracy will be critical.
Perfromance enhancement will be an important consideration for next level of model development
More information is available from the website here: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
Dataset used in this project can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Wordcloud and Historgram for Unigram, Bigram and Trigram Occurances using ngram Tokenizer source:https://gist.github.com/nonsleepr/0c1d7f1bdd0953dabf2f
http://stackoverflow.com/questions/37817975/error-in-rweka-in-r-package