Executive Summary
This milestone report is created as part of the Coursera Data Science capstone project in order to provide an overview of the data we will work with in this project. Here, we summerize the data processing procedure and present the results of exploratory analysis on three English databases namely, Twitter, News, and Blogs. In the end, a summary of our plan for creating a prediction algorithm and a Shiny app is provided.
Summary Statistics
Three databases used in this project are loaded:
## Warning: package 'stringi' was built under R version 3.2.5
## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 167155 appears to contain an embedded nul
## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 268547 appears to contain an embedded nul
## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 1274086 appears to contain an embedded nul
## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.twitter.txt", :
## line 1759032 appears to contain an embedded nul
## Warning in readLines("./Coursera-Swiftkey/final/en_US/en_US.news.txt",
## encoding = "UTF-8"): incomplete final line found on './Coursera-Swiftkey/
## final/en_US/en_US.news.txt'
Summary statistics for the three databases:
TBL_Summary <- data.frame(File = c("en_US.twitter","en_US.news","en_US.blogs"),
Size_MB = c(twitter_sz, news_sz, blogs_sz),
Lines_Count = c(length(twitter),length(news),length(blogs)),
Word_Count = c(sum(stri_count_words(twitter)),sum(stri_count_words(news)),sum(stri_count_words(blogs))),
Word_Average = c(mean(stri_count_words(twitter)),mean(stri_count_words(news)),mean(stri_count_words(blogs))))
TBL_Summary
## File Size_MB Lines_Count Word_Count Word_Average
## 1 en_US.twitter 159.3641 2360148 30093369 12.75063
## 2 en_US.news 196.2775 77259 2674536 34.61779
## 3 en_US.blogs 200.4242 899288 37546246 41.75108
Plotting the summaries:
library(ggplot2)
qplot(stri_count_words(twitter))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(stri_count_words(news))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(stri_count_words(blogs))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Sampling We need to use sampling to make the analysis manageable. This following code randomly subsets each of the files to 20,000 lines of text, creates one giant file from combining these three sample, and saves it as a new file.
Creating and Cleaning Corpus After creating the sample file, we need to make a corpus and next tokenize that corpus. Before creating the corpus, we removed all the characters that are not recognized by the libraries we are using in this section.
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
Number of steps are required here to clean up the corpus. We convert all characters to lower-case, remove some of the irrelevant strings, numbers, punctuations, all the redundant white-spaces, email addresses, and finally all the web links. We also remove all the standard stopwords in English language using the stopword word set in tm library. These are the usual suspects for frequent words in a corpus. We finally take out all the profanity words from the corpus. The following code takes care of all these steps.
## Warning in readLines(file("C:/Users/Arash/Desktop/Coursera-SwiftKey/final/
## Profanity_List.txt"), : incomplete final line found on 'C:/Users/Arash/
## Desktop/Coursera-SwiftKey/final/Profanity_List.txt'
N-grams and Data Analysis After all the boring stuff, we move to the real analyses. We will use unigram (UT), bigram (BT), trigram (TT), and quadgram (QT) analyses to understand the frequency of words and words combinations. RWeka library is used in this section.
## Warning: package 'RWeka' was built under R version 3.2.5
## Warning: closing unused connection 5 (C:/Users/Arash/Desktop/Coursera-
## SwiftKey/final/Profanity_List.txt)
To do an exploratory analysis on these N-grams, we find the frequency of word combinations in each N-gram:
In the following we present top 10 most frequent words/word combinations in each of the four N-grams.
paste("Unigram Frequencies")
## [1] "Unigram Frequencies"
head(UT_frq_TBL, 10)
## word freq
## said said 5932
## will will 5568
## one one 5122
## just just 4535
## like like 4343
## can can 4073
## time time 3571
## get get 3398
## new new 3163
## people people 2777
paste("Bigram Frequencies")
## [1] "Bigram Frequencies"
head(BT_frq_TBL, 10)
## word freq
## last year last year 383
## new york new york 373
## right now right now 339
## dont know dont know 325
## years ago years ago 267
## high school high school 263
## feel like feel like 230
## first time first time 216
## last week last week 216
## im going im going 209
paste("Trigram Frequencies")
## [1] "Trigram Frequencies"
head(TT_frq_TBL, 10)
## word freq
## new york city new york city 43
## cant wait see cant wait see 39
## let us know let us know 29
## im pretty sure im pretty sure 27
## new york times new york times 26
## happy mothers day happy mothers day 24
## two years ago two years ago 24
## im looking forward im looking forward 22
## world war ii world war ii 21
## dont even know dont even know 20
paste("Quadgram Frequencies")
## [1] "Quadgram Frequencies"
head(QT_frq_TBL, 10)
## word
## g fat g saturated g fat g saturated
## advertising fees advertising linking advertising fees advertising linking
## advertising linking amazoncom amazonca advertising linking amazoncom amazonca
## amazon eu associates programmes amazon eu associates programmes
## amazon eu content provided amazon eu content provided
## amazon services llc amazon amazon services llc amazon
## amazon services llc andor amazon services llc andor
## amazonca amazoncouk amazonde amazonfr amazonca amazoncouk amazonde amazonfr
## amazoncom amazonca amazoncouk amazonde amazoncom amazonca amazoncouk amazonde
## amazoncouk amazonde amazonfr amazonit amazoncouk amazonde amazonfr amazonit
## freq
## g fat g saturated 9
## advertising fees advertising linking 7
## advertising linking amazoncom amazonca 7
## amazon eu associates programmes 7
## amazon eu content provided 7
## amazon services llc amazon 7
## amazon services llc andor 7
## amazonca amazoncouk amazonde amazonfr 7
## amazoncom amazonca amazoncouk amazonde 7
## amazoncouk amazonde amazonfr amazonit 7
The following 4-part histograms summarises the 10-most common N-grams.
ggplot(UT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
geom_bar(stat = "identity", fill = "darkslateblue", colour = "black", width = 0.50) +
scale_x_discrete(breaks = seq(1, 10, by = 1), labels = UT_frq_TBL$word[1:10]) +
ggtitle("Unigrams") +
ylab("Frequency of Words") +
xlab("Words") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
ggplot(BT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
geom_bar(stat = "identity", fill = "tomato", colour = "black", width = 0.50) +
scale_x_discrete(breaks = seq(1, 10, by = 1), labels = BT_frq_TBL$word[1:10]) +
ggtitle("Bigrams") +
ylab("Frequency of Words") +
xlab("Words") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
ggplot(TT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
geom_bar(stat = "identity", fill = "gold3", colour = "black", width = 0.50) +
scale_x_discrete(breaks = seq(1, 10, by = 1), labels = TT_frq_TBL$word[1:10]) +
ggtitle("Trigrams") +
ylab("Frequency of Words") +
xlab("Words") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
ggplot(QT_frq_TBL[1:10,], aes(x = seq(1:10), y = freq)) +
geom_bar(stat = "identity", fill = "mediumpurple", colour = "black", width = 0.50) +
scale_x_discrete(breaks = seq(1, 10, by = 1), labels = QT_frq_TBL$word[1:10]) +
ggtitle("Quadgrams") +
ylab("Frequency of Words") +
xlab("Words") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Future Steps - Algoritham and Shiny App The N-grams explained in this report will be used to calculate the probability of the occurance of the next word for completing a string of text. A Trigram/Quadgram model will be used in this prediction. As for the Shiny, we will create an online application that takes part of the text string from the user and based on the algorithm prepared in the previous section proposes a list of possible words in order to complete the string.