As the part of Data Capstone Project, this milestone report demonstrates the work done on exploratory data analysis and modeling. To get started with the Data Science Capstone Project.I’ve download the Coursera Swiftkey Dataset. After extraction, I have chosen to work with folder en_US which contains following three files:
# Preload necessary R librabires
library(dplyr)
library(doParallel)
library(stringi)
library(tm)
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.5
library(ggplot2)
library(wordcloud)
library(SnowballC)
# Setup parallel clusters to accelarate execution time
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))
Before starting to work with the files mentioned above, it is very important to have look into the basic details of those files like lines, words, etc.
blogs <- readLines("Coursera-Swiftkey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', skipNul = TRUE)
news <- readLines("Coursera-Swiftkey/final/en_US/en_US.news.txt", encoding = 'UTF-8', skipNul = TRUE)
## Warning in readLines("Coursera-Swiftkey/final/en_US/en_US.news.txt",
## encoding = "UTF-8", : incomplete final line found on 'Coursera-Swiftkey/
## final/en_US/en_US.news.txt'
twitter <- readLines("Coursera-Swiftkey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', skipNul = TRUE)
rawstats <- data.frame(
File = c("blogs","news","twitter"),
t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,]))
)
print(rawstats)
## File Lines LinesNEmpty Chars CharsNWhite TotalWords
## 1 blogs 899288 899288 206824382 170389539 37570839
## 2 news 77259 77259 15639408 13072698 2651432
## 3 twitter 2360148 2360148 162096241 134082806 30451170
The file representing blogs data has about nine hundred thousand lines, the news has about seventy seven thousand lines and twitter files has almost two million and four hundred thousand lines.
Since the raw data is very huge, sampling will be better option before starting the analysis.
set.seed(39)
sampleData <- list()
sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
sampleData <- c(sampleTwitter,sampleNews,sampleBlogs)
writeLines(sampleData, "./sample/sampleData.txt")
rm(rawstats,blogs,news,twitter, sampleData, sampleBlogs, sampleNews, sampleTwitter)
The best approach to analyze the data is using tm package of R. Loading tm library and creating the corpus is the first step before starting the analysis on the data. The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents.
directory <- file.path(".", "sample")
us_files <- Corpus(DirSource(directory))
# remove more transforms
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
us_files <- tm_map(us_files, toSpace, "/|@|\\|")
# convert to lowercase
us_files <- tm_map(us_files, content_transformer(tolower))
# remove punctuation
us_files <- tm_map(us_files, removePunctuation)
# remove numbers
us_files <- tm_map(us_files, removeNumbers)
# strip whitespace
us_files <- tm_map(us_files, stripWhitespace)
# remove english stop words
us_files <- tm_map(us_files, removeWords, stopwords("english"))
# initiate stemming
us_files <- tm_map(us_files, stemDocument)
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
Word frequencies plays vital role in creating the n-gram models. us_files corpus will be used to create unigrams, bigrams and trigrams.
unigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
unigrams <- DocumentTermMatrix(us_files, control = list(tokenize = unigramTokenizer))
BigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
bigrams <- DocumentTermMatrix(us_files, control = list(tokenize = BigramTokenizer))
TrigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
trigrams <- DocumentTermMatrix(us_files, control = list(tokenize = TrigramTokenizer))
After creation of document term matrix for unigrams, bigrams and trigrams, frequencies of these unigrams, bigrams and trigrams can be summed up to find total frquencies. Let’s start with the finding out the top frequencies of unigrams. Unigrams
unigrams_frequency <- sort(colSums(as.matrix(unigrams)),decreasing = TRUE)
unigrams_freq_df <- data.frame(word = names(unigrams_frequency), frequency = unigrams_frequency)
head(unigrams_freq_df, 10)
## word frequency
## said said 2912
## will will 2801
## one one 2613
## like like 2397
## get get 2289
## time time 2215
## just just 2201
## can can 2071
## year year 2052
## make make 1729
Plotting unigram frequency in the histogram
unigrams_freq_df %>%
filter(frequency > 1000) %>%
ggplot(aes(reorder(word,-frequency), frequency)) +
geom_bar(stat = "identity") +
ggtitle("Unigrams with frequencies > 1000") +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Bigrams
bigrams_frequency <- sort(colSums(as.matrix(bigrams)),decreasing = TRUE)
bigrams_freq_df <- data.frame(word = names(bigrams_frequency), frequency = bigrams_frequency)
head(bigrams_freq_df, 10)
## word frequency
## last year last year 211
## new york new york 176
## high school high school 167
## right now right now 158
## look like look like 154
## year ago year ago 147
## last week last week 117
## dont know dont know 110
## feel like feel like 106
## st loui st loui 101
Plotting bigrams frequency in the histogram
bigrams_freq_df %>%
filter(frequency > 100) %>%
ggplot(aes(reorder(word,-frequency), frequency)) +
geom_bar(stat = "identity") +
ggtitle("Bigrams with frequencies > 100") +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Trigrams
trigrams_frequency <- sort(colSums(as.matrix(trigrams)),decreasing = TRUE)
trigrams_freq_df <- data.frame(word = names(trigrams_frequency), frequency = trigrams_frequency)
head(trigrams_freq_df, 10)
## word frequency
## new york citi new york citi 29
## none repeat scroll none repeat scroll 25
## repeat scroll yellow repeat scroll yellow 25
## stylebackground none repeat stylebackground none repeat 25
## cant wait see cant wait see 17
## two year ago two year ago 17
## u u u u u u 17
## presid barack obama presid barack obama 16
## happi mother day happi mother day 15
## st loui counti st loui counti 14
Plotting trigrams frequency in the histogram
trigrams_freq_df %>%
filter(frequency > 10) %>%
ggplot(aes(reorder(word,-frequency), frequency)) +
geom_bar(stat = "identity") +
ggtitle("Trigrams with frequencies > 10") +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
On the basis of analysis, I am planning to use ngram dataframe to calculate the probabilities of the next word occuring with respect to previous words. For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.