The objective of this project project is to demonstrate an interaction with the data and to create a basic report of summarized statistics about the data sets and to report any interesting findings discovered. This document would explain only the major features of the data as identified and briefly summarize plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. I would make use of tables and plots to illustrate important summaries of the data set.
library(tm)
library(corpora)
library(quanteda)
library(wordcloud)
library(ggplot2)
library(dplyr)
library(stringi)
library(RWeka)
library(corpus)
First,I download the data and store it in a folder,then I extract the -csv files. Next,I shall load the data into R using a base function ‘readLines’, will another function ‘file’ to open a connection.
if (!file.exists("./download")){dir.create("./download")}
fileUrl = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(fileUrl,destfile = "./download/Coursera-SwiftKey.zip")
unzip(zipfile = "./download/Coursera-SwiftKey.zip")
blogconnect <- file("./final/en_US/en_US.blogs.txt","r")
blogread <- readLines(blogconnect, skipNul=T,encoding="UTF-8")
close(blogconnect)
newsconnect <- file("./final/en_US/en_US.news.txt","r")
newsread <- readLines(newsconnect, skipNul=T,encoding="UTF-8")
close(newsconnect)
twitterconnect <- file("./final/en_US/en_US.twitter.txt","r")
tweetread <- readLines(twitterconnect, skipNul=T,encoding="UTF-8")
close(twitterconnect)
Next, I do some pre-processing like updating stopwords’ list, creating N-gram functions, creating summary variable, objects, sampling the three different articles and combining them together for analysis. Then I perform some data cleaning to make the combined samples readable as a dataframe. Then I created a unigram, bigram, and trigram dataframe.
upStopword <- c(stopwords("en"),"can","will","want","just","like","may","cant","isnt","also","the","dont")
unigram <- function(x)NGramTokenizer(x, Weka_control(min=1,max=1))
bigram<- function(x)NGramTokenizer(x, Weka_control(min=2,max=2))
trigram <- function(x)NGramTokenizer(x,Weka_control(min=3,max=3))
bloglen <- length(blogread)
blogwcount <- sum(stri_count_words(blogread))
maxblog <- max(blogwcount)
blogWavg <- mean(stri_count_words(blogread))
newslen <- length(newsread)
newswcount <- sum(stri_count_words(newsread))
maxnews <- max(newswcount)
newsWavg <- mean(stri_count_words(newsread))
twitterlen <- length(tweetread)
twitterwcount <- sum(stri_count_words(tweetread))
maxtwitter <- max(twitterwcount)
twitterWavg <- mean(stri_count_words(tweetread))
data_summary <- data.frame(Data_Source=c("Blog","News","Twitter"),
Number_of_Lines=c(bloglen,newslen,twitterlen),
Number_of_Words=c(blogwcount,newswcount,twitterwcount),
Number_of_Characters_in_Longest_Line=c(maxblog,maxnews,maxtwitter),
Average_Words=c(blogWavg,newsWavg,twitterWavg))
memory.limit(size = 1e+9) ## I set my memory size limit to 1 Gigabyte
## [1] 1e+09
set.seed(123)
blogSample <- sample(blogread, bloglen * 0.05)
newsSample <- sample(newsread, newslen * 0.05)
twitterSample <- sample(tweetread, twitterlen * 0.05)
corpus <- c(blogSample,newsSample,twitterSample)
corpus = VCorpus(VectorSource(iconv(corpus,"UTF-8","ASCII",sub="")))
corpus = tm_map(corpus, content_transformer(tolower)) %>%
tm_map(removeWords,upStopword) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(PlainTextDocument) %>%
tm_map(stripWhitespace)
uniterms <- TermDocumentMatrix(corpus,control = list(tokenize=unigram))
unifreq <- findFreqTerms(uniterms,lowfreq = 35)
unigram_matrix <- sort(rowSums(as.matrix(uniterms[unifreq,])),decreasing = T)
unigram_dataframe <- data.frame(word=names(unigram_matrix),frequency = unigram_matrix)
## Bigram Processing
biterms <- TermDocumentMatrix(corpus, control = list(tokenize=bigram))
bifreq <- findFreqTerms(biterms, lowfreq = 15)
bigram_matrix <- sort(rowSums(as.matrix(biterms[bifreq,])),decreasing = T)
bigram_dataframe <- data.frame(word = names(bigram_matrix),frequency = bigram_matrix)
data_summary
## Data_Source Number_of_Lines Number_of_Words
## 1 Blog 899288 37546239
## 2 News 77259 2674536
## 3 Twitter 2360148 30093413
## Number_of_Characters_in_Longest_Line Average_Words
## 1 37546239 41.75107
## 2 2674536 34.61779
## 3 30093413 12.75065
The Blog Article contains 899288 lines of text, with 37546239 words in the article. The longest line contains 37546239 lines of character. ### News Article The News Article contains 77259 lines of text, with 2674536 words in the article. The longest line contains 2674536 lines of character. ### Twitter Article The Twitter Article contains 2360148 lines of text, with 30093413 words in the article. The longest line contains 30093413 lines of character.
After combining samples from each of the three(3) articles above, we have the Unigram, Bigram and table given below:
print("Unigram")
## [1] "Unigram"
head(unigram_dataframe) #First 6 Unigrams rows
## word frequency
## one one 10616
## get get 9251
## time time 8500
## love love 7688
## good good 7617
## now now 7344
print("Bigram")
## [1] "Bigram"
head(bigram_dataframe) #First 6 Bigrams rows
## word frequency
## right now right now 1155
## last night last night 696
## looking forward looking forward 510
## happy birthday happy birthday 448
## new york new york 437
## first time first time 418
The table shows the Unigram and Bigram with their corresponding frequency in decreasing order.
This section contains visual expression of the data. First, I plot a word cloud of the most frequent 150 Unigram. Then I plot the bar chart of the most frquent Unigram and Bigram. I shall indicate a table for the trigram data because of the memory required.
wordcloud(words = unigram_dataframe$word , freq = unigram_dataframe$frequency, max.words = 150, scale = c(5,0.6) , rot.per = 0.5, use.r.layout = F, colors = brewer.pal(8,"Dark2"))
# Unigram Bar Chart
gg_unigram <- ggplot(data = unigram_dataframe[1:25,], aes(y = reorder(word,frequency), x = frequency))
gg_unigram <- gg_unigram + geom_bar(stat = "identity", fill = "green")
gg_unigram <- gg_unigram + ggtitle("Articles Unigram with corresponding Frequency Bar in Decreasing Order") + ylab("Words") + xlab("Frequency")
gg_unigram
# Bigram Bar Chart
gg_bigram <- ggplot(data = bigram_dataframe[1:20,], aes(y = reorder(word,frequency), x = frequency)) + geom_bar(stat = "identity", fill = "orange")
gg_bigram <- gg_bigram + ggtitle("Articles Bigram with corresponding Frequency Bar in Decreasing Order") + ylab("Words") + xlab("Frequency")
gg_bigram
term_stats(corpus,ngrams = 3,types = T)
## term type1 type2 type3 count support
## 1 happy mothers day happy mothers day 176 174
## 2 let us know let us know 123 123
## 3 happy new year happy new year 96 95
## 4 new york city new york city 75 73
## 5 cinco de mayo cinco de mayo 62 61
## 6 looking forward seeing looking forward seeing 56 56
## 7 st patricks day st patricks day 40 35
## 8 new years eve new years eve 36 34
## 9 hope great day hope great day 34 34
## 10 new york times new york times 34 34
## 11 new york ny new york ny 30 30
## 12 two years ago two years ago 30 30
## 13 come see us come see us 29 29
## 14 really looking forward really looking forward 27 27
## 15 world war ii world war ii 26 25
## 16 come join us come join us 25 25
## 17 dreams come true dreams come true 23 23
## 18 happy valentines day happy valentines day 23 23
## 19 look forward seeing look forward seeing 23 23
## 20 love love love love love love 23 23
## <U+22EE>(1451243 rows total)
The model prediction would be based on the data wrangling carried out on the data. Based on the N-grams created in order of highest frequency to lowest frequency, the model would predict a word with the input of an alphabet, or the next word with the input of a word using the Bigram tokens, or a third word with the input of two preceeding two words using the Trigram tokens. Also, the prediction would be made to interact with each other in such a way that the model predicts a Unigram based on the first alphabetical input, and also makes a Bigram prediction based on the Unigram Prediction and a Trigram prediction based on the Bigram prediction thereby having it predict a phrase based on a word input.