The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
The Capstone Dataset includes corpuses in four different languages, including Russian, English, French and German, using texts from twitter, blogs and news. As I only speak English from the 4, the analysis will be covering the English dataset
The dataset can be obtained from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
# Load usual frequent libraries, and read the files with readLine(), as the other
# commands of read.csv and read.table were failing to load the data properly
library(ggplot2)
library(dplyr)
library(tm)
library(ngram)
library(SnowballC)
library(wordcloud)
library(RWeka)
library(rJava)
library(plyr)
# Download and unzip the file, not executing this command for easier
# knitting of the file
#fileurl = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
#if (!file.exists('./CapstoneCorpus.zip')){
# download.file(fileurl,'./CapstoneCorpus.zip', mode = 'wb')
# unzip("CapstoneCorpus.zip")
#}
theBlog <- readLines("./final/en_US/en_US.blogs.txt", encoding="UTF-8")
theNews <- readLines("./final/en_US/en_US.news.txt", encoding="UTF-8")
theTweets <- readLines("./final/en_US/en_US.twitter.txt", encoding="UTF-8")
my_stat <- function(file_loc, variable)
{
fileSize <- paste0(round((file.info(file_loc)$size/1024)/1024, 2), " MB")
totalChars <- sum(nchar(file_loc))
rowNums <- length(variable)
numWords <- wordcount(variable)
maxChars <- which.max(nchar(variable))
return(cat(paste0("Size: ", fileSize, "\n", "Number of Characters: ", totalChars, "\n", "Number of Rows: ", rowNums, "\n", "Number of Words: ", numWords, "\n", "Max Line: ", maxChars, "\n")))
}
blogStats <- my_stat("en_US.blogs.txt", theBlog)
## Size: NA MB
## Number of Characters: 15
## Number of Rows: 899288
## Number of Words: 37334131
## Max Line: 483415
newsStats <- my_stat("en_US.news.txt", theNews)
## Size: NA MB
## Number of Characters: 14
## Number of Rows: 77259
## Number of Words: 2643969
## Max Line: 14556
tweetStats <- my_stat("en_US.twitter.txt", theTweets)
## Size: NA MB
## Number of Characters: 17
## Number of Rows: 2360148
## Number of Words: 30373543
## Max Line: 26
Since we are working iwth a huge dataset, we can sample 12000 of each of the three ources, and merge them together in one corpus
set.seed(8888)
blogSample <- iconv(sample(theBlog, 4000, replace=FALSE), "latin1", "ASCII", sub="")
newsSample <- iconv(sample(theNews, 4000, replace=FALSE), "latin1", "ASCII", sub="")
tweetSample <- iconv(sample(theTweets, 4000, replace=FALSE), "latin1", "ASCII", sub="")
theBlog <- NA
theNews <- NA
theTweets <- NA
theCorpus <- c(blogSample, newsSample, tweetSample)
# Check for NA values, but there were none
# Remove punctuation, digits, quotation marks etc.
theCorpus <- tolower(theCorpus)
theCorpus <- removeWords(theCorpus, stopwords("english"))
convertCorpus <- Corpus(VectorSource(theCorpus))
funcs <- list(removePunctuation, removeNumbers, stripWhitespace)
finalCorpus <- tm_map(convertCorpus, FUN=tm_reduce, tmFuns = funcs)
# I wanted to explore for each word length between 6-13 characters what are the top 5 most frequent words within the samples
DTM4 <- TermDocumentMatrix(finalCorpus, control=list(wordLengths=c(4, 5)))
DTM6 <- TermDocumentMatrix(finalCorpus, control=list(wordLengths=c(6, 7)))
DTM8 <- TermDocumentMatrix(finalCorpus, control=list(wordLengths=c(8, 9)))
DTM10 <- TermDocumentMatrix(finalCorpus, control=list(wordLengths=c(10, 11)))
DTM12 <- TermDocumentMatrix(finalCorpus, control=list(wordLengths=c(12, 13)))
DTM_ALL <- TermDocumentMatrix(finalCorpus, control=list(wordLengths=c(4, 20)))
# Top X words for single matrix
topWords <- function(DTM, X)
{
return(sort(rowSums(as.matrix(DTM)), decreasing=TRUE)[1:X])
}
DTMs <- list(DTM4, DTM6, DTM8, DTM10, DTM12)
# A function to connect all of them, and sort by the words that have the most frequency without into account
# the seprate limits
top_words <- lapply(1:length(DTMs), function(x) sort(rowSums(as.matrix(DTMs[[x]])), decreasing=TRUE))
top_words_df <- t(ldply(1:length(top_words), function(i) head(names(top_words[[i]]),10)))
# Plotting of the most frequent words with 12 or 13 characters
wordFreqExp <- topWords(DTM12, 20)
wordFreqExp <- as.data.frame(wordFreqExp)
plotThis <- data.frame(word=rownames(wordFreqExp), freq=wordFreqExp$wordFreqExp)
top_words_df
## [,1] [,2] [,3] [,4] [,5]
## V1 "will" "people" "something" "everything" "international"
## V2 "said" "really" "children" "information" "relationship"
## V3 "just" "little" "business" "government" "particularly"
## V4 "like" "things" "different" "university" "professional"
## V5 "time" "around" "everyone" "especially" "conversation"
## V6 "good" "school" "actually" "definitely" "organization"
## V7 "also" "always" "anything" "experience" "understanding"
## V8 "know" "another" "including" "department" "investigation"
## V9 "first" "thanks" "together" "conference" "unfortunately"
## V10 "back" "better" "probably" "california" "neighborhood"
# Unigram - Most Frequent Single Words
q <- ggplot(data=plotThis, aes(x=reorder(-freq,word), y=freq)) + scale_x_discrete(labels= plotThis$word) +
theme(axis.text.x = element_text(angle=90)) +geom_col(aes(fill=freq)) + scale_fill_continuous(high = "#132B43", low = "#56B1F7")+ xlab("Most frequent Words - Unigrams")+
ylab("Frequency")
q
# Word-Cloud of most frequent words between 4 and 20 letters
WordCloud <- topWords(DTM_ALL, 250)
WordCloud <- as.data.frame(WordCloud)
finalCloud <- data.frame(word=rownames(WordCloud), freq=WordCloud$WordCloud)
wordcloud(words=finalCloud$word, freq=finalCloud$freq, min.freq=1, max.words=250, random.order=FALSE, rot.per=0.4, colors=brewer.pal(8, "Dark2"))
# Bigram and Threegrams
biGram <- NGramTokenizer(finalCorpus, Weka_control(min=2, max=2))
threeGram <- NGramTokenizer(finalCorpus, Weka_control(min=3, max=3))
theBiWords <- data.frame(table(biGram))
theBiWords <- theBiWords[order(theBiWords$Freq, decreasing=TRUE),]
theTriWords <- data.frame(table(threeGram))
theTriWords <- theTriWords[order(theTriWords$Freq, decreasing=TRUE),]
biGrPlot <- ggplot(theBiWords[1:20, ], aes(x=reorder(-Freq, biGram), y=Freq)) + scale_x_discrete(labels=theBiWords$biGram) +
scale_fill_continuous(high = "#132B43", low = "#56B1F7") + xlab("Most frequent Words - Bi-Grams") + geom_bar(stat="identity", col="maroon", fill="maroon") +
theme(axis.text.x=element_text(angle=45, hjust=1))
biGrPlot
threeGrPlot <- ggplot(theTriWords[1:20, ]) + aes(x=threeGram, y=Freq) + scale_x_discrete(labels=theTriWords$threeGram) +
scale_fill_continuous(high = "#132B43", low = "#56B1F7") + xlab("Most frequent Words - Three-Grams") + geom_bar(stat="identity", col="maroon", fill="maroon") +
theme(axis.text.x=element_text(angle=45, hjust=1))
threeGrPlot
This is only sampling of the texts to finish the first step in actual data that will contain the information for building a predictive model that will be embedded within a Shiny App that should tell the next words that will follow after a few words a user has entered. This model will also need tweaking (hyperparameters optimization) in order to have the best accuracy one can have, and the sampling might be increased if the predictions are off of what is expected.
The model will be built with different N-Grams (possibly up until 5 or 6 consecutive words). Not too much attention will be put on the visualization of the application