Assignment overview

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Introduction

Loading the file and reading it

destfile = "/Coursera-SwiftKey.zip"
if(!file.exists(destfile)){
  url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  file <- basename(url)
  download.file(url, file, method="curl")
  unzip(file)
}
news <- readLines("final/en_US/en_US.news.txt", encoding = 'UTF-8',warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = 'UTF-8',warn = FALSE)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = 'UTF-8',warn = FALSE)
#linecount
line_news<-length(news)
line_twitter<-length(twitter)
line_blogs<-length(blogs)

#wordcount

wc_news<-wordcount(news)
wc_twitter<-wordcount(twitter)
wc_blogs<-wordcount(blogs)

#basic data tables
a<-rbind(line_news,line_twitter,line_blogs)
b<-rbind(wc_news,wc_twitter,wc_blogs)
c<-as.data.frame(cbind(a,b))
names(c)<-c("nr of lines","nr of words")
rownames(c)<-c("news","twitter","blogs")
c
##         nr of lines nr of words
## news          77259     2643969
## twitter     2360148    30373543
## blogs        899288    37334131

Sampling

set.seed(11000)
c_blogs <- sample(blogs, length(blogs)*0.01)
c_news <- sample(news, length(news)*0.01)
c_twitter <- sample(twitter, length(twitter)*0.01)
c_combi=c(c_blogs,c_news,c_twitter)
unigram_combi <- NGramTokenizer(c_combi, Weka_control(min = 1, max = 1))
bigram_combi <- NGramTokenizer(c_combi, Weka_control(min = 2, max = 2)) 
trigram_combi <- NGramTokenizer(c_combi, Weka_control(min = 3, max = 3)) 

unigram_combi<-data.frame(table(unigram_combi))%>%arrange(desc(Freq))
bigram_combi<-data.frame(table(bigram_combi))%>%arrange(desc(Freq))
trigram_combi<-data.frame(table(trigram_combi))%>%arrange(desc(Freq))

df_ngram<-as.data.frame(cbind(unigram_combi[1:15,],bigram_combi[1:15,],trigram_combi[1:15,]))
names(df_ngram)[c(2,4,6)]<-c("Freq1","Freq2","Freq3")
df_ngram
##    unigram_combi Freq1 bigram_combi Freq2  trigram_combi Freq3
## 1            the 26535       of the  2526        I don t   358
## 2             to 18620       in the  2364        I can t   211
## 3              I 16353          I m  1547       a lot of   181
## 4              a 14869      for the  1376 Thanks for the   180
## 5            and 14792       to the  1327     one of the   168
## 6             of 12985       on the  1214        I m not   159
## 7             in  9581        to be  1152        to be a   148
## 8            you  7957        don t   872    going to be   124
## 9             is  7906       at the   860      I want to   123
## 10           for  7372      and the   736     be able to   121
## 11          that  7120       I have   725     don t know   107
## 12            it  6910         is a   723       I have a   106
## 13            on  5445         it s   717       I didn t   104
## 14            my  4975        I was   699     the end of   102
## 15             s  4680         in a   691      I ve been   101

plots

ggplot(df_ngram, aes(x=reorder(unigram_combi,Freq1), y=(Freq1))) +
  geom_bar(stat="Identity", color="black")+
  xlab("Unigrams") + ylab("Frequency")+
  ggtitle("Common 15 Unigrams")+
  theme(axis.text.x=element_text(angle=90, hjust=1))

ggplot(df_ngram, aes(x=reorder(bigram_combi,Freq2), y=(Freq2))) +
  geom_bar(stat="Identity",color="black")+
  xlab("Bigrams") + ylab("Frequency")+
  ggtitle("Common 15 Bigrams")+
  theme(axis.text.x=element_text(angle=90, hjust=1))

ggplot(df_ngram, aes(x=reorder(trigram_combi,Freq3), y=(Freq3))) +
  geom_bar(stat="Identity", color="black")+
  xlab("Trigrams") + ylab("Frequency")+
  ggtitle("Common 15 Trigrams")+
  theme(axis.text.x=element_text(angle=90, hjust=1))

Conclusion and next steps

The above exploratory analyses shows some interesting findings.The next steps will entails, a.Build a ML predictive algorithm b.Build a a Shiny app, that suggest the most likely next word after a phrase is typed c.Prepare a pitch about the app and publish it at “shinyapps.io” server.