The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
#setwd("C:/Users/Borja/Documents/Workspace coursera/10.-Data_Science_Capstone")
setwd("C:/Users/F36BPC0/Documents/10.-Data_Science_Capstone/Week2_Modeling")
unzip("Coursera-SwiftKey.zip")
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
We will only analyze the en_US files in this milestone:
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
con<- file("final/en_US/en_US.twitter.txt")
tw <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
length(tw)
## [1] 2360148
con2<- file("final/en_US/en_US.blogs.txt")
blogs <- readLines(con2, encoding = "UTF-8", skipNul = TRUE)
con3<- file("final/en_US/en_US.news.txt")
news <- readLines(con3, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con3, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'final/en_US/en_US.news.txt'
close(con)
close(con2)
close(con3)
We will do this distribution with the first 10000 tweets.
tw10000 <- tw[1:10000]
First step of pre processing the text is tokenization. Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis (for example for a machine learning algorithm that classifies and counts them for a particular sentiment).
Before the tokenization the first step is to remove some items from the words. This items will be numbers, punctuation signs and extra whitespaces.
library(tm)
## Loading required package: NLP
corpus <- VCorpus(VectorSource(tw10000)) # Build the main corpus
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
#Generate new dataset
tw10000_newdata<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
Then, we will use the textreuse library for tokenization. We will do three tokenizations: Using 1gram tokenization, we will get the most used words in all the first 10000 tweets. We will do the same procedure for 2gram and 3gram to obtain the most combinated words.
library(textreuse)
tokenize1 <- sapply(1:dim(tw10000_newdata)[1], function(x){
if(length(gregexpr(" ", tw10000_newdata[x,])[[1]]) > 1+1){ #N of words. It must be at least 2.-> (" aa ") would be 2
tokenize_ngrams(tw10000_newdata[x,], n = 1)
}})
tokenize2 <- sapply(1:dim(tw10000_newdata)[1], function(x){
if(length(gregexpr(" ", tw10000_newdata[x,])[[1]]) > 2+1){
tokenize_ngrams(tw10000_newdata[x,], n = 2)
}})
tokenize3 <- sapply(1:dim(tw10000_newdata)[1], function(x){
if(length(gregexpr(" ", tw10000_newdata[x,])[[1]]) > 3+1){
tokenize_ngrams(tw10000_newdata[x,], n = 3)
}})
tokenize1 <- as.factor(unlist(tokenize1))
tokenize2 <- as.factor(unlist(tokenize2))
tokenize3 <- as.factor(unlist(tokenize3))
sort(table(tokenize1),decreasing=TRUE)[1:5]
## tokenize1
## the to i a you
## 3853 3260 3006 2587 2239
sort(table(tokenize2),decreasing=TRUE)[1:5]
## tokenize2
## in the for the of the to be on the
## 322 295 245 222 194
sort(table(tokenize3),decreasing=TRUE)[1:5]
## tokenize3
## thanks for the thank you for going to be i love you cant wait to
## 71 37 35 34 33
Now, we will print the 10 most used words in all the tweets (called unigram, that is a n-gram of 1)
tab1 <- as.data.frame(sort(table(tokenize1),decreasing=TRUE)[1:10])
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
p <- ggplot(tab1, aes(tokenize1, Freq))
p <- p + geom_bar(stat="identity", , fill = "Red")
p <- p + geom_text(aes(label=Freq), vjust=-0.1,cex=3)
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
The n gram of 2 (bigram) is:
tab2 <- as.data.frame(sort(table(tokenize2),decreasing=TRUE)[1:10])
p <- ggplot(tab2, aes(tokenize2, Freq))
p <- p + geom_bar(stat="identity", fill = "Green")
p <- p + geom_text(aes(label=Freq), vjust=-0.1,cex=3)
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
The n gram of 3 (trigram) is:
tab3 <- as.data.frame(sort(table(tokenize3),decreasing=TRUE)[1:10])
p <- ggplot(tab3, aes(tokenize3, Freq))
p <- p + geom_bar(stat="identity", fill = "Blue")
p <- p + geom_text(aes(label=Freq), vjust=-0.1,cex=3)
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
An interesting finding could be a workcloud, where the most frequent terms are shown as below:
library(wordcloud)
## Loading required package: RColorBrewer
tab1 <- as.data.frame(sort(table(tokenize1),decreasing=TRUE)[1:400])
pal2 <- brewer.pal(8,"Dark2")
wordcloud(tab1$tokenize1, tab1$Freq , colors = pal2, max.words = 300, random.order = FALSE)
It would be interesting to find an optimized way to do this procedure, as it may take a big amount of time if we would try to get the most frequent word in all the dataset (remember that in this example we have just taken into account the first 10000 tweets of the en_US.txt).
In a future, a Shiny app can be done that is able to modify some parameters.
The most basic example would be a shiny app where the user would be ablo to modify the n-gram number. The app would automatically calculate the most frequent n-gram depending on this n number.
The user would also had the chance to modify to see the top X words, in this case, the most frequent ones.
The app could give also different display opportunities: in order to see the most frequent words, it could deliver the option to show a histogram graph, a wordcloud…