The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
This report is builded in a following chapters:
- loading the data (including basic report)
- Cleaning the data (incl. frequency analysis)
- N grams (2 and 3)
- Further steps
The data is downloaded and unzipped in a local directory. There are 4 subdirectories with each a different language (English, German, Finnish and Russian). In each directory there are 3 documents (blogs, news and twitter). In the first part the englisch directory is selected and loaded into memory. Before loading the meta data on the files will be looked into.
In this chunk the meta data (filename, size, number of lines and the maximum lenght of the file) is detemined and printed.
fileInformation <- function(searchfile) {
size <- file.info(searchfile)$size/1048576
conn <- file(searchfile, "r")
fulltext <- readLines(conn)
numberlines <- length(fulltext)
maxlinelength <- 0
for (i in 1:numberlines) {
linelength <- nchar(fulltext[i])
if (linelength > maxlinelength) { maxlinelength <- linelength }
}
close(conn)
list(searchfile=searchfile, size=size, numberlines=numberlines, maxlinelength=maxlinelength)
}
afdrukken <- function(lijst){
cat("File : ", lijst[[1]], "\n")
cat("Size of the file : ", sprintf("%.1f", lijst[[2]]), "MB \n")
cat("Number of lines in file: ", lijst[[3]], "lines \n")
cat("Maximum length of line : ", lijst[[4]], "characters \n")
cat("\n")
}
URLfile <- "C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/"
blog_info <- fileInformation(paste0(URLfile,"en_US.blogs.txt"))
news_info <- fileInformation(paste0(URLfile,"en_US.news.txt"))
twitter_info <- fileInformation(paste0(URLfile,"en_US.twitter.txt"))
cat("Documentation for each file: \n")
## Documentation for each file:
afdrukken(blog_info)
## File : C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/en_US.blogs.txt
## Size of the file : 200.4 MB
## Number of lines in file: 899288 lines
## Maximum length of line : 40835 characters
afdrukken(news_info)
## File : C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/en_US.news.txt
## Size of the file : 196.3 MB
## Number of lines in file: 1010242 lines
## Maximum length of line : 11384 characters
afdrukken(twitter_info)
## File : C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/en_US.twitter.txt
## Size of the file : 159.4 MB
## Number of lines in file: 2360148 lines
## Maximum length of line : 213 characters
After reading the files, they will be splitted in two sections: a training file and a test file. The test file will be used to see how accurate the algoritme is. The files are to big to be handled fast. So the training set will be splitted into a small working file and a larger file to build the real thing.
The test set will be 30% of the files and the working set will be 5% of the training set. The three workingsets will be combined in a set which is the bases of the analysis.
##Reading blogs
con <- file(blog_info[[1]], "r")
blogtext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
## Reading news
con <- file(news_info[[1]], "r")
newstext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
## Reading twitter
con <- file(twitter_info[[1]], "r")
twittertext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)
Splitsenbestand <- function (splitsbestand, lijst, bestandslocatie, soort){
## sampling preprocessing
set.seed(1000) # standard for sampling
lengte <- lijst[[3]]
trainset <- rbinom(lengte, 1, 0.7)
lengtetrainset <- sum(trainset)
workingset <- rbinom(lengtetrainset, 1, .05)
## splitting the file in train, work and testset
splitsbestand_train <- splitsbestand[trainset == 1]
splitsbestand_test <- splitsbestand[!trainset == 1]
splitsbestand_work <- splitsbestand_train[workingset == 1 ]
##writing the dataset test
con <- file(paste0(bestandslocatie,"en_US.", soort, "test.txt"), "w")
cat(splitsbestand_test, file = con, sep = "\n")
close(con)
##writing the dataset train
con <- file(paste0(bestandslocatie,"en_US.", soort, "train.txt"), "w")
cat(splitsbestand_train, file = con, sep = "\n")
close(con)
##writing the dataset work
con <- file(paste0(bestandslocatie,"en_US.", soort, "work.txt"), "w")
cat(splitsbestand_work, file = con, sep = "\n")
close(con)
rm(con)
}
##splitsbestand, lijst, bestandslocatie, soort
Splitsenbestand(blogtext, blog_info, URLfile, "blog")
Splitsenbestand(newstext, news_info, URLfile, "news")
Splitsenbestand(twittertext, twitter_info, URLfile, "twitter")
In this chunk the workingset is read so the rest of the assignment can be done is a speady way. In this chunk the three sets are combined to workingset with all the three workingfiles in it and it will be written to a file for later use as well.
## Reading blogs
con <- file(paste0(URLfile,"/en_US.blogwork.txt"), "r")
blogtext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
## Reading news
con <- file(paste0(URLfile,"/en_US.newswork.txt"), "r")
newstext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
## Reading twitter
con <- file(paste0(URLfile,"/en_US.twitterwork.txt"), "r")
twittertext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)
## Combining the files and write it to combine file
Totaltext <- c(blogtext, newstext, twittertext, recursive = FALSE)
con <- file(paste0(URLfile,"/en_US.combined.txt"))
writeLines(Totaltext, con)
close(con)
# Removing the 3 files for memory reasons
rm(con, blogtext, newstext, twittertext)
In the data is a lot of elements which have no added value to textmining. These elements have to be cleaned or removed. The removed elements are emoticons, numbers, punctuations, extra whites and capitals (replaced by lower characters). After analyzing the words with the highest frequency I will clean the dataset further.
The data is cleaned so emoticons, punctuations, numbers and white spaces are removed. All the capitals in the text will also be tranformed to lower characters.
cleaningData <- function(inputfile){
inputfile <- iconv(inputfile, "latin1", "ASCII", sub="")
inputfile <- VectorSource(inputfile)
inputfile <- SimpleCorpus(inputfile, control = list(language = "en"))
inputfile <- tm_map(inputfile, removePunctuation)
inputfile <- tm_map(inputfile, removeNumbers)
inputfile <- tm_map(inputfile, stripWhitespace)
inputfile <- tm_map(inputfile, content_transformer(tolower))
}
Totaltext <- cleaningData(Totaltext)
The first analysis was done with the workingfile. The words with the highest frequency are underneath outlined in a graph. If you look at these words don’t have a real added value in the predictions. I used a standard stoplist from the tm package and the snowball stopword list. Both lists are printed in the appendix.
## Building a document term matrix and count the terms
dtm <- DocumentTermMatrix(Totaltext)
dtm <- aggregate(count ~ term, data = tidy(dtm), sum)
## The terms are sorted so the terms with the higest frequency can be extracted
dtm <- dtm[order(as.numeric(dtm$count), decreasing = TRUE), ]
dtm_head <- head(dtm, 20)
## Count the words in the the file and in the top 20
wordsblogtext <- sum(as.numeric(dtm$count))
perctop20 <- round((sum(as.numeric(dtm_head$count))/sum(as.numeric(dtm$count)) *100), 1)
## Building the plot
titel <- paste0("Top 20 words in frequency (", perctop20, "% of the total number of words)")
fig <- ggplot(dtm_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)
The combined file is read again so the analysis without stopwords can start clean. After that the cleaning will be done and there will be a new plot to see what the most frequent words are.The list of words which are removed is in the appendix.
## Reading textfile
con <- file(paste0(URLfile,"/en_US.combined.txt"), "r")
Totaltext <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)
cleaningDatawithstopwords <- function(inputfile){
inputfile <- iconv(inputfile, "latin1", "ASCII", sub="")
inputfile <- VectorSource(inputfile)
inputfile <- SimpleCorpus(inputfile, control = list(language = "en"))
inputfile <- tm_map(inputfile, content_transformer(tolower))
inputfile <- tm_map(inputfile, removeWords, c(stopwords("en"), mystopwords))
inputfile <- tm_map(inputfile, removePunctuation)
inputfile <- tm_map(inputfile, removeNumbers)
inputfile <- tm_map(inputfile, stripWhitespace)
}
Totaltext <- cleaningDatawithstopwords(Totaltext)
## Making the document term matrix
dtm <- DocumentTermMatrix(Totaltext)
## Counting the words and extract the words with the highest frequency
dtm <- aggregate(count ~ term, data = tidy(dtm), sum)
dtm <- dtm[order(as.numeric(dtm$count), decreasing = TRUE), ]
dtm_head <- head(dtm, 20)
## Building the plot
lesswords <- round((sum(as.numeric(dtm$count))/wordsblogtext) *100, 1)
titel <- paste0("Top 20 words (", lesswords, "% of the original number of words)")
fig <- ggplot(dtm_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)
In this part the analysis will be done on the cleaned data set to look into the most frequent combinations of words. There will be looked into two word clusters and three word clusters. These lists will be the basis of the predictionmodel. The two word N gram are the two successive words and there frequencies. The numbers have the tendency to explode.
In this part the 2 word clusters are defined and the top 20 (in frequency) combinations are plotted.
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
## Building a document term matrix for 2 word clusters
## I had to transform the data into a VCorpus because in my version the Weka package is not working.
dtm2 <- DocumentTermMatrix(VCorpus(VectorSource(Totaltext)), control = list(tokenize = BigramTokenizer))
# The clusters are sorted so the clusters with the higest frequency can be extracted
dtm2 <- aggregate(count ~ term, data = tidy(dtm2), sum)
dtm2 <- dtm2[order(as.numeric(dtm2$count), decreasing = TRUE), ]
dtm2_head <- head(dtm2, 20)
## Building the plot
titel <- paste0("Top 20 2 gram words ")
fig <- ggplot(dtm2_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)
Same analysis but then with three successive words from the texts.
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
## Building a document term matrix for 3 word clusters
dtm3 <- DocumentTermMatrix(VCorpus(VectorSource(Totaltext)), control = list(tokenize = TrigramTokenizer))
# The clusters are sorted so the clusters with the higest frequency can be extracted
dtm3 <- aggregate(count ~ term, data = tidy(dtm3), sum)
dtm3 <- dtm3[order(as.numeric(dtm3$count), decreasing = TRUE), ]
dtm3_head <- head(dtm3, 20)
## Building the plot
titel <- paste0("Top 20 3 gram word clusters")
fig <- ggplot(dtm3_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)
In the upcoming time I will build a prediction model. The biggest challenge is to create a model which performs wll. In R standard packages are available to predict. In the next chunk one of these models (from the ALNP model) is presented. This model has not yet been validated.
names(dtm) <- c("word", "freq")
names(dtm2) <- c("word", "freq")
names(dtm3) <- c("word", "freq")
nGramModelsList <- list(dtm3, dtm2, dtm)
testString <- "I am the one who"
predict_Backoff(testString,nGramModelsList)
## [1] "month"
testString <- "what is my"
predict_Backoff(testString,nGramModelsList)
## [1] "novel"
testString <- "the best movie"
predict_Backoff(testString,nGramModelsList)
## [1] "ve"
The next steps is to look into different models and compare them based on acurary. It is also important to look into the issues of performance of the predictionmodel. After the model is developed and validated the shiny app has to be build and presented.
Standard stopword list
stopwords("en")
## [1] "a" "an" "and" "are" "as" "at" "be" "but"
## [9] "by" "for" "if" "in" "into" "is" "it" "no"
## [17] "not" "of" "on" "or" "such" "that" "the" "their"
## [25] "then" "there" "these" "they" "this" "to" "was" "will"
## [33] "with"
Snowball stopword list
mystopwords
## [1] "a" "about" "above" "after" "again"
## [6] "against" "all" "am" "an" "and"
## [11] "any" "are" "aren't" "as" "at"
## [16] "be" "because" "been" "before" "being"
## [21] "below" "between" "both" "but" "by"
## [26] "can" "cannot" "could" "couldn't" "did"
## [31] "didn't" "do" "does" "doesn't" "doing"
## [36] "don't" "down" "during" "each" "few"
## [41] "for" "from" "further" "had" "hadn't"
## [46] "has" "hasn't" "have" "having" "he"
## [51] "her" "here" "hers" "herself" "him"
## [56] "himself" "his" "how" "i" "if"
## [61] "im" "i'd" "i'm" "in" "into"
## [66] "is" "it" "its" "itself" "me"
## [71] "more" "most" "my" "myself" "no"
## [76] "nor" "not" "of" "off" "on"
## [81] "once" "only" "or" "other" "ought"
## [86] "our" "ours" "ourselves" "out" "over"
## [91] "own" "same" "she" "should" "so"
## [96] "some" "such" "than" "that" "the"
## [101] "their" "theirs" "them" "themselves" "then"
## [106] "there" "these" "they" "this" "those"
## [111] "through" "to" "too" "under" "until"
## [116] "up" "very" "was" "we" "were"
## [121] "what" "when" "where" "which" "while"
## [126] "who" "whom" "why" "with" "would"
## [131] "you" "your" "yours" "yourself" "yourselves"