Assignment

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

This report is builded in a following chapters:
- loading the data (including basic report)
- Cleaning the data (incl. frequency analysis)
- N grams (2 and 3)
- Further steps

Loading the data

Introduction

The data is downloaded and unzipped in a local directory. There are 4 subdirectories with each a different language (English, German, Finnish and Russian). In each directory there are 3 documents (blogs, news and twitter). In the first part the englisch directory is selected and loaded into memory. Before loading the meta data on the files will be looked into.

getting meta data on the files

In this chunk the meta data (filename, size, number of lines and the maximum lenght of the file) is detemined and printed.

fileInformation <- function(searchfile) {
  size <- file.info(searchfile)$size/1048576

  conn <- file(searchfile, "r")
  fulltext <- readLines(conn)
  numberlines <- length(fulltext)
  
  maxlinelength <- 0
  for (i in 1:numberlines) {
    linelength <- nchar(fulltext[i])
    if (linelength > maxlinelength) { maxlinelength <- linelength }
  }
  close(conn)
  
  list(searchfile=searchfile, size=size, numberlines=numberlines, maxlinelength=maxlinelength)
}


afdrukken <- function(lijst){
        cat("File                   : ", lijst[[1]], "\n")
        cat("Size of the file       : ", sprintf("%.1f", lijst[[2]]), "MB \n")
        cat("Number of lines in file: ", lijst[[3]], "lines \n")
        cat("Maximum length of line : ", lijst[[4]], "characters \n")
        cat("\n")
} 

URLfile <- "C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/"

blog_info <- fileInformation(paste0(URLfile,"en_US.blogs.txt"))
news_info <- fileInformation(paste0(URLfile,"en_US.news.txt"))
twitter_info <- fileInformation(paste0(URLfile,"en_US.twitter.txt"))

cat("Documentation for each file: \n")

## Documentation for each file:

afdrukken(blog_info)

## File                   :  C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/en_US.blogs.txt 
## Size of the file       :  200.4 MB 
## Number of lines in file:  899288 lines 
## Maximum length of line :  40835 characters

afdrukken(news_info)

## File                   :  C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/en_US.news.txt 
## Size of the file       :  196.3 MB 
## Number of lines in file:  1010242 lines 
## Maximum length of line :  11384 characters

afdrukken(twitter_info)

## File                   :  C:/Users/menno_000/Documents/R/Course data analytics/Capstone/final/en_US/en_US.twitter.txt 
## Size of the file       :  159.4 MB 
## Number of lines in file:  2360148 lines 
## Maximum length of line :  213 characters

Reading and splitting the file

Introduction

After reading the files, they will be splitted in two sections: a training file and a test file. The test file will be used to see how accurate the algoritme is. The files are to big to be handled fast. So the training set will be splitted into a small working file and a larger file to build the real thing.

The test set will be 30% of the files and the working set will be 5% of the training set. The three workingsets will be combined in a set which is the bases of the analysis.

Reading the files

##Reading blogs 
con <- file(blog_info[[1]], "r") 
blogtext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)

## Reading news
con <- file(news_info[[1]], "r") 
newstext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)

## Reading twitter 
con <- file(twitter_info[[1]], "r") 
twittertext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)
rm(con)

Sampling the data (splitting in files)

Splitsenbestand <- function (splitsbestand, lijst, bestandslocatie, soort){
        ## sampling preprocessing
        set.seed(1000)  # standard for sampling
        lengte <- lijst[[3]] 
        trainset <- rbinom(lengte, 1, 0.7)
        lengtetrainset <- sum(trainset)
        workingset <- rbinom(lengtetrainset, 1, .05)
        
        ## splitting the file in train, work and testset
        splitsbestand_train <- splitsbestand[trainset == 1]
        splitsbestand_test <- splitsbestand[!trainset == 1]
        splitsbestand_work <- splitsbestand_train[workingset == 1 ]
        
        ##writing the dataset test
        con <- file(paste0(bestandslocatie,"en_US.", soort, "test.txt"), "w")
        cat(splitsbestand_test, file = con, sep = "\n")
        close(con)
        
        ##writing the dataset train
        con <- file(paste0(bestandslocatie,"en_US.", soort, "train.txt"), "w")
        cat(splitsbestand_train, file = con, sep = "\n")
        close(con)
        
        ##writing the dataset work
        con <- file(paste0(bestandslocatie,"en_US.", soort, "work.txt"), "w")
        cat(splitsbestand_work, file = con, sep = "\n")
        close(con)
        rm(con)
}

##splitsbestand, lijst, bestandslocatie, soort

Splitsenbestand(blogtext, blog_info, URLfile, "blog")
Splitsenbestand(newstext, news_info, URLfile, "news")
Splitsenbestand(twittertext, twitter_info, URLfile, "twitter")

Reading the workingset

In this chunk the workingset is read so the rest of the assignment can be done is a speady way. In this chunk the three sets are combined to workingset with all the three workingfiles in it and it will be written to a file for later use as well.

## Reading blogs
con <- file(paste0(URLfile,"/en_US.blogwork.txt"), "r")
blogtext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)

## Reading news
con <- file(paste0(URLfile,"/en_US.newswork.txt"), "r")
newstext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)

## Reading twitter 
con <- file(paste0(URLfile,"/en_US.twitterwork.txt"), "r")
twittertext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)
rm(con)

## Combining the files and write it to combine file
Totaltext <- c(blogtext, newstext, twittertext, recursive = FALSE)
con <- file(paste0(URLfile,"/en_US.combined.txt"))
writeLines(Totaltext, con)
close(con)

# Removing the 3 files for memory reasons
rm(con, blogtext, newstext, twittertext)

Cleaning the data

Introduction

In the data is a lot of elements which have no added value to textmining. These elements have to be cleaned or removed. The removed elements are emoticons, numbers, punctuations, extra whites and capitals (replaced by lower characters). After analyzing the words with the highest frequency I will clean the dataset further.

Cleaning the data

The data is cleaned so emoticons, punctuations, numbers and white spaces are removed. All the capitals in the text will also be tranformed to lower characters.

cleaningData <- function(inputfile){
        inputfile <- iconv(inputfile, "latin1", "ASCII", sub="")
        inputfile <- VectorSource(inputfile)
        inputfile <- SimpleCorpus(inputfile, control = list(language = "en"))
        inputfile <- tm_map(inputfile, removePunctuation)
        inputfile <- tm_map(inputfile, removeNumbers)
        inputfile <- tm_map(inputfile, stripWhitespace)
        inputfile <- tm_map(inputfile, content_transformer(tolower))
}

Totaltext <- cleaningData(Totaltext)

High frequency words

The first analysis was done with the workingfile. The words with the highest frequency are underneath outlined in a graph. If you look at these words don’t have a real added value in the predictions. I used a standard stoplist from the tm package and the snowball stopword list. Both lists are printed in the appendix.

## Building a document term matrix and count the terms
dtm <- DocumentTermMatrix(Totaltext)
dtm <- aggregate(count ~ term, data = tidy(dtm), sum)

## The terms are sorted so the terms with the higest frequency can be extracted
dtm <- dtm[order(as.numeric(dtm$count), decreasing = TRUE),  ]
dtm_head <- head(dtm, 20)

## Count the words in the the file and in the top 20
wordsblogtext <- sum(as.numeric(dtm$count))
perctop20 <- round((sum(as.numeric(dtm_head$count))/sum(as.numeric(dtm$count)) *100), 1)

## Building the plot
titel <- paste0("Top 20 words in frequency (", perctop20, "% of the total number of words)")
fig <- ggplot(dtm_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)

Cleaning with the stopword list

Introduction

The combined file is read again so the analysis without stopwords can start clean. After that the cleaning will be done and there will be a new plot to see what the most frequent words are.The list of words which are removed is in the appendix.

reading the combined textfile

## Reading textfile
con <- file(paste0(URLfile,"/en_US.combined.txt"), "r")
Totaltext <- readLines(con, encoding = "UTF-8", skipNul = TRUE) 
close(con)
rm(con)

Cleaning the data including stopwords

cleaningDatawithstopwords <- function(inputfile){
        inputfile <- iconv(inputfile, "latin1", "ASCII", sub="")
        inputfile <- VectorSource(inputfile)
        inputfile <- SimpleCorpus(inputfile, control = list(language = "en"))
        inputfile <- tm_map(inputfile, content_transformer(tolower))
        inputfile <- tm_map(inputfile, removeWords, c(stopwords("en"), mystopwords))
        inputfile <- tm_map(inputfile, removePunctuation)
        inputfile <- tm_map(inputfile, removeNumbers)
        inputfile <- tm_map(inputfile, stripWhitespace)
        
}

Totaltext <- cleaningDatawithstopwords(Totaltext)

Top word list without words in stoplist

## Making the document term matrix
dtm <- DocumentTermMatrix(Totaltext)

## Counting the words and extract the words with the highest frequency
dtm <- aggregate(count ~ term, data = tidy(dtm), sum)
dtm <- dtm[order(as.numeric(dtm$count), decreasing = TRUE),  ]
dtm_head <- head(dtm, 20)

## Building the plot
lesswords <- round((sum(as.numeric(dtm$count))/wordsblogtext) *100, 1)
titel <- paste0("Top 20 words (", lesswords, "% of the original number of words)")

fig <- ggplot(dtm_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)

n gram frequencies

Introduction

In this part the analysis will be done on the cleaned data set to look into the most frequent combinations of words. There will be looked into two word clusters and three word clusters. These lists will be the basis of the predictionmodel. The two word N gram are the two successive words and there frequencies. The numbers have the tendency to explode.

2 gram frequency word clusters

In this part the 2 word clusters are defined and the top 20 (in frequency) combinations are plotted.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

## Building a document term matrix for 2 word clusters
## I had to transform the data into a VCorpus because in my version the Weka package is not working.
dtm2 <- DocumentTermMatrix(VCorpus(VectorSource(Totaltext)), control = list(tokenize = BigramTokenizer))

# The clusters are sorted so the clusters with the higest frequency can be extracted
dtm2 <- aggregate(count ~ term, data = tidy(dtm2), sum)
dtm2 <- dtm2[order(as.numeric(dtm2$count), decreasing = TRUE),  ]
dtm2_head <- head(dtm2, 20)

## Building the plot
titel <- paste0("Top 20 2 gram words ")
fig <- ggplot(dtm2_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)

3 gram frequency word clusters

Same analysis but then with three successive words from the texts.

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

## Building a document term matrix for 3 word clusters
dtm3 <- DocumentTermMatrix(VCorpus(VectorSource(Totaltext)), control = list(tokenize = TrigramTokenizer))

# The clusters are sorted so the clusters with the higest frequency can be extracted
dtm3 <- aggregate(count ~ term, data = tidy(dtm3), sum)
dtm3 <- dtm3[order(as.numeric(dtm3$count), decreasing = TRUE),  ]
dtm3_head <- head(dtm3, 20)

## Building the plot
titel <- paste0("Top 20 3 gram word clusters")
fig <- ggplot(dtm3_head, aes(x= reorder(term, - count), y=count)) + geom_bar(stat="identity", fill = "red")
fig <- fig + ggtitle(titel) + xlab("Word in Corpus") + ylab("Count")
fig <- fig + theme(axis.text.x=element_text(angle=90))
print(fig)

Further steps prediction model

Introduction

In the upcoming time I will build a prediction model. The biggest challenge is to create a model which performs wll. In R standard packages are available to predict. In the next chunk one of these models (from the ALNP model) is presented. This model has not yet been validated.

Predicting model based on ALNP package

names(dtm) <- c("word", "freq")
names(dtm2) <- c("word", "freq")
names(dtm3) <- c("word", "freq")

nGramModelsList <- list(dtm3, dtm2, dtm)

testString <- "I am the one who"
predict_Backoff(testString,nGramModelsList)

## [1] "month"

testString <- "what is my"
predict_Backoff(testString,nGramModelsList)

## [1] "novel"

testString <- "the best movie"
predict_Backoff(testString,nGramModelsList)

## [1] "ve"

Next steps

The next steps is to look into different models and compare them based on acurary. It is also important to look into the issues of performance of the predictionmodel. After the model is developed and validated the shiny app has to be build and presented.

Appendix: stopwordlist

Standard stopword list

stopwords("en")

##  [1] "a"     "an"    "and"   "are"   "as"    "at"    "be"    "but"  
##  [9] "by"    "for"   "if"    "in"    "into"  "is"    "it"    "no"   
## [17] "not"   "of"    "on"    "or"    "such"  "that"  "the"   "their"
## [25] "then"  "there" "these" "they"  "this"  "to"    "was"   "will" 
## [33] "with"

Snowball stopword list

mystopwords

##   [1] "a"          "about"      "above"      "after"      "again"     
##   [6] "against"    "all"        "am"         "an"         "and"       
##  [11] "any"        "are"        "aren't"     "as"         "at"        
##  [16] "be"         "because"    "been"       "before"     "being"     
##  [21] "below"      "between"    "both"       "but"        "by"        
##  [26] "can"        "cannot"     "could"      "couldn't"   "did"       
##  [31] "didn't"     "do"         "does"       "doesn't"    "doing"     
##  [36] "don't"      "down"       "during"     "each"       "few"       
##  [41] "for"        "from"       "further"    "had"        "hadn't"    
##  [46] "has"        "hasn't"     "have"       "having"     "he"        
##  [51] "her"        "here"       "hers"       "herself"    "him"       
##  [56] "himself"    "his"        "how"        "i"          "if"        
##  [61] "im"         "i'd"        "i'm"        "in"         "into"      
##  [66] "is"         "it"         "its"        "itself"     "me"        
##  [71] "more"       "most"       "my"         "myself"     "no"        
##  [76] "nor"        "not"        "of"         "off"        "on"        
##  [81] "once"       "only"       "or"         "other"      "ought"     
##  [86] "our"        "ours"       "ourselves"  "out"        "over"      
##  [91] "own"        "same"       "she"        "should"     "so"        
##  [96] "some"       "such"       "than"       "that"       "the"       
## [101] "their"      "theirs"     "them"       "themselves" "then"      
## [106] "there"      "these"      "they"       "this"       "those"     
## [111] "through"    "to"         "too"        "under"      "until"     
## [116] "up"         "very"       "was"        "we"         "were"      
## [121] "what"       "when"       "where"      "which"      "while"     
## [126] "who"        "whom"       "why"        "with"       "would"     
## [131] "you"        "your"       "yours"      "yourself"   "yourselves"

Data Science Capstone Text mining

Menno Oerlemans

16 december 2017

Assignment

Loading the data

Introduction

getting meta data on the files

Reading and splitting the file

Introduction

Reading the files

Sampling the data (splitting in files)

Reading the workingset

Cleaning the data

Introduction

Cleaning the data

High frequency words

Cleaning with the stopword list

Introduction

reading the combined textfile

Cleaning the data including stopwords

Top word list without words in stoplist

n gram frequencies

Introduction

2 gram frequency word clusters

3 gram frequency word clusters

Further steps prediction model

Introduction

Predicting model based on ALNP package

Next steps

Appendix: stopwordlist