Summary

This report presents some basic exploratory analysis made for the Milestone Report Submission in the Data Science Capstone course by Johns Hopkins at Coursera. The course objective is to apply data science in the area of natural language processing. The final result of the course will be to construct a Shiny application that accepts some text inputed by the user and try to predict what the next word will be. As start the course have provided a set of files containing texts extracted from blogs, news/media sites and twitter, to be used as a input in the creation of a prediction algorithm. In the next sessions we will be analyzing a subset of the data provided.

Main findings

  1. The files are huge and processing then takes time, what means that we have to find ways to process then wisely in memory (using vector operations)
  2. Looking at the statistics of the 3 files, we assume that we can join then without loosing any caracteristic of each one.
  3. With a vocabulary of 1% of the total number of words we can predict 91% of the text.

Data

The data provided in the course site comprises four sets of files (de_DE - Danish, en_US - English,fi_FI - Finnish an ru_RU - Russian), with each set containing 3 text files with texts from blogs, news/media sites and twitter. In this analysis we will focus english set of files: . en_US.blogs.txt . en_US.news.txt . en_US.twitter.txt

Obtaining and loading the data

The data was obtained using the code below:

arquivo <- "001-Data/Coursera-SwiftKey.zip"
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",arquivo,method="curl",mode="wb")
unzip(arquivo,list=TRUE)

To load each data set into R we used the code below:

library(stringr)
arquivos <- sapply(list.files("./001-Data/final/en_US",full.names=T),
                   function(file) as.numeric(str_extract(system2("wc",c("-l",file),stdout=T),"[0-9]+")))
arquivosUS <- data.frame()
for (i in 1:length(arquivos)) {
        df <- data.frame(arquivo=names(arquivos[i]),nlines=arquivos[i],seed=19061963,size=arquivos[i]/10)
        arquivosUS <- rbind(arquivosUS,df)
        df <- NULL
}
arquivosUS$nome <- list.files("./001-Data/final/en_US")
blogsUS <- readLines(as.character(arquivosUS[1,"arquivo"]),encoding="UTF-8",skipNul=T)
str(blogsUS)
##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...
newsUS <- readLines(as.character(arquivosUS[2,"arquivo"]),encoding="UTF-8",skipNul=T)
str(newsUS)
##  chr [1:1010242] "He wasn't home alone, apparently." ...
twitterUS <- readLines(as.character(arquivosUS[3,"arquivo"]),encoding="UTF-8",skipNul=T)
str(twitterUS)
##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

In the code above we created the data frame arquivosUS just to hold information and keep reference of the files used in this analysis. Below is a listing of this data frame:

rownames(arquivosUS) <- NULL
print(arquivosUS[1:4])
##                                    arquivo  nlines     seed     size
## 1   ./001-Data/final/en_US/en_US.blogs.txt  899288 19061963  89928.8
## 2    ./001-Data/final/en_US/en_US.news.txt 1010242 19061963 101024.2
## 3 ./001-Data/final/en_US/en_US.twitter.txt 2360148 19061963 236014.8

Analyzing the Data

Data size in memory

In the code below we plot the amount of memory occupied by each one of the data sets.

library(ggplot2)
memory_files <- data.frame(files=c("blogsUS","twitterUS","newsUS"),
                         memory=c(object.size(blogsUS),object.size(twitterUS),object.size(newsUS)))
ggplot(data=memory_files,aes(x=files,y=memory,fill=files)) + 
        geom_bar(stat="identity") +
        xlab("Text files provided") + 
        ylab("Memory size in bytes") +
        ggtitle("Memory in bytes per file")

memory_files <- NULL

Lines of text per data set

lines_files <- data.frame(files=c("blogsUS","twitterUS","newsUS"),
                          lines=c(length(blogsUS),length(twitterUS),length(newsUS)))
ggplot(data=lines_files,aes(x=files,y=lines,fill=files)) + 
        geom_bar(stat="identity") +
        xlab("Text files provided") + 
        ylab("Number of lines") +
        ggtitle("Number of lines per file")

#       plot memory of files

Plot and summary information of each line size

line_stats <- data.frame(files=c(rep("blogsUS",length(blogsUS)),
                                 rep("twitterUS",length(twitterUS)),
                                 rep("newsUS",length(newsUS))),
                           linelen=c(sapply(blogsUS,str_length,USE.NAMES=FALSE),
                                    sapply(twitterUS,str_length,USE.NAMES=FALSE),
                                    sapply(newsUS,str_length,USE.NAMES=FALSE)))
boxplot(log(linelen)~files,data=line_stats,
        main="Statistics of line length per file",
        xlab="Files",
        ylab="LOG(Line length)")

#       overall summary of line lengths
summary(line_stats$linelen)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      44      88     134     154   40830

In the y axis of the plot we used a log of the line length to normalize and make it visible on the plot.

Plot and summary information on number of words by line in text file

word_stats <- data.frame(files=c(rep("blogsUS",length(blogsUS)),
                                 rep("twitterUS",length(twitterUS)),
                                 rep("newsUS",length(newsUS))),
                         wordcount=c(str_count(blogsUS,"\\w+"),
                                   str_count(twitterUS,"\\w+"),
                                   str_count(newsUS,"\\w+")))
boxplot(log(wordcount)~files,data=word_stats,
        main="Statistics of number of words per text in file",
        xlab="Files",
        ylab="LOG(number of words per text)")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn

#       overall summary of words per text in file
summary(word_stats$wordcount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   17.00   24.58   28.00 6851.00
#       summary of words per text in file blogsUS
summary(word_stats[word_stats$file=="blogsUS","wordcount"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     9.0    29.0    42.6    61.0  6851.0
#       summary of words per text in file twitterUS
summary(word_stats[word_stats$file=="twitterUS","wordcount"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   13.14   19.00   47.00
#       summary of words per text in file newsUS
summary(word_stats[word_stats$file=="newsUS","wordcount"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   35.26   47.00 1928.00
total <- sum(word_stats$wordcount)

Analyzing words in the data sets

The 3 data sets together have 104936031 words, and the dataset sizes are huge to acommodate in memory. So we think this word analysis has to be made in detail to help as search for a good solution (prediction algorithm that performs well and fits in the memory available to the Shiny server). First we are going to do some transformation and cleansing in the data.

Data Transformations and Cleansing

We adopted an approach to mark some selected punctuation (dot,comma,exclamation, smileys, etc.), because we think they denote an end of thinking and usally the next word doesn’t has a relationship with the one before the punctuation. Also, they mark the beggining of a new phrase, then they are important to the flow of words. So, we are considering some punctuation as a special kind of word. This is reflected in the cleanse and transformation function below.

cleanData <- function(data) {
        library(tm)
        data <- tolower(data) # convert to lowercase
        data <- removeNumbers(data) # remove numbers
        pontuacao <- '[.,!:;?]|:-\\)|:-\\(|:\\)|:\\(|:D|=D|8\\)|:\\*|=\\*|:x|:X|:o|:O|:~\\(|T\\.T|Y\\.Y|S2|<3|:B|=B|=3|:3'
        data <- gsub(pontuacao," END ",data) # substitute selected ponctuation (including smileys) with the word END
        data <- gsub("$"," END",data) # make sure every line ends with an END
        data <- gsub("\\b(\\w+)\\s+\\1\\b","\\1",data) # remove duplicate words in sequence (eg. that that)
        data <- gsub("\\b(\\w+)\\s+\\1\\b","\\1",data) # remove duplicate words in sequence (eg. that that)
        data <- gsub("\\b(\\w+)\\s+\\1\\b","\\1",data) # remove duplicate words in sequence (eg. that that)
        data <- removePunctuation(data) # remove all other punctuation
        data <- stripWhitespace(data) # remove excess white space
        data <- gsub("^[[:space:]]","",data) # make sure lines doesn't begin with space
        data <- gsub("[[:space:]]$","",data) # make sure lines doesn't end with space
}
blogsUS <- cleanData(blogsUS)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
save(file="blogsUS-clean.rdata",blogsUS)
newsUS <- cleanData(newsUS)
save(file="newsUS-clean.rdata",newsUS)
twitterUS <- cleanData(twitterUS)
save(file="twitterUS-clean.rdata",twitterUS)

Still in the transformation we adopted to transform all the data sets in a vector of words.

blogsUS <- unlist(str_split(blogsUS,"\\W+"))
newsUS <- unlist(str_split(newsUS,"\\W+"))
twitterUS <- unlist(str_split(twitterUS,"\\W+"))
save(file="blogsUS-words.rdata",blogsUS)
save(file="newsUS-words.rdata",newsUS)
save(file="twitterUS-words.rdata",twitterUS)

Statistics on words in each file and also concatenating all of then

words.blogsUS <- sort(table(blogsUS),decreasing=TRUE) # table with blogsUS word freq
words.newsUS <- sort(table(newsUS),decreasing=TRUE) # table with newsUS word freq
words.twitterUS <- sort(table(twitterUS),decreasing=TRUE) # table with twitterUS word freq
words.all <- sort(table(c(blogsUS,newsUS,twitterUS)),decreasing=TRUE) # all word freq
q.blogsUS <- quantile(words.blogsUS,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
q.newsUS <- quantile(words.newsUS,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
q.twitterUS <- quantile(words.twitterUS,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
q.all <- quantile(words.all,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
print(q.blogsUS)
##      0%     25%     50%     75%     80%     95%     99%    100% 
##       1       1       1       4       6      78     824 4451634
print(q.newsUS)
##      0%     25%     50%     75%     80%     95%     99%    100% 
##       1       1       2       6       9     114    1179 4499678
print(q.twitterUS)
##      0%     25%     50%     75%     80%     95%     99%    100% 
##       1       1       1       3       4      46     572 5186943
print(q.all)
##       0%      25%      50%      75%      80%      95%      99%     100% 
##        1        1        1        3        4       59      869 14138255
qqnorm(words.blogsUS,main="Normal Q-Q plot of words in blogsUS")
qqline(words.blogsUS)

qqnorm(words.newsUS,main="Normal Q-Q plot of words in newsUS")
qqline(words.newsUS)

qqnorm(words.twitterUS,main="Normal Q-Q plot of words in twitterUS")
qqline(words.twitterUS)

qqnorm(words.all,main="Normal Q-Q plot of words in all files")
qqline(words.all)

hist(words.all)

head(words.all)
## 
##      END      the       to      and        a       of 
## 14138255  4763777  2753690  2409871  2402467  2005693
tail(words.all)
## 
## энергетику        юге     южного      южной          я         як 
##          1          1          1          1          1          1
words99 <- words.all[words.all>=q.all['99%']] # all word freq above 99% quartile
hist(words99)

head(words99)
## 
##      END      the       to      and        a       of 
## 14138255  4763777  2753690  2409871  2402467  2005693
tail(words99)
## 
##         disc     executed implications      precise      satisfy 
##          869          869          869          869          869 
##      sweeney 
##          869
sum(words99)/sum(words.all)
## [1] 0.914251
total.words99 <- length(words99) # total of unique words in all 3 data sets above 99% quantile
total.words <- length(words.all) # total of unique words in all 3 data sets
stotal.words99 <- format(total.words99,big.mark=",",small.mark=",",small.interval=3)
stotal.words <- format(total.words,big.mark=",",small.mark=",",small.interval=3)
p99 <- total.words99/total.words
stotal.per <- format(p99,digits=3,big.mark=",",small.mark=",",small.interval=3)                                                        

As we can see the words frequencies are totally skewed to the right, and with a vocabulary of words above the 99% quantile (7,889) we can achieve 91.4251022% of words that appears in the text. In other words :-) , using only 1% of the words we can solve 91% of the text. This reduced number of words cold be the diference of success or failure.

Next Steps

  1. Build ngrans (2 to 5) using only the words found above 99% quantile
  2. Put ngrans in a adjacency matrix (probally using only vector operations with the sparse matrices of the Matrix package - I tried the RWeka package, but I’m having problems with it in my notebook)
  3. Construct a markov chain using this adjacency matrix
  4. Build a simple Shiny application with a text input and 5 bottons that will change the labels to the predicted words as the user types.