Introduction

The following paper is a milestone report written for the Capstone Project of the Data Science Specialisation by Johns Hopkins University on Coursera. As the project overview text states “Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.”

The aim of this Capstone project is to use skills learnt during the Data Science Specialization to build and pitch a Data Product to address the problem stated above. This Milestone report specifically aims to complete and initial data exploration exercise on the data provided, it includes summary statistics and data visualisations.

Executive Summary

A large amount of textual data was to form the basis of a predictive text algorithm, there has had to be some data cleansing applied to the dataset before it can be used. An analysis shows that unsurprisingly stop words (e.g. the, and) occur very frequently and this will need to be taken into consideration when putting the predictive model together. On average however a word shows up in the corpus of text around 16-17 times.

To produce a useful dataset for an algorithm there will need to be further data cleaning that occur which, in addition generally machine learning algorithms perform better when there is more data so my intention is to see how I can make use of the most amount of data with the shortest processing time.

I have yet to read up on n-grams as the course text suggested, however I imagine an implementation of a predictive text model will take insight gleaned from how words are associated in the corpus of text provided, e.g. if someone has written “The cat sat” it would need to take account of the words prior to the last word as well as the last word to determine the most likely next word(s) e.g. “on the mat”. This will be an interesting challenge, the Tm package has a find associations function (“findAssocs”) which I believe will be a good starting point. In addition I may want to consider how I tokenize text and think about looking at the sentence that has been written as a whole rather than just a single word for instance “I want to” could lead to “I want to be a tree” or “I want to goto the market”.

Packages Used

The following packages are used within this paper

library(tm);
library(plyr);
library(stringr);
library(wordcloud);
library(ggplot2);

Rather than the R distribution available on CRAN (http://bit.ly/1U43jRZ) I have used the Microsoft R distribution available on MRAN (http://bit.ly/2bUIg5D). The reason for this one of the problems on a Windows platform in R is that natively R does not make full use of modern multi core architecture and runs models on a single core of a processor and since the host environment this paper was executed on is a windows machine the author has chosen to address this problem using the Microsoft R distribution which does take advantage of a modern multi core architecture. It seems that Tm doesn’t make use of this so further investigation will be required to see if the full amount of hardware available can be utilised.

Data Processing

The first step in the process of data exploration is loading the data you have been provided with, the following download the data from the Internet and loads it it into the environment so it can be processed

#Define Variables
file.url <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
file.dest <- './data/Coursera-SwiftKey.zip'
#Download the data
Download.file(file.url, file.dest)
#Unzip the file
unzip(file.dest, exdir="./data")

For the purposes of the this paper the en_US dataset will be used, the dataset contains the following files

en_US_path <- file.path(".", "./Data/final/en_US/")  
en_Us_docs_Files <- DirSource("./Data/final/en_US/", recursive=TRUE)$filelist
en_Us_docs_Files
## [1] "./Data/final/en_US//ANSI/en_US.twitter.txt"
## [2] "./Data/final/en_US/en_US.blogs.txt"        
## [3] "./Data/final/en_US/en_US.news.txt"

As shown the data has been downloaded and is now ready for data exploration

Inital Data Exploration

This initial data exploration gathers some high level summary statistics for each file.

File Header

Lets start by looking at the first two lines of the files to see what the data looks like, this will give a feeling for the quality of the data and issues there may be with the files

con <- file(en_Us_docs_Files[[1]], "r") 
readLines(con, 2)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

For the first file “./Data/final/en_US//ANSI/en_US.twitter.txt” the text looks quite well formed, it is evident that the file has abbreviations and slang included in it, the style of language appears to be informal.

con <- file(en_Us_docs_Files[[2]], "r") 
readLines(con, 2)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [2] "We love you Mr. Brown."

For the second file “./Data/final/en_US/en_US.blogs.txt” the text looks quite well formed, there appear to be some formatting characters that are not pure text which will require prior to the text being useful for the task at hand. The likely explanation for this is due to the file encoding and this will be factored in later on. The style of language used is more formal the the first file.

con <- file(en_Us_docs_Files[[3]], "r") 
readLines(con, 2)
## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

For the third file “./Data/final/en_US/en_US.news.txt” the text looks quite well formed with no formatting problems to mentioned, the style of language used seems quite formal.

Subsequent investigations revealed that the twitter file is ANSI encoded and the other two files are UTF-8 encoded, these will therefore have to be processed differently as the UTF-8 files have extended characters in them.

Line Counts

The first summary statistic computed are line counts for each file, the following code calculates the number of lines for each file and stores this in a data frame

library(plyr);
#First File
filecon <-  file(en_Us_docs_Files[1],open="r") 
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(filecon,readsizeof))) > 0 ) 
        nooflines <- nooflines+linesread )
close(filecon)
linecounts1 <- data.frame(en_Us_docs_Files[1], nooflines)

#Second File
filecon <-  file(en_Us_docs_Files[2],open="r") 
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(filecon,readsizeof))) > 0 ) 
        nooflines <- nooflines+linesread )
close(filecon)
linecounts2 <- data.frame(en_Us_docs_Files[2], nooflines)

#Third File
filecon <-  file(en_Us_docs_Files[3],open="r") 
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(filecon,readsizeof))) > 0 ) 
        nooflines <- nooflines+linesread )
close(filecon)
linecounts3 <- data.frame(en_Us_docs_Files[3], nooflines)

linecounts1 <- rename(linecounts1, c('en_Us_docs_Files.1.'='FileName', 'nooflines'='NoOfLines'))
linecounts2 <- rename(linecounts2, c("en_Us_docs_Files.2."="FileName", "nooflines"="NoOfLines"))
linecounts3 <- rename(linecounts3, c("en_Us_docs_Files.3."="FileName", "nooflines"="NoOfLines"))

FileLineCounts <- rbind(linecounts1, linecounts2, linecounts3)

The number of lines per file is as follows

FileLineCounts
##                                     FileName NoOfLines
## 1 ./Data/final/en_US//ANSI/en_US.twitter.txt   2360148
## 2         ./Data/final/en_US/en_US.blogs.txt    899288
## 3          ./Data/final/en_US/en_US.news.txt     77259

The table above shows that the twitter dataset has the most lines, followed by the blog file then the news file. As there are a very large number of lines for the sake of expedience I am going to take a subset of file for the purposes of data exploration (say the first 20000 lines); ideally I would like to use all the data that is there but the amount of time required for my machine to process the data would be a long time so I have opted to work with a subset for now to get a feel for the data.

Word counts

Due to the different types of file encoding the the groups of files must be treated differently, as such the ANSI coded files will be moved to another directory away from the UTF files

setwd("C:/Users/Marek Kluczynski/Documents/Data Science Spealization/Capstone Project/Data/final/en_US")
dir.create("ANSI")
setwd("C:/Users/Marek Kluczynski/Documents/Data Science Spealization/Capstone Project")

#Move ANSI Files
file.copy("./Data/final/en_US/en_US.twitter.txt", "./Data/final/en_US/ANSI/")
file.remove("./Data/final/en_US/en_US.twitter.txt")
#Set paths for loading later
en_US_path_UTF <- file.path(".", "Data/final/en_US/")  
en_US_path_ANSI <- file.path(".", "Data/final/en_US/ANSI/")  

The next step is to load the data into a corpus of text so it can be mined, this is done as follows

en_US_docs_UTF <- Corpus(DirSource(en_US_path_UTF, encoding="UTF-8"))
en_US_docs_ANSI <- Corpus(DirSource(en_US_path_ANSI))
en_US_docs <- c(en_US_docs_UTF, en_US_docs_ANSI)
summary(en_US_docs)
##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

Word counts for each file will now be summarised however before this is done it is advisable to do a little pre processing to clean up the data.

The cleaning will involve removing numbers, punctuation, white space from the data. In addition the a subset of the first 20,000 lines of text will be taken from each file due to the amount of processing time it would take to complete the whole file. The following code applies to this to the corpus of text using the functions within the Tm package

for(i in seq(en_US_docs))   
{
        en_US_docs[[i]]$content <- en_US_docs[[i]]$content[1:20000]   #Take first 20k lines
        en_US_docs[[i]] <- removeNumbers(en_US_docs[[i]]) #Remove Numbers
        en_US_docs[[i]] <- removePunctuation(en_US_docs[[i]]) #Remove puncation
        en_US_docs[[i]] <- gsub("[^[a-z|A-Z]]", "", en_US_docs[[i]]$content) #Remove non alpha characters
        en_US_docs[[i]] <- tolower(en_US_docs[[i]]) #Convert to lower case
        en_US_docs[[i]] <- stripWhitespace(en_US_docs[[i]]) #Strip the whitespace
}
en_US_docs <- tm_map(en_US_docs, PlainTextDocument)

Now that the cleaning has been done term document and document term matrices can be created which hold information about word counts.

dtm <- DocumentTermMatrix(en_US_docs)
tdm <- TermDocumentMatrix(en_US_docs)
tdm
## <<TermDocumentMatrix (terms: 81854, documents: 3)>>
## Non-/sparse entries: 118925/126637
## Sparsity           : 52%
## Maximal term length: 95
## Weighting          : term frequency (tf)

As shown above the number of unique terms identified is 81816, the term document matrix is the most useful for looking at word counts, the top 20 terms across all documents are shown below

WordFrequency <- colSums(as.matrix(dtm))   
Order <- order(WordFrequency)  
WordFrequency[tail(Order)] #Look for the least frequently occuring terms
##   you  with   for  that   and   the 
## 12926 12930 18141 18754 45559 87706
WordFrequency[head(Order)] #Look for the most frequently occuring terms
## <U+FEFF><U+FEFF><U+FEFF><U+FEFF><U+FEFF> <U+FEFF><U+FEFF><U+FEFF> <U+0096>ashley   <U+0096>bell    <U+0096>but  <U+0096>canby 
##       1       1       1       1       1       1

Looking at the data above there still appears to be some data cleaning required, for instance where “-ashley” should appear as “ashley”, in addition there still seem to be some file encoding issues however due to time constraints this will be applied later on in the projects development life cycle.

The following is a set of basic summary statistics for the corpus

summary(WordFrequency)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0    16.8     4.0 87710.0

It shows that the mean occurrence for a word is 16.8 with the highest number of occurrences in the corpus being 8.77110^{4}

Data Visualisation

To further explore the data word clouds and barcharts will be employed, word clouds can be very useful for show the most frequent words the following shows the top 150 most frequently occurring words in the Corpus of text.

wordcloud(names(WordFrequency), WordFrequency, max.words=150, colors=brewer.pal(6, "Set2")) 

From this it seems that words that are classed as stop words occur frequently within the corpus. The following is a histogram of those words which

mybardata <- data.frame(word=names(WordFrequency), Frequency=WordFrequency)
mybardata <- subset(mybardata, Frequency>5000) #Subset data for at least 5k words 
myplot <- ggplot(mybardata, aes(word, Frequency))
myplot <- myplot + geom_bar(stat="identity")   
myplot

It is clear that the stop words appear very frequently, how this will impact the model which I am to build I am not sure yet!

Referances

The following resources were consulted in addition to the documentation for the r packages used

  1. Basic Text Mining in R - https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html
  2. Gsub examples - http://www.endmemo.com/program/R/gsub.php
  3. Text Mining the Complete Works of William Shakespeare - https://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/
  4. Text Mining Infrastructure in R - https://www.jstatsoft.org/article/view/v025i05
  5. A gentle introduction to text mining using R - https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
  6. Intro to text analysis - https://www.r-bloggers.com/intro-to-text-analysis-with-r/
  7. Wiki on NLP - https://en.wikipedia.org/wiki/Natural_language_processing