Executive Summary

The data set for the Capstone Project was downloaded, extracted and read into R for preliminary analysis. The data was subset, cleaned and some exploratory analysis performed on the size and character distribution of the data sets and the frequency with which some words and terms occurred. Familiarity was built with both the data and some of the linguistic tools needed later, such as document term matrixes and word n-grams.

The Data

The data was downloaded at the following web address. At the time of writing this was a 548Mb Zip file. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Only the English portion will be considered for the sake of this project, so the Russian, Finnish and German folders may be removed after unzipping. It was found that the data import functioned more reliably when a connection was opened before reading each file in by line. The size of the files created approximately matches the declared file size in Windows Explorer which indicates successful import of these large files.

con <- file("Coursera-SwiftKey\\final\\en_US\\en_US.blogs.txt", open = "rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("Coursera-SwiftKey\\final\\en_US\\en_US.news.txt", open = "rb")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("Coursera-SwiftKey\\final\\en_US\\en_US.twitter.txt", open = "rb")
twits <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

Basic Summary

Character data was used as a proxy for some introductory exploration. The ‘Blogs’ and ‘News’ sets have a similar distribution of characters per line, ranging from single words to many thousands, while ‘Tweets’ is an outlier. This is expected due to twitters 2 to 140 character limit and each line comprising a maximum of one tweet. File size was displayed along with this summary.

dataSum<-matrix(c((summary(nchar(blogs))) , summary(nchar(news)),summary(nchar(twits))),ncol=6, byrow = TRUE)
dataSize<-matrix(c(object.size(blogs) ,object.size(news) ,object.size(twits))/1000000 , ncol=1)
dataSum<-cbind(dataSum, dataSize)
dataSum<-as.table(round(dataSum,0))
rownames(dataSum)<-c('Blogs', 'News', 'Twitter')
colnames(dataSum)<-c('Min.','Q1','Q2','Mean','Q3','Max.','Size(Mb)')
dataSum
##          Min.    Q1    Q2  Mean    Q3  Max. Size(Mb)
## Blogs       1    47   156   230   329 40833      261
## News        1   110   185   201   268 11384      262
## Twitter     2    37    64    69   100   140      316

Subsetting and Cleaning

The data set is too large to perform exploratory analysis on directly, so a subset must be taken. The three files are relatively equivalent in size, the same number of sample lines of each was taken for the subset. If the files were of drastically different sizes and the subset was for training data, the sets should be combined first and then subset to the total size desired to avoid unintentionally biasing smaller sets. Five thousand lines of each set were selected as a compromise point between potential accuracy and computational intensity.

set.seed(500)
twitSamp <- sample(twits,5000)
blogSamp <- sample(blogs,5000)
newsSamp <- sample(news,5000)

To use linguistic tools later the sample must be cleaned. Splitting strings by comma’s ensures any poor grammar use does not lead to combinations of words being read together as one. Non-Ascii characters were removed next as some characters do not have a lower case form, potentially causing an error later in the cleaning process. The final step in cleaning was to remove punctuation, numbers and convert all text to lower case. There is no longer a start or end of a sentence, and some n-grams will contain the last word of one sentence and the first of another. However, these are distributed enough through the data and unlikely to reoccur frequently enough to influence average and frequency data. This may lead to reduced accuracy at the beginning and end of sentences for later prediction modeling but was seen as an acceptable compromise given the reduce complexity of the cleaned data.

dataSamp<-c(twitSamp,blogSamp,newsSamp)
dataSampC<-unlist(strsplit(dataSamp, split=", "))
nonAscii<-grep("dataSampC", iconv(dataSampC, "latin1", "ASCII", sub="dataSampC"))
dataSampC<-dataSampC[-nonAscii]
dataSampC <- paste(dataSampC, collapse = ", ")
dataSampC<-stripWhitespace(removePunctuation(removeNumbers(tolower(dataSampC))))

The frequency of words and combinations thereof

Existing linguistic tool packages (Rweka and tm) were used to investigate the relative frequency of words in the subset. Due to the large sample size taken, it can be assumed these distributions would be similar for the whole data set.

A Corpus file type was made from the cleaned data table and a Document Terms Matrix calculated from that. This is an important tool in natural language processing for further multivariate or latent semantic analysis needed for prediction models.

N-grams of the top twenty single word, pairs, triplets and quadruplets were created, sorted by frequency and displayed as a bar chart for visual reference. The analysis was as is expected with popular definite and indefinite articles, prepositions and pronouns featuring in the most frequently used single words. Combinations of these as used together in natural writing could logically be expected to be the most frequent pairs of words occurring together. While this general trend continued for the triplets, some new words appeared in the terms such as “according”, which is seldom used alone but frequently enough proceeded by “to the” to appear in the top twenty.

Four grams, or combinations of four words together, showed a higher proportion of phrases and clauses over combinations of the previously more popular individual words. It is expected that this trend will continue as the n-gram size increases, eventually resulting in slogans, memes and other commonly shared identical phrases being more frequent than other combinations.

CorpSamp <- Corpus(VectorSource(dataSampC))
CorpSampTDM<-TermDocumentMatrix(CorpSamp)

ngram1<-data.frame(table(NGramTokenizer(CorpSamp,Weka_control(min=1, max=1))))
ngram1<-ngram1[order(-ngram1$Freq),]
topN1<-ngram1[1:20,]

ngram2<-data.frame(table(NGramTokenizer(CorpSamp,Weka_control(min=2, max=2))))
ngram2<-ngram2[order(-ngram2$Freq),]
topN2<-ngram2[1:20,]

ngram3<-data.frame(table(NGramTokenizer(CorpSamp,Weka_control(min=3, max=3))))
ngram3<-ngram3[order(-ngram3$Freq),]
topN3<-ngram3[1:20,]

ngram4<-data.frame(table(NGramTokenizer(CorpSamp,Weka_control(min=4, max=4))))
ngram4<-ngram4[order(-ngram4$Freq),]
topN4<-ngram4[1:20,]
fig1<-ggplot(topN1, aes(x=reorder(Var1, Freq),y=Freq))+geom_col() + labs(title = "Most Frequent Single Words", x="Word") +coord_flip()

fig2<-ggplot(topN2, aes(x=reorder(Var1, Freq),y=Freq))+geom_col() + labs(title = "Most Frequent Pairs", x="Word Pairs") +coord_flip()

fig3<-ggplot(topN3, aes(x=reorder(Var1, Freq),y=Freq))+geom_col() + labs(title = "Most Frequent Triplets", x="Word Triplets") +coord_flip()

fig4<-ggplot(topN4, aes(x=reorder(Var1, Freq),y=Freq))+geom_col() + labs(title = "Most Frequent Quadruplets", x="Word Quadruplets") +coord_flip()

fig1

fig2

fig3

fig4