Introduction

In this document I will summarize the facts I have found in the given data in the Coursera Data Science Specialization.

The goal of this document as targeted by the mentors is:

“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.”

Fetching

The data was downloaded in the following link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip the data was downloaded and unziped with the following code:

strURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
strZIPFile<-"dataCapstone.zip"
strDir<-"final"

downloadTextData<-function(){
  
  if(!file.exists(strZIPFile)){
    download.file(strURL,strZIPFile)
  }
  
  if(!dir.exists(strDir)){
    unzip(strZIPFile)
  }
  
}

downloadTextData()

Preliminar analysis

After that the data was loaded, each of the files were read.

I show some code that uses a function I did. My plan is modularize more, I use some constants that I will set in other code units in the final version. These constants only are used in order to construct the paths to the files.

## Idioms
ENGLISH_ID<-"en"
FINNISH_ID<-"fn"
RUSSIAN_ID<-"ru"
DEUTSCH_ID<-"de"

## Document types
BLOGS_ID<-"blogs"
NEWS_ID<-"news"
TWITTER_ID<-"twitter"

## Idioms prefix
IDIOM_CODE_DEUTSCH<-"de_DE"
IDIOM_CODE_ENGLISH<-"en_US"
IDIOM_CODE_FINNISH<-"fi_FI"
IDIOM_CODE_RUSSIAN<-"ru_RU"

getFileName<-function(language=ENGLISH_ID,type=TWITTER_ID){
  
  strIdiomCode<-switch (language,
                        "du" = IDIOM_CODE_DEUTSCH,
                        "fn" = IDIOM_CODE_FINNISH,
                        "en" = IDIOM_CODE_ENGLISH,
                        "ru" = IDIOM_CODE_RUSSIAN
  )
  
  strFileName<-paste(strIdiomCode,".",type,".txt",sep="")

  file.path(strDir,strIdiomCode,strFileName)
}

language=ENGLISH_ID

strFileBlogs<-getFileName(language,BLOGS_ID)
strFileNews<-getFileName(language,NEWS_ID)
strFileTwitter<-getFileName(language,TWITTER_ID)
  
  
fileBlogs<-file(strFileBlogs)
linesBlogs<-readLines(fileBlogs,skipNul = TRUE,encoding = "UTF-8")
close(fileBlogs)
  
fileNews<-file(strFileNews)
linesNews<-readLines(fileNews,skipNul=TRUE,encoding = "UTF-8")
close(fileNews)
  
fileTwitter<-file(strFileTwitter)
linesTwitter<-readLines(fileTwitter,skipNul=TRUE,encoding = "UTF-8")
close(fileTwitter)

After that I can access the total number of lines of each file, I will show the maximum size of the rows for every file aside the mean and minimum:

filesLinesSummary<-data.table(files=c(BLOGS_ID,TWITTER_ID,NEWS_ID), 
                              noLines=c(length(linesBlogs),length(linesTwitter),length(linesNews)),
                              maximumLength=c(max(nchar(linesBlogs)),max(nchar(linesTwitter)),max(nchar(linesNews))),
                              mimimumLength=c(min(nchar(linesBlogs)),min(nchar(linesTwitter)),min(nchar(linesNews))),
                              mean=c(mean(nchar(linesBlogs)),mean(nchar(linesTwitter)),mean(nchar(linesNews))))
kable(filesLinesSummary)

files	noLines	maximumLength	mimimumLength	mean
blogs	899288	40833	1	229.98695
twitter	2360148	140	2	68.68054
news	77259	5760	2	202.42830

This is a very useful view, but, is not necessary use all the samples to do a preliminary analysis. I will take samples based on the text length, so the mean for blogs and news are similar and is three times, more or less, the mean for twitter. This is more an ad hoc aproach but will work just to get an idea of the trigrams and bigrams that could be useful for the algorithm. So I will take samples of 50000 for news and blogs and 150000 for twitters.

## num of lines to consider
  linesToReadBlogs<-50000
  linesToReadNews<-50000
  linesToReadTwitter<-150000

  set.seed(33234)
  linesBlogs<-sample(linesBlogs,linesToReadBlogs)
  linesTwitter<-sample(linesTwitter,linesToReadTwitter)
  linesNews<-sample(linesNews,linesToReadNews)
  
  linesToProcess<-c(linesBlogs,linesNews,linesTwitter)

Cleaning

So in linesToProcess I have the lines I will use in my analysis, now I face another problem, the text is raw and full of punctuation marks and spaces that will not help. I will clean the data using str_remove_all of the stringr package and removepuntuation, stripWhitespace, and removeNumbers of tm. This will help me to delete some dirty characters that I found in the text aside numbers and punctuation marks that will be useless and that can complicate the construction of the n-grams.

cleanText<-function(dataTable){
  #dataTable$text<-removeWords(dataTable$text,stopwords('en'))
  dataTable$text<- str_remove_all(dataTable$text,"#")
  dataTable$text<- str_remove_all(dataTable$text,"–")
  dataTable$text<- str_remove_all(dataTable$text,"-")
  dataTable$text<- str_remove_all(dataTable$text,"“")
  dataTable$text<- str_remove_all(dataTable$text,"”")
  dataTable$text<- str_remove_all(dataTable$text,"“")
  dataTable$text<- removePunctuation(dataTable$text)
  dataTable$text<- removeNumbers(dataTable$text)
  dataTable$text<- stripWhitespace(dataTable$text)
  dataTable$text<- tolower(dataTable$text)
  dataTable
}

dataTable<-data.table(text=linesToProcess)

dataTable<-cleanText(dataTable)

Now I have a data table with the lines cleaned, I am able to get the n-grams now and get a word count of the total data, I will discard the words that count lesss than five with prune_vocabulary.

I will use functions from the text2vec library.

tokens = space_tokenizer(dataTable$text)

vocab3gram <- create_vocabulary(itoken(tokens),ngram = c(3,3))

vocab2gram <- create_vocabulary(itoken(tokens),ngram = c(2,2))

vocab1gram <- create_vocabulary(itoken(tokens),ngram = c(1,1))
 
vocab3gram <- prune_vocabulary(vocab3gram, term_count_min = 5)

vocab2gram <- prune_vocabulary(vocab2gram, term_count_min = 5)

vocab1gram <- prune_vocabulary(vocab1gram, term_count_min = 5)

Here I stop a little just to show the word count in the corpora and the count of documents in which the word appears, I will show only the first 30 occurrences for brevity.

kable(head(vocab1gram[order(vocab1gram$term_count,decreasing=TRUE),],30))

	term	term_count	doc_count
161925	the	260218	113631
161924	to	154159	92477
161923	and	132652	77277
161922	a	131996	81584
161921	of	109547	66497
161920	i	96352	57119
161919	in	90450	62283
161918	for	62356	50235
161917	is	60656	46021
161916	that	57089	41534
161915	you	55539	40584
161914	it	51758	38412
161913	on	45884	38378
161912	with	39540	32363
161911	my	35773	27372
161910	was	34614	24457
161909	at	31868	27206
161908	this	30947	25935
161907	be	30794	26156
161906	have	29661	25072
161905	are	27478	22827
161904	but	26600	23743
161903	as	26546	19214
161902	we	23372	16566
161901	he	22852	15281
161900	not	22763	19552
161899	so	22233	19598
161898	me	21897	18667
161897	from	21201	18542
161896	all	18854	16797

Plotting

Now I can show the top n-grams for my analisys, I show the 1-gram histogram plot:

gg<- ggplot(head(vocab1gram[order(vocab1gram$term_count,decreasing=TRUE),],30),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()

For 2-grams:

gg<- ggplot(head(vocab2gram[order(vocab2gram$term_count,decreasing=TRUE),],20),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()

And finally for 3-grams:

gg<- ggplot(head(vocab3gram[order(vocab3gram$term_count,decreasing=TRUE),],10),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()

Even with this graphs is dificult to see the real context of the n-grams, I will use the wordcloud library to make the wordclouds for my analisys, the wordclouds are a very fancy way to display these kind of contexts.

For 1-grams:

pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:2)]
wordcloud(vocab1gram$term,vocab1gram$term_count,c(3,.5),2,250,FALSE,,.15,pal)

For 2-grams:

pal <- brewer.pal(9,"PuBu")
pal <- pal[-(1:2)]
wordcloud(vocab2gram$term,vocab2gram$term_count,c(2.5,0.4),2,200,FALSE,,.15,pal)

For 3-grams:

pal <- brewer.pal(9,"PuBu")
pal <- pal[-(1:2)]
wordcloud(vocab3gram$term,vocab3gram$term_count,c(1.5,0.25),2,150,FALSE,,.15,pal)

We can see more clearly the n-grams in these graphs.

Next steps

The next steps are:

Generate a Markov Network for the 2-grams.
Generate a Markov Network for the 3-grams.
Have a general vocabulary having the 1-grams.
Use string distance algorithms to get an approximate sugestion if a misspeling happens.
Unite all these points in a function that receiving a string can get n word candidates.
Put these function in a Shiny app.

I have reviewed the Markov networks and I think is a very good start to do the prredictor, you can find a very nice introductory video here.

Milestone Report

Marco Antonio Béjar Villalba