Introduction

In this document I will summarize the facts I have found in the given data in the Coursera Data Science Specialization.

The goal of this document as targeted by the mentors is:

“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.”

Fetching

The data was downloaded in the following link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip the data was downloaded and unziped with the following code:

strURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
strZIPFile<-"dataCapstone.zip"
strDir<-"final"

downloadTextData<-function(){
  
  if(!file.exists(strZIPFile)){
    download.file(strURL,strZIPFile)
  }
  
  if(!dir.exists(strDir)){
    unzip(strZIPFile)
  }
  
}

downloadTextData()

Preliminar analysis

After that the data was loaded, each of the files were read.

I show some code that uses a function I did. My plan is modularize more, I use some constants that I will set in other code units in the final version. These constants only are used in order to construct the paths to the files.

## Idioms
ENGLISH_ID<-"en"
FINNISH_ID<-"fn"
RUSSIAN_ID<-"ru"
DEUTSCH_ID<-"de"

## Document types
BLOGS_ID<-"blogs"
NEWS_ID<-"news"
TWITTER_ID<-"twitter"

## Idioms prefix
IDIOM_CODE_DEUTSCH<-"de_DE"
IDIOM_CODE_ENGLISH<-"en_US"
IDIOM_CODE_FINNISH<-"fi_FI"
IDIOM_CODE_RUSSIAN<-"ru_RU"

getFileName<-function(language=ENGLISH_ID,type=TWITTER_ID){
  
  strIdiomCode<-switch (language,
                        "du" = IDIOM_CODE_DEUTSCH,
                        "fn" = IDIOM_CODE_FINNISH,
                        "en" = IDIOM_CODE_ENGLISH,
                        "ru" = IDIOM_CODE_RUSSIAN
  )
  
  strFileName<-paste(strIdiomCode,".",type,".txt",sep="")

  file.path(strDir,strIdiomCode,strFileName)
}

language=ENGLISH_ID

strFileBlogs<-getFileName(language,BLOGS_ID)
strFileNews<-getFileName(language,NEWS_ID)
strFileTwitter<-getFileName(language,TWITTER_ID)
  
  
fileBlogs<-file(strFileBlogs)
linesBlogs<-readLines(fileBlogs,skipNul = TRUE,encoding = "UTF-8")
close(fileBlogs)
  
fileNews<-file(strFileNews)
linesNews<-readLines(fileNews,skipNul=TRUE,encoding = "UTF-8")
close(fileNews)
  
fileTwitter<-file(strFileTwitter)
linesTwitter<-readLines(fileTwitter,skipNul=TRUE,encoding = "UTF-8")
close(fileTwitter)

After that I can access the total number of lines of each file, I will show the maximum size of the rows for every file aside the mean and minimum:

filesLinesSummary<-data.table(files=c(BLOGS_ID,TWITTER_ID,NEWS_ID), 
                              noLines=c(length(linesBlogs),length(linesTwitter),length(linesNews)),
                              maximumLength=c(max(nchar(linesBlogs)),max(nchar(linesTwitter)),max(nchar(linesNews))),
                              mimimumLength=c(min(nchar(linesBlogs)),min(nchar(linesTwitter)),min(nchar(linesNews))),
                              mean=c(mean(nchar(linesBlogs)),mean(nchar(linesTwitter)),mean(nchar(linesNews))))
kable(filesLinesSummary)
files noLines maximumLength mimimumLength mean
blogs 899288 40833 1 229.98695
twitter 2360148 140 2 68.68054
news 77259 5760 2 202.42830

This is a very useful view, but, is not necessary use all the samples to do a preliminary analysis. I will take samples based on the text length, so the mean for blogs and news are similar and is three times, more or less, the mean for twitter. This is more an ad hoc aproach but will work just to get an idea of the trigrams and bigrams that could be useful for the algorithm. So I will take samples of 50000 for news and blogs and 150000 for twitters.

## num of lines to consider
  linesToReadBlogs<-50000
  linesToReadNews<-50000
  linesToReadTwitter<-150000

  set.seed(33234)
  linesBlogs<-sample(linesBlogs,linesToReadBlogs)
  linesTwitter<-sample(linesTwitter,linesToReadTwitter)
  linesNews<-sample(linesNews,linesToReadNews)
  
  linesToProcess<-c(linesBlogs,linesNews,linesTwitter)

Cleaning

So in linesToProcess I have the lines I will use in my analysis, now I face another problem, the text is raw and full of punctuation marks and spaces that will not help. I will clean the data using str_remove_all of the stringr package and removepuntuation, stripWhitespace, and removeNumbers of tm. This will help me to delete some dirty characters that I found in the text aside numbers and punctuation marks that will be useless and that can complicate the construction of the n-grams.

cleanText<-function(dataTable){
  #dataTable$text<-removeWords(dataTable$text,stopwords('en'))
  dataTable$text<- str_remove_all(dataTable$text,"#")
  dataTable$text<- str_remove_all(dataTable$text,"–")
  dataTable$text<- str_remove_all(dataTable$text,"-")
  dataTable$text<- str_remove_all(dataTable$text,"“")
  dataTable$text<- str_remove_all(dataTable$text,"”")
  dataTable$text<- str_remove_all(dataTable$text,"“")
  dataTable$text<- removePunctuation(dataTable$text)
  dataTable$text<- removeNumbers(dataTable$text)
  dataTable$text<- stripWhitespace(dataTable$text)
  dataTable$text<- tolower(dataTable$text)
  dataTable
}

dataTable<-data.table(text=linesToProcess)

dataTable<-cleanText(dataTable)

Now I have a data table with the lines cleaned, I am able to get the n-grams now and get a word count of the total data, I will discard the words that count lesss than five with prune_vocabulary.

I will use functions from the text2vec library.

tokens = space_tokenizer(dataTable$text)

vocab3gram <- create_vocabulary(itoken(tokens),ngram = c(3,3))

vocab2gram <- create_vocabulary(itoken(tokens),ngram = c(2,2))

vocab1gram <- create_vocabulary(itoken(tokens),ngram = c(1,1))
 
vocab3gram <- prune_vocabulary(vocab3gram, term_count_min = 5)

vocab2gram <- prune_vocabulary(vocab2gram, term_count_min = 5)

vocab1gram <- prune_vocabulary(vocab1gram, term_count_min = 5)

Here I stop a little just to show the word count in the corpora and the count of documents in which the word appears, I will show only the first 30 occurrences for brevity.

kable(head(vocab1gram[order(vocab1gram$term_count,decreasing=TRUE),],30))
term term_count doc_count
161925 the 260218 113631
161924 to 154159 92477
161923 and 132652 77277
161922 a 131996 81584
161921 of 109547 66497
161920 i 96352 57119
161919 in 90450 62283
161918 for 62356 50235
161917 is 60656 46021
161916 that 57089 41534
161915 you 55539 40584
161914 it 51758 38412
161913 on 45884 38378
161912 with 39540 32363
161911 my 35773 27372
161910 was 34614 24457
161909 at 31868 27206
161908 this 30947 25935
161907 be 30794 26156
161906 have 29661 25072
161905 are 27478 22827
161904 but 26600 23743
161903 as 26546 19214
161902 we 23372 16566
161901 he 22852 15281
161900 not 22763 19552
161899 so 22233 19598
161898 me 21897 18667
161897 from 21201 18542
161896 all 18854 16797

Plotting

Now I can show the top n-grams for my analisys, I show the 1-gram histogram plot:

gg<- ggplot(head(vocab1gram[order(vocab1gram$term_count,decreasing=TRUE),],30),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()

For 2-grams:

gg<- ggplot(head(vocab2gram[order(vocab2gram$term_count,decreasing=TRUE),],20),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()

And finally for 3-grams:

gg<- ggplot(head(vocab3gram[order(vocab3gram$term_count,decreasing=TRUE),],10),aes(x=term,y=term_count, colour=doc_count,group=1))
gg+geom_point(shape=19, size=3)+geom_line()

Even with this graphs is dificult to see the real context of the n-grams, I will use the wordcloud library to make the wordclouds for my analisys, the wordclouds are a very fancy way to display these kind of contexts.

For 1-grams:

pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:2)]
wordcloud(vocab1gram$term,vocab1gram$term_count,c(3,.5),2,250,FALSE,,.15,pal)

For 2-grams:

pal <- brewer.pal(9,"PuBu")
pal <- pal[-(1:2)]
wordcloud(vocab2gram$term,vocab2gram$term_count,c(2.5,0.4),2,200,FALSE,,.15,pal)

For 3-grams:

pal <- brewer.pal(9,"PuBu")
pal <- pal[-(1:2)]
wordcloud(vocab3gram$term,vocab3gram$term_count,c(1.5,0.25),2,150,FALSE,,.15,pal)

We can see more clearly the n-grams in these graphs.

Next steps

The next steps are:

  1. Generate a Markov Network for the 2-grams.
  2. Generate a Markov Network for the 3-grams.
  3. Have a general vocabulary having the 1-grams.
  4. Use string distance algorithms to get an approximate sugestion if a misspeling happens.
  5. Unite all these points in a function that receiving a string can get n word candidates.
  6. Put these function in a Shiny app.

I have reviewed the Markov networks and I think is a very good start to do the prredictor, you can find a very nice introductory video here.