Introduction

This report is for the Coursera Capstone Project offered by Johns Hopkins Bloomberg School of Public Health. This report describes the exploratory basic data analysis of the Capstone Dataset.

The main objective in this course is to apply data science in the area of natural language processing.

The final result of this course will be to construct a Shiny application that accepts some text inputted by the user and try to predict what the next word will be.

This milestone report provides a summary of the data processing and descriptive statistics steps for the Capstone project. In addition, strategies about modeling and predictions will also be discussed.

Although this report is written for non-data scientist readers. You could find some code in the annex.

This report will contain the following steps:

1. The training dataset:

2. Cleaning the data

3. Tokenization

4. Cluster analysis

5. Next step. Predictive algorithm

6. Annex

The training dataset

The data as specified by the project can be downloaded from (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).

filepath <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

download.file(filepath,paste(getwd(),"Coursera-SwiftKey.zip",sep = "/"))

unzip("Coursera-SwiftKey.zip")

It consist in three files from different sources: Blogs, News and Tweets.

               Blogs     News    Tweets
Size_bytes 260564320 20111392 316037344
Lines         899288    77259   2360148

Cleaning the data

We took a sample of 10000 from each file and clean the text (clean function in the annex), remove emoticons and put it in lowercase. Since this project aims to build a text predictive model we did not remove stopwords in the first place to see which are the more frequent combinations. Now the tables are ready to be tokenized.

Tokenization

First we did a frequency analysis including stopwords.

You may notice that the n-gram frequencies have a different distribution regarding on each source. Blogs, News and Tweeter. One might think that in case we know the platform or de type of writing the prediction could be more accurate.

1-gram tokenization:

Here we have the frequency charts for the different sources

2-gram tokenization:

Here we have the 2-gram frequency charts for the different sources

3-gram tokenization:

Here we have the 3-gram frequency charts for the different sources

Now we remove stopwords and repeat the analysis

1-gram tokenization:

Here we have the frequency charts for the different sources

2-gram tokenization:

Here we have the 2-gram frequency charts for the different sources

3-gram tokenization:

Here we have the 3-gram frequency charts for the different sources

Cluster Analysis

We also performe some cluster analysis to determine the diferent ways of writing. Some groups of words may tend to be together depending on what is being written.

Each cluster might be a good input for the model we will create later.

We choose to do a hierarchical cluster. To set the optimal number of clusters to use we used the “kl” index.

You can learn more about this index in http://artax.karlin.mff.cuni.cz/r-help/library/NbClust/html/NbClust.html

In this case the optimal number of clusters is 91.

In the dendogram below, eventhough we are not able to see the lines. We can see the different clusters coloured.

The more similar is the colour, the more near to each other are the lines.

Alt text

Alt text

In order to see the difference between the clusters. Here you can select then and display a chart with the most frequent words (that apear at least in the 10% of the lines).

This cluster model was created using the 7308 lines that had the most common 86 words.

As an example, here you can see the clusters 1, 15 and 35:

Next step. Predictive algorithm

The current plan for the development of the text prediction application will be to use the frequency of TriGrams and BiGrams to estimate the most likely word to follow the entered text and also the clusters. The challenge will be to offer valid predictions of N-grams that are not observed within the data set. In these cases the algorithm will likely default to a list of “non-common” words (i.e. factor out words like the, and, that) and estimate the best possible candidate.

Next step for the project would be to validate the sample and to build a model for final phase.

Annex

R Packages:

library(NLP)
library(tm)
library(qdapRegex)
library(ggplot2)
library(plyr)
library(RColorBrewer)
library(wordcloud)
library(sqldf)
library(akmeans)
library(NbClust)
library(lsa)
library(ape)
library(arules)
library(fpc)
library(gridExtra)
library(dendextend)

Functions:

# Create clean-up functions
cleanText = function(x){
  #   This simple function does not cover ambiguities such as 's or 'd
  x <- gsub("let's","let us",x)
  x <- gsub("I'm","I am",x)
  x <- gsub("'re", " are",x)
  x <- gsub("n't", " not",x)
  x <- gsub("'ll", " will",x)
  x <- gsub("'ve"," have",x)
  x <- gsub("’|“|â€", "", x)
  x <- gsub("[^a-zA-Z ]", "", x)
  return(x)
}


## Remove stopwordsl
nsw = function(x){x[x %in% stopwords() == FALSE]}


## Cosine metric
cos_dist = function(x,y)
     {
      1 - sum(x * y)/(sqrt(sum(x^2))*sqrt(sum(y^2)))
     }