Milestone Capstone Project

This report will show a summary to describe the capstone project of Johns Hopkins Data Science specialization.

If you’re unfamiliar with Data Science: The goal of this project is to complete sentences using previous information. What information? Incomplete sentences, obvious. It’s the same as using your smartphone’s keyboard, “he” shows you some hints trying to predict what you want to write. That’s all (well, there’s some mathematical background behind “that’s all” you just don’t care).

Here, I’ll explain how to get data to perform this task, show some graphics to understand the problem and my future approach to save the world and predict sentences.

If you’re familiar with Data Science: Just read “If you’re unfamiliar with Data Science”.

Part 0: The problem

In this capstone, according to our bosses, we need to apply data science in the area of natural language processing (NLP). So, trying to predict the next word having some previous information is a problem related to NLP.

To achieve this, we need some data to “learn” about this problem so we can use it to resolve similar problems in the future. This means we will predict what people says without a doubt? Of course not, but we will have a tool to analyze some sentences and, according to my research, it’ll be very good at it.

Part 1: Data acquisition and cleaning

We’ll use data from a corpus called HC Corpora(http://www.corpora.heliohost.org).

This data, contains four folders with 4 different languages. Our task is to analyze only the english language.

To download the data, use the code below.

dest<- "rsera-SwiftKey.zip"
sour <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# Using download.file to download the url into the .zip
download.file(url = sour, 
              destfile = dest)

# Extracting the files
unzip(zipfile = dest)

# Reading the data
blogs   <- readLines(con = "final/en_US/en_US.blogs.txt", 
                     encoding="UTF-8")
twitter <- readLines(con = "final/en_US/en_US.twitter.txt",
                     encoding="UTF-8")
news    <- readLines(con = "final/en_US/en_US.twitter.txt", 
                     encoding= "UTF-8")

When we talk about “processing the sentences”" we need to first clean our data. That said, among all the options we can learn from http://cran.r-project.org/web/views/NaturalLanguageProcessing.html which is a summary of Natural Language Processing packages in R I’ve decided to use the TM package for processing data.

TM package is really well explained in http://www.jstatsoft.org/v25/i05/paper and has all the preprocessing techniches we need.

Below, we have the code to clean our data. We don’t have (at the moment) the bad words filter because I’ll do deeper investigation about a reliable bad word data.

# Sampling data to generate de wordclouds
sample_blogs   <- sample(x = blogs, size = 1000)
sample_news    <- sample(x = news, size = 1000)
sample_twitter <- sample(x = twitter, size = 1000)

sample_all <- c(sample_blogs,sample_news,sample_twitter)

cloud.corpus <- Corpus(x = VectorSource(x = sample_all))
cloud.corpus <- tm_map(x = cloud.corpus, 
                       FUN = content_transformer(FUN = tolower))
cloud.corpus <- tm_map(x = cloud.corpus, 
                       FUN = removePunctuation)
cloud.corpus <- tm_map(x = cloud.corpus,
                       FUN = stripWhitespace)
cloud.corpus <- tm_map(x = cloud.corpus,
                       FUN = removeNumbers)
cloud.corpus <- tm_map(x = cloud.corpus, 
                       FUN = stemDocument)

Using some functions of TM allows us to: 1. Transform all the letters to lower so we can compare examples as “House” with “house”. 1. Remove punctuation from ALL the data so, for example, we can compare “day.” with “day”. 1. Remove Numbers because we want to compare letters not numbers.

Above, we have an example of the power of the TM package, we plot one wordcloud with common english words and then another wordcloud without the common english words. So it’s quite fascinating how the wordclouds changes.

Part 2: Exploratory analysis

##                   Blogs        News      Tweets
## Lines           899,288   1,010,242   2,360,148
## LinesNEmpty     899,288   1,010,242   2,360,148
## Chars       206,824,382 203,223,154 162,384,825
## CharsNWhite 170,389,539 169,860,866 134,370,070

As you can see, the files are just too big. In order to do an apropiate analysis, we need to sample the data into smaller pieces (as we did above to generate the clouds). It’s important that if we sample the data we can actually lose accuracy because we don’t have ALL the examples. However, in a computer science perspective, this will allow us to a faster prediction.

Summary for the en_US.blogs.txtfile

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

Summary for the en_US.news.txtfile

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

Summary for the en_US.twitter.txtfile

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.79   18.00   61.00

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Part 3: Future work

Our shiny app (our “data product”) should have a text input for entering sentence and predict the next word. According to https://en.wikipedia.org/wiki/Natural_language_processing one way to approach the problem is using 2-gram, 3-gram or 4-gram datasets. Basically, a n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application (explained in http://en.wikipedia.org/wiki/N-gram).

##  [1] "can"  "day"  "get"  "just" "know" "like" "make" "new"  "one"  "said"
## [11] "say"  "time" "will" "year"

2-gram

3-gram

Above, we have the frequencies per n-gram using n=1, 2 and 3. The goal is to predict the (n+1)th word using the previous n.

If we have the “most common” (n-1)-grams, we can predict the next word in order to form a n-gram. I’ve decided to use this approach because in http://en.wikipedia.org/wiki/Katz%27s_back-off_model(Katz’s back-off model) we have an approach using condicional probabilities and I’m a researcher at UCV (Universidad Central de Venezuela) teaching probabilities so, it’s easier to me using a model based on probabilities.

Milestone Capstone Project

Fernando Crema

Sunday, March 29, 2015

Part 0: The problem

Part 1: Data acquisition and cleaning

Part 2: Exploratory analysis

Part 3: Future work