I present the work I’ve done for each task. First of all I’m working on a direcory on my computer.

setwd("C:/Users/Bruno Gonzalez/Google Drive/JHU_Data_Science/10 Data Science Capstone/")

Task 0: Understanding the problem

  1. Obtaining the data
  2. Familiarizing yourself with NLP and text mining

The task 0 was only to download the file. The code for doing this is the next one.

  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "./Coursera-SwiftKey.zip")

Task 1: Getting and Claning Data

  1. Tokenization
  2. Profanity filtering

First of all, for the examples presented in this work I’m only using the “en_US.news” data to save some memory. I start just reading the file as a character.

  nws <- readLines("./final/en_US/en_US.news.txt")
## Warning in readLines("./final/en_US/en_US.news.txt"): incomplete final line
## found on './final/en_US/en_US.news.txt'

This funtion gets the longest line.

fun <- function(dat){
  
  x <- nchar(dat[1])
  n <- length(dat)
  
  for(i in 2:n){
    m <- nchar(dat[i])
    if(x < m){x <- m}
  }
  
  x
  
}

After this y eliminate all the punctuation signs and all the digits.

  tok_nws <- gsub("[[:punct:]]", "", nws)
  tok_nws <- gsub("[[:digit:]]+", "", tok_nws)

Then I split eash work, and transform the character to a factor type. This will make easier to analyze the data

  tok_nws <- strsplit(tok_nws, "\\s+")
  
  tok2_nws <- unlist(tok_nws)
  tok2_nws <- as.factor(tok2_nws)

Task 2: Exploratory Data Analysis

  1. Exploratory analysis
  2. Understand frequencies of words and word pairs

I start observing those words that repeats more times. The first 30 are presented on the table.

  sum_nws <- summary(tok2_nws, maxsum = 31)
  sum_nws
##     the      to     and       a      of      in     for    that      is 
##  132178   68718   65667   63707   58822   47980   25819   24950   21727 
##      on    said    with     The     was      at      as      it      he 
##   19622   19112   18966   18891   17576   15432   13432   13024   12858 
##      be       I    from     his    have     are      by     has      an 
##   11489   11480   11359   11336   10915   10600    9820    9304    9043 
##    will     who     not (Other) 
##    8300    8189    8175 1802814

It can be seen from the “bar chart” that “the” is most frequent word in the text. Followed by “to” and “and”.

  barplot(sum_nws[1:30])

Then, I will try to analyze which are the most frequent words after “the”

  x <- grep("[Tt]he", tok2_nws)
  x <- x+1
  sig_the <- tok2_nws[x]
  sum_the <- summary(sig_the)

By plotting this information, it can be seen that “first” is the most frequent word after “the”

  barplot(sum_the[1:23])

Task 3: Modeling

  1. Build basic n-gram model
  2. Build a model to handle unseen n-grams

For the modeling I made a function that creates a matrix of frequencies for two consecutive words.

gram_mc <- function(s){
  x <- levels(s)
  l <- length(x)
  mat <- matrix(nrow = l, ncol = l )
  
  for (i in 1:l){
    a <- grep(x[i], s)
    b <- tok2_nws[a+1]
    
    for(j in 1:l){
      mat[i,j] <-  sum(b==x[j])
      
    }
    
  }
  mat
}

The rows would be the first word, and the columns would be the second. As these matrix can be seen as a Markov Chan (just need to be divided by the total by row), the frecuency of the third word can be calculated multiplying this matrix by itself.

This project has been a real chalenge to me, but I think I’ve done the rigth advances to achive the goal.