Introduction

This report provides a solution to the task of week 2 of the Data Science Capstone course.

The goal of the task is to build an algorithm to predict text. The algorithm must predict the next word when the first words are provided. If a user types: “The” the algorithm must predict the second word. In this way if the user writes a word or a set of them the algorithm must predict the next one. In this work the algorithm is constructed to predict the next word only when the user provides a word (1-gram). Later in the course it will be extended to 2-gram and 3-gram

The data are three corporas, set of texts, provided by the course. The corporas are three files with extension .txt, called en_US.blogs, en_US.news and en_US.twitter. A number of lines and words will be provided later in each corpora.

The processing of the corpora was performed through functions created by the author of this document, explained below.

Loading corporas

The corpora were uploaded for blogs, news and tweets. The following table Expresses the number of lines of text per corpora. It is observed, for example, that for corpora news there are more than 1 million lines of text related to news.

Corpora Number_of_Lines
Blogs 899288
News 1010242
Tweets 2360148

The corpora represent a lot of information to construct the algorithm, however to take into account all the words of the corpora will put into serious difficulties the functionality of the algorithm. To solve this, we proceeded to select a sample of each corpora, In such a way that the proportion of occurrence of each word in the samples represents the proportion of population occurrence with a confidence of 99% assuming that the proportion of occurrence of each word follows a normal distribution (Central Limit Theorem). When assuming a confidence level of 99%, a standard deviation of the sample proportions equal to 0.05 and an approximate value of the population proportion equal to 0.5, it was determined that the sample size for each corpora will be of 664 lines of text.

Cleaning and transforming corporas

It is frequent that in applications of text mining transforms the texts of the corpora to lowercase words, this was realized. In addition each corpora has its problems, the most common are: punctuation marks and characters that do not represent words like the @. The punctuation marks and characters that were presented in each corpora were identified and eliminated. For this task we used the functions fwpnt (), frp (), feds () and frpds (), created by the author of this document.

#Set Working directory where corporas are
setwd("D:/Santiago/2016/Data_Science/Capstone-Project/Coursera-SwiftKey/final/en_US/Samples")

data_sets <- list.files()

#Transform words into lowercase words
blogs <- read.csv(data_sets[1])
blogs <- as.character(blogs[,2])
blogs <- tolower(blogs)

news <- read.csv(data_sets[2])
news <- as.character(news[,2])
news <- tolower(news)

tweets <- read.csv(data_sets[3])
tweets <- as.character(tweets[,2])
tweets <- tolower(tweets)
#Functions to eliminate profanity

#Find words, punctuation and numbers in a text. Words is a regular expression expressed like character class data_set is a vector corpus, each row is a text Outcome: a corpus vector with words searched
fwpnt <- function(words, data_set){
        vector_sel <- grepl(words, data_set)
        data_set <- data_set[vector_sel]
        data_set
}

#Remove profanity you do not want predict. words: character o word to remplacement remplacement: character class that represent remplacement
#data_set is a vector corpus, each row is a text. Outcome: a corpus vector without profanity
frp <- function(words, remplacement, data_sets){
        data_sets <- gsub(words, remplacement, data_sets)
        data_sets
}

#Eliminate double space. data_sets is a corpus vector with text. Outcome: a corpus vector without doublespace
feds <- function(data_sets){
        while(length(fwpnt("  ", data_sets)) > 0 ){
                data_sets <- frp("  ", " ", data_sets)
        }
        data_sets
}

#Function eliminate puntuaction signs and double spaces
frpds <- function(data_sets){
        data_sets <- frp("\"|/|[0-9]|\'|¿|,|&|#|:|_|=|%|¡|$|¬|)|!|\\$|\\||\\(|\\?|\\.|\\^|~|-|`|°|º|ª|@|·|´|ç|\\*|¨|Ç|;|\\+|<|>|â|???|T|o|-|-|\\[|]|\\{|}|^|s|¦|~|'|"|"|¥|»|©|¹|²|ã|½|¸|,|¢|±|???|ð|ÿ|³|'", 
                         " ", 
                         data_sets)
        data_sets <- feds(data_sets)
        data_sets
}
#blogs, news and tweets corporas free of profanity
blogs <- frpds(blogs)
news <- frpds(news)
tweets <- frpds(tweets)

blogs <- paste(" ", blogs, " ", sep = "")
news <- paste(" ", news, " ", sep = "")
tweets <- paste(" ", tweets, " ", sep = "")

Dictionaries of the n-gram algorithm

We describe the steps to build dictionaries, fundamental elements of the n-gram algorithm.

The first step is to create a dictionary with the only words that make up the corpora blogs, news and tweets. This dictionary will be called dictionary_1.

The second step is to create a dictionary with the only two word combinations that make up the corpora blogs, news and tweets. This dictionary will be called dictionary_2.

The third step is to create a function that allows you to calculate The absolute frequency of the words found in the dictionary1 and the absolute frequency of the combinations found in the dictionary2. This function has been called ffrec_sub () and its code is shown below

# Function that counts the minimum number of times a word appears in a corpora.
# That is, in a line a word can appear one or more times, the function
# Will count that word only once for that line. This will result in an undervalued frequency.
ffrec_sub <- function(word, data_sets){
        frec_sub <- sum(grepl(word, data_sets))
        frec_sub
}

Below are two graphs built from the results of the ffrec_sub () function. One with the top 50 of single words (1-gram) of dictionary1 that are most repeated and the other with top 50 of combination of two words (2-gram) of dictionary2 that are most repeated. It is observed that the word (1-gram) that is most repeated is “the” followed by “to”, “and” and “to”, respectively. Then the absolute frequency distribution in the top 50 decreases rapidly. The combination of two words (2-gram) most often is “in the”, followed by “of the”. From there the distribution decreases rapidly.

The fourth step is to debug dictionaries. Those items that are repeated two or more times will be selected. In this way the dictionaries would contain less irrelevant information which would make the decision rules of the following steps more efficient.

Basic n-gram algorithm

Each of the decision rules of the basic n-gram algorithm based on dictionaries is expressed as follows:

  1. The dictionaries are loaded.
  2. The last word of the text entered by the user is determined.
  3. The word identified in step 2, is searched in dictionaries. First in the dictionary_2, if the word is found in the dictionary_2 then the algorithm will result in the combination of two more frequent words found in the dictionary_2 given the first word of step 2.
  4. If the word from step 2 is not found in dictionary_2 then the algorithm will produce the word from step 2 followed by the most frequent word from dictionary_1.