Executive summary

This report contains the milestone report rubric for the Data Science Capstone project. The report is divided in three parts. First, we load and clean the data from SwiftKey. Second, we perform an exploratory data analysis. Finally, we describe the future work of this project.

Loading data

In this report we use the english documents of the SwiftKey dataset: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. To load and analyse these documents with R [1], we use the library “TM” [2] and “Quanteda” [3]. As a first step, we load the three documents and we show a short summary.

texts <- VCorpus(DirSource(paste(BASE_PATH,"en_US_sample",sep=""), encoding = "UTF-8"))

Document sampling

In this section, we create a function to sample the documents. This function takes an argument that indicates the percentage of lines from the original document that are included in the sampled document.

sample_document <- function(path_input_doc,path_output_doc,percent_sampling_lines){
  con_in <- file(path_input_doc, open="rb")
  if (file.exists(path_output_doc)){
    file.remove(path_output_doc)
  }
  con_out <- file(path_output_doc, open="w")
  text_lines <- readLines(con_in, encoding="UTF-8")
  for(i in 1:length(text_lines)){
    if(rbinom(1,1,percent_sampling_lines)==1) {
      write(text_lines[i], con_out, append="TRUE")
    }
  }
  close(con_in)
  close(con_out)
}

if(DO_SAMPLING){
  percent_sampling_lines <- 0.01
  sample_document("./en_US/en_US.blogs.txt","./en_US_sample/en_US.blogs.txt",percent_sampling_lines)
  sample_document("./en_US/en_US.news.txt","./en_US_sample/en_US.news.txt",percent_sampling_lines)
  sample_document("./en_US/en_US.twitter.txt","./en_US_sample/en_US.twitter.txt",percent_sampling_lines)
}

sample_texts <- VCorpus(DirSource(paste(BASE_PATH,"en_US_sample",sep=""), encoding = "UTF-8"))

Cleaning data

In this section, we tokenize the texts and we remove the profane words.

Function to tokenize

We use this function to divide the text in tokens and remove the punctuation, the numbers, the whitespaces, etc. We also stem the words of the documents in order to not make distinction between the singular and the plural form of the same word, between different conjugations of the same verb and so on.

tokenize_file <- function(text_file){
  text_trans <- tm_map(text_file, content_transformer(tolower))
  text_trans <- tm_map(text_trans, removeWords, stopwords("english")) 
  text_trans <- tm_map(text_trans, removePunctuation) 
  text_trans <- tm_map(text_trans, removeNumbers) 
  text_trans <- tm_map(text_trans, stemDocument)
  text_trans <- tm_map(text_trans, stripWhitespace) 
  text_trans
}

Function to remove the profane words

This function removes the profane words in the text. The profane word list was extracted from [4].

SWEAR_WORDS <- scan(paste(BASE_PATH,"google_bad_words_utf.txt",sep=""), what="", sep="\n")

filter_profane_words <- function(text_file){
  text_trans <- tm_map(text_file, removeWords, SWEAR_WORDS) 
  text_trans <- tm_map(text_trans, stripWhitespace) 
}

Cleaning process

We execute the actions described in the previous sections. First, we tokenize the data and, second, we filter the profane words. We use the sample data in the following sections.

sample_texts <- tokenize_file(sample_texts)
sample_texts <- filter_profane_words(sample_texts)

Exploratory data analysis

Basic analysis

In the following table and plot, we compute the basic metrics for the three files and we compare their dimensions in terms of lines, words and characters. These metrics are computed over the complete files, not over the sampled files. As the plots show, although the twitter file has many more lines, the other two files have more words and more characters.

get_summary_stats <- function(text_file){
  id <- text_file$meta$id
  text_content <- text_file$content
  total_lines <- length(text_content)
  words_per_line <- sapply(strsplit(text_content, "\\s+"), length)
  words_longest_line <- max(words_per_line)
  total_words <- sum(words_per_line)
  
  chars_per_line <- sapply(text_content,nchar)
  chars_longest_line <- max(chars_per_line)
  total_chars <- sum(chars_per_line)
  
  c(id,total_lines,total_words,total_chars,words_longest_line,chars_longest_line)
}

summary_tab <- data.frame(rbind(get_summary_stats(texts[[1]]),
      get_summary_stats(texts[[2]]),
      get_summary_stats(texts[[3]])))
colnames(summary_tab) <- c("file","lines","words","chars","words longest line","chars longest line")

summary_tab$lines <- as.numeric(as.character(summary_tab$lines))
summary_tab$words <- as.numeric(as.character(summary_tab$words))
summary_tab$chars <- as.numeric(as.character(summary_tab$chars))
summary_tab$"words longest line" <- as.numeric(as.character(summary_tab$"words longest line"))
summary_tab$"chars longest line" <- as.numeric(as.character(summary_tab$"chars longest line"))

Comparison on the number of lines

ggplot(data=summary_tab, aes(x=file, y=lines, fill=file)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black")

Comparison on the number of words

ggplot(data=summary_tab, aes(x=file, y=words, fill=file)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black")

Comparison on the number of characters

ggplot(data=summary_tab, aes(x=file, y=chars, fill=file)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black")

Word frequency

For this project, we want to make predictions for the next word that a user could write in a sentence. Therefore, it is important to also give a preliminary analysis on the word frequency. In this section, we use the sampled data, which gives a good picture of the situation and it reduces the computing requirements. For this task we mainly use the TermDocumentMatrix from “TM” [2] package. This is a matrix, which for each file, it stores the number of occurrences of each word.

tdm <- TermDocumentMatrix(sample_texts)
m <- as.matrix(tdm)
col_names <- colnames(m)
m <- cbind(m,rowSums(m))
colnames(m) <- c(col_names,"frequency")
freq_df <- data.frame(m)
freq_df <- freq_df[order(freq_df$frequency,decreasing = TRUE),]

In the following table, we show the 10 most repeated words in all the three documents.

head(freq_df,10)

##      en_US.blogs.txt en_US.news.txt en_US.twitter.txt frequency
## one             1375            908               855      3138
## will            1098           1126               882      3106
## just             956            587              1512      3055
## like            1124            606              1285      3015
## get              928            585              1474      2987
## said             378           2370               174      2922
## can             1033            606               969      2608
## time             973            664               860      2497
## day              705            430              1104      2239
## year             622           1117               389      2128

In the following histograms, we count the number of times (frequency) that each word appears in all the text and we show the 10 largest word frequencies. We see that the most common situations is words appearing a single time.

ggplot(data=freq_df, aes(freq_df$frequency)) + geom_histogram() + 
  scale_x_continuous(breaks=1:10,limits=c(0, 10))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

However, most of the used words are repeated several times. In the following code, we compute the percentage of text that is covered by the most repeated words.

freq_df$cum_freq <- 0
num_total_words <- sum(freq_df$frequency)

i <- nrow(freq_df)
while(i>0){
  freq_df[i,]$cum_freq <- num_total_words
  num_total_words <- num_total_words - freq_df[i,]$frequency
  i <- i - 1
}
num_total_words <- sum(freq_df$frequency)
freq_df$percent_total_words <- freq_df$cum_freq/num_total_words

freq_df_50 <- freq_df[freq_df$percent_total_words<0.5,]
freq_df_90 <- freq_df[freq_df$percent_total_words<0.9,]

We see that 580 words cover 50% of all the total text of the samples and 7176 words cover 90% of the total text of the sample.

Bigram and trigram frequencies

In this section, we compute the frequencies of the bigrams and the trigrams of the corpus of the sample texts. The bigrams are the sets of two words that appear together in the corpus of the text. Similarly, the trigrams are the set of three words that appear together in the corpus of the text.

corp_blogs <- corpus(sample_texts[[1]]$content)

## Non-UTF-8 encoding (possibly) detected  :ISO-8859-9.

corp_news <- corpus(sample_texts[[2]]$content)
corp_twitter <- corpus(sample_texts[[3]]$content)

## Non-UTF-8 encoding (possibly) detected  :ISO-8859-2.

corp_all <- corp_blogs + corp_news + corp_twitter

NUM_TOP_EXPRESSIONS <- 10

Frequency of the 10 most frequent bigrams of the complete corpus of the samples:

bigram_all <- tokenize(corp_all, ngrams = 2)
bigram_all <- unlist(bigram_all, use.names = FALSE)
dfm_bigram_all <- dfm(bigram_all)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 506,610 documents
##    ... indexing features: 397,651 feature types
##    ... created a 506610 x 397651 sparse dfm
##    ... complete. 
## Elapsed time: 9.648 seconds.

top_bigram_all <- topfeatures(dfm_bigram_all, n=NUM_TOP_EXPRESSIONS)
top_bigram_all

##    right_now    last_year     new_york    look_like     year_ago 
##          271          225          214          205          170 
## look_forward    feel_like   last_night    last_week  high_school 
##          168          165          163          138          130

Frequency of the 10 most frequent trigrams of the complete corpus of the samples:

trigram_all <- tokenize(corp_all, ngrams = 3)
trigram_all <- unlist(trigram_all, use.names = FALSE)
dfm_trigram_all <- dfm(trigram_all)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 465,254 documents
##    ... indexing features: 459,277 feature types
##    ... created a 465254 x 459277 sparse dfm
##    ... complete. 
## Elapsed time: 8.627 seconds.

top_trigram_all <- topfeatures(dfm_trigram_all, n=NUM_TOP_EXPRESSIONS)
top_trigram_all

##       new_york_citi    happi_mother_day presid_barack_obama 
##                  31                  28                  25 
##      happi_new_year         let_us_know    look_forward_see 
##                  20                  19                  17 
##        two_year_ago        world_war_ii       cinco_de_mayo 
##                  16                  15                  15 
##                  _s 
##                  15

Future work

The main outcome for this project is a predictive application, where the user enter words and the application suggests the next word. The steps to build this application are as follows:

We will continue using the TM[2] and Quanteda[3] package.
We will use the english texts from the SwiftKey dataset.
We will base the prediction on the n-grams.
We will also test other pre-processes to tokenize, such as steming.
We will test several machine learning models and the combination of them and with the different n-grams.

References

[1]R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2015 [Online]. Available: https://www.R-project.org/

[2]I. Feinerer and K. Hornik, Tm: Text mining package. 2015 [Online]. Available: http://CRAN.R-project.org/package=tm

[3]K. Benoit and P. Nulty, Quanteda: Quantitative analysis of textual data. 2015 [Online]. Available: https://CRAN.R-project.org/package=quanteda

[4]“Full list of bad words banned by google.” http://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/.

Data Science Capstone - Milestone Report Rubric

Victor Garcia

December 10, 2015