Introduction

This project allows to build a predictive text model similar to those used in SwiftKey to complete sentences and help the user to save some time typing, particularly in smartphones.

The input is a large corpora obtained from twits, news and texts of blogs, obtain from HC Corpora (www.corpora.heliohost.org). The goal is to use this initial information to learn from the data which words could be used to complete a sentence. The first step is to understand the problem, know well the data, clean and process it, and then use some prediction methods (supervised and unsupervised) in order to implement a solution.

The next report pretends to summarize main results about the data and to plan a strategy to develop a prediction model.

Data acquisition and cleaning

The data for the project is available from here. My work is based on the english locale. The three data files refer to twits, blogs and news, and the files are large enough to discourage to keep all the information in memory.

rbind(paste("data/en_US",list.files("data/en_US/"),sep="/"),
      file.size(paste("data/en_US",list.files("data/en_US/"),sep="/")))
     [,1]                         [,2]                       
[1,] "data/en_US/en_US.blogs.txt" "data/en_US/en_US.news.txt"
[2,] "210160014"                  "205811889"                
     [,3]                          
[1,] "data/en_US/en_US.twitter.txt"
[2,] "167105338"                   

Each data file is a collection of rows with units of information obatined from the source that gives the name to the file: a twitt, a blog entrance or a news paragraph. The file blogs has 899288 records, news has 1010242 records and twits has 2360148 records.

Since we need to remove profanity words as one of the indicated tasks, we need to consider a list of this words. I obtained one from the following link: http://www.bannedwordlist.com/

As main tools for analysis, I decided to use the packages tm, Rweka . These packages offer some methods and functions to clean and summarize the data that can be created by hand, but for me it takes too much time to program (I am not an expert), the advantage is that is a quick and partial solution but they have some limitations and the structure seem to be more complex than what is necessary for the project in hand.

Other tools in hand are R.utils, stringr, wordcloud.

Preprocessing

For the purpose of this project, I will consider a sample of lines from the datasets, only to show the functionality and main ideas considered. For this purpose I consider a sampler function that selects at random a subset of records from the complete file:

#Function to create sample lines from an input file
samplelines <- function(filename,m=50,seed=10){
  set.seed(seed) #fix a seed for reproducibility
  require("R.utils")
  n <- as.numeric(countLines(filename))
  muestra <- sort(sample(1:n,m,replace = F))
  lineas <- NULL
  con <- file(filename)
  for(i in 1:m) lineas <- append(lineas, readLines(con,muestra[i])[muestra[i]])
  close(con)
  lineas }

Now we generate sample files with, say, m=20 records each and create a corpus with those files:

The following steps are for cleaning the data. Here is important comment that the order in which is applied the filtering is important and gives different results. I decided first to put everythin in lowercase, then remove numbers that appear alone, remove all the punctuation symbols, then remove profanities and other letters that remain alone after removing puntuation (for example, “aren’t” appears as “aren t”), and at last remove the extra white spaces.

One more decision was not remove the stopwords, as those words can be useful for prediction.

library(tm)
Loading required package: NLP
library(RWeka)
library(stringr)
 #read the profanities database, and put everything in lowercase, and remove punctuation
 #signs 
profanities <- readLines(con=file("data/badwords/badwords.txt"))
Warning in readLines(con = file("data/badwords/badwords.txt")): incomplete
final line found on 'data/badwords/badwords.txt'
profanities <-tolower(gsub("[[:punct:]]","",profanities))

 #create the corpus of files.
texto <- Corpus(DirSource(directory = "data/Sample/"))

  #Data cleansing
texto <- tm_map(texto,content_transformer(tolower)) #change everything to lowercase
texto <- tm_map(texto,removeNumbers) #remove numbers
texto <- tm_map(texto,content_transformer(str_replace_all),pattern = "[[:punct:]]",replacement = " ") #remove punctuation and replace by spaces
texto <- tm_map(texto,content_transformer(removeWords),words=c(profanities,"t","x"," nt ","j","m"))
texto <- tm_map(texto,stripWhitespace) #remove extra spaces

Now we can use the “clean” corpus to do some exploratory analysis and try to respond some of the questions formulated for the second task.

Exploratory data analysis

With the corpus we create a Termdocument matrix and analyze the distribution of words. Since the matrix is disperse, I consider only the first 100 words

tdm <- as.matrix(TermDocumentMatrix(texto))
dim(tdm)
[1] 2538    3
stat1 <- sort(rowSums(tdm),decreasing = T)[1:100]
dotchart(stat1,main="Word frequency in the sample",xlab="frequency",cex = 0.5)

In my sample, there are 2538 different words. The more frequent words (in my sample) are (considering stopping words):

stat1
      the       and      that       for       you       was      with 
      351       235        77        75        63        47        43 
      are      this       but      have      from      said       out 
       41        41        40        35        32        32        30 
      all       her       not      what      they      will     about 
       29        29        29        27        26        26        25 
      his       had       one     would       can       has       who 
       25        23        23        23        22        22        22 
     more       new      time       she      were      when      just 
       21        21        21        20        19        19        18 
      our     their      your       how      know       get      some 
       18        18        18        17        17        16        16 
     last     after      like      also      back       day      even 
       15        14        14        13        13        13        13 
    first      here      make      need      than      them      then 
       13        13        13        13        13        13        13 
    there     think      want      year       don      home      into 
       12        12        12        12        11        11        11 
     been    before      both       got      next       see      take 
       10        10        10        10        10        10        10 
      two      very   because      each      life       now    people 
       10        10         9         9         9         9         9 
    still       use     which     being      city       did      love 
        9         9         9         8         8         8         8 
     most       off     other      over    second    thanks      well 
        8         8         8         8         8         8         8 
     work       yum asparagus      case     could     doing     every 
        8         8         7         7         7         7         7 
     face       lot 
        7         7 

A wordcloud can be helpful at this point, where is obvious the frequency of the stopping words:

library(wordcloud)
Loading required package: RColorBrewer
wordcloud(names(stat1),stat1)

We consider now the distributions of 2-grams and 3-grams. Repeat a similar process that the one used for the distribution of words. However, the Marix term is much more sparse for these cases:

token_delim <- "\\t\\r\\n.!?,;\"() "
bitoken <- function(x){NGramTokenizer(x, Weka_control(min=2,max=2, delimiters = token_delim))}
tdm2 <- as.matrix(TermDocumentMatrix(texto, control = list(tokenize = bitoken)))
stat3 <- sort(rowSums(tdm2),decreasing = T)[1:100]
dotchart(stat3,main="Word frequency in the sample",xlab="frequency",cex = 0.5)

wordcloud(names(stat3),stat3)

token_delim <- "\\t\\r\\n.!?,;\"() "
tritoken <- function(x){NGramTokenizer(x, Weka_control(min=3,max=3, delimiters = token_delim))}
tdm3 <- as.matrix(TermDocumentMatrix(texto, control = list(tokenize = tritoken)))
stat4 <- sort(rowSums(tdm3),decreasing = T)[1:100]
dotchart(stat4,main="Word frequency in the sample",xlab="frequency",cex = 0.5)

wordcloud(names(stat4),stat4)

The next exploration question asks: How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? In order to obtain this data, we need to obtain the index of the cumulative frequency proportion:

stat2 <- sort(rowSums(tdm),decreasing = T)
cumfreq <- cumsum(stat2)/sum(stat2)

For the first question, we require 236 words to cover 50% of all word instances, and 1935 words to cover 90%.

I still don’t have a definite solution to identify or evaluate if a wword comes from a foreign language, and I think that as those words are not very frequent, it will be possible to include them in the set of nonstop words.

Work to do

For the rest of the Project:

  1. Need to solve what to do with the foreign language words and decide if it is convenient or not for prediction to use the stopping words.
  2. To explore supervised and unsupervised techniques for prediction, to decide how to create the build and test groups.
  3. To develop the solution in R.

Thanks for your feedback.