This milestone report describes my work so far in the Data Science Capstone Project on Coursera. The ultimate goal is to create a prediction algorithm and integrate it into an R shiny app, so any user can enter one, two, three or more words and a “predicted next word” will be displayed. So far, my work has only be accomplished for the English language.

1 Loading the data

First I downloaded the required data, unpacked it and saved it into a subfolder called “data”. From there, I loaded it into my workspace using readLines().. Saving it into a list saves some memory.

data <- list()

for(i in 1:length(dir("./data/final/en_US"))) {
  
  txt <- dir("../data/final/en_US", full.names=T)[i] 
  con <- file(txt, "rb")  
  
  data[[i]] <- readLines(con, encoding="UTF-8", skipNul = T) # irgendwie komisch.
  print(c(txt, max(nchar(data[[i]]))))
  close(con)
  
}

names(data) <- c("blogs", "news", "twitter")

2 Data preparation

In order to prepare the data for analysis, I set all words to lower case, removed punctuation and numbers and fixed the spacing (i.e., remove double spacing etc.). This could be accomplished by using the preprocess() function in the ngram package. I did this for every line in order to preserve the lines (otherwise, line boundaries would have been destroyed, which wouldnt make any sense for calculating ngrams).

for(i in 1:length(data)) {
  for(j in 1:length(data[[i]])) {
    data[[i]][j] <- preprocess(
      concatenate(data[[i]][j]), case="lower", remove.punct=T,
      remove.numbers=T, fix.spacing=T)
  }
}

From this data set, I created a second data set in which I removed so called “stop words”, i.e. meaningless words like “it”, “as”, “or”, so that in the end I will be able to predict meaningful words.

3 Basic report about the data set

3.1 Summary statistics

The full data consists of thee diffferent sources: “news”, “blogs” and “twitter”. The news data has 899288 lines, the blogs data has 1010242 lines and twitter data is the longest with 2360148 lines. The basic summaries are as follows.

data <- readRDS("../data/data-prep_20201118.rds")
swfull <- readRDS("../data/sw-prep_20201118.rds")
blogs news twitter all
lines 899288 1010242 2360148 4269678
words 36934013 33569489 29586893 100090395
words without stopwords 19583669 19796438 17585610 56965717
longest line 6327 1370 47 7744
longest line without stopwords 3916 1315 47 5278

3.2 Word counts

I created n-grams from a (randomized) subset of 30 percent of the original data. The n-grams were counted by the < ngram function of the ngram package, e.g. two-word counts by <tt ngram(data, n=2) . The ngrams were calculated line-by-line, so no ngrams crossing line boundaries were counted (because these are meaningless). We get the frequency tables of ngrams by using the get.phrasetable() function of the same package. I only saved ngrams with a frequency of at least 2 in order to save memory and because ngrams which occured only once are meaningless in order to predict words.

The most frequent words and word combinations are shown here. Because simple words are hardly very interesting, we calculate the same for the corpus without stopwords:

4 Interesting findings

4.1 primitive algorithm

So far I have tried to create a simple prediction algorithm (see appendix):

  • When the user enters a string like “hello and good evening”, for example, it checks whether there are any 5-grams starting with “hello and good evening”.
  • If there are, it returns the next word of the most frequent 5-grams (in this case: ).
  • If there aren’t, it checks whether there are any 4-grams starting with “and good evening” (so it removes the first word of the string first).
  • If there are, it returns the next word of the most frequent 4-grams (in this case: ).
  • If there aren’t, … (and so on)
  • If no n-grams are found, it just returns the most frequent word.

The algorithm lets the user decide if he or she wants to include stopwords or just want a “meaningful” result.

4.2 Results

The prediction accuracy so far has not been really well. It does not depend so much on the size of the training dffata set (here: 10, 20, 30, 40 or 50 percent of the original data set). It does depend, however, a lot on whether you are looking for a word including stopwords (higher accuracy) or for a meaningful word (lower accuracy).

We can see that the average accuracy of the predictions does not increase for the size of the training data set (x-axis,in percent), neither for words including stopwords (left) nor for words excluding stopwords).

But in the lower figures we can see that the average time for the results increases with larger training data. Hence, it seems best to base the results on a smaller ttraining data set to increase user experience.

Also, the size of ngrams (“steps”) does not influence the prediction, which is rather strange). I have to work on that.

5 Further plans for creating a prediction algorithm and Shiny app

A lot of things have to be done in order to create a good Shiny app.

For the prediction algorithm, profanity words have to be filtered first. Then, a lot of data cleaning has to be done. For example, all words including @ could be removed; also, it would be helpful to standardise different spellings of the same word (“its”, “it’s” etc.); this would not only improve the accuracy, but also the size of the ngrams frequency tables.

One could also include more information into the prediction, like the length of the string or one could choose between twitter and blog-generated ngrams-frequency tables (even though I think it’s best to integrate everything). Also, I have to make a decision on how many ngrams I include; it could also be helpful to not always pick the most frequent 3-gram, for example, if there is a way better 2-gram to choose from.

For the app I intend to let the user decide on whether he or she wants to include stopwords, how many words he or she wants to get predicted (e.g., a selection of 3 words) and whether he wants to increase accuracy (by decreasing speed).

6 Appendix

6.1 prediction algorithm

nextword <- function(x, pick=F, n=1, sw=F, permille=1, steps=2 ) {
  
  # preprocess the entered string similar to corpus
  word <- preprocess(
    concatenate(x), 
    case="lower",
    remove.punct=T,
    remove.numbers=T,
    fix.spacing=T)
  
  # separate into single words
  word <- str_split(word, " ", simplify=T)
  if(sw) word <- word[!word %in% tm::stopwords()]
  
  comb <- data.frame()
  
  # words at most 4 words long
  if(length(word) >= (steps-1)) word <- word[(length(word)-(steps-2)):length(word)]
  
  # search for ngrams until match is found
  while(nrow(comb) <= 1 & length(word) > 0) {
    
    if(sw) tabnow <- tab_sw[[permille]][[length(word)+1]] 
    if(!sw) tabnow <- tab_corpus[[permille]][[length(word)+1]] 
    word <- paste(word, collapse=" ")
    
    # where does beginning of string match?
    comb <- tabnow %>% 
      filter(str_detect(ngrams, paste0("^", word, "\\s"))) #%>%
    
    # table or single predicted word?
    if(pick) {
      comb <- comb %>%
    arrange(desc(freq)) %>%
    filter(row_number() <= n) %>% 
    select(ngrams)
      
      comb <- str_split(comb$ngrams, " ", simplify = T)
      
    }
    
    word <- str_split(str_remove(word, "\\s$"), " ", simplify = T)[-1]
  
}


# result
  
  # if match was found
if(length(comb) > 1) {
  
  if(!pick) nextword <- comb
  if(pick) nextword <-  comb[,ncol(comb)]
  return(nextword)
  
  # if not: return most frequent single word
} else { 
  
  if(!sw) nextword <- tab_corpus[[permille]][[1]][1:n,1]
  if(sw)  nextword <- tab_sw[[permille]][[1]][1:n,1]
  
  return(nextword) 
}

}