Summary.

This report is a part of the Capstone project for JHU Data Science Specialization (Coursera). The materials for capstone project were provided by Johns Hopkins University’s team in collaboration with SwiftKey. The final goal of the capstone project is to develop a word prediction algorithm and to deploy it as a text typing app on the web, that would suggest the next most probable words as the text is typed in. The purpose of this report is to process the provided text files, perform exploratory analysis of the obtained data, and make a brief summary of the observations.

The data.

The data is from a corpus called HC Corpora. The data can be downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The files include collections of twitter, blog and news posts in four different languages including English. I will use English language for building the application.

Exploring the data.

twitter <- readLines("en_US.twitter.txt", skipNul = TRUE)
object.size(twitter)/1024^2
length(twitter)
sum(sapply(gregexpr("\\W+", twitter), length))
blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)
object.size(blogs)/1024^2
length(blogs)
sum(sapply(gregexpr("\\W+", blogs), length))
news <- readLines(file("en_US.news.txt", "rb"), skipNul=TRUE)
object.size(news)/1024^2
length(news)
sum(sapply(gregexpr("\\W+", news), length))
Data source File size Number of lines Number of words
twitter 257.2 Mb 2360148 30513904
blogs 231.3 Mb 899288 38487556
news 230.4 Mb 1010242 35866668

Processing the text files.

There are a number of packages available in R for text processing. However, since the sizes of the files are rather large, I have chosen to use Python for text preprocessing for performance reasons. I also decided to write my own scripts because I wanted to have a total control over the text processing pipeline, so that changes can be added fast and precisely for the fine-tuning of my final model. The python script used in this report can be viewed in the Appendix section.

The python script performs the following text preprocessing (tokenization):

  1. Convert all words to lowercases.
  2. Split the text by punctuation to produce semantically coherent blocks of text sutable for making n-grams. The n-grams are combinations of words that usually used together in writing and where bigrams are two word sequences and trigrams are three word sequences. Single word are called unigrams. I assumed that for next word prediction purposes, if two words are separated by punctuation, they are most probably not the part of the same n-gram.
  3. Expand contractions. This converts contractions such as “aren’t” and “i’m” into “are not” and “i am”, respectively. I think that “I” and “am” should be considered as separate words for prediction purposes and will simplify prediction algorithm. However, I also plan to cover specific cases, such as if a user would enter contractions such as “I’m” into the text field of the web application.
  4. Using regular expressions to match only letters and single spaces, clean the text of all remaining punctuation, numbers, extra spaces, non-letter characters, and emoticons. I have not employed yet the selection of hyphenated words and acronyms, because upon initial text analysis I found that they are not very common and probably will not significantly affect the accuracy of predictions. However, if at further analysis I will find that they contribute to the accuracy of the model, I may design a decision classifier for distinguishing hyphenated words and acronyms from punctuation.
  5. Using the preprocessed text file create frequency tables of unigrams, bigrams, and trigrams, and save them as .csv files.

For the purposes of this report I have processed the “en_US.twitter.txt” file using the abovementioned script. The overall process of tokenization and creating n-gram tables for the entire “en_US.twitter.txt” file takes roughly 5 minutes on a regular desktop computer.

The new files containing n-gram tables were loaded into R and analyzed further.

unigrams <- read.csv("unigrams.csv", stringsAsFactors=FALSE, header=FALSE)
library(ggplot2)
ggplot(data = unigrams, aes(x=V2)) +
    geom_histogram(binwidth=5) +
    coord_cartesian(xlim = c(0, 100)) +
    ggtitle("Relative counts for words with frequencies below 100") +
    xlab("Unigrams frequencies")

Out of all 285583 identified unigrams, 171584 words have frequencies of one, and there are 9661 words with frequencies above 100.

subset1 <- subset(unigrams, V2==1)
subset100 <- subset(unigrams, V2 > 100)
dim(unigrams); dim(subset1); dim(subset100)
## [1] 285583      2
## [1] 171584      2
## [1] 9661    2

Visualizing the most frequent words as a world cloud:

library(wordcloud)
wordcloud(subset100$V1, subset100$V2, scale=c(7,0.6), colors=brewer.pal(8, "Dark2"), max.words=50)

Analysing and showing the most common bigrams.

bigrams <- read.csv("bigrams.csv", stringsAsFactors=FALSE, header=FALSE)
top20 <-  bigrams[order(bigrams$V2, decreasing = TRUE),]
top20 <- top20[1:20,]

ggplot(top20, aes(x=reorder(V1, -V2), y = V2)) + 
        geom_bar(stat="identity", fill="skyblue") + xlab("") + ylab("Counts") +
        ggtitle("The most common bigrams") +
        geom_text(aes(label=V2), vjust= -0.3, size=3) +
        theme(axis.text.x = element_text(size=16, angle=45 , hjust=1),
              axis.text.y = element_text(colour="grey20",size=14),
              axis.title.y = element_text(size=14),
              legend.text = element_text(size=14))

Analysing and showing the most common trigrams.

trigrams <- read.csv("trigrams.csv", stringsAsFactors=FALSE, header=FALSE)
top20t <-  trigrams[order(trigrams$V2, decreasing = TRUE),]
top20t <- top20t[1:20,]

ggplot(top20t, aes(x=reorder(V1, -V2), y = V2)) + 
        geom_bar(stat="identity", fill="lightgreen") + xlab("") + ylab("Counts") +
        ggtitle("The most common trigrams") +
        geom_text(aes(label=V2), vjust= -0.3, size=3) +
        theme(axis.text.x = element_text(size=16, angle=45 , hjust=1),
              axis.text.y = element_text(colour="grey20",size=14),
              axis.title.y = element_text(size=14),
              legend.text = element_text(size=14))

Conclusion.

As we can see the text corpora is pretty big. There are many high frequency n-grams and even more low frequency n-grams. The predictive algorithm should react very fast to user’s input and since the App will be deployed on the Shiny server, it should not take much space in memory. Hence, I will most probaly need to reduce the amount of n-grams used for the final language model. The majority of low freguency n-grams probably will not contribute much to the predictive capabilities of the algorithm and hopefully can be discarded. On the order hand I might try to include a higher order n-grams such as 4-grams and 5-grams. The cut-off frequencies will be selected to minimize memory footprint of the model, but at the same time not to reduce significantly the accuracy of the predictions. The algorithm for prediction will be developed on the basis of Katz back-off or Kneser-Ney smoothing model. I also have plans to implement profanity filtering. The model will be fine-tuned (fast running times), so that the app will be able to respond to user’s input in real time. The final model will be deployed on Shiny server.

Appendix.

####################---- Here is python script ----#############################

# dictionary of mapping of contractions to their expansions:
contractions = {"aren't": "are not", "can't": "cannot", "can't've": "cannot have",
                "'cause": "because", "could've": "could have", "couldn't": "could not",
                "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not",
                "don't": "do not", "hadn't": "had not", "hadn't've": "had not have",
                "hasn't": "has not", "haven't": "have not", "he'd've": "he would have",
                "he'll": "he will", "how'd": "how did", "how'd'y": "how do you",
                "how'll": "how will", "how's": "how is", "i'd've": "i would have",
                "i'll": "i will", "i'm": "i am", "i've": "i have", "isn't": "is not",
                "it's": "it is", "it'd've": "it would have", "it'll": "it will",
                "let's": "let us", "ma'am": "madam", "mayn't": "may not",
                "might've": "might have", "mightn't": "might not", "mightn't've": "might not have",
                "must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
                "needn't": "need not", "needn't've": "need not have", "oughtn't": "ought not",
                "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not",
                "shan't've": "shall not have", "she'd've": "she would have", "she'll": "she will",
                "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have",
                "so've": "so have", "that'd've": "that would have", "there'd've": "there would have",
                "they'd've": "they would have", "they'll": "they will", "they're": "they are",
                "they've": "they have", "to've": "to have", "wasn't": "was not",
                "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
                "we're": "we are", "we've": "we have", "weren't": "were not", "what're": "what are",
                "what's": "what is", "what've": "what have", "when've": "when have", "where'd": "where did",
                "where's": "where is", "where've": "where have", "who've": "who have", "why've": "why have",
                "will've": "will have", "won't": "will not", "won't've": "will not have",
                "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have",
                "y'all": "you all", "y'all'd": "you all would", "y'all'd've": "you all would have",
                "y'all're": "you all are", "y'all've": "you all have", "you'd've": "you would have",
                "you're": "you are", "you've": "you have", "you'll": "you will"}


import re
from collections import defaultdict
import csv

file_name = "en_US.twitter.txt" # or use a copmlete path if the file is not in current directory.


def process_text(txt_file, new_file):
    """Takes a text file, preprocess the text and saves proccesed data into
        new txt file line by line, where each line contains meaningfully
        connected text block.

    Args:
        txt_file (.txt): file name or a full path name for the file.
        new_file (.txt): name for a file where the proccesed text will be saved.

    Returns:
        None
    """
    f = open(new_file, "w") # the file where the processed text will be written.

    with open(txt_file, 'rb') as in_file:

        for line in in_file:
            line = line.lower() #  convert to lower cases
            phrases_list = re.split(r'[!?.,;:&()]+', line) # split line by punctuation

            for phrase in phrases_list:
                word_list = []
                for word in phrase.split(): # splits a string into words
                    if word in contractions: # expands contractions
                        new_word = contractions[word]
                        word_list.append(new_word)
                    else:
                        word_list.append(word)

                new_phrase = ' '.join(word_list) # converts list into a string

                # clean the text and save each phrase line by line:
                clean_phrase = re.sub(r"[^a-zA-Z\s]+", '', new_phrase).strip()
                if clean_phrase != '':
                    f.write("%s\n" % clean_phrase)
    f.close()

process_text(file_name, "processed.txt")


def make_ngrams(proc_file, n):
    """Finds n-grams and calculates their frequencies.

    Args:
        proc_file (.txt): file containing text preprocessed by process_text().
        n (int): the number of words in n-gram.

    Returns:
        frequency dictionary of n-grams.
    """
    dct = defaultdict(int)

    with open("processed.txt", 'r') as in_file1:
        for line in in_file1:

            line_list = line.split()
            for i in range(len(line_list)-n):
                ngram = ' '.join(line_list[i:i+n])
                dct[ngram] += 1
    return dct


freq_dict1 = make_ngrams("processed.txt", 1) # unigrams
freq_dict2 = make_ngrams("processed.txt", 2) # bigrams
freq_dict3 = make_ngrams("processed.txt", 3) # trigrams

# write data into a .csv files:
with open('unigrams.csv','wb') as f1:
    w1 = csv.writer(f1)
    w1.writerows(freq_dict1.items())

with open('bigrams.csv','wb') as f2:
    w2 = csv.writer(f2)
    w2.writerows(freq_dict2.items())

with open('trigrams.csv','wb') as f3:
    w3 = csv.writer(f3)
    w3.writerows(freq_dict3.items())