NLP Project: Word Prediction Part 1

Background

In this Project we’ll use three sets of data to create a word prediction application. The sets are from twitter, various blogs and news sources. They have been provided by SwiftKey through Coursera.com. In this part, Part 1 we’ll perform an exploratory analysis of the data. Get an idea of how large the sources are, if there are any correlations among terms and which words appear most often. In Part 2 we’ll create a predictive model and design an application. First lets set the environment to the project’s directory, then dive in from there.

setwd("~/Desktop/Word_Prediction_Project")
root <- "~/Desktop/Word_Prediction_Project/"

Download and Load Data Sets

We’re going to need to download the zip file to the directory from the Coursera website, then we’ll unzip it. This compressed file contains data sets in several different languages. We’re going to use the data sets that are in English for this project. This is located in the “en_US” directory.

url <- c("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
download.file(url, "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip", exdir = ".")

The three files are in text format. We need to use the readLines function to load the data into R.

twitter <- readLines(paste0(root, "final/en_US/en_US.twitter.txt"))
blogs <- readLines(paste0(root, "final/en_US/en_US.blogs.txt"))
news <- readLines(paste0(root, "final/en_US/en_US.news.txt"))

Exploritory Analysis

Summarise Data Sets

Lets take a look at some of the information about these files, such as the size of the file and average word count. Here we create a function that will return a data frame that summarizes the data.

library(dplyr)
library(stringi)
summaryStats <- function(x){
        size <- round(file.info(
                        paste0(root, 
                               "final/en_US/en_US.", 
                                deparse(substitute(x)), 
                                ".txt")
                        )$size / 1024^2)
        stats_1 <- stri_stats_general(x)
        words <- sum(stri_count_words(x))
        ave_words_per_line <- words/stats_1[1]
        maxwords <- max(stri_count_words(x))
        chars_per_word <- stats_1[3]/words
        data.frame(
                Source = c(paste0(deparse(substitute(x)))),
                Size_MB = c(size),
                Total_Lines = c(stats_1[1]),
                Total_Words = c(words),
                Total_Chars = c(stats_1[3]),
                Ave_Words_Per_Line = c(ave_words_per_line),
                Max_Words_Per_Line = c(maxwords),
                Chars_Per_Word = c(chars_per_word)
        )
}

twitter_summary <- summaryStats(twitter)
news_summary <- summaryStats(news)
blogs_summary <- summaryStats(blogs)

data_stats <- rbind(twitter_summary, news_summary)
data_stats <- rbind(data_stats, blogs_summary)
data_stats <- arrange(data_stats, desc(Size_MB))

Source	Size_MB	Total_Lines	Total_Words	Total_Chars	Ave_Words_Per_Line	Max_Words_Per_Line	Chars_Per_Word
blogs	200	899288	37546246	206824382	41.75108	6726	5.508524
news	196	1010242	34762395	203223154	34.40997	1796	5.846063
twitter	159	2360148	30093369	162096031	12.75063	47	5.386437

We can see from the table the blogs data set is by far the largest file. This data set also contains the most number of words, characters, and words per line, but has the least amount of lines. It’s the twitter data set that has the most amount of lines. The three data sets are very large. Running statistical analysis on these sets would take a lot of computational power and a lot of time. We’re going to take random samples from the sets that are much smaller in size in order to compute things at a quicker speed. First though, let’s make sure that each set has a word count that is a little more even, as we don’t want there to be any bias between the sources when we combine them later. We’ll do this by deleting out some lines from the blogs and news sources. In order to figure out how many lines need to be deleted we’re going to subtract the blogs and news total word counts from the smallest word count(twitter with 30093369 words) then we’ll divide it by the average number of words per lines.

blogs_delete <- (data_stats$Total_Words[1] - data_stats$Total_Words[3]) / 
        data_stats$Ave_Words_Per_Line[1]
news_delete <- (data_stats$Total_Words[2] - data_stats$Total_Words[3]) / 
        data_stats$Ave_Words_Per_Line[2]
blogs_2 <- blogs[1:(length(blogs) - round(blogs_delete))]
news_2 <- news[1:(length(news) - round(news_delete))]

Source	Size_MB	Total_Lines	Total_Words	Total_Chars	Ave_Words_Per_Line	Max_Words_Per_Line	Chars_Per_Word
blogs_2	NA	720781	30073423	165663249	41.72338	6726	5.508626
news_2	NA	874554	30101252	175973903	34.41897	1796	5.846066
twitter	159	2360148	30093369	162096031	12.75063	47	5.386437

Sampling

Now that the word counts are much closer in size, we’re going to randomize each data set. Then we’ll save the data sets to the project’s directory.

set.seed(227)

randomizeSets <- function(data_set){
        rows <- sample(NROW(data_set))
        data_set_new <- data_set[rows]
        connection <- file(description = paste(root, 
                                 deparse(substitute(data_set)),
                                 "_randomized", ".txt", sep=""), 
                           open = "w")
        writeLines(data_set_new, con = connection)
        close(connection)
        return(data_set_new)
}

twitter_random <- randomizeSets(data_set = twitter)
blogs_random <- randomizeSets(data_set = blogs_2)
news_random <- randomizeSets(data_set = news_2)

Next we’ll grab the samples. We’re only going to use an extremely small percentage of the data. Using roughly 2% of the data is small enough to make our calculations.

sampleExtrack <- function(dataSet, size){
        split <- round(length(dataSet) * size)
        samp <- dataSet[1:split]
        return(samp)
}

twitter_sample <- sampleExtrack(twitter_random, size = .02)
blogs_sample <- sampleExtrack(blogs_random, size = .02)
news_sample <- sampleExtrack(news_random, size = .02)

Cleanning and Term Frequencies

We now have made samples of each set, but we need to do a little cleaning and reformatting. The cleanUp function that we define bellow uses the tm package to easily clean the sets. This function will change all letters to lowercase for the sake of consistency, it will also remove any punctuation and white space. It returns a vector corpus object. Once we have made clean corpses we turn them into document term matrices. We can do some further exploratory analysis to see which words are the most frequently used in the data sets. We should also compare the frequencies of the words from one data set to another, and establish if the three set correlate with each other.

library(tm)
library(slam)
library(stringr)

cleanUp <- function(x){
        Sample_Source <- VectorSource(x)
        Sample_Corpus <- VCorpus(Sample_Source)
        Sample_Clean <- tm_map(Sample_Corpus, content_transformer(tolower))
        Sample_Clean <- tm_map(Sample_Clean, removePunctuation)
        Sample_Clean <- tm_map(Sample_Clean, stripWhitespace)
        return(Sample_Clean)
}

matrixDTM <- function(x){
        cleaned <- cleanUp(x)
        dtm <- DocumentTermMatrix(cleaned)
        mtx <- as.matrix(dtm)
        return(mtx)
}

wordFreq <- function(x){
        mtx <- matrixDTM(x)
        sums <- colSums(mtx)
        sorted <- sort(sums, decreasing = TRUE)
        df <-  as.data.frame(sorted)
        names(df) <- str_split(deparse(substitute(x)), "_")[[1]][1]
        return(df)
}

twitter_freq <- wordFreq(twitter_sample)
blogs_freq <- wordFreq(blogs_sample)
news_freq <- wordFreq(news_sample)

Top Words

word	twitter	blogs	news
the	18553	30110	34170
you	10950	4892	1659
and	8621	17685	15379
for	7761	5910	6040
that	4547	7400	5998
your	3595	1581	545

Above the this table are three word clouds for the twitter, blogs, and news data-sets respectively. As you can see from the table and word clouds, the word “the” is used the most often in all the sets. Words like “you” and “and” are frequently used as well. These words are referred to as stop words. Stop words are words that get used often but don’t have much significance to them. In some NLP project you may want to remove the stop words to view the relationships of significant words. For a word prediction application we’ll want to keep them included as they reflect how we would normally speak.

If we plot the sets against each other, we can see that there is a strong correlation among words in all the sets. Meaning that words that often appear in one set will also often appear in another set. The correlation coefficient further backs this up.

library(ggplot2)
library(gridExtra)

plot1 <- ggplot(data = frequency_df, aes(x = twitter, y = blogs)) + 
        geom_point() +
        annotate("text", x = Inf, y = Inf,
                label = paste0("cor: ",
                        round(cor(
                                frequency_df$twitter, 
                                frequency_df$blogs,
                                use = "complete.obs"),
                                2)
                        ),
                vjust = 3, hjust = 1, col = "#F8766D") +
        labs(title = "Twitter vs. Blogs")

plot2 <- ggplot(data = frequency_df, aes(x = twitter, y = news)) +
        geom_point() +
        annotate("text", x = Inf, y = Inf,
                label = paste0("cor: ",
                        round(cor(
                                frequency_df$twitter,
                                frequency_df$news,
                                use = "complete.obs"), 
                                2)),
                vjust = 3, hjust = 1, col = "#F8766D") +
        labs(title = "Twitter vs. News")

plot3 <- ggplot(data = frequency_df, aes(x = news, y = blogs)) +
        geom_point() + 
        annotate("text", x = Inf, y = Inf,
                label = paste0("cor: ",
                        round(cor(
                                frequency_df$news,
                                frequency_df$blogs,
                                use = "complete.obs"),
                                2)),
                vjust = 3, hjust = 1, col = "#F8766D") +
        labs(title = "News vs. Blogs")

grid.arrange(plot1, plot2, plot3, nrow = 2, ncol = 2)

Combine Sources

Lets combine the three separate sources.

master_set <- c(twitter_sample, blogs_sample, news_sample)

master_random <- randomizeSets(master_set)
master_sample <- sampleExtrack(master_random, size = .1)

master_cleaned <- cleanUp(master_sample)

Bigram and Trigram Tokens

We have an idea of which words are used the most frequently in the data-set, we should try to get an idea of which combination of words also frequently appear in the data. From here on out we’ll just be using the combined data set. In order for us to get an idea of which combination of words are used we need to create word tokens. Tokens in NLP are chopped pieces of a character sequence, you define them by the size. We’ll make tokens for bi-grams(two word combinations), and trig-rams(three word combinations). The functions bellow create the bi-grams and tri-grams then plots out their corresponding histograms of the top most frequent combinations.

library(RWeka)
options(mc.cores = 1)

bigram_tokenizer <- function(x) 
        unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), 
               use.names = FALSE)
bigram_tdm <- TermDocumentMatrix(master_cleaned, 
                                control = list(tokenize = bigram_tokenizer))
bigram_matrix <- as.matrix(bigram_tdm)

bigram_frequency <- rowSums(bigram_matrix)
bigram_frequency <- sort(bigram_frequency, decreasing = TRUE)

bigram_top_10 <- bigram_frequency[1:10]
bigram <- data.frame(bigram_top_10)
bigram$terms <- rownames(bigram)
rownames(bigram) <- NULL
colnames(bigram) <- c("frequency", "terms")
bigram$terms <- reorder(bigram$terms, bigram$frequency)

bigram_plot <- ggplot(bigram, aes(x = terms, y = frequency, fill = terms)) + 
        geom_bar(stat = "identity") + 
        coord_flip() +
        theme(legend.position = "none") +
        labs(title = "Top Bi-gram Tokens")
bigram_plot

trigram_tokenizer <- function(x)
        unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), 
               use.names = FALSE)
trigram_tdm <- TermDocumentMatrix(master_cleaned, 
                                control = list(tokenize = trigram_tokenizer))
trigram_matrix <- as.matrix(trigram_tdm)

trigram_frequency <- rowSums(trigram_matrix)
trigram_frequency <- sort(trigram_frequency, decreasing = TRUE)

trigram_top_10 <- trigram_frequency[1:10]
trigram <- data.frame(trigram_top_10)
trigram$terms <- rownames(trigram)
rownames(trigram) <- NULL
colnames(trigram) <- c("frequency", "terms")
trigram$terms <- reorder(trigram$terms, trigram$frequency)

trigram_plot <- ggplot(trigram, aes(x = terms, y = frequency, fill = terms)) + 
        geom_bar(stat="identity") + 
        coord_flip() +
        theme(legend.position = "none") +
        labs(title = "Top Tri-gram Tokens")
trigram_plot

From this you can see that combinations of words like “of the”, “in the”, “one of the”, and “a lot of” are used often.

Conclusion

From this analysis we see which terms and combination of terms are used most often in the data sets. This will come in handy in making a text predictor application. From here we’re going to make the assumption that the sample is a reflection of the broader data. If one wanted, they could take more random samples from the data sets and rerun the term frequencies analysis to see if the results are similar.

In Part 2 we’ll start by using our uni, bi, and tri-gram frequencies with conditional probability to create our prediction application.

Project Scource Code

https://github.com/PunkFood-Disme/Word_Prediction_Project