Background

In this Project we’ll use three sets of data to create a word prediction application. The sets are from twitter, various blogs and news sources. They have been provided by SwiftKey through Coursera.com. In this part, Part 1 we’ll perform an exploratory analysis of the data. Get an idea of how large the sources are, if there are any correlations among terms and which words appear most often. In Part 2 we’ll create a predictive model and design an application. First lets set the environment to the project’s directory, then dive in from there.

setwd("~/Desktop/Word_Prediction_Project")
root <- "~/Desktop/Word_Prediction_Project/"

Download and Load Data Sets

We’re going to need to download the zip file to the directory from the Coursera website, then we’ll unzip it. This compressed file contains data sets in several different languages. We’re going to use the data sets that are in English for this project. This is located in the “en_US” directory.

url <- c("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
download.file(url, "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip", exdir = ".")

The three files are in text format. We need to use the readLines function to load the data into R.

twitter <- readLines(paste0(root, "final/en_US/en_US.twitter.txt"))
blogs <- readLines(paste0(root, "final/en_US/en_US.blogs.txt"))
news <- readLines(paste0(root, "final/en_US/en_US.news.txt"))

Exploritory Analysis

Summarise Data Sets

Lets take a look at some of the information about these files, such as the size of the file and average word count. Here we create a function that will return a data frame that summarizes the data.

library(dplyr)
library(stringi)
summaryStats <- function(x){
        size <- round(file.info(
                        paste0(root, 
                               "final/en_US/en_US.", 
                                deparse(substitute(x)), 
                                ".txt")
                        )$size / 1024^2)
        stats_1 <- stri_stats_general(x)
        words <- sum(stri_count_words(x))
        ave_words_per_line <- words/stats_1[1]
        maxwords <- max(stri_count_words(x))
        chars_per_word <- stats_1[3]/words
        data.frame(
                Source = c(paste0(deparse(substitute(x)))),
                Size_MB = c(size),
                Total_Lines = c(stats_1[1]),
                Total_Words = c(words),
                Total_Chars = c(stats_1[3]),
                Ave_Words_Per_Line = c(ave_words_per_line),
                Max_Words_Per_Line = c(maxwords),
                Chars_Per_Word = c(chars_per_word)
        )
}

twitter_summary <- summaryStats(twitter)
news_summary <- summaryStats(news)
blogs_summary <- summaryStats(blogs)

data_stats <- rbind(twitter_summary, news_summary)
data_stats <- rbind(data_stats, blogs_summary)
data_stats <- arrange(data_stats, desc(Size_MB))
Source Size_MB Total_Lines Total_Words Total_Chars Ave_Words_Per_Line Max_Words_Per_Line Chars_Per_Word
blogs 200 899288 37546246 206824382 41.75108 6726 5.508524
news 196 1010242 34762395 203223154 34.40997 1796 5.846063
twitter 159 2360148 30093369 162096031 12.75063 47 5.386437

We can see from the table the blogs data set is by far the largest file. This data set also contains the most number of words, characters, and words per line, but has the least amount of lines. It’s the twitter data set that has the most amount of lines. The three data sets are very large. Running statistical analysis on these sets would take a lot of computational power and a lot of time. We’re going to take random samples from the sets that are much smaller in size in order to compute things at a quicker speed. First though, let’s make sure that each set has a word count that is a little more even, as we don’t want there to be any bias between the sources when we combine them later. We’ll do this by deleting out some lines from the blogs and news sources. In order to figure out how many lines need to be deleted we’re going to subtract the blogs and news total word counts from the smallest word count(twitter with 30093369 words) then we’ll divide it by the average number of words per lines.

blogs_delete <- (data_stats$Total_Words[1] - data_stats$Total_Words[3]) / 
        data_stats$Ave_Words_Per_Line[1]
news_delete <- (data_stats$Total_Words[2] - data_stats$Total_Words[3]) / 
        data_stats$Ave_Words_Per_Line[2]
blogs_2 <- blogs[1:(length(blogs) - round(blogs_delete))]
news_2 <- news[1:(length(news) - round(news_delete))]
Source Size_MB Total_Lines Total_Words Total_Chars Ave_Words_Per_Line Max_Words_Per_Line Chars_Per_Word
blogs_2 NA 720781 30073423 165663249 41.72338 6726 5.508626
news_2 NA 874554 30101252 175973903 34.41897 1796 5.846066
twitter 159 2360148 30093369 162096031 12.75063 47 5.386437

Sampling

Now that the word counts are much closer in size, we’re going to randomize each data set. Then we’ll save the data sets to the project’s directory.

set.seed(227)

randomizeSets <- function(data_set){
        rows <- sample(NROW(data_set))
        data_set_new <- data_set[rows]
        connection <- file(description = paste(root, 
                                 deparse(substitute(data_set)),
                                 "_randomized", ".txt", sep=""), 
                           open = "w")
        writeLines(data_set_new, con = connection)
        close(connection)
        return(data_set_new)
}

twitter_random <- randomizeSets(data_set = twitter)
blogs_random <- randomizeSets(data_set = blogs_2)
news_random <- randomizeSets(data_set = news_2)

Next we’ll grab the samples. We’re only going to use an extremely small percentage of the data. Using roughly 2% of the data is small enough to make our calculations.

sampleExtrack <- function(dataSet, size){
        split <- round(length(dataSet) * size)
        samp <- dataSet[1:split]
        return(samp)
}

twitter_sample <- sampleExtrack(twitter_random, size = .02)
blogs_sample <- sampleExtrack(blogs_random, size = .02)
news_sample <- sampleExtrack(news_random, size = .02)

Cleanning and Term Frequencies

We now have made samples of each set, but we need to do a little cleaning and reformatting. The cleanUp function that we define bellow uses the tm package to easily clean the sets. This function will change all letters to lowercase for the sake of consistency, it will also remove any punctuation and white space. It returns a vector corpus object. Once we have made clean corpses we turn them into document term matrices. We can do some further exploratory analysis to see which words are the most frequently used in the data sets. We should also compare the frequencies of the words from one data set to another, and establish if the three set correlate with each other.

library(tm)
library(slam)
library(stringr)

cleanUp <- function(x){
        Sample_Source <- VectorSource(x)
        Sample_Corpus <- VCorpus(Sample_Source)
        Sample_Clean <- tm_map(Sample_Corpus, content_transformer(tolower))
        Sample_Clean <- tm_map(Sample_Clean, removePunctuation)
        Sample_Clean <- tm_map(Sample_Clean, stripWhitespace)
        return(Sample_Clean)
}

matrixDTM <- function(x){
        cleaned <- cleanUp(x)
        dtm <- DocumentTermMatrix(cleaned)
        mtx <- as.matrix(dtm)
        return(mtx)
}

wordFreq <- function(x){
        mtx <- matrixDTM(x)
        sums <- colSums(mtx)
        sorted <- sort(sums, decreasing = TRUE)
        df <-  as.data.frame(sorted)
        names(df) <- str_split(deparse(substitute(x)), "_")[[1]][1]
        return(df)
}

twitter_freq <- wordFreq(twitter_sample)
blogs_freq <- wordFreq(blogs_sample)
news_freq <- wordFreq(news_sample)

Top Words

word twitter blogs news
the 18553 30110 34170
you 10950 4892 1659
and 8621 17685 15379
for 7761 5910 6040
that 4547 7400 5998
your 3595 1581 545

Above the this table are three word clouds for the twitter, blogs, and news data-sets respectively. As you can see from the table and word clouds, the word “the” is used the most often in all the sets. Words like “you” and “and” are frequently used as well. These words are referred to as stop words. Stop words are words that get used often but don’t have much significance to them. In some NLP project you may want to remove the stop words to view the relationships of significant words. For a word prediction application we’ll want to keep them included as they reflect how we would normally speak.

If we plot the sets against each other, we can see that there is a strong correlation among words in all the sets. Meaning that words that often appear in one set will also often appear in another set. The correlation coefficient further backs this up.

library(ggplot2)
library(gridExtra)

plot1 <- ggplot(data = frequency_df, aes(x = twitter, y = blogs)) + 
        geom_point() +
        annotate("text", x = Inf, y = Inf,
                label = paste0("cor: ",
                        round(cor(
                                frequency_df$twitter, 
                                frequency_df$blogs,
                                use = "complete.obs"),
                                2)
                        ),
                vjust = 3, hjust = 1, col = "#F8766D") +
        labs(title = "Twitter vs. Blogs")

plot2 <- ggplot(data = frequency_df, aes(x = twitter, y = news)) +
        geom_point() +
        annotate("text", x = Inf, y = Inf,
                label = paste0("cor: ",
                        round(cor(
                                frequency_df$twitter,
                                frequency_df$news,
                                use = "complete.obs"), 
                                2)),
                vjust = 3, hjust = 1, col = "#F8766D") +
        labs(title = "Twitter vs. News")

plot3 <- ggplot(data = frequency_df, aes(x = news, y = blogs)) +
        geom_point() + 
        annotate("text", x = Inf, y = Inf,
                label = paste0("cor: ",
                        round(cor(
                                frequency_df$news,
                                frequency_df$blogs,
                                use = "complete.obs"),
                                2)),
                vjust = 3, hjust = 1, col = "#F8766D") +
        labs(title = "News vs. Blogs")

grid.arrange(plot1, plot2, plot3, nrow = 2, ncol = 2)

Combine Sources

Lets combine the three separate sources.

master_set <- c(twitter_sample, blogs_sample, news_sample)

master_random <- randomizeSets(master_set)
master_sample <- sampleExtrack(master_random, size = .1)

master_cleaned <- cleanUp(master_sample)

Bigram and Trigram Tokens

We have an idea of which words are used the most frequently in the data-set, we should try to get an idea of which combination of words also frequently appear in the data. From here on out we’ll just be using the combined data set. In order for us to get an idea of which combination of words are used we need to create word tokens. Tokens in NLP are chopped pieces of a character sequence, you define them by the size. We’ll make tokens for bi-grams(two word combinations), and trig-rams(three word combinations). The functions bellow create the bi-grams and tri-grams then plots out their corresponding histograms of the top most frequent combinations.

library(RWeka)
options(mc.cores = 1)

bigram_tokenizer <- function(x) 
        unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), 
               use.names = FALSE)
bigram_tdm <- TermDocumentMatrix(master_cleaned, 
                                control = list(tokenize = bigram_tokenizer))
bigram_matrix <- as.matrix(bigram_tdm)

bigram_frequency <- rowSums(bigram_matrix)
bigram_frequency <- sort(bigram_frequency, decreasing = TRUE)

bigram_top_10 <- bigram_frequency[1:10]
bigram <- data.frame(bigram_top_10)
bigram$terms <- rownames(bigram)
rownames(bigram) <- NULL
colnames(bigram) <- c("frequency", "terms")
bigram$terms <- reorder(bigram$terms, bigram$frequency)

bigram_plot <- ggplot(bigram, aes(x = terms, y = frequency, fill = terms)) + 
        geom_bar(stat = "identity") + 
        coord_flip() +
        theme(legend.position = "none") +
        labs(title = "Top Bi-gram Tokens")
bigram_plot

trigram_tokenizer <- function(x)
        unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), 
               use.names = FALSE)
trigram_tdm <- TermDocumentMatrix(master_cleaned, 
                                control = list(tokenize = trigram_tokenizer))
trigram_matrix <- as.matrix(trigram_tdm)

trigram_frequency <- rowSums(trigram_matrix)
trigram_frequency <- sort(trigram_frequency, decreasing = TRUE)

trigram_top_10 <- trigram_frequency[1:10]
trigram <- data.frame(trigram_top_10)
trigram$terms <- rownames(trigram)
rownames(trigram) <- NULL
colnames(trigram) <- c("frequency", "terms")
trigram$terms <- reorder(trigram$terms, trigram$frequency)

trigram_plot <- ggplot(trigram, aes(x = terms, y = frequency, fill = terms)) + 
        geom_bar(stat="identity") + 
        coord_flip() +
        theme(legend.position = "none") +
        labs(title = "Top Tri-gram Tokens")
trigram_plot

From this you can see that combinations of words like “of the”, “in the”, “one of the”, and “a lot of” are used often.

Conclusion

From this analysis we see which terms and combination of terms are used most often in the data sets. This will come in handy in making a text predictor application. From here we’re going to make the assumption that the sample is a reflection of the broader data. If one wanted, they could take more random samples from the data sets and rerun the term frequencies analysis to see if the results are similar.

In Part 2 we’ll start by using our uni, bi, and tri-gram frequencies with conditional probability to create our prediction application.