Overview:


The purpose of this project is to create an interactive tool that analyses and then predicts a user’s next desired word choice. The tool will use real US data from blog posts, twitter feeds, and news stories to form the basis for these predictions. The data is being provided courtesy of SwiftKey, a technology firm that creates software that uses NLP (Natural Language Processing) within smart phones to suggest words for users to write text messages.


In this phase of the project, we will glance over some features of the data, including word population size of the different text sources, and frequency of certain combinations of words.


Analysis of Line and Word Counts:


First, we’ll want to setup our R directory and read in the three en_US files:


setwd("C:/Users/Zachary/Data Science/Capstone")
require(tm);require(SnowballC);require(data.table);require(ggplot2)

##Read the three US files into R.
main_path <- file.path('.','final/en_US')

con <- file(paste(main_path, dir(main_path)[1], sep='/'),'r')
en_US_blogs <- readLines(con)
close(con)

con <- file(paste(main_path, dir(main_path)[2], sep='/'),'r')
en_US_news <- readLines(con)
close(con)

con <- file(paste(main_path, dir(main_path)[3], sep='/'),'r')
en_US_twitter <- readLines(con)
close(con)


Next, we will create data frames for the line counts:


##Make a data frame for lines count
lineslength <- c(length(en_US_blogs),length(en_US_news),length(en_US_twitter))
lineslength <- data.frame(lineslength)
lineslength$names <- c("blogs","news","twitter")


We’ll do the same for word count:


##Count words in blogs
blogs_words <- lapply(en_US_blogs, function (x) length(unlist(strsplit(x," "))))
blogs_words <- data.frame(do.call("rbind",blogs_words))
names(blogs_words) <- "Blog Word Count"

##Count words in news
news_words <- lapply(en_US_news, function (x) length(unlist(strsplit(x," "))))
news_words <- data.frame(do.call("rbind",news_words))
names(news_words) <- "News Word Count"

##Count words in twitter
twitter_words <- lapply(en_US_twitter, function (x) length(unlist(strsplit(x," "))))
twitter_words <- data.frame(do.call("rbind",twitter_words))
names(twitter_words) <- "Twitter Word Count"

#Combine words tables
words <- c(sum(blogs_words),sum(news_words),sum(twitter_words))
words <- data.frame(words)
words$names <- c("blogs","news","twitter")


Now, we’ll use ggplot to construct these barplots:


par(mfcol = c(1,2))
ggplot(lineslength,aes(x=names,y=lineslength))+
        geom_bar(stat='identity',fill='blue',color='grey60')+
        xlab('Source') + ylab('Total Lines') + coord_flip() +
        ggtitle('Total Line Count by Source')

#Construct bar graph of word count for each type
ggplot(words,aes(x=names,y=words))+
        geom_bar(stat='identity',fill='green',color='grey60')+
        xlab('Source') + ylab('Total Words') + coord_flip() +
        ggtitle('Total Word Count by Source')


Cleaning and Tokenization:


The next step is to clean the three data sets and collectively break down the data into onegrams,twograms, and threegrams. Essentially, we want to know which a) singular words, b) combinations of two consecutive words, and c) combinations of three consecutive words are most prevalent among the data sources. This understanding will form the backbone of our algorithm later on.


I created a tokenize function which performed the following on the data:

  • Made everything lower-case
  • Removed non-english characters
  • Removed extra spaces before and after words
  • (I attempted profanity filter but did not find success)


Using the tokenize function, I seperately created three RData files containing frequencies of the one,two, and three-grams. After loading them back into R, we can graph them:


load("onegram.RData")
load("twogram.RData")
load("threegram.RData")


We’ll use the hist function to create three histograms for the data. Notice that we’ll use log10 on the data to condense the x-axis into a digestable graph.


par(mfcol = c(1,3))
hist(log10(table(onegram[,2])), xlab="", col = "blue", 
     ylab="Number of words", main = "One-gram")
hist(log10(table(twogram[,2])), xlab="Frequency (log10)", col = "orange", 
     ylab="", main = "Two-gram")
hist(log10(table(threegram[,2])), xlab="", col = "purple", 
     ylab="", main = "Three-gram")


This data tells us that roughly 120 words account for 50% of the frequency of one-grams in the dataset. These are common words like “the”, “and”, and “to”. For three-grams, there are about 30,000 combinations of words that account for 50% of the three-gram frequency. Lots more variability there, and much less concentrated around a small population like one-grams.