I am in the process of creating a predictive text model. When a use inserts a word or a set of words, I want my machine algorithm to have a reasonable guess for what the next word is that they would like to input, similarly to how many smart phone keyboard apps, or the Google search function work today. To develop this model, I am using the Corpora data sets, found at http://www.corpora.heliohost.org/aboutcorpus.html

Steps Taken

  1. Exploratory Analysis
  2. Data Cleansing
  3. Using n-grams
  4. Developing a prediction model

Exploratory Analysis

First, I downloaded the data and looked at it. I found readLines() to be the best way to read the .txt files into R.

Packages Used: data.table, stringr

library(data.table)
library(stringr)
library(knitr)
library(lattice)

A sample of en_US.twitter.txt

head(en_US_t)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

Data cleansing

Upon seeing the text I was looking with, I new that I wanted to remove special characters, change everything to lowercase and be able to examine how frequent certain words appeared. I used the following functions to clean up my data.

## Split lines of text everywhere where a space occurs
p <- vector()
for (i in 1:1000) {
    p[i] <- strsplit(en_US_t[i], " ")
}

## Reformat into a more manageable dataset
y <- unlist(p, recursive = FALSE)

## Make everything lowercase
q <- str_to_lower(y)

# Remove all non-alphanumeric characters
r <- str_replace_all(q, "[^[:alnum:]]", "")

Using N-grams

N-grams are important for predictive analytics. They are chunks of texts that are n words long, used predict trends in language. From the data set, I determined the most common 2-grams and 3-grams. Once my data was clean, it was easy to determine the most frequently used words, and sets of words

Developing Prediction Model

After studying n-gram frequency tables, I know that I have an increased probability of certain words following a word of interest. For example, ‘are’ very commonly follows the english word ‘how?’. To start, I am going to build a model that takes a specified word as input, and outputs the most common next word as determined by my frequency table of 2-grams.

##           word freq likelihood_next
## 1         much    3      0.11538462
## 2          was    3      0.11538462
## 3          are    2      0.07692308
## 4           is    2      0.07692308
## 5         many    2      0.07692308
## 6           to    2      0.07692308
## 7        about    1      0.03846154
## 8          can    1      0.03846154
## 9    dedicated    1      0.03846154
## 10         did    1      0.03846154
## 11        good    1      0.03846154
## 12           i    1      0.03846154
## 13          it    1      0.03846154
## 14         the    1      0.03846154
## 15 unfortunate    1      0.03846154
## 16        will    1      0.03846154
## 17          ya    1      0.03846154
## 18         you    1      0.03846154

Next Steps

My model will be much stronger once I start thinking about 3-5 gram predictions. My code is very clunky and could be streamlined with more efficient use of loops and functions which I design for this purpose. I still have some work to do in determining how to seperate foreign languages, but my best plan so far is to identify unique keywords for different languages, search for that word and if it’s present label the text entry as that language.