Milestone Project

Summary

The purpose of the project is to build a text prediction model utilized in keyboard input like Swiftkey. The data source contains text files from blogs, news and twitters in different languages. This file is to summarize some preliminary steps to process and analyzing the data, including:

Preposessing
Tokenization
Ngram analysis
Model prototype

Preposessing

Data sampling: sampling from original file and save the samples
Remove: numbers, punctuation, profanity
Transform: to lower case

### Read in corpus
    mycor  <- Corpus(DirSource("/Users/Lilsummer/Desktop/final/mycorpus"))

### Sampling
    r1 = readLines('en_US.blogs.txt')
    line.sample1 = sample(length(r1), 3000)
    r1.sample = r1[line.sample1]
    write.csv(r1.sample, file = "en_US.blogs.sample.txt")

### Remove punctuation
    toSpace <- content_transformer(function(x, pattern) {gsub(pattern, " ", x)})
    mycor <- tm_map(mycor, toSpace, "-")
    mycor <- tm_map(mycor, toSpace, "\"")
    mycor <- tm_map(mycor, toSpace, ",")
    mycor <- tm_map(mycor, toSpace, "\n")
    mycor <- tm_map(mycor, toSpace, ":")
    mycor <- tm_map(mycor, removePunctuation)
### Remove digit
    mycor <- tm_map(mycor, removeNumbers)
### Remove white space
    mycor <- tm_map(mycor, stripWhitespace)
### Remove profanity
    badwords <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
    badwords_vector <- VectorSource(badwords)
    mycor <- tm_map(mycor, removeWords, badwords_vector)
### Remove stop word (optional)
    mycor.rm <- tm_map(mycor, removeWords, stopwords('english'))

### To lower case
    mycor <- tm_map(mycor, content_transformer(tolower))

Basic summary of data

line counts
- Blogs: 899288
- News: 1010242
- Twitter: 2360148

    r1 = readLines('en_US.blogs.txt')
    r2 = readLines('en_US.news.txt')
    r3 = readLines('en_US.twitter.txt')
    length(r1)
    length(r2)
    length(r3)

Tokenization

Use the sampled data to calculate the frequency
Calculate the word frequency with or without stop words

N-grams analysis

Worldcloud of Top 100 2-gram from the sampled data

Worldcloud of Top 100 3-gram from the sampled data

Model prototype

We tried to use “want” and “i want” to predict the probability of the next word by using 2-grams and 3-grams.

    ### want in 2-gram
    source('findcount2.r')
    source('findcount3.r')
    word1 = findcount2("want")
    word2 = findcount3("i want")
    
    head(word1, 10)

##           X mycor.bigrams Freq
## 26       26       want to  179
## 4222   4222        want a    6
## 5469   5469       want it    5
## 5470   5470       want my    5
## 5471   5471      want you    5
## 7570   7570     want more    4
## 7571   7571     want them    4
## 11870 11870      want any    3
## 11871 11871      want the    3
## 24221 24221      want all    2

    head(word2, 10)

##           X   mycor.trigrams Freq
## 3     91946        i want to   47
## 1394  91940        i want my    4
## 6378  91932         i want a    2
## 6379  91933       i want all    2
## 6380  91937        i want it    2
## 6381  91944      i want them    2
## 99103 91934       i want for    1
## 99104 91935      i want full    1
## 99105 91936 i want inglenook    1
## 99106 91938   i want marquez    1

The results from “want” and “I want” are pretty similar except for some minor differences. As long as the given two words matches in the 3-grams dictionary, 3-grams are more reliable. The basic algorithm will be:

If 2 words given
- Search in 3-grams dictionary
  - If there’s a match, make prediction
  - If there’s no match
    - Search the second word in 2-grams dictionary
      - If there’s a match, make prediction
      - If there’s no macth, return null
If 1 word given
- Search in 2-gram dictionary
  - If there’s a match, make prediction
  - If there’s no macth, return null