The purpose of the project is to build a text prediction model utilized in keyboard input like Swiftkey. The data source contains text files from blogs, news and twitters in different languages. This file is to summarize some preliminary steps to process and analyzing the data, including:
### Read in corpus
mycor <- Corpus(DirSource("/Users/Lilsummer/Desktop/final/mycorpus"))
### Sampling
r1 = readLines('en_US.blogs.txt')
line.sample1 = sample(length(r1), 3000)
r1.sample = r1[line.sample1]
write.csv(r1.sample, file = "en_US.blogs.sample.txt")
### Remove punctuation
toSpace <- content_transformer(function(x, pattern) {gsub(pattern, " ", x)})
mycor <- tm_map(mycor, toSpace, "-")
mycor <- tm_map(mycor, toSpace, "\"")
mycor <- tm_map(mycor, toSpace, ",")
mycor <- tm_map(mycor, toSpace, "\n")
mycor <- tm_map(mycor, toSpace, ":")
mycor <- tm_map(mycor, removePunctuation)
### Remove digit
mycor <- tm_map(mycor, removeNumbers)
### Remove white space
mycor <- tm_map(mycor, stripWhitespace)
### Remove profanity
badwords <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
badwords_vector <- VectorSource(badwords)
mycor <- tm_map(mycor, removeWords, badwords_vector)
### Remove stop word (optional)
mycor.rm <- tm_map(mycor, removeWords, stopwords('english'))
### To lower case
mycor <- tm_map(mycor, content_transformer(tolower))
r1 = readLines('en_US.blogs.txt')
r2 = readLines('en_US.news.txt')
r3 = readLines('en_US.twitter.txt')
length(r1)
length(r2)
length(r3)
We tried to use “want” and “i want” to predict the probability of the next word by using 2-grams and 3-grams.
### want in 2-gram
source('findcount2.r')
source('findcount3.r')
word1 = findcount2("want")
word2 = findcount3("i want")
head(word1, 10)
## X mycor.bigrams Freq
## 26 26 want to 179
## 4222 4222 want a 6
## 5469 5469 want it 5
## 5470 5470 want my 5
## 5471 5471 want you 5
## 7570 7570 want more 4
## 7571 7571 want them 4
## 11870 11870 want any 3
## 11871 11871 want the 3
## 24221 24221 want all 2
head(word2, 10)
## X mycor.trigrams Freq
## 3 91946 i want to 47
## 1394 91940 i want my 4
## 6378 91932 i want a 2
## 6379 91933 i want all 2
## 6380 91937 i want it 2
## 6381 91944 i want them 2
## 99103 91934 i want for 1
## 99104 91935 i want full 1
## 99105 91936 i want inglenook 1
## 99106 91938 i want marquez 1
The results from “want” and “I want” are pretty similar except for some minor differences. As long as the given two words matches in the 3-grams dictionary, 3-grams are more reliable. The basic algorithm will be: