Plans for Creating a Prediction Algorithm The author plans to use an analysis of words and n-grams to generate a prediction algorithm. Prediction will be based on calculations of maximum likelihood estimates for words generated by a user. A n-gram model will be used which looks at previous words, up to two words, to generate words which are most likely to appear next. Discounting will be used when unknown words or combination of words are entered by the user. Most likely, I will use Katz backoff to discount probabilities. The prediction algorithm will run on a Shiny App where users will enter in text to receive predictions.
The purpose of project is to build a model that produces the word(s) based on a user input. The prediction will be based on the calculation of maximum likeklihood estimates of words and n-grams generarated by a user. The project will work the combinations of up to three words (3-grams).
The challenges to overcome are: * Finding the balance between predictive power and performance * Dealing with unknown words/cominations of words * Selecting the best statistical model
The final result will be availalbe as web interface (buld with RShiny).
In order to create the sampled data I’ve created the following
## [1] "en_US.blogs.txt|number of lines: 899288"
## [1] "en_US.blogs.txt|number of words: 38370723"
## [1] "en_US.news.txt|number of lines: 1010242"
## [1] "en_US.news.txt|number of words: 35783083"
## [1] "en_US.twitter.txt|number of lines: 2360148"
## [1] "en_US.twitter.txt|number of words: 31149374"
One can easily see that the we talking about big data here. Dealing with that much data isn’t efficient and also it’s not necessary. Next we going to pick a sample of each media. We’ll also manipulate the data in a way that serves our purpose.
Since the files are quite large we can move on selecting one sample of each file and cleaning the data by going throug the following steps:
That’s what the function below does:
get_sample <- function (filename,prob) {
path <- paste0('final/',substring(filename, 1,5),'/',filename)
# load Google's list of bad words
en_bad <- read.table("badlanguage.txt")
# open conection
con <- file(path,"r")
set.seed(23)
full <- readLines(con, encoding="UTF-8")
# urls
full <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", full)
# remove twitter @ mentions
full <- gsub("@\\w+", " ", full)
# remove punctuation
full <- removePunctuation(full,preserve_intra_word_contractions = TRUE,preserve_intra_word_dashes = TRUE)
# Use a binomial distrubution to get a sample where the sample size equals the value of the variable prob
sample <- as.data.frame(list(full[rbinom(length(full),1,prob)==1]))
colnames(sample)[1] <- "words"
#tokinize the text, get words, 2-grams and 3-grams, I also wanna store stopwords as they are relevant for this task
sample <- as.vector(sample[,1]) %>%
tokenize_ngrams(n = 3, n_min = 1) %>%
unlist() %>% list() %>%
as.data.frame() %>% mutate(language = str_sub(filename,1,5),media = str_extract(filename,'blogs|news|twitter'))
colnames(sample)[1] <- "words"
#delete all lines with bad words
sample <- sample %>% subset(!grepl(paste(unlist(en_bad), collapse="|"),words)) %>%
mutate(type = ifelse(str_count(words, ' ') == 0,'words',paste0(str_count(words, ' ')+1,'-grams')))
#write sample file
write.table(sample,paste0(gsub('.txt','',filename),'_sample',str_sub(filename,-4,-1)))
#get frequencies and write file
freq <- sample %>% group_by(words,language,media,type) %>% summarise(counts = n()) %>% ungroup() %>% group_by(language,media,type) %>% mutate(rank = rank(-counts)) %>% ungroup()
write.table(freq,paste0(gsub('.txt','',filename),'_freq',str_sub(filename,-4,-1)))
close(con)
}
I’ve pulled the data using the function above. Let’s have a look at total counts per type:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## # A tibble: 12 x 4
## # Groups: language, media [4]
## language media type total_counts
## <chr> <chr> <chr> <int>
## 1 en_US blogs words 1145275
## 2 en_US blogs 2-grams 1117998
## 3 en_US blogs 3-grams 1091045
## 4 en_US news words 1048508
## 5 en_US news 2-grams 1017969
## 6 en_US news 3-grams 987538
## 7 en_US twitter words 902986
## 8 en_US twitter 2-grams 831708
## 9 en_US twitter 3-grams 760459
## 10 en_US all words 3096769
## 11 en_US all 2-grams 2967675
## 12 en_US all 3-grams 2839042
The whole list contians solely so callded stopwords. Words which are not specific at all and don’t give any information about the topic of the conversation. Which makes totall sense.
When it comes to the top 20 used 2-grams we also see quite common combinations like ‘going to’, ‘I am’ or ‘I don’t’ ranking up there. Which is what I’ve expected.
Finally, looking in at the top 20 3-grams it’s save to say that the data shaping has been quite a success. Very commenly used phrases rank on top of the list which matches the expectations perfectly.
We’d need 154 unique words to cover 50% and around 8k unique words to cover 90% of all words used. That’s pretty helpful information and will be needed to balance out performance and precisness.
Next I’m going to pick a statistical model based on the clean data we’ve got in place now. Before I go there I might take an even closer look at the data and see if further data cleaning steps are necessary. I did some research on how to deal with foreign words. While it is quite tricky to identify foreign words, there are some commenly used methods to identify whole sentences written in a foreign language. So I’m considring edding the language detection to my script to create the samples.