Milestone report
This milestone report describes my work so far in the Data Science Capstone Project on Coursera. The ultimate goal is to create a prediction algorithm and integrate it into an R shiny app, so any user can enter one, two, three or more words and a “predicted next word” will be displayed. So far, my work has only be accomplished for the English language.
1 Loading the data
First I downloaded the required data, unpacked it and saved it into a subfolder called “data”. From there, I loaded it into my workspace using readLines().. Saving it into a list saves some memory.
data <- list()
for(i in 1:length(dir("./data/final/en_US"))) {
txt <- dir("../data/final/en_US", full.names=T)[i]
con <- file(txt, "rb")
data[[i]] <- readLines(con, encoding="UTF-8", skipNul = T) # irgendwie komisch.
print(c(txt, max(nchar(data[[i]]))))
close(con)
}
names(data) <- c("blogs", "news", "twitter")2 Data preparation
In order to prepare the data for analysis, I set all words to lower case, removed punctuation and numbers and fixed the spacing (i.e., remove double spacing etc.). This could be accomplished by using the preprocess() function in the ngram package. I did this for every line in order to preserve the lines (otherwise, line boundaries would have been destroyed, which wouldnt make any sense for calculating ngrams).
for(i in 1:length(data)) {
for(j in 1:length(data[[i]])) {
data[[i]][j] <- preprocess(
concatenate(data[[i]][j]), case="lower", remove.punct=T,
remove.numbers=T, fix.spacing=T)
}
}From this data set, I created a second data set in which I removed so called “stop words”, i.e. meaningless words like “it”, “as”, “or”, so that in the end I will be able to predict meaningful words.
3 Basic report about the data set
3.1 Summary statistics
The full data consists of thee diffferent sources: “news”, “blogs” and “twitter”. The news data has 899288 lines, the blogs data has 1010242 lines and twitter data is the longest with 2360148 lines. The basic summaries are as follows.
| blogs | news | all | ||
|---|---|---|---|---|
| lines | 899288 | 1010242 | 2360148 | 4269678 |
| words | 36934013 | 33569489 | 29586893 | 100090395 |
| words without stopwords | 19583669 | 19796438 | 17585610 | 56965717 |
| longest line | 6327 | 1370 | 47 | 7744 |
| longest line without stopwords | 3916 | 1315 | 47 | 5278 |
3.2 Word counts
I created n-grams from a (randomized) subset of 30 percent of the original data. The n-grams were counted by the < ngram function of the ngram package, e.g. two-word counts by <tt ngram(data, n=2) . The ngrams were calculated line-by-line, so no ngrams crossing line boundaries were counted (because these are meaningless). We get the frequency tables of ngrams by using the get.phrasetable() function of the same package. I only saved ngrams with a frequency of at least 2 in order to save memory and because ngrams which occured only once are meaningless in order to predict words.
The most frequent words and word combinations are shown here. Because simple words are hardly very interesting, we calculate the same for the corpus without stopwords:
4 Interesting findings
4.1 primitive algorithm
So far I have tried to create a simple prediction algorithm (see appendix):
- When the user enters a string like “hello and good evening”, for example, it checks whether there are any 5-grams starting with “hello and good evening”.
- If there are, it returns the next word of the most frequent 5-grams (in this case: ).
- If there aren’t, it checks whether there are any 4-grams starting with “and good evening” (so it removes the first word of the string first).
- If there are, it returns the next word of the most frequent 4-grams (in this case: ).
- If there aren’t, … (and so on)
- If no n-grams are found, it just returns the most frequent word.
The algorithm lets the user decide if he or she wants to include stopwords or just want a “meaningful” result.
4.2 Results
The prediction accuracy so far has not been really well. It does not depend so much on the size of the training dffata set (here: 10, 20, 30, 40 or 50 percent of the original data set). It does depend, however, a lot on whether you are looking for a word including stopwords (higher accuracy) or for a meaningful word (lower accuracy).
We can see that the average accuracy of the predictions does not increase for the size of the training data set (x-axis,in percent), neither for words including stopwords (left) nor for words excluding stopwords).
But in the lower figures we can see that the average time for the results increases with larger training data. Hence, it seems best to base the results on a smaller ttraining data set to increase user experience.
Also, the size of ngrams (“steps”) does not influence the prediction, which is rather strange). I have to work on that.
5 Further plans for creating a prediction algorithm and Shiny app
A lot of things have to be done in order to create a good Shiny app.
For the prediction algorithm, profanity words have to be filtered first. Then, a lot of data cleaning has to be done. For example, all words including @ could be removed; also, it would be helpful to standardise different spellings of the same word (“its”, “it’s” etc.); this would not only improve the accuracy, but also the size of the ngrams frequency tables.
One could also include more information into the prediction, like the length of the string or one could choose between twitter and blog-generated ngrams-frequency tables (even though I think it’s best to integrate everything). Also, I have to make a decision on how many ngrams I include; it could also be helpful to not always pick the most frequent 3-gram, for example, if there is a way better 2-gram to choose from.
For the app I intend to let the user decide on whether he or she wants to include stopwords, how many words he or she wants to get predicted (e.g., a selection of 3 words) and whether he wants to increase accuracy (by decreasing speed).
6 Appendix
6.1 prediction algorithm
nextword <- function(x, pick=F, n=1, sw=F, permille=1, steps=2 ) {
# preprocess the entered string similar to corpus
word <- preprocess(
concatenate(x),
case="lower",
remove.punct=T,
remove.numbers=T,
fix.spacing=T)
# separate into single words
word <- str_split(word, " ", simplify=T)
if(sw) word <- word[!word %in% tm::stopwords()]
comb <- data.frame()
# words at most 4 words long
if(length(word) >= (steps-1)) word <- word[(length(word)-(steps-2)):length(word)]
# search for ngrams until match is found
while(nrow(comb) <= 1 & length(word) > 0) {
if(sw) tabnow <- tab_sw[[permille]][[length(word)+1]]
if(!sw) tabnow <- tab_corpus[[permille]][[length(word)+1]]
word <- paste(word, collapse=" ")
# where does beginning of string match?
comb <- tabnow %>%
filter(str_detect(ngrams, paste0("^", word, "\\s"))) #%>%
# table or single predicted word?
if(pick) {
comb <- comb %>%
arrange(desc(freq)) %>%
filter(row_number() <= n) %>%
select(ngrams)
comb <- str_split(comb$ngrams, " ", simplify = T)
}
word <- str_split(str_remove(word, "\\s$"), " ", simplify = T)[-1]
}
# result
# if match was found
if(length(comb) > 1) {
if(!pick) nextword <- comb
if(pick) nextword <- comb[,ncol(comb)]
return(nextword)
# if not: return most frequent single word
} else {
if(!sw) nextword <- tab_corpus[[permille]][[1]][1:n,1]
if(sw) nextword <- tab_sw[[permille]][[1]][1:n,1]
return(nextword)
}
}