The document describes the intermediate results of the capstone project. In this project text data is loaded, analyzed and used to build a test prediction model.
This document consists of: 1. loading the data 2. exploring the data 3. next steps
The data table package is used to transform and filter data with high performance. NLP is used for wordcount. tm for loading docs and to transform to lower case. ngram to find the ngrams. ggplot and labling for the graphs.
library(data.table, lib.loc="C:/TFS/Rlib/")
library(ngram, lib.loc="C:/TFS/Rlib/")
library(reshape, lib.loc="C:/TFS/Rlib/")
library(NLP, lib.loc="C:/TFS/Rlib/")
library(tm, lib.loc="C:/TFS/Rlib/")
library(labeling, lib.loc="C:/TFS/Rlib/")
library(textcat, lib.loc="C:/TFS/Rlib/")
library(ggplot2, lib.loc="C:/TFS/Rlib/")
#number of lines we will process
nl = 1000
#Twitter
con <- file("files/final/en_US/en_US.twitter.txt", "r")
txt = readLines(con, n=nl)
close(con)
df_txt_twitter = data.frame(txt, stringsAsFactors = F)
#to lowercase
df_txt_twitter = data.frame(txt = apply(df_txt_twitter,1, tolower), stringsAsFactors = F)
#News
con <- file("files/final/en_US/en_US.news.txt", "r")
txt = readLines(con, n=10)
df_txt_news = data.frame(txt, stringsAsFactors = F)
close(con)
#Blog
con <- file("files/final/en_US/en_US.blogs.txt", "r")
txt = readLines(con, n=10)
df_txt_blog = data.frame(txt, stringsAsFactors = F)
close(con)
The data looks like this.
#Twitter
head(df_txt_twitter, 3)
## txt
## 1 how are you? btw thanks for the rt. you gonna be in dc anytime soon? love to see you. been way, way too long.
## 2 when you meet someone special... you'll know. your heart will beat more rapidly and you'll smile for no reason.
## 3 they've decided its more fun if i don't.
#News
head(df_txt_news, 3)
## txt
## 1 He wasn't home alone, apparently.
## 2 The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
## 3 WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
#Blog
head(df_txt_blog,3)
## txt
## 1 In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>.
## 2 We love you Mr. Brown.
## 3 Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
In this chapter we will get some basic statistics on the three files.
We will count the number of words and number of lines with two custom functions.
count_lines<-function(x) {
dfr = data.frame(strsplit(x, "\\. |\\? "), stringsAsFactors = F)
colnames(dfr) = "zin"
zinnen = cbind(words = apply(dfr, 1, wordcount), dfr)
length(zinnen$zin)
}
count_words<-function(x) {
dfr = data.frame(strsplit(x, "\\. |\\? "), stringsAsFactors = F)
colnames(dfr) = "zin"
zinnen = cbind(words = apply(dfr, 1, wordcount), dfr)
sum(zinnen$words)
}
l = sum(apply(df_txt_twitter,1,count_lines))
w = sum(apply(df_txt_twitter,1,count_words))
The twitter file has 1495 lines. And 12773 words. So the average number of words per line is: 8.5438127.
l = sum(apply(df_txt_blog,1,count_lines))
w = sum(apply(df_txt_blog,1,count_words))
The Blog file has 28 lines. And 565 words. So the average number of words per line is: 20.1785714.
l = sum(apply(df_txt_news,1,count_lines))
w = sum(apply(df_txt_news,1,count_words))
The News file has 21 lines. And 345 words. So the average number of words per line is: 16.4285714.
The chapter will look at frequency of singlegrams and bigrams.
singlegram_twitter = loop_doc(1,nl,df_txt_twitter)
bigrams_twitter = loop_doc(2,nl,df_txt_twitter)
The supporing function.
The ngram chart. With singlegrams (words) and bigrams (red line) from the twitter file.
pt_df = top50(singlegram_twitter)
pt = ggplot(pt_df, aes(x, V1, label = w1)) + geom_text()
pt_df2 = top50(bigrams_twitter)
pt2 = pt + geom_line(data=pt_df2[1:50,], aes(x=x, y=V1), color='red') +
ggtitle("ngrams frequency chart") + xlab("ngram") + ylab("Frequency")
pt2
When looking at these charts it becomes appearant that a small proportion of the words contribute to a large proportion of the words counts. Lets verify:
prc = runningt_perc(singlegram_twitter)
l = length(prc[prc$pr < 50.1,]) / length(prc[,]) * 100
l
## [1] 3.778281
We only need 3.7782805 procent of the words to get to 50% of the frequency count. Lets see how much we need to get to 90%:
l = length(prc[prc$pr < 90.1,]) / length(prc[,]) * 100
l
## [1] 71.38009
The x axis shows the percentage of words the y axis shows the percentage of the word count.
df2 = f_bins(prc)
curve = ggplot(df2, aes(x=pw, y=V1)) + geom_line()
Add the max lift point:
m = max(df2$V1 - df2$pw, na.rm=T)
p = match(m, df2$V1 - df2$pw)
curve = curve + geom_point(data=df2[p,], aes(x=pw, y=V1), color='green', size=8)
The lift is greatest untill approx 10% of the top words.
lets show the same curve for bigrams.
prc2 = runningt_perc(bigrams_twitter)
df3 = f_bins(prc2)
curve = curve + geom_line(data=df3, aes(x=pw, y=V1), color="red")
m = max(df3$V1 - df3$pw, na.rm=T)
p = match(m, df3$V1 - df3$pw)
curve = curve + geom_point(data=df3[p,], aes(x=pw, y=V1), color='red', size=7)
curve = curve +
ggtitle("ngrams ROC frequency chart, with single and bigram") + xlab("ngram perc.") + ylab("Frequency perc.")
curve
The lift is greatest untill approx 8% of the top words for the bigrams.
From the charts in this document we have learned that a few words are used a lot. So there are probably a lot of synonyms. Relative to the number of words there are only few bigrams. The lift in the ROC curve is also way less compared to the ROC for the single word.
With this we can observe it will be difficult to get a high prediction accuracy. And building a prediction model with a large set of words will give little additional value to a model with fewer words.
The design of the prediction model will use backoff. so when there is no match based on three words. A match will be made based on 2 words.
For f.e. the 3 words model, the bigram based on the last two words will also be included and given a weight. The predicted word based on both the tri- as the bigram will be used.