This is the first milestone of the datascience specialization capstone project from coursera.
The motivation for this project is to:
1. Demonstrate that I’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that i amassed so far.
4. Get feedback on my plans for creating a prediction algorithm and Shiny app.
This document will be concise and will explain only the major features of the data and briefly summarize my plans for creating the prediction algorithm and Shiny app. I will make use of tables and plots to illustrate important summaries of the data set.
In the following articles I will describe how I load and clean the data. I also separate the data set in a training- and a test-dataset for further model building. I also do some summary statistics and show some variation between the data sources.
I will show the following statistics of each document:
- Line count of the whole datasets
- Line count of the training datasets
- Word count of the training datasets
- Word count of the training datasets without stopwords
- Frequency of the top unigrams
- Frequency of the top bigrams
- Frequency of the top trigrams
- Count of unique word, which represent 50% and 90% of the total words
Besides the ggplot2
and stringr
packages I will make use of the quanteda
package, that includes many useful functions for NLP.
library(quanteda)
library(stringr)
library(ggplot2)
First of all i read in the unzipped text files line by line.
# create the file connection
blogs_con <- file("D:/Coursera/Course/C10_Capstone/Project/Data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r")
news_con <- file("D:/Coursera/Course/C10_Capstone/Project/Data/Coursera-SwiftKey/final/en_US/en_US.news.txt", "r")
twitter_con <- file("D:/Coursera/Course/C10_Capstone/Project/Data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r")
# read in the lines
blogs_lines<-readLines(blogs_con, encoding = "UTF-8")
news_lines<-readLines(news_con, encoding = "UTF-8")
twitter_lines<-readLines(twitter_con, encoding = "UTF-8")
# close the file connection
close(blogs_con)
close(news_con)
close(twitter_con)
Next i separate each document in a training dataset and a testing dataset. I do this with a weighted coin also line by line.
# devide in trainig and test dataset
set.seed(12344)
# create a random binomial vector to select the lines for the training dataset
sample_blogs<-rbinom(n=length(blogs_lines), size= 1, prob=0.7)
sample_news<-rbinom(n=length(news_lines), size= 1, prob=0.7)
sample_twitter<-rbinom(n=length(twitter_lines), size= 1, prob=0.7)
# iterate over each line and select training and test lines
blogs_training=c()
blogs_testing=c()
for (i in c(1:length(blogs_lines))){
if(sample_blogs[i]==1)
{ blogs_training[length(blogs_training)+1]<-blogs_lines[i] }
else
{blogs_testing[length(blogs_testing)+1]<-blogs_lines[i]}
}
# iterate over each line and select training and test lines
news_training=c()
news_testing=c()
for (i in c(1:length(news_lines))){
if(sample_news[i]==1)
{ news_training[length(news_training)+1]<-news_lines[i] }
else
{news_testing[length(news_testing)+1]<-news_lines[i]}
}
# iterate over each line and select training and test lines
twitter_training=c()
twitter_testing=c()
for (i in c(1:length(twitter_lines))){
if(sample_twitter[i]==1)
{ twitter_training[length(twitter_training)+1]<-twitter_lines[i] }
else
{twitter_testing[length(twitter_testing)+1]<-twitter_lines[i]}
}
# create a document out of the readed lines for each dataset
blogs_training_text<-paste(blogs_training, collapse = " ")
news_training_text<-paste(news_training, collapse = " ")
twitter_training_text<-paste(twitter_training, collapse = " ")
I use the quanteda package to create a corpus for each document. These corpora are the basis for further analysis.
# creating a corpus for each data source type (blogs, news, twitter)
blogs_corpus<-corpus(blogs_training_text)
news_corpus<-corpus(news_training_text)
twitter_corpus<-corpus(twitter_training_text)
Next i will tokenize each corpus by word and create also a version of the tokens without stopsword. During the tokenization i remove numbers, punctuation, symbols, hyphens and separators. As i want to predict the next word, these features should be irrelevant.
# create tokens and remove punctuation, numbers and sysmbols
blogs_tokens<-tokens(blogs_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE, remove_hyphens = TRUE)
news_tokens<-tokens(news_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE, remove_hyphens = TRUE)
twitter_tokens<-tokens(twitter_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE, remove_hyphens = TRUE)
# remove english stop words
blogs_nostop_tokens <- tokens_select(blogs_tokens, pattern = stopwords('en'), selection = 'remove')
news_nostop_tokens <- tokens_select(news_tokens, pattern = stopwords('en'), selection = 'remove')
twitter_nostop_tokens <- tokens_select(twitter_tokens, pattern = stopwords('en'), selection = 'remove')
After i load in the data and did a basic cleaning, i will create some basic statistics about each document to get an idea of the whole data and maybe some variations.
I will calculate the following values:
- Line count of the whole datasets
- Line count of the training datasets
- Word count of the training datasets
- Word count of the training datasets without stopwords
# cout lines of the whole documents
blogs_row_count<-length(blogs_lines)
news_row_count<-length(news_lines)
twitter_row_count<-length(twitter_lines)
# count lines of the trainig dataset
blogs_training_row_count<-length(blogs_training)
news_training_row_count<-length(news_training)
twitter_training_row_count<-length(twitter_training)
# count words
blogs_word_count<-ntoken(blogs_tokens)[[1]]
news_word_count<-ntoken(news_tokens)[[1]]
twitter_word_count<-ntoken(twitter_tokens)[[1]]
# count words without stopwords
blogs_nostop_word_count<-ntoken(blogs_nostop_tokens)[[1]]
news_nostop_word_count<-ntoken(news_nostop_tokens)[[1]]
twitter_nostop_word_count<-ntoken(twitter_nostop_tokens)[[1]]
# create a data frame that includes the dimension of each source
data_dimensions <- data.frame(c("blogs", "news","twitter"),
c(blogs_row_count, news_row_count, twitter_row_count),
c(blogs_training_row_count, news_training_row_count, twitter_training_row_count),
c(blogs_word_count, news_word_count, twitter_word_count),
c(blogs_nostop_word_count, news_nostop_word_count, twitter_nostop_word_count)
)
colnames(data_dimensions) <- c("Source document", "Total lines", "#Lines of trainingset", "#Words of trainingset", "#Words without stopwords")
data_dimensions
## Source document Total lines #Lines of trainingset #Words of trainingset
## 1 blogs 899288 630220 26008963
## 2 news 77259 53990 1826610
## 3 twitter 2360148 1652091 20749284
## #Words without stopwords
## 1 13319808
## 2 1056401
## 3 11835456
The table above shows, that the twitter document has the most lines but fewer words than the blogs document. The news document has the fewest lines and word, but also a lower percentage of stopwords. So we can do some assumptions:
- twitter: Many lines with fewer word
- blogs: Fewer lines with more word
- news: Higher quality speech (fewer stopwords)
Next i create a document frequency matrix of the tokens of each document. And plot a chart of the top 30 unigrams of each document.
#build the document frequency matrixies for each document
blogs_dfm<-dfm(blogs_tokens)
news_dfm<-dfm(news_tokens)
twitter_dfm<-dfm(twitter_tokens)
# create plots of the top unigrams (with stop words)
g <- ggplot(textstat_frequency(blogs_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top unigram of the blogs corpus", x = "Unigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
g <- ggplot(textstat_frequency(news_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top unigram of the news corpus", x = "Unigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
g <- ggplot(textstat_frequency(twitter_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top unigram of the twitter corpus", x = "Unigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
Next i do again a tokenization of the corpus and create bigrams.
# create 2-gram frequencies (with stopwords)
blogs_bigram<-tokens(blogs_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE, ngrams=2, concatenator=" ")
news_bigram<-tokens(news_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE, ngrams=2, concatenator=" ")
twitter_bigram<-tokens(twitter_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE, ngrams=2, concatenator=" ")
I create a document frequency matrix of the bigrams of each document. And plot a chart of the top 30 bigrams of each document.
#build the document frequency matrixies for bigrams of each document
blogs_bigrams_dfm<-dfm(blogs_bigram)
news_bigrams_dfm<-dfm(news_bigram)
twitter_bigrams_dfm<-dfm(twitter_bigram)
# create plots of the top bigrams
g <- ggplot(textstat_frequency(blogs_bigrams_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the blogs bigrams", x = "Bigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
g <- ggplot(textstat_frequency(news_bigrams_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the news bigrams", x = "Bigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
g <- ggplot(textstat_frequency(twitter_bigrams_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the twitter bigrams", x = "Bigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
Next i do again a tokenization of the corpus and create trigrams.
# create 3-gram frequencies (with stopwords)
blogs_trigram<-tokens(blogs_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE, ngrams=3, concatenator=" ")
news_trigram<-tokens(news_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE, ngrams=3, concatenator=" ")
twitter_trigram<-tokens(twitter_corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE, ngrams=3, concatenator=" ")
I create a document frequency matrix of the trigrams of each document. And plot a chart of the top 30 trigrams of each document.
#build the document frequency matrixies for bigrams of each document
blogs_trigrams_dfm<-dfm(blogs_trigram)
news_trigrams_dfm<-dfm(news_trigram)
twitter_trigrams_dfm<-dfm(twitter_trigram)
# create plots of the top trigrams
g <- ggplot(textstat_frequency(blogs_trigrams_dfm, n=50), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the blogs trigrams", x = "Trigrams", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
g <- ggplot(textstat_frequency(news_trigrams_dfm, n=50), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the news trigrams", x = "Trigrams", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
g <- ggplot(textstat_frequency(twitter_trigrams_dfm, n=50), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the twitter trigrams", x = "Trigrams", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()
Next i will check how many unique words are needed to represent 50% and 90% of each document.
# plot cumulativ sums of word frequencies of blogs dataset
blogs_stats<-textstat_frequency(blogs_dfm)
blogs_stats$cumsum<-cumsum(textstat_frequency(blogs_dfm)$frequency)
blogs_v_50<-blogs_stats$rank[blogs_stats$cumsum>blogs_word_count*0.5][[1]]
blogs_v_90<-blogs_stats$rank[blogs_stats$cumsum>blogs_word_count*0.9][[1]]
blogs_h_50<-blogs_word_count*0.5
blogs_h_90<-blogs_word_count*0.9
plot(blogs_stats$cumsum)
abline(h=blogs_h_50, col="blue" )
abline(h=blogs_h_90, col="red" )
abline(v=blogs_v_50, col="blue" )
abline(v=blogs_v_90, col="red" )
As the plot shows, only 107 unique words are needed to represent 50% and only 6476 unique words are needed to represent 90% of the blogs document.
# plot cumulativ sums of word frequencies of news dataset
news_stats<-textstat_frequency(news_dfm)
news_stats$cumsum<-cumsum(textstat_frequency(news_dfm)$frequency)
news_v_50<-news_stats$rank[news_stats$cumsum>news_word_count*0.5][[1]]
news_v_90<-news_stats$rank[news_stats$cumsum>news_word_count*0.9][[1]]
news_h_50<-news_word_count*0.5
news_h_90<-news_word_count*0.9
plot(news_stats$cumsum)
abline(h=news_h_50, col="blue" )
abline(h=news_h_90, col="red" )
abline(v=news_v_50, col="blue" )
abline(v=news_v_90, col="red" )
As the plot shows, only 195 unique words are needed to represent 50% and only 7781 unique words are needed to represent 90% of the news document.
# plot cumulativ sums of word frequencies of twitter dataset
twitter_stats<-textstat_frequency(twitter_dfm)
twitter_stats$cumsum<-cumsum(textstat_frequency(twitter_dfm)$frequency)
twitter_v_50<-twitter_stats$rank[twitter_stats$cumsum>twitter_word_count*0.5][[1]]
twitter_v_90<-twitter_stats$rank[twitter_stats$cumsum>twitter_word_count*0.9][[1]]
twitter_h_50<-twitter_word_count*0.5
twitter_h_90<-twitter_word_count*0.9
plot(twitter_stats$cumsum)
abline(h=twitter_h_50, col="blue" )
abline(h=twitter_h_90, col="red" )
abline(v=twitter_v_50, col="blue" )
abline(v=twitter_v_90, col="red" )
As the plot shows, only 125 unique words are needed to represent 50% and only 5527 unique words are needed to represent 90% of the twitter document.
For building the “next word”-prediction model i will use the assumption of the markov chains. The naive assumption is, that the next state (word) is only based on the previous state (word). No further history of states (words) is taken into account. I will use a bigram document frequency matrix to find the most frequent “second word”. I will also test a 2- and 3-order markov chain (2 and 3 word of history) with the help of tri- and quadgrams, to optimize the prediction model. I assume that the rising capacity to save the document frequency matrixes will limit the model accuracy.