NLP Milestone Project

Basic report of summary statistics

In the following articles I will describe how I load and clean the data. I also separate the data set in a training- and a test-dataset for further model building. I also do some summary statistics and show some variation between the data sources.
I will show the following statistics of each document:
- Line count of the whole datasets
- Line count of the training datasets
- Word count of the training datasets
- Word count of the training datasets without stopwords
- Frequency of the top unigrams
- Frequency of the top bigrams
- Frequency of the top trigrams
- Count of unique word, which represent 50% and 90% of the total words

Besides the ggplot2 and stringr packages I will make use of the quanteda package, that includes many useful functions for NLP.

library(quanteda)
library(stringr)
library(ggplot2)

Data import and cleaning

First of all i read in the unzipped text files line by line.

# create the file connection
blogs_con <- file("D:/Coursera/Course/C10_Capstone/Project/Data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r") 
news_con <- file("D:/Coursera/Course/C10_Capstone/Project/Data/Coursera-SwiftKey/final/en_US/en_US.news.txt", "r") 
twitter_con <- file("D:/Coursera/Course/C10_Capstone/Project/Data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r") 

# read in the lines
blogs_lines<-readLines(blogs_con, encoding = "UTF-8")
news_lines<-readLines(news_con, encoding = "UTF-8")
twitter_lines<-readLines(twitter_con, encoding = "UTF-8")

# close the file connection
close(blogs_con)
close(news_con)
close(twitter_con)

Next i separate each document in a training dataset and a testing dataset. I do this with a weighted coin also line by line.

# devide in trainig and test dataset
set.seed(12344)

# create a random binomial vector to select the lines for the training dataset
sample_blogs<-rbinom(n=length(blogs_lines), size= 1, prob=0.7)
sample_news<-rbinom(n=length(news_lines), size= 1, prob=0.7)
sample_twitter<-rbinom(n=length(twitter_lines), size= 1, prob=0.7)

# iterate over each line and select training and test lines
blogs_training=c()
blogs_testing=c()
for (i in c(1:length(blogs_lines))){
         if(sample_blogs[i]==1)
         { blogs_training[length(blogs_training)+1]<-blogs_lines[i] }
         else
         {blogs_testing[length(blogs_testing)+1]<-blogs_lines[i]}
}

# iterate over each line and select training and test lines
news_training=c()
news_testing=c()
for (i in c(1:length(news_lines))){
         if(sample_news[i]==1)
         { news_training[length(news_training)+1]<-news_lines[i] }
         else
         {news_testing[length(news_testing)+1]<-news_lines[i]}
}

# iterate over each line and select training and test lines
twitter_training=c()
twitter_testing=c()
for (i in c(1:length(twitter_lines))){
         if(sample_twitter[i]==1)
         { twitter_training[length(twitter_training)+1]<-twitter_lines[i] }
         else
         {twitter_testing[length(twitter_testing)+1]<-twitter_lines[i]}
}

# create a document out of the readed lines for each dataset 
blogs_training_text<-paste(blogs_training, collapse = " ")
news_training_text<-paste(news_training, collapse = " ")
twitter_training_text<-paste(twitter_training, collapse = " ")

I use the quanteda package to create a corpus for each document. These corpora are the basis for further analysis.

# creating a corpus for each data source type (blogs, news, twitter)
blogs_corpus<-corpus(blogs_training_text)
news_corpus<-corpus(news_training_text)
twitter_corpus<-corpus(twitter_training_text)

Next i will tokenize each corpus by word and create also a version of the tokens without stopsword. During the tokenization i remove numbers, punctuation, symbols, hyphens and separators. As i want to predict the next word, these features should be irrelevant.

# create tokens and remove punctuation, numbers and sysmbols
blogs_tokens<-tokens(blogs_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, remove_hyphens = TRUE)
news_tokens<-tokens(news_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, remove_hyphens = TRUE)
twitter_tokens<-tokens(twitter_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, remove_hyphens = TRUE)

# remove english stop words
blogs_nostop_tokens <- tokens_select(blogs_tokens, pattern = stopwords('en'), selection = 'remove')
news_nostop_tokens <- tokens_select(news_tokens, pattern = stopwords('en'), selection = 'remove')
twitter_nostop_tokens <- tokens_select(twitter_tokens, pattern = stopwords('en'), selection = 'remove')

Exploratory data analysis

After i load in the data and did a basic cleaning, i will create some basic statistics about each document to get an idea of the whole data and maybe some variations.

I will calculate the following values:
- Line count of the whole datasets
- Line count of the training datasets
- Word count of the training datasets
- Word count of the training datasets without stopwords

# cout lines of the whole documents
blogs_row_count<-length(blogs_lines)
news_row_count<-length(news_lines)
twitter_row_count<-length(twitter_lines)

# count lines of the trainig dataset
blogs_training_row_count<-length(blogs_training)
news_training_row_count<-length(news_training)
twitter_training_row_count<-length(twitter_training)

# count words
blogs_word_count<-ntoken(blogs_tokens)[[1]]
news_word_count<-ntoken(news_tokens)[[1]]
twitter_word_count<-ntoken(twitter_tokens)[[1]]

# count words without stopwords
blogs_nostop_word_count<-ntoken(blogs_nostop_tokens)[[1]]
news_nostop_word_count<-ntoken(news_nostop_tokens)[[1]]
twitter_nostop_word_count<-ntoken(twitter_nostop_tokens)[[1]]

# create a data frame that includes the dimension of each source 
data_dimensions <- data.frame(c("blogs", "news","twitter"), 
      c(blogs_row_count, news_row_count, twitter_row_count), 
      c(blogs_training_row_count, news_training_row_count, twitter_training_row_count), 
      c(blogs_word_count, news_word_count, twitter_word_count),
      c(blogs_nostop_word_count, news_nostop_word_count, twitter_nostop_word_count)
      )
colnames(data_dimensions) <- c("Source document", "Total lines", "#Lines of trainingset", "#Words of trainingset", "#Words without stopwords")

data_dimensions

##   Source document Total lines #Lines of trainingset #Words of trainingset
## 1           blogs      899288                630220              26008963
## 2            news       77259                 53990               1826610
## 3         twitter     2360148               1652091              20749284
##   #Words without stopwords
## 1                 13319808
## 2                  1056401
## 3                 11835456

The table above shows, that the twitter document has the most lines but fewer words than the blogs document. The news document has the fewest lines and word, but also a lower percentage of stopwords. So we can do some assumptions:
- twitter: Many lines with fewer word
- blogs: Fewer lines with more word
- news: Higher quality speech (fewer stopwords)

Next i create a document frequency matrix of the tokens of each document. And plot a chart of the top 30 unigrams of each document.

#build the document frequency matrixies for each document
blogs_dfm<-dfm(blogs_tokens)
news_dfm<-dfm(news_tokens)
twitter_dfm<-dfm(twitter_tokens)

# create plots of the top unigrams (with stop words)
g <- ggplot(textstat_frequency(blogs_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top unigram of the blogs corpus", x = "Unigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

g <- ggplot(textstat_frequency(news_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top unigram of the news corpus", x = "Unigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

g <- ggplot(textstat_frequency(twitter_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top unigram of the twitter corpus", x = "Unigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

Next i do again a tokenization of the corpus and create bigrams.

# create 2-gram frequencies (with stopwords)
blogs_bigram<-tokens(blogs_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, 
                    remove_hyphens = TRUE, ngrams=2, concatenator=" ")

news_bigram<-tokens(news_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, 
                    remove_hyphens = TRUE, ngrams=2, concatenator=" ")

twitter_bigram<-tokens(twitter_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, 
                    remove_hyphens = TRUE, ngrams=2, concatenator=" ")

I create a document frequency matrix of the bigrams of each document. And plot a chart of the top 30 bigrams of each document.

#build the document frequency matrixies for bigrams of each document
blogs_bigrams_dfm<-dfm(blogs_bigram)
news_bigrams_dfm<-dfm(news_bigram)
twitter_bigrams_dfm<-dfm(twitter_bigram)

# create plots of the top bigrams
g <- ggplot(textstat_frequency(blogs_bigrams_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the blogs bigrams", x = "Bigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

g <- ggplot(textstat_frequency(news_bigrams_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the news bigrams", x = "Bigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

g <- ggplot(textstat_frequency(twitter_bigrams_dfm, n=30), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the twitter bigrams", x = "Bigram", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

Next i do again a tokenization of the corpus and create trigrams.

# create 3-gram frequencies (with stopwords)
blogs_trigram<-tokens(blogs_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, 
                    remove_hyphens = TRUE, ngrams=3, concatenator=" ")

news_trigram<-tokens(news_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, 
                    remove_hyphens = TRUE, ngrams=3, concatenator=" ")

twitter_trigram<-tokens(twitter_corpus, what = "word", 
                    remove_numbers = TRUE, remove_punct = TRUE,
                    remove_symbols = TRUE, remove_separators = TRUE, 
                    remove_hyphens = TRUE, ngrams=3, concatenator=" ")

I create a document frequency matrix of the trigrams of each document. And plot a chart of the top 30 trigrams of each document.

#build the document frequency matrixies for bigrams of each document
blogs_trigrams_dfm<-dfm(blogs_trigram)
news_trigrams_dfm<-dfm(news_trigram)
twitter_trigrams_dfm<-dfm(twitter_trigram)

# create plots of the top trigrams
g <- ggplot(textstat_frequency(blogs_trigrams_dfm, n=50), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the blogs trigrams", x = "Trigrams", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

g <- ggplot(textstat_frequency(news_trigrams_dfm, n=50), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the news trigrams", x = "Trigrams", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

g <- ggplot(textstat_frequency(twitter_trigrams_dfm, n=50), aes(x = reorder(feature, frequency), y = frequency))
g <- g + labs(title = "Top counts of the twitter trigrams", x = "Trigrams", y = "Frequency")
g + geom_bar(stat="identity", colour="black") + coord_flip()

Next i will check how many unique words are needed to represent 50% and 90% of each document.

# plot cumulativ sums of word frequencies of blogs dataset
blogs_stats<-textstat_frequency(blogs_dfm)
blogs_stats$cumsum<-cumsum(textstat_frequency(blogs_dfm)$frequency)
blogs_v_50<-blogs_stats$rank[blogs_stats$cumsum>blogs_word_count*0.5][[1]]
blogs_v_90<-blogs_stats$rank[blogs_stats$cumsum>blogs_word_count*0.9][[1]]
blogs_h_50<-blogs_word_count*0.5
blogs_h_90<-blogs_word_count*0.9
plot(blogs_stats$cumsum)
abline(h=blogs_h_50, col="blue" )
abline(h=blogs_h_90, col="red" )
abline(v=blogs_v_50, col="blue" )
abline(v=blogs_v_90, col="red" )

As the plot shows, only 107 unique words are needed to represent 50% and only 6476 unique words are needed to represent 90% of the blogs document.

# plot cumulativ sums of word frequencies of news dataset 
news_stats<-textstat_frequency(news_dfm)
news_stats$cumsum<-cumsum(textstat_frequency(news_dfm)$frequency)
news_v_50<-news_stats$rank[news_stats$cumsum>news_word_count*0.5][[1]]
news_v_90<-news_stats$rank[news_stats$cumsum>news_word_count*0.9][[1]]
news_h_50<-news_word_count*0.5
news_h_90<-news_word_count*0.9
plot(news_stats$cumsum)
abline(h=news_h_50, col="blue" )
abline(h=news_h_90, col="red" )
abline(v=news_v_50, col="blue" )
abline(v=news_v_90, col="red" )

As the plot shows, only 195 unique words are needed to represent 50% and only 7781 unique words are needed to represent 90% of the news document.

# plot cumulativ sums of word frequencies of twitter dataset 
twitter_stats<-textstat_frequency(twitter_dfm)
twitter_stats$cumsum<-cumsum(textstat_frequency(twitter_dfm)$frequency)
twitter_v_50<-twitter_stats$rank[twitter_stats$cumsum>twitter_word_count*0.5][[1]]
twitter_v_90<-twitter_stats$rank[twitter_stats$cumsum>twitter_word_count*0.9][[1]]
twitter_h_50<-twitter_word_count*0.5
twitter_h_90<-twitter_word_count*0.9
plot(twitter_stats$cumsum)
abline(h=twitter_h_50, col="blue" )
abline(h=twitter_h_90, col="red" )
abline(v=twitter_v_50, col="blue" )
abline(v=twitter_v_90, col="red" )

As the plot shows, only 125 unique words are needed to represent 50% and only 5527 unique words are needed to represent 90% of the twitter document.

NLP Milestone Project

Bastian Huntgeburth

4 April 2019

Summary

Basic report of summary statistics

Data import and cleaning

Exploratory data analysis

Plans for creating the prediction