Data Science Specialization Capstone Project

Analysis

Required packages:

library(caTools)
library(quanteda)
library(ggplot2)

Loading the text files into R. Note that the text files are placed in a “data” folder within the working directory.

blog_file <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news_file <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter_file <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Exploratory data analysis:

Number of words on each line, within each element:

sum(sapply(strsplit(blog_file, " "), length))

## [1] 37334131

sum(sapply(strsplit(news_file, " "), length))

## [1] 2643969

sum(sapply(strsplit(twitter_file, " "), length))

## [1] 30373583

As the information from the word count shows, the blog element has 37.3 million words, the news element - 2.6 million words, and the twitter element - 30.4 million words.

Number of lines in each element:

length(blog_file)

## [1] 899288

length(news_file)

## [1] 77259

length(twitter_file)

## [1] 2360148

The blog element has a total of 899,288 lines, the news element - 77,259, and the twitter element - 2.4 million lines.

Summary of the distribution of the text lengths within the three elements:

summary(nchar(blog_file))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833

summary(nchar(news_file))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   111.0   186.0   202.4   270.0  5760.0

summary(nchar(twitter_file))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

As the summary statistics shows, the distribution of the length of the lines within each element varies from 1 to 40,833 characters in the blog element, from 2 to 5,760 characters in the news element, and from 2 to 140 characters in the twitter element.

Before we start cleaning the data, we need to subsample each text element in order to save computation power. From each element, we will extract 1% of the lines, proportionally. Later, we will furhter perform data slicing and use 70% of the lines for training and 30% for testing.

set.seed(1234)

sub_blog_file <- sample(blog_file, size = round(length(blog_file) * .01, 0), replace = TRUE)
sub_news_file <- sample(news_file, size = round(length(news_file) * .01, 0), replace = TRUE)
sub_twitter_file <- sample(twitter_file, size = round(length(twitter_file) * .01, 0), replace = TRUE)

raw_data <- c(sub_blog_file, sub_news_file, sub_twitter_file)

inTrain <- sample.split(raw_data, 0.7)
training <- subset(raw_data, inTrain == TRUE)
testing <- subset(raw_data, inTrain == FALSE)

After we have our raw training element of lines, we need to go through the text analytics data pre-processing pipeline.

Tokenize the text and remove numbers, punctuation, symbols, and hyphens:

train.tokens <- tokens(training, what = "word",
                       remove_numbers = TRUE,
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_hyphens = TRUE)

Lower case the tokens:

train.tokens <- tokens_tolower(train.tokens)

Remove stopwords. We will use quanteda’s built-in stopwords list for English:

train.tokens <- tokens_select(train.tokens, stopwords(), selection = "remove")

Perform “stemming”:

In other words, we combine words that are similar and colapse them down into a single term, again for English:

train.tokens <- tokens_wordstem(train.tokens, language = "english")

Next, let’s add n-grams in order to increase the predictive power of our model and store them as matrices, as well as data frames with the corresponding frequencies for better processing and visualization. Last, we will save the data frame to the binary .RData file, to increase performance and save memory.

Unigrams:

unigram <- dfm(train.tokens, n = 1, concatenator = " ")
unigram_df <- as.data.frame(as.matrix(docfreq(unigram)))
unigram_df <- sort(rowSums(unigram_df), decreasing = TRUE)
unigram_df <- data.frame(token = names(unigram_df), frequency = unigram_df)

saveRDS(unigram_df,"unigram_df.RData")
rm(unigram, unigram_df)

Bigrams:

bigrams <- dfm(train.tokens, n = 2, concatenator = " ")
bigrams_df <- as.data.frame(as.matrix(docfreq(bigrams)))
bigrams_df <- sort(rowSums(bigrams_df), decreasing = TRUE)
bigrams_df <- data.frame(token = names(bigrams_df), frequency = bigrams_df)

saveRDS(bigrams_df,"bigrams_df.RData")
rm(bigrams, bigrams_df)

Trigrams:

trigrams <- dfm(train.tokens, n = 3, concatenator = " ")
trigrams_df <- as.data.frame(as.matrix(docfreq(trigrams)))
trigrams_df <- sort(rowSums(trigrams_df), decreasing = TRUE)
trigrams_df <- data.frame(token = names(trigrams_df), frequency = trigrams_df)

saveRDS(trigrams_df,"trigrams_df.RData")
rm(trigrams, trigrams_df)

Fourgrams:

fourgrams <- dfm(train.tokens, n = 4, concatenator = " ")
fourgrams_df <- as.data.frame(as.matrix(docfreq(fourgrams)))
fourgrams_df <- sort(rowSums(fourgrams_df), decreasing = TRUE)
fourgrams_df <- data.frame(token = names(fourgrams_df), frequency = fourgrams_df)

saveRDS(fourgrams_df,"fourgrams_df.RData")
rm(fourgrams, fourgrams_df)

Fivegrams:

fivegrams <- dfm(train.tokens, n = 5, concatenator = " ")
fivegrams_df <- as.data.frame(as.matrix(docfreq(fivegrams)))
fivegrams_df <- sort(rowSums(fivegrams_df), decreasing = TRUE)
fivegrams_df <- data.frame(token = names(fivegrams_df), frequency = fivegrams_df)

saveRDS(fivegrams_df,"fivegrams_df.RData")
rm(fivegrams, fivegrams_df)