Introduction

Machine learning has become an increasingly popular field of study year after year, with new techniques being discovered on an ever increasing basis. With today’s world more focused on data than ever before, it is essential that one can take inspiration from multiple areas of the data Science community and apply this knowledge to a myriad of topics questions.

This particular article focuses on an area of machine learning called Natural Language Processing, or NLP for short. As with any research one needs to apply exploratory data analysis to better understand the data being used and to also, hopfully, find interesting patterns that may become useful as the project goes on.

First the packages and data will be loaded and an initial analysis performed. Thereafter the data is explored and presented graphically to display any interesting and necessary findings.

Data Processing

library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.2
library(ggplot2)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.2
## Loading required package: RColorBrewer
library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.1     v purrr   0.3.2
## v tidyr   1.0.0     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)

Loading the Data

There are three data files that we want to examine, Blogs, News and Twitter data from the english directory.

blogs <- readLines("en_US.blogs.txt",encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt",encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt",encoding = "UTF-8", skipNul = TRUE)

Pre-processing

Now that we have the data loaded, the data need to be transformed into the correct class so that it can be manipulated correctly. The techniques used in this report are predominantly inspired by the very useful online textbook Text Mining in R

First let’s load the data:

blogs_df <- tibble(blogs) 
news_df <- tibble(news)
twitter_df <- tibble(twitter)

The reason why the data is converted into the class tibble is that it does not convert strings to factors and does not use row names.

Tokenization

A very important step is tokenization. What this does is essentially takes the data and creates a variable that has one word per row. Say the first sentence in the original text file is 8 words long, then the data frame’s first 8 rows will be the first 8 words from the original text file. For example:

tidy_blogs <- blogs_df %>% unnest_tokens(word, blogs)
tidy_news <- news_df %>% unnest_tokens(word, news)
tidy_twitter <- twitter_df %>% unnest_tokens(word, twitter)
head(tidy_blogs, 10)
## # A tibble: 10 x 1
##    word      
##    <chr>     
##  1 in        
##  2 the       
##  3 years     
##  4 thereafter
##  5 most      
##  6 of        
##  7 the       
##  8 oil       
##  9 fields    
## 10 and

Removing Stop Words

Another important step is to remove stop words. Stop words are basically words that are extremely common such as, ā€˜the’ or ā€˜on’. These words don’t really tell us much about what words in the text file are trying to say, or the sentiment or context they are used in. For this reason we remove them.

data("stop_words")

tidy_blogs <- tidy_blogs %>% anti_join(stop_words)
## Joining, by = "word"
tidy_news <- tidy_news %>% anti_join(stop_words)
## Joining, by = "word"
tidy_twitter <- tidy_twitter %>% anti_join(stop_words)
## Joining, by = "word"

If the data is further manipulated into ranking the most popular words, we can see that these do not contain stop words.

blogs_count <- tidy_blogs %>% count(word, sort = TRUE) 
news_count <- tidy_news %>% count(word, sort = TRUE) 
twitter_count <- tidy_twitter %>% count(word, sort = TRUE) 

head(blogs_count,10)
## # A tibble: 10 x 2
##    word       n
##    <chr>  <int>
##  1 time   90918
##  2 people 59574
##  3 day    52372
##  4 love   45230
##  5 life   41251
##  6 it’s   38657
##  7 1      30907
##  8 2      29561
##  9 world  29305
## 10 i’m    29189

Basic Summary

Before some exploratory analysis is performed, it is necessary to take a quick look at a summary of the data and see what each file comprises of. Importantly, we are interested in word counts and line counts of each respective files.

Each step will be done separately until the last summary, from there the files will be combined into a summary matrix. This entails creating summaries for word count, line count and file size.

n_blogs <- nchar(blogs_count[,1])
n_news <- nchar(news_count[,1])
n_twitter <- nchar(twitter_count[,1])

Summary_1 <- rbind(n_blogs, n_news, n_twitter)
blogs_lines <- NROW(blogs)
twitter_lines <- NROW(twitter)
news_lines <- NROW(news)

Summary_2 <- rbind(blogs_lines, twitter_lines, news_lines)
b <- file.size("en_US.blogs.txt")
n <- file.size("en_US.news.txt")
t <- file.size("en_US.twitter.txt")

FilesSize_blog<- (b/1024^2)
FilesSize_news <- (n/1024^2)
FilesSize_twitter <- (t/1024^2)

File_Sizes <- rbind(FilesSize_blog, FilesSize_news, FilesSize_twitter)

Now we can combine this all into one nice summary matrix:

##         No.Of Words No. Of Lines Size(Mbs)
## Blogs       3827771       899288  200.4242
## News         975346      2360148  196.2775
## Twitter     4648075        77259  159.3641

Exploratory Analysis

Now that the data are clean, it’s time to see what words are the most popular in the text files.

Blogs

News

Twitter

Side-Quest: What About Sentences?

We can also unnest datasets to extract sentences. This can be done using the commands below:

blog_sentences <- tibble(text = blogs) %>% 
  unnest_tokens(sentence, text, token = "sentences")

Let’s look at a sentence:

## [1] "in the years thereafter, most of the oil fields and platforms were named after pagan ā€œgodsā€."

This may be a neat piece of analysis that can be used later on in the project.

Tokenizing by an n-gram

Now that basic summaries and exploratory graphs have been viewed, we can now turn to n-grams. Firstly, we will explore some bi-grams to see how these are derived and what the common ones may be.

An important thing to note is that when removing stop words from bi-grams, there is a little bit more work to do since our stop word dataset is tokenised into rows containing one observation. Therefore an anti-join won’t work if we don’t manipulate the data first.

Further, we will use the news training dataset to perform this analysis as it is the smallest dataset to work with.

## # A tibble: 417,820 x 3
##    word1  word2         n
##    <chr>  <chr>     <int>
##  1 st     louis       701
##  2 los    angeles     436
##  3 san    francisco   381
##  4 30     p.m         354
##  5 health care        317
##  6 1      2           227
##  7 san    diego       219
##  8 vice   president   219
##  9 white  house       179
## 10 7      p.m         167
## # ... with 417,810 more rows

From this we can see that ā€œst louisā€ is the most popular bi-gram in the news dataset. The same analysis technique can be applied to both the larger datasets, the blogs and twitter.

So what about tri-grams? Let’s have a look to see what are the most common tri-grams in the news training set.

## # A tibble: 247,935 x 4
##    word1     word2     word3        n
##    <chr>     <chr>     <chr>    <int>
##  1 president barack    obama       95
##  2 7         30        p.m         77
##  3 st        louis     county      76
##  4 gov       chris     christie    66
##  5 world     war       ii          53
##  6 11        30        a.m         49
##  7 6         30        p.m         42
##  8 1         1         2           41
##  9 chief     financial officer     40
## 10 1         2         cup         39
## # ... with 247,925 more rows

We can see that the combinations of n-grams get smaller and smaller the greater the n becomes. Let’s have a final look and see if quad-grams are still relevant.

## # A tibble: 130,254 x 5
##    word1    word2     word3      word4        n
##    <chr>    <chr>     <chr>      <chr>    <int>
##  1 cuyahoga county    common     pleas       14
##  2 dow      jones     industrial average     13
##  3 st       louis     public     schools     13
##  4 treasury secretary timothy    geithner    12
##  5 martin   luther    king       jr          11
##  6 vice     president joe        biden       11
##  7 0        0         0          0           10
##  8 10       a.m       5          p.m         10
##  9 american civil     liberties  union       10
## 10 assembly speaker   sheila     oliver      10
## # ... with 130,244 more rows

As we can see, the amount of combinations of quad-grams are not as large as the previous n < 4. This will be revisited at a later stage in the project but is nonetheless an important finding to note that this relationship between the number of n-grams and it’s predictability.