Machine learning has become an increasingly popular field of study year after year, with new techniques being discovered on an ever increasing basis. With todayās world more focused on data than ever before, it is essential that one can take inspiration from multiple areas of the data Science community and apply this knowledge to a myriad of topics questions.
This particular article focuses on an area of machine learning called Natural Language Processing, or NLP for short. As with any research one needs to apply exploratory data analysis to better understand the data being used and to also, hopfully, find interesting patterns that may become useful as the project goes on.
First the packages and data will be loaded and an initial analysis performed. Thereafter the data is explored and presented graphically to display any interesting and necessary findings.
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.2
library(ggplot2)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.2
## Loading required package: RColorBrewer
library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 1.0.0 v dplyr 0.8.3
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
There are three data files that we want to examine, Blogs, News and Twitter data from the english directory.
blogs <- readLines("en_US.blogs.txt",encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt",encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt",encoding = "UTF-8", skipNul = TRUE)
Now that we have the data loaded, the data need to be transformed into the correct class so that it can be manipulated correctly. The techniques used in this report are predominantly inspired by the very useful online textbook Text Mining in R
First letās load the data:
blogs_df <- tibble(blogs)
news_df <- tibble(news)
twitter_df <- tibble(twitter)
The reason why the data is converted into the class tibble is that it does not convert strings to factors and does not use row names.
A very important step is tokenization. What this does is essentially takes the data and creates a variable that has one word per row. Say the first sentence in the original text file is 8 words long, then the data frameās first 8 rows will be the first 8 words from the original text file. For example:
tidy_blogs <- blogs_df %>% unnest_tokens(word, blogs)
tidy_news <- news_df %>% unnest_tokens(word, news)
tidy_twitter <- twitter_df %>% unnest_tokens(word, twitter)
head(tidy_blogs, 10)
## # A tibble: 10 x 1
## word
## <chr>
## 1 in
## 2 the
## 3 years
## 4 thereafter
## 5 most
## 6 of
## 7 the
## 8 oil
## 9 fields
## 10 and
Another important step is to remove stop words. Stop words are basically words that are extremely common such as, ātheā or āonā. These words donāt really tell us much about what words in the text file are trying to say, or the sentiment or context they are used in. For this reason we remove them.
data("stop_words")
tidy_blogs <- tidy_blogs %>% anti_join(stop_words)
## Joining, by = "word"
tidy_news <- tidy_news %>% anti_join(stop_words)
## Joining, by = "word"
tidy_twitter <- tidy_twitter %>% anti_join(stop_words)
## Joining, by = "word"
If the data is further manipulated into ranking the most popular words, we can see that these do not contain stop words.
blogs_count <- tidy_blogs %>% count(word, sort = TRUE)
news_count <- tidy_news %>% count(word, sort = TRUE)
twitter_count <- tidy_twitter %>% count(word, sort = TRUE)
head(blogs_count,10)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 time 90918
## 2 people 59574
## 3 day 52372
## 4 love 45230
## 5 life 41251
## 6 itās 38657
## 7 1 30907
## 8 2 29561
## 9 world 29305
## 10 iām 29189
Before some exploratory analysis is performed, it is necessary to take a quick look at a summary of the data and see what each file comprises of. Importantly, we are interested in word counts and line counts of each respective files.
Each step will be done separately until the last summary, from there the files will be combined into a summary matrix. This entails creating summaries for word count, line count and file size.
n_blogs <- nchar(blogs_count[,1])
n_news <- nchar(news_count[,1])
n_twitter <- nchar(twitter_count[,1])
Summary_1 <- rbind(n_blogs, n_news, n_twitter)
blogs_lines <- NROW(blogs)
twitter_lines <- NROW(twitter)
news_lines <- NROW(news)
Summary_2 <- rbind(blogs_lines, twitter_lines, news_lines)
b <- file.size("en_US.blogs.txt")
n <- file.size("en_US.news.txt")
t <- file.size("en_US.twitter.txt")
FilesSize_blog<- (b/1024^2)
FilesSize_news <- (n/1024^2)
FilesSize_twitter <- (t/1024^2)
File_Sizes <- rbind(FilesSize_blog, FilesSize_news, FilesSize_twitter)
Now we can combine this all into one nice summary matrix:
## No.Of Words No. Of Lines Size(Mbs)
## Blogs 3827771 899288 200.4242
## News 975346 2360148 196.2775
## Twitter 4648075 77259 159.3641
Now that the data are clean, itās time to see what words are the most popular in the text files.
We can also unnest datasets to extract sentences. This can be done using the commands below:
blog_sentences <- tibble(text = blogs) %>%
unnest_tokens(sentence, text, token = "sentences")
## [1] "in the years thereafter, most of the oil fields and platforms were named after pagan āgodsā."
This may be a neat piece of analysis that can be used later on in the project.
Now that basic summaries and exploratory graphs have been viewed, we can now turn to n-grams. Firstly, we will explore some bi-grams to see how these are derived and what the common ones may be.
An important thing to note is that when removing stop words from bi-grams, there is a little bit more work to do since our stop word dataset is tokenised into rows containing one observation. Therefore an anti-join wonāt work if we donāt manipulate the data first.
Further, we will use the news training dataset to perform this analysis as it is the smallest dataset to work with.
## # A tibble: 417,820 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 st louis 701
## 2 los angeles 436
## 3 san francisco 381
## 4 30 p.m 354
## 5 health care 317
## 6 1 2 227
## 7 san diego 219
## 8 vice president 219
## 9 white house 179
## 10 7 p.m 167
## # ... with 417,810 more rows
From this we can see that āst louisā is the most popular bi-gram in the news dataset. The same analysis technique can be applied to both the larger datasets, the blogs and twitter.
So what about tri-grams? Letās have a look to see what are the most common tri-grams in the news training set.
## # A tibble: 247,935 x 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 president barack obama 95
## 2 7 30 p.m 77
## 3 st louis county 76
## 4 gov chris christie 66
## 5 world war ii 53
## 6 11 30 a.m 49
## 7 6 30 p.m 42
## 8 1 1 2 41
## 9 chief financial officer 40
## 10 1 2 cup 39
## # ... with 247,925 more rows
We can see that the combinations of n-grams get smaller and smaller the greater the n becomes. Letās have a final look and see if quad-grams are still relevant.
## # A tibble: 130,254 x 5
## word1 word2 word3 word4 n
## <chr> <chr> <chr> <chr> <int>
## 1 cuyahoga county common pleas 14
## 2 dow jones industrial average 13
## 3 st louis public schools 13
## 4 treasury secretary timothy geithner 12
## 5 martin luther king jr 11
## 6 vice president joe biden 11
## 7 0 0 0 0 10
## 8 10 a.m 5 p.m 10
## 9 american civil liberties union 10
## 10 assembly speaker sheila oliver 10
## # ... with 130,244 more rows
As we can see, the amount of combinations of quad-grams are not as large as the previous n < 4. This will be revisited at a later stage in the project but is nonetheless an important finding to note that this relationship between the number of n-grams and itās predictability.