Loading packages

library(tidyverse)
library(tidytext)
library(stopwords)
library(tm)

theme_set(theme_bw())

Task 0 - Understanding the Problem

In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, you should familiarize yourself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to you.

The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language*.

Each entry is tagged with it’s date of publication. Where user comments are included they will be tagged with the date of the main entry.

Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog) If possible, each entry is tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it will be tagged with “sports” subject).In many cases it’s not feasible to tag the entries (for example, it’s not really practical to tag each individual Twitter entry, though I’ve got some ideas which might be implemented in the future) or no subject is found by the automated process, in which case the entry is tagged with a ‘0’.

To save space, the subject and type is given as a numerical code.

Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.

Task 1 - Getting and cleaning the data and Task 2 - Exploratory Data Analysis

Task 1:

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, you should understand what real data looks like and how much effort you need to put into cleaning the data

Task 2:

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Loading the English text data and convert them as tibbles

blogs   <- read_lines("./en_US.blogs.txt")
news    <- read_lines("./en_US.news.txt")
twt <- read_lines("./en_US.twitter.txt") 

#creating tibbles (data frames)
blogs_tbl <- tibble(blogs)
news_tbl <- tibble(news)
twt_tbl <- tibble(twt)

Tuning the tibbles

#Rename the column to 'text'
blogs_tbl <- rename(blogs_tbl, text = blogs)
news_tbl <- rename(news_tbl, text = news)
twt_tbl <- rename(twt_tbl, text = twt)

#Add source column (factor)  
blogs_tbl <- blogs_tbl %>% 
  mutate(source = factor(rep("Blogs", nrow(blogs_tbl))))

news_tbl <- news_tbl %>% 
  mutate(source = factor(rep("News", nrow(news_tbl))))

twt_tbl <- twt_tbl %>% 
  mutate(source = factor(rep("Twitter", nrow(twt_tbl))))

Showing the first 3 rows of each file

set.seed(2021)
#Blogs
sample_n(blogs_tbl, 3)
# A tibble: 3 x 2
  text                                                                    source
  <chr>                                                                   <fct> 
1 then i carry these germs                                                Blogs 
2 It may be a combination of my OCD or the fact that I almost always fee~ Blogs 
3 14.- Case Closed - Produced by Domingo                                  Blogs 
#News
sample_n(news_tbl, 3)
# A tibble: 3 x 2
  text                                                                    source
  <chr>                                                                   <fct> 
1 "“This plant will be at 100 percent capacity — to the max,” Johnson, o~ News  
2 "\"We've all worked hard. He's exhausted. I'm exhausted,\" said Leesbu~ News  
3 "Rose Psara, chief investigator for the St. Louis Medical Examiner's O~ News  
#Twitter
sample_n(twt_tbl, 3)
# A tibble: 3 x 2
  text                                           source 
  <chr>                                          <fct>  
1 funny Ass Hell                                 Twitter
2 She's done! NOW ITS JUSTIN'S TURN              Twitter
3 Now at Winghouse in Orlando, getting ready for Twitter

Initial EDA - Number of characters and lines

# Calculating # of characters
blogs_total_char   <- sum(nchar(blogs))
news_total_char    <- sum(nchar(news))
twt_total_char <- sum(nchar(twt))

# Calculating # of lines
blogs_lines   <- length(blogs)
news_lines    <- length(news)
twt_lines <- length(twt)

Showing a summary of raw data

txt_summary <- tibble(archive = as.factor(c('Blogs', 'News', 'Twitter')),
                          n_lines = c(blogs_lines, news_lines, twt_lines),
                          n_char = c(blogs_total_char, news_total_char, twt_total_char)
                          )
txt_summary
# A tibble: 3 x 3
  archive n_lines    n_char
  <fct>     <int>     <int>
1 Blogs    899288 206824505
2 News    1010242 203223159
3 Twitter 2360148 162096031

Sampling from tibbles (10%)

set.seed(2021)
blogs_sample <- blogs_tbl %>% sample_n(., nrow(blogs_tbl) * .1)
news_sample <- news_tbl %>%   sample_n(., nrow(news_tbl) * .1)
twt_sample <- twt_tbl %>% sample_n(., nrow(twt_tbl) * .1)

Tidy the data

#Join (vertically) the samples
tidy_data <- bind_rows(blogs_sample, news_sample, twt_sample)

#See the structure
glimpse(tidy_data)
Rows: 426,966
Columns: 2
$ text   <chr> "then i carry these germs", "It may be a combination of my OCD ~
$ source <fct> Blogs, Blogs, Blogs, Blogs, Blogs, Blogs, Blogs, Blogs, Blogs, ~
levels(tidy_data$source)
[1] "Blogs"   "News"    "Twitter"
(token_full_data <- tidy_data %>% unnest_tokens(word, text))
# A tibble: 10,237,899 x 2
   source word       
   <fct>  <chr>      
 1 Blogs  then       
 2 Blogs  i          
 3 Blogs  carry      
 4 Blogs  these      
 5 Blogs  germs      
 6 Blogs  it         
 7 Blogs  may        
 8 Blogs  be         
 9 Blogs  a          
10 Blogs  combination
# ... with 10,237,889 more rows

Cleaning data. Removing stopping words

data("stop_words")
stopping_words <- read_delim("./stopWords.csv", delim = "\n", col_names = FALSE)

stopping_words <- unnest_tokens(stopping_words, word, X1)
rep_reg <- "[^[:alpha:][:space:]]*"
rep_url <- "http[^[:space:]]*"
rep_var <- "\\b(?=\\w*(\\w)\\1)\\w+\\b" 

clean_data <-  tidy_data %>%
  mutate(text = str_replace_all(text, rep_reg, "")) %>%
  mutate(text = str_replace_all(text, rep_url, "")) %>%
  mutate(text = str_replace_all(text, rep_var, "")) 

clean_tidy_data <- clean_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

Words with 50% of coverage of all words in the dataset

the_50s <- clean_tidy_data %>%
  count(word) %>%  
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop)) %>%  
  mutate(coverage = cumsum(prop)) %>%
  filter(coverage <= 0.5) 

sample_n(the_50s, 3)
# A tibble: 3 x 4
  word        n     prop coverage
  <chr>   <int>    <dbl>    <dbl>
1 percent  3703 0.00108    0.0978
2 joy       592 0.000172   0.465 
3 south    1948 0.000566   0.215 
print(paste('Number of words with 50% of coverage of all words in the dataset: ', nrow(the_50s))) 
[1] "Number of words with 50% of coverage of all words in the dataset:  1279"

Words with 90% of coverage of all words in the dataset

the_90s <- clean_tidy_data %>%
  count(word) %>%  
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop)) %>%  
  mutate(coverage = cumsum(prop)) %>%
  filter(coverage <= 0.9) 

sample_n(the_90s, 3)
# A tibble: 3 x 4
  word           n       prop coverage
  <chr>      <int>      <dbl>    <dbl>
1 sweatpants    22 0.00000639    0.887
2 quote        429 0.000125      0.532
3 creative     744 0.000216      0.415
print(paste('Number of words with 90% of coverage of all words in the dataset: ', nrow(the_90s))) 
[1] "Number of words with 90% of coverage of all words in the dataset:  17115"

The top 10 words

the_50s %>%
  top_n(10, prop) %>%
  mutate(word = reorder(word, prop)) %>%
  ggplot(aes(word, prop)) +
  geom_col(fill = 'lightblue', col= 'blue') +
  xlab(NULL) +
  coord_flip()

The most frequent words by media

freq <- clean_tidy_data %>%
  count(source, word) %>%
  group_by(source) %>%
  mutate(prop = n / sum(n)) %>%
  spread(source, prop) %>%
  gather(source, prop, `Blogs`:`Twitter`) %>%
  arrange(desc(prop), desc(n))

freq %>%
  filter(prop > 0.002) %>% 
  mutate(word = reorder(word, prop)) %>% 
  ggplot(aes(word, prop, fill = source)) +
  geom_col() + 
  xlab(NULL) + 
  coord_flip() +
  facet_grid(~source, scales = "free") +
  scale_fill_viridis_d() + 
  theme(legend.position = 'none')

n-grams

bi-grams

#Preparing bi-gram
bigram_data <- clean_data %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigram_90 <- bigram_data %>%
  count(bigram) %>%  
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop)) %>%  
  mutate(coverage = cumsum(prop)) %>%
  filter(coverage <= 0.9)

head(bigram_90)
# A tibble: 6 x 4
  bigram      n    prop coverage
  <chr>   <int>   <dbl>    <dbl>
1 of the  43825 0.00516  0.00516
2 in the  41748 0.00492  0.0101 
3 to the  24894 0.00293  0.0130 
4 for the 20596 0.00243  0.0154 
5 on the  19731 0.00232  0.0178 
6 to be   16201 0.00191  0.0197 
print(paste('Number of bi-grams with 90% of coverage in the dataset: ', nrow(bigram_90))) 
[1] "Number of bi-grams with 90% of coverage in the dataset:  1256942"
# Plotting bi-grams
bigram_90 %>%
  top_n(10, prop) %>%
  mutate(bigram = reorder(bigram, prop)) %>%
  ggplot(aes(bigram, prop)) +
  geom_col(fill = 'lightgreen', col = 'black') +
  xlab(NULL) +
  coord_flip() + 
  ggtitle("Bigrams that represent the 90%") +
  labs(y = "Proportion")

tri-grams

#Preparing tri-grams
trigram_data <- clean_data %>% 
  unnest_tokens(trigram, text, token = "ngrams", n = 3)

trigram_90 <- trigram_data %>%
  count(trigram) %>%  
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop)) %>%  
  mutate(coverage = cumsum(prop)) %>%
  filter(coverage <= 0.9)

head(trigram_90)
# A tibble: 6 x 4
  trigram            n     prop coverage
  <chr>          <int>    <dbl>    <dbl>
1 <NA>           23701 0.00293   0.00293
2 one of the      3338 0.000413  0.00334
3 a lot of        3044 0.000376  0.00372
4 thanks for the  2474 0.000306  0.00403
5 to be a         1784 0.000221  0.00425
6 the of the      1686 0.000209  0.00446
print(paste('Number of tri-grams with 90% of coverage in the dataset: ', nrow(trigram_90)))
[1] "Number of tri-grams with 90% of coverage in the dataset:  4495717"
# Plotting tri-grams
trigram_90 %>%
  top_n(10, prop) %>%
  mutate(trigram = reorder(trigram, prop)) %>%
  ggplot(aes(trigram, prop)) +
  geom_col(fill = 'lightblue', col = 'black') +
  ggtitle("Trigrams that represent the 90%") +
  labs(y = "Proportion") + 
  coord_flip()

– END

References:

Text Mining with R

Supervised Machine Learning for Text Analysis in R

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252   
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C                   
[5] LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tm_0.7-8        NLP_0.2-1       stopwords_2.2   tidytext_0.3.1 
 [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
 [9] readr_1.4.0     tidyr_1.1.3     tibble_3.1.2    ggplot2_3.3.5  
[13] tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        lubridate_1.7.10  lattice_0.20-44   assertthat_0.2.1 
 [5] digest_0.6.27     utf8_1.2.1        slam_0.1-48       R6_2.5.0         
 [9] cellranger_1.1.0  backports_1.2.1   reprex_2.0.0      evaluate_0.14    
[13] highr_0.9         httr_1.4.2        pillar_1.6.1      rlang_0.4.11     
[17] readxl_1.3.1      rstudioapi_0.13   jquerylib_0.1.4   Matrix_1.3-3     
[21] rmarkdown_2.9     labeling_0.4.2    munsell_0.5.0     broom_0.7.8      
[25] compiler_4.1.0    janeaustenr_0.1.5 modelr_0.1.8      xfun_0.24        
[29] pkgconfig_2.0.3   htmltools_0.5.1.1 tidyselect_1.1.1  viridisLite_0.4.0
[33] fansi_0.5.0       crayon_1.4.1      dbplyr_2.1.1      withr_2.4.2      
[37] SnowballC_0.7.0   grid_4.1.0        jsonlite_1.7.2    gtable_0.3.0     
[41] lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1    scales_1.1.1     
[45] tokenizers_0.2.1  cli_3.0.0         stringi_1.6.2     farver_2.1.0     
[49] fs_1.5.0          xml2_1.3.2        bslib_0.2.5.1     ellipsis_0.3.2   
[53] generics_0.1.0    vctrs_0.3.8       tools_4.1.0       glue_1.4.2       
[57] hms_1.1.0         parallel_4.1.0    yaml_2.2.1        colorspace_2.0-2 
[61] rvest_1.0.0       knitr_1.33        haven_2.4.1       sass_0.4.0