Exploratory Data Analysis

Goal of analysis

The main goal is of this analysis is explore the corpus to identify key aspects of data, how many words each file have, how many lines each file have, how many words represent a certain percentage of the text.

For this analysis, we will explore US blogs, tweets and news, plot words frequency and analyze the most common 2-grams and 3-grams (combinations of 2 or 3 words).

Data Loading

First we download the data from URL.

Now we will show basic summaries: file lines, words, word frequency

file_lines <- sapply(file_name, function(x) {
  as.numeric(countLines(paste0(data_dir, lang_dir, x)))
  })

tibble(File = unlist(file_name), Lines = file_lines)

## # A tibble: 3 x 2
##   File                Lines
##   <chr>               <dbl>
## 1 en_US.blogs.txt    899288
## 2 en_US.news.txt    1010242
## 3 en_US.twitter.txt 2360148

We are going to subset data to continue exploring data.

sample_text <- function(file_name, output_name) {
  set.seed(1234)
  
  lines <- read_lines(file_name)
  
  n_lines <- as.numeric(countLines(file_name))
  
  size <- ifelse(n_lines/1000 > 1000, n_lines/1000, 1000)
  
  selected_lines <- sample(n_lines, size = size, replace = FALSE)
  subset_text <- lines[selected_lines]
  
  write_lines(subset_text, output_name)
}


sample_name <- lapply(file_name, function(name) { paste0("sample_", name) })
sample_paths <- lapply(sample_name, function(name) { paste0(data_dir,lang_dir, name) })

# Sample Blogs data
if(!file.exists(paste0(data_dir,lang_dir,sample_name["blogs"]))) {
  sample_text(paste0(data_dir,lang_dir,file_name["blogs"]),
              paste0(data_dir,lang_dir,sample_name["blogs"]))
}

# Sample News data
if(!file.exists(paste0(data_dir,lang_dir,sample_name["news"]))) {
  sample_text(paste0(data_dir,lang_dir,file_name["news"]),
              paste0(data_dir,lang_dir,sample_name["news"]))
}

# Sample Twitter data
if(!file.exists(paste0(data_dir,lang_dir,sample_name["twitter"]))) {
  sample_text(paste0(data_dir,lang_dir,file_name["twitter"]),
              paste0(data_dir,lang_dir,sample_name["twitter"]))
}

Blog data

First 10 lines of blog data sample

blog_lines <- read_lines(sample_paths["blogs"])
head(blog_lines, 10)

##  [1] "He looked back at me, his eyes were as dark as coal,"                                                                                                                                                                                                                                                                                                                                                                                     
##  [2] "You've set up a problem without stakes. Why does she care who the voice on the phone is? Why would she even listen to him past \"hello?\""                                                                                                                                                                                                                                                                                                
##  [3] "Yvonne Strahovski … Peg Mooring"                                                                                                                                                                                                                                                                                                                                                                                                          
##  [4] "What are you like with medical procedures when it comes to your kids?"                                                                                                                                                                                                                                                                                                                                                                    
##  [5] "But I believe that the DNA in the rainbow trout was created directly by God at a time in the past. Replacements for it are no longer being made. Therefore, to wipe it out of an existence would be to destroy the work of the greatest artist there is."                                                                                                                                                                                 
##  [6] "We can never go home."                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [7] "Paul Lemberg refers to the Comfort Zone phenomenon as leading business managers to become “fat, dumb, and happy.” In other words, becoming complacent when things are going fine. This can lead to becoming reactive with your strategy, rather than proactive. Do you want to be reconfiguring your department under duress at breakneck speed at the last-minute, or would you rather plan well ahead of time when the pressure is off?"
##  [8] "And then of course we got a cheese plate. It was three European cheeses, which our waiter thoroughly explained, served with baguette slices and plum chutney. My favorite was the goat's cheese Brie-type one, of which the name has completely left me despite all the goodness."                                                                                                                                                        
##  [9] "The Things of Childhood"                                                                                                                                                                                                                                                                                                                                                                                                                  
## [10] "the road hit us. Very bumpy; let me say that again VERY BUMPY! After about two"

# Create a list to store all data transformations
blog_data <- list()

To analyze words is necessary to split each line into tokens, this process is called tokenization.

blog_data$text_tibble <- tibble(line = 1:length(blog_lines), text = blog_lines)
blog_data$original <- blog_data$text_tibble %>% 
                 unnest_tokens(token, text, strip_punct = TRUE) %>%
                 filter(!is.na(token)) %>%
                 mutate(word = token)

# Remove stopwords
blog_data$wo_stepwords <- blog_data$original %>%
                     anti_join(stop_words, by=c("token"="word"))

# Using sentimentr
blog_data$sentences <- get_sentences(blog_lines)

profanity_terms <- extract_profanity_terms(blog_data$sentences) # names: neutral, profanity, sentence

profane_words <- unique(unlist(profanity_terms$profanity))

# Filter profanity
blog_data$wo_profanity <- blog_data$wo_stepwords %>%
                     filter(!token %in% profane_words)

# Remove numbers
blog_data$wo_numbers <- blog_data$wo_profanity %>% filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word))

# Remove misspelled words
blog_data$correct_words <- blog_data$wo_numbers %>% filter(hunspell_check(word))

1-gram

## Joining, by = "word"

## Joining, by = "word"

2-gram

blog_data$bigram <- blog_data$text_tibble %>% 
                 unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
                 filter(!is.na(bigram))

2-grams with stop words.

bigrams_separated <- blog_data$bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% profane_words) %>%
  filter(!word2 %in% profane_words) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word1)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word2)) %>%
  filter(hunspell_check(word1)) %>%
  filter(hunspell_check(word2))

blog_data$bigrams_separated <- bigrams_filtered
blog_data$bigrams_filtered <- bigrams_filtered %>%
                         unite(bigram, word1, word2, sep = " ")

2-grams without stop words.

blog_data$bigrams_wo_stepwords <- blog_data$bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ")

3-gram

3-gram with stop words

blog_data$trigram <- blog_data$text_tibble %>% 
                 unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
                 filter(!is.na(trigram))

trigrams_separated <- blog_data$trigram %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% profane_words) %>%
  filter(!word2 %in% profane_words) %>%
  filter(!word3 %in% profane_words) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  filter(!is.na(word3)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word1)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word2)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word3)) %>%
  filter(hunspell_check(word1)) %>%
  filter(hunspell_check(word2)) %>%
  filter(hunspell_check(word3))

blog_data$trigrams_separated <- trigrams_filtered
blog_data$trigrams_filtered <- trigrams_filtered %>%
                         unite(trigram, word1, word2, word3, sep = " ")

3-gram without stop words

blog_data$trigrams_wo_stopwords <- blog_data$trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

News data

First 10 lines of news data sample

news_lines <- read_lines(sample_paths["news"])
head(news_lines, 10)

##  [1] "It’s a nice thought, but don’t expect state Democrats Stephen Sweeney or Sheila Oliver to be raising their glasses to Christie anytime soon, even if he does throw his name in the presidential race. The terms \"rotten bastard\" and \"mentally deranged,\" words that those two critics recently uttered about him, aren’t exactly champagne toast material."
##  [2] "In an \"A\" review, Drew McWeeny of HitFix.com writes Lawrence invests Katniss \"with a rich inner life that makes her feel real. It is a pure movie star performance, and Lawrence rises to the occasion.\""                                                                                                                                                   
##  [3] "This is also according to advice I've given others: We're flawed, all of us, and hoping to find an ideal person is not only pointless, it's also dehumanizing to people to expect them to meet your ideals. All you can realistically hope for are people who are self-aware enough and responsible enough to try to keep their frailties in check."            
##  [4] "And it would help if people listened when they were being called over the radio, she added."                                                                                                                                                                                                                                                                    
##  [5] "One excellent chef, Mark Helms, changed spaces and concepts, selling his tiny Ravenous Café in the Pocket area and opening even tinier Juno's Kitchen & Delicatessen on J Street in east Sacramento, a few blocks from his house. Great idea, great location, great food."                                                                                      
##  [6] "CQ Politics cites this evidence from the New York Times/CBS News poll:"                                                                                                                                                                                                                                                                                         
##  [7] "Jones's plan is to have players up and down the lineup pick up bits and pieces of Cooper's stats and responsibilities."                                                                                                                                                                                                                                         
##  [8] "Detroit wasn<U+0092>t interested in bringing him back in 2011, and Bonderman said he <U+0093>blew out<U+0094> his elbow that winter, trying to get ready to sign with the Cleveland Indians."                                                                                                                                                                   
##  [9] "(Sometimes?)"                                                                                                                                                                                                                                                                                                                                                   
## [10] "Efforts to reach Epstein for comment Wednesday night were unsuccessful. The Cubs and Padres were allowed to announce the decision formally because of the rainout of Game 6 of the World Series and the Cubs will have a news conference next week, probably Tuesday. Both teams were asked not to comment until the World Series ends."

# Create a list to store all data transformations
news_data <- list()

Tokenize news data to later perform a word analysis.

news_data$text_tibble <- tibble(line = 1:length(news_lines), text = news_lines)
news_data$original <- news_data$text_tibble %>% 
                 unnest_tokens(token, text, strip_punct = TRUE) %>%
                 filter(!is.na(token)) %>%
                 mutate(word = token)

# Remove stopwords
news_data$wo_stepwords <- news_data$original %>%
                     anti_join(stop_words, by=c("token"="word"))

# Using sentimentr
news_data$sentences <- get_sentences(news_lines)

profanity_terms <- extract_profanity_terms(news_data$sentences) # names: neutral, profanity, sentence

profane_words <- unique(unlist(profanity_terms$profanity))

# Filter profanity
news_data$wo_profanity <- news_data$wo_stepwords %>%
                     filter(!token %in% profane_words)

# Remove numbers
news_data$wo_numbers <- news_data$wo_profanity %>% filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word))

# Remove misspelled words
news_data$correct_words <- news_data$wo_numbers %>% filter(hunspell_check(word))

1-gram

## Joining, by = "word"

## Joining, by = "word"

2-gram

news_data$bigram <- news_data$text_tibble %>% 
                 unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
                 filter(!is.na(bigram))

2-grams with stop words.

bigrams_separated <- news_data$bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% profane_words) %>%
  filter(!word2 %in% profane_words) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word1)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word2)) %>%
  filter(hunspell_check(word1)) %>%
  filter(hunspell_check(word2))

news_data$bigrams_separated <- bigrams_filtered
news_data$bigrams_filtered <- bigrams_filtered %>%
                         unite(bigram, word1, word2, sep = " ")

2-grams without stop words.

news_data$bigrams_wo_stepwords <- news_data$bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ")

3-gram

3-gram with stop words

news_data$trigram <- news_data$text_tibble %>% 
                 unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
                 filter(!is.na(trigram))

trigrams_separated <- news_data$trigram %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% profane_words) %>%
  filter(!word2 %in% profane_words) %>%
  filter(!word3 %in% profane_words) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  filter(!is.na(word3)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word1)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word2)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word3)) %>%
  filter(hunspell_check(word1)) %>%
  filter(hunspell_check(word2)) %>%
  filter(hunspell_check(word3))

news_data$trigrams_separated <- trigrams_filtered
news_data$trigrams_filtered <- trigrams_filtered %>%
                         unite(trigram, word1, word2, word3, sep = " ")

3-gram without stop words

news_data$trigrams_wo_stopwords <- news_data$trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Twitter data

First 10 lines of twitter data sample

twitter_lines <- read_lines(sample_paths["twitter"])
head(twitter_lines, 10)

##  [1] "u not getting rid of that beard are ya?"                                                                                   
##  [2] "Hope you are well lady. Keep me posted on all the goings on. :D I will text you when I find my phone"                      
##  [3] "you guys doing Danny James' \"Pear\" on lp?!"                                                                              
##  [4] "how exciting :) have a wonderful Christmas!"                                                                               
##  [5] "If you have no money and you want a usa visa there areno questions asked. But to more you got the more you have to prove?!"
##  [6] "cleans. I don't like cleaning. :)"                                                                                         
##  [7] "Very big issue down the road. RT : T2: IMO the other issue w/exchange is trusting correctness."                            
##  [8] "God has over 2 Billion followers and He didn't even need Twitter!"                                                         
##  [9] "Ain't no one cute at my school :/"                                                                                         
## [10] "and thats why so much seems messed up in DC. best leaders often dont aim/intend to lead."

# Create a list to store all data transformations
twitter_data <- list()

Tokenize twitter data to later perform a word analysis.

twitter_data$text_tibble <- tibble(line = 1:length(twitter_lines), text = twitter_lines)
twitter_data$original <- twitter_data$text_tibble %>% 
                 unnest_tokens(token, text, strip_punct = TRUE) %>%
                 filter(!is.na(token)) %>%
                 mutate(word = token)

# Remove stopwords
twitter_data$wo_stepwords <- twitter_data$original %>%
                     anti_join(stop_words, by=c("token"="word"))

# Using sentimentr
twitter_data$sentences <- get_sentences(twitter_lines)

profanity_terms <- extract_profanity_terms(twitter_data$sentences) # names: neutral, profanity, sentence

profane_words <- unique(unlist(profanity_terms$profanity))

# Filter profanity
twitter_data$wo_profanity <- twitter_data$wo_stepwords %>%
                     filter(!token %in% profane_words)

# Remove numbers
twitter_data$wo_numbers <- twitter_data$wo_profanity %>% filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word))

# Remove misspelled words
twitter_data$correct_words <- twitter_data$wo_numbers %>% filter(hunspell_check(word))

1-gram

## Joining, by = "word"

## Joining, by = "word"

2-gram

twitter_data$bigram <- twitter_data$text_tibble %>% 
                 unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
                 filter(!is.na(bigram))

2-grams with stop words.

bigrams_separated <- twitter_data$bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% profane_words) %>%
  filter(!word2 %in% profane_words) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word1)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word2)) %>%
  filter(hunspell_check(word1)) %>%
  filter(hunspell_check(word2))

twitter_data$bigrams_separated <- bigrams_filtered
twitter_data$bigrams_filtered <- bigrams_filtered %>%
                         unite(bigram, word1, word2, sep = " ")

2-grams without stop words.

twitter_data$bigrams_wo_stepwords <- twitter_data$bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ")

3-gram

3-gram with stop words

twitter_data$trigram <- twitter_data$text_tibble %>% 
                 unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
                 filter(!is.na(trigram))

trigrams_separated <- twitter_data$trigram %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% profane_words) %>%
  filter(!word2 %in% profane_words) %>%
  filter(!word3 %in% profane_words) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  filter(!is.na(word3)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word1)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word2)) %>%
  filter(!grepl("^(\\d+(,\\d+)*(.\\d+))|(\\d+)$", word3)) %>%
  filter(hunspell_check(word1)) %>%
  filter(hunspell_check(word2)) %>%
  filter(hunspell_check(word3))

twitter_data$trigrams_separated <- trigrams_filtered
twitter_data$trigrams_filtered <- trigrams_filtered %>%
                         unite(trigram, word1, word2, word3, sep = " ")

3-gram without stop words

twitter_data$trigrams_wo_stopwords <- twitter_data$trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Sample summaries

lines <- sapply(sample_paths, function(name) as.numeric(countLines(name)))
words <- c(length(blog_data$correct_words$word), 
           length(news_data$correct_words$word),
           length(twitter_data$correct_words$word))
unique_words <- c(length(unique(blog_data$correct_words$word)), 
                  length(unique(news_data$correct_words$word)),
                  length(unique(twitter_data$correct_words$word)))
word_instances <- tibble(File = unlist(sample_name),
                       Lines = lines,
                       Words = words,
                       Unique_words = unique_words)

word_instances

## # A tibble: 3 x 4
##   File                     Lines Words Unique_words
##   <chr>                    <dbl> <int>        <int>
## 1 sample_en_US.blogs.txt    1000 13318         5769
## 2 sample_en_US.news.txt     1010 12865         5618
## 3 sample_en_US.twitter.txt  2360  9510         3998

Questions to consider

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Coverage values for en_US.blogs.txt:

84 unique words (0.43%) to cover 50% all word instances in the document.

3847 unique words (19.8%) to cover 90% all word instances in the document.

Coverage values for en_US.news.txt:

129 unique words (0.67%) to cover 50% all word instances in the document.

4399 unique words (22.87%) to cover 90% all word instances in the document.

Coverage values for en_US.twitter.txt:

90 unique words (0.63%) to cover 50% all word instances in the document.

2550 unique words (17.89%) to cover 90% all word instances in the document.

How do you evaluate how many of the words come from foreign languages?

We use hunspell_check to filter the words which come from foreign langauges. hunspell_check consider foreign languages words as misspelled words.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

We can use a synonims dictionary to increase the corpus vocabulary.

Exploratory Data Analysis

Luis Talavera

2022-09-12

Goal of analysis

Data Loading

Blog data

1-gram

2-gram

3-gram

News data

1-gram

2-gram

3-gram

Twitter data

1-gram

2-gram

3-gram

Sample summaries

Questions to consider

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Coverage values for en_US.blogs.txt:

Coverage values for en_US.news.txt:

Coverage values for en_US.twitter.txt:

How do you evaluate how many of the words come from foreign languages?

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?