Module 1 - Data Cleaning

Introduction for Task 2 - Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Steps that were taken for the project

Step 1: Libraries and File Paths

1.1 Load the necessary libraries

Libraries: Load several R libraries: stringr for string manipulation, tm for text mining, ggplot2 for visualization, and ngram for n-gram generation. Constants: Sets file paths for storing intermediate and final results in the analysis.

# Load necessary libraries
library(dplyr)    # For data manipulation
library(stringr)  # For string operations
library(tm)       # For text mining and preprocessing
library(SnowballC) # For stemming
library(wordcloud) # For creating word clouds
library(ggplot2)  # For plotting

1.2 Define the file paths for the datasets (continued)

# Define file paths for the data
twitter_file <- "./en_US/en_US.twitter.txt"
blogs_file <- "./en_US/en_US.blogs.txt"
news_file <- "./en_US/en_US.news.txt"

Step 2: Data Loading and Initial Exploration

In this step, we load the datasets and perform some initial exploration to understand the structure and content of the data.

2.1 Load the datasets

# Load the datasets
twitter_data <- readLines(twitter_file, encoding = "UTF-8", warn = FALSE)
blogs_data <- readLines(blogs_file, encoding = "UTF-8", warn = FALSE)
news_data <- readLines(news_file, encoding = "UTF-8", warn = FALSE)

2.2 Create data frames

Convert the loaded data into data frames for easier manipulation:

# Load necessary libraries
library(dplyr)    # For data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)  # For string operations

# Set a seed for reproducibility
set.seed(123)

# Create data frames
twitter_df <- data.frame(text = twitter_data, stringsAsFactors = FALSE)
blogs_df <- data.frame(text = blogs_data, stringsAsFactors = FALSE)
news_df <- data.frame(text = news_data, stringsAsFactors = FALSE)

# Sample 5,000 random rows if the dataset has more than 5,000 rows
if (nrow(twitter_df) > 5000) {
  twitter_df <- twitter_df %>% sample_n(70000)
}

if (nrow(blogs_df) > 5000) {
  blogs_df <- blogs_df %>% sample_n(70000)
}

if (nrow(news_df) > 5000) {
  news_df <- news_df %>% sample_n(70000)
}

# Display the first few rows of each sampled dataset
cat("Sampled Twitter Data:\n")

## Sampled Twitter Data:

print(head(twitter_df, 5))

##                                                                       text
## 1     just wanted to thank you & ask what got you started on your mission?
## 2 Right when I thought I was done... I ran of "sugar" for the last dessert
## 3                                 I tell ion gaf so why test my tolerance?
## 4                                             mayfly? Wish I was there. :)
## 5                                               follow me tho, so I can dm

cat("\nSampled Blogs Data:\n")

## 
## Sampled Blogs Data:

print(head(blogs_df, 5))

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
## 1                                                                                                                                                                                                                                                                                                                                                                           The present Aids-HIV epidemic -- against which the Mbeki-regime undertakes no action and still is publicly failing to properly acknowledge -- the World Health Organisation estimates that more than 6-million African South Africans will be dead within the forthcoming decade. And the Mbeki-led ANC regime, which could have undertaken a huge prevention campaign such as Uganda's a long time ago, has done nothing to stave off this terrible death rate.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          4) Follow @steph_chows on Twitter. Leave a separate comment saying you did, or already do follow.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Another favourite of mine is the snowtex. I have two types, one plain and one glitter. The one with glitter is the one I tend to lend my hand too over the festive season.
## 4 So there can be seen quite an evolution and progressive intelligence throughout the Vedas Samhitas and Upanishads that reveal the changes or updates we see from antiquity to present day. The ancient peoples lived by a similar thread that ancient Pagans lived under, meaning, the lore and the divine guidance provided for one lifestyle that we now feel is harsh and barbaric. It was all survival of the fittest. People had their castes, their societal chores, kings and warriors were revered and celebrated with massive offerings and festivals. And then a new wave of human feeling appeared and no longer was it unquestioningly accepted to hear the tortured cries and bellows of the animals whose blood and trauma was meant to bring about goodwill and blessings to those who ordered the knives to their throats.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I’ve found my way around it,- lower calorie bread, if necessary, with Helmann’s Extra Light Mayo.

cat("\nSampled News Data:\n")

## 
## Sampled News Data:

print(head(news_df, 5))

##                                                                                                                                                                                                                                                                                                                                                      text
## 1                                                                                                                                                                                                                                         On Saturday, he complained again. His technical with 2:44 to go against the Pacers was his seventh in 24 games.
## 2                                                                                                                                                                                                                                                                                                                Dingell, attributed to Winston Churchill
## 3 Austin, of the Economic Policy Institute, offers a look at a topic many don't want to broach: racial discrimination in hiring. Because teens are often looking for low-skilled, entry-level jobs, factors such as training or education often don't come into play, the disparity in their employment rates offer a chance to study such bias, he said.
## 4                                                                                                                                                                                                                                                                   "It was just a lack of effort. We weren't ready. It was embarrassing," Crawford said.
## 5                       Angelina and Rose are a unique and special sibling group that deserve a family. While they experience the typical sibling squabbles, they love and depend on one another. They get excited about the thought of exploring new opportunities as they get older, and would love to experience them with a loving and caring family.

2.3 Initial exploration of the datasets

You can perform some basic exploratory analysis to understand the datasets better:

# Initial exploration
# Display the first few rows of each dataset
cat("First 5 tweets:\n")

First 5 tweets:

print(head(twitter_df, 5))

                                                                      text
1     just wanted to thank you & ask what got you started on your mission?
2 Right when I thought I was done... I ran of "sugar" for the last dessert
3                                 I tell ion gaf so why test my tolerance?
4                                             mayfly? Wish I was there. :)
5                                               follow me tho, so I can dm

cat("\nFirst 5 blog posts:\n")


First 5 blog posts:

print(head(blogs_df, 5))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        text
1                                                                                                                                                                                                                                                                                                                                                                           The present Aids-HIV epidemic -- against which the Mbeki-regime undertakes no action and still is publicly failing to properly acknowledge -- the World Health Organisation estimates that more than 6-million African South Africans will be dead within the forthcoming decade. And the Mbeki-led ANC regime, which could have undertaken a huge prevention campaign such as Uganda's a long time ago, has done nothing to stave off this terrible death rate.
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          4) Follow @steph_chows on Twitter. Leave a separate comment saying you did, or already do follow.
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Another favourite of mine is the snowtex. I have two types, one plain and one glitter. The one with glitter is the one I tend to lend my hand too over the festive season.
4 So there can be seen quite an evolution and progressive intelligence throughout the Vedas Samhitas and Upanishads that reveal the changes or updates we see from antiquity to present day. The ancient peoples lived by a similar thread that ancient Pagans lived under, meaning, the lore and the divine guidance provided for one lifestyle that we now feel is harsh and barbaric. It was all survival of the fittest. People had their castes, their societal chores, kings and warriors were revered and celebrated with massive offerings and festivals. And then a new wave of human feeling appeared and no longer was it unquestioningly accepted to hear the tortured cries and bellows of the animals whose blood and trauma was meant to bring about goodwill and blessings to those who ordered the knives to their throats.
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I’ve found my way around it,- lower calorie bread, if necessary, with Helmann’s Extra Light Mayo.

cat("\nFirst 5 news articles:\n")


First 5 news articles:

print(head(news_df, 5))

                                                                                                                                                                                                                                                                                                                                                     text
1                                                                                                                                                                                                                                         On Saturday, he complained again. His technical with 2:44 to go against the Pacers was his seventh in 24 games.
2                                                                                                                                                                                                                                                                                                                Dingell, attributed to Winston Churchill
3 Austin, of the Economic Policy Institute, offers a look at a topic many don't want to broach: racial discrimination in hiring. Because teens are often looking for low-skilled, entry-level jobs, factors such as training or education often don't come into play, the disparity in their employment rates offer a chance to study such bias, he said.
4                                                                                                                                                                                                                                                                   "It was just a lack of effort. We weren't ready. It was embarrassing," Crawford said.
5                       Angelina and Rose are a unique and special sibling group that deserve a family. While they experience the typical sibling squabbles, they love and depend on one another. They get excited about the thought of exploring new opportunities as they get older, and would love to experience them with a loving and caring family.

# Display the number of rows for each dataset
cat("\nThere are", nrow(twitter_df), "tweets in the Twitter dataset.\n")


There are 70000 tweets in the Twitter dataset.

cat("There are", nrow(blogs_df), "blog posts in the Blogs dataset.\n")

There are 70000 blog posts in the Blogs dataset.

cat("There are", nrow(news_df), "news articles in the News dataset.\n")

There are 70000 news articles in the News dataset.

Step 3: Text Cleaning

In this step, we’ll clean the text data for all three datasets.

3.1 Read and Clean Twitter, Blogs, and News Data

# Load necessary libraries
library(dplyr)    # For data manipulation
library(stringr)  # For string operations

# Function to clean text data
clean_text <- function(data) {
  data %>%
    mutate(
      # Convert all text to lowercase
      text = str_to_lower(text),
      
      # Remove non-alphanumeric characters (keep letters, numbers, and spaces)
      text = str_replace_all(text, "[^\\w\\s]", ""),
      
      # Remove extra spaces and trim the lines
      text = str_trim(text)
    )
}

# Clean all datasets
twitter_cleaned <- clean_text(twitter_df)
blogs_cleaned <- clean_text(blogs_df)
news_cleaned <- clean_text(news_df)

# Display the first few rows of cleaned data for each dataset
cat("Cleaned Twitter Data:\n")

Cleaned Twitter Data:

print(head(twitter_cleaned, 5))

                                                                 text
1  just wanted to thank you  ask what got you started on your mission
2 right when i thought i was done i ran of sugar for the last dessert
3                             i tell ion gaf so why test my tolerance
4                                             mayfly wish i was there
5                                           follow me tho so i can dm

cat("\nCleaned Blogs Data:\n")


Cleaned Blogs Data:

print(head(blogs_cleaned, 5))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               text
1                                                                                                                                                                                                                                                                                                                                                                               the present aidshiv epidemic  against which the mbekiregime undertakes no action and still is publicly failing to properly acknowledge  the world health organisation estimates that more than 6million african south africans will be dead within the forthcoming decade and the mbekiled anc regime which could have undertaken a huge prevention campaign such as ugandas a long time ago has done nothing to stave off this terrible death rate
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      4 follow steph_chows on twitter leave a separate comment saying you did or already do follow
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            another favourite of mine is the snowtex i have two types one plain and one glitter the one with glitter is the one i tend to lend my hand too over the festive season
4 so there can be seen quite an evolution and progressive intelligence throughout the vedas samhitas and upanishads that reveal the changes or updates we see from antiquity to present day the ancient peoples lived by a similar thread that ancient pagans lived under meaning the lore and the divine guidance provided for one lifestyle that we now feel is harsh and barbaric it was all survival of the fittest people had their castes their societal chores kings and warriors were revered and celebrated with massive offerings and festivals and then a new wave of human feeling appeared and no longer was it unquestioningly accepted to hear the tortured cries and bellows of the animals whose blood and trauma was meant to bring about goodwill and blessings to those who ordered the knives to their throats
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ive found my way around it lower calorie bread if necessary with helmanns extra light mayo

cat("\nCleaned News Data:\n")


Cleaned News Data:

print(head(news_cleaned, 5))

                                                                                                                                                                                                                                                                                                                                        text
1                                                                                                                                                                                                                                on saturday he complained again his technical with 244 to go against the pacers was his seventh in 24 games
2                                                                                                                                                                                                                                                                                                    dingell attributed to winston churchill
3 austin of the economic policy institute offers a look at a topic many dont want to broach racial discrimination in hiring because teens are often looking for lowskilled entrylevel jobs factors such as training or education often dont come into play the disparity in their employment rates offer a chance to study such bias he said
4                                                                                                                                                                                                                                                             it was just a lack of effort we werent ready it was embarrassing crawford said
5               angelina and rose are a unique and special sibling group that deserve a family while they experience the typical sibling squabbles they love and depend on one another they get excited about the thought of exploring new opportunities as they get older and would love to experience them with a loving and caring family

# Create a directory for cleaned data if it doesn't exist
if (!dir.exists("./cleaned_data")) {
  dir.create("./cleaned_data")
}

# Save the cleaned data into RDS files
saveRDS(twitter_cleaned, "./cleaned_data/twitter_cleaned.rds")
saveRDS(blogs_cleaned, "./cleaned_data/blogs_cleaned.rds")
saveRDS(news_cleaned, "./cleaned_data/news_cleaned.rds")

Step 4: Stop Word Removal

In this step, we will remove common stop words from the cleaned text data of all three datasets.

4.1 Define Stop Words

library(tm)

Loading required package: NLP

# Define the stop words
stop_words <- stopwords("en")  # Load English stop words

4.2 Function for Stop Word Removal

# Function to remove stop words from text data
remove_stop_words <- function(data) {
  data %>%
    mutate(
      text = sapply(text, function(x) {
        # Split the text into words, remove stop words, and reassemble
        words <- str_split(x, "\\s+")[[1]]
        words <- words[!words %in% stop_words]
        paste(words, collapse = " ")
      })
    )
}

4.3 Apply Stop Word Removal to All Datasets

# Remove stop words from all cleaned datasets
twitter_no_stopwords <- remove_stop_words(twitter_cleaned)
blogs_no_stopwords <- remove_stop_words(blogs_cleaned)
news_no_stopwords <- remove_stop_words(news_cleaned)

# Display the first few rows of data without stop words for each dataset
cat("Twitter Data Without Stop Words:\n")

Twitter Data Without Stop Words:

print(head(twitter_no_stopwords, 5))

                                       text
1 just wanted thank ask got started mission
2 right thought done ran sugar last dessert
3               tell ion gaf test tolerance
4                               mayfly wish
5                         follow tho can dm

cat("\nBlogs Data Without Stop Words:\n")


Blogs Data Without Stop Words:

print(head(blogs_no_stopwords, 5))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     text
1                                                                                                                                                                                                                                         present aidshiv epidemic mbekiregime undertakes action still publicly failing properly acknowledge world health organisation estimates 6million african south africans will dead within forthcoming decade mbekiled anc regime undertaken huge prevention campaign ugandas long time ago done nothing stave terrible death rate
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               4 follow steph_chows twitter leave separate comment saying already follow
3                                                                                                                                                                                                                                                                                                                                                                                                                                                            another favourite mine snowtex two types one plain one glitter one glitter one tend lend hand festive season
4 can seen quite evolution progressive intelligence throughout vedas samhitas upanishads reveal changes updates see antiquity present day ancient peoples lived similar thread ancient pagans lived meaning lore divine guidance provided one lifestyle now feel harsh barbaric survival fittest people castes societal chores kings warriors revered celebrated massive offerings festivals new wave human feeling appeared longer unquestioningly accepted hear tortured cries bellows animals whose blood trauma meant bring goodwill blessings ordered knives throats
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ive found way around lower calorie bread necessary helmanns extra light mayo

cat("\nNews Data Without Stop Words:\n")


News Data Without Stop Words:

print(head(news_no_stopwords, 5))

                                                                                                                                                                                                                                                          text
1                                                                                                                                                                                                 saturday complained technical 244 go pacers seventh 24 games
2                                                                                                                                                                                                                         dingell attributed winston churchill
3 austin economic policy institute offers look topic many dont want broach racial discrimination hiring teens often looking lowskilled entrylevel jobs factors training education often dont come play disparity employment rates offer chance study bias said
4                                                                                                                                                                                                     just lack effort werent ready embarrassing crawford said
5                                        angelina rose unique special sibling group deserve family experience typical sibling squabbles love depend one another get excited thought exploring new opportunities get older love experience loving caring family

# Save the data without stop words into RDS files
saveRDS(twitter_no_stopwords, "./cleaned_data/twitter_no_stopwords.rds")
saveRDS(blogs_no_stopwords, "./cleaned_data/blogs_no_stopwords.rds")
saveRDS(news_no_stopwords, "./cleaned_data/news_no_stopwords.rds")

Step 5: 1-Gram Word Frequency Calculation

In this step, we will split the cleaned datasets into individual words (1-grams) and calculate the frequency of each word.

5.1 Function for Word Frequency Calculation

# Function to calculate word frequency
calculate_word_frequency <- function(data) {
  # Split the text into individual words and unlist
  words <- unlist(str_split(data$text, "\\s+"))

  # Create a table of word frequencies
  word_freq <- table(words)

  # Convert the table to a DataFrame
  word_freq_df <- as.data.frame(word_freq, stringsAsFactors = FALSE)
  colnames(word_freq_df) <- c("word", "frequency")

  # Sort by frequency in descending order
  word_freq_df <- word_freq_df[order(-word_freq_df$frequency), ]

  return(word_freq_df)
}

5.2 Calculate Word Frequency for All Datasets

# Calculate word frequency for all datasets without stop words
twitter_word_freq <- calculate_word_frequency(twitter_no_stopwords)
blogs_word_freq <- calculate_word_frequency(blogs_no_stopwords)
news_word_freq <- calculate_word_frequency(news_no_stopwords)

# Display the top 10 words for each dataset
cat("Top 10 words in Twitter dataset:\n")

Top 10 words in Twitter dataset:

print(head(twitter_word_freq, 10))

      word frequency
23712   im      4643
25820 just      4425
27670 like      3601
19889  get      3263
28350 love      3225
20394 good      3045
51145 will      2843
8873   can      2684
14745 dont      2648
39953   rt      2633

cat("\nTop 10 words in Blogs dataset:\n")


Top 10 words in Blogs dataset:

print(head(blogs_word_freq, 10))

        word frequency
61953    one      9906
94466   will      9017
16290    can      7694
47376   just      7668
51073   like      7666
87015   time      6828
36877    get      5505
43679     im      5156
48767   know      4631
64980 people      4554

cat("\nTop 10 words in News dataset:\n")


Top 10 words in News dataset:

print(head(news_word_freq, 10))

       word frequency
72720  said     17437
90303  will      7682
60593   one      5754
58259   new      4826
10204  also      4052
85715   two      3997
18757   can      3983
91693  year      3824
34252 first      3747
46603  just      3739

5.3 Save Word Frequency DataFrames

# Save word frequency data into RDS files
saveRDS(twitter_word_freq, "./cleaned_data/twitter_word_frequency.rds")
saveRDS(blogs_word_freq, "./cleaned_data/blogs_word_frequency.rds")
saveRDS(news_word_freq, "./cleaned_data/news_word_frequency.rds")

Step 6: Visualization

In this step, we will create several visualizations to better understand the word frequencies from the datasets.

6.1 Plot Top 20 Word Frequency (With Stop Words)

# Load necessary libraries
library(ggplot2)     # For plotting


Attaching package: 'ggplot2'

The following object is masked from 'package:NLP':

    annotate

library(dplyr)       # For data manipulation
library(tidyverse)   # For data manipulation and plotting

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate() masks NLP::annotate()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)    # For text processing


Attaching package: 'tidytext'

The following object is masked _by_ '.GlobalEnv':

    stop_words

# Select the top 20 words for each dataset, ordered by frequency
top_twitter_words <- twitter_word_freq %>%
  arrange(desc(frequency)) %>%  # Order by frequency in descending
  head(20) %>%                   # Take the top 20
  mutate(dataset = "Twitter")

top_blogs_words <- blogs_word_freq %>%
  arrange(desc(frequency)) %>%   # Order by frequency in descending
  head(20) %>%                   # Take the top 20
  mutate(dataset = "Blogs")

top_news_words <- news_word_freq %>%
  arrange(desc(frequency)) %>%    # Order by frequency in descending
  head(20) %>%                   # Take the top 20
  mutate(dataset = "News")

# Combine the top words into one data frame
top_combined_words <- bind_rows(top_twitter_words, top_blogs_words, top_news_words)

# Create the plot using facet_wrap and reorder the words for each dataset
ggplot(top_combined_words, aes(x = reorder_within(word, frequency, dataset), y = frequency, fill = dataset)) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~ dataset, scales = "free") +  # Separate plots for each dataset
  labs(title = "Top 20 Most Frequent Words in Datasets (With Stop Words)",
       x = "Words",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")  # Optionally remove legend if not needed

6.2 Plot Top 20 Word Frequency (Without Stop Words)

# Load necessary libraries
library(ggplot2)     # For plotting
library(dplyr)       # For data manipulation
library(tidyverse)   # For data manipulation and plotting
library(tidytext)    # For text processing

# Filter top 20 words without stop words for Twitter dataset
top_twitter_words_no_stop <- twitter_word_freq %>%
  filter(!word %in% stopwords("en")) %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(dataset = "Twitter")

# Filter top 20 words without stop words for Blogs dataset
top_blogs_words_no_stop <- blogs_word_freq %>%
  filter(!word %in% stopwords("en")) %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(dataset = "Blogs")

# Filter top 20 words without stop words for News dataset
top_news_words_no_stop <- news_word_freq %>%
  filter(!word %in% stopwords("en")) %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(dataset = "News")

# Combine the top words into one data frame
top_combined_words_no_stop <- bind_rows(top_twitter_words_no_stop, 
                                         top_blogs_words_no_stop, 
                                         top_news_words_no_stop)

# Create the plot using facet_wrap and reorder the words for each dataset
ggplot(top_combined_words_no_stop, aes(x = reorder_within(word, frequency, dataset), y = frequency, fill = dataset)) +
  geom_bar(stat = "identity", width = 0.5) +  # Adjust bar width to make them thinner
  scale_x_reordered() +  # Correctly reorder the x-axis
  coord_flip() +
  facet_wrap(~ dataset, scales = "free") +  # Separate plots for each dataset
  labs(title = "Top 20 Most Frequent Words in Datasets (Without Stop Words)",
       x = "Words",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")  # Optionally remove legend if not needed

6.3 Histogram of Word Frequencies

# Load necessary libraries
library(ggplot2)     # For plotting
library(dplyr)       # For data manipulation
library(tidyverse)   # For data manipulation and plotting

# Select the top 250 words for each dataset
top_250_twitter_words <- head(twitter_word_freq, 50) %>%
  mutate(dataset = "Twitter")

top_250_blogs_words <- head(blogs_word_freq, 50) %>%
  mutate(dataset = "Blogs")

top_250_news_words <- head(news_word_freq, 50) %>%
  mutate(dataset = "News")

# Combine the top words into one data frame
top_combined_250_words <- bind_rows(top_250_twitter_words, 
                                     top_250_blogs_words, 
                                     top_250_news_words)

# Create the histogram using facet_wrap
ggplot(top_combined_250_words, aes(x = frequency, fill = dataset)) +
  geom_histogram(binwidth = 1, color = "steelblue", position = "identity", alpha = 0.7) +
  facet_wrap(~ dataset, scales = "free_x") +  # Separate plots for each dataset with independent x-axes
  labs(title = "Histogram of Word Frequencies in Datasets",
       x = "Frequency",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")  # Optionally remove legend if not needed

Step 7: Stop Word Removal from Frequency Data

In this step, we will filter out stop words from the previously computed word frequency DataFrames for the Twitter, Blogs, and News datasets.

7.1 Remove Stop Words from Twitter Word Frequency Data

# Remove stop words from the Twitter word frequency data
twitter_word_freq_no_stop <- twitter_word_freq[!twitter_word_freq$word %in% stopwords("en"), ]

# Display the number of words after removing stop words
cat("\nAfter removing stop words, there are", nrow(twitter_word_freq_no_stop), "unique words in the Twitter word frequency data.\n")


After removing stop words, there are 53051 unique words in the Twitter word frequency data.

7.2 Remove Stop Words from Blogs Word Frequency Data

# Remove stop words from the Blogs word frequency data
blogs_word_freq_no_stop <- blogs_word_freq[!blogs_word_freq$word %in% stopwords("en"), ]

# Display the number of words after removing stop words
cat("\nAfter removing stop words, there are", nrow(blogs_word_freq_no_stop), "unique words in the Blogs word frequency data.\n")


After removing stop words, there are 96903 unique words in the Blogs word frequency data.

7.3 Remove Stop Words from News Word Frequency Data

# Remove stop words from the News word frequency data
news_word_freq_no_stop <- news_word_freq[!news_word_freq$word %in% stopwords("en"), ]

# Display the number of words after removing stop words
cat("\nAfter removing stop words, there are", nrow(news_word_freq_no_stop), "unique words in the News word frequency data.\n")


After removing stop words, there are 92426 unique words in the News word frequency data.

# Create a directory for cleaned data if it doesn't exist
if (!dir.exists("./results")) {
  dir.create("./results")
}


# Save the cleaned word frequency data without stop words
saveRDS(twitter_word_freq_no_stop, "./results/twitter_word_freq_no_stop.rds")
saveRDS(blogs_word_freq_no_stop, "./results/blogs_word_freq_no_stop.rds")
saveRDS(news_word_freq_no_stop, "./results/news_word_freq_no_stop.rds")

cat("The cleaned word frequency data without stop words has been saved.\n")

The cleaned word frequency data without stop words has been saved.

Step 8: Word Cloud Generation

A word cloud is a visual representation of word frequency, where the size of each word indicates its frequency in the dataset.

8.1 Generate Word Clouds for Each Dataset

Twitter Word Cloud

# Load necessary libraries
library(dplyr)    # For data manipulation
library(stringr)  # For string operations
library(tm)       # For text mining and preprocessing
library(SnowballC) # For stemming
library(wordcloud) # For creating word clouds

Loading required package: RColorBrewer

library(ggplot2)  # For plotting

# Generate a word cloud for the Twitter word frequency data
set.seed(1234)  # For reproducibility

wordcloud(words = twitter_word_freq_no_stop$word,
          freq = twitter_word_freq_no_stop$frequency,
          min.freq = 1,
          max.words = 200,
          random.order = FALSE,
          rot.per = 0.35,
          scale = c(4, 0.5),
          colors = brewer.pal(8, "Dark2"))

title(main = "Twitter Word Cloud")

Blogs Word Cloud

# Generate a word cloud for the Blogs word frequency data
set.seed(1234)  # For reproducibility

wordcloud(words = blogs_word_freq_no_stop$word,
          freq = blogs_word_freq_no_stop$frequency,
          min.freq = 1,
          max.words = 200,
          random.order = FALSE,
          rot.per = 0.35,
          scale = c(4, 0.5),
          colors = brewer.pal(8, "Dark2"))

title(main = "Blogs Word Cloud")

News Word Cloud

# Generate a word cloud for the News word frequency data
set.seed(1234)  # For reproducibility

wordcloud(words = news_word_freq_no_stop$word,
          freq = news_word_freq_no_stop$frequency,
          min.freq = 1,
          max.words = 200,
          random.order = FALSE,
          rot.per = 0.35,
          scale = c(4, 0.5),
          colors = brewer.pal(8, "Dark2"))

title(main = "News Word Cloud")

Step 9: 2-Grams (Word Pairs)

9.1 Create 2-Grams from Cleaned Data

We’ll use the ngram package to generate the 2-grams from the cleaned text.

Generate 2-Grams

# Load necessary libraries
library(dplyr)
library(ngram)
library(stringr)
library(ggplot2)

# Function to generate 2-grams
generate_ngrams <- function(data, dataset_name) {
  # Filter to keep only entries with at least 2 words
  cleaned_text <- data %>%
    filter(str_count(text, "\\w+") >= 2) %>%
    pull(text)  # Extract the 'text' column as a character vector

  # Generate 2-grams
  ngrams_object <- ngram::ngram(cleaned_text, n = 2)

  # Convert to a data frame
  ngram_freq <- as.data.frame(ngram::get.phrasetable(ngrams_object))

  # Rename columns for clarity
  if (ncol(ngram_freq) == 3) {
    colnames(ngram_freq) <- c("ngrams", "frequency", "prop")
  } else {
    stop("Unexpected number of columns in ngram_freq")
  }

  # Remove NA and empty strings from the 'ngrams' column
  ngram_freq <- ngram_freq %>%
    filter(!is.na(ngrams) & sapply(ngrams, function(x) x != ""))

  # Add dataset name for later use in plotting
  ngram_freq$dataset <- dataset_name

  return(ngram_freq)
}

# Generate 2-grams for each dataset
twitter_2gram_freq <- generate_ngrams(twitter_cleaned, "Twitter")
blogs_2gram_freq <- generate_ngrams(blogs_cleaned, "Blogs")
news_2gram_freq <- generate_ngrams(news_cleaned, "News")

# Combine all n-gram data frames into one
all_ngrams_freq <- bind_rows(twitter_2gram_freq, blogs_2gram_freq, news_2gram_freq)

# Display the top 20 most frequent 2-grams for each dataset
top_20_ngrams <- all_ngrams_freq %>%
  group_by(dataset) %>%
  arrange(desc(frequency)) %>%
  slice_head(n = 20)  # Get top 20 for each dataset

# Display the top 20 2-grams
print(top_20_ngrams)

# A tibble: 60 × 4
# Groups:   dataset [3]
   ngrams     frequency    prop dataset
   <chr>          <int>   <dbl> <chr>  
 1 "of the "      14491 0.00512 Blogs  
 2 "in the "      11956 0.00423 Blogs  
 3 "to the "       6694 0.00237 Blogs  
 4 "on the "       5844 0.00207 Blogs  
 5 "to be "        5325 0.00188 Blogs  
 6 "and the "      4548 0.00161 Blogs  
 7 "for the "      4495 0.00159 Blogs  
 8 "i was "        3888 0.00137 Blogs  
 9 "and i "        3818 0.00135 Blogs  
10 "it was "       3746 0.00132 Blogs  
# ℹ 50 more rows

9.2 Visualize the Top 20 2-Grams

Next, we will create a bar plot to visualize the top 20 most frequent 2-grams.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyverse)  # For data manipulation
library(tidytext)   # For text processing

# Generate top 20 2-grams without stop words for Twitter dataset
top_twitter_ngrams_no_stop <- top_20_ngrams %>%
  filter(dataset == "Twitter") %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(ngrams = as.character(ngrams))  # Ensure ngrams are character type

# Generate top 20 2-grams without stop words for Blogs dataset
top_blogs_ngrams_no_stop <- top_20_ngrams %>%
  filter(dataset == "Blogs") %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(ngrams = as.character(ngrams))

# Generate top 20 2-grams without stop words for News dataset
top_news_ngrams_no_stop <- top_20_ngrams %>%
  filter(dataset == "News") %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(ngrams = as.character(ngrams))

# Combine the top ngrams into one data frame
top_combined_ngrams_no_stop <- bind_rows(top_twitter_ngrams_no_stop, 
                                         top_blogs_ngrams_no_stop, 
                                         top_news_ngrams_no_stop)

# Create the plot using facet_wrap and reorder the 2-grams for each dataset
ggplot(top_combined_ngrams_no_stop, aes(x = reorder_within(ngrams, frequency, dataset), y = frequency, fill = dataset)) +
  geom_bar(stat = "identity", width = 0.5) +  # Adjust bar width to make them thinner
  scale_x_reordered() +  # Correctly reorder the x-axis
  coord_flip() +
  facet_wrap(~ dataset, scales = "free") +  # Separate plots for each dataset
  labs(title = "Top 20 Most Frequent 2-Grams in Datasets",
       x = "2-Grams",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")  # Optionally remove legend if not needed

Step 10: 3-Grams (Word Triplets)

10.1 Create 3-Grams from Cleaned Twitter Data

We’ll use the same ngram package to generate the 3-grams from the cleaned Twitter text.

Generate 3-Grams

# Load necessary libraries
library(dplyr)
library(ngram)
library(stringr)
library(ggplot2)

# Function to generate n-grams
generate_ngrams <- function(data, dataset_name, n) {
  # Filter to keep only entries with at least n words
  cleaned_text <- data %>%
    filter(str_count(text, "\\w+") >= n) %>%
    pull(text)  # Extract the 'text' column as a character vector

  # Generate n-grams
  ngrams_object <- ngram::ngram(cleaned_text, n = n)

  # Convert to a data frame
  ngram_freq <- as.data.frame(ngram::get.phrasetable(ngrams_object))

  # Rename columns for clarity
  if (ncol(ngram_freq) == 3) {
    colnames(ngram_freq) <- c("ngrams", "frequency", "prop")
  } else {
    stop("Unexpected number of columns in ngram_freq")
  }

  # Remove NA and empty strings from the 'ngrams' column
  ngram_freq <- ngram_freq %>%
    filter(!is.na(ngrams) & sapply(ngrams, function(x) x != ""))

  # Add dataset name for later use in plotting
  ngram_freq$dataset <- dataset_name

  return(ngram_freq)
}

# Generate 3-grams for each dataset
twitter_3gram_freq <- generate_ngrams(twitter_cleaned, "Twitter", 3)
blogs_3gram_freq <- generate_ngrams(blogs_cleaned, "Blogs", 3)
news_3gram_freq <- generate_ngrams(news_cleaned, "News", 3)

# Combine all n-gram data frames into one
all_ngrams_freq <- bind_rows(twitter_3gram_freq, blogs_3gram_freq, news_3gram_freq)

# Display the top 20 most frequent 3-grams for each dataset
top_20_ngrams <- all_ngrams_freq %>%
  group_by(dataset) %>%
  arrange(desc(frequency)) %>%
  slice_head(n = 20)  # Get top 20 for each dataset

# Display the top 20 3-grams
print(top_20_ngrams)

# A tibble: 60 × 4
# Groups:   dataset [3]
   ngrams         frequency     prop dataset
   <chr>              <int>    <dbl> <chr>  
 1 "one of the "       1171 0.000424 Blogs  
 2 "a lot of "          917 0.000332 Blogs  
 3 "it was a "          544 0.000197 Blogs  
 4 "out of the "        520 0.000188 Blogs  
 5 "to be a "           513 0.000186 Blogs  
 6 "as well as "        510 0.000185 Blogs  
 7 "some of the "       502 0.000182 Blogs  
 8 "a couple of "       484 0.000175 Blogs  
 9 "the end of "        482 0.000175 Blogs  
10 "i want to "         477 0.000173 Blogs  
# ℹ 50 more rows

10.2 Visualize the Top 20 3-Grams

Next, we will create a bar plot to visualize the top 20 most frequent 3-grams.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyverse)  # For data manipulation
library(tidytext)   # For text processing

# Generate top 20 3-grams without stop words for Twitter dataset
top_twitter_ngrams_no_stop <- top_20_ngrams %>%
  filter(dataset == "Twitter") %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(ngrams = as.character(ngrams))  # Ensure ngrams are character type

# Generate top 20 3-grams without stop words for Blogs dataset
top_blogs_ngrams_no_stop <- top_20_ngrams %>%
  filter(dataset == "Blogs") %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(ngrams = as.character(ngrams))

# Generate top 20 3-grams without stop words for News dataset
top_news_ngrams_no_stop <- top_20_ngrams %>%
  filter(dataset == "News") %>%
  arrange(desc(frequency)) %>%
  head(20) %>%
  mutate(ngrams = as.character(ngrams))

# Combine the top ngrams into one data frame
top_combined_ngrams_no_stop <- bind_rows(top_twitter_ngrams_no_stop, 
                                         top_blogs_ngrams_no_stop, 
                                         top_news_ngrams_no_stop)

# Create the plot using facet_wrap and reorder the 3-grams for each dataset
ggplot(top_combined_ngrams_no_stop, aes(x = reorder_within(ngrams, frequency, dataset), y = frequency, fill = dataset)) +
  geom_bar(stat = "identity", width = 0.5) +  # Adjust bar width to make them thinner
  scale_x_reordered() +  # Correctly reorder the x-axis
  coord_flip() +
  facet_wrap(~ dataset, scales = "free") +  # Separate plots for each dataset
  labs(title = "Top 20 Most Frequent 3-Grams in Datasets",
       x = "3-Grams",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")  # Optionally remove legend if not needed

Step 11: Coverage Analysis

11.1 Calculate Word Frequencies

First, we need to calculate the word frequencies from the cleaned Twitter data.

## Step 11: Coverage Analysis

### 11.1 Calculate Word Frequencies

# Load necessary libraries
library(dplyr)
library(stringr)

# Function to calculate word frequencies from cleaned text
calculate_word_frequencies <- function(data, dataset_name) {
  # Split the cleaned text into words
  words <- unlist(strsplit(data$text, "\\W+"))  # Use \\W+ to split by non-word characters
  
  # Create a data frame with word frequencies
  word_freq <- as.data.frame(table(words)) %>%
    rename(word = words, frequency = Freq) %>%
    mutate(dataset = dataset_name)  # Add dataset name for identification
  
  # Sort by frequency in descending order
  word_freq <- word_freq %>%
    arrange(desc(frequency))
  
  return(word_freq)
}

# Calculate word frequencies for each dataset
twitter_word_freq <- calculate_word_frequencies(twitter_cleaned, "Twitter")
blogs_word_freq <- calculate_word_frequencies(blogs_cleaned, "Blogs")
news_word_freq <- calculate_word_frequencies(news_cleaned, "News")

# Combine the word frequency data frames into one
all_word_freq <- bind_rows(twitter_word_freq, blogs_word_freq, news_word_freq)

# Display the first few rows of word frequencies
cat("First few word frequencies:\n")

First few word frequencies:

print(head(all_word_freq, 10))

   word frequency dataset
1   the     27913 Twitter
2    to     23241 Twitter
3     i     21269 Twitter
4     a     17982 Twitter
5   you     16157 Twitter
6   and     13014 Twitter
7   for     11439 Twitter
8    in     11234 Twitter
9    is     10647 Twitter
10   of     10645 Twitter

11.2 Calculate Cumulative Frequency

Next, we will calculate the cumulative frequency and the coverage.

# Load necessary libraries
library(dplyr)

# Function to calculate word frequencies and coverage analysis
calculate_word_frequencies_with_coverage <- function(data, dataset_name) {
  # Split the cleaned text into words
  words <- unlist(strsplit(data$text, "\\W+"))  # Use \\W+ to split by non-word characters
  
  # Create a data frame with word frequencies
  word_freq <- as.data.frame(table(words)) %>%
    rename(word = words, frequency = Freq) %>%
    mutate(dataset = dataset_name)  # Add dataset name for identification
  
  # Sort by frequency in descending order
  word_freq <- word_freq %>%
    arrange(desc(frequency))
  
  # Calculate cumulative frequency
  word_freq$cumulative_frequency <- cumsum(word_freq$frequency)
  
  # Calculate total words and unique words
  total_words <- sum(word_freq$frequency)
  unique_words <- nrow(word_freq)
  
  # Calculate coverage percentage
  word_freq$coverage_percentage <- (word_freq$cumulative_frequency / total_words) * 100
  
  return(list(word_freq = word_freq, total_words = total_words, unique_words = unique_words))
}

# Calculate word frequencies and coverage for each dataset
twitter_results <- calculate_word_frequencies_with_coverage(twitter_cleaned, "Twitter")
blogs_results <- calculate_word_frequencies_with_coverage(blogs_cleaned, "Blogs")
news_results <- calculate_word_frequencies_with_coverage(news_cleaned, "News")

# Combine the word frequency data frames into one
all_word_freq <- bind_rows(twitter_results$word_freq, 
                            blogs_results$word_freq, 
                            news_results$word_freq)

# Display the first few rows of cumulative frequency and coverage percentage
cat("Cumulative frequency and coverage percentage for Twitter:\n")

## Cumulative frequency and coverage percentage for Twitter:

print(head(twitter_results$word_freq, 10))

##    word frequency dataset cumulative_frequency coverage_percentage
## 1   the     27913 Twitter                27913            3.163082
## 2    to     23241 Twitter                51154            5.796737
## 3     i     21269 Twitter                72423            8.206926
## 4     a     17982 Twitter                90405           10.244634
## 5   you     16157 Twitter               106562           12.075534
## 6   and     13014 Twitter               119576           13.550272
## 7   for     11439 Twitter               131015           14.846532
## 8    in     11234 Twitter               142249           16.119561
## 9    is     10647 Twitter               152896           17.326072
## 10   of     10645 Twitter               163541           18.532356

cat("\nCumulative frequency and coverage percentage for Blogs:\n")

## 
## Cumulative frequency and coverage percentage for Blogs:

print(head(blogs_results$word_freq, 10))

##    word frequency dataset cumulative_frequency coverage_percentage
## 1   the    143947   Blogs               143947            4.966440
## 2   and     84899   Blogs               228846            7.895614
## 3    to     83081   Blogs               311927           10.762063
## 4     a     70321   Blogs               382248           13.188269
## 5    of     68024   Blogs               450272           15.535224
## 6     i     59976   Blogs               510248           17.604508
## 7    in     46169   Blogs               556417           19.197425
## 8  that     35602   Blogs               592019           20.425760
## 9    is     34104   Blogs               626123           21.602412
## 10   it     31584   Blogs               657707           22.692118

cat("\nCumulative frequency and coverage percentage for News:\n")

## 
## Cumulative frequency and coverage percentage for News:

print(head(news_results$word_freq, 10))

##    word frequency dataset cumulative_frequency coverage_percentage
## 1   the    137105    News               137105            5.755469
## 2    to     62778    News               199883            8.390798
## 3   and     61810    News               261693           10.985493
## 4     a     60811    News               322504           13.538250
## 5    of     53388    News               375892           15.779401
## 6    in     46583    News               422475           17.734888
## 7   for     24335    News               446810           18.756436
## 8  that     23900    News               470710           19.759723
## 9    is     19866    News               490576           20.593669
## 10   on     18658    News               509234           21.376905

# Display total and unique words for each dataset
cat("\nTotal and Unique Words:\n")

## 
## Total and Unique Words:

cat("Twitter: Total Words =", twitter_results$total_words, ", Unique Words =", twitter_results$unique_words, "\n")

## Twitter: Total Words = 882462 , Unique Words = 53171

cat("Blogs: Total Words =", blogs_results$total_words, ", Unique Words =", blogs_results$unique_words, "\n")

## Blogs: Total Words = 2898394 , Unique Words = 97030

cat("News: Total Words =", news_results$total_words, ", Unique Words =", news_results$unique_words, "\n")

## News: Total Words = 2382169 , Unique Words = 92549

11.3 Identify Unique Words for Coverage Thresholds

Now, we will determine how many unique words are required to cover 50% and 90% of the total occurrences.

# Load necessary libraries
library(dplyr)

# Function to calculate word frequencies and coverage analysis
calculate_word_frequencies_with_coverage <- function(data, dataset_name) {
  # Split the cleaned text into words
  words <- unlist(strsplit(data$text, "\\W+"))  # Use \\W+ to split by non-word characters
  
  # Create a data frame with word frequencies
  word_freq <- as.data.frame(table(words)) %>%
    rename(word = words, frequency = Freq) %>%
    mutate(dataset = dataset_name)  # Add dataset name for identification
  
  # Sort by frequency in descending order
  word_freq <- word_freq %>%
    arrange(desc(frequency))
  
  # Calculate cumulative frequency
  word_freq$cumulative_frequency <- cumsum(word_freq$frequency)
  
  # Calculate total words and unique words
  total_words <- sum(word_freq$frequency)
  unique_words <- nrow(word_freq)
  
  # Calculate coverage percentage
  word_freq$coverage_percentage <- (word_freq$cumulative_frequency / total_words) * 100
  
  # Identify unique words needed to cover 50% and 90%
  words_for_50 <- sum(word_freq$coverage_percentage < 50) + 1
  words_for_90 <- sum(word_freq$coverage_percentage < 90) + 1
  
  return(list(word_freq = word_freq, total_words = total_words, unique_words = unique_words,
              words_for_50 = words_for_50, words_for_90 = words_for_90))
}

# Calculate word frequencies and coverage for each dataset
twitter_results <- calculate_word_frequencies_with_coverage(twitter_cleaned, "Twitter")
blogs_results <- calculate_word_frequencies_with_coverage(blogs_cleaned, "Blogs")
news_results <- calculate_word_frequencies_with_coverage(news_cleaned, "News")

# Combine the word frequency data frames into one
all_word_freq <- bind_rows(twitter_results$word_freq, 
                            blogs_results$word_freq, 
                            news_results$word_freq)

# Display the first few rows of cumulative frequency and coverage percentage
cat("Cumulative frequency and coverage percentage for Twitter:\n")

## Cumulative frequency and coverage percentage for Twitter:

print(head(twitter_results$word_freq, 10))

##    word frequency dataset cumulative_frequency coverage_percentage
## 1   the     27913 Twitter                27913            3.163082
## 2    to     23241 Twitter                51154            5.796737
## 3     i     21269 Twitter                72423            8.206926
## 4     a     17982 Twitter                90405           10.244634
## 5   you     16157 Twitter               106562           12.075534
## 6   and     13014 Twitter               119576           13.550272
## 7   for     11439 Twitter               131015           14.846532
## 8    in     11234 Twitter               142249           16.119561
## 9    is     10647 Twitter               152896           17.326072
## 10   of     10645 Twitter               163541           18.532356

cat("\nCumulative frequency and coverage percentage for Blogs:\n")

## 
## Cumulative frequency and coverage percentage for Blogs:

print(head(blogs_results$word_freq, 10))

##    word frequency dataset cumulative_frequency coverage_percentage
## 1   the    143947   Blogs               143947            4.966440
## 2   and     84899   Blogs               228846            7.895614
## 3    to     83081   Blogs               311927           10.762063
## 4     a     70321   Blogs               382248           13.188269
## 5    of     68024   Blogs               450272           15.535224
## 6     i     59976   Blogs               510248           17.604508
## 7    in     46169   Blogs               556417           19.197425
## 8  that     35602   Blogs               592019           20.425760
## 9    is     34104   Blogs               626123           21.602412
## 10   it     31584   Blogs               657707           22.692118

cat("\nCumulative frequency and coverage percentage for News:\n")

## 
## Cumulative frequency and coverage percentage for News:

print(head(news_results$word_freq, 10))

##    word frequency dataset cumulative_frequency coverage_percentage
## 1   the    137105    News               137105            5.755469
## 2    to     62778    News               199883            8.390798
## 3   and     61810    News               261693           10.985493
## 4     a     60811    News               322504           13.538250
## 5    of     53388    News               375892           15.779401
## 6    in     46583    News               422475           17.734888
## 7   for     24335    News               446810           18.756436
## 8  that     23900    News               470710           19.759723
## 9    is     19866    News               490576           20.593669
## 10   on     18658    News               509234           21.376905

# Display total, unique words, and coverage thresholds for each dataset
cat("\nTotal and Unique Words:\n")

## 
## Total and Unique Words:

cat("Twitter: Total Words =", twitter_results$total_words, 
    ", Unique Words =", twitter_results$unique_words, 
    ", Words for 50% Coverage =", twitter_results$words_for_50,
    ", Words for 90% Coverage =", twitter_results$words_for_90, "\n")

## Twitter: Total Words = 882462 , Unique Words = 53171 , Words for 50% Coverage = 126 , Words for 90% Coverage = 5811

cat("Blogs: Total Words =", blogs_results$total_words, 
    ", Unique Words =", blogs_results$unique_words, 
    ", Words for 50% Coverage =", blogs_results$words_for_50,
    ", Words for 90% Coverage =", blogs_results$words_for_90, "\n")

## Blogs: Total Words = 2898394 , Unique Words = 97030 , Words for 50% Coverage = 109 , Words for 90% Coverage = 7036

cat("News: Total Words =", news_results$total_words, 
    ", Unique Words =", news_results$unique_words, 
    ", Words for 50% Coverage =", news_results$words_for_50,
    ", Words for 90% Coverage =", news_results$words_for_90, "\n")

## News: Total Words = 2382169 , Unique Words = 92549 , Words for 50% Coverage = 214 , Words for 90% Coverage = 9201

11.4 Visualization of Coverage

Finally, let’s visualize the coverage analysis using a plot.

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Combine the word frequency data frames for plotting
combined_coverage_data <- bind_rows(
  twitter_results$word_freq %>% mutate(dataset = "Twitter"),
  blogs_results$word_freq %>% mutate(dataset = "Blogs"),
  news_results$word_freq %>% mutate(dataset = "News")
)

# Plot the coverage analysis for all datasets
ggplot(combined_coverage_data, aes(x = 1:nrow(combined_coverage_data), y = coverage_percentage, color = dataset)) +
  geom_line() +  # Line plot for coverage percentage
  geom_hline(yintercept = 50, linetype = "dashed", color = "red") +  # 50% coverage line
  geom_hline(yintercept = 90, linetype = "dashed", color = "green") +  # 90% coverage line
  labs(title = "Coverage Analysis of Word Frequencies",
       x = "Number of Unique Words",
       y = "Coverage Percentage") +
  theme_minimal() +
  scale_color_manual(values = c("blue", "orange", "green"))  # Set custom colors for datasets

Step 12: Word Cloud Generation

12.1 Generate Word Cloud

We will generate a word cloud using the wordcloud package based on the word frequencies calculated previously.

# Load necessary libraries
library(wordcloud)
library(RColorBrewer)

# Set seed for reproducibility
set.seed(123)

# Generate word cloud for Twitter
png("wordcloud_twitter.png", width = 800, height = 600)
wordcloud(words = twitter_results$word_freq$word,
          freq = twitter_results$word_freq$frequency,
          min.freq = 1,
          max.words = 100,
          random.order = FALSE,
          rot.per = 0.35,
          scale = c(4, 0.5),
          colors = brewer.pal(8, "Dark2"))
dev.off()

## png 
##   2

# Generate word cloud for Blogs
png("wordcloud_blogs.png", width = 800, height = 600)
wordcloud(words = blogs_results$word_freq$word,
          freq = blogs_results$word_freq$frequency,
          min.freq = 1,
          max.words = 100,
          random.order = FALSE,
          rot.per = 0.35,
          scale = c(4, 0.5),
          colors = brewer.pal(8, "Dark2"))
dev.off()

## png 
##   2

# Generate word cloud for News
png("wordcloud_news.png", width = 800, height = 600)
wordcloud(words = news_results$word_freq$word,
          freq = news_results$word_freq$frequency,
          min.freq = 1,
          max.words = 100,
          random.order = FALSE,
          rot.per = 0.35,
          scale = c(4, 0.5),
          colors = brewer.pal(8, "Dark2"))
dev.off()

## png 
##   2

Key Concepts in the Code:

1-Gram, 2-Gram, 3-Gram: Refers to sequences of one, two, or three words that appear together in the text.
Stop Words: Common words (e.g., “the”, “is”, “and”) that are often removed in text analysis because they provide little meaningful information.
Word Cloud: A visual representation of word frequency, where the size of the word represents its frequency in the text.
Coverage: Analyzes how many unique words are needed to capture a certain percentage of the text.
This process prepares the Twitter text data for deeper natural language processing (NLP) tasks such as sentiment analysis or topic modeling.