The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Tasks to accomplish
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Questions to consider
Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
Libraries: Load several R libraries: stringr for string manipulation, tm for text mining, ggplot2 for visualization, and ngram for n-gram generation. Constants: Sets file paths for storing intermediate and final results in the analysis.
# Load necessary libraries
library(dplyr) # For data manipulation
library(stringr) # For string operations
library(tm) # For text mining and preprocessing
library(SnowballC) # For stemming
library(wordcloud) # For creating word clouds
library(ggplot2) # For plotting
# Define file paths for the data
twitter_file <- "./en_US/en_US.twitter.txt"
blogs_file <- "./en_US/en_US.blogs.txt"
news_file <- "./en_US/en_US.news.txt"
In this step, we load the datasets and perform some initial exploration to understand the structure and content of the data.
# Load the datasets
twitter_data <- readLines(twitter_file, encoding = "UTF-8", warn = FALSE)
blogs_data <- readLines(blogs_file, encoding = "UTF-8", warn = FALSE)
news_data <- readLines(news_file, encoding = "UTF-8", warn = FALSE)
Convert the loaded data into data frames for easier manipulation:
# Load necessary libraries
library(dplyr) # For data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr) # For string operations
# Set a seed for reproducibility
set.seed(123)
# Create data frames
twitter_df <- data.frame(text = twitter_data, stringsAsFactors = FALSE)
blogs_df <- data.frame(text = blogs_data, stringsAsFactors = FALSE)
news_df <- data.frame(text = news_data, stringsAsFactors = FALSE)
# Sample 5,000 random rows if the dataset has more than 5,000 rows
if (nrow(twitter_df) > 5000) {
twitter_df <- twitter_df %>% sample_n(70000)
}
if (nrow(blogs_df) > 5000) {
blogs_df <- blogs_df %>% sample_n(70000)
}
if (nrow(news_df) > 5000) {
news_df <- news_df %>% sample_n(70000)
}
# Display the first few rows of each sampled dataset
cat("Sampled Twitter Data:\n")
## Sampled Twitter Data:
print(head(twitter_df, 5))
## text
## 1 just wanted to thank you & ask what got you started on your mission?
## 2 Right when I thought I was done... I ran of "sugar" for the last dessert
## 3 I tell ion gaf so why test my tolerance?
## 4 mayfly? Wish I was there. :)
## 5 follow me tho, so I can dm
cat("\nSampled Blogs Data:\n")
##
## Sampled Blogs Data:
print(head(blogs_df, 5))
## text
## 1 The present Aids-HIV epidemic -- against which the Mbeki-regime undertakes no action and still is publicly failing to properly acknowledge -- the World Health Organisation estimates that more than 6-million African South Africans will be dead within the forthcoming decade. And the Mbeki-led ANC regime, which could have undertaken a huge prevention campaign such as Uganda's a long time ago, has done nothing to stave off this terrible death rate.
## 2 4) Follow @steph_chows on Twitter. Leave a separate comment saying you did, or already do follow.
## 3 Another favourite of mine is the snowtex. I have two types, one plain and one glitter. The one with glitter is the one I tend to lend my hand too over the festive season.
## 4 So there can be seen quite an evolution and progressive intelligence throughout the Vedas Samhitas and Upanishads that reveal the changes or updates we see from antiquity to present day. The ancient peoples lived by a similar thread that ancient Pagans lived under, meaning, the lore and the divine guidance provided for one lifestyle that we now feel is harsh and barbaric. It was all survival of the fittest. People had their castes, their societal chores, kings and warriors were revered and celebrated with massive offerings and festivals. And then a new wave of human feeling appeared and no longer was it unquestioningly accepted to hear the tortured cries and bellows of the animals whose blood and trauma was meant to bring about goodwill and blessings to those who ordered the knives to their throats.
## 5 I’ve found my way around it,- lower calorie bread, if necessary, with Helmann’s Extra Light Mayo.
cat("\nSampled News Data:\n")
##
## Sampled News Data:
print(head(news_df, 5))
## text
## 1 On Saturday, he complained again. His technical with 2:44 to go against the Pacers was his seventh in 24 games.
## 2 Dingell, attributed to Winston Churchill
## 3 Austin, of the Economic Policy Institute, offers a look at a topic many don't want to broach: racial discrimination in hiring. Because teens are often looking for low-skilled, entry-level jobs, factors such as training or education often don't come into play, the disparity in their employment rates offer a chance to study such bias, he said.
## 4 "It was just a lack of effort. We weren't ready. It was embarrassing," Crawford said.
## 5 Angelina and Rose are a unique and special sibling group that deserve a family. While they experience the typical sibling squabbles, they love and depend on one another. They get excited about the thought of exploring new opportunities as they get older, and would love to experience them with a loving and caring family.
You can perform some basic exploratory analysis to understand the datasets better:
# Initial exploration
# Display the first few rows of each dataset
cat("First 5 tweets:\n")
First 5 tweets:
print(head(twitter_df, 5))
text
1 just wanted to thank you & ask what got you started on your mission?
2 Right when I thought I was done... I ran of "sugar" for the last dessert
3 I tell ion gaf so why test my tolerance?
4 mayfly? Wish I was there. :)
5 follow me tho, so I can dm
cat("\nFirst 5 blog posts:\n")
First 5 blog posts:
print(head(blogs_df, 5))
text
1 The present Aids-HIV epidemic -- against which the Mbeki-regime undertakes no action and still is publicly failing to properly acknowledge -- the World Health Organisation estimates that more than 6-million African South Africans will be dead within the forthcoming decade. And the Mbeki-led ANC regime, which could have undertaken a huge prevention campaign such as Uganda's a long time ago, has done nothing to stave off this terrible death rate.
2 4) Follow @steph_chows on Twitter. Leave a separate comment saying you did, or already do follow.
3 Another favourite of mine is the snowtex. I have two types, one plain and one glitter. The one with glitter is the one I tend to lend my hand too over the festive season.
4 So there can be seen quite an evolution and progressive intelligence throughout the Vedas Samhitas and Upanishads that reveal the changes or updates we see from antiquity to present day. The ancient peoples lived by a similar thread that ancient Pagans lived under, meaning, the lore and the divine guidance provided for one lifestyle that we now feel is harsh and barbaric. It was all survival of the fittest. People had their castes, their societal chores, kings and warriors were revered and celebrated with massive offerings and festivals. And then a new wave of human feeling appeared and no longer was it unquestioningly accepted to hear the tortured cries and bellows of the animals whose blood and trauma was meant to bring about goodwill and blessings to those who ordered the knives to their throats.
5 I’ve found my way around it,- lower calorie bread, if necessary, with Helmann’s Extra Light Mayo.
cat("\nFirst 5 news articles:\n")
First 5 news articles:
print(head(news_df, 5))
text
1 On Saturday, he complained again. His technical with 2:44 to go against the Pacers was his seventh in 24 games.
2 Dingell, attributed to Winston Churchill
3 Austin, of the Economic Policy Institute, offers a look at a topic many don't want to broach: racial discrimination in hiring. Because teens are often looking for low-skilled, entry-level jobs, factors such as training or education often don't come into play, the disparity in their employment rates offer a chance to study such bias, he said.
4 "It was just a lack of effort. We weren't ready. It was embarrassing," Crawford said.
5 Angelina and Rose are a unique and special sibling group that deserve a family. While they experience the typical sibling squabbles, they love and depend on one another. They get excited about the thought of exploring new opportunities as they get older, and would love to experience them with a loving and caring family.
# Display the number of rows for each dataset
cat("\nThere are", nrow(twitter_df), "tweets in the Twitter dataset.\n")
There are 70000 tweets in the Twitter dataset.
cat("There are", nrow(blogs_df), "blog posts in the Blogs dataset.\n")
There are 70000 blog posts in the Blogs dataset.
cat("There are", nrow(news_df), "news articles in the News dataset.\n")
There are 70000 news articles in the News dataset.
In this step, we’ll clean the text data for all three datasets.
# Load necessary libraries
library(dplyr) # For data manipulation
library(stringr) # For string operations
# Function to clean text data
clean_text <- function(data) {
data %>%
mutate(
# Convert all text to lowercase
text = str_to_lower(text),
# Remove non-alphanumeric characters (keep letters, numbers, and spaces)
text = str_replace_all(text, "[^\\w\\s]", ""),
# Remove extra spaces and trim the lines
text = str_trim(text)
)
}
# Clean all datasets
twitter_cleaned <- clean_text(twitter_df)
blogs_cleaned <- clean_text(blogs_df)
news_cleaned <- clean_text(news_df)
# Display the first few rows of cleaned data for each dataset
cat("Cleaned Twitter Data:\n")
Cleaned Twitter Data:
print(head(twitter_cleaned, 5))
text
1 just wanted to thank you ask what got you started on your mission
2 right when i thought i was done i ran of sugar for the last dessert
3 i tell ion gaf so why test my tolerance
4 mayfly wish i was there
5 follow me tho so i can dm
cat("\nCleaned Blogs Data:\n")
Cleaned Blogs Data:
print(head(blogs_cleaned, 5))
text
1 the present aidshiv epidemic against which the mbekiregime undertakes no action and still is publicly failing to properly acknowledge the world health organisation estimates that more than 6million african south africans will be dead within the forthcoming decade and the mbekiled anc regime which could have undertaken a huge prevention campaign such as ugandas a long time ago has done nothing to stave off this terrible death rate
2 4 follow steph_chows on twitter leave a separate comment saying you did or already do follow
3 another favourite of mine is the snowtex i have two types one plain and one glitter the one with glitter is the one i tend to lend my hand too over the festive season
4 so there can be seen quite an evolution and progressive intelligence throughout the vedas samhitas and upanishads that reveal the changes or updates we see from antiquity to present day the ancient peoples lived by a similar thread that ancient pagans lived under meaning the lore and the divine guidance provided for one lifestyle that we now feel is harsh and barbaric it was all survival of the fittest people had their castes their societal chores kings and warriors were revered and celebrated with massive offerings and festivals and then a new wave of human feeling appeared and no longer was it unquestioningly accepted to hear the tortured cries and bellows of the animals whose blood and trauma was meant to bring about goodwill and blessings to those who ordered the knives to their throats
5 ive found my way around it lower calorie bread if necessary with helmanns extra light mayo
cat("\nCleaned News Data:\n")
Cleaned News Data:
print(head(news_cleaned, 5))
text
1 on saturday he complained again his technical with 244 to go against the pacers was his seventh in 24 games
2 dingell attributed to winston churchill
3 austin of the economic policy institute offers a look at a topic many dont want to broach racial discrimination in hiring because teens are often looking for lowskilled entrylevel jobs factors such as training or education often dont come into play the disparity in their employment rates offer a chance to study such bias he said
4 it was just a lack of effort we werent ready it was embarrassing crawford said
5 angelina and rose are a unique and special sibling group that deserve a family while they experience the typical sibling squabbles they love and depend on one another they get excited about the thought of exploring new opportunities as they get older and would love to experience them with a loving and caring family
# Create a directory for cleaned data if it doesn't exist
if (!dir.exists("./cleaned_data")) {
dir.create("./cleaned_data")
}
# Save the cleaned data into RDS files
saveRDS(twitter_cleaned, "./cleaned_data/twitter_cleaned.rds")
saveRDS(blogs_cleaned, "./cleaned_data/blogs_cleaned.rds")
saveRDS(news_cleaned, "./cleaned_data/news_cleaned.rds")
In this step, we will remove common stop words from the cleaned text data of all three datasets.
library(tm)
Loading required package: NLP
# Define the stop words
stop_words <- stopwords("en") # Load English stop words
# Function to remove stop words from text data
remove_stop_words <- function(data) {
data %>%
mutate(
text = sapply(text, function(x) {
# Split the text into words, remove stop words, and reassemble
words <- str_split(x, "\\s+")[[1]]
words <- words[!words %in% stop_words]
paste(words, collapse = " ")
})
)
}
# Remove stop words from all cleaned datasets
twitter_no_stopwords <- remove_stop_words(twitter_cleaned)
blogs_no_stopwords <- remove_stop_words(blogs_cleaned)
news_no_stopwords <- remove_stop_words(news_cleaned)
# Display the first few rows of data without stop words for each dataset
cat("Twitter Data Without Stop Words:\n")
Twitter Data Without Stop Words:
print(head(twitter_no_stopwords, 5))
text
1 just wanted thank ask got started mission
2 right thought done ran sugar last dessert
3 tell ion gaf test tolerance
4 mayfly wish
5 follow tho can dm
cat("\nBlogs Data Without Stop Words:\n")
Blogs Data Without Stop Words:
print(head(blogs_no_stopwords, 5))
text
1 present aidshiv epidemic mbekiregime undertakes action still publicly failing properly acknowledge world health organisation estimates 6million african south africans will dead within forthcoming decade mbekiled anc regime undertaken huge prevention campaign ugandas long time ago done nothing stave terrible death rate
2 4 follow steph_chows twitter leave separate comment saying already follow
3 another favourite mine snowtex two types one plain one glitter one glitter one tend lend hand festive season
4 can seen quite evolution progressive intelligence throughout vedas samhitas upanishads reveal changes updates see antiquity present day ancient peoples lived similar thread ancient pagans lived meaning lore divine guidance provided one lifestyle now feel harsh barbaric survival fittest people castes societal chores kings warriors revered celebrated massive offerings festivals new wave human feeling appeared longer unquestioningly accepted hear tortured cries bellows animals whose blood trauma meant bring goodwill blessings ordered knives throats
5 ive found way around lower calorie bread necessary helmanns extra light mayo
cat("\nNews Data Without Stop Words:\n")
News Data Without Stop Words:
print(head(news_no_stopwords, 5))
text
1 saturday complained technical 244 go pacers seventh 24 games
2 dingell attributed winston churchill
3 austin economic policy institute offers look topic many dont want broach racial discrimination hiring teens often looking lowskilled entrylevel jobs factors training education often dont come play disparity employment rates offer chance study bias said
4 just lack effort werent ready embarrassing crawford said
5 angelina rose unique special sibling group deserve family experience typical sibling squabbles love depend one another get excited thought exploring new opportunities get older love experience loving caring family
# Save the data without stop words into RDS files
saveRDS(twitter_no_stopwords, "./cleaned_data/twitter_no_stopwords.rds")
saveRDS(blogs_no_stopwords, "./cleaned_data/blogs_no_stopwords.rds")
saveRDS(news_no_stopwords, "./cleaned_data/news_no_stopwords.rds")
In this step, we will split the cleaned datasets into individual words (1-grams) and calculate the frequency of each word.
# Function to calculate word frequency
calculate_word_frequency <- function(data) {
# Split the text into individual words and unlist
words <- unlist(str_split(data$text, "\\s+"))
# Create a table of word frequencies
word_freq <- table(words)
# Convert the table to a DataFrame
word_freq_df <- as.data.frame(word_freq, stringsAsFactors = FALSE)
colnames(word_freq_df) <- c("word", "frequency")
# Sort by frequency in descending order
word_freq_df <- word_freq_df[order(-word_freq_df$frequency), ]
return(word_freq_df)
}
# Calculate word frequency for all datasets without stop words
twitter_word_freq <- calculate_word_frequency(twitter_no_stopwords)
blogs_word_freq <- calculate_word_frequency(blogs_no_stopwords)
news_word_freq <- calculate_word_frequency(news_no_stopwords)
# Display the top 10 words for each dataset
cat("Top 10 words in Twitter dataset:\n")
Top 10 words in Twitter dataset:
print(head(twitter_word_freq, 10))
word frequency
23712 im 4643
25820 just 4425
27670 like 3601
19889 get 3263
28350 love 3225
20394 good 3045
51145 will 2843
8873 can 2684
14745 dont 2648
39953 rt 2633
cat("\nTop 10 words in Blogs dataset:\n")
Top 10 words in Blogs dataset:
print(head(blogs_word_freq, 10))
word frequency
61953 one 9906
94466 will 9017
16290 can 7694
47376 just 7668
51073 like 7666
87015 time 6828
36877 get 5505
43679 im 5156
48767 know 4631
64980 people 4554
cat("\nTop 10 words in News dataset:\n")
Top 10 words in News dataset:
print(head(news_word_freq, 10))
word frequency
72720 said 17437
90303 will 7682
60593 one 5754
58259 new 4826
10204 also 4052
85715 two 3997
18757 can 3983
91693 year 3824
34252 first 3747
46603 just 3739
# Save word frequency data into RDS files
saveRDS(twitter_word_freq, "./cleaned_data/twitter_word_frequency.rds")
saveRDS(blogs_word_freq, "./cleaned_data/blogs_word_frequency.rds")
saveRDS(news_word_freq, "./cleaned_data/news_word_frequency.rds")
In this step, we will create several visualizations to better understand the word frequencies from the datasets.
# Load necessary libraries
library(ggplot2) # For plotting
Attaching package: 'ggplot2'
The following object is masked from 'package:NLP':
annotate
library(dplyr) # For data manipulation
library(tidyverse) # For data manipulation and plotting
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate() masks NLP::annotate()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext) # For text processing
Attaching package: 'tidytext'
The following object is masked _by_ '.GlobalEnv':
stop_words
# Select the top 20 words for each dataset, ordered by frequency
top_twitter_words <- twitter_word_freq %>%
arrange(desc(frequency)) %>% # Order by frequency in descending
head(20) %>% # Take the top 20
mutate(dataset = "Twitter")
top_blogs_words <- blogs_word_freq %>%
arrange(desc(frequency)) %>% # Order by frequency in descending
head(20) %>% # Take the top 20
mutate(dataset = "Blogs")
top_news_words <- news_word_freq %>%
arrange(desc(frequency)) %>% # Order by frequency in descending
head(20) %>% # Take the top 20
mutate(dataset = "News")
# Combine the top words into one data frame
top_combined_words <- bind_rows(top_twitter_words, top_blogs_words, top_news_words)
# Create the plot using facet_wrap and reorder the words for each dataset
ggplot(top_combined_words, aes(x = reorder_within(word, frequency, dataset), y = frequency, fill = dataset)) +
geom_bar(stat = "identity", width = 0.5) +
scale_x_reordered() +
coord_flip() +
facet_wrap(~ dataset, scales = "free") + # Separate plots for each dataset
labs(title = "Top 20 Most Frequent Words in Datasets (With Stop Words)",
x = "Words",
y = "Frequency") +
theme_minimal() +
theme(legend.position = "none") # Optionally remove legend if not needed
# Load necessary libraries
library(ggplot2) # For plotting
library(dplyr) # For data manipulation
library(tidyverse) # For data manipulation and plotting
library(tidytext) # For text processing
# Filter top 20 words without stop words for Twitter dataset
top_twitter_words_no_stop <- twitter_word_freq %>%
filter(!word %in% stopwords("en")) %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(dataset = "Twitter")
# Filter top 20 words without stop words for Blogs dataset
top_blogs_words_no_stop <- blogs_word_freq %>%
filter(!word %in% stopwords("en")) %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(dataset = "Blogs")
# Filter top 20 words without stop words for News dataset
top_news_words_no_stop <- news_word_freq %>%
filter(!word %in% stopwords("en")) %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(dataset = "News")
# Combine the top words into one data frame
top_combined_words_no_stop <- bind_rows(top_twitter_words_no_stop,
top_blogs_words_no_stop,
top_news_words_no_stop)
# Create the plot using facet_wrap and reorder the words for each dataset
ggplot(top_combined_words_no_stop, aes(x = reorder_within(word, frequency, dataset), y = frequency, fill = dataset)) +
geom_bar(stat = "identity", width = 0.5) + # Adjust bar width to make them thinner
scale_x_reordered() + # Correctly reorder the x-axis
coord_flip() +
facet_wrap(~ dataset, scales = "free") + # Separate plots for each dataset
labs(title = "Top 20 Most Frequent Words in Datasets (Without Stop Words)",
x = "Words",
y = "Frequency") +
theme_minimal() +
theme(legend.position = "none") # Optionally remove legend if not needed
6.3 Histogram of Word Frequencies
# Load necessary libraries
library(ggplot2) # For plotting
library(dplyr) # For data manipulation
library(tidyverse) # For data manipulation and plotting
# Select the top 250 words for each dataset
top_250_twitter_words <- head(twitter_word_freq, 50) %>%
mutate(dataset = "Twitter")
top_250_blogs_words <- head(blogs_word_freq, 50) %>%
mutate(dataset = "Blogs")
top_250_news_words <- head(news_word_freq, 50) %>%
mutate(dataset = "News")
# Combine the top words into one data frame
top_combined_250_words <- bind_rows(top_250_twitter_words,
top_250_blogs_words,
top_250_news_words)
# Create the histogram using facet_wrap
ggplot(top_combined_250_words, aes(x = frequency, fill = dataset)) +
geom_histogram(binwidth = 1, color = "steelblue", position = "identity", alpha = 0.7) +
facet_wrap(~ dataset, scales = "free_x") + # Separate plots for each dataset with independent x-axes
labs(title = "Histogram of Word Frequencies in Datasets",
x = "Frequency",
y = "Count") +
theme_minimal() +
theme(legend.position = "none") # Optionally remove legend if not needed
In this step, we will filter out stop words from the previously computed word frequency DataFrames for the Twitter, Blogs, and News datasets.
# Remove stop words from the Twitter word frequency data
twitter_word_freq_no_stop <- twitter_word_freq[!twitter_word_freq$word %in% stopwords("en"), ]
# Display the number of words after removing stop words
cat("\nAfter removing stop words, there are", nrow(twitter_word_freq_no_stop), "unique words in the Twitter word frequency data.\n")
After removing stop words, there are 53051 unique words in the Twitter word frequency data.
# Remove stop words from the Blogs word frequency data
blogs_word_freq_no_stop <- blogs_word_freq[!blogs_word_freq$word %in% stopwords("en"), ]
# Display the number of words after removing stop words
cat("\nAfter removing stop words, there are", nrow(blogs_word_freq_no_stop), "unique words in the Blogs word frequency data.\n")
After removing stop words, there are 96903 unique words in the Blogs word frequency data.
# Remove stop words from the News word frequency data
news_word_freq_no_stop <- news_word_freq[!news_word_freq$word %in% stopwords("en"), ]
# Display the number of words after removing stop words
cat("\nAfter removing stop words, there are", nrow(news_word_freq_no_stop), "unique words in the News word frequency data.\n")
After removing stop words, there are 92426 unique words in the News word frequency data.
# Create a directory for cleaned data if it doesn't exist
if (!dir.exists("./results")) {
dir.create("./results")
}
# Save the cleaned word frequency data without stop words
saveRDS(twitter_word_freq_no_stop, "./results/twitter_word_freq_no_stop.rds")
saveRDS(blogs_word_freq_no_stop, "./results/blogs_word_freq_no_stop.rds")
saveRDS(news_word_freq_no_stop, "./results/news_word_freq_no_stop.rds")
cat("The cleaned word frequency data without stop words has been saved.\n")
The cleaned word frequency data without stop words has been saved.
A word cloud is a visual representation of word frequency, where the size of each word indicates its frequency in the dataset.
Twitter Word Cloud
# Load necessary libraries
library(dplyr) # For data manipulation
library(stringr) # For string operations
library(tm) # For text mining and preprocessing
library(SnowballC) # For stemming
library(wordcloud) # For creating word clouds
Loading required package: RColorBrewer
library(ggplot2) # For plotting
# Generate a word cloud for the Twitter word frequency data
set.seed(1234) # For reproducibility
wordcloud(words = twitter_word_freq_no_stop$word,
freq = twitter_word_freq_no_stop$frequency,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
title(main = "Twitter Word Cloud")
Blogs Word Cloud
# Generate a word cloud for the Blogs word frequency data
set.seed(1234) # For reproducibility
wordcloud(words = blogs_word_freq_no_stop$word,
freq = blogs_word_freq_no_stop$frequency,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
title(main = "Blogs Word Cloud")
News Word Cloud
# Generate a word cloud for the News word frequency data
set.seed(1234) # For reproducibility
wordcloud(words = news_word_freq_no_stop$word,
freq = news_word_freq_no_stop$frequency,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
title(main = "News Word Cloud")
We’ll use the ngram package to generate the 2-grams from the cleaned text.
Generate 2-Grams
# Load necessary libraries
library(dplyr)
library(ngram)
library(stringr)
library(ggplot2)
# Function to generate 2-grams
generate_ngrams <- function(data, dataset_name) {
# Filter to keep only entries with at least 2 words
cleaned_text <- data %>%
filter(str_count(text, "\\w+") >= 2) %>%
pull(text) # Extract the 'text' column as a character vector
# Generate 2-grams
ngrams_object <- ngram::ngram(cleaned_text, n = 2)
# Convert to a data frame
ngram_freq <- as.data.frame(ngram::get.phrasetable(ngrams_object))
# Rename columns for clarity
if (ncol(ngram_freq) == 3) {
colnames(ngram_freq) <- c("ngrams", "frequency", "prop")
} else {
stop("Unexpected number of columns in ngram_freq")
}
# Remove NA and empty strings from the 'ngrams' column
ngram_freq <- ngram_freq %>%
filter(!is.na(ngrams) & sapply(ngrams, function(x) x != ""))
# Add dataset name for later use in plotting
ngram_freq$dataset <- dataset_name
return(ngram_freq)
}
# Generate 2-grams for each dataset
twitter_2gram_freq <- generate_ngrams(twitter_cleaned, "Twitter")
blogs_2gram_freq <- generate_ngrams(blogs_cleaned, "Blogs")
news_2gram_freq <- generate_ngrams(news_cleaned, "News")
# Combine all n-gram data frames into one
all_ngrams_freq <- bind_rows(twitter_2gram_freq, blogs_2gram_freq, news_2gram_freq)
# Display the top 20 most frequent 2-grams for each dataset
top_20_ngrams <- all_ngrams_freq %>%
group_by(dataset) %>%
arrange(desc(frequency)) %>%
slice_head(n = 20) # Get top 20 for each dataset
# Display the top 20 2-grams
print(top_20_ngrams)
# A tibble: 60 × 4
# Groups: dataset [3]
ngrams frequency prop dataset
<chr> <int> <dbl> <chr>
1 "of the " 14491 0.00512 Blogs
2 "in the " 11956 0.00423 Blogs
3 "to the " 6694 0.00237 Blogs
4 "on the " 5844 0.00207 Blogs
5 "to be " 5325 0.00188 Blogs
6 "and the " 4548 0.00161 Blogs
7 "for the " 4495 0.00159 Blogs
8 "i was " 3888 0.00137 Blogs
9 "and i " 3818 0.00135 Blogs
10 "it was " 3746 0.00132 Blogs
# ℹ 50 more rows
Next, we will create a bar plot to visualize the top 20 most frequent 2-grams.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyverse) # For data manipulation
library(tidytext) # For text processing
# Generate top 20 2-grams without stop words for Twitter dataset
top_twitter_ngrams_no_stop <- top_20_ngrams %>%
filter(dataset == "Twitter") %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(ngrams = as.character(ngrams)) # Ensure ngrams are character type
# Generate top 20 2-grams without stop words for Blogs dataset
top_blogs_ngrams_no_stop <- top_20_ngrams %>%
filter(dataset == "Blogs") %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(ngrams = as.character(ngrams))
# Generate top 20 2-grams without stop words for News dataset
top_news_ngrams_no_stop <- top_20_ngrams %>%
filter(dataset == "News") %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(ngrams = as.character(ngrams))
# Combine the top ngrams into one data frame
top_combined_ngrams_no_stop <- bind_rows(top_twitter_ngrams_no_stop,
top_blogs_ngrams_no_stop,
top_news_ngrams_no_stop)
# Create the plot using facet_wrap and reorder the 2-grams for each dataset
ggplot(top_combined_ngrams_no_stop, aes(x = reorder_within(ngrams, frequency, dataset), y = frequency, fill = dataset)) +
geom_bar(stat = "identity", width = 0.5) + # Adjust bar width to make them thinner
scale_x_reordered() + # Correctly reorder the x-axis
coord_flip() +
facet_wrap(~ dataset, scales = "free") + # Separate plots for each dataset
labs(title = "Top 20 Most Frequent 2-Grams in Datasets",
x = "2-Grams",
y = "Frequency") +
theme_minimal() +
theme(legend.position = "none") # Optionally remove legend if not needed
We’ll use the same ngram package to generate the 3-grams from the cleaned Twitter text.
Generate 3-Grams
# Load necessary libraries
library(dplyr)
library(ngram)
library(stringr)
library(ggplot2)
# Function to generate n-grams
generate_ngrams <- function(data, dataset_name, n) {
# Filter to keep only entries with at least n words
cleaned_text <- data %>%
filter(str_count(text, "\\w+") >= n) %>%
pull(text) # Extract the 'text' column as a character vector
# Generate n-grams
ngrams_object <- ngram::ngram(cleaned_text, n = n)
# Convert to a data frame
ngram_freq <- as.data.frame(ngram::get.phrasetable(ngrams_object))
# Rename columns for clarity
if (ncol(ngram_freq) == 3) {
colnames(ngram_freq) <- c("ngrams", "frequency", "prop")
} else {
stop("Unexpected number of columns in ngram_freq")
}
# Remove NA and empty strings from the 'ngrams' column
ngram_freq <- ngram_freq %>%
filter(!is.na(ngrams) & sapply(ngrams, function(x) x != ""))
# Add dataset name for later use in plotting
ngram_freq$dataset <- dataset_name
return(ngram_freq)
}
# Generate 3-grams for each dataset
twitter_3gram_freq <- generate_ngrams(twitter_cleaned, "Twitter", 3)
blogs_3gram_freq <- generate_ngrams(blogs_cleaned, "Blogs", 3)
news_3gram_freq <- generate_ngrams(news_cleaned, "News", 3)
# Combine all n-gram data frames into one
all_ngrams_freq <- bind_rows(twitter_3gram_freq, blogs_3gram_freq, news_3gram_freq)
# Display the top 20 most frequent 3-grams for each dataset
top_20_ngrams <- all_ngrams_freq %>%
group_by(dataset) %>%
arrange(desc(frequency)) %>%
slice_head(n = 20) # Get top 20 for each dataset
# Display the top 20 3-grams
print(top_20_ngrams)
# A tibble: 60 × 4
# Groups: dataset [3]
ngrams frequency prop dataset
<chr> <int> <dbl> <chr>
1 "one of the " 1171 0.000424 Blogs
2 "a lot of " 917 0.000332 Blogs
3 "it was a " 544 0.000197 Blogs
4 "out of the " 520 0.000188 Blogs
5 "to be a " 513 0.000186 Blogs
6 "as well as " 510 0.000185 Blogs
7 "some of the " 502 0.000182 Blogs
8 "a couple of " 484 0.000175 Blogs
9 "the end of " 482 0.000175 Blogs
10 "i want to " 477 0.000173 Blogs
# ℹ 50 more rows
Next, we will create a bar plot to visualize the top 20 most frequent 3-grams.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyverse) # For data manipulation
library(tidytext) # For text processing
# Generate top 20 3-grams without stop words for Twitter dataset
top_twitter_ngrams_no_stop <- top_20_ngrams %>%
filter(dataset == "Twitter") %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(ngrams = as.character(ngrams)) # Ensure ngrams are character type
# Generate top 20 3-grams without stop words for Blogs dataset
top_blogs_ngrams_no_stop <- top_20_ngrams %>%
filter(dataset == "Blogs") %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(ngrams = as.character(ngrams))
# Generate top 20 3-grams without stop words for News dataset
top_news_ngrams_no_stop <- top_20_ngrams %>%
filter(dataset == "News") %>%
arrange(desc(frequency)) %>%
head(20) %>%
mutate(ngrams = as.character(ngrams))
# Combine the top ngrams into one data frame
top_combined_ngrams_no_stop <- bind_rows(top_twitter_ngrams_no_stop,
top_blogs_ngrams_no_stop,
top_news_ngrams_no_stop)
# Create the plot using facet_wrap and reorder the 3-grams for each dataset
ggplot(top_combined_ngrams_no_stop, aes(x = reorder_within(ngrams, frequency, dataset), y = frequency, fill = dataset)) +
geom_bar(stat = "identity", width = 0.5) + # Adjust bar width to make them thinner
scale_x_reordered() + # Correctly reorder the x-axis
coord_flip() +
facet_wrap(~ dataset, scales = "free") + # Separate plots for each dataset
labs(title = "Top 20 Most Frequent 3-Grams in Datasets",
x = "3-Grams",
y = "Frequency") +
theme_minimal() +
theme(legend.position = "none") # Optionally remove legend if not needed
First, we need to calculate the word frequencies from the cleaned Twitter data.
## Step 11: Coverage Analysis
### 11.1 Calculate Word Frequencies
# Load necessary libraries
library(dplyr)
library(stringr)
# Function to calculate word frequencies from cleaned text
calculate_word_frequencies <- function(data, dataset_name) {
# Split the cleaned text into words
words <- unlist(strsplit(data$text, "\\W+")) # Use \\W+ to split by non-word characters
# Create a data frame with word frequencies
word_freq <- as.data.frame(table(words)) %>%
rename(word = words, frequency = Freq) %>%
mutate(dataset = dataset_name) # Add dataset name for identification
# Sort by frequency in descending order
word_freq <- word_freq %>%
arrange(desc(frequency))
return(word_freq)
}
# Calculate word frequencies for each dataset
twitter_word_freq <- calculate_word_frequencies(twitter_cleaned, "Twitter")
blogs_word_freq <- calculate_word_frequencies(blogs_cleaned, "Blogs")
news_word_freq <- calculate_word_frequencies(news_cleaned, "News")
# Combine the word frequency data frames into one
all_word_freq <- bind_rows(twitter_word_freq, blogs_word_freq, news_word_freq)
# Display the first few rows of word frequencies
cat("First few word frequencies:\n")
First few word frequencies:
print(head(all_word_freq, 10))
word frequency dataset
1 the 27913 Twitter
2 to 23241 Twitter
3 i 21269 Twitter
4 a 17982 Twitter
5 you 16157 Twitter
6 and 13014 Twitter
7 for 11439 Twitter
8 in 11234 Twitter
9 is 10647 Twitter
10 of 10645 Twitter
Next, we will calculate the cumulative frequency and the coverage.
# Load necessary libraries
library(dplyr)
# Function to calculate word frequencies and coverage analysis
calculate_word_frequencies_with_coverage <- function(data, dataset_name) {
# Split the cleaned text into words
words <- unlist(strsplit(data$text, "\\W+")) # Use \\W+ to split by non-word characters
# Create a data frame with word frequencies
word_freq <- as.data.frame(table(words)) %>%
rename(word = words, frequency = Freq) %>%
mutate(dataset = dataset_name) # Add dataset name for identification
# Sort by frequency in descending order
word_freq <- word_freq %>%
arrange(desc(frequency))
# Calculate cumulative frequency
word_freq$cumulative_frequency <- cumsum(word_freq$frequency)
# Calculate total words and unique words
total_words <- sum(word_freq$frequency)
unique_words <- nrow(word_freq)
# Calculate coverage percentage
word_freq$coverage_percentage <- (word_freq$cumulative_frequency / total_words) * 100
return(list(word_freq = word_freq, total_words = total_words, unique_words = unique_words))
}
# Calculate word frequencies and coverage for each dataset
twitter_results <- calculate_word_frequencies_with_coverage(twitter_cleaned, "Twitter")
blogs_results <- calculate_word_frequencies_with_coverage(blogs_cleaned, "Blogs")
news_results <- calculate_word_frequencies_with_coverage(news_cleaned, "News")
# Combine the word frequency data frames into one
all_word_freq <- bind_rows(twitter_results$word_freq,
blogs_results$word_freq,
news_results$word_freq)
# Display the first few rows of cumulative frequency and coverage percentage
cat("Cumulative frequency and coverage percentage for Twitter:\n")
## Cumulative frequency and coverage percentage for Twitter:
print(head(twitter_results$word_freq, 10))
## word frequency dataset cumulative_frequency coverage_percentage
## 1 the 27913 Twitter 27913 3.163082
## 2 to 23241 Twitter 51154 5.796737
## 3 i 21269 Twitter 72423 8.206926
## 4 a 17982 Twitter 90405 10.244634
## 5 you 16157 Twitter 106562 12.075534
## 6 and 13014 Twitter 119576 13.550272
## 7 for 11439 Twitter 131015 14.846532
## 8 in 11234 Twitter 142249 16.119561
## 9 is 10647 Twitter 152896 17.326072
## 10 of 10645 Twitter 163541 18.532356
cat("\nCumulative frequency and coverage percentage for Blogs:\n")
##
## Cumulative frequency and coverage percentage for Blogs:
print(head(blogs_results$word_freq, 10))
## word frequency dataset cumulative_frequency coverage_percentage
## 1 the 143947 Blogs 143947 4.966440
## 2 and 84899 Blogs 228846 7.895614
## 3 to 83081 Blogs 311927 10.762063
## 4 a 70321 Blogs 382248 13.188269
## 5 of 68024 Blogs 450272 15.535224
## 6 i 59976 Blogs 510248 17.604508
## 7 in 46169 Blogs 556417 19.197425
## 8 that 35602 Blogs 592019 20.425760
## 9 is 34104 Blogs 626123 21.602412
## 10 it 31584 Blogs 657707 22.692118
cat("\nCumulative frequency and coverage percentage for News:\n")
##
## Cumulative frequency and coverage percentage for News:
print(head(news_results$word_freq, 10))
## word frequency dataset cumulative_frequency coverage_percentage
## 1 the 137105 News 137105 5.755469
## 2 to 62778 News 199883 8.390798
## 3 and 61810 News 261693 10.985493
## 4 a 60811 News 322504 13.538250
## 5 of 53388 News 375892 15.779401
## 6 in 46583 News 422475 17.734888
## 7 for 24335 News 446810 18.756436
## 8 that 23900 News 470710 19.759723
## 9 is 19866 News 490576 20.593669
## 10 on 18658 News 509234 21.376905
# Display total and unique words for each dataset
cat("\nTotal and Unique Words:\n")
##
## Total and Unique Words:
cat("Twitter: Total Words =", twitter_results$total_words, ", Unique Words =", twitter_results$unique_words, "\n")
## Twitter: Total Words = 882462 , Unique Words = 53171
cat("Blogs: Total Words =", blogs_results$total_words, ", Unique Words =", blogs_results$unique_words, "\n")
## Blogs: Total Words = 2898394 , Unique Words = 97030
cat("News: Total Words =", news_results$total_words, ", Unique Words =", news_results$unique_words, "\n")
## News: Total Words = 2382169 , Unique Words = 92549
Now, we will determine how many unique words are required to cover 50% and 90% of the total occurrences.
# Load necessary libraries
library(dplyr)
# Function to calculate word frequencies and coverage analysis
calculate_word_frequencies_with_coverage <- function(data, dataset_name) {
# Split the cleaned text into words
words <- unlist(strsplit(data$text, "\\W+")) # Use \\W+ to split by non-word characters
# Create a data frame with word frequencies
word_freq <- as.data.frame(table(words)) %>%
rename(word = words, frequency = Freq) %>%
mutate(dataset = dataset_name) # Add dataset name for identification
# Sort by frequency in descending order
word_freq <- word_freq %>%
arrange(desc(frequency))
# Calculate cumulative frequency
word_freq$cumulative_frequency <- cumsum(word_freq$frequency)
# Calculate total words and unique words
total_words <- sum(word_freq$frequency)
unique_words <- nrow(word_freq)
# Calculate coverage percentage
word_freq$coverage_percentage <- (word_freq$cumulative_frequency / total_words) * 100
# Identify unique words needed to cover 50% and 90%
words_for_50 <- sum(word_freq$coverage_percentage < 50) + 1
words_for_90 <- sum(word_freq$coverage_percentage < 90) + 1
return(list(word_freq = word_freq, total_words = total_words, unique_words = unique_words,
words_for_50 = words_for_50, words_for_90 = words_for_90))
}
# Calculate word frequencies and coverage for each dataset
twitter_results <- calculate_word_frequencies_with_coverage(twitter_cleaned, "Twitter")
blogs_results <- calculate_word_frequencies_with_coverage(blogs_cleaned, "Blogs")
news_results <- calculate_word_frequencies_with_coverage(news_cleaned, "News")
# Combine the word frequency data frames into one
all_word_freq <- bind_rows(twitter_results$word_freq,
blogs_results$word_freq,
news_results$word_freq)
# Display the first few rows of cumulative frequency and coverage percentage
cat("Cumulative frequency and coverage percentage for Twitter:\n")
## Cumulative frequency and coverage percentage for Twitter:
print(head(twitter_results$word_freq, 10))
## word frequency dataset cumulative_frequency coverage_percentage
## 1 the 27913 Twitter 27913 3.163082
## 2 to 23241 Twitter 51154 5.796737
## 3 i 21269 Twitter 72423 8.206926
## 4 a 17982 Twitter 90405 10.244634
## 5 you 16157 Twitter 106562 12.075534
## 6 and 13014 Twitter 119576 13.550272
## 7 for 11439 Twitter 131015 14.846532
## 8 in 11234 Twitter 142249 16.119561
## 9 is 10647 Twitter 152896 17.326072
## 10 of 10645 Twitter 163541 18.532356
cat("\nCumulative frequency and coverage percentage for Blogs:\n")
##
## Cumulative frequency and coverage percentage for Blogs:
print(head(blogs_results$word_freq, 10))
## word frequency dataset cumulative_frequency coverage_percentage
## 1 the 143947 Blogs 143947 4.966440
## 2 and 84899 Blogs 228846 7.895614
## 3 to 83081 Blogs 311927 10.762063
## 4 a 70321 Blogs 382248 13.188269
## 5 of 68024 Blogs 450272 15.535224
## 6 i 59976 Blogs 510248 17.604508
## 7 in 46169 Blogs 556417 19.197425
## 8 that 35602 Blogs 592019 20.425760
## 9 is 34104 Blogs 626123 21.602412
## 10 it 31584 Blogs 657707 22.692118
cat("\nCumulative frequency and coverage percentage for News:\n")
##
## Cumulative frequency and coverage percentage for News:
print(head(news_results$word_freq, 10))
## word frequency dataset cumulative_frequency coverage_percentage
## 1 the 137105 News 137105 5.755469
## 2 to 62778 News 199883 8.390798
## 3 and 61810 News 261693 10.985493
## 4 a 60811 News 322504 13.538250
## 5 of 53388 News 375892 15.779401
## 6 in 46583 News 422475 17.734888
## 7 for 24335 News 446810 18.756436
## 8 that 23900 News 470710 19.759723
## 9 is 19866 News 490576 20.593669
## 10 on 18658 News 509234 21.376905
# Display total, unique words, and coverage thresholds for each dataset
cat("\nTotal and Unique Words:\n")
##
## Total and Unique Words:
cat("Twitter: Total Words =", twitter_results$total_words,
", Unique Words =", twitter_results$unique_words,
", Words for 50% Coverage =", twitter_results$words_for_50,
", Words for 90% Coverage =", twitter_results$words_for_90, "\n")
## Twitter: Total Words = 882462 , Unique Words = 53171 , Words for 50% Coverage = 126 , Words for 90% Coverage = 5811
cat("Blogs: Total Words =", blogs_results$total_words,
", Unique Words =", blogs_results$unique_words,
", Words for 50% Coverage =", blogs_results$words_for_50,
", Words for 90% Coverage =", blogs_results$words_for_90, "\n")
## Blogs: Total Words = 2898394 , Unique Words = 97030 , Words for 50% Coverage = 109 , Words for 90% Coverage = 7036
cat("News: Total Words =", news_results$total_words,
", Unique Words =", news_results$unique_words,
", Words for 50% Coverage =", news_results$words_for_50,
", Words for 90% Coverage =", news_results$words_for_90, "\n")
## News: Total Words = 2382169 , Unique Words = 92549 , Words for 50% Coverage = 214 , Words for 90% Coverage = 9201
Finally, let’s visualize the coverage analysis using a plot.
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Combine the word frequency data frames for plotting
combined_coverage_data <- bind_rows(
twitter_results$word_freq %>% mutate(dataset = "Twitter"),
blogs_results$word_freq %>% mutate(dataset = "Blogs"),
news_results$word_freq %>% mutate(dataset = "News")
)
# Plot the coverage analysis for all datasets
ggplot(combined_coverage_data, aes(x = 1:nrow(combined_coverage_data), y = coverage_percentage, color = dataset)) +
geom_line() + # Line plot for coverage percentage
geom_hline(yintercept = 50, linetype = "dashed", color = "red") + # 50% coverage line
geom_hline(yintercept = 90, linetype = "dashed", color = "green") + # 90% coverage line
labs(title = "Coverage Analysis of Word Frequencies",
x = "Number of Unique Words",
y = "Coverage Percentage") +
theme_minimal() +
scale_color_manual(values = c("blue", "orange", "green")) # Set custom colors for datasets
We will generate a word cloud using the wordcloud package based on the word frequencies calculated previously.
# Load necessary libraries
library(wordcloud)
library(RColorBrewer)
# Set seed for reproducibility
set.seed(123)
# Generate word cloud for Twitter
png("wordcloud_twitter.png", width = 800, height = 600)
wordcloud(words = twitter_results$word_freq$word,
freq = twitter_results$word_freq$frequency,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
dev.off()
## png
## 2
# Generate word cloud for Blogs
png("wordcloud_blogs.png", width = 800, height = 600)
wordcloud(words = blogs_results$word_freq$word,
freq = blogs_results$word_freq$frequency,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
dev.off()
## png
## 2
# Generate word cloud for News
png("wordcloud_news.png", width = 800, height = 600)
wordcloud(words = news_results$word_freq$word,
freq = news_results$word_freq$frequency,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
dev.off()
## png
## 2
1-Gram, 2-Gram, 3-Gram: Refers to sequences of one, two, or three words that appear together in the text.
Stop Words: Common words (e.g., “the”, “is”, “and”) that are often removed in text analysis because they provide little meaningful information.
Word Cloud: A visual representation of word frequency, where the size of the word represents its frequency in the text.
Coverage: Analyzes how many unique words are needed to capture a certain percentage of the text.
This process prepares the Twitter text data for deeper natural language processing (NLP) tasks such as sentiment analysis or topic modeling.