This analysis involved the collection of data from two sources: the New York Times API and an additional dataset from Kaggle. Following data collection, preprocessing and cleaning steps were performed to ensure the data’s suitability for analysis. Subsequently, sentiment analysis was conducted on the articles, comparing the frequency of positive and negative words between the Kaggle dataset and the New York Times articles. The findings were visualized using bar plots and word clouds to provide a clear comparison. Overall, this analysis sheds light on the sentiment trends within the datasets and highlights any disparities between the two sources.
In an era where sexual health awareness is increasingly important, understanding the sentiment surrounding related discussions is crucial. By analyzing articles from reputable sources like the New York Times and user-generated content from platforms like Kaggle, I aim to uncover sentiment trends, identify common positive and negative themes, and ultimately contribute to a deeper understanding of public discourse surrounding sexual health. Through this project, I seek to shed light on important societal discussions and promote informed conversations about sexual health.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(textdata)
## Warning: package 'textdata' was built under R version 4.3.3
library(textclean)
## Warning: package 'textclean' was built under R version 4.3.3
library(jsonlite)
library(httr)
##
## Attaching package: 'httr'
## The following object is masked from 'package:textdata':
##
## cache_info
library(tm)
## Warning: package 'tm' was built under R version 4.3.3
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
##
## content
library(scrapeR)
## Warning: package 'scrapeR' was built under R version 4.3.3
library(wordcloud2)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:httr':
##
## config
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:jsonlite':
##
## flatten
Load the data using the New York Times API
url_times <- paste0('https://api.nytimes.com/svc/mostpopular/v2/emailed/7.json?api-key=',Sys.getenv("TIMES_API_KEY"))
jsonDF1 <- fromJSON(url_times)
# Convert JSON data to a data frame
df <- as.data.frame(jsonDF1$results)
health_articles <- df[df$section == "Well" | df$subsection == "Move", ]
# Check if there are any articles in each section
if (nrow(health_articles) > 0) {
print("Health Articles:")
print(health_articles[c("title", "url")])
} else {
print("No articles found in the health section.")
}
## [1] "Health Articles:"
## title
## 6 5 Exercises for a Better Sex Life
## 7 A Peek Inside the Brains of ‘Super-Agers’
## 8 Why Your Big Sister Resents You
## 14 This Common Condition Can Damage Joints Long Before It’s Detected
## 15 How Bad Are Ultraprocessed Foods, Really?
## 19 For a Stable, Strong Core, Forget About Crunches
## url
## 6 https://www.nytimes.com/2024/04/27/well/move/sex-exercises-improve-performance.html
## 7 https://www.nytimes.com/2024/04/29/well/mind/super-agers-study.html
## 8 https://www.nytimes.com/2024/04/15/well/family/birth-order-siblings.html
## 14 https://www.nytimes.com/2024/05/02/well/osteoarthritis-diagnosis-symptoms.html
## 15 https://www.nytimes.com/2024/05/06/well/eat/ultraprocessed-foods-harmful-health.html
## 19 https://www.nytimes.com/2023/02/08/well/move/core-strength-exercises-workouts.html
article_url <- "https://www.nytimes.com/2024/04/27/well/move/sex-exercises-improve-performance.html"
article_content <- scrapeR(article_url)
#head(article_content)
additional_data <- read.csv("https://raw.githubusercontent.com/topkelama/Final-Project/main/cleaned_asha_data.csv")
additional_data$text <- paste(additional_data$PostTitle, additional_data$PostDescription, sep = " ")
Define a function to clean and tokenize text
clean_text <- function(text) {
# Tokenize text
tokens <- unlist(strsplit(text, "\\s+"))
# Remove HTML tags
cleaned_tokens <- gsub("<.*?>", "", tokens)
# Convert to lowercase
cleaned_tokens <- tolower(cleaned_tokens)
# Remove special characters and punctuation
cleaned_tokens <- gsub("[^a-zA-Z\\s]", "", cleaned_tokens)
# Remove stop words
cleaned_tokens <- cleaned_tokens[!cleaned_tokens %in% stopwords("en")]
return(cleaned_tokens)
}
Ensure that the text data is in character format
# Extract the text column
text_data <- additional_data$text
text_data <- as.character(text_data)
# Apply cleaning and tokenization calling the function clean_text
cleaned_tokens_kaggle <- lapply(text_data, clean_text)
# Apply cleaning and tokenization
cleaned_tokens_nyt <- lapply(article_content, clean_text)
# Load NRC lexicon
nrc_lexicon <- get_sentiments("nrc")
# Add a 'value' column to the NRC lexicon dataframe with default value 1
nrc_lexicon$value <- 1
df_cleaned_nyt <- data.frame(tokens = unlist(cleaned_tokens_nyt))
df_cleaned_additional <- data.frame(tokens = unlist(cleaned_tokens_kaggle))
joined_nyt_data <- inner_join(df_cleaned_nyt, nrc_lexicon, by = c("tokens" = "word"))
## Warning in inner_join(df_cleaned_nyt, nrc_lexicon, by = c(tokens = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 11140 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
joined_sexual_health <- inner_join(df_cleaned_additional, nrc_lexicon, by = c("tokens" = "word"))
## Warning in inner_join(df_cleaned_additional, nrc_lexicon, by = c(tokens = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 5621 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
#joined_nyt_data
Top 10 Positive words from New York Times article
positive_words_nyt <- joined_nyt_data %>%
filter(sentiment == "positive") %>%
count(tokens, sort = TRUE) %>%
head(10)
Top 10 Positive words Sexual health post
positive_words_kaggle <- joined_sexual_health %>%
filter(sentiment == "positive") %>%
count(tokens, sort = TRUE) %>%
head(10)
negative_words_kaggle <- joined_sexual_health %>%
filter(sentiment == "negative") %>%
count(tokens, sort = TRUE) %>%
head(10)
negative_words_nyt <- joined_nyt_data %>%
filter(sentiment == "negative") %>%
count(tokens, sort = TRUE) %>%
head(10)
Aggregate the NYT article corpus with lexicon nrc
aggregated_nyt_data <- joined_nyt_data %>%
group_by(tokens) %>%
summarise(total_value = sum(value)) %>%
ungroup()
aggregated_sexual_health <- joined_sexual_health %>%
group_by(tokens) %>%
summarise(total_value = sum(value)) %>%
ungroup()
combined_data <- inner_join(aggregated_sexual_health, aggregated_nyt_data, by = "tokens")
mean_total <- mean(combined_data$total_value.x)
median_total <- median(combined_data$total_value.x)
mode_total <- as.numeric(names(sort(table(combined_data$total_value.x), decreasing = TRUE)[1]))
# Print the central tendencies
cat("Central Tendencies for total_value:\n")
## Central Tendencies for total_value:
cat("Mean:", mean_total, "\n")
## Mean: 64.18824
cat("Median:", median_total, "\n")
## Median: 20
cat("Mode:", mode_total, "\n")
## Mode: 1
combine two positive datasets
combined_positive_words <- merge(positive_words_kaggle, positive_words_nyt, by = "tokens", all = TRUE)
# Filling NA values with 0
combined_positive_words[is.na(combined_positive_words)] <- 0
Combine two negative datasets
combined_negative_words <- merge(negative_words_kaggle, negative_words_nyt, by = "tokens", all = TRUE)
combined_negative_words[is.na(combined_negative_words)] <- 0
combined_positive_words <- combined_positive_words %>%
arrange(desc(n.x)) %>%
mutate(tokens = factor(tokens, levels = tokens))
#combined_positive_words
Positive sentiments
Due to the data density, only top 10 most frequently occured words are chosen
# Assuming positive_words_kaggle and positive_words_nyt are your data frames
ggplot(combined_positive_words, aes(x = tokens)) +
geom_bar(aes(y = n.x, fill = "Kaggle"), stat = "identity", position = "dodge") +
geom_bar(aes(y = n.y, fill = "NYT"), stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Kaggle" = "blue", "NYT" = "red")) +
labs(x = "Positive Words", y = "Frequency", title = "Comparison of Positive Words between Kaggle and NYT") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
combined_negative_words <- combined_negative_words %>%
arrange(desc(n.x)) %>%
mutate(tokens = factor(tokens, levels = tokens))
ggplot(combined_negative_words, aes(x = tokens)) +
geom_bar(aes(y = n.x, fill = "Kaggle"), stat = "identity", position = "dodge") +
geom_bar(aes(y = n.y, fill = "NYT"), stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Kaggle" = "blue", "NYT" = "red")) +
labs(x = "Negative Words", y = "Frequency", title = "Comparison of Negative Words between Kaggle and NYT") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Join datasets from two different sources
# Aggregate the Kaggle data
aggregated_kaggle <- joined_sexual_health %>%
group_by(tokens, sentiment) %>%
summarise(n = n(), .groups = "drop")
# Aggregate the NY Times data
aggregated_nyt <- joined_nyt_data %>%
group_by(tokens, sentiment) %>%
summarise(n = n(), .groups = "drop")
# Join the aggregated datasets
joined_data_overall <- rbind(aggregated_nyt, aggregated_kaggle)
# Aggregate the joined dataset
aggregated_overall <- joined_data_overall %>%
group_by(tokens, sentiment) %>%
summarise(n = sum(n), .groups = "drop")
positive_tokens <- aggregated_overall %>%
filter(sentiment == "positive") %>%
select(tokens, n)
negative_tokens <- aggregated_overall %>%
filter(sentiment == "negative") %>%
select(tokens, n)
# Convert entire_positive_words and entire_negative_words to tibbles if they are not already
entire_positive_words <- as_tibble(positive_tokens)
entire_negative_words <- as_tibble(negative_tokens)
positive_word_freq <- entire_positive_words %>%
group_by(tokens) %>%
summarise(n = sum(n))
negative_word_freq <- entire_negative_words %>%
group_by(tokens) %>%
summarise(n = sum(n))
wordcloud2(positive_word_freq, size = 1.5, color = "random-dark", backgroundColor = "white")
wordcloud2(negative_word_freq, size = 2, color = "random-dark", backgroundColor = "white")
In this analysis, I delved into the sentiment of sexual health-related articles sourced from the New York Times API and Kaggle. By scrutinizing the frequency of positive and negative words in each dataset, I uncovered both similarities and discrepancies between the two sources. Utilizing visual aids like bar plots and word clouds facilitated a comprehensive comparison, offering valuable insights into sentiment distribution across the datasets. Through this exploration, I gained a deeper understanding of sentiment trends in sexual health discourse across different platforms.