Introduction

This analysis involved the collection of data from two sources: the New York Times API and an additional dataset from Kaggle. Following data collection, preprocessing and cleaning steps were performed to ensure the data’s suitability for analysis. Subsequently, sentiment analysis was conducted on the articles, comparing the frequency of positive and negative words between the Kaggle dataset and the New York Times articles. The findings were visualized using bar plots and word clouds to provide a clear comparison. Overall, this analysis sheds light on the sentiment trends within the datasets and highlights any disparities between the two sources.

Data Science Workflow OSEMN

  1. Obtain:- I loaded an article from the New York Times’ most popular API and another one from Kaggle.
  2. Scrub:- I defined a function to clean and tokenize the text. I used the regex along with important R functions we learned this semester, namely strsplit(), gsub(), unlist() etc.
  3. Explore:- I made a sentiment analysis on both corpuses using lexicon nrc and compared the top 10 words with higher occurrence. I compared positive and negative words between 2 corpuses data. Aggregated the sentiment values and Obtained central tendencies. Finally, Inner joined those corpora and plotted with wordcloud using the wordcloud2() function.
  4. Model:- In this project, I focused more on data acquisition and data tidying. So, I skipped this step of modeling.
  5. Interpretation:- I added a conclusion at the end of the code blocks which interprets and draws a conclusion of this project.

Motivation

In an era where sexual health awareness is increasingly important, understanding the sentiment surrounding related discussions is crucial. By analyzing articles from reputable sources like the New York Times and user-generated content from platforms like Kaggle, I aim to uncover sentiment trends, identify common positive and negative themes, and ultimately contribute to a deeper understanding of public discourse surrounding sexual health. Through this project, I seek to shed light on important societal discussions and promote informed conversations about sexual health.

Load required libraries

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(textdata)
## Warning: package 'textdata' was built under R version 4.3.3
library(textclean)
## Warning: package 'textclean' was built under R version 4.3.3
library(jsonlite)
library(httr)
## 
## Attaching package: 'httr'
## The following object is masked from 'package:textdata':
## 
##     cache_info
library(tm)
## Warning: package 'tm' was built under R version 4.3.3
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
## 
##     content
library(scrapeR)
## Warning: package 'scrapeR' was built under R version 4.3.3
library(wordcloud2)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:httr':
## 
##     config
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:jsonlite':
## 
##     flatten

Data Collection

Load the data using the New York Times API

url_times <- paste0('https://api.nytimes.com/svc/mostpopular/v2/emailed/7.json?api-key=',Sys.getenv("TIMES_API_KEY"))

jsonDF1 <- fromJSON(url_times) 

Filter articles based on section

# Convert JSON data to a data frame
df <- as.data.frame(jsonDF1$results)

health_articles <- df[df$section == "Well" | df$subsection == "Move", ]

Selecting an appropriate article

# Check if there are any articles in each section
if (nrow(health_articles) > 0) {
  print("Health Articles:")
  print(health_articles[c("title", "url")])
} else {
  print("No articles found in the health section.")
}
## [1] "Health Articles:"
##                                                                title
## 6                                  5 Exercises for a Better Sex Life
## 7                          A Peek Inside the Brains of ‘Super-Agers’
## 8                                    Why Your Big Sister Resents You
## 14 This Common Condition Can Damage Joints Long Before It’s Detected
## 15                         How Bad Are Ultraprocessed Foods, Really?
## 19                  For a Stable, Strong Core, Forget About Crunches
##                                                                                     url
## 6   https://www.nytimes.com/2024/04/27/well/move/sex-exercises-improve-performance.html
## 7                   https://www.nytimes.com/2024/04/29/well/mind/super-agers-study.html
## 8              https://www.nytimes.com/2024/04/15/well/family/birth-order-siblings.html
## 14       https://www.nytimes.com/2024/05/02/well/osteoarthritis-diagnosis-symptoms.html
## 15 https://www.nytimes.com/2024/05/06/well/eat/ultraprocessed-foods-harmful-health.html
## 19   https://www.nytimes.com/2023/02/08/well/move/core-strength-exercises-workouts.html

Scrape the content

article_url <- "https://www.nytimes.com/2024/04/27/well/move/sex-exercises-improve-performance.html"

article_content <- scrapeR(article_url)
#head(article_content)

Load an additional data from different source

additional_data <- read.csv("https://raw.githubusercontent.com/topkelama/Final-Project/main/cleaned_asha_data.csv")

Combine the title and description into a single column

additional_data$text <- paste(additional_data$PostTitle, additional_data$PostDescription, sep = " ")

Data Transformation and Preprocessing

Define a function to clean and tokenize text

clean_text <- function(text) {
  # Tokenize text
  tokens <- unlist(strsplit(text, "\\s+"))
  # Remove HTML tags
  cleaned_tokens <- gsub("<.*?>", "", tokens)
  # Convert to lowercase
  cleaned_tokens <- tolower(cleaned_tokens)
  # Remove special characters and punctuation
  cleaned_tokens <- gsub("[^a-zA-Z\\s]", "", cleaned_tokens)
  # Remove stop words
  cleaned_tokens <- cleaned_tokens[!cleaned_tokens %in% stopwords("en")]
  return(cleaned_tokens)
}

Call the clean_text function

Ensure that the text data is in character format

# Extract the text column
text_data <- additional_data$text

text_data <- as.character(text_data)

Call clean_text fucntion and apply on kaggle data

# Apply cleaning and tokenization calling the function clean_text
cleaned_tokens_kaggle <- lapply(text_data, clean_text)

Call clean_text function and apply on NYT article data

# Apply cleaning and tokenization
cleaned_tokens_nyt <- lapply(article_content, clean_text)

Load Lexicon

 # Load NRC lexicon
nrc_lexicon <- get_sentiments("nrc")
# Add a 'value' column to the NRC lexicon dataframe with default value 1
nrc_lexicon$value <- 1

parse to data frame

df_cleaned_nyt <- data.frame(tokens = unlist(cleaned_tokens_nyt))

df_cleaned_additional <- data.frame(tokens = unlist(cleaned_tokens_kaggle))

Join with NRC lexicon to calculate sentiment scores

joined_nyt_data <- inner_join(df_cleaned_nyt, nrc_lexicon, by = c("tokens" = "word"))
## Warning in inner_join(df_cleaned_nyt, nrc_lexicon, by = c(tokens = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 11140 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
joined_sexual_health <- inner_join(df_cleaned_additional, nrc_lexicon, by = c("tokens" = "word"))
## Warning in inner_join(df_cleaned_additional, nrc_lexicon, by = c(tokens = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 5621 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
#joined_nyt_data

Analysis

Top 10 Positive words from New York Times article

positive_words_nyt <- joined_nyt_data %>%
  filter(sentiment == "positive") %>%
  count(tokens, sort = TRUE) %>%
  head(10)

Top 10 Positive words Sexual health post

positive_words_kaggle <- joined_sexual_health %>%
  filter(sentiment == "positive") %>%
  count(tokens, sort = TRUE) %>%
  head(10)

Negative words from Sexual Health data from kaggle

negative_words_kaggle <- joined_sexual_health %>%
  filter(sentiment == "negative") %>%
  count(tokens, sort = TRUE) %>%
  head(10)

NEgative words from NY Times Article

negative_words_nyt <- joined_nyt_data %>%
  filter(sentiment == "negative") %>%
  count(tokens, sort = TRUE) %>%
  head(10)

Aggregation

Aggregate the NYT article corpus with lexicon nrc

aggregated_nyt_data <- joined_nyt_data %>%
  group_by(tokens) %>%
  summarise(total_value = sum(value)) %>%
  ungroup()

Aggregate the Sexual Health post corpus with lexicon nrc

aggregated_sexual_health <- joined_sexual_health %>%
  group_by(tokens) %>%
  summarise(total_value = sum(value)) %>%
  ungroup()

Combine two aggregated data sets

combined_data <- inner_join(aggregated_sexual_health, aggregated_nyt_data, by = "tokens")

Calculate central tendencies for total_value

mean_total <- mean(combined_data$total_value.x)
median_total <- median(combined_data$total_value.x)
mode_total <- as.numeric(names(sort(table(combined_data$total_value.x), decreasing = TRUE)[1]))

# Print the central tendencies
cat("Central Tendencies for total_value:\n")
## Central Tendencies for total_value:
cat("Mean:", mean_total, "\n")
## Mean: 64.18824
cat("Median:", median_total, "\n")
## Median: 20
cat("Mode:", mode_total, "\n")
## Mode: 1

compare positive and negative words after merging two datasets in one

combine two positive datasets

combined_positive_words <- merge(positive_words_kaggle, positive_words_nyt, by = "tokens", all = TRUE)

# Filling NA values with 0
combined_positive_words[is.na(combined_positive_words)] <- 0

Combine two negative datasets

combined_negative_words <- merge(negative_words_kaggle, negative_words_nyt, by = "tokens", all = TRUE)
combined_negative_words[is.na(combined_negative_words)] <- 0
combined_positive_words <- combined_positive_words %>%
  arrange(desc(n.x)) %>%
  mutate(tokens = factor(tokens, levels = tokens))
#combined_positive_words

Visualize with the barplot using ggplot function

Positive sentiments

Due to the data density, only top 10 most frequently occured words are chosen

# Assuming positive_words_kaggle and positive_words_nyt are your data frames
ggplot(combined_positive_words, aes(x = tokens)) +
  geom_bar(aes(y = n.x, fill = "Kaggle"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = n.y, fill = "NYT"), stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Kaggle" = "blue", "NYT" = "red")) +
  labs(x = "Positive Words", y = "Frequency", title = "Comparison of Positive Words between Kaggle and NYT") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

combined_negative_words <- combined_negative_words %>%
  arrange(desc(n.x)) %>%
  mutate(tokens = factor(tokens, levels = tokens))

Negative Sentiments

ggplot(combined_negative_words, aes(x = tokens)) +
  geom_bar(aes(y = n.x, fill = "Kaggle"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = n.y, fill = "NYT"), stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Kaggle" = "blue", "NYT" = "red")) +
  labs(x = "Negative Words", y = "Frequency", title = "Comparison of Negative Words between Kaggle and NYT") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Now, Let’s find the sentiments in entire data sets with wordcloud

Join datasets from two different sources

# Aggregate the Kaggle data
aggregated_kaggle <- joined_sexual_health %>%
  group_by(tokens, sentiment) %>%
  summarise(n = n(), .groups = "drop")

# Aggregate the NY Times data
aggregated_nyt <- joined_nyt_data %>%
  group_by(tokens, sentiment) %>%
  summarise(n = n(), .groups = "drop")

# Join the aggregated datasets
joined_data_overall <- rbind(aggregated_nyt, aggregated_kaggle)

# Aggregate the joined dataset
aggregated_overall <- joined_data_overall %>%
  group_by(tokens, sentiment) %>%
  summarise(n = sum(n), .groups = "drop")

Filter positive and negative tokens from both datasets

positive_tokens <- aggregated_overall %>%
  filter(sentiment == "positive") %>%
  select(tokens, n)

negative_tokens <- aggregated_overall %>%
  filter(sentiment == "negative") %>%
  select(tokens, n)
# Convert entire_positive_words and entire_negative_words to tibbles if they are not already
entire_positive_words <- as_tibble(positive_tokens)
entire_negative_words <- as_tibble(negative_tokens)

Plot with wordcloud

positive_word_freq <- entire_positive_words %>%
  group_by(tokens) %>%
  summarise(n = sum(n))
negative_word_freq <- entire_negative_words %>%
  group_by(tokens) %>%
  summarise(n = sum(n))

Positive Sentiments in WordCloud

wordcloud2(positive_word_freq, size = 1.5, color = "random-dark", backgroundColor = "white")

Negative Sentiments in wordCloud

wordcloud2(negative_word_freq, size = 2, color = "random-dark", backgroundColor = "white")

Conclusion:

In this analysis, I delved into the sentiment of sexual health-related articles sourced from the New York Times API and Kaggle. By scrutinizing the frequency of positive and negative words in each dataset, I uncovered both similarities and discrepancies between the two sources. Utilizing visual aids like bar plots and word clouds facilitated a comprehensive comparison, offering valuable insights into sentiment distribution across the datasets. Through this exploration, I gained a deeper understanding of sentiment trends in sexual health discourse across different platforms.