In this digital age, the pervasive impact of artificial intelligence (AI) is undeniable, influencing an ever-expanding range of industries and human activities. This influence, while frequently beneficial, has sparked widespread debate about its implications for society. This final project for ECI 588 investigates public perceptions of AI by analyzing sentiments expressed in nearly 10,000 comments on a YouTube documentary about AI’s global impact. This documentary (https://www.youtube.com/watch?v=s0dMTAQM4cw), which has captivated the attention of over 10.5 million viewers, provides fertile ground for examining the public discourse surrounding AI developments in Europe, the United States, China, and elsewhere, making those comments a rich source of public opinion about AI. New comments are still being submitted until April 20th, 2023.
This project is driven by a desire to uncover the depth and nuance of public sentiment on AI. The specific research questions this study aims to answer include:
What specific aspects of AI are most concerning or exciting to the public? [3.1.1]
What are the predominant sentiments (positive, negative, or neutral) expressed by the public concerning AI? [3.2.1]
How do these perceptions align with recent developments and portrayals of AI in media? [3.3.1]
The findings from this study are anticipated to be of high relevance to AI developers, researchers, and policymakers. By understanding the public’s hopes and fears regarding AI, these stakeholders can guide the development of AI technologies in a manner that is more likely to be accepted and supported by the public. For instance, if privacy concerns are prominent among negative sentiments, developers can prioritize enhancing data protection measures in their AI systems.
The primary data source for this project is nearly 10,000 comments extracted from a YouTube documentary video about artificial intelligence. The comments were collected using custom scripts set up in Google Sheets within Google Workspace. Despite initial difficulties, this method enabled efficient and systematic data collection under Dr. Jiang’s supervision.
Here are two main sources which guide me collect the data:
The initial AI_YouTube_comments data is organized into ten variables:
To ensure its quality and relevance for analysis, I removed non-content elements such as Channel URL, Name, the number of likes, replies etc. And leaving only the texual content of the comments and the time.
The following packages were installed and/or loaded to prepare for this project.
# Load necessary packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.2
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
library(tm)
## Warning: package 'tm' was built under R version 4.3.2
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(stopwords)
## Warning: package 'stopwords' was built under R version 4.3.2
##
## Attaching package: 'stopwords'
##
## The following object is masked from 'package:tm':
##
## stopwords
library(igraph)
## Warning: package 'igraph' was built under R version 4.3.3
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.3.3
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.3
## Loading required package: RColorBrewer
library(RColorBrewer)
library(lubridate)
library(SnowballC)
To uncover sentiment trends and contextual shifts over time, the data will be meticulously refined. This involves removing stopwords and utilizing bi-grams to expose the most frequent words in the comments, thereby highlighting each word’s prevalence to better understand the underlying sentiments. Additionally, analyzing comment activity during specific periods will help pinpoint anomalies that may be influenced by external events.
# Import Data
ayc_raw <- read_excel("588/data/AI _YouTube_comments.xlsx")
# Preprocess the data
ayc_raw <- ayc_raw %>%
mutate(
Time = tolower(Time),
Comment = tolower(Comment),
text = removePunctuation(Comment),
text = removeNumbers(text),
text = str_remove_all(text, "\\d") # Removes digits
)
# Tokenization and cleaning
tds_bigrams <- ayc_raw %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word,
word1 != "", !is.na(word1), word2 != "", !is.na(word2)) %>%
mutate(word1 = wordStem(word1), word2 = wordStem(word2)) %>%
unite(bigram, c(word1, word2), sep = " ")
# Analysis of top 25 bigrams
bigram_top_tokens <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(25)
## Selecting by n
# Plotting the top 25 bigrams
ggplot(bigram_top_tokens, aes(x=reorder(bigram, n), y=n)) +
geom_bar(stat="identity", fill="orange") +
labs(x = "Bigram", y = "Frequency", title = "Top 25 Bigrams in Comments") +
coord_flip()
After analyzing the top 25 bigrams in comments, we can see: People are more concerned with AI technology/arithmetic method and its impact on human society. This impact encompasses autonomous driving technology, religion, race, lifestyle changes, and so on. With the help of artificial intelligence analysis, here it generated a word cloud based on the data frame.
The Word Network above visualized the top 100 bi-grams from the comments dataset. Each node represents a word, and each edge connects words that frequently appear together as bi-grams. It illustrated the contextual relationships between words, showcasing how certain topic or phrase are interconnected within the comments, and revealing common phrases that capture viewer interests, concerns, or reactions.
Through the word cloud, we can observe that the most frequently mentioned topics concern the relationship between AI and humans, which appears to be a major public interest. Additionally, political, religious, and economic factors are also believed to significantly impact the development of AI.
# Analyze the sentiment of the public over time
comments <- tds_bigrams %>%
mutate(
DateTime = ymd_hms(Time),
Year = year(DateTime)
)
sentiment_by_year <- comments %>%
unnest_tokens(word, Comment) %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(Year, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
mutate(sentiment_score = positive - negative)
## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 12263 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Plotting sentiment score over time
ggplot(sentiment_by_year, aes(x = Year, y = sentiment_score, fill = Year)) +
geom_col(show.legend = FALSE) +
labs(title = "Sentiment Score Over Time", x = "Year", y = "Net Sentiment Score") +
theme_minimal()
# Network analysis
bigram_network <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(100) %>%
separate(bigram, into = c("source", "target"), sep = " ") %>%
mutate(weight = n)
## Selecting by n
graph <- graph_from_data_frame(bigram_network, directed = FALSE)
# Visualize word network
ggraph(graph, layout = "fr") +
geom_edge_link(aes(width = weight), alpha = 0.5) +
geom_node_point(color = "pink", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8, repel = TRUE,
size = 3,) +
theme_minimal() +
ggtitle("Network Visualization of Top 100 Bigrams")
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Using ChatGPT to analyze the sentiment distribution overtime, and here's a bar chart to identify the result.
knitr::include_graphics("588/Final Project/images/daily volume of comments/sentiment trend over time.PNG")
# Using ChatGPT to analyze the sentiment trend since November 2022, and here's a line chart to identify the result.
knitr::include_graphics("588/Final Project/images/daily volume of comments/sentiment trend Nov 2022 onwards.PNG")
# Using ChatGPT to analyze the sentiment distribution since November 2022, and here's a bar chart to identify the result.
knitr::include_graphics("588/Final Project/images/daily volume of comments/sentiment distribution Nov 2022 onwards.PNG")
# Using ChatGPT Data Analysis to generate a word cloud based on the tri-grams, here it is.
knitr::include_graphics("588/Final Project/images/daily volume of comments/word cloud for tri-grams.PNG")
This Dictionary Sentiment Ratio over Time provides a powerful framework for sentiment analysis across multiple dictionaries, allowing for easy visualization of comparison results. The findings clearly demonstrate that public sentiment trends toward AI are similar. Since the documentary video’s release in 2019, the public’s attitude toward AI has remained mostly neutral, with more positive than negative emotions. However, based on the spool, from the end of 2022 to the beginning of 2023, sentiment toward AI in the comments fluctuated violently, even peaking. Why?
According to Wikipedia, OpenAI developed and launched ChatGPT on November 30, 2022, and I’m curious how the public reacted to this historic event and AI. Has anything to do with abnormal spikes?
Since its public launch on November 30, 2022, ChatGPT has sparked widespread interest and discussion in the AI documentary, reflecting broader awareness and concerns about artificial intelligence (AI). The line chart allows for the analysis of public sentiment. In December 2022, people had strong negative feelings about artificial intelligence technology. This is also linked to the media’s focus on how artificial intelligence systems process personal data and the widespread impact on employment, social prejudice, and other issues. With the use of ChatGPT, public sentiment has stabilized, indicating the same attitude orientation as before, namely, neutrality is the majority, and positive emotions outweigh negative emotions.
The bar chart above showcases the sentiment distribution (neutral in green, positive in red, and negative in blue) from November 2022 onwards. This comparative view highlights the relative frequencies of each sentiment category, allowing for a clear understanding of the overall sentiment landscape within this period.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach.
Ferreira-Mello, R., André, M., Pinheiro, A., Costa, E., & Romero, C. (2019). Text mining in education.