In this digital age, the pervasive impact of artificial intelligence (AI) is undeniable, influencing an ever-expanding range of industries and human activities. This influence, while frequently beneficial, has sparked widespread debate about its implications for society. This final project for ECI 588 investigates public perceptions of AI by analyzing sentiments expressed in nearly 10,000 comments on a YouTube documentary about AI’s global impact. This documentary (https://www.youtube.com/watch?v=s0dMTAQM4cw), which has captivated the attention of over 10.5 million viewers, provides fertile ground for examining the public discourse surrounding AI developments in Europe, the United States, China, and elsewhere, making those comments a rich source of public opinion about AI. New comments are still being submitted until April 20th, 2023.
This project is driven by a desire to uncover the depth and nuance of public sentiment on AI. The specific research questions this study aims to answer include:
What specific aspects of AI are most concerning or exciting to the public? [3.1.1]
What are the predominant sentiments (positive, negative, or neutral) expressed by the public concerning AI? [3.2.1]
How do these perceptions align with recent developments and portrayals of AI in media? [3.3.1]
The findings from this study are anticipated to be of high relevance to AI developers, researchers, and policymakers. By understanding the public’s hopes and fears regarding AI, these stakeholders can guide the development of AI technologies in a manner that is more likely to be accepted and supported by the public. For instance, if privacy concerns are prominent among negative sentiments, developers can prioritize enhancing data protection measures in their AI systems.
The primary data source for this project is nearly 10,000 comments extracted from a YouTube documentary video about artificial intelligence. The comments were collected using custom scripts set up in Google Sheets within Google Workspace. Despite initial difficulties, this method enabled efficient and systematic data collection under Dr. Jiang’s supervision.
Here are two main sources which guide me collect the data:
The initial AI_YouTube_comments data is organized into ten variables:
To ensure its quality and relevance for analysis, I removed non-content elements such as Channel URL, Name, the number of likes, replies etc. And leaving only the texual content of the comments and the time.
The following packages were installed and/or loaded to prepare for this project.
# load the packages
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.2
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.2
library(tm)
## Warning: package 'tm' was built under R version 4.3.2
## Loading required package: NLP
library(stopwords)
## Warning: package 'stopwords' was built under R version 4.3.2
##
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
##
## stopwords
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.3.3
library(stm)
## Warning: package 'stm' was built under R version 4.3.2
## stm v1.3.7 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(cluster)
library(latticeExtra)
## Warning: package 'latticeExtra' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:stm':
##
## cloud
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
library(SnowballC)
library(stringr)
## Warning: package 'stringr' was built under R version 4.3.2
library(igraph)
## Warning: package 'igraph' was built under R version 4.3.3
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.3
## Loading required package: RColorBrewer
library(RColorBrewer)
library(scales)
## Warning: package 'scales' was built under R version 4.3.2
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:igraph':
##
## %--%, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
To uncover sentiment trends and contextual shifts over time, the data will be meticulously refined. This involves removing stopwords and utilizing bi-grams to expose the most frequent words in the comments, thereby highlighting each word’s prevalence to better understand the underlying sentiments. Additionally, analyzing comment activity during specific periods will help pinpoint anomalies that may be influenced by external events.
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
ayc_raw <- read_excel("AI _YouTube_comments.xlsx")
head(ayc_raw)
## # A tibble: 6 × 2
## Comment Time
## <chr> <chr>
## 1 "This AI thing is really fascinating when we talk about global economic… 2024…
## 2 "The fact that THAY were able to Create AI. Proves that privacy has go… 2024…
## 3 "Interesting ❤" 2024…
## 4 "Crazy this was 4 years ago and it’s coming true in front of us" 2024…
## 5 "\U0001f637☕\U0001f1fa\U0001f1f8" 2024…
## 6 "ALL I.S. Information System, one and the same 010. Quantum Exchange." 2024…
# Pre-process the data
ayc_raw$Time <- tolower(ayc_raw$Time)
ayc_raw$Comment<- tolower(ayc_raw$Comment)
ayc_raw$text <- tolower(ayc_raw$Comment)
ayc_raw$text <- removePunctuation(ayc_raw$text)
ayc_raw$text <- removeNumbers(ayc_raw$text)
head(ayc_raw)
## # A tibble: 6 × 3
## Comment Time text
## <chr> <chr> <chr>
## 1 "this ai thing is really fascinating when we talk about global ec… 2024… "thi…
## 2 "the fact that thay were able to create ai. proves that privacy … 2024… "the…
## 3 "interesting ❤" 2024… "int…
## 4 "crazy this was 4 years ago and it’s coming true in front of us" 2024… "cra…
## 5 "\U0001f637☕\U0001f237\U0001f237" 2024… "\U0…
## 6 "all i.s. information system, one and the same 010. quantum excha… 2024… "all…
# Create a Corpus object if needed for further text manipulation
tds_tidy <- Corpus(VectorSource(ayc_raw$text))
# Now, tokenize into bi-grams
class(ayc_raw)
## [1] "tbl_df" "tbl" "data.frame"
ayc_raw<- as.data.frame(ayc_raw)
head(class(ayc_raw))
## [1] "data.frame"
# Generate bi-grams
tds_bigrams <- ayc_raw %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
tds_bigrams <- tds_bigrams %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1%in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(word1 != "", !is.na(word1), word1 != " ") %>%
filter(word2 != "", !is.na(word2), word2 != " ") %>%
filter(!str_detect(word1, "\\d")) %>%
filter(!str_detect(word2, "\\d")) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
unite(bigram, c(word1, word2), sep = " ")
bigram_top_tokens <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(25)
## Selecting by n
bigram_top_tokens
## bigram n
## 1 artifici intellig 201
## 2 drive car 100
## 3 human race 34
## 4 invent task 34
## 5 social credit 32
## 6 total control 32
## 7 jesu christ 28
## 8 human brain 27
## 9 cell phone 26
## 10 tech giant 26
## 11 credit score 22
## 12 silicon vallei 22
## 13 tech compani 22
## 14 autonom car 21
## 15 chines peopl 21
## 16 elon musk 20
## 17 human interact 20
## 18 vend machin 20
## 19 driverless car 19
## 20 human life 18
## 21 selfdriv car 18
## 22 social media 18
## 23 coffe serv 17
## 24 ai technologi 16
## 25 bill gate 16
## 26 neural network 16
# Plotting the top 25 bigrams
ggplot(bigram_top_tokens, aes(x=reorder(bigram, n), y=n)) +
geom_bar(stat="identity", fill="orange") +
labs(x = "Bigram", y = "Frequency", title = "Top 25 Bigrams in Comments", subtitle = "the most common words from the public") +
coord_flip()
# Create term-document matrix
tdm <- TermDocumentMatrix(tds_bigrams$bigram)
# Convert the matrix to a data frame for easier manipulation
m <- as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing = TRUE)
word_df <- data.frame(word = names(word_freqs), freq = word_freqs)
# Create the word cloud: with specified frequency settings for the words, based on previous bigram counts.
wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 30,
max.words = 1000, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
After analyzing the top 25 bigrams in comments, we can see: People are more concerned with AI technology/arithmetic method and its impact on human society. This impact encompasses autonomous driving technology, religion, race, lifestyle changes, and so on. With the help of artificial intelligence analysis, here it generated a word cloud based on the data frame.
Through the word cloud, we can observe that the most frequently mentioned topics concern the relationship between AI and humans, which appears to be a major public interest. Additionally, political, religious, and economic factors are also believed to significantly impact the development of AI.
# Convert 'Time' to 'DateTime' using ymd_hms from lubridate which automatically handles ISO 8601 formats
comments <- tds_bigrams %>%
mutate(
DateTime = ymd_hms(Time, tz = "UTC"),
Year = year(DateTime) # Extract the year from the DateTime
)
# Check the results to ensure the conversion
head(head(comments))
## Comment
## 1 this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary when we talk about advance military weaponry.
## 2 this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary when we talk about advance military weaponry.
## 3 this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary when we talk about advance military weaponry.
## 4 this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary when we talk about advance military weaponry.
## 5 this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary when we talk about advance military weaponry.
## 6 the fact that thay were able to create ai. proves that privacy has gone out the window. because you had to have gotten the data sets from someone.
## Time bigram DateTime Year
## 1 2024-04-04t15:34:15z global econom 2024-04-04 15:34:15 2024
## 2 2024-04-04t15:34:15z econom advanc 2024-04-04 15:34:15 2024
## 3 2024-04-04t15:34:15z bit scari 2024-04-04 15:34:15 2024
## 4 2024-04-04t15:34:15z advanc militari 2024-04-04 15:34:15 2024
## 5 2024-04-04t15:34:15z militari weaponri 2024-04-04 15:34:15 2024
## 6 2024-04-01t22:10:52z creat ai 2024-04-01 22:10:52 2024
# Show the structure and a summary of the new columns
str(comments)
## 'data.frame': 23185 obs. of 5 variables:
## $ Comment : chr "this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary whe"| __truncated__ "this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary whe"| __truncated__ "this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary whe"| __truncated__ "this ai thing is really fascinating when we talk about global economic advancement but it's a bit scary whe"| __truncated__ ...
## $ Time : chr "2024-04-04t15:34:15z" "2024-04-04t15:34:15z" "2024-04-04t15:34:15z" "2024-04-04t15:34:15z" ...
## $ bigram : chr "global econom" "econom advanc" "bit scari" "advanc militari" ...
## $ DateTime: POSIXct, format: "2024-04-04 15:34:15" "2024-04-04 15:34:15" ...
## $ Year : num 2024 2024 2024 2024 2024 ...
summary(comments$DateTime)
## Min. 1st Qu.
## "2019-09-26 18:16:08.0000" "2020-10-24 19:35:53.0000"
## Median Mean
## "2021-03-31 12:41:33.0000" "2021-04-21 15:22:28.8480"
## 3rd Qu. Max.
## "2021-10-15 16:55:58.0000" "2024-04-04 15:34:15.0000"
summary(comments$Year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2019 2020 2021 2021 2021 2024
# filtering Year rows based on a Comment condition and selecting the certain columns
filtered_data <- comments %>%
filter(!is.na(Year)) %>%
select(Year, Comment)
# Load sentiment lexicons
nrc <- get_sentiments("nrc")
bing <- get_sentiments("bing")
loughran <- get_sentiments("loughran")
# Tokenize comments
comments_tokens <- comments %>%
unnest_tokens(word, Comment)
# Join tokens with NRC lexicon and Aggregate sentiment by year
sentiment_analysis <- comments_tokens %>%
inner_join(nrc, by = c("word"))
## Warning in inner_join(., nrc, by = c("word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 12263 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
sentiment_by_year <- sentiment_analysis %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(Year, sentiment) %>%
summarise(count = n(), .groups = 'drop')
# Calculate sentiment ratio or simple count comparison
sentiment_summary <- sentiment_by_year %>%
pivot_wider(names_from = sentiment, values_from = count, values_fill = list(count = 0)) %>%
mutate(sentiment_score = positive - negative)
# Convert Year to a categorical factor for better plotting
sentiment_summary$Year <- factor(sentiment_summary$Year)
# Plotting sentiment score over time
ggplot(sentiment_summary, aes(x = Year, y = sentiment_score, fill = Year)) +
geom_col(show.legend = FALSE) +
labs(title = "Sentiment Score Over Time", x = "Year", y = "Net Sentiment Score") +
theme_minimal()
# Join and calculate sentiment
calculate_sentiment <- function(data, lexicon, method) {
data %>%
inner_join(lexicon, by = "word") %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(Year, sentiment) %>%
summarise(count = n(), .groups = 'drop') %>%
pivot_wider(names_from = sentiment, values_from = count, values_fill = list(count = 0)) %>%
mutate(
ratio = positive / negative,
method = method
)
}
sentiment_nrc <- calculate_sentiment(comments_tokens, nrc, "NRC")
## Warning in inner_join(., lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 12263 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
sentiment_bing <- calculate_sentiment(comments_tokens, bing, "Bing")
sentiment_loughran <- calculate_sentiment(comments_tokens, loughran, "Loughran")
## Warning in inner_join(., lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 847 of `x` matches multiple rows in `y`.
## ℹ Row 2086 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Function to plot sentiment ratios
plot_sentiment_ratios <- function(sentiment_data) {
ggplot(sentiment_data, aes(x = Year, y = ratio, group = method, color = method)) +
geom_line() +
geom_point() +
labs(title = "Sentiment Ratio Over Time by Lexicon", x = "Year", y = "Sentiment Ratio (Positive/Negative)") +
theme_minimal() +
scale_color_brewer(palette = "Set1") # Color set for clarity
}
# Plotting sentiment data
all_sentiments <- bind_rows(sentiment_nrc, sentiment_bing, sentiment_loughran)
plot_sentiment_ratios(all_sentiments)
This Dictionary Sentiment Ratio over Time provides a powerful framework for sentiment analysis across multiple dictionaries, allowing for easy visualization of comparison results. The findings clearly demonstrate that public sentiment trends toward AI are similar. Since the documentary video’s release in 2019, the public’s attitude toward AI has remained mostly neutral, with more positive than negative emotions. However, based on the spool, from the end of 2022 to the beginning of 2023, sentiment toward AI in the comments fluctuated violently, even peaking.
According to Wikipedia, OpenAI developed and launched ChatGPT on November 30, 2022, and I’m curious how the public reacted to this historic event and AI. Has anything to do with abnormal spikes?
#Filter comments from November 30, 2022 to the current date
specific_comments <- comments %>%
filter(DateTime >= as.POSIXct("2022-11-30") & DateTime <= Sys.time())
# Tokenize comments of bigram for sentiment analysis
specific_comments_tokens <- specific_comments %>%
unnest_tokens(word, bigram)
# use Bing lexicon and visualize the results with a line chart
sentiment_bing_current <- specific_comments_tokens %>%
inner_join(bing, by = "word") %>%
group_by(DateTime) %>%
summarise(
positive = sum(sentiment == "positive"),
negative = sum(sentiment == "negative"),
.groups = 'drop'
) %>%
mutate(sentiment_score = positive - negative)
ggplot(sentiment_bing_current, aes(x = DateTime, y = sentiment_score)) +
geom_line(color = "#FC4E07", size = 0.7) +
geom_point(color = "#534EB2", size = 2, shape = 22, fill = "#E5D8B0") +
labs(title = "Sentiment Trends (Nov 2022 Onwards) Using Bing Lexicon",
x = "Time",
y = "Sentiment Score") +
theme_light()+
theme(
plot.background = element_rect(fill = "#FFFFFF"),
panel.background = element_rect(fill = "#FFFFFF", color = "#FFFFFF"),
plot.title = element_text(color = "#007ACC", size = 16, face = "bold"),
axis.title = element_text(color = "#4A4A4A"),
axis.text = element_text(color = "#4A4A4A"),
panel.grid.major = element_line(color = "#D3D3D3"), # Light gray grid lines
panel.grid.minor = element_blank(), # No minor grid lines
legend.position = "bottom",
legend.background = element_rect(fill = "#FFFFFF")
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Using ChatGPT Data Analysis to generate a word cloud based on the tri-grams, here it is.
knitr::include_graphics("images/daily volume of comments/word cloud for tri-grams.PNG")
# Using ChatGPT to analyze the sentiment trend since November 2022, and here's a line chart to identify the result.
knitr::include_graphics("images/daily volume of comments/sentiment trend Nov 2022 onwards.PNG")
# Using ChatGPT to analyze the sentiment distribution overtime, and here's a bar chart to identify the result.
knitr::include_graphics("images/daily volume of comments/sentiment trend over time.PNG")
# Using ChatGPT to analyze the sentiment distribution since November 2022, and here's a bar chart to identify the result.
knitr::include_graphics("sentiment distribution Nov 2022 onwards.PNG")
Since its public launch on November 30, 2022, ChatGPT has sparked widespread interest and discussion in the AI documentary, reflecting broader awareness and concerns about artificial intelligence (AI). The line chart allows for the analysis of public sentiment. In December 2022, people had strong negative feelings about artificial intelligence technology. This is also linked to the media’s focus on how artificial intelligence systems process personal data and the widespread impact on employment, social prejudice, and other issues. With the use of ChatGPT, public sentiment has stabilized, indicating the same attitude orientation as before, namely, neutrality is the majority, and positive emotions outweigh negative emotions.
The bar chart above showcases the sentiment distribution (neutral in green, positive in red, and negative in blue) from November 2022 onwards. This comparative view highlights the relative frequencies of each sentiment category, allowing for a clear understanding of the overall sentiment landscape within this period.
The Word Network above visualized the tri-grams from the comments dataset. Each node represents a word, and each edge connects words that frequently appear together as bi-grams. It illustrated the contextual relationships between words, showcasing how certain topic or phrase are interconnected within the comments, and revealing common phrases that capture viewer interests, concerns, or reactions.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach.
Ferreira-Mello, R., André, M., Pinheiro, A., Costa, E., & Romero, C. (2019). Text mining in education.