0. INTRODUCTION

In this digital age, the pervasive impact of artificial intelligence (AI) is undeniable, influencing an ever-expanding range of industries and human activities. This influence, while frequently beneficial, has sparked widespread debate about its implications for society. This final project for ECI 588 investigates public perceptions of AI by analyzing sentiments expressed in nearly 10,000 comments on a YouTube documentary about AI’s global impact. This documentary (https://www.youtube.com/watch?v=s0dMTAQM4cw), which has captivated the attention of over 10.5 million viewers, provides fertile ground for examining the public discourse surrounding AI developments in Europe, the United States, China, and elsewhere, making those comments a rich source of public opinion about AI. New comments are still being submitted until April 20th, 2023.

This project is driven by a desire to uncover the depth and nuance of public sentiment on AI. The specific research questions this study aims to answer include:

The findings from this study are anticipated to be of high relevance to AI developers, researchers, and policymakers. By understanding the public’s hopes and fears regarding AI, these stakeholders can guide the development of AI technologies in a manner that is more likely to be accepted and supported by the public. For instance, if privacy concerns are prominent among negative sentiments, developers can prioritize enhancing data protection measures in their AI systems.

1. PREPARE

1.1 Data Collection & Preparation

The primary data source for this project is nearly 10,000 comments extracted from a YouTube documentary video about artificial intelligence. The comments were collected using custom scripts set up in Google Sheets within Google Workspace. Despite initial difficulties, this method enabled efficient and systematic data collection under Dr. Jiang’s supervision.

Here are two main sources which guide me collect the data:

The initial AI_YouTube_comments data is organized into ten variables:

  • Channel URL: the URL address of a registered YouTube account.
  • Name: the name associated with a registered YouTube account.
  • Comment: the primary text of the comment.
  • Time: the time-stamp when each comment was published.
  • Likes: the number of likes received by the main comment.
  • Reply Count: the total numbers of replies to the comment.
  • Reply Author: the name of the registered YouTube account that replied to the comment.
  • Reply: the text content of the reply.
  • Published: the published time of each reply.

To ensure its quality and relevance for analysis, I removed non-content elements such as Channel URL, Name, the number of likes, replies etc. And leaving only the texual content of the comments and the time.

1.2 R Package Set Up

The following packages were installed and/or loaded to prepare for this project.

# load the packages
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.2
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.2
library(tm)
## Warning: package 'tm' was built under R version 4.3.2
## Loading required package: NLP
library(stopwords)
## Warning: package 'stopwords' was built under R version 4.3.2
## 
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
## 
##     stopwords
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.3.3
library(stm)
## Warning: package 'stm' was built under R version 4.3.2
## stm v1.3.7 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
library(cluster)
library(latticeExtra)
## Warning: package 'latticeExtra' was built under R version 4.3.2
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:stm':
## 
##     cloud
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
library(SnowballC)
library(stringr)
## Warning: package 'stringr' was built under R version 4.3.2
library(igraph)
## Warning: package 'igraph' was built under R version 4.3.3
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.3
## Loading required package: RColorBrewer
library(RColorBrewer)
library(scales)
## Warning: package 'scales' was built under R version 4.3.2
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:igraph':
## 
##     %--%, union
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

2. Wrangele

To uncover sentiment trends and contextual shifts over time, the data will be meticulously refined. This involves removing stopwords and utilizing bi-grams to expose the most frequent words in the comments, thereby highlighting each word’s prevalence to better understand the underlying sentiments. Additionally, analyzing comment activity during specific periods will help pinpoint anomalies that may be influenced by external events.

2.1 Import Data

library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
ayc_raw <- read_excel("AI _YouTube_comments.xlsx")
head(ayc_raw)
## # A tibble: 6 × 2
##   Comment                                                                  Time 
##   <chr>                                                                    <chr>
## 1 "This AI thing is really fascinating when we talk about global economic… 2024…
## 2 "The fact that THAY were able to Create  AI. Proves that privacy has go… 2024…
## 3 "Interesting ❤"                                                          2024…
## 4 "Crazy this was 4 years ago and it’s coming true in front of us"         2024…
## 5 "\U0001f637☕\U0001f1fa\U0001f1f8"                                       2024…
## 6 "ALL I.S. Information System, one and the same 010. Quantum Exchange."   2024…

2.2 Tokenization: applying stop words removal, and then bi-gram to clean the data.

# Pre-process the data
ayc_raw$Time <- tolower(ayc_raw$Time)
ayc_raw$Comment<- tolower(ayc_raw$Comment)

ayc_raw$text <- tolower(ayc_raw$Comment)
ayc_raw$text <- removePunctuation(ayc_raw$text)
ayc_raw$text <- removeNumbers(ayc_raw$text)

head(ayc_raw)
## # A tibble: 6 × 3
##   Comment                                                            Time  text 
##   <chr>                                                              <chr> <chr>
## 1 "this ai thing is really fascinating when we talk about global ec… 2024… "thi…
## 2 "the fact that thay were able to create  ai. proves that privacy … 2024… "the…
## 3 "interesting ❤"                                                    2024… "int…
## 4 "crazy this was 4 years ago and it’s coming true in front of us"   2024… "cra…
## 5 "\U0001f637☕\U0001f237\U0001f237"                                 2024… "\U0…
## 6 "all i.s. information system, one and the same 010. quantum excha… 2024… "all…
# Create a Corpus object if needed for further text manipulation
tds_tidy <- Corpus(VectorSource(ayc_raw$text))
# Now, tokenize into bi-grams
class(ayc_raw)
## [1] "tbl_df"     "tbl"        "data.frame"
ayc_raw<- as.data.frame(ayc_raw)
head(class(ayc_raw))
## [1] "data.frame"
# Generate bi-grams
tds_bigrams <- ayc_raw %>%   
  unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
tds_bigrams <- tds_bigrams %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1%in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(word1 != "", !is.na(word1), word1 != " ") %>%
  filter(word2 != "", !is.na(word2), word2 != " ") %>%
  filter(!str_detect(word1, "\\d")) %>%
  filter(!str_detect(word2, "\\d")) %>%
  mutate(word1 = wordStem(word1)) %>% 
  mutate(word2 = wordStem(word2)) %>% 
  unite(bigram, c(word1, word2), sep = " ")

3. ANALYSIS

3.1 Analyze the top 25 bigrams to understand the key concerns of the public.

bigram_top_tokens <- tds_bigrams %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(25)
## Selecting by n
bigram_top_tokens
##               bigram   n
## 1  artifici intellig 201
## 2          drive car 100
## 3         human race  34
## 4        invent task  34
## 5      social credit  32
## 6      total control  32
## 7        jesu christ  28
## 8        human brain  27
## 9         cell phone  26
## 10        tech giant  26
## 11      credit score  22
## 12    silicon vallei  22
## 13      tech compani  22
## 14       autonom car  21
## 15      chines peopl  21
## 16         elon musk  20
## 17    human interact  20
## 18       vend machin  20
## 19    driverless car  19
## 20        human life  18
## 21      selfdriv car  18
## 22      social media  18
## 23        coffe serv  17
## 24     ai technologi  16
## 25         bill gate  16
## 26    neural network  16
# Plotting the top 25 bigrams
ggplot(bigram_top_tokens, aes(x=reorder(bigram, n), y=n)) +
  geom_bar(stat="identity", fill="orange") +
  labs(x = "Bigram", y = "Frequency", title = "Top 25 Bigrams in Comments", subtitle = "the most common words from the public") +
  coord_flip() 

# Create term-document matrix
tdm <- TermDocumentMatrix(tds_bigrams$bigram)
# Convert the matrix to a data frame for easier manipulation
m <- as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing = TRUE)
word_df <- data.frame(word = names(word_freqs), freq = word_freqs)
# Create the word cloud: with specified frequency settings for the words, based on previous bigram counts.
wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 30,
          max.words = 1000, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

3.2 Analyze the sentiment of the public over time.

# Convert 'Time' to 'DateTime' using ymd_hms from lubridate which automatically handles ISO 8601 formats
comments <- tds_bigrams %>%
  mutate(
    DateTime = ymd_hms(Time, tz = "UTC"), 
    Year = year(DateTime)  # Extract the year from the DateTime
  )
# Check the results to ensure the conversion
head(head(comments))
##                                                                                                                                                     Comment
## 1 this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary when we talk about advance military weaponry.
## 2 this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary when we talk about advance military weaponry.
## 3 this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary when we talk about advance military weaponry.
## 4 this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary when we talk about advance military weaponry.
## 5 this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary when we talk about advance military weaponry.
## 6       the fact that thay were able to create  ai. proves that privacy has gone out the window. because you had to have gotten the data sets from someone.
##                   Time            bigram            DateTime Year
## 1 2024-04-04t15:34:15z     global econom 2024-04-04 15:34:15 2024
## 2 2024-04-04t15:34:15z     econom advanc 2024-04-04 15:34:15 2024
## 3 2024-04-04t15:34:15z         bit scari 2024-04-04 15:34:15 2024
## 4 2024-04-04t15:34:15z   advanc militari 2024-04-04 15:34:15 2024
## 5 2024-04-04t15:34:15z militari weaponri 2024-04-04 15:34:15 2024
## 6 2024-04-01t22:10:52z          creat ai 2024-04-01 22:10:52 2024
# Show the structure and a summary of the new columns
str(comments)
## 'data.frame':    23185 obs. of  5 variables:
##  $ Comment : chr  "this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary whe"| __truncated__ "this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary whe"| __truncated__ "this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary whe"| __truncated__ "this ai thing is really fascinating when we talk about global economic advancement but it&#39;s a bit scary whe"| __truncated__ ...
##  $ Time    : chr  "2024-04-04t15:34:15z" "2024-04-04t15:34:15z" "2024-04-04t15:34:15z" "2024-04-04t15:34:15z" ...
##  $ bigram  : chr  "global econom" "econom advanc" "bit scari" "advanc militari" ...
##  $ DateTime: POSIXct, format: "2024-04-04 15:34:15" "2024-04-04 15:34:15" ...
##  $ Year    : num  2024 2024 2024 2024 2024 ...
summary(comments$DateTime)
##                       Min.                    1st Qu. 
## "2019-09-26 18:16:08.0000" "2020-10-24 19:35:53.0000" 
##                     Median                       Mean 
## "2021-03-31 12:41:33.0000" "2021-04-21 15:22:28.8480" 
##                    3rd Qu.                       Max. 
## "2021-10-15 16:55:58.0000" "2024-04-04 15:34:15.0000"
summary(comments$Year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2019    2020    2021    2021    2021    2024
# filtering  Year rows based on a Comment condition and selecting the certain columns
filtered_data <- comments %>%
  filter(!is.na(Year)) %>%  
  select(Year, Comment) 
# Load sentiment lexicons
nrc <- get_sentiments("nrc")
bing <- get_sentiments("bing")
loughran <- get_sentiments("loughran")
# Tokenize comments
comments_tokens <- comments %>%
  unnest_tokens(word, Comment)
# Join tokens with NRC lexicon and Aggregate sentiment by year
sentiment_analysis <- comments_tokens %>%
  inner_join(nrc, by = c("word"))
## Warning in inner_join(., nrc, by = c("word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 12263 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
sentiment_by_year <- sentiment_analysis %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(Year, sentiment) %>%
  summarise(count = n(), .groups = 'drop')
# Calculate sentiment ratio or simple count comparison
sentiment_summary <- sentiment_by_year %>%
  pivot_wider(names_from = sentiment, values_from = count, values_fill = list(count = 0)) %>%
  mutate(sentiment_score = positive - negative)
# Convert Year to a categorical factor for better plotting
sentiment_summary$Year <- factor(sentiment_summary$Year)
# Plotting sentiment score over time
ggplot(sentiment_summary, aes(x = Year, y = sentiment_score, fill = Year)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Sentiment Score Over Time", x = "Year", y = "Net Sentiment Score") +
  theme_minimal()

# Join and calculate sentiment
calculate_sentiment <- function(data, lexicon, method) {
  data %>%
    inner_join(lexicon, by = "word") %>%
    filter(sentiment %in% c("positive", "negative")) %>%
    group_by(Year, sentiment) %>%
    summarise(count = n(), .groups = 'drop') %>%
    pivot_wider(names_from = sentiment, values_from = count, values_fill = list(count = 0)) %>%
    mutate(
      ratio = positive / negative,
      method = method
    )
}
sentiment_nrc <- calculate_sentiment(comments_tokens, nrc, "NRC")
## Warning in inner_join(., lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 12263 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
sentiment_bing <- calculate_sentiment(comments_tokens, bing, "Bing")
sentiment_loughran <- calculate_sentiment(comments_tokens, loughran, "Loughran")
## Warning in inner_join(., lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 847 of `x` matches multiple rows in `y`.
## ℹ Row 2086 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Function to plot sentiment ratios
plot_sentiment_ratios <- function(sentiment_data) {
  ggplot(sentiment_data, aes(x = Year, y = ratio, group = method, color = method)) +
    geom_line() +
    geom_point() +
    labs(title = "Sentiment Ratio Over Time by Lexicon", x = "Year", y = "Sentiment Ratio (Positive/Negative)") +
    theme_minimal() +
    scale_color_brewer(palette = "Set1")  # Color set for clarity
}
# Plotting sentiment data
all_sentiments <- bind_rows(sentiment_nrc, sentiment_bing, sentiment_loughran)
plot_sentiment_ratios(all_sentiments)

3.3 Analyze the sentiment of the public from November 30, 2022, onwards.

#Filter comments from November 30, 2022 to the current date
specific_comments <- comments %>%
  filter(DateTime >= as.POSIXct("2022-11-30") & DateTime <= Sys.time())
# Tokenize comments of bigram for sentiment analysis
specific_comments_tokens <- specific_comments %>%
  unnest_tokens(word, bigram)
# use Bing lexicon and visualize the results with a line chart
sentiment_bing_current <- specific_comments_tokens %>%
  inner_join(bing, by = "word") %>%
  group_by(DateTime) %>%
  summarise(
    positive = sum(sentiment == "positive"),
    negative = sum(sentiment == "negative"),
    .groups = 'drop'
  ) %>%
  mutate(sentiment_score = positive - negative)
ggplot(sentiment_bing_current, aes(x = DateTime, y = sentiment_score)) +
  geom_line(color = "#FC4E07", size = 0.7) +
  geom_point(color = "#534EB2", size = 2, shape = 22, fill = "#E5D8B0") +
  labs(title = "Sentiment Trends (Nov 2022 Onwards) Using Bing Lexicon",
       x = "Time",
       y = "Sentiment Score") +
  theme_light()+
  theme(
    plot.background = element_rect(fill = "#FFFFFF"),
    panel.background = element_rect(fill = "#FFFFFF", color = "#FFFFFF"),
    plot.title = element_text(color = "#007ACC", size = 16, face = "bold"),
    axis.title = element_text(color = "#4A4A4A"),
    axis.text = element_text(color = "#4A4A4A"),
    panel.grid.major = element_line(color = "#D3D3D3"),  # Light gray grid lines
    panel.grid.minor = element_blank(),  # No minor grid lines
    legend.position = "bottom",
    legend.background = element_rect(fill = "#FFFFFF")
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Using ChatGPT Data Analysis to generate a word cloud based on the tri-grams,  here it is.
knitr::include_graphics("images/daily volume of comments/word cloud for tri-grams.PNG")

# Using ChatGPT to analyze the sentiment trend since November 2022, and here's a line chart to identify the result.
knitr::include_graphics("images/daily volume of comments/sentiment trend Nov 2022 onwards.PNG")

# Using ChatGPT to analyze the sentiment distribution overtime, and here's a bar chart to identify the result.
knitr::include_graphics("images/daily volume of comments/sentiment trend over time.PNG")

# Using ChatGPT to analyze the sentiment distribution since November 2022, and here's a bar chart to identify the result.
knitr::include_graphics("sentiment distribution Nov 2022 onwards.PNG")

3.1.1 findings

  • After analyzing the top 25 bigrams in comments, we can see: People are more concerned with AI technology/arithmetic method and its impact on human society. This impact encompasses autonomous driving technology, religion, race, lifestyle changes, and so on. With the help of artificial intelligence analysis, here it generated a word cloud based on the data frame.

  • The Word Network above visualized the top 100 bi-grams from the comments dataset. Each node represents a word, and each edge connects words that frequently appear together as bi-grams. It illustrated the contextual relationships between words, showcasing how certain topic or phrase are interconnected within the comments, and revealing common phrases that capture viewer interests, concerns, or reactions.

  • Through the word cloud, we can observe that the most frequently mentioned topics concern the relationship between AI and humans, which appears to be a major public interest. Additionally, political, religious, and economic factors are also believed to significantly impact the development of AI.

3.2.1 findings

  • This Dictionary Sentiment Ratio over Time provides a powerful framework for sentiment analysis across multiple dictionaries, allowing for easy visualization of comparison results. The findings clearly demonstrate that public sentiment trends toward AI are similar. Since the documentary video’s release in 2019, the public’s attitude toward AI has remained mostly neutral, with more positive than negative emotions. However, based on the spool, from the end of 2022 to the beginning of 2023, sentiment toward AI in the comments fluctuated violently, even peaking. Why?

  • According to Wikipedia, OpenAI developed and launched ChatGPT on November 30, 2022, and I’m curious how the public reacted to this historic event and AI. Has anything to do with abnormal spikes?

3.3.1 fingings

  • Since its public launch on November 30, 2022, ChatGPT has sparked widespread interest and discussion in the AI documentary, reflecting broader awareness and concerns about artificial intelligence (AI). The line chart allows for the analysis of public sentiment. In December 2022, people had strong negative feelings about artificial intelligence technology. This is also linked to the media’s focus on how artificial intelligence systems process personal data and the widespread impact on employment, social prejudice, and other issues. With the use of ChatGPT, public sentiment has stabilized, indicating the same attitude orientation as before, namely, neutrality is the majority, and positive emotions outweigh negative emotions.

  • The bar chart above showcases the sentiment distribution (neutral in green, positive in red, and negative in blue) from November 2022 onwards. This comparative view highlights the relative frequencies of each sentiment category, allowing for a clear understanding of the overall sentiment landscape within this period.

4. SUMMARY

Reference