People love to talk about AI, to hear about what’s new, what’s developing and how it might change our lives. In an earlier post I did some data scraping to download the text from all the TED Talks under the search term “AI”. In this post we will anaylse the sentiments contained in TED Talks about AI (positive or negative), and how these might have developed across time. We begin as always by loading some packages and functions.


# Some packages
library(tidyverse)
library(data.table)
library(tidytext)
library(qdap)
library(tm)
library(stringr)
library(ggrepel)
library(plotly)
library(colorspace)


# Some colours

myCols8<- c("#003f5c", "#2f4b7c", "#665191", "#a05195", "#d45087", "#f95d6a", "#ff7c43", "#ffa600")
cols10<-  c("#7a5195", "#bc5090", "#ef5675", "#ff764a", "#ffa600","#7a5195", "#bc5090", 
            "#ef5675", "#ff764a", "#ffa600")

# Some functions

quote_plot<- function(plot_data, titleText, sizes=c(4, 4, 4, 5, 5, 6, 6, 7, 8, 10), 
                      spacing= c("squared","linear")){
  # Generic plotting function to display text on a dark grey background
  plot_data<- plot_data %>% mutate(polarity= abs(polarity)) %>%
    arrange(polarity) %>%
    mutate(Order= order(polarity), X= 0 ,
           Size= sizes) 
    if (spacing== "squared") {
      plot_data$Order<- plot_data$Order^2
    }
  ggplot(plot_data, aes(x= X, y= Order, color= factor(Order))) +
    geom_text_repel(aes(label= text, size= I(Size)), fontface= "bold",                                 segment.alpha=0) +
    scale_color_manual(values= rep(cols10,2)) +
    theme_void() +
    theme(panel.background = element_rect(fill= "grey20", color= "grey20"),
          plot.background = element_rect(fill= "grey20", size= 2),
          plot.margin = margin(0.5, 0.8,0.8, 0.8, "cm"),
          legend.position = "none",
          title= element_text(face= "bold", size= 12, color= "white")) +
    labs(title= titleText)
}

Lets grab the data…

Then we load in the data that was obtained in a previous session. In this data set we have the text from 88 TED Talks, between 2007 and mid 2020. I notice that some talks are annotated with audience reactions. I don’t want these to form part of the analysis so I will remove them.

df<- readRDS("TED_data.rds")

# Extract the year
df<- df %>% mutate(Year= year(date))

# Remove the audience descriptors
df$full_text<- str_remove_all(df$full_text, "\\(laughter\\) ")
df$full_text<- str_remove_all(df$full_text, "\\(applause\\)")

# Print the first part of the first talk
print(str_sub(df$full_text[1],1,400))
[1] "When I was a kid, I was the quintessential nerd. I think some of you were, too. (Laughter) And you, sir, who laughed the loudest, you probably still are. (Laughter) I grew up in a small town in the dusty plains of north Texas, the son of a sheriff who was the son of a pastor. Getting into trouble was not an option. And so I started reading calculus books for fun. (Laughter) You did, too. That led "

Sentences, words and word stemming

Here I will pre-process the text data for analysis. How this is done depends quite a lot on each case. In my case I filter out words less than 3 characters long, remove stopwords, and words containing digits. There are a few different options for stemming - reducing words down to a common stem - and after evaluating a few options I landed on stemDocument from the tm package. Even after applying a stemming function, I found a few cases where a little manual stemming was needed.
In order to have complete words in your analysis (rather than stems) I created a dictionary of the complete word most commonly represented by each stem, and joined those back into the main data set.

# Break the test into sentences
sentObj<- unnest_sentences(df, sent, full_text)

# Break the sentences into words
wordObj<- sentObj %>% unnest_tokens(word, sent, drop= FALSE)

outwords<- c("yeah","no","black","ll","ve","ynh","don", "isn","won","didn","ca","em",                        "nb","ems","t","applause", "laughter")

wordObj<- wordObj %>% mutate(LEN= nchar(word)) %>% filter(LEN > 2) %>%
                      filter(!word %in% stop_words$word) %>%
                      filter(!word %in% outwords) %>%
                      filter(!word %like% "\\d")

wordObj$stem<- stemDocument(wordObj$word)



# Some additional manual stemming
wordObj<- wordObj %>% mutate(stem= ifelse(word %like% "chinese|china","china",stem),
                             stem= ifelse(word %like% "deepfak", "deepfake", stem),
                             stem= ifelse(word %like% "superintel","superintelligent", stem),
                             stem= ifelse(word %like% "accid","accid", stem))

# make a dictionary 
t1<-  wordObj %>% count(word,stem) %>% ungroup() %>% 
                  group_by(stem) %>% 
                  top_n(1, wt=n) %>%
                  rename(complete= word) %>% select(-n)

# Stem completion
wordObj<- left_join(wordObj,t1)
Joining, by = "stem"
wordObj$word<- wordObj$complete

Are you positive? Matching words to sentiments

Now we bring in some sentiment lexicons, these ones are from the tidytext package. They contain lists of words and whether they are conisdered positive or negative. In the case of afinn, this is a numerical score between -5 (very negative) and +5 (very positive) words. Because my data set is quite small and the vocab is quite specialised, I will refer to several lexicons to gather sentiments.

# Get some sentiment lexicons
bing_lex<-   get_sentiments("bing")
afinn_lex<-  get_sentiments("afinn")


bing_lex<-       bing_lex %>% mutate(bing= ifelse(sentiment== "negative", -1, 1)) %>% 
                              select(-sentiment)

afinn_lex<-     afinn_lex %>% rename(afinn= value)

wordObj<- wordObj %>% left_join(bing_lex) %>%
                      left_join(afinn_lex)
Joining, by = c("word", "bing")
Joining, by = c("word", "afinn")
# Positive and negative words from qdapDictionaries
wordObj<- wordObj %>% mutate(qdap= ifelse(word %in% positive.words, 1, 0),
                             qdap= ifelse(word %in% negative.words, -1, qdap))



# Make a combined polarity score

wordObj<- wordObj %>% mutate(bing= fixNAs(bing), afinn= fixNAs(afinn),
                             combined= bing + afinn + qdap)

I apply the polarity function from qdap to score sentences as positive or negative in sentiment. This potentially gives additional insight since polarity will take into account negating or amplifying words contained in the sentence.

# Polarity by sentence

polarity<-         polarity(sentObj$sent)
sentObj$polarity<- polarity$all$polarity

Its all in the clouds

Lets explore some of the positive and negative words found in the TED Talks. The size of the word reflects is number of mentions. Note that in the positive plot we have taken out some of the very most common words to allow for visualisation of more.

# Wordclouds

t0<- wordObj %>% ungroup() %>% filter(combined > 0) %>% count(word) %>% filter(! word %in% c("technologies","intelligence","like","work","right","kind","well","want","ability"))
multiplier<- ceiling(nrow(t0)/8)
wordcloud2::wordcloud2(t0, color= rep(myCols8, multiplier))

NA

t0<- wordObj %>% filter(combined < 0) %>% count(word)  
multiplier<- ceiling(nrow(t0)/8)
wordcloud2::wordcloud2(t0, color= rep(myCols8, multiplier))

Who are you calling “negative”?

Now we make use of the sentence polarity to identify some of the most highly-charged negative and positive sentences.

# Positive and negative quotes
t0<- wordObj %>% group_by(Year,title,sent) %>%
     summarise(polarity= sum(combined,na.rm=T))

negative<- t0 %>% ungroup() %>% top_n(10, wt= -polarity) %>% mutate(text= str_wrap(sent, 60))
negative<- negative[1:10,]
quote_plot(negative,"Most Negative Quotes", sizes= rep(2.6,nrow(negative)), spacing= "linear")


positive<- t0 %>% ungroup() %>% top_n(10, wt= polarity) %>% mutate(text= str_wrap(sent, 60))
quote_plot(positive, "Most Positive Quotes", sizes= rep(2.6,nrow(positive)) , spacing= "linear")

Is the AI honeymoon over?

Finally, we map the overall talk sentiment (positive or negative) and how this has changed across time.

# Talk Polarity

talk_pol<- wordObj %>% group_by(Year, title) %>% 
              summarise(polarity= sum(combined, na.rm = T))
talk_pol$length<- str_count(df$full_text," ")

g<- ggplot(talk_pol, aes(x= Year, y= polarity, color= -polarity, label= title)) +
  geom_jitter(aes(size= length), alpha= 0.5) +
  geom_smooth(size= 0.5,  alpha= 0.3, se= FALSE, color= "grey70") +
  scale_color_continuous_sequential(palette = "Plasma") +
  geom_hline(yintercept = 0, color= "grey50", alpha= 0.5) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title= "TED Talk Overall Sentiment by Year", y="Polarity",
       subtitle = "Point size indicates length of the talk")

ggplotly(g) %>%
  style(text = paste0(talk_pol$Year,"<br>",
                      talk_pol$title,"<br>Polartiy: ",
                      talk_pol$polarity), traces = 1) 

We can see that the first properly negative talk didn’t appear until 2013, before which all talks on the topic of AI had been either neutral or positive. Since 2016 however, there has been a growing awareness of the more negative potential implications of AI, while other talks continue to emphasise the positive.

. .

..

---
title: "TED Talks: AI and its Sentiments"
output: html_notebook
author: Cel McCracken
date: "2020-07-10"
---
![](/Users/Celeste/Desktop/AI_Dystopia.png)

People love to talk about AI, to hear about what's new, what's developing and how it might change our lives. In an earlier post I did some data scraping to download the text from all the TED Talks under the search term "AI". In this post we will anaylse the sentiments contained in TED Talks about AI (positive or negative), and how these might have developed across time.  We begin as always by loading some packages and functions.
```{r, message= FALSE}

# Some packages
library(tidyverse)
library(data.table)
library(tidytext)
library(qdap)
library(tm)
library(stringr)
library(ggrepel)
library(plotly)
library(colorspace)


# Some colours

myCols8<- c("#003f5c", "#2f4b7c", "#665191", "#a05195", "#d45087", "#f95d6a", "#ff7c43", "#ffa600")
cols10<-  c("#7a5195", "#bc5090", "#ef5675", "#ff764a", "#ffa600","#7a5195", "#bc5090", 
            "#ef5675", "#ff764a", "#ffa600")

# Some functions

quote_plot<- function(plot_data, titleText, sizes=c(4, 4, 4, 5, 5, 6, 6, 7, 8, 10), 
                      spacing= c("squared","linear")){
  # Generic plotting function to display text on a dark grey background
  plot_data<- plot_data %>% mutate(polarity= abs(polarity)) %>%
    arrange(polarity) %>%
    mutate(Order= order(polarity), X= 0 ,
           Size= sizes) 
    if (spacing== "squared") {
      plot_data$Order<- plot_data$Order^2
    }
  ggplot(plot_data, aes(x= X, y= Order, color= factor(Order))) +
    geom_text_repel(aes(label= text, size= I(Size)), fontface= "bold",  segment.alpha=0) +
    #geom_text(aes(label= text, size=I(Size)), fontface="bold") +
    scale_color_manual(values= rep(cols10,2)) +
    theme_void() +
    theme(panel.background = element_rect(fill= "grey20", color= "grey20"),
          plot.background = element_rect(fill= "grey20", size= 2),
          plot.margin = margin(0.5, 0.8,0.8, 0.8, "cm"),
          legend.position = "none",
          title= element_text(face= "bold", size= 12, color= "white")) +
    labs(title= titleText)
}

fixNAs<- function(thevec){
  thevec<- ifelse(is.na(thevec),0,thevec)
  return(thevec)
}


```

### Lets grab the data...

Then we load in the data that was obtained in a previous session.  In this data set we have the text from 88 TED Talks, between 2007 and mid 2020. I notice that some talks are annotated with audience reactions. I don't want these to form part of the analysis so I will remove them.
```{r}
df<- readRDS("TED_data.rds")

# Extract the year
df<- df %>% mutate(Year= year(date))

# Remove the audience descriptors
df$full_text<- str_remove_all(df$full_text, "\\(laughter\\) ")
df$full_text<- str_remove_all(df$full_text, "\\(applause\\)")

# Print the first part of the first talk
print(str_sub(df$full_text[1],1,400))

```

### Sentences, words and word stemming
Here I will pre-process the text data for analysis. How this is done depends quite a lot on each case.  In my case I filter out words less than 3 characters long, remove stopwords, and words containing digits.
There are a few different options for stemming - reducing words down to a common stem - and after evaluating a few options I landed on `stemDocument` from the `tm` package.  Even after applying a stemming function, I found a few cases where a little manual stemming was needed.  
In order to have complete words in your analysis (rather than stems) I created a dictionary of the complete word most commonly represented by each stem, and joined those back into the main data set.
```{r , message=FALSE}
# Break the test into sentences
sentObj<- unnest_sentences(df, sent, full_text)

# Break the sentences into words
wordObj<- sentObj %>% unnest_tokens(word, sent, drop= FALSE)

outwords<- c("yeah","no","black","ll","ve","ynh","don", "isn","won","didn","ca","em", "nb","ems","t","applause", "laughter")

wordObj<- wordObj %>% mutate(LEN= nchar(word)) %>% filter(LEN > 2) %>%
                      filter(!word %in% stop_words$word) %>%
                      filter(!word %in% outwords) %>%
                      filter(!word %like% "\\d")

wordObj$stem<- stemDocument(wordObj$word)



# Some additional manual stemming
wordObj<- wordObj %>% mutate(stem= ifelse(word %like% "chinese|china","china",stem),
                             stem= ifelse(word %like% "deepfak", "deepfake", stem),
                             stem= ifelse(word %like% "superintel","superintelligent", stem),
                             stem= ifelse(word %like% "accid","accid", stem))

# make a dictionary 
t1<-  wordObj %>% count(word,stem) %>% ungroup() %>% 
                  group_by(stem) %>% 
                  top_n(1, wt=n) %>%
                  rename(complete= word) %>% select(-n)

# Stem completion
wordObj<- left_join(wordObj,t1)
wordObj$word<- wordObj$complete

```

### Are you positive? Matching words to sentiments
Now we bring in some sentiment lexicons, these ones are from the `tidytext` package. They contain lists of words and whether they are conisdered positive or negative. In the case of afinn, this is a numerical score between -5 (very negative) and +5 (very positive) words. Because my data set is quite small and the vocab is quite specialised, I will refer to several lexicons to gather sentiments.
```{r, message= FALSE}
# Get some sentiment lexicons
bing_lex<-   get_sentiments("bing")
afinn_lex<-  get_sentiments("afinn")


bing_lex<-       bing_lex %>% mutate(bing= ifelse(sentiment== "negative", -1, 1)) %>% 
                              select(-sentiment)

afinn_lex<-     afinn_lex %>% rename(afinn= value)

wordObj<- wordObj %>% left_join(bing_lex) %>%
                      left_join(afinn_lex)

# Positive and negative words from qdapDictionaries
wordObj<- wordObj %>% mutate(qdap= ifelse(word %in% positive.words, 1, 0),
                             qdap= ifelse(word %in% negative.words, -1, qdap))

# Make a combined polarity score
wordObj<- wordObj %>% mutate(bing= fixNAs(bing), afinn= fixNAs(afinn),
                             combined= bing + afinn + qdap)

```
I apply the `polarity` function from `qdap` to score sentences as positive or negative in sentiment. This potentially gives additional insight since `polarity` will take into account negating or amplifying words contained in the sentence.
```{r, message= FALSE}
# Polarity by sentence

polarity<-         polarity(sentObj$sent)
sentObj$polarity<- polarity$all$polarity

```


### Its all in the clouds
Lets explore some of the positive and negative words found in the TED Talks. The size of the word reflects is number of mentions. Note that in the positive plot we have taken out some of the very most common words to allow for visualisation of more.
```{r, message= FALSE}
# Wordclouds

t0<- wordObj %>% ungroup() %>% filter(combined > 0) %>% count(word) %>% filter(! word %in% c("technologies","intelligence","like","work","right","kind","well","want","ability"))
multiplier<- ceiling(nrow(t0)/8)
wordcloud2::wordcloud2(t0, color= rep(myCols8, multiplier))
```
```{r, message= FALSE}
t0<- wordObj %>% filter(combined < 0) %>% count(word)  
multiplier<- ceiling(nrow(t0)/8)
wordcloud2::wordcloud2(t0, color= rep(myCols8, multiplier))
```


### Who are you calling "negative"?
Now we make use of the sentence polarity to identify some of the most highly-charged negative and positive sentences.
```{r, fig.align="center", message=FALSE}
# Positive and negative quotes
t0<- wordObj %>% group_by(Year,title,sent) %>%
     summarise(polarity= sum(combined,na.rm=T))

negative<- t0 %>% ungroup() %>% top_n(10, wt= -polarity) %>% mutate(text= str_wrap(sent, 60))
negative<- negative[1:10,]
quote_plot(negative,"Most Negative Quotes", sizes= rep(2.6,nrow(negative)), spacing= "linear")

```
```{r}
positive<- t0 %>% ungroup() %>% top_n(10, wt= polarity) %>% mutate(text= str_wrap(sent, 60))
quote_plot(positive, "Most Positive Quotes", sizes= rep(2.6,nrow(positive)) , spacing= "linear")
```

### Is the AI honeymoon over?
Finally, we map the overall talk sentiment (positive or negative) and how this has changed across time.
```{r, message= FALSE}
# Talk Polarity

talk_pol<- wordObj %>% group_by(Year, title) %>% 
              summarise(polarity= sum(combined, na.rm = T))
talk_pol$length<- str_count(df$full_text," ")

g<- ggplot(talk_pol, aes(x= Year, y= polarity, color= -polarity, label= title)) +
  geom_jitter(aes(size= length), alpha= 0.5) +
  geom_smooth(size= 0.5,  alpha= 0.3, se= FALSE, color= "grey70") +
  scale_color_continuous_sequential(palette = "Plasma") +
  geom_hline(yintercept = 0, color= "grey50", alpha= 0.5) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title= "TED Talk Overall Sentiment by Year", y="Polarity",
       subtitle = "Point size indicates length of the talk")

ggplotly(g) %>%
  style(text = paste0(talk_pol$Year,"<br>",
                      talk_pol$title,"<br>Polartiy: ",
                      talk_pol$polarity), traces = 1) 
```
We can see that the first properly negative talk didn't appear until 2013, before which all talks on the topic of AI had been either neutral or positive.  Since 2016 however, there has been a growing awareness of the more negative potential implications of AI, while other talks continue to emphasise the positive.




.
.

..
