Introduction

TED, which is an acronym standing for Technology, Entertainment and Design, “is a nonprofit devoted to spreading ideas.” A powerful way that TED works to achieve this mission of spreading ideas is through their TED Talks. TED Talks are short presentations on any topic imaginable. The common link, however, is that every TED Talk seeks to educate listeners on something new.
https://www.ted.com/about/our-organization

This project aims to analyze whether there is a correlation between the positive/negative sentiment of a TED Talk, and the popularity of the talk.

Hypothesis

I predict that the sentiment in the top five most viewed talks will be more positive than negative. Conversely, the sentiment of the top five least popular TED Talks is negative. Viewers want to listen to TED Talks that are positive.

Setup

In order to run all of my analysis functions, I first had to run the necessary packages.

library(ggthemes)
library(ggplot2)
library(wordcloud2)
library(tidyverse)
library(stringr)
library(tidytext)
library(textdata)

After downloading the required packages, I then imported two datasets from Kaggle. The datasets from Kaggle include every TED Talk up until September 21, 2017. Below is the main dataset https://www.kaggle.com/rounakbanik/ted-talks?select=ted_main.csv Below is the dataset that includes the transcripts https://www.kaggle.com/rounakbanik/ted-talks?select=transcripts.csv

Data Import

tedMain <- read.csv("~/Desktop/ted_main.csv", stringsAsFactors=FALSE)

tedTranscripts <- read.csv("~/Desktop/transcripts.csv", stringsAsFactors=FALSE)

After importing the two datasets, I then merged them.

tedTalks <- merge(tedMain, tedTranscripts, by = "url")

top_n(tedTalks,5,views) -> top5views 
top_n(tedTalks, -5, views) -> bottom5views

Top Ten

Then, I ensured that, from the top and bottom viewed TED Talks, the ten most popular words were extracted. Before utilizing sentiment analysis on the five most popular words in each category, I wanted to see the top ten most popular to gain a larger understanding before closing in on five.

top5views %>%
  unnest_tokens(word, transcript) ->top5words

top5words %>% 
  count(word, sort = TRUE) %>% 
anti_join(stop_words) %>% 
  arrange(desc(n)) %>% 
  head(10) %>% 
ggplot(aes(reorder(word, n), n)) +
  geom_col() +
  coord_flip() +
theme_calc()

bottom5views %>%
  unnest_tokens(word, transcript) -> bottom5words

bottom5words %>% 
count(word, sort = TRUE) %>% 
anti_join(stop_words) %>% 
  arrange(desc(n)) %>% 
  head(10) %>% 
ggplot(aes(reorder(word, n), n)) +
  geom_col() +
  coord_flip() +
theme_calc()

Filter

In order to run sentiment analyses with Afinn, Bing, and NRC, I first had to import the data sets, unnest the tokens, and filter out unnecessary words.

top5views %>%
  unnest_tokens(word, transcript) %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("laughter", "la", "music", "ha")) -> top5WordsFiltered

## Joining, by = "word"

bottom5views %>%
  unnest_tokens(word, transcript) %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("laughter", "la", "music", "ha")) -> bottom5WordsFiltered

## Joining, by = "word"

top5views %>%
  unnest_tokens(word, transcript) %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("laughter", "la", "music", "ha")) -> top5WordsFiltered

## Joining, by = "word"

Unnesting the tokens helped pull the top words from the specific category of ‘transcript’ from the datasets. Filtering helped ensure that noises were excluded from the analysis.

Sentiment Lexicons

Afinn, Bing, & NRC

In order to understand which TED Talks are more positive, it is necessary to run Afinn sentiment analyses. The Afinn analyses help understand the mean sentiment of these TED Talks, thus providing better insight into which talks are more positive than others. The Afinn scale goes from -5 (most negative rating) to 5 (most positive rating). The mean of each TED Talk provides insight to which use more positive language than others.

Afinn top5words provides the five words that score highest on the sentiment analysis, from the top five viewed videos. Filtering the value and providing two afinn tables, one with sentiment values over 0 and one under 0, produces the five words that score highest in these filter categories. Afinn provides the top five words, with the highest positive sentiment score from the top five videos.

Afinn for the five most popular TED Talks. Mean = 0.38

top5words %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments("afinn")) ->top5words_afinn

## Joining, by = "word"
## Joining, by = "word"

mean(top5words_afinn$value)

## [1] 0.3848921

top5words_afinn %>% 
filter(value > 0) %>% 
  count(word, sort = TRUE) %>% 
  head (5) %>% 
  knitr::kable()

word	n
love	21
powerful	15
applause	14
feeling	11
god	7

Setting the value to greater than 0 collects all of the words that have a sentiment score above 0 (positive). Setting the value to less than 0 collects all of the words that have a sentiment score less than 0 (negative).

top5words_afinn %>% 
  filter(value < 0) %>% 
  count(word, sort = TRUE) %>% 
  head (5) %>% 
  knitr::kable()

word	n
vulnerability	16
numb	10
shame	10
wrong	10
dead	9

Afinn for the five least popular TED Talks Mean=0.50

bottom5words %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments("afinn")) ->bottom5words_afinn

## Joining, by = "word"
## Joining, by = "word"

mean(bottom5words_afinn$value)

## [1] 0.505814

bottom5words_afinn %>% 
  filter(value > 0) %>% 
  count(word, sort = TRUE) %>% 
  head (5) %>% 
  knitr::kable()

word	n
god	30
love	15
compassionate	8
advantage	6
rich	6

bottom5words_afinn %>% 
  filter(value < 0) %>% 
  count(word, sort = TRUE) %>% 
  head (5) %>% 
  knitr::kable()

word	n
fail	4
wrong	4
bad	3
blah	3
criminal	3

After understanding the mean of each TED Talk through the Afinn, it is valuable to see the most common words used in each talk. NRC analysis provides insight to the most popular words and how many times they are used in the context of the talk. ### NRC

top5words_nrc <- top5words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("nrc"))

## Joining, by = "word"
## Joining, by = "word"

ggplot(top5words_nrc) + geom_bar(aes(sentiment))

bottom5words_nrc <- bottom5words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("nrc"))

## Joining, by = "word"
## Joining, by = "word"

ggplot(bottom5words_nrc) + geom_bar(aes(sentiment))

Bing

top5words_bing <- top5words %>% 
anti_join(stop_words) %>%
  inner_join(get_sentiments("bing"))

## Joining, by = "word"
## Joining, by = "word"

ggplot(top5words_bing) + geom_bar(aes(sentiment))

bottom5words_bing <- bottom5words %>% 
  anti_join(stop_words) %>%
  inner_join(get_sentiments("bing"))

## Joining, by = "word"
## Joining, by = "word"

ggplot(bottom5words_bing) + geom_bar(aes(sentiment))

Word Clouds

Word Clouds present, in a visual way, an illustration of the most popular words in each category (top 5, bottom 5). The Word Clouds below easily illustrate which words were used the most in the talks by presenting them in different sizes that correlate with their usage.

library(wordcloud2)

top5words_afinn %>% 
  filter(value > 0) %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

top5words_afinn %>% 
  filter(value < 0) %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

bottom5words_afinn %>% 
  filter(value > 0) %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

bottom5words_afinn %>% 
  filter(value < 0) %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

Conclusion

In analyzing the top ten words from the videos with the top five views and the bottom five views, it is evident that there are shared words. In the top ten used words from the top five most viewed videos and five least viewed videos, the word ‘people’ is used highly. This alludes to the fact that most TED Talks discuss people. The idea that ‘people’ is used commonly in both the popular and least popular videos may be indication that most TED Talks share a common bond of discussing social issues.

Through running the Afinn sentiment ananlyses, it is evident that the top five most popular talks have a mean sentiment score of 0.38, and the lowest five talks have a mean score of 0.50. This contradicts my hypothesis, as I had anticipated the most popular talks would have higher sentiment scores. The presence of the word ‘god’ may be largely impactful in these scores. God is used 30 times in the five least popular talks. The high positive sentiment score of ‘god’, and its large presence may have an impact on the mean.

Another intriguing aspect of the sentiment analyses is the disparity between the positive sentiment score and the negative Bing sentiment scores. In the top five most popular talks, there are more positive words used, but there are also many negative words used. In the bottom five viewed talks, the difference between the usage of positive words and negative words is larger.

The top five talks may have used more words in their talks, thus it is challenging to compare the top five and bottom five on a word-for-word basis.

In conclusion, it is inaccurate to hypothesize that the most popular TED Talks are popular because they use more positive words, and vice versa. Viewers may not desire positivity out of the talks they watch, but maybe topics that they can relate to.

Sentiment Analysis of TED Talks-Bo Hawkes