TED Talks are concise, inspirational speeches delivered by experts across a vast number of fields. They’re meant to spread powerful and important ideas on just about every topic you could imagine. You’ve probably watched or at least heard about a TED Talk before, as some of the talks emass millions of views from people all around the world. As an avid watcher of TED Talks, I was interested in what elements make these renowned TED Talks popular - what contributes to their ability to gain interest and attention from so many different people. This information is interesting by itself, but also has implications for people trying to communicate their own persuasive ideas.
This report serves as an exploratory analysis of TED Talks, trying to determine what factors might contribute to their popularity. I’m interested in looking at the effect the duration of the talk, the content of the talk, and the speaker of the talk have on its popularity, as defined by the number of people who view it. While this is an exploratory analysis I do have some hypotheses. I am predicting that talks that run over 15 minutes in length will receive less views than shorter talks. I also predict that the sentiment of the most popular talks will be more positive than the least popular, and all TED Talks in general.
This analysis delves into factors that make a TED Talk successful, by comparing them in the most popular and least popular TED Talks. As mentioned before, the topics for TED Talks are incredibly wide ranging, with each being given by a unique speaker on a very specific topic - because of this any trends in word usage and sentiment are more significant. The datasets used for this analysis are from Kaggle, and contains information about all audio-video recordings of TED Talks (TEDx Talks included) uploaded to the official TED.com website until September 21st, 2017.
I began this analysis by loading the following packages: tidyverse, tidytext, dplyr, ggplot2, wordcloud2. I then imported the csv files from Kaggle and merged them together into one dataset. In order to see the code for this, and all other parts of the analysis, click on the black button to the right labeled “code”.
library(tidyverse)
library(tidytext)
library(dplyr)
library(ggplot2)
library(wordcloud2)
tedMain <- read.csv("~/R/TedTalkAnalysis/tedMain.csv", stringsAsFactors=FALSE)
tedTranscripts <- read.csv("~/R/TedTalkAnalysis/tedTranscripts.csv", stringsAsFactors=FALSE)
tedTalks <- merge(tedMain, tedTranscripts, by="url")I was interested if the length of the video influenced the number of views it had - were people more likely to view a video that is under 10 minutes than they are one that is over 15? The organizers of TED do stress that presentations should be kept under 18 minutes, however this rule is not always abided by. I created a scatter plot to visualize this relationship between duration and views, and then performed a correlation test to further quantify this relationship.
ggplot(tedTalks, aes(durationMinutes, views)) +
geom_point(color="red") -> scatterplot1
scatterplot1 +
xlab("Duration (minutes)") +
ylab("Views") +
ggtitle("Is Duration of Talk Related to Its Popularity?") +
theme_minimal()
cor.test(tedTalks$durationMinutes, tedTalks$views)
The scatterplot and correlation test both demonstrate that there is a very weak relationship between the duration of the talk and its views. The correlation between the two was very weak, at only 0.06372947. Even though this correlation is really weak, I do still think there is some useful information that comes out of the scatter plot. The majority of the most popular talks fall between around 10 minutes and around 20 minutes. There were far fewer talks that exceeded 20 minutes in length, no doubt due to requests from TED organizers, and none of those talks demonstrated especially high levels of views.
After looking at TED Talks broadly, I wanted to narrow the focus to the most and least popular ones included in the dataset, as defined by number of views.
top_n(tedTalks,15,views) -> top15views
ggplot(top15views, aes(reorder(name, views), views)) +geom_col(fill="purple") +ylab("Views") +xlab("TedTalk") +ggtitle("Top 15 views") +theme_minimal()-> TopViewsPlot
top_n(tedTalks,-15,views) -> bottom15views
ggplot(bottom15views, aes(reorder(name, views), views)) +geom_col(fill="purple") +ylab("Views") +xlab("TedTalk") +ggtitle("Bottom 15 views") +theme_minimal()-> BottomViewsPlot
| Speaker | TED Talk Title | Views |
|---|---|---|
| Ken Robinson | Do schools kill creativity? | 47,227,110 |
| Amy Cuddy | Your body language may shape who you are | 43,155,405 |
| Simon Sinek | How great leaders inspire action | 34,309,432 |
| Brene Brown | The power of vulnerability | 31,168,150 |
| Mary Roach | 10 things you didn’t know about orgasm | 22,270,883 |
| Julian Treasure | How to speak so that people want to listen | 21,594,632 |
| Jill Bolte Taylor | My stroke of insight | 21,190,883 |
| Tony Robbins | Why we do what we do | 20,685,401 |
| James Veitch | This is what happens when you reply to spam email | 20,475,972 |
| Cameron Russell | Looks aren’t everything. Believe me, I’m a model | 19,787,465 |
| Dan Pink | The puzzle of motivation | 18,830,983 |
| Susan Cain | The power of introverts | 17,629,275 |
| Pamela Meyer | How to spot a liar | 16,861,578 |
| Robert Waldinger | What makes a good life? Lessons from the longest study on happiness | 16,601,927 |
| Shawn Achor | The happy secret to better work | 16,209,727 |
| Speaker | TED Talk Title | Views |
|---|---|---|
| Bilal Bomani | Plant fuels that could power a jet | 155,895 |
| David Birch | A new way to stop identity theft | 174,326 |
| Jackie Tabick | The balancing act of compassion | 176,245 |
| Rick Falkvinge | I am a pirate | 181,010 |
| Jon Boogz and Lil Buck | A dance to honor Mother Earth | 182,975 |
| Keith Bellows | The camel’s hump | 185,275 |
| Roger Doiron | My subersive (garden) plot | 191,555 |
| Paul MacCready | Nature vs. humans | 197,139 |
| Joseph Lekuton | A parable for Kenya | 200,726 |
| James Forbes | Compassion at the dinner table | 204,410 |
| Jamais Cascio | Tools for a better world | 212,202 |
| Susan Shaw | The oil spill’s toxic trade-off | 220,099 |
| Franco Sacchi | A tour of Nollywood, Nigeria’s booming film industry | 223,082 |
| Daniel Pauly | The ocean’s shifting baseline | 224,768 |
| Seyi Oyesola | A hospital tour in Nigeria | 230,569 |
The least viewed TED Talks have garnered significantly less views than the most popular TED Talks, and than the average for all TED Talks. The average views on these least popular TED Talks was 197,351, compared to the average views for all TED Talks of 1,738,340.
In looking at the topics of these TED Talks, the most popular ones seem to focus on helping people better their own lives or answer large questions. These types of talks are timeless and always applicable, possibly contributing to their enduring popularity. The least popular TED Talks didn’t have as clear of a theme running throughout all the topics, but did seem to focus on broader issues like healthcare in Nigeria, or the environmental impact of an oil spill. These may lose relevance as time passes, perhaps a contributing factor to the low number of views they receive.
Aside from noting these general themes, what specific factors influence these huge disparities amongst views on TED Talks? In other words, makes some talks so successful and others not? There are undoubtedly many factors at play, the following of which are investigated in the remainder of this report:
To find the most frequently used words in the most popular and least popular TED Talks, I began by unnesting the tokens and removing stopwords (words such as “an”, “a”, “any”). I then further filtered the results to exclude “laughter”, “applause” and “music”, as these are words that appear frequently in the transcripts of TED Talks, but refer to reactions from the audience or accompanying sounds to the speaker.
bottom15views %>%
unnest_tokens(word,transcript) %>%
anti_join(stop_words) %>%
count(word,sort=TRUE) -> bottom15words
bottom15words %>%
filter(!word %in% c("laughter", "applause", "la", "music")) -> bottom15wordsFiltered
bottom15wordsFiltered %>%
head(15) -> bottom15wordsSorted
ggplot(bottom15wordsSorted, aes(reorder(word, n), n)) +
geom_col(fill="red") +ylab("Count") +xlab("Word") +
ggtitle("Most Popular Words in Bottom 15 TedTalks") +
coord_flip() +theme_minimal() ->
Bottom15wordsPlot1
Bottom15wordsPlot1 + theme(axis.text.x=element_text(angle=45, hjust=1))
top15views %>%
unnest_tokens(word,transcript) %>%
anti_join(stop_words) %>%
count(word,sort=TRUE) -> top15words
top15words %>%
filter(!word %in% c("laughter", "applause", "music")) -> top15wordsFiltered
top15wordsFiltered %>%
head(15) -> top15wordsSorted
ggplot(top15wordsSorted, aes(reorder(word, n), n)) +
geom_col(fill="red") +ylab("Count") +xlab("Word") +
ggtitle("Most Popular Words in Top 15 TedTalks") +
coord_flip() +theme_minimal() ->
Top15wordsPlot1
Top15wordsPlot1 + theme(axis.text.x=element_text(angle=45, hjust=1))There were five words that were frequently used in both the most and least popular TED Talks - these were: people, time, world,life and power. The frequent use of these words across talks of all different topics sheds light on the nature of TED Talks as a whole - regardless of their popularity.
| Word | Frequency in Most Popular TED Talks | Frequency in Least Popular TED Talks |
|---|---|---|
| people | 230 | 114 |
| time | 90 | 73 |
| world | 66 | 80 |
| life | 62 | 33 |
| power | 44 | 30 |
Even though these five words are found often in both the most and least popular talks, the frequency at which they are used differs. Generally, with the exception of the word “world”, these words are used with a higher frequency in the most popular TED Talks. Most notably, the word “people”, which was the most frequent word used in all TED Talks, is used twice as much in the most popular TED Talks than in the least popular. It makes sense that “people” would be the most popular word across all TED Talks - TED Talks are all about people - how they think, how they behave, how they learn, how they are affected by the world around them and in turn how they affect the world we live in.
It was interesting to see that in addition to the word “people”, the most popular TED Talks also featured high frequencies of the words “human” and “person”, while the least popular did not. I think further supports my earlier statement that the most popular talks are focused on people, and how they can better themselves and improve their lives.
The only word of these five that is used more frequently in the least popular TED Talks than the most popular was “world”. This supports the observation make earlier that the least popular TED Talks deal with more generalized issues within the world, and focus less on material that has implications for individuals.
After looking at how overall word use compared between the most and least popular TED Talks, I was interested in seeing how personal pronoun use specifically compared. The use of personal pronouns by speakers is an often overlooked part of speeches, but I think their use can reflect a lot about the nature of the speech and how the speaker is relating to the audience. Using a list of personal pronouns from grammar.com, I analyzed the use of first person singular and plural personal pronouns in the talks. In order to do this I created a personal pronoun variable, turned it into a dataframe, and then used it to filter each of the word count variables from the most and least popular TED Talks to create personal pronoun count variables for each. I then performed an rbind of all the curse count variables and used ggplot in order to visualize the use of personal pronouns in TED Talks.
pronouns <- c("i", "me", "my", "mine", "myself", "we",
"us", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves")
pronouns <- tibble(word = pronouns)
top15unfiltered <- top15views %>%
unnest_tokens(word,transcript) %>%
count(word,sort=TRUE)
TopTalksPronounCount <- top15unfiltered %>% filter(word %in% pronouns$word)
TopTalksPronounCount["Order"] <- "Top 15"
bottom15unfilteredwords <- bottom15views %>%
unnest_tokens(word,transcript) %>%
count(word,sort=TRUE)
BottomTalksPronounCount <- bottom15unfilteredwords %>% filter(word %in% pronouns$word)
BottomTalksPronounCount["Order"] <- "Bottom 15"
PronounCounts <- rbind(BottomTalksPronounCount, TopTalksPronounCount)
ggplot(PronounCounts, aes(word,n,fill=Order)) + geom_bar(stat="identity") + ylab("Count") +xlab("Pronoun") +ggtitle("Pronoun Use in TedTalks") + coord_flip() + theme_minimal()
Grammar.com defines first-person singular personal pronouns as: I, me, my, and mine, and first-person plural personal pronouns as: we, us, our, and ours. The most popular TED Talks use singular personal pronouns at a much higher frequency than in the least popular TED Talks, while the least popular TED Talks have a higher use of plural personal pronouns, like “we” and “us”. This demonstrates that speaking on a direct personal level to try and resonate with the audience is a best practice for speakers giving TED Talks. The higher use of all first-person personal pronouns across the board in the most popular TED Talks reinforces this importance of relating to the audience on a broader scale.
In order to analyze the sentiment of TEDalks, I used the afinn sentiment scale. This scale assigns a score to each of the words within its lexicon, ranging from -5 to 5. Negative scores indicate the word has a negative sentiment and positive scores indicate the word has a positive sentiment. I began by creating a new variable that included the words from all the TED Talks, as I had previoiusly only created variables for the most and least popular. Once I had done this, I performed an innerjoin of the afinn sentiment scale with the words from all TED Talks, and then the most and least popular. I then found the average sentiment score for TED Talks overall, and then specifically within the 15 most popular talks and 15 least popular talks.
tedTalks %>%
unnest_tokens(word,transcript) %>%
anti_join(stop_words) -> allwords
allwords %>%
filter(!word %in% c("laughter", "applause", "la", "music")) ->
allwordsFiltered
allwordsSent <- allwordsFiltered %>%
inner_join(get_sentiments("afinn"))
allwordsSent
mean(allwordsSent$score)
AllTalksSentAfinn = -0.5532017
TopTalksSentAfinn <- top15wordsFiltered %>%
inner_join(get_sentiments("afinn"))
TopTalksSentAfinn
mean(TopTalksSentAfinn$score) -> TopTalksSentAfinn2
TopTalksSentAfinn2
TopTalksAfinnScore= -0.02621723
BottomTalksSentAfinn <- bottom15wordsFiltered %>%
inner_join(get_sentiments("afinn"))
BottomTalksSentAfinn
mean(BottomTalksSentAfinn$score) -> BottomTalksSentAfinn2
BottomTalksSentAfinn2
BottomTalksAfinnScore= 0.0575
Score = c(-0.02621723, 0.0575, -0.5532017)
Talk = c("Most Pop. Talks","Least Pop. Talks", "All Talks")
AllAFinnData <- data.frame(Talk, Score)
AFinnChart <- AllAFinnData %>%
ggplot(aes(Talk, Score)) + geom_bar(stat = "identity",fill="red") +
ggtitle("Sentiment of TED Talks") +
ylim(-.75, .5) +
theme_minimal()+
ylab("Score") +
xlab("Talk")
AFinnChart + theme(axis.text.x=element_text(angle=45, hjust=1))
These results were not at all what I was expecting - the least popular TED Talks had the highest average sentiment, and the only one that was positive. The most popular TED Talks had a slightly negative sentiment, which was contrary to what I had predicted. To further complicate things, TED Talks overall had a lower sentiment than both the most and least popular TED Talks. I wanted to investigate this further, to try and find some sort of explanation. To do this, I looked at the ten words that contributed the most to this average sentiment score in the most and least popular TED Talks.
| Word | # of Times Used | Sentiment Score |
|---|---|---|
| god | 36 | 1 |
| love | 20 | 3 |
| care | 19 | 2 |
| happy | 18 | 3 |
| hope | 17 | 2 |
| growing | 13 | 1 |
| share | 13 | 1 |
| wonderful | 12 | 4 |
| pretty | 11 | 1 |
| healthy | 10 | 2 |
| Word | # of Times Used | Sentiment Score |
|---|---|---|
| love | 45 | 3 |
| happy | 22 | 3 |
| powerful | 21 | 2 |
| hard | 18 | -1 |
| positive | 17 | 2 |
| true | 16 | 2 |
| vulnerability | 16 | 1 |
| wrong | 16 | -2 |
| fake | 15 | -3 |
| happiness | 15 | 3 |
These tables display the words that are part of the Afinn sentiment lexicon and that were used most in the TED Talks. You can see that words like vulnerability, hard, wrong, and fake were used frequently in the most popular TED Talks, and these words all received negative sentiment scores. However, in the context of the talks, these words don’t necessarily have bad connotations - they could speak to struggles or challenges that were overcome. The lack of these words or those like them in the least popular TED Talks suggest that talks that tackle problems (and hopefully offer solutions) make for the most intriguing and successful talks. The fact that the sentiment score for all talks in general was lower suggests that mentioning problems or difficult subjects too often may discourages views.
This sentiment analysis essentially suggests that the use of words with negative sentiment has to be just right - not too high to where people get discouraged, and not too low to where people aren’t inspired.
The last factor of TED Talks I looked into was the gender of the speaker - were the most popular TED Talks primarily delivered by men or women, and how does this compare to the least popular? The dataset I used in this analysis did not provide the gender of the speaker, so I had to determine this for myself. The dataset included far too many talks to go through and label the gender of each speaker, so I just focused on speakers of the most and least popular talks. I searched the speaker on Google and used my best judgment based on search results and images/video to determine their gender as either male or female. I then created a new dataframe with this information and created pie charts.
GenderTop <- data.frame(group = c("Male","Female"), value = c(8, 7))
GenderTopbp<- ggplot(GenderTop, aes(x="", y=value, fill=group))+
geom_bar(width = 1, stat = "identity")
GenderTopbp
GenderTopPie <- GenderTopbp + coord_polar("y", start=0)
GenderTopPie + scale_fill_manual(values=c("red", "light blue")) + theme_minimal() +
xlab("") + ylab("") + ggtitle("Gender of Speakers in Most Popular TedTalks")
GenderBottom <- data.frame(group = c("Male","Female"), value = c(13, 2))
GenderBottombp<- ggplot(GenderBottom, aes(x="", y=value, fill=group))+
geom_bar(width = 1, stat = "identity")
GenderBottombp
GenderBottomPie <- GenderBottombp + coord_polar("y", start=0)
GenderBottomPie + scale_fill_manual(values=c("red", "light blue")) +theme_minimal() +
xlab("") + ylab("") +ggtitle("Gender of Speakers in Least Popular TedTalks")
In the most popular TED Talks, the distribution of gender among speakers is very even, with 8 of the most popular talks given by men and 7 by women. In the least popular TED Talks there is a much more uneven distribution, with 13 of the talks given by men and only 2 by women. There is no way to compare this to the overall gender distribution amongst speakers as this data wasn’t included in the dataset, however it is still interesting to note these trends.
There seem to be lots of commonalities between the most and least popular TED Talks - in the most frequent words used, in the most common pronouns, and in overall sentiment - which leads me to believe that there are other factors that might be more important when determining the popularity of a talk - perhaps the speaker and their reputation. However, their are some general guidelines/best practices for a successful TED Talk that emerge from this analysis: