This is an Rmarkdown based on the notebook written by Mikael Huss, avaialble here. The orginal python notebook can be found here
Founded in 1984 by Richard Saulman as a non profit organisation that aimed at bringing experts from the fields of Technology, Entertainment and Design together, TED Conferences have gone on to become the Mecca of ideas from virtually all walks of life. As of 2015, TED and its sister TEDx chapters have published more than 2000 talks for free consumption by the masses and its speaker list boasts of the likes of Al Gore, Jimmy Wales, Shahrukh Khan and Bill Gates.
Ted, which operates under the slogan ‘Ideas worth spreading’ has managed to achieve an incredible feat of bringing world renowned experts from various walks of life and study and giving them a platform to distill years of their work and research into talks of 18 minutes in length. What’s even more incredible is that their invaluable insights is available on the Internet for free.
Since the time I begin watching TED Talks in high school, they have never ceased to amaze me. I have learned an incredible amount, about fields I was completely alien to, in the form of poignant stories, breathtaking visuals and subtle humor. So in this notebook, I wanted to attempt at finding insights about the world of TED, its speakers and its viewers and try to answer a few questions that I had always had in the back of my mind.
The data has been obtained by running a custom web scraper on the official TED.com website. The data is shared under the Creative Commons License (just like the TED Talks) and hosted on Kaggle. You can download it here: https://www.kaggle.com/rounakbanik/ted-talks
The main dataset contains metadata about every TED Talk hosted on the TED.com website until September 21, 2017. Let me give you a brief walkthrough of the kind of data available so as to give you an idea of what are the possibilities with this dataset.
library(tidyverse)
df <- read_csv("data/ted_main.csv")
names(df)
[1] "comments" "description" "duration" "event" "film_date"
[6] "languages" "main_speaker" "name" "num_speaker" "published_date"
[11] "ratings" "related_talks" "speaker_occupation" "tags" "title"
[16] "url" "views"
Features Available
I’m just going to reorder the columns in the order I’ve listed the features for my convenience (and OCD), and convert the Unix timestamps into a human readable format.
library(anytime)
df <- df %>%
select(name, title, description, main_speaker, speaker_occupation,
num_speaker, duration, event, film_date, published_date, comments,
tags, languages, ratings, related_talks, url, views) %>%
mutate_at(funs(anydate(., tz = 'UTC')), .vars = c('film_date', 'published_date'))
glimpse(df)
Observations: 2,550
Variables: 17
$ name <chr> "Ken Robinson: Do schools kill creativity?", "Al Gore: Averting the climate cr...
$ title <chr> "Do schools kill creativity?", "Averting the climate crisis", "Simplicity sell...
$ description <chr> "Sir Ken Robinson makes an entertaining and profoundly moving case for creatin...
$ main_speaker <chr> "Ken Robinson", "Al Gore", "David Pogue", "Majora Carter", "Hans Rosling", "To...
$ speaker_occupation <chr> "Author/educator", "Climate advocate", "Technology columnist", "Activist for e...
$ num_speaker <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ duration <int> 1164, 977, 1286, 1116, 1190, 1305, 992, 1198, 1485, 1262, 1414, 1538, 1550, 52...
$ event <chr> "TED2006", "TED2006", "TED2006", "TED2006", "TED2006", "TED2006", "TED2006", "...
$ film_date <date> 2006-02-25, 2006-02-25, 2006-02-24, 2006-02-26, 2006-02-22, 2006-02-02, 2006-...
$ published_date <date> 2006-06-27, 2006-06-27, 2006-06-27, 2006-06-27, 2006-06-27, 2006-06-27, 2006-...
$ comments <int> 4553, 265, 124, 200, 593, 672, 919, 46, 852, 900, 79, 55, 71, 242, 99, 325, 30...
$ tags <chr> "['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'tea...
$ languages <int> 60, 43, 26, 35, 48, 36, 31, 19, 32, 31, 27, 20, 24, 27, 25, 31, 32, 27, 22, 32...
$ ratings <chr> "[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', '...
$ related_talks <chr> "[{'id': 865, 'hero': 'https://pe.tedcdn.com/images/ted/172559_800x600.jpg', '...
$ url <chr> "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity\n", "http...
$ views <int> 47227110, 3200520, 1636292, 1697550, 12005869, 20685401, 3769987, 967741, 2567...
We also have another dataset which contains the transcript of every talk but we will get to that later. For now, let us begin with the analysis of TED Talks!
We have over 2550 talks at our disposal. These represent all the talks that have ever been posted on the TED Platform until September 21, 2017 and has talks filmed in the period between 1994 and 2017. It has been over two glorious decades of TED.
For starters, let us perform some easy analysis. I want to know what the 15 most viewed TED talks of all time are. The number of views gives us a good idea of the popularity of the TED Talk.
pop_talks <- df %>%
select(title, main_speaker, views, film_date) %>%
arrange(desc(views)) %>%
head(15)
pop_talks
# A tibble: 15 x 4
title main_speaker views film_date
<chr> <chr> <int> <date>
1 Do schools kill creativity? Ken Robinson 47227110 2006-02-25
2 Your body language may shape who you are Amy Cuddy 43155405 2012-06-26
3 How great leaders inspire action Simon Sinek 34309432 2009-09-17
4 The power of vulnerability Brené Brown 31168150 2010-06-06
5 10 things you didn't know about orgasm Mary Roach 22270883 2009-02-06
6 How to speak so that people want to listen Julian Treasure 21594632 2013-06-10
7 My stroke of insight Jill Bolte Taylor 21190883 2008-02-27
8 Why we do what we do Tony Robbins 20685401 2006-02-02
9 This is what happens when you reply to spam email James Veitch 20475972 2015-12-08
10 Looks aren't everything. Believe me, I'm a model. Cameron Russell 19787465 2012-10-27
11 The puzzle of motivation Dan Pink 18830983 2009-07-24
12 The power of introverts Susan Cain 17629275 2012-02-28
13 How to spot a liar Pamela Meyer 16861578 2011-07-13
14 What makes a good life? Lessons from the longest study on happiness Robert Waldinger 16601927 2015-11-14
15 The happy secret to better work Shawn Achor 16209727 2011-05-11
Observations
Let us make a bar chart to visualise these 15 talks in terms of the number of views they garnered.
ggplot(pop_talks, aes(x = reorder(main_speaker, views), y = views/100000,
fill = main_speaker)) +
geom_bar(stat = 'identity') +
guides(fill = FALSE) + labs(x = "", y = "Views (x 100,000)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Finally, in this section, let us investigate the summary statistics and the distribution of the views garnered on various TED Talks.
ggplot(df, aes(views)) +
geom_histogram(aes(y = ..density..)) +
geom_line(stat = "density") + xlim(c(0, 0.4e7))
summary(df$views)
Min. 1st Qu. Median Mean 3rd Qu. Max.
50443 755793 1124524 1698297 1700760 47227110
The average number of views on TED Talks in 1.6 million, and the median number of views is 1.12 million. This suggests a very high average level of popularity of TED Talks. We also notice that the majority of talks have views less than 4 million. We will consider this as the cutoff point when constructing box plots in the later sections.
Comments
Although the TED website gives us access to all the comments posted publicly, this dataset only gives us the number of comments. We will therefore have to restrict our analysis to this feature only. You could try performing textual analysis by scraping the website for comments.
summary(df$comments)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.0 63.0 118.0 191.6 221.8 6404.0
Observations
ggplot(df, aes(comments)) +
geom_histogram(aes(y = ..density..)) +
geom_line(stat = "density") + xlim(c(0, 500))
From the plot above, we can see that the bulk of the talks have fewer than 500 comments. This clearly suggests that the mean obtained above has been heavily influenced by outliers. This is possible because the number of samples is only 2550 talks.
Another question that I am interested in is if the number of views is correlated with the number of comments. We should think that this is the case as more popular videos tend to have more comments. Let us find out.
library(ggExtra)
ggMarginal(
ggplot(df, aes(x = views, y = comments)) + geom_point(),
type = "histogram")
cor(df[, c("views", "comments")])
views comments
views 1.0000000 0.5309387
comments 0.5309387 1.0000000
As the scatterplot and the correlation matrix show, the Pearson coefficient is slightly more than 0.5. This suggests a medium to strong correlation between the two quantities. This result was pretty expected as mentioned above. Let us now check the number of views and comments on the 10 most commented TED Talks of all time.
df %>%
select(title, main_speaker, views, comments) %>%
arrange(desc(comments)) %>%
head(10)
# A tibble: 10 x 4
title main_speaker views comments
<chr> <chr> <int> <int>
1 Militant atheism Richard Dawkins 4374792 6404
2 Do schools kill creativity? Ken Robinson 47227110 4553
3 Science can answer moral questions Sam Harris 3433437 3356
4 My stroke of insight Jill Bolte Taylor 21190883 2877
5 How do you explain consciousness? David Chalmers 2162764 2673
6 Taking imagination seriously Janet Echelman 1832930 2492
7 On reading the Koran Lesley Hazleton 1847256 2374
8 Your body language may shape who you are Amy Cuddy 43155405 2290
9 The danger of science denial Michael Specter 1838628 2272
10 How great leaders inspire action Simon Sinek 34309432 1930
As can be seen above, Richard Dawkins’ talk on Militant Atheism’ generated the greatest amount of discussion and opinions despite having significantly lesser views than Ken Robinson’s talk, which is second in the list. This raises some interesting questions.
Which talks tend to attract the largest amount of discussion?
To answer this question, we will define a new feature discussion quotient which is simply the ratio of the number of comments to the number of views. We will then check which talks have the largest discussion quotient.
df <- mutate(df, dis_quo = comments/views)
df %>%
select(title, main_speaker, views, comments, dis_quo, film_date) %>%
arrange(desc(dis_quo)) %>%
head(10)
# A tibble: 10 x 6
title main_speaker views comments dis_quo film_date
<chr> <chr> <int> <int> <dbl> <date>
1 The case for same-sex marriage Diane J. Savino 292395 649 0.002219600 2009-12-02
2 E-voting without fraud David Bismark 543551 834 0.001534355 2010-07-14
3 Militant atheism Richard Dawkins 4374792 6404 0.001463841 2002-02-02
4 Inside a school for suicide bombers Sharmeen Obaid-Chinoy 1057238 1502 0.001420683 2010-02-10
5 Taking imagination seriously Janet Echelman 1832930 2492 0.001359572 2011-03-03
6 On reading the Koran Lesley Hazleton 1847256 2374 0.001285149 2010-10-10
7 Curating humanity's heritage Elizabeth Lindsey 439180 555 0.001263719 2010-12-08
8 How do you explain consciousness? David Chalmers 2162764 2673 0.001235918 2014-03-18
9 The danger of science denial Michael Specter 1838628 2272 0.001235704 2010-02-11
10 Dance to change the world Mallika Sarabhai 481834 595 0.001234865 2009-11-04
This analysis has actually raised extremely interesting insights. Half of the talks in the top 10 are on the lines of Faith and Religion. I suspect science and religion is still a very hotly debated topic even in the 21st century. We shall come back to this hypothesis in a later section.
The most discusses talk, though, is The Case for Same Sex Marriage (which has religious undertones). This is not that surprising considering the amount of debate the topic caused back in 2009 (the time the talk was filmed).
TED (especially TEDx) Talks tend to occur all throughout the year. Is there a hot month as far as TED is concerned? In other words, how are the talks distributed throughout the months since its inception? Let us find out.
library(lubridate)
df$month <- month(df$film_date, label = TRUE)
ggplot(df, aes(x = month, fill = month)) +
geom_bar() + guides(fill = FALSE)
February is clearly the most popular month for TED Conferences whereas August and January are the least popular. February’s popularity is largely due to the fact that the official TED Conferences are held in February. Let us check the distribution for TEDx talks only.
df_x <- df %>%
filter(grepl("TEDx", event))
df_x %>%
group_by(month) %>%
count() %>%
ggplot(aes(x = month, y = n, fill = month)) +
geom_bar(stat='identity') + guides(fill=FALSE)
As far as TEDx talks are concerned, November is the most popular month. However, we cannot take this result at face value as very few of the TEDx talks are actually uploaded to the TED website and therefore, it is entirely possible that the sample in our dataset is not at all representative of all TEDx talks. A slightly more accurate statement would be that the most popular TEDx talks take place the most in October and November.
The next question I’m interested in is the most popular days for conducting TED and TEDx conferences. The tools applied are very sensible to the procedure applied for months.
df %>%
mutate(day = wday(film_date, label = T, week_start = 1)) %>%
group_by(day) %>%
count() %>%
ggplot(aes(x = day, y = n, fill = day)) +
geom_bar(stat = 'identity') + guides(fill = FALSE)
The distribution of days is almost a bell curve with Wednesday and Thursday being the most popular days and Sunday being the least popular. This is pretty interesting because I was of the opinion that most TED Conferences would happen sometime in the weekend.
Let us now visualize the number of TED talks through the years and check if our hunch that they have grown significantly is indeed true.
df$year <- year(df$film_date)
df %>%
group_by(year) %>%
count() %>%
ggplot(aes(x = year, y = n)) +
geom_line() + geom_point() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations
Finally, to put it all together, let us construct a heat map that shows us the number of talks by month and year. This will give us a good summary of the distribution of talks.
df %>%
group_by(month, year) %>%
count() %>%
ggplot(aes(x = year, y = month)) +
geom_tile(aes(fill = n)) +
geom_text(aes(label = n), size = 2.5,
position = position_jitter(height = .25)) +
scale_fill_gradient(low = "white", high = "red") +
guides(fill = FALSE) + labs(x = "Year", y = "Month")
In this section, we will try and gain insight about all the amazing speakers who have managed to inspire millions of people through their talks on the TED Platform. The first question we shall ask in this section is who are the most popular TED Speakers. That is, which speakers have given the most number of TED Talks.
group_by(df, main_speaker) %>%
count() %>%
arrange(desc(n))
# A tibble: 2,156 x 2
# Groups: main_speaker [2,156]
main_speaker n
<chr> <int>
1 Hans Rosling 9
2 Juan Enriquez 7
3 Marco Tempest 6
4 Rives 6
5 Bill Gates 5
6 Clay Shirky 5
7 Dan Ariely 5
8 Jacqueline Novogratz 5
9 Julian Treasure 5
10 Nicholas Negroponte 5
# ... with 2,146 more rows
Hans Rosling, the Swedish Health Professor is clearly the most popular TED Speaker, with more than 9 appearances on the TED Forum. Juan Enriquez comes a close second with 7 appearances. Rives and Marco Tempest have graced the TED platform 6 times.
Which occupation should you choose if you want to become a TED Speaker? Let us have a look what kind of people TED is most interested in inviting to its events.
occupation_df <- group_by(df, speaker_occupation) %>%
count() %>%
arrange(desc(n))
occupation_df
# A tibble: 1,449 x 2
# Groups: speaker_occupation [1,449]
speaker_occupation n
<chr> <int>
1 Writer 45
2 Artist 34
3 Designer 34
4 Journalist 33
5 Entrepreneur 31
6 Architect 30
7 Inventor 27
8 Psychologist 26
9 Photographer 25
10 Filmmaker 21
# ... with 1,439 more rows
ggplot(head(occupation_df, 10), aes(x = reorder(speaker_occupation, n),
y = n, fill = speaker_occupation)) +
geom_bar(stat ="identity") + guides(fill = FALSE) +
labs(x = "") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations
Do some professions tend to attract a larger number of viewers? To answer this question let us visualise the relationship between the top 10 most popular professions and the views they garnered in the form of a box plot.
df %>%
filter(speaker_occupation %in% head(occupation_df$speaker_occupation, 10)) %>%
ggplot(aes(x = speaker_occupation, y = views, fill = speaker_occupation)) +
geom_boxplot() +
labs(x = "") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill=FALSE)
On average, out of the top 10 most popular professions, Psychologists tend to garner the most views. Writers have the greatest range of views between the first and the third quartile.
Finally, let us check the number of talks which have had more than one speaker.
table(df$num_speaker)
1 2 3 4 5
2492 49 5 3 1
Almost every talk has just one speaker. There are close to 50 talks where two people shared the stage. The maximum number of speakers to share a single stage was 5. I suspect this was a dance performance. Let’s have a look.
filter(df, num_speaker == 5) %>%
select(title, description, main_speaker, event)
# A tibble: 1 x 4
title
<chr>
1 A dance to honor Mother Earth
# ... with 3 more variables: description <chr>, main_speaker <chr>, event <chr>
My hunch was correct. It is a talk titled A dance to honor Mother Earth by Jon Boogz and Lil Buck at the TED 2017 Conference.
Which TED Events tend to hold the most number of TED.com upload worthy events? We will try to answer that question in this section.
count(df, event) %>%
arrange(desc(n)) %>% head()
# A tibble: 6 x 2
event n
<chr> <int>
1 TED2014 84
2 TED2009 83
3 TED2013 77
4 TED2016 77
5 TED2015 75
6 TED2011 70
As expected, the official TED events held the major share of TED Talks published on the TED.com platform. TED2014 had the most number of talks followed by TED2009. There isn’t too much insight to be gained from this.
One remarkable aspect of TED Talks is the sheer number of languages in which it is accessible. Let us perform some very basic data visualization and descriptive statistics about languages at TED.
summary(df$languages)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 23.00 28.00 27.33 33.00 72.00
On average, a TED Talk is available in 27 different languages. The maximum number of languages a TED Talk is available in is a staggering 72. Let us check which talk this is.
filter(df, languages == 72)
# A tibble: 1 x 20
name title
<chr> <chr>
1 Matt Cutts: Try something new for 30 days Try something new for 30 days
# ... with 18 more variables: description <chr>, main_speaker <chr>, speaker_occupation <chr>,
# num_speaker <int>, duration <int>, event <chr>, film_date <date>, published_date <date>, comments <int>,
# tags <chr>, languages <int>, ratings <chr>, related_talks <chr>, url <chr>, views <int>, dis_quo <dbl>,
# month <ord>, year <dbl>
The most translated TED Talk of all time is Matt Cutts’ Try Something New in 30 Days. The talk does have a very universal theme of exploration. The sheer number of languages it’s available in demands a little more inspection though as it has just over 8 million views, far fewer than the most popular TED Talks.
Finally, let us check if there is a correlation between the number of views and the number of languages a talk is available in. We would think that this should be the case since the talk is more accessible to a larger number of people but as Matt Cutts’ talk shows, it may not really be the case.
ggMarginal(
ggplot(df, aes(x = languages, y = views)) +
geom_point()
)
cor(df[, c("languages","views")])
languages views
languages 1.0000000 0.3776231
views 0.3776231 1.0000000
The Pearson coefficient is 0.38 suggesting a medium correlation between the aforementioned quantities.
n this section, we will try to find out the most popular themes in the TED conferences. Although TED started out as a conference about technology, entertainment and design, it has since diversified into virtually every field of study and walk of life. It will be interesting to see if this conference with Silicon Valley origins has a bias towards certain topics.
To answer this question, we need to wrangle our data in a way that it is suitable for analysis. More specifically, we need to split the related_tags list into separate rows.
library(qdapRegex)
theme_df <- do.call("rbind", Map(function(title, theme)
data_frame(title = title, theme = theme),
df$title, rm_between(df$tags, "'", "'", extract = TRUE))) %>%
merge(df, by = "title")
There is one more theme “,” as compared to the referred notebook (not so relevant for the analysis anyway)
length(table(theme_df$theme))
[1] 417
pop_themes <- group_by(theme_df, theme) %>%
count() %>%
arrange(desc(n))
pop_themes
# A tibble: 417 x 2
# Groups: theme [417]
theme n
<chr> <int>
1 technology 726
2 science 564
3 global issues 501
4 culture 486
5 TEDx 449
6 design 418
7 business 347
8 entertainment 299
9 health 230
10 innovation 229
# ... with 407 more rows
ggplot(head(pop_themes,10), aes(x = reorder(theme, n), y = n, fill = theme)) +
geom_bar(stat = "identity") +
guides(fill = FALSE) + labs(x = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As may have been expected, Technology is the most popular topic for talks. The other two original factions, Design and Entertainment, also make it to the list of top 10 themes. Science and Global Issues are the second and the third most popular themes respectively.
The next question I want to answer is the trends in the share of topics of TED Talks across the world. Has the demand for Technology talks increased? Do certain years have a disproportionate share of talks related to global issues? Let’s find out!
We will only be considering the top 7 themes, excluding TEDx and talks after 2009, the year when the number of TED Talks really peaked.
pop_theme_talks <- theme_df %>%
filter(theme %in% head(pop_themes$theme, 8),
theme !="TEDx",
year > 2008)
xtab_df <- pop_theme_talks %>%
group_by(theme, year) %>%
tally %>%
group_by(year) %>%
mutate(prop = n/sum(n))
ggplot(xtab_df, aes(x = year, fill = theme, y = prop)) +
geom_bar(stat = 'identity', position = 'fill')
ggplot(xtab_df, aes(x = year, y= prop, group = theme, col = theme)) +
geom_line()
The proportion of technology talks has steadily increased over the years with a slight dip in 2010. This is understandable considering the boom of technologies such as blockchain, deep learning and augmented reality capturing people’s imagination.
Talks on culture have witnessed a dip, decreasing steadily starting 2013. The share of culture talks has been the least in 2017. Entertainment talks also seem to have witnessed a slight decline in popularity since 2009.
Like with the speaker occupations, let us investigate if certain topics tend to garner more views than certain other topics. We will be doing this analysis for the top ten categories that we discovered in an earlier cell. As with the speaker occupations, the box plot will be used to deduce this relation.
theme_df %>%
filter(theme %in% head(pop_themes$theme, 10)) %>%
ggplot(aes(x = theme, y = views, fill = theme)) +
geom_boxplot() + labs(x = "") +
guides(fill = FALSE) + coord_cartesian(ylim = c(0, 0.4e7)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Although culture has lost its share in the number of TED Talks over the years, they garner the highest median number of views.
In this section, we will perform analysis on the length of TED Talks. TED is famous for imposing a very strict time limit of 18 minutes. Although this is the suggested limit, there have been talks as short as 2 minutes and some have stretched to as long as 24 minutes. Let us get an idea of the distribution of TED Talk durations.
df$duration <- df$duration/60
summary(df$duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.250 9.617 14.133 13.775 17.446 87.600
TED Talks, on average are 13.7 minutes long. I find this statistic surprising because TED Talks are often synonymous with 18 minutes and the average is a good 3 minutes shorter than that.
The shortest TED Talk on record is 2.25 minutes long whereas the longest talk is 87.6 minutes long. I’m pretty sure the longest talk was not actually a TED Talk. Let us look at both the shortest and the longest talk.
filter(df, duration == 2.25 | duration == 87.6)
# A tibble: 2 x 20
name title
<chr> <chr>
1 Murray Gell-Mann: The ancestor of language The ancestor of language
2 Douglas Adams: Parrots, the universe and everything Parrots, the universe and everything
# ... with 18 more variables: description <chr>, main_speaker <chr>, speaker_occupation <chr>,
# num_speaker <int>, duration <dbl>, event <chr>, film_date <date>, published_date <date>, comments <int>,
# tags <chr>, languages <int>, ratings <chr>, related_talks <chr>, url <chr>, views <int>, dis_quo <dbl>,
# month <ord>, year <dbl>
The shortest talk was at TED2007 titled The ancestor of language by Murray Gell-Mann. The longest talk on TED.com, as we had guessed, is not a TED Talk at all. Rather, it was a talk titled Parrots, the universe and everything delivered by Douglas Adams at the University of California in 2001.
Let us now check for any correlation between the popularity and the duration of a TED Talk. To make sure we only include TED Talks, we will consider only those talks which have a duration less than 25 minutes.
ggMarginal(
filter(df, duration < 25) %>%
ggplot(aes(x = duration, y = views)) + geom_point()
)
There seems to be almost no correlation between these two quantities. This strongly suggests that there is no tangible correlation between the length and the popularity of a TED Talk. Content is king at TED.
Next, we look at transcripts to get an idea of word count. For this, we introduce our second dataset, the one which contains all transcripts.
There seems to be almost no correlation between these two quantities. This strongly suggests that there is no tangible correlation between the length and the popularity of a TED Talk. Content is king at TED.
Next, we look at transcripts to get an idea of word count. For this, we introduce our second dataset, the one which contains all transcripts.
df2 <- read_csv("data/transcripts.csv")
glimpse(df2)
Observations: 2,467
Variables: 2
$ transcript <chr> "Good morning. How are you?(Laughter)It's been great, hasn't it? I've been blown away ...
$ url <chr> "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity\n", "https://www....
It seems that we have data available for 2467 talks. Let us perform a join of the two dataframes on the url feature to include word counts for every talk.
df3 <- merge(df, df2, by = "url")
glimpse(df3)
Observations: 2,467
Variables: 21
$ url <chr> "https://www.ted.com/talks/9_11_healing_the_mothers_who_found_forgiveness_frie...
$ name <chr> "Aicha el-Wafi + Phyllis Rodriguez: The mothers who found forgiveness, friends...
$ title <chr> "The mothers who found forgiveness, friendship", "My year of living biblically...
$ description <chr> "Phyllis Rodriguez and Aicha el-Wafi have a powerful friendship born of unthin...
$ main_speaker <chr> "Aicha el-Wafi + Phyllis Rodriguez", "AJ Jacobs", "Markus Fischer", "Improv Ev...
$ speaker_occupation <chr> "9/11 mothers", "Author", "Designer", "Social energy entrepreneur", "Whistler"...
$ num_speaker <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, ...
$ duration <dbl> 9.900000, 17.666667, 6.316667, 3.816667, 11.933333, 9.833333, 14.266667, 15.45...
$ event <chr> "TEDWomen 2010", "EG 2007", "TEDGlobal 2011", "TED2012", "TEDxRotterdam 2010",...
$ film_date <date> 2010-12-12, 2007-12-02, 2011-07-15, 2012-03-01, 2010-06-04, 2014-10-21, 2016-...
$ published_date <date> 2011-05-02, 2008-07-17, 2011-07-22, 2012-03-09, 2011-02-11, 2014-12-05, 2017-...
$ comments <int> 149, 583, 440, 324, 93, 48, 36, 850, 79, 333, 104, 194, 95, 124, 231, 56, 29, ...
$ tags <chr> "['culture', 'friendship', 'global issues', 'parenting', 'terrorism']", "['com...
$ languages <int> 32, 39, 45, 51, 31, 39, 20, 36, 27, 30, 26, 26, 28, 28, 33, 28, 19, 20, 23, 24...
$ ratings <chr> "[{'id': 10, 'name': 'Inspiring', 'count': 385}, {'id': 1, 'name': 'Beautiful'...
$ related_talks <chr> "[{'id': 968, 'hero': 'https://pe.tedcdn.com/images/ted/202850_800x600.jpg', '...
$ views <int> 820976, 2291701, 6264902, 2950307, 1917442, 817014, 896491, 1347633, 1474192, ...
$ dis_quo <dbl> 1.814913e-04, 2.543962e-04, 7.023254e-05, 1.098191e-04, 4.850212e-05, 5.875052...
$ month <ord> Dec, Dec, Jul, Mar, Jun, Oct, Feb, Sep, Mar, Mar, Mar, Jun, Jun, Feb, Jul, May...
$ year <dbl> 2010, 2007, 2011, 2012, 2010, 2014, 2016, 2010, 2011, 2011, 2015, 2013, 2016, ...
$ transcript <chr> "Phyllis Rodriguez: We are here today because of the fact that we have what mo...
df3$wc <- sapply(df3$transcript, function(x)
length(strsplit(x, split="\\s+")[[1]]))
summary(df3$wc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1332 2028 2040 2707 9044
We can see that the average TED Talk has around 1971 words and there is a significantly large standard deviation of a 1009 words. The longest talk is more than 9044 words in length.
Like duration, there shouldn’t be any correlation between number of words and views. We will proceed to look at a more interesting statstic: the number of words per minute.
df3$wpm = df3$wc/df3$duration
summary(df3$wpm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.08086 133.27105 149.95733 147.10305 165.27871 247.36486
The average TED Speaker enunciates 142 words per minute. The fastest talker spoke a staggering 247 words a minute which is much higher than the average of 125-150 words per minute in English. Let us see who this is!
filter(df3, wpm > 245)
url
1 https://www.ted.com/talks/mae_jemison_on_teaching_arts_and_sciences_together\n
name title
1 Mae Jemison: Teach arts and sciences together Teach arts and sciences together
description
1 Mae Jemison is an astronaut, a doctor, an art collector, a dancer ... Telling stories from her own education and from her time in space, she calls on educators to teach both the arts and sciences, both intuition and logic, as one -- to create bold thinkers.
main_speaker speaker_occupation num_speaker duration event
1 Mae Jemison Astronaut, engineer, entrepreneur, physician and educator 1 14.8 TED2002
film_date published_date comments
1 2002-02-02 2009-05-05 99
tags languages
1 ['art', 'dance', 'education', 'future', 'science', 'science and art', 'space', 'technology'] 20
ratings
1 [{'id': 24, 'name': 'Persuasive', 'count': 126}, {'id': 10, 'name': 'Inspiring', 'count': 243}, {'id': 22, 'name': 'Fascinating', 'count': 83}, {'id': 25, 'name': 'OK', 'count': 86}, {'id': 26, 'name': 'Obnoxious', 'count': 17}, {'id': 21, 'name': 'Unconvincing', 'count': 67}, {'id': 8, 'name': 'Informative', 'count': 107}, {'id': 11, 'name': 'Longwinded', 'count': 68}, {'id': 3, 'name': 'Courageous', 'count': 42}, {'id': 9, 'name': 'Ingenious', 'count': 34}, {'id': 2, 'name': 'Confusing', 'count': 10}, {'id': 7, 'name': 'Funny', 'count': 14}, {'id': 1, 'name': 'Beautiful', 'count': 55}, {'id': 23, 'name': 'Jaw-dropping', 'count': 18}]
related_talks
1 [{'id': 66, 'hero': 'https://pe.tedcdn.com/images/ted/6b6eb940bceab359ca676a9b486aae475c1df883_2880x1620.jpg', 'speaker': 'Ken Robinson', 'title': 'Do schools kill creativity?', 'duration': 1164, 'slug': 'ken_robinson_says_schools_kill_creativity', 'viewed_count': 47227861}, {'id': 1571, 'hero': 'https://pe.tedcdn.com/images/ted/d538403350630eeaf965325257caf869350a9832_1600x1200.jpg', 'speaker': 'John Maeda', 'title': 'How art, technology and design inform creative leaders', 'duration': 1001, 'slug': 'john_maeda_how_art_technology_and_design_inform_creative_leaders', 'viewed_count': 1045694}, {'id': 1657, 'hero': 'https://pe.tedcdn.com/images/ted/f31f296ffb6902d226e403e6431713cf37629b55_1600x1200.jpg', 'speaker': 'Mitch Resnick', 'title': "Let's teach kids to code", 'duration': 1008, 'slug': 'mitch_resnick_let_s_teach_kids_to_code', 'viewed_count': 1724947}, {'id': 952, 'hero': 'https://pe.tedcdn.com/images/ted/197744_800x600.jpg', 'speaker': 'Ben Cameron', 'title': 'Why the live arts matter', 'duration': 764, 'slug': 'ben_cameron_tedxyyc', 'viewed_count': 497709}, {'id': 1653, 'hero': 'https://pe.tedcdn.com/images/ted/e562bc9bf7daf7d4f06450eff3d5af17a5e873f8_2880x1620.jpg', 'speaker': 'Young-ha Kim', 'title': 'Be an artist, right now!', 'duration': 1017, 'slug': 'young_ha_kim_be_an_artist_right_now', 'viewed_count': 1841666}, {'id': 1747, 'hero': 'https://pe.tedcdn.com/images/ted/910102e6486442bc30fbc5952c254a9f9882942f_1600x1200.jpg', 'speaker': 'Phil Hansen', 'title': 'Embrace the shake', 'duration': 601, 'slug': 'phil_hansen_embrace_the_shake', 'viewed_count': 2155393}]
views dis_quo month year
1 744257 0.0001330186 Feb 2002
transcript
1 What I want to do today is to spend some time talking about some stuff that's sort of giving me a little bit of existential angst, for lack of a better word, over the past couple of years, and basically, these three quotes tell what's going on. "When God made the color purple, God was just showing off," Alice Walker wrote in "The Color Purple," and Zora Neale Hurston wrote in "Dust Tracks On A Road," "Research is a formalized curiosity. It's poking and prying with a purpose." And then finally, when I think about the near future, you know, we have this attitude, well, whatever happens, happens. Right? So that goes along with the Chesire Cat saying, "If you don't care much where you want to get to, it doesn't much matter which way you go." But I think it does matter which way we go, and what road we take, because when I think about design in the near future, what I think are the most important issues, what's really crucial and vital is that we need to revitalize the arts and sciences right now in 2002. (Applause) If we describe the near future as 10, 20, 15 years from now, that means that what we do today is going to be critically important, because in the year 2015, and the year 2020, 2025, the world our society is going to be building on, the basic knowledge and abstract ideas, the discoveries that we came up with today, just as all these wonderful things we're hearing about here at the TED conference that we take for granted in the world right now, were really knowledge and ideas that came up in the '50s, the '60s, and the '70s. That's the substrate that we're exploiting today, whether it's the internet, genetic engineering, laser scanners, guided missiles, fiber optics, high-definition television, sensing, remote-sensing from space and the wonderful remote-sensing photos that we see in 3D weaving, TV programs like Tracker, and Enterprise, CD rewrite drives, flatscreen, Alvin Ailey's Suite Otis, or Sarah Jones' "Your Revolution Will Not Be Between These Thighs," which by the way was banned by the FCC, or ska, all of these things without question, almost without exception, are really based on ideas and abstract and creativity from years before, so we have to ask ourselves, what are we contributing to that legacy right now? And when I think about it, I'm really worried. To be quite frank, I'm concerned. I'm skeptical that we're doing very much of anything. We're, in a sense, failing to act in the future. We're purposefully, consciously being laggards. We're lagging behind. Frantz Fanon, who was a psychiatrist from Martinique, said, "Each generation must, out of relative obscurity, discover its mission, and fulfill or betray it." What is our mission? What do we have to do? I think our mission is to reconcile, to reintegrate science and the arts, because right now there's a schism that exists in popular culture. You know, people have this idea that science and the arts are really separate. We think of them as separate and different things, and this idea was probably introduced centuries ago, but it's really becoming critical now, because we're making decisions about our society every day that, if we keep thinking that the arts are separate from the sciences, and we keep thinking it's cute to say, "I don't understand anything about this one, I don't understand anything about the other one," then we're going to have problems. Now I know no one here at TED thinks this. All of us, we already know that they're very connected, but I'm going to let you know that some folks in the outside world, believe it or not, they think it's neat when they say, "You know, scientists and science is not creative. Maybe scientists are ingenious, but they're not creative. And then we have this tendency, the career counselors and various people say things like, "Artists are not analytical. They're ingenious, perhaps, but not analytical," and when these concepts underly our teaching and what we think about the world, then we have a problem, because we stymie support for everything. By accepting this dichotomy, whether it's tongue-in-cheek, when we attempt to accommodate it in our world, and we try to build our foundation for the world, we're messing up the future, because, who wants to be uncreative? Who wants to be illogical? Talent would run from either of these fields if you said you had to choose either. Then they're going to go to something where they think, "Well, I can be creative and logical at the same time." Now I grew up in the '60s and I'll admit it, actually, my childhood spanned the '60s, and I was a wannabe hippie and I always resented the fact that I wasn't really old enough to be a hippie. And I know there are people here, the younger generation who want to be hippies, but people talk about the '60s all the time, and they talk about the anarchy that was there, but when I think about the '60s, what I took away from it was that there was hope for the future. We thought everyone could participate. There were wonderful, incredible ideas that were always percolating, and so much of what's cool or hot today is really based on some of those concepts, whether it's, you know, people trying to use the prime directive from Star Trek being involved in things, or again that three-dimensional weaving and fax machines that I read about in my weekly readers that the technology and engineering was just getting started. But the '60s left me with a problem. You see, I always assumed I would go into space, because I followed all of this, but I also loved the arts and sciences. You see, when I was growing up as a little girl and as a teenager, I loved designing and making dogs' clothes and wanting to be a fashion designer. I took art and ceramics. I loved dance. Lola Falana. Alvin Ailey. Jerome Robbins. And I also avidly followed the Gemini and the Apollo programs. I had science projects and tons of astronomy books. I took calculus and philosophy. I wondered about the infinity and the Big Bang theory. And when I was at Stanford, I found myself, my senior year, chemical engineering major, half the folks thought I was a political science and performing arts major, which was sort of true because I was Black Student Union President and I did major in some other things, and I found myself the last quarter juggling chemical engineering separation processes, logic classes, nuclear magnetic resonance spectroscopy, and also producing and choreographing a dance production, and I had to do the lighting and the design work, and I was trying to figure out, do I go to New York City to try to become a professional dancer, or do I go to medical school? Now, my mother helped me figure that one out. (Laughter) But when I went into space, when I went into space I carried a number of things up with me. I carried a poster by Alvin Ailey, which you can figure out now, I love the dance company. An Alvin Ailey poster of Judith Jamison performing the dance "Cry," dedicated to all black women everywhere. A Bundu statue, which was from the Women's Society in Sierra Leone, and a certificate for the Chicago Public School students to work to improve their science and math, and folks asked me, "Why did you take up what you took up?" And I had to say, "Because it represents human creativity, the creativity that allowed us, that we were required to have to conceive and build and launch the space shuttle, springs from the same source as the imagination and analysis it took to carve a Bundu statue, or the ingenuity it took to design, choreograph, and stage "Cry." Each one of them are different manifestations, incarnations, of creativity, avatars of human creativity, and that's what we have to reconcile in our minds, how these things fit together. The difference between arts and sciences is not analytical versus intuitive, right? E=MC squared required an intuitive leap, and then you had to do the analysis afterwards. Einstein said, in fact, "The most beautiful thing we can experience is the mysterious. It is the source of all true art and science." Dance requires us to express and want to express the jubilation in life, but then you have to figure out, exactly what movement do I do to make sure that it comes across correctly? The difference between arts and sciences is also not constructive versus deconstructive, right? A lot of people think of the sciences as deconstructive. You have to pull things apart. And yeah, sub-atomic physics is deconstructive. You literally try to tear atoms apart to understand what's inside of them. But sculpture, from what I understand from great sculptors, is deconstructive, because you see a piece and you remove what doesn't need to be there. Biotechnology is constructive. Orchestral arranging is constructive. So in fact we use constructive and deconstructive techniques in everything. The difference between science and the arts is not that they are different sides of the same coin, even, or even different parts of the same continuum, but rather they're manifestations of the same thing. Different quantum states of an atom? Or maybe if I want to be more 21st century I could say that they are different harmonic resonances of a superstring. But we'll leave that alone. (Laughter) They spring from the same source. The arts and sciences are avatars of human creativity. It's our attempt as humans to build an understanding of the universe, the world around us. It's our attempt to influence things, the universe internal to ourselves and external to us. The sciences, to me, are manifestations of our attempt to express or share our understanding, our experience, to influence the universe external to ourselves. It doesn't rely on us as individuals. It's the universe, as experienced by everyone, and the arts manifest our desire, our attempt to share or influence others through experiences that are peculiar to us as individuals. Let me say it again another way: science provides an understanding of a universal experience, and arts provides a universal understanding of a personal experience. That's what we have to think about, that they're all part of us, they're all part of a continuum. It's not just the tools, it's not just the sciences, you know, the mathematics and the numerical stuff and the statistics, because we heard, very much on this stage, people talked about music being mathematical. Right? Arts don't just use clay, aren't the only ones that use clay, light and sound and movement. They use analysis as well. So people might say, well, I still like that intuitive versus analytical thing, because everybody wants to do the right brain, left brain thing, right? We've all been accused of being right-brained or left-brained at some point in time, depending on who we disagreed with. (Laughter) You know, people say intuitive, you know that's like you're in touch with nature, in touch with yourself and relationships. Analytical: you put your mind to work, and I'm going to tell you a little secret. You all know this though, but sometimes people use this analysis idea, that things are outside of ourselves, to be, say, that this is what we're going to elevate as the true, most important sciences, right? And then you have artists, and you all know this is true as well, artists will say things about scientists because they say they're too concrete, they're disconnected with the world. But, we've even had that here on stage, so don't act like you don't know what I'm talking about. (Laughter) We had folks talking about the Flat Earth Society and flower arrangers, so there's this whole dichotomy that we continue to carry along, even when we know better. And folks say we need to choose either or. But it would really be foolish to choose either one, right? Intuitive versus analytical? That's a foolish choice. It's foolish, just like trying to choose between being realistic or idealistic. You need both in life. Why do people do this? I'm just gonna quote a molecular biologist, Sydney Brenner, who's 70 years old so he can say this. He said, "It's always important to distinguish between chastity and impotence." Now... (Laughter) I want to share with you a little equation, okay? How do understanding science and the arts fit into our lives and what's going on and the things that we're talking about here at the design conference, and this is a little thing I came up with, understanding and our resources and our will cause us to have outcomes. Our understanding is our science, our arts, our religion, how we see the universe around us, our resources, our money, our labor, our minerals, those things that are out there in the world we have to work with. But more importantly, there's our will. This is our vision, our aspirations of the future, our hopes, our dreams, our struggles and our fears. Our successes and our failures influence what we do with all of those, and to me, design and engineering, craftsmanship and skilled labor, are all the things that work on this to have our outcome, which is our human quality of life. Where do we want the world to be? And guess what? Regardless of how we look at this, whether we look at arts and sciences are separate or different, they're both being influenced now and they're both having problems. I did a project called S.E.E.ing the Future: Science, Engineering and Education, and it was looking at how to shed light on most effective use of government funding. We got a bunch of scientists in all stages of their careers. They came to Dartmouth College, where I was teaching, and they talked about with theologians and financiers, what are some of the issues of public funding for science and engineering research? What's most important about it? There are some ideas that emerged that I think have really powerful parallels to the arts. The first thing they said was that the circumstances that we find ourselves in today in the sciences and engineering that made us world leaders is very different than the '40s, the '50s, and the '60s and the '70s when we emerged as world leaders, because we're no longer in competition with fascism, with Soviet-style communism, and by the way that competition wasn't just military, it included social competition and political competition as well, that allowed us to look at space as one of those platforms to prove that our social system was better. Another thing they talked about was the infrastructure that supports the sciences is becoming obsolete. We look at universities and colleges, small, mid-sized community colleges across the country, their laboratories are becoming obsolete, and this is where we train most of our science workers and our researchers, and our teachers, by the way, and then that there's a media that doesn't support the dissemination of any more than the most mundane and inane of information. There's pseudo-science, crop circles, alien autopsy, haunted houses, or disasters. And that's what we see. And this isn't really the information you need to operate in everyday life and figure out how to participate in this democracy and determine what's going on. They also said that there's a change in the corporate mentality. Whereas government money had always been there for basic science and engineering research, we also counted on some companies to do some basic research, but what's happened now is companies put more energy into short-term product development than they do in basic engineering and science research. And education is not keeping up. In K through 12, people are taking out wet labs. They think if we put a computer in the room it's going to take the place of actually, we're mixing the acids, we're growing the potatoes. And government funding is decreasing in spending and then they're saying, let's have corporations take over, and that's not true. Government funding should at least do things like recognize cost-benefits of basic science and engineering research. We have to know that we have a responsibility as global citizens in this world. We have to look at the education of humans. We need to build our resources today to make sure that they're trained so that they understand the importance of these things, and we have to support the vitality of science, and that doesn't mean that everything has to have one thing that's going to go on, or we know exactly what's going to be the outcome of it, but that we support the vitality and the intellectual curiosity that goes along, and if you think about those parallels to the arts, the competition with the Bolshoi Ballet spurred the Joffrey and the New York City Ballet to become better. Infrastructure museums, theaters, movie houses across the country are disappearing. We have more television stations with less to watch, we have more money spent on rewrites to get old television programs in the movies. We have corporate funding now that, when it goes to some company, when it goes to support the arts, it almost requires that the product be part of the picture that the artist draws, and we have stadiums that are named over and over again by corporations. In Houston, we're trying to figure out what to do with that Enron Stadium thing. (Laughter) And fine arts and education in the schools is disappearing, and we have a government that seems like it's gutting the NEA and other programs, so we have to really stop and think, what are we trying to do with the sciences and the arts? There's a need to revitalize them. We have to pay attention to it. I just want to tell you really quickly what I'm doing. (Applause) I want to tell you what I've been doing a little bit since... I feel this need to sort of integrate some of the ideas that I've had and run across over time. One of the things that I found out is that there's a need to repair the dichotomy between the mind and body as well. My mother always told me, you have to be observant, know what's going on in your mind and your body, and as a dancer I had this tremendous faith in my ability to know my body, just as I knew how to sense colors. Then I went to medical school, and I was supposed to just go on what the machine said about bodies. You know, you would ask patients questions and some people would tell you, "Don't, don't, don't listen to what the patients said." We know that patients know and understand their bodies better, but these days we're trying to divorce them from that idea. We have to reconcile the patient's knowledge of their body with physician's measurements. We had someone talk about measuring emotions and getting machines to figure out what, to keep us from acting crazy. Right? No, we shouldn't measure, we shouldn't use machines to measure road rage and then do something to keep us from engaging in it. Maybe we can have machines help us to recognize that we have road rage and then we need to know how to control that without the machines. We even need to be able to recognize that without the machines. What I'm very concerned about is how do we bolster our self-awareness as humans, as biological organisms? Michael Moschen spoke of having to teach and learn how to feel with my eyes, to see with my hands. We have all kinds of possibilities to use our senses by, and that's what we have to do. That's what I want to do, is to try to use bioinstrumentation, those kind of things to help our senses in what we do, and that's the work I've been doing now as a company called BioSentient Corporation. I figured I'd have to do that ad, because I'm an entrepreneur, because entrepreneur says that that's somebody who does what they want to do because they're not broke enough that they have to get a real job. (Laughter) But that's the work I'm doing with BioSentient Corporation trying to figure out how do we integrate these things? Let me finish by saying that my personal design issue for the future is really about integrating, to think about that intuitive and that analytical. The arts and sciences are not separate. High school physics lesson before you leave. High school physics teacher used to hold up a ball. She would say this ball has potential energy, but nothing will happen to it, it can't do any work until I drop it and it changes states. I like to think of ideas as potential energy. They're really wonderful, but nothing will happen until we risk putting them into action. This conference is filled with wonderful ideas. We're going to share lots of things with people, but nothing's going to happen until we risk putting those ideas into action. We need to revitalize the arts and sciences of today, we need to take responsibility for the future. We can't hide behind saying it's just for company profits, or it's just a business, or I'm an artist or an academician. Here's how you judge what you're doing. I talked about that balance between intuitive, analytical. Fran Lebowitz, my favorite cynic, she said the three questions of greatest concern, now I'm going to add on to design, is, "Is it attractive?" That's the intuitive. "Is it amusing?" The analytical. "And does it know its place?" The balance. Thank you very much. (Applause)
wc wpm
1 3661 247.3649
The person is Mae Jemison with a talk on Teach arts and sciences together at the TED2002 conference. We should take this result with a pinch of salt because I went ahead and had a look at the talk and she didn’t really seem to speak that fast.
Finally, in this section, I’d like to see if there is any correlation between words per minute and popularity.
ggMarginal(
ggplot(filter(df3, duration<25), aes(x = wpm, y = views)) + geom_point()
)
cor(df3[, c("wpm","views")])
wpm views
wpm 1.00000000 0.01311274
views 0.01311274 1.00000000
Again, there is practically no correlation. If you are going to give a TED Talk, you probably shouldn’t worry if you’re speaking a little faster or a little slower than usual.
TED allows its users to rate a particular talk on a variety of metrics. We therefore have data on how many people found a particular talk funny, inspiring, creative and a myriad of other verbs. Let us inspect how this ratings dictionary actually looks like.
df[2, 'ratings']
# A tibble: 1 x 1
ratings
<chr>
1 [{'id': 7, 'name': 'Funny', 'count': 544}, {'id': 3, 'name': 'Courageous', 'count': 139}, {'id': 2, 'name': 'C
rat_levels <- c(`1` = "Beautiful", `2` = "Confusing", `3` = "Courageous",
`7` = "Funny", `8` = "Informative", `9` = "Ingenious",
`10` = "Inspiring", `11` = "Longwinded", `21` = "Unconvincing",
`22` = "Fascinating", `23` = "Jaw-dropping", `24` = "Persuasive",
`25` = "OK", `26` = "Obnoxious")
library(RJSONIO)
rating_list <- lapply(df$ratings, function(s)
do.call("rbind", lapply(fromJSON(s), function(y) do.call("cbind", y))) %>%
as.data.frame() %>%
mutate(rating = rat_levels[as.character(id)])
)
rating_df <- do.call("rbind", Map(function(title, rating)
data.frame(title, rating, stringsAsFactors = FALSE),
df$title, rating_list)) %>%
merge(df, by = "title")
Funniest Talks of all time
rating_df %>%
filter(rating == "Funny") %>%
select(title, main_speaker, views, published_date, count) %>%
group_by(title) %>%
arrange(desc(count))
# A tibble: 2,550 x 5
# Groups: title [2,550]
title main_speaker views published_date count
<chr> <chr> <int> <date> <dbl>
1 Do schools kill creativity? Ken Robinson 47227110 2006-06-27 19645
2 This is what happens when you reply to spam email James Veitch 20475972 2016-01-08 7731
3 Inside the mind of a master procrastinator Tim Urban 14745406 2016-03-15 7445
4 The happy secret to better work Shawn Achor 16209727 2012-02-01 7315
5 Lies, damned lies and statistics (about TEDTalks) Sebastian Wernicke 2212944 2010-04-30 5552
6 The power of vulnerability Brené Brown 31168150 2010-12-23 5225
7 10 things you didn't know about orgasm Mary Roach 22270883 2009-05-20 4166
8 "It's time for \"The Talk\"" Julia Sweeney 3362099 2010-05-14 4025
9 Did you hear the one about the Iranian-American? Maz Jobrani 4646183 2010-08-19 4013
10 Bring on the learning revolution! Ken Robinson 7266316 2010-05-24 3000
# ... with 2,540 more rows
Most Beautiful Talks of all time
rating_df %>%
filter(rating == "Beautiful") %>%
select(title, main_speaker, views, published_date, count) %>%
group_by(title) %>%
arrange(desc(count))
# A tibble: 2,550 x 5
# Groups: title [2,550]
title main_speaker views published_date count
<chr> <chr> <int> <date> <dbl>
1 My stroke of insight Jill Bolte Taylor 21190883 2008-03-12 9437
2 The power of vulnerability Brené Brown 31168150 2010-12-23 7942
3 Building a park in the sky Robert Hammond 704205 2011-06-30 6685
4 The transformative power of classical music Benjamin Zander 9315483 2008-06-25 5967
5 The danger of a single story Chimamanda Ngozi Adichie 13298341 2009-10-07 5607
6 Underwater astonishments David Gallo 13926113 2008-01-11 5201
7 Do schools kill creativity? Ken Robinson 47227110 2006-06-27 4573
8 If I should have a daughter ... Sarah Kay 10529854 2011-03-18 4430
9 Nature. Beauty. Gratitude. Louie Schwartzberg 3658158 2012-11-22 4399
10 Your elusive creative genius Elizabeth Gilbert 13155478 2009-02-09 4027
# ... with 2,540 more rows
Most Jawdropping Talks of all time
rating_df %>%
filter(rating == "Jaw-dropping") %>%
select(title, main_speaker, views, published_date, count) %>%
group_by(title) %>%
arrange(desc(count))
# A tibble: 2,550 x 5
# Groups: title [2,550]
title main_speaker views published_date count
<chr> <chr> <int> <date> <dbl>
1 How PhotoSynth can connect the world's images Blaise Agüera y Arcas 4772595 2007-05-27 14728
2 My stroke of insight Jill Bolte Taylor 21190883 2008-03-12 10464
3 The thrilling potential of SixthSense technology Pranav Mistry 16097077 2009-11-16 8416
4 Underwater astonishments David Gallo 13926113 2008-01-11 8328
5 "A performance of \"Mathemagic\"" Arthur Benjamin 8360707 2007-12-13 7196
6 New insights on poverty Hans Rosling 3243784 2007-06-25 5137
7 This is Saturn Carolyn Porco 2627709 2007-10-01 4971
8 The radical promise of the multi-touch interface Jeff Han 4531020 2006-08-01 4643
9 Do schools kill creativity? Ken Robinson 47227110 2006-06-27 4439
10 The best stats you've ever seen Hans Rosling 12005869 2006-06-27 3736
# ... with 2,540 more rows
Most Confusing Talks of all time
rating_df %>%
filter(rating == "Confusing") %>%
select(title, main_speaker, views, published_date, count) %>%
group_by(title) %>%
arrange(desc(count))
# A tibble: 2,550 x 5
# Groups: title [2,550]
title main_speaker views published_date count
<chr> <chr> <int> <date> <dbl>
1 I believe we evolved from aquatic apes Elaine Morgan 1038576 2009-07-31 531
2 An 8-dimensional model of the universe Garrett Lisi 1491698 2008-10-14 376
3 Why we do what we do Tony Robbins 20685401 2006-06-27 301
4 My stroke of insight Jill Bolte Taylor 21190883 2008-03-12 289
5 The call to learn Clifford Stoll 2283491 2008-03-26 278
6 Design and destiny Philippe Starck 1783740 2007-12-04 276
7 Brain magic Keith Barry 13327101 2008-07-18 273
8 17 words of architectural inspiration Daniel Libeskind 784642 2009-07-01 244
9 Do schools kill creativity? Ken Robinson 47227110 2006-06-27 242
10 The surprising science of happiness Dan Gilbert 14689301 2006-09-26 241
# ... with 2,540 more rows
I was curious about which words are most often used by TED Speakers. Could we create a Word Cloud out of all TED Speeches?
I tried to to create a word cloud following the tutorial on r-bloggers
library(tm)
library(SnowballC)
library(wordcloud)
texts <- df3$transcript
corpus <- Corpus(VectorSource(texts)) %>%
tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords('english')) %>%
tm_map(stemDocument) %>%
tm_map(removeWords, c("and", "this", "there"))
corpus <- Corpus(VectorSource(corpus))
m <- TermDocumentMatrix(corpus) %>% as.matrix()
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v) %>%
filter(!(word %in% c("and","this","that")))
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
An interactive alternative
library(wordcloud2)
wordcloud2(data = d)
# Why not?
letterCloud(d, word = "TED")