Final Project - Data Wrangling with R

1. Introduction

Problem Statement:

The objective of the project is to explore the TED Talks data and generate some interesting insights from it. This would include understanding the trend of popularity of ted talks over the years in terms of views,comments and ratings along with exploring its possible drivers like occupation of the speaker, duration fo the ted talk, number of speakers etc. FInally , I want to summarise the key ingredients of the best and the worst TED talks.

Implementation:

The data has been obtained from Kaggle which covers all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. After cleaning the data and bringing it to an easily usable format using text mining and data wrangling techniques it will be analysed graphically to generate inferences.

How does it help the consumer?

The analysis will be useful to the consumer in understanding where TED Talk is heading over the years with respect to their popularity and identify its drivers. It will ultimately help them design the best ted talks and avoid the mistakes of the worst ones.

2. Packages Required

Below are the list of packages needed for the analysis:

jsonlite: Used to read in data that exist as JSON
stringr: Used for character manipulation and cleaning the data
tidytext: Used to tidy text
tidyverse: Used for data manipulation and visualising data better using ggplot
wordcloud: creating a word cloud
DT: To display the data on HTML in a scrollable format
Lubridate: To work with dates easily
reshape2 : To reshape the data to a matrix to be used in wordcloud comparison
plotly: To plot interactive box plots

library(jsonlite)
library(stringr)
library(tidytext)
library(tidyverse)
library(wordcloud)
library(DT)
library(lubridate)
library(reshape2)
library(plotly)

3. Data Prepartion

A.Data Source

Data source for the analysis is a Kaggle Dataset TED Talk Data.

B. Description of the data

The data contains the information of all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. It was posted on Sep 09,2017 and was updated again on Sep 25,2017. It covers data from 2006 to 2017. The data has been collected by scraping the official TED Website and is available under the Creative Commons License.

The data contains a total of 2550 rows and 14 variables where each row conatins data of a particular ted talk.

Brief description of the variables are as below:

Comments: The number of first level comments made on the talk
Description: A blurb of what the talk is about
Duration: The duration of the talk in seconds
Event: The TED/TEDx event where the talk took place
Film_date: The Unix timestamp of the filming
Languages: The number of languages in which the talk is available
Main_speaker: The first named speaker of the talk
Name: The official name of the TED Talk. Includes the title and the speaker
Num_speaker: The number of speakers in the talk
Published_date: The Unix timestamp for the publication of the talk on TED.com
Ratings: A stringified dictionary of the various ratings given to the talk (inspiring, fascinating, jaw dropping, etc.)
Related_talks: A list of dictionaries of recommended talks to watch next
Speaker_occupation: The occupation of the main speaker
Tags: The themes associated with the talk
Title: The title of the talk
Url: The URL of the talk.
Views: The number of views on the talk

10 of these fields along with few derived variables will be used for this analysis.

C. Original Data

ted <- read.csv('https://raw.githubusercontent.com/Duttaay/TED_Talk_Analysis_R-Project/Duttaay-patch-1/ted_main.csv',stringsAsFactors = FALSE)

#Original data looks like this
datatable(ted,extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))

5. Data Cleaning

Data Cleaning had muutiple steps explained below in detail:

1. Only keeping the variables to be used in the analysis

ted1 <- ted[c("comments","duration","languages","main_speaker","num_speaker","published_date","ratings","speaker_occupation","tags","views","title")]
#checking the colnames and dimeantions of the new table
colnames(ted1)

##  [1] "comments"           "duration"           "languages"         
##  [4] "main_speaker"       "num_speaker"        "published_date"    
##  [7] "ratings"            "speaker_occupation" "tags"              
## [10] "views"              "title"

dim(ted1)

## [1] 2550   11

2. Checking for missing values in the table (At the time of importing I had converted the blanks to NAs)

sum(is.na(ted1))

## [1] 0

colSums(is.na(ted1))

##           comments           duration          languages 
##                  0                  0                  0 
##       main_speaker        num_speaker     published_date 
##                  0                  0                  0 
##            ratings speaker_occupation               tags 
##                  0                  0                  0 
##              views              title 
##                  0                  0

There are 6 missing values in the Speaker_occupation Column. At this point I want to conserve these rows and they have been taken care of in futher steps of data cleaning.

3. Checking the numerical variables for outliers

ggplot(aes(x = "",y = comments),data = ted1) +
  geom_boxplot() +
  scale_y_log10(labels = scales::comma)+
  labs(title = "Number of Comments") +
  theme_minimal()

ggplot(aes(x = "",y = views),data = ted1) + 
  geom_boxplot() +
  scale_y_log10(labels = scales::comma) +
  labs(title = "Number of Views") +
  theme_minimal()

par(mfrow = c(1,3))
hist(ted1$num_speaker)
boxplot(ted1$languages, main = "Number of languages")
boxplot(ted1$duration,main = "Duration (in secs)")

Most of the variables have outliers but fortunately no negative values. Only variable that is a concern is the Number of languages which as 0 values in 86 records. At this point I do not want to delete the rows to be able to access other column data. However, I would take care of it during the analysis.

4. Converting the Published date to a normal date format and creating Month and Year column to be used later in the analysis

ted1$published_date <- as.Date(as.character(ymd_hms(as.POSIXct(as.numeric(ted1$published_date),origin = '1970-01-01', tz = "GMT"))),format = "%Y-%m-%d")
ted1$published_month <- factor(month.abb[month(ted1$published_date)])
ted1$published_year <- year(ted1$published_date)

5. Adding a ‘sno’ column in the datset

The sno column will act as the primary key for the dataset where each sno will be an identifier for each talk

len <- ted1 %>% summarise(sno = n())
ted1$sno <- seq(1,as.numeric(len))

6. Cleaning the Ratings column and transforming it to number of positive, Negative and Neutral ratings

Currently, the rating sample record looks like :

## [1] "[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

#reading the values as json to get the values in rows
df1 <- c()
for (i in 1:2550)
  {
    df <- fromJSON(str_replace_all(ted1$rating[i],"'",'"'))
    df$sno <- i 
    df1 <- rbind(df,df1)
}
#Creating a table with the ratings
ted_ratings <- df1

#Checking the distinct rating types available
df %>% distinct(name)

##            name
## 1  Unconvincing
## 2   Informative
## 3     Inspiring
## 4            OK
## 5   Fascinating
## 6     Ingenious
## 7     Confusing
## 8     Obnoxious
## 9     Beautiful
## 10   Longwinded
## 11   Persuasive
## 12 Jaw-dropping
## 13   Courageous
## 14        Funny

#Classified the distinct rating types to positive, negative and neutral ratings
negative_words <- c('Unconvincing','Confusing','Obnoxious','Longwinded')
positive_words <- c('Informative','Inspiring','Fascinating','Ingenious','Beautiful','Persuasive','Jaw-dropping','Courageous','Funny')

df1$ratings_type <- ifelse(df1$name %in% unlist(negative_words),'negative_ratings',ifelse(df1$name %in% unlist(positive_words),'positive_ratings',ifelse(df1$name == 'OK','neutral_ratings',' ')))

ted2 <- df1 %>% group_by(sno,ratings_type) %>% 
  summarise(count_rating_type = sum(count)) %>% spread(ratings_type,count_rating_type) %>% ungroup() %>%
  inner_join(ted1,by = "sno")

7. Cleaning the Speaker occupation field

Currently the occupation sample records looks like :

ted1$speaker_occupation[1:5]

## [1] "Author/educator"                     
## [2] "Climate advocate"                    
## [3] "Technology columnist"                
## [4] "Activist for environmental justice"  
## [5] "Global health expert; data visionary"

#replacing all the ;,/ to blanks
ted2$speaker_occupation <- ted2$speaker_occupation %>% str_replace_all('/',' ') %>% str_replace_all(',',' ')   %>% str_replace_all(';',' ') %>% str_replace_all('\\+',' ') %>% tolower()

#Unnesting each occupation
df2 <- unnest_tokens(ted2,occupation1,speaker_occupation) %>% select(sno,occupation1)

#stop word list to be removed
stop_words <-  c('and','of','in','expert','social','the','for')

#removing stop words and renaming similar words 
df2 <- df2 %>% subset(!occupation1 %in% stop_words) %>% mutate(occupation1 = str_replace_all(occupation1, 
       c("writer" = "author","scientists" = "scientist","researcher" = "scientist","neuroscientist" = "scientist", "professor" = "educator", "scholar" = "educator", "education" = "educator", "teacher" = "educator", "songauthor" = "author","editor" = "author","data" = "data related","analyst" = "data related","statistician" = "data related", "musician" = "artist","singer" = "artist","sing" = "artist","poet" = "artist","actor" = "artist", "comedian" = "artist","playwright" = "artist","media" = "artist","performance" = "artist","guitarist" = "artist", "dancer" = " artist","humorist" = "artist","pianist" = "artist", "violinist" = "artist","magician" = "artist","artists" = "artist","band" = "artist", "director" = "filmmaker", "producer" = "filmmaker", "entrepreneur" = "business","ceo" = "business", "founder" = "business", "psychology" = "psychologist", "physician" = "health", "medical" = "health", "doctor" = "health", "design" = "designer", "designerer" = "designer", "reporter" = "journalist"))) 

#creating a list of top 20 words
occupation_by_rank <- df2 %>% group_by(occupation1) %>% summarise(n = n_distinct(sno)) %>% arrange(desc(n))
top_20_occ <- occupation_by_rank[1:20,1]
datatable(head(occupation_by_rank,20))

These top 20 occupations cover approximately 75% of the talks. Thus, renaming all the rest as ‘Others’( this has also led to taking care of missing values as they will now be nreplaced by ‘Others’). Even then there will be a case when two of the top 20 occupations come under one speaker then removed duplicates.

ted3 <- df2 %>%  mutate(rank = ifelse(occupation1 %in% unlist(top_20_occ),1,0)) %>% arrange(sno,desc(rank)) %>%
  subset(!duplicated(sno)) %>% right_join(ted2,by = "sno") %>% 
  mutate(speaker_occupation = ifelse(is.na(occupation1),"others",occupation1)) %>% 
  select(-(occupation1))

Finally we have one occupation assigned to the speaker which is amonth the top 20 / Others with top 20 occupation as priority.

9. Cleaning the tags field

Currently the tags sample records looks like :

ted3$tags[1:2]

## [1] "['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'teaching']"                                                  
## [2] "['alternative energy', 'cars', 'climate change', 'culture', 'environment', 'global issues', 'science', 'sustainability', 'technology']"

#unnesting individual tags from the field
ted3$tags <- ted3$tags %>% str_replace_all('\\[','') %>% str_replace_all('\\]','')   %>% str_replace_all("\\'",' ') %>% str_replace_all(',',' ') %>% tolower()

talk_tags <- unnest_tokens(ted3,tags1,tags) %>% select(sno,tags1)
datatable(head(talk_tags,10))

9. Creating the final dataset

ted_final <- ted3 %>%
             select(c("sno","main_speaker","title","num_speaker","comments","positive_ratings","negative_ratings","neutral_ratings","duration","languages","speaker_occupation","views","published_month","published_year","published_date")) %>%
             mutate(ratings = positive_ratings + negative_ratings + neutral_ratings)

E. Cleaned Dataset

I would be using two final datasets for the analysis:

ted_final : Contains all the cleaned and formatted ted talk data
talk_tags : Contains the tags data unnested and matched with the serial number ( that identifies the ted talk)

ted_final

datatable(head(ted_final,100))

talk_tags

datatable(head(talk_tags,100))

F. Summary of cleaned dataset

main_speaker: The column is of type character and has 2156 unique speakers
num_speaker: The column is of type integer and has mean : 1.03, median : 1, range 1 to 5 speakers
comments: The column is of type integer and has mean : 191.56, median : 118, range 2 to 6404 comments
positive_ratings: The column is of type integer and has mean : 2222.76, median : 1276.5, range 45 to 91538 positive ratings
negative_ratings: The column is of type integer and has mean : 132.41, median : 77, range 0 to 3777 negative ratings
neutral_ratings: The column is of type integer and has mean : 81.24, median : 55.5, range 0 to 1341 neutral ratings
duration: The column is of type integer and has mean : 826.51 secs, median : 848 secs, range 135 to 5256secs duration
languages: The column is of type integer and has mean : 27.33 , median : 28 , range 0 to 72 languages
speaker_occupation: The column is of type character and has 477 unique occupations which includes the top 20 filtered occupations and the rest renamed as others
views: The column is of type integer and has mean : 1.698297510^{6}, median : 1.124523510^{6}, range 50443 to 47227110 views

4.Exploratory Analysis

In this section I have tried to explore the TED Talks dataset and generate insights. The analysis focuses on understanding:

Trend of popularity of TED Talks over the years
Factors that affect a TED Talk’s popularity - what makes them the best and what makes them the worst

Format: Code followed by plot/table followed by Summary/Explanation

A. TED Talks over time

ted_final %>%
  group_by(published_year) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = factor(published_year),y = n,group = 1)) + 
  geom_line(color = "Blue") + geom_point(lwd = 2, color = "blue") + 
  labs(title = "Number of Talks over the years", x = "Published Year", y = "# of Talks") +
  geom_hline(aes(yintercept = mean(n)), linetype = "dashed", alpha = .5) +
  annotate("text", x = '2007', y = 210, label = "Average: 212.5", size = 3) +
  theme_minimal()

Summary/Explanation :

Ted talk started publishing in 2006 post which the number of talks have been increasing every year with average number of talks of 212.5. In 2012, there was a peak in the number of talks with more than 300 talks that year which is the highest number of talks till date. 2012 to 2017 there has only been slight changes in the number of talks however, a small dip of ~15% is observed in 2015. 2017 shows a drop in the views because it only captures the total number of talks covered only till September.

So, we can say on an average we can expect around 210 or more talks every year.

ted_final %>%
  group_by(published_year) %>%
  summarise(avg_views = mean(views/100000)) %>% 
  ggplot(aes(x = factor(published_year),y = avg_views,group = 1)) +
  geom_line(color = "red") +
  geom_point(lwd = 2, color = "red") + 
  labs(title = "Views by Published Year", x = "Published Year" , y = "Average # of views(in hundred thousands)") +
  geom_hline(aes(yintercept = mean(avg_views)), linetype = "dashed", alpha = .5) +
  annotate("text", x = '2007', y = 19, label = "Average: 1,838,604", size = 3) +
  annotate("text", x = '2007', y = 42, label = "Max: 4,130,967", size = 3) + 
  theme_minimal()

Summary/Explanation :

The highest average number of views ~ 4.1 million views were observed in 2006 which had the lowest number of talks. This was kind of expected considering 2006 was the year when TED Talks started offering free viewing online. We also see that although 2012 had the highest number of talks the views were below average.

#getting a dot 
ted_final %>% 
  mutate(published_year1 = as.factor(published_year)) %>%
  group_by(published_year1) %>%
  summarise(avg_comments = mean(comments)) %>%
  ggplot(aes(x = published_year1, y = avg_comments)) + 
  geom_point(col = "tomato2", size = 3) +   
  geom_segment(aes(x = published_year1,xend = published_year1,y = min(avg_comments),yend = max(avg_comments)),linetype = "dashed",size = 0.05) +
  coord_flip() + 
  labs(title = "Number of Comments by Published Year", x = "Published year", y = "Average # of Comments") +
  theme_minimal()

Summary/Explanation :

As expected the 2006 which had the maximum average views also observed the maximum average number of comments with 363 comments per talk followed by 2013 with 289 comments per talk. Post 2013 we see a steady decrease in the number of comments with 2016 having only 81 comments per talk which can be a possible reason for concern. Lets look at ratings to confirm if less comments also come with lesser ratings.

#Getting a dot chart for number of ratings by published year
ted_final %>% 
  mutate(published_year1 = as.factor(published_year)) %>%
  group_by(published_year1) %>%
  summarise(avg_ratings = mean(ratings)) %>%
  ggplot(aes(x = published_year1, y = avg_ratings)) + 
  geom_point(col = "tomato2", size = 3) +   
  geom_segment(aes(x = published_year1,xend = published_year1,y = min(avg_ratings),yend = max(avg_ratings)),linetype = "dashed",size = 0.05) +
  coord_flip() + 
  labs(title = "Number of Comments by Published Year", x = "Published year", y = "Average # of Ratings")+
  theme_minimal()

Summary/Explanation :

The average number of ratings also show its highest value in the 2006 but, 2013 which had the second highest views and comments did not receive many ratings. On the othe rhand 2010 showed a peak in the number of ratings.

#creating a stacked bar chart showing percentag eof positive, neutral and negative ratings by the published year
ted_final %>%
group_by(published_year) %>%
summarise(Perc_Positive_Ratings= sum(positive_ratings)/sum(ratings), Perc_Negative_Ratings = sum(negative_ratings)/sum(ratings), Perc_Neutral_Ratings = sum(neutral_ratings)/sum(ratings)) %>%
gather(Type, Perc_rating ,-published_year) %>%
ggplot(aes(x = published_year, y = Perc_rating, fill = Type)) + geom_bar(stat = "identity") +
labs(title = "Percentage of Positive, Negative and Neutral Ratings by Published Year", x = "Published year", y = "% of Ratings") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()

Summary/Explanation :

From the graph above we observe that the percentage of negative, positive and neutral ratings have stayed more or less the same over the years. The year 2009 is the only year which have slightly higher ratings which can be further deep dived to find the reason behind it.

The good news is that the percentage of negative ratings have been reducing since 2010 with the least in 2017.

B. Popularity of TED Talks

With an average of 1.8 million views, TED Talks are clearly very popular.

Top 10 Most Viewed TED Talks of all times

#getting the top 10 talks of all times by number of views
datatable(ted_final %>%
            arrange(desc(views)) %>%
            select( title, main_speaker, views, published_date,comments,ratings) %>%
            head(10))

“Do Schools Kill Creativity?” by Ken Robinson is the most viewed and the most rated talk of all times with ~47 million views and 98k ratings.

We want to know what makes the best TED talks. Is there a factor that drives the most and the least viewed TED talk?

To try to answer this question the TED Talks have been bucketed into five categories based on their number of views (quantiles):

Worst: 0-20%ile
Bad: 20-40%ile
Ok: 40-60%ile
Good: 60-80%ile
Best: >80%ile

#Defining the view category
ted_final$view_category <- 
  ifelse(between(ted_final$views,quantile(ted_final$views,0),quantile(ted_final$views,0.20)),'Worst',
  ifelse(between(ted_final$views,quantile(ted_final$views,0.20),quantile(ted_final$views,0.40)),'Bad',
  ifelse(between(ted_final$views,quantile(ted_final$views,0.40),quantile(ted_final$views,0.60)),'Ok', 
  ifelse(between(ted_final$views,quantile(ted_final$views,0.60),quantile(ted_final$views,0.80)),'Good',
  ifelse(ted_final$views > quantile(ted_final$views,0.80),'Best','NA')))))

#adding levels to the column
vcat_order <- c('Best','Good','Ok','Bad','Worst')
ted_final$view_category <- factor(ted_final$view_category, levels = vcat_order)

view_cat <- ted_final %>%
  group_by(view_category) %>%
  summarise(Min_Views = min(views),Max_Views = max(views)) %>%
  arrange(desc(Min_Views))

datatable(view_cat)

C. What makes the best TED Talks?

i. Do publishing months impact views?

# Adding levels to published month
month_order <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
ted_final$published_month <- factor(ted_final$published_month, levels = month_order)

#Plotting Views by published month facetted over last 7 years
ted_final %>%
  filter(published_year >= 2010) %>%
  group_by(published_year,published_month) %>%
  summarise(m_views = sum(views)) %>%
  inner_join(ted_final %>%
  filter(published_year >= 2010) %>%
  group_by(published_year) %>%
  summarise(y_views = sum(views)),by = "published_year") %>%
  mutate(perc_views = m_views/y_views) %>%
  ggplot(aes(x = published_month,y = perc_views,group = 1, color = published_year)) +
  geom_point() + geom_line() + facet_wrap(~published_year,ncol = 1) + 
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Published Month", y = "Percent COntribution in Yearly views", title = "Monthly percentage views over years 2010-2017 - Seasonality") +
  theme_minimal()

Summary/Explanation :

We can see from the graph above that there is no monthly seasonlity in the ted talk view data. In other words, the month of the date the TED Talk was published does not affect the number of views the TED talk observes.

ii. Does speaker occupation affect the TED talk views?

The speaker occupation was throughly cleaned the data preparation step ultimately leading to tidy data with one major occupation of the talk speaker.

To analyse the most important speaker occupations across view categories different wordclouds and frequency charts have been created. To make the comparison simple and more insightful the comparisons has only been done across the most and the least viewed TED talks.

A function has been created to generate a word cloud and a word frequency chart according to the category (based on number of views).

#creating a function to create a wordcloud and the frequency chart by the view category used in the function call
generate_cloud_grph <- function(v_cat){
                  df_wc <- as.data.frame(ted_final %>% 
                           subset(view_category == v_cat,select = c(speaker_occupation,view_category)) %>% 
                           count(speaker_occupation, sort = TRUE))

                  wordcloud(words = df_wc$speaker_occupation, freq = df_wc$n, min.freq = 1,
                           max.words = 100, random.order = FALSE, rot.per = 0.35, 
                           colors = brewer.pal(8, "Dark2"))
                  
                  ted_final %>%
                  filter(view_category == v_cat) %>%
                  group_by(speaker_occupation) %>%
                  summarise(n = n()) %>% 
                  arrange(desc(n)) %>% 
                  head(10) %>%
                  ggplot(aes(x = reorder(speaker_occupation,n), y = n, label = n)) + 
                  geom_point(size = 6) + 
                  geom_segment(aes(x = speaker_occupation, 
                                   xend = speaker_occupation, 
                                   y = 0, 
                                   yend = n)) + 
                  geom_text(color = "white", size = 3) + coord_flip() +
                  labs(x = "Frequency",y = "Speaker Occupation") +
                  theme_classic()
}

Analysis of the Best(Most Viewed) TED Talks

generate_cloud_grph("Best")

Analysis of the Worst(Lease Viewed) TED Talks

generate_cloud_grph("Worst")

Summary/Explanation : From the plots above we can see that some common occupations which belong both to ‘Best’ and ‘Worst’ TED Talks are artists,designers, educators,authors and business as, they are also the most common occupations of TED Talk speakers.

There is no major difference in the speaker occupations among the most and least viewed TED Talks however, it looks like the speaker occupations that stand out among best TED Talks are filmmakers, health and psychologists and the occupations that stand out among the Worst TED Talks are activists, biologist, inventor.

To get a proper comparison of words in Best and Worst TED talks we did a comparison cloud.

#Generating the comparison clouds for Best and Worst Ted Talks
set.seed(1234)
ted_final %>% select(speaker_occupation,view_category) %>% 
  subset(view_category %in% c('Best','Worst')) %>%
  group_by(speaker_occupation,view_category) %>%
  summarise(n = n()) %>% 
  acast(speaker_occupation ~ view_category, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)

Summary/Explanation : The talks with speakers with occupation psychologist, scientists and authors dominate the Best TED Talks whereas the speakers with occupations activist, designer, biologist, politician dominate the Worst TED Talks although they as they are not of the same size thus, the inference is only indicative.

Below is a small brief of how the comparison cloud works:

Let pi,j be the rate at which word i occurs in document j, and pj be the average across documents(sigma(pi,j/ndocs)). The size of each word is mapped to its maximum deviation ( maxi(pi,j minus pj) ), and its angular position is determined by the document where that maximum occurs

iii. Are longer TED Talks less viewed?

#creating the interactive box plot of duration by category of TED talk
ted_final %>%
  plot_ly(y = ~duration, color = ~view_category, type = "box")

cor(ted_final$views,ted_final$duration)

## [1] 0.04874043

Summary/Explanation :

From the Interactive box plot above we can see that there is almost no relation between the number of views a TED Talk gets and the duration of the talk as the mean duration of all the TED Talk categories is almost constant The same can also be observed from the correlation value of 0.048 of duration vs views.

However, we can also see that Worst category has the most outliers. In other words, the longest TED Talks are more likely to be the least viewed.

iv. Do viewers prefer co-presenters?

#creating a variable to club  more than one speaker as co-speaker talks
datatable(ted_final %>% 
          mutate(No_of_Speakers = ifelse(num_speaker == 1 , '1','>1')) %>%
          group_by(No_of_Speakers) %>%
          summarise(count = n()))

#creating a boxplot of views by number of speakers category
ted_final %>% 
  mutate(No_of_Speakers = ifelse(num_speaker == 1 , '1','>1')) %>%
  ggplot(aes(x = No_of_Speakers, y = views, fill = No_of_Speakers)) + 
  geom_boxplot() +
  scale_y_log10(labels = scales::comma) +
   theme_minimal()

Summary/Explanation :

Majority of the the talks are conducted by single speakers. Co-presented talks only contribute to approximately 2.2% of the talks.

From the graph above we can see that the average number of views for single speaker is higher than when there are co-presenters. Also, the shows which have had the highest views and are indicated in the boxplots as outliers all mostly are single speaker TED Talks. In other words we can say that the best TED talks are single speaker talks.

v. Do views increase with more languages?

#Creating a boxplot of number of languages by view catgory
ted_final %>%
  ggplot(aes(x = view_category, y = languages)) + 
  geom_boxplot(width = 0.3, fill = "plum") + coord_flip() +
  labs( x = "View Category", y = "# of Languages", title = "Languages vs View_category") +
   theme_minimal()

cor(ted_final$views,ted_final$languages)

## [1] 0.3776231

Summary/Explanation :

From the plot above we can see that as the number of langugaes in which the TED talk is available increase the number of views also increase. On checking the correlation between number of views and the number of languages we observe the same with a correlation of 0.377 views and number of languages are moderately correlated.

In other word’s increasing the number of languages increases the views of the TED Talks.

vi. Do Tags associated to a TED Talk affect views?

#generating a comparison word cloud for Best and Worst TED talks
set.seed(1234)
talk_tags %>%
  inner_join(ted_final, by = "sno") %>%
  select(view_category, tags1) %>%
  filter(!(tags1 %in% c('global','tedx'))) %>%
  subset(view_category %in% c('Best','Worst')) %>%
  group_by(tags1,view_category) %>%
  summarise(n = n())  %>%
  acast(tags1 ~ view_category, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 100)

Summary/Explanation :

From the wordcloud above we can see the worst TED talks have words like - issues, technology, politics while the best TED talks have words like psychology, brain, work,culture, humor,love and happiness.

Below is a small brief of how the comparison cloud works:

5. Summary

5.1 Problem statement:

This analysis was aimed at helping the readers see the popularity trend of TED Talks over the years and also understand the drivers of popularity of TED talks making the Best and Worst TED Talks.

5.2 Implementation:

The data was cleaned and manipulated according to the analysis. The data was then analyzed graohically to look at the general trend of TED Talk’s popularity. Using text mining and data wrangling methods tags and speaker occupations was tidied which was later visualised graphically thorough wordclouds, frequency plots, etc. to understand if they have an impact on views. Other factors like number of languages, number of speakers,etc. using other visualization techniques like box plots, dot charts, etc.

5.3 Insights from the analysis:

Based on the analysis, we uncovered some insights about TED Talks. Below is the summary of the analysis:

Popularity of TED talks over time:

Over the years TED talks have grown with respect to all the 4 parameters - number of talks they conduct, number of views, number of comments and the % of positive rating
2006 being the first year when TED talks was launched for free viewing online it was the year of highest average views inspite of the least number of talks
Another successful year for them was 2013 when the average number of views and comments peaked

Factors affecting TED talk’s popularity

Month of publishing does not affect the popularity of the talk
Best talks fall both in single and co-speaker talks however, majority of the best TED talks ( outliers) are single speaker talks
Increasing the number of languages the talk is available in increases the views of the TED Talks
lease viewed TED talks have words like - issues, technology, politics while the most viewed TED talks have words like psychology, brain, work,culture, humor,love and happiness
Speaker occupation does not have a major effect on the TED talk’s popularity but it has been noticed that speakers with occupation psychologist, scientists and authors dominate themost viewed TED Talks whereas the speakers with occupations activist, designer, biologist, politician dominate the lease viewed TED Talks
Duration of the talks is not correlated to the number of views however, the longest talks belong to the least viewed category

Thus, now we know the ingredients to create the BEST TED talk.

5.4 Limitations of the analysis or future scope:

Another very important factor that affect the popularity of TED talks is the content of the talk or transcripts. This, can be further added into the analysis. As a further scope a regression model or a neural network can be built to predict the popularity of a talk based on its characteristics.

Final Project - Data Wrangling with R - TED Talk Analysis

Ananya Dutta

03 December, 2017

TED Talk Analysis

1. Introduction

2. Packages Required

3. Data Prepartion

A.Data Source

B. Description of the data

C. Original Data

5. Data Cleaning

E. Cleaned Dataset

F. Summary of cleaned dataset

4.Exploratory Analysis

A. TED Talks over time

B. Popularity of TED Talks

C. What makes the best TED Talks?

i. Do publishing months impact views?

ii. Does speaker occupation affect the TED talk views?

iii. Are longer TED Talks less viewed?

iv. Do viewers prefer co-presenters?

v. Do views increase with more languages?

vi. Do Tags associated to a TED Talk affect views?

5. Summary