TED Conferences LLC (Technology, Entertainment, Design) adalah organisasi media America yang menayangkan gagasan atau kajian-kajian kontemporer serta didistribusikan secara bebas. Slogan mereka adalah “ideas worth spreading”.
Tujuan proyek ini adalah untuk mengeksplorasi data TED Talks dan menghasilkan beberapa insight menarik. Bagaimana memahami tren popularitas pembicaraan ted selama bertahun-tahun dalam hal pandangan, komentar dan rating-nya. Kita juga ingin mengeksplorasi siapa saja pembicara yang banyak menarik perhatian penonton dan komentar, termasuk pekerjaan pembicara, durasi pembicaraan ted, jumlah pembicara dll.
Data diperoleh dari Kaggle yang mencakup semua rekaman audio-video TED Talks yang diunggah ke situs web resmi TED.com hingga tanggal 21 September 2017. Setelah membersihkan data dan membawanya ke format yang mudah digunakan menggunakan teknik penambangan teks dan pengolahan data itu akan dianalisis secara grafis untuk menghasilkan kesimpulan.
rm(list = ls())
setwd("D:/R/ted")
library(jsonlite)
library(stringr)
library(tidytext)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.7
## v tidyr 1.1.3 v forcats 0.5.1
## v readr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks jsonlite::flatten()
## x dplyr::lag() masks stats::lag()
library(wordcloud)
## Loading required package: RColorBrewer
library(DT)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
my_data<-read.csv2("ted2.csv",header=TRUE,sep=";")
ted1 <- my_data[c("comments","duration","languages","main_speaker","num_speaker","published_date","ratings","speaker_occupation","tags","views","title")]
colnames(ted1)
## [1] "comments" "duration" "languages"
## [4] "main_speaker" "num_speaker" "published_date"
## [7] "ratings" "speaker_occupation" "tags"
## [10] "views" "title"
dim(ted1)
## [1] 2550 11
Kesimpulan: ada 2550 data dan 11 variabel
sum(is.na(ted1))
## [1] 0
colSums(is.na(ted1))
## comments duration languages main_speaker
## 0 0 0 0
## num_speaker published_date ratings speaker_occupation
## 0 0 0 0
## tags views title
## 0 0 0
ggplot(aes(x = "",y = comments),data = ted1) +
geom_boxplot() +
scale_y_log10(labels = scales::comma)+
labs(title = "Jumlah komentar (comments)") +
theme_minimal()
ggplot(aes(x = "",y = views),data = ted1) +
geom_boxplot() +
scale_y_log10(labels = scales::comma) +
labs(title = "Jumlah penonton (Views)") +
theme_minimal()
par(mfrow = c(1,3))
hist(ted1$num_speaker)
boxplot(ted1$languages, main = "Jumlah bahasa")
boxplot(ted1$duration,main = "Durasi (dalam detik)")
Kesimpulan: Banyak outliers namun tidak ada yang memiliki nilai negatif.
ted1$published_date <- as.Date(as.character(ymd_hms(as.POSIXct(as.numeric(ted1$published_date),origin = '1970-01-01', tz = "GMT"))),format = "%Y-%m-%d")
ted1$published_month <- factor(month.abb[month(ted1$published_date)])
ted1$published_year <- year(ted1$published_date)
len <- ted1 %>% summarise(sno = n())
ted1$sno <- seq(1,as.numeric(len))
# membaca data-data json untuk mendapatkan nilainya dalam baris
df1 <- c()
for (i in 1:2550)
{
df <- fromJSON(str_replace_all(ted1$rating[i],"'",'"'))
df$sno <- i
df1 <- rbind(df,df1)
}
#Membuat tabel dengan ratings
ted_ratings <- df1
#Cek tipe rating distincts yang ada
df %>% distinct(name)
## name
## 1 Unconvincing
## 2 Informative
## 3 Inspiring
## 4 OK
## 5 Fascinating
## 6 Ingenious
## 7 Confusing
## 8 Obnoxious
## 9 Beautiful
## 10 Longwinded
## 11 Persuasive
## 12 Jaw-dropping
## 13 Courageous
## 14 Funny
#Klasifikasi tipe *distinct rating* dalam bentuk positif, negatif dan netral
negative_words <- c('Unconvincing','Confusing','Obnoxious','Longwinded')
positive_words <- c('Informative','Inspiring','Fascinating','Ingenious','Beautiful','Persuasive','Jaw-dropping','Courageous','Funny')
df1$ratings_type <- ifelse(df1$name %in% unlist(negative_words),'negative_ratings',ifelse(df1$name %in% unlist(positive_words),'positive_ratings',ifelse(df1$name == 'OK','neutral_ratings',' ')))
ted2 <- df1 %>% group_by(sno,ratings_type) %>%
summarise(count_rating_type = sum(count)) %>% spread(ratings_type,count_rating_type) %>% ungroup() %>%
inner_join(ted1,by = "sno")
## `summarise()` has grouped output by 'sno'. You can override using the `.groups` argument.
ted1$speaker_occupation[1:5]
## [1] Author/educator Climate advocate
## [3] Technology columnist Activist for environmental justice
## [5] Global health expert; data visionary
## 1459 Levels: ... Zoologist
#mengganti semua tanda ;,/ menjadi kosong (*blanks*)
ted2$speaker_occupation <- ted2$speaker_occupation %>% str_replace_all('/',' ') %>% str_replace_all(',',' ') %>% str_replace_all(';',' ') %>% str_replace_all('\\+',' ') %>% tolower()
#Memisahkan setiap pekerjaan
df2 <- unnest_tokens(ted2,occupation1,speaker_occupation) %>% select(sno,occupation1)
#kata-kata berhenti (*stop word*) dibuang
stop_words <- c('and','of','in','expert','social','the','for')
#menghilangkan *stop words* dan mengganti kata-kata yang sama
df2 <- df2 %>% subset(!occupation1 %in% stop_words) %>% mutate(occupation1 = str_replace_all(occupation1,
c("writer" = "author","scientists" = "scientist","researcher" = "scientist","neuroscientist" = "scientist", "professor" = "educator", "scholar" = "educator", "education" = "educator", "teacher" = "educator", "songauthor" = "author","editor" = "author","data" = "data related","analyst" = "data related","statistician" = "data related", "musician" = "artist","singer" = "artist","sing" = "artist","poet" = "artist","actor" = "artist", "comedian" = "artist","playwright" = "artist","media" = "artist","performance" = "artist","guitarist" = "artist", "dancer" = " artist","humorist" = "artist","pianist" = "artist", "violinist" = "artist","magician" = "artist","artists" = "artist","band" = "artist", "director" = "filmmaker", "producer" = "filmmaker", "entrepreneur" = "business","ceo" = "business", "founder" = "business", "psychology" = "psychologist", "physician" = "health", "medical" = "health", "doctor" = "health", "design" = "designer", "designerer" = "designer", "reporter" = "journalist")))
#membuat listing 20 kata teratas
occupation_by_rank <- df2 %>% group_by(occupation1) %>% summarise(n = n_distinct(sno)) %>% arrange(desc(n))
top_20_occ <- occupation_by_rank[1:20,1]
datatable(head(occupation_by_rank,20))
ted3 <- df2 %>% mutate(rank = ifelse(occupation1 %in% unlist(top_20_occ),1,0)) %>% arrange(sno,desc(rank)) %>%
subset(!duplicated(sno)) %>% right_join(ted2,by = "sno") %>%
mutate(speaker_occupation = ifelse(is.na(occupation1),"others",occupation1)) %>%
select(-(occupation1))
ted_final <- ted3 %>%
select(c("sno","main_speaker","title","num_speaker","comments","positive_ratings","negative_ratings","neutral_ratings","duration","languages","speaker_occupation","views","published_month","published_year","published_date")) %>%
mutate(ratings = positive_ratings + negative_ratings + neutral_ratings)
ted_final : Berisi semua data yang telah dibersihkan dan diformat dari data ted talk
talk_tags : Berisi seluruh data tags yang telah dipisahkan dan dipasangkan dengan *serial number*
ted_final
datatable(head(ted_final,100))
talk_tags
datatable(head(talk_tags,100))
ted_final %>%
group_by(published_year) %>%
summarise(n = n()) %>%
ggplot(aes(x = factor(published_year),y = n,group = 1)) +
geom_line(color = "Blue") + geom_point(lwd = 2, color = "blue") +
labs(title = "Jumlah Talks per tahun", x = "Tahun", y = "# Talks") +
geom_hline(aes(yintercept = mean(n)), linetype = "dashed", alpha = .5) +
annotate("text", x = '2007', y = 210, label = "Rata-rata: 212.5", size = 3) +
theme_minimal()
Kesimpulan: Ted talk mulai diterbitkan pada tahun 2006 pasca yang jumlah talk meningkat setiap tahun dengan rata-rata jumlah talk 212,5.
Pada tahun 2012, adalah jumlah talk terbanyak dengan lebih dari 300 talk. Dari tahun 2012 ke tahun 2017 hanya ada sedikit perubahan dalam jumlah talk namun, penurunan kecil ~15% diamati pada tahun 2015.
Pada tahun 2017 menunjukkan penurunan jumlah penayangan karena hanya mencakup jumlah total talk hingga bulan September.
Jadi, kita dapat mengatakan rata-rata talk adalah 210 per tahun
ted_final %>%
group_by(published_year) %>%
summarise(avg_views = mean(views/100000)) %>%
ggplot(aes(x = factor(published_year),y = avg_views,group = 1)) +
geom_line(color = "red") +
geom_point(lwd = 2, color = "red") +
labs(title = "Views berdasarkan Tahun", x = "Published Year" , y = "Rata-rata # of views(ratusan ribu)") +
geom_hline(aes(yintercept = mean(avg_views)), linetype = "dashed", alpha = .5) +
annotate("text", x = '2007', y = 19, label = "Rata-rata: 1,838,604", size = 3) +
annotate("text", x = '2007', y = 42, label = "Max: 4,130,967", size = 3) +
theme_minimal()
Kesimpulan: Rata-rata jumlah Views tertinggi ~ 4,1 juta pada tahun 2006. Pada tahun 2013 memiliki jumlah pembicaraan tertinggi, namun jumlah Views pada tahun selanjutnya menurun.
#getting a dot
ted_final %>%
mutate(published_year1 = as.factor(published_year)) %>%
group_by(published_year1) %>%
summarise(avg_comments = mean(comments)) %>%
ggplot(aes(x = published_year1, y = avg_comments)) +
geom_point(col = "tomato2", size = 3) +
geom_segment(aes(x = published_year1,xend = published_year1,y = min(avg_comments),yend = max(avg_comments)),linetype = "dashed",size = 0.05) +
coord_flip() +
labs(title = "Number of Comments by Published Year", x = "Published year", y = "Average # of Comments") +
theme_minimal()
Kesimpulan:Seperti yang diharapkan tahun 2006 yang memiliki rata-rata Views maksimum juga mengamati jumlah rata-rata maksimum comments dengan 363 comments per talk diikuti oleh tahun 2013 dengan 289 comments per talk. Pasca tahun 2013 terlihat penurunan yang stabil dalam jumlah comments, seperti pada tahun 2016 hanya memiliki 81 comments pertalk.
#Getting a dot chart for number of ratings by published year
ted_final %>%
mutate(published_year1 = as.factor(published_year)) %>%
group_by(published_year1) %>%
summarise(avg_ratings = mean(ratings)) %>%
ggplot(aes(x = published_year1, y = avg_ratings)) +
geom_point(col = "tomato2", size = 3) +
geom_segment(aes(x = published_year1,xend = published_year1,y = min(avg_ratings),yend = max(avg_ratings)),linetype = "dashed",size = 0.05) +
coord_flip() +
labs(title = "Number of Comments by Published Year", x = "Published year", y = "Average # of Ratings")+
theme_minimal()
Kesimpulan:Rata-rata jumlah rating juga menunjukkan nilaitertingginya pada tahun 2006, pada tahun 2013 yang memiliki views dan comments tertinggi namun tidak mendapatkan banyak rating.Di sisi lain pada tahun 2010 menunjukkan puncak dalam jumlah peringkat.
#creating a stacked bar chart showing percentag eof positive, neutral and negative ratings by the published year
ted_final %>%
group_by(published_year) %>%
summarise(Perc_Positive_Ratings= sum(positive_ratings)/sum(ratings), Perc_Negative_Ratings = sum(negative_ratings)/sum(ratings), Perc_Neutral_Ratings = sum(neutral_ratings)/sum(ratings)) %>%
gather(Type, Perc_rating ,-published_year) %>%
ggplot(aes(x = published_year, y = Perc_rating, fill = Type)) + geom_bar(stat = "identity") +
labs(title = "Percentage of Positive, Negative and Neutral Ratings by Published Year", x = "Published year", y = "% of Ratings") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
Kesimpulan:Dari grafik di atas kami mengamati bahwa persentase peringkat negatif, positif dan netral tetap kurang lebih sama selama bertahun-tahun.
Tahun 2009 adalah satu-satunya tahun yang memiliki peringkat negatif sedikit lebih tinggi. Kabar baiknya adalah persentase peringkat negatif telah berkurang sejak 2010 dan paling sedikit pada 2017.
#getting the top 10 talks of all times by number of views
datatable(ted_final %>%
arrange(desc(views)) %>%
select( title, main_speaker, views, published_date,comments,ratings) %>%
head(10))
Kami ingin tahu apa yang membuat TED talks terbaik.Apakah faktor-faktor yang mendorong TED talks paling banyak dan paling sedikit dilihat?
Untuk menjawabnya, kita harus membaginya menjadi 5 kategori Views (berdasarkan kuantile)
#Defining the view category
ted_final$view_category <-
ifelse(between(ted_final$views,quantile(ted_final$views,0),quantile(ted_final$views,0.20)),'Worst',
ifelse(between(ted_final$views,quantile(ted_final$views,0.20),quantile(ted_final$views,0.40)),'Bad',
ifelse(between(ted_final$views,quantile(ted_final$views,0.40),quantile(ted_final$views,0.60)),'Ok',
ifelse(between(ted_final$views,quantile(ted_final$views,0.60),quantile(ted_final$views,0.80)),'Good',
ifelse(ted_final$views > quantile(ted_final$views,0.80),'Best','NA')))))
#adding levels to the column
vcat_order <- c('Best','Good','Ok','Bad','Worst')
ted_final$view_category <- factor(ted_final$view_category, levels = vcat_order)
view_cat <- ted_final %>%
group_by(view_category) %>%
summarise(Min_Views = min(views),Max_Views = max(views)) %>%
arrange(desc(Min_Views))
datatable(view_cat)
# Adding levels to published month
month_order <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
ted_final$published_month <- factor(ted_final$published_month, levels = month_order)
#Plotting viewer berdasarkan bulan selama 7 tahun
ted_final %>%
filter(published_year >= 2010) %>%
group_by(published_year,published_month) %>%
summarise(m_views = sum(views)) %>%
inner_join(ted_final %>%
filter(published_year >= 2010) %>%
group_by(published_year) %>%
summarise(y_views = sum(views)),by = "published_year") %>%
mutate(perc_views = m_views/y_views) %>%
ggplot(aes(x = published_month,y = perc_views,group = 1, color = published_year)) +
geom_point() + geom_line() + facet_wrap(~published_year,ncol = 1) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Published Month", y = "Percent COntribution in Yearly views", title = "Monthly percentage views over years 2010-2017 - Seasonality") +
theme_minimal()
## `summarise()` has grouped output by 'published_year'. You can override using the `.groups` argument.
Kesimpulan:Kita dapat melihat dari grafik di atas bahwa Publishing months TED Talk tidak mempengaruhi jumlah viewer.
#creating a function to create a wordcloud and the frequency chart by the view category used in the function call
generate_cloud_grph <- function(v_cat){
df_wc <- as.data.frame(ted_final %>%
subset(view_category == v_cat,select = c(speaker_occupation,view_category)) %>%
count(speaker_occupation, sort = TRUE))
wordcloud(words = df_wc$speaker_occupation, freq = df_wc$n, min.freq = 1,
max.words = 100, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
ted_final %>%
filter(view_category == v_cat) %>%
group_by(speaker_occupation) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(10) %>%
ggplot(aes(x = reorder(speaker_occupation,n), y = n, label = n)) +
geom_point(size = 6) +
geom_segment(aes(x = speaker_occupation,
xend = speaker_occupation,
y = 0,
yend = n)) +
geom_text(color = "white", size = 3) + coord_flip() +
labs(x = "Frequency",y = "Speaker Occupation") +
theme_classic()
}
Analisis TED Talks Terbaik (Best) bagi viewer
generate_cloud_grph("Best")
Analisis TED Talks Terburuk (Worst) bagi viewer
generate_cloud_grph("Worst")
Kesimpulan: Dari plot di atas kita dapat melihat bahwa beberapa pekerjaan umum yang termasuk dalam TED Talks terbanyak adalah author, artist, scientist, psychologist, dan business.
Untuk mendapatkan perbandingan Best dan Worst dari para pembicara kita bisa menggunakan Comparison cloud
set.seed(1234)
ted_final %>% select(speaker_occupation,view_category) %>%
subset(view_category %in% c('Best','Worst')) %>%
group_by(speaker_occupation,view_category) %>%
summarise(n = n()) %>%
acast(speaker_occupation ~ view_category, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)
## `summarise()` has grouped output by 'speaker_occupation'. You can override using the `.groups` argument.
Kesimpulan:Pembicara yang bidang pekerjaannya psychologist, scientists dan authors mendominasi the Best TED Talks sementara itu pembicara dengan pekerjaan activist, designer, biologist, politician mendominasi the Worst TED Talks.
ted_final %>%
plot_ly(y = ~duration, color = ~view_category, type = "box")
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
cor(ted_final$views,ted_final$duration)
## [1] 0.04874043
Kesimpulan: Dari Box Plot interaktif kita melihat tidak terdapat hubungan antara jumlah viewer dengan durasi Ted Talks. Nilai korelasinta 0.048
Hal lain, kategori Worst menampilkan lebih banyak outliers, artinya semakin lama durasinya semakin sedikit viewernya.
datatable(ted_final %>%
mutate(No_of_Speakers = ifelse(num_speaker == 1 , '1','>1')) %>%
group_by(No_of_Speakers) %>%
summarise(count = n()))
ted_final %>%
mutate(No_of_Speakers = ifelse(num_speaker == 1 , '1','>1')) %>%
ggplot(aes(x = No_of_Speakers, y = views, fill = No_of_Speakers)) +
geom_boxplot() +
scale_y_log10(labels = scales::comma) +
theme_minimal()
Kesimpulan: Mayoritas lebih menyukai satu pembicara saja,co-presenter hanya memberikan kontribusi 2,2% dari pembicaraan.
ted_final %>%
ggplot(aes(x = view_category, y = languages)) +
geom_boxplot(width = 0.3, fill = "plum") + coord_flip() +
labs( x = "View Category", y = "# of Languages", title = "Languages vs View_category") +
theme_minimal()
cor(ted_final$views,ted_final$languages)
## [1] 0.3776231
Kesimpulan: Dari plot di atas kita dapat melihat bahwa dengan bertambahnya jumlah bahasa yang digunakan, jumlah viewer juga meningkat.