Assignment 2

Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original

Objective 1: Bar chart of Top 5 talks with most description length

Objective 2: Bar chart of Top 20 Events by number of talks

Objective

Through the treemap, world cloud, and string analysis, the author wish to let the target audience know about the trend or topic of TED talk in these years. The beginning user of TED talk or the researcher might get interested in this discovery, so they are the target audience.

Main issues:

In terms of objective 1, there are two issues in this part. One is the colour of the legend is too similar to that of the bar which will confuse the target audience for the exact length of the bar.
Secondly, the scale of x-axis will mislead that all the description length are almost the same among top5 talks.
According to objective 2, some of the colour of the bar are too similar to distinguish the difference.

Reference

A open dataset from Kaggle (CC BY-NC-SA 4.0) containing information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. website: https://www.kaggle.com/gsdeepakkumar/lets-talk-about-ted-talks/log

Code

The following code was used to fix the issues identified in the original.

library(ggplot2) # Data visualisation
library(dplyr) # data manipulation
library(stringr) # String manipulation
library(colourpicker)

ted=read.csv("/Users/qmoa_liu/downloads/assignment2template1950/ted_main.csv",header=TRUE,stringsAsFactors = FALSE)
transcript=read.csv("/Users/qmoa_liu/downloads/assignment2template1950/transcripts.csv",header=TRUE,stringsAsFactors = FALSE)


# objective1
test = ted %>% select(description,name,duration,views) %>% mutate(deslength=str_length(description))
summary(test$deslength)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0   236.0   296.0   313.7   379.0   769.0

test %>% arrange(desc(deslength))  %>% head(5) %>%  ggplot(aes(reorder(name,deslength),deslength,fill=name))+geom_bar(stat="identity")+theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())+labs(x="Name",y="Description length",title="Top 5 talks with most description length")+coord_cartesian(ylim=c(650,775))

# objective2
CPCOLS <- c("#1f78b4", "#33a02c", "#FFD700", "#EE82EE", "#6B6B6B", "#8B5A00", "#8B2252", "#7FFF00", "#fb9a99", "#fdbf6f", "#cab2d6", "#ffff99", "#A020F0", "#00F5FF", "#e31a1c", "#ff7f00", "#6a3d9a", "#b15928", "#a6cee3", "#b2df8a")

test=ted %>% group_by(event) %>% tally() %>% arrange(desc(n))
ggplot(head(test,20),aes(factor(event,levels=event),n,fill=event))+geom_bar(stat="identity")+theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())+scale_fill_manual(values = CPCOLS)+labs(x="Event",y="Number of Talks",title="Top 20 Events by number of talks")

Data Reference

A open dataset from Kaggle (CC BY-NC-SA 4.0) containing information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. website: https://www.kaggle.com/gsdeepakkumar/lets-talk-about-ted-talks/log

Reconstruction

The following plot fixes the main issues in the original.