Libraries and Setup
Abstract
The data we’re working is USvideos.csv, a dataset scraped from Youtube’s US Trending Videos. By analizing this Trending Videos data, we will uncover how these videos were able to reach such massive audience. We’ll also take a look and see if there were any similarities between the videos that did well on the platform.
## [1] "trending_date" "title" "channel_title"
## [4] "category_id" "publish_time" "views"
## [7] "likes" "dislikes" "comment_count"
## [10] "comments_disabled" "ratings_disabled" "video_error_or_removed"
I will use visualization techniques to gain such insights from the trending videos data: 1. Does the video publishing time affect its popularity? If so, when is the best time to publish a video? 2. Does user engagement relate to video’s popularity? If so, does it tend to has a positive or negative engagement? 3. Who were the most prolific producers of trending videos in recent weeks?
We will use two plotting systems for our task:
- Base plot for a quick and simple exploratory visualization - ggplot2 for uncovering more complex pattern & for producing explanatory visualization
## 'data.frame': 13400 obs. of 12 variables:
## $ trending_date : Factor w/ 67 levels "17.01.12","17.02.12",..: 14 14 14 14 14 14 14 14 14 14 ...
## $ title : Factor w/ 2986 levels "'I have dad moves': Barack Obama discusses dancing on David Letterman's new Netflix show",..: 2802 2574 2081 1903 1231 89 2164 143 2482 2920 ...
## $ channel_title : Factor w/ 1408 levels "_¢_Á_\235","“÷\201\220µ_‘⬓_\220 Korean Englishman",..: 195 686 1046 472 902 559 1063 283 6 1358 ...
## $ category_id : int 22 24 23 24 24 28 24 28 1 25 ...
## $ publish_time : Factor w/ 2903 levels "2008-04-05T18:22:40.000Z",..: 302 271 255 275 253 307 240 258 281 279 ...
## $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
## $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
## $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
## $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
## $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Data Pre-Processing
Working with date/time
Date-Time Conversion
In order to answer our first question (Does the video publishing time affect its popularity?), we need to explore our datetime data more. Looking back to the structure of our dataset, there are two variables in our dataset which related to datetime data; trending_date and publish_date.
YEAR
%Y = YYYY
%y = YY
MONTH
%B = month name e.g. March
%b = month name(abbreviation) e.g. Mar
%m = 2 digits mo e.g. 03
%M = 1 digit mo e.g. 3
DAY
%A = weekday e.g. Friday
%d = weekday digit.
But now, we’ll use an easier alternative to work with date-time data, and that is through the use of lubridate. lubridate provides tools that make it easier to parse and manipulate dates:
## [1] 18.21.01 18.21.01 18.21.01 18.21.01 18.21.01 18.21.01
## 67 Levels: 17.01.12 17.02.12 17.03.12 17.04.12 17.05.12 17.06.12 ... 18.21.01
vids$trending_date <- ydm(vids$trending_date)
vids$publish_time <- ymd_hms(vids$publish_time,tz = "America/New_York")## Date in ISO8601 format; converting timezone from UTC to "America/New_York".
## trending_date title
## 1 2017-11-14 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 2017-11-14 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3 2017-11-14 Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4 2017-11-14 Nickelback Lyrics: Real or Fake?
## 5 2017-11-14 I Dare You: GOING BALD!?
## 6 2017-11-14 2 Weeks with iPhone X
## channel_title category_id publish_time views likes dislikes
## 1 CaseyNeistat 22 2017-11-13 12:13:01 748374 57527 2966
## 2 LastWeekTonight 24 2017-11-13 02:30:00 2418783 97185 6146
## 3 Rudy Mancuso 23 2017-11-12 14:05:24 3191434 146033 5339
## 4 Good Mythical Morning 24 2017-11-13 06:00:04 343168 10172 666
## 5 nigahiga 24 2017-11-12 13:01:41 2095731 132235 1989
## 6 iJustine 28 2017-11-13 14:07:23 119180 9763 511
## comment_count comments_disabled ratings_disabled video_error_or_removed
## 1 15954 FALSE FALSE FALSE
## 2 12703 FALSE FALSE FALSE
## 3 8181 FALSE FALSE FALSE
## 4 2146 FALSE FALSE FALSE
## 5 17518 FALSE FALSE FALSE
## 6 1434 FALSE FALSE FALSE
Date-Time Extraction
Categorizing hour
## [1] 12 2 14 6 13 0 16 9 8 21 22 15 11 10 17 7 20 18 19 1 4 3 23 5
# create function for categorizing
pw <- function(x){
if(x < 8){
x <- "12AM to 7AM"
}else if(x >= 8 & x < 16){
x <- "8AM to 3PM"
}else{
x <- "4PM to 11PM"
}
}# apply function
vids$publish_when <-as.factor(sapply(vids$publish_hour, pw))
unique(vids$publish_when)## [1] 8AM to 3PM 12AM to 7AM 4PM to 11PM
## Levels: 12AM to 7AM 4PM to 11PM 8AM to 3PM
# reorder category level
vids$publish_when <- ordered(vids$publish_when,
levels = c("12AM to 7AM",
"8AM to 3PM",
"4PM to 11PM"))
head(vids$publish_when)## [1] 8AM to 3PM 12AM to 7AM 8AM to 3PM 12AM to 7AM 8AM to 3PM 8AM to 3PM
## Levels: 12AM to 7AM < 8AM to 3PM < 4PM to 11PM
Working with category
vids$category_id <- sapply(as.character(vids$category_id), switch,
"1" = "Film and Animation",
"2" = "Autos and Vehicles",
"10" = "Music",
"15" = "Pets and Animals",
"17" = "Sports",
"19" = "Travel and Events",
"20" = "Gaming",
"22" = "People and Blogs",
"23" = "Comedy",
"24" = "Entertainment",
"25" = "News and Politics",
"26" = "Howto and Style",
"27" = "Education",
"28" = "Science and Technology",
"29" = "Nonprofit and Activism",
"43" = "Shows")
vids$category_id <- as.factor(vids$category_id)Select unique videos
## [1] 13400
## [1] 2986
Quick Exploration with R’s Base plot
## trending_date title
## 1 2017-11-14 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 2017-11-14 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3 2017-11-14 Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4 2017-11-14 Nickelback Lyrics: Real or Fake?
## 5 2017-11-14 I Dare You: GOING BALD!?
## 6 2017-11-14 2 Weeks with iPhone X
## channel_title category_id publish_time views
## 1 CaseyNeistat People and Blogs 2017-11-13 12:13:01 748374
## 2 LastWeekTonight Entertainment 2017-11-13 02:30:00 2418783
## 3 Rudy Mancuso Comedy 2017-11-12 14:05:24 3191434
## 4 Good Mythical Morning Entertainment 2017-11-13 06:00:04 343168
## 5 nigahiga Entertainment 2017-11-12 13:01:41 2095731
## 6 iJustine Science and Technology 2017-11-13 14:07:23 119180
## likes dislikes comment_count comments_disabled ratings_disabled
## 1 57527 2966 15954 FALSE FALSE
## 2 97185 6146 12703 FALSE FALSE
## 3 146033 5339 8181 FALSE FALSE
## 4 10172 666 2146 FALSE FALSE
## 5 132235 1989 17518 FALSE FALSE
## 6 9763 511 1434 FALSE FALSE
## video_error_or_removed trending_dow trending_month publish_wday publish_hour
## 1 FALSE Tuesday Nov Monday 12
## 2 FALSE Tuesday Nov Monday 2
## 3 FALSE Tuesday Nov Sunday 14
## 4 FALSE Tuesday Nov Monday 6
## 5 FALSE Tuesday Nov Sunday 13
## 6 FALSE Tuesday Nov Monday 14
## publish_when
## 1 8AM to 3PM
## 2 12AM to 7AM
## 3 8AM to 3PM
## 4 12AM to 7AM
## 5 8AM to 3PM
## 6 8AM to 3PM
How Popular is Your Video?
A like on a Youtube video is a lot more than just a simple number: in addition to emphasizing quality, creativity and the hidden idea behind the video, it also means popularity. In this case, the perfect ratio between likes and views is: likes: views = 4%.
In simpler terms, at least 4 like every 100 views. Visitors, but especially your channel’s subscribers, by investing their time leaving a like will indirectly say that the video is qualitatively valid, from the point of view of entertainment and originality, to the point that it deserves a nice thumb up.
# create engagement ratios
vids.u$likesratio <- vids.u$likes/vids.u$views
vids.u$dislikesratio <- vids.u$dislikes/vids.u$views
vids.u$commentsratio <- vids.u$comment_count/vids.u$viewsIn general, how was the distribution of the trending videos likesratio?
- ‘likesratio’ value is distributed from 0 (minimum value) to 30 (maximum value)
- There are about +-1000 videos have ‘likesratio’ value of 0-2%
- about + -700 videos that have
likesratioof 2-4%
- about + -500 videos have
likesratioof 4-6%
- etc
likesratio distribution within each category
- “Music” has the highest
likesratiovalue for variance - “Sports” and “Entertainment” have the most outliers
- Overall, the `likes’ of the videos in the" Animals and Animals "category are spread out normally
How High is the Engagement?
Say, one of the content we want to publish have people who interested in “Autos and Vehicles”, “Gaming”, and “Travel and Events” as our target audience. So let’s subset our data based on these particular categories:
campaign <- c("Autos and Vehicles", "Gaming", "Travel and Events")
vids.camp <- vids.u[vids.u$category_id %in% campaign,]
vids.camp$category_id <- droplevels(vids.camp$category_id)plot(vids.camp$comment_count, vids.camp$views,
col=vids.camp$category_id, pch=19)
abline(lm(vids.camp$views ~vids.camp$comment_count))
legend("topright", legend=levels(vids.camp$category_id), fill = 1:3)- The higher the number of views in a video, the higher the number of comments on the video
- The “Gaming” category has the highest
viewsandcomment_count - From the Gaming category, there are some videos with high
views' but fewcomment_count`, and vice versa
Exploratory Viz
Pre-Campaign Analysis
- Amongst Automotive, Gaming & Travel, which category shows highest likeability (likesperviews)?
- Does the publish time affect likeability?
## trending_date title
## 31 2017-11-14 I TOOK THE $3,000,000 LAMBO TO CARMAX! They offered me......
## 35 2017-11-14 New Emirates First Class Suite | Boeing 777 | Emirates
## 59 2017-11-14 Train Swipes Parked Vehicle
## 132 2017-11-14 L.A. Noire - Nintendo Switch Trailer
## 164 2017-11-14 Caterham Chris Hoy 60 Second Donut Challenge
## 198 2017-11-14 Inside Keanu Reeves' Custom Motorcycle Shop | WIRED
## channel_title category_id publish_time views likes dislikes
## 31 hp_overload Autos and Vehicles 2017-11-12 20:43:12 98378 4035 495
## 35 Emirates Travel and Events 2017-11-12 00:55:42 141148 1661 70
## 59 ViralHog Autos and Vehicles 2017-11-12 19:46:11 7265 89 8
## 132 Nintendo Gaming 2017-11-09 14:59:48 154872 7683 164
## 164 Caterham Cars Autos and Vehicles 2017-11-09 04:59:31 4850 22 1
## 198 WIRED Autos and Vehicles 2017-11-08 10:00:27 704363 16352 224
## comment_count comments_disabled ratings_disabled video_error_or_removed
## 31 486 FALSE FALSE FALSE
## 35 236 FALSE FALSE FALSE
## 59 22 FALSE FALSE FALSE
## 132 1734 FALSE FALSE FALSE
## 164 1 FALSE FALSE FALSE
## 198 841 FALSE FALSE FALSE
## trending_dow trending_month publish_wday publish_hour publish_when
## 31 Tuesday Nov Sunday 20 4PM to 11PM
## 35 Tuesday Nov Sunday 0 12AM to 7AM
## 59 Tuesday Nov Sunday 19 4PM to 11PM
## 132 Tuesday Nov Thursday 14 8AM to 3PM
## 164 Tuesday Nov Thursday 4 12AM to 7AM
## 198 Tuesday Nov Wednesday 10 8AM to 3PM
## likesratio dislikesratio commentsratio
## 31 0.041015268 0.0050316128 0.0049401289
## 35 0.011767790 0.0004959333 0.0016720039
## 59 0.012250516 0.0011011700 0.0030282175
## 132 0.049608709 0.0010589390 0.0111963428
## 164 0.004536082 0.0002061856 0.0002061856
## 198 0.023215302 0.0003180178 0.0011939866
ggplot(vids.camp, aes(publish_when, likesratio))+
geom_boxplot(outlier.shape = NA)+
geom_jitter(aes(color = category_id, size = comment_count), alpha = 0.6)+
labs(title = "Pre-Campaign Analysis: Automotive, Gaming & Travel",
subtitle = "Gaming shows higher user engagement",
x = "Publish Time",
y = "Likes Per Views Ratio",
size = "Comment count",
color = "Category")Visualizing comparison with barchart
Who were the most prolific producers of trending videos in recent weeks?
Since we’re concerned about the quantity of videos (talking about being prolific!) we will create another subset of the full dataframe, but take only the channels that have at least 10 videos being trending.
# take only the channels that have at least 10 videos being trending
top <- as.data.frame(table(Channel = vids.u$channel_title))
top <- top[top$Freq >=10 ,]
top <- top[order(top$Freq, decreasing = T), ]
# get top 15 videos
top <- head(top, 15)
head(top)## Channel Freq
## 1023 Refinery29 31
## 1230 The Tonight Show Starring Jimmy Fallon 30
## 1358 Vox 29
## 1245 TheEllenShow 28
## 885 Netflix 27
## 893 NFL 25
Make your barchart more efficient by ordering the bars based on the value you want to show:
# reorder the bars based on value
top$Channel <- reorder(top$Channel, top$Freq)
# adjust color with `scale_fill_*`/`scale_color_*`
ggplot(top, aes(Channel, Freq))+
geom_col(aes(fill = Freq))+
coord_flip()+
scale_fill_gradient(low = "pink", high = "purple")Multivariate Barchart
How positive is the engagement?
Data transformation
# subset `vids.u` based on channel in `top`
vids.top <- vids.u[vids.u$channel_title %in% top$Channel, ]
head(vids.top)## trending_date
## 10 2017-11-14
## 11 2017-11-14
## 24 2017-11-14
## 53 2017-11-14
## 79 2017-11-14
## 103 2017-11-14
## title
## 10 Why the rise of the robots won‰Ûªt mean the end of work
## 11 Dion Lewis' 103-Yd Kick Return TD vs. Denver! | Can't-Miss Play | NFL Wk 10 Highlights
## 24 What $4,800 Will Get You In NYC | Sweet Digs Home Tour | Refinery29
## 53 Wildest Superstar distractions: WWE Top 10, Nov. 11, 2017
## 79 Jason Momoa & Lisa Bonet: Love at First Sight
## 103 Mark Wahlberg's Kids Use Him for His Celeb Connections
## channel_title category_id publish_time
## 10 Vox News and Politics 2017-11-13 08:45:16
## 11 NFL Sports 2017-11-12 21:05:26
## 24 Refinery29 Howto and Style 2017-11-12 11:00:01
## 53 WWE Sports 2017-11-11 10:00:00
## 79 The Late Late Show with James Corden Entertainment 2017-11-10 04:35:00
## 103 Late Night with Seth Meyers Comedy 2017-11-10 06:00:04
## views likes dislikes comment_count comments_disabled ratings_disabled
## 10 256426 12654 1363 2368 FALSE FALSE
## 11 81377 655 25 177 FALSE FALSE
## 24 145921 1707 578 673 FALSE FALSE
## 53 1044813 15397 838 1100 FALSE FALSE
## 79 1497519 15504 353 1084 FALSE FALSE
## 103 225286 1731 193 206 FALSE FALSE
## video_error_or_removed trending_dow trending_month publish_wday
## 10 FALSE Tuesday Nov Monday
## 11 FALSE Tuesday Nov Sunday
## 24 FALSE Tuesday Nov Sunday
## 53 FALSE Tuesday Nov Saturday
## 79 FALSE Tuesday Nov Friday
## 103 FALSE Tuesday Nov Friday
## publish_hour publish_when likesratio dislikesratio commentsratio
## 10 8 8AM to 3PM 0.049347570 0.0053153736 0.0092346330
## 11 21 4PM to 11PM 0.008048957 0.0003072121 0.0021750617
## 24 11 8AM to 3PM 0.011698111 0.0039610474 0.0046120846
## 53 10 8AM to 3PM 0.014736608 0.0008020574 0.0010528200
## 79 4 12AM to 7AM 0.010353124 0.0002357232 0.0007238639
## 103 6 12AM to 7AM 0.007683567 0.0008566888 0.0009143933
Data Aggregation`
# aggregate data
vids.top.agg <- aggregate.data.frame(x = list(likesratio = vids.top$likesratio,
dislikesratio = vids.top$dislikesratio),
by = list(channel = vids.top$channel_title),
mean)
# arrange channel levels by likes vs dislikes ratio
vids.top.agg$channel <- reorder(vids.top.agg$channel,
(vids.top.agg$likesratio/vids.top.agg$dislikesratio))
head(vids.top.agg)## channel likesratio dislikesratio
## 1 CNN 0.01253842 0.0062069224
## 2 ESPN 0.01016376 0.0007641081
## 3 First We Feast 0.04233054 0.0014632845
## 4 Jimmy Kimmel Live 0.02053525 0.0007489192
## 5 Late Night with Seth Meyers 0.01392779 0.0010047001
## 6 NBA 0.01883015 0.0032860873
## # A tibble: 6 x 3
## channel name value
## <fct> <chr> <dbl>
## 1 CNN likesratio 0.0125
## 2 CNN dislikesratio 0.00621
## 3 ESPN likesratio 0.0102
## 4 ESPN dislikesratio 0.000764
## 5 First We Feast likesratio 0.0423
## 6 First We Feast dislikesratio 0.00146
Faceting
vids.long$value <- round(vids.long$value,3)*100
plot <- ggplot(vids.long, aes(x = channel, y= value))+
geom_col(aes(fill = name), position = "dodge")+
geom_text(aes(label = value), hjust = 1, size = 2.8, color = "white")+ # add enhancement on geom_text()
coord_flip()+
labs(title = "Likeability Analysis on Popular Channels",
x = NULL,
y = NULL,
fill = "Ratios in (%):")+
facet_wrap(~name, scales = "free_x")
library(ggthemes)## Warning: package 'ggthemes' was built under R version 3.6.3
Visualizing trend with line chart
Pre-Campaign Analysis; Entertainment, Music & Gaming
vids.camp2 <- vids[vids$category_id %in% c("Entertainment", "Music", "Gaming"), ]
library(dplyr)
vids.camp2 <- count(vids.camp2, trending_date, category_id)
head(vids.camp2)## # A tibble: 6 x 3
## trending_date category_id n
## <date> <fct> <int>
## 1 2017-11-14 Entertainment 45
## 2 2017-11-14 Gaming 1
## 3 2017-11-14 Music 45
## 4 2017-11-15 Entertainment 45
## 5 2017-11-15 Gaming 1
## 6 2017-11-15 Music 37
ggplot(data = vids.camp2, aes(x = trending_date , y = n, group = category_id))+
geom_line(aes(color = category_id))+
geom_point(aes(color = category_id))+
facet_wrap(~category_id, scales = "free_y", ncol = 1)+
labs(title = "Trend Analysis on Entertainment, Gaming & Music",
x = "Trending Date",
y = "Total Videos",
color = "Category")Visualizing relationships with scatterplot
Pre-Campaign Analysis; Likes vs. Dislikes Ratio in Refinery29
refinery <- vids.u[vids.u$channel_title == "Refinery29",]
ggplot(refinery, aes(likesratio,dislikesratio))+
geom_point(aes(size = commentsratio))+
geom_smooth(method = "lm")ggplot(refinery, aes(likesratio,dislikesratio))+
geom_point(aes(size = commentsratio))+
facet_wrap(~category_id, ncol = 1)+
labs(title = "Likes vs. Dislikes Ratio in Refinery29",
x = "Likes Ratio",
y = "Dislikes Ratio",
size = "Comment Ratio") ### Outlier Identification
Which videos from Refinery29 were amongst the outliers of dislikesratio:
## [1] 0.01007510 0.01274599
library(ggrepel)
ggplot(refinery, aes(likesratio,dislikesratio))+
geom_point(aes(size = commentsratio), alpha = 0.6)+
geom_label_repel(data = refinery.out, aes(label = title),
size = 3, vjust = 1.3,
color = "white",fill = "indianred")+ # adds another aesthetic adjustment
geom_point(data = refinery.out,aes(size = commentsratio), color = "indianred", show.legend = F)+
facet_wrap(~category_id, ncol = 1)+
labs(title = "Likes vs. Dislikes Ratio in Refinery29",
x = "Likes Ratio",
y = "Dislikes Ratio",
size = "Comment Ratio")Conclusion
From this analysis, we can derive several conclusion :
“Music” has the highest likesratio value for variance.
“Sports” and “Entertainment” have the most outliers.
Overall, the `likes’ of the videos in the" Animals and Animals "category are spread out normally.
From the boxplot analysis, we can conclude that publishing time highly affect the user engagement rates, and the best time to publish a video is between 8AM to 3PM.
The higher the number of views in a video, the higher the number of comments on the video The “Gaming” category has the highest views andcomment_count.
By comparing the likeability analysis and the popularity analysis chart, we can see that the more popular a video, the more it tends to have positive user engagement.
The top 3 most prolific producers of trending videos in the week is “Refinery29”, “The Tonight Show Starring Jimmy Fallon”, and “Vox”.