Without a doubt, Youtube has been one of the biggest social media in decades, being the world’s largest video hosting website and second most visited website both by Alexa Internet and SimilarWeb ranking of all websites globally. YouTube has had social impact in many fields, with some individual YouTube videos having directly shaped world events.
Constituting one of the world’s most popular search engines, YouTube enables inexpensive distribution of educational content, including course material from educational institutions and “how to” videos from individuals. Worldwide video access has spurred innovation by enabling geographically distributed individuals to build upon each other’s work, to collaborate, or to crowdsource.[1]
In this analysis we’re going to analyze Youtube statistic, because it is well known that YouTube streaming data (video views) has been used to gauge consumer opinion for marketing decisions. We hope that this publication may help you get better understanding regarding the dataset presented.
Before running the codes, let’s setup the markdown first by running these codes.
# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
# scientific notation
options(scipen = 9999)We’re also going to unload these packages that will help us transforming our dataset into desired outcome.
library(ggplot2)
library(GGally)
library(ggthemes)
library(ggpubr)
library(lubridate)
library(scales)
library(tidyr)
library(colorspace)
library(tidyverse)
library(hrbrthemes)Now we’re good to go!
This analysis was performed to visualize Youtube statistics with various plots. This was made to complete the Learn By Building : Data Visualization class.
In this analysis, we’re going to use US Youtube Trending Videos dataset provided by Mitchell J on kaggle, which included records of trending videos between 14th November 2017 to 14th June 2018.[2]
First, let’s import the data into “master” object. Also we need to check the structure and the first 10 data of the dataset to get a better look.
#> 'data.frame': 40949 obs. of 16 variables:
#> $ video_id : chr "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
#> $ trending_date : chr "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#> $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#> $ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#> $ category_id : int 22 24 23 24 24 28 24 28 1 25 ...
#> $ publish_time : chr "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#> $ tags : chr "SHANtell martin" "last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump" "racist superman|rudy|mancuso|king|bach|racist|superman|love|rudy mancuso poo bear black white official music vi"| __truncated__ "rhett and link|gmm|good mythical morning|rhett and link good mythical morning|good mythical morning rhett and l"| __truncated__ ...
#> $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#> $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#> $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#> $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
#> $ thumbnail_link : chr "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
#> $ comments_disabled : chr "False" "False" "False" "False" ...
#> $ ratings_disabled : chr "False" "False" "False" "False" ...
#> $ video_error_or_removed: chr "False" "False" "False" "False" ...
#> $ description : chr "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO â–¶ \\n\\nSUBSCRIBE â–º https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_co"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...
#> [1] FALSE
From data inspection above, we can conclude statements as follow :
From those findings, Data Cleansing & Coercion are needed to be done.
Before proceeding to cleansing & coercion process, we’re going to remove some variables which wouldn’t be needed for further analysis.
In “master” data frame that we’ve imported there were videos that trending for days, resulting in more than one entry for each title provided. To make it unique, we’re going to create dataframe that only include the data when that specific title made trending.
In this dataset, we could find that there were variables that included date & time details. Since it’s still structured as character, let’s change them into date & time data type using lubridate.
Remember to not run the same code twice, as the result would be your data changing into NA.
youtube$trending_date <- ydm(youtube$trending_date)
youtube$publish_time <- ymd_hms(youtube$publish_time)As mentioned above, we found that the Category variable still in number format. We’re going to change them into character to make it more understadable.
youtube$category_id <- sapply(as.character(youtube$category_id), switch,
"1" = "Film and Animation",
"2" = "Autos and Vehicles",
"10" = "Music",
"15" = "Pets and Animals",
"17" = "Sports",
"18" = "Short Movies",
"19" = "Travel and Events",
"20" = "Gaming",
"22" = "People and Blogs",
"23" = "Comedy",
"24" = "Entertainment",
"25" = "News and Politics",
"26" = "Howto and Style",
"27" = "Education",
"28" = "Science and Technology",
"29" = "Nonprofit and Activism",
"43" = "Shows")For the newly changed and the rest of the variables, we need to transform the data structure to match their respective type.
youtube[,c("views", "likes", "dislikes", "comment_count")] <- lapply(youtube[,c("views", "likes", "dislikes", "comment_count")], as.numeric)
youtube[,c("channel_title", "category_id")] <- lapply(youtube[,c("channel_title", "category_id")], as.factor)
youtube[,c("comments_disabled", "ratings_disabled", "video_error_or_removed")] <- lapply(youtube[,c("comments_disabled", "ratings_disabled", "video_error_or_removed")], as.logical)Let’s check the transformed data structure before proceeding to the next step.
#> 'data.frame': 6455 obs. of 12 variables:
#> $ trending_date : Date, format: "2017-11-14" "2017-11-14" ...
#> $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#> $ channel_title : Factor w/ 2198 levels "12 News","1MILLION Dance Studio",..: 330 1103 1644 763 1418 885 1674 461 4 2117 ...
#> $ category_id : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
#> $ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" ...
#> $ views : num 748374 2418783 3191434 343168 2095731 ...
#> $ likes : num 57527 97185 146033 10172 132235 ...
#> $ dislikes : num 2966 6146 5339 666 1989 ...
#> $ comment_count : num 15954 12703 8181 2146 17518 ...
#> $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Since everything’s matched already, on to adding new variables.
In this dataframe, we’re going to make four new variables as follow :
We can subset the hour data by using lubridate.
For period of the day, we’re going to transform the data into three categories and change the data type into categorical type.
pw <- function(x){
if(x < 8){
x <- "12am to 8am"
}else if(x >= 8 & x < 16){
x <- "8am to 3pm"
}else{
x <- "3pm to 12am"
}
}
youtube$publish_when <- as.factor(sapply(youtube$publish_hour, pw))We can subset the days data by using lubridate. Also, we’re going to order them so our plots will display them in our desired order.
youtube$publish_wday <- as.factor(weekdays(youtube$publish_time))
youtube$publish_wday <- ordered(youtube$publish_wday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))This variable will show you how many days needed for trending videos in this dataset to trend.
youtube$timetotrend <- youtube$trending_date - as.Date(youtube$publish_time)
youtube$timetotrend <- as.factor(ifelse(youtube$timetotrend <= 7, youtube$timetotrend, "8+"))
table(youtube$timetotrend)#>
#> 0 1 2 3 4 5 6 7 8+
#> 121 2736 1834 710 381 242 136 67 228
In this section we’re going to make some general charts to get better look on Youtube Trending dataset that we’ve prepared before.
First, let’s see how many videos for each categories that went trending. In this case, we’re going to use Lollipop chart.
gi1 <- data.frame(table(youtube$category))
ggplot(gi1, aes(x=reorder(Var1, -Freq), y=Freq)) +
geom_segment( aes(x=reorder(Var1, Freq), xend=reorder(Var1, Freq), y=0, yend=Freq), color="skyblue") +
geom_point( color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
) +
labs(title = "Frequency of Trending Videos",
subtitle = "Based on Categories",
caption = "Source : Youtube Trending Dataset",
x = "Category",
y = "Number of Videos") From that chart we can conclude that Entertainment Category has the most trending videos while Shows Category has the least video.
Next, we’re going to find out how many videos for each time of publish (that we’ve categorized before). For this section, we’re going to make pie chart. Nevertheless, before we proceed, please keep in mind that pie chart is extremely not preferred to be used in showing analysis. Quoted from data to viz[3],
Pie charts are highly critized and must be avoided as much as possible. Human is very bad at translating angles towards values. In the adjacent pie chart, try to figure out which group is the biggest one and try to order them by value. You will probably struggle to do so and this is why pie charts must be avoided.
But today, we’re going to make an exception and use this chart for the sake of learning.
#M Subset the data
gi2 <- data.frame(table(youtube$publish_when))
# Compute percentages
gi2$fraction <- gi2$Freq / sum(gi2$Freq)
# Compute the cumulative percentages (top of each rectangle)
gi2$ymax <- cumsum(gi2$fraction)
# Compute the bottom of each rectangle
gi2$ymin <- c(0, head(gi2$ymax, n=-1))
# Compute label position
gi2$labelPosition <- (gi2$ymax + gi2$ymin) / 2
# Compute a good label
gi2$label <- paste0(gi2$Var1, "\n value: ", gi2$Freq)
# Make the plot
ggplot(gi2, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Var1)) +
geom_rect() +
geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
scale_fill_brewer(palette=4) +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Frequency of Trending Videos",
subtitle = "Based on Period of Uploading",
caption = "Source : Youtube Trending Dataset") Based on the pie chart, we can conclude that most of the trending videos were uploaded at 3pm to 12am, while the least was 12am to 8am.
To get a better understanding on the distribution of total trending videos being uploaded based on days, we’re going to make a bar chart.
gi3 <- data.frame(table(youtube$publish_wday))
ggplot(gi3, aes(x=Var1, y=Freq)) +
geom_col(fill = "firebrick", alpha = 0.9) +
labs(title = "Frequency of Trending Videos",
subtitle = "Based on Days of Uploading",
x = NULL,
y = "Frequency",
caption = "Source : Youtube Trending Dataset") +
geom_text(aes(label = Freq), size = 3, hjust = 0.5, vjust = 3, position = "stack") This bar chart shows that most of the videos were uploaded on Wednesday, while the least was on Saturday.
In this section, we’re going to subset some data to make more detailed analysis for specific cases.
Let’s find out 10 channels with most views of their trending videos by making Lollipop chart.
sp1 <- youtube %>%
select(channel_title, views) %>%
group_by(channel_title) %>%
summarise(views = sum(views)) %>%
arrange(-views)
sp1_head <- data.frame(head(sp1, 10))
ggplot(sp1_head, aes(x=reorder(channel_title, -views), y=views)) +
geom_segment( aes(x=reorder(channel_title, views), xend=reorder(channel_title, views), y=0, yend=views), color="gold") +
geom_point( color="gold4", size=6, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
) +
labs(title = "10 Trending Youtube Channel",
subtitle = "With Most Views",
caption = "Source : Youtube Trending Dataset",
x = "Category",
y = "Number of Likes") From chart above, we can conclude that Dude Perfect is the channel who got the most views for their trending videos.
Earlier, we found out that Entertainment was the category with most videos went trending. Right now, let’s found out the distribution and average of likes based on the time it was published. In this analysis, we’re going to use boxplot to visualize the distribution & the average of data.
sp2 <- youtube %>%
filter(category_id == "Entertainment") %>%
select(publish_when, likes) %>%
group_by(publish_when)
ggplot(data = sp2, aes(x = publish_when, y = log(likes), fill = publish_when)) +
geom_boxplot() +
geom_hline(aes(yintercept = log(mean(likes)))) +
labs(title = "Number of Likes for Entertainment",
subtitle = "Based on Posted Time ",
caption = "Source : Youtube Trending Dataset",
x = NULL,
y = NULL) From data above, we can conclude as follow : - 3 pm to 12am was the most prime time to publish videos, due to highest average of likes compared to other allocated time - Most of the data fell below average line
On Data Preparation, we’ve made new variables called timetotrend, which shows how many days a video need to be trending, and publish_when, which categorizes published hours into 3 period. Now, let’s combine those two variables to get a better look at their distribution.
sp3<-youtube %>%
select(timetotrend, publish_when)
ggplot(data = sp3, aes(x = timetotrend)) +
geom_bar(aes(fill = publish_when), position = "stack" ) +
scale_x_discrete(guide = guide_axis(angle = 0)) +
labs(title = "Distribution of Videos",
subtitle = "Based on Posted Time & Time to Trend",
caption = "Source : Youtube Trending Dataset",
x = 'Days Needed for Video to Trend',
y = 'Frequency') From this chart we can conclude as follow : - Most videos needed 1 day to become trending, with 8am to 3pm become the most allocated period. - Most of the days were dominated by 3pm to 12am published videos, except 0 Day (12am to 8am) and 1 Day (8am to 3pm)
In this point, we’re going to make a facet. The facet approach partitions a plot into a matrix of panels, which shows a different subset of the data. In this analysis, we’re going to subset 5 channels with most likes data to know the distribution of likes in total.
sp4<- youtube %>%
select(category_id, likes) %>%
group_by(category_id) %>%
summarise(likes = sum(likes)) %>%
arrange(-likes)
sp4<- head(sp4, 5)
plot_sp4 <- youtube %>%
filter(category_id == c("Music","Entertainment", "Comedy","People and Blogs","Howto and Style")) %>%
select(category_id, publish_wday, likes) %>%
group_by(category_id, publish_wday) %>%
summarise(likes = sum(likes))
ggplot(data = plot_sp4, aes(x = publish_wday, y = likes)) +
geom_line(aes(group = category_id, col = category_id), size = 1.1) +
geom_point() +
scale_color_brewer(palette = "Set1") +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
facet_wrap(~category_id, ncol = 1, scales = "free_y") +
theme_minimal() +
theme(legend.position = "none",
# plot.background = element_rect(fill = "grey"),
strip.background = element_rect(fill = "firebrick4"),
strip.text = element_text(color = "white", size = 12, face = "bold"),
) +
labs(title = "Summary of Likes Each Day",
subtitle = "5 Most Liked Categories",
caption = "Source : Youtube Trending Dataset",
x = 'Days',
y = 'Likes')From facet plot above, we can conclude as follows : - The days with most likes varied between each chosen Categories. - Cumulatively, Monday and Saturday are the days with least likes compared to other days.
For the last analysis, we’re going to subset selected categories (Music, Pets Animals and Education) to analyze whether a correlation exists between amount of Likes & Views for these categories. Since there were so many outliers, we’re going to log the number to make the data easier to decipher.
sp5 <- youtube %>%
filter(category_id == c("Music","Pets Animals","Education")) %>%
select(category_id,views,likes)
ggplot(data = sp5, aes(x = log(views), y = log(likes))) +
geom_point(color="black",
fill="#69b3a2",
shape=5,
alpha=0.5,
size=1,
stroke = 1) +
geom_smooth() +
labs(title = "Correlation Between Likes & Views",
subtitle = "Category = Music, Pets Animals & Education",
caption = "Source : Youtube Trending Dataset",
x = NULL,
y = NULL) +
theme_ipsum() From the scatterplot above, we can conclude that there is correlation between Likes & Views for Music, Pets Animals & Education categories. This conclude that if a video have more views, eventually the video will have more likes.
By using appropriate plot for each query presented, we can obtain better insight from its respective charts. Refer to data to wiz [4] for your reference in choosing the most suitable plot/chart.