1 Background

1.1 Objectives

Without a doubt, Youtube has been one of the biggest social media in decades, being the world’s largest video hosting website and second most visited website both by Alexa Internet and SimilarWeb ranking of all websites globally. YouTube has had social impact in many fields, with some individual YouTube videos having directly shaped world events.

Constituting one of the world’s most popular search engines, YouTube enables inexpensive distribution of educational content, including course material from educational institutions and “how to” videos from individuals. Worldwide video access has spurred innovation by enabling geographically distributed individuals to build upon each other’s work, to collaborate, or to crowdsource.[1]

In this analysis we’re going to analyze Youtube statistic, because it is well known that YouTube streaming data (video views) has been used to gauge consumer opinion for marketing decisions. We hope that this publication may help you get better understanding regarding the dataset presented.

1.2 Libraries and Setup

Before running the codes, let’s setup the markdown first by running these codes.

# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

# scientific notation
options(scipen = 9999)

We’re also going to unload these packages that will help us transforming our dataset into desired outcome.

library(ggplot2)
library(GGally)
library(ggthemes)
library(ggpubr)
library(lubridate)
library(scales)
library(tidyr)
library(colorspace)
library(tidyverse)
library(hrbrthemes)

Now we’re good to go!

1.3 Disclaimer

This analysis was performed to visualize Youtube statistics with various plots. This was made to complete the Learn By Building : Data Visualization class.

2 Data Preparation

In this analysis, we’re going to use US Youtube Trending Videos dataset provided by Mitchell J on kaggle, which included records of trending videos between 14th November 2017 to 14th June 2018.[2]

2.1 Data Inspection

First, let’s import the data into “master” object. Also we need to check the structure and the first 10 data of the dataset to get a better look.

master <- read.csv("USvideos updated.csv")
str(master)

#> 'data.frame':    40949 obs. of  16 variables:
#>  $ video_id              : chr  "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
#>  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#>  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#>  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
#>  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#>  $ tags                  : chr  "SHANtell martin" "last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump" "racist superman|rudy|mancuso|king|bach|racist|superman|love|rudy mancuso poo bear black white official music vi"| __truncated__ "rhett and link|gmm|good mythical morning|rhett and link good mythical morning|good mythical morning rhett and l"| __truncated__ ...
#>  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#>  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#>  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#>  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
#>  $ thumbnail_link        : chr  "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
#>  $ comments_disabled     : chr  "False" "False" "False" "False" ...
#>  $ ratings_disabled      : chr  "False" "False" "False" "False" ...
#>  $ video_error_or_removed: chr  "False" "False" "False" "False" ...
#>  $ description           : chr  "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO â–¶ \\n\\nSUBSCRIBE â–º https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_co"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...

head(master)

anyNA(master)

#> [1] FALSE

From data inspection above, we can conclude statements as follow :

There are 40.949 videos with 16 variables included.
Some of the variables didn’t match the data type.
The category still in numbered format.
There are no missing values in this dataset.

From those findings, Data Cleansing & Coercion are needed to be done.

2.2 Data Cleansing & Coercion

Before proceeding to cleansing & coercion process, we’re going to remove some variables which wouldn’t be needed for further analysis.

master[, c("thumbnail_link", "description", "video_id", "tags")] <- NULL

In “master” data frame that we’ve imported there were videos that trending for days, resulting in more than one entry for each title provided. To make it unique, we’re going to create dataframe that only include the data when that specific title made trending.

youtube <- master[match(unique(master$title), master$title),]

In this dataset, we could find that there were variables that included date & time details. Since it’s still structured as character, let’s change them into date & time data type using lubridate.

Remember to not run the same code twice, as the result would be your data changing into NA.

youtube$trending_date <- ydm(youtube$trending_date)
youtube$publish_time <- ymd_hms(youtube$publish_time)

As mentioned above, we found that the Category variable still in number format. We’re going to change them into character to make it more understadable.

youtube$category_id <- sapply(as.character(youtube$category_id), switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "18" = "Short Movies",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

For the newly changed and the rest of the variables, we need to transform the data structure to match their respective type.

youtube[,c("views", "likes", "dislikes", "comment_count")] <- lapply(youtube[,c("views", "likes", "dislikes", "comment_count")], as.numeric)
youtube[,c("channel_title", "category_id")] <- lapply(youtube[,c("channel_title", "category_id")], as.factor)
youtube[,c("comments_disabled", "ratings_disabled", "video_error_or_removed")] <- lapply(youtube[,c("comments_disabled", "ratings_disabled", "video_error_or_removed")], as.logical)

Let’s check the transformed data structure before proceeding to the next step.

str(youtube)

#> 'data.frame':    6455 obs. of  12 variables:
#>  $ trending_date         : Date, format: "2017-11-14" "2017-11-14" ...
#>  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title         : Factor w/ 2198 levels "12 News","1MILLION Dance Studio",..: 330 1103 1644 763 1418 885 1674 461 4 2117 ...
#>  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
#>  $ publish_time          : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" ...
#>  $ views                 : num  748374 2418783 3191434 343168 2095731 ...
#>  $ likes                 : num  57527 97185 146033 10172 132235 ...
#>  $ dislikes              : num  2966 6146 5339 666 1989 ...
#>  $ comment_count         : num  15954 12703 8181 2146 17518 ...
#>  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Since everything’s matched already, on to adding new variables.

2.3 Adding New Variables

In this dataframe, we’re going to make four new variables as follow :

The hours

We can subset the hour data by using lubridate.

youtube$publish_hour <- hour(youtube$publish_time)

Period of the day

For period of the day, we’re going to transform the data into three categories and change the data type into categorical type.

pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 3pm"
    }else{
      x <- "3pm to 12am"
    }  
}
youtube$publish_when <- as.factor(sapply(youtube$publish_hour, pw))

The day of the week of each video at the time of publish

We can subset the days data by using lubridate. Also, we’re going to order them so our plots will display them in our desired order.

youtube$publish_wday <- as.factor(weekdays(youtube$publish_time))

youtube$publish_wday <- ordered(youtube$publish_wday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

Time needed for a video to become trending

This variable will show you how many days needed for trending videos in this dataset to trend.

youtube$timetotrend <- youtube$trending_date - as.Date(youtube$publish_time)
youtube$timetotrend <- as.factor(ifelse(youtube$timetotrend <= 7, youtube$timetotrend, "8+"))
table(youtube$timetotrend)

#> 
#>    0    1    2    3    4    5    6    7   8+ 
#>  121 2736 1834  710  381  242  136   67  228

3 Data Visualization

3.1 General Information

In this section we’re going to make some general charts to get better look on Youtube Trending dataset that we’ve prepared before.

Total Videos by Category

First, let’s see how many videos for each categories that went trending. In this case, we’re going to use Lollipop chart.

gi1 <- data.frame(table(youtube$category))
ggplot(gi1, aes(x=reorder(Var1, -Freq), y=Freq)) +
  geom_segment( aes(x=reorder(Var1, Freq), xend=reorder(Var1, Freq), y=0, yend=Freq), color="skyblue") +
  geom_point( color="blue", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  ) +
   labs(title = "Frequency of Trending Videos", 
       subtitle = "Based on Categories",
       caption = "Source : Youtube Trending Dataset", 
       x = "Category",
       y = "Number of Videos")

From that chart we can conclude that Entertainment Category has the most trending videos while Shows Category has the least video.

Total Trending Videos Based on Time of Publish

Next, we’re going to find out how many videos for each time of publish (that we’ve categorized before). For this section, we’re going to make pie chart. Nevertheless, before we proceed, please keep in mind that pie chart is extremely not preferred to be used in showing analysis. Quoted from data to viz[3],

Pie charts are highly critized and must be avoided as much as possible. Human is very bad at translating angles towards values. In the adjacent pie chart, try to figure out which group is the biggest one and try to order them by value. You will probably struggle to do so and this is why pie charts must be avoided.

But today, we’re going to make an exception and use this chart for the sake of learning.

#M Subset the data
gi2 <- data.frame(table(youtube$publish_when))

# Compute percentages
gi2$fraction <- gi2$Freq / sum(gi2$Freq)

# Compute the cumulative percentages (top of each rectangle)
gi2$ymax <- cumsum(gi2$fraction)

# Compute the bottom of each rectangle
gi2$ymin <- c(0, head(gi2$ymax, n=-1))

# Compute label position
gi2$labelPosition <- (gi2$ymax + gi2$ymin) / 2

# Compute a good label
gi2$label <- paste0(gi2$Var1, "\n value: ", gi2$Freq)

# Make the plot
ggplot(gi2, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Var1)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  scale_fill_brewer(palette=4) +
  coord_polar(theta="y") +
  xlim(c(2, 4)) +
  theme_void() +
  theme(legend.position = "none") +
   labs(title = "Frequency of Trending Videos", 
       subtitle = "Based on Period of Uploading",
       caption = "Source : Youtube Trending Dataset")

Based on the pie chart, we can conclude that most of the trending videos were uploaded at 3pm to 12am, while the least was 12am to 8am.

Total Trending Videos Based on Days

To get a better understanding on the distribution of total trending videos being uploaded based on days, we’re going to make a bar chart.

gi3 <- data.frame(table(youtube$publish_wday))

ggplot(gi3, aes(x=Var1, y=Freq)) + 
  geom_col(fill = "firebrick", alpha = 0.9) +
  labs(title = "Frequency of Trending Videos", 
       subtitle = "Based on Days of Uploading", 
       x = NULL, 
       y = "Frequency",
       caption = "Source : Youtube Trending Dataset") +
  geom_text(aes(label = Freq), size = 3, hjust = 0.5, vjust = 3, position = "stack")

This bar chart shows that most of the videos were uploaded on Wednesday, while the least was on Saturday.

3.2 Specific Information

In this section, we’re going to subset some data to make more detailed analysis for specific cases.

10 Channel with Most Views

Let’s find out 10 channels with most views of their trending videos by making Lollipop chart.

sp1 <- youtube %>% 
  select(channel_title, views) %>% 
  group_by(channel_title) %>%
  summarise(views = sum(views)) %>% 
  arrange(-views)

sp1_head <- data.frame(head(sp1, 10))

ggplot(sp1_head, aes(x=reorder(channel_title, -views), y=views)) +
  geom_segment( aes(x=reorder(channel_title, views), xend=reorder(channel_title, views), y=0, yend=views), color="gold") +
  geom_point( color="gold4", size=6, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  ) +
   labs(title = "10 Trending Youtube Channel", 
       subtitle = "With Most Views",
       caption = "Source : Youtube Trending Dataset", 
       x = "Category",
       y = "Number of Likes")

From chart above, we can conclude that Dude Perfect is the channel who got the most views for their trending videos.

Average Likes for Most Trending Categories

Earlier, we found out that Entertainment was the category with most videos went trending. Right now, let’s found out the distribution and average of likes based on the time it was published. In this analysis, we’re going to use boxplot to visualize the distribution & the average of data.

sp2 <- youtube %>% 
  filter(category_id == "Entertainment") %>% 
  select(publish_when, likes) %>% 
  group_by(publish_when)

ggplot(data = sp2, aes(x = publish_when, y = log(likes), fill = publish_when)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = log(mean(likes)))) +
  labs(title = "Number of Likes for Entertainment", 
       subtitle = "Based on Posted Time ", 
       caption = "Source : Youtube Trending Dataset", 
       x = NULL, 
       y = NULL)

From data above, we can conclude as follow : - 3 pm to 12am was the most prime time to publish videos, due to highest average of likes compared to other allocated time - Most of the data fell below average line

Distribution of Publish Period Based on Time to Trend

On Data Preparation, we’ve made new variables called timetotrend, which shows how many days a video need to be trending, and publish_when, which categorizes published hours into 3 period. Now, let’s combine those two variables to get a better look at their distribution.

sp3<-youtube %>% 
  select(timetotrend, publish_when)

ggplot(data = sp3, aes(x = timetotrend)) + 
  geom_bar(aes(fill = publish_when), position = "stack" ) +
  scale_x_discrete(guide = guide_axis(angle = 0)) +
    labs(title = "Distribution of Videos", 
       subtitle = "Based on Posted Time & Time to Trend", 
       caption = "Source : Youtube Trending Dataset", 
       x = 'Days Needed for Video to Trend', 
       y = 'Frequency')

From this chart we can conclude as follow : - Most videos needed 1 day to become trending, with 8am to 3pm become the most allocated period. - Most of the days were dominated by 3pm to 12am published videos, except 0 Day (12am to 8am) and 1 Day (8am to 3pm)

Likes Based on Publish Day Distribution of 5 Channels with Most Likes

In this point, we’re going to make a facet. The facet approach partitions a plot into a matrix of panels, which shows a different subset of the data. In this analysis, we’re going to subset 5 channels with most likes data to know the distribution of likes in total.

sp4<- youtube %>% 
  select(category_id, likes) %>% 
  group_by(category_id) %>%
  summarise(likes = sum(likes)) %>%
  arrange(-likes)

sp4<- head(sp4, 5)

plot_sp4 <- youtube %>% 
  filter(category_id == c("Music","Entertainment", "Comedy","People and Blogs","Howto and Style")) %>% 
  select(category_id, publish_wday, likes) %>% 
  group_by(category_id, publish_wday) %>% 
  summarise(likes = sum(likes))

ggplot(data = plot_sp4, aes(x = publish_wday, y = likes)) +
  geom_line(aes(group = category_id, col  = category_id), size = 1.1) + 
  geom_point() +
  scale_color_brewer(palette = "Set1")  +
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  facet_wrap(~category_id, ncol = 1, scales = "free_y") +
  theme_minimal() +
  theme(legend.position = "none", 
        # plot.background = element_rect(fill = "grey"), 
        strip.background = element_rect(fill = "firebrick4"), 
        strip.text = element_text(color = "white", size = 12, face = "bold"), 
        ) +
  labs(title = "Summary of Likes Each Day", 
       subtitle = "5 Most Liked Categories", 
       caption = "Source : Youtube Trending Dataset", 
       x = 'Days', 
       y = 'Likes')

From facet plot above, we can conclude as follows : - The days with most likes varied between each chosen Categories. - Cumulatively, Monday and Saturday are the days with least likes compared to other days.

Correlation Between Likes & Views for 3 Categories

For the last analysis, we’re going to subset selected categories (Music, Pets Animals and Education) to analyze whether a correlation exists between amount of Likes & Views for these categories. Since there were so many outliers, we’re going to log the number to make the data easier to decipher.

sp5 <- youtube %>% 
  filter(category_id == c("Music","Pets Animals","Education")) %>% 
  select(category_id,views,likes)

ggplot(data = sp5, aes(x = log(views), y = log(likes))) +
  geom_point(color="black",
        fill="#69b3a2",
        shape=5,
        alpha=0.5,
        size=1,
        stroke = 1) +
  geom_smooth() +
  labs(title = "Correlation Between Likes & Views", 
       subtitle = "Category = Music, Pets Animals & Education", 
       caption = "Source : Youtube Trending Dataset", 
       x = NULL, 
       y = NULL) +
  theme_ipsum()

From the scatterplot above, we can conclude that there is correlation between Likes & Views for Music, Pets Animals & Education categories. This conclude that if a video have more views, eventually the video will have more likes.

4 Conclusion

By using appropriate plot for each query presented, we can obtain better insight from its respective charts. Refer to data to wiz [4] for your reference in choosing the most suitable plot/chart.

Data Visualization - Youtube Trending Videos

Fairuza Alief

10/22/2020