1. DATA INTRODUCTION

As a “YouTuber” in America who wants to increase the prestige of his YouTube channel, we plan to create trending video content! We just got data on “YouTube’s US Trending Videos” and want to find out what characteristics make a video trending?

“YouTube US Trending Videos” is a collection of 200 trending videos in the US per day from 2017-11-14 to 2018-01-21.

2. DATA PREPARATION

Input the data and store it in a variable named Videos

Videos <- read.csv("data_input/USvideos.csv")

And now we can do data inspection and cleansing`

2.1. Data Inspection

Check if the saved data is correct

head(Videos)

Inspect the data.

str(Videos)

## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
##  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

dim(Videos)

## [1] 13400    12

names(Videos)

##  [1] "trending_date"          "title"                  "channel_title"         
##  [4] "category_id"            "publish_time"           "views"                 
##  [7] "likes"                  "dislikes"               "comment_count"         
## [10] "comments_disabled"      "ratings_disabled"       "video_error_or_removed"

From our inspection we can conclude :
* Retail data contain 13400 of rows and 13 of coloumns.
* Each of column name :
01. “trending_date”,
02. “title”,
03. “channel_title”,
04. “category_id”,
05. “publish_time”,
06. “views”,
07. “likes”,
08. “dislikes”,
09. “comment_count”,
10. “comments_disabled”,
11. “ratings_disabled”,
12. “video_error_or_removed”,

2.2. Data Cleansing & Coertions

From the str() result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion).

Videos$trending_date <- ydm(Videos$trending_date)
Videos$publish_time <- ymd_hms(Videos$publish_time, tz = "America/New_York")

## Date in ISO8601 format; converting timezone from UTC to "America/New_York".

Videos$category_id <- sapply(X = as.character(Videos$category_id), 
                           FUN = switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")
str(Videos)

## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : Date, format: "2017-11-14" "2017-11-14" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : chr  "People and Blogs" "Entertainment" "Comedy" "Entertainment" ...
##  $ publish_time          : POSIXct, format: "2017-11-13 12:13:01" "2017-11-13 02:30:00" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

head(Videos)

Each of column already changed into desired data type

Now, we have to check for the missing value in the data.

colSums(is.na(Videos))

##          trending_date                  title          channel_title 
##                      0                      0                      0 
##            category_id           publish_time                  views 
##                      0                      0                      0 
##                  likes               dislikes          comment_count 
##                      0                      0                      0 
##      comments_disabled       ratings_disabled video_error_or_removed 
##                      0                      0                      0

anyNA(Videos)

## [1] FALSE

From the result above, now we know that there are no missing value in the Videos data.

We will do subsetting to delete some column (10, 11, & 12 because we dont need the informations). then save it into Videos_new variable.

Videos_new <- Videos[,c(1:9)]
head(Videos_new)

2.2. Data Feature Engineering

Extract the day name information from the trending_date column and create a new column named trending_day.

Videos_new$trending_day <- (wday(Videos_new$trending_date, 
     label = T,
     abbr = T))
head(Videos_new)

Extract the hour information from the publish_time column and create a new column named publish_hour.

Videos_new$publish_hour <- hour(Videos_new$publish_time)
head(Videos_new)

Create a publish_when column by dividing publish_hour into periods (Day-Night).

Videos_new$publish_when <- ifelse(test = Videos_new$publish_hour > 12, yes = "Night", no = "Day")
head(Videos_new)

Extract the day name information from the publish_time column and create a new column named publish_day.

Videos_new$publish_day <- wday(x=Videos_new$publish_time, label=T, abbr = T)
head(Videos_new)

3. DATA EXPLORATION & VISUALIZATION

In the Videos_new data there is data redundancy, namely there are videos that appear several times because they are trending for more than 1 day.

For further analysis, **we will only use data when the video is first trending8* in order to reduce data redundancy.

index.Videos_new <- match(unique(Videos_new$title), Videos_new$title)
Videos_new <- Videos_new[index.Videos_new,]
head(Videos_new)

3.1. Likes & Comments per Views for Top 4 Category

We will look for 4 categories of videos with the most views. by aggregating the category_id & views columns.

category_views <- aggregate(views~category_id,Videos_new,sum)
head(category_views[order(category_views$views, decreasing = T),],4)

Based on these top 4 video categories, we will do further analysis.

subset the Videos_new data for the top 4 categories and save it to the Videos_top_4 object.

Videos_top_4 <- Videos_new[Videos_new$category_id %in% c("Entertainment", "Music", "Comedy", "Howto and Style"), ]

create likesp column containing likes/views and dislikesp containing dislikes/views

Videos_top_4$likesp <- Videos_top_4$likes/Videos_top_4$views
Videos_top_4$commentp <- Videos_top_4$comment_count /Videos_top_4$views
head(Videos_new)

see the distribution of likes/views and dislikes/views per category

 ggplot(data = Videos_top_4 , mapping = aes(x = category_id , y = likesp )) +
geom_boxplot(outlier.shape = NA, fill = "black" , col = "blue", alpha = 0.5 ) +
geom_jitter( (aes(size=commentp)) , col="green", 
alpha = 0.2) +
  labs(title = "Likes and Comment character trending in youtube",  
subtitle = "Entertainment, Music, Comedy, Howto and Style" , 
x =NULL ,
y = "likes per view" ,
size = "Comment per view",
caption = "Source: Youtube" ) +
theme_minimal()

3.2. More than equal to 10 Channel

We are also planning to collaborate with a YouTube channel that often appears in trending video searches!
We will look for YouTube channels that have more than equal to 10 trending videos. So that it can be determined which YouTube channel is good to be a collaboration partner.

Count the video frequency of each channel

Videos_10chan <- as.data.frame(table(Videos_new$channel_title))
colnames(Videos_10chan) <- c("Title", "Freq")

Perform filtering for channels that have a frequency >= 10.

Videos_10chan <- Videos_10chan[Videos_10chan$Freq >= 10 , ]
head(Videos_10chan, 10)

Sort from highest to lowest frequency and grab top 10 data from Videos_10chan

Videos_10chan <- head(Videos_10chan[order(Videos_10chan$Freq, decreasing=T), ], 10)

Visualization.

ggplot(data = Videos_10chan[1:10,], mapping = aes(x=  Freq, y= reorder(Title,Freq))) +
  geom_col(aes(fill = Freq))  + 
  labs(
    title = "Top 10 Trending Channel Youtube",
    x = "Video Count",
    y = "Channel Title",
    caption = "Source: Youtube"
  ) +
  scale_fill_gradient(low = "purple", high = "green") +   geom_label(mapping = aes(label = Freq), 
            col = "blue",
            nudge_x = -1) + 
  theme_minimal() +
  theme(legend.position = "none")+   geom_vline(xintercept=mean(Videos_10chan$Freq), col="white")+ 
  scale_x_continuous(breaks=seq(0,35,5))

3.3. Categories with Highest Trending Videos

We will find out which category has the highest number of videos and want to know the proportion of each time period (Day/Night) when the video is published a lot.

Videos_DayNight <- as.data.frame(table(Videos_new$category_id, Videos_new$publish_when))         
head(Videos_DayNight)

Visualization.

ggplot(data = Videos_DayNight, mapping = aes(x = Freq, y = reorder(Var1, Freq))) +
  geom_col(mapping = aes(fill = Var2), position = "stack") +
  labs(x = "Video Count", y = NULL,
       fill = NULL,
       title = "Categories with Highest Trending Videos",
       subtitle = "Colored per Publish Hour") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.position = "top")

## 3.4. Publish Time & Views

We will Visualize the trend of average viewers per publish_hour for the top 4 categories.

Videos_TimeViews <- aggregate(views ~ category_id + publish_hour,
                       data = Videos_top_4,
                       FUN = mean)

Visualization.

ggplot(data = Videos_TimeViews, mapping = aes(x = publish_hour, y = views)) +
  geom_line(aes(group = category_id,
                col = category_id)) +
  labs(x = "Publish Hour", y = "Views",
       fill = NULL,
       title = "Publish Hour & Views") +
  geom_point(aes(col = category_id)) +
  theme_minimal()

4. DATA ANALYSIS

Based on the exploration of the data above, we can perform the following analysis:
1. The Music category has the highest engagement. The Music category has the highest likes per view compared to other categories. It can be seen from the median value.
Of the three categories, Music has the highest comment per view component compared to other categories. It can be seen from the size jitter.
2. Top 10 trending youtube channels have video titles freq greater than the overall trending average.
3. “The Entertainment category has the highest number of videos for the proportion of each time period (Day/Night) when the video is published a lot.
4. The Music category has the most average views at prime time.

Conclusion From the four analyzes above, the Music category is the most recommended category for new YouTubers as a channel category to be created. Likewise for active YouTubers, you can add a Music category to your YouTube channel to increase likes and views on the channel.

US Youtube Videos - Data Visualization

Tubagus Fathul Arifin

2022-08-01

1. DATA INTRODUCTION

2. DATA PREPARATION

2.1. Data Inspection

2.2. Data Cleansing & Coertions

2.2. Data Feature Engineering

3. DATA EXPLORATION & VISUALIZATION

3.1. Likes & Comments per Views for Top 4 Category

3.2. More than equal to 10 Channel

4. DATA ANALYSIS