As a “YouTuber” in America who wants to increase the prestige of his YouTube channel, we plan to create trending video content! We just got data on “YouTube’s US Trending Videos” and want to find out what characteristics make a video trending?
“YouTube US Trending Videos” is a collection of 200 trending videos in the US per day from 2017-11-14 to 2018-01-21.
Input the data and store it in a variable named Videos
Videos <- read.csv("data_input/USvideos.csv")And now we can do data inspection and cleansing`
Check if the saved data is correct
head(Videos)Inspect the data.
str(Videos)## 'data.frame': 13400 obs. of 12 variables:
## $ trending_date : chr "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
## $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
## $ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
## $ category_id : int 22 24 23 24 24 28 24 28 1 25 ...
## $ publish_time : chr "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
## $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
## $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
## $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
## $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
## $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
dim(Videos)## [1] 13400 12
names(Videos)## [1] "trending_date" "title" "channel_title"
## [4] "category_id" "publish_time" "views"
## [7] "likes" "dislikes" "comment_count"
## [10] "comments_disabled" "ratings_disabled" "video_error_or_removed"
From our inspection we can conclude :
* Retail data contain 13400 of rows and 13 of coloumns.
* Each of column name :
01. “trending_date”,
02. “title”,
03. “channel_title”,
04. “category_id”,
05. “publish_time”,
06. “views”,
07. “likes”,
08. “dislikes”,
09. “comment_count”,
10. “comments_disabled”,
11. “ratings_disabled”,
12. “video_error_or_removed”,
From the str() result, we find some of
data type not in the corect type. we need to convert it into corect type
(data coertion).
Videos$trending_date <- ydm(Videos$trending_date)
Videos$publish_time <- ymd_hms(Videos$publish_time, tz = "America/New_York")## Date in ISO8601 format; converting timezone from UTC to "America/New_York".
Videos$category_id <- sapply(X = as.character(Videos$category_id),
FUN = switch,
"1" = "Film and Animation",
"2" = "Autos and Vehicles",
"10" = "Music",
"15" = "Pets and Animals",
"17" = "Sports",
"19" = "Travel and Events",
"20" = "Gaming",
"22" = "People and Blogs",
"23" = "Comedy",
"24" = "Entertainment",
"25" = "News and Politics",
"26" = "Howto and Style",
"27" = "Education",
"28" = "Science and Technology",
"29" = "Nonprofit and Activism",
"43" = "Shows")
str(Videos)## 'data.frame': 13400 obs. of 12 variables:
## $ trending_date : Date, format: "2017-11-14" "2017-11-14" ...
## $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
## $ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
## $ category_id : chr "People and Blogs" "Entertainment" "Comedy" "Entertainment" ...
## $ publish_time : POSIXct, format: "2017-11-13 12:13:01" "2017-11-13 02:30:00" ...
## $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
## $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
## $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
## $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
## $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
head(Videos)Each of column already changed into desired data type
Now, we have to check for the missing value in the data.
colSums(is.na(Videos))## trending_date title channel_title
## 0 0 0
## category_id publish_time views
## 0 0 0
## likes dislikes comment_count
## 0 0 0
## comments_disabled ratings_disabled video_error_or_removed
## 0 0 0
anyNA(Videos)## [1] FALSE
From the result above, now we know that there are no missing value in
the Videos data.
We will do subsetting to delete some column (10, 11, & 12 because
we dont need the informations). then save it into
Videos_new variable.
Videos_new <- Videos[,c(1:9)]
head(Videos_new)Extract the day name information from the trending_date
column and create a new column named trending_day.
Videos_new$trending_day <- (wday(Videos_new$trending_date,
label = T,
abbr = T))
head(Videos_new)Extract the hour information from the publish_time
column and create a new column named publish_hour.
Videos_new$publish_hour <- hour(Videos_new$publish_time)
head(Videos_new)Create a publish_when column by dividing
publish_hour into periods (Day-Night).
Videos_new$publish_when <- ifelse(test = Videos_new$publish_hour > 12, yes = "Night", no = "Day")
head(Videos_new)Extract the day name information from the publish_time
column and create a new column named publish_day.
Videos_new$publish_day <- wday(x=Videos_new$publish_time, label=T, abbr = T)
head(Videos_new)In the Videos_new data there is data redundancy, namely
there are videos that appear several times because they are trending for
more than 1 day.
For further analysis, **we will only use data when the video is first trending8* in order to reduce data redundancy.
index.Videos_new <- match(unique(Videos_new$title), Videos_new$title)
Videos_new <- Videos_new[index.Videos_new,]
head(Videos_new)We will look for 4 categories of videos with the most views. by
aggregating the category_id & views
columns.
category_views <- aggregate(views~category_id,Videos_new,sum)
head(category_views[order(category_views$views, decreasing = T),],4)Based on these top 4 video categories, we will do further analysis.
subset the Videos_new data for the top 4 categories and
save it to the Videos_top_4 object.
Videos_top_4 <- Videos_new[Videos_new$category_id %in% c("Entertainment", "Music", "Comedy", "Howto and Style"), ]create likesp column containing likes/views and
dislikesp containing dislikes/views
Videos_top_4$likesp <- Videos_top_4$likes/Videos_top_4$views
Videos_top_4$commentp <- Videos_top_4$comment_count /Videos_top_4$views
head(Videos_new)see the distribution of likes/views and dislikes/views per category
ggplot(data = Videos_top_4 , mapping = aes(x = category_id , y = likesp )) +
geom_boxplot(outlier.shape = NA, fill = "black" , col = "blue", alpha = 0.5 ) +
geom_jitter( (aes(size=commentp)) , col="green",
alpha = 0.2) +
labs(title = "Likes and Comment character trending in youtube",
subtitle = "Entertainment, Music, Comedy, Howto and Style" ,
x =NULL ,
y = "likes per view" ,
size = "Comment per view",
caption = "Source: Youtube" ) +
theme_minimal()We are also planning to collaborate with a YouTube channel that often
appears in trending video searches!
We will look for YouTube channels that have more than equal to
10 trending videos. So that it can be determined which YouTube
channel is good to be a collaboration partner.
Count the video frequency of each channel
Videos_10chan <- as.data.frame(table(Videos_new$channel_title))
colnames(Videos_10chan) <- c("Title", "Freq")Perform filtering for channels that have a frequency >= 10.
Videos_10chan <- Videos_10chan[Videos_10chan$Freq >= 10 , ]
head(Videos_10chan, 10)Sort from highest to lowest frequency and grab top 10 data from
Videos_10chan
Videos_10chan <- head(Videos_10chan[order(Videos_10chan$Freq, decreasing=T), ], 10)Visualization.
ggplot(data = Videos_10chan[1:10,], mapping = aes(x= Freq, y= reorder(Title,Freq))) +
geom_col(aes(fill = Freq)) +
labs(
title = "Top 10 Trending Channel Youtube",
x = "Video Count",
y = "Channel Title",
caption = "Source: Youtube"
) +
scale_fill_gradient(low = "purple", high = "green") + geom_label(mapping = aes(label = Freq),
col = "blue",
nudge_x = -1) +
theme_minimal() +
theme(legend.position = "none")+ geom_vline(xintercept=mean(Videos_10chan$Freq), col="white")+
scale_x_continuous(breaks=seq(0,35,5))We will find out which category has the highest number of videos and want to know the proportion of each time period (Day/Night) when the video is published a lot.
Videos_DayNight <- as.data.frame(table(Videos_new$category_id, Videos_new$publish_when))
head(Videos_DayNight)Visualization.
ggplot(data = Videos_DayNight, mapping = aes(x = Freq, y = reorder(Var1, Freq))) +
geom_col(mapping = aes(fill = Var2), position = "stack") +
labs(x = "Video Count", y = NULL,
fill = NULL,
title = "Categories with Highest Trending Videos",
subtitle = "Colored per Publish Hour") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "top")
## 3.4. Publish Time & Views
We will Visualize the trend of average viewers per
publish_hour for the top 4 categories.
Videos_TimeViews <- aggregate(views ~ category_id + publish_hour,
data = Videos_top_4,
FUN = mean)Visualization.
ggplot(data = Videos_TimeViews, mapping = aes(x = publish_hour, y = views)) +
geom_line(aes(group = category_id,
col = category_id)) +
labs(x = "Publish Hour", y = "Views",
fill = NULL,
title = "Publish Hour & Views") +
geom_point(aes(col = category_id)) +
theme_minimal()Based on the exploration of the data above, we can perform the
following analysis:
1. The Music category has the highest
engagement. The Music category has the highest likes per
view compared to other categories. It can be seen from the median
value.
Of the three categories, Music has the highest comment per
view component compared to other categories. It can be seen from the
size jitter.
2. Top 10 trending youtube channels have video titles
freq greater than the overall trending average.
3. “The Entertainment category has the
highest number of videos for the proportion of each time period
(Day/Night) when the video is published a lot.
4. The Music category has the most average
views at prime time.
Conclusion From the four analyzes above, the
Music category is the most recommended category for new
YouTubers as a channel category to be created. Likewise for active
YouTubers, you can add a Music category to your YouTube
channel to increase likes and views on the channel.