The data that we are going to analyze is the
USVideos.csv. This data shows trending Youtube Videos from
the US.
USVideos dataset contains 12 columns, which are:
1.trending_date: trending date (format: YY.DD.MM)
title: video title
channel_title: channel name
category_id: video category
publish_time: the date the video was uploaded
(format: YYYY-MM-DD-HH-MM-SS)
views: total views
likes: total likes
dislikes: total dislikes
comment_count total comments
comment_disabled: whether the comments is
disabled
rating_disabled: whether the rating is
disabled
video_error_or_removed: whether the video was
deleted
Make sure that the data is placed in the same folder as our R
project. Use the function read.csv() to read the CSV file
to R. Then, save it under the video object.
video <- read.csv("USvideos.csv")Instead of looking at the whole data, it’s better for us to “peek” at some rows that can represent the overall shape of the data.
To see the first few rows of the data, we use the head()
function.
head(video)To see the last few rows of the data, we use the tail()
function.
tail(video)dim(video)## [1] 13400 12
names(video)## [1] "trending_date" "title" "channel_title"
## [4] "category_id" "publish_time" "views"
## [7] "likes" "dislikes" "comment_count"
## [10] "comments_disabled" "ratings_disabled" "video_error_or_removed"
From the inspection above, we could conclude that:
trending_date,
title, channel_title,
category_id, publish_time, views,
likes, comment_count,
comments_disabled, ratings_disabled,
video_error_or_removed.We want to check the data type of each columns of our dataset using
the str() function.
str(video)## 'data.frame': 13400 obs. of 12 variables:
## $ trending_date : chr "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
## $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
## $ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
## $ category_id : int 22 24 23 24 24 28 24 28 1 25 ...
## $ publish_time : chr "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
## $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
## $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
## $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
## $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
## $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Some of the columns does not have the correct data type. We need to
modify the data type of trending_date and
publish_time into date type and category_id
into factor.
To change into date data type, we will be using
lubridate package.
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
trending_date columnBefore:
str(video$trending_date)## chr [1:13400] "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
After:
video$trending_date <- ydm(video$trending_date)
head(video$trending_date)## [1] "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14"
## [6] "2017-11-14"
class(video$trending_date)## [1] "Date"
The data type has been changed into Date.
publish_time columnBefore:
str(video$publish_time)## chr [1:13400] "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" ...
After:
video$publish_time <- ymd_hms(video$publish_time)
head(video$publish_time)## [1] "2017-11-13 17:13:01 UTC" "2017-11-13 07:30:00 UTC"
## [3] "2017-11-12 19:05:24 UTC" "2017-11-13 11:00:04 UTC"
## [5] "2017-11-12 18:01:41 UTC" "2017-11-13 19:07:23 UTC"
We are going to analyze US Trending Youtube Videos, so we need to change the timezone into New York timezone.
video$publish_time <- ymd_hms(video$publish_time, tz = "America/New_York")## Warning: 5 failed to parse.
head(video$publish_time)## [1] "2017-11-13 17:13:01 EST" "2017-11-13 07:30:00 EST"
## [3] "2017-11-12 19:05:24 EST" "2017-11-13 11:00:04 EST"
## [5] "2017-11-12 18:01:41 EST" "2017-11-13 19:07:23 EST"
class(video$publish_time)## [1] "POSIXct" "POSIXt"
The data type has been changed into POSIXct/POSIXt, which is a data type for date & time.
We are going to change the data type of category_id into
factor.
video$category_id <- as.factor(video$category_id)
str(video$category_id)## Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...
category_id Transformationvideo[,c("channel_title", "category_id")]As we can see from the table above, the category_id
column only consists of numbers which can be hard to understand for
readers.
We want to change category_id column into characters
that is more intuitive and informative for readers using
sapply() and switch() function.
video$category_id <- sapply(X = as.character(video$category_id),
FUN = switch,
"1" = "Film and Animation",
"2" = "Autos and Vehicles",
"10" = "Music",
"15" = "Pets and Animals",
"17" = "Sports",
"19" = "Travel and Events",
"20" = "Gaming",
"22" = "People and Blogs",
"23" = "Comedy",
"24" = "Entertainment",
"25" = "News and Politics",
"26" = "Howto and Style",
"27" = "Education",
"28" = "Science and Technology",
"29" = "Nonprofit and Activism",
"43" = "Shows")
video[,c("channel_title", "category_id")]We want to see what video has the highest views.
video[video$views == max(video$views), c("title", "views")]YouTube Rewind: The Shape of 2017 | #YouTubeRewind is the most viewed videos with a total of 149376127 views.
In our dataset, there are several videos that are recorded multiple times as they are trending for more than 1 day.
length(video$title)## [1] 13400
length(unique(video$title))## [1] 2986
There are a total of 13400 trending videos with 2986 unique videos.
We want to make new columns that consist of like ratio and dislike ratio.
video$like_ratio <- video$likes / video$views
video$dislike_ratio <- video$dislikes / video$views
head(video)To select the first index of an identical data, we will use the
match(vector1, vector2) function. Then, save it under the
unique_video object.
unique_video <- video[match(unique(video$title), video$title),]
unique_video [, c("trending_date", "title")]From the table above, we could see when each videos was trending.
We want to check whether the unique_video object only
contains unique video title.
length(unique_video$title)## [1] 2986
length(unique(unique_video$title))## [1] 2986
The unique_video only contains unique video title.
We want to know the distribution of like ratio and dislike ratio of Autos and Vehicles category.
So, we need to prepare the data. Make a new object named
video_ratio containing the category that we are going to
plot.
video_ratio <- unique_video[unique_video$category_id %in% "Autos and Vehicles",]
head(video_ratio)Lastly, we plot the data.
plot(x = video_ratio$like_ratio, y = video_ratio$dislike_ratio)
abline(lm(video_ratio$dislike_ratio ~ video_ratio$like_ratio))
From the plot above, we could see that the like ratio and dislike ratio
of the Autos and Vehicle category has weak positive correlation.
We want to see the average views trend of the People and Blogs, Entertainment, and Film and Animation category per day. To analyze trends, we need to graph a line plot.
First we need to make a new object named video_trend
containing the category that we are goint to plot.
video_trend <- unique_video[unique_video$category_id %in% c("People and Blogs", "Entertainment", "Film and Animation"),]
video_trendNext, we are going to calculate the average views for each category per day.
video_agg <- aggregate(x = views ~ category_id + trending_date,
data = video_trend,
FUN = mean)
video_aggLastly, we are going to visualize the data using the
ggplot2 package.
library(ggplot2)
ggplot(data = video_agg, mapping = aes(x = trending_date,
y = views)) +
geom_line(aes(color = category_id)) +
geom_point(aes(color = category_id)) +
scale_y_continuous(labels = scales::number_format())+
labs(title = "Average Views Trend",
subtitle = "of the People and Blogs, Entertainment, and Film and Animation category",
x = "Trending Date",
y = "Average Views") +
theme_minimal()From the graph above, we could see that Entertainment and Film and Animation category have a relatively high average views. People and Blogs category has a more stable average views compare to the other two categories.
We want to see the 10 most trending channel on Youtube based on how many of their videos are trending.
trending_channel <- as.data.frame(table(unique_video$channel_title))
names(trending_channel) <- c("channel_title", "Freq")
highest_trending_channel <- trending_channel[order(trending_channel$Freq, decreasing = TRUE),]
highest_trending_channel <- highest_trending_channel[1:10, ]
highest_trending_channelNow, we just need to plot the graph.
library(scales)
ggplot(data = highest_trending_channel, mapping = aes(y = reorder(channel_title,Freq), x = Freq)) +
geom_col(aes(fill = Freq)) +
geom_label(aes(label = Freq), nudge_x = 2) +
scale_fill_gradient(high = "#9c3247", low = "#f7bcc8") +
labs(title = "Top 10 Trending Channels",
subtitle = "across all categories",
x = "Total Trending Videos",
y = NULL,
fill = "Total Trending Videos") +
theme_minimal()From the graph above, we could see that the top 10 trending channels are:
trending_category <- as.data.frame(table(unique_video$category_id))
names(trending_category) <- c("category_id", "Freq")
highest_trending_category <- trending_category[order(trending_category$Freq, decreasing = TRUE),]
highest_trending_categoryggplot(data = highest_trending_category, mapping = aes(y = reorder(category_id,Freq), x = Freq)) +
geom_col(aes(fill = Freq)) +
geom_label(aes(label = Freq), nudge_x = 2) +
scale_fill_gradient(high = "#243f80", low = "#a6b8e0") +
labs(title = "Trending Ranking of Each Categories",
x = "Total Trending Videos",
y = NULL,
fill = "Total Trending Videos") +
theme_minimal()
The graph above shows that Entertainment category has the most trending
Youtube Videos. While the category that has the least trending Youtube
Videos is the Shows category.
We want to see what category has the highest views, likes, dislikes, and comments.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
engagement <- unique_video %>%
group_by(category_id) %>%
summarise(total_views = sum(views),
total_likes = sum(likes),
total_dislikes = sum(dislikes),
total_comments = sum(comment_count))
engagementFrom our analysis, we can conclude that Entertainment category has the most trending Youtube Videos as Entertainment category has the most total views, total likes, total dislikes, and total comments. Therefore, Entertainment category has high engagement and leads to the most trending category.
Like and dislike of a video has a weak positive correlation. So, an increase in like is associated with a slight increase in the dislike, and vice versa. But the relationship is not very consistent or pronounced.