1 Explanation

1.1 Brief Explanation about the Data

The data that we are going to analyze is the USVideos.csv. This data shows trending Youtube Videos from the US.

1.2 Column Description

USVideos dataset contains 12 columns, which are:

1.trending_date: trending date (format: YY.DD.MM)

title: video title
channel_title: channel name
category_id: video category
publish_time: the date the video was uploaded (format: YYYY-MM-DD-HH-MM-SS)
views: total views
likes: total likes
dislikes: total dislikes
comment_count total comments
comment_disabled: whether the comments is disabled
rating_disabled: whether the rating is disabled
video_error_or_removed: whether the video was deleted

2 Input Data

Make sure that the data is placed in the same folder as our R project. Use the function read.csv() to read the CSV file to R. Then, save it under the video object.

video <- read.csv("USvideos.csv")

2.1 Data Inspection

Instead of looking at the whole data, it’s better for us to “peek” at some rows that can represent the overall shape of the data.

To see the first few rows of the data, we use the head() function.

head(video)

To see the last few rows of the data, we use the tail() function.

tail(video)

dim(video)

## [1] 13400    12

names(video)

##  [1] "trending_date"          "title"                  "channel_title"         
##  [4] "category_id"            "publish_time"           "views"                 
##  [7] "likes"                  "dislikes"               "comment_count"         
## [10] "comments_disabled"      "ratings_disabled"       "video_error_or_removed"

From the inspection above, we could conclude that:

Our data has 3755 rows and 11 columns
The name of the columns are: trending_date, title, channel_title, category_id, publish_time, views, likes, comment_count, comments_disabled, ratings_disabled, video_error_or_removed.

3 Data Wrangling

We want to check the data type of each columns of our dataset using the str() function.

str(video)

## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
##  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Some of the columns does not have the correct data type. We need to modify the data type of trending_date and publish_time into date type and category_id into factor.

3.1 Change Date Data Type

To change into date data type, we will be using lubridate package.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

trending_date column

Before:

str(video$trending_date)

##  chr [1:13400] "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...

After:

video$trending_date <- ydm(video$trending_date)
head(video$trending_date)

## [1] "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14"
## [6] "2017-11-14"

class(video$trending_date)

## [1] "Date"

The data type has been changed into Date.

publish_time column

Before:

str(video$publish_time)

##  chr [1:13400] "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" ...

After:

video$publish_time <- ymd_hms(video$publish_time)
head(video$publish_time)

## [1] "2017-11-13 17:13:01 UTC" "2017-11-13 07:30:00 UTC"
## [3] "2017-11-12 19:05:24 UTC" "2017-11-13 11:00:04 UTC"
## [5] "2017-11-12 18:01:41 UTC" "2017-11-13 19:07:23 UTC"

We are going to analyze US Trending Youtube Videos, so we need to change the timezone into New York timezone.

video$publish_time <- ymd_hms(video$publish_time, tz = "America/New_York")

## Warning: 5 failed to parse.

head(video$publish_time)

## [1] "2017-11-13 17:13:01 EST" "2017-11-13 07:30:00 EST"
## [3] "2017-11-12 19:05:24 EST" "2017-11-13 11:00:04 EST"
## [5] "2017-11-12 18:01:41 EST" "2017-11-13 19:07:23 EST"

class(video$publish_time)

## [1] "POSIXct" "POSIXt"

The data type has been changed into POSIXct/POSIXt, which is a data type for date & time.

3.2 Change Factor Data Type

We are going to change the data type of category_id into factor.

video$category_id <- as.factor(video$category_id)
str(video$category_id)

##  Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...

3.3 `category_id` Transformation

video[,c("channel_title", "category_id")]

As we can see from the table above, the category_id column only consists of numbers which can be hard to understand for readers.

We want to change category_id column into characters that is more intuitive and informative for readers using sapply() and switch() function.

video$category_id <- sapply(X = as.character(video$category_id),
                            FUN = switch,
                            "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

video[,c("channel_title", "category_id")]

4 Data Exploratory

4.1 Videos with the Most Views

We want to see what video has the highest views.

video[video$views == max(video$views), c("title", "views")]

YouTube Rewind: The Shape of 2017 | #YouTubeRewind is the most viewed videos with a total of 149376127 views.

4.2 Unique Video Title

In our dataset, there are several videos that are recorded multiple times as they are trending for more than 1 day.

length(video$title)

## [1] 13400

length(unique(video$title))

## [1] 2986

There are a total of 13400 trending videos with 2986 unique videos.

4.3 Like Ratio and Dislike Ratio

We want to make new columns that consist of like ratio and dislike ratio.

video$like_ratio <-  video$likes / video$views
video$dislike_ratio <- video$dislikes / video$views
head(video)

4.4 When does a video first time trending

To select the first index of an identical data, we will use the match(vector1, vector2) function. Then, save it under the unique_video object.

unique_video <- video[match(unique(video$title), video$title),]
unique_video [, c("trending_date", "title")]

From the table above, we could see when each videos was trending.

We want to check whether the unique_video object only contains unique video title.

length(unique_video$title)

## [1] 2986

length(unique(unique_video$title))

## [1] 2986

The unique_video only contains unique video title.

4.5 Correlation of Like Ratio and Dislike Ratio

We want to know the distribution of like ratio and dislike ratio of Autos and Vehicles category.

So, we need to prepare the data. Make a new object named video_ratio containing the category that we are going to plot.

video_ratio <- unique_video[unique_video$category_id %in% "Autos and Vehicles",]
head(video_ratio)

Lastly, we plot the data.

plot(x = video_ratio$like_ratio, y = video_ratio$dislike_ratio)
abline(lm(video_ratio$dislike_ratio ~ video_ratio$like_ratio))

From the plot above, we could see that the like ratio and dislike ratio of the Autos and Vehicle category has weak positive correlation.

4.6 Views Trend of Trending Youtube Videos

We want to see the average views trend of the People and Blogs, Entertainment, and Film and Animation category per day. To analyze trends, we need to graph a line plot.

First we need to make a new object named video_trend containing the category that we are goint to plot.

video_trend <- unique_video[unique_video$category_id %in% c("People and Blogs", "Entertainment", "Film and Animation"),]

video_trend

Next, we are going to calculate the average views for each category per day.

video_agg <- aggregate(x = views ~ category_id + trending_date,
                       data = video_trend,
                       FUN = mean)
video_agg

Lastly, we are going to visualize the data using the ggplot2 package.

library(ggplot2)
ggplot(data = video_agg, mapping = aes(x = trending_date, 
                                       y = views)) +
  geom_line(aes(color = category_id)) +
  geom_point(aes(color = category_id)) + 
  scale_y_continuous(labels = scales::number_format())+
  
  labs(title = "Average Views Trend",
       subtitle = "of the People and Blogs, Entertainment, and Film and Animation category",
       x = "Trending Date",
       y = "Average Views") +
  theme_minimal()

From the graph above, we could see that Entertainment and Film and Animation category have a relatively high average views. People and Blogs category has a more stable average views compare to the other two categories.

4.7 Top 10 Trending Channels

We want to see the 10 most trending channel on Youtube based on how many of their videos are trending.

trending_channel <- as.data.frame(table(unique_video$channel_title))
names(trending_channel) <- c("channel_title", "Freq")

highest_trending_channel <- trending_channel[order(trending_channel$Freq, decreasing = TRUE),]
highest_trending_channel <- highest_trending_channel[1:10, ]
highest_trending_channel

Now, we just need to plot the graph.

library(scales)

ggplot(data = highest_trending_channel, mapping = aes(y = reorder(channel_title,Freq), x = Freq)) +
  geom_col(aes(fill = Freq)) +
  geom_label(aes(label = Freq), nudge_x = 2) +

  scale_fill_gradient(high = "#9c3247", low = "#f7bcc8") +
  labs(title = "Top 10 Trending Channels",
       subtitle = "across all categories",
       x = "Total Trending Videos",
       y = NULL,
       fill = "Total Trending Videos") +
  theme_minimal()

From the graph above, we could see that the top 10 trending channels are:

Refinery29
The Tonight Show Starring Jimmy Fallon
Vox
TheEllenShow
Netflix
NFL
Jimmy Kimmel Live
ESPN
The Late Show with Stephen Colbert
Late Night with Seth Meyers

4.8 Category that has the Most Trending Videos

trending_category <- as.data.frame(table(unique_video$category_id))
names(trending_category) <- c("category_id", "Freq")

highest_trending_category <- trending_category[order(trending_category$Freq, decreasing = TRUE),]

highest_trending_category

ggplot(data = highest_trending_category, mapping = aes(y = reorder(category_id,Freq), x = Freq)) +
  geom_col(aes(fill = Freq)) +
  geom_label(aes(label = Freq), nudge_x = 2) +

  scale_fill_gradient(high = "#243f80", low = "#a6b8e0") +
  labs(title = "Trending Ranking of Each Categories",
       x = "Total Trending Videos",
       y = NULL,
       fill = "Total Trending Videos") +
  
  theme_minimal()

The graph above shows that Entertainment category has the most trending Youtube Videos. While the category that has the least trending Youtube Videos is the Shows category.

4.9 Total Views, Likes, Dislikes, and Comments

We want to see what category has the highest views, likes, dislikes, and comments.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

engagement <- unique_video %>% 
  group_by(category_id) %>% 
  summarise(total_views = sum(views),
            total_likes = sum(likes),
            total_dislikes = sum(dislikes),
            total_comments = sum(comment_count))

engagement

5 Conclusion

From our analysis, we can conclude that Entertainment category has the most trending Youtube Videos as Entertainment category has the most total views, total likes, total dislikes, and total comments. Therefore, Entertainment category has high engagement and leads to the most trending category.

Like and dislike of a video has a weak positive correlation. So, an increase in like is associated with a slight increase in the dislike, and vice versa. But the relationship is not very consistent or pronounced.

Data Visualization - USVideos

Michelle Intan Handa

2023-06-12

1 Explanation

1.1 Brief Explanation about the Data

1.2 Column Description

2 Input Data

2.1 Data Inspection

3 Data Wrangling

3.1 Change Date Data Type

3.2 Change Factor Data Type

3.3 `category_id` Transformation

4 Data Exploratory

4.1 Videos with the Most Views

4.2 Unique Video Title

4.3 Like Ratio and Dislike Ratio

4.5 Correlation of Like Ratio and Dislike Ratio

4.9 Total Views, Likes, Dislikes, and Comments

5 Conclusion

Data Visualization - USVideos

Michelle Intan Handa

2023-06-12

1 Explanation

1.1 Brief Explanation about the Data

1.2 Column Description

2 Input Data

2.1 Data Inspection

3 Data Wrangling

3.1 Change Date Data Type

3.2 Change Factor Data Type

3.3 category_id Transformation

4 Data Exploratory

4.1 Videos with the Most Views

4.2 Unique Video Title

4.3 Like Ratio and Dislike Ratio

4.4 When does a video first time trending

4.5 Correlation of Like Ratio and Dislike Ratio

4.6 Views Trend of Trending Youtube Videos

4.7 Top 10 Trending Channels

4.8 Category that has the Most Trending Videos

4.9 Total Views, Likes, Dislikes, and Comments

5 Conclusion

3.3 `category_id` Transformation