1 Explanation

1.1 Brief Explanation about the Data

The data that we are going to analyze is the USVideos.csv. This data shows trending Youtube Videos from the US.

1.2 Column Description

USVideos dataset contains 12 columns, which are:

1.trending_date: trending date (format: YY.DD.MM)

  1. title: video title

  2. channel_title: channel name

  3. category_id: video category

  4. publish_time: the date the video was uploaded (format: YYYY-MM-DD-HH-MM-SS)

  5. views: total views

  6. likes: total likes

  7. dislikes: total dislikes

  8. comment_count total comments

  9. comment_disabled: whether the comments is disabled

  10. rating_disabled: whether the rating is disabled

  11. video_error_or_removed: whether the video was deleted

2 Input Data

Make sure that the data is placed in the same folder as our R project. Use the function read.csv() to read the CSV file to R. Then, save it under the video object.

video <- read.csv("USvideos.csv")

2.1 Data Inspection

Instead of looking at the whole data, it’s better for us to “peek” at some rows that can represent the overall shape of the data.

To see the first few rows of the data, we use the head() function.

head(video)

To see the last few rows of the data, we use the tail() function.

tail(video)
dim(video)
## [1] 13400    12
names(video)
##  [1] "trending_date"          "title"                  "channel_title"         
##  [4] "category_id"            "publish_time"           "views"                 
##  [7] "likes"                  "dislikes"               "comment_count"         
## [10] "comments_disabled"      "ratings_disabled"       "video_error_or_removed"

From the inspection above, we could conclude that:

  • Our data has 3755 rows and 11 columns
  • The name of the columns are: trending_date, title, channel_title, category_id, publish_time, views, likes, comment_count, comments_disabled, ratings_disabled, video_error_or_removed.

3 Data Wrangling

We want to check the data type of each columns of our dataset using the str() function.

str(video)
## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
##  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Some of the columns does not have the correct data type. We need to modify the data type of trending_date and publish_time into date type and category_id into factor.

3.1 Change Date Data Type

To change into date data type, we will be using lubridate package.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
  1. trending_date column

Before:

str(video$trending_date)
##  chr [1:13400] "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...

After:

video$trending_date <- ydm(video$trending_date)
head(video$trending_date)
## [1] "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14"
## [6] "2017-11-14"
class(video$trending_date)
## [1] "Date"

The data type has been changed into Date.

  1. publish_time column

Before:

str(video$publish_time)
##  chr [1:13400] "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" ...

After:

video$publish_time <- ymd_hms(video$publish_time)
head(video$publish_time)
## [1] "2017-11-13 17:13:01 UTC" "2017-11-13 07:30:00 UTC"
## [3] "2017-11-12 19:05:24 UTC" "2017-11-13 11:00:04 UTC"
## [5] "2017-11-12 18:01:41 UTC" "2017-11-13 19:07:23 UTC"

We are going to analyze US Trending Youtube Videos, so we need to change the timezone into New York timezone.

video$publish_time <- ymd_hms(video$publish_time, tz = "America/New_York")
## Warning: 5 failed to parse.
head(video$publish_time)
## [1] "2017-11-13 17:13:01 EST" "2017-11-13 07:30:00 EST"
## [3] "2017-11-12 19:05:24 EST" "2017-11-13 11:00:04 EST"
## [5] "2017-11-12 18:01:41 EST" "2017-11-13 19:07:23 EST"
class(video$publish_time)
## [1] "POSIXct" "POSIXt"

The data type has been changed into POSIXct/POSIXt, which is a data type for date & time.

3.2 Change Factor Data Type

We are going to change the data type of category_id into factor.

video$category_id <- as.factor(video$category_id)
str(video$category_id)
##  Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...

3.3 category_id Transformation

video[,c("channel_title", "category_id")]

As we can see from the table above, the category_id column only consists of numbers which can be hard to understand for readers.

We want to change category_id column into characters that is more intuitive and informative for readers using sapply() and switch() function.

video$category_id <- sapply(X = as.character(video$category_id),
                            FUN = switch,
                            "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

video[,c("channel_title", "category_id")]

4 Data Exploratory

4.1 Videos with the Most Views

We want to see what video has the highest views.

video[video$views == max(video$views), c("title", "views")]

YouTube Rewind: The Shape of 2017 | #YouTubeRewind is the most viewed videos with a total of 149376127 views.

4.2 Unique Video Title

In our dataset, there are several videos that are recorded multiple times as they are trending for more than 1 day.

length(video$title)
## [1] 13400
length(unique(video$title))
## [1] 2986

There are a total of 13400 trending videos with 2986 unique videos.

4.3 Like Ratio and Dislike Ratio

We want to make new columns that consist of like ratio and dislike ratio.

video$like_ratio <-  video$likes / video$views
video$dislike_ratio <- video$dislikes / video$views
head(video)

4.5 Correlation of Like Ratio and Dislike Ratio

We want to know the distribution of like ratio and dislike ratio of Autos and Vehicles category.

So, we need to prepare the data. Make a new object named video_ratio containing the category that we are going to plot.

video_ratio <- unique_video[unique_video$category_id %in% "Autos and Vehicles",]
head(video_ratio)

Lastly, we plot the data.

plot(x = video_ratio$like_ratio, y = video_ratio$dislike_ratio)
abline(lm(video_ratio$dislike_ratio ~ video_ratio$like_ratio))

From the plot above, we could see that the like ratio and dislike ratio of the Autos and Vehicle category has weak positive correlation.

4.9 Total Views, Likes, Dislikes, and Comments

We want to see what category has the highest views, likes, dislikes, and comments.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
engagement <- unique_video %>% 
  group_by(category_id) %>% 
  summarise(total_views = sum(views),
            total_likes = sum(likes),
            total_dislikes = sum(dislikes),
            total_comments = sum(comment_count))

engagement

5 Conclusion

From our analysis, we can conclude that Entertainment category has the most trending Youtube Videos as Entertainment category has the most total views, total likes, total dislikes, and total comments. Therefore, Entertainment category has high engagement and leads to the most trending category.

Like and dislike of a video has a weak positive correlation. So, an increase in like is associated with a slight increase in the dislike, and vice versa. But the relationship is not very consistent or pronounced.