Library

This is a list of library that I used in this analysis.

library(lubridate)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(scales)

Data Input

First, lets input the data. YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. This dataset is a daily record of the top trending YouTube video in US.

vids <- read.csv("USvideos.csv")

This is the top 6 data of the dataset.

head(vids)

Data Inspection

Lets check if there is any NA

colSums(is.na(vids))
##          trending_date                  title          channel_title 
##                      0                      0                      0 
##            category_id           publish_time                  views 
##                      0                      0                      0 
##                  likes               dislikes          comment_count 
##                      0                      0                      0 
##      comments_disabled       ratings_disabled video_error_or_removed 
##                      0                      0                      0

Okay, the data didn’t have any NA so we can continue without problem.

Before start anything, lets explore the data. We need to make sure the datatypes are correct.

str(vids)
## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
##  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Based on the data, I need to change some of the datatypes into factor and date. Beside that, we know that the data is a dataframe and have 13400 rows and 12 columns.

Because the category_id still a number that represent a category, I need to change it into the name of the category. After that, I need to change the datatypes into a suitable one.

vids$category_id <- sapply(as.character(vids$category_id), switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

vids$category_id <- as.factor(vids$category_id)
vids$trending_date <- ydm(vids$trending_date)
vids$publish_time <- ymd_hms(vids$publish_time, tz = "America/New_York")

Lets check the data again.

str(vids)
## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : Date, format: "2017-11-14" "2017-11-14" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
##  $ publish_time          : POSIXct, format: "2017-11-13 12:13:01" "2017-11-13 02:30:00" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

With this we succesfully change the datatypes and change the category_id.

Data Wrangling & Exploration

Make colomn about publish hour

Lets make new colomn about publish_hour. I will extract the information from publish_time column

vids$publish_hour <- hour(vids$publish_time)

I want to make publish_hour into 3 categories and add it into a new column called publish_when. First, make the function.

pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 4pm"
    }else{
      x <- "4pm to 12am"
    }  
}

Then apply it to the data

vids$publish_when <- as.factor(sapply(vids$publish_hour, pw))

Make colomn about publish day

I also want to add new colomn about name of the day publish.

vids$publish_wday <- wday(vids$publish_time, 
                          label = T, 
                          abbr = F,
                          week_start = 1 
                          ) 

Make colomn about timetotrend

Then I want to know how long it takes to trend. So I make new column called publish_date and extract the date information from publish_time

vids$publish_date <- date(vids$publish_time)

After that, I can make new colomn called timetotrend and change the datatypes into factor.

vids$timetotrend <- vids$trending_date - vids$publish_date
vids$timetotrend <- as.factor(ifelse(vids$timetotrend <= 7, vids$timetotrend, "8+"))

Make colomn about like/dislike/comment ratio

Last, I want to make column about like,dislike,comment ratio

vids$likesratio <- vids$likes/vids$views
vids$dislikesratio <- vids$dislikes/vids$views
vids$commentratio <- vids$comment_count/vids$views

Filter unique data

This is the data structure.

str(vids)
## 'data.frame':    13400 obs. of  20 variables:
##  $ trending_date         : Date, format: "2017-11-14" "2017-11-14" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
##  $ publish_time          : POSIXct, format: "2017-11-13 12:13:01" "2017-11-13 02:30:00" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ publish_hour          : int  12 2 14 6 13 14 0 16 9 8 ...
##  $ publish_when          : Factor w/ 3 levels "12am to 8am",..: 3 1 3 1 3 3 1 2 3 3 ...
##  $ publish_wday          : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 1 1 7 1 7 1 7 7 1 1 ...
##  $ publish_date          : Date, format: "2017-11-13" "2017-11-13" ...
##  $ timetotrend           : Factor w/ 9 levels "0","1","2","3",..: 2 2 3 2 3 2 3 3 2 2 ...
##  $ likesratio            : num  0.0769 0.0402 0.0458 0.0296 0.0631 ...
##  $ dislikesratio         : num  0.003963 0.002541 0.001673 0.001941 0.000949 ...
##  $ commentratio          : num  0.02132 0.00525 0.00256 0.00625 0.00836 ...

But, there is one problem. There is a redundant data because there are channels that have videos that trends more than 1 day so the data doubled. Lets just take the unique data and add it to vids.unique.

vids.unique <- vids[match(unique(vids$title),vids$title),]
str(vids.unique)
## 'data.frame':    2986 obs. of  20 variables:
##  $ trending_date         : Date, format: "2017-11-14" "2017-11-14" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
##  $ publish_time          : POSIXct, format: "2017-11-13 12:13:01" "2017-11-13 02:30:00" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ publish_hour          : int  12 2 14 6 13 14 0 16 9 8 ...
##  $ publish_when          : Factor w/ 3 levels "12am to 8am",..: 3 1 3 1 3 3 1 2 3 3 ...
##  $ publish_wday          : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 1 1 7 1 7 1 7 7 1 1 ...
##  $ publish_date          : Date, format: "2017-11-13" "2017-11-13" ...
##  $ timetotrend           : Factor w/ 9 levels "0","1","2","3",..: 2 2 3 2 3 2 3 3 2 2 ...
##  $ likesratio            : num  0.0769 0.0402 0.0458 0.0296 0.0631 ...
##  $ dislikesratio         : num  0.003963 0.002541 0.001673 0.001941 0.000949 ...
##  $ commentratio          : num  0.02132 0.00525 0.00256 0.00625 0.00836 ...

At the end, we got dataframe that unique and contain every information that we need to answer many business case I want to solve.