library(lubridate)
library(dplyr)
library(ggplot2)
library(scales)
library(plotly)
library(glue)
library(tidyr)

Data Background

we are a beginner youtube channel who wants to make our videos so that they can be trending in our country, for that we want to do some analysis on several youtube channels in America that make them trending using USvideos.csv data file and see what YouTube users in the US are doing so they can trend.

Import Data .csv

We will analyze the USvideos.csv data contained in the data_input folder. Use the read.csv() function to read the CSV file to R.

USvideo <- read.csv(file="data_input/USvideos.csv")
USvideo

Inspect Data

after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() and tail() data’s only.

head(USvideo)

tail(USvideo)

dim(USvideo)

#> [1] 13400    12

The functions anteNA() and is.na() are used to find out whether there is a missing value in the data, functions that sound simple but are very crucial if they are not carried out because they will affect our work process in analysis, for example during calculations.

anyNA(USvideo)

#> [1] FALSE

colSums(is.na(USvideo))

#>          trending_date                  title          channel_title 
#>                      0                      0                      0 
#>            category_id           publish_time                  views 
#>                      0                      0                      0 
#>                  likes               dislikes          comment_count 
#>                      0                      0                      0 
#>      comments_disabled       ratings_disabled video_error_or_removed 
#>                      0                      0                      0

str(USvideo)

#> 'data.frame':    13400 obs. of  12 variables:
#>  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#>  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#>  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
#>  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#>  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#>  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#>  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#>  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
#>  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

From the results of our inspections, we get some information that can support our analysis, as follows:

YouTube’s US Trending Videos adalah kumpulan 200 video trending di US per harinya sejak 2017-11-14 hingga 2018-01-21. Berikut adalah deskripsi kolomnya:

trending_date: trending date
title: video title
channel_title: Youtube channel name
category_id: video category
publish_time: Upload date
views: views
likes: likes
dislikes: dislikes
comment_count: number of comment at video
comment_disabled: is the comment field not activated?
rating_disabled: is video rating not enabled?
video_error_or_removed: is the video deleted
date format that is on the data.
the number of columns 12 and rows 13400.
From the data we inspected, there is no missing value in each column.

Data Cleansing

Are there any unnecessary columns? Does each column already have the right data type?

Discarded column:

comments_disabled
ratings_disabled
video_error_or_removed

Columns that need to be fixed:

category_id -> factor, we can change it to the original label
publish_time -> date_time (POSIXct)
trending_date -> date

in order to shorten the time we will use the existing functions in the dplyr library to do all these things at once.

USvideo <- USvideo %>%
  select(-c(comments_disabled,ratings_disabled,video_error_or_removed)) %>%
  mutate(category_id = as.factor(category_id),
         publish_time = ymd_hms(publish_time),
         trending_date = ydm(trending_date))

str(USvideo)

#> 'data.frame':    13400 obs. of  9 variables:
#>  $ trending_date: Date, format: "2017-11-14" "2017-11-14" ...
#>  $ title        : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title: chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#>  $ category_id  : Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...
#>  $ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" ...
#>  $ views        : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#>  $ likes        : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#>  $ dislikes     : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#>  $ comment_count: int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...

Now, Each of column already changed into desired data type dataset is ready to be processed and analyzed.

Study Case

in this session we want to see by visualizing what american youtubers are doing

what time do american youtubers usually publish videos?

USvideo$publish_hour <- hour(USvideo$publish_time)
hist(USvideo$publish_hour, breaks = 10)

Answer: based on the histogram, the video is usually published at 14-17

On which days are the most published videos??

USvideo$publish_wday <- wday(USvideo$publish_time, label = T, abbr = T)
plot(USvideo$publish_wday)

Answer: most published videos on tuesday.

how is the correlation between views and likes, will the number of likes increase the view value??

plot(USvideo$views, USvideo$likes)

cor(USvideo$views, USvideo$likes)

#> [1] 0.8831559

Answer: based on the scatter plot that we made, the number of likes affects the number of viewers on the published video, to see the correlation with the numbers we use function cor and we get a value of about 0.8 which indicates likes have a strong effect on the number of video viewers.

how many videos were published in the category?

# change values in column category_id
USvideo$category_id <- sapply(as.character(USvideo$category_id), switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

# cahnge type to factor
USvideo$category_id <- as.factor(USvideo$category_id)

# aggregation base on category_id
vids_count <- USvideo %>% 
  group_by(category_id) %>% 
  summarise(count_cat = n()) %>% 
  arrange(desc(count_cat)) %>% 
  ungroup()

#scaling label on chart
vids_count2 <- vids_count %>% 
  mutate(label = glue("Category {category_id}
                      Video Count: {comma(count_cat)}"))

plot_count <- ggplot(vids_count2, aes(x =count_cat , y=  reorder(category_id, count_cat), text = label)) +
        geom_col(aes(fill = count_cat)) +
        scale_fill_gradient(low = "blue", high ="navy") +
        labs(title = "Number Of videos Base Category",
             x = "Video count",
             y = NULL) +
        theme_minimal() +
        theme(legend.position = "none")
      
      ggplotly(plot_count, tooltip = "text")

These bar plot shows that the lowest vidoe count is Shows and for top 4 video count are Entertaiment, Music, Howto and Style, and Comedy.

Referring to chart number 4, we see that the most published videos are entertainment videos, but what about the number of views and likes?

although Film and Animation is not included in the 4 most published categories, if we look at it from the perspective of the audience, the animation category is in the top 4 categories. and we have mentioned above that likes have a strong correlation with evidence in the top 3, the number of viewers has the same position as the top 3 likes, but not in the fourth position (Film and Animation) which shows fewer likes when compared to the number of viewers in the bottom position.

Conclusion

From all the graphs above, we can draw some assumptions, such as:

American YouTubers most often publish their videos in the afternoon, so viewers can watch their videos after work based on country time.
From the results of the scatter plot and cast function, the number of likes has a strong correlation with the number of views and also affects whether the video will trend or not because it is enjoyed by the audience.
American YouTubers publish more of their videos in the middle of the week than on the weekends
Based on the data that has been visualized, the 4 categories with the highest number of videos are Entertaiment, Music, Howto and Style, and Comedy but this cannot explain how many people. If you are interested in this category, we can see that although Movies and Animations are outside the top 4 in published videos, Movies and Animations have a better viewer count.

Data Visualization

Ruli Erinton

21 May 2022