LBB DATA VISUALIZATION

1. Intro

1.1 Greetings

Hi !! Welcome to my Rmd :) in this LBB i will use previous data which is USvideos.csv

1.2 Data’s Brief

This data is about xxxxxxx for listing in USA. and the first thing i need to do is load all package tht might be needed for this dataset.

2. Data Explanatory

2.1 Data Input & Structure

yt <- read.csv("youtubetrends.csv")

Futhermore we will inspect our data

dim(yt)
## [1] 2986   16
head(yt)
##   trending_date                                                          title
## 1    2017-11-14                             WE WANT TO TALK ABOUT OUR MARRIAGE
## 2    2017-11-14 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3    2017-11-14          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4    2017-11-14                               Nickelback Lyrics: Real or Fake?
## 5    2017-11-14                                       I Dare You: GOING BALD!?
## 6    2017-11-14                                          2 Weeks with iPhone X
##           channel_title            category_id        publish_time   views
## 1          CaseyNeistat       People and Blogs 2017-11-13 12:13:01  748374
## 2       LastWeekTonight          Entertainment 2017-11-13 02:30:00 2418783
## 3          Rudy Mancuso                 Comedy 2017-11-12 14:05:24 3191434
## 4 Good Mythical Morning          Entertainment 2017-11-13 06:00:04  343168
## 5              nigahiga          Entertainment 2017-11-12 13:01:41 2095731
## 6              iJustine Science and Technology 2017-11-13 14:07:23  119180
##    likes dislikes comment_count comments_disabled ratings_disabled
## 1  57527     2966         15954             FALSE            FALSE
## 2  97185     6146         12703             FALSE            FALSE
## 3 146033     5339          8181             FALSE            FALSE
## 4  10172      666          2146             FALSE            FALSE
## 5 132235     1989         17518             FALSE            FALSE
## 6   9763      511          1434             FALSE            FALSE
##   video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 1                  FALSE           12   8am to 3pm       Monday           1
## 2                  FALSE            2  12am to 8am       Monday           1
## 3                  FALSE           14   8am to 3pm       Sunday           2
## 4                  FALSE            6  12am to 8am       Monday           1
## 5                  FALSE           13   8am to 3pm       Sunday           2
## 6                  FALSE           14   8am to 3pm       Monday           1
tail(yt)
##      trending_date
## 2981    2018-01-21
## 2982    2018-01-21
## 2983    2018-01-21
## 2984    2018-01-21
## 2985    2018-01-21
## 2986    2018-01-21
##                                                                                      title
## 2981                                          Short-term spending bill fails in the Senate
## 2982                                                      Jessica Chastain Monologue - SNL
## 2983 Legendary Rocker Tom Petty Died From Accidental Overdose Of Medications, Coroner Says
## 2984                                                                        Mulan Hair Cut
## 2985                                                                    Urban opera singer
## 2986                                      GET READY WITH ME | Chit Chat with Winnie Harlow
##            channel_title       category_id        publish_time views likes
## 2981     Washington Post News and Politics 2018-01-19 23:43:42 13044    68
## 2982 Saturday Night Live     Entertainment 2018-01-21 00:44:30 23484   858
## 2983     CBS Los Angeles News and Politics 2018-01-19 21:10:07  3976    30
## 2984      Aymeric Favard            Sports 2014-03-15 14:21:22 16231   103
## 2985      Marcos Horacio            Sports 2018-01-19 12:26:11 13531   136
## 2986      LaToya Forever     Entertainment 2018-01-18 18:03:24 44754  3246
##      dislikes comment_count comments_disabled ratings_disabled
## 2981       22           220             FALSE            FALSE
## 2982       80           136             FALSE            FALSE
## 2983        5            45             FALSE            FALSE
## 2984        7            15             FALSE            FALSE
## 2985        5            16             FALSE            FALSE
## 2986       28           386             FALSE            FALSE
##      video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 2981                  FALSE           23  3pm to 12am       Friday           1
## 2982                  FALSE            0  12am to 8am       Sunday           0
## 2983                  FALSE           21  3pm to 12am       Friday           1
## 2984                  FALSE           14   8am to 3pm     Saturday          8+
## 2985                  FALSE           12   8am to 3pm       Friday           2
## 2986                  FALSE           18  3pm to 12am     Thursday           3

From our inspection we can conclude :

  • yt data contain 2986 of rows and 16 of coloumns

  • Each of column name : “trending_date”, “title”, “channel_title”, “category_id”, “publish_time”, “views”, “dislikes”, “comment_count”, “comments_disabled”, “ratings_disabled”, “video_error_or_removed”, “publish_hour”, “publish_when”, “publish_wday”, & “timetotrend”

str(yt)
## 'data.frame':    2986 obs. of  16 variables:
##  $ trending_date         : chr  "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
##  $ category_id           : chr  "People and Blogs" "Entertainment" "Comedy" "Entertainment" ...
##  $ publish_time          : chr  "2017-11-13 12:13:01" "2017-11-13 02:30:00" "2017-11-12 14:05:24" "2017-11-13 06:00:04" ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ publish_hour          : int  12 2 14 6 13 14 0 16 9 8 ...
##  $ publish_when          : chr  "8am to 3pm" "12am to 8am" "8am to 3pm" "12am to 8am" ...
##  $ publish_wday          : chr  "Monday" "Monday" "Sunday" "Monday" ...
##  $ timetotrend           : chr  "1" "1" "2" "1" ...

as we can see here, we need to adjust some of the data types.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
yt$publish_time <- ymd_hms(yt$publish_time, tz = "America/New_York")
yt$trending_date <- date(yt$trending_date)
yt$publish_when <- as.factor(yt$publish_when)
yt$category_id<- as.factor(yt$category_id)
yt$publish_wday <- as.factor(yt$publish_wday)
yt$title <- as.factor(yt$title)
yt$channel_title <- as.factor(yt$channel_title)
yt$timetotrend <- as.factor(yt$timetotrend)

2.2 Missing Data

colSums(is.na(yt))
##          trending_date                  title          channel_title 
##                      0                      0                      0 
##            category_id           publish_time                  views 
##                      0                      0                      0 
##                  likes               dislikes          comment_count 
##                      0                      0                      0 
##      comments_disabled       ratings_disabled video_error_or_removed 
##                      0                      0                      0 
##           publish_hour           publish_when           publish_wday 
##                      0                      0                      0 
##            timetotrend 
##                      0

alright!! there are no missing values!!

2.3 Subseting and Practical Statistics

First, lets check our data’s summary

summary(yt)
##  trending_date       
##  Min.   :2017-11-14  
##  1st Qu.:2017-11-27  
##  Median :2017-12-14  
##  Mean   :2017-12-15  
##  3rd Qu.:2018-01-04  
##  Max.   :2018-01-21  
##                      
##   title                                                                                                                                                                                                                                        
##  _\xa3_\xbc\x84\xdb_\xa1\x84\xd1__\x84\x81\x84\xce_\xbc_\xa1 _\xa1\x84\xdb__\x84Є\x8f _\xe1__\x84\xd0_Ȅ\xce___\xfc_\xc8_\xa1 ___\xab\x84\xdb_\xa1_\xe1\x84\xc4 _\xab___\xa1 \x84\x81_\xb5_\xc8_\xa1 ___\xa1 _\xc1\x84\xc9___\xab\x84\xd0:   1  
##  _\xd9_\xc4 How to make Pumpkin Pie Mistakes                                                                                                                                                                                            :   1  
##  'I have dad moves': Barack Obama discusses dancing on David Letterman's new Netflix show                                                                                                                                               :   1  
##  'I have taken poison' claims war criminal                                                                                                                                                                                              :   1  
##  'Lightning Strike' B777-300 on departure @ KL743 // PH-BVS                                                                                                                                                                             :   1  
##  'Lord of the Rings' TV Series Gets Multi-Season Order At Amazon | News Flash | Entertainment Weekly                                                                                                                                    :   1  
##  (Other)                                                                                                                                                                                                                                :2980  
##                                 channel_title             category_id 
##  Refinery29                            :  31   Entertainment    :736  
##  The Tonight Show Starring Jimmy Fallon:  30   Music            :391  
##  Vox                                   :  29   Howto and Style  :285  
##  TheEllenShow                          :  28   Comedy           :273  
##  Netflix                               :  27   News and Politics:271  
##  NFL                                   :  25   People and Blogs :228  
##  (Other)                               :2816   (Other)          :802  
##   publish_time                     views              likes          
##  Min.   :2008-04-05 14:22:40   Min.   :     687   Min.   :      0.0  
##  1st Qu.:2017-11-22 10:51:03   1st Qu.:   43014   1st Qu.:    931.2  
##  Median :2017-12-11 12:41:02   Median :  149196   Median :   4785.0  
##  Mean   :2017-11-07 14:44:26   Mean   :  512602   Mean   :  25412.1  
##  3rd Qu.:2018-01-01 10:22:33   3rd Qu.:  483848   3rd Qu.:  17517.0  
##  Max.   :2018-01-21 00:44:30   Max.   :37736281   Max.   :2055137.0  
##                                                                      
##     dislikes        comment_count    comments_disabled ratings_disabled
##  Min.   :     0.0   Min.   :     0   Mode :logical     Mode :logical   
##  1st Qu.:    39.0   1st Qu.:   152   FALSE:2942        FALSE:2976      
##  Median :   148.5   Median :   589   TRUE :44          TRUE :10        
##  Mean   :  1257.7   Mean   :  3545                                     
##  3rd Qu.:   541.0   3rd Qu.:  1964                                     
##  Max.   :629120.0   Max.   :733373                                     
##                                                                        
##  video_error_or_removed  publish_hour        publish_when     publish_wday
##  Mode :logical          Min.   : 0.00   12am to 8am: 552   Friday   :478  
##  FALSE:2985             1st Qu.: 9.00   3pm to 12am: 795   Monday   :415  
##  TRUE :1                Median :12.00   8am to 3pm :1639   Saturday :269  
##                         Mean   :11.95                      Sunday   :247  
##                         3rd Qu.:16.00                      Thursday :490  
##                         Max.   :23.00                      Tuesday  :561  
##                                                            Wednesday:526  
##   timetotrend  
##  1      :1379  
##  2      : 714  
##  3      : 290  
##  4      : 189  
##  5      : 132  
##  8+     : 129  
##  (Other): 153

From summary above, we can take a note for some important information:

  1. The most favourite category is from others category, then Entertainment comes after, the least comes from people and blogs
  2. Max views from this data has 37736281, while minimum has 687
  3. Max Likes from this data has 2055137.0, while minimum has 0, while dislike has more max number with 629120.0, and the minimum dislike has the same number as Likes.
  4. from Publish_when, we can see that from 8am to 3pm has the most number, while from 12 am to 8am has the least.
  5. Tuesday becomes the favorite day to publish while on Sunday is the least facorite.

3. Study Case

These are the libraries that I am going to use

library(ggplot2) 
library(plotly) 
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(glue) 
  1. Which category has the most likes and the least likes?

First we have to find the likes ratio, and dislike ratio

yt$like_ratio <-yt$likes/yt$views
yt$dislike_ratio <- yt$dislikes/yt$views
head(yt)
##   trending_date                                                          title
## 1    2017-11-14                             WE WANT TO TALK ABOUT OUR MARRIAGE
## 2    2017-11-14 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3    2017-11-14          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4    2017-11-14                               Nickelback Lyrics: Real or Fake?
## 5    2017-11-14                                       I Dare You: GOING BALD!?
## 6    2017-11-14                                          2 Weeks with iPhone X
##           channel_title            category_id        publish_time   views
## 1          CaseyNeistat       People and Blogs 2017-11-13 12:13:01  748374
## 2       LastWeekTonight          Entertainment 2017-11-13 02:30:00 2418783
## 3          Rudy Mancuso                 Comedy 2017-11-12 14:05:24 3191434
## 4 Good Mythical Morning          Entertainment 2017-11-13 06:00:04  343168
## 5              nigahiga          Entertainment 2017-11-12 13:01:41 2095731
## 6              iJustine Science and Technology 2017-11-13 14:07:23  119180
##    likes dislikes comment_count comments_disabled ratings_disabled
## 1  57527     2966         15954             FALSE            FALSE
## 2  97185     6146         12703             FALSE            FALSE
## 3 146033     5339          8181             FALSE            FALSE
## 4  10172      666          2146             FALSE            FALSE
## 5 132235     1989         17518             FALSE            FALSE
## 6   9763      511          1434             FALSE            FALSE
##   video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 1                  FALSE           12   8am to 3pm       Monday           1
## 2                  FALSE            2  12am to 8am       Monday           1
## 3                  FALSE           14   8am to 3pm       Sunday           2
## 4                  FALSE            6  12am to 8am       Monday           1
## 5                  FALSE           13   8am to 3pm       Sunday           2
## 6                  FALSE           14   8am to 3pm       Monday           1
##   like_ratio dislike_ratio
## 1 0.07686932  0.0039632590
## 2 0.04017930  0.0025409472
## 3 0.04575780  0.0016729157
## 4 0.02964146  0.0019407404
## 5 0.06309732  0.0009490722
## 6 0.08191811  0.0042876322
case1 <- yt %>% 
  group_by(category_id) %>% 
  summarise(mean_likes = mean(like_ratio)) %>%
  ungroup()
case1 <- case1 %>% 
  mutate(text_tooltip = glue("Mean Likes: {round(mean_likes, 2)}"))
case1
## # A tibble: 16 × 3
##    category_id            mean_likes text_tooltip    
##    <fct>                       <dbl> <glue>          
##  1 Autos and Vehicles         0.0176 Mean Likes: 0.02
##  2 Comedy                     0.0522 Mean Likes: 0.05
##  3 Education                  0.0437 Mean Likes: 0.04
##  4 Entertainment              0.0341 Mean Likes: 0.03
##  5 Film and Animation         0.0338 Mean Likes: 0.03
##  6 Gaming                     0.0480 Mean Likes: 0.05
##  7 Howto and Style            0.0524 Mean Likes: 0.05
##  8 Music                      0.0724 Mean Likes: 0.07
##  9 News and Politics          0.0146 Mean Likes: 0.01
## 10 Nonprofit and Activism     0.0245 Mean Likes: 0.02
## 11 People and Blogs           0.0520 Mean Likes: 0.05
## 12 Pets and Animals           0.0453 Mean Likes: 0.05
## 13 Science and Technology     0.0383 Mean Likes: 0.04
## 14 Shows                      0.0322 Mean Likes: 0.03
## 15 Sports                     0.0173 Mean Likes: 0.02
## 16 Travel and Events          0.0285 Mean Likes: 0.03
# buat ulang ggplot
plot_case1_new <- ggplot(data = case1, aes(x = mean_likes, 
                                       y = reorder(category_id, mean_likes),
                                       text = text_tooltip))+ # untuk menambahkan informasi tooltip yang sudah diatur
  geom_col(aes(fill = mean_likes), show.legend = F)+
    labs(y = "YouTube Categories", x = "Likes Ratio", title = "Average likes ratio", 
       subtitle = "on YouTube")+
  scale_fill_gradient(low = "#fff2cd", high = "firebrick")+
  theme_minimal()
ggplotly(plot_case1_new, tooltip = "text")

Interpretation

  • Music Category has the most likes ratio
  • The least category who has the lowest like_ratio is News and Politics.
  1. We would like to know when is the right time to publish the video.
plot(x = yt$publish_when)

interpretation

  • We can see from the visualization that from “8am to 3pm” has the most number, with 1639
  • “3pm to 12am” comes after with 792
  • And the least number “12am to 8am” with 552.
  1. After we got the information about “Public When”, now we would like to know YouTube US Trending videos, between category and Publish Hour
yt_test <- as.data.frame(table(yt$publish_when,yt$category_id))
colnames(yt_test) <- c("publish_when","category","freq")
case2 <- yt_test

plot_case2_new <- ggplot(case2, aes(x = freq, 
                                y = reorder(category,freq),
                                text = glue("Frekuensi: {freq}"))) +
 geom_col(aes(fill = publish_when), position = "dodge")+
  labs(title = "Proportion of Youtube Trending Videos",
       subtitle = "Category vs Publish Hour",
       caption = "Source: Youtube Us Trending",
       x = "Freq",
       y = NULL,
       fill = NULL)+
  scale_fill_brewer(palette = "Set2")+
  theme_minimal()+
  theme(legend.position = "top")+
  geom_text(aes(label = freq, group = publish_when),
            position=position_dodge(.9))

ggplotly(plot_case2_new, tooltip = "text")

Interpretation

  • We can see from the vizualitation, every category shows that “8am to 3pm” has the most publish hour.
  • Entertainment sits in the most trending, with 418 with the publish hour 8am to 3pm.
  • The best 3 categories to make trending are “Entertainment”, “Music”. and “How to and Style”. Those three are best to post from “8 am to 3pm”
  • The categories to avoid are : “Shows”, “Non-Profit and Activism”, and “Gaming”.
trending_channel <- as.data.frame(table(yt$channel_title))   

head(trending_channel)
##                                                                                                                            Var1
## 1                                                                                                               _\xa2_\xc1_\x9d
## 2 \x8b\xc4\xf8\x8b\xc4_\x8b\xc4_\x8b\xc4_ \x8b\xc4Ћĩ\x8b⦋\xc4_\x8b\xe2_ \x8c\xc9\u008c_\x8f\x8bā\x8bģ\x8b\xc4_\x8bč\x8b\xc4\xc7
## 3                                                                                                        \x8e\xc4_\x8e\xd9܏_\xa1
## 4                                                                             \x93\xf7\x81\x90\xb5_\x91⬓_\x90 Korean Englishman
## 5                                                                                                                       12 News
## 6                                                                                              1theK (\x93ݐ\x91\x8dӓ_ۓ\x9d\xab)
##   Freq
## 1    1
## 2    1
## 3    1
## 4    1
## 5    1
## 6    5
  1. which chanel is often on the Trending, at least minimum 10 times on trending
case3 <- trending_channel[trending_channel$Freq >= 10, ]
head(case3)
##                 Var1 Freq
## 96          BBC News   13
## 173    BuzzFeedVideo   10
## 254              CNN   17
## 263     CollegeHumor   15
## 264   ColliderVideos   13
## 360 E! Entertainment   13
#visualisasi
plot_case3_new <- ggplot(data = case3, mapping = aes(x = Freq, 
                                   y = reorder(Var1, Freq),
                                   text = glue("Frekuensi: {Freq}"))) +
  geom_col(mapping = aes(fill = Freq)) +
  labs(title = "Top Trending Videos",
       caption = "Source: Youtube Us Trending",
       x = "Freq",
       y = NULL,
       fill = NULL)+
  geom_text(mapping = aes(label = Freq), 
                          color = "white", 
                          size = 5,
                          nudge_x = -0.7) +
  scale_fill_gradient(low = "orange", high = "red") 

ggplotly(plot_case3_new, tooltip = "text")

Interpretation

  • Top 3 trending videos coming from “Refinery 29”, “The tonight show starring Jimmy Fallon”, and “Vox”.
  • “Refinery 29” sits in no. 1 Trending Videos with 31 times being in trending videos
  • “BuzzFeedVideo” is not as trending as today, they were in the bottom of top trending videos with only 10 times through that period.
  1. Which category has the most engagement between category and like ratio?
ggplot(data = yt, mapping = aes(x = like_ratio, y = category_id)) +
  geom_jitter(aes(col = yt$like_ratio)) +
  geom_boxplot(alpha = 0.5) +
  scale_x_log10() +
  geom_point() +
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Use of `yt$like_ratio` is discouraged. Use `like_ratio` instead.
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
## Warning: Removed 12 rows containing missing values (geom_point).

Interpretation

  • as we can see, Music Category has the most like ratio if we compared with another category.
  • The least can be sports or nonprofit and activism.

4. Conclusion

From all graphs above, we may say some assumptions, such as : 1. Music Category holds the most promissing content based on like ratio, while News and Politics Category is not too popular amongst the other categories. 2. The best suggestion for published time is from “8am to 3pm”. 3. Music & Entertainment categories can be solution if you want to be on the top of trending list, because it has bigger chance and to be on the trending lists. 4. Although Entertainment is the number 1 based on top trending videos, but it doesnt mean it has the highest engagement rate with the viewers, the highest category based on engagement rate holds by music category.