Hi !! Welcome to my Rmd :) in this LBB i will use previous data which is USvideos.csv
This data is about xxxxxxx for listing in USA. and the first thing i need to do is load all package tht might be needed for this dataset.
yt <- read.csv("youtubetrends.csv")
Futhermore we will inspect our data
dim(yt)
## [1] 2986 16
head(yt)
## trending_date title
## 1 2017-11-14 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 2017-11-14 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3 2017-11-14 Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4 2017-11-14 Nickelback Lyrics: Real or Fake?
## 5 2017-11-14 I Dare You: GOING BALD!?
## 6 2017-11-14 2 Weeks with iPhone X
## channel_title category_id publish_time views
## 1 CaseyNeistat People and Blogs 2017-11-13 12:13:01 748374
## 2 LastWeekTonight Entertainment 2017-11-13 02:30:00 2418783
## 3 Rudy Mancuso Comedy 2017-11-12 14:05:24 3191434
## 4 Good Mythical Morning Entertainment 2017-11-13 06:00:04 343168
## 5 nigahiga Entertainment 2017-11-12 13:01:41 2095731
## 6 iJustine Science and Technology 2017-11-13 14:07:23 119180
## likes dislikes comment_count comments_disabled ratings_disabled
## 1 57527 2966 15954 FALSE FALSE
## 2 97185 6146 12703 FALSE FALSE
## 3 146033 5339 8181 FALSE FALSE
## 4 10172 666 2146 FALSE FALSE
## 5 132235 1989 17518 FALSE FALSE
## 6 9763 511 1434 FALSE FALSE
## video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 1 FALSE 12 8am to 3pm Monday 1
## 2 FALSE 2 12am to 8am Monday 1
## 3 FALSE 14 8am to 3pm Sunday 2
## 4 FALSE 6 12am to 8am Monday 1
## 5 FALSE 13 8am to 3pm Sunday 2
## 6 FALSE 14 8am to 3pm Monday 1
tail(yt)
## trending_date
## 2981 2018-01-21
## 2982 2018-01-21
## 2983 2018-01-21
## 2984 2018-01-21
## 2985 2018-01-21
## 2986 2018-01-21
## title
## 2981 Short-term spending bill fails in the Senate
## 2982 Jessica Chastain Monologue - SNL
## 2983 Legendary Rocker Tom Petty Died From Accidental Overdose Of Medications, Coroner Says
## 2984 Mulan Hair Cut
## 2985 Urban opera singer
## 2986 GET READY WITH ME | Chit Chat with Winnie Harlow
## channel_title category_id publish_time views likes
## 2981 Washington Post News and Politics 2018-01-19 23:43:42 13044 68
## 2982 Saturday Night Live Entertainment 2018-01-21 00:44:30 23484 858
## 2983 CBS Los Angeles News and Politics 2018-01-19 21:10:07 3976 30
## 2984 Aymeric Favard Sports 2014-03-15 14:21:22 16231 103
## 2985 Marcos Horacio Sports 2018-01-19 12:26:11 13531 136
## 2986 LaToya Forever Entertainment 2018-01-18 18:03:24 44754 3246
## dislikes comment_count comments_disabled ratings_disabled
## 2981 22 220 FALSE FALSE
## 2982 80 136 FALSE FALSE
## 2983 5 45 FALSE FALSE
## 2984 7 15 FALSE FALSE
## 2985 5 16 FALSE FALSE
## 2986 28 386 FALSE FALSE
## video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 2981 FALSE 23 3pm to 12am Friday 1
## 2982 FALSE 0 12am to 8am Sunday 0
## 2983 FALSE 21 3pm to 12am Friday 1
## 2984 FALSE 14 8am to 3pm Saturday 8+
## 2985 FALSE 12 8am to 3pm Friday 2
## 2986 FALSE 18 3pm to 12am Thursday 3
From our inspection we can conclude :
yt data contain 2986 of rows and 16 of coloumns
Each of column name : “trending_date”, “title”, “channel_title”, “category_id”, “publish_time”, “views”, “dislikes”, “comment_count”, “comments_disabled”, “ratings_disabled”, “video_error_or_removed”, “publish_hour”, “publish_when”, “publish_wday”, & “timetotrend”
str(yt)
## 'data.frame': 2986 obs. of 16 variables:
## $ trending_date : chr "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" ...
## $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
## $ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
## $ category_id : chr "People and Blogs" "Entertainment" "Comedy" "Entertainment" ...
## $ publish_time : chr "2017-11-13 12:13:01" "2017-11-13 02:30:00" "2017-11-12 14:05:24" "2017-11-13 06:00:04" ...
## $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
## $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
## $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
## $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
## $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ publish_hour : int 12 2 14 6 13 14 0 16 9 8 ...
## $ publish_when : chr "8am to 3pm" "12am to 8am" "8am to 3pm" "12am to 8am" ...
## $ publish_wday : chr "Monday" "Monday" "Sunday" "Monday" ...
## $ timetotrend : chr "1" "1" "2" "1" ...
as we can see here, we need to adjust some of the data types.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
yt$publish_time <- ymd_hms(yt$publish_time, tz = "America/New_York")
yt$trending_date <- date(yt$trending_date)
yt$publish_when <- as.factor(yt$publish_when)
yt$category_id<- as.factor(yt$category_id)
yt$publish_wday <- as.factor(yt$publish_wday)
yt$title <- as.factor(yt$title)
yt$channel_title <- as.factor(yt$channel_title)
yt$timetotrend <- as.factor(yt$timetotrend)
colSums(is.na(yt))
## trending_date title channel_title
## 0 0 0
## category_id publish_time views
## 0 0 0
## likes dislikes comment_count
## 0 0 0
## comments_disabled ratings_disabled video_error_or_removed
## 0 0 0
## publish_hour publish_when publish_wday
## 0 0 0
## timetotrend
## 0
alright!! there are no missing values!!
First, lets check our data’s summary
summary(yt)
## trending_date
## Min. :2017-11-14
## 1st Qu.:2017-11-27
## Median :2017-12-14
## Mean :2017-12-15
## 3rd Qu.:2018-01-04
## Max. :2018-01-21
##
## title
## _\xa3_\xbc\x84\xdb_\xa1\x84\xd1__\x84\x81\x84\xce_\xbc_\xa1 _\xa1\x84\xdb__\x84Є\x8f _\xe1__\x84\xd0_Ȅ\xce___\xfc_\xc8_\xa1 ___\xab\x84\xdb_\xa1_\xe1\x84\xc4 _\xab___\xa1 \x84\x81_\xb5_\xc8_\xa1 ___\xa1 _\xc1\x84\xc9___\xab\x84\xd0: 1
## _\xd9_\xc4 How to make Pumpkin Pie Mistakes : 1
## 'I have dad moves': Barack Obama discusses dancing on David Letterman's new Netflix show : 1
## 'I have taken poison' claims war criminal : 1
## 'Lightning Strike' B777-300 on departure @ KL743 // PH-BVS : 1
## 'Lord of the Rings' TV Series Gets Multi-Season Order At Amazon | News Flash | Entertainment Weekly : 1
## (Other) :2980
## channel_title category_id
## Refinery29 : 31 Entertainment :736
## The Tonight Show Starring Jimmy Fallon: 30 Music :391
## Vox : 29 Howto and Style :285
## TheEllenShow : 28 Comedy :273
## Netflix : 27 News and Politics:271
## NFL : 25 People and Blogs :228
## (Other) :2816 (Other) :802
## publish_time views likes
## Min. :2008-04-05 14:22:40 Min. : 687 Min. : 0.0
## 1st Qu.:2017-11-22 10:51:03 1st Qu.: 43014 1st Qu.: 931.2
## Median :2017-12-11 12:41:02 Median : 149196 Median : 4785.0
## Mean :2017-11-07 14:44:26 Mean : 512602 Mean : 25412.1
## 3rd Qu.:2018-01-01 10:22:33 3rd Qu.: 483848 3rd Qu.: 17517.0
## Max. :2018-01-21 00:44:30 Max. :37736281 Max. :2055137.0
##
## dislikes comment_count comments_disabled ratings_disabled
## Min. : 0.0 Min. : 0 Mode :logical Mode :logical
## 1st Qu.: 39.0 1st Qu.: 152 FALSE:2942 FALSE:2976
## Median : 148.5 Median : 589 TRUE :44 TRUE :10
## Mean : 1257.7 Mean : 3545
## 3rd Qu.: 541.0 3rd Qu.: 1964
## Max. :629120.0 Max. :733373
##
## video_error_or_removed publish_hour publish_when publish_wday
## Mode :logical Min. : 0.00 12am to 8am: 552 Friday :478
## FALSE:2985 1st Qu.: 9.00 3pm to 12am: 795 Monday :415
## TRUE :1 Median :12.00 8am to 3pm :1639 Saturday :269
## Mean :11.95 Sunday :247
## 3rd Qu.:16.00 Thursday :490
## Max. :23.00 Tuesday :561
## Wednesday:526
## timetotrend
## 1 :1379
## 2 : 714
## 3 : 290
## 4 : 189
## 5 : 132
## 8+ : 129
## (Other): 153
From summary above, we can take a note for some important information:
These are the libraries that I am going to use
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(glue)
First we have to find the likes ratio, and dislike ratio
yt$like_ratio <-yt$likes/yt$views
yt$dislike_ratio <- yt$dislikes/yt$views
head(yt)
## trending_date title
## 1 2017-11-14 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 2017-11-14 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3 2017-11-14 Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4 2017-11-14 Nickelback Lyrics: Real or Fake?
## 5 2017-11-14 I Dare You: GOING BALD!?
## 6 2017-11-14 2 Weeks with iPhone X
## channel_title category_id publish_time views
## 1 CaseyNeistat People and Blogs 2017-11-13 12:13:01 748374
## 2 LastWeekTonight Entertainment 2017-11-13 02:30:00 2418783
## 3 Rudy Mancuso Comedy 2017-11-12 14:05:24 3191434
## 4 Good Mythical Morning Entertainment 2017-11-13 06:00:04 343168
## 5 nigahiga Entertainment 2017-11-12 13:01:41 2095731
## 6 iJustine Science and Technology 2017-11-13 14:07:23 119180
## likes dislikes comment_count comments_disabled ratings_disabled
## 1 57527 2966 15954 FALSE FALSE
## 2 97185 6146 12703 FALSE FALSE
## 3 146033 5339 8181 FALSE FALSE
## 4 10172 666 2146 FALSE FALSE
## 5 132235 1989 17518 FALSE FALSE
## 6 9763 511 1434 FALSE FALSE
## video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 1 FALSE 12 8am to 3pm Monday 1
## 2 FALSE 2 12am to 8am Monday 1
## 3 FALSE 14 8am to 3pm Sunday 2
## 4 FALSE 6 12am to 8am Monday 1
## 5 FALSE 13 8am to 3pm Sunday 2
## 6 FALSE 14 8am to 3pm Monday 1
## like_ratio dislike_ratio
## 1 0.07686932 0.0039632590
## 2 0.04017930 0.0025409472
## 3 0.04575780 0.0016729157
## 4 0.02964146 0.0019407404
## 5 0.06309732 0.0009490722
## 6 0.08191811 0.0042876322
case1 <- yt %>%
group_by(category_id) %>%
summarise(mean_likes = mean(like_ratio)) %>%
ungroup()
case1 <- case1 %>%
mutate(text_tooltip = glue("Mean Likes: {round(mean_likes, 2)}"))
case1
## # A tibble: 16 × 3
## category_id mean_likes text_tooltip
## <fct> <dbl> <glue>
## 1 Autos and Vehicles 0.0176 Mean Likes: 0.02
## 2 Comedy 0.0522 Mean Likes: 0.05
## 3 Education 0.0437 Mean Likes: 0.04
## 4 Entertainment 0.0341 Mean Likes: 0.03
## 5 Film and Animation 0.0338 Mean Likes: 0.03
## 6 Gaming 0.0480 Mean Likes: 0.05
## 7 Howto and Style 0.0524 Mean Likes: 0.05
## 8 Music 0.0724 Mean Likes: 0.07
## 9 News and Politics 0.0146 Mean Likes: 0.01
## 10 Nonprofit and Activism 0.0245 Mean Likes: 0.02
## 11 People and Blogs 0.0520 Mean Likes: 0.05
## 12 Pets and Animals 0.0453 Mean Likes: 0.05
## 13 Science and Technology 0.0383 Mean Likes: 0.04
## 14 Shows 0.0322 Mean Likes: 0.03
## 15 Sports 0.0173 Mean Likes: 0.02
## 16 Travel and Events 0.0285 Mean Likes: 0.03
# buat ulang ggplot
plot_case1_new <- ggplot(data = case1, aes(x = mean_likes,
y = reorder(category_id, mean_likes),
text = text_tooltip))+ # untuk menambahkan informasi tooltip yang sudah diatur
geom_col(aes(fill = mean_likes), show.legend = F)+
labs(y = "YouTube Categories", x = "Likes Ratio", title = "Average likes ratio",
subtitle = "on YouTube")+
scale_fill_gradient(low = "#fff2cd", high = "firebrick")+
theme_minimal()
ggplotly(plot_case1_new, tooltip = "text")
Interpretation
plot(x = yt$publish_when)
interpretation
yt_test <- as.data.frame(table(yt$publish_when,yt$category_id))
colnames(yt_test) <- c("publish_when","category","freq")
case2 <- yt_test
plot_case2_new <- ggplot(case2, aes(x = freq,
y = reorder(category,freq),
text = glue("Frekuensi: {freq}"))) +
geom_col(aes(fill = publish_when), position = "dodge")+
labs(title = "Proportion of Youtube Trending Videos",
subtitle = "Category vs Publish Hour",
caption = "Source: Youtube Us Trending",
x = "Freq",
y = NULL,
fill = NULL)+
scale_fill_brewer(palette = "Set2")+
theme_minimal()+
theme(legend.position = "top")+
geom_text(aes(label = freq, group = publish_when),
position=position_dodge(.9))
ggplotly(plot_case2_new, tooltip = "text")
Interpretation
trending_channel <- as.data.frame(table(yt$channel_title))
head(trending_channel)
## Var1
## 1 _\xa2_\xc1_\x9d
## 2 \x8b\xc4\xf8\x8b\xc4_\x8b\xc4_\x8b\xc4_ \x8b\xc4Ћĩ\x8b⦋\xc4_\x8b\xe2_ \x8c\xc9\u008c_\x8f\x8bā\x8bģ\x8b\xc4_\x8bč\x8b\xc4\xc7
## 3 \x8e\xc4_\x8e\xd9_\xa1
## 4 \x93\xf7\x81\x90\xb5_\x91⬓_\x90 Korean Englishman
## 5 12 News
## 6 1theK (\x93ݐ\x91\x8dӓ_ۓ\x9d\xab)
## Freq
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 5
case3 <- trending_channel[trending_channel$Freq >= 10, ]
head(case3)
## Var1 Freq
## 96 BBC News 13
## 173 BuzzFeedVideo 10
## 254 CNN 17
## 263 CollegeHumor 15
## 264 ColliderVideos 13
## 360 E! Entertainment 13
#visualisasi
plot_case3_new <- ggplot(data = case3, mapping = aes(x = Freq,
y = reorder(Var1, Freq),
text = glue("Frekuensi: {Freq}"))) +
geom_col(mapping = aes(fill = Freq)) +
labs(title = "Top Trending Videos",
caption = "Source: Youtube Us Trending",
x = "Freq",
y = NULL,
fill = NULL)+
geom_text(mapping = aes(label = Freq),
color = "white",
size = 5,
nudge_x = -0.7) +
scale_fill_gradient(low = "orange", high = "red")
ggplotly(plot_case3_new, tooltip = "text")
Interpretation
ggplot(data = yt, mapping = aes(x = like_ratio, y = category_id)) +
geom_jitter(aes(col = yt$like_ratio)) +
geom_boxplot(alpha = 0.5) +
scale_x_log10() +
geom_point() +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Use of `yt$like_ratio` is discouraged. Use `like_ratio` instead.
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
## Warning: Removed 12 rows containing missing values (geom_point).
Interpretation
From all graphs above, we may say some assumptions, such as : 1. Music Category holds the most promissing content based on like ratio, while News and Politics Category is not too popular amongst the other categories. 2. The best suggestion for published time is from “8am to 3pm”. 3. Music & Entertainment categories can be solution if you want to be on the top of trending list, because it has bigger chance and to be on the trending lists. 4. Although Entertainment is the number 1 based on top trending videos, but it doesnt mean it has the highest engagement rate with the viewers, the highest category based on engagement rate holds by music category.