Introduction

YouTube has become one of the leading socialmedia and video sharing websites. It has become a vital source of entertainment for millennial and gen-z internet users. Considering the importance of YouTube as a trend driving platform in modern popular culture, we’ve decided to analyze the data for for trending YouTube videos. The data is retrieved from Kaggle.

Data Cleaning

To start, we need to load the category tags from the JSON files. All the files have the same name and structure, but they have the country code at the beginning. So we can use that to loop over the files and load the categories.

Languages that are not latin-based were removed from the list of datasets since my system doesn’t support the character set encoding for these languages.

# Library for working with json files
library(jsonlite)
# Library for easily cleaning the data 
library(tidyverse)
# Library for working with dataframes
library(plotly)
library(htmlwidgets)
library(textclean)
library(ggplot2)
library(sqldf)
library(knitr)

# Define the list of countries
countries <- c('CA', 'DE', 'FR', 'GB', 'MX', 'US')

# Create a blank dataframe with the column names. This dataframe
# will have the union data at the end.
categories <- data.frame(
    kind=character(),
    etag=character(),
    id=integer(),
    channel_id=character(),
    title=character(),
    assignable=logical(),
    country=character(),
    stringsAsFactors=FALSE
)

for (c in countries){
  filr_path <- str_interp('YouTube_Trending/${c}_category_id.json')
  # load the json file
  jsonData <- fromJSON(filr_path, flatten=TRUE)
  # convert the json into a dataframe and rename the columns
  temp <- as.data.frame(jsonData[3]) %>% rename(
    kind = items.kind,
    etag = items.etag,
    id = items.id,
    channel_id = items.snippet.channelId,
    title = items.snippet.title,
    assignable = items.snippet.assignable
  )
  temp$id <- as.integer(temp$id)
  # add country column
  temp$country=c
  # union the data into categories dataframe.
  categories <- union(categories, temp)
}

categories <- subset(categories, select=-c(kind, etag, channel_id, assignable))
kable(sample_n(categories, 10), caption = "Categories Samples")
Categories Samples
id title country
24 Entertainment MX
15 Pets & Animals GB
31 Anime/Animation US
42 Shorts DE
42 Shorts GB
35 Documentary CA
27 Education CA
24 Entertainment DE
24 Entertainment US
44 Trailers FR

Next we need to load the videos data set from the CSV files. The following paragraph prints a sample of the data

test <- read.csv('YouTube_Trending/CAvideos.csv')
test$publish_time <- as.POSIXct(test$publish_time, format="%Y-%m-%dT%H:%M:%OS", tz='UTC')
kable(head(test), caption="Sample of videos data")
Sample of videos data
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10 17:00:03 Eminem|Walk|On|Water|Aftermath/Shady/Interscope|Rap 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem’s new track Walk on Water ft. Beyoncé is available everywhere: http://shady.sr/WOWEminem Best of Eminem: https://goo.gl/AquNpo\nSubscribe for more: https://goo.gl/DxCrDV\n\nFor more visit: ://eminem.com://facebook.com/eminem://twitter.com/eminem://instagram.com/eminem://eminem.tumblr.com://shadyrecords.com://facebook.com/shadyrecords://twitter.com/shadyrecords://instagram.com/shadyrecords://trustshady.tumblr.comvideo by Eminem performing Walk On Water. (C) 2017 Aftermath Records://vevo.ly/gA7xKt
0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13 17:00:00 plush|bad unboxing|unboxing|fan mail|idubbbztv|idubbbztv2|things|best|packages|plushies|chontent chop 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will last for another year. On a side note, more 2nd channel vids soon. editing with premiere from now on, gon’ be a tedious transition, but i think it’s for the best. __â–º http://www.youtube.com/subscription_center?add_user=iDubbbztv\n\nMain Channel â–º https://www.youtube.com/user/iDubbbzTV\nSecond Channel â–º https://www.youtube.com/channel/UC-tsNNJ3yIW98MtPH6PWFAQ\nGaming Channel â–º https://www.youtube.com/channel/UCVhfFXNY0z3-mbrTh1OYRXA\n\nWebsite â–º http://www.idubbbz.com/\n\nInstagram â–º https://instagram.com/idubbbz/\nTwitter â–º https://twitter.com/Idubbbz\nFacebook â–º http://www.facebook.com/IDubbbz\nTwitch â–º http://www.twitch.tv/idubbbz\n_
5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Lele Pons Rudy Mancuso 23 2017-11-12 19:05:24 racist superman|rudy|mancuso|king|bach|racist|superman|love|rudy mancuso poo bear black white official music video|iphone x by pineapple|lelepons|hannahstocking|rudymancuso|inanna|anwar|sarkis|shots|shotsstudios|alesso|anitta|brazil|Getting My Driver’s License | Lele Pons 3191434 146035 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO â–¶ â–º https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confirmation=1\n\nTHANKS FOR WATCHING! LIKE & SUBSCRIBE FOR MORE VIDEOS!———————————————————–ME ON: | http://instagram.com/rudymancuso\nTwitter | http://twitter.com/rudymancuso\nFacebook | http://facebook.com/rudymancuso\n\nCAST: Mancuso | http://youtube.com/c/rudymancuso\nLele Pons | http://youtube.com/c/lelepons\nKing Bach | https://youtube.com/user/BachelorsPadTv\n\nVideo Effects: Natale | https://instagram.com/calebnatale\n\nPA:\nPaulina GregoryStudios Channels:| https://youtube.com/c/alesso\nAnitta | http://youtube.com/c/anitta\nAnwar Jibawi | http://youtube.com/c/anwar\nAwkward Puppets | http://youtube.com/c/awkwardpuppets\nHannah Stocking | http://youtube.com/c/hannahstocking\nInanna Sarkis | http://youtube.com/c/inanna\nLele Pons | http://youtube.com/c/lelepons\nMaejor | http://youtube.com/c/maejor\nMike Tyson | http://youtube.com/c/miketyson Mancuso | http://youtube.com/c/rudymancuso\nShots Studios | http://youtube.com/c/shots\n\n#Rudy\n#RudyMancuso
d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12 18:01:41 ryan|higa|higatv|nigahiga|i dare you|idy|rhpc|dares|no truth|comments|comedy|funny|stupid|fail 2095828 132239 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it’s been a while since we did this show, but we’re back with what might be the best episode yet!your dares in the comment section! my book how to write good ://higatv.com/ryan-higas-how-to-write-good-pre-order-links/Launched New Official Store://www.gianthugs.com/collections/ryanChannel://www.youtube.com/higatv://www.twitter.com/therealryanhiga://www.facebook.com/higatv://www.higatv.com://www.instagram.com/notryanhigaus mail or whatever you want here!Box 232355Vegas, NV 89105
2Vv-BfVoq4g 17.14.11 Ed Sheeran - Perfect (Official Music Video) Ed Sheeran 10 2017-11-09 11:04:14 edsheeran|ed sheeran|acoustic|live|cover|official|remix|official video|lyrics|session 33523622 1634130 21082 85067 https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg False False False 🎧: https://ad.gt/yt-perfect\n💰: https://atlanti.cr/yt-album\nSubscribe to Ed’s channel: http://bit.ly/SubscribeToEdSheeran\n\nFollow Ed on…: http://www.facebook.com/EdSheeranMusic\nTwitter: http://twitter.com/edsheeran\nInstagram: http://instagram.com/teddysphotos\nOfficial Website: http://edsheeran.com\n\nDirector: Jason Koenig: Honna Kimmerer: Ed Sheeran & Zoey Deutch of Photography: Johnny ValenciaCompany: Anonymous ContentProducer: Nina SorianoManager: Doug Hoff: Dan CurwinDesigner: John LavinCasting: Amy Hubbard by: Jason Koenig, Ed Sheeran, Andrew Kolvet, Jenny Koenig, Murray Cummingsby: Jason Koenig & Johnny Valencia: Ian Hubert: Bo Valencia, Dennis Ranalta, Arthur PauliCinematography: Corey KoniniecCamera op: Ryan Haug1st AC: Ryan Brown1st Assistant Director: Ole ZapatkaDirector: Klaus Hartlfx: Lucien Stephenson: Thomas Berz: Claudia Lajda& Makeup: Christel ThoresenCasting: Ursula KiplingerVFX: ZoicThanks to: The Hintertux Glacier, Austria;Tenne, and Hotel Neuhintertux
0yIWz1XEeyc 17.14.11 Jake Paul Says Alissa Violet CHEATED with LOGAN PAUL! #DramaAlert Team 10 vs Martinez Twins! DramaAlert 25 2017-11-13 07:37:51 #DramaAlert|Drama|Alert|DramaAlert|keemstar|youtube news|jake paul|team 10|alissa violet|cheated|logan paul|logan paul alissa violet|jake paul alissa violet|Martinez Twins|left team 10|faze banks|erika costell 1309699 103755 4613 12143 https://i.ytimg.com/vi/0yIWz1XEeyc/default.jpg False False False â–º Follow for News! - https://twitter.com/KEEMSTAR\n\nâ–º Also follow #DramaAlert on:‹† Instagram: https://instagram.com/DramaAlert\n⋆ Twitter: https://twitter.com/DramaAlert\n⋆ Facebook: https://facebook.com/DramaAlert\n\nâ–º Follow for livestreams! - https://twitch.tv/KEEMSTAR\n\nâ–º KEEM Merch://keem.shirtz.cool–º USE CODE (KEEM)://gfuel.com/pages/keemstarin the Woods! (OUT NOW)–º iTunes://itunes.apple.com/us/album/dollar-in-the-woods-single/id1295414119https://itunes.apple.com/us/album/dollar-in-the-woods-single/id1295414119–º Spotify ://open.spotify.com/track/3uUHoKWqPbJ5qoREGbguC9?si=v4CgSBBR–º YouTube (Music Video)://youtu.be/n38Qxi7TVWo! (My New Game)–º Apple (iOS)://itunes.apple.com/us/app/the-adpocalypse/id1263621591–º Android://play.google.com/store/apps/details?id=com.projectorgames.howtogetahead
kable(summary(test), caption="Summary of videos data")
Summary of videos data
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
Length:40881 Length:40881 Length:40881 Length:40881 Min. : 1.0 Min. :2008-01-13 01:32:16 Length:40881 Min. : 733 Min. : 0 Min. : 0 Min. : 0 Length:40881 Length:40881 Length:40881 Length:40881 Length:40881
Class :character Class :character Class :character Class :character 1st Qu.:20.0 1st Qu.:2018-01-02 14:21:05 Class :character 1st Qu.: 143902 1st Qu.: 2191 1st Qu.: 99 1st Qu.: 417 Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Median :24.0 Median :2018-02-24 23:00:01 Mode :character Median : 371204 Median : 8780 Median : 303 Median : 1301 Mode :character Mode :character Mode :character Mode :character Mode :character
NA NA NA NA Mean :20.8 Mean :2018-02-24 08:05:59 NA Mean : 1147036 Mean : 39583 Mean : 2009 Mean : 5043 NA NA NA NA NA
NA NA NA NA 3rd Qu.:24.0 3rd Qu.:2018-04-23 01:48:54 NA 3rd Qu.: 963302 3rd Qu.: 28717 3rd Qu.: 950 3rd Qu.: 3713 NA NA NA NA NA
NA NA NA NA Max. :43.0 Max. :2018-06-14 02:25:38 NA Max. :137843120 Max. :5053338 Max. :1602383 Max. :1114800 NA NA NA NA NA

The sample shows that we need to do the following cleaning steps:

  • Cast the boolean columns to the correct data types (logical)
  • Cast the publish time to a date. Although publish time is a timestamp value and can be cast to POSIXct type to get the timestamp value, we will cast it to date since we will only use the date value.
  • Cast the trending date to a date column
countries <- c('CA', 'DE', 'FR', 'GB', 'MX', 'US')
# countries <- c('CA')

videos <- data.frame(
  ratings_disabled=logical(),
  publish_time=as.Date(character()),
  video_error_or_removed=logical(),
  comment_count=integer(),
  description=character(),
  title=character(),
  views=integer(),
  trending_date=as.Date(character()),
  thumbnail_link=character(),
  category_id=integer(),
  likes=integer(),
  channel_title=character(),
  comments_disabled=logical(),
  video_id=character(),
  dislikes=integer(),
  tags=character(),
  country=character(),
  stringsAsFactors=FALSE
)

for (c in countries){
  temp = read.csv(str_interp('YouTube_Trending/${c}videos.csv'))
  temp$video_error_or_removed <- as.logical(temp$video_error_or_removed)
  temp$ratings_disabled <- as.logical(temp$ratings_disabled)
  temp$comments_disabled <- as.logical(temp$comments_disabled)
  temp$publish_time <- as.Date(temp$publish_time, "%Y-%m-%d")
  temp$trending_date <- as.Date(temp$trending_date, '%y.%d.%m')
  temp$country = c
  videos <- union(videos, temp)
}

kable(sample_n(videos, 14), caption='Sample clean videos data')
Sample clean videos data
ratings_disabled publish_time video_error_or_removed comment_count description title views trending_date thumbnail_link category_id likes channel_title comments_disabled video_id dislikes tags country
FALSE 2018-03-03 FALSE 845 Meet Sparta, the world’s smallest cat. “Aww, small animals are so cute!” you might say but while most tiny animals are cute this one is mean. They don’t call him “The Mean Kitty” for nothing! Subscribe to TheMeanKitty: http://bit.ly/SubTmk\n\nBUY THE MEAN KITTY BOOK:://goo.gl/2QZ2TVMOREPopular: https://www.youtube.com/playlist?list=PLBBAC287EFA11E6AB\nSparta: https://www.youtube.com/playlist?list=PLsYvFuldmOwcsums4NYT4ZFt4p4OTiz5S\nLoki: https://www.youtube.com/playlist?list=PLsYvFuldmOwdbU1ZadPxB_YBFEt_H_nHm\nSongs: https://www.youtube.com/playlist?list=PLsYvFuldmOwdhEOquQifX0_uZ853B1fql\nBroken Cats: https://www.youtube.com/playlist?list=PLsYvFuldmOwcKofAXnRtyHIrYLzu6vkV6\n\n\nFollow TheMeanKitty on Socials:: https://www.facebook.com/TheRealMeanKitty/\nTwitter: https://twitter.com/TheMeanKitty\n\n\nCATS WITH ATTITUDES!- from viral hit The Mean Kitty Song which now has over 80 million views to date. He’s a Bengal mix, born sometime in mid-2007; I celebrate his b-day on May 20th. He rescued me in July of 2007. Sparta love stalking, wrestling, crunchy toys, playing fetch and being held like a baby.- the long white kitty that looks like a cow but I believe he may be part monkey. Loki is about the same age as Sparta, so we celebrate their birthdays together. He found me at a rescue center in July of 2008. He loves toys, making noise, hanging upside down, poking things and more than anything, he loves the love! World’s Smallest Cat - Cute, Tiny and Mean 311963 2018-03-10 https://i.ytimg.com/vi/7HIt7pA4WqY/default.jpg 15 5003 TheMeanKitty FALSE 7HIt7pA4WqY 1337 worlds smallest cat|small cat|tiny cat|rusty spotted cat|smallest cat in the world|world’s smallest cat video|world’s smallest cat ever|world’s smallest cat species|smallest cats ever|smallest cat breed|smallest cat species|smallest|smallest animals in the world|smallest animals ever|smallest animals|tiniest|smallest things ever|world’s smallest animals|smallest animal|small animals|smallest animal on earth|smallest animals on earth|themeankitty GB
FALSE 2017-11-11 FALSE 860 ሰላም ነዚ ቪድዮ ካብዚ ዳዉሎድ ምግባር ን ኣርቲስ ብዙሕ ጉዳት ሲለ ዘለዎ ብክብረትኩም ካብዚ ቻነል ነዚ ቪዲዮ ዳዉሎድ ምግባር ምዝርጋሕን ኩልኩል እዩ።„— © Copyright: Shalom Entertainment New Eritrean film Dama (ዳማ) part 13 Shalom Entertainment 2017 260199 2017-11-14 https://i.ytimg.com/vi/1YhtkrC2t0c/default.jpg 22 4833 Shalom Entertainment FALSE 1YhtkrC2t0c 302 [none] CA
FALSE 2018-01-27 FALSE 1775 Get 25% off the Solid Gold Calendar with coupon code LIMCAL25 (through this Sunday only!) â–º http://solidgoldaquatics.com/product/autographed-2018-solid-gold-calendar-monthly-giveaways/\n\nWatch the Aquascape team as they expertly construct a beautiful goldfish pond in my backyard! Pond Guy’s channel â–º https://www.youtube.com/user/ThePondGuyAquascape\n\nLearn more about Aquascape â–º https://www.aquascapeinc.com/\n\nTropical Water Gardens (my local certified Aquascape contractor) â–º https://www.tropicalwatergardens.com/ __________to my blog post â–º http://solidgoldaquatics.com/2018/01/27/my-epic-new-fish-pond/\n__________\n\nNEW VIDEOS FRIDAYS (and sometimes Tuesdays)!â–º http://www.youtube.com/subscription_center?add_user=flashofpink\nWebsite â–º http://www.solidgoldaquatics.com\nFacebook â–º https://www.facebook.com/solidgoldaquatics\nInstagram â–º http://instagram.com/solidgoldaquatics\nSnapchat â–º solidgoldaquaâ–º https://twitter.com/solidgoldaqua\n__________\n\nBecome a Solid Gold Member! â–º http://solidgoldaquatics.com/membership-join/\n__________\n\nITEMS IN THIS VIDEO: Aquascape Pond Kit â–º http://amzn.to/2FoyAx1 (affiliate link)Patio Pond â–º http://amzn.to/2Eg9qRQ (affiliate link) __________: Funky World by Geoxor: https://soundcloud.com/geoxor_official/funky-world and Toy Houses by Joey Pecoraro: https://soundcloud.com/joeypecoraro/toy-houses\n__________\n\nThis video description contains Amazon affiliate links, which means that if you click on the link and purchase the item, Amazon will give me a small percentage of the sale. Using my Amazon links for useful pet care products that I recommend is an easy way you can help support what I’m doing here at Solid Gold. MY EPIC NEW FISH POND! 298322 2018-02-03 https://i.ytimg.com/vi/nUDX6OCXEyc/default.jpg 15 9817 Solid Gold Aquatics FALSE nUDX6OCXEyc 310 backyard pond|how to build a pond|how to|pond|fish pond|koi pond|pond waterfall|waterfall|water feature|tutorial|step by step|aquascape pond|building a pond|aquascape inc|the pond guy|Greg Wittstock|red eared slider|red-eared slider|turtle|turtle pond|patio pond|pet turtle|cuff and link|Jennifer Lynx|Solid Gold Aquatics|DIY|do it yourself US
FALSE 2018-04-12 FALSE 9439 â–º Listen LIVE: http://power1051fm.com/\nâ–º Facebook: https://www.facebook.com/Power1051NY/\nâ–º Twitter: https://twitter.com/power1051/\nâ–º Instagram: https://www.instagram.com/power1051/ Bow Wow Talks #BowWowChallenge And Addresses Rumors In His Last Radio Interview 1479454 2018-04-14 https://i.ytimg.com/vi/aCAhbjLKkt4/default.jpg 24 18380 Breakfast Club Power 105.1 FM FALSE aCAhbjLKkt4 1764 the breakfast club|power1051|celebrity news|radio|video|interview|angela yee|charlamagne tha god|dj envy|bow wow|lil bow wow|shad moss|bowwowchallenge|bow wow challenge GB
FALSE 2018-04-21 FALSE 1106 ذا فويس العرض المباشر الثالث | شيماء من المغرب | اختارها حماقى | 21-4- The Voice 2018°Ø§ فويس 2018 مرحلة المواجهة | The Voice °Ø§ فويس الصوت Ùˆ بس | The Voice 2018‚ناة تقدم مضمونا متنوعا Ùˆ متميزا من حيث الشكل Ùˆ المضمون لإرضاء جميع الأذواق. ذا فويس العرض المباشر الثالث | شيماء من المغرب | اختارها حماقى | 21-4- The Voice 2018 395924 2018-04-23 https://i.ytimg.com/vi/TvYfboCx-cE/default.jpg 24 4969 New life FALSE TvYfboCx-cE 289 ذافويس“|”كيدز“|”الاهلى“|”الزمالك“|”ليفربول“|”برشلونة“|”ريال“|”مدريد“|”ذا“|”فويس“|”حلقات“|”الصوت“|”Ùˆ“|”بس“|”The“|”Voice“|”2018“|”مصر“|”سوريا“|”الجزائر“|”المغرب“|”تونس“|”ليبيا“|”الكويت“|”اليمن“|”لبنان“|”العراق“|”احلام“|”اليسا“|”عاصى“|”حماقى“|”محمد“|”صلاح“|”بورنموث“|”ذا فويس العرض المباشر الثالث | شيماء من المغرب | اختارها حماقى | 21-4- The Voice 2018 FR
FALSE 2018-02-28 FALSE 3697 Thách thức danh hài tập 15 (gala 2) mùa 4 là sá»± trở lại cá»§a hai thí sinh được xem là tiềm năng nhất Mỹ Ngân và Kim Hoàng. Ở gala 2 này, chá»§ nhân giải thưởng 150 triệu sẽ xuất hiện, cùng theo dõi ngay để biết ai sẽ nhận được tràng cười từ BGK nhé!¡ch thức danh hài mùa 4 phát sóng vào lúc 20h30 thứ 4 hàng tuần trên HTV7.¡ch thức danh hài mùa 4 đã trở lại vá»›i sá»± góp mặt cá»§a bá»™ đôi Giám khảo Trấn Thành, Trường Giang và MC-Danh Hài Tiến Luật cÅ©ng những thí sinh được tuyển chọn trong toàn quốc.¡ch thức danh hài trân trọng cảm Æ¡n Tập Ä‘oàn Ä‘iện tá»­ ASANZO nhà tài trợ chính đã đồng hành cùng chương trình‘‰ Fanpage Thách thức danh hài: https://www.facebook.com/thachthucdan Thách thức danh hài 4 |gala 2: Hết “bỏ túi” 100 triệu, cô sinh viên Nông Lâm lại “ẵm trọn” 150 triệu 3337702 2018-03-01 https://i.ytimg.com/vi/Fi_RxjAKurw/default.jpg 24 21370 DIEN QUAN Comedy / Hài FALSE Fi_RxjAKurw 2504 thách thức danh hài|thach thuc danh hai|trấn thành|trường giang|tran thanh|truong giang|trấn thành thách thức danh hài|trường giang thách thức danh hài|tran thanh thach thuc danh hai|truong giang thach thuc danh hai|thách thức danh hài gala 2|thach thuc danh hai gala 2|thách thức danh hài gala 2 full|thach thuc danh hai gala 2 full|dienquan|htv thach thuc danh hai|thach thuc danh hai tap 15|thách thức danh hài tập 15 full CA
FALSE 2018-04-04 FALSE 208 Aujourd’hui, je mélange 25 sortes de céréales et je fais un gâteau avec ça! C’est le défi que Huby m’a lancé 😜-toi à la chaîne de Huby: http://bit.ly/2kOAMDS\nInstagram d’Isaac: http://bit.ly/2A4lCEX\n\n💫 Like la vidéo et abonne-toi ici: http://bit.ly/1nAJhmi\n\n✔️ Active les notifications en cliquant sur la 🔔 pour voir toutes mes vidéos et partage avec tes amis! ’« Suis-moi sur mes réseaux sociaux:: http://bit.ly/2mK60S3\nINSTAGRAM PRO: http://bit.ly/2zkxyDr\nSNAPCHAT : carlarsenault: http://bit.ly/2gZQDOL\nTWITTER : http://bit.ly/2he9uVf\n\nMa nouvelle collection de t-shirts👕: http://fybr.fr/45-carl-is-cooking\n\nMon livre📕: http://amzn.to/2nf4YM9\n\nTu auras besoin: g de chapelure de céréales g de beurre fondu g de fromage à la crème g de yaourt nature g de purée de framboise g de sucre g d’agar agarpour la décoque j’utilise pour filmer:©ra fixe: http://amzn.to/2mFfAmO\nCaméra mobile: http://amzn.to/2Dj1vVA\nRing flash: http://amzn.to/2DecpwZ\n\nN'OUBLIEZ PAS, VOUS ÊTES LES MEILLEURS! 🍪💙© Carl is cooking 2015-2018 JE FAIS UN GÂTEAU AVEC 25 SORTES DE CÉRÉALES 14640 2018-04-05 https://i.ytimg.com/vi/Hk7WucXqqQY/default.jpg 26 1208 Carl is cooking FALSE Hk7WucXqqQY 18 gateau 25 sortes de céréales“|”gateau xxl“|”recette xxl“|”dessert xxl“|”degustation américaine“|”dégustation américain“|”dégustation américaine en français“|”degustation quebecoise“|”je mélange“|”carl is cooking“|”huby“|”degustation canadienne“|”recette cheesecake“|”recette cheesecake facile“|”cheesecake“|”cheesecake sans cuisson“|”cheesecake sans cuisson sans gelatine“|”cheesecake sans gélatine“|”cheesecake framboise“|”recette cheesecake philadelphia“|”recette cheesecake sans cuisson FR
FALSE 2018-04-05 FALSE 24053 Here’s the original video of Haru the Shiba Inu eating popcorn: https://www.youtube.com/watch?v=Vf_4juIwad0\n\nHere's Julien’s tribute video to #ad: https://www.youtube.com/watch?v=4UO9E3Tb1Qc&t=1s\n\nPlease subscribe to my channel and my vlog channel! I make new videos here every Wednesday and make vlogs during my majestical daily life. ://www.youtube.com/JennaMarbles://www.youtube.com/JennaMarblesVlogour weekly podcast ://www.youtube.com/user/JennaJulienPodcast://www.twitch.tv/jennajulienpast gaming from Twitch to Jenna Julien Games://www.youtube.com/channel/UC_Z0x662N1VUN9J7FYwCwkg:: ://www.facebook.com/pages/Jenna-Mourey/311917224927:://twitter.com/#!/Jenna_Marbles_Marbles://twitter.com/charlesmarbles://twitter.com/kermit_thedog_thedog:://jennamarblesblog.com/shop :://www.jennamarblesblog.com/: ://jennamarbles.tumblr.com/://instagram.com/JennaMarbles My Dogs Eating Popcorn ASMR 2437584 2018-04-17 https://i.ytimg.com/vi/cYxzLGk2NcY/default.jpg 23 167781 JennaMarbles FALSE cYxzLGk2NcY 7006 jenna|marbles|mourey|dog asmr|my dogs|eating|eat|popcorn|asmr|popcorn asmr|cute|funny|kermit|peach|italian greyhound|chihuahua|chewing|munching|whisper|ad|hamster|julien|solomita|boyfriend|vlog|adorable|puppy|best|awesome|funniest|cutest|relaxing|tantrum|microphone|soothing|cermet|paesh|annoy|puppies|shiba|inu|quiet|channel|dog vlog US
FALSE 2018-01-10 FALSE 13529 Lots to talk about, you Beautiful Bastards. Let’s jump into it… save yourself some money w/ TING!: http://phil.ting.com\nI Quit. We’re Starting Something New… : https://youtu.be/OYLqrH6YjLw\nWOW! I Didn’t Even Think Of That… : https://youtu.be/9lAoLe7tr60\nNew TheDeFrancoFam!: https://youtu.be/Cp5G0JFqoX0\n————————————to support the show, AND get cool stuff?!€”———————————up to http://DeFrancoElite.com to get early vlogs, bonus videos, exclusive livestreams, exclusive posters and mugs, and private Discord access.up for Postmates (Awesome Food/Drink Delivery) use code PhillyD and get $100 Free Delivery Credit: http://PostDeFranco.com\nInterested in Bitcoin? Sign up for Coinbase (Awesome way to Buy/Sell/Store Bitcoin/Etherium/Litecoin) and get $10 worth of Bitcoin with your first $100 deposit: https://www.coinbase.com/join/593e99f483ace31d47c4ba5b\nWHERE DID YOU GET THAT DOPE SHIRT?!: http://ShopDeFranco.com\n————————————Two DeFranco Shows!€”———————————! Video Exposes Teacher Being Slammed By Police For Criticizing School Board and More…: https://youtu.be/UWUmHI3GyTU\nWOW! Google Sued For Discriminating Against White Men and Bella Thorne’s Abuse Story Makes Waves..: https://youtu.be/zViKDlOzK0s\n————————————IN AWESOME!€”———————————the Dog to their Owner: https://youtu.be/rYzLh2QuraQ\nBad Lip Reading Trump Anthem: https://youtu.be/Zo_mpwmashg\nDunkey’s Best of 2017: https://youtu.be/P6ODTQKhaXk\nHonest Game Trailer Sonic Forces: https://youtu.be/SkerOgIfpGU\nWanna Work With Us? Here’s How!: https://twitter.com/GrownWomanChild/status/951104165161312256\nSecret Link: https://youtu.be/obkLDeO58Wo\n————————————:€”———————————Paul Dead Body Video Controversy: Coverage: https://youtu.be/ZAyvEft9MIs\nhttps://youtu.be/4_FHvf9typs\nUPDATE: ://www.polygon.com/2018/1/10/16873340/youtube-logan-paul-statement-consequence-channel://www.rollingstone.com/culture/news/logan-paul-youtube-looking-into-further-consequences-w515291://abcnews.go.com/Entertainment/youtube-responds-controversial-logan-paul-video/story?id=52254465-Ian’s Video ‘How to Get Views Like Logan Paul’: https://youtu.be/Q-iacolSpi8\n\nNorth Carolina Gerrymandering: ://time.com/5096431/north-carolina-voting-districts-gerrymandering/://www.washingtonpost.com/news/morning-mix/wp/2018/01/10/federal-court-voids-north-carolinas-gop-drawn-congressional-map-for-partisan-gerrymandering/?utm_term=.cdc1967d262e://www.nytimes.com/2018/01/09/us/north-carolina-gerrymander.html?_r=1://www.foxnews.com/politics/2018/01/09/north-carolina-congressional-map-illegally-gerrymandered-judges-rule.html://www.wral.com/court-throws-out-nc-congressional-map-again/17245449/ ://www.documentcloud.org/documents/4345694-North-Carolina-partisan-gerrymandering-opinion.html#search/p17/electing%20republicans%20is%20betterLeaves Breitbart: ://www.breitbart.com/big-government/2018/01/09/stephen-k-bannon-steps-breitbart-news-network/://www.nytimes.com/2018/01/09/us/politics/steve-bannon-breitbart-trump.html://www.cnn.com/2018/01/07/politics/read-bannon-full-statement/index.html://www.politico.com/story/2018/01/09/bannon-steps-down-from-breitbart-news-329603://www.newsweek.com/top-20-revelations-trump-fire-and-fury-book-about-golden-showers-ivanka-bannon-769899://www.foxnews.com/politics/2018/01/09/steve-bannon-steps-down-as-executive-chairman-breitbart-news.html://thehill.com/homenews/media/368264-fox-spokesperson-fox-news-will-not-be-hiring-steve-bannon€”———————————listen on the go? -ITUNES: http://PDSPodcast.com\n-SOUNDCLOUD: https://soundcloud.com/thephilipdefrancoshow\n————————————: http://on.fb.me/mqpRW7\nTWITTER: http://Twitter.com/PhillyD\nINSTAGRAM: https://instagram.com/phillydefranco/\nSNAPCHAT: TheDeFrancoFam: https://www.reddit.com/r/DeFranco\n\n————————————by:Girardier: https://twitter.com/jamesgirardier\n\n\nProduced by:Morones - https://twitter.com/MandaOhDang\n\nMotion Graphics Artist:Borst - https://twitter.com/brianjborst\n\nP.O. BOX: Philip DeFranco Ventura BlvdD #542, CA 91436 Youtube’s RIDICULOUS New Response To The Logan Paul Scandal Reveals a Huge Problem and More… 2048788 2018-01-12 https://i.ytimg.com/vi/C-ePy-2WLfY/default.jpg 24 103017 Philip DeFranco FALSE C-ePy-2WLfY 2705 logan paul youtube“|”Logan Paul“|”sxephil“|”philip defranco“|”DeFranco“|”philip defranco show“|”YouTube“|”the philip defranco show“|”demonetization“|”Adpocalypse“|”Logan Paul Suicide“|”logan paul apology“|”pewdiepie“|”pewdiepie nazi“|”Ian Kung“|”North Carolina“|”Gerrymandering“|”David Lewis“|”Republicans“|”Democrats“|”Alabama“|”Roy Moore“|”Doug Jones“|”Steve Bannon“|”Trump“|”Donald Trump“|”Paul Manafort“|”Fire and Fury“|”Michael Wolff“|”Breitbart“|”news“|”us news“|”logan paul vlog“|”logan“|”paul“|”suicide apology FR
FALSE 2018-05-15 FALSE 100 Reto 4 Elementos Episodio 34 Lunes 14 de Mayo 2018 Parte 1 72949 2018-05-15 https://i.ytimg.com/vi/LZ5YHRqRda4/default.jpg 22 252 React Uni3z FALSE LZ5YHRqRda4 142 [none] MX
FALSE 2017-12-24 FALSE 334 HERKEZE MERHABALAR SİZLERE MAÇ ÖZETLERİNİ EN HIZLI BİR ŞEKİLDE SUNMAYA ÇALIŞACAĞIZ EMEĞE KARŞILIK OLARAK 2SANİYENİZİ AYIRARAK ABONE OLUR LİKE ATARSANIZ ÇOK MUTLU OLURUZ İYİ SEYİRLER !göztepe,Galatasaray Göztepe,galatasaray göztepe maç özeti,galatasaray göztepe mac ozeti,galatasaray 3 göztepe 1,gs 3 göztepe 1,galatasaray goztepe hd maç özeti,galatasaray göztepe maç özeti izle,Yasin Öztekin,Galatasaray,galatasaray goztepe izle,Galatasaray Göztepe Maç Özeti izle,galatasaray 3 goztepe 1,galatasaray goztepe mac ozeti izle,Göztepe,Galatasaray Göztepe beınsport,Galatasaray 3 göztepe 1,Göztepe galatasaray izle,galatasaray goztepe mac,HD Galatasaray 3-1 Göztepe -HD Maç Özeti - 24/12/2017 226163 2017-12-25 https://i.ytimg.com/vi/jekzs98AanI/default.jpg 17 1034 Yasin Kayrancı FALSE jekzs98AanI 202 galatasaray göztepe|Galatasaray Göztepe|galatasaray göztepe maç özeti|galatasaray göztepe mac ozeti|galatasaray 3 göztepe 1|gs 3 göztepe 1|galatasaray goztepe hd maç özeti|galatasaray göztepe maç özeti izle|Yasin Öztekin|Galatasaray|galatasaray goztepe izle|Galatasaray Göztepe Maç Özeti izle|galatasaray 3 goztepe 1|galatasaray goztepe mac ozeti izle|Göztepe|Galatasaray Göztepe beınsport|Galatasaray 3 göztepe 1|Göztepe galatasaray izle|galatasaray goztepe mac|HD DE
FALSE 2018-03-06 FALSE 1631 бесплатные тарталетки с малиной к заказу от 1500 до 15.03 , скидка 40% без ограничений по сроку на 3 месяца.€Ð¾Ð¼Ð¾ÐºÐ¾Ð´ oblomoff8://cheese-cake.ru“руппа Ð’Ñ‹ чо мне привезли? - ://vk.com/foodfails“руппа ВК-://vk.com/atpiska§Ð¸ÑÑ‚о рецепты -://vk.com/club103827516¾Ð¹ инстаграмчик-://www.instagram.com/oblomoffood/±Ð·Ð¾Ñ€Ñ‹ техники-://www.youtube.com/user/muhanesidela’идео-бложик-://www.youtube.com/user/oblomoffstuff Ищем ДОСТОЙНЫЕ суши на БАЛИ! #СлавноеБали 598186 2018-03-07 https://i.ytimg.com/vi/du0cM_dBRiQ/default.jpg 24 23211 oblomoff FALSE du0cM_dBRiQ 1718 славный друже|обзор суши|суши|обзор|еда|роллы|обзор доставки|друже|доставка еды|японская кухня|обзоры доставок|японская еда|bali|русские на бали|sashimi|japanese food|путешествия|ролы|рестораны бали|джимбаран|еда на бали|индонезия|bali food|ресторан|где поесть на бали|где покушать на бали|остров бали|отдых на бали|индонезийская кухня|азия|туризм|bali restaurants|цены на бали|сказочное бали|индонезия бали|кухня на бали|jimbaran|отзывы о ресторанах DE
FALSE 2018-04-01 FALSE 408 LUP Mexico Andre Marin Alex Aguinaga Gustavo Mendoza Salim SombraJornada 13 Liga Mx Aguilas del America Derrota a la Maquina de Cruz Azul Conferencia Caixinha. si le vas a Cruz Azul tienes Gana de Cambiar de Equipo? Goles Resumen.Raul Jimenez Rabona La Ultima Palabra - America le Gana a Cruz Azul, Toluca Lider, Tigres Golea a Leon 182783 2018-04-02 https://i.ytimg.com/vi/uYaoNKY9tr0/default.jpg 17 949 Los Amos del Periodismo Deportivo 2 FALSE uYaoNKY9tr0 111 [none] MX
FALSE 2018-06-11 FALSE 1403 DOWNLOAD ONEFOOTBALL APP FOR FREE NOW: https://tinyurl.com/Shpendi10CFC----------------------------------------­--------------------------[LIVE NOW] Belgium vs Costa Rica Live Stream (LIVE NOW) Belgium vs Costa Rica Live Stream Belgium vs Costa Rica 4-1 All Goals and Highlights with English Commentary 2017-18 HD 720pBelgium vs Costa Rica 4 - 1 ● All Goals | 2016/17 [HD] Belgium vs Costa Rica 4-1 Goal De Bruyne 11/06/2018 |HD|Belgium vs Costa Rica 4-1 Goal Lukaku 11/06/2018 |HD| Belgium vs Costa Rica 4-1 All Goals & Highlights 11/06/2018Belgium vs Costa Rica 4:1 2018 - Match Preview 11/06/2018 HD Belgium vs Costa Rica 4-1 All Goals & Highlights 11/06/2018 HDStay with me !https://www.youtube.com/Shpendi10CFChttps://www.twitter.com/ShpendZhubihttps://www.instagram.com/ShpendZhubi Belgium vs Costa Rica 4-1 - All Goals & Extended Highlights - Friendly 11/06/2018 HD 2133596 2018-06-13 https://i.ytimg.com/vi/g4a4Mez2M8o/default.jpg 17 11397 Shpendi10CFC FALSE g4a4Mez2M8o 992 Belgium vs Costa Rica|Belgium vs Costa Rica 2018|Belgium vs Costa Rica 4-1|Belgium vs Costa Rica highlights|Belgium vs Costa Rica all goals|Belgium vs Costa Rica goals highlights|Costa Rica|Belgium|Romelu Lukaku|Batshuayi|Mertens|Hazard DE

Since we won’t be doing any analysis on some of the data, we can remove these columns from the dataframe

videos <- subset(videos, select=-c(description, ratings_disabled, thumbnail_link,
                                   video_error_or_removed, comments_disabled, tags))

Exploring the categories for each country

categories %>% arrange(title) %>% head() %>% kable(caption="Sample of categories data")
Sample of categories data
id title country
32 Action/Adventure CA
32 Action/Adventure DE
32 Action/Adventure FR
32 Action/Adventure GB
32 Action/Adventure MX
32 Action/Adventure US

Considering that the category ID and title are the same for all countries, we can drop the country column and make everything unique

categories <- categories %>%
  select(id, title) %>% 
  unique() %>%
  rename(category_id = id)

head(categories) %>% kable(caption = "Sample clean categories data")
Sample clean categories data
category_id title
1 Film & Animation
2 Autos & Vehicles
10 Music
15 Pets & Animals
17 Sports
18 Short Movies

To avoid issues related to character encoding, we will only keep videos with titles and channel names that have characters in the ascii table.

vid_title <- replace_non_ascii(
  videos$title,
  replacement = NA,
  remove.nonconverted = TRUE)

temp_videos_df <- mutate(videos, title = vid_title) %>% na.omit(temp_videos_df)

channel_title <- replace_non_ascii(
  videos$channel_title,
  replacement = NA,
  remove.nonconverted = TRUE)

temp_videos_df <- mutate(temp_videos_df, channel_title = channel_title) %>% na.omit(temp_videos_df)
videos <- temp_videos_df

kable(head(videos, 10), caption="Sample ASCII compatible videos data")
Sample ASCII compatible videos data
publish_time comment_count title views trending_date category_id likes channel_title video_id dislikes country
2017-11-10 125882 Eminem - Walk On Water (Audio) ft. BeyoncA(C) 17158579 2017-11-14 10 787425 EminemVEVO n1WpP7iowLc 43420 CA
2017-11-13 13030 PLUSH - Bad Unboxing Fan Mail 1014651 2017-11-14 23 127794 iDubbbzTV 0dBIkQ4Mz1M 1688 CA
2017-11-12 8181 Racist Superman | Rudy Mancuso, King Bach & Lele Pons 3191434 2017-11-14 23 146035 Rudy Mancuso 5qpjK5DgCt4 5339 CA
2017-11-12 17518 I Dare You: GOING BALD!? 2095828 2017-11-14 24 132239 nigahiga d380meD0W0M 1989 CA
2017-11-09 85067 Ed Sheeran - Perfect (Official Music Video) 33523622 2017-11-14 10 1634130 Ed Sheeran 2Vv-BfVoq4g 21082 CA
2017-11-13 12143 Jake Paul Says Alissa Violet CHEATED with LOGAN PAUL! #DramaAlert Team 10 vs Martinez Twins! 1309699 2017-11-14 25 103755 DramaAlert 0yIWz1XEeyc 4613 CA
2017-11-12 26629 Vanoss Superhero School - New Students 2987945 2017-11-14 23 187464 VanossGaming _uM5kFfkhB8 9850 CA
2017-11-13 15959 WE WANT TO TALK ABOUT OUR MARRIAGE 748374 2017-11-14 22 57534 CaseyNeistat 2kyS6SvSYSE 2967 CA
2017-11-12 36391 THE LOGANG MADE HISTORY. LOL. AGAIN. 4477587 2017-11-14 24 292837 Logan Paul Vlogs JzCsM1vtn78 4123 CA
2017-11-10 1484 Finally Sheldon is winning an argument about the existence of God 505161 2017-11-14 22 4135 Sheikh Musa 43sm-QwLcx4 976 CA

Exploratory Analysis

To explore the data, we are going to look at the viewing trend of videos per country.

library(ggplot2)

ggplotly(
  videos %>% group_by(country, trending_date) %>% 
    count() %>%
    ggplot(aes(x=trending_date, y=n, group=country)) +
    geom_line(aes(color=country)) +
    ggtitle("Trending videos per country") +
    scale_y_continuous(name="Videos", limits=c(70, 210), breaks=seq(70,210,10)) +
    scale_x_continuous(name="Trending Date", breaks = seq(as.Date('2017-11-01'), as.Date('2018-06-26'), 30)) +
    theme_minimal()
)
ggplotly(
  videos %>% group_by(country, trending_date) %>% summarise(avg_views = mean(views)) %>%
    ggplot(aes(x=trending_date, y=avg_views, group=country)) +
    geom_line(aes(color=country)) +
    ggtitle("Avg views per country") +
    scale_y_continuous(name="Avg Views", limits=c(0, 14000000), breaks=seq(0,14000000,1000000), labels = scales::comma) +
    scale_x_continuous(name="Trending Date", breaks = seq(as.Date('2017-11-01'), as.Date('2018-06-26'), 30)) +
    theme_minimal()
  )
ggplotly(
  videos %>% group_by(country, trending_date) %>% summarise(pct_99 = quantile(views, probs=c(0.99))) %>%
    ggplot(aes(x=trending_date, y=pct_99, group=country)) +
    geom_line(aes(color=country)) +
    ggtitle("99th Percentile of views per country") +
    scale_y_continuous(name="PCT 99th Views", limits=c(0, 260000000), breaks=seq(0,260000000,20000000), labels = scales::comma) +
    scale_x_continuous(name="Trending Date", breaks = seq(as.Date('2017-11-01'), as.Date('2018-06-26'), 30)) +
    theme_minimal()
)
ggplotly(
  videos %>% group_by(country, trending_date) %>% summarise(min_views = min(views)) %>%
    ggplot(aes(x=trending_date, y=min_views, group=country)) +
    geom_line(aes(color=country)) +
    ggtitle("Min views per country") +
    scale_y_continuous(name="Min Views", limits=c(0, 320000), breaks=seq(0,320000,20000), labels = scales::comma) +
    scale_x_continuous(name="Trending Date", breaks = seq(as.Date('2017-11-01'), as.Date('2018-06-26'), 30)) +
    theme_minimal()
)

More over, exploring the correlation between the numeric values in the data will also help show what kind of analysis can be done on the data

library(corrplot)
cor_matrix <- videos %>% select(views, likes, dislikes, comment_count) %>%
  cor( method = "pearson", use = "complete.obs")

corrplot(cor_matrix, method="number")

videos %>% select(views, likes, dislikes, comment_count) %>%
  pairs()

Data Analysis

In this analysis, we have a few questions that we would like to answer:

  1. How many likes and dislikes do the top 10 trending videos have in each country, and overall (top 10 based on final number of views)?
  2. How many videos have more than 10 million views on their final trending day in each country and overall?
  3. What is the ratio of likes & comments to views?
  4. How do the videos cluster together, and what conclusions can we draw from the clusters?

To answer these questions, we would need to get the data for each video on its last trending date. This will provide a snapshot of all the videos at their peak trending time, which would be an even comparison ground for all the videos data.

videos_last_trending_day <- videos %>% 
  group_by(video_id) %>%
  arrange(video_id, desc(trending_date)) %>%
  mutate(row_num = row_number()) %>%
  filter(row_num == 1) %>%
  subset(select=-c(row_num))

Analysis Questions

  1. How many likes and dislikes do the top 10 trending videos have in each country, and overall (top 10 based on final number of views)?

Likes & dislikes per country

sqldf("
with cte as (
  select *,
    row_number() over (partition by country order by views desc) as video_rank 
  from videos_last_trending_day
)
select country,
  sum(likes) as likes,
  sum(dislikes) as dislikes
from cte
where video_rank <= 10
group by country
limit 100") %>% kable()
country likes dislikes
CA 5870253 616783
DE 1033171 423228
FR 1513184 137625
GB 28881235 3231725
MX 2871750 169955
US 7367068 437387

Overall likes & dislikes

sqldf("
with cte as (
  select *,
    row_number() over (partition by country order by views desc) as video_rank 
  from videos_last_trending_day
)
select sum(likes) as likes,
  sum(dislikes) as dislikes
from cte
where video_rank <= 10
limit 100") %>% kable()
likes dislikes
47536661 5016703
  1. How many videos have more than 10 million views on their final trending day in each country and overall?

Videos per country

sqldf("
select count(1) videos_count
from videos_last_trending_day
where views > 10000000
limit 100")  %>% kable()
videos_count
416

Overall videos

sqldf("
select country,
  count(1) videos_count
from videos_last_trending_day
where views > 10000000
group by 1
limit 100")  %>% kable()
country videos_count
CA 77
DE 3
FR 1
GB 245
MX 10
US 80

Clustering video data

To be able to find similarities in such a large dataset, we can use unsupervised machine learning algorithms to find clusters in the data. For this analysis, we are using K-Means clustering methods. The data displayed below shows the clusters with their centers

library(mltools)
library(data.table)
library(factoextra)
library(dummies)
library(factoextra)
library(fastDummies)

dataset <- base::merge(videos_last_trending_day, categories, by="category_id") %>%
  rename(category_name = title.y) %>%
  subset(select=-c(category_id,
                   publish_time,
                   title.x,
                   trending_date,
                   video_id,
                   channel_title
                   ))

dataset <- dummy_cols(dataset, select_columns = c('country', 'category_name'))

set.seed(1234)
dataset <- dataset %>% subset(select=-c(country))
dataset <- dataset %>% subset(select=-c(category_name))

km.res <- kmeans(dataset, 6, nstart = 100)


km.res$centers %>% kable(row.names = TRUE)
comment_count views likes dislikes country_CA country_DE country_FR country_GB country_MX country_US category_name_Autos & Vehicles category_name_Comedy category_name_Education category_name_Entertainment category_name_Film & Animation category_name_Gaming category_name_Howto & Style category_name_Movies category_name_Music category_name_News & Politics category_name_Nonprofits & Activism category_name_People & Blogs category_name_Pets & Animals category_name_Science & Technology category_name_Shows category_name_Sports category_name_Trailers category_name_Travel & Events
1 35880.925 19231302.9 396207.034 25997.9887 0.2180451 0.0075188 0.0037594 0.5375940 0.0263158 0.2067669 0.0075188 0.0150376 0.0000000 0.1804511 0.0639098 0.0187970 0.0112782 0.000000 0.5827068 0.0037594 0.0037594 0.0300752 0.0000000 0.0112782 0.0000000 0.0676692 0.00e+00 0.0037594
2 211807.400 327910910.2 3257466.400 212472.6000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00e+00 0.0000000
3 149278.053 121396174.8 1675693.263 163983.3684 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1052632 0.0000000 0.0000000 0.0000000 0.000000 0.8947368 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00e+00 0.0000000
4 90824.922 57377152.2 920458.875 81142.8750 0.0156250 0.0000000 0.0000000 0.8125000 0.0000000 0.1718750 0.0000000 0.0156250 0.0000000 0.1093750 0.0312500 0.0000000 0.0156250 0.000000 0.7968750 0.0000000 0.0000000 0.0156250 0.0000000 0.0156250 0.0000000 0.0000000 0.00e+00 0.0000000
5 13714.361 4274939.0 126603.557 6288.3621 0.3780488 0.0426829 0.0281426 0.2246717 0.0708255 0.2556285 0.0028143 0.0984991 0.0065666 0.2800188 0.0581614 0.0257974 0.0581614 0.000469 0.2631332 0.0201689 0.0009381 0.0769231 0.0065666 0.0201689 0.0000000 0.0792683 0.00e+00 0.0023452
6 1051.793 220959.1 7700.964 344.3836 0.2199418 0.1864425 0.2071976 0.0293918 0.2998477 0.0571786 0.0191910 0.0625782 0.0221152 0.3098580 0.0398917 0.0340705 0.0630406 0.000000 0.0647816 0.0907867 0.0046107 0.1536505 0.0076574 0.0229721 0.0018225 0.0981040 1.36e-05 0.0048556

Below is a sample of the data joined to the clustering output

categ_videos <- base::merge(videos_last_trending_day, categories, by="category_id") %>%
  rename(category_name = title.y) %>%
  subset(select=-c(category_id,
                   publish_time,
                   title.x,
                   video_id,
                   # split_tags,
                   channel_title
                   )) %>%
  drop_na()
video_cluster <- cbind(categ_videos, cluster = km.res$cluster)
video_cluster %>% head() %>% kable()
comment_count views trending_date likes dislikes country category_name cluster
791 198904 2018-06-09 4319 142 CA Film & Animation 6
432 507121 2018-05-06 4129 185 US Film & Animation 6
9 5797 2018-01-04 53 2 MX Film & Animation 6
0 14745 2018-02-28 342 11 MX Film & Animation 6
4 16786 2017-11-20 66 10 FR Film & Animation 6
33 8224 2017-12-16 186 13 DE Film & Animation 6

Below are charts showing how the video categories are clustered per average views, comment count, likes & dislikes.

ggplotly(
  video_cluster %>% group_by(category_name, cluster = as.character(cluster)) %>%
    summarise(avg_views = mean(views)) %>%
    ggplot(aes(x=cluster, y=avg_views, group=category_name)) +
    geom_bar(stat = 'identity', aes( fill=category_name), position=position_dodge()) +
    theme(axis.text.x = element_text(face = "bold",size = 12, angle = 45, hjust = 1)) +
    ggtitle("Avg views per cluster and category") +
    theme_classic()
)
ggplotly(
  video_cluster %>% group_by(category_name, cluster = as.character(cluster)) %>%
    summarise(avg_comments = mean(comment_count)) %>%
    ggplot(aes(x=cluster, y=avg_comments, group=category_name)) +
    geom_bar(stat = 'identity', aes( fill=category_name), position=position_dodge()) +
    theme(axis.text.x = element_text(face = "bold",size = 12, angle = 45, hjust = 1)) +
    ggtitle("Avg comments per cluster and category") +
    theme_classic()
)
ggplotly(
  video_cluster %>% group_by(category_name, cluster = as.character(cluster)) %>%
    summarise(avg_likes = mean(likes)) %>%
    ggplot(aes(x=cluster, y=avg_likes, group=category_name)) +
    geom_bar(stat = 'identity', aes( fill=category_name), position=position_dodge()) +
    theme(axis.text.x = element_text(face = "bold",size = 12, angle = 45, hjust = 1)) +
    ggtitle("Avg likes per cluster and category") +
    theme_classic()
)
ggplotly(
  video_cluster %>% group_by(category_name, cluster = as.character(cluster)) %>%
    summarise(avg_dislikes = mean(dislikes)) %>%
    ggplot(aes(x=cluster, y=avg_dislikes, group=category_name)) +
    geom_bar(stat = 'identity', aes( fill=category_name), position=position_dodge()) +
    theme(axis.text.x = element_text(face = "bold",size = 12, angle = 45, hjust = 1)) +
    ggtitle("Avg dislikes per cluster and category") +
    theme_classic()
)

Based on the charts above, we can make the following conclusions:

  • Cluster 2 is made of Music videos with high views and high interactions
  • Cluster 3 is made of Entertainment and Music videos only, this highlights the similarities between these video categories as music is a form of entertainment
  • Clusters 1, 5 and 6 have videos from all categories. However, it is clear that videos in cluster 1 get the most interaction (likes, dislikes, comments & views), followed by cluster 5 and then cluster 6
  • Cluster 4 has entertainment, DIY, vlogs & documentaries related video category with a medium level of interaction.
  • Although Non-profit & activism videos don’t have high views, they have very high interaction rate in terms of comments, likes & dislikes.

Engagement Analysis

Based on the clusters created, we can conclude that the category of the video can affects the way users interact with it. The engagement rate on a video category is based on the interaction rate of users with the video.

ggplotly(
  video_cluster %>% group_by(category_name) %>%
  summarise(
    avg_dislikes = mean(dislikes),
    avg_likes = mean(likes),
    avg_comments = mean(comment_count),
    avg_views = mean(views)
    ) %>%
  ggplot(aes(x=avg_views, y=avg_comments, color=category_name)) +
  ggtitle("Avg views over avg comments per category") +
  geom_point(aes(fill=category_name))
)
ggplotly(
  video_cluster %>% group_by(category_name) %>%
  summarise(
    avg_dislikes = mean(dislikes),
    avg_likes = mean(likes),
    avg_comments = mean(comment_count),
    avg_views = mean(views)
    ) %>%
  ggplot(aes(x=avg_views, y=avg_likes, color=category_name)) +
  ggtitle("Avg views over avg likes per category") +
  geom_point(aes(fill=category_name))
)
ggplotly(
  video_cluster %>% group_by(category_name) %>%
  summarise(
    avg_dislikes = mean(dislikes),
    avg_likes = mean(likes),
    avg_comments = mean(comment_count),
    avg_views = mean(views)
    ) %>%
  ggplot(aes(x=avg_views, y=avg_dislikes, color=category_name)) +
  ggtitle("Avg views over avg dislikes per category") +
  geom_point(aes(fill=category_name))
)

As displayed in the charts above, the clusters created by the K-Means algorithm do reflect the attributes of the data. The charts also confirm the assumptions made earlier that:

  • Entertainment & Music videos get the highest engagement rate
  • Non-profit & Activism videos get the third highest level of engagement even though they do not get the third highest level of average views. This could be due to the polarizing nature of Non-profit & activism issues.