1 Summary

1.1 Goals of this project

To understand the demographics of some of the most popular YouTube videos
Understand the impact of different mertics such as likes, dislikes, tags and comments on number of views on different videos based on different countries
To unearth the patterns in what it takes for a video to go viral and increase its popularity
In the end we will create a YouTube video recommender system which is considered the single most important factor to increase user engagement by top content hosting websites such as YouTube, Netflix and Spotify.

1.2 Introduction to Youtube

Founded in 2005 in sunny California, YouTube has changed how we work, play and consume content.
Beyond the viral videos like Charlie bit my finger and Evolution of dance, the tech giant has completely revolutionised the entertainment industry.
Here are just some of the ways that the world will never be the same again.
Talent discovery
YouTube is an extremely useful tool for actors, dancers, singers, impressionists etc. It’s a completely free way for them to share their material with a large audience. In fact, Justin Bieber was actually scouted after posting his first YouTube clip.
Political discussion
YouTube provides users with a great platform to expose government wrongdoing and encourage discussion, especially in countries with limited free speech. For example, Russian punk band Pussy Riot screened their 2012 protest against Putin on the website.
Accessible education
YouTube has an abundance of how-to videos, tutorials and lectures. If you have a burning question all you have to do is search the term and you’ll probably find lots of videos on the subject. This is a great way to make education accessible for all.
More money donated to charities
Do you remember the ice bucket challenge that took over social media back in 2014? Over 2.4 million people took part including major celebs like Kim Kardashian, Oprah and LeBron James. More than $70 million was raised for the ALS Association and other charities. Without YouTube’s global reach this wouldn’t have been possible.
A move towards video content
Since the invention of YouTube, companies have started to place a greater emphasis on video content such as employer branding clips and employee testimonials. These assets are crucial to attract the best talent in 2019 and beyond.
Advertising
Many people thought that YouTube meant the end of traditional advertising, however, it merely provided companies with another outlet. For example, the John Lewis Christmas ad already has 10 million views (it has only been online for one week). While the latest Lion King trailer has knocked up 44 million views in just four days

Every minute, more than 100 hours of video are uploaded to YouTube. It has over a billion users, almost one-third of all people on the Internet.

1.3 Why is it important to analyse Youtube videos?

The way video is consumed is changing fast. A report by Cisco suggests that by 2021, 80% of all traffic on the internet will be video-generated. And when you talk about videos, you have to mention YouTube.
YouTube is great, but also weird. It can be extremely useful or completely pointless. The type of content on YouTube ranges from life-hacks and tutorials to “Lord Voldemort laughing like a retard for ten hours”.
Why am I mentioning all this? Because it shows how diverse and vast YouTube’s audience is. In case you are still wondering why it is important to include YouTube in your marketing plan, here are a few stats for you: * YouTube is the second largest search engine after Google.
* YouTube users watch more than one billion videos in a day. This is more than Netflix and Facebook video combined.
* It is available in 88 countries and can be accessed in 76 different languages.
* People pay more attention to YouTube ads than Facebook ads.
If you are a marketer deciding whether to use YouTube or not, these numbers should give you enough proof of its reach.
Reach is important for any marketing strategy, but a platform as big as YouTube also presents huge competition. Which is why understanding how YouTube analytics work is very important.

# Installing Dependencies
set.seed(123)
# Data manipulation
library(data.table)
library(dplyr)
library(DT)
library(lubridate)
library(reshape)
library(rjson)
library(tidyr)
library(tibble)
library(reshape)
# Visualization
library(ggplot2)
library(ggpubr)
library(GGally)
library(ggcorrplot)
library(ggrepel)
# Wordcloud
library(wordcloud)
# Text manipulation
library(tidytext)
library(stringr)
library(tm)
library(sentimentr)
library(wordcloud)
library(RSentiment)
# Recommender system
library(recommenderlab)

2 Datasets

The project involves multiple datasets

1. The dataset of top trending videos from "www.kaggle.com"  
   i) The dataset has top trending videos from four different countries USA, Canada, UK and India  
   ii) Every video has related metrics such as video category, Title, Tags, Likes, Dislikes, comments etc  
2. The second dataset was created by pulling data through Youtube API  
   i) It contains a list of top 100 funny videos from Youtube  
   ii) The dataset was then moked up to represent multiple users and their rating on each of those funny videos  
   iii) The three columns in the dataset represents the name of the user, name of the video and the rating against the video  
   iv) The rating can have two values, 1/TRUE - Represents if the video was liked by the user and 0/FALSE - If the video was not watched or disliked by the user  
   v) Even though parts of the dataset is moked up it represents the actual scenario and thus it appropriate for recommendation system

#Reading the dataset of top trending videos from four different countries available on "www.kaggle.com"

gb <- as.data.table(read.csv("GBvideos.csv"))

gb[,"Location":="GB"]

ind <- as.data.table(read.csv("INvideos.csv"))

ind[,"Location":="IN"]

ca <- as.data.table(read.csv("CAvideos.csv"))

ca[,"Location":="CA"]

us <- as.data.table(read.csv("USvideos.csv"))

us[,"Location":="US"]

# Reading the json files which contains the names of the categories to which the videos belong

# LOADING DATA

categories.ca <- fromJSON(file="CA_category_id.json")
categories.gb <- fromJSON(file="GB_category_id.json")
categories.ind <- fromJSON(file="IN_category_id.json")
categories.us <- fromJSON(file="US_category_id.json")

# RECOVERING VIDEO CATEGORIES
ca.dict <- sapply(categories.ca$items, FUN=function(l){
  id <- l$id
  category <- l$snippet$title
  return(c(id,category))
})

gb.dict <- sapply(categories.gb$items, FUN=function(l){
  id <- l$id
  category <- l$snippet$title
  return(c(id,category))
})

ind.dict <- sapply(categories.ind$items, FUN=function(l){
  id <- l$id
  category <- l$snippet$title
  return(c(id,category))
})

us.dict <- sapply(categories.us$items, FUN=function(l){
  id <- l$id
  category <- l$snippet$title
  return(c(id,category))
})

ca.dict <- data.frame(t(ca.dict))
gb.dict <- data.frame(t(gb.dict))
ind.dict <- data.frame(t(ind.dict))
us.dict <- data.frame(t(us.dict))

colnames(ca.dict) <- c("category_id","category_title")
colnames(gb.dict) <- c("category_id","category_title")
colnames(ind.dict) <- c("category_id","category_title")
colnames(us.dict) <- c("category_id","category_title")

ca.dict$category_id <- as.numeric(as.character(ca.dict$category_id))
gb.dict$category_id <- as.numeric(as.character(gb.dict$category_id))
ind.dict$category_id <- as.numeric(as.character(ind.dict$category_id))
us.dict$category_id <- as.numeric(as.character(us.dict$category_id))

rownames(ca.dict) <- ca.dict$category_id
rownames(gb.dict) <- gb.dict$category_id
rownames(ind.dict) <- ind.dict$category_id
rownames(us.dict) <- us.dict$category_id

# ASSIGNING CATEGORIES TO VIDEOS
ca  <- merge(ca,ca.dict,by="category_id")
gb <- merge(gb,gb.dict,by="category_id")
ind <- merge(ind,ind.dict,by="category_id")
us <- merge(us,us.dict,by="category_id")


#combining the four datasets into one

data <- as.data.table(rbind(ca,gb,ind,us))

#Converting trending date to date datatype

data$trending_date <- ydm(data$trending_date)

#Splitting publish time into publish date and time

data[, c("P_day", "P_time") := tstrsplit(publish_time, "T", fixed=TRUE)]

data$publish_time <- NULL

data$P_day <- ymd(data$P_day)

# Converting time to HMS format

data$P_time <- strtrim(data$P_time,5)

data$P_time <- hm(data$P_time)

# Removing variables which are not important for our analysis

#data$thumbnail_link <- NULL
data$comments_disabled <- NULL
data$video_error_or_removed <- NULL
data$ratings_disabled <- NULL

# Converting Location to factors

data$Location <- as.factor(data$Location)

#Creating metric to find avg time for a video to viral

data$diff_days <- data$trending_date-data$P_day

3 Exploratory Data Analysis

3.0.1 Finding the correlation between numeric variables

corr <- cor(data[,c("views","likes","dislikes","comment_count"), with=F])

ggcorrplot(corr, hc.order=T, type="lower",outline.col="white",
             ggtheme = ggplot2::theme_gray,
             lab=T,
             colors = c("#6D9EC1", "white", "#E46726"))

* 1. We see a high correlation between the number of likes on a video and views  
* 2. Whereas the correlation is almost half between dislikes and views

3.0.2 Most…

3.0.2.1 Viewed videos

mvideo <- data[,.("Total_Views"=round(max(views,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_Views)]



mvideo %>% 

  mutate(image = paste0('<img width="80%" height="80%" src="', thumbnail_link , '"></img>')) %>% 

  arrange(-Total_Views) %>% 

  top_n(5,wt = Total_Views) %>% 

  select(image, title, Total_Views) %>% 

  datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

3.0.2.2 Liked videos

mvideo <- data[,.("Total_Likes"=round(max(likes,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_Likes)]



mvideo %>% 

  mutate(image = paste0('<img width="80%" height="80%" src="', thumbnail_link , '"></img>')) %>% 

  arrange(-Total_Likes) %>% 

  top_n(5,wt = Total_Likes) %>% 

  select(image, title, Total_Likes) %>% 

  datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

3.0.2.3 Disliked videos

mvideo <- data[,.("Total_Dislikes"=round(max(dislikes,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_Dislikes)]



mvideo %>% 

  mutate(image = paste0('<img width="80%" height="80%" src="', thumbnail_link , '"></img>')) %>% 

  arrange(-Total_Dislikes) %>% 

  top_n(5,wt = Total_Dislikes) %>% 

  select(image, title, Total_Dislikes) %>% 

  datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

3.0.2.4 Commented videos

mvideo <- data[,.("Total_comments"=round(max(comment_count,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_comments)]



mvideo %>% 

  mutate(image = paste0('<img width="80%" height="80%" src="', thumbnail_link , '"></img>')) %>% 

  arrange(-Total_comments) %>% 

  top_n(5,wt = Total_comments) %>% 

  select(image, title, Total_comments) %>% 

  datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

3.0.3 Top Trending Channels

ggplot(data[,.N,by=channel_title][order(-N)][1:10],aes(reorder(channel_title,-N),N,fill=channel_title))+
  geom_bar(stat="identity")+
  geom_label(aes(label=N))+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+  
  labs(caption="Number of videos",title=" Top trending channel titles in all countries")+
  xlab(NULL)+ylab(NULL)+coord_flip()

* We see that the most number of videos are posted by channels which have shows on a regular basis

3.0.4 Top Trending Categories

ggplot(data[,.N,by=category_title][order(-N)][1:10],aes(reorder(category_title,-N),N,fill=as.factor(category_title)))+
  geom_bar(stat="identity")+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+
  labs(title=" Top Category ID")+
  xlab("Category")+ylab("Number of videos")

* 1. Clearly music videos and entertainment related videos are most famous on YouTube  
* 2. This idea probably led to YouTube create an entire new app dedicated to Music

3.0.5 How many days does it take for a video to trend

ggplot(data[diff_days<30],aes(as.factor(diff_days),fill=as.factor(diff_days)))+
  geom_bar()+
  guides(fill="none")+
  labs(title=" Time between published and trending",subtitle="In days")+
  xlab("Number of Days")

* It usually takes atleast one to two days for a video to trend, never starts trending on the same day

3.0.6 Views v/s Likes

ggplot(data[,.("views"=max(views),"likes"=max(likes)),by=title],aes(views,likes,colour=likes,size=likes))+
  geom_jitter()+
  geom_smooth()+
  guides(fill="none")+
  labs(caption="Donyoe",title="Views V/s Likes")+
  theme(legend.position ="none")+
  geom_text_repel(data=subset(data[,.("views"=max(views),"likes"=max(likes)),by=title], views > 1e+08),aes(views,likes,label=title),check_overlap=T)

3.0.7 Likes v/s Comments

ggplot(data[,.("comment_count"=max(comment_count),"likes"=max(likes)),by=title],aes(comment_count,likes,colour=likes,size=likes))+
  geom_jitter()+
  geom_smooth()+
  guides(fill="none")+
  labs(caption="Donyoe",title="Views V/s Comment")+
  theme(legend.position = "none")+geom_text_repel(data=subset(data[,.("comment_count"=max(comment_count),"likes"=max(likes)),by=title], likes > 3e+06),aes(comment_count,likes,label=title),check_overlap=T)

3.0.8 Numbers based on countries

3.0.8.1 Total number of views

ggplot(data[,.("Total_Views"=max(views)),by=Location],aes(reorder(Location,-Total_Views),Total_Views,fill=Location))+
  geom_bar(stat="identity")+
  geom_label(aes(label=Total_Views))+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+  
  labs(title=" Total Views by Countries")+
  xlab(NULL)+
  ylab(NULL)

* GB is the Country with most viewed videos in the trending field with significative difference with the other countries, almost doubled the second country

3.0.8.2 Total number of likes

ggplot(data[,.("Total_Likes"=max(likes)),by=Location],aes(reorder(Location,-Total_Likes),Total_Likes,fill=Location))+
  geom_bar(stat="identity")+
  geom_label(aes(label=Total_Likes))+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+  
  labs(title=" Total number of likes by Countries")+
  xlab(NULL)+
  ylab(NULL)

3.0.8.3 Total number of dislikes

ggplot(data[,.("Total_Dislikes"=max(dislikes)),by=Location],aes(reorder(Location,-Total_Dislikes),Total_Dislikes,fill=Location))+
  geom_bar(stat="identity")+
  geom_label(aes(label=Total_Dislikes))+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+  
  labs(title=" Total Dislikes by Countries")+xlab(NULL)+ylab(NULL)

3.0.8.4 Total number of comments

ggplot(data[,.("Total_Comments"=max(comment_count)),by=Location],aes(reorder(Location,-Total_Comments),Total_Comments,fill=Location))+
  geom_bar(stat="identity")+
  geom_label(aes(label=Total_Comments))+
  guides(fill="none")+theme(axis.text.x = element_text(angle = 45,hjust = 1))+  labs(title=" Total Comments by Countries")+xlab(NULL)+ylab(NULL)

3.0.8.5 Average time for a video to trend by countries

ggplot(data[diff_days<20],aes(as.factor(diff_days),fill=as.factor(diff_days)))+
  geom_bar()+guides(fill="none")+
  labs(title=" Time between published and trending by countries",subtitle="In days")+
  xlab(NULL)+ylab(NULL)+facet_wrap(~Location)

* We can clearly see in Canada and India that a video starts trending in 1 to 2 days, Whereas that's not the case in US and GB

4 Text Analysis

4.0.1 Most repeted pair of words in title

data$title <- as.character(data$title)
bigrams <- data %>% select(title) %>%
  unnest_tokens(bigram, title, token= "ngrams", n=2)

ggplot(bigrams[,.N,by=bigram][order(-N)][1:19],aes(reorder(bigram,-N),N,fill=bigram))+
  geom_bar(stat="identity")+
  geom_label(aes(label=N))+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+
  labs(title="Top Title bigrams")+
  xlab(NULL)+
  ylab(NULL)

* We clearly see bigrams within the titles which are related to music being most popular.
* However, a couple of bigrams related to one topic that stands out..

4.0.2 Most common tags

data$tags <- as.character(data$tags)
bigrams <- data %>% select(tags) %>%
  unnest_tokens(bigram, tags, token= "ngrams", n=1)

ggplot(bigrams[,.N,by=bigram][order(-N)][1:19],aes(reorder(bigram,-N),N,fill=bigram))+
  geom_bar(stat="identity")+
  geom_label(aes(label=N))+
  guides(fill="none")+
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+
  labs(title="Top Title bigrams")+
  xlab(NULL)+
  ylab(NULL)

* News and Comedy seems to be among the top tags which we did not see earlier in our analysis

5 Recommendation system

5.0.1 Problem Statement

To generate a list of top 5 songs by accessing YouTube programmatically that a user will find funny and will increase the likelyhood of user watching the video and liking it

5.0.2 Approach

The idea is to find n number of videos for a user that the user might find funny and like it. We will do this by filtering videos based on the title of the video.
In order to collected the funniest videos we will search videos which pop up when we input the key word “funny” and should be among the most liked videos. We use the YouTube API to pull the metadata related to videos that appear in our search and create a pool of videos that might be hilarious. We then mock up the dataset which is exactly how a user video rating dataset would look like.
We then use that moked up dataset to train our recommendation model.

5.0.3 Recommendation system design

Various approaches for building recommendation systems have been developed and are being used by top companies in the world. The approach includes collaborative filtering, content-based filtering (if you like an item, you will also like a similar item) or hybrid filtering. Collaborative filtering technique is the most mature and the most commonly implemented.
Collaborative filtering recommends items by identifying other users with similar taste; it uses their opinion to recommend items to the active user.

5.0.4 Collaborative filtering technique used-cases

GroupLens is a news company which has implemented collaborative methods in assisting users to locate articles from massive news database.
Ringo, an online social information filtering system uses collaborative filtering to build users profile based on their ratings on music albums. Amazon uses topic diversification algorithms to improve its recommendation.
Netflix uses collaborative filtering on their movie recommendations too. Even YouTube relies on collaborative filtering on their video recommendation but has a more complex architecture than what we’re going to implement.

5.0.5 Collaborative filtering

 There are two types of Collaborative filtering:
  * User-based Collaborative filtering
  * Item-based Collaborative filtering

I will implement User-based Collaborative filtering within the scope of this project

# Reading the data for Recommender System
new_data <- read.csv("new_data")

#Creating rating matrix
ratings <- new_data %>% 
  # now we replace all NAs by 0
  mutate_each(funs(replace(., is.na(.), 0))) %>%
  # now we convert all 0 to false and 1 to true
  # (except the user column, because it contains names and neighter 0 nor 1)
  mutate_each(funs(as.logical), -users) %>% 
  # dplyr uses a own structure called tibble, but we want dataframes
  # .. so we convert it to one
  as.data.frame()

ratings <- ratings %>% column_to_rownames(var ="users") 

ratings$X <- NULL

binaryVideoRatings <- as.matrix(ratings) %>% as("binaryRatingMatrix")

binaryVideoRatings

## 6 x 98 rating matrix of class 'binaryRatingMatrix' with 300 ratings.

We have converted our dataset into a Binary Rating Matrix since we have our ratings in the form of 1 and 0.
We can also convert the dataset into Real Rating Matrix if the rating is on the scale of 1 to 5.

5.0.6 Creating the UBCF model

We create the User Based Collabraive Filtering model (UBCF)

There are three different variations to this model.

We’re using the cosine method which computes internally the cosine similariy all users, which are represented as vectors in the model.

This is equivalent to : crossprod(User(a),User(b))/sqrt(crossprod(User(a))*crossprod(User(b)))

The other two methods are Jaccard and Pearsons.

model_ubcf <- Recommender(data = binaryVideoRatings, method = "UBCF", 
                     parameter = list(method = "cosine"))
recommendations <- predict(model_ubcf, binaryVideoRatings, n = 6)

getList(recommendations)

## $Messi
## [1] "TRY.NOT.TO.LAUGH.or.GRIN..Funny.Animals.Vines.Compilation.2017...Funny.Animals.and.Kids.Fails.Vines"
## [2] "X10.PRANK.SERU.YG.GAMPANG.DILAKUKAN.DI.RUMAH.TAPI.BIKIN.KESEL.BANGET..."                            
## [3] "Break.Up.PRANK.turns.into.PROPOSAL..GONE.WRONG."                                                    
## [4] "FUNNY.VIDEOS.that.will.make.you.LAUGH.INSANELY...Funny.compilation"                                 
## [5] "TRY.NOT.TO.LAUGH.or.GRIN..Funny.Kids.Fails.Compilation.2017...Best.Kids.Fails.that.Make.Us.Laugh"   
## [6] "TRY.NOT.TO.LAUGH.or.GRIN..Funny.Pranks.Vines.Compilation.2017...Best.Pranks.Vines.May.2017"         
## 
## $Saurez
## [1] "CRAZY.PRANK.ON.GIRLFRIEND....10.000.RED.CUPS."             
## [2] "CUTTING.HEADPHONES.AT.GYM.PRANK...2017.Pranks"             
## [3] "ITCHING.BAIT.PHONE.PRANK.."                                
## [4] "X1.MILLION.ORBEEZ.IN.GIRLFRIEND.S.CAR.PRANK."              
## [5] "Best.Prank.Vines.Compilation...Top.Pranks.Vines.April.2016"
## [6] "DJANI.LYRICS.PRANK.NA.MOJOJ.BIVSOJ.DJEVOJCI.."             
## 
## $Vidal
## [1] "IF.YOU.LAUGH..YOU.LOSE..87..FAIL."                                 
## [2] "Headless.man.Prank.part.2..slaughter.version...Julien.magic"       
## [3] "MOR.FORELSKET.I.ALBERT...PRANK."                                   
## [4] "PIZZA.DELIVERY.PRANK.ON.MY.GIRLFRIEND.S.STALKER..GONE.WRONG...."   
## [5] "we.got.KICKED.OUT.of.our.home...PRANK.WARS."                       
## [6] "FUNNY.VIDEOS.that.will.make.you.LAUGH.INSANELY...Funny.compilation"
## 
## $Pogba
## [1] "X1.MILLION.ORBEEZ.IN.GIRLFRIEND.S.CAR.PRANK."              
## [2] "Best.Prank.Vines.Compilation...Top.Pranks.Vines.April.2016"
## [3] "DJANI.LYRICS.PRANK.NA.MOJOJ.BIVSOJ.DJEVOJCI.."             
## [4] "Epic.Prank.Compilation..Parents.Prank.Kids"                
## [5] "FIDGET.SPINNER.IN.HAPPY.MEAL.PRANK"                        
## [6] "I.WANT.TO.MAKE.A.BABY.WITH.YOU.PRANK....CRYING."           
## 
## $Rashford
## [1] "Bought.Fake.Louis.Vuitton.Prank.Gold.Digger.Test."                                                
## [2] "Killer.Clown.9.Scare.Prank...Shadow.Plays"                                                        
## [3] "SLIME.PRANK.ON.BROTHERS.CAR."                                                                     
## [4] "If.you.don.t.laugh..you.have.no.soul..."                                                          
## [5] "TRY.NOT.TO.LAUGH.or.GRIN..Funny.Fails.Compilation.2017....Best.Fails.Vines.May.2017"              
## [6] "Try.Not.To.Laugh.Funny.Animals.Vines...Best.Vine.Compilation.2017...Funny.Kids.at.the.Zoo.Edition"
## 
## $Lukaku
## [1] "X.I.BET..MY.KIDNEY.YOU.WILL.LAUGH.."                                                                 
## [2] "CAUGHT.CHEATING.on.my.Girlfriend.PRANK."                                                             
## [3] "Caught.My.Girlfriend.Cheating.Prank."                                                                
## [4] "GIANT.WUBBLE.BUBBLE.CAR.PRANK."                                                                      
## [5] "HARDEST.VERSION.AFV.Try.Not.to.Laugh.or.Grin.While.Watching.Funniest.Vines.of.best.funny.videos.2017"
## [6] "IMPOSSIBLE.NOT.TO.LAUGH...Funny.school.fail.compilation"

The above list shows the predicted list of personalised recommended videos for each user.

#Saving the recommendtions in data frame
recom <- as(recommendations, "matrix") %>% as.data.frame()

5.0.7 Evaluation of the recommendation model

# create evaluation scheme by cross validating the rating matrix
scheme <- evaluationScheme(binaryVideoRatings, method="cross", k=4, given=3)
results <- evaluate(scheme, method="POPULAR", type = "topNList",n=c(1,3,5,10,15,20))

## POPULAR run fold/sample [model time/prediction time]
##   1  [0.002sec/0.015sec] 
##   2  [0.002sec/0.043sec] 
##   3  [0.002sec/0.016sec] 
##   4  [0.002sec/0.016sec]

getConfusionMatrix(results)[[1]]

##            TP        FP       FN       TN precision     recall        TPR
## 1   0.6666667 0.3333333 39.66667 54.33333 0.6666667 0.01295597 0.01295597
## 3   2.0000000 1.0000000 38.33333 53.66667 0.6666667 0.03886792 0.03886792
## 5   3.3333333 1.6666667 37.00000 53.00000 0.6666667 0.06477987 0.06477987
## 10  6.6666667 3.3333333 33.66667 51.33333 0.6666667 0.12955975 0.12955975
## 15  9.6666667 5.3333333 30.66667 49.33333 0.6444444 0.18805031 0.18805031
## 20 13.0000000 7.0000000 27.33333 47.66667 0.6500000 0.25283019 0.25283019
##            FPR
## 1  0.004329004
## 3  0.012987013
## 5  0.021645022
## 10 0.043290043
## 15 0.072871573
## 20 0.094516595

avg(results)

##            TP        FP       FN       TN precision     recall        TPR
## 1   0.6666667 0.3333333 39.83333 54.16667 0.6666667 0.01309292 0.01309292
## 3   1.8333333 1.1666667 38.66667 53.33333 0.6111111 0.03923964 0.03923964
## 5   3.0833333 1.9166667 37.41667 52.58333 0.6166667 0.06361388 0.06361388
## 10  5.9166667 4.0833333 34.58333 50.41667 0.5916667 0.12548172 0.12548172
## 15  9.1666667 5.8333333 31.33333 48.66667 0.6111111 0.18937398 0.18937398
## 20 12.3333333 7.6666667 28.16667 46.83333 0.6166667 0.25136030 0.25136030
##            FPR
## 1  0.004329004
## 3  0.018790627
## 5  0.029149316
## 10 0.065920085
## 15 0.089549234
## 20 0.114746788

We evaluate the model for six different scenarios i.e Accuracy when we recommend one video to the user, three, five…upto 20 videos to the user.

The average accuracy is determined by two parameters Recall and Precision

1. Recall:
   What ratio of items that a user likes were actually recommended. If a user likes say        5 items and the recommendation decided to show 3 of them, then the recall is 0.6

2. Precision
   Out of all the recommended items, how many the user actually liked? If 5 items were        recommended to the user out of which he liked say 4 of them, then precision is 0.8

5.0.7.1 ROC Curve

plot(results, annotate=TRUE)

The ROC cover does not show great results due to limitations of our dataset. However, we can increase(improve) the area under the curve by having a more diversed dataset.

5.0.7.2 Precision v/s Recall plot

plot(results, "prec/rec", annotate=TRUE)

We observe an average precision of about 0.66 which means that 66% of the times the user will like the video suggested by us. Which is a much better number compared to a naive probabilistic approach where the probability of user liking the video would be 50%

6 Conclusion

Music and Entertainment are the most popular type of videos on YouTube
Though the number of likes and views of a video have some correlation, Just like cannot guarantee you success
If you have the word “Official” in your title it’s more likely for that video to gather higher number of views
Most trending videos are also among the most watched videos in Great Britain which might indicate viewers in GB play an important role for a video to trend
India has high dislike percentage on videos
For YouTube to improve user engagement must put in place an effective content recommend system which takes into consideration the demographics and preferences of like minded users

7 References

https://www.unboxsocial.com/blog/youtube-analytics-guide-business-account/
https://www.kaggle.com/
https://www.youtube.com/
https://github.com/mhahsler/recommenderlab/
https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf/

Youtube Analytics EDA & Overview

Raj Desai

3/06/2019