Exploratory Data Analysis of Digital Campaigns

This project is divided into 3 major parts which will be explained through the document.

Part A: Identifying Trends in YouTube engagement metrics

Aim Of Part A

This part is based on the engagement metrics data of YouTube videos of 12 different Indian entertainment channels. The aim of this project is to:

Derive regression models to identify trends in the various engagement metrics (i.e. views, likes, dislikes, and comments)
Determine the accuracy of these regression models
Determine the loyalty of the audience
Determine the type of content that performs well
Classify channels based on their viewership growth over time

Loading Relevant Data Packages

For the purpose of this assigment, I will be making use of the following packages:

ggplot2: for data visualisation
dplyr: for data modification
tidyr: to tidy the data for better visualisation
readxl: to import data from excel files

#loading packages
library(ggplot2)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readxl)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Loading the excel data sheet

OML_DD<-read_xlsx("OML_DD.xlsx")

Data Manipulation

In this section, I will:

Create relevant variables
Rearrange the data into new data frames for convenience
Extract the ‘date’ column to later analyse viewership growth.

#initialising 'short_name'
OML_DD<-OML_DD %>% mutate(short_name = "n")

#converting variables to the relevant class
OML_DD$view_count<-as.integer(OML_DD$view_count)
OML_DD$likes_count<-as.integer(OML_DD$likes_count)
OML_DD$dislikes_count<-as.integer(OML_DD$dislikes_count)
OML_DD$comments_count<-as.integer(OML_DD$comments_count)

## Warning: NAs introduced by coercion

#creating a field for 'short name'
for(i in 1:length(OML_DD$channel_name)){if(OML_DD$channel_name[i]=="2 Foreigners In Bollywood"){OML_DD$short_name[i] = "FIB"} else if(OML_DD$channel_name[i]=="All India Bakchod"){OML_DD$short_name[i] = "AIB"} else if(OML_DD$channel_name[i]=="ashish chanchlani vines"){OML_DD$short_name[i] = "ACV"} else if(OML_DD$channel_name[i]=="BB Ki Vines"){OML_DD$short_name[i] = "BKV"} else if(OML_DD$channel_name[i]=="BeYouNick"){OML_DD$short_name[i] = "BYN"} else if(OML_DD$channel_name[i]=="East India Comedy"){OML_DD$short_name[i] = "EIC"} else if(OML_DD$channel_name[i]=="Filter Copy"){OML_DD$short_name[i] = "FC"} else if(OML_DD$channel_name[i]=="Girliyapa"){OML_DD$short_name[i] = "GY"} else if(OML_DD$channel_name[i]=="Mostly Sane"){OML_DD$short_name[i]="MS"} else if(OML_DD$channel_name[i]=="SnG Comedy"){OML_DD$short_name[i]="SNGC"} else if(OML_DD$channel_name[i]=="The Screen Patti"){OML_DD$short_name[i] = "TSP"} else if(OML_DD$channel_name[i] =="The Timeliners"){OML_DD$short_name[i] = "TT"}}

#splitting date & time
OML_DD <- OML_DD %>% separate(date_time, c("date", "time"), "T")
OML_DD$date<-as.Date(OML_DD$date)

#creating channel-wise data frames
FIB<-OML_DD %>% filter(channel_name == "2 Foreigners In Bollywood")
AIB<-OML_DD %>% filter(channel_name == "All India Bakchod")
ACV<-OML_DD %>% filter(channel_name == "ashish chanchlani vines")
BKV<-OML_DD %>% filter(channel_name == "BB Ki Vines")
BYN<-OML_DD %>% filter(channel_name == "BeYouNick")
EIC<-OML_DD %>% filter(channel_name == "East India Comedy")
FC<-OML_DD %>% filter(channel_name == "Filter Copy")
GY<-OML_DD %>% filter(channel_name == "Girliyapa")
MS<-OML_DD %>% filter(channel_name == "Mostly Sane")
SNGC<-OML_DD %>% filter(channel_name == "SnG Comedy")
TSP<-OML_DD %>% filter(channel_name == "The Screen Patti")
TT<-OML_DD %>% filter(channel_name == "The Timeliners")


#creating a summary statistics data frame
sum_stat_vc<-OML_DD %>% group_by(channel_name, short_name) %>% summarise(mean_vc = mean(view_count), sd_vc = sd(view_count), median_vc = median(view_count), coeff_of_var = sd_vc/mean_vc)

Regression Modelling

For the purpose of this assignment, I will formulate bivariate regression models to test the correlation between the available variables.

Particularly, I will use the Linear Least Squares Regression Line (or the “Best Fit” Line in Regression) to test the dataset.

I will use the view count as my independent variable and plot other variables against it.

Linear Least Squares Regression Line: A line which seeks to minimise the squared distances between each plotted data point and itself.

#tidying the OML data dump for easier plotting
OML_tidy <- OML_DD %>% gather(measure, value, -view_count, -channel_name, -video_title, -date, -time, -short_name)

OML_tidy$value<-as.integer(OML_tidy$value)

#plotting the tidy OML Data
ggplot(OML_tidy, aes(x=view_count, y = value, col = measure)) + geom_point(alpha = 0.4) + labs(x = "View Count")

## Warning: Removed 2 rows containing missing values (geom_point).

We notice an outlier in the upper part of the graph which can affect the regression model. For our study, let’s ignore that outlier.

#filtering out the outlier
OML_tidy_filtered<- OML_tidy %>% filter(value<900000)

#Plotting the scatterplot and regression line
ggplot(OML_tidy_filtered, aes(x=view_count, y = value, col = measure)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) + labs(x = "View Count")

As seen from the graph, the view_count (i.e. the number of views per video) is the explanatory variable, while the likes_count, dislikes_count and comments_count are the response variables.

The above assumption is consistent with logic, since the decision of “liking” “disliking” and “commenting” on a video generally follows the “viewing” of the video.

Let’s interpret the model and determine its accuracy.

#plotting likes vs. views
OML_DD_filtered <- OML_DD %>% filter(likes_count<900000, comments_count<60000)
ggplot(OML_DD_filtered, aes(x=view_count, y = likes_count)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) + labs(x = "View Count", y = "Likes Count")

#calculating the regression coefficient
OML_DD_filtered %>% summarise(r = cor(view_count, likes_count, use = "pairwise.complete.obs"))

## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.864

The coefficient of correlation is 0.864, which suggests a strong positive correlation between the number of views and the number of likes.

#calculating the slope (regression coefficient) & intercept of the regression line
mod1 <- lm(likes_count ~ view_count, data = OML_DD_filtered)

summary(mod1)

## 
## Call:
## lm(formula = likes_count ~ view_count, data = OML_DD_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272121  -10021   -3573    4666  410422 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.413e+03  1.485e+03   2.298   0.0217 *  
## view_count  2.435e-02  3.697e-04  65.866   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47450 on 1472 degrees of freedom
## Multiple R-squared:  0.7467, Adjusted R-squared:  0.7465 
## F-statistic:  4338 on 1 and 1472 DF,  p-value: < 2.2e-16

From the linear model, we can interpret the following:

Regression Coefficient = 0.02435 (for every point increase in views per video, there is an increase of 0.025 likes per video)
Residual standard error = 47450
R-squared value = 0.7467
p-value < 2.2e-16 (we can safely reject the null hypothesis and assume that there is a strong correlation between the two variables. This is supported by the R coefficient value)
y-intercept = 3413

Equation of the linear model:

Y = \(\beta\)_o + \(\beta\)₁.X

where \(\beta\)_o is the intercept, \(\beta\)₁ is the slope or regression coefficient, and X and Y are the explanatory and response variables respectively.

Plugging the values,

likes_count = 3413 + 0.02435(view_count)

Accuracy of the model

the R-squared value is 74.65%, which means that the model fits the set of observations fairly well.

Since this model is constructed with data from all channels, we cannot determine which channel has a stronger relationship between the two concerned variables, and hence, we will need to explore individual channel data.

Let’s have a look at comments vs. views

#plotting comments vs views
ggplot(OML_DD_filtered, aes(x=view_count, y = comments_count)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) +labs(x = "View Count", y = "Comments Count")

#calculating regression coefficient
OML_DD_filtered %>% summarise(r = cor(view_count, comments_count, use = "pairwise.complete.obs"))

## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.710

The coefficient of correlation here is 0.710, which means that the linear relationship between comments and views is weaker than that of likes and views.

#calculating the slope & intercept of the regression line
mod2<-lm(comments_count ~ view_count, data = OML_DD_filtered)

summary(mod2)

## 
## Call:
## lm(formula = comments_count ~ view_count, data = OML_DD_filtered)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15200  -1000    -93    444  41611 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.026e+01  1.308e+02  -0.308    0.758    
## view_count   1.260e-03  3.257e-05  38.678   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4180 on 1472 degrees of freedom
## Multiple R-squared:  0.504,  Adjusted R-squared:  0.5037 
## F-statistic:  1496 on 1 and 1472 DF,  p-value: < 2.2e-16

From the linear model, we can interpret the following:

Regression Coefficient = 0.00126 (for every point increase in views per video, there is an increase of only 0.00126 comments per video. People only comment if they really like a particular video)
Residual standard error = 4180
** R-squared value** = 0.504
p-value < 2.2e-16 (we can safely reject the null hypothesis and assume that there is a correlation between the two variables. This is supported by the R coefficient value.)
y-intercept = -40.26

The equation of the model is:

**comments_views = -40.26 + 0.00126*(view_count)**

Accuracy of the model

The R-squared value for this model lies at only ~50%, which means that our linear model is only an okay fit for our set of observations.

We can similarly explore the linear model for dislikes count and number of views per video.

It would be interesting to explore the relationship between number of likes and comments.

ggplot(OML_DD_filtered, aes(likes_count, comments_count)) + geom_point(alpha = 0.4) + geom_smooth(method = lm, se = FALSE) + labs(x = "Likes Count", y = "Comments Count")->p1

ggplot(OML_DD, aes(dislikes_count, comments_count)) + geom_point(alpha = 0.4) + geom_smooth(method = lm, se = FALSE)  + labs(x = "Dislikes Count", y = "Comments Count") -> p2

grid.arrange(p1, p2, ncol = 2)

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

mark the difference in x and y scales in the 2 plots.

I’ve assumed the explanatory variable to be likes count/dislikes count and response variable as no. of comments count because comments are generally considered higher in the heirarchy of engagement metrics.

Audience Loyalty

Audience loyalty is a complex concept. As the name suggests, it expresses whether the audience is consistent, i.e. the same viewers keep coming back for more content, or whether it is erratic.

To determine audience loyalty, I would require a lot more information, for example, no. of subscribers, IP address tracking, and cookie-tracking. Because I do not have access to this information, I will rely on the view count data.

Logic: An erratic view count suggests poor audience loyalty, because it would imply either sudden surges or sudden drops in viewership. Hence, the more consistent the view count, the better is the audience loyalty.

I will plot the summary statistics data frame for this purpose.

#tidying the data frame for intuitive plotting
sum_stat_vc_tidy<- sum_stat_vc %>% gather(measure, value, -channel_name, -short_name, -coeff_of_var)

#plotting the tidy data
ggplot(sum_stat_vc_tidy, aes(x=short_name, y = value, col = measure)) + geom_point() + labs(x = "Channel Name")

#printing the summary stat data dram
sum_stat_vc

## # A tibble: 12 x 6
## # Groups:   channel_name [?]
##    channel_name          short_name mean_vc   sd_vc median_vc coeff_of_var
##    <chr>                 <chr>        <dbl>   <dbl>     <dbl>        <dbl>
##  1 2 Foreigners In Boll… FIB         3.94e6  2.97e6  2759534.        0.754
##  2 All India Bakchod     AIB         2.54e6  3.02e6  1367270.        1.19 
##  3 ashish chanchlani vi… ACV         4.17e6  3.70e6  3030134         0.886
##  4 BB Ki Vines           BKV         9.06e6  6.37e6  7243978.        0.704
##  5 BeYouNick             BYN         1.50e6  8.97e5  1367758.        0.600
##  6 East India Comedy     EIC         7.14e5  5.36e5   630487         0.751
##  7 Filter Copy           FC          1.85e6  1.60e6  1608605         0.862
##  8 Girliyapa             GY          2.15e6  1.53e6  1844980         0.710
##  9 Mostly Sane           MS          7.26e5  1.09e6   259732.        1.50 
## 10 SnG Comedy            SNGC        2.74e5  5.22e5   126716.        1.90 
## 11 The Screen Patti      TSP         1.45e6  1.40e6  1235319         0.967
## 12 The Timeliners        TT          1.54e6  1.23e6  1384983         0.797

The coefficient of variation (std. dev./mean) is a measure of variabilty (or in our case, how erratic the view count is). It is particularly useful when you compare multiple data sets.

For the purpose of this assignment, I will assume that audience loyalty is inversely proportional to the coefficient of variation.

sum_stat_vc[order(sum_stat_vc$coeff_of_var),]

## # A tibble: 12 x 6
## # Groups:   channel_name [12]
##    channel_name          short_name mean_vc   sd_vc median_vc coeff_of_var
##    <chr>                 <chr>        <dbl>   <dbl>     <dbl>        <dbl>
##  1 BeYouNick             BYN         1.50e6  8.97e5  1367758.        0.600
##  2 BB Ki Vines           BKV         9.06e6  6.37e6  7243978.        0.704
##  3 Girliyapa             GY          2.15e6  1.53e6  1844980         0.710
##  4 East India Comedy     EIC         7.14e5  5.36e5   630487         0.751
##  5 2 Foreigners In Boll… FIB         3.94e6  2.97e6  2759534.        0.754
##  6 The Timeliners        TT          1.54e6  1.23e6  1384983         0.797
##  7 Filter Copy           FC          1.85e6  1.60e6  1608605         0.862
##  8 ashish chanchlani vi… ACV         4.17e6  3.70e6  3030134         0.886
##  9 The Screen Patti      TSP         1.45e6  1.40e6  1235319         0.967
## 10 All India Bakchod     AIB         2.54e6  3.02e6  1367270.        1.19 
## 11 Mostly Sane           MS          7.26e5  1.09e6   259732.        1.50 
## 12 SnG Comedy            SNGC        2.74e5  5.22e5   126716.        1.90

Based on our assumptions, BeYouNick has the most loyal audience, while SnG Comedy, the least.

Type Of Content

In order to determine which type of content works best, a classification of the data set based on the type of content is required. Without that information, I’m unable to plot informed scatterplots for the same.

Viewership Growth

A linear model can be fitted to a plot between the date that a particular video of a channel was released and the content’s view count.

We can thus arrange channels from slowest growth to fastest growth.

Let’s investigate the performance of SnG Comedy.

ggplot(SNGC, aes(date, view_count)) + geom_point(alpha = 0.6) + geom_smooth(method = lm, se = FALSE) + labs(x = "Date of Release", y = "View Count")

lm(view_count ~ date, data = SNGC)

## 
## Call:
## lm(formula = view_count ~ date, data = SNGC)
## 
## Coefficients:
## (Intercept)         date  
##     9727218         -546

We can notice that for each incremental day, the number of views per content drops by an average of 546.

Investigating for the channel Mostly Sane:

ggplot(MS, aes(date, view_count)) + geom_point(alpha = 0.6) + geom_smooth(method = lm, se = FALSE) + labs(x = "Date of Release", y = "View Count")

lm(view_count ~ date, data = MS)

## 
## Call:
## lm(formula = view_count ~ date, data = MS)
## 
## Coefficients:
## (Intercept)         date  
##   -25391696         1489

We have better news for Mostly Sane: for each incremental day, the number of views per content increases by an average of 1489.

I will extrapolate this method to all channels. It isn’t necessary to plot the data for all the channels. We’re only concerned with the slope of the regression line.

lm(view_count ~ date, data = AIB)

## 
## Call:
## lm(formula = view_count ~ date, data = AIB)
## 
## Coefficients:
## (Intercept)         date  
##  15034718.8       -740.9

lm(view_count ~ date, data = TSP)

## 
## Call:
## lm(formula = view_count ~ date, data = TSP)
## 
## Coefficients:
## (Intercept)         date  
##   -20548549         1276

lm(view_count ~ date, data = ACV)

## 
## Call:
## lm(formula = view_count ~ date, data = ACV)
## 
## Coefficients:
## (Intercept)         date  
##  -124812844         7518

lm(view_count ~ date, data = FC)

## 
## Call:
## lm(formula = view_count ~ date, data = FC)
## 
## Coefficients:
## (Intercept)         date  
##  -1.051e+07    7.087e+02

lm(view_count ~ date, data = TT)

## 
## Call:
## lm(formula = view_count ~ date, data = TT)
## 
## Coefficients:
## (Intercept)         date  
##   -16146521         1017

lm(view_count ~ date, data = FIB)

## 
## Call:
## lm(formula = view_count ~ date, data = FIB)
## 
## Coefficients:
## (Intercept)         date  
##  -123431841         7387

lm(view_count ~ date, data = EIC)

## 
## Call:
## lm(formula = view_count ~ date, data = EIC)
## 
## Coefficients:
## (Intercept)         date  
##   4227273.8       -205.4

lm(view_count ~ date, data = GY)

## 
## Call:
## lm(formula = view_count ~ date, data = GY)
## 
## Coefficients:
## (Intercept)         date  
##    51580012        -2843

lm(view_count ~ date, data = BKV)

## 
## Call:
## lm(formula = view_count ~ date, data = BKV)
## 
## Coefficients:
## (Intercept)         date  
##  -155727095         9727

lm(view_count ~ date, data = BYN)

## 
## Call:
## lm(formula = view_count ~ date, data = BYN)
## 
## Coefficients:
## (Intercept)         date  
##  -1.548e+07    9.918e+02

I’m going to manually create a data frame to determine the vieweship growth rates of different channels.

viewership_growth<-data.frame(channel = c("SNGC", "MS", "AIB", "TSP", "ACV", "FC", "TT", "FIB", "EIC", "GY", "BKV", "BYN"), view_slope = c(-546, 1489, -740.9, 1276, 7518, 708.7, 1017, 7387, -205.4, -2483, 9727, 991.8))

viewership_growth[order(viewership_growth$view_slope),]

##    channel view_slope
## 10      GY    -2483.0
## 3      AIB     -740.9
## 1     SNGC     -546.0
## 9      EIC     -205.4
## 6       FC      708.7
## 12     BYN      991.8
## 7       TT     1017.0
## 4      TSP     1276.0
## 2       MS     1489.0
## 8      FIB     7387.0
## 5      ACV     7518.0
## 11     BKV     9727.0

From the table above, we can determine that BB Ki Vines has the most viewership growth, while Girliyapa has the worst viewership growth in that it is experiencing negative vieweship growth.

note: I have ignored the strength of the linear relationship between the date and view count variables to keep calcualtions straightforward.

Part B: Analysing Facebook Promotional Video Campaigns

Aim Of Part B

The aim of Part B is to identify, from the set of available campaigns, the best and worst campaigns based on the following metrics:

Cost per reach
Cost per engagement
Content Quality
Average percent of video watched per engagement

Loading Data

#loading data

OML_TS<-read_xlsx("OML_TS.xlsx")

Data Manipulation

In this section, I will create data frames and order them for future use. Respective data frames will be explained as required.

Note: To calculate the total engagement, I have added only the positive reactions (i.e. like, haha, wow, love) to the number of comments and shares. In certain thought-provoking and political videos, the “sad” react and “angry” react would also be meaningful, but we know that these are simply video promotions, and hence, it does not make sense to include negative reactions in total engagement.

#adding "cost per reach", "total engagement", and "cost per engagement" fields
OML_TS <- OML_TS %>% mutate(cost_per_reach = amount/Reach, tot_eng = shares + comments + pos_reactions, cost_per_eng = amount/tot_eng, cost_per_view = amount/three_sec_views)

#arranging the data frame by the "cost per reach" field
cheap_reach<-OML_TS[order(OML_TS$cost_per_reach),]
cheap_reach <- select(cheap_reach, campaign_name, cost_per_reach)

#arranging the data frame by "cost per engagement" field
cheap_eng<-OML_TS[order(OML_TS$cost_per_eng),] 

#calculating the ratio of thirty-second-views by ten-second views
OML_TS<-OML_TS %>% mutate(thirty_ten = thirty_sec_views/ten_sec_views)

#calculating the percent of video watched per unit engagement
OML_TS <- mutate(OML_TS, shares_per_view = shares/ten_sec_views)

shares_p_view<-OML_TS[order(OML_TS$shares_per_view),] 
shares_p_view<-select(shares_p_view, campaign_name, shares_per_view)

thirty_ten_eng<-OML_TS[order(OML_TS$thirty_ten),] 
thirty_ten_eng<-select(thirty_ten_eng, campaign_name, thirty_ten)

video_watched<-OML_TS[order(OML_TS$video_percent_watched),] 
video_watched<-select(video_watched, campaign_name, video_percent_watched)

Cost per Reach

In this section, I will explore the costs per reach of all the campaigns to identify the cheapest and the most expensive ones.

ggplot(OML_TS, aes(campaign_name, cost_per_reach)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Cost/Reach")

cheap_reach

## # A tibble: 17 x 2
##    campaign_name cost_per_reach
##    <chr>                  <dbl>
##  1 I                    0.00283
##  2 H                    0.00457
##  3 D                    0.00738
##  4 J                    0.00823
##  5 F                    0.00903
##  6 E                    0.00957
##  7 A                    0.0117 
##  8 P                    0.0121 
##  9 B                    0.0128 
## 10 G                    0.0154 
## 11 Q                    0.0176 
## 12 L                    0.0200 
## 13 N                    0.0202 
## 14 K                    0.0206 
## 15 C                    0.0209 
## 16 O                    0.0276 
## 17 M                    0.0285

It appears that campaigns D, H and I have the cheapest reaches, while campaigns M, O, and C have considerably higher costs per reach.

Cost per Engagement

Here, I will attempt to address which campaign has the cheapest engagement, and which one, the costliest.

ggplot(OML_TS, aes(campaign_name, cost_per_eng)) + geom_point(colour = "blue") + labs(x ="Campaign Name", y = "Cost/Engagement")

cheap_eng %>% select(campaign_name, cost_per_eng)

## # A tibble: 17 x 2
##    campaign_name cost_per_eng
##    <chr>                <dbl>
##  1 I                    0.165
##  2 H                    0.489
##  3 J                    0.700
##  4 E                    0.921
##  5 D                    1.01 
##  6 F                    1.06 
##  7 P                    1.58 
##  8 B                    1.60 
##  9 K                    2.03 
## 10 A                    2.05 
## 11 C                    2.21 
## 12 N                    2.41 
## 13 L                    2.52 
## 14 G                    2.85 
## 15 M                    3.24 
## 16 O                    3.43 
## 17 Q                    4.17

I, H, and J are the cheapest campaigns with respect to engagement, and Q, O and M, the costliest.

It is interesting to note that O and M are also among the least 3 costs per reach, while H and I are in the highest 3 costs per reach.

Cost Per 3-Second-View

I want to investigate how expensive each view is in different campaigns.

ggplot(OML_TS, aes(campaign_name, cost_per_view)) + geom_point(colour="blue") + labs(x = "Campaign Name", y = "Cost/3-sec-view")

cost_p_view<-OML_TS[order(OML_TS$cost_per_view),]
cost_p_view

## # A tibble: 17 x 17
##    campaign_name   Reach amount avg_watch_time video_percent_watched
##    <chr>           <dbl>  <dbl>          <dbl>                 <dbl>
##  1 I             1768376   5000             16                 10.9 
##  2 H             1093412   5000             12                  8.2 
##  3 D              677633   5000             18                 11.6 
##  4 F              553417   5000             27                  5.53
##  5 E              595320   5700             34                  3.72
##  6 G              324869   5000             26                  5.71
##  7 J              607497   5000             15                  4.34
##  8 Q              567135  10000             32                  1.06
##  9 P              268717   3250             12                  6.15
## 10 A              426501   5000             38                  3.04
## 11 B              390876   5000             13                  4.97
## 12 L              500542  10000             26                  4.37
## 13 N              494105  10000             18                  4.91
## 14 O              362602  10000             14                  3.86
## 15 K              242472   5000              7                  2.07
## 16 C              239803   5000             10                  4.88
## 17 M              351346  10000             15                  3.74
## # ... with 12 more variables: three_sec_views <dbl>, ten_sec_views <dbl>,
## #   thirty_sec_views <dbl>, shares <dbl>, comments <dbl>,
## #   pos_reactions <dbl>, cost_per_reach <dbl>, tot_eng <dbl>,
## #   cost_per_eng <dbl>, cost_per_view <dbl>, thirty_ten <dbl>,
## #   shares_per_view <dbl>

Campaigns I, H and D are the cheapest w.r.t this metric.

Content Quality

For this analysis, I will measure the engagement factor of video campaigns by the following metrics:

Average percentage of the video watched: Ideally, I would like more information on how much percentage of the video each viewer watched to determine if the median or the mean is a better measure of central tendency, but since I do not have that data, I will make use of the mean.
Retention factor: I will define the retention factor as 30-second-views/ten-second-views i.e., out of those viewers who watched the first ten seconds of the video, how many watched 30 seconds of it?

I will ignore the 3-second-views data because it includes auto-play views too, and that is not a good indicator of how many people consciously made the choice to watch the videos. The retention factor measured above can give us a good idea of how engaging a particular video is because it tells us the % of viewers who were engaged enough within the first 10 seconds that continued to watch for the next 20 seconds.

Plot of video percent watched in each campaign

ggplot(OML_TS, aes(campaign_name, video_percent_watched)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Average Percent of Video Watched")

video_watched

## # A tibble: 17 x 2
##    campaign_name video_percent_watched
##    <chr>                         <dbl>
##  1 Q                              1.06
##  2 K                              2.07
##  3 A                              3.04
##  4 E                              3.72
##  5 M                              3.74
##  6 O                              3.86
##  7 J                              4.34
##  8 L                              4.37
##  9 C                              4.88
## 10 N                              4.91
## 11 B                              4.97
## 12 F                              5.53
## 13 G                              5.71
## 14 P                              6.15
## 15 H                              8.2 
## 16 I                             10.9 
## 17 D                             11.6

I can derive 2 key insights from this graph:

Campaign I has the second highest percentage of video watched in all the campaigns and it is among the cheapest campaigns.
Campaign D has the highest percent of video watched.

Plot of retention factor of each campaign

 ggplot(OML_TS, aes(campaign_name, thirty_ten)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Retention Factor")

thirty_ten_eng

## # A tibble: 17 x 2
##    campaign_name thirty_ten
##    <chr>              <dbl>
##  1 K                  0.507
##  2 Q                  0.512
##  3 E                  0.554
##  4 F                  0.576
##  5 O                  0.585
##  6 P                  0.595
##  7 M                  0.598
##  8 C                  0.622
##  9 G                  0.623
## 10 H                  0.623
## 11 B                  0.634
## 12 A                  0.657
## 13 J                  0.662
## 14 N                  0.680
## 15 L                  0.681
## 16 I                  0.701
## 17 D                  0.795

Campaigns D, I, N and L have the most retentive content.

Further Analysis

People only share on Facebook the content that they find incredibly appealing. Therefore, I’m interesed to find out the number of shares per view of each campaign. It would provide further insight to the quality of the content.

Again, I will use the ten-second-view metric to weed out auto-play views.

Plot of number of shares per view

ggplot(OML_TS, aes(campaign_name, shares_per_view)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Shares/View")

shares_p_view

## # A tibble: 17 x 2
##    campaign_name shares_per_view
##    <chr>                   <dbl>
##  1 C                     0.00194
##  2 A                     0.00463
##  3 M                     0.00502
##  4 G                     0.00553
##  5 L                     0.00687
##  6 Q                     0.00734
##  7 E                     0.00751
##  8 D                     0.00814
##  9 O                     0.00835
## 10 B                     0.0109 
## 11 N                     0.0131 
## 12 F                     0.0132 
## 13 H                     0.0191 
## 14 P                     0.0273 
## 15 K                     0.0313 
## 16 J                     0.0485 
## 17 I                     0.0636

Campaigns P, K, J, and I have the highest shares per 10-second-view, while C, A, and M, the least.

Conclusion of Part B

Based on the 5 analyses, I can draw the following insights:

Campaign I is the best campaign: Campaign I has ranked in the top 2 of all the metrics. Particularly, it is the cheapest in terms of reach & engagement, and it has also garnered most shares per 10 second view.
Campaign D has also performed very well: It has the highest retention factor & average percent of video watched (which means it is very engaging), quite cheap as well.
Campaign H is the 3rd best campaign: This campaign is the 2nd cheapest in reach, engagement as well as reach. Additionally, it has the 3rd highest video percent watched and 5th highest shares per view.
Campaign O has performed poorly: The campaign has ranked in the costliest 3 in terms of all, reach, engagement and view. It has the 6th lowest average video percent watched, and 5th lowest retention factor.
Campaign Q has poor engagement: Campaign K has high costs of reach and views, and lowest retention factor. It also has the 2nd least percent of video watched.
Campaign M is another bad apple: Campaign M has turned out to be a very costly campaign (with the highest cost per reach and view). It is also the 3rd worst in terms of shares per view and has 6th least amount of video percent watched.

Thus, we have our 3 best and 3 worst campaigns based on 6 different metrics.

Part C: What More Information Is Required?

This project is formulated based on the data provided by OML. Even though I have been able to produce some interesting insights, my analysis would have been more robust had I received data on the following parameters.

Data For Part A

For YouTube videos, rankings are important. There are a few metrics that indicate how well a video can be ranked. I’ve discussed the following metrics below*:

Length of each video: The average length of a video on the front page is 14 minutes 50 seconds. Because the YouTube audience visits the site with an intention to watch long videos, as a general thumb rule, the longer the video, the better is its ranking. After all, YouTube wants to increase the time viewers spend on the site. (https://youtube-creators.googleblog.com/2012/08/youtube-now-why-we-focus-on-watch-time.html)
The number of shares per video: Again, greater the number of shares, better is the video’s ranking.
Number of channel subsribers per video: This information can allow us to measure the loyalty of the audience. Also, it can tell us whether a video increases or decreases the number of subsribers.

these metrics are sourced from https://backlinko.com/youtube-ranking-factors where 1.3 million YouTube videos were analysed.

Data For Part B

The process of identifying the best and the worst campaigns would be aided by the following metrics:

Post Clicks
Percent of auto-play vs. click-to-play videos
Number of Impressions: to measure how widely the content is being circulated
Negative Feedback: hide posts, hide all posts, report as spam, unlike page
Video watched at 100%: This tell us how many people were engaged enough to watch the whole video as well as those who didn’t watch the whole video but were interested in the video’s concluding part.

Exploratory Data Analysis of Digital Campaigns

Saurabh Bodas

24/06/2018

Part A: Identifying Trends in YouTube engagement metrics

Aim Of Part A

Loading Relevant Data Packages

Loading the excel data sheet

Data Manipulation

Regression Modelling

likes_count = 3413 + 0.02435(view_count)

Accuracy of the model

**comments_views = -40.26 + 0.00126*(view_count)**

Accuracy of the model

Audience Loyalty

Type Of Content

Viewership Growth

Part B: Analysing Facebook Promotional Video Campaigns

Aim Of Part B

Loading Data

Data Manipulation

Cost per Reach

Cost per Engagement

Cost Per 3-Second-View

Content Quality

Plot of video percent watched in each campaign

Plot of retention factor of each campaign

Further Analysis

Plot of number of shares per view

Conclusion of Part B

Part C: What More Information Is Required?

Data For Part A

Data For Part B