This project is divided into 3 major parts which will be explained through the document.
This part is based on the engagement metrics data of YouTube videos of 12 different Indian entertainment channels. The aim of this project is to:
For the purpose of this assigment, I will be making use of the following packages:
#loading packages
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
OML_DD<-read_xlsx("OML_DD.xlsx")
In this section, I will:
#initialising 'short_name'
OML_DD<-OML_DD %>% mutate(short_name = "n")
#converting variables to the relevant class
OML_DD$view_count<-as.integer(OML_DD$view_count)
OML_DD$likes_count<-as.integer(OML_DD$likes_count)
OML_DD$dislikes_count<-as.integer(OML_DD$dislikes_count)
OML_DD$comments_count<-as.integer(OML_DD$comments_count)
## Warning: NAs introduced by coercion
#creating a field for 'short name'
for(i in 1:length(OML_DD$channel_name)){if(OML_DD$channel_name[i]=="2 Foreigners In Bollywood"){OML_DD$short_name[i] = "FIB"} else if(OML_DD$channel_name[i]=="All India Bakchod"){OML_DD$short_name[i] = "AIB"} else if(OML_DD$channel_name[i]=="ashish chanchlani vines"){OML_DD$short_name[i] = "ACV"} else if(OML_DD$channel_name[i]=="BB Ki Vines"){OML_DD$short_name[i] = "BKV"} else if(OML_DD$channel_name[i]=="BeYouNick"){OML_DD$short_name[i] = "BYN"} else if(OML_DD$channel_name[i]=="East India Comedy"){OML_DD$short_name[i] = "EIC"} else if(OML_DD$channel_name[i]=="Filter Copy"){OML_DD$short_name[i] = "FC"} else if(OML_DD$channel_name[i]=="Girliyapa"){OML_DD$short_name[i] = "GY"} else if(OML_DD$channel_name[i]=="Mostly Sane"){OML_DD$short_name[i]="MS"} else if(OML_DD$channel_name[i]=="SnG Comedy"){OML_DD$short_name[i]="SNGC"} else if(OML_DD$channel_name[i]=="The Screen Patti"){OML_DD$short_name[i] = "TSP"} else if(OML_DD$channel_name[i] =="The Timeliners"){OML_DD$short_name[i] = "TT"}}
#splitting date & time
OML_DD <- OML_DD %>% separate(date_time, c("date", "time"), "T")
OML_DD$date<-as.Date(OML_DD$date)
#creating channel-wise data frames
FIB<-OML_DD %>% filter(channel_name == "2 Foreigners In Bollywood")
AIB<-OML_DD %>% filter(channel_name == "All India Bakchod")
ACV<-OML_DD %>% filter(channel_name == "ashish chanchlani vines")
BKV<-OML_DD %>% filter(channel_name == "BB Ki Vines")
BYN<-OML_DD %>% filter(channel_name == "BeYouNick")
EIC<-OML_DD %>% filter(channel_name == "East India Comedy")
FC<-OML_DD %>% filter(channel_name == "Filter Copy")
GY<-OML_DD %>% filter(channel_name == "Girliyapa")
MS<-OML_DD %>% filter(channel_name == "Mostly Sane")
SNGC<-OML_DD %>% filter(channel_name == "SnG Comedy")
TSP<-OML_DD %>% filter(channel_name == "The Screen Patti")
TT<-OML_DD %>% filter(channel_name == "The Timeliners")
#creating a summary statistics data frame
sum_stat_vc<-OML_DD %>% group_by(channel_name, short_name) %>% summarise(mean_vc = mean(view_count), sd_vc = sd(view_count), median_vc = median(view_count), coeff_of_var = sd_vc/mean_vc)
For the purpose of this assignment, I will formulate bivariate regression models to test the correlation between the available variables.
Particularly, I will use the Linear Least Squares Regression Line (or the “Best Fit” Line in Regression) to test the dataset.
I will use the view count as my independent variable and plot other variables against it.
Linear Least Squares Regression Line: A line which seeks to minimise the squared distances between each plotted data point and itself.
#tidying the OML data dump for easier plotting
OML_tidy <- OML_DD %>% gather(measure, value, -view_count, -channel_name, -video_title, -date, -time, -short_name)
OML_tidy$value<-as.integer(OML_tidy$value)
#plotting the tidy OML Data
ggplot(OML_tidy, aes(x=view_count, y = value, col = measure)) + geom_point(alpha = 0.4) + labs(x = "View Count")
## Warning: Removed 2 rows containing missing values (geom_point).
We notice an outlier in the upper part of the graph which can affect the regression model. For our study, let’s ignore that outlier.
#filtering out the outlier
OML_tidy_filtered<- OML_tidy %>% filter(value<900000)
#Plotting the scatterplot and regression line
ggplot(OML_tidy_filtered, aes(x=view_count, y = value, col = measure)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) + labs(x = "View Count")
As seen from the graph, the view_count (i.e. the number of views per video) is the explanatory variable, while the likes_count, dislikes_count and comments_count are the response variables.
The above assumption is consistent with logic, since the decision of “liking” “disliking” and “commenting” on a video generally follows the “viewing” of the video.
Let’s interpret the model and determine its accuracy.
#plotting likes vs. views
OML_DD_filtered <- OML_DD %>% filter(likes_count<900000, comments_count<60000)
ggplot(OML_DD_filtered, aes(x=view_count, y = likes_count)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) + labs(x = "View Count", y = "Likes Count")
#calculating the regression coefficient
OML_DD_filtered %>% summarise(r = cor(view_count, likes_count, use = "pairwise.complete.obs"))
## # A tibble: 1 x 1
## r
## <dbl>
## 1 0.864
The coefficient of correlation is 0.864, which suggests a strong positive correlation between the number of views and the number of likes.
#calculating the slope (regression coefficient) & intercept of the regression line
mod1 <- lm(likes_count ~ view_count, data = OML_DD_filtered)
summary(mod1)
##
## Call:
## lm(formula = likes_count ~ view_count, data = OML_DD_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -272121 -10021 -3573 4666 410422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.413e+03 1.485e+03 2.298 0.0217 *
## view_count 2.435e-02 3.697e-04 65.866 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47450 on 1472 degrees of freedom
## Multiple R-squared: 0.7467, Adjusted R-squared: 0.7465
## F-statistic: 4338 on 1 and 1472 DF, p-value: < 2.2e-16
From the linear model, we can interpret the following:
Equation of the linear model:
Y = \(\beta\)o + \(\beta\)1.X
where \(\beta\)o is the intercept, \(\beta\)1 is the slope or regression coefficient, and X and Y are the explanatory and response variables respectively.
Plugging the values,
the R-squared value is 74.65%, which means that the model fits the set of observations fairly well.
Since this model is constructed with data from all channels, we cannot determine which channel has a stronger relationship between the two concerned variables, and hence, we will need to explore individual channel data.
Let’s have a look at comments vs. views
#plotting comments vs views
ggplot(OML_DD_filtered, aes(x=view_count, y = comments_count)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) +labs(x = "View Count", y = "Comments Count")
#calculating regression coefficient
OML_DD_filtered %>% summarise(r = cor(view_count, comments_count, use = "pairwise.complete.obs"))
## # A tibble: 1 x 1
## r
## <dbl>
## 1 0.710
The coefficient of correlation here is 0.710, which means that the linear relationship between comments and views is weaker than that of likes and views.
#calculating the slope & intercept of the regression line
mod2<-lm(comments_count ~ view_count, data = OML_DD_filtered)
summary(mod2)
##
## Call:
## lm(formula = comments_count ~ view_count, data = OML_DD_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15200 -1000 -93 444 41611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.026e+01 1.308e+02 -0.308 0.758
## view_count 1.260e-03 3.257e-05 38.678 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4180 on 1472 degrees of freedom
## Multiple R-squared: 0.504, Adjusted R-squared: 0.5037
## F-statistic: 1496 on 1 and 1472 DF, p-value: < 2.2e-16
From the linear model, we can interpret the following:
The equation of the model is:
The R-squared value for this model lies at only ~50%, which means that our linear model is only an okay fit for our set of observations.
We can similarly explore the linear model for dislikes count and number of views per video.
It would be interesting to explore the relationship between number of likes and comments.
ggplot(OML_DD_filtered, aes(likes_count, comments_count)) + geom_point(alpha = 0.4) + geom_smooth(method = lm, se = FALSE) + labs(x = "Likes Count", y = "Comments Count")->p1
ggplot(OML_DD, aes(dislikes_count, comments_count)) + geom_point(alpha = 0.4) + geom_smooth(method = lm, se = FALSE) + labs(x = "Dislikes Count", y = "Comments Count") -> p2
grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
mark the difference in x and y scales in the 2 plots.
I’ve assumed the explanatory variable to be likes count/dislikes count and response variable as no. of comments count because comments are generally considered higher in the heirarchy of engagement metrics.
Audience loyalty is a complex concept. As the name suggests, it expresses whether the audience is consistent, i.e. the same viewers keep coming back for more content, or whether it is erratic.
To determine audience loyalty, I would require a lot more information, for example, no. of subscribers, IP address tracking, and cookie-tracking. Because I do not have access to this information, I will rely on the view count data.
Logic: An erratic view count suggests poor audience loyalty, because it would imply either sudden surges or sudden drops in viewership. Hence, the more consistent the view count, the better is the audience loyalty.
I will plot the summary statistics data frame for this purpose.
#tidying the data frame for intuitive plotting
sum_stat_vc_tidy<- sum_stat_vc %>% gather(measure, value, -channel_name, -short_name, -coeff_of_var)
#plotting the tidy data
ggplot(sum_stat_vc_tidy, aes(x=short_name, y = value, col = measure)) + geom_point() + labs(x = "Channel Name")
#printing the summary stat data dram
sum_stat_vc
## # A tibble: 12 x 6
## # Groups: channel_name [?]
## channel_name short_name mean_vc sd_vc median_vc coeff_of_var
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2 Foreigners In Boll… FIB 3.94e6 2.97e6 2759534. 0.754
## 2 All India Bakchod AIB 2.54e6 3.02e6 1367270. 1.19
## 3 ashish chanchlani vi… ACV 4.17e6 3.70e6 3030134 0.886
## 4 BB Ki Vines BKV 9.06e6 6.37e6 7243978. 0.704
## 5 BeYouNick BYN 1.50e6 8.97e5 1367758. 0.600
## 6 East India Comedy EIC 7.14e5 5.36e5 630487 0.751
## 7 Filter Copy FC 1.85e6 1.60e6 1608605 0.862
## 8 Girliyapa GY 2.15e6 1.53e6 1844980 0.710
## 9 Mostly Sane MS 7.26e5 1.09e6 259732. 1.50
## 10 SnG Comedy SNGC 2.74e5 5.22e5 126716. 1.90
## 11 The Screen Patti TSP 1.45e6 1.40e6 1235319 0.967
## 12 The Timeliners TT 1.54e6 1.23e6 1384983 0.797
The coefficient of variation (std. dev./mean) is a measure of variabilty (or in our case, how erratic the view count is). It is particularly useful when you compare multiple data sets.
For the purpose of this assignment, I will assume that audience loyalty is inversely proportional to the coefficient of variation.
sum_stat_vc[order(sum_stat_vc$coeff_of_var),]
## # A tibble: 12 x 6
## # Groups: channel_name [12]
## channel_name short_name mean_vc sd_vc median_vc coeff_of_var
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BeYouNick BYN 1.50e6 8.97e5 1367758. 0.600
## 2 BB Ki Vines BKV 9.06e6 6.37e6 7243978. 0.704
## 3 Girliyapa GY 2.15e6 1.53e6 1844980 0.710
## 4 East India Comedy EIC 7.14e5 5.36e5 630487 0.751
## 5 2 Foreigners In Boll… FIB 3.94e6 2.97e6 2759534. 0.754
## 6 The Timeliners TT 1.54e6 1.23e6 1384983 0.797
## 7 Filter Copy FC 1.85e6 1.60e6 1608605 0.862
## 8 ashish chanchlani vi… ACV 4.17e6 3.70e6 3030134 0.886
## 9 The Screen Patti TSP 1.45e6 1.40e6 1235319 0.967
## 10 All India Bakchod AIB 2.54e6 3.02e6 1367270. 1.19
## 11 Mostly Sane MS 7.26e5 1.09e6 259732. 1.50
## 12 SnG Comedy SNGC 2.74e5 5.22e5 126716. 1.90
Based on our assumptions, BeYouNick has the most loyal audience, while SnG Comedy, the least.
In order to determine which type of content works best, a classification of the data set based on the type of content is required. Without that information, I’m unable to plot informed scatterplots for the same.
A linear model can be fitted to a plot between the date that a particular video of a channel was released and the content’s view count.
We can thus arrange channels from slowest growth to fastest growth.
Let’s investigate the performance of SnG Comedy.
ggplot(SNGC, aes(date, view_count)) + geom_point(alpha = 0.6) + geom_smooth(method = lm, se = FALSE) + labs(x = "Date of Release", y = "View Count")
lm(view_count ~ date, data = SNGC)
##
## Call:
## lm(formula = view_count ~ date, data = SNGC)
##
## Coefficients:
## (Intercept) date
## 9727218 -546
We can notice that for each incremental day, the number of views per content drops by an average of 546.
Investigating for the channel Mostly Sane:
ggplot(MS, aes(date, view_count)) + geom_point(alpha = 0.6) + geom_smooth(method = lm, se = FALSE) + labs(x = "Date of Release", y = "View Count")
lm(view_count ~ date, data = MS)
##
## Call:
## lm(formula = view_count ~ date, data = MS)
##
## Coefficients:
## (Intercept) date
## -25391696 1489
We have better news for Mostly Sane: for each incremental day, the number of views per content increases by an average of 1489.
I will extrapolate this method to all channels. It isn’t necessary to plot the data for all the channels. We’re only concerned with the slope of the regression line.
lm(view_count ~ date, data = AIB)
##
## Call:
## lm(formula = view_count ~ date, data = AIB)
##
## Coefficients:
## (Intercept) date
## 15034718.8 -740.9
lm(view_count ~ date, data = TSP)
##
## Call:
## lm(formula = view_count ~ date, data = TSP)
##
## Coefficients:
## (Intercept) date
## -20548549 1276
lm(view_count ~ date, data = ACV)
##
## Call:
## lm(formula = view_count ~ date, data = ACV)
##
## Coefficients:
## (Intercept) date
## -124812844 7518
lm(view_count ~ date, data = FC)
##
## Call:
## lm(formula = view_count ~ date, data = FC)
##
## Coefficients:
## (Intercept) date
## -1.051e+07 7.087e+02
lm(view_count ~ date, data = TT)
##
## Call:
## lm(formula = view_count ~ date, data = TT)
##
## Coefficients:
## (Intercept) date
## -16146521 1017
lm(view_count ~ date, data = FIB)
##
## Call:
## lm(formula = view_count ~ date, data = FIB)
##
## Coefficients:
## (Intercept) date
## -123431841 7387
lm(view_count ~ date, data = EIC)
##
## Call:
## lm(formula = view_count ~ date, data = EIC)
##
## Coefficients:
## (Intercept) date
## 4227273.8 -205.4
lm(view_count ~ date, data = GY)
##
## Call:
## lm(formula = view_count ~ date, data = GY)
##
## Coefficients:
## (Intercept) date
## 51580012 -2843
lm(view_count ~ date, data = BKV)
##
## Call:
## lm(formula = view_count ~ date, data = BKV)
##
## Coefficients:
## (Intercept) date
## -155727095 9727
lm(view_count ~ date, data = BYN)
##
## Call:
## lm(formula = view_count ~ date, data = BYN)
##
## Coefficients:
## (Intercept) date
## -1.548e+07 9.918e+02
I’m going to manually create a data frame to determine the vieweship growth rates of different channels.
viewership_growth<-data.frame(channel = c("SNGC", "MS", "AIB", "TSP", "ACV", "FC", "TT", "FIB", "EIC", "GY", "BKV", "BYN"), view_slope = c(-546, 1489, -740.9, 1276, 7518, 708.7, 1017, 7387, -205.4, -2483, 9727, 991.8))
viewership_growth[order(viewership_growth$view_slope),]
## channel view_slope
## 10 GY -2483.0
## 3 AIB -740.9
## 1 SNGC -546.0
## 9 EIC -205.4
## 6 FC 708.7
## 12 BYN 991.8
## 7 TT 1017.0
## 4 TSP 1276.0
## 2 MS 1489.0
## 8 FIB 7387.0
## 5 ACV 7518.0
## 11 BKV 9727.0
From the table above, we can determine that BB Ki Vines has the most viewership growth, while Girliyapa has the worst viewership growth in that it is experiencing negative vieweship growth.
note: I have ignored the strength of the linear relationship between the date and view count variables to keep calcualtions straightforward.
The aim of Part B is to identify, from the set of available campaigns, the best and worst campaigns based on the following metrics:
#loading data
OML_TS<-read_xlsx("OML_TS.xlsx")
In this section, I will create data frames and order them for future use. Respective data frames will be explained as required.
Note: To calculate the total engagement, I have added only the positive reactions (i.e. like, haha, wow, love) to the number of comments and shares. In certain thought-provoking and political videos, the “sad” react and “angry” react would also be meaningful, but we know that these are simply video promotions, and hence, it does not make sense to include negative reactions in total engagement.
#adding "cost per reach", "total engagement", and "cost per engagement" fields
OML_TS <- OML_TS %>% mutate(cost_per_reach = amount/Reach, tot_eng = shares + comments + pos_reactions, cost_per_eng = amount/tot_eng, cost_per_view = amount/three_sec_views)
#arranging the data frame by the "cost per reach" field
cheap_reach<-OML_TS[order(OML_TS$cost_per_reach),]
cheap_reach <- select(cheap_reach, campaign_name, cost_per_reach)
#arranging the data frame by "cost per engagement" field
cheap_eng<-OML_TS[order(OML_TS$cost_per_eng),]
#calculating the ratio of thirty-second-views by ten-second views
OML_TS<-OML_TS %>% mutate(thirty_ten = thirty_sec_views/ten_sec_views)
#calculating the percent of video watched per unit engagement
OML_TS <- mutate(OML_TS, shares_per_view = shares/ten_sec_views)
shares_p_view<-OML_TS[order(OML_TS$shares_per_view),]
shares_p_view<-select(shares_p_view, campaign_name, shares_per_view)
thirty_ten_eng<-OML_TS[order(OML_TS$thirty_ten),]
thirty_ten_eng<-select(thirty_ten_eng, campaign_name, thirty_ten)
video_watched<-OML_TS[order(OML_TS$video_percent_watched),]
video_watched<-select(video_watched, campaign_name, video_percent_watched)
In this section, I will explore the costs per reach of all the campaigns to identify the cheapest and the most expensive ones.
ggplot(OML_TS, aes(campaign_name, cost_per_reach)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Cost/Reach")
cheap_reach
## # A tibble: 17 x 2
## campaign_name cost_per_reach
## <chr> <dbl>
## 1 I 0.00283
## 2 H 0.00457
## 3 D 0.00738
## 4 J 0.00823
## 5 F 0.00903
## 6 E 0.00957
## 7 A 0.0117
## 8 P 0.0121
## 9 B 0.0128
## 10 G 0.0154
## 11 Q 0.0176
## 12 L 0.0200
## 13 N 0.0202
## 14 K 0.0206
## 15 C 0.0209
## 16 O 0.0276
## 17 M 0.0285
It appears that campaigns D, H and I have the cheapest reaches, while campaigns M, O, and C have considerably higher costs per reach.
Here, I will attempt to address which campaign has the cheapest engagement, and which one, the costliest.
ggplot(OML_TS, aes(campaign_name, cost_per_eng)) + geom_point(colour = "blue") + labs(x ="Campaign Name", y = "Cost/Engagement")
cheap_eng %>% select(campaign_name, cost_per_eng)
## # A tibble: 17 x 2
## campaign_name cost_per_eng
## <chr> <dbl>
## 1 I 0.165
## 2 H 0.489
## 3 J 0.700
## 4 E 0.921
## 5 D 1.01
## 6 F 1.06
## 7 P 1.58
## 8 B 1.60
## 9 K 2.03
## 10 A 2.05
## 11 C 2.21
## 12 N 2.41
## 13 L 2.52
## 14 G 2.85
## 15 M 3.24
## 16 O 3.43
## 17 Q 4.17
I, H, and J are the cheapest campaigns with respect to engagement, and Q, O and M, the costliest.
It is interesting to note that O and M are also among the least 3 costs per reach, while H and I are in the highest 3 costs per reach.
I want to investigate how expensive each view is in different campaigns.
ggplot(OML_TS, aes(campaign_name, cost_per_view)) + geom_point(colour="blue") + labs(x = "Campaign Name", y = "Cost/3-sec-view")
cost_p_view<-OML_TS[order(OML_TS$cost_per_view),]
cost_p_view
## # A tibble: 17 x 17
## campaign_name Reach amount avg_watch_time video_percent_watched
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 I 1768376 5000 16 10.9
## 2 H 1093412 5000 12 8.2
## 3 D 677633 5000 18 11.6
## 4 F 553417 5000 27 5.53
## 5 E 595320 5700 34 3.72
## 6 G 324869 5000 26 5.71
## 7 J 607497 5000 15 4.34
## 8 Q 567135 10000 32 1.06
## 9 P 268717 3250 12 6.15
## 10 A 426501 5000 38 3.04
## 11 B 390876 5000 13 4.97
## 12 L 500542 10000 26 4.37
## 13 N 494105 10000 18 4.91
## 14 O 362602 10000 14 3.86
## 15 K 242472 5000 7 2.07
## 16 C 239803 5000 10 4.88
## 17 M 351346 10000 15 3.74
## # ... with 12 more variables: three_sec_views <dbl>, ten_sec_views <dbl>,
## # thirty_sec_views <dbl>, shares <dbl>, comments <dbl>,
## # pos_reactions <dbl>, cost_per_reach <dbl>, tot_eng <dbl>,
## # cost_per_eng <dbl>, cost_per_view <dbl>, thirty_ten <dbl>,
## # shares_per_view <dbl>
Campaigns I, H and D are the cheapest w.r.t this metric.
For this analysis, I will measure the engagement factor of video campaigns by the following metrics:
Average percentage of the video watched: Ideally, I would like more information on how much percentage of the video each viewer watched to determine if the median or the mean is a better measure of central tendency, but since I do not have that data, I will make use of the mean.
Retention factor: I will define the retention factor as 30-second-views/ten-second-views i.e., out of those viewers who watched the first ten seconds of the video, how many watched 30 seconds of it?
I will ignore the 3-second-views data because it includes auto-play views too, and that is not a good indicator of how many people consciously made the choice to watch the videos. The retention factor measured above can give us a good idea of how engaging a particular video is because it tells us the % of viewers who were engaged enough within the first 10 seconds that continued to watch for the next 20 seconds.
ggplot(OML_TS, aes(campaign_name, video_percent_watched)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Average Percent of Video Watched")
video_watched
## # A tibble: 17 x 2
## campaign_name video_percent_watched
## <chr> <dbl>
## 1 Q 1.06
## 2 K 2.07
## 3 A 3.04
## 4 E 3.72
## 5 M 3.74
## 6 O 3.86
## 7 J 4.34
## 8 L 4.37
## 9 C 4.88
## 10 N 4.91
## 11 B 4.97
## 12 F 5.53
## 13 G 5.71
## 14 P 6.15
## 15 H 8.2
## 16 I 10.9
## 17 D 11.6
I can derive 2 key insights from this graph:
Campaign I has the second highest percentage of video watched in all the campaigns and it is among the cheapest campaigns.
Campaign D has the highest percent of video watched.
ggplot(OML_TS, aes(campaign_name, thirty_ten)) + geom_point(colour = "blue") + labs(x = "Campaign Name", y = "Retention Factor")
thirty_ten_eng
## # A tibble: 17 x 2
## campaign_name thirty_ten
## <chr> <dbl>
## 1 K 0.507
## 2 Q 0.512
## 3 E 0.554
## 4 F 0.576
## 5 O 0.585
## 6 P 0.595
## 7 M 0.598
## 8 C 0.622
## 9 G 0.623
## 10 H 0.623
## 11 B 0.634
## 12 A 0.657
## 13 J 0.662
## 14 N 0.680
## 15 L 0.681
## 16 I 0.701
## 17 D 0.795
Campaigns D, I, N and L have the most retentive content.
People only share on Facebook the content that they find incredibly appealing. Therefore, I’m interesed to find out the number of shares per view of each campaign. It would provide further insight to the quality of the content.
Again, I will use the ten-second-view metric to weed out auto-play views.
Based on the 5 analyses, I can draw the following insights:
Campaign I is the best campaign: Campaign I has ranked in the top 2 of all the metrics. Particularly, it is the cheapest in terms of reach & engagement, and it has also garnered most shares per 10 second view.
Campaign D has also performed very well: It has the highest retention factor & average percent of video watched (which means it is very engaging), quite cheap as well.
Campaign H is the 3rd best campaign: This campaign is the 2nd cheapest in reach, engagement as well as reach. Additionally, it has the 3rd highest video percent watched and 5th highest shares per view.
Campaign O has performed poorly: The campaign has ranked in the costliest 3 in terms of all, reach, engagement and view. It has the 6th lowest average video percent watched, and 5th lowest retention factor.
Campaign Q has poor engagement: Campaign K has high costs of reach and views, and lowest retention factor. It also has the 2nd least percent of video watched.
Campaign M is another bad apple: Campaign M has turned out to be a very costly campaign (with the highest cost per reach and view). It is also the 3rd worst in terms of shares per view and has 6th least amount of video percent watched.
Thus, we have our 3 best and 3 worst campaigns based on 6 different metrics.
This project is formulated based on the data provided by OML. Even though I have been able to produce some interesting insights, my analysis would have been more robust had I received data on the following parameters.
For YouTube videos, rankings are important. There are a few metrics that indicate how well a video can be ranked. I’ve discussed the following metrics below*:
Length of each video: The average length of a video on the front page is 14 minutes 50 seconds. Because the YouTube audience visits the site with an intention to watch long videos, as a general thumb rule, the longer the video, the better is its ranking. After all, YouTube wants to increase the time viewers spend on the site. (https://youtube-creators.googleblog.com/2012/08/youtube-now-why-we-focus-on-watch-time.html)
The number of shares per video: Again, greater the number of shares, better is the video’s ranking.
Number of channel subsribers per video: This information can allow us to measure the loyalty of the audience. Also, it can tell us whether a video increases or decreases the number of subsribers.
these metrics are sourced from https://backlinko.com/youtube-ranking-factors where 1.3 million YouTube videos were analysed.
The process of identifying the best and the worst campaigns would be aided by the following metrics:
**comments_views = -40.26 + 0.00126*(view_count)**