The study was performed on a data set from Kaggle that compiled metrics of over 90,000 YouTube videos. The purpose of the study was to determine whether it was possible to build a model to predict the target (likes to views ratio) of a YouTube video. Furthermore, the study aimed to provide insights into questions such as:
What is the main factor influencing target ratio?
What categories of videos tend to perform the best?
What categories of videos tend to perform the worst?
Source: For this study, the data set with 92,275 observations and 20 variables from a Kaggle competition called “Predict Youtube Video Likes” was used.
The study allocated 40% of the data for Exploratory Data Analysis and used the other 60% for the predictive-modeling phase. Out of that data, 80% was allocated for training data set and the rest was used as a testing data set. Different types of models were created: Linear, lasso, ridge, and random forest models For each type of model, two variations were created: those that included interaction terms and those that did not. In the case of the random forest models, the variations were based on the number of trees.
The models were fitted into resamples to minimize the bias of the data by decreasing the chances of an scenario where the model memorizes the training set. The following metrics were obtained:
R-squared:
# A tibble: 8 × 9
wflow_id .config preproc model .metric .estimator mean n std_err
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
1 random_fore… Preprocess… recipe rand_… rsq standard 0.662 15 0.00210
2 random_fore… Preprocess… recipe rand_… rsq standard 0.661 15 0.00209
3 lm_no Preprocess… recipe linea… rsq standard 0.371 15 0.00277
4 lm_yes Preprocess… recipe linea… rsq standard 0.458 15 0.00365
5 lasso_no Preprocess… recipe linea… rsq standard 0.357 15 0.00220
6 lasso_yes Preprocess… recipe linea… rsq standard 0.384 15 0.00993
7 ridge_no Preprocess… recipe linea… rsq standard 0.369 15 0.00256
8 ridge_yes Preprocess… recipe linea… rsq standard 0.441 15 0.00533
Root Mean Squared Error:
# A tibble: 8 × 9
wflow_id .config preproc model .metric .estimator mean n std_err
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
1 random_for… Preprocess… recipe rand_… rmse standard 0.0246 15 9.89e-5
2 random_for… Preprocess… recipe rand_… rmse standard 0.0246 15 9.50e-5
3 lm_no Preprocess… recipe linea… rmse standard 0.0319 15 1.37e-4
4 lm_yes Preprocess… recipe linea… rmse standard 0.0296 15 1.26e-4
5 lasso_no Preprocess… recipe linea… rmse standard 0.0325 15 1.44e-4
6 lasso_yes Preprocess… recipe linea… rmse standard 0.0318 15 2.17e-4
7 ridge_no Preprocess… recipe linea… rmse standard 0.0320 15 1.39e-4
8 ridge_yes Preprocess… recipe linea… rmse standard 0.0301 15 1.49e-4
The results above show that the customized random forest model performed the best with a mean r-squared of 0.662 and a mean rmse of 0.0246. Thus, it was chosen as the model to be fitted in the testing set. The result of such operation is reflected in the visualization below which shows the predicted vs. the actual values in the testing set.
Moreover, the metrics of the fitted model in the testing set were: r-squared = 0.674 and rmse = 0.0240, which is very close to the mean values obtained across folds. This is an indicator that the model was built correctly.
Some of the main takeaways and conclusions were:
It was possible to build a model that could moderately predict the target of a YouTube video.
The model tends to under-predict at higher target values.
The coefficients of the linear model with interaction show that the impact of disabling comments should be further studied as one of the factors that drive target given that it has the biggest coefficient.
When isolating the effect of categories of videos in the regression, comedy and music videos perform the best in terms of target while sport and news/politics videos perform the worst.
The EDA shows that Fridays tend to be the days of release with the best target, while Sundays tend to be worst.
To improve the model, it is recommended to take into account variables such as video tags and titles since the presence/absence of certain words may affect the target. Additionally, collecting new metrics such as the season of the year when the videos were released or the number of subscribers of the channel that released the video would introduce new factors that could potentially improve model. Finally, an extension of this study could consider creating specific models for individual target ratios in response to the study’s weakness when predicting videos with higher targets.