This study developed several models to predict the target (likes to views ratio) of a YouTube video using a a data set from Kaggle. Aside from prediction, the study’s purpose was to provide an insight into what factors might affect target ratio the most and what categories of videos tend to perform better/worse. The study concluded that-out of the different models developed-the random forest model with 1000 trees performed the best with an r-squared value of ~0.67 and a rmse of 0.024. It also identified enabling/disabling comments as an important variable that is worth of further study. Finally, it showed that, when isolating the category factor, music and comedy videos tend to perform the best while sports and news/politics videos tend to perform the worst.
Note: For detail regarding the model-building process step-by-step, please check the following link https://github.com/xamanthalc/youtube_modeling_data_science (open link in new tab).
Source: The data was obtained from Kaggle from a competition called “Predict Youtube Video Likes.” The data has been compiled by Kaggle, which provides a strong trust on its validity, and it is a compilation of metrics of over 90,000 YouTube Videos.
General Facts: The data set has 92,275 rows and 20 variables, as shown below.
[1] 92275
[1] 20
Out of the 20 variables:
8 were character variables.
1 was a date variable.
3 were logical variables.
7 were numeric variables.
1 was a POSIXc variable.
Additionally, after skimming the data set, it was found that 2 variables have missing observations: description (1476 missing) and duration_seconds (2176 missing). Given that duration_seconds is a numerical variable that is likely to be important for modeling and its missing observations represent a very small part of the data (2.39%), such observations with missing duration_seconds were deleted. Overall, the data was in tidy format. The result is shown below:
| Name | youtube_data |
| Number of rows | 90099 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| Date | 1 |
| logical | 3 |
| numeric | 7 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| video_id | 0 | 1.00 | 11 | 11 | 0 | 16530 | 0 |
| title | 0 | 1.00 | 1 | 100 | 0 | 16870 | 0 |
| channelId | 0 | 1.00 | 24 | 24 | 0 | 4453 | 0 |
| channelTitle | 0 | 1.00 | 2 | 47 | 0 | 4521 | 0 |
| tags | 0 | 1.00 | 2 | 500 | 0 | 12564 | 0 |
| thumbnail_link | 0 | 1.00 | 46 | 51 | 0 | 16530 | 0 |
| description | 1414 | 0.98 | 2 | 4998 | 0 | 17260 | 0 |
| id | 0 | 1.00 | 22 | 22 | 0 | 90099 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| trending_date | 0 | 1 | 2020-08-12 | 2021-11-30 | 2021-04-08 | 461 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| comments_disabled | 0 | 1 | 0.02 | FAL: 88722, TRU: 1377 |
| ratings_disabled | 0 | 1 | 0.00 | FAL: 89668, TRU: 431 |
| has_thumbnail | 0 | 1 | 0.89 | TRU: 80622, FAL: 9477 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| categoryId | 0 | 1 | 18.72 | 6.86 | 1 | 17.00 | 20.00 | 24.00 | 29.00 |
| view_count | 0 | 1 | 2787115.43 | 7408199.50 | 38510 | 532918.00 | 1102651.00 | 2490732.00 | 264407389.00 |
| likes | 0 | 1 | 152989.52 | 440810.11 | 0 | 21679.00 | 52148.00 | 131998.50 | 16021534.00 |
| dislikes | 0 | 1 | 3097.00 | 13401.05 | 0 | 369.00 | 851.00 | 2233.00 | 879354.00 |
| comment_count | 0 | 1 | 13856.33 | 97838.48 | 0 | 1719.00 | 3891.00 | 9353.00 | 6738537.00 |
| duration_seconds | 0 | 1 | 760.61 | 5816.82 | 3 | 184.00 | 446.00 | 854.00 | 485620.00 |
| target | 0 | 1 | 0.06 | 0.04 | 0 | 0.03 | 0.05 | 0.08 | 0.43 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| publishedAt | 0 | 1 | 2020-08-03 21:51:14 | 2021-11-29 18:50:25 | 2021-04-03 22:39:01 | 16257 |
Considerations - Derived Variables: From the skim above, it was noticed that the variable publishedAt is a POSIXc variable that outlined the date and time when a video was released. New variables called hour_released, released_date, time_release_trending, and day_week_released were created from publishedAt. The variable hour_released stores the hour at which a Youtube video was posted and it is a numeric variable. The variable released_date stores the date when the video was released and it is a date variable. The variable time_release_trending stores the number of days between the release of a video and the date when it becomes trending. The variable day_week_released stores the day of the week when the video was released and it is a character variable. The justification was that isolating date and hour as individual variables could bring up interesting insights during the EDA and model-building process later on.
Thus, after the modifications, the data set had 90,099 rows and 24 variables, out of which:
9 were character variables.
2 were a date variable.
3 were logical variables.
9 were numeric variables.
1 was a POSIXct variable.
Codebook: A codebook for the data set is included in the project under the name “codebook_youtube_data_set” and can also be found in the following link: https://drive.google.com/file/d/1fePKrgvBqZPlPeJk1C_k9rQt5ggomMHV/view?usp=sharing.
The main research question for this study was: Can the target (ratio of number of likes and views) be successfully predicted based in video’s metrics (duration, comments, etc.)?
Other questions:
What is the main factor affecting a video’s target?
What category of video should someone produce to optimize their target?
Reasoning: Target ratio was chosen as the outcome variable over number of likes with aims of standardization and adding additional complexity to the model. This is because it is logical that an outcome variable such as “number of likes” is highly affected by the number of views (e.g. a video with 1 million of views has higher chances of more likes than a video with 100 thousand views). Thus, a better way to attempt to measure viewers’ responsiveness to a video is the target ratio since it adjusts the number of likes to the total views. Moreover, having target as the outcome variable makes the predictors less obvious when modeling, thereby adding complexity to the process.
In the section below, an Exploratory Data Analysis was conducted to explore the distributions and relationships of the variables in the data set to use the insights as starting point for the model-building process. Due to the large amount of data, this study could afford to use 40% of the entries for EDA (36,040 observations) and separate 60% of the data for the predictive-modeling phase to avoid introducing bias (54,059) observations.
The following plots show the distribution of the numeric variable.
As shown above, the variables likes, view_count, comment_count, dislikes, duration_seconds had strong right-skewed shapes. This is due to the fact that there were outliers at levels much bigger than those where most of the observations are. This behavior could later cause trouble during modeling, so such variables were transformed with logarithmically to base 10 to reduce skewness. The distributions after the transformation are shown below:
Additionally, the distribution of the number of days from video release to becoming trending is shown below. Despite its right-skewness, the variable was not transformed due to its small range (0-37).
Some takeaways:
The most popular categories of YouTube videos in the data set are categoryId = 24 (entertainment), followed by categoryId = 10 (music).
The most popular hour to release a YouTube video in the data set is 16:00, followed by 17:00.
The media number of: views is ~1 million, likes is ~50 thousand, dislikes is 852, comments is ~4 thousand, duration is ~450 seconds (7,5 minutes), and target (likes/views) is 0.05.
Most videos become trending within 10 days of release.
Some takeaways:
The most popular date when Youtube videos were released was October 30, 2020.
Friday is the most popular day to release videos (although they are fairly released across all days of the week).
Most videos did not disable comments/ratings/thumbnails.
Given that the outcome variable of the predictive model was target, the bivariate analysis was focused on finding insights of the behavior of target relative to other variables.
Note: The variable categoryId was tested as a qualitative variable for bivariate analysis.
Some takeaways:
Overall, the sooner a video becomes trending from release date, the higher its target ratio.
Comment count and target ratio have a positive relationship.
Both likes and view_count are not in the plots because target ratio = likes/view_count:
All else equal, the higher the number of like, the higher the target ratio.
All else equal, the higher the number of view_count, the lower the target ratio.
Some takeaways:
Videos that allow ratings and comments tend to have a higher target.
The categories with the higher median target were 23: comedy and 10: music.
Video released on Sunday performed the worse according to target, and videos released on Friday performed the best.
From the EDA section, it was possible to determine that variables such as categoryId, comments_disabled, ratings_disabled, comment_count, time_release_trending, day_week_released, and duration_seconds seemed to be relevant when predicting the target of a video. To continue with the modeling phase, some variables were left-out due to lack of enough importance and with the purpose of having a more focused data set. Those variables were:
title: too specific to each entry and difficult to make into a metric. For further research, it could be interesting to investigate whether the length of a title, or its sentiment may impact target.
publishedAt: Irrelevant given that it is redundant with the columns released_date and hour_released.
channelId and channelTitle: too specific to each entry and difficult to make into a metric. For further research, it could be interesting to collect the number of subscriptions and use that as a metric for modeling target.
tags: too specific to each entry and difficult to make into a metric. However, with NPL (Natural Processing Language), a study could dismantle whether the presence/absence of some tags are likely to impact target.
thumbnail_link: not relevant.
description: too specific to each entry and difficult to make into a metric. For further research, it could be interesting to investigate whether the sentiment of the description may impact target.
id: not relevant given that video_id was already being used as primary key.
released_date and trending date: to guarantee not dealing with time series.
Thus, the data set for modeling ended up with 14 variables and 54,059 observations. Finally, the variables day_week_released, has_thumbnail, ratings_disabled, comments_disabled, and categoryId were turned into factors.
After the above modifications, the first ten observations were:
# A tibble: 10 × 14
video_id categoryId view_count likes dislikes comment_count comments_disabl…
<chr> <fct> <int> <int> <int> <int> <fct>
1 AqqMA2m… 17 381755 3534 95 877 FALSE
2 CZJvBfo… 24 20224542 657687 6147 43081 FALSE
3 SKQ5r3A… 10 1842144 203285 1127 17019 FALSE
4 O3V4UBZ… 20 2319270 255979 30349 48695 FALSE
5 Cew-Bnj… 25 1546608 32579 1795 15246 FALSE
6 311ZcJo… 24 7482473 484530 11628 23978 FALSE
7 TPbQCrk… 24 522929 65095 536 2010 FALSE
8 9xnTz6s… 20 530910 14657 288 793 FALSE
9 EhAemz1… 27 2829884 173762 2239 17019 FALSE
10 A7JdZaC… 24 1851793 88507 2809 3020 FALSE
# … with 7 more variables: ratings_disabled <fct>, duration_seconds <dbl>,
# has_thumbnail <fct>, target <dbl>, hour_released <int>,
# day_week_released <fct>, time_release_trending <dbl>
The training data set used 80% of the data for predictive modeling (43,247 observations), and the 20% left was isolated for the testing phase (10,812 observations). Additionally, repeated cross-validation with 5 folds and 3 repeats was exercise on the training data set for later operations. The resamples are show below:
# 5-fold cross-validation repeated 3 times using stratification
# A tibble: 15 × 3
splits id id2
<list> <chr> <chr>
1 <split [34595/8652]> Repeat1 Fold1
2 <split [34596/8651]> Repeat1 Fold2
3 <split [34599/8648]> Repeat1 Fold3
4 <split [34599/8648]> Repeat1 Fold4
5 <split [34599/8648]> Repeat1 Fold5
6 <split [34595/8652]> Repeat2 Fold1
7 <split [34596/8651]> Repeat2 Fold2
8 <split [34599/8648]> Repeat2 Fold3
9 <split [34599/8648]> Repeat2 Fold4
10 <split [34599/8648]> Repeat2 Fold5
11 <split [34595/8652]> Repeat3 Fold1
12 <split [34596/8651]> Repeat3 Fold2
13 <split [34599/8648]> Repeat3 Fold3
14 <split [34599/8648]> Repeat3 Fold4
15 <split [34599/8648]> Repeat3 Fold5
4 main models were used to fit the resamples and assess which performed the best: Linear models, lasso models, ridge models, and random forest models. For the first three type of models, each one of them had two versions: without and with pairwise interactions. In the case of random forest, two versions were also created: using the default values, and customizing the number of trees to 1000.
The plots below show how the models ranked in terms of r-squared and Root Mean Squared Error (rmse). The random forests performed better across both metrics:
The tables below show that the random forest with customized values had a small edge over the random forest with default values:
R-squared:
# A tibble: 8 × 9
wflow_id .config preproc model .metric .estimator mean n std_err
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
1 random_fore… Preprocess… recipe rand_… rsq standard 0.662 15 0.00210
2 random_fore… Preprocess… recipe rand_… rsq standard 0.661 15 0.00209
3 lm_no Preprocess… recipe linea… rsq standard 0.371 15 0.00277
4 lm_yes Preprocess… recipe linea… rsq standard 0.458 15 0.00365
5 lasso_no Preprocess… recipe linea… rsq standard 0.357 15 0.00220
6 lasso_yes Preprocess… recipe linea… rsq standard 0.384 15 0.00993
7 ridge_no Preprocess… recipe linea… rsq standard 0.369 15 0.00256
8 ridge_yes Preprocess… recipe linea… rsq standard 0.441 15 0.00533
Root Mean Squared Error:
# A tibble: 8 × 9
wflow_id .config preproc model .metric .estimator mean n std_err
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
1 random_for… Preprocess… recipe rand_… rmse standard 0.0246 15 9.89e-5
2 random_for… Preprocess… recipe rand_… rmse standard 0.0246 15 9.50e-5
3 lm_no Preprocess… recipe linea… rmse standard 0.0319 15 1.37e-4
4 lm_yes Preprocess… recipe linea… rmse standard 0.0296 15 1.26e-4
5 lasso_no Preprocess… recipe linea… rmse standard 0.0325 15 1.44e-4
6 lasso_yes Preprocess… recipe linea… rmse standard 0.0318 15 2.17e-4
7 ridge_no Preprocess… recipe linea… rmse standard 0.0320 15 1.39e-4
8 ridge_yes Preprocess… recipe linea… rmse standard 0.0301 15 1.49e-4
As show above, the random forest with customized values performed better and had an r-squared = 0.662, which is considered to be moderate. Additionally, it also had the smallest rmse = 0.0246 which meant that it was the model that had the highest proportion of accurate predictions and the smallest error. Even though it was the model the performed the best, it was necessary to put the results in perspective:
As show by the plot, at targets over 0.15, the models tends to under-predict. For example, some of the observations have target values of 0.4 that are predicted at slightly over 0.1.
Aside from the 8 models created above, another 32 models resulting from tuning the linear models were developed. The linear model without penalty was tuned according to the number of splines, while the lasso and ridge model were tuned according to penalty. The results are shown below:
The iterations used for tuning on the number of splines were:
# A tibble: 16 × 2
comment_count_df hour_release_df
<int> <int>
1 1 1
2 5 1
3 10 1
4 15 1
5 1 5
6 5 5
7 10 5
8 15 5
9 1 10
10 5 10
11 10 10
12 15 10
13 1 15
14 5 15
15 10 15
16 15 15
The plot below shows that the rmse and the r-squared performed better when both the number of degrees of freedom of the hour_released and comment_count is higher.
This is further shown with the output of the select_best function:
# A tibble: 1 × 3
`comment_count df` `hour_release df` .config
<int> <int> <chr>
1 15 15 Preprocessor16_Model1
Thus, the final model was tuned using 15 degrees of freedom for both categories, and the metrics across folds are shown below:
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.0283 15 0.0000727 Preprocessor1_Model1
2 rsq standard 0.507 15 0.00213 Preprocessor1_Model1
The iterations used for tuning on the penalty were:
# A tibble: 8 × 1
penalty
<dbl>
1 0.0000000001
2 0.00000000268
3 0.0000000720
4 0.00000193
5 0.0000518
6 0.00139
7 0.0373
8 1
The plot below shows that the rmse and the r-squared performed better at the same point in both cases.
The select_best function shows the value of such point:
# A tibble: 1 × 2
penalty .config
<dbl> <chr>
1 0.0000000001 Preprocessor1_Model1
Thus, the final model was tuned using 0.0000000001 as penalty, and the metrics across folds are shown below:
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.0301 15 0.000334 Preprocessor1_Model1
2 rsq standard 0.444 15 0.0102 Preprocessor1_Model1
The iterations used for tuning on the penalty were:
# A tibble: 8 × 1
penalty
<dbl>
1 0.0000000001
2 0.00000000268
3 0.0000000720
4 0.00000193
5 0.0000518
6 0.00139
7 0.0373
8 1
The plot below shows that the rmse and the r-squared performed better at the same point in both cases.
The select_best function shows the value of such point:
# A tibble: 1 × 2
penalty .config
<dbl> <chr>
1 0.0000000001 Preprocessor1_Model1
Thus, the final model was tuned using 0.0000000001 as penalty, and the metrics across folds are shown below:
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.0302 15 0.000168 Preprocessor1_Model1
2 rsq standard 0.440 15 0.00565 Preprocessor1_Model1
The table below summarizes the results of the tuned models:
linear_splines lasso ridge
mean rsqr 0.5070 0.4440 0.4400
mean rmse 0.0283 0.0301 0.0302
The linear model with tuned splines at 15 degrees of freedom outperforms the tuned lasso and ridge models with a mean r-squared across folds of 0.5 and a mean rmse of 0.0283. Additionally, the model is also better than the linear model developed without tuning in the previous section. Yet, the random forest customized remains as the best model (mean r-squared = 0.662, mean rmse = 0.0246).
Note: This section did not tune random forest models due to the fact that in the non-tuned section, it was shown that further increment of the number of trees or minimum nodes did not have a sufficiently large impact on the model. Additionally, all the tuned models were created with a recipe that includes interaction since the “Simple Model Selection” section showed that including interactions improved all models.
Having chosen the random forest with customized values as the best model, it was applied on the testing set. The table and plot shown below show a snapshot of the actual values vs the predictions:
# A tibble: 10,812 × 2
target .pred
<dbl> <dbl>
1 0.0325 0.0537
2 0.110 0.0974
3 0.110 0.0744
4 0.0993 0.0810
5 0.0797 0.0797
6 0.0151 0.0254
7 0.0386 0.0537
8 0.0189 0.0742
9 0.101 0.0827
10 0.0314 0.0491
# … with 10,802 more rows
The plot in the testing set very closely resembled the trend shown by the resamples. The metrics below confirmed that it was, indeed, the case:
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.681
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.0240
The r-squared of the model in the testing set was 0.674 which is slightly better than its value in the resamples (0.662). The rmse in the testing set was 0.024 which was also slight better than its value in the resamples (0.025). Because both metrics are similar in the resampling and testing phase, it was reasonable to conclude that the model was built properly.
Given that determining what is the main factor affecting the likes-to-views ratio is an inferential question, it was difficult to provide an accurate answer. However, by looking at the model coefficients, it was possible to obtain an approximation.
Ideally, one should look at the coefficients provided by the model of best fit. However, since such model was a random forest, this was not possible. Instead, the linear model with interaction and tuned splines was used since it was the one the performed the best after the random forests. Its coefficients are shown below:
# A tibble: 435 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0692 0.00323 21.4 1.76e-101
2 dislikes -0.0198 0.000241 -82.2 0
3 duration_seconds -0.00564 0.000182 -31.0 4.23e-208
4 time_release_trending -0.00725 0.000156 -46.5 0
5 categoryId_X2 -0.000655 0.000230 -2.85 4.35e- 3
6 categoryId_X10 0.00383 0.000375 10.2 2.01e- 24
7 categoryId_X15 -0.000198 0.000226 -0.879 3.80e- 1
8 categoryId_X17 -0.00985 0.000309 -31.9 4.54e-220
9 categoryId_X19 0.000917 0.000302 3.03 2.42e- 3
10 categoryId_X20 -0.000804 0.000331 -2.43 1.51e- 2
# … with 425 more rows
The estimate is the coefficient assigned to a specific factor. Therefore, the estimate with the greatest magnitude represents the factors that carry more weight when the target was predicted. It is important to highlight that the weight could be either negative or positive (e.g. a coefficient -2 would be more relevant than a coefficient of 1). Thus, the absolute value of the estimates was calculated and ordered, the result is shown below:
# A tibble: 321 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 comments_disabled_TRUE. 0.840 0.0308 -27.3 2.70e-162
2 comment_count_x_comments_disabled_TRU… 0.177 0.00654 -27.1 3.63e-160
3 comment_count_ns_14 0.150 0.00752 -19.9 1.18e- 87
4 comment_count_ns_01 0.133 0.00389 -34.3 9.33e-254
5 comment_count_ns_03 0.133 0.00429 -31.1 3.04e-210
6 comment_count_ns_05 0.129 0.00448 -28.9 1.57e-181
7 comment_count_ns_04 0.128 0.00442 -28.9 2.24e-181
8 comment_count_ns_02 0.127 0.00424 -29.9 6.76e-195
9 comment_count_ns_07 0.125 0.00464 -27.1 5.24e-160
10 comment_count_ns_06 0.125 0.00455 -27.5 6.48e-165
# … with 311 more rows
The comment_disabled_TRUE. was the factor with the most influence in the regression. The table below shows the comment_count in the original regression(no absolute values):
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 comments_disabled_TRUE. -0.840 0.0308 -27.3 2.70e-162
Since the value was negative, it means that the comment_disabled_TRUE. was the factor with the largest effect in the prediction and that videos that have disabled comments are more likely to have a lower target ratio. Again, it was not possible to determine if, in effect, comment_disabled_TRUE. is the most important factor, but using the regression coefficients it was fair to say that - at the bare minimum - comment_disabled_TRUE. is a good candidate and it would be worth to look more closely at it.
Note: comment_count is correlated with view_count (0.54 across all data). Given that target is calculated from view_count it is logical that comment_count (and, therefore, comment_disabled_TRUE.) would be the main driving factor.
Similarly to before, to gain an insight into which might be the best/worse categories to optimize target ratio, the coefficients of the linear regression was inspected. However, given that the goal is to maximize target, the sign of the coefficient was taken into account.
The top 3 results are show below:
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 categoryId_X23 0.00410 0.000250 16.4 4.59e-60
2 categoryId_X10 0.00383 0.000375 10.2 2.01e-24
3 categoryId_X22 0.00320 0.000267 12.0 4.97e-33
The last 3 results are show below:
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 categoryId_X24 -0.000936 0.000348 -2.69 7.13e- 3
2 categoryId_X25 -0.00770 0.000216 -35.6 3.64e-274
3 categoryId_X17 -0.00985 0.000309 -31.9 4.54e-220
The results above hint at the fact that producing video in the categories 23 (comedy), 10 (music), and 22 (people and blogs) are likely to improve the target ratio. On the other hand, producing videos in the categories 24 (entertainment), 25 (news and politics), and 17 (sports) are likely to produce the opposite effect.
It was possible to predict the target ratio (number of likes over view count) at a moderate level by developing a random forest model that accurately predicted ~67% of observations with a Root Mean Squared Error (rmse) of 0.024. A highly noticeable weakness of the model occurs at higher target values where the model tends to under-predict.
Additionally, the regression model with interaction and tuned splines provided insights into the fact that the comment_disabled_TRUE. might be one (if not “the”) main factors driving target ratio. Similarly, the model’s coefficients unveiled that if someone aims to maximize the target ratio of their videos, they should probably produce music, comedy, or personal blog videos. Nonetheless, neither of the previous statements can be assured, but they provide a good starting point for further study.
The EDA showed that highly successful videos should expect up to 0.4 for target ratio. It also showed that the sooner a video becomes trending, the higher its target ratio tends to be. On release day of the week, Fridays generally performed the best (likely because people have more time to watch the video over the weekend), and Sundays performed the worst (probably due to the opposite reason).
To further improve the model, it is recommended to look at variables that were excluded from the predicting phase with an emphasis on: tags and channelId since one might find that the presence of a specific tag tends to increase/decrease target and that videos from specific channels tend to perform better.