MSBA 580 Midterm - Phase 1, 2 & 3: EDA and Data Validation

Introduction

As social media consultants, our team’s objective was to see what type of YouTube content drives the highest engagement. This R markdown will show and covers Phase 1 & 2 (Exploratory Data Analysis) and Phase 3 (Data Validation) using the datasnaek YouTube trending dataset sourced from Kaggle, which contains real trending video statistics from the United States.

Data Source

The dataset (USvideos.csv) is sourced from Mitchell JY’s Trending YouTube Scraper via Kaggle (datasnaek/youtube-new). It contains 40,949 observations across 16 variables including views, likes, dislikes, comment count, publish time, and trending date.

Phase 1 & 2: Loading Data and Understanding the Shape

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

youtube <- read_csv("USvideos.csv")

## Rows: 40949 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): video_id, trending_date, title, channel_title, tags, thumbnail_lin...
## dbl  (5): category_id, views, likes, dislikes, comment_count
## lgl  (3): comments_disabled, ratings_disabled, video_error_or_removed
## dttm (1): publish_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(youtube)

## Rows: 40,949
## Columns: 16
## $ video_id               <chr> "2kyS6SvSYSE", "1ZAPwfrtAFY", "5qpjK5DgCt4", "p…
## $ trending_date          <chr> "17.14.11", "17.14.11", "17.14.11", "17.14.11",…
## $ title                  <chr> "WE WANT TO TALK ABOUT OUR MARRIAGE", "The Trum…
## $ channel_title          <chr> "CaseyNeistat", "LastWeekTonight", "Rudy Mancus…
## $ category_id            <dbl> 22, 24, 23, 24, 24, 28, 24, 28, 1, 25, 17, 24, …
## $ publish_time           <dttm> 2017-11-13 17:13:01, 2017-11-13 07:30:00, 2017…
## $ tags                   <chr> "SHANtell martin", "last week tonight trump pre…
## $ views                  <dbl> 748374, 2418783, 3191434, 343168, 2095731, 1191…
## $ likes                  <dbl> 57527, 97185, 146033, 10172, 132235, 9763, 1599…
## $ dislikes               <dbl> 2966, 6146, 5339, 666, 1989, 511, 2445, 778, 11…
## $ comment_count          <dbl> 15954, 12703, 8181, 2146, 17518, 1434, 1970, 34…
## $ thumbnail_link         <chr> "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg…
## $ comments_disabled      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ ratings_disabled       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ video_error_or_removed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ description            <chr> "SHANTELL'S CHANNEL - https://www.youtube.com/s…

dim(youtube)

## [1] 40949    16

The dataset contains 40,949 rows and 16 columns. Key variables include views, likes, dislikes, comment_count, publish_time, and trending_date. Notably, the derived variables required for modeling (engagement_rate, like_per_view, video_length_category, days_since_published) are not present in the raw data and does have to be engineered.

Phase 3: Data Validation

Missing Value Check

colSums(is.na(youtube))

##               video_id          trending_date                  title 
##                      0                      0                      0 
##          channel_title            category_id           publish_time 
##                      0                      0                      0 
##                   tags                  views                  likes 
##                      0                      0                      0 
##               dislikes          comment_count         thumbnail_link 
##                      0                      0                      0 
##      comments_disabled       ratings_disabled video_error_or_removed 
##                      0                      0                      0 
##            description 
##                    578

Only the description column contains missing values (578 NAs). All key numeric columns like views, likes, dislikes, and comment_count are complete with zero missing values.

Variance Check

youtube %>%
  select(views, likes, dislikes, comment_count) %>%
  summarise(across(everything(), list(mean = mean, sd = sd, var = var)))

## # A tibble: 1 × 12
##   views_mean views_sd views_var likes_mean likes_sd    likes_var dislikes_mean
##        <dbl>    <dbl>     <dbl>      <dbl>    <dbl>        <dbl>         <dbl>
## 1   2360785. 7394114.   5.47e13     74267.  228885. 52388498047.         3711.
## # ℹ 5 more variables: dislikes_sd <dbl>, dislikes_var <dbl>,
## #   comment_count_mean <dbl>, comment_count_sd <dbl>, comment_count_var <dbl>

All four numeric columns show substantial rea world variance. Views range from 549 to over 225 million with a standard deviation of 7.39 million, confirming this isa real data which is a stark contrast to the near zero variance found in the original synthetic dataset we were using.

Derived Variable Engineering

youtube_clean <- youtube %>%
  filter(video_error_or_removed == FALSE) %>%
  mutate(
    engagement_rate = (likes + dislikes + comment_count) / views,
    like_per_view = likes / views,
    days_since_published = as.numeric(
      as.Date(trending_date, format = "%y.%d.%m") - as.Date(publish_time)
    ),
    video_length_category = NA  # not available in raw data
  )

glimpse(youtube_clean)

## Rows: 40,926
## Columns: 20
## $ video_id               <chr> "2kyS6SvSYSE", "1ZAPwfrtAFY", "5qpjK5DgCt4", "p…
## $ trending_date          <chr> "17.14.11", "17.14.11", "17.14.11", "17.14.11",…
## $ title                  <chr> "WE WANT TO TALK ABOUT OUR MARRIAGE", "The Trum…
## $ channel_title          <chr> "CaseyNeistat", "LastWeekTonight", "Rudy Mancus…
## $ category_id            <dbl> 22, 24, 23, 24, 24, 28, 24, 28, 1, 25, 17, 24, …
## $ publish_time           <dttm> 2017-11-13 17:13:01, 2017-11-13 07:30:00, 2017…
## $ tags                   <chr> "SHANtell martin", "last week tonight trump pre…
## $ views                  <dbl> 748374, 2418783, 3191434, 343168, 2095731, 1191…
## $ likes                  <dbl> 57527, 97185, 146033, 10172, 132235, 9763, 1599…
## $ dislikes               <dbl> 2966, 6146, 5339, 666, 1989, 511, 2445, 778, 11…
## $ comment_count          <dbl> 15954, 12703, 8181, 2146, 17518, 1434, 1970, 34…
## $ thumbnail_link         <chr> "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg…
## $ comments_disabled      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ ratings_disabled       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ video_error_or_removed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ description            <chr> "SHANTELL'S CHANNEL - https://www.youtube.com/s…
## $ engagement_rate        <dbl> 0.102150796, 0.047972059, 0.049994141, 0.037835…
## $ like_per_view          <dbl> 0.076869319, 0.040179297, 0.045757800, 0.029641…
## $ days_since_published   <dbl> 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1,…
## $ video_length_category  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

After filtering out 23 error rows, the cleaned dataset contains 40,926 observations and 20 variables. Three of the four required derived variables were successfully engineered. video_length_category does remains NA as video duration is not included in this dataset and would need to have separate YouTube Data API lookup.

Distribution Visualizations

ggplot(youtube_clean, aes(x = views)) +
  geom_histogram(bins = 50, fill = "steelblue") +
  scale_x_log10(labels = scales::comma) +
  labs(title = "Distribution of Views (log scale)",
       x = "Views", y = "Count")

ggplot(youtube_clean, aes(x = engagement_rate)) +
  geom_histogram(bins = 50, fill = "darkgreen") +
  labs(title = "Distribution of Engagement Rate",
       x = "Engagement Rate", y = "Count")

ggplot(youtube_clean, aes(x = days_since_published)) +
  geom_histogram(bins = 50, fill = "coral") +
  labs(title = "Days Since Published When Trending",
       x = "Days", y = "Count")

### Key Validation Findings

The views distributiondoes follow a near normal curve on a log scale, being around around 1 million views, ranging from ~10,000 to over 100 million. This is very consistent with real YouTube trending behavior.

The engagement rate distribution is righ -skewed, with most videos falling between 0.01 and 0.05, and a long tail of high engagement outliers. This is showing real genuine user behavior.

The days since published distribution shows the vast majority of videos trend within days of publishing, with a small number of outliers trending years later. This is consistent with the way viral patterns are

Conclusion

The datasnaek YouTube trending dataset passes all data validation checks. Variance is real, missing values are minimal and derived variables have been successfully engineered for three of the four required modeling inputs. The dataset is ready to support Phase 4 model development (Random Forest and Logistic Regression) by Lennin and Nico respectively.

video_length_category remains a limitation — the team should discuss whether to drop this variable or supplement with a YouTube Data API lookup for video duration.

References

Mitchell JY. (2019). Trending YouTube Video Statistics. Kaggle. https://www.kaggle.com/datasets/datasnaek/youtube-new