LBB1

Introduction

In this report, we will analyze data from trending YouTube videos dataset. This dataset covers several months of trending YouTube Videos and is updated daily. With this, we can do various things such as sentiment analysis, video categorization, and analysis of factors influencing the popularity of a video.

Data Import

The data has been sourced from Youtube’s API and is currently loaded into R in a structured format.

library(dplyr)

## 
##  次のパッケージを付け加えます: 'dplyr'

##  以下のオブジェクトは 'package:stats' からマスクされています:
## 
##     filter, lag

##  以下のオブジェクトは 'package:base' からマスクされています:
## 
##     intersect, setdiff, setequal, union

library(readr)
library(lubridate)

## 
##  次のパッケージを付け加えます: 'lubridate'

##  以下のオブジェクトは 'package:base' からマスクされています:
## 
##     date, intersect, setdiff, union

data <- read_csv("inputs/JP_youtube_trending_data.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 206180 Columns: 16

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): video_id, title, channelId, channelTitle, tags, thumbnail_link, de...
## dbl  (5): categoryId, view_count, likes, dislikes, comment_count
## lgl  (2): comments_disabled, ratings_disabled
## dttm (2): publishedAt, trending_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

spec(data)

## cols(
##   video_id = col_character(),
##   title = col_character(),
##   publishedAt = col_datetime(format = ""),
##   channelId = col_character(),
##   channelTitle = col_character(),
##   categoryId = col_double(),
##   trending_date = col_datetime(format = ""),
##   tags = col_character(),
##   view_count = col_double(),
##   likes = col_double(),
##   dislikes = col_double(),
##   comment_count = col_double(),
##   thumbnail_link = col_character(),
##   comments_disabled = col_logical(),
##   ratings_disabled = col_logical(),
##   description = col_character()
## )

Data Description

The dataset comprises of various attributes of Youtube videos such as title, channel title, publication date, trending date, views, likes, dislikes, comments, and more. The goal of our analysis is to understand patterns in the data, trends, and potential factors that influence the popularity or views of a video.

library(DT)
library(shiny)

## 
##  次のパッケージを付け加えます: 'shiny'

##  以下のオブジェクトは 'package:DT' からマスクされています:
## 
##     dataTableOutput, renderDataTable

# Define the UI
ui <- fluidPage(
  dataTableOutput("myTable")
)

# Define the server
server <- function(input, output) {
  output$myTable <- renderDataTable(
    data,
    server = TRUE
  )
}

# Run the Shiny app
shinyApp(ui, server)

Shiny applications not supported in static R Markdown documents

glimpse(data)

## Rows: 206,180
## Columns: 16
## $ video_id          <chr> "UYXa8R9vvzA", "02MaoZ5n-uM", "ucDDYszgj5c", "M9Pmf9…
## $ title             <chr> "皆からの色々な質問に何も隠さず答える！びっくりさせ…
## $ publishedAt       <dttm> 2020-08-11 10:00:06, 2020-08-11 13:36:28, 2020-08-1…
## $ channelId         <chr> "UCZCzstgLGQdK8GSztJHh0-w", "UC0v-pxTo1XamIDE-f__Ad0…
## $ channelTitle      <chr> "タナカガ", "(パーソル パ・リーグTV公式)PacificLeagu…
## $ categoryId        <dbl> 22, 17, 23, 20, 1, 26, 10, 22, 10, 20, 24, 20, 10, 2…
## $ trending_date     <dttm> 2020-08-12, 2020-08-12, 2020-08-12, 2020-08-12, 202…
## $ tags              <chr> "[None]", "パーソルパリーグTV|パリーグTV|パシフィッ…
## $ view_count        <dbl> 778499, 1161952, 1980557, 2381688, 442524, 431031, 6…
## $ likes             <dbl> 34811, 18514, 63961, 146742, 14388, 6096, 714306, 86…
## $ dislikes          <dbl> 667, 259, 692, 2794, 73, 123, 15176, 134, 572, 163, …
## $ comment_count     <dbl> 3939, 4115, 6216, 16557, 1420, 607, 31040, 1781, 826…
## $ thumbnail_link    <chr> "https://i.ytimg.com/vi/UYXa8R9vvzA/default.jpg", "h…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ratings_disabled  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ description       <chr> "登録者数30万人ありがとうございます！！！ご機嫌よう…

summary(data)

##    video_id            title            publishedAt                    
##  Length:206180      Length:206180      Min.   :2020-08-04 12:45:01.00  
##  Class :character   Class :character   1st Qu.:2021-04-25 15:00:16.00  
##  Mode  :character   Mode  :character   Median :2021-12-31 08:20:51.00  
##                                        Mean   :2022-01-01 22:34:53.60  
##                                        3rd Qu.:2022-09-15 09:00:07.00  
##                                        Max.   :2023-06-03 23:15:00.00  
##   channelId         channelTitle         categoryId   
##  Length:206180      Length:206180      Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:17.00  
##  Mode  :character   Mode  :character   Median :22.00  
##                                        Mean   :19.22  
##                                        3rd Qu.:24.00  
##                                        Max.   :29.00  
##  trending_date                        tags             view_count       
##  Min.   :2020-08-12 00:00:00.00   Length:206180      Min.   :        0  
##  1st Qu.:2021-04-30 00:00:00.00   Class :character   1st Qu.:   294203  
##  Median :2022-01-05 00:00:00.00   Mode  :character   Median :   507657  
##  Mean   :2022-01-06 15:17:02.69                      Mean   :  1295259  
##  3rd Qu.:2022-09-20 00:00:00.00                      3rd Qu.:  1002946  
##  Max.   :2023-06-04 00:00:00.00                      Max.   :289350312  
##      likes             dislikes      comment_count     thumbnail_link    
##  Min.   :       0   Min.   :     0   Min.   :      0   Length:206180     
##  1st Qu.:    4997   1st Qu.:     0   1st Qu.:    386   Class :character  
##  Median :   10955   Median :     0   Median :    845   Mode  :character  
##  Mean   :   56577   Mean   :   526   Mean   :   6002                     
##  3rd Qu.:   26244   3rd Qu.:   200   3rd Qu.:   1951                     
##  Max.   :16369715   Max.   :879359   Max.   :6889393                     
##  comments_disabled ratings_disabled description       
##  Mode :logical     Mode :logical    Length:206180     
##  FALSE:198287      FALSE:190917     Class :character  
##  TRUE :7893        TRUE :15263      Mode  :character  
##                                                       
##                                                       
##

Data Cleansing

Data cleansing is an essential step in the data science process for the JP_youtube_trending dataset. It involves identifying and rectifying errors, inconsistencies, and inaccuracies present in the dataset. By performing data cleansing, we ensure that the dataset is accurate, complete, and reliable, which is crucial for meaningful analysis and reliable insights. Data cleansing tasks may include handling missing values, removing duplicates, standardizing data formats, and validating data against predefined rules. By conducting data cleansing, we improve the quality and integrity of the dataset, making it suitable for subsequent analysis.

replace_na <- function(x) {
  if(is.numeric(x)) {
    return(ifelse(is.na(x), 0, x))
  } else if(is.character(x)) {
    return(ifelse(is.na(x), "", x))
  } else {
    return(x)
  }
}

data <- lapply(data, replace_na)

data <- as.data.frame(data)

glimpse(data)

## Rows: 206,180
## Columns: 16
## $ video_id          <chr> "UYXa8R9vvzA", "02MaoZ5n-uM", "ucDDYszgj5c", "M9Pmf9…
## $ title             <chr> "皆からの色々な質問に何も隠さず答える！びっくりさせ…
## $ publishedAt       <dttm> 2020-08-11 10:00:06, 2020-08-11 13:36:28, 2020-08-1…
## $ channelId         <chr> "UCZCzstgLGQdK8GSztJHh0-w", "UC0v-pxTo1XamIDE-f__Ad0…
## $ channelTitle      <chr> "タナカガ", "(パーソル パ・リーグTV公式)PacificLeagu…
## $ categoryId        <dbl> 22, 17, 23, 20, 1, 26, 10, 22, 10, 20, 24, 20, 10, 2…
## $ trending_date     <dttm> 2020-08-12, 2020-08-12, 2020-08-12, 2020-08-12, 202…
## $ tags              <chr> "[None]", "パーソルパリーグTV|パリーグTV|パシフィッ…
## $ view_count        <dbl> 778499, 1161952, 1980557, 2381688, 442524, 431031, 6…
## $ likes             <dbl> 34811, 18514, 63961, 146742, 14388, 6096, 714306, 86…
## $ dislikes          <dbl> 667, 259, 692, 2794, 73, 123, 15176, 134, 572, 163, …
## $ comment_count     <dbl> 3939, 4115, 6216, 16557, 1420, 607, 31040, 1781, 826…
## $ thumbnail_link    <chr> "https://i.ytimg.com/vi/UYXa8R9vvzA/default.jpg", "h…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ratings_disabled  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ description       <chr> "登録者数30万人ありがとうございます！！！ご機嫌よう…

na_values <- is.na(data)

num_na_values <- sum(na_values)

num_na_values_by_column <- colSums(na_values)

print(num_na_values)

## [1] 0

data <- data %>%
  mutate(publishedAt = parse_date_time(publishedAt, orders = c("ymd_HMS", "ymd_HM", "ymd")),
         trending_date = ymd(trending_date))

data <- data %>%
  mutate(like_dislike_ratio = ifelse(dislikes == 0, NA, likes/dislikes))

EDA

This is where we want to get to know our data better. We can start by looking at basic metrics, distribution of key features and relationships between variables.

glimpse(data)

## Rows: 206,180
## Columns: 17
## $ video_id           <chr> "UYXa8R9vvzA", "02MaoZ5n-uM", "ucDDYszgj5c", "M9Pmf…
## $ title              <chr> "皆からの色々な質問に何も隠さず答える！びっくりさせ…
## $ publishedAt        <dttm> 2020-08-11 10:00:06, 2020-08-11 13:36:28, 2020-08-…
## $ channelId          <chr> "UCZCzstgLGQdK8GSztJHh0-w", "UC0v-pxTo1XamIDE-f__Ad…
## $ channelTitle       <chr> "タナカガ", "(パーソル パ・リーグTV公式)PacificLeag…
## $ categoryId         <dbl> 22, 17, 23, 20, 1, 26, 10, 22, 10, 20, 24, 20, 10, …
## $ trending_date      <date> 2020-08-12, 2020-08-12, 2020-08-12, 2020-08-12, 20…
## $ tags               <chr> "[None]", "パーソルパリーグTV|パリーグTV|パシフィッ…
## $ view_count         <dbl> 778499, 1161952, 1980557, 2381688, 442524, 431031, …
## $ likes              <dbl> 34811, 18514, 63961, 146742, 14388, 6096, 714306, 8…
## $ dislikes           <dbl> 667, 259, 692, 2794, 73, 123, 15176, 134, 572, 163,…
## $ comment_count      <dbl> 3939, 4115, 6216, 16557, 1420, 607, 31040, 1781, 82…
## $ thumbnail_link     <chr> "https://i.ytimg.com/vi/UYXa8R9vvzA/default.jpg", "…
## $ comments_disabled  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ ratings_disabled   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ description        <chr> "登録者数30万人ありがとうございます！！！ご機嫌よう…
## $ like_dislike_ratio <dbl> 52.190405, 71.482625, 92.429191, 52.520401, 197.095…

summary(select_if(data, is.numeric))

##    categoryId      view_count            likes             dislikes     
##  Min.   : 1.00   Min.   :        0   Min.   :       0   Min.   :     0  
##  1st Qu.:17.00   1st Qu.:   294203   1st Qu.:    4997   1st Qu.:     0  
##  Median :22.00   Median :   507657   Median :   10955   Median :     0  
##  Mean   :19.22   Mean   :  1295259   Mean   :   56577   Mean   :   526  
##  3rd Qu.:24.00   3rd Qu.:  1002946   3rd Qu.:   26244   3rd Qu.:   200  
##  Max.   :29.00   Max.   :289350312   Max.   :16369715   Max.   :879359  
##                                                                         
##  comment_count     like_dislike_ratio
##  Min.   :      0   Min.   :   0.04   
##  1st Qu.:    386   1st Qu.:  25.22   
##  Median :    845   Median :  46.49   
##  Mean   :   6002   Mean   :  82.32   
##  3rd Qu.:   1951   3rd Qu.:  95.79   
##  Max.   :6889393   Max.   :2625.00   
##                    NA's   :114319

library(ggplot2)
top_channels <- data %>% group_by(channelTitle) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(10)
ggplot(top_channels, aes(x=reorder(channelTitle, n), y=n)) + geom_bar(stat="identity") + coord_flip() + labs(title="Top 10 channels with most trending videos", x="Channel", y="Number of trending videos")

top_categories <- data %>% group_by(categoryId) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(10)
ggplot(top_categories, aes(x=reorder(as.character(categoryId), n), y=n)) + geom_bar(stat="identity") + labs(title="Top 10 categories with most trending videos", x="Category", y="Number of trending videos")

library(corrplot)

## corrplot 0.92 loaded

numeric_data <- select_if(data, is.numeric)
numeric_data <- numeric_data[ , sapply(numeric_data, function(x) all(is.finite(x)) && sd(x) != 0)]
correlation_matrix <- cor(numeric_data, use="complete.obs")

corrplot(correlation_matrix, type="upper", order="hclust", tl.col="black", tl.srt=45)

Data Analysis

Data analysis is a key component of the data science process for the JP_youtube_trending dataset. It involves examining, interpreting, and exploring the dataset to uncover patterns, extract insights, and support decision-making. In this context, data analysis techniques can be applied to gain a deeper understanding of user engagement dynamics on YouTube in Japan. By analyzing the dataset, we can identify trends, patterns, and correlations between variables related to user engagement metrics. This information can be valuable for content creators, marketers, and analysts to understand the factors that contribute to user engagement on YouTube and guide their content creation and marketing strategies. Data analysis techniques may include statistical analysis, data visualization, predictive modeling, and machine learning algorithms, among others. By conducting data analysis, we derive actionable insights that can inform decision-making and optimize content and marketing strategies on YouTube in Japan.

avg_by_category <- data %>%
  group_by(categoryId) %>%
  summarise(
    avg_views = mean(view_count),
    avg_likes = mean(likes),
    avg_dislikes = mean(dislikes),
    avg_comment = mean(comment_count)
  )

ggplot(data, aes(x=view_count, y=likes)) + geom_point() + geom_smooth(method=lm, col="red") + labs(x="View Count", y="Likes", title="Likes vs View Count")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(x=view_count, y=dislikes)) + geom_point() + geom_smooth(method=lm, col="red") + labs(x="View Count", y="Dislikes", title="Dislikes vs View Count")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(x=view_count, y=comment_count)) + geom_point() + geom_smooth(method=lm, col="red") + labs(x="View Count", y="Comments", title="Comments vs View Count")

## `geom_smooth()` using formula = 'y ~ x'

Conclusion

Based on the produced plots, here are some conclusions we can make about the trends in the YouTube JP Trends dataset:

Strong Correlation between Views and Likes: The scatter plot showing likes vs view count, along with the correlation plot, indicates a strong positive correlation between the number of views a video gets and the number of likes it receives. This could be interpreted as the more views a video receives, the more likely it is to have a higher number of likes. This suggests that popular videos, which attract a large number of views, also tend to be well-received by the audience, leading to more likes.
Moderate Correlation between Views and Dislikes: The scatter plot of dislikes vs view count, along with the correlation plot, shows a less strong but still noticeable positive correlation. This suggests that videos with more views also tend to get more dislikes. However, the relationship isn’t as strong as it is for likes, which might indicate that while popular videos attract both likes and dislikes, the majority of the response is positive.
Comments and Views Relationship: The scatter plot of comments vs view count shows a noticeable positive correlation, indicating that videos with higher views tend to have more comments. This is likely because videos that attract a larger audience naturally lead to more discussion and interaction in the form of comments.
Most Trending Channels: The bar plot showing the top ten channels with the most trending videos indicates which channels are the most successful in terms of consistently producing trending content. These channels may have a good understanding of their audience, a robust marketing strategy, or other factors that contribute to their success in making trending videos.
Most Trending Categories: The bar plot showing the top ten categories with the most trending videos provides insight into what types of content are most likely to trend on YouTube. This could be influenced by a range of factors, including the popularity of the category, the number of channels producing that type of content, or current trends and events related to the category.

These insights can be useful for content creators, marketers, and analysts in understanding the dynamics of user engagement on YouTube and guiding their content creation and marketing strategies. However, it’s important to remember that correlation does not imply causation – while these trends are evident in the data, they do not necessarily mean that one variable directly influences the other. Other underlying factors may be influencing these trends. For a comprehensive understanding, a more detailed analysis could be performed considering additional variables and factors.