In this report, we will analyze data from trending YouTube videos dataset. This dataset covers several months of trending YouTube Videos and is updated daily. With this, we can do various things such as sentiment analysis, video categorization, and analysis of factors influencing the popularity of a video.
The data has been sourced from Youtube’s API and is currently loaded into R in a structured format.
library(dplyr)
##
## 次のパッケージを付け加えます: 'dplyr'
## 以下のオブジェクトは 'package:stats' からマスクされています:
##
## filter, lag
## 以下のオブジェクトは 'package:base' からマスクされています:
##
## intersect, setdiff, setequal, union
library(readr)
library(lubridate)
##
## 次のパッケージを付け加えます: 'lubridate'
## 以下のオブジェクトは 'package:base' からマスクされています:
##
## date, intersect, setdiff, union
data <- read_csv("inputs/JP_youtube_trending_data.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 206180 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): video_id, title, channelId, channelTitle, tags, thumbnail_link, de...
## dbl (5): categoryId, view_count, likes, dislikes, comment_count
## lgl (2): comments_disabled, ratings_disabled
## dttm (2): publishedAt, trending_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(data)
## cols(
## video_id = col_character(),
## title = col_character(),
## publishedAt = col_datetime(format = ""),
## channelId = col_character(),
## channelTitle = col_character(),
## categoryId = col_double(),
## trending_date = col_datetime(format = ""),
## tags = col_character(),
## view_count = col_double(),
## likes = col_double(),
## dislikes = col_double(),
## comment_count = col_double(),
## thumbnail_link = col_character(),
## comments_disabled = col_logical(),
## ratings_disabled = col_logical(),
## description = col_character()
## )
The dataset comprises of various attributes of Youtube videos such as title, channel title, publication date, trending date, views, likes, dislikes, comments, and more. The goal of our analysis is to understand patterns in the data, trends, and potential factors that influence the popularity or views of a video.
library(DT)
library(shiny)
##
## 次のパッケージを付け加えます: 'shiny'
## 以下のオブジェクトは 'package:DT' からマスクされています:
##
## dataTableOutput, renderDataTable
# Define the UI
ui <- fluidPage(
dataTableOutput("myTable")
)
# Define the server
server <- function(input, output) {
output$myTable <- renderDataTable(
data,
server = TRUE
)
}
# Run the Shiny app
shinyApp(ui, server)
glimpse(data)
## Rows: 206,180
## Columns: 16
## $ video_id <chr> "UYXa8R9vvzA", "02MaoZ5n-uM", "ucDDYszgj5c", "M9Pmf9…
## $ title <chr> "皆からの色々な質問に何も隠さず答える!びっくりさせ…
## $ publishedAt <dttm> 2020-08-11 10:00:06, 2020-08-11 13:36:28, 2020-08-1…
## $ channelId <chr> "UCZCzstgLGQdK8GSztJHh0-w", "UC0v-pxTo1XamIDE-f__Ad0…
## $ channelTitle <chr> "タナカガ", "(パーソル パ・リーグTV公式)PacificLeagu…
## $ categoryId <dbl> 22, 17, 23, 20, 1, 26, 10, 22, 10, 20, 24, 20, 10, 2…
## $ trending_date <dttm> 2020-08-12, 2020-08-12, 2020-08-12, 2020-08-12, 202…
## $ tags <chr> "[None]", "パーソルパリーグTV|パリーグTV|パシフィッ…
## $ view_count <dbl> 778499, 1161952, 1980557, 2381688, 442524, 431031, 6…
## $ likes <dbl> 34811, 18514, 63961, 146742, 14388, 6096, 714306, 86…
## $ dislikes <dbl> 667, 259, 692, 2794, 73, 123, 15176, 134, 572, 163, …
## $ comment_count <dbl> 3939, 4115, 6216, 16557, 1420, 607, 31040, 1781, 826…
## $ thumbnail_link <chr> "https://i.ytimg.com/vi/UYXa8R9vvzA/default.jpg", "h…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ratings_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ description <chr> "登録者数30万人ありがとうございます!!!ご機嫌よう…
summary(data)
## video_id title publishedAt
## Length:206180 Length:206180 Min. :2020-08-04 12:45:01.00
## Class :character Class :character 1st Qu.:2021-04-25 15:00:16.00
## Mode :character Mode :character Median :2021-12-31 08:20:51.00
## Mean :2022-01-01 22:34:53.60
## 3rd Qu.:2022-09-15 09:00:07.00
## Max. :2023-06-03 23:15:00.00
## channelId channelTitle categoryId
## Length:206180 Length:206180 Min. : 1.00
## Class :character Class :character 1st Qu.:17.00
## Mode :character Mode :character Median :22.00
## Mean :19.22
## 3rd Qu.:24.00
## Max. :29.00
## trending_date tags view_count
## Min. :2020-08-12 00:00:00.00 Length:206180 Min. : 0
## 1st Qu.:2021-04-30 00:00:00.00 Class :character 1st Qu.: 294203
## Median :2022-01-05 00:00:00.00 Mode :character Median : 507657
## Mean :2022-01-06 15:17:02.69 Mean : 1295259
## 3rd Qu.:2022-09-20 00:00:00.00 3rd Qu.: 1002946
## Max. :2023-06-04 00:00:00.00 Max. :289350312
## likes dislikes comment_count thumbnail_link
## Min. : 0 Min. : 0 Min. : 0 Length:206180
## 1st Qu.: 4997 1st Qu.: 0 1st Qu.: 386 Class :character
## Median : 10955 Median : 0 Median : 845 Mode :character
## Mean : 56577 Mean : 526 Mean : 6002
## 3rd Qu.: 26244 3rd Qu.: 200 3rd Qu.: 1951
## Max. :16369715 Max. :879359 Max. :6889393
## comments_disabled ratings_disabled description
## Mode :logical Mode :logical Length:206180
## FALSE:198287 FALSE:190917 Class :character
## TRUE :7893 TRUE :15263 Mode :character
##
##
##
Data cleansing is an essential step in the data science process for the JP_youtube_trending dataset. It involves identifying and rectifying errors, inconsistencies, and inaccuracies present in the dataset. By performing data cleansing, we ensure that the dataset is accurate, complete, and reliable, which is crucial for meaningful analysis and reliable insights. Data cleansing tasks may include handling missing values, removing duplicates, standardizing data formats, and validating data against predefined rules. By conducting data cleansing, we improve the quality and integrity of the dataset, making it suitable for subsequent analysis.
replace_na <- function(x) {
if(is.numeric(x)) {
return(ifelse(is.na(x), 0, x))
} else if(is.character(x)) {
return(ifelse(is.na(x), "", x))
} else {
return(x)
}
}
data <- lapply(data, replace_na)
data <- as.data.frame(data)
glimpse(data)
## Rows: 206,180
## Columns: 16
## $ video_id <chr> "UYXa8R9vvzA", "02MaoZ5n-uM", "ucDDYszgj5c", "M9Pmf9…
## $ title <chr> "皆からの色々な質問に何も隠さず答える!びっくりさせ…
## $ publishedAt <dttm> 2020-08-11 10:00:06, 2020-08-11 13:36:28, 2020-08-1…
## $ channelId <chr> "UCZCzstgLGQdK8GSztJHh0-w", "UC0v-pxTo1XamIDE-f__Ad0…
## $ channelTitle <chr> "タナカガ", "(パーソル パ・リーグTV公式)PacificLeagu…
## $ categoryId <dbl> 22, 17, 23, 20, 1, 26, 10, 22, 10, 20, 24, 20, 10, 2…
## $ trending_date <dttm> 2020-08-12, 2020-08-12, 2020-08-12, 2020-08-12, 202…
## $ tags <chr> "[None]", "パーソルパリーグTV|パリーグTV|パシフィッ…
## $ view_count <dbl> 778499, 1161952, 1980557, 2381688, 442524, 431031, 6…
## $ likes <dbl> 34811, 18514, 63961, 146742, 14388, 6096, 714306, 86…
## $ dislikes <dbl> 667, 259, 692, 2794, 73, 123, 15176, 134, 572, 163, …
## $ comment_count <dbl> 3939, 4115, 6216, 16557, 1420, 607, 31040, 1781, 826…
## $ thumbnail_link <chr> "https://i.ytimg.com/vi/UYXa8R9vvzA/default.jpg", "h…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ratings_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ description <chr> "登録者数30万人ありがとうございます!!!ご機嫌よう…
na_values <- is.na(data)
num_na_values <- sum(na_values)
num_na_values_by_column <- colSums(na_values)
print(num_na_values)
## [1] 0
data <- data %>%
mutate(publishedAt = parse_date_time(publishedAt, orders = c("ymd_HMS", "ymd_HM", "ymd")),
trending_date = ymd(trending_date))
data <- data %>%
mutate(like_dislike_ratio = ifelse(dislikes == 0, NA, likes/dislikes))
This is where we want to get to know our data better. We can start by looking at basic metrics, distribution of key features and relationships between variables.
glimpse(data)
## Rows: 206,180
## Columns: 17
## $ video_id <chr> "UYXa8R9vvzA", "02MaoZ5n-uM", "ucDDYszgj5c", "M9Pmf…
## $ title <chr> "皆からの色々な質問に何も隠さず答える!びっくりさせ…
## $ publishedAt <dttm> 2020-08-11 10:00:06, 2020-08-11 13:36:28, 2020-08-…
## $ channelId <chr> "UCZCzstgLGQdK8GSztJHh0-w", "UC0v-pxTo1XamIDE-f__Ad…
## $ channelTitle <chr> "タナカガ", "(パーソル パ・リーグTV公式)PacificLeag…
## $ categoryId <dbl> 22, 17, 23, 20, 1, 26, 10, 22, 10, 20, 24, 20, 10, …
## $ trending_date <date> 2020-08-12, 2020-08-12, 2020-08-12, 2020-08-12, 20…
## $ tags <chr> "[None]", "パーソルパリーグTV|パリーグTV|パシフィッ…
## $ view_count <dbl> 778499, 1161952, 1980557, 2381688, 442524, 431031, …
## $ likes <dbl> 34811, 18514, 63961, 146742, 14388, 6096, 714306, 8…
## $ dislikes <dbl> 667, 259, 692, 2794, 73, 123, 15176, 134, 572, 163,…
## $ comment_count <dbl> 3939, 4115, 6216, 16557, 1420, 607, 31040, 1781, 82…
## $ thumbnail_link <chr> "https://i.ytimg.com/vi/UYXa8R9vvzA/default.jpg", "…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ ratings_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ description <chr> "登録者数30万人ありがとうございます!!!ご機嫌よう…
## $ like_dislike_ratio <dbl> 52.190405, 71.482625, 92.429191, 52.520401, 197.095…
summary(select_if(data, is.numeric))
## categoryId view_count likes dislikes
## Min. : 1.00 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:17.00 1st Qu.: 294203 1st Qu.: 4997 1st Qu.: 0
## Median :22.00 Median : 507657 Median : 10955 Median : 0
## Mean :19.22 Mean : 1295259 Mean : 56577 Mean : 526
## 3rd Qu.:24.00 3rd Qu.: 1002946 3rd Qu.: 26244 3rd Qu.: 200
## Max. :29.00 Max. :289350312 Max. :16369715 Max. :879359
##
## comment_count like_dislike_ratio
## Min. : 0 Min. : 0.04
## 1st Qu.: 386 1st Qu.: 25.22
## Median : 845 Median : 46.49
## Mean : 6002 Mean : 82.32
## 3rd Qu.: 1951 3rd Qu.: 95.79
## Max. :6889393 Max. :2625.00
## NA's :114319
library(ggplot2)
top_channels <- data %>% group_by(channelTitle) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(10)
ggplot(top_channels, aes(x=reorder(channelTitle, n), y=n)) + geom_bar(stat="identity") + coord_flip() + labs(title="Top 10 channels with most trending videos", x="Channel", y="Number of trending videos")
top_categories <- data %>% group_by(categoryId) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(10)
ggplot(top_categories, aes(x=reorder(as.character(categoryId), n), y=n)) + geom_bar(stat="identity") + labs(title="Top 10 categories with most trending videos", x="Category", y="Number of trending videos")
library(corrplot)
## corrplot 0.92 loaded
numeric_data <- select_if(data, is.numeric)
numeric_data <- numeric_data[ , sapply(numeric_data, function(x) all(is.finite(x)) && sd(x) != 0)]
correlation_matrix <- cor(numeric_data, use="complete.obs")
corrplot(correlation_matrix, type="upper", order="hclust", tl.col="black", tl.srt=45)
Data analysis is a key component of the data science process for the JP_youtube_trending dataset. It involves examining, interpreting, and exploring the dataset to uncover patterns, extract insights, and support decision-making. In this context, data analysis techniques can be applied to gain a deeper understanding of user engagement dynamics on YouTube in Japan. By analyzing the dataset, we can identify trends, patterns, and correlations between variables related to user engagement metrics. This information can be valuable for content creators, marketers, and analysts to understand the factors that contribute to user engagement on YouTube and guide their content creation and marketing strategies. Data analysis techniques may include statistical analysis, data visualization, predictive modeling, and machine learning algorithms, among others. By conducting data analysis, we derive actionable insights that can inform decision-making and optimize content and marketing strategies on YouTube in Japan.
avg_by_category <- data %>%
group_by(categoryId) %>%
summarise(
avg_views = mean(view_count),
avg_likes = mean(likes),
avg_dislikes = mean(dislikes),
avg_comment = mean(comment_count)
)
ggplot(data, aes(x=view_count, y=likes)) + geom_point() + geom_smooth(method=lm, col="red") + labs(x="View Count", y="Likes", title="Likes vs View Count")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(x=view_count, y=dislikes)) + geom_point() + geom_smooth(method=lm, col="red") + labs(x="View Count", y="Dislikes", title="Dislikes vs View Count")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(x=view_count, y=comment_count)) + geom_point() + geom_smooth(method=lm, col="red") + labs(x="View Count", y="Comments", title="Comments vs View Count")
## `geom_smooth()` using formula = 'y ~ x'
Based on the produced plots, here are some conclusions we can make about the trends in the YouTube JP Trends dataset:
Strong Correlation between Views and Likes: The scatter plot showing likes vs view count, along with the correlation plot, indicates a strong positive correlation between the number of views a video gets and the number of likes it receives. This could be interpreted as the more views a video receives, the more likely it is to have a higher number of likes. This suggests that popular videos, which attract a large number of views, also tend to be well-received by the audience, leading to more likes.
Moderate Correlation between Views and Dislikes: The scatter plot of dislikes vs view count, along with the correlation plot, shows a less strong but still noticeable positive correlation. This suggests that videos with more views also tend to get more dislikes. However, the relationship isn’t as strong as it is for likes, which might indicate that while popular videos attract both likes and dislikes, the majority of the response is positive.
Comments and Views Relationship: The scatter plot of comments vs view count shows a noticeable positive correlation, indicating that videos with higher views tend to have more comments. This is likely because videos that attract a larger audience naturally lead to more discussion and interaction in the form of comments.
Most Trending Channels: The bar plot showing the top ten channels with the most trending videos indicates which channels are the most successful in terms of consistently producing trending content. These channels may have a good understanding of their audience, a robust marketing strategy, or other factors that contribute to their success in making trending videos.
Most Trending Categories: The bar plot showing the top ten categories with the most trending videos provides insight into what types of content are most likely to trend on YouTube. This could be influenced by a range of factors, including the popularity of the category, the number of channels producing that type of content, or current trends and events related to the category.
These insights can be useful for content creators, marketers, and analysts in understanding the dynamics of user engagement on YouTube and guiding their content creation and marketing strategies. However, it’s important to remember that correlation does not imply causation – while these trends are evident in the data, they do not necessarily mean that one variable directly influences the other. Other underlying factors may be influencing these trends. For a comprehensive understanding, a more detailed analysis could be performed considering additional variables and factors.