Importing libraries to run

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.2.3
library(broom)
## Warning: package 'broom' was built under R version 4.2.3
library(lindia)
## Warning: package 'lindia' was built under R version 4.2.3
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(dplyr)
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.2.3
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

Project Goal:

Empower aspiring YouTubers with a holistic and actionable roadmap to success by providing comprehensive guidance, insights, and strategies. The primary objective is to equip content creators with the knowledge and tools necessary for building and sustaining a thriving YouTube channel. This includes not only technical aspects like content creation and optimization but also the development of a strategic mindset that fosters long-term success in the dynamic landscape of online content creation.

Purpose of the Project:

This project is driven by a commitment to demystify the journey to YouTube success, recognizing the multifaceted nature of achieving and maintaining a prosperous channel. Serving as a valuable and ongoing resource for emerging content creators, the project aims to go beyond superficial advice by offering in-depth knowledge. The purpose is not just to educate creators on the basics but to inspire them with real-world success stories, guide them through practical tips, and keep them constantly updated on the latest trends, algorithm changes, and industry best practices.

Dataset Used:

Dataset Overview: The “Global YouTube Statistics” dataset provides comprehensive information about various YouTube channels worldwide, encompassing key metrics such as subscriber counts, video views, upload frequency, country of origin, earnings, and more.

Initially setting our directories and loading our data.

knitr::opts_knit$set(root.dir = 'C:/Users/Prana/OneDrive/Documents/Topics in Info FA23(Grad)')
youtube <- read_delim("./Global Youtube Statistics.csv", delim = ",")
## Rows: 995 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Youtuber, category, Title, Country, Abbreviation, channel_type, cr...
## dbl (21): rank, subscribers, video views, uploads, video_views_rank, country...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Link: https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023

Based on the project goal and purpose, we can come up with some novel questions to investigate. Some of them include:

1. Do YouTubers from certain countries tend to have more subscribers or higher video views on average?

This can be analyzed by creating a box plot between countries and video views & between countries and subscribers as we are comparing a continuous vs categorical variable. Since there are too many countries, we are consolidating them.

# Create a new variable 'GroupedCountry' to consolidate countries
youtube$GroupedCountry <- ifelse(youtube$Country %in% c("India", "United States", "Pakistan", "South Korea", "Argentina", "Thailand", "Russia", "United Kingdom", "Brazil", "Japan"), youtube$Country, "Other")

# Video Views
ggplot(data = youtube) + 
  geom_boxplot(mapping = aes(x = GroupedCountry, y = `video views`)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Boxplot of Video Views by Grouped Country")

# Subscribers
ggplot(data = youtube) + 
  geom_boxplot(mapping = aes(x = GroupedCountry, y = subscribers)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Boxplot of Subscribers by Grouped Country")

The box plots highlight potential success patterns for YouTube channels in different countries. Channels in India and the United States show a higher likelihood of significant success, with numerous outliers indicating exceptional subscribers and video views. Meanwhile, Pakistan has the highest median, suggesting a solid chance for decent success.

2. Do you need subscribers to get more video views or vice-versa?

Since there are few Youtube channels with 0 video views (These channels belong to YouTube and don’t post anything), we shall be removing them so that it doesn’t hinder our observations.

youtube <- youtube |>
  filter(`video views` != 0)

To answer this question, we shall create a scatter plot for the relationship between the number of subscribers (x axis) and the number of video views (y axis) on a YouTube channel using a linear regression line fitted to the data. And then we can build a linear regression model (model) with the number of subscribers as the response variable and the number of video views as the predictor variable.

youtube |>
  ggplot(mapping = aes(x = subscribers, y = `video views`)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Build the linear regression model
model <- lm(subscribers ~ `video views`, data = youtube)
# Summarize the model
summary(model)
## 
## Call:
## lm(formula = subscribers ~ `video views`, data = youtube)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -41617692  -4370437  -1239051   2698367 126872192 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.193e+07  3.773e+05   31.62   <2e-16 ***
## `video views` 9.586e-04  2.098e-05   45.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9311000 on 985 degrees of freedom
## Multiple R-squared:  0.6794, Adjusted R-squared:  0.6791 
## F-statistic:  2087 on 1 and 985 DF,  p-value: < 2.2e-16

The multiple R-squared value (0.6794) indicates the proportion of variability in subscribers that is explained by the number of video views. In this case, approximately 67.94% of the variability in subscribers can be explained by the linear relationship with video views. Therefore, we can say for most cases, you need more subscribers to get more views, and vice-versa.

3. Do Youtubers that upload more earn more?

This can be analyzed by first finding the average yearly earnings by using the columns ‘highest_yearly_earnings’ and ‘lowest_yearly_earnings’. This is mutated to the dataset. Then we can create a line plot to analyze the average yearly earnings of Youtubers based on the number of uploads.

youtube |>
  mutate(avg_yr_earn=(highest_yearly_earnings+lowest_yearly_earnings)/2) |>
  ggplot() +
  geom_line(mapping = aes(x = uploads,y = avg_yr_earn)) 

The line plot shows that there’s no clear link between how many videos a YouTuber uploads and how much money they make on average. So, it seems like making more videos doesn’t necessarily mean you’ll earn more.

4. Does the age of the Youtube channel attribute to the success of the channel?

We can do this by exploring the relationship between the age of YouTube channels and their success, measured by the number of subscribers. First, the created_year variable is converted to a Date format. Duplicate rows in the dataset are then checked and removed, ensuring that each unique channel creation year corresponds to a single data point. After transforming the data into a time series format using the tsibble package, a line plot is generated to visualize the trend of YouTube subscribers over time. This plot allows for the examination of any patterns or correlations between the age of YouTube channels and their subscriber counts, providing insights into whether channel age attributes to the success of the channel.

#Since there is a Youtube channel with created_year of 1970, we need to remove that to prevent inaccurate readings for our data dive.
youtube <- youtube |>
 filter(created_year != 1970)

youtube <- youtube |>
  mutate(created_year = as.Date(as.character(created_year), format = "%Y"))
# Filter out rows with missing values in created_year
youtube_ <- youtube %>%
  dplyr::select(created_year, subscribers) %>%
  filter(!is.na(created_year)) %>%
  distinct()

# Check for duplicate rows
duplicates <- youtube_ %>% duplicates()
## Using `created_year` as index variable.
# Print duplicates if any
if (nrow(duplicates) > 0) {
  print("Duplicate rows:")
  print(duplicates)
  
  # Remove duplicate rows using group_by() and summarize()
  youtube_ <- youtube_ %>%
    group_by(created_year) %>%
    summarize(subscribers = mean(subscribers, na.rm = TRUE))

  # Check for duplicate rows again
  duplicates <- youtube_ %>% duplicates()
  
  # Print a message if duplicates are still present
  if (nrow(duplicates) > 0) {
    print("Duplicate rows still exist after removal:")
    print(duplicates)
    stop("Duplicate rows still exist after removal.")
  }
}
## [1] "Duplicate rows:"
## # A tibble: 809 × 2
##    created_year subscribers
##    <date>             <dbl>
##  1 2006-12-04     245000000
##  2 2012-12-04     166000000
##  3 2006-12-04     162000000
##  4 2006-12-04     159000000
##  5 2015-12-04     112000000
##  6 2010-12-04     111000000
##  7 2016-12-04     106000000
##  8 2018-12-04      98900000
##  9 2014-12-04      96700000
## 10 2007-12-04      96000000
## # ℹ 799 more rows
## Using `created_year` as index variable.
# Create the tsibble
youtube_tsibble <- as_tsibble(youtube_, index = created_year)
# Plot the entire time series
ggplot(youtube_tsibble, aes(x = created_year, y = subscribers)) +
  geom_line() +
  labs(title = "YouTube Subscribers Over Time",
       x = "Year",
       y = "Subscribers")

The graph shows that older YouTube channels generally have more subscribers. This makes sense because channels that have been around longer have had more time to make videos and attract viewers who subscribe. The slight decrease in average subscribers over the years might be due to newer channels joining, spreading subscribers across a larger number of channels. Overall, it suggests that sticking around on YouTube and making content consistently can lead to more subscribers over time. This is good news for new YouTubers, as it shows that building a channel takes time, and patience and persistence can pay off in the long run.

5. Which channel types engage urban audiences?

We calculate the probability that YouTube channels, grouped by their channel types, have video views greater than the mean video views across all channels in the dataset, with the additional condition that the urban population ratio within each channel type group must be greater than 0.85. Then we create a horizontal bar graph to visualize the probability values for each channel type.

gp5<- youtube |>
  group_by(channel_type)|>
  mutate(urban_ratio=Urban_population/Population)|>
  filter(urban_ratio>0.85) |>
  summarize(probability=sum(`video views`>mean(`video views`))/ n())
q5<-data.frame(gp5)
 

ggplot(q5, aes(x = probability, y = channel_type)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Probability", y = "Channel Type") +
  theme_minimal()

Channel types like News and Nonprofit have probabilities of 0, indicating that, within the selected subset of channels with high urban population ratios (urban_ratio > 0.85), they are less likely to have video views above the mean. Meanwhile, channel types, such as Comedy and Games, tend to have a higher likelihood of engaging viewers where urban population ratio is more.

6.Which category give a better probability of success?

To answer this question, we can we calculate the probability of the number of channels having more than 10 million subscribers in each category. And then plot these probabilities on a scatter plot with each category. We are taking 10 million subscribers as a threshold for the success of a Youtube channel.

gp6<- youtube |>
  group_by(category)|>
  summarize(probability = sum(subscribers > 100000000) / n())
q6<-data.frame(gp6)
q6
##                 category probability
## 1       Autos & Vehicles 0.000000000
## 2                 Comedy 0.000000000
## 3              Education 0.022222222
## 4          Entertainment 0.004166667
## 5       Film & Animation 0.000000000
## 6                 Gaming 0.010752688
## 7          Howto & Style 0.000000000
## 8                 Movies 0.000000000
## 9                  Music 0.005000000
## 10       News & Politics 0.000000000
## 11 Nonprofits & Activism 0.000000000
## 12        People & Blogs 0.015267176
## 13        Pets & Animals 0.000000000
## 14  Science & Technology 0.000000000
## 15                 Shows 0.076923077
## 16                Sports 0.000000000
## 17              Trailers 0.000000000
## 18       Travel & Events 0.000000000
## 19                   nan 0.000000000
q6|>
  ggplot()+
  geom_point(mapping=aes(x=category,y=probability))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From the probability values, we conclude that among the 19 categories present, 11 of them have 0 probability of having subscribers more than 10 million. We also notice that category ‘shows’ have the highest probability of 0.076923077 and other categories have a probability of less than 0.03. This shows that most categories don’t have the capabilities to go over 10 million subscribers and even if they do, most fall under the category ‘shows’.

Final Recommendations for Aspiring Youtubers:

By integrating these recommendations, aspiring YouTubers can enhance their chances of building and sustaining a successful YouTube channel in the competitive landscape of online content creation.