Research Question: Is being a most subscribed Youtube channel related to its broadcasting language as English?
Unit of observation: Youtube channel Variables: 1. Subscribers - ratio 2. Language - binary: english, or non-english
Source of the dataset: https://www.kaggle.com/datasets/rajkumarpandey02/list-of-most-subscribed-youtube-channels-in-world
mydata <- read.csv("/Users/nadiatwinkle/Bootcamp/List of most-subscribed YouTube channels.csv")
head(mydata)
## X Rank Name Link Brand.channel Subscribers Language
## 1 0 1 T-Series Link Yes 238 Hindi[7][8]
## 2 1 2 Cocomelon Link Yes 155 English
## 3 2 3 Sony Entertainment Television India Link Yes 153 Hindi[9]
## 4 3 4 MrBeast Link No 137 English
## 5 4 5 PewDiePie Link No 111 English
## 6 5 6 Kids Diana Show Link Yes 109 English[10][11][12]
## Category Country
## 1 Music India
## 2 Education United States
## 3 Entertainment India
## 4 Entertainment United States
## 5 Games Sweden
## 6 Entertainment Ukraine
To simplify, I would like to use only the columns Rank, Subscribers, and Language.
selected_columns <- mydata[, c("Rank", "Subscribers", "Language")]
# View the new
print(selected_columns)
## Rank Subscribers Language
## 1 1 238.0 Hindi[7][8]
## 2 2 155.0 English
## 3 3 153.0 Hindi[9]
## 4 4 137.0 English
## 5 5 111.0 English
## 6 6 109.0 English[10][11][12]
## 7 7 105.0 English
## 8 8 94.9 English
## 9 9 93.8 English
## 10 10 93.4 Hindi[13][14]
## 11 11 84.8 Korean
## 12 12 83.3 Hindi
## 13 13 79.2 English
## 14 14 78.2 Hindi
## 15 15 73.9 Korean
## 16 16 71.1 English
## 17 17 69.5 Korean
## 18 18 66.4 Portuguese
## 19 19 66.3 Hindi
## 20 20 66.1 English
## 21 21 64.2 Hindi
## 22 22 63.2 Hindi[16]
## 23 23 60.6 Hindi
## 24 24 59.0 English
## 25 25 58.8 English
## 26 26 58.4 Hindi
## 27 27 57.3 Hindi
## 28 28 56.6 Bhojpuri
## 29 29 56.2 English
## 30 30 56.1 Hindi
## 31 31 56.0 Spanish
## 32 32 55.9 Hindi
## 33 33 55.6 English
## 34 34 53.1 English
## 35 35 53.0 English
## 36 36 52.7 Hindi
## 37 37 52.4 English
## 38 38 51.1 English
## 39 39 50.6 English
## 40 40 49.8 Hindi
## 41 41 47.6 Spanish
## 42 42 47.4 English
## 43 43 46.5 Spanish
## 44 44 45.7 Spanish
## 45 45 45.4 Spanish
## 46 46 45.4 Hindi
## 47 47 45.2 Hindi
## 48 48 45.0 Russian
## 49 49 44.9 Portuguese
## 50 50 44.6 Russian
In the Language column, I am replacing the values that if it is English, then it is 0, and if not, then it is 1, because our research question asks whether being in English language affect the number of subscribers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Replace values in the 'Language' column
selected_columns <- selected_columns %>%
mutate(Language = ifelse(Language == "English", 0, 1))
# View the modified data frame
print(selected_columns)
## Rank Subscribers Language
## 1 1 238.0 1
## 2 2 155.0 0
## 3 3 153.0 1
## 4 4 137.0 0
## 5 5 111.0 0
## 6 6 109.0 1
## 7 7 105.0 0
## 8 8 94.9 0
## 9 9 93.8 0
## 10 10 93.4 1
## 11 11 84.8 1
## 12 12 83.3 1
## 13 13 79.2 0
## 14 14 78.2 1
## 15 15 73.9 1
## 16 16 71.1 0
## 17 17 69.5 1
## 18 18 66.4 1
## 19 19 66.3 1
## 20 20 66.1 0
## 21 21 64.2 1
## 22 22 63.2 1
## 23 23 60.6 1
## 24 24 59.0 0
## 25 25 58.8 0
## 26 26 58.4 1
## 27 27 57.3 1
## 28 28 56.6 1
## 29 29 56.2 0
## 30 30 56.1 1
## 31 31 56.0 1
## 32 32 55.9 1
## 33 33 55.6 0
## 34 34 53.1 0
## 35 35 53.0 0
## 36 36 52.7 1
## 37 37 52.4 0
## 38 38 51.1 0
## 39 39 50.6 0
## 40 40 49.8 1
## 41 41 47.6 1
## 42 42 47.4 0
## 43 43 46.5 1
## 44 44 45.7 1
## 45 45 45.4 1
## 46 46 45.4 1
## 47 47 45.2 1
## 48 48 45.0 1
## 49 49 44.9 1
## 50 50 44.6 1
Description:
library(ggplot2)
Home <- ggplot(selected_columns[selected_columns$Language == 0, ], aes(x = Subscribers)) +
theme_linedraw() +
geom_bar(fill = "darkblue") +
ylab("Frequency") +
ggtitle("English Subscribers")
Attended <- ggplot(selected_columns[selected_columns$Language == 1, ], aes(x = Subscribers)) +
theme_linedraw() +
geom_bar(fill = "darkred") +
ylab("Frequency") +
ggtitle("Non-English Subscribers")
library(ggpubr)
ggarrange(Home, Attended, ncol = 2, nrow = 1)
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
selected_columns %>%
group_by(Language) %>%
shapiro_test(Subscribers)
## # A tibble: 2 × 4
## Language variable statistic p
## <dbl> <chr> <dbl> <dbl>
## 1 0 Subscribers 0.820 0.00226
## 2 1 Subscribers 0.613 0.0000000765
Based on the research data we reject the normality assumption for both groups (English and non-English). The corresponding p-values are 0,0023 and p<0,001.Therefore, we will conduct the Wilcoxon rank sum test.
Null hypothesis: The locations of distributions are the same. Hypothesis one: The location of distribution of those channels that broadcast in English language is more to the right.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:car':
##
## logit
describeBy(selected_columns$Subscribers, selected_columns$Language)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 19 76.33 31.55 59 73.41 12.45 47.4 155 107.6 1.11 0.04 7.24
## ---------------------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 31 69.58 38.73 57.3 60.9 17.2 44.6 238 193.4 2.94 9.33 6.96
Here in Wilcoxon test we choose paired = False because samples are independent.
wilcox.test(selected_columns$Subscribers ~ selected_columns$Language,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "less")
##
## Wilcoxon rank sum test
##
## data: selected_columns$Subscribers by selected_columns$Language
## W = 359, p-value = 0.9013
## alternative hypothesis: true location shift is less than 0
library(ggplot2)
ggplot(selected_columns, aes(x = Subscribers, fill = as.factor(Language))) +
geom_histogram(position = position_dodge(width = 2.5), binwidth = 5, colour = "Black") +
scale_x_continuous(breaks = seq(0, 250, 25)) +
ylab("Frequency") +
labs(fill = "Language")
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
effectsize(wilcox.test(selected_columns$Subscribers ~ selected_columns$Language,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "less"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.22 | [-1.00, 0.46]
##
## - One-sided CIs: lower bound fixed at [-1.00].
Based on the research data, the effect size is small-to-moderate which equals 0.22.