Erdene Enkh, 13. January 2024

IMB; Multivariate Analysis;

Homework 1

Most subscribed Youtube channels and their broadcasting language

Research Question: Is being a most subscribed Youtube channel related to its broadcasting language as English?

Unit of observation: Youtube channel Variables: 1. Subscribers - ratio 2. Language - binary: english, or non-english

Source of the dataset: https://www.kaggle.com/datasets/rajkumarpandey02/list-of-most-subscribed-youtube-channels-in-world

mydata <- read.csv("/Users/nadiatwinkle/Bootcamp/List of most-subscribed YouTube channels.csv")
head(mydata)
##   X Rank                                Name Link Brand.channel Subscribers            Language
## 1 0    1                            T-Series Link           Yes         238         Hindi[7][8]
## 2 1    2                           Cocomelon Link           Yes         155             English
## 3 2    3 Sony Entertainment Television India Link           Yes         153            Hindi[9]
## 4 3    4                             MrBeast Link            No         137             English
## 5 4    5                           PewDiePie Link            No         111             English
## 6 5    6                     Kids Diana Show Link           Yes         109 English[10][11][12]
##        Category       Country
## 1         Music         India
## 2     Education United States
## 3 Entertainment         India
## 4 Entertainment United States
## 5         Games        Sweden
## 6 Entertainment       Ukraine

To simplify, I would like to use only the columns Rank, Subscribers, and Language.

selected_columns <- mydata[, c("Rank", "Subscribers", "Language")]

# View the new 
print(selected_columns)
##    Rank Subscribers            Language
## 1     1       238.0         Hindi[7][8]
## 2     2       155.0             English
## 3     3       153.0            Hindi[9]
## 4     4       137.0             English
## 5     5       111.0             English
## 6     6       109.0 English[10][11][12]
## 7     7       105.0             English
## 8     8        94.9             English
## 9     9        93.8             English
## 10   10        93.4       Hindi[13][14]
## 11   11        84.8              Korean
## 12   12        83.3               Hindi
## 13   13        79.2             English
## 14   14        78.2               Hindi
## 15   15        73.9              Korean
## 16   16        71.1             English
## 17   17        69.5              Korean
## 18   18        66.4          Portuguese
## 19   19        66.3               Hindi
## 20   20        66.1             English
## 21   21        64.2               Hindi
## 22   22        63.2           Hindi[16]
## 23   23        60.6               Hindi
## 24   24        59.0             English
## 25   25        58.8             English
## 26   26        58.4               Hindi
## 27   27        57.3               Hindi
## 28   28        56.6            Bhojpuri
## 29   29        56.2             English
## 30   30        56.1               Hindi
## 31   31        56.0             Spanish
## 32   32        55.9               Hindi
## 33   33        55.6             English
## 34   34        53.1             English
## 35   35        53.0             English
## 36   36        52.7               Hindi
## 37   37        52.4             English
## 38   38        51.1             English
## 39   39        50.6             English
## 40   40        49.8               Hindi
## 41   41        47.6             Spanish
## 42   42        47.4             English
## 43   43        46.5             Spanish
## 44   44        45.7             Spanish
## 45   45        45.4             Spanish
## 46   46        45.4               Hindi
## 47   47        45.2               Hindi
## 48   48        45.0             Russian
## 49   49        44.9          Portuguese
## 50   50        44.6             Russian

In the Language column, I am replacing the values that if it is English, then it is 0, and if not, then it is 1, because our research question asks whether being in English language affect the number of subscribers.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Replace values in the 'Language' column
selected_columns <- selected_columns %>%
  mutate(Language = ifelse(Language == "English", 0, 1))

# View the modified data frame
print(selected_columns)
##    Rank Subscribers Language
## 1     1       238.0        1
## 2     2       155.0        0
## 3     3       153.0        1
## 4     4       137.0        0
## 5     5       111.0        0
## 6     6       109.0        1
## 7     7       105.0        0
## 8     8        94.9        0
## 9     9        93.8        0
## 10   10        93.4        1
## 11   11        84.8        1
## 12   12        83.3        1
## 13   13        79.2        0
## 14   14        78.2        1
## 15   15        73.9        1
## 16   16        71.1        0
## 17   17        69.5        1
## 18   18        66.4        1
## 19   19        66.3        1
## 20   20        66.1        0
## 21   21        64.2        1
## 22   22        63.2        1
## 23   23        60.6        1
## 24   24        59.0        0
## 25   25        58.8        0
## 26   26        58.4        1
## 27   27        57.3        1
## 28   28        56.6        1
## 29   29        56.2        0
## 30   30        56.1        1
## 31   31        56.0        1
## 32   32        55.9        1
## 33   33        55.6        0
## 34   34        53.1        0
## 35   35        53.0        0
## 36   36        52.7        1
## 37   37        52.4        0
## 38   38        51.1        0
## 39   39        50.6        0
## 40   40        49.8        1
## 41   41        47.6        1
## 42   42        47.4        0
## 43   43        46.5        1
## 44   44        45.7        1
## 45   45        45.4        1
## 46   46        45.4        1
## 47   47        45.2        1
## 48   48        45.0        1
## 49   49        44.9        1
## 50   50        44.6        1

Description:

  • ID: Rank
  • Subscribers: Number of subscribers in millions
  • Language: 0 - English, 1 - Not English.
library(ggplot2)

Home <- ggplot(selected_columns[selected_columns$Language == 0, ], aes(x = Subscribers)) +
  theme_linedraw() +
  geom_bar(fill = "darkblue") +
  ylab("Frequency") +
  ggtitle("English Subscribers")

Attended <- ggplot(selected_columns[selected_columns$Language == 1, ], aes(x = Subscribers)) +
  theme_linedraw() +
  geom_bar(fill = "darkred") +
  ylab("Frequency") +
  ggtitle("Non-English Subscribers")

library(ggpubr)
ggarrange(Home, Attended, ncol = 2, nrow = 1)

library(dplyr)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
selected_columns %>%
  group_by(Language) %>%
  shapiro_test(Subscribers)
## # A tibble: 2 × 4
##   Language variable    statistic            p
##      <dbl> <chr>           <dbl>        <dbl>
## 1        0 Subscribers     0.820 0.00226     
## 2        1 Subscribers     0.613 0.0000000765

Based on the research data we reject the normality assumption for both groups (English and non-English). The corresponding p-values are 0,0023 and p<0,001.Therefore, we will conduct the Wilcoxon rank sum test.

Null hypothesis: The locations of distributions are the same. Hypothesis one: The location of distribution of those channels that broadcast in English language is more to the right.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:car':
## 
##     logit
describeBy(selected_columns$Subscribers, selected_columns$Language)
## 
##  Descriptive statistics by group 
## group: 0
##    vars  n  mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 19 76.33 31.55     59   73.41 12.45 47.4 155 107.6 1.11     0.04 7.24
## --------------------------------------------------------------------------- 
## group: 1
##    vars  n  mean    sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 31 69.58 38.73   57.3    60.9 17.2 44.6 238 193.4 2.94     9.33 6.96

Here in Wilcoxon test we choose paired = False because samples are independent.

wilcox.test(selected_columns$Subscribers ~ selected_columns$Language,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "less")
## 
##  Wilcoxon rank sum test
## 
## data:  selected_columns$Subscribers by selected_columns$Language
## W = 359, p-value = 0.9013
## alternative hypothesis: true location shift is less than 0

Based on the research data we cannot reject the null hypothesis at the p-value of 0.9013. The test does not provide sufficient evidence to conclude that the median number of subscribers for channels in English is significantly less than the median for non-English channels.

library(ggplot2)
ggplot(selected_columns, aes(x = Subscribers, fill = as.factor(Language))) +
  geom_histogram(position = position_dodge(width = 2.5), binwidth = 5, colour = "Black") +
  scale_x_continuous(breaks = seq(0, 250, 25)) +
  ylab("Frequency") +
  labs(fill = "Language")

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
effectsize(wilcox.test(selected_columns$Subscribers ~ selected_columns$Language,
           paired = FALSE,
           correct = FALSE,
           exact = FALSE,
           alternative = "less"))
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.22              | [-1.00, 0.46]
## 
## - One-sided CIs: lower bound fixed at [-1.00].

Based on the research data, the effect size is small-to-moderate which equals 0.22.