The following work was used in Tech Army’s submission for Test 4. Click here to have a look.

Introduction

We begin our mission by looking at the characteristics of the data set from which we pick one variable and check its distribution visually and mathematically. Then we proceed to visually check the normality of the variable. Finally, other key elements of the data set that influence the normality of the variable are considered and conclusions are drawn mathematically.

Characteristics of the data set:

The observations made from the R code are listed below:

  1. The data set has the details of youtube videos like name, channel name, date published, counts of views, likes and comments along with the derived field viewsPerLike.
  2. It has 1502 rows.
  3. viewsPerLike, the selected column, has values between 14 and 326 with the median value being 80.73 and mean 105.01
  4. Though the range of viewsPerLike is between 14 and 326, more than half of the values fall between 14 and 104
data <- read.csv(file = "~/MBAA531/test4data.csv", header = TRUE)
col <- data$viewsPerLike
vplRange <- cut(col,(seq(min(col),max(col),30)))
freqTab <- as.data.frame(table(vplRange)) %>% arrange(desc(Freq))

names(data)
## [1] "title"            "channel_title"    "publication_date" "viewCount"       
## [5] "likeCount"        "commentCount"     "viewsPerLike"
nrow(data)
## [1] 1502
head(data)
##                                                                   title
## 1                        Welcome to the heart of Africa | Qatar Airways
## 2                          15 years of flying to Berlin | Qatar Airways
## 3                                Welcome to the Nordics | Qatar Airways
## 4 Live Passion with Boca Juniors' Lisandro "Lica" Lopez | Qatar Airways
## 5                            The hardship behind the Silver Spoon dream
## 6                      Our beloved #Qatar in the eyes of #MotoGP Racers
##       channel_title     publication_date viewCount likeCount commentCount
## 1     Qatar Airways 2020-08-11T15:30:08Z      4577       324           27
## 2     Qatar Airways 2020-11-05T14:01:05Z      4040       271           58
## 3     Qatar Airways 2020-10-13T15:30:06Z      4438       296           31
## 4     Qatar Airways 2020-11-12T15:00:01Z      1052        69            6
## 5 American Airlines 2022-03-28T22:28:03Z       218        14            2
## 6     Qatar Airways 2021-04-17T17:53:00Z      1205        76            9
##   viewsPerLike
## 1     14.12654
## 2     14.90775
## 3     14.99324
## 4     15.24638
## 5     15.57143
## 6     15.85526
summary(data)
##     title           channel_title      publication_date     viewCount      
##  Length:1502        Length:1502        Length:1502        Min.   :    218  
##  Class :character   Class :character   Class :character   1st Qu.:   3027  
##  Mode  :character   Mode  :character   Mode  :character   Median :   6670  
##                                                           Mean   :  37349  
##                                                           3rd Qu.:  17025  
##                                                           Max.   :4424126  
##    likeCount         commentCount      viewsPerLike   
##  Min.   :    2.00   Min.   :   0.00   Min.   : 14.13  
##  1st Qu.:   40.25   1st Qu.:   2.00   1st Qu.: 39.62  
##  Median :  109.00   Median :   9.00   Median : 80.73  
##  Mean   :  367.92   Mean   :  29.59   Mean   :105.01  
##  3rd Qu.:  262.75   3rd Qu.:  23.00   3rd Qu.:156.19  
##  Max.   :34293.00   Max.   :4498.00   Max.   :325.33
freqTab
##       vplRange Freq
## 1  (14.1,44.1]  435
## 2  (44.1,74.1]  276
## 3   (74.1,104]  187
## 4    (104,134]  149
## 5    (134,164]  109
## 6    (164,194]   94
## 7    (194,224]   94
## 8    (224,254]   65
## 9    (254,284]   48
## 10   (284,314]   33

Base Histogram

The above observation is confirmed by the histogram with values having peak frequencies from 40s to 100

hist(data$viewsPerLike, freq=TRUE, main="Frequency of Values", xlab="Views Per Like")

Normality Check

qqnorm(col, main="Normality of Views to Likes Ratio")
qqline(col)

Histogram on gglplot2

The details of summary() are visually represented in the histogram on ggplot2. Further, the channel wise distribution of the variable is illustrated and compared against the overall mean and quartiles of variable from the whole data set.

ggplot(data,aes(viewsPerLike, fill = channel_title)) + geom_histogram(bins = 50, colour="white") +
  geom_vline(aes(xintercept=quantile(col,0.25), color="Quartiles"), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.5), color="Quartiles"), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.75), color="Quartiles"), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=mean(col), color="Mean"), linetype="dashed", size=1.5, show.legend=TRUE) +
  scale_color_manual(name = "Channels", values = c(Quartiles = "blue", Mean = "red"))

summary(col)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.13   39.62   80.73  105.01  156.19  325.33

Distribution and Weight of data based on channels:

The data set has 45.33% of data from Qatar Airways followed by American Airlines with 26.17% data, Singapore Airlines with 18.11% and lastly British Airways with 10.39% A clear picture of the distribution of data from each channel is seen compared with the mean of the whole data set.

channels <- as.data.frame(data %>% count(channel_title) %>% mutate(freq = (n / sum(n))*100))
channels[order(-channels$freq),]
##        channel_title   n     freq
## 3      Qatar Airways 681 45.33955
## 1  American Airlines 393 26.16511
## 4 Singapore Airlines 272 18.10919
## 2    British Airways 156 10.38615
ggplot(data, aes(x = viewsPerLike, group = channel_title, fill = channel_title)) + stat_bin(bins=30, aes(y = ..count..)) + geom_density() + facet_wrap(~channel_title, scales = "free") + geom_vline(xintercept=mean(data$viewsPerLike), size=1.5, color="red")

Density

The boxplot gives the range of the variable’s quartiles. It can be seen that the values are denser in the first and second quartiles.

ggplot(data, mapping = aes(x = viewsPerLike, y = 0)) + geom_jitter() + geom_boxplot(alpha = .5, show.legend = T ) + geom_vline(aes(xintercept=quantile(col,0.25)), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.5)), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.75)), linetype="solid", size=1, show.legend=T)

Conclusion

The distribution of values in viewsPerLike is given below:

stdev <- sd(col)
stdev2 <- 2*stdev
stdev3 <- 3*stdev

oneSD <- round(sum(col > mean(col) - stdev & col < mean(col) + stdev)/nrow(data) * 100,2)
twoSD <- round(sum(col > mean(col) - stdev2 & col < mean(col) + stdev2)/nrow(data) * 100,2)
threeSD <- round(sum(col > mean(col) - stdev3 & col < mean(col) + stdev3)/nrow(data) * 100,2)

As conclusion, we will see whether the variable conforms to the “68-95-99.7” normal distribution model. 69.84 percent of values fall within one standard deviation from mean, 94.94 percent of values fall within two standard deviations from mean and 100 percent of values fall within three standard deviations from mean.

Velu