Introduction

To start off our task, we will examine various characteristics of the data set that we have selected. From where we will choose a single variable and examine its distribution visually and mathematically. After which, we shall visually examine the normality of the aforementioned variable. To wrap things up, we shall look into the factors and elements of the data set that influence the normality of the variable and accordingly draw overall conclusions mathematically.




Dataset Characteristics

The observations made from the R code are listed below:

  1. The data set has the details of youtube videos like name, channel name, date published, counts of views, likes and comments along with the derived field viewsPerLike.
  2. It has 1502 rows.
  3. viewsPerLike, the selected column, has values between 14 and 326 with the median value being 80.73 and mean 105.01
  4. Though the range of viewsPerLike is between 14 and 326, more than half of the values fall between 14 and 104
data <- read.csv(file = "~/MBAA531/test 4/test4data.csv", header = TRUE)
col <- data$viewsPerLike
vplRange <- cut(col,(seq(min(col),max(col),30)))
freqTab <- as.data.frame(table(vplRange)) %>% arrange(desc(Freq))

names(data)
## [1] "title"            "channel_title"    "publication_date" "viewCount"       
## [5] "likeCount"        "commentCount"     "viewsPerLike"
nrow(data)
## [1] 1502
DT::datatable(head(data))
DT::datatable(freqTab)




Base Histogram

The above observation is confirmed by the histogram with values having peak frequencies from 40s to 100

hist(data$viewsPerLike, freq=TRUE, main="Frequency of Values", xlab="Views Per Like")




Histogram on gglplot2

The details of summary() are visually represented in the histogram on ggplot2. Further, the channel wise distribution of the variable is illustrated and compared against the overall mean and quartiles of variable from the whole data set.

ggplot(data,aes(viewsPerLike, fill = channel_title))+
  geom_histogram(bins = 50, colour="white")+
  geom_vline(aes(xintercept=quantile(col,0.25), color="Quartiles"),linetype="solid", size=1, show.legend=T)+
  geom_vline(aes(xintercept=quantile(col,0.5), color="Quartiles"), linetype="solid", size=1, show.legend=T)+
  geom_vline(aes(xintercept=quantile(col,0.75), color="Quartiles"), linetype="solid", size=1, show.legend=T)+
  geom_vline(aes(xintercept=mean(col), color="Mean"), linetype="dashed", size=1.5, show.legend=TRUE) +
  scale_color_manual(name = "Channels", values = c(Quartiles = "blue", Mean = "red"))

summary(col)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.13   39.62   80.73  105.01  156.19  325.33




Distribution and Weight

The data set is a collection of details about the videos from four different Youtube channels run by four different commercial airliners from across the world. In the following steps, we will be:

  • Showing the weight of each channel on the entire data set.
  • Distribution pattern of the variable from each channel.

The aim is to create an understanding of the role of varied viewership behaviour on the distribution pattern of the variable in the entire data set. It has 45.33% of data from Qatar Airways followed by American Airlines with 26.17% data, Singapore Airlines with 18.11% and lastly British Airways with 10.39% A clear picture of the distribution of data from each channel is seen compared with the mean of the whole data set.

channels <- as.data.frame(data %>% count(channel_title) %>%mutate(freq = (n / sum(n))*100))
channels[order(-channels$freq),]
##        channel_title   n     freq
## 3      Qatar Airways 681 45.33955
## 1  American Airlines 393 26.16511
## 4 Singapore Airlines 272 18.10919
## 2    British Airways 156 10.38615
ggplot(data, aes(x = viewsPerLike, group = channel_title, fill = channel_title)) + 
  stat_bin(bins=30, aes(y = ..count..)) + 
  geom_density() + 
  facet_wrap(~channel_title, scales = "free") + 
  geom_vline(xintercept=mean(data$viewsPerLike), size=1.5, color="red")




Density

The boxplot gives the range of the variable’s quartiles. It can be seen that the values are denser in the first and second quartiles.

ggplot(data, mapping = aes(x = viewsPerLike, y = 0)) + 
  geom_jitter() + 
  geom_boxplot(alpha = .5, show.legend = T ) +
  geom_vline(aes(xintercept=quantile(col,0.25)), linetype="solid", size=1, show.legend=T) + 
  geom_vline(aes(xintercept=quantile(col,0.5)), linetype="solid", size=1, show.legend=T) + 
  geom_vline(aes(xintercept=quantile(col,0.75)), linetype="solid", size=1, show.legend=T)




Conclusion

The distribution of values in viewsPerLike is given below:

stdev <- sd(col)
stdev2 <- 2*stdev
stdev3 <- 3*stdev

oneSD <- round(sum(col > mean(col) - stdev & col < mean(col) + stdev)/nrow(data) * 100,2)
twoSD <- round(sum(col > mean(col) - stdev2 & col < mean(col) + stdev2)/nrow(data) * 100,2)
threeSD <- round(sum(col > mean(col) - stdev3 & col < mean(col) + stdev3)/nrow(data) * 100,2)

As conclusion, we will see whether the variable conforms to the “68-95-99.7” normal distribution model. 69.84 percent of values fall within one standard deviation from mean, 94.94 percent of values fall within two standard deviations from mean and 100 percent of values fall within three standard deviations from mean.



Submitted By: Tech Army

Back to Top