To start off our task, we will examine various characteristics of the data set that we have selected. From where we will choose a single variable and examine its distribution visually and mathematically. After which, we shall visually examine the normality of the aforementioned variable. To wrap things up, we shall look into the factors and elements of the data set that influence the normality of the variable and accordingly draw overall conclusions mathematically.
The observations made from the R code are listed below:
data <- read.csv(file = "~/MBAA531/test 4/test4data.csv", header = TRUE)
col <- data$viewsPerLike
vplRange <- cut(col,(seq(min(col),max(col),30)))
freqTab <- as.data.frame(table(vplRange)) %>% arrange(desc(Freq))
names(data)
## [1] "title" "channel_title" "publication_date" "viewCount"
## [5] "likeCount" "commentCount" "viewsPerLike"
nrow(data)
## [1] 1502
DT::datatable(head(data))
DT::datatable(freqTab)
The above observation is confirmed by the histogram with values having peak frequencies from 40s to 100
hist(data$viewsPerLike, freq=TRUE, main="Frequency of Values", xlab="Views Per Like")
The details of summary() are visually represented in the histogram on ggplot2. Further, the channel wise distribution of the variable is illustrated and compared against the overall mean and quartiles of variable from the whole data set.
ggplot(data,aes(viewsPerLike, fill = channel_title))+
geom_histogram(bins = 50, colour="white")+
geom_vline(aes(xintercept=quantile(col,0.25), color="Quartiles"),linetype="solid", size=1, show.legend=T)+
geom_vline(aes(xintercept=quantile(col,0.5), color="Quartiles"), linetype="solid", size=1, show.legend=T)+
geom_vline(aes(xintercept=quantile(col,0.75), color="Quartiles"), linetype="solid", size=1, show.legend=T)+
geom_vline(aes(xintercept=mean(col), color="Mean"), linetype="dashed", size=1.5, show.legend=TRUE) +
scale_color_manual(name = "Channels", values = c(Quartiles = "blue", Mean = "red"))
summary(col)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.13 39.62 80.73 105.01 156.19 325.33
The data set is a collection of details about the videos from four different Youtube channels run by four different commercial airliners from across the world. In the following steps, we will be:
The aim is to create an understanding of the role of varied viewership behaviour on the distribution pattern of the variable in the entire data set. It has 45.33% of data from Qatar Airways followed by American Airlines with 26.17% data, Singapore Airlines with 18.11% and lastly British Airways with 10.39% A clear picture of the distribution of data from each channel is seen compared with the mean of the whole data set.
channels <- as.data.frame(data %>% count(channel_title) %>%mutate(freq = (n / sum(n))*100))
channels[order(-channels$freq),]
## channel_title n freq
## 3 Qatar Airways 681 45.33955
## 1 American Airlines 393 26.16511
## 4 Singapore Airlines 272 18.10919
## 2 British Airways 156 10.38615
ggplot(data, aes(x = viewsPerLike, group = channel_title, fill = channel_title)) +
stat_bin(bins=30, aes(y = ..count..)) +
geom_density() +
facet_wrap(~channel_title, scales = "free") +
geom_vline(xintercept=mean(data$viewsPerLike), size=1.5, color="red")
The boxplot gives the range of the variable’s quartiles. It can be seen that the values are denser in the first and second quartiles.
ggplot(data, mapping = aes(x = viewsPerLike, y = 0)) +
geom_jitter() +
geom_boxplot(alpha = .5, show.legend = T ) +
geom_vline(aes(xintercept=quantile(col,0.25)), linetype="solid", size=1, show.legend=T) +
geom_vline(aes(xintercept=quantile(col,0.5)), linetype="solid", size=1, show.legend=T) +
geom_vline(aes(xintercept=quantile(col,0.75)), linetype="solid", size=1, show.legend=T)
The distribution of values in viewsPerLike is given below:
stdev <- sd(col)
stdev2 <- 2*stdev
stdev3 <- 3*stdev
oneSD <- round(sum(col > mean(col) - stdev & col < mean(col) + stdev)/nrow(data) * 100,2)
twoSD <- round(sum(col > mean(col) - stdev2 & col < mean(col) + stdev2)/nrow(data) * 100,2)
threeSD <- round(sum(col > mean(col) - stdev3 & col < mean(col) + stdev3)/nrow(data) * 100,2)
As conclusion, we will see whether the variable conforms to the “68-95-99.7” normal distribution model. 69.84 percent of values fall within one standard deviation from mean, 94.94 percent of values fall within two standard deviations from mean and 100 percent of values fall within three standard deviations from mean.
Submitted By: Tech Army