The following work was used in Tech Army’s submission for Test 4. Click here to have a look.
We begin our mission by looking at the characteristics of the data set from which we pick one variable and check its distribution visually and mathematically. Then we proceed to visually check the normality of the variable. Finally, other key elements of the data set that influence the normality of the variable are considered and conclusions are drawn mathematically.
The observations made from the R code are listed below:
data <- read.csv(file = "~/MBAA531/test4data.csv", header = TRUE)
col <- data$viewsPerLike
vplRange <- cut(col,(seq(min(col),max(col),30)))
freqTab <- as.data.frame(table(vplRange)) %>% arrange(desc(Freq))
names(data)
## [1] "title" "channel_title" "publication_date" "viewCount"
## [5] "likeCount" "commentCount" "viewsPerLike"
nrow(data)
## [1] 1502
head(data)
## title
## 1 Welcome to the heart of Africa | Qatar Airways
## 2 15 years of flying to Berlin | Qatar Airways
## 3 Welcome to the Nordics | Qatar Airways
## 4 Live Passion with Boca Juniors' Lisandro "Lica" Lopez | Qatar Airways
## 5 The hardship behind the Silver Spoon dream
## 6 Our beloved #Qatar in the eyes of #MotoGP Racers
## channel_title publication_date viewCount likeCount commentCount
## 1 Qatar Airways 2020-08-11T15:30:08Z 4577 324 27
## 2 Qatar Airways 2020-11-05T14:01:05Z 4040 271 58
## 3 Qatar Airways 2020-10-13T15:30:06Z 4438 296 31
## 4 Qatar Airways 2020-11-12T15:00:01Z 1052 69 6
## 5 American Airlines 2022-03-28T22:28:03Z 218 14 2
## 6 Qatar Airways 2021-04-17T17:53:00Z 1205 76 9
## viewsPerLike
## 1 14.12654
## 2 14.90775
## 3 14.99324
## 4 15.24638
## 5 15.57143
## 6 15.85526
summary(data)
## title channel_title publication_date viewCount
## Length:1502 Length:1502 Length:1502 Min. : 218
## Class :character Class :character Class :character 1st Qu.: 3027
## Mode :character Mode :character Mode :character Median : 6670
## Mean : 37349
## 3rd Qu.: 17025
## Max. :4424126
## likeCount commentCount viewsPerLike
## Min. : 2.00 Min. : 0.00 Min. : 14.13
## 1st Qu.: 40.25 1st Qu.: 2.00 1st Qu.: 39.62
## Median : 109.00 Median : 9.00 Median : 80.73
## Mean : 367.92 Mean : 29.59 Mean :105.01
## 3rd Qu.: 262.75 3rd Qu.: 23.00 3rd Qu.:156.19
## Max. :34293.00 Max. :4498.00 Max. :325.33
freqTab
## vplRange Freq
## 1 (14.1,44.1] 435
## 2 (44.1,74.1] 276
## 3 (74.1,104] 187
## 4 (104,134] 149
## 5 (134,164] 109
## 6 (164,194] 94
## 7 (194,224] 94
## 8 (224,254] 65
## 9 (254,284] 48
## 10 (284,314] 33
The above observation is confirmed by the histogram with values having peak frequencies from 40s to 100
hist(data$viewsPerLike, freq=TRUE, main="Frequency of Values", xlab="Views Per Like")
qqnorm(col, main="Normality of Views to Likes Ratio")
qqline(col)
The details of summary() are visually represented in the histogram on ggplot2. Further, the channel wise distribution of the variable is illustrated and compared against the overall mean and quartiles of variable from the whole data set.
ggplot(data,aes(viewsPerLike, fill = channel_title)) + geom_histogram(bins = 50, colour="white") +
geom_vline(aes(xintercept=quantile(col,0.25), color="Quartiles"), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.5), color="Quartiles"), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.75), color="Quartiles"), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=mean(col), color="Mean"), linetype="dashed", size=1.5, show.legend=TRUE) +
scale_color_manual(name = "Channels", values = c(Quartiles = "blue", Mean = "red"))
summary(col)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.13 39.62 80.73 105.01 156.19 325.33
The data set has 45.33% of data from Qatar Airways followed by American Airlines with 26.17% data, Singapore Airlines with 18.11% and lastly British Airways with 10.39% A clear picture of the distribution of data from each channel is seen compared with the mean of the whole data set.
channels <- as.data.frame(data %>% count(channel_title) %>% mutate(freq = (n / sum(n))*100))
channels[order(-channels$freq),]
## channel_title n freq
## 3 Qatar Airways 681 45.33955
## 1 American Airlines 393 26.16511
## 4 Singapore Airlines 272 18.10919
## 2 British Airways 156 10.38615
ggplot(data, aes(x = viewsPerLike, group = channel_title, fill = channel_title)) + stat_bin(bins=30, aes(y = ..count..)) + geom_density() + facet_wrap(~channel_title, scales = "free") + geom_vline(xintercept=mean(data$viewsPerLike), size=1.5, color="red")
The boxplot gives the range of the variable’s quartiles. It can be seen that the values are denser in the first and second quartiles.
ggplot(data, mapping = aes(x = viewsPerLike, y = 0)) + geom_jitter() + geom_boxplot(alpha = .5, show.legend = T ) + geom_vline(aes(xintercept=quantile(col,0.25)), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.5)), linetype="solid", size=1, show.legend=T) + geom_vline(aes(xintercept=quantile(col,0.75)), linetype="solid", size=1, show.legend=T)
The distribution of values in viewsPerLike is given below:
stdev <- sd(col)
stdev2 <- 2*stdev
stdev3 <- 3*stdev
oneSD <- round(sum(col > mean(col) - stdev & col < mean(col) + stdev)/nrow(data) * 100,2)
twoSD <- round(sum(col > mean(col) - stdev2 & col < mean(col) + stdev2)/nrow(data) * 100,2)
threeSD <- round(sum(col > mean(col) - stdev3 & col < mean(col) + stdev3)/nrow(data) * 100,2)
As conclusion, we will see whether the variable conforms to the “68-95-99.7” normal distribution model. 69.84 percent of values fall within one standard deviation from mean, 94.94 percent of values fall within two standard deviations from mean and 100 percent of values fall within three standard deviations from mean.
Velu