author: “Alexander Levakov”
date: “February 26, 2015”
“Logic will get you from A to Z; imagination will get you everywhere.” - Albert Einstein
We use YANDEX data (https://stat.yandex.ru/stats.xml) to compare average number of Unique visitors per day for 2013 and 2014 by means of Student test and Wilcoxon test both.
stat<-read.csv(file="13-14-yan.CSV",sep="\t")
summary(stat)
## X2013 X2014
## Min. :25652290 Min. :26927780
## 1st Qu.:28040452 1st Qu.:28647410
## Median :29214535 Median :29750325
## Mean :28893823 Mean :29364558
## 3rd Qu.:29703218 3rd Qu.:30290600
## Max. :31258080 Max. :31571110
plot(density(stat$X2013),col="blue",xlab="Number of unique IP average per day",ylab="Density","Yandex - 2013, 2014" )
polygon(density(stat$X2013),col="lightgrey",border="blue")
lines(density(stat$X2014),col="red")
legend("topleft", c("X2013"), col = "blue",text.col = "blue", bg = "white")
legend("topright", c("X2014"), col = "red",text.col = "red", bg = "white")
grid()
As we see both variables (X2013 and X2014) are close to each other on the whole. So we need some more proofs for this preliminary conclusion.
, where
See https://en.wikipedia.org/wiki/Student%27s_t-test
The data for this case have equal sample sizes and equal variance!
First, we test the normality and homogeneity of variance for two variables (see below).
Second, we postulate null hypothesis H0| mean(X2013)=mean(X2014) and alternatve hypothesis H1|mean(X2013)<>mean(X2014). We use t.test function to prove it or not.
t.test(stat$X2013,stat$X2014,paired=F,var.equal = T)
##
## Two Sample t-test
##
## data: stat$X2013 and stat$X2014
## t = -0.7553, df = 22, p-value = 0.4581
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1763177.8 821709.5
## sample estimates:
## mean of x mean of y
## 28893823 29364558
As we see mean(X2013) and mean(X2014) (averages) do not differ significantly. We say more - there’s no reason by t-statistics and P-value to reject H0.
See http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
wilcox.test(stat$X2013,stat$X2014,paired=F)
##
## Wilcoxon rank sum test
##
## data: stat$X2013 and stat$X2014
## W = 59, p-value = 0.4776
## alternative hypothesis: true location shift is not equal to 0
We got the same result using non parametric method.
See http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test
shapiro.test(stat$X2013)
##
## Shapiro-Wilk normality test
##
## data: stat$X2013
## W = 0.95, p-value = 0.6373
shapiro.test(stat$X2014)
##
## Shapiro-Wilk normality test
##
## data: stat$X2014
## W = 0.9395, p-value = 0.4917
No problems with normality for two variables.
See http://en.wikipedia.org/wiki/F-test_of_equality_of_variances
var.test(stat$X2013,stat$X2014)
##
## F test to compare two variances
##
## data: stat$X2013 and stat$X2014
## F = 1.2388, num df = 11, denom df = 11, p-value = 0.7287
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3566266 4.3032650
## sample estimates:
## ratio of variances
## 1.238813
No problems with variance homogeneity and equality for two variables.
First, both (parametric and non-parametric) tests give us strong and significant evidence to make conclusion in favor of null hypothesis - average number of Unique visitors of YANDEX per day for 2013 and 2014 do not differ.
Second, the results for two tests demonstrate not only problems with YANDEX as company but the problems with economics i.e. recession in business activity.
Third, web statistics could be used for prediction of business activity.