Student and Wilcoxon tests for YANDEX statistics

author: “Alexander Levakov”

date: “February 26, 2015”

“Logic will get you from A to Z; imagination will get you everywhere.” - Albert Einstein

Research goal

We use YANDEX data (https://stat.yandex.ru/stats.xml) to compare average number of Unique visitors per day for 2013 and 2014 by means of Student test and Wilcoxon test both.

Data

stat<-read.csv(file="13-14-yan.CSV",sep="\t")

Preliminary analysis

summary(stat)

##      X2013              X2014         
##  Min.   :25652290   Min.   :26927780  
##  1st Qu.:28040452   1st Qu.:28647410  
##  Median :29214535   Median :29750325  
##  Mean   :28893823   Mean   :29364558  
##  3rd Qu.:29703218   3rd Qu.:30290600  
##  Max.   :31258080   Max.   :31571110

plot(density(stat$X2013),col="blue",xlab="Number of unique IP average per day",ylab="Density","Yandex - 2013, 2014" )
polygon(density(stat$X2013),col="lightgrey",border="blue")
lines(density(stat$X2014),col="red")
legend("topleft", c("X2013"), col = "blue",text.col = "blue", bg = "white")
legend("topright", c("X2014"), col = "red",text.col = "red", bg = "white")
grid()

Note

As we see both variables (X2013 and X2014) are close to each other on the whole. So we need some more proofs for this preliminary conclusion.

Student test

alt text , where

See https://en.wikipedia.org/wiki/Student%27s_t-test

The data for this case have equal sample sizes and equal variance!

First, we test the normality and homogeneity of variance for two variables (see below).

Second, we postulate null hypothesis H0| mean(X2013)=mean(X2014) and alternatve hypothesis H1|mean(X2013)<>mean(X2014). We use t.test function to prove it or not.

t.test(stat$X2013,stat$X2014,paired=F,var.equal = T)

## 
##  Two Sample t-test
## 
## data:  stat$X2013 and stat$X2014
## t = -0.7553, df = 22, p-value = 0.4581
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1763177.8   821709.5
## sample estimates:
## mean of x mean of y 
##  28893823  29364558

Note

As we see mean(X2013) and mean(X2014) (averages) do not differ significantly. We say more - there’s no reason by t-statistics and P-value to reject H0.

Wilcoxon test

alt text

See http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

wilcox.test(stat$X2013,stat$X2014,paired=F)

## 
##  Wilcoxon rank sum test
## 
## data:  stat$X2013 and stat$X2014
## W = 59, p-value = 0.4776
## alternative hypothesis: true location shift is not equal to 0

Note

We got the same result using non parametric method.

Normality Shapiro-Wilk test

alt text

See http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test

shapiro.test(stat$X2013)

## 
##  Shapiro-Wilk normality test
## 
## data:  stat$X2013
## W = 0.95, p-value = 0.6373

shapiro.test(stat$X2014)

## 
##  Shapiro-Wilk normality test
## 
## data:  stat$X2014
## W = 0.9395, p-value = 0.4917

Note

No problems with normality for two variables.

Variance test

alt text

See http://en.wikipedia.org/wiki/F-test_of_equality_of_variances

var.test(stat$X2013,stat$X2014)

## 
##  F test to compare two variances
## 
## data:  stat$X2013 and stat$X2014
## F = 1.2388, num df = 11, denom df = 11, p-value = 0.7287
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3566266 4.3032650
## sample estimates:
## ratio of variances 
##           1.238813

Note

No problems with variance homogeneity and equality for two variables.

Conclusions

First, both (parametric and non-parametric) tests give us strong and significant evidence to make conclusion in favor of null hypothesis - average number of Unique visitors of YANDEX per day for 2013 and 2014 do not differ.

Second, the results for two tests demonstrate not only problems with YANDEX as company but the problems with economics i.e. recession in business activity.

Third, web statistics could be used for prediction of business activity.

http://en.wikipedia.org/wiki/Economics