Comparing Batches

Part I - Comparing Weekend and Weekday Web Visits

Using the Google Analytics tool, let us suppose that your instructor collects the number of daily visits to the website that promotes his Bayesian R book.

He collects the following visit counts for 20 weekend days (Saturday or Sunday): 11 10 12 15 8 24 14 11 6 11 6 12 10 14 7 10 17 5 3 4

He collects the following visit counts for 20 weekdays (Tuesday and Wednesday). 18 21 27 23 27 28 25 19 26 32 51 45 32 19 15 37 34 32 32 30

He wishes to compare the weekend visit counts with the weekday visit counts.

We will create Weekday and Weekend vectors based on the above info.

knitr::opts_chunk$set(error = TRUE)

We use the above script in case the slider.function outputs dont get executed as we knit.

library (LearnEDA)

## Loading required package: vcd

## Loading required package: grid

## Loading required package: manipulate

Weekend=c(11,10,12,15,8,24,14,11,6,11,6,12,10,14,7,10,17,5,3,4)
fivenum(Weekend)

## [1]  3.0  6.5 10.5 13.0 24.0

Weekday=c(18,21,27,23,27,28,25,19,26,32,51,45,32,19,15,37,34,32,32,30)
fivenum(Weekday)

## [1] 15.0 22.0 27.5 32.0 51.0

We will combine the two vectors by the data.frame function and then use the stack function to modify the data into a new format.

The data frame WebVisit has two variables - “values” contains the response (the visit counts) and “ind” contains the group (Weekend or Weekday).

Visit=data.frame(Weekend,Weekday)
Visit

##    Weekend Weekday
## 1       11      18
## 2       10      21
## 3       12      27
## 4       15      23
## 5        8      27
## 6       24      28
## 7       14      25
## 8       11      19
## 9        6      26
## 10      11      32
## 11       6      51
## 12      12      45
## 13      10      32
## 14      14      19
## 15       7      15
## 16      10      37
## 17      17      34
## 18       5      32
## 19       3      32
## 20       4      30

WebVisit=stack(data.frame(Weekend,Weekday))
WebVisit

##    values     ind
## 1      11 Weekend
## 2      10 Weekend
## 3      12 Weekend
## 4      15 Weekend
## 5       8 Weekend
## 6      24 Weekend
## 7      14 Weekend
## 8      11 Weekend
## 9       6 Weekend
## 10     11 Weekend
## 11      6 Weekend
## 12     12 Weekend
## 13     10 Weekend
## 14     14 Weekend
## 15      7 Weekend
## 16     10 Weekend
## 17     17 Weekend
## 18      5 Weekend
## 19      3 Weekend
## 20      4 Weekend
## 21     18 Weekday
## 22     21 Weekday
## 23     27 Weekday
## 24     23 Weekday
## 25     27 Weekday
## 26     28 Weekday
## 27     25 Weekday
## 28     19 Weekday
## 29     26 Weekday
## 30     32 Weekday
## 31     51 Weekday
## 32     45 Weekday
## 33     32 Weekday
## 34     19 Weekday
## 35     15 Weekday
## 36     37 Weekday
## 37     34 Weekday
## 38     32 Weekday
## 39     32 Weekday
## 40     30 Weekday

The data seems to be stacked up well. We can compare the weekend and weekday visit counts using the “slider.compare” function.

slider.compare(WebVisit$values,WebVisit$ind)

## Error in manipulate(power.plot(power, y, group), power = manipulate::slider(-2, : The manipulate package must be run from within RStudio

What do we see? The weekday has a larger median (~27) compared to Weekend (10). Both the data seem to be skewed to the left although the skew is more prominent in Weekend. The spread is much bigger with Weekday compared to Weekend data. Weekday data has long tails compared to shorter tails with Weekend data. The length of tails on either end of the Weekend data is about the same. The length of the Weekday box is larger compared to Weekend box indicating Weeday has a larger fourth spread(IVR). We do not see any dependence between the spread and the level in this boxplot display.

What about the outliers? How many? Which group?
There is one outlier that precede the second whisker for both the batches. The nice thing about the box plot is that it shows the outliers clearly.

Using the slider.compare function we can also change power of reexpression (transformation) to 0.5 and we can compare the spreads of the two groups.

For reexpressed data with the change in power to 0.5, the spread looks very similar. An important point to consider that there is no correlation between the spread and the level. The transformation has been successful with the good spread. The weekday data has an outlier but the weekend data does not have any outlier now. The whiskers are of the same size for both the batches.

Now, let us change the power of reexpression to 0 (corresponds to taking the log reexpression) and see what happens.

For reexpressed data with change in power to 0 (log values), the spread of the weekend seem to be bigger than that of the weekday The whiskers are also bigger for weekend compared to that of weekday.

From the above three reexpressions, changing the power to 0.5 makes the transformation more successful with good spread for those batches which are identical and also that this reexpression removes the correlation between the spread and the levels.

Using the best reexpression at power = 0.5, we can see how the weekend counts differ from the weekday counts. The length of the box is pretty much the same for both the batches which indicates that they have the same IQR (fourth spread). Since the spread is the same in this case, we can try to express one batch as the function of another from the difference in medians.

This makes a good transformation as there is no more infuence between spread and level, comparing the median makes the best approach. We can say

        Weekday = Weekend Visits + 17

which means that weekday has 17 visits more than the weekend.

Part II:Snowfall

Now let us work on a dataset “Snowfall” from R. The data has info about the seasonal snowfall in inches in Buffalo, NY and Cairo, IL from 1918-19 through 1937-38.

We will load the data and use the slider.compare function to construct parallel box plots of Buffalo & Cairo snowfall amounts. We will also check if there is a dependence between spread & level.

head(snowfall)

##   Season    City Snowfall
## 1   1918 Buffalo     25.0
## 2   1919 Buffalo     69.4
## 3   1920 Buffalo     53.5
## 4   1921 Buffalo     39.8
## 5   1922 Buffalo     63.6
## 6   1923 Buffalo     46.7

attach(snowfall)
snowfall

##    Season    City Snowfall
## 1    1918 Buffalo     25.0
## 2    1919 Buffalo     69.4
## 3    1920 Buffalo     53.5
## 4    1921 Buffalo     39.8
## 5    1922 Buffalo     63.6
## 6    1923 Buffalo     46.7
## 7    1924 Buffalo     72.9
## 8    1925 Buffalo     79.6
## 9    1926 Buffalo     83.6
## 10   1927 Buffalo     80.7
## 11   1928 Buffalo     60.3
## 12   1929 Buffalo     79.0
## 13   1930 Buffalo     64.8
## 14   1931 Buffalo     49.6
## 15   1932 Buffalo     54.7
## 16   1933 Buffalo     71.8
## 17   1934 Buffalo     49.1
## 18   1935 Buffalo    103.9
## 19   1936 Buffalo     51.6
## 20   1937 Buffalo     81.6
## 21   1918   Cairo      1.8
## 22   1919   Cairo      4.5
## 23   1920   Cairo     13.9
## 24   1921   Cairo      4.0
## 25   1922   Cairo      1.2
## 26   1923   Cairo      6.8
## 27   1924   Cairo      7.2
## 28   1925   Cairo     11.5
## 29   1926   Cairo      6.2
## 30   1927   Cairo      0.4
## 31   1928   Cairo     11.5
## 32   1929   Cairo     12.4
## 33   1930   Cairo     11.3
## 34   1931   Cairo      2.9
## 35   1932   Cairo      7.4
## 36   1933   Cairo      2.7
## 37   1934   Cairo      1.6
## 38   1935   Cairo     14.1
## 39   1936   Cairo      5.4
## 40   1937   Cairo      3.0

tapply(snowfall$Snowfall,snowfall$City,median)

## Buffalo   Cairo 
##    64.2     5.8

boxplot(Snowfall~City,horizontal=TRUE,data=snowfall,xlab="Snowfall",ylab="City")

From the box plot we can see that the spread is bigger in case of Buffalo compared to that of Cairo. The median is also very smal for Cairo compared to that of Buffalo. The idea is that if the length of the box is same then it means that the spread of both the batches are same which is not the case here. The other ideas is that with the levels the spread increases there is a correlation between the two and the goal is to remove this correlation so that we have similar spreads so we can compare the median and try to talk about one batch as a function of another.

slider.compare(snowfall$Snowfall,snowfall$City)

## Error in manipulate(power.plot(power, y, group), power = manipulate::slider(-2, : The manipulate package must be run from within RStudio

At power=1, we can see some dependence between the spread and the level as the length of the box for Cairo is much smaller than that of Buffalo.

When the power=0.5 the length of the box seems to be the same that is the IQR or the fourth spread is the same for both the batches and that indicates that there is no correlation between the levels and the spread.

When we get to power = 0, the spread of Cairo seems to be larger than that of Buffalo and there is also an outlier for Buffalo. This may not be the right power/transformation as the spread is not the same.

When we move the power =-0.7, the spread of Buffalo totally shrinks into a line. The spread of Cairo is bigger but the spread of Buffalo gets to negligible.

Using the reexpressed data at power =0.5, we can compare between Buffalo & Cairo snowfall amounts. We can relatively express this as

      snowfall of Buffalo = Cairo Snowfall + 58.4.

With the spread being similar it is fair to compare the median and correlate to each batch as a function. Overall, we were able to compare between batches of data.

Comparing Batches

Suresh Gajapathy

September 14, 2016