Part I - Comparing Weekend and Weekday Web Visits
Using the Google Analytics tool, let us suppose that your instructor collects the number of daily visits to the website that promotes his Bayesian R book.
He collects the following visit counts for 20 weekend days (Saturday or Sunday): 11 10 12 15 8 24 14 11 6 11 6 12 10 14 7 10 17 5 3 4
He collects the following visit counts for 20 weekdays (Tuesday and Wednesday). 18 21 27 23 27 28 25 19 26 32 51 45 32 19 15 37 34 32 32 30
He wishes to compare the weekend visit counts with the weekday visit counts.
We will create Weekday and Weekend vectors based on the above info.
knitr::opts_chunk$set(error = TRUE)
We use the above script in case the slider.function outputs dont get executed as we knit.
library (LearnEDA)
## Loading required package: vcd
## Loading required package: grid
## Loading required package: manipulate
Weekend=c(11,10,12,15,8,24,14,11,6,11,6,12,10,14,7,10,17,5,3,4)
fivenum(Weekend)
## [1] 3.0 6.5 10.5 13.0 24.0
Weekday=c(18,21,27,23,27,28,25,19,26,32,51,45,32,19,15,37,34,32,32,30)
fivenum(Weekday)
## [1] 15.0 22.0 27.5 32.0 51.0
We will combine the two vectors by the data.frame function and then use the stack function to modify the data into a new format.
The data frame WebVisit has two variables - “values” contains the response (the visit counts) and “ind” contains the group (Weekend or Weekday).
Visit=data.frame(Weekend,Weekday)
Visit
## Weekend Weekday
## 1 11 18
## 2 10 21
## 3 12 27
## 4 15 23
## 5 8 27
## 6 24 28
## 7 14 25
## 8 11 19
## 9 6 26
## 10 11 32
## 11 6 51
## 12 12 45
## 13 10 32
## 14 14 19
## 15 7 15
## 16 10 37
## 17 17 34
## 18 5 32
## 19 3 32
## 20 4 30
WebVisit=stack(data.frame(Weekend,Weekday))
WebVisit
## values ind
## 1 11 Weekend
## 2 10 Weekend
## 3 12 Weekend
## 4 15 Weekend
## 5 8 Weekend
## 6 24 Weekend
## 7 14 Weekend
## 8 11 Weekend
## 9 6 Weekend
## 10 11 Weekend
## 11 6 Weekend
## 12 12 Weekend
## 13 10 Weekend
## 14 14 Weekend
## 15 7 Weekend
## 16 10 Weekend
## 17 17 Weekend
## 18 5 Weekend
## 19 3 Weekend
## 20 4 Weekend
## 21 18 Weekday
## 22 21 Weekday
## 23 27 Weekday
## 24 23 Weekday
## 25 27 Weekday
## 26 28 Weekday
## 27 25 Weekday
## 28 19 Weekday
## 29 26 Weekday
## 30 32 Weekday
## 31 51 Weekday
## 32 45 Weekday
## 33 32 Weekday
## 34 19 Weekday
## 35 15 Weekday
## 36 37 Weekday
## 37 34 Weekday
## 38 32 Weekday
## 39 32 Weekday
## 40 30 Weekday
The data seems to be stacked up well. We can compare the weekend and weekday visit counts using the “slider.compare” function.
slider.compare(WebVisit$values,WebVisit$ind)
## Error in manipulate(power.plot(power, y, group), power = manipulate::slider(-2, : The manipulate package must be run from within RStudio
What do we see? The weekday has a larger median (~27) compared to Weekend (10). Both the data seem to be skewed to the left although the skew is more prominent in Weekend. The spread is much bigger with Weekday compared to Weekend data. Weekday data has long tails compared to shorter tails with Weekend data. The length of tails on either end of the Weekend data is about the same. The length of the Weekday box is larger compared to Weekend box indicating Weeday has a larger fourth spread(IVR). We do not see any dependence between the spread and the level in this boxplot display.
What about the outliers? How many? Which group?
There is one outlier that precede the second whisker for both the batches. The nice thing about the box plot is that it shows the outliers clearly.
Using the slider.compare function we can also change power of reexpression (transformation) to 0.5 and we can compare the spreads of the two groups.
For reexpressed data with the change in power to 0.5, the spread looks very similar. An important point to consider that there is no correlation between the spread and the level. The transformation has been successful with the good spread. The weekday data has an outlier but the weekend data does not have any outlier now. The whiskers are of the same size for both the batches.
Now, let us change the power of reexpression to 0 (corresponds to taking the log reexpression) and see what happens.
For reexpressed data with change in power to 0 (log values), the spread of the weekend seem to be bigger than that of the weekday The whiskers are also bigger for weekend compared to that of weekday.
From the above three reexpressions, changing the power to 0.5 makes the transformation more successful with good spread for those batches which are identical and also that this reexpression removes the correlation between the spread and the levels.
Using the best reexpression at power = 0.5, we can see how the weekend counts differ from the weekday counts. The length of the box is pretty much the same for both the batches which indicates that they have the same IQR (fourth spread). Since the spread is the same in this case, we can try to express one batch as the function of another from the difference in medians.
This makes a good transformation as there is no more infuence between spread and level, comparing the median makes the best approach. We can say
Weekday = Weekend Visits + 17
which means that weekday has 17 visits more than the weekend.
Part II:Snowfall
Now let us work on a dataset “Snowfall” from R. The data has info about the seasonal snowfall in inches in Buffalo, NY and Cairo, IL from 1918-19 through 1937-38.
We will load the data and use the slider.compare function to construct parallel box plots of Buffalo & Cairo snowfall amounts. We will also check if there is a dependence between spread & level.
head(snowfall)
## Season City Snowfall
## 1 1918 Buffalo 25.0
## 2 1919 Buffalo 69.4
## 3 1920 Buffalo 53.5
## 4 1921 Buffalo 39.8
## 5 1922 Buffalo 63.6
## 6 1923 Buffalo 46.7
attach(snowfall)
snowfall
## Season City Snowfall
## 1 1918 Buffalo 25.0
## 2 1919 Buffalo 69.4
## 3 1920 Buffalo 53.5
## 4 1921 Buffalo 39.8
## 5 1922 Buffalo 63.6
## 6 1923 Buffalo 46.7
## 7 1924 Buffalo 72.9
## 8 1925 Buffalo 79.6
## 9 1926 Buffalo 83.6
## 10 1927 Buffalo 80.7
## 11 1928 Buffalo 60.3
## 12 1929 Buffalo 79.0
## 13 1930 Buffalo 64.8
## 14 1931 Buffalo 49.6
## 15 1932 Buffalo 54.7
## 16 1933 Buffalo 71.8
## 17 1934 Buffalo 49.1
## 18 1935 Buffalo 103.9
## 19 1936 Buffalo 51.6
## 20 1937 Buffalo 81.6
## 21 1918 Cairo 1.8
## 22 1919 Cairo 4.5
## 23 1920 Cairo 13.9
## 24 1921 Cairo 4.0
## 25 1922 Cairo 1.2
## 26 1923 Cairo 6.8
## 27 1924 Cairo 7.2
## 28 1925 Cairo 11.5
## 29 1926 Cairo 6.2
## 30 1927 Cairo 0.4
## 31 1928 Cairo 11.5
## 32 1929 Cairo 12.4
## 33 1930 Cairo 11.3
## 34 1931 Cairo 2.9
## 35 1932 Cairo 7.4
## 36 1933 Cairo 2.7
## 37 1934 Cairo 1.6
## 38 1935 Cairo 14.1
## 39 1936 Cairo 5.4
## 40 1937 Cairo 3.0
tapply(snowfall$Snowfall,snowfall$City,median)
## Buffalo Cairo
## 64.2 5.8
boxplot(Snowfall~City,horizontal=TRUE,data=snowfall,xlab="Snowfall",ylab="City")
From the box plot we can see that the spread is bigger in case of Buffalo compared to that of Cairo. The median is also very smal for Cairo compared to that of Buffalo. The idea is that if the length of the box is same then it means that the spread of both the batches are same which is not the case here. The other ideas is that with the levels the spread increases there is a correlation between the two and the goal is to remove this correlation so that we have similar spreads so we can compare the median and try to talk about one batch as a function of another.
slider.compare(snowfall$Snowfall,snowfall$City)
## Error in manipulate(power.plot(power, y, group), power = manipulate::slider(-2, : The manipulate package must be run from within RStudio
At power=1, we can see some dependence between the spread and the level as the length of the box for Cairo is much smaller than that of Buffalo.
When the power=0.5 the length of the box seems to be the same that is the IQR or the fourth spread is the same for both the batches and that indicates that there is no correlation between the levels and the spread.
When we get to power = 0, the spread of Cairo seems to be larger than that of Buffalo and there is also an outlier for Buffalo. This may not be the right power/transformation as the spread is not the same.
When we move the power =-0.7, the spread of Buffalo totally shrinks into a line. The spread of Cairo is bigger but the spread of Buffalo gets to negligible.
Using the reexpressed data at power =0.5, we can compare between Buffalo & Cairo snowfall amounts. We can relatively express this as
snowfall of Buffalo = Cairo Snowfall + 58.4.
With the spread being similar it is fair to compare the median and correlate to each batch as a function. Overall, we were able to compare between batches of data.