See the notes for this section here .

1 Import and explore data

The data file contains session time for page A vs. B:

21 observations for page A
15 observations for page B

library(ggplot2)

session_times <- read.csv('./data_files/web_page_data.csv')

session_times %>%
  head() %>%
  kable() %>%
  kable_styling()

Page	Time
Page A	0.21
Page B	2.53
Page A	0.35
Page B	0.71
Page A	0.67
Page B	0.85

Visualize distribution of session times for either page:

ggplot(session_times, aes(x=Page, y=Time)) + geom_boxplot()

2 Define permutation function

Steps:

Combine all session times together
Repeatedly shuffle and resample the data points, without replacement, into groups of 21 and 15 (to replicate the sizes of the two groups)
Calculate the difference between means of the two groups
Repeat the process 1000 times to create a distribution of differences

perm_fun <- function(x, n1, n2) {
   n <- n1 + n2 
   idx_b <- sample(1:n, n1) 
   idx_a <- setdiff(1:n, idx_b) 
   mean_diff <- mean(x[idx_b]) - mean(x[idx_a]) 
   return(mean_diff)
}

3 Permutation test

## Resample data 1000 times
perm_diffs <- rep(0, 1000) 

for(i in 1:1000) 
  perm_diffs[i] = perm_fun(session_times[,'Time'], 21, 15)

## Plot distribution of differences
hist(perm_diffs, xlab='Session time differences (in seconds)') 

## Visualize where difference between page A anad B fall within the distribution
mean_a <- mean(session_times[session_times['Page']=='Page A', 'Time']) 

mean_b <- mean(session_times[session_times['Page']=='Page B', 'Time'])
 
abline(v = mean_b - mean_a)

The observed difference between mean session times for page A and B fall within the range of chance variation, so it is not statistically significant.

Resampling - Practical Statistics for Data Scientists

Nancy Chelaru

1 Import and explore data

2 Define permutation function

3 Permutation test