This homework is due by Thursday, February 2nd, 8:00pm. Upload a pdf file to Canvas called 3_probability.pdf

Tip 1: Take a close look at the R Markdown files simulation2.Rmd (class 8) and modeling_data.Rmd (Class 9)!

Tip 2: Other than the functions defined for you already, you won’t need to create any additional functions.

Distributions (1 point)

When we have empirical data, we can compute cumulative probabilities and create probability density functions using quantile() and geom_density(), respectively. Take a look at the help files for both of these functions to better understand what they’re doing.

Consider the following data set:

df.p1 = tibble(observation = 1:20,
               rating = c(0.3775909, 0.5908214, 0.07285336, 0.06989763, 0.2180343,
                          1.447484, 0.614781, 0.2698414, 0.4782837, 0.073523, 
                          0.6953676, 0.3810149, 0.6188018, 2.211967, 0.5272716, 
                          0.517622, 0.9380176, 0.3273733, 0.1684667, 0.2942399))

Quantile (0.5 points)

What’s the 60% percentile of the rating variable? (In other words, for which value of rating is it true, that 60% of the values are lower than that value?)

### YOUR CODE HERE ###
qnorm(0.6, mean=0.5446627, sd=0.5103532)
## [1] 0.6739592
######################

Density (0.5 points)

Plot the density of the rating variable. Use bandwidth of 0.15.

### YOUR CODE HERE ###
df.p1 %>% 
  ggplot(aes(x=rating)) +
  geom_density(bw=0.15)

######################

Sampling distribution (7 points)

The sample standard deviation \(s = \sqrt\frac{\sum_{i=1}^n(Y_i-\overline Y)^2}{n-1}\) is an unbiased estimator of the population standard deviation.In this exercise, we will run a simulation to compare \(s\) with another estimator \(s' = \sqrt\frac{\sum_{i=1}^n(Y_i-\overline Y)^2}{n}\) and show that \(s'\) is biased. Note that the only difference between \(s\) and \(s'\) is in the denominator.

Generate the sampling distributions (2 points)

Let’s run 10,000 simulations, each consisting of a sample of 40 draws from a population distribution. To start off, create a new data frame df.samples with 400,000 rows and the following 3 columns:

  • sim: index for the simulation (from 1 to 10,000)
  • sample: index for the sample in each simulation (from 1 to 40)
  • x: sample values drawn from the population distribution

For the population distribution assume a normal distribution with mean = 0 and sd = 1.

Tip: You can use the expand_grid() function to make a data frame that contains all possible combinations for the elements of two (or more) vectors. See code from Lecture 06: Probability.

Note the eval = F argument in the code chunk below. Once you’ve filled in the code, make sure to change this to eval = T so that the code chunk is evaluated whenever the file is knitted. The same is true for the other code chunks below.

set.seed(1)

n_simulations = 10000 # number of simulations
n_samples = 40 # number of samples in each simulation
population_mean = 0 # ground truth mean
population_sd = 1 # ground truth standard deviation

### YOUR CODE HERE ###
df.samples = tibble(expand_grid(sim=1:n_simulations,
                                samples=1:n_samples),
                    x=rnorm(400000,mean=population_mean,sd=population_sd))

######################

df.samples %>% 
  head(5)
df.samples %>% 
  summary()
##       sim           samples            x            
##  Min.   :    1   Min.   : 1.00   Min.   :-4.882127  
##  1st Qu.: 2501   1st Qu.:10.75   1st Qu.:-0.676890  
##  Median : 5000   Median :20.50   Median : 0.001682  
##  Mean   : 5000   Mean   :20.50   Mean   :-0.000621  
##  3rd Qu.: 7500   3rd Qu.:30.25   3rd Qu.: 0.673835  
##  Max.   :10000   Max.   :40.00   Max.   : 4.650944

Calculate sample statistics (3 points)

For each simulation, calculate the estimators \(s\) and \(s'\), and save the results in a new data frame called df.sample_stats with 10,000 rows and the following 3 columns:

  • sim: index for the simulation
  • unbiased: the unbiased estimated standard deviation from simulated samples
  • biased: the biased estimated standard deviation from simulated samples

Tip 1: group_by() and summarize() are your friends here!

Tip 2: If you choose to use sd(), note that it uses the n-1 correction (take a look at its help file).

### YOUR CODE HERE ###

## Getting the estimates by working backwards from the SD "n-1" equation
df.sample_stats = df.samples %>% 
  group_by(sim) %>% 
  summarize(unbiased = sd(x), 
            biased = sqrt(((((sd(x))^2)*(n()-1)))/n()))

## Getting the estimates by using the formula for SD
df.sample_stats = df.samples %>% 
  group_by(sim) %>% 
  summarize(deviations = sum((x-mean(x))^2),
            unbiased = sqrt(deviations/(n()-1)),
            biased = sqrt(deviations/n())) %>% 
  select(sim, unbiased, biased)
             
######################

df.sample_stats %>% 
  head(5)
df.sample_stats %>% 
  summary()
##       sim           unbiased          biased      
##  Min.   :    1   Min.   :0.5747   Min.   :0.5674  
##  1st Qu.: 2501   1st Qu.:0.9168   1st Qu.:0.9053  
##  Median : 5000   Median :0.9911   Median :0.9787  
##  Mean   : 5000   Mean   :0.9940   Mean   :0.9815  
##  3rd Qu.: 7500   3rd Qu.:1.0695   3rd Qu.:1.0561  
##  Max.   :10000   Max.   :1.4452   Max.   :1.4270

Visualization (1 point)

Plot the two types of estimates and determine which is better. Your plot should include the distribution of your estimates and two vertical lines, one each for the mean of your two distributions. Everything should be drawn on the same plot.

Tip: For making the plot, the geom_density() and geom_vline() functions will be helpful.

### YOUR CODE HERE ###
mean_unbiased = mean(df.sample_stats$unbiased)
mean_biased = mean(df.sample_stats$biased)

df.sample_stats %>% 
  pivot_longer(cols = -sim,
               names_to = "biastest",
               values_to = "value") %>% 
  ggplot(aes(x=value, color=biastest)) +
  geom_density(lwd=1) +
  geom_vline(xintercept = c(mean_biased,mean_unbiased), color=c("red","blue"),
             lwd=1, lty="dashed", alpha = .2) +
  scale_color_manual(labels=c("Biased SD Formula","Unbiased SD Formula"), values=c("red","blue")) +
  labs(x = "Distribution of SD Values", color = "Formula")

######################

Interpretation (1 point)

How do the two estimates compare to the ground truth?

Considering the ground truth is an SD value of 1, the unbiased SD formula is more accurate than the biased SD formula.

Permutation test (7 points)

Imagine that you collected data about people’s heights from three different places and you are interested whether there are any differences in people’s height between the three places.

By visualizing the data, we can see that the variances between the three groups differ considerably, which is troublesome for parametric tests (e.g. a t-test). However, we can perform a permutation test, which is non-parametric. In this case, we are interested in whether the maximal difference between each of the pairs of group means, is greater than we would expect to see by chance.

set.seed(1)

df.heights = read_csv(file = "data/df_heights.csv",
                      show_col_types = FALSE)

ggplot(data = df.heights,
       mapping = aes(x = group,
                     y = height)) +
  geom_point(position = position_jitter(height = 0,
                                        width = 0.1),
             alpha = 0.5) + 
  stat_summary(fun.data = "mean_cl_boot",
               shape = 21, 
               fill = "lightblue",
               size = 1)
## Warning: Computation failed in `stat_summary()`
## Caused by error in `fun.data()`:
## ! The package `Hmisc` is required.

Compute maximum difference (3 points)

Create a function that takes a data frame similar to df.heights and returns the maximum difference between the three group means. Use that function to compute what the maximum difference is for the df.heights data set.

max_group_diff = function(df) {
  ### YOUR CODE HERE ###
  max_diff = df %>% 
    group_by(group) %>%  
    summarize(mean_height = mean(height)) %>%
    pull(mean_height) %>% 
    dist() %>% 
    max()
  ######################
  return (max_diff)
}

actual_max_diff = max_group_diff(df.heights)
actual_max_diff
## [1] 0.273778

The maximum difference between the three group means is 0.274.

Run permutation (3 points)

Create a function that takes a data frame similar to df.heights, shuffles the groups, and returns the maximum difference between the group means (using max_group_diff()) of the shuffled groups. Then create a data frame df.permutations with 100 rows and the following 2 columns:

  • permutation: index for the permutation (from 1 to 100)
  • max_diff: the maximum difference between group means in the permuted sample

Tip: The sample() function is helpful here for shuffling the group labels, and the replicate() function for running the same function many times.

set.seed(1)
n_permutations = 100

permute_max_group_diff = function(df) {
  ### YOUR CODE HERE ###
  max_diff = df %>% 
    mutate(group = sample(group)) %>% 
    group_by(group) %>%  
    summarize(mean_height = mean(height)) %>%
    pull(mean_height) %>% 
    dist() %>% 
    max()
  ######################
  return (max_diff)
}

### YOUR CODE HERE ###
df.permutations = tibble(permutation = 1:n_permutations, 
  max_diff = replicate(n = n_permutations, permute_max_group_diff(df.heights)))
######################

df.permutations %>% 
  head(5)
df.permutations %>% 
  summary()
##   permutation        max_diff      
##  Min.   :  1.00   Min.   :0.01047  
##  1st Qu.: 25.75   1st Qu.:0.10095  
##  Median : 50.50   Median :0.14406  
##  Mean   : 50.50   Mean   :0.15331  
##  3rd Qu.: 75.25   3rd Qu.:0.20550  
##  Max.   :100.00   Max.   :0.36960
#plot the distribution of the differences 
ggplot(data = df.permutations, aes(x = max_diff)) +
  geom_histogram(aes(y = stat(density)),
                 color = "black",
                 fill = "lightblue",
                 binwidth = 0.05) + 
  stat_density(geom = "line",
               size = 1.5,
               bw = 0.04) +
  geom_vline(xintercept = actual_max_diff, color = "red", size = 2) +
  labs(x = "Maximum Difference Between Groups")
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

Perform permutation test (1 point)

Calculate the p-value of getting the observed result (or one that’s more extreme) if the null hypothesis of no difference between the groups was true.

### YOUR CODE HERE ###
df.permutations %>% 
  summarize(p_value = sum(max_diff >= actual_max_diff)/n())
######################

What is the probability for the actually observed maximum difference between two groups to occur if the null hypothesis was true?

The probability for the actually observed maximum difference between two groups to occur if the null hypothesis was true is 6%.

Grading Rubric

  1. Distributions (1 point)
    1. Quantile (0.5 points)
    2. Density (0.5 points)
  2. Sampling distribution (7 points)
    1. Generate the sample distributions (2 points)
    2. Calculate sample statistics (3 points)
    3. Visualization (1 point)
    4. Interpretation (1 point)
  3. Permutation test (7 points)
    1. Compute maximum difference (3 points)
    2. Run permutation (3 points)
    3. Perform permutation test (1 point)

Session info

Information about this R session including which version of R was used, and what packages were loaded.

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Singapore.utf8  LC_CTYPE=English_Singapore.utf8   
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Singapore.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.0.10    purrr_1.0.1    
##  [5] readr_2.1.3     tidyr_1.3.0     tibble_3.1.8    ggplot2_3.4.0  
##  [9] tidyverse_1.3.2 knitr_1.42     
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.9.1     assertthat_0.2.1    digest_0.6.31      
##  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
##  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.20      
## [10] httr_1.4.4          highr_0.10          pillar_1.8.1       
## [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
## [16] rstudioapi_0.14     jquerylib_0.1.4     rmarkdown_2.20     
## [19] labeling_0.4.2      googledrive_2.0.0   bit_4.0.5          
## [22] munsell_0.5.0       broom_1.0.3         compiler_4.2.2     
## [25] modelr_0.1.10       xfun_0.36           pkgconfig_2.0.3    
## [28] htmltools_0.5.4     tidyselect_1.2.0    fansi_1.0.4        
## [31] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.3.0       
## [34] withr_2.5.0         grid_4.2.2          jsonlite_1.8.4     
## [37] gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3          
## [40] magrittr_2.0.3      scales_1.2.1        cli_3.6.0          
## [43] stringi_1.7.12      vroom_1.6.1         cachem_1.0.6       
## [46] farver_2.1.1        fs_1.6.0            xml2_1.3.3         
## [49] bslib_0.4.2         ellipsis_0.3.2      generics_0.1.3     
## [52] vctrs_0.5.2         tools_4.2.2         bit64_4.0.5        
## [55] glue_1.6.2          hms_1.1.2           parallel_4.2.2     
## [58] fastmap_1.1.0       yaml_2.3.7          timechange_0.2.0   
## [61] colorspace_2.1-0    gargle_1.2.1        rvest_1.0.3        
## [64] haven_2.5.1         sass_0.4.5