This homework is due by Thursday, February 2nd, 8:00pm. Upload a pdf file to Canvas called
3_probability.pdf
Tip 1: Take a close look at the R Markdown files
simulation2.Rmd(class 8) andmodeling_data.Rmd(Class 9)!
Tip 2: Other than the functions defined for you already, you won’t need to create any additional functions.
When we have empirical data, we can compute cumulative probabilities
and create probability density functions using quantile()
and geom_density(), respectively. Take a look at the help
files for both of these functions to better understand what they’re
doing.
Consider the following data set:
df.p1 = tibble(observation = 1:20,
rating = c(0.3775909, 0.5908214, 0.07285336, 0.06989763, 0.2180343,
1.447484, 0.614781, 0.2698414, 0.4782837, 0.073523,
0.6953676, 0.3810149, 0.6188018, 2.211967, 0.5272716,
0.517622, 0.9380176, 0.3273733, 0.1684667, 0.2942399))
What’s the 60% percentile of the rating variable? (In
other words, for which value of rating is it true, that 60% of the
values are lower than that value?)
### YOUR CODE HERE ###
qnorm(0.6, mean=0.5446627, sd=0.5103532)
## [1] 0.6739592
######################
Plot the density of the rating variable. Use bandwidth
of 0.15.
### YOUR CODE HERE ###
df.p1 %>%
ggplot(aes(x=rating)) +
geom_density(bw=0.15)
######################
The sample standard deviation \(s = \sqrt\frac{\sum_{i=1}^n(Y_i-\overline Y)^2}{n-1}\) is an unbiased estimator of the population standard deviation.In this exercise, we will run a simulation to compare \(s\) with another estimator \(s' = \sqrt\frac{\sum_{i=1}^n(Y_i-\overline Y)^2}{n}\) and show that \(s'\) is biased. Note that the only difference between \(s\) and \(s'\) is in the denominator.
Let’s run 10,000 simulations, each consisting of a sample of 40 draws
from a population distribution. To start off, create a new data frame
df.samples with 400,000 rows and the following 3
columns:
For the population distribution assume a normal distribution with
mean = 0 and sd = 1.
Tip: You can use the
expand_grid()function to make a data frame that contains all possible combinations for the elements of two (or more) vectors. See code from Lecture 06: Probability.
Note the
eval = Fargument in the code chunk below. Once you’ve filled in the code, make sure to change this toeval = Tso that the code chunk is evaluated whenever the file is knitted. The same is true for the other code chunks below.
set.seed(1)
n_simulations = 10000 # number of simulations
n_samples = 40 # number of samples in each simulation
population_mean = 0 # ground truth mean
population_sd = 1 # ground truth standard deviation
### YOUR CODE HERE ###
df.samples = tibble(expand_grid(sim=1:n_simulations,
samples=1:n_samples),
x=rnorm(400000,mean=population_mean,sd=population_sd))
######################
df.samples %>%
head(5)
df.samples %>%
summary()
## sim samples x
## Min. : 1 Min. : 1.00 Min. :-4.882127
## 1st Qu.: 2501 1st Qu.:10.75 1st Qu.:-0.676890
## Median : 5000 Median :20.50 Median : 0.001682
## Mean : 5000 Mean :20.50 Mean :-0.000621
## 3rd Qu.: 7500 3rd Qu.:30.25 3rd Qu.: 0.673835
## Max. :10000 Max. :40.00 Max. : 4.650944
For each simulation, calculate the estimators \(s\) and \(s'\), and save the results in a new
data frame called df.sample_stats with 10,000 rows and the
following 3 columns:
Tip 1:
group_by()andsummarize()are your friends here!
Tip 2: If you choose to use
sd(), note that it uses the n-1 correction (take a look at its help file).
### YOUR CODE HERE ###
## Getting the estimates by working backwards from the SD "n-1" equation
df.sample_stats = df.samples %>%
group_by(sim) %>%
summarize(unbiased = sd(x),
biased = sqrt(((((sd(x))^2)*(n()-1)))/n()))
## Getting the estimates by using the formula for SD
df.sample_stats = df.samples %>%
group_by(sim) %>%
summarize(deviations = sum((x-mean(x))^2),
unbiased = sqrt(deviations/(n()-1)),
biased = sqrt(deviations/n())) %>%
select(sim, unbiased, biased)
######################
df.sample_stats %>%
head(5)
df.sample_stats %>%
summary()
## sim unbiased biased
## Min. : 1 Min. :0.5747 Min. :0.5674
## 1st Qu.: 2501 1st Qu.:0.9168 1st Qu.:0.9053
## Median : 5000 Median :0.9911 Median :0.9787
## Mean : 5000 Mean :0.9940 Mean :0.9815
## 3rd Qu.: 7500 3rd Qu.:1.0695 3rd Qu.:1.0561
## Max. :10000 Max. :1.4452 Max. :1.4270
Plot the two types of estimates and determine which is better. Your plot should include the distribution of your estimates and two vertical lines, one each for the mean of your two distributions. Everything should be drawn on the same plot.
Tip: For making the plot, the
geom_density()andgeom_vline()functions will be helpful.
### YOUR CODE HERE ###
mean_unbiased = mean(df.sample_stats$unbiased)
mean_biased = mean(df.sample_stats$biased)
df.sample_stats %>%
pivot_longer(cols = -sim,
names_to = "biastest",
values_to = "value") %>%
ggplot(aes(x=value, color=biastest)) +
geom_density(lwd=1) +
geom_vline(xintercept = c(mean_biased,mean_unbiased), color=c("red","blue"),
lwd=1, lty="dashed", alpha = .2) +
scale_color_manual(labels=c("Biased SD Formula","Unbiased SD Formula"), values=c("red","blue")) +
labs(x = "Distribution of SD Values", color = "Formula")
######################
How do the two estimates compare to the ground truth?
Considering the ground truth is an SD value of 1, the unbiased SD formula is more accurate than the biased SD formula.
Imagine that you collected data about people’s heights from three different places and you are interested whether there are any differences in people’s height between the three places.
By visualizing the data, we can see that the variances between the three groups differ considerably, which is troublesome for parametric tests (e.g. a t-test). However, we can perform a permutation test, which is non-parametric. In this case, we are interested in whether the maximal difference between each of the pairs of group means, is greater than we would expect to see by chance.
set.seed(1)
df.heights = read_csv(file = "data/df_heights.csv",
show_col_types = FALSE)
ggplot(data = df.heights,
mapping = aes(x = group,
y = height)) +
geom_point(position = position_jitter(height = 0,
width = 0.1),
alpha = 0.5) +
stat_summary(fun.data = "mean_cl_boot",
shape = 21,
fill = "lightblue",
size = 1)
## Warning: Computation failed in `stat_summary()`
## Caused by error in `fun.data()`:
## ! The package `Hmisc` is required.
Create a function that takes a data frame similar to
df.heights and returns the maximum difference between the
three group means. Use that function to compute what the maximum
difference is for the df.heights data set.
max_group_diff = function(df) {
### YOUR CODE HERE ###
max_diff = df %>%
group_by(group) %>%
summarize(mean_height = mean(height)) %>%
pull(mean_height) %>%
dist() %>%
max()
######################
return (max_diff)
}
actual_max_diff = max_group_diff(df.heights)
actual_max_diff
## [1] 0.273778
The maximum difference between the three group means is 0.274.
Create a function that takes a data frame similar to
df.heights, shuffles the groups, and returns the maximum
difference between the group means (using max_group_diff())
of the shuffled groups. Then create a data frame
df.permutations with 100 rows and the following 2
columns:
Tip: The
sample()function is helpful here for shuffling the group labels, and thereplicate()function for running the same function many times.
set.seed(1)
n_permutations = 100
permute_max_group_diff = function(df) {
### YOUR CODE HERE ###
max_diff = df %>%
mutate(group = sample(group)) %>%
group_by(group) %>%
summarize(mean_height = mean(height)) %>%
pull(mean_height) %>%
dist() %>%
max()
######################
return (max_diff)
}
### YOUR CODE HERE ###
df.permutations = tibble(permutation = 1:n_permutations,
max_diff = replicate(n = n_permutations, permute_max_group_diff(df.heights)))
######################
df.permutations %>%
head(5)
df.permutations %>%
summary()
## permutation max_diff
## Min. : 1.00 Min. :0.01047
## 1st Qu.: 25.75 1st Qu.:0.10095
## Median : 50.50 Median :0.14406
## Mean : 50.50 Mean :0.15331
## 3rd Qu.: 75.25 3rd Qu.:0.20550
## Max. :100.00 Max. :0.36960
#plot the distribution of the differences
ggplot(data = df.permutations, aes(x = max_diff)) +
geom_histogram(aes(y = stat(density)),
color = "black",
fill = "lightblue",
binwidth = 0.05) +
stat_density(geom = "line",
size = 1.5,
bw = 0.04) +
geom_vline(xintercept = actual_max_diff, color = "red", size = 2) +
labs(x = "Maximum Difference Between Groups")
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
Calculate the p-value of getting the observed result (or one that’s more extreme) if the null hypothesis of no difference between the groups was true.
### YOUR CODE HERE ###
df.permutations %>%
summarize(p_value = sum(max_diff >= actual_max_diff)/n())
######################
What is the probability for the actually observed maximum difference between two groups to occur if the null hypothesis was true?
The probability for the actually observed maximum difference between two groups to occur if the null hypothesis was true is 6%.
Information about this R session including which version of R was used, and what packages were loaded.
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Singapore.utf8 LC_CTYPE=English_Singapore.utf8
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Singapore.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_1.0.0 stringr_1.5.0 dplyr_1.0.10 purrr_1.0.1
## [5] readr_2.1.3 tidyr_1.3.0 tibble_3.1.8 ggplot2_3.4.0
## [9] tidyverse_1.3.2 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] lubridate_1.9.1 assertthat_0.2.1 digest_0.6.31
## [4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0
## [7] backports_1.4.1 reprex_2.0.2 evaluate_0.20
## [10] httr_1.4.4 highr_0.10 pillar_1.8.1
## [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1
## [16] rstudioapi_0.14 jquerylib_0.1.4 rmarkdown_2.20
## [19] labeling_0.4.2 googledrive_2.0.0 bit_4.0.5
## [22] munsell_0.5.0 broom_1.0.3 compiler_4.2.2
## [25] modelr_0.1.10 xfun_0.36 pkgconfig_2.0.3
## [28] htmltools_0.5.4 tidyselect_1.2.0 fansi_1.0.4
## [31] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.3.0
## [34] withr_2.5.0 grid_4.2.2 jsonlite_1.8.4
## [37] gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3
## [40] magrittr_2.0.3 scales_1.2.1 cli_3.6.0
## [43] stringi_1.7.12 vroom_1.6.1 cachem_1.0.6
## [46] farver_2.1.1 fs_1.6.0 xml2_1.3.3
## [49] bslib_0.4.2 ellipsis_0.3.2 generics_0.1.3
## [52] vctrs_0.5.2 tools_4.2.2 bit64_4.0.5
## [55] glue_1.6.2 hms_1.1.2 parallel_4.2.2
## [58] fastmap_1.1.0 yaml_2.3.7 timechange_0.2.0
## [61] colorspace_2.1-0 gargle_1.2.1 rvest_1.0.3
## [64] haven_2.5.1 sass_0.4.5