make sure to load the necessary libraries
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Confidence Interval Simulation of the 95% confidence interval from 1000 samples of 16 observations each This is similar to what we did in week 6, but we will use a normal distribution this time. We are going to create a matrix (similar to a data frame, but easier to manipulate in this context) with random values from a normal distribution. The matrix mymatrix will have 1000 columns (variables) and 16 rows (observations); this is the example from lecture slide 8. Here the “mystery mean” is chosen to be 250.
M <- 250
mymatrix <- matrix(rnorm(16000, mean = M, sd = 20),ncol=1000,nrow=16)
#(rnorm(1600, mean = M, sd = 20) draws 16000 values from a normal distribution with mean M=250 and standard deviation 20. Why do we need 16000 values? Because 1000 columns * 16 rows = 16000, so this is how many values we need to fill our matrix of 1000 columns by 16 rows.
Now we can easily create the mean of each of our 1000 variables (samples):
means <- colMeans(mymatrix)
#The line above creates a vector with the means of the variables (=columns) in mymatrix. This is why it???s called colMeans: it takes the means of the columns in a matrix.
Now plot the estimated means from our 1000 samples as a histogram:
ggplot (data=NULL) +
geom_histogram(mapping = aes(x = means))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#data=NULL tells R that means is not in a data frame (since means is in the global environment).
Let’s call mean1 the estimated mean from the first sample. If we did not know the mean of the population M and we just had this sample and knew that sd=20, how confident can we be about the mean of the population? Let’s store the first value of the vector means in mean1:
mean1 <- means[1]
mean1
## [1] 246.1085
From slide 9, we know that the sample mean is within 10 points of the mean of the population.
#The lower bound is:
lowerbound1 <- mean1-10
lowerbound1
## [1] 236.1085
#The upper bound is:
upperbound1 <- mean1+10
upperbound1
## [1] 256.1085
Now let’s repeat this operation for all our 1000 samples and create a matrix (dataset) of lower bounds and a matrix (dataset) of upper bounds.
lowerbound <- means-10
upperbound <- means+10
You can look at the list of values for lowerbound and upperbound in the Environment panel to verify that the calculation was done correctly given the values in means. Now let’s see how many times the lower bound is above the mean of the population M=250, and therefore the population mean is not included in the confidence interval. Let’s call the vector that tags this cases “overestimate” since the confidence interval is too high above the population mean:
overestimate <- as.numeric(lowerbound>M)
The tag overestimate takes the value of 1 if the lower bound is smaller than 250, and 0 otherwise. How many times is the lower bound smaller than 250?
sumoverestimate <- sum(overestimate)
sumoverestimate
## [1] 23
Similarly, we can check how many times the upper bound is lower than the mean of the population M=250. In this case, we missed the population mean by underestimating.
underestimate <- as.numeric(upperbound<M)
sumunderestimate <- sum(underestimate)
sumunderestimate
## [1] 19
#What is the total number of samples where our confidence interval missed the population mean by either over or under estimating?
sumoverestimate + sumunderestimate
## [1] 42
#What is the percent of the 1000 samples in which the 95% confidence interval missed the population mean?
(sumoverestimate + sumunderestimate)/1000
## [1] 0.042
#This proportion should be close to 0.05=5%, meaning that we missed the sample mean 5% of the time. My experience with running this code a few times shows that values could be quite different from 5%: try it. Repeating the code is easier if you use an R script by clicking File, New File, R script. This allows you to repeat this whole code as many times as you like (i.e. drawing 1000 samples of 16 observations) by selecting it and clicking Run at the top of the window. You will see a series of proportions of confidence intervals that missed the population mean M=250 that are close to 5%, but not exactly 5% every time.
#The reasoning of tests of significance We are going to implement the example from slide 26. We will be using the binomial distribution (The binomial distribution describes the number of successes in n trials if the probability of success in each trial is p and the trials are independent. explained on p.313 and following of the textbook chapter 5.)
The free thrower throws n=50 times, and he claims that his success rate is p=0.8. So, out of 50 trials he should succeed 50*0.8=40 times on average.
Let’s make him throw 50 ball-sets 400 times.
throwsresult <- rbinom(400,50,0.8)
#The above creates a vector with the number of successful throws in 400 50 balls-sets, with a success rate of 0.8 every time.
We can plot the distribution of the number of successes like this:
ggplot (data=NULL) +
geom_histogram(mapping = aes(x = throwsresult),binwidth=1)
The player in reality scored 32 times. What is the share of the throwsresults that are 32 or less if the actual success rate is 80%?
lowthrowsresult <- as.numeric(throwsresult<=32)
#to create a tag that takes the value of 1 if the number of successful throws is 32 or less (out of our 400 series of trials), and 0 otherwise.
sum(lowthrowsresult)/400
## [1] 0.005
#to count up how many times the successful throws are 32 or less and then divides by 400 to get the proportion out of the 400 series of trials.
This proportion is very small, which provides strong evidence against the claim that the person is an 80% free-throw shooter. That’s because if he were really an 80% shooter, scoring just 32 times out of 50 is highly unlikely (though it is not impossible).
For this case, we don’t really need to sample because the probability of having 32 or less successes out of 50 in a binomial case is mathematically defined and can be computed by R in a single line:
pbinom(32, 50, 0.8, lower.tail = TRUE, log.p = FALSE)
## [1] 0.006260775