This document was composed from Dr. Snopkowski’s ANTH 504 Week 5 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 10 and 11.
Audio 230209_001
Time 55:08
Identify Samples vs. Populations
The Law of Large Numbers and Central Limit Theorem
Estimating population parameters
Estimating confidence intervals
What is the difference between:
Sampling Techniques
It is important that if you are using a sample it is representative of your population.
One way to do this is to have a simple random sample, which means every member of the population has an equal chance of being selected.
One way to achieve random samples is to give each individual in your population and then use a random number generator to select individuals.
Random: Everyone has an equal chance of being selected for the study.
Stratified: population divided into several sub-populations
oversampling: over-represent rare groups
Snowball: when sampling from “hidden” or hard to access populations: ask participants to refer more participants. Type of convenience
Convenience: convenient
What is the problem if you don’t have a random sample?
Only Equation
If we have less error, the model is a better fit than a model with more error.
What is the most basic statistical model?
The model of the mean. Basic mean model
In statistics we fit models to our data
We use a statistical model to represent what is happening in the real world.
The mean is a hypothetical value
it doesn’t have to be a value that actually exists in the data set
the mean is simple statistical model
Population parameters are parameters that represent the entire population. They are typically unknown (since it’s difficult to measure an entire population of anything)
Sample statistics are the values we get from a sample
X bar is sample statistic
mu is population parameter
We sometimes shift between. The best information, best guess of population mean, is the sample mean but the mu is representing the population which is unknowable.
The mean is a model of what happens in the real world: the typical score (this is the parameter)
It is not a perfect representation of the data
How can we assess how well the mean represents reality?
If the mean represents every measure, it is a perfit fit but you have probably done something wrong.
A deviation is the difference between the mean and an actual data point.
Deviations can be calculated by taking each score and subtracting the mean from the data point:
If you sum them you get zero. So we square it.
Therefore, we square each deviation.
If we add these squared deviations we get the Sum of Squared Errors (SS)
It is dependent on the number in the sample. So we calculate variance.
This gives a more un-bias estimate
This value is called the variance (s2)
The variance has one problem: it is measured in units squared.
This isn’t a very meaningful metric so we take the square root value.
This is the standard deviation (s)
Sample
Population
Sample to Population:
Audio: 230214__002 Time: 03:00
A mathematical law that applies to many different sample statistics, but is easiest to think about in terms of the law of averages.
As the sample gets larger, the sample mean tends to get closer to the true population mean. We looked at picking to blue balls in the last chapter. The more blue balls we pull, the closer the sample mean will get to the population mean. Bigger samples are always better
Unfortunately, it’s not very helpful to know that we need a bigger and bigger sample…so we can also use Central Limit Theorem to help us
Let’s represent a population of female heights (based on a google search, it looks like the mean is 64.5 inches with a sd of 2.5 inches).
Let’s look at what the histogram of this distribution looks like
Create a normal plot with the mean of 64.5 and sd of 2.5 We make a
data frame with the points 57 and 72. This makes it so we are plotting
between this range. args=list(mean=64.5, sd=2.5) normal
density plot that has the mean of 64.5 and sd of 2.5. We used this code
in the hw last week.
ggplot(data.frame(x=c(57, 72)), aes(x=x)) +
stat_function(fun=dnorm, args=list(mean=64.5, sd=2.5))
stat_function() does gives us a normal curve. This
represents what we think the population looks like. We can create
samples from this distribute. We randomly choose samples.
This density plot is the proposed population curve. We want to take samples from a population with this characteristic.
Now let’s create samples from this distribution rnorm()
where “r” means “random”. We randomly select 10 values from a normal
distribution with an mean 64.5 and sd of 2.5. It is another in the norm
function group. (d, p, q, and r)
height.1 <- rnorm(n=10, mean=64.5, sd=2.5)
height.1
## [1] 64.02387 63.28834 63.49557 64.38659 63.28811 66.92035 65.48174 65.98774
## [9] 62.27811 68.38742
We choose 10 samples from the population that has the mean of 64.5 and sd of 2.5. This is a random process. Each time you run this, it will give different values.
What is the mean of the sample? What is the mean of aditional samples?
mean(height.1)
## [1] 64.75378
height.2 <- rnorm(n=10, mean=64.5, sd=2.5)
mean(height.2)
## [1] 65.47619
height.3 <- rnorm(n=10, mean=64.5, sd=2.5)
mean(height.3)
## [1] 63.24093
The way central limit theorem works is if you go out an take lots of little samples and then calculate the mean of these little samples, and then plot the means you can get yourself a normal distribution.
So, we’ll want to make sampling the means more automated. Then take the distribution of these sample means. We are going from population to means.
sample_means <- function(n) {
x <- mean(rnorm(n=n, mean=64.5, sd=2.5))
}
We create a function using function() This tells R that
we are creating a function. When we call a function, we send in a
argument. n is what you put inbetween the () when you call
the function.
The goal of this function is to obtain a sample from the population
that has a mean of 64.5 and sd of 2.5 and then get a mean of that
sample. We just used rnorm() to randomly chooses values
from a normal distribution and then we used mean() to get
the mean of that one sample. Here we place both mean() and
rnorm in a fashion where x is the result of
the mean of the randomly selected numbers. x is the mean,
the result of this function. Once you create the function, you have to
run it to store it into the environment.
Now, lets practice “using”calling” the function we just created:
m <- sample_means(10)
m
## [1] 64.42085
We enter 10 in for n in the function and then save/store
the result as m (Which is called x inside of
the function but it will not store x)
What we might like to do is to replicate our function over and over again so that we can get many sample means.Remember, this is the goal so we can demonstrate the central limit theorem.
First, we will set the sample size. We usually use n to
represent sample size and it is what we used in our function to
represent the sample size. Setting a value to n makes it
easy to adjust in the future. B is how many samples. The
larger B is, the longer it takes to run the code.
n <- 10 #how big is our sample
B <- 10000 #how many samples we will select
We use replicate() to run the our function over and over
again a B number of times. We already made B =
1000 and n=10 above. We store 10000 samples of sample means
in a vector called s
s <- replicate(B, sample_means(n))
str(s)
## num [1:10000] 63.9 65.9 63.6 64.1 63.2 ...
The vector is the class of number and there is 10,000 values. How often do you think the value will be below 60? or above 70? This will be rare. The distribution of the sample means will cluster around the mean. Therefore, we expect the standard deviation of s to be smaller. The plot density should be a lot taller than the normal distribution.
data.frame(s) %>%
ggplot(aes(s)) +
geom_density(fill="pink") +
xlim(57,72) +
stat_function(fun=dnorm, args=list(mean=64.5, sd=2.5))
How does this plot compare with the normal curve we made on the last slide? What does this tell us? The pink area is the sample of the means under the graph. As we expected, the standard deviation is smaller and the plot is narrow and tall. The standard deviation of the means of sample means is smaller than the normal distribution. More data is closer to the mean of the population. But the graph isn’t perfectly smooth.
What happens when you change the sample size from 10 to 30? Change n from 10 to 30.
n <- 30 #how big is our sample
s <- replicate(B, sample_means(n)) #rerun the replicate function
Now our plot will be more smooth in shape but still having the high peak.
data.frame(s) %>% ggplot(aes(s)) + geom_density(fill="pink") + xlim(57, 72) + stat_function(fun=dnorm,
args=list(mean=64.5, sd=2.5))
How does having a larger sample n relate to the central
limit theorem? ### What is the central limit theorem?
The idea that if you have a population and you take samples of a given
size n and calculate a distribution of sample means, As
n increases larger (to at least 30 samples), the sample
distribution will look normal with a mean equal to the population mean
and the standard deviation is equal to the standard error (SE).
As samples get large (greater than 30) the sampling distribution has a normal distribution with a mean equal to the population mean, and a standard deviation equal to the Standard Error (SE)
The standard error is the sample standard deviation s of
sample means divided by the square root of n(sample
size)
It doesn’t matter what your underlying distribution looks like. Let’s
see how the central limit theorem works when we take samples from a
population that does not have a normal distribution.
rlnorm() randomly chooses from the log of the normal
curve
t <- rlnorm(10000)
hist(t)
Note that the histogram is not normal. We chose samples from this distribution, calculate the mean in a function, and replicate that function a B number of times Then we plot the sample means just as we did before.
log_smpl_means <- function(n) {
x <- mean(rlnorm(n=n))
}
S <- replicate(B, sample_means(n))
data.frame(S) %>% ggplot(aes(S)) +
geom_density(fill="pink")
Notice how the distribution of the sample means is more normally distributed, even though we took these samples for a population that was not normal. This is central limit theorem.
z-score - Expresses a score in terms of how many standard deviations it is away from the mean. How far a value is from the mean in standard deviations.
1.96 cuts off the top 2.5% of the distribution. The area to the right of the that point is 2.5% of the population
−1.96 cuts off the bottom 2.5% of the distribution. The area to the left of the that point is 2.5% of the population 5% is the area to the left and right of both points together, OR:
95% of z-scores lie between −1.96 and 1.96.
99% of z-scores lie between −2.58 and 2.58.
pnorm(-2.58)*100
## [1] 0.4940016
0.5% of the population is to the left of that point.
pnorm(2.58)*100
## [1] 99.506
0.5% of the population is to the right of that point.
0.5 + 0.5
## [1] 1
100-1
## [1] 99
99% of z-scores lie between −2.58 and 2.58
Use pnorm() to determine the area to the left of a
zscore 0
pnorm(0)
## [1] 0.5
50% of the points are to the left of the curve.
We care about this to calculate our CI. ### Confidence Intervals
qnorm(p -- c(0.025, 0.975))
qnorm returns the z-scores corresponding to the 2.5% & 97.5% areas
under the curve.How do we calculate a 95% Confidence Intervals? Confidence Interval = (mean - 1.96 * SE, mean + 1.96 * SE)
tells us our uncertainty. CI=Xbar +/- Z (subscript 1-p/2) * SE
Xbar is is the sample mean. We used 64.5 earlier so lets do
that example.
Xbar <- 64.5
Z is (1-p)/2 Z is frequently 95% or
(1-0.95)/2
## [1] 0.025
Ask “What is the z-score that corresponds with 0.25?” The area to the left of of that value?
z_score <- qnorm(0.025)
z_score
## [1] -1.959964
Notice that the z-score is both a plus and a minus. So you will use both values. Remember, this is a confidence interval. The result is an interval: a space between two things; a gap.
SE is the sample error. What is SE? SE= sd/sqrt(n) We
used 2.5 in an earlier example so let’s continute that here.
SE <- 2.5/sqrt(30)
SE
## [1] 0.4564355
CI=Xbar +/- Z (subscript 1-p/2) * SE
CI <-Xbar + c(1, -1) * z_score * SE
CI
## [1] 63.6054 65.3946
The c(1, -1) makes sure that we are adding and subtracting the z-score. The z-score is 1.96 and -1.96.
change the z part.
(1-0.99)/2
## [1] 0.005
z_score <- qnorm(0.005)
z_score
## [1] -2.575829
Confidence Interval = (mean - 2.58 * SE, mean + 2.58 * SE)
CI <-Xbar + c(1, -1) * z_score * SE
CI
## [1] 63.3243 65.6757
For Central Limit Theorem to work, the sample size is larger than 30.
If the sample size is less than 30, then we will use a t
distribution. function qt This is the t distribution. It
has the same options as the norm but dt,
qt, pt, rt ##### degrees of
freedom is df = n-1. If the sample is 10, the
df=9
qt(0.025, df=9)
## [1] -2.262157
This is a little bit bigger CI because we had less certainty. Note that you need to tell the degrees of freedom Use t when the sample is less than 30.
use qt For the first argument, enter the value that
corresponds with percent of confidence (1-p)/2 where
p id the percent of confidence. For the second argument,
enter the degrees of freedom (n-1)
As a reader, we might ask how much overlap? Which are statistically different? No overlap suggests that these are statistically different from eachother.
Bars represent 95% confidence intervals.
All of these are significant. The error bars can overlap by half an still be statistically significant.
Practice: Yellowstone Question
data(faithful)
colnames(faithful)
## [1] "eruptions" "waiting"
Xbar +/- Z * s/sqrt(n)
#Find the means
mean_erupt <- mean(faithful$eruptions)
mean_erupt
## [1] 3.487783
mean_wait <- mean(faithful$waiting)
mean_wait
## [1] 70.89706
#Find the standard deviation of the sample
sd_e <- sd(faithful$eruptions)
sd_e
## [1] 1.141371
sd_w <- sd(faithful$waiting)
sd_w
## [1] 13.59497
# calculate the zscore with eruption at a 95% CI and waiting at a 99% CI
zscore_e<- qnorm(0.025)
zscore_e
## [1] -1.959964
zscore_w <- qnorm(0.005)
zscore_w
## [1] -2.575829
#same the sample size
sample_size <- 272
# standard
se_e <- sd_e / sqrt(sample_size)
se_e
## [1] 0.0692058
se_w <- sd_w / sqrt(sample_size)
se_w
## [1] 0.8243164
#mean eruption time with a 95% CI
mean_erupt + c(1, -1)* zscore_e * se_e
## [1] 3.352142 3.623424
#mean waiting time with a 99% CI
mean_wait + c(1, -1)* zscore_w * se_w
## [1] 68.77376 73.02036
What is the t-distrubution used for: It is for smaller sample sizes
Let’s pretend that you only have data on 10 eruptions. Calculate the 95%
#select the first ten numbers from the the erupt column
erupt <- faithful[1:10,1]
#calculate the mean of the sample of 10 from the sample of eruptions in df faithful
mean_erupt_10 <- mean(erupt)
mean_erupt_10
## [1] 3.3032
#tscore for 95% CI
tscore <- qt(0.025, 9)
tscore
## [1] -2.262157
#standard deviation of sample of 10 from eruption
sd_er <- sd(erupt)
sd_er
## [1] 1.056433
# standard error
se_t <- sd_er / sqrt(10)
se_t
## [1] 0.3340734
# mean with 95% CI using t-test
mean_erupt_10 +c(1, -1)*tscore*se_t
## [1] 2.547473 4.058927
Hypothesis
Test statistics
Making decisions
P-values
Types of errors
Power & Effect Sizes
Null hypothesis, H0
There is no effect.
E.g. Big Brother contestants and members of the public will not differ in their scores on personality disorder questionnaires
The alternative hypothesis, H1 or Ha
AKA the experimental hypothesis
E.g. Big Brother contestants will score higher on personality disorder questionnaires than members of the public
Ho: µ1 = µ2 Ha: µ1 ≠ µ2 Calculate
probability of test statistic*
A. If the p-value is greater than 0.05, we say: “We fail to reject the null hypothesis.”
This does not mean the null hypothesis is true! This language is important We don’t say that ‘we accept the null’ because we don’t have evidence for it. Se say fail to reject because we don’t have evidence to reject it.
The null is never really true unless you are comparing “A to itself” and using the same measurement. Why would anyone do this. We don’t. So, two different things will always be different. Or the same thing with measured at different times will have some kind of difference, if anything, it has different times. This is by definition. So, the null is never true. It is really a question as to whether what you are comparing are statistically different enough to reject the null.
B. If it is less than 0.05, reject the null hypothesis.
If you reject the null you say: “We reject the null hypothesis in favor of the alternative.” or something to this effect. This does not mean the alternative hypothesis is necessarily true. This is all probabilistic. “We have evidence to suggest that the alternative is true.”
If there is only a 5% chance of the event occurring, many scientists believe that is a useful threshold for confidence. This is the p-value.This is the value we get from the stats statistic. What it is trying to do is give you some level of probability of whether the two samples are truly equal to each other. If you did this experiment over and over and over again (frequentist idea) how likely is it that you would find a difference as large as you found or larger by chance. The p-value give you a probability of the chances of getting the difference or greater difference that you found. A p-value of 0.05 means the likelihood that these things are equal to each other is relatively low, a 5% chance. If it is less than 0.05, then it is unlikely these two samples are the same. If we did this 100 times, 5% of the time, we would be wrong. 5% or 0.05 for the alpha level is considered a useful level of confidence. The alpha at 0.05 corresponds to the 95% confidence value.
Edit to diagram: not accept the null but fail to reject.
A Statistic for which the frequency of particular values is known.
Two-sided test: Ho: mu1 = mu2 Ha: mu1 ≠ mu2
One-sided test: Ho: mu1 ≥ mu2 Ha: mu1 = mu2
The equal sign has to be in the null.
This changes the probability. It moves the whole area over, which gives you more power and because of this, it could be considered a form of p-hacking.
You are focused on one side or the direction. You only care if it is better.
Is it better than the other?
significance testing
We assume that the null hypothesis is true (that there is no effect)
We fit a statistical model that represents the alternative hypothesis and see how well it fits
P-value: We calculate the probability (over many, many identical tests) of getting a test statistic at least as big as the one we have if there were no effect and all other assumptions of the model were met.
If that probability is very small (usually less than 0.05), we conclude that the model (alternative hypothesis) fits the data well and have data to support the alternative hypothesis.
In statistics, we begin with the assumption that the null hypothesis is true. We assume that there is no effect, the the means are equal to each other. Then we fit an alternative model to see how well the alternative hypothesis fits. We look to see if they are different from each other. If they are truly different from each other, then the alternative model will fit well. We get a p-value. The p-value is the probability, over many many identical tests, of getting a test statistic at least as big as the one we have, if there was in fact, no effect, and all other assumption of the model were met.
If the p-value is very small, “we conclude that the statistical model of the alternative hypothesis fits the data well. We have data to support the alternative hypothesis.”
Type I error
occurs when we believe that there is a genuine effect in our population, when in fact there isn’t.
The probability is the α-level (usually .05)
Type II error
occurs when we believe that there is no effect in the population when, in reality, there is.
The probability is the β-level (often .2)
Example: If you do 3 tests, each at 95% probability of no type 1 error – you get: 0.95 * 0.95 * 0.95 = 0.857 Now your probability of a type 1 error = 1-0.857 = 14.3% To deal with this many people do a Bonferroni correction: Pcrit = α / k, where k is the number of comparisons
about p-value Doesn’t tell us about the importance of an effect. Some journals are attempting to move away from a reliance on p-alues.
Incentive structures and publication bias
Researcher degrees of freedom
p-hacking and HARKing
1: A significant result means that the effect is important
No, because significance depends on sample size.
Misconception
2: A non-significant result means that the null hypothesis is true
No, a non-significant result tells us only that the effect is not big enough to be found (given our sample size), it doesn’t tell us that the effect size is zero.
Misconception
3: A significant result means that the null hypothesis is false
A scientist has many decisions to make when designing and analyzing a study. There are a lot of different ways you can look at things and analyze things.
The alpha level, the level of power, how many participants should be collected, which statistical model to fit, how to deal with extreme scores, which control variables to consider, which measures to use, and so on
Researchers might use these researcher degrees of freedom to shed their results in the most favourable light (Simmons, Nelson, & Simonsohn, 2011)
Fanelli (2009) assimilated data from studies in which scientists reported on other scientists’ behaviour.
On average, 14.12% had observed fabricating or falsifying data, or altering results to improve the outcome
A disturbingly high 28.53% reported other questionable practices
p-hacking
Researcher degrees of freedoms that lead to the selective reporting of significant p-values
HARKing
The practice in research articles of presenting a hypothesis that was made after data collection as though it were made before data collection
The ASA statement on p-values (Wasserstein & American Statistical Association, 2016).
The ASA points out that p-values can indicate how incompatible the data are with a specified statistical model (e.g., how incompatible the data are with the null hypothesis). You are at liberty to use the degree of incompatibility to inform your own beliefs about the relative plausibility of the null and alternative hypotheses, as long as you don’t interpret p-values as a measure of the probability that the hypothesis in question is true. They are also not the probability that the data were produced by random chance alone.
Scientific conclusions and policy decisions should not be based only on whether a p-value passes a specific threshold.
Don’t p-hack. Be fully transparent about the number of hypotheses explored during the study, and all data collection decisions and statistical analyses.
Don’t confuse statistical significance with practical importance. A p-value does not measure the size of an effect and is influenced by the sample size, so you should never interpret a p-value in any way that implies that it quantifies the size or importance of an effect.
‘By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.’
Open science
A movement to make the process, data and outcomes of research freely available to everyone.
Pre-registration of research
The practice of making all aspects of your research process (rationale, hypotheses, design, data processing strategy, data analysis strategy) publicly available before data collection begins.
Registered reports in an academic journal
If the protocol is deemed to be rigorous enough and the research question novel enough, the protocol is accepted by the journal typically with a guarantee to publish the findings no matter what they are
Public websites (e.g., the Open Science Framework).
An effect size is a standardized measure of the size of an effect:
Standardized = comparable across studies
Not (as) reliant on the sample size
Allows people to objectively evaluate the size of observed effect
They encourage interpreting effects on a continuum and not applying a categorical decision rule such as ‘significant’ or ‘not significant’.
Effect sizes and sample size
Effect sizes are affected by sample size (larger samples yield better estimates of the population effect size), but, unlike p-values, there is no decision rule attached to effect sizes so the interpretation of effect sizes is not confounded by sample size.
Effect sizes and researcher degrees of freedom
Although there are researcher degrees of freedom (not related to sample size) that researchers could use to maximize (or minimize) effect sizes, there is less incentive to do so because effect sizes are not tied to a decision rule in which effects either side of a certain threshold have qualitatively opposite interpretations.