Week 5_6 Estimating unknown quantities from a sample & Hypothesis Testing

Chapter 10

This document was composed from Dr. Snopkowski’s ANTH 504 Week 5 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 10 and 11.

Audio 230209_001
Time 55:08

Objectives for Ch. 10

Identify Samples vs. Populations
The Law of Large Numbers and Central Limit Theorem
Estimating population parameters
Estimating confidence intervals

Building Statistical Models

Populations and Samples

What is the difference between:

Population
- defined group that you want the sample to tell you about
- we never know what the population actually looks like
Sample
- a set of the population

Remember from Lecture 1:

Sampling Techniques

It is important that if you are using a sample it is representative of your population.
One way to do this is to have a simple random sample, which means every member of the population has an equal chance of being selected.
One way to achieve random samples is to give each individual in your population and then use a random number generator to select individuals.
Random: Everyone has an equal chance of being selected for the study.
Stratified: population divided into several sub-populations
oversampling: over-represent rare groups
Snowball: when sampling from “hidden” or hard to access populations: ask participants to refer more participants. Type of convenience
Convenience: convenient

What is the problem if you don’t have a random sample?

It might give bias answers.

The Only Equation You Will Ever Need

Only Equation

If we have less error, the model is a better fit than a model with more error.

What is the most basic statistical model?

The model of the mean. Basic mean model

A Simple Statistical Model

In statistics we fit models to our data
We use a statistical model to represent what is happening in the real world.
The mean is a hypothetical value
it doesn’t have to be a value that actually exists in the data set
the mean is simple statistical model

Population parameters vs. sample statistics

Population parameters are parameters that represent the entire population. They are typically unknown (since it’s difficult to measure an entire population of anything)
Sample statistics are the values we get from a sample
X bar is sample statistic
mu is population parameter

We sometimes shift between. The best information, best guess of population mean, is the sample mean but the mu is representing the population which is unknowable.

Measuring the ‘Fit’ of the Model

The mean is a model of what happens in the real world: the typical score (this is the parameter)
It is not a perfect representation of the data
How can we assess how well the mean represents reality?

The Mean as a Model

A Perfect Fit

If the mean represents every measure, it is a perfit fit but you have probably done something wrong.

Calculating ‘Error’

A deviation is the difference between the mean and an actual data point.
Deviations can be calculated by taking each score and subtracting the mean from the data point:

Why not use the Total Error?

If you sum them you get zero. So we square it.

Sum of Squared Errors

Therefore, we square each deviation.
If we add these squared deviations we get the Sum of Squared Errors (SS)

What is the problem with the sum of squares?

It is dependent on the number in the sample. So we calculate variance.

Variance

We calculate the average variability by dividing by the number of scores

This gives a more un-bias estimate
This value is called the variance (s2)

Standard Deviation

The variance has one problem: it is measured in units squared.
This isn’t a very meaningful metric so we take the square root value.
This is the standard deviation (s)

Samples Vs. Populations

Sample
- Mean and SD describe only the sample from which they were calculated.
Population
- Mean and SD are intended to describe the entire population (very rare to actually know).
Sample to Population:
- Mean and SD are obtained from a sample, but are used to estimate the mean and SD of the population (very common in social sciences)

Law of Large Numbers

Audio: 230214__002 Time: 03:00

A mathematical law that applies to many different sample statistics, but is easiest to think about in terms of the law of averages.
As the sample gets larger, the sample mean tends to get closer to the true population mean. We looked at picking to blue balls in the last chapter. The more blue balls we pull, the closer the sample mean will get to the population mean. Bigger samples are always better
Unfortunately, it’s not very helpful to know that we need a bigger and bigger sample…so we can also use Central Limit Theorem to help us

Sampling distributions of the mean

Let’s represent a population of female heights (based on a google search, it looks like the mean is 64.5 inches with a sd of 2.5 inches).
Let’s look at what the histogram of this distribution looks like

Create a normal plot with the mean of 64.5 and sd of 2.5 We make a data frame with the points 57 and 72. This makes it so we are plotting between this range. args=list(mean=64.5, sd=2.5) normal density plot that has the mean of 64.5 and sd of 2.5. We used this code in the hw last week.

ggplot(data.frame(x=c(57, 72)), aes(x=x)) + 
  stat_function(fun=dnorm, args=list(mean=64.5, sd=2.5))

stat_function() does gives us a normal curve. This represents what we think the population looks like. We can create samples from this distribute. We randomly choose samples.

This density plot is the proposed population curve. We want to take samples from a population with this characteristic.

Now let’s create samples from this distribution rnorm() where “r” means “random”. We randomly select 10 values from a normal distribution with an mean 64.5 and sd of 2.5. It is another in the norm function group. (d, p, q, and r)

height.1 <- rnorm(n=10, mean=64.5, sd=2.5)
height.1

##  [1] 64.02387 63.28834 63.49557 64.38659 63.28811 66.92035 65.48174 65.98774
##  [9] 62.27811 68.38742

We choose 10 samples from the population that has the mean of 64.5 and sd of 2.5. This is a random process. Each time you run this, it will give different values.

What is the mean of the sample? What is the mean of aditional samples?

mean(height.1)

## [1] 64.75378

height.2 <- rnorm(n=10, mean=64.5, sd=2.5)
mean(height.2)

## [1] 65.47619

height.3 <- rnorm(n=10, mean=64.5, sd=2.5)
mean(height.3)

## [1] 63.24093

The way central limit theorem works is if you go out an take lots of little samples and then calculate the mean of these little samples, and then plot the means you can get yourself a normal distribution.

So, we’ll want to make sampling the means more automated. Then take the distribution of these sample means. We are going from population to means.

Let’s create a function to see what the distribution of sample means looks like

sample_means <- function(n) {
x <- mean(rnorm(n=n, mean=64.5, sd=2.5))
}

We create a function using function() This tells R that we are creating a function. When we call a function, we send in a argument. n is what you put inbetween the () when you call the function.

The goal of this function is to obtain a sample from the population that has a mean of 64.5 and sd of 2.5 and then get a mean of that sample. We just used rnorm() to randomly chooses values from a normal distribution and then we used mean() to get the mean of that one sample. Here we place both mean() and rnorm in a fashion where x is the result of the mean of the randomly selected numbers. x is the mean, the result of this function. Once you create the function, you have to run it to store it into the environment.

Now, lets practice “using”calling” the function we just created:

m <- sample_means(10)
m

## [1] 64.42085

We enter 10 in for n in the function and then save/store the result as m (Which is called x inside of the function but it will not store x)

What we might like to do is to replicate our function over and over again so that we can get many sample means.Remember, this is the goal so we can demonstrate the central limit theorem.

First, we will set the sample size. We usually use n to represent sample size and it is what we used in our function to represent the sample size. Setting a value to n makes it easy to adjust in the future. B is how many samples. The larger B is, the longer it takes to run the code.

n <- 10 #how big is our sample
B <- 10000 #how many samples we will select

We use replicate() to run the our function over and over again a B number of times. We already made B = 1000 and n=10 above. We store 10000 samples of sample means in a vector called s

s <- replicate(B, sample_means(n))

str(s)

##  num [1:10000] 63.9 65.9 63.6 64.1 63.2 ...

The vector is the class of number and there is 10,000 values. How often do you think the value will be below 60? or above 70? This will be rare. The distribution of the sample means will cluster around the mean. Therefore, we expect the standard deviation of s to be smaller. The plot density should be a lot taller than the normal distribution.

data.frame(s) %>%
  ggplot(aes(s)) +
  geom_density(fill="pink") +
  xlim(57,72) +
  stat_function(fun=dnorm, args=list(mean=64.5, sd=2.5))

How does this plot compare with the normal curve we made on the last slide? What does this tell us? The pink area is the sample of the means under the graph. As we expected, the standard deviation is smaller and the plot is narrow and tall. The standard deviation of the means of sample means is smaller than the normal distribution. More data is closer to the mean of the population. But the graph isn’t perfectly smooth.

What happens when you change the sample size from 10 to 30? Change n from 10 to 30.

n <- 30 #how big is our sample
s <- replicate(B, sample_means(n)) #rerun the replicate function

Now our plot will be more smooth in shape but still having the high peak.

data.frame(s) %>% ggplot(aes(s)) + geom_density(fill="pink") + xlim(57, 72) + stat_function(fun=dnorm,
args=list(mean=64.5, sd=2.5))

How does having a larger sample n relate to the central limit theorem? ### What is the central limit theorem? The idea that if you have a population and you take samples of a given size n and calculate a distribution of sample means, As n increases larger (to at least 30 samples), the sample distribution will look normal with a mean equal to the population mean and the standard deviation is equal to the standard error (SE).

What is the standard error?

As samples get large (greater than 30) the sampling distribution has a normal distribution with a mean equal to the population mean, and a standard deviation equal to the Standard Error (SE)

How to calculate the Standard Error

The standard error is the sample standard deviation s of sample means divided by the square root of n(sample size)

Key to CLT:

It doesn’t matter what your underlying distribution looks like. Let’s see how the central limit theorem works when we take samples from a population that does not have a normal distribution. rlnorm() randomly chooses from the log of the normal curve

t <- rlnorm(10000)
hist(t)

Note that the histogram is not normal. We chose samples from this distribution, calculate the mean in a function, and replicate that function a B number of times Then we plot the sample means just as we did before.

log_smpl_means <- function(n) {
x <- mean(rlnorm(n=n))
}
S <- replicate(B, sample_means(n))

data.frame(S) %>% ggplot(aes(S)) + 
  geom_density(fill="pink")

Notice how the distribution of the sample means is more normally distributed, even though we took these samples for a population that was not normal. This is central limit theorem.

Summary of symbols

Review: z-scores

z-score - Expresses a score in terms of how many standard deviations it is away from the mean. How far a value is from the mean in standard deviations.

Properties of z-scores

1.96 cuts off the top 2.5% of the distribution. The area to the right of the that point is 2.5% of the population
−1.96 cuts off the bottom 2.5% of the distribution. The area to the left of the that point is 2.5% of the population 5% is the area to the left and right of both points together, OR:
95% of z-scores lie between −1.96 and 1.96.
99% of z-scores lie between −2.58 and 2.58.

pnorm(-2.58)*100

## [1] 0.4940016

0.5% of the population is to the left of that point.

pnorm(2.58)*100

## [1] 99.506

0.5% of the population is to the right of that point.

0.5 + 0.5

## [1] 1

100-1

## [1] 99

99% of z-scores lie between −2.58 and 2.58

99.9% of them lie between −3.29 and 3.29.

Use pnorm() to determine the area to the left of a zscore 0

pnorm(0)

## [1] 0.5

50% of the points are to the left of the curve.

We care about this to calculate our CI. ### Confidence Intervals

Now that we understand how the sample means vary, we can say something about the variability in our estimate. u We do this using Confidence Intervals. Recall: qnorm(p -- c(0.025, 0.975)) qnorm returns the z-scores corresponding to the 2.5% & 97.5% areas under the curve.

Confidence Intervals calculation

How do we calculate a 95% Confidence Intervals? Confidence Interval = (mean - 1.96 * SE, mean + 1.96 * SE)

tells us our uncertainty. CI=Xbar +/- Z (subscript 1-p/2) * SE Xbar is is the sample mean. We used 64.5 earlier so lets do that example.

Xbar <- 64.5

Z is (1-p)/2 Z is frequently 95% or

(1-0.95)/2

## [1] 0.025

Ask “What is the z-score that corresponds with 0.25?” The area to the left of of that value?

z_score <- qnorm(0.025)
z_score

## [1] -1.959964

Notice that the z-score is both a plus and a minus. So you will use both values. Remember, this is a confidence interval. The result is an interval: a space between two things; a gap.

SE is the sample error. What is SE? SE= sd/sqrt(n) We used 2.5 in an earlier example so let’s continute that here.

SE <- 2.5/sqrt(30)
SE

## [1] 0.4564355

CI=Xbar +/- Z (subscript 1-p/2) * SE

CI <-Xbar + c(1, -1) * z_score * SE
CI

## [1] 63.6054 65.3946

The c(1, -1) makes sure that we are adding and subtracting the z-score. The z-score is 1.96 and -1.96.

Confidence intervals interpretation

True mean
15 million
Sample mean
17 million
Interval estimate
12 to 22 million (contains true value)
16 to 18 million (misses true value)
CIs constructed such that 95% contain the true value. If we replicated the experiment over and over & calculated a 95% CI each time, then 95% of those intervals would contain the true mean.5% of the time the CI will not contain the true mean.

How would we calculate 99% confidence interval?

change the z part.

(1-0.99)/2

## [1] 0.005

z_score <- qnorm(0.005)
z_score

## [1] -2.575829

Confidence Interval = (mean - 2.58 * SE, mean + 2.58 * SE)

CI <-Xbar + c(1, -1) * z_score * SE
CI

## [1] 63.3243 65.6757

For Central Limit Theorem to work, the sample size is larger than 30.

t distribution

If the sample size is less than 30, then we will use a t distribution. function qt This is the t distribution. It has the same options as the norm but dt, qt, pt, rt ##### degrees of freedom is df = n-1. If the sample is 10, the df=9

qt(0.025, df=9)

## [1] -2.262157

This is a little bit bigger CI because we had less certainty. Note that you need to tell the degrees of freedom Use t when the sample is less than 30.

How would we calculate confidence intervals for small samples?

use qt For the first argument, enter the value that corresponds with percent of confidence (1-p)/2 where p id the percent of confidence. For the second argument, enter the degrees of freedom (n-1)

Showing Confidence Intervals Visually

As a reader, we might ask how much overlap? Which are statistically different? No overlap suggests that these are statistically different from eachother.

Confidence Intervals and Statistical Significance

Bars represent 95% confidence intervals.

All of these are significant. The error bars can overlap by half an still be statistically significant.

Practice: Yellowstone Question

data(faithful)
colnames(faithful)

## [1] "eruptions" "waiting"

Xbar +/- Z * s/sqrt(n)

#Find the means
mean_erupt <- mean(faithful$eruptions)
mean_erupt

## [1] 3.487783

mean_wait <- mean(faithful$waiting)
mean_wait

## [1] 70.89706

#Find the standard deviation of the sample
sd_e <- sd(faithful$eruptions)
sd_e

## [1] 1.141371

sd_w <- sd(faithful$waiting)
sd_w

## [1] 13.59497

# calculate the zscore with eruption at a 95% CI and waiting at a 99% CI
zscore_e<- qnorm(0.025)
zscore_e

## [1] -1.959964

zscore_w <- qnorm(0.005)
zscore_w

## [1] -2.575829

#same the sample size
sample_size <- 272

# standard 
se_e <- sd_e / sqrt(sample_size)
se_e

## [1] 0.0692058

se_w <- sd_w / sqrt(sample_size)
se_w

## [1] 0.8243164

#mean eruption time with a 95% CI
mean_erupt + c(1, -1)* zscore_e * se_e

## [1] 3.352142 3.623424

#mean waiting time with a 99% CI
mean_wait + c(1, -1)* zscore_w * se_w

## [1] 68.77376 73.02036

What is the t-distrubution used for: It is for smaller sample sizes

Let’s pretend that you only have data on 10 eruptions. Calculate the 95%

#select the first ten numbers from the the erupt column
erupt <- faithful[1:10,1]
#calculate the mean of the sample of 10 from the sample of eruptions in df faithful
mean_erupt_10 <- mean(erupt)
mean_erupt_10

## [1] 3.3032

#tscore for 95% CI
tscore <- qt(0.025, 9)
tscore

## [1] -2.262157

#standard deviation of sample of 10 from eruption
sd_er <- sd(erupt)
sd_er

## [1] 1.056433

# standard error
se_t <- sd_er / sqrt(10)
se_t

## [1] 0.3340734

# mean with 95% CI using t-test
mean_erupt_10 +c(1, -1)*tscore*se_t

## [1] 2.547473 4.058927

Hypothesis Testing (Ch. 11)

Hypothesis
Test statistics
Making decisions
P-values
Types of errors
Power & Effect Sizes

Types of Hypotheses

Null hypothesis, H0

There is no effect.
E.g. Big Brother contestants and members of the public will not differ in their scores on personality disorder questionnaires
The alternative hypothesis, H1 or Ha
AKA the experimental hypothesis
E.g. Big Brother contestants will score higher on personality disorder questionnaires than members of the public

Talking about significance or lack thereof

Ho: µ1 = µ2 Ha: µ1 ≠ µ2 Calculate probability of test statistic*

A. If the p-value is greater than 0.05, we say: “We fail to reject the null hypothesis.”

This does not mean the null hypothesis is true! This language is important We don’t say that ‘we accept the null’ because we don’t have evidence for it. Se say fail to reject because we don’t have evidence to reject it.

The null is never really true unless you are comparing “A to itself” and using the same measurement. Why would anyone do this. We don’t. So, two different things will always be different. Or the same thing with measured at different times will have some kind of difference, if anything, it has different times. This is by definition. So, the null is never true. It is really a question as to whether what you are comparing are statistically different enough to reject the null.

B. If it is less than 0.05, reject the null hypothesis.

If you reject the null you say: “We reject the null hypothesis in favor of the alternative.” or something to this effect. This does not mean the alternative hypothesis is necessarily true. This is all probabilistic. “We have evidence to suggest that the alternative is true.”

P-value

If there is only a 5% chance of the event occurring, many scientists believe that is a useful threshold for confidence. This is the p-value.This is the value we get from the stats statistic. What it is trying to do is give you some level of probability of whether the two samples are truly equal to each other. If you did this experiment over and over and over again (frequentist idea) how likely is it that you would find a difference as large as you found or larger by chance. The p-value give you a probability of the chances of getting the difference or greater difference that you found. A p-value of 0.05 means the likelihood that these things are equal to each other is relatively low, a 5% chance. If it is less than 0.05, then it is unlikely these two samples are the same. If we did this 100 times, 5% of the time, we would be wrong. 5% or 0.05 for the alpha level is considered a useful level of confidence. The alpha at 0.05 corresponds to the 95% confidence value.

Edit to diagram: not accept the null but fail to reject.

Test Statistics

A Statistic for which the frequency of particular values is known.

Observed values can be used to test hypotheses.

One- and Two-Tailed Tests

Two-sided test: Ho: mu1 = mu2 Ha: mu1 ≠ mu2

One-sided test: Ho: mu1 ≥ mu2 Ha: mu1 = mu2

The equal sign has to be in the null.
This changes the probability. It moves the whole area over, which gives you more power and because of this, it could be considered a form of p-hacking.
You are focused on one side or the direction. You only care if it is better.

Is it better than the other?

Quick overview of Null hypothesis

significance testing

We assume that the null hypothesis is true (that there is no effect)
We fit a statistical model that represents the alternative hypothesis and see how well it fits
P-value: We calculate the probability (over many, many identical tests) of getting a test statistic at least as big as the one we have if there were no effect and all other assumptions of the model were met.
If that probability is very small (usually less than 0.05), we conclude that the model (alternative hypothesis) fits the data well and have data to support the alternative hypothesis.

In statistics, we begin with the assumption that the null hypothesis is true. We assume that there is no effect, the the means are equal to each other. Then we fit an alternative model to see how well the alternative hypothesis fits. We look to see if they are different from each other. If they are truly different from each other, then the alternative model will fit well. We get a p-value. The p-value is the probability, over many many identical tests, of getting a test statistic at least as big as the one we have, if there was in fact, no effect, and all other assumption of the model were met.

If the p-value is very small, “we conclude that the statistical model of the alternative hypothesis fits the data well. We have data to support the alternative hypothesis.”

Type I and Type II Errors

Type I error
occurs when we believe that there is a genuine effect in our population, when in fact there isn’t.
The probability is the α-level (usually .05)
Type II error
occurs when we believe that there is no effect in the population when, in reality, there is.
The probability is the β-level (often .2)

Type I & Type II Errors Graph

Multiple Tests

If you do multiple tests, then your probability of a Type 1 error increases.

Example: If you do 3 tests, each at 95% probability of no type 1 error – you get: 0.95 * 0.95 * 0.95 = 0.857 Now your probability of a type 1 error = 1-0.857 = 14.3% To deal with this many people do a Bonferroni correction: Pcrit = α / k, where k is the number of comparisons

Pitfalls of being overly concerned

about p-value Doesn’t tell us about the importance of an effect. Some journals are attempting to move away from a reliance on p-alues.

NHST and wider problems in science

Incentive structures and publication bias
Researcher degrees of freedom
p-hacking and HARKing

Misconceptions around p-values

Misconception

1: A significant result means that the effect is important

No, because significance depends on sample size.
Misconception

2: A non-significant result means that the null hypothesis is true

No, a non-significant result tells us only that the effect is not big enough to be found (given our sample size), it doesn’t tell us that the effect size is zero.
Misconception

3: A significant result means that the null hypothesis is false

No, it is logically not possible to conclude this. The p-value is the probability that the chosen test statistic would have been at least as large as its observed value if every model assumption were correct, including the test (typically null) hypothesis. Finding a small p-value doesn’t tell us which assumption is incorrect – it could be that the null hypothesis is false or it could be that the study protocols were violated or by chance we got an unusual sample

Researcher degrees of freedom

A scientist has many decisions to make when designing and analyzing a study. There are a lot of different ways you can look at things and analyze things.
The alpha level, the level of power, how many participants should be collected, which statistical model to fit, how to deal with extreme scores, which control variables to consider, which measures to use, and so on
Researchers might use these researcher degrees of freedom to shed their results in the most favourable light (Simmons, Nelson, & Simonsohn, 2011)
Fanelli (2009) assimilated data from studies in which scientists reported on other scientists’ behaviour.
On average, 14.12% had observed fabricating or falsifying data, or altering results to improve the outcome
A disturbingly high 28.53% reported other questionable practices

p-hacking and HARKing

p-hacking
Researcher degrees of freedoms that lead to the selective reporting of significant p-values
HARKing
The practice in research articles of presenting a hypothesis that was made after data collection as though it were made before data collection

Evidence for p-hacking?

Sense

The ASA statement on p-values (Wasserstein & American Statistical Association, 2016).
The ASA points out that p-values can indicate how incompatible the data are with a specified statistical model (e.g., how incompatible the data are with the null hypothesis). You are at liberty to use the degree of incompatibility to inform your own beliefs about the relative plausibility of the null and alternative hypotheses, as long as you don’t interpret p-values as a measure of the probability that the hypothesis in question is true. They are also not the probability that the data were produced by random chance alone.
Scientific conclusions and policy decisions should not be based only on whether a p-value passes a specific threshold.
Don’t p-hack. Be fully transparent about the number of hypotheses explored during the study, and all data collection decisions and statistical analyses.
Don’t confuse statistical significance with practical importance. A p-value does not measure the size of an effect and is influenced by the sample size, so you should never interpret a p-value in any way that implies that it quantifies the size or importance of an effect.
‘By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.’

Preregistration

Open science
A movement to make the process, data and outcomes of research freely available to everyone.
Pre-registration of research
The practice of making all aspects of your research process (rationale, hypotheses, design, data processing strategy, data analysis strategy) publicly available before data collection begins.
Registered reports in an academic journal
If the protocol is deemed to be rigorous enough and the research question novel enough, the protocol is accepted by the journal typically with a guarantee to publish the findings no matter what they are
Public websites (e.g., the Open Science Framework).

Effect sizes

An effect size is a standardized measure of the size of an effect:
Standardized = comparable across studies
Not (as) reliant on the sample size
Allows people to objectively evaluate the size of observed effect

Advantages of effect sizes

They encourage interpreting effects on a continuum and not applying a categorical decision rule such as ‘significant’ or ‘not significant’.
Effect sizes and sample size
Effect sizes are affected by sample size (larger samples yield better estimates of the population effect size), but, unlike p-values, there is no decision rule attached to effect sizes so the interpretation of effect sizes is not confounded by sample size.
Effect sizes and researcher degrees of freedom
Although there are researcher degrees of freedom (not related to sample size) that researchers could use to maximize (or minimize) effect sizes, there is less incentive to do so because effect sizes are not tied to a decision rule in which effects either side of a certain threshold have qualitatively opposite interpretations.