Normal distribution: https://mathlets.org/mathlets/probability-distributions/ Confidence Interval: https://mathlets.org/mathlets/confidence-intervals/

Example

Suppose you give a test to a history class. After you have graded them, you compute the mean and standard deviation of the distribution of grades to be \(\mu\) = 80 and \(\sigma\) = 12. The standard score for a student who gets a 72 on the exam is? What is the percentile this score at?

For loops

‘for loops’ help us loop, i.e. repeat, through the elements in a vector and run the same code on each element

Loop through the sequence 1 to 5 printing the square of each number

for (j in 1:5) {
  print(j^2) 
} 
  • We can capture the results of our loop in a list.
n <- 5
x <- rep(0,n)
for (j in 1:n) {
  x[j] = j^2
}
x

Let’s use a for loop to estimate the average of squaring the result of a roll of a die.

ntrials <- 1000
trials <- rep(0,ntrials)
for(j in 1:ntrials) {
  trials[j] <- sample(1:6,1)
}
mean(trials^2)

Sample vs Population and Estimation

Introduction

In many fields, we often want to understand the characteristics of a large group, but it’s too difficult to collect data from everyone. Instead, we collect data from a smaller group—a sample—and use this information to estimate things about the entire population.

Population vs. Sample

  • Population: The entire group we’re interested in. For example, if we want to know the average height of all students in a university, all the students are the population.
  • Sample: A smaller group selected from the population. If we only measure the height of 100 students, those 100 students are the sample.

Since we usually don’t have data from the entire population, we collect a sample and estimate the population characteristics based on it.

Estimating the Population Mean

The most common thing we want to estimate is the mean (average) of some characteristic in the population (e.g., the average height of all students). To estimate this: - We take the mean of the sample. - We use that sample mean as an estimate for the population mean.

# Sample data: heights of 50 students (in cm)
sample_heights <- runif(50,min=150,max=180)

# Calculate the sample mean (average)
sample_mean <- mean(sample_heights)
sample_mean

In this case, sample_mean is the average height of the 50 students. We use this as an estimate for the average height of all students at the university (the population mean).

Estimating the Population variance

  • Sample variance:
  • The unbiased estimator for the population variance is \(\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2\)
sum((sample_heights-sample_mean)^2)/49
var(sample_heights) 

Estimating Population Proportion

Let’s say we surveyed 100 students, and 60 of them passed an exam. We can estimate the proportion of all students who pass the exam.

# Sample data: 1 = pass, 0 = fail
sample_pass <- sample(c(0,1),100,replace=TRUE)

# Calculate the sample proportion (percentage of students who passed)
sample_proportion <- mean(sample_pass)
sample_proportion 

Here, sample_proportion will give us an estimate of the percentage of all students who pass the exam based on the sample.

Confidence Intervals (Basic Idea)

A confidence interval gives us a range where we expect the population mean or proportion to fall. It adds a margin of error to our estimate. For example, instead of just saying “the average height is 170 cm,” we might say, “we’re 95% confident the true average height is between 167 and 173 cm.”

In R, we can calculate confidence intervals easily.

To calculate a 95% confidence interval for the mean of our sample heights:

ci <- mean(sample_heights)  + c(-1, 1) * (sd(sample_heights)/sqrt(length(sample_heights)) * 1.96)
ci

The output will give us a range (lower bound, upper bound) for where we expect the true average height to be.

Sample Size Matters

The size of the sample affects how accurate our estimates are. Larger samples give us more precise estimates of the population. Smaller samples have more variability, so the estimates can be less accurate.

Exercise

A researcher is interested in estimating the mean and standard deviation of the household income for a certain region. The researcher collects a random sample of 10 households and records the household income (in thousands of dollars) as follows:

Household Median Household Income (in $1000s) 1 55 2 62 3 58 4 65 5 60 6 68 7 72 8 59 9 64 10 70

Task: Using the data provided:

  • Estimate the population mean for the household income based on the sample.
  • Estimate the population standard deviation based on the sample.
income <- c(55, 62, 58, 65, 60, 68, 72, 59, 64, 70) 
mean_income <- mean(income) 
sd_income <- sd(income)  

The central limit theorem

The Central Limit Theorem (CLT) states that the distribution of sample means will approximate a normal distribution.

set.seed(123)
population <- rexp(10000,rate=1)
simulate_clt <- function(sample_size, num_samples) {
  sample_means <- numeric(num_samples) 
  for (i in 1:num_samples) {
    sample <- sample(population, sample_size, replace = TRUE)
    sample_means[i] <- mean(sample)
  }
  return(sample_means)
} 
sample_sizes <- c(5, 30, 100,10000) 
num_samples <- 1000   
par(mfrow = c(1, 4)) 
for (size in sample_sizes) {
  sample_means <- simulate_clt(size, num_samples) 
  sem <- sd(sample_means)  
  hist(sample_means,probability = TRUE,breaks = 50) 
  curve(dnorm(x, mean = mean(sample_means), sd = sem),  col = "red", lwd = 2, add = TRUE)
}

Finding Critial values in Z distribution

It’s important to understand how to find critical values for a Z-test, which is based on the standard normal distribution. Critical values help us determine the cutoff points for deciding whether a sample statistic is significantly different from the population parameter.

  • Z-distribution
  • The Z-distribution provides a way to calculate how many standard deivations a data point is from the mean. These distances are called Z-scores
  • A +Z indicates the value is above the mean
  • A ciritcla value is the location on the Z-distribution beyond which we consider results to be ‘extreme’.
  • Thest critical values are determined by the given signifcance level.
  • rejection region: the are under the curve that beyond that value.

Common Critical Values and Their Corresponding Significance Levels: 1. Left-tailed Z-test - For \(\alpha=0.05\) (5% significance level), the critical value is approximately qnorm(0.05) - For \(\alpha=0.01\) (1% significance level), the critical value is approximately qnorm(0.01) 2. Right-tailed Z-test - For \(\alpha=0.05\) (5% significance level), the critical value is approximately qnorm(0.95) - For \(\alpha=0.01\) (1% significance level), the critical value is approximately qnorm(0.99)