Normal distribution: https://mathlets.org/mathlets/probability-distributions/ Confidence Interval: https://mathlets.org/mathlets/confidence-intervals/
Suppose you give a test to a history class. After you have graded them, you compute the mean and standard deviation of the distribution of grades to be \(\mu\) = 80 and \(\sigma\) = 12. The standard score for a student who gets a 72 on the exam is? What is the percentile this score at?
Loop through the sequence 1 to 5 printing the square of each number
for (j in 1:5) {
print(j^2)
}
n <- 5
x <- rep(0,n)
for (j in 1:n) {
x[j] = j^2
}
x
Let’s use a for loop to estimate the average of squaring the result of a roll of a die.
ntrials <- 1000
trials <- rep(0,ntrials)
for(j in 1:ntrials) {
trials[j] <- sample(1:6,1)
}
mean(trials^2)
In many fields, we often want to understand the characteristics of a large group, but it’s too difficult to collect data from everyone. Instead, we collect data from a smaller group—a sample—and use this information to estimate things about the entire population.
Since we usually don’t have data from the entire population, we collect a sample and estimate the population characteristics based on it.
The most common thing we want to estimate is the mean (average) of some characteristic in the population (e.g., the average height of all students). To estimate this: - We take the mean of the sample. - We use that sample mean as an estimate for the population mean.
# Sample data: heights of 50 students (in cm)
sample_heights <- runif(50,min=150,max=180)
# Calculate the sample mean (average)
sample_mean <- mean(sample_heights)
sample_mean
In this case, sample_mean is the average height of the 50 students. We use this as an estimate for the average height of all students at the university (the population mean).
sum((sample_heights-sample_mean)^2)/49
var(sample_heights)
Let’s say we surveyed 100 students, and 60 of them passed an exam. We can estimate the proportion of all students who pass the exam.
# Sample data: 1 = pass, 0 = fail
sample_pass <- sample(c(0,1),100,replace=TRUE)
# Calculate the sample proportion (percentage of students who passed)
sample_proportion <- mean(sample_pass)
sample_proportion
Here, sample_proportion will give us an estimate of the percentage of all students who pass the exam based on the sample.
A confidence interval gives us a range where we expect the population mean or proportion to fall. It adds a margin of error to our estimate. For example, instead of just saying “the average height is 170 cm,” we might say, “we’re 95% confident the true average height is between 167 and 173 cm.”
In R, we can calculate confidence intervals easily.
To calculate a 95% confidence interval for the mean of our sample heights:
ci <- mean(sample_heights) + c(-1, 1) * (sd(sample_heights)/sqrt(length(sample_heights)) * 1.96)
ci
The output will give us a range (lower bound, upper bound) for where we expect the true average height to be.
The size of the sample affects how accurate our estimates are. Larger samples give us more precise estimates of the population. Smaller samples have more variability, so the estimates can be less accurate.
A researcher is interested in estimating the mean and standard deviation of the household income for a certain region. The researcher collects a random sample of 10 households and records the household income (in thousands of dollars) as follows:
Household Median Household Income (in $1000s) 1 55 2 62 3 58 4 65 5 60 6 68 7 72 8 59 9 64 10 70
Task: Using the data provided:
income <- c(55, 62, 58, 65, 60, 68, 72, 59, 64, 70)
mean_income <- mean(income)
sd_income <- sd(income)
The Central Limit Theorem (CLT) states that the distribution of sample means will approximate a normal distribution.
set.seed(123)
population <- rexp(10000,rate=1)
simulate_clt <- function(sample_size, num_samples) {
sample_means <- numeric(num_samples)
for (i in 1:num_samples) {
sample <- sample(population, sample_size, replace = TRUE)
sample_means[i] <- mean(sample)
}
return(sample_means)
}
sample_sizes <- c(5, 30, 100,10000)
num_samples <- 1000
par(mfrow = c(1, 4))
for (size in sample_sizes) {
sample_means <- simulate_clt(size, num_samples)
sem <- sd(sample_means)
hist(sample_means,probability = TRUE,breaks = 50)
curve(dnorm(x, mean = mean(sample_means), sd = sem), col = "red", lwd = 2, add = TRUE)
}
It’s important to understand how to find critical values for a Z-test, which is based on the standard normal distribution. Critical values help us determine the cutoff points for deciding whether a sample statistic is significantly different from the population parameter.
Common Critical Values and Their Corresponding Significance Levels: 1. Left-tailed Z-test - For \(\alpha=0.05\) (5% significance level), the critical value is approximately qnorm(0.05) - For \(\alpha=0.01\) (1% significance level), the critical value is approximately qnorm(0.01) 2. Right-tailed Z-test - For \(\alpha=0.05\) (5% significance level), the critical value is approximately qnorm(0.95) - For \(\alpha=0.01\) (1% significance level), the critical value is approximately qnorm(0.99)