Lecture 6.1 : Survey Sampling

Discrete Variable

Discrete Variable takes on countable values.
Example: Number of pets in a household.
(Lecture 6.1)


Continuous Variable

Continuous Variable can take on any value within a range.
Example: A person’s height or weight.
(Lecture 6.1)


Probabilities for a Discrete Variable

Probabilities for a Discrete Variable assign specific probabilities to individual outcomes.
Example: \(P(X = 3) = 0.2\)
(Lecture 6.1)


Probabilities for a Continuous Variable

Probabilities for a Continuous Variable are computed as areas under a curve.
Examples: \(P(X < a)\), \(P(X > a)\), \(P(a < X < b)\)
(Lecture 6.1)


Standard Normal Distribution

Standard Normal Distribution is the distribution that results from converting any normal distribution into standard units.
It has a mean of 0 and standard deviation of 1. The shape of the distribution remains unchanged.
(Lecture 6.1)


Z-score

Z-score expresses a value in terms of standard deviations from the mean.
Formula: \(z = \frac{x - \mu}{\sigma}\)
Example: A height of 70 inches when \(\mu = 64\) and \(\sigma = 3\) gives \(z = 2\)
(Lecture 6.1)


Percentiles (Quantile)

Percentile is the value below which a given percentage of observations fall.
Example: The 90th percentile means 90% of values are below that number.
(Lecture 6.1)


Survey Sampling

Survey Sampling involves selecting a subset from a population to estimate characteristics of the whole.
(Lecture 6.1)


Census

Census collects data from every member of the population.
(Lecture 6.1)


Sample Survey

Sample Survey collects data from a subset of the population to estimate proportions.
We estimate population proportion \(p\) using the sample proportion \(\hat{p}\).
(Lecture 6.1)


Population

Population is the entire group of people or objects we want to study.
(Lecture 6.1)


Population Parameter

Population Parameter is a numerical characteristic of a population, such as the mean or proportion.
(Lecture 6.1)


Sample

Sample is a subset of the population, ideally representative of the whole.
(Lecture 6.1)


Statistic

Statistic is a numerical summary of a sample used to estimate a population parameter.
(Lecture 6.1)


Margin of Error

Margin of Error quantifies the uncertainty due to sampling.
Example: A margin of error of ±2% implies the true proportion lies within 2% of the estimate.
(Lecture 6.1)


Random Sample

Random Sample means each member of the population has an equal chance of being selected.
(Lecture 6.1)


Sampling Frame

Sampling Frame is the list from which a sample is drawn.
(Lecture 6.1)


Binomial Model

Binomial Model applies when:
- There are two possible outcomes (success/failure)
- Probability of success \(p\) is constant
- Number of trials \(n\) is fixed
- Trials are independent
- We count number of successes
If valid, the binomial model allows probability prediction for outcomes.
(Lecture 6.1)


Bias

Bias is a systematic deviation from the true value due to flaws in the sampling method.
Example: Using a non-representative sampling frame.
(Lecture 6.1)


Good Survey

Good Survey uses random and independent selection:
1. Random sampling
2. Independence between observations
(Lecture 6.1)


With Replacement

With Replacement means each unit can be selected more than once, maintaining independence.
(Lecture 6.1)


Without Replacement

Without Replacement means each selected unit is removed from the pool, slightly affecting independence.
(Lecture 6.1)


Valid Sample

Valid Sample can be selected:
1. Without replacement if the population is ≥10× sample size (Simple Random Sample or SRS)
2. With replacement always
A valid sample ensures that \(\hat{p}\) is unbiased, precision increases with sample size, and sampling distribution approaches normality.
(Lecture 6.1)


Accuracy & Precision

Accuracy: How close the estimate is to the true value (low bias).
Precision: How consistent the estimates are across samples (low variability).
(Lecture 6.1)


Precision

Precision improves as sample size increases.
Precise estimators have less variability and are measured using standard error.
(Lecture 6.1)


Lecture 6.2 : Central Limit Theorem

Accuracy and Precision

Accuracy: Does it “hit” the target on average? Measured by bias.
Precision: How spread out are the estimates? Measured by standard error.
(Lecture 6.2)


Random Sample

Random Sample ensures each individual in the population has an equal chance of being selected, enabling unbiased estimation.
(Lecture 6.2)


Sample Without Replacement (SRS)

Sample Without Replacement is when selected individuals are not returned to the population before the next selection.
If the population is at least 10× larger than the sample, this method behaves like sampling with independence.
(Lecture 6.2)


Sample With Replacement

Sample With Replacement means each individual can be selected more than once.
This ensures independence across selections.
(Lecture 6.2)


Population (p)

Population Proportion (\(p\)) is the true proportion of individuals in the population with a certain characteristic.
Example: The proportion of UCLA students who support a tuition increase for CAPS.
(Lecture 6.2)


Sample (\(\hat{p}\))

Sample Proportion (\(\hat{p}\)) is the proportion in a sample with the characteristic of interest.
Used to estimate the population proportion \(p\).
(Lecture 6.2)


Standard Error

Standard Error (SE) measures the variability in an estimator, similar to standard deviation for variables.
Formula: \(SE_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)
Interpretation: Larger \(n\) leads to smaller SE → more precise estimates.
(Lecture 6.2)


Sampling Distribution

Sampling Distribution is the probability distribution of a statistic over repeated samples.
It helps us understand the behavior of \(\hat{p}\) across many samples.
(Lecture 6.2)


Simulation

Simulation is a method for estimating the behavior of sampling distributions using repeated random sampling.
Used to visualize how \(\hat{p}\) varies from sample to sample.
(Lecture 6.2)


Shape of the Sampling Distribution

Shape of the Sampling Distribution depends on sample size.
- Small \(n\): shape may be skewed or irregular
- Large \(n\): shape becomes approximately Normal
(Lecture 6.2)


Central Limit Theorem

Central Limit Theorem (CLT) for proportions states:
If \(n\) is large enough, then \(\hat{p} \sim N\left(p, \sqrt{\frac{p(1 - p)}{n}}\right)\)
Conditions: \(np \geq 10\) and \(n(1 - p) \geq 10\)
(Lecture 6.2)


Center, Spread, and Shape

Center of the sampling distribution is \(p\) (mean).
Spread is measured by \(SE_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\)
Shape becomes approximately Normal when the sample is large enough.
(Lecture 6.2)


Law of Large Numbers

Law of Large Numbers says that as the sample size increases, \(\hat{p}\) tends to get closer to \(p\).
This supports the reliability of estimators as \(n\) grows.
(Lecture 6.2)


Z-score for Proportions

Z-score measures how many standard errors a sample proportion is from the population proportion.
Formula: \(z = \frac{\hat{p} - p}{\sqrt{\frac{p(1 - p)}{n}}}\)
Used to assess how surprising a result is under the assumption that \(p\) is known.
(Lecture 6.2)


Lecture 7.1 : Confidence Intervals

Central Limit Theorem

Central Limit Theorem (CLT) states that if the sample size is large enough, the sampling distribution of sample proportions is approximately Normal:
\(\hat{p} \sim N\left(p, SE_{\hat{p}}\right)\)
Conditions: \(np \geq 10\) and \(n(1 - p) \geq 10\)
(Lecture 7.1)


Effect of Sample Size

Effect of Sample Size: Increasing \(n\) reduces the standard error, leading to more precise estimates.
Example: \(SE = \sqrt{0.3 \cdot 0.7 / 500} = 0.02\) vs. \(SE = \sqrt{0.3 \cdot 0.7 / 1000} = 0.0145\)
(Lecture 7.1)


Standard Error

Standard Error (SE) estimates the standard deviation of the sampling distribution.
Formula: \(SE_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\) (if \(p\) is known)
or \(SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\) (estimated)
(Lecture 7.1)


Z-score

Z-score tells how many standard errors a sample proportion is from the population proportion.
Formula: \(z = \frac{\hat{p} - p}{SE_{\hat{p}}}\)
(Lecture 7.1)


CLT Checklist

  1. Random Sample: Was the sample selected randomly?
  2. Independence: Are the responses independent of each other?
  3. Sample Size: Check if \(np \geq 10\) and \(n(1 - p) \geq 10\)
    (Lecture 7.1)

Random Sample

Random Sample is required for valid inference and ensures unbiased estimation of parameters.
(Lecture 7.1)


Independence

Independence means the outcome of one individual does not influence another.
Usually satisfied if the sample is less than 10% of the population.
(Lecture 7.1)


Sample Size Condition

Sample Size Condition ensures the CLT applies:
\(np \geq 10\) and \(n(1 - p) \geq 10\)
(Lecture 7.1)


Confidence Stats Definition

Confidence Statistics quantify uncertainty in an estimate and are used to construct intervals with known confidence levels.
(Lecture 7.1)


Confidence Interval

Confidence Interval (CI) provides an estimate of a population parameter along with a margin of error.
Formula: \(\hat{p} \pm \text{critical value} \cdot SE_{\hat{p}}\)
This reflects uncertainty in the estimate. CIs are about population parameters, not sample statistics.
(Lecture 7.1)


Critical Value

Critical Value (\(z^*\)) is the number of standard errors to span for a given confidence level.
Example: For 95% confidence, \(z^* = 1.96\)
(Lecture 7.1)


Interpreting CIs

Interpreting Confidence Intervals means understanding what it says about the population.
Example: “We are 95% confident that the proportion of Americans who believe the government helps the middle class too little is between 59.6% and 64.5%.”
This does not mean there is a 95% probability the true proportion is in the interval—we either captured it or we didn’t. The method captures the truth 95% of the time.
(Lecture 7.1)


Confidence Level

Confidence Level is the long-run proportion of constructed intervals that contain the true population parameter.
Example: 95% confidence means that if we repeat the procedure many times, 95% of the resulting intervals will contain the true \(p\).
(Lecture 7.1)


Lecture 7.2 : Confidence intervals for two proportions

confidence interval

It’s an estimate of a population parameter that includes an allowance for our uncertainty. It is based on a single sample of \(n\) randomly selected members of the population. Usually expressed as an estimate plus-minus a margin of error. Interpret it as a range of plausible values for the true value (that is, the value if we could see everything in the population).
Example: 95% CI for \(\hat{p}_1 - \hat{p}_2\) is (0.03, 0.08): We are 95% confident the true difference in proportions lies between 3% and 8%.

(Lecture 7.2)


margin of error

The quantity added and subtracted to the sample statistic in a confidence interval.
\(\text{Margin of Error} = z^* \cdot SE\)

(Lecture 7.2)


Central Limit Theorem for proportions

\(\hat{p}_1 - \hat{p}_2 \sim N(p_1 - p_2, \sqrt{SE^2_{\hat{p}_1} + SE^2_{\hat{p}_2}})\)

(Lecture 7.2)


wider intervals

Higher confidence level → wider interval.

(Lecture 7.2)


Smaller interval

Larger sample size or lower confidence level → smaller interval.

(Lecture 7.2)


Confidence intervals for two proportions

\(\hat{p}_1 - \hat{p}_2 \pm z^* \cdot \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}\)

(Lecture 7.2)


Confidence intervals for two proportions Interpretation

Example: “We are 95% confident that the proportion of support in Group 1 exceeds that in Group 2 by between 3% and 8%.”
- If 0 is in the CI → no evidence of a difference
- If 0 is not in the CI → evidence of a difference

(Lecture 7.2)


CLT conditions for two proportions

CLT must hold in both samples:
- Both samples must be random, independent
- Both samples must be large, so that the expected number of yes’s and no’s is bigger than 10 in both samples
- (Only if sampling without replacement) Each population must be at least 10 times larger than its sample size
- The samples must be independent of each other!

(Lecture 7.2)


Samples must be independent of each other

It means that if I tell you who answered ‘yes’ in one sample, you know nothing about who answered ‘yes’ in the other.

(Lecture 7.2)


Relative Risk

A ratio of two probabilities.
\(RR = \frac{p_1}{p_2}\)
Example: If Group 1 has a 19% infection rate and Group 2 has 1%, then \(RR = 19\) → Group 1 is 19 times as likely to get infected.

(Lecture 7.2)


Lecture 8.1 : Hypothesis Testing for One Proportion

Confidence Intervals for One Proportion

Confidence Intervals answer questions like “how much?” or “what percent?” by estimating a population proportion with a range of plausible values.
(Lecture 8.1)


Hypothesis Tests for One Proportion

Hypothesis Tests address questions like “is the percentage different from what everyone thinks it is?” or “did my intervention change the population proportion?”
(Lecture 8.1)


Null Hypothesis

Null Hypothesis (\(H_0\)) is the default, skeptical position assumed to be true. Any observed differences are attributed to chance.
Example: \(H_0 : p = 0.54\) (polio rate without vaccine)
(Lecture 8.1)


Alternative Hypothesis

Alternative Hypothesis (\(H_A\)) is what we hope to demonstrate—typically that the true proportion differs from the null.
Example: \(H_A : p < 0.54\) (vaccine lowers polio rate)
(Lecture 8.1)


P-value

P-value is the probability of observing an outcome as extreme (or more extreme) than the one obtained, under the assumption that the null hypothesis is true.
- Right-tailed: \(P(Z > z_{\text{obs}})\)
- Left-tailed: \(P(Z < z_{\text{obs}})\)
- Two-tailed: \(2 \cdot P(Z > |z_{\text{obs}}|)\)
Example: If \(z_{\text{obs}} = 1.25\), then \(P(Z > 1.25) = 0.106\)
(Lecture 8.1)


Significance Level

Significance Level (\(\alpha\)) is the threshold for rejecting the null hypothesis.
Common values: \(\alpha = 0.05\), \(0.01\).
Relates to confidence level: 95% CI → \(\alpha = 0.05\)
(Lecture 8.1)


Test Statistic

Test Statistic compares observed outcomes with what is expected under the null hypothesis.
Formula: \(z_{\text{obs}} = \frac{\hat{p} - p_0}{SE_{p_0}}\)
where \(p_0\) is from \(H_0\), and standard error uses \(p_0\):
\(SE_{p_0} = \sqrt{\frac{p_0(1 - p_0)}{n}}\)
(Lecture 8.1)


Z-score

Z-score represents how many standard errors the observed \(\hat{p}\) is from \(p_0\).
Interpretation: A \(z\) of 1.5 means \(\hat{p}\) is 1.5 standard errors above \(p_0\).
(Lecture 8.1)


Type I Error

Type I Error occurs when we reject the null hypothesis when it is actually true.
Probability of Type I error = \(\alpha\)
(Lecture 8.1)


Type II Error

Type II Error occurs when we fail to reject the null hypothesis when it is actually false.
Probability of Type II error = \(\beta\)
(Lecture 8.1)


Steps of Hypothesis Testing

  1. Hypothesize: Formulate null and alternative hypotheses about a population parameter.
  2. Prepare: Understand data and check if CLT applies. Identify the test statistic.
  3. Compare: Compute test statistic and p-value.
  4. Interpret: Decide whether to reject \(H_0\) based on p-value and significance level.
    (Lecture 8.1)

Hypotheses

Hypotheses are always about parameters, not observed values.
Example: \(H_0 : p = 0.20\), \(H_A : p \ne 0.20\)
(Lecture 8.1)


Lecture 8.2: Hypothesis Testing for Proportions


null and alternative hypotheses

Null Hypothesis is a conservative or skeptical claim about a population parameter, assumed true until evidence suggests otherwise.
Alternative Hypothesis is a competing claim we hope to support with evidence.
(Lecture 8.2)


test statistic

Test Statistic compares the observed outcome to what we would expect under the null.
Example: \(Z_{obs} = \frac{\hat{p} - p_0}{SE_{p_0}}\)
(Lecture 8.2)


population parameter

Population Parameter is a fixed but unknown value that describes a population characteristic (e.g., \(p\)).
(Lecture 8.2)


Statistic

Statistic is a value computed from sample data (e.g., \(\hat{p}\)) used to estimate the population parameter.
(Lecture 8.2)


Z-score

Z-score is the number of standard errors the observed statistic is from the null.
Interpretation: A Z-score of 2 means the result is 2 SEs above what is expected under \(H_0\).
(Lecture 8.2)


P-value

P-value is the probability of getting an outcome as extreme or more extreme than the one observed, assuming the null hypothesis is true.
If \(p\)-value < \(\alpha\), reject the null.
(Lecture 8.2)


Independent random sample

Independent Random Sample ensures each individual is selected independently and randomly; necessary for valid inference.
(Lecture 8.2)


Central Limit Theorem

Central Limit Theorem (CLT): when conditions are met (large sample, independence), the sampling distribution of a proportion is approximately normal.
(Lecture 8.2)


standard Normal distribution

Standard Normal Distribution has a mean of 0 and SD of 1; used to approximate sampling distributions under the null.
(Lecture 8.2)


Type I error

Type I Error occurs when we reject the null hypothesis even though it is true (false positive).
(Lecture 8.2)


Type II error

Type II Error occurs when we fail to reject the null hypothesis even though it is false (false negative).
(Lecture 8.2)


Confusion Matrix

Null is TRUE (Truth) Null is FALSE (Truth)
Reject H₀ (Decision) ❌ Type I Error (False Positive) ✅ Correct Decision (True Positive)
Fail to Reject H₀ (Decision) ✅ Correct Decision (True Negative) ❌ Type II Error (False Negative)

(Lecture 8.2)


Significance level

Significance Level (\(\alpha\)) is the probability of making a Type I error.
Interpretation: \(\alpha = 0.05\) means we’re willing to wrongly reject \(H_0\) 5% of the time.
(Lecture 8.2)


Conditional Probabilities and Types of Error

\(P(\text{Type I}) = P(\text{Reject } H_0 \mid H_0 \text{ is true})\)
\(P(\text{Type II}) = P(\text{Fail to Reject } H_0 \mid H_0 \text{ is false})\)
(Lecture 8.2)


Why do we reject if p-value is less than significance level?

To control the probability of a Type I error. Ensures that the rate of wrongly rejecting \(H_0\) is no more than \(\alpha\).
(Lecture 8.2)


CI and Hypothesis Test

  • If CI does not include the null value → reject \(H_0\)
  • If CI includes the null value → fail to reject \(H_0\)
    (Lecture 8.2)

Hypothesis testing for difference between two proportions

Hypothesis Test for Difference in Proportions compares \(p_1\) and \(p_2\) to see if a real difference exists.
(Lecture 8.2)


Necessary assumptions of Hypothesis testing for difference between two proportions

  • Each sample randomly drawn from its population
  • Each sample large enough: at least 10 successes and 10 failures
  • Observations within each sample are independent
  • The two samples are independent of each other
    (Lecture 8.2)

CLT for testing proportions

If assumptions hold, the sampling distribution of \(\hat{p}_1 - \hat{p}_2\) is approximately \(N(0, 1)\) under \(H_0\).
(Lecture 8.2)


Test Statistic for difference between two proportions

\(Z_{obs} = \frac{\hat{p}_1 - \hat{p}_2 - 0}{SE_o}\)
\(SE_o = \sqrt{p(1 - p) \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}\)
Where \(p\) is the pooled proportion under \(H_0\).
(Lecture 8.2)


CLT, CI & HT For Numerical Data — Lecture 9.1


Summary of Hypothesis Tests:
1. State Null and Alternative hypotheses and choose significance level.
2. Find value of test statistic after checking conditions for CLT.
3. Depending on Alternative Hypothesis, find p-value.
4. If p-value < significance level, reject Null.
(Lecture 9.1)


Numerical Variables:
Quantitative values such as income or body temperature.
(Lecture 9.1)


Categorical Variables:
Used for proportions (e.g., win/loss, yes/no).
(Lecture 9.1)


Confidence Intervals:
Provide a range of plausible values for a population parameter.
(Lecture 9.1)


Random Samples:
Required to ensure validity of statistical inference.
(Lecture 9.1)


Hypothesis Test:
Tests whether the population parameter differs from a hypothesized value.
(Lecture 9.1)


Population:
- Mean: μ
- Proportion: p
(Lecture 9.1)


Statistic:
- Sample Mean: \(\bar{x}\)
- Sample Proportion: \(\hat{p}\)
(Lecture 9.1)


Central Limit Theorem for Means:
If the sample size is large enough, the sampling distribution of the average is approximately Normal.
(Lecture 9.1)


Sampling Distribution:
The probability distribution of a statistic over many samples.
(Lecture 9.1)


Probability Distribution:
Describes how probabilities are distributed over values of a random variable.
(Lecture 9.1)


Standard Error:
\(\frac{\sigma}{\sqrt{n}}\)
SE of the sample mean is smaller than σ; it’s σ divided by the square root of n.
(Lecture 9.1)


Inference for Mean:
Sample mean is unbiased. As n increases, the approximation to Normal improves.
(Lecture 9.1)


Normal Distribution:
Used for approximating sampling distributions due to CLT.
(Lecture 9.1)


Law of Large Numbers:
Sample means converge to the population mean as sample size increases.
(Lecture 9.1)


Distribution of the Sample Mean:
\(\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})\)
(Lecture 9.1)


Null and Alternative Hypothesis for Means:
- H₀: μ = μ₀
- Hₐ: μ ≠ μ₀ or μ < μ₀ or μ > μ₀
(Lecture 9.1)


Test Statistic for Means:
- Z-test: \(\frac{\bar{x} - \mu}{SE_{\bar{x}}}\) when σ known
- T-test: \(t = \frac{\bar{x} - \mu_0}{\hat{SE}_{\bar{x}}}\) when σ unknown
where \(\hat{SE}_{\bar{x}} = \frac{s}{\sqrt{n}}\)
(Lecture 9.1)


Sampling Distribution of t:
As degrees of freedom increase, t approximates Normal distribution.
(Lecture 9.1)


Degrees of Freedom:
refer to the number of values in a calculation that are free to vary. For a single sample mean, the degrees of freedom is \(df = n - 1\) because one value is constrained by the sample mean.

Example: If you know the mean of 5 numbers is 10, then knowing 4 of them determines the 5th.

(Lecture 9.1)


T-distribution:
Used when σ is unknown; has thicker tails than Normal.
(Lecture 9.1)


Confidence Intervals for the Mean:
\(\bar{x} \pm \text{critical value} \times \hat{SE}_{\bar{x}}\)
Use t-distribution with \(n - 1\) degrees of freedom.
(Lecture 9.1)


Lecture 9.2 : Two-Sample CI & HT For Numerical Data

Summary of Hypothesis Tests

  1. State hypotheses: Null and alternative; choose significance level \(\alpha\).
  2. Check conditions: Confirm CLT assumptions.
  3. Compute test statistic: Based on sample data.
  4. Make decision: If \(p\)-value \(< \alpha\), reject the null hypothesis.
    (Lecture 9.2)

Population

Population refers to the entire group under study. Represented by parameters like \(\mu\).
(Lecture 9.2)


Sample

Sample is a subset of the population from which we draw data. Represented by statistics like \(\bar{x}\).
(Lecture 9.2)


CLT for One Mean

Central Limit Theorem (CLT): For large enough \(n\), the sampling distribution of \(\bar{x}\) is approximately Normal:
\(\bar{x} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\)
(Lecture 9.2)


Standard Error

Standard Error for sample mean:
\(SE_{\bar{x}} = \frac{s}{\sqrt{n}}\)
Smaller \(SE\) implies more precise estimates.
(Lecture 9.2)


Test Statistic

Test Statistic for comparing means:
\(z = \frac{\bar{x} - \mu_0}{SE_{\bar{x}}}\)
For unknown \(\sigma\), use \(t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}\)
(Lecture 9.2)


Confidence Intervals for the Mean

Confidence Interval:
\(\bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}\)
Example: 95% CI means we are 95% confident the interval captures the true mean \(\mu\).
(Lecture 9.2)


Comparing Two Means

Comparing Two Means involves checking if observed differences could have occurred by chance.
Hypothesis: \(H_0: \mu_1 = \mu_2\) vs. \(H_A: \mu_1 \ne \mu_2\)
(Lecture 9.2)


CLT for Independent Sample Means

Conditions:
1. Random samples & independent observations
2. Independent samples (not paired)
3. Large samples or approximately Normal populations (each \(n \geq 25\))
(Lecture 9.2)


Distribution of Difference Between Sample Means

\(\bar{x}_1 - \bar{x}_2 \sim N\left(\mu_1 - \mu_2,\ \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\right)\)
(Lecture 9.2)


CI for Difference in Means

\(\bar{x}_1 - \bar{x}_2 \pm t^* \cdot \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
Estimates the range of plausible differences in population means.
(Lecture 9.2)


Hypothesis Testing for Difference in Means

Test Statistic:
\(t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)
(Lecture 9.2)


Null Hypothesis

\(H_0: \mu_1 = \mu_2\) — assumes no difference in population means.
(Lecture 9.2)


Alternative Hypothesis

\(H_A: \mu_1 \ne \mu_2\), \(H_A: \mu_1 > \mu_2\), or \(H_A: \mu_1 < \mu_2\) — based on research question.
(Lecture 9.2)


Significance Level

Significance Level (\(\alpha\)) is the probability of rejecting \(H_0\) when it is true. Common levels: 0.05 or 0.01.
(Lecture 9.2)


Box Plot

Box Plot visualizes five-number summary: min, \(Q_1\), median, \(Q_3\), max.
Helpful for comparing distributions between groups.
(Lecture 9.2)


Histogram

Histogram displays frequency of data intervals.
Use to assess skewness and modality before applying CLT.
(Lecture 9.2)


Unpaired T-Test

Unpaired T-Test compares means from two independent samples.
Assumes normality or large \(n\), and equal/unequal variances.
(Lecture 9.2)


Paired T-Test

Paired T-Test compares matched pairs (e.g., before/after).
CLT applies to the differences:
\(\bar{x}_{\text{diff}} \sim N\left(\mu_1 - \mu_2,\ \frac{s^2_{\text{diff}}}{n}\right)\)
Example: Measure student scores before and after tutoring.
(Lecture 9.2)


CI for Paired Samples Difference in Means

\(\bar{x}_{\text{diff}} \pm t^* \cdot \sqrt{\frac{s^2_{\text{diff}}}{n}}\)
Estimates change from before to after within individuals.
(Lecture 9.2)


Paired vs. Unpaired

  • Paired: same individuals under both conditions.
  • Unpaired: different individuals in each group.
    Example: Before vs. after weight loss (paired) vs. comparing two diet groups (unpaired).
    (Lecture 9.2)

Lecture 10.1 : Final Review


Observational Study vs Experiment

Experiment: treatment is randomly assigned; causal conclusions can be drawn.
Observational Study: no random assignment; only associations, not causality.
(Lecture 10.1)


treatment variable(s) and response variable(s)

Treatment variable: the variable manipulated (e.g., smoking vs. not).
Response variable: the outcome measured (e.g., baby weight).
(Lecture 10.1)


Observational Study

A study where the researcher does not assign treatments but observes naturally occurring differences.
(Lecture 10.1)


Experiment

A study where treatments are randomly assigned to subjects to test for causal effects.
(Lecture 10.1)


Regression

Modeling the relationship between a response and explanatory variable.
Intercept: expected \(y\) when \(x = 0\) (interpret only if \(x = 0\) is in range).
Slope: expected change in \(y\) when \(x\) increases by 1.
(Lecture 10.1)


Coefficient of determination

\(R^2\) = proportion of variability in \(y\) explained by \(x\).
\(R^2 = r^2\) where \(r\) is the correlation coefficient.
(Lecture 10.1)


correlation coefficient \(r\)

Measures linear association between two variables. \(r \in [-1, 1]\).
(Lecture 10.1)


Probability

Deals with random sampling, conditional probabilities, and independence.
\(P(A \mid B) = P(A)\) if independent; \(P(A \text{ and } B) = 0\) if mutually exclusive.
(Lecture 10.1)


Random Sampling

Each member of the population has equal chance of being selected.
(Lecture 10.1)


Independence

Two events are independent if knowing one does not affect the probability of the other.
\(P(A \mid B) = P(A)\)
(Lecture 10.1)


Mutually exclusive

Two events cannot both occur. \(P(A \text{ and } B) = 0\)
(Lecture 10.1)


Law of large numbers

As sample size increases, the sample statistic gets closer to the population parameter.
(Lecture 10.1)


Discrete variables

Numerical values that are countable (e.g., number of pets).
(Lecture 10.1)


Continuous variables

Numerical values that can take on any value in a range (e.g., height).
(Lecture 10.1)


bar graph

Used for categorical variables; height of bars represents frequency.
(Lecture 10.1)


Histogram

Used for numerical data; shows distribution via bins.
(Lecture 10.1)


Normal distribution

Bell-shaped, symmetric; used to model sample means or proportions under CLT.
(Lecture 10.1)


Sampling distribution

Distribution of a sample statistic (e.g., \(\bar{x}\) or \(\hat{p}\)) across many samples.
(Lecture 10.1)


Inference

Using sample data to draw conclusions about a population.
(Lecture 10.1)


Sample vs. Population

Sample: subset of the population.
Population: entire group of interest.
(Lecture 10.1)


statistic vs. parameter

Statistic: calculated from sample data (e.g., \(\bar{x}, \hat{p}\)).
Parameter: value that describes the population (e.g., \(\mu, p\)).
(Lecture 10.1)


accuracy (bias) and precision (standard error)

Bias (accuracy): how close estimates are to true value.
Standard error (precision): how much estimates vary.
(Lecture 10.1)


central limit theorem

With large enough samples, the sampling distribution of the mean/proportion is approximately Normal.
(Lecture 10.1)


confidence intervals

Estimate of a population parameter plus/minus a margin of error.
(Lecture 10.1)


hypothesis tests

A formal procedure for testing a claim about a population.
Steps: Hypotheses → Check assumptions → Test statistic → p-value → Decision
(Lecture 10.1)


Null Hypothesis

A statement of no effect or difference. Typically the hypothesis we try to reject.
(Lecture 10.1)


Alternative Hypothesis

What we seek evidence for; a statement that contradicts the null.
(Lecture 10.1)


Test Statistic

Compares observed data to what we expect under the null.
Often \(z = \frac{\bar{x} - \mu_0}{SE}\) or \(t = \frac{\bar{x} - \mu_0}{\hat{SE}}\)
(Lecture 10.1)


P-value

Probability of observing a test statistic as extreme as or more extreme than the one observed, assuming \(H_0\) is true.
(Lecture 10.1)


type I and II errors

Type I Error: Reject \(H_0\) when it is true.
Type II Error: Fail to reject \(H_0\) when it is false.
(Lecture 10.1)


Proportion vs Mean

Use proportion if the variable is categorical (1 = success, 0 = failure).
Use mean if the variable is numerical.
(Lecture 10.1)


One sample vs. two sample

One-sample: compare sample to a known value.
Two-sample: compare two independent samples.
(Lecture 10.1)


Paired vs. unpaired

Paired: samples are linked; test differences within pairs.
Unpaired: samples are independent; compare group averages.
(Lecture 10.1)


normal vs. t-distribution

Use \(t\)-distribution when sample size is small (\(n < 25\)) for numerical data.
Otherwise, use standard Normal (\(z\)).
(Lecture 10.1)