Context :

Lectures 1-5.1 are content related to Midterm 1.

Lecture 1.1 : Welcome to Stats 10

Prediction

Prediction is the process of estimating future outcomes based on patterns or models found in data.
Example: Predicting the time of the next solar eclipse is easier than predicting an earthquake due to lower variability.
(Lecture 1.1)


Variability

Variability refers to the natural fluctuations in data that create uncertainty and necessitate statistical reasoning.
Example: People’s test scores vary due to different study habits and conditions.
(Lecture 1.1)


Data

Data are recorded information produced by people, sensors, or machines that must be organized to extract meaning.
Example: Data trails from credit card transactions and Google searches.
(Lecture 1.1)


Data Handling

Data Handling is the process of managing, organizing, and structuring data to prepare it for analysis.
Example: Identifying variable types and cleaning data from surveys.
(Lecture 1.1)


Basic Variable Types

Numerical: Quantitative values like price or time since review.
Categorical: Labels such as brand or payment type.
Note: Type depends more on use than format—e.g., ZIP codes are numeric but used categorically.
(Lecture 1.1)


Organizing Data

Organizing Data means structuring it in a tidy format where rows represent cases and columns represent variables.
Example: A gas station dataset where each row is a gas station and columns include brand, price, etc.
(Lecture 1.1)


Collecting Data

Collecting Data involves methods such as:
- Observational studies
- Controlled experiments
- Surveys
- Census
(Lecture 1.1)


Observational Studies

Observational Studies are those where subjects select their own treatment or exposure.
Example: Choosing to attend bootcamp rather than being assigned.
(Lecture 1.1)


Controlled Experiments

Controlled Experiments assign subjects to treatment or control groups by the researcher.
Example: Comparing nasal spray to injection flu vaccines via researcher-assigned groups.
(Lecture 1.1)


Surveys

Surveys are tools for collecting structured responses directly from individuals.
Example: Asking individuals about their data usage or opinions.
(Lecture 1.1)


Census

A Census is a study that attempts to collect data from every individual in a population.
(Lecture 1.1)


Confounding Factor

A Confounding Factor is an external variable that affects both the treatment and response, providing an alternative explanation for an observed association.
Example: Motivation affecting both bootcamp participation and recidivism.
(Lecture 1.1)


Causality

Causality refers to the determination of whether one variable directly affects another.
Example: Asking whether drinking coffee causes heart disease or just correlates with it.
(Lecture 1.1)


Treatment Group

The Treatment Group receives the intervention or condition being studied.
Example: Individuals required to drink whole milk in a study.
(Lecture 1.1)


Control Group

The Control Group does not receive the treatment and serves as a baseline.
Example: Individuals who do not drink whole milk in the same study.
(Lecture 1.1)


Response Variable

The Response Variable is the outcome being measured to assess the effect of the treatment.
Example: Whether someone dies from falling out of bed after a treatment.
(Lecture 1.1)




Lecture 1.2 : Welcome to Stats 10

Confounding Variables

Confounding Variables are alternative explanations for observed associations. They influence both the treatment and the response variable.
Example: Self-discipline can lead someone to volunteer for bootcamp and also reduce recidivism.
(Lecture 1.2)


Variability

Variability refers to natural differences in data across individuals or groups. It complicates analysis and conclusions, making control and replication essential.
(Lecture 1.2)


Visualizing Variability

Visualizing Variability helps us understand how values differ by using tools such as histograms or bar plots to display the distribution of data.
(Lecture 1.2)


Distribution

Distribution tells us: (1) what values a variable takes, and (2) how frequently those values occur.
It reflects the spread of data through counts or relative frequencies.
(Lecture 1.2)


Center, Spread, and Shape

Center, Spread, and Shape describe a distribution: - Center: The typical or average value. - Spread: The variability in the data. - Shape: Symmetry, skewness, or modality of the distribution.
(Lecture 1.2)


Controlled Experiment

Controlled Experiments are those in which researchers assign subjects to treatment or control groups.
This allows for causal conclusions.
(Lecture 1.2)


Control Group

The Control Group is the group that does not receive the treatment and provides a baseline for comparison.
(Lecture 1.2)


Replication

Replication is repeating a study across many individuals to ensure observed effects are not due to chance.
(Lecture 1.2)


Observational Study

Observational Studies occur when subjects assign themselves to treatment or control groups.
These always include confounding variables and cannot support cause-and-effect conclusions.
(Lecture 1.2)


Causal Claims

Causal Claims assert that doing X will cause a change in Y.
They require controlled experiments to be valid.
(Lecture 1.2)


Associational Claim

Associational Claims suggest that X and Y tend to vary together, without inferring causality.
Example: More whole grain consumption is associated with longer life.
(Lecture 1.2)


Categorical Variable

Categorical Variables represent groups or labels.
Visualized with bar charts, ribbon plots, etc. (e.g., gender categories).
(Lecture 1.2)


Numerical Variable

Numerical Variables are quantities that take on numeric values.
They are best visualized with histograms, dot plots, or boxplots.
(Lecture 1.2)


Histograms

Histograms show numerical distributions by displaying ranges of values and the counts or relative frequencies within those ranges.
(Lecture 1.2)


Bin Length

Bin Length is the width of each interval in a histogram.
Wider bins obscure details; narrower bins may add noise.
(Lecture 1.2)


Treatment Variable

Treatment Variable is the condition manipulated by the researcher.
Example: Sleep duration, caffeine amount, or laptop use.
(Lecture 1.2)


Response Variable

Response Variable is the outcome measured to assess treatment impact.
Example: Pain tolerance, test scores, or concussions.
(Lecture 1.2)


Potential Confounders

Potential Confounders are variables that affect both the treatment and the outcome.
Examples: Income, health status, education.
(Lecture 1.2)


Confounding Variables & Randomization

Randomization helps eliminate confounding by evenly distributing unknown variables across groups.
This is why randomized controlled trials are the gold standard for causal inference.
(Lecture 1.2)


Random Assignment

Random Assignment is the process by which subjects are placed into treatment or control groups using randomization.
It ensures groups are comparable and helps eliminate confounding variables, enabling valid causal conclusions.
(Lecture 1.2)


Lecture 2.1 : Analyzing Distributions

Distribution

Distribution tells us what values a variable takes and how frequently each value occurs.
It reflects both the range and the frequency of observed values.
(Lecture 2.1)


Histograms

Histograms are graphical tools that group data into bins and display how many observations fall into each bin.
They help us visualize the distribution’s shape, center, and spread.
(Lecture 2.1)


Shape

Shape describes the appearance of a distribution:
- Symmetric: Both sides are roughly the same.
- Right-skewed: Tail extends to the right.
- Left-skewed: Tail extends to the left.
(Lecture 2.1)


Center

Center is the “typical” value used to summarize a distribution.
It may be measured using the mean or median depending on the shape.
(Lecture 2.1)


Variability (Spread)

Variability describes how much the data values differ from each other.
Measured using standard deviation or IQR, depending on symmetry.
(Lecture 2.1)


Statistical Questions

Statistical Questions are questions that anticipate variability in the data and require data collection and analysis.
Example: “How long does it take to get to campus?”
(Lecture 2.1)


Typical Value

Typical Value is a summary of what’s most representative in a distribution.
Examples: Mean, median, or mode depending on context.
(Lecture 2.1)


Center of a Distribution

Center of a Distribution refers to the point of balance or midpoint in the data.
The mean is the balance point; the median splits the data in half.
(Lecture 2.1)


Sample Mean

Sample Mean is the average value of the data.
It’s the center of gravity (balancing point) of the distribution.
(Lecture 2.1)


Standard Deviation

Standard Deviation quantifies the average distance of observations from the mean.
Best used when the distribution is symmetric and unimodal.
(Lecture 2.1)


Median

Median is the middle value of a sorted dataset—half the data lies above and half below it.
It’s less affected by outliers or skewness than the mean.
(Lecture 2.1)


IQR (Interquartile Range)

IQR is the distance between the first (Q1) and third quartile (Q3).
It measures the spread of the middle 50% of the data and is a distance, not a location.
(Lecture 2.1)


Median and IQR

Median and IQR are used to summarize center and spread when data are skewed or have outliers.
Median gives the midpoint; IQR shows the middle 50% spread.
(Lecture 2.1)


Boxplots

Boxplots graphically display the median, IQR, and potential outliers.
They summarize the five-number summary but hide the number of modes.
(Lecture 2.1)


Center and Spread

Center and Spread should always be reported together:
- Use mean and SD for symmetric data
- Use median and IQR for skewed data
(Lecture 2.1)


Mode of a Distribution

Mode is the most frequently occurring value in a dataset.
(Lecture 2.1)


Multi-modal Distributions

Multi-modal Distributions have multiple peaks or humps, indicating the presence of subgroups.
In such cases, a single “typical” value may be misleading.
(Lecture 2.1)


“Typical” with Categorical Variables

Typical with Categorical Variables can be defined as:
- Majority: More than 50%
- Mode: Most frequent category
(Lecture 2.1)


Lecture 2.2: Associations between Variables

Center & Spread (Variability)

Mean & Standard Deviation: The mean is the sum of all observations divided by the number of observations. The standard deviation measures how much values typically deviate from the mean.
(Lecture 2.2)

Median and Inter-quartile-range (IQR): The median splits the dataset in half; IQR measures the spread of the middle 50% of data (Q3 - Q1).
(Lecture 2.2)


Boxplots

Boxplots display the 5-number summary (min, Q1, median, Q3, max). They provide a compact view of distribution but may omit important details. Useful for comparing distributions across groups.
(Lecture 2.2)


Outliers

Outliers are points beyond Q1 - 1.5IQR or Q3 + 1.5IQR. They may affect mean and standard deviation more than median and IQR. Boxplots can help identify potential outliers.
(Lecture 2.2)


Comparing Groups

Comparing Groups involves precise comparisons of center, spread, and shape between distributions, using contextual language (e.g., “Coffee drinkers had a higher median and greater variability than non-coffee drinkers”).
(Lecture 2.2)


Z-scores

Z-scores standardize data by subtracting the mean and dividing by the standard deviation:
z = (x - mean)/SD.
They allow for comparison across different units and scales. Z-scores have mean 0 and SD 1.
(Lecture 2.2)


Empirical Rule

Empirical Rule: For approximately normal distributions: - ~68% of data falls within ±1 SD - ~95% within ±2 SD - ~99.7% within ±3 SD
(Lecture 2.2)


Associations Between Numerical Variables

Two variables are associated if the mean of one depends on the value of the other.
(Lecture 2.2)


Scatterplots

Scatterplots graph paired data. Each point represents an observation’s x- and y-values.
(Lecture 2.2)


Pattern & Prediction

Pattern & Prediction: If a pattern exists, it may allow us to predict future values and assess variable relationships in complex systems.
(Lecture 2.2)


Predictor and Response Variables

Predictor variable (independent/explanatory): Plotted on the x-axis.
Response variable (dependent/target): Plotted on the y-axis.
(Lecture 2.2)


Describing Associations

Trend: Increasing or decreasing
Strength: How tightly clustered the data are
Shape: Linear or non-linear
Interpretation: Must be in context (e.g., “Bears with wider heads tend to weigh more.”)
(Lecture 2.2)


Correlation Coefficient

Correlation coefficient (r) measures the strength and direction of a linear association between two variables.
- Ranges from -1 to 1
- Positive r: increasing trend
- Negative r: decreasing trend
- r = 0: no linear association
- Calculated using z-scores:
r = average of (z_x * z_y)
Note: r does not imply linearity; it only quantifies strength if the relationship is linear.
(Lecture 2.2)


Lecture 3.1 : Regression

Correlation

Correlation describes the direction and strength of a linear relationship between two variables.
(Lecture 3.1)


Correlation Coefficient

Correlation coefficient (r) is a numerical measure of linear association.
Always between -1 and 1.
- r = +1 or -1: perfect linear association
- r = 0: no linear association
(Lecture 3.1)


Calculating r

Calculating r involves standardizing both variables (z-scores), multiplying the z-scores, and averaging the products.
(Lecture 3.1)


Summarizing Linear Associations

Summarizing linear associations involves intercept, slope, prediction, correlation coefficient, SD_x, SD_y, and interpretation.
(Lecture 3.1)


Predictions

Predictions are made using the regression line to estimate the mean value of y for a given x.
(Lecture 3.1)


Regression Line

The regression line is the line of best fit. It passes through the point \((\bar{x}, \bar{y})\) and is computed as:
\[ \hat{y} = \text{intercept} + \text{slope} \cdot x \]
Slope = \(r \cdot \frac{\text{SD}_y}{\text{SD}_x}\)
(Lecture 3.1)


Interpret the Slope

Slope tells us how the predicted value of y changes for a one-unit increase in x.
Example: Each additional SAT point might increase predicted GPA by 0.0017.
(Lecture 3.1)


Intercept

Intercept is the predicted value of y when x = 0.
Note: Interpret with caution — x = 0 may be outside the observed range.
(Lecture 3.1)


Extrapolating

Extrapolation is using the regression line to predict y-values beyond the range of observed x-values.
This can lead to invalid conclusions.
(Lecture 3.1)


Evaluating the Regression

Evaluating regression involves checking linearity, goodness of fit, and explanatory power.
(Lecture 3.1)


r-squared

r-squared (r²) is the coefficient of determination.
- Represents the proportion of variability in y explained by x.
- r² = 1: perfect linear fit
- r² = 0: regression line explains nothing
(Lecture 3.1)


Lecture 3.2: Probability


Randomness and Chance

Randomness and Chance refer to unpredictability in individual outcomes, though patterns may emerge in the long run.
Example: Whether a coin lands heads or tails on a single toss is random, but over many tosses, about half will be heads.
(Lecture 3.2)


Randomness and Statistics

Randomness and Statistics interact through two core ideas: random assignment in experiments and random sampling in surveys.
This allows for valid generalizations and causal inferences.
(Lecture 3.2)


Random Assignment

Random Assignment is assigning subjects to groups by chance to ensure differences between groups are minimal.
(Lecture 3.2)


Estimating Probability

Estimating Probability can be done through theoretical reasoning, simulations, or observations.
(Lecture 3.2)


Theoretical Probability

Theoretical Probability is determined through mathematical logic and assumptions.
Example: The probability of rolling a 3 on a fair die is 1/6.
(Lecture 3.2)


Simulated/Experimental Probability

Simulated Probability estimates chance via repeated simulation of a process.
Example: Flipping a virtual coin 1,000 times to estimate the chance of heads.
(Lecture 3.2)


Observational Probability

Observational Probability is derived from actual real-world data outcomes.
Example: 21% of surveyed individuals reported having fired a gun.
(Lecture 3.2)


Basic Probability Theory Terminology

  • Random: An outcome not predictable in the short term.
  • Event: A subset of outcomes.
  • Outcome: A single result from an experiment.
  • Sample Space: All possible outcomes.
  • Probability: The relative frequency of an event over infinitely many trials.
    (Lecture 3.2)

Probability

Probability is a measure of the likelihood of an event, ranging from 0 (impossible) to 1 (certain).
(Lecture 3.2)


Methods to Determine Probabilities

Estimate through simulation, or Reason via math and logic.
(Lecture 3.2)


Simulations

Simulations use repeated random events to estimate probabilities.
(Lecture 3.2)


Law of Large Numbers

Law of Large Numbers states that as repetitions of an experiment increase, the observed frequency approaches the true probability.
(Lecture 3.2)


Sample Space

Sample Space is the set of all possible outcomes of a random process.
Example: For flipping two coins, the sample space is {HH, HT, TH, TT}.
(Lecture 3.2)


Event

Event is a specific outcome or set of outcomes from the sample space.
Example: Getting two tails when flipping two coins.
(Lecture 3.2)


Probability of an Event

Probability of an Event is the long-run relative frequency over infinitely many repetitions.
(Lecture 3.2)


Logical Operators: Probability Rules

  • Not: P(not A) = probability that A does not occur.
  • And: P(A and B) = probability both A and B occur.
  • Or: P(A or B) = probability A, B, or both occur.
    (Lecture 3.2)

Equally Likely Outcomes

Equally Likely Outcomes: If all outcomes have the same chance, P(event) = # favorable outcomes / # total outcomes.
(Lecture 3.2)


Probability Distribution

Probability Distribution shows all possible outcomes and the probabilities for each.
(Lecture 3.2)


Mutually Exclusive (Disjoint)

Mutually Exclusive events cannot happen at the same time.
Example: A coin cannot land heads and tails simultaneously.
(Lecture 3.2)


Conditional Probabilities

Conditional Probability is the probability of an event given that another event has occurred.
Notation: P(A | B)
Example: P(shoots gun | person is conservative) = 0.40
(Lecture 3.2)


Lecture 4.1: Probability Part 2


Conditional Probabilities

Conditional Probabilities refer to the chance of an event occurring given that another event has already occurred.
(Lecture 4.1)


Mutually Exclusive

Mutually Exclusive events cannot happen at the same time.
If A and B are mutually exclusive, then:
\[ P(A \text{ and } B) = 0 \]
(Lecture 4.1)


Conditional Probability

Conditional Probability is the probability of event A given that B has occurred:
\[ P(A | B) = \frac{P(A \text{ and } B)}{P(B)} \]
(Lecture 4.1)


Complement of an Event

Complement of an Event is the probability that the event does not occur.
\[ P(\text{not A}) = 1 - P(A) \]
(Lecture 4.1)


Associations Between Two Categorical Variables

Associations exist if the probability of A is different depending on whether B occurs:
If \(P(A) \neq P(A | B)\), then A and B are associated.
Conversely, if \(P(B) = P(B | A)\), then knowing A tells us nothing about B.
(Lecture 4.1)


Independent Events

Independent Events occur when one event does not affect the probability of another.
A and B are independent if:
\[ P(A | B) = P(A) \quad \text{and} \quad P(B | A) = P(B) \]
(Lecture 4.1)


Statistical Approaches to Determine Independence

Statistical Approaches to determine independence include: - Theoretical: Based on logic and probability rules. - Empirical (Data-Based): Based on observed differences in conditional probabilities.
(Lecture 4.1)


Building a Simulation

Building a Simulation involves six key steps: 1. Identify the random action and the probability of a successful outcome. 2. Determine how to simulate the random action. 3. Define the event of interest. 4. Explain the procedure for one trial. 5. Run a trial and record the result. 6. Repeat many times (≥100) and compute the empirical probability as:
\[ \text{Empirical Probability} = \frac{\text{# times event occurred}}{\text{# of trials}} \]
(Lecture 4.1)


Lecture 4.2: Probability Models


Law of Large Numbers

Law of Large Numbers states that in the long run, relative frequencies approach the true probability. In the short run, these frequencies can vary dramatically.
(Lecture 4.2)


Probability Models

Probability Models summarize the likelihood of different outcomes for phenomena that behave predictably over time.
(Lecture 4.2)


Probability Distribution Displays

Probability Distributions show: 1. The values of outcomes (sample space) 2. Their long-run relative frequencies (probabilities)
(Lecture 4.2)


Two Types of Numerical Outcomes

Numerical Outcomes can be: - Discrete: Countable outcomes (e.g., number of heads) - Continuous: Measured outcomes that can take on any value in an interval
(Lecture 4.2)


Discrete and Continuous Random Variables

  • Discrete Random Variables: Take specific, countable values
  • Continuous Random Variables: Can take any value within a range
    (Lecture 4.2)

Discrete – The Binomial Model

The Binomial Model is used when: - There are two possible outcomes: success/failure - Each trial is independent - The number of trials is fixed (n) - The probability of success (p) is constant
(Lecture 4.2)


Binomial Distribution

Binomial Distribution describes the probability of observing a certain number of successes in n independent trials, each with probability p of success.
(Lecture 4.2)


Continuous – Normal Distribution

Normal Distribution is a continuous distribution shaped like a bell (also called Gaussian).
It represents probabilities as areas under a curve.
(Lecture 4.2)


Continuous Probability

Continuous Probability Distributions cannot list all values due to their infinite nature.
They are represented: - Graphically: as area under a curve - Mathematically: using integrals
(Lecture 4.2)


Discrete Probability

Discrete Probability Distributions are used when outcomes are countable.
They can be displayed: - In a table - With a graph - With a formula
(Lecture 4.2)


Normal Distribution

Normal Distribution is symmetric and bell-shaped.
It models many natural processes and is used extensively in statistical inference.
(Lecture 4.2)


Empirical Rule

Empirical Rule for normal distributions: - ~68% within 1 SD of mean - ~95% within 2 SDs - ~99.7% within 3 SDs
(Lecture 4.2)


Mean and SD of Normal Distribution

  • Mean (μ): Determines the center of the distribution
  • Standard Deviation (σ): Determines the spread
    Larger σ → wider distribution
    (Lecture 4.2)

Lecture 5.1: Midterm Review


Numerical vs. Categorical Variables

Numerical Variables represent quantities (e.g., height, income).
Categorical Variables represent groups or categories (e.g., gender, brand).
(Lecture 5.1)


Discrete vs. Continuous Numerical Variables

  • Discrete: Countable values (e.g., number of siblings)
  • Continuous: Any value in a range (e.g., height)
    (Lecture 5.1)

Binomial Distribution

Binomial Distribution models the number of successes in a fixed number of independent trials, each with the same probability of success.
(Lecture 5.1)


Normal Distribution

Normal Distribution is a continuous, bell-shaped distribution symmetric about its mean.
(Lecture 5.1)


Experiment vs. Observational Study

  • Experiment: Researcher assigns treatments
  • Observational Study: Subjects select or are observed naturally without intervention
    (Lecture 5.1)

Confounding Factor

A Confounding Factor is an alternative explanation for an observed association.
(Lecture 5.1)


Mean and Standard Deviation

  • Mean: Average; center of gravity
  • Standard Deviation: Typical distance from the mean; measures spread
    (Lecture 5.1)

Median and IQR

  • Median: Middle value
  • IQR: Range of the middle 50% of data
    (Lecture 5.1)

Five Number Summary & Boxplot

  • Five Number Summary: Minimum, Q1, Median, Q3, Maximum
  • Boxplot: Visual display of the five-number summary
    (Lecture 5.1)

Z Scores

Z Score standardizes a value by expressing it in terms of standard deviations from the mean.
\[ z = \frac{x - \bar{x}}{s} \]
(Lecture 5.1)


Empirical Rule

Empirical Rule (for normal distributions): - 68% within 1 SD - 95% within 2 SDs - 99.7% within 3 SDs
(Lecture 5.1)


Correlation Coefficient (r)

r measures strength and direction of linear association.
- r ∈ [-1, 1] - r > 0: positive trend - r < 0: negative trend - r = 0: no linear trend
(Lecture 5.1)


Coefficient of Determination (r²)

measures the proportion of variability in y explained by x in a linear model.
(Lecture 5.1)


Linear Regression

Linear Regression fits a line to predict y from x.
- Interpret line: estimates average y for given x - Slope: Change in predicted y for a one-unit change in x - Intercept: Predicted y when x = 0 - Evaluate strength: use r and r²
(Lecture 5.1)


Predicted Values

Predicted Values are averages from the regression line for any x.
The line passes through the point \((\bar{x}, \bar{y})\) and builds upon that point with the slope \(\text{Slope} = r \cdot \frac{\text{SD}_y}{\text{SD}_x}\)
(Lecture 5.1)


Interpreting Slope

Slope is the expected change in y for each one-unit increase in x. \(\text{Slope} = r \cdot \frac{\text{SD}_y}{\text{SD}_x}\)
(Lecture 5.1)


Probability Rules

Basic Probability Rules include: - P(not A) = 1 - P(A) - P(A or B) = P(A) + P(B) - P(A and B) - P(A and B) = P(A) × P(B) if independent
(Lecture 5.1)


Empirical Probability

Empirical Probability is based on observed data or simulations:
\[ P(A) = \frac{\text{# of times A occurred}}{\text{total trials}} \]
(Lecture 5.1)