Lectures 1-5.1 are content related to Midterm 1.
Prediction is the process of estimating future
outcomes based on patterns or models found in data.
Example: Predicting the time of the next solar eclipse is easier
than predicting an earthquake due to lower variability.
(Lecture 1.1)
Variability refers to the natural fluctuations in
data that create uncertainty and necessitate statistical
reasoning.
Example: People’s test scores vary due to different study habits and
conditions.
(Lecture 1.1)
Data are recorded information produced by people,
sensors, or machines that must be organized to extract meaning.
Example: Data trails from credit card transactions and Google
searches.
(Lecture 1.1)
Data Handling is the process of managing,
organizing, and structuring data to prepare it for analysis.
Example: Identifying variable types and cleaning data from
surveys.
(Lecture 1.1)
Numerical: Quantitative values like price or time
since review.
Categorical: Labels such as brand or payment
type.
Note: Type depends more on use than format—e.g., ZIP codes are
numeric but used categorically.
(Lecture 1.1)
Organizing Data means structuring it in a tidy
format where rows represent cases and columns represent
variables.
Example: A gas station dataset where each row is a gas station and
columns include brand, price, etc.
(Lecture 1.1)
Collecting Data involves methods such as:
- Observational studies
- Controlled experiments
- Surveys
- Census
(Lecture 1.1)
Observational Studies are those where subjects
select their own treatment or exposure.
Example: Choosing to attend bootcamp rather than being
assigned.
(Lecture 1.1)
Controlled Experiments assign subjects to treatment
or control groups by the researcher.
Example: Comparing nasal spray to injection flu vaccines via
researcher-assigned groups.
(Lecture 1.1)
Surveys are tools for collecting structured
responses directly from individuals.
Example: Asking individuals about their data usage or
opinions.
(Lecture 1.1)
A Census is a study that attempts to collect data
from every individual in a population.
(Lecture 1.1)
A Confounding Factor is an external variable that
affects both the treatment and response, providing an alternative
explanation for an observed association.
Example: Motivation affecting both bootcamp participation and
recidivism.
(Lecture 1.1)
Causality refers to the determination of whether one
variable directly affects another.
Example: Asking whether drinking coffee causes heart disease or just
correlates with it.
(Lecture 1.1)
The Treatment Group receives the intervention or
condition being studied.
Example: Individuals required to drink whole milk in a
study.
(Lecture 1.1)
The Control Group does not receive the treatment and
serves as a baseline.
Example: Individuals who do not drink whole milk in the same
study.
(Lecture 1.1)
The Response Variable is the outcome being measured
to assess the effect of the treatment.
Example: Whether someone dies from falling out of bed after a
treatment.
(Lecture 1.1)
Confounding Variables are alternative explanations
for observed associations. They influence both the treatment and the
response variable.
Example: Self-discipline can lead someone to volunteer for bootcamp
and also reduce recidivism.
(Lecture 1.2)
Variability refers to natural differences in data
across individuals or groups. It complicates analysis and conclusions,
making control and replication essential.
(Lecture 1.2)
Visualizing Variability helps us understand how
values differ by using tools such as histograms or bar plots to display
the distribution of data.
(Lecture 1.2)
Distribution tells us: (1) what values a variable
takes, and (2) how frequently those values occur.
It reflects the spread of data through counts or relative
frequencies.
(Lecture 1.2)
Center, Spread, and Shape describe a distribution: -
Center: The typical or average value. - Spread: The
variability in the data. - Shape: Symmetry, skewness, or
modality of the distribution.
(Lecture 1.2)
Controlled Experiments are those in which
researchers assign subjects to treatment or control groups.
This allows for causal conclusions.
(Lecture 1.2)
The Control Group is the group that does not receive
the treatment and provides a baseline for comparison.
(Lecture 1.2)
Replication is repeating a study across many
individuals to ensure observed effects are not due to chance.
(Lecture 1.2)
Observational Studies occur when subjects assign
themselves to treatment or control groups.
These always include confounding variables and cannot support
cause-and-effect conclusions.
(Lecture 1.2)
Causal Claims assert that doing X will cause a
change in Y.
They require controlled experiments to be valid.
(Lecture 1.2)
Associational Claims suggest that X and Y tend to
vary together, without inferring causality.
Example: More whole grain consumption is associated with longer
life.
(Lecture 1.2)
Categorical Variables represent groups or
labels.
Visualized with bar charts, ribbon plots, etc. (e.g., gender
categories).
(Lecture 1.2)
Numerical Variables are quantities that take on
numeric values.
They are best visualized with histograms, dot plots, or
boxplots.
(Lecture 1.2)
Histograms show numerical distributions by
displaying ranges of values and the counts or relative frequencies
within those ranges.
(Lecture 1.2)
Bin Length is the width of each interval in a
histogram.
Wider bins obscure details; narrower bins may add noise.
(Lecture 1.2)
Treatment Variable is the condition manipulated by
the researcher.
Example: Sleep duration, caffeine amount, or laptop use.
(Lecture 1.2)
Response Variable is the outcome measured to assess
treatment impact.
Example: Pain tolerance, test scores, or concussions.
(Lecture 1.2)
Potential Confounders are variables that affect both
the treatment and the outcome.
Examples: Income, health status, education.
(Lecture 1.2)
Randomization helps eliminate confounding by evenly
distributing unknown variables across groups.
This is why randomized controlled trials are the gold standard for
causal inference.
(Lecture 1.2)
Random Assignment is the process by which subjects
are placed into treatment or control groups using randomization.
It ensures groups are comparable and helps eliminate confounding
variables, enabling valid causal conclusions.
(Lecture 1.2)
Distribution tells us what values a variable takes
and how frequently each value occurs.
It reflects both the range and the frequency of observed
values.
(Lecture 2.1)
Histograms are graphical tools that group data into
bins and display how many observations fall into each bin.
They help us visualize the distribution’s shape, center, and
spread.
(Lecture 2.1)
Shape describes the appearance of a
distribution:
- Symmetric: Both sides are roughly the same.
- Right-skewed: Tail extends to the right.
- Left-skewed: Tail extends to the left.
(Lecture 2.1)
Center is the “typical” value used to summarize a
distribution.
It may be measured using the mean or median depending on the
shape.
(Lecture 2.1)
Variability describes how much the data values
differ from each other.
Measured using standard deviation or IQR, depending on
symmetry.
(Lecture 2.1)
Statistical Questions are questions that anticipate
variability in the data and require data collection and analysis.
Example: “How long does it take to get to campus?”
(Lecture 2.1)
Typical Value is a summary of what’s most
representative in a distribution.
Examples: Mean, median, or mode depending on context.
(Lecture 2.1)
Center of a Distribution refers to the point of
balance or midpoint in the data.
The mean is the balance point; the median splits the data in
half.
(Lecture 2.1)
Sample Mean is the average value of the data.
It’s the center of gravity (balancing point) of the
distribution.
(Lecture 2.1)
Standard Deviation quantifies the average distance
of observations from the mean.
Best used when the distribution is symmetric and
unimodal.
(Lecture 2.1)
Median is the middle value of a sorted dataset—half
the data lies above and half below it.
It’s less affected by outliers or skewness than the mean.
(Lecture 2.1)
IQR is the distance between the first (Q1) and third
quartile (Q3).
It measures the spread of the middle 50% of the data and is a
distance, not a location.
(Lecture 2.1)
Median and IQR are used to summarize center and
spread when data are skewed or have outliers.
Median gives the midpoint; IQR shows the middle 50%
spread.
(Lecture 2.1)
Boxplots graphically display the median, IQR, and
potential outliers.
They summarize the five-number summary but hide the number of
modes.
(Lecture 2.1)
Center and Spread should always be reported
together:
- Use mean and SD for symmetric data
- Use median and IQR for skewed data
(Lecture 2.1)
Mode is the most frequently occurring value in a
dataset.
(Lecture 2.1)
Multi-modal Distributions have multiple peaks or
humps, indicating the presence of subgroups.
In such cases, a single “typical” value may be
misleading.
(Lecture 2.1)
Typical with Categorical Variables can be defined
as:
- Majority: More than 50%
- Mode: Most frequent category
(Lecture 2.1)
Mean & Standard Deviation: The
mean is the sum of all observations divided by the
number of observations. The standard deviation measures
how much values typically deviate from the mean.
(Lecture 2.2)
Median and Inter-quartile-range (IQR): The
median splits the dataset in half; IQR
measures the spread of the middle 50% of data (Q3 - Q1).
(Lecture 2.2)
Boxplots display the 5-number summary (min, Q1,
median, Q3, max). They provide a compact view of distribution but may
omit important details. Useful for comparing distributions across
groups.
(Lecture 2.2)
Outliers are points beyond Q1 - 1.5IQR or Q3 +
1.5IQR. They may affect mean and standard deviation more than
median and IQR. Boxplots can help identify potential outliers.
(Lecture 2.2)
Comparing Groups involves precise comparisons of
center, spread, and shape between distributions, using contextual
language (e.g., “Coffee drinkers had a higher median and greater
variability than non-coffee drinkers”).
(Lecture 2.2)
Z-scores standardize data by subtracting the mean
and dividing by the standard deviation:
z = (x - mean)/SD.
They allow for comparison across different units and scales. Z-scores
have mean 0 and SD 1.
(Lecture 2.2)
Empirical Rule: For approximately normal
distributions: - ~68% of data falls within ±1 SD - ~95% within ±2 SD -
~99.7% within ±3 SD
(Lecture 2.2)
Two variables are associated if the mean of one
depends on the value of the other.
(Lecture 2.2)
Scatterplots graph paired data. Each point
represents an observation’s x- and y-values.
(Lecture 2.2)
Pattern & Prediction: If a pattern exists, it
may allow us to predict future values and assess variable relationships
in complex systems.
(Lecture 2.2)
Predictor variable (independent/explanatory):
Plotted on the x-axis.
Response variable (dependent/target): Plotted on the
y-axis.
(Lecture 2.2)
Trend: Increasing or decreasing
Strength: How tightly clustered the data are
Shape: Linear or non-linear
Interpretation: Must be in context (e.g., “Bears with
wider heads tend to weigh more.”)
(Lecture 2.2)
Correlation coefficient (r) measures the strength
and direction of a linear association between two
variables.
- Ranges from -1 to 1
- Positive r: increasing trend
- Negative r: decreasing trend
- r = 0: no linear association
- Calculated using z-scores:
r = average of (z_x * z_y)
Note: r does not imply linearity; it only quantifies strength if the
relationship is linear.
(Lecture 2.2)
Correlation describes the direction and strength of
a linear relationship between two variables.
(Lecture 3.1)
Correlation coefficient (r) is a numerical measure
of linear association.
Always between -1 and 1.
- r = +1 or -1: perfect linear association
- r = 0: no linear association
(Lecture 3.1)
Calculating r involves standardizing both variables
(z-scores), multiplying the z-scores, and averaging the products.
(Lecture 3.1)
Summarizing linear associations involves intercept,
slope, prediction, correlation coefficient, SD_x, SD_y, and
interpretation.
(Lecture 3.1)
Predictions are made using the regression line to
estimate the mean value of y for a given x.
(Lecture 3.1)
The regression line is the line of best fit. It
passes through the point \((\bar{x},
\bar{y})\) and is computed as:
\[
\hat{y} = \text{intercept} + \text{slope} \cdot x
\]
Slope = \(r \cdot
\frac{\text{SD}_y}{\text{SD}_x}\)
(Lecture 3.1)
Slope tells us how the predicted value of y changes
for a one-unit increase in x.
Example: Each additional SAT point might increase predicted GPA by
0.0017.
(Lecture 3.1)
Intercept is the predicted value of y when x =
0.
Note: Interpret with caution — x = 0 may be outside the observed
range.
(Lecture 3.1)
Extrapolation is using the regression line to
predict y-values beyond the range of observed x-values.
This can lead to invalid conclusions.
(Lecture 3.1)
Evaluating regression involves checking linearity,
goodness of fit, and explanatory power.
(Lecture 3.1)
r-squared (r²) is the coefficient of
determination.
- Represents the proportion of variability in y explained by x.
- r² = 1: perfect linear fit
- r² = 0: regression line explains nothing
(Lecture 3.1)
Randomness and Chance refer to unpredictability in
individual outcomes, though patterns may emerge in the long run.
Example: Whether a coin lands heads or tails on a single toss is
random, but over many tosses, about half will be heads.
(Lecture 3.2)
Randomness and Statistics interact through two core
ideas: random assignment in experiments and random sampling in
surveys.
This allows for valid generalizations and causal
inferences.
(Lecture 3.2)
Random Assignment is assigning subjects to groups by
chance to ensure differences between groups are minimal.
(Lecture 3.2)
Estimating Probability can be done through
theoretical reasoning, simulations, or observations.
(Lecture 3.2)
Theoretical Probability is determined through
mathematical logic and assumptions.
Example: The probability of rolling a 3 on a fair die is
1/6.
(Lecture 3.2)
Simulated Probability estimates chance via repeated
simulation of a process.
Example: Flipping a virtual coin 1,000 times to estimate the chance
of heads.
(Lecture 3.2)
Observational Probability is derived from actual
real-world data outcomes.
Example: 21% of surveyed individuals reported having fired a
gun.
(Lecture 3.2)
Probability is a measure of the likelihood of an
event, ranging from 0 (impossible) to 1 (certain).
(Lecture 3.2)
Estimate through simulation, or
Reason via math and logic.
(Lecture 3.2)
Simulations use repeated random events to estimate
probabilities.
(Lecture 3.2)
Law of Large Numbers states that as repetitions of
an experiment increase, the observed frequency approaches the true
probability.
(Lecture 3.2)
Sample Space is the set of all possible outcomes of
a random process.
Example: For flipping two coins, the sample space is {HH, HT, TH,
TT}.
(Lecture 3.2)
Event is a specific outcome or set of outcomes from
the sample space.
Example: Getting two tails when flipping two coins.
(Lecture 3.2)
Probability of an Event is the long-run relative
frequency over infinitely many repetitions.
(Lecture 3.2)
Equally Likely Outcomes: If all outcomes have the
same chance, P(event) = # favorable outcomes / # total outcomes.
(Lecture 3.2)
Probability Distribution shows all possible outcomes
and the probabilities for each.
(Lecture 3.2)
Mutually Exclusive events cannot happen at the same
time.
Example: A coin cannot land heads and tails
simultaneously.
(Lecture 3.2)
Conditional Probability is the probability of an
event given that another event has occurred.
Notation: P(A | B)
Example: P(shoots gun | person is conservative) = 0.40
(Lecture 3.2)
Conditional Probabilities refer to the chance of an
event occurring given that another event has already occurred.
(Lecture 4.1)
Mutually Exclusive events cannot happen at the same
time.
If A and B are mutually exclusive, then:
\[
P(A \text{ and } B) = 0
\]
(Lecture 4.1)
Conditional Probability is the probability of event
A given that B has occurred:
\[
P(A | B) = \frac{P(A \text{ and } B)}{P(B)}
\]
(Lecture 4.1)
Complement of an Event is the probability that the
event does not occur.
\[
P(\text{not A}) = 1 - P(A)
\]
(Lecture 4.1)
Associations exist if the probability of A is
different depending on whether B occurs:
If \(P(A) \neq P(A | B)\), then A and B
are associated.
Conversely, if \(P(B) = P(B | A)\),
then knowing A tells us nothing about B.
(Lecture 4.1)
Independent Events occur when one event does not
affect the probability of another.
A and B are independent if:
\[
P(A | B) = P(A) \quad \text{and} \quad P(B | A) = P(B)
\]
(Lecture 4.1)
Statistical Approaches to determine independence
include: - Theoretical: Based on logic and probability
rules. - Empirical (Data-Based): Based on observed
differences in conditional probabilities.
(Lecture 4.1)
Building a Simulation involves six key steps: 1.
Identify the random action and the probability of a successful outcome.
2. Determine how to simulate the random action. 3. Define the event of
interest. 4. Explain the procedure for one trial. 5. Run a trial and
record the result. 6. Repeat many times (≥100) and compute the empirical
probability as:
\[
\text{Empirical Probability} = \frac{\text{# times event
occurred}}{\text{# of trials}}
\]
(Lecture 4.1)
Law of Large Numbers states that in the long run,
relative frequencies approach the true probability. In the short run,
these frequencies can vary dramatically.
(Lecture 4.2)
Probability Models summarize the likelihood of
different outcomes for phenomena that behave predictably over
time.
(Lecture 4.2)
Probability Distributions show: 1. The values of
outcomes (sample space) 2. Their long-run relative frequencies
(probabilities)
(Lecture 4.2)
Numerical Outcomes can be: -
Discrete: Countable outcomes (e.g., number of heads) -
Continuous: Measured outcomes that can take on any
value in an interval
(Lecture 4.2)
The Binomial Model is used when: - There are two
possible outcomes: success/failure - Each trial is independent - The
number of trials is fixed (n) - The probability of success (p) is
constant
(Lecture 4.2)
Binomial Distribution describes the probability of
observing a certain number of successes in n independent trials, each
with probability p of success.
(Lecture 4.2)
Normal Distribution is a continuous distribution
shaped like a bell (also called Gaussian).
It represents probabilities as areas under a curve.
(Lecture 4.2)
Continuous Probability Distributions cannot list all
values due to their infinite nature.
They are represented: - Graphically: as area under a curve -
Mathematically: using integrals
(Lecture 4.2)
Discrete Probability Distributions are used when
outcomes are countable.
They can be displayed: - In a table - With a graph - With a
formula
(Lecture 4.2)
Normal Distribution is symmetric and
bell-shaped.
It models many natural processes and is used extensively in statistical
inference.
(Lecture 4.2)
Empirical Rule for normal distributions: - ~68%
within 1 SD of mean - ~95% within 2 SDs - ~99.7% within 3 SDs
(Lecture 4.2)
Numerical Variables represent quantities (e.g.,
height, income).
Categorical Variables represent groups or categories
(e.g., gender, brand).
(Lecture 5.1)
Binomial Distribution models the number of successes
in a fixed number of independent trials, each with the same probability
of success.
(Lecture 5.1)
Normal Distribution is a continuous, bell-shaped
distribution symmetric about its mean.
(Lecture 5.1)
A Confounding Factor is an alternative explanation
for an observed association.
(Lecture 5.1)
Z Score standardizes a value by expressing it in
terms of standard deviations from the mean.
\[
z = \frac{x - \bar{x}}{s}
\]
(Lecture 5.1)
Empirical Rule (for normal distributions): - 68%
within 1 SD - 95% within 2 SDs - 99.7% within 3 SDs
(Lecture 5.1)
r measures strength and direction of linear
association.
- r ∈ [-1, 1] - r > 0: positive trend - r < 0: negative trend - r
= 0: no linear trend
(Lecture 5.1)
r² measures the proportion of variability in y
explained by x in a linear model.
(Lecture 5.1)
Linear Regression fits a line to predict y from
x.
- Interpret line: estimates average y for given x -
Slope: Change in predicted y for a one-unit change in x
- Intercept: Predicted y when x = 0 - Evaluate
strength: use r and r²
(Lecture 5.1)
Predicted Values are averages from the regression
line for any x.
The line passes through the point \((\bar{x},
\bar{y})\) and builds upon that point with the slope \(\text{Slope} = r \cdot
\frac{\text{SD}_y}{\text{SD}_x}\)
(Lecture 5.1)
Slope is the expected change in y for each one-unit
increase in x. \(\text{Slope} = r \cdot
\frac{\text{SD}_y}{\text{SD}_x}\)
(Lecture 5.1)
Basic Probability Rules include: - P(not A) = 1 -
P(A) - P(A or B) = P(A) + P(B) - P(A and B) - P(A and B) = P(A) × P(B)
if independent
(Lecture 5.1)
Empirical Probability is based on observed data or
simulations:
\[
P(A) = \frac{\text{# of times A occurred}}{\text{total trials}}
\]
(Lecture 5.1)