7-PROBABILITY-DISTRIBUTION.knit

FRENKHY TONGA RETANG-2.png

7 Introduction of Probability Distribution

Welcome to the Probability Distributions assignment for Week 11. This presentation covers fundamental concepts in probability distributions, including continuous random variables, sampling distributions, the Central Limit Theorem, and sample proportions.

In this assignment, we will explore how probability theory forms the foundation of statistical inference and decision-making in data science. Understanding these concepts is crucial for analyzing data, making predictions, and drawing meaningful conclusions from statistical analyses.

The topics covered in this module include:

Understanding continuous random variables and their properties
Working with probability density functions (PDF) and cumulative distribution functions (CDF)
Analyzing sampling distributions and their characteristics
Applying the Central Limit Theorem to real-world problems
Estimating population proportions using sample data

7.1 Continuous Random Variables

Introduction to Continuous Random Variables

A continuous random variable is a variable that can take on any value within a specified range or interval. Unlike discrete random variables that take on countable values, continuous random variables can assume infinitely many values within their range. Examples include height, weight, temperature, time, and distance measurements.

The key characteristic of continuous random variables is that the probability of the variable taking on any single exact value is essentially zero. Instead, we calculate probabilities over intervals or ranges of values.

Key Differences: Discrete vs Continuous

Discrete: Countable outcomes (e.g., number of students: 10, 11, 12)
Continuous: Uncountable outcomes (e.g., height: 170.5 cm, 170.51 cm, 170.512 cm…)
Discrete: P(X = x) > 0 for specific values
Continuous: P(X = x) = 0 for any specific value
Discrete: Probability Mass Function (PMF)
Continuous: Probability Density Function (PDF)

7.1.1 Random Variable

Formal Definition

A random variable is a function that assigns a numerical value to each outcome in the sample space of a random experiment. For continuous random variables, these values form a continuous range.

Notation: Random variables are typically denoted by capital letters (X, Y, Z), while their specific values are denoted by lowercase letters (x, y, z).

Examples of Continuous Random Variables

Example 1: Temperature

Let X = temperature in Celsius in Cikarang. X can take any value in the range, say [20°C, 35°C]. Possible values: 25.3°C, 28.75°C, 31.256°C, etc.

Example 2: Student Heights

Let Y = height of university students in centimeters. Y might range from 150 cm to 195 cm, with values like 165.5 cm, 172.83 cm, etc.

Example 3: Service Time

Let T = time (in minutes) to serve a customer at a bank. T ≥ 0, with values like 2.5 minutes, 5.75 minutes, 10.123 minutes, etc.

7.1.2 Probability Density Function (PDF)

Definition of PDF

The Probability Density Function (PDF), denoted as f(x), is a function that describes the relative likelihood of a continuous random variable taking on a given value. The PDF must satisfy two conditions:

Non-negativity: f(x) ≥ 0 for all x
Total Area = 1: ∫_-∞^∞ f(x) dx = 1

Important: The value f(x) itself is NOT a probability. It’s a density. Probabilities are calculated as areas under the PDF curve.

Properties of PDF:

1. f(x) ≥ 0 for all x
2. ∫_-∞^∞ f(x) dx = 1
3. P(X = a) = 0 for any specific value a
4. P(a ≤ X ≤ b) = ∫_a^b f(x) dx

Example: Uniform Distribution

Consider a uniform distribution on the interval [0, 10]. The PDF is:

f(x) = 1/10 for 0 ≤ x ≤ 10

f(x) = 0 otherwise

Verification:

∫₀¹⁰ (1/10) dx = (1/10) × 10 = 1 ✓

The total area under the curve equals 1, satisfying the PDF condition.

Example: Exponential Distribution

The exponential distribution with rate parameter λ = 0.5 has PDF:

f(x) = 0.5e^-0.5x for x ≥ 0

f(x) = 0 for x < 0

This is commonly used to model waiting times, such as time until next customer arrival.

7.1.3 Probability on an Interval

Calculating Probabilities Over Intervals

For continuous random variables, we calculate probabilities over intervals by finding the area under the PDF curve between two points. This is computed using integration:

P(a ≤ X ≤ b) = ∫_a^b f(x) dx

Important properties:

P(a < X < b) = P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b)
The probability is the same whether endpoints are included or excluded
P(X = a) = 0 for any specific value a

Key Probability Formulas:

P(a ≤ X ≤ b) = ∫_a^b f(x) dx

P(X ≥ a) = ∫_a^∞ f(x) dx = 1 - P(X < a)

P(X ≤ b) = ∫_-∞^b f(x) dx

P(X = a) = 0 (for any specific value)

Example: Uniform Distribution Probabilities

Given X ~ Uniform[0, 10], find various probabilities:

1. P(3 ≤ X ≤ 7):

P(3 ≤ X ≤ 7) = ∫₃⁷ (1/10) dx = (1/10) × (7-3) = 4/10 = 0.4

2. P(X > 6):

P(X > 6) = ∫₆¹⁰ (1/10) dx = (1/10) × (10-6) = 4/10 = 0.4

3. P(X < 2.5):

P(X < 2.5) = ∫₀^2.5 (1/10) dx = (1/10) × 2.5 = 0.25

4. P(X = 5):

P(X = 5) = 0 (probability of any exact value is zero)

Example: Normal Distribution

Suppose X ~ Normal(μ = 100, σ = 15), representing IQ scores.

Find P(85 ≤ X ≤ 115):

This interval is within one standard deviation of the mean (100 ± 15).

Using the empirical rule or standard normal table:

P(85 ≤ X ≤ 115) ≈ 0.683 or 68.3%

This means approximately 68.3% of the population has IQ scores between 85 and 115.

7.1.4 Cumulative Distribution Function (CDF)

Definition of CDF

The Cumulative Distribution Function (CDF), denoted as F(x), gives the probability that the random variable X takes on a value less than or equal to x:

F(x) = P(X ≤ x) = ∫_-∞^x f(t) dt

The CDF is the integral (area) of the PDF from negative infinity up to x.

Properties of CDF:

1. F(x) is non-decreasing: if x₁ < x₂, then F(x₁) ≤ F(x₂)
2. 0 ≤ F(x) ≤ 1 for all x
3. lim_x→-∞ F(x) = 0
4. lim_x→∞ F(x) = 1
5. P(a < X ≤ b) = F(b) - F(a)
6. f(x) = dF(x)/dx (PDF is derivative of CDF)

Example: Uniform Distribution CDF

For X ~ Uniform[0, 10], the CDF is:

F(x) = 0 for x < 0

F(x) = x/10 for 0 ≤ x ≤ 10

F(x) = 1 for x > 10

Calculations:

F(5) = 5/10 = 0.5 → P(X ≤ 5) = 0.5

F(7.5) = 7.5/10 = 0.75 → P(X ≤ 7.5) = 0.75

P(3 < X ≤ 8) = F(8) - F(3) = 0.8 - 0.3 = 0.5

Example: Exponential Distribution CDF

For exponential distribution with λ = 0.5:

F(x) = 1 - e^-0.5x for x ≥ 0

F(x) = 0 for x < 0

Interpretation:

F(2) = 1 - e^-1 ≈ 0.632 → 63.2% probability that waiting time is ≤ 2 units

P(X > 3) = 1 - F(3) = e^-1.5 ≈ 0.223 → 22.3% probability of waiting more than 3 units

Relationship Between PDF and CDF

CDF is the integral of PDF: F(x) = ∫_-∞^x f(t) dt
PDF is the derivative of CDF: f(x) = dF(x)/dx
CDF is cumulative: It accumulates probability from left to right
PDF shows density: It shows where probability is concentrated
CDF ranges [0,1]: Always between 0 and 1
PDF can exceed 1: f(x) can be greater than 1 (it’s a density, not probability)

7.2 Sampling Distributions

Introduction to Sampling Distributions

A sampling distribution is the probability distribution of a statistic (such as the sample mean, sample proportion, or sample variance) obtained through repeated sampling from a population. It describes how the statistic varies from sample to sample.

Understanding sampling distributions is crucial because:

We rarely have access to entire populations
We use sample statistics to estimate population parameters
We need to quantify uncertainty in our estimates
It forms the foundation for statistical inference

Key Terminology

Population Parameter: A numerical characteristic of the population (denoted by Greek letters: μ, σ, π)

Sample Statistic: A numerical characteristic of a sample (denoted by Roman letters: x̄, s, p̂)

Sampling Distribution: The distribution of a sample statistic over all possible samples of size n

Standard Error: The standard deviation of a sampling distribution

Example: Sampling Distribution of Sample Mean

Suppose we have a population of student heights with μ = 170 cm and σ = 10 cm.

We repeatedly take samples of n = 25 students and calculate x̄ for each sample.

Results after many samples:

Sample 1: x̄₁ = 168.5 cm
Sample 2: x̄₂ = 171.2 cm
Sample 3: x̄₃ = 169.8 cm
Sample 4: x̄₄ = 170.5 cm
… (continue for many samples)

The distribution of all these x̄ values is the sampling distribution of the mean.

Properties of Sampling Distribution of the Mean

For Sample Mean (x̄):

Mean of sampling distribution: μ_x̄ = μ
(unbiased estimator)

Standard Error: SE = σ_x̄ = σ/√n
(decreases as n increases)

Shape: Approaches normal as n increases
(Central Limit Theorem)

Important Observations

Center: The sampling distribution is centered at the population mean μ
Spread: The spread decreases as sample size increases (σ/√n)
Shape: Becomes more normal as n increases, regardless of population shape
Variability: Sample means vary less than individual observations

Example: Effect of Sample Size

Population: μ = 100, σ = 15

For n = 2:

SE = 15/√2 ≈ 10.6

Sample means vary widely

For n = 5:

SE = 15/√5 ≈ 6.7

Sample means less variable

For n = 10:

SE = 15/√10 ≈ 4.7

Sample means more concentrated

For n = 30:

SE = 15/√30 ≈ 2.7

Sample means tightly clustered around μ

For n = 100:

SE = 15/√100 = 1.5

Sample means very close to μ

7.3 Central Limit Theorem

The Central Limit Theorem (CLT)

The Central Limit Theorem is one of the most important theorems in statistics. It states that:

For a random sample of size n drawn from any population with mean μ and standard deviation σ, as n becomes large, the sampling distribution of the sample mean x̄ approaches a normal distribution with mean μ and standard deviation σ/√n, regardless of the shape of the original population distribution.

Central Limit Theorem:

As n → ∞:
x̄ ~ N(μ, σ²/n)

Or equivalently:
Z = (x̄ - μ)/(σ/√n) ~ N(0, 1)

Rule of Thumb:
n ≥ 30 is generally sufficient for CLT to apply
For symmetric populations, smaller n may suffice
For highly skewed populations, larger n may be needed

Why CLT is Powerful

Universality: Works for ANY population distribution (normal, skewed, uniform, etc.)
Predictability: We can predict behavior of sample means even without knowing population distribution
Foundation for Inference: Enables hypothesis testing and confidence intervals
Normal Approximation: Allows use of normal distribution tables and methods
Quality Control: Basis for control charts and process monitoring

Visual Demonstration of CLT

Scenario: Population is highly skewed (exponential distribution)

n = 2: Sampling distribution still bimodal/skewed

n = 5: Beginning to smooth out

n = 10: Approaching normal shape

n = 30: Very close to normal

n = 100: Essentially indistinguishable from normal

This demonstrates CLT: as n increases, sampling distribution becomes increasingly normal regardless of population shape.

Example: Application of CLT

A factory produces bolts with mean length μ = 5 cm and standard deviation σ = 0.2 cm. A quality inspector takes a random sample of n = 36 bolts.

Question: What is the probability that the sample mean length is between 4.95 and 5.05 cm?

Solution:

By CLT, x̄ ~ N(5, 0.2²/36)

SE = σ/√n = 0.2/√36 = 0.2/6 ≈ 0.0333

Standardize:

Z₁ = (4.95 - 5)/0.0333 ≈ -1.50

Z₂ = (5.05 - 5)/0.0333 ≈ 1.50

P(-1.50 < Z < 1.50) ≈ 0.8664 or 86.64%

Interpretation: There’s an 86.64% chance the sample mean will fall between 4.95 and 5.05 cm.

7.4 Sample Proportion

Definition of Sample Proportion

The sample proportion p̂ (read as “p-hat”) is the fraction of individuals in a sample that possess a certain characteristic of interest. If we have a sample of size n and X individuals have the characteristic, then:

p̂ = X/n

The sample proportion is used to estimate the population proportion π.

Sampling Distribution of p̂

Properties of Sample Proportion Distribution

When sampling from a population with proportion π:

Mean: E[p̂] = π (unbiased estimator)
Standard Deviation (Standard Error): σ_p̂ = √[π(1-π)/n]

Sampling Distribution of p̂:

Mean: μ_p̂ = π

Standard Error: SE_p̂ = √[π(1-π)/n]

Normal Approximation (when conditions met):
p̂ ~ N(π, π(1-π)/n)

Z = (p̂ - π)/√[π(1-π)/n] ~ N(0, 1)

Conditions for Normal Approximation:
nπ ≥ 10 AND n(1-π) ≥ 10

Example: Political Survey

In a population, 60% support a certain policy (π = 0.6). A random sample of n = 100 people is taken.

Question 1: What is the probability that between 55% and 65% of the sample supports the policy?

Solution:

Check conditions: nπ = 100(0.6) = 60 ≥ 10 ✓

n(1-π) = 100(0.4) = 40 ≥ 10 ✓

SE = √[0.6(0.4)/100] = √0.0024 ≈ 0.049

Z₁ = (0.55 - 0.60)/0.049 ≈ -1.02

Z₂ = (0.65 - 0.60)/0.049 ≈ 1.02

P(-1.02 < Z < 1.02) ≈ 0.6922 or 69.22%

Example: Quality Control

A manufacturer claims defect rate is 5% (π = 0.05). Inspector samples 200 items.

Question: What’s the probability of finding more than 8% defects in the sample?

Solution:

Check conditions: nπ = 200(0.05) = 10 ✓

n(1-π) = 200(0.95) = 190 ✓

SE = √[0.05(0.95)/200] = √0.0002375 ≈ 0.0154

P(p̂ > 0.08) = P(Z > (0.08-0.05)/0.0154)

= P(Z > 1.95)

≈ 0.0256 or 2.56%

Interpretation: If true defect rate is 5%, there’s only 2.56% chance of finding 8% or more defects in a sample of 200. This would be unusual and might suggest the true defect rate is higher.

7.5 Review of Sampling Distributions

Key Concepts Summary

Continuous Random Variables: Can take any value in an interval; probabilities calculated over ranges, not at specific points

PDF (Probability Density Function): Describes relative likelihood; area under curve = probability

CDF (Cumulative Distribution Function): P(X ≤ x); integral of PDF; always non-decreasing

Sampling Distribution: Distribution of a statistic over all possible samples

Central Limit Theorem: Sample means approach normal distribution as n increases, regardless of population shape

Standard Error: Standard deviation of sampling distribution; measures variability of statistic

Sample Proportion: Used to estimate population proportion; follows approximately normal distribution when conditions met

Essential Formulas:

For Sample Mean:
μ_x̄ = μ
SE_x̄ = σ/√n
Z = (x̄ - μ)/(σ/√n)

For Sample Proportion:
μ_p̂ = π
SE_p̂ = √[π(1-π)/n]
Z = (p̂ - π)/√[π(1-π)/n]

Conditions:
CLT: n ≥ 30 (general rule)
Normal approx for p̂: nπ ≥ 10 AND n(1-π) ≥ 10

Common Mistakes to Avoid

Confusing PDF value with probability: f(x) is NOT a probability; it’s a density
Thinking P(X = a) > 0 for continuous variables: Always P(X = a) = 0
Using wrong standard error: For means use σ/√n; for proportions use √[π(1-π)/n]
Ignoring conditions: Check conditions before using normal approximation
Confusing σ and SE: σ is population SD; SE is SD of sampling distribution
Misapplying CLT: CLT applies to sample means, not individual observations

Comprehensive Example

A university wants to estimate average study hours per week. From past data: μ = 20 hours, σ = 5 hours.

Scenario 1: Random sample of n = 25 students

SE = 5/√25 = 1 hour

P(19 ≤ x̄ ≤ 21) = P(-1 ≤ Z ≤ 1) ≈ 0.68 or 68%

Scenario 2: Increase sample to n = 100

SE = 5/√100 = 0.5 hour

P(19 ≤ x̄ ≤ 21) = P(-2 ≤ Z ≤ 2) ≈ 0.95 or 95%

Observation: Larger sample size → smaller SE → more precise estimate → higher confidence in narrower range

References

Primary Sources

Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2016). Probability & Statistics for Engineers & Scientists (9th ed.). Pearson Education.
Ross, S. M. (2014). Introduction to Probability and Statistics for Engineers and Scientists (5th ed.). Academic Press.
Montgomery, D. C., & Runger, G. C. (2018). Applied Statistics and Probability for Engineers (7th ed.). John Wiley & Sons.
Devore, J. L. (2015). Probability and Statistics for Engineering and the Sciences (9th ed.). Cengage Learning.

Additional Resources

Course Lecture Notes: Essential of Probability - Week 11, Universitas Pelita Harapan, 2024
Online Resources:
- Khan Academy - Probability and Statistics
- StatQuest with Josh Starmer - YouTube Channel
- MIT OpenCourseWare - Probability and Statistics
Statistical Software Documentation:
- R Documentation (stats package)
- Python SciPy.stats documentation

Note on Data and Examples: All numerical examples and scenarios presented in this document are for educational purposes. While based on realistic situations, specific values may be hypothetical to illustrate statistical concepts clearly.

Learning Outcomes

Upon completing this module, students should be able to:

Distinguish between discrete and continuous random variables
Understand and work with probability density functions (PDF) and cumulative distribution functions (CDF)
Calculate probabilities for continuous distributions
Explain the concept of sampling distributions
Apply the Central Limit Theorem to real-world problems
Work with sampling distributions of proportions
Understand the relationship between sample size and precision of estimates
Recognize when to apply normal approximations

Final Remarks

Understanding sampling distributions is fundamental to statistical inference

The Central Limit Theorem bridges probability theory and statistical practice

Larger sample sizes generally lead to more precise estimates

Always verify conditions before applying approximations

Visual representations (graphs, charts) aid in understanding distributions

Practice with diverse examples solidifies conceptual understanding

These concepts form the foundation for hypothesis testing and confidence intervals (covered in future modules)

PROBABILITY DISTRIBUTIONS

PROBABILITY DISTRIBUTIONS

TUGAS WEEK 11

7 Introduction of Probability Distribution

7.1 Continuous Random Variables

Introduction to Continuous Random Variables

Key Differences: Discrete vs Continuous

7.1.1 Random Variable

Formal Definition

7.1.2 Probability Density Function (PDF)

Definition of PDF

7.1.3 Probability on an Interval

Calculating Probabilities Over Intervals

7.1.4 Cumulative Distribution Function (CDF)

Definition of CDF

Relationship Between PDF and CDF

7.2 Sampling Distributions

Introduction to Sampling Distributions

Key Terminology

Properties of Sampling Distribution of the Mean

Important Observations

7.3 Central Limit Theorem

The Central Limit Theorem (CLT)

Why CLT is Powerful

7.4 Sample Proportion

Definition of Sample Proportion

Sampling Distribution of p̂

Properties of Sample Proportion Distribution

7.5 Review of Sampling Distributions

Key Concepts Summary

Common Mistakes to Avoid

References

Primary Sources

Additional Resources

Learning Outcomes

Final Remarks