Tugas Week 11 ~ Probability of Distribution

Paskalis Farelnata Zamasi

NIM : 52250043

Student Major in Data Science at
Institut Teknologi Sains Bandung

R Programming
Data Science
Statistics
Dosen Pembimbing
Bakti Siregar, M.Sc., CDS.

Bab 7 ~ Probability Distributions

Introductions

A probability distribution is a core concept in statistics that not only describes the likelihood of an event occurring but also forms the foundation for many data analysis methods and decision-making processes. Imagine a process or experiment that produces varying outcomes such as rolling a die, measuring people’s heights, or predicting product sales. These outcomes aren’t fixed; they are random and can be represented through random variables. A probability distribution then describes how probabilities are allocated across the possible values of those variables. Understanding this is essential because it determines how data behaves, how probabilities are computed, and how accurate predictions are made.

To examine it more deeply, consider that probability distributions allow us to model uncertainty mathematically. In real-world scenarios, we rarely have complete data for an entire population (e.g., the height of every person in Indonesia). Instead, we take a small sample and use distributions to infer characteristics of the whole population. The shape of the distribution whether symmetric like a bell (normal distribution) or skewed affects how the data is interpreted. Ignoring this structure risks flawed analysis, such as underestimating extreme events like product failures or market fluctuations.

This material focuses on several interconnected key concepts, that is:

  1. Continuous Random Variables These refer to variables that can take any value within a continuous range, such as waiting time in a queue or air temperature. Unlike discrete variables (e.g., the number of coins flipped), continuous variables use a probability density function (PDF) to compute probabilities. For instance, the probability of the temperature being exactly 25°C is zero, but the probability of it being between 24°C and 26°C can be computed by integrating the PDF. This distinction matters because many natural measurements are continuous, and mixing them up can lead to inaccurate predictive models.

  2. Sampling Distributions When you take a sample from a population and compute statistics such as the sample mean or proportion, the sampling distribution describes how those statistics vary if the sampling is repeated many times. This isn’t just theoretical it’s a tool for evaluating the reliability of your estimates. For example, if you survey 100 people about a product preference, the sampling distribution helps estimate how close your sample mean is to the true population mean. Without this, you risk excessive confidence in results from small samples, which can lead to poor business decisions.

  3. Central Limit Theorem (CLT) One of the most powerful results in statistics, the CLT states that the distribution of sample means from any population (even a non-normal one) will approach a normal distribution if the sample size is sufficiently large (typically n > 30). This matters because it enables the use of simple statistical tools such as hypothesis tests and confidence intervals, regardless of the population’s original shape. However, this is not universal—CLT does not apply to very small samples or data with extreme outliers, which can distort the analysis if ignored.

  4. Sample Proportions Distributions Similar to sampling distributions but specific to proportions (e.g., the percentage of voters supporting a candidate). These are widely used in surveys and quantitative research. This distribution approaches normality for large samples, but its variance depends on the population proportion. A common mistake is neglecting the assumption of independence between samples, which can inflate estimation error.

Video and Summary of Chapter 7 Material

7.1 Continuous Random

The video begins with a review of discrete variables to establish contrast, then moves on to continuous variables.

  • Discrete Variable: A variable that can only take on countable values, obtained through counting, not measurement. These values are finite and finite, although they can involve decimals (e.g., money in a bank account, such as $420.69). This concept makes sense for categorical or integer data, but the video underemphasizes limitations such as context-dependence (e.g., money cannot be negative).

  • Continuous Variable: A variable that can take on any numeric value within a specified range, obtained through measurement, and is therefore infinite and uncountable. This reflects the reality of natural data such as age or weight, where precision can be expanded indefinitely (e.g., 23 years old could be 23 years 6 months 2 days 3 seconds, etc.).

  • Differences in Probability Representation:

  • Discrete: Depicted with a bar chart (with gaps between bars to indicate discontinuity), using basic probability formulas such as P(X = k) for specific values.

  • Continuous: Depicted with a histogram (without gaps to show continuity) or a density curve. Probability is calculated as the area under the curve, not the value of a single point (since P(X = c) = 0 for an exact value of c).

Formula

To complete and ensure clarity, here are the key formulas relevant to the explanation:

For discrete variables: Probability mass (PMF) :

\[ P(X = k) = f(k) \]

\[ where \]

\[\sum f(k) = 1\] for all countable k.

For continuous variables: Probability density function (PDF) :

\[ (x) \geq 0 \]

\[ and \]

\[ \int_{-\infty}^{\infty} f(x) \, dx = 1 \]

Probability range:

\[ P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx \]

Cumulative distribution (CDF) for continuous :

\[ F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) \, dt \]

Example

The video provides concrete examples to illustrate:

  • Discrete: Number of heads when tossing a coin 4 times; number of children in a family (it’s unreasonable to say 0.73 children); number of blue marbles from a box; test score (5/10); bank balance ($420.69).

Relevant because it demonstrates countability, but the bank balance example is inconsistent when considered continuous in another context—evaluation.

  • Continuous: Age (23 could be 23.5 years, or more precisely to the second/nanosecond); weight (150 lb could be 150.305482 lb, etc.); temperature; distance.

Relevant because it depicts infinity, like age surveys that are never exact because time passes.

7.1.1 Continuous Random Variables

Continuous random variables take on any value within an interval on the real number line, not countable discrete values. Examples: height, time, temperature, age, pressure, speed. Key characteristics:

  • Values within a finite interval (a, b) or infinite (-∞, +∞).

  • The probability of a single point is always zero:

\[ P(X = x) = 0 \]

because there are infinitely many possibilities.

  • Probability is only meaningful for intervals:

\[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]

where f(x) is the probability density function (PDF).

7.1.2 Probability Density Function (PDF)

The PDF f(x) is a function that describes the probability distribution of a continuous variable. It is valid if it satisfies the following conditions:

  1. Non-negative:

\[ f(x) \geq 0 ∀x \]

  1. Total integral equals 1:

\[ \int_{-\infty}^{\infty} f(x) \, dx = 1 \]

Interpretation:

  • A larger value of f(x) indicates a higher probability density around x (the value is more likely to occur there).

  • f(x) is not a probability itself—it is calculated from the area under the PDF curve.

Example: f(x) =

\[ f(x) = 3x^2 on [0, 1]. \]

  • Validation:

\[ \int_0^1 3x^2 \, dx = 1 \]

7.1.3 Probability on an Interval

To calculate probability in an interval:

\[ P(a \leq X \leq b) = \int_a^b 3x^2 \, dx \] Examples:

\[ P(0.5 \leq X \leq 1) = 0.875 \] (calculated from the integral).

7.1.4 Cumulative Distribution Function

CDF is defined as:

\[ F(x) = P(X \leq x) = \int_0^x 3t^2 \, dt = x^3 \] Relationship between PDF and CDF:

\[ f(x) = F'(x) \]

7.2 Sampling Distributions

This video discusses the concept of sampling distributions, focusing on the distribution of sample means from a population.

A sampling distribution is the probability distribution of a sample statistic (e.g., the sample mean, denoted)

\[ \bar{X} \] drawn from all possible samples of size n from the population. It differs from the population distribution (all values with mean μ and standard deviation σ) and the sampling distribution (values in a single sample). Key interpretation: A sampling distribution allows the estimation of population parameters without a full census, saving resources—for example, estimating the average height of a global population from a sample.

Sampling variability causes :

\[ \bar{X} \] differs across samples, but the mean of all

\[ \bar{X} \]

equals :

\[ μ \]

For large n (n ≥ 30), the shape approaches normal via the Central Limit Theorem (CLT).

Formula

  • Population Distribution Formula

\[ X \sim N(\mu, \sigma) \] \[ Z = \frac{X - \mu}{\sigma} \]

  • Sampling Distribution Formula for Sample Mean

\[ \bar{X} \sim N\left( \mu, \frac{\sigma}{\sqrt{n}} \right), where \]

  • Mean :

\[ \mu_{\bar{X}} = \mu \]

  • Standard error :

\[ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \]

  • Z-score :

\[ Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

Example

  • Problem 1: The height of a Canadian is ~ N(160 cm, 7 cm). Calculate P\((\bar{X}\) < 157) for n=10.

  • Standard Error :

\[ \frac{7}{\sqrt{10}} \approx 2.21 \]

  • Z :

\[ \frac{157 - 160}{2.21} \approx -1.36 \] \[ P(Z < -1.36) \approx 0.0869, \] This is relevant because it contrasts sampling probability with individuals.

  • Problem 2: Proportion of people taller than 170 cm (population distribution).

  • Z :

\[ \frac{170 - 160}{7} \approx 1.43 \]

  • Proportion :

\[ 1 - 0.9236 \approx 0.0764 \] This contrast highlights the differences in distribution.

7.3 Central Limit Theorem

The Central Limit Theorem (CLT) states that the sampling distribution of a sample mean ({x}$) will approximate a normal distribution when the sample size (n) is large enough, regardless of the shape of the original population distribution. This allows statistical inference using normal formulas, such as confidence intervals or hypothesis tests.

  • Sampling Distribution: Formed by taking repeated random samples from the population, calculating a statistic (e.g., \(\bar{x}\)), and collecting them into a distribution.

  • CLT Condition: For n ≥ 30, the distribution of \(\bar{x}\) is normal with mean μ (population) and variance \(σ²/n\). If the population is normal, the CLT holds even for small n.

  • Implication: Small samples (<30) can produce non-normal distributions, increasing variability and reducing the reliability of estimates.

Formula

Main Formula:

  • Sample mean:

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

  • According to CLT: If n is large

\[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n}) \] where:

  • \(μ\) = Population mean

  • \(\sigma\) = Population standard deviation

  • n = Sample size

Use alignment for the equation:

\[\bar{x} \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{(for } n \geq 30\text{)}\]

Example

  • Example from video: A population with a skewed distribution (e.g., exponential). Take repeated samples with n=30, calculating \(\bar{x}\) each time—the distribution of \(\bar{x}\) is normal, centered at μ with low variability.

  • Additional example: In a household income survey (right-skewed distribution), take n=50 samples. The CLT ensures \(\bar{x}\) is normal, so we can calculate a 95% confidence interval:

\[ \bar{x} \pm 1.96 \sqrt{\frac{s^2}{n}} \]

Example histogram, where CLT is in action:

For n = 5, the histogram is skewed (to show non-normality in small samples).

For n = 10, it approximates a bell curve with a red overlay (for better convergence).

7.4 Sample Proportion

The video explains the sampling distribution of the sample proportion (\(\hat{p}\)) as the probability distribution of all possible values of \(\hat{p}\) drawn from a sample of size \(n\) from a population with success proportion \(p\). It is an extension of the binomial distribution, where

\[ \hat{p} = \frac{x}{n} \] (x = the number of successes in the sample).

  • Review Sampling Distribution : A sampling distribution is the distribution of sample statistics (e.g., the mean or proportion) from a population.

  • Population vs. Sample Proportion : The population proportion \(p\) is fixed; the sample proportion \(\hat{p}\) varies due to random sampling.

  • Mean :

\[ \mu_{\hat{p}} = p \] means unbiased

  • Standard Deviation :

\[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \] measures variability; decreases as \(n\) increases.

  • Distribution Shape: Approaches normal if conditions are met (see below), allowing a normal approximation for the binomial.

  • Application: Used for inference purposes such as confidence intervals or hypothesis testing on proportions.

Formula

  • Sample Proportion:

\[ \hat{p} = \frac{x}{n} \]

  • Mean of Sampling Distribution:

\[ \mu_{\hat{p}} = p \]

  • Standard Deviation:

\[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

  • Normal Approximation:

\[ \hat{p} \sim N\left(p, \sqrt{\frac{p(1-p)}{n}}\right) \]

Example

  • Example 1: Mean, SD, and Form:

Assume a population with p = 0.6 and a sample of n = 100.

Then

\[ \mu_{\hat{p}} = 0.6 \]

\[ \sigma_{\hat{p}} = \sqrt{\frac{0.6 \times 0.4}{100}} = 0.049 \] Form: Normal because \(np = 60 > 10\) and \(n(1-p) = 40 > 10\).

  • Example 2: Probability Calculation:

Calculate \(P(\hat{p} > 0.65)\) using the z-score:

\[ z = \frac{0.65 - 0.6}{0.049} \approx 1.02 \] probability ≈ 0.154 (from the normal table). This relates to the normal approximation for the binomial.

Visualization Sampling Distribution of Sampling Proportion

\(\hat{p}\) approaching normal when n is large

7.5 Review Sampling Distribution

The video explains the sampling distribution of the sample proportion as an inferential tool for estimating population proportions from a sample. The binomial distribution is the basis: a binary event (success/failure) with a fixed probability p, and the number of trials n. The sample proportion \(\hat{p} = X/n\) (X: number of successes) has a sampling distribution that is approximately normal for large samples, based on the Central Limit Theorem (CLT).

Key interpretations:

  • Mean of the distribution:

\[ \mu_{\hat{p}} = p \]

  • Standard deviation:

\[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

  • Normality conditions:

\[ np \geq 5 \] \[and\]

\[ n(1-p) \geq 5 \]

  • Probability is calculated using continuity correction for the normal approximation to the binomial.

Formula

  • Binomial Distribution:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]

  • Mean Binomial:

\[ \mu = np \]

  • Variance Binomial:

\[ \sigma^2 = np(1-p) \]

  • Sampling Proportion:

\[ \hat{p} \approx N\left(p, \sqrt{\frac{p(1-p)}{n}}\right) \]

  • Z-Score untuk Probabilitas:

\[ Z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} \]

Example

  • Example 1: Population with success rate p = 0.4, sample n = 100. Calculate the probability that \(\hat{p} > 0.45\)

  • Check the following conditions:

\[ np = 40 \geq 5 \]

\[ n(1-p) = 60 \geq 5 \]

  • Z =

\[ \frac{0.45 - 0.4}{\sqrt{\frac{0.4 \times 0.6}{100}}} = \frac{0.05}{0.049} \approx 1.02 \]

  • P(Z > 1.02) ≈ 0.1539 (from the normal table)

  • Relevance: Illustrate how the binomial approximates normality, with continuity correction: use 0.455 for the limits.

The binomial is illustrated as follows:

  • Example 2 : Compare exact binomial vs. approx normal for n = 20, p = 0.5, P(X ≥ 12)

  • Exact: pbinom(11, 20, 0.5)

\[ R ≈ 0.2517 \]

  • Approx:

\[ Z = \frac{(12/20) - 0.5}{\sqrt{0.5^2/20}} = \frac{0.1}{0.1118} \approx 0.89 P(Z ≥ 0.89) ≈ 0.1867 \]

References

[1] A First Course in Probability (10th Edition, 2019) oleh Sheldon Ross.

[2] Introduction to Probability (2nd Edition, 2019) oleh Joseph K. Blitzstein dan Jessica Hwang.

[3] Probability and Statistics (4th Edition, 2012) oleh Morris H. DeGroot dan Mark J. Schervish.

[4] Triola, M. F. (2018). Elementary Statistics (13th ed.).

[5] “Sampling: Design and Analysis” oleh Sharon L. Lohr (edisi ketiga, 2021.

[6] “Statistics” oleh David Freedman, Robert Pisani, dan Roger Purves (edisi ke-4, 2007).

[7] Devore, J. L. (2015). Probability and Statistics for Engineering and the Sciences. Cengage Learning.

[8] Wickham, H. (2019). R for Data Science. O’Reilly.

[9] Diez, D. M., et al. (2019). OpenIntro Statistics (4th ed.). OpenIntro.

[10] Moore, D. S., et al. (2018). Introduction to the Practice of Statistics (9th ed.). W.H. Freeman.

[11] Probability and Statistics for Engineering and the Sciences oleh Jay L. Devore (edisi ke-9, 2015, Cengage Learning).

[12] Sampling Techniques oleh William G. Cochran (edisi ke-3, Wiley).

[13] Statistics and Probability with Applications (High School) oleh Daren Starnes et al. (W.H. Freeman).

[14] Sampling: Design and Analysis oleh Sharon L. Lohr (edisi terbaru, CRC Press).