PROBABILITY DISTRIBUTIONS

Basic Statistics – Data Science – Assignment Week 11

INSTITUT TEKNOLOGI SAINS BANDUNG

IDENTITY CARD

Name : Hirose Kawarin Sirait

Student ID : 52250012

Major : Data Science

Lecturer : Mr. Bakti Siregar, M.Sc., CDS.

Definition of Probability Distribution

A probability distribution will describe various different events that are related to the uncertainty of these phenomena or occurrences. Therefore, this probability distribution can be used through experiments to determine the sample space and the various possibilities in a particular event. A probability distribution can be interpreted as a statistical function that aims to describe all possible values and the probabilities that can be taken from various random variables within a certain range. In this case, the range of the frequency distribution is limited by the minimum and maximum values, where the probability values to be plotted depend on the number of factors. It can be concluded that the general definition of a probability distribution is a distribution that describes the likelihood of a set of variations as a substitute for frequency.

Here are some more in-depth materials for learning about probability distributions.

1 Continuous Random

In statistics and probability theory, a continuous random variable is a type of variable that can take any value within a given range. Unlike discrete random variables, which can only assume specific, separate values (like the number of students in a class), continuous random variables can assume any value within an interval, making them ideal for modelling quantities that vary smoothly without jumps.

This makes them ideal for modelling a wide range of real-world phenomena, such as the height of individuals, the time taken to complete a task, or the amount of rainfall in a particular period. In this article, we will discuss the concept of “Continuous Random Variable” in detail including its examples and properties. We will also discuss how it is different from a discrete random variable.

To understand, watch the following video

To understand continuous random variables, it is essential to know how probability is represented using a Probability Density Function (PDF). Unlike discrete random variables, a continuous random variable does not assign probability to individual points. Instead, probability is obtained from the area under the PDF curve.

1.1 Random Variable

A random variable is continuous if it can take any value within an interval on the real number line.

Examples include: height, time, temperature, age, pressure, and velocity.

Key characteristics:

  • The variable takes values in an interval such as \((a, b)\) or even \((-\infty, +\infty)\).

  • The probability of any single point is always zero:

\[ P(X = x) = 0 \]

  • Probabilities are meaningful only over intervals:

\[ P(a \le X \le b) = \int_{a}^{b} f(x)\, dx \]

1.2 Probability Density Funct

A function \(f(x)\) is a valid Probability Density Function (PDF) if it satisfies:

1. Non-negativity

\[ f(x) \ge 0 \quad \forall x \]

2. Total Area Equals 1

\[ \int_{-\infty}^{\infty} f(x)\,dx = 1 \]

Interpretation:

  • Larger values of \(f(x)\) indicate higher probability density around that value.

  • However, \(f(x)\) is not a probability; probabilities come from the area under the curve.

Example PDF: \(f(x) = 3x^2 \text{ on } [0,1]\)

Consider the probability density function:

\[ f(x) = 3x^2,\quad 0 \le x \le 1 \]

Validation:

\[ \int_{0}^{1} 3x^2 \, dx = 1 \]

1.3 Probability on an Interval

To compute probability within an interval:

\[ P(a \le X \le b) = \int_{a}^{b} 3x^2 \, dx \]

Example: \[ P(0.5 \le X \le 1) \]

1.4 Cumulative Distribution Funct.

The Cumulative Distribution Function (CDF) is defined as:

\[ F(x) = P(X \le x) = \int_{0}^{x} 3t^{2}\, dt = x^{3} \]

Relationship between PDF and CDF: \[ f(x) = F'(x) \]

2 Sampling Distributions

Before exploring the concept of sampling distributions in detail, this video provides a clear visual explanation of how statistics such as sample means behave when repeatedly drawn from the same population. It offers an intuitive foundation for understanding variability, uncertainty, and why sampling distributions are essential in statistical inference.

To understand, watch the following video

In statistics, a sampling distribution is used to calculate the probability of approximating the value of a population parameter when studying a sample. Similarly, a sampling distribution allows us to estimate the sampling error for a particular sample size.

2.1 Difference between a Sample Distribution and a Sampling Distribution

Sample means can be seen from their distribution shape. The distribution of a sample is the distribution of a single sample taken from a population, so its shape follows the pattern of the original data—it can be right-skewed, left-skewed, or non-normal. In contrast, the sampling distribution of sample means is the distribution formed from the means of many samples of the same size, and according to the Central Limit Theorem, the distribution of these means will tend to be normal if the sample size is large enough (generally n ≥ 30), even if the original data is not normal. Therefore, if a distribution is normal in shape, it is usually the sampling distribution of sample means, whereas if the distribution is skewed, it is the distribution of a sample.

Example:

A recent study with a sample size of 200 found that the mean height of an adult in a particular city is 164 cm with a standard error of 3.7 cm and that the heights are normally distributed. Is this a description of a distribution of the sample or is it a description of the sampling distribution of sample means?

  • Step 1: Analyze the information given regarding the shape of the distribution - is it normally distributed or skewed to the right or left? The data is described as being normally distributed - not skewed in either direction.

  • Step 2: Use the Central Limit Theorem to conclude if the described distribution is a distribution of a sample or a sampling distribution of sample means. If the shape is skewed right or left, the distribution is a distribution of a sample. If the shape is normally distributed, the distribution is a sampling distribution of sample means. Since we have large samples (size 100) and the distribution is normal, we can conclude that the description is of a sampling distribution of sample means.

2.2 Sampling Distribution of the Sample Mean

shows a side-by-side comparison of a histogram for the original population and a histogram for this distribution. Whereas the distribution of the population is uniform, the sampling distribution of the mean has a shape approaching the shape of the familiar bell curve. This phenomenon of the sampling distribution of the mean taking on a bell shape even though the population distribution is not bell-shaped happens in general. Here is a somewhat more realistic example.

Suppose we take samples of size 1, 5, 10, or 20 from a population that consists entirely of the numbers 0 and 1, half the population 0, half 1, so that the population mean is 0.5. The sampling distributions are:

n = 1:

\[ \begin{array}{c|cc} \hline \bar{x} & 0 & 1 \\ \hline P(\bar{x}) & 0.5 & 0.5 \\ \hline \end{array} \]

n = 5:

\[ \begin{array}{c|cccccc} \hline \bar{x} & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\ \hline P(\bar{x}) & 0.03 & 0.16 & 0.31 & 0.31 & 0.16 & 0.03 \\ \hline \end{array} \]

n = 10:

\[ \begin{array}{c|ccccccccccc} \hline \bar{x} & 0 & 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 & 1 \\ \hline P(\bar{x}) & 0.00 & 0.01 & 0.04 & 0.12 & 0.21 & 0.25 & 0.21 & 0.12 & 0.04 & 0.01 & 0.00 \\ \hline \end{array} \]

n = 20:

\[ \begin{array}{c|ccccccccccc} \hline \bar{x} & 0 & 0.05 & 0.10 & 0.15 & 0.20 & 0.25 & 0.30 & 0.35 & 0.40 & 0.45 & 0.50 \\ \hline P(\bar{x}) & 0.00 & 0.00 & 0.00 & 0.00 & 0.01 & 0.04 & 0.07 & 0.12 & 0.16 & 0.18 \\ \hline \end{array} \] \[ \begin{array}{c|ccccccccccc} \hline \bar{x} & 0.55 & 0.60 & 0.65 & 0.70 & 0.75 & 0.80 & 0.85 & 0.90 & 0.95 & 1 \\ \hline P(\bar{x}) & 0.16 & 0.12 & 0.07 & 0.04 & 0.01 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ \hline \end{array} \]

Histograms illustrating these distributions are “Distributions of the Sample Mean”.

2.3 Population Distribution

Population and population distribution refer to the number and spread of people within a certain area. These two concepts are important for understanding patterns of population growth, urbanization, and regional development.

The population distribution for a discrete variable is:

\[ P(X = x_i) = \frac{\text{number of elements equal to } x_i}{N} \] The population distribution for a continuous variable is:

\[ f(x) \ge 0, \qquad \int_{-\infty}^{\infty} f(x)\, dx = 1 \]

Example for a population {0, 1}:

\[ P(X = 0) = 0.5, \quad P(X = 1) = 0.5 \]

3 Central Limit Theorem

The Central Limit Theorem (CLT) discusses how the distribution of sample means will become normal as larger samples are taken from a population. This is important for understanding how statistics can make inferences about a population even when the population distribution is unknown. The CLT helps in hypothesis testing and other applications in the field of statistics and data science.

To understand, watch the following video

The Central Limit Theorem (CLT) is a statistical theorem that states that if we take a sufficiently large random sample from any population, the distribution of the sample means will approach a normal distribution. In other words, the CLT allows us to make assumptions about a population based on a sample taken from that population.

3.1 Key Principles

  1. Sample Size: The CLT holds true when the sample size is large enough, typically n ≥ 30. The larger the sample size, the closer the sampling distribution of the sample mean will be to a normal distribution.

  2. Independence: The samples must be independent of each other.

  3. Finite Variance: The population from which samples are drawn must have a finite variance.

3.2 Formula

The CLT can be mathematically expressed as:

\[ \overline{X} \sim N\left(\mu,\; \frac{\sigma}{\sqrt{n}}\right) \]

where:

  • \(\overline{X}\) is the sample mean

  • \(\mu\) is the population mean

  • \(\sigma\) is the population standard deviation

  • \(n\) is the sample size

3.3 Examples

Example 1: Turtle Shell Widths

Suppose the width of a turtle’s shell follows a uniform distribution with a minimum width of 2 inches and a maximum width of 6 inches. If we take random samples of 30 turtles and calculate the mean shell width each time, the distribution of these sample means will approximate a normal distribution

Example 2: Number of Pets per Family

Consider the number of pets per family in a city follows a chi-square distribution with three degrees of freedom. If we take random samples of 30 families and calculate the mean number of pets each time, the distribution of these sample means will also approximate a normal distribution.

4 Sample Propotion

Proportional sampling is one of the methods used in research and surveys to obtain representative data from a specific population. In this method, the samples taken have proportions that match the distribution of the population being studied.

To understand, watch the following video

The sample proportion is important because it serves as an estimate of the true population proportion. Since collecting data from an entire population is often impossible, the sample proportion allows researchers to make valid conclusions about a large population using only a smaller sample. It is also essential because it is unbiased, meaning that on average it accurately represents the population proportion.

\[ \text{Proportion} = \frac{\text{number of favourable outcomes}}{\text{total number of outcomes}} \]

4.1 Explanation

In statistics, when conducting a survey, usually not all population data is known, so a study is typically conducted on a representative sample, and then the conclusions drawn are extrapolated to the entire population. Therefore, the sample proportion is used to estimate the proportion of the entire population. Below we will see how this is done.

The sample proportion is equal to the number of successful cases in a sample divided by the total number of samples. Therefore, the formula to calculate the sample proportion is:

\[ \widehat{p} = \frac{e}{n} \]

where:

  • \(\widehat{p}\) : proporsi sampel

  • \(e\) : jumlah kasus yang berhasil dalam sampel

  • \(n\) : jumlah total item data dalam sampel

4.2 Example

  • A company manufactures toys and buys some of its parts from another external company. However, in the batch it purchased, defective parts appeared, so it decided to conduct a statistical study to determine the proportion of parts in good condition and the proportion of defective parts. So, you order a sample of 1,000 units and find 138 defective parts. What is the proportion of parts in good condition in the sample? And what is the proportion of defective parts in the sample?

The number of non-defective parts in the sample is 1000 minus the number of defective parts: \[ e = 1000 - 138 = 862 \]

Answer: \[ \widehat{p} = \frac{e}{n} = \frac{862}{1000} = 0.862 \]

Therefore, the proportion of spare parts in good condition is 86.2%.

Conversely, the proportion of defective spare parts is equal to one minus the proportion of good spare parts:

\[ \widehat{q} = 1 - \widehat{p} = 1 - 0.862 = 0.138 \]

Therefore, the proportion of defective parts in the sample is 13.8%.

5 Review Sampling Distribution

A sampling distribution is the probability distribution of a statistic (e.g., mean, proportion, or standard deviation) derived from multiple random samples of a population. It provides insights into how a statistic varies across different samples and is a cornerstone of inferential statistics.

When analyzing a population, it is often impractical to study every individual. Instead, smaller samples are taken, and their statistics (e.g., sample mean) are used to estimate population parameters. The sampling distribution represents the variability of these sample statistics.

A sampling distribution is the probability distribution of a statistic obtained by selecting random samples from a population, crucial for making inferences about that population.

To understand, watch the following video

Sampling distribution is important to study because it provides an overview of how a sample statistic, such as the mean or proportion, will behave if sampling is repeated multiple times. By understanding this concept, we can measure uncertainty, construct confidence intervals, and conduct hypothesis tests correctly. In essence, the sampling distribution is the foundation that allows us to draw accurate conclusions about a population solely from sample data.

5.1 Review of Probability Theory

The review of probability theory covers essential concepts such as the sample space, event space, probability measure, and axioms of probability. It also includes the calculus of probability, random variables, probability distributions, expectation, moments, and extreme-value distributions. The review is suitable for those who have completed a course on probability and statistics and provides a foundational understanding for further study in related fields.

5.2 Example

  • Suppose we have a jar that contains 200 green marbles, and 300 blue marbles. If a marble is drawn three times with replacement, what is the probability of drawing at least two green marbles?

Probability of Green

\[ P(\text{Green}) = \frac{\text{number of successful outcomes}}{\text{total number of outcomes}} \]

\[ P(\text{Green}) = \frac{200}{500} = 0.4 \]


Probability of Blue

\[ P(\text{Blue}) = \frac{\text{number of unsuccessful outcomes}}{\text{total number of outcomes}} \]

\[ P(\text{Blue}) = \frac{300}{500} = 0.6 \]

  • Sampling Distribution of the Sample Mean

Given:

  • Population mean: \(\mu = 13,525\)

  • Population standard deviation: \(\sigma = 4,180\)

  • Sample size: \(n = 100\)

Mean of the Sampling Distribution

\[ \mu_{\bar{X}} = \mu = 13,525 \]

Standard Error of the Mean

\[ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \]

Substitute values:

\[ \sigma_{\bar{X}} = \frac{4,180}{\sqrt{100}} \]

\[ \sigma_{\bar{X}} = 418 \]

Interpretation

This means the sample means will cluster around:

\[ 13,525 \quad \text{with variability (standard error) of} \quad 418. \]

5.3 Key Concepts

  1. Definition: The sampling distribution is the distribution of a statistic (e.g., sample mean) calculated from all possible samples of a fixed size from a population. For example, if we repeatedly sample 50 dolphins from a population and calculate their mean weights, the distribution of these means forms the sampling distribution.

  2. Properties: The mean of the sampling distribution of the sample mean equals the population mean (μx̄ = μ). The standard deviation of the sampling distribution (known as the standard error) is given by σx̄ = σ / √n, where σ is the population standard deviation and n is the sample size.

  3. Central Limit Theorem (CLT): Regardless of the population’s shape, the sampling distribution of the sample mean approaches a normal distribution as the sample size increases (n > 30). This is crucial for making statistical inferences.

  4. Types of Sampling Distributions: Sampling Distribution of the Mean: Focuses on the means of samples. For example, the average weight of dolphins in repeated samples. Sampling Distribution of the Proportion: Focuses on proportions, such as the percentage of dolphins that are black in repeated samples. T-Distribution: Used when the population standard deviation is unknown or the sample size is small.

5.4 Applications

Sampling distributions are essential for hypothesis testing, constructing confidence intervals, and making predictions about population parameters. For instance, they help determine the likelihood of observing a sample mean under specific conditions, enabling statisticians to draw conclusions about the population.