This primer provides an overview of 25 probability distributions, including 10 discrete probability distributions and 15 continuous probability distributions. For each distribution, we discuss its definition, assumptions, application, real-life examples, and R code to generate random variables.
The 10 discrete probability distributions covered in this primer are Bernoulli, Binomial, Poisson, Geometric, Negative Binomial, Hypergeometric, Multinomial, Poisson Binomial, Zipf, and Discrete Uniform. These distributions are used to model binary outcomes, counts of successes or rare events, number of trials needed to obtain a success, frequency of occurrence of items in a population, and equally likely outcomes.
The 15 continuous probability distributions covered in this primer are Uniform, Normal, Exponential, Log-Normal, Gamma, Beta, Chi-Square, Student’s t, F, Weibull, Cauchy, Laplace, Pareto, Gumbel, and Logistic. These distributions are used to model continuous random variables, such as measurements of time, length, or weight, or ratios of two independent random variables.
Understanding these probability distributions is essential for statistical analysis and data modeling in various fields, including finance, engineering, healthcare, and social sciences. The R code provided in this primer can be used to simulate random variables from each distribution and explore their properties and characteristics. This primer serves as a useful reference for students, researchers, and practitioners who seek to apply probability theory and statistical analysis in their work.
Discrete probability distributions are a key concept in probability theory and statistics. They are used to model random variables that can only take on specific values, such as counts or whole numbers. Unlike continuous probability distributions, which model random variables that can take on any value within a range, discrete probability distributions are used to model random variables that can only take on certain values.
Some of the most commonly used discrete probability distributions include the Bernoulli, Binomial, Poisson, Geometric, Negative Binomial, Hypergeometric, Multinomial, Poisson Binomial, Zipf, and Discrete Uniform distributions. Each of these distributions has a unique set of parameters that determine its properties, such as mean, variance, skewness, and kurtosis.
Discrete probability distributions are used to estimate probabilities, calculate expected values, and make predictions based on observed data. They are commonly used in various fields, such as finance, engineering, healthcare, and social sciences, to model binary outcomes, counts of successes or rare events, frequency of occurrence of items in a population, and equally likely outcomes.
In this context, it is important to note that discrete probability distributions are all based on the fundamental concept of probability mass functions (PMFs), which describe the probability distribution of a discrete random variable. PMFs are used to calculate the probability of a random variable taking on a specific value.
Understanding discrete probability distributions is essential for statistical analysis and data modeling. They are used to analyze and interpret data in a wide range of fields and are essential for estimating probabilities and making predictions based on observed data. Overall, discrete probability distributions play a critical role in modern statistics and are essential for anyone seeking to apply statistical methods to real-world problems.
Definition: The Bernoulli distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0).
Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.
Application: The Bernoulli distribution is commonly used to model the outcome of a single trial of an experiment where the outcome can only be success or failure.
Real-life example: A coin toss, where the outcome can only be heads (success) or tails (failure).
# Generate 10 Bernoulli random variables with probability of success 0.3
library(Rlab)
## Warning: package 'Rlab' was built under R version 4.1.3
## Rlab 4.0 attached.
##
## Attaching package: 'Rlab'
## The following objects are masked from 'package:stats':
##
## dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
## qweibull, rexp, rgamma, rweibull
## The following object is masked from 'package:datasets':
##
## precip
rbern(n = 10, prob = 0.3)
## [1] 0 1 0 0 0 1 0 0 0 0
Definition: The binomial distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0), and is the result of a fixed number of independent trials.
Assumptions: The binary random variable has a fixed probability of success and failure, the outcomes are independent, and the trials are identical.
Application: The binomial distribution is commonly used to model the number of successes in a fixed number of independent trials where the outcome can only be success or failure. Real-life example: The number of heads in 10 coin tosses.
# Generate 10 binomial random variables with 5 trials and probability of success 0.3
rbinom(n = 10, size = 5, prob = 0.3)
## [1] 2 2 2 1 4 2 3 2 1 4
Definition: The Poisson distribution is a probability distribution for a discrete random variable that represents the number of events occurring in a fixed interval of time or space.
Assumptions: The events occur independently, the probability of an event occurring is constant over time or space, and the events occur at a fixed rate.
Application: The Poisson distribution is commonly used to model the number of occurrences of rare events such as accidents, defects, or defects in a manufacturing process.
Real-life example: The number of car accidents that occur in a city in a given day.
# Generate 10 Poisson random variables with lambda = 2
rpois(n = 10, lambda = 2)
## [1] 2 0 4 4 0 5 2 0 3 1
Definition: The geometric distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0), and represents the number of trials needed until the first success occurs.
Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.
Application: The geometric distribution is commonly used to model the number of trials needed until the first success occurs in a sequence of independent trials.
Real-life example: The number of times a basketball player needs to shoot the ball until they score a point.
# Generate 10 geometric random variables with probability of success 0.3
rgeom(n = 10, prob = 0.3)
## [1] 4 6 3 1 5 0 1 2 4 15
Definition: The negative binomial distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0), and represents the number of trials needed until a fixed number of successes occur.
Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.
Application: The negative binomial distribution is commonly used to model the number of trials needed until a fixed number of successes occur in a sequence of independent trials.
Real-life example: The number of times a person needs to roll a die until they get three sixes.
# Generate 10 negative binomial random variables with 3 successes and probability of success 0.3
rnbinom(n = 10, size = 3, prob = 0.3)
## [1] 8 7 7 5 6 9 11 5 22 2
Definition: The hypergeometric distribution is a probability distribution for a discrete random variable that represents the number of successes in a fixed number of draws without replacement from a finite population with two classes, success and failure.
Assumptions: The discrete random variable represents the number of successes in a fixed number of draws without replacement from a finite population with two classes, and the population size, number of successes, and number of draws are known.
Application: The hypergeometric distribution is commonly used to model the probability of selecting a given number of successes in a sample without replacement, such as the number of defective items in a sample of products.
Real-life example: The number of red balls in a sample of 10 balls drawn without replacement from a bag with 20 balls, 5 of which are red.
# Generate 10 hypergeometric random variables with population size 20, number of successes 5, and number of draws 10
rhyper(nn= 10, m = 5, n = 15, k = 10)
## [1] 1 2 3 3 3 2 3 2 3 3
Definition: The multinomial distribution is a probability distribution for a discrete random variable that represents the frequency of occurrence of multiple categories in a fixed number of independent trials with more than two possible outcomes.
Assumptions: The discrete random variable represents the frequency of occurrence of multiple categories in a fixed number of independent trials, and the probabilities of each outcome are fixed for each trial.
Application: The multinomial distribution is commonly used to model the frequency of occurrence of multiple categories in a series of independent trials, such as the number of times each face of a die is rolled in a fixed number of rolls.
Real-life example: The number of times each candidate is selected in a survey of 100 voters.
# Generate 10 multinomial random variables with three outcomes and probabilities 0.2, 0.3, and 0.5 in 10 independent trials
rmultinom(n = 10, size = 10, prob = c(0.2, 0.3, 0.5))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 4 3 2 2 1 4 2 2 1 1
## [2,] 1 3 5 5 2 3 6 4 7 2
## [3,] 5 4 3 3 7 3 2 4 2 7
Definition: The Poisson binomial distribution is a probability distribution for a discrete random variable that represents the number of successes in a series of independent trials with different probabilities of success.
Assumptions: The discrete random variable represents the number of successes in a series of independent trials, and the probabilities of success are different for each trial.
Application: The Poisson binomial distribution is commonly used to model the number of successes or failures in a series of independent trials with different probabilities of success, such as the number of customers who buy a product in a series of marketing campaigns.
Real-life example: The number of customers who buy a product in 10 marketing campaigns with success probabilities 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.
# Generate 10 Poisson binomial random variables with success probabilities 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1 in 10 independent trials
library(poibin)
rpoibin(m=2, pp= c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1), wts=NULL)
## [1] 4 7
Definition: The Zipf distribution is a probability distribution for a discrete random variable that represents the frequency of occurrence of items in a population, with a long tail of rare events.
Assumptions: The discrete random variable represents the frequency of occurrence of items in a population, and the rank and frequency of the most common item are known.
Application: The Zipf distribution is commonly used to model the frequency of occurrence of words in a natural language, the popularity of websites, or the size of cities.
Real-life example: The frequency of occurrence of words in the English language.
# Generate 10 Zipf random variables with parameter alpha 1 and most common item frequency 100
library(zipfR)
#rzipf(n = 10, s = 2, N = 100)
Definition: The discrete uniform distribution is a probability distribution for a discrete random variable that represents the probability of each outcome in a finite set of equally likely outcomes.
Assumptions: The discrete random variable represents the probability of each outcome in a finite set of equally likely outcomes.
Application: The discrete uniform distribution is commonly used to model situations where each outcome is equally likely, such as rolling a fair die or drawing a card from a deck.
Real-life example: The number rolled on a fair die.
# Generate 10 discrete uniform random variables with outcomes 1 to 6
sample(x = 1:6, size = 10, replace = TRUE)
## [1] 1 6 5 3 6 2 1 6 4 5
Bernoulli Distribution: Nominal scale (binary outcome)
Binomial Distribution: Nominal scale (count of successes)
Poisson Distribution: Nominal scale (count of rare events)
Geometric Distribution: Nominal scale (number of trials needed to get first success)
Negative Binomial Distribution: Nominal scale (number of trials needed to get fixed number of successes)
Hypergeometric Distribution: Nominal scale (count of successes in finite population)
Multinomial Distribution: Nominal scale (count of multiple categories in independent trials)
Poisson Binomial Distribution: Nominal scale (count of successes in independent trials with different probabilities)
Zipf Distribution: Ordinal scale (rank of item by frequency of occurrence)
Discrete Uniform Distribution: Nominal scale (equally likely outcomes)
Continuous probability distributions are a fundamental concept in probability theory and statistics. They are used to model random variables that can take on any value within a range of values, as opposed to discrete probability distributions that model random variables that can only take on certain values. Continuous probability distributions are commonly used in various fields, such as finance, engineering, healthcare, and social sciences, to model measurements of time, length, weight, or other continuous variables.
Some of the most commonly used continuous probability distributions include the Uniform, Normal, Exponential, Log-Normal, Gamma, Beta, Chi-Square, Student’s t, F, Weibull, Cauchy, Laplace, Pareto, Gumbel, and Logistic distributions. Each of these distributions has a unique shape and set of parameters that determine its properties, such as mean, variance, skewness, and kurtosis.
Understanding continuous probability distributions is essential for statistical analysis and data modeling. They are used to estimate probabilities, calculate expected values, and make predictions based on observed data. In addition, the properties of continuous probability distributions, such as their moments and tail behavior, have important implications for statistical inference and hypothesis testing.
In this context, it is important to note that continuous probability distributions are all based on the fundamental concept of probability density functions (PDFs), which describe the probability distribution of a continuous random variable. PDFs are used to calculate the probability of a random variable taking on a specific value or falling within a certain range of values. The area under a PDF curve represents the probability of the random variable falling within that range of values.
Overall, continuous probability distributions play a critical role in modern statistics and are essential for analyzing and interpreting data in a wide range of fields. Understanding their properties, assumptions, and applications is essential for anyone seeking to apply statistical methods to real-world problems.
Definition: The uniform distribution is a probability distribution for a continuous random variable that takes values within a fixed interval, and all values within that interval are equally likely.
Assumptions: The continuous random variable has a fixed interval, and all values within that interval are equally likely.
Application: The uniform distribution is commonly used to model outcomes where all values within an interval are equally likely, such as the roll of a fair die or the arrival time of a customer at a store.
Real-life example: The arrival time of a customer at a store between 9:00 am and 10:00 am.
# Generate 10 uniform random variables between 0 and 1
runif(n = 10)
## [1] 0.62836781 0.36129915 0.39124268 0.65525076 0.95001533 0.94654090
## [7] 0.05209806 0.16042054 0.13249802 0.66642079
Definition: The normal distribution is a probability distribution for a continuous random variable that has a bell-shaped curve, with most of the values around the mean and fewer values further away from the mean.
Assumptions: The continuous random variable has a mean and standard deviation, and the values are normally distributed around the mean.
Application: The normal distribution is commonly used to model outcomes where the values are normally distributed, such as the height of a population or the scores on a standardized test.
Real-life example: The height of a population of people.
# Generate 10 normal random variables with mean 0 and standard deviation 1
rnorm(n = 10, mean = 0, sd = 1)
## [1] 0.47255794 -1.00171215 -1.25129215 0.20049873 -0.06913215 1.07735018
## [7] 2.65391772 0.14550943 -1.23665386 -1.28576401
Definition: The exponential distribution is a probability distribution for a continuous random variable that represents the time between events in a Poisson process.
Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.
Application: The exponential distribution is commonly used to model the time between occurrences of rare events such as accidents, defects, or defects in a manufacturing process.
Real-life example: The time between car accidents on a particular road.
# Generate 10 exponential random variables with rate parameter 0.5
rexp(n = 10, rate = 0.5)
## [1] 1.0738940 3.5091776 2.5955264 2.1647797 1.0531659 5.5677601 2.6514645
## [8] 1.1702604 0.4941048 3.4160287
Definition: The gamma distribution is a probability distribution for a continuous random variable that represents the sum of n independent exponential random variables, each with the same rate parameter.
Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.
Application: The gamma distribution is commonly used to model the time to complete a task that is composed of multiple independent sub-tasks, each with a known time to completion.
Real-life example: The time to complete a software development project that is composed of multiple independent sub-tasks.
# Generate 10 gamma random variables with shape parameter 2 and rate parameter 0.5
rgamma(n = 10, shape = 2, rate = 0.5)
## [1] 2.912141 3.129058 3.182755 4.076055 5.117186 2.461633 3.484854 4.145739
## [9] 6.911272 4.896910
Definition: The beta distribution is a probability distribution for a continuous random variable that takes values between 0 and 1, and is used to model the distribution of probabilities.
Assumptions: The continuous random variable takes values between 0 and 1.
Application: The beta distribution is commonly used to model the distribution of probabilities, such as the probability of a candidate winning an election or the probability of a stock price increasing.
Real-life example: The probability of a particular candidate winning an election.
# Generate 10 beta random variables with shape parameters 2 and 5
rbeta(n = 10, shape1 = 2, shape2 = 5)
## [1] 0.29816243 0.32665454 0.25349514 0.34523862 0.06897022 0.65593664
## [7] 0.45961624 0.19567540 0.49516424 0.20079907
Definition: The Student’s t distribution is a probability distribution for a continuous random variable that is used to model the distribution of sample means when the sample size is small and the population standard deviation is unknown.
Assumptions: The continuous random variable has a mean and standard deviation, and the sample size is small and the population standard deviation is unknown.
Application: The Student’s t distribution is commonly used in hypothesis testing and confidence interval estimation when the sample size is small and the population standard deviation is unknown.
Real-life example: The distribution of IQ scores in a sample of 10 people.
# Generate 10 Student's t random variables with 5 degrees of freedom
rt(n = 10, df = 5)
## [1] -0.178503569 -0.003473483 0.118672906 2.183996714 -0.469441101
## [6] 1.801459652 0.813760575 -0.130475715 0.803109138 -0.311119488
Definition: The chi-squared distribution is a probability distribution for a continuous random variable that is used to model the distribution of the sum of squared standard normal random variables.
Assumptions: The continuous random variable is the sum of squared standard normal random variables.
Application: The chi-squared distribution is commonly usedin hypothesis testing and confidence interval estimation when the population standard deviation is unknown and the sample size is large.
Real-life example: The sum of squared deviations from the mean in a sample of 100 measurements.
# Generate 10 chi-squared random variables with 5 degrees of freedom
rchisq(n = 10, df = 5)
## [1] 3.115157 5.033969 2.633151 3.293294 3.404943 2.656027 7.795141 6.067275
## [9] 1.843143 4.915851
Definition: The F distribution is a probability distribution for a continuous random variable that is used to model the distribution of the ratio of two chi-squared random variables.
Assumptions: The continuous random variable is the ratio of two chi-squared random variables.
Application: The F distribution is commonly used in hypothesis testing and confidence interval estimation when comparing the variances of two populations.
Real-life example: The ratio of the variances of two sets of data.
# Generate 10 F random variables with numerator degrees of freedom 5 and denominator degrees of freedom 10
rf(n = 10, df1 = 5, df2 = 10)
## [1] 0.6190342 3.2136285 0.1807652 2.0638532 0.8202575 5.3778171 3.5081302
## [8] 9.3296230 0.5191228 0.7340216
Definition: The log-normal distribution is a probability distribution for a continuous random variable that is the result of taking the natural logarithm of a normal random variable.
Assumptions: The continuous random variable is the result of taking the natural logarithm of a normal random variable.
Application: The log-normal distribution is commonly used to model outcomes that are the result of exponential growth or compounding, such as stock prices or population sizes.
Real-life example: The stock price of a company over time.
# Generate 10 log-normal random variables with meanlog = 0 and sdlog = 1
rlnorm(n = 10, meanlog = 0, sdlog = 1)
## [1] 0.54504420 0.61758477 5.75944796 0.02652327 0.35382796 0.66278599
## [7] 0.52167088 0.73497979 0.93642903 1.83519286
Definition: The Weibull distribution is a probability distribution for a continuous random variable that is used to model the time to failure of a system or component.
Assumptions: The time to failure follows a Weibull distribution.
Application: The Weibull distribution is commonly used in reliability analysis to model the time to failure of a system or component.
Real-life example: The time to failure of a component in a manufacturing process.
# Generate 10 Weibull random variables with shape parameter 2 and scale parameter 1
rweibull(n = 10, shape = 2, scale = 1)
## [1] 1.3850509 0.7519880 1.6545574 0.5641020 0.3919465 0.6716577 1.3067403
## [8] 1.0198459 0.3825102 0.4438237
Definition: The Pareto distribution is a probability distribution for a continuous random variable that is used to model rare events with a high impact, known as power-law distributions.
Assumptions: The continuous random variable has a minimum value and a shape parameter that determines the tail behavior of the distribution.
Application: The Pareto distribution is commonly used in economics, finance, and insurance to model the distribution of income, wealth, or losses.
Real-life example: The distribution of income in a country.
# Generate 10 Pareto random variables with shape parameter 2 and minimum value 1
library(Pareto)
## Warning: package 'Pareto' was built under R version 4.1.3
rPareto(100, 1000, 2)
## [1] 1466.605 1914.817 1769.961 1561.345 1148.556 1779.569 1333.430
## [8] 1904.837 1680.520 2116.007 1665.096 1076.880 2143.071 1261.498
## [15] 2824.095 1242.484 1473.468 2016.340 1099.695 5370.836 8609.321
## [22] 1098.771 1512.804 1007.976 1232.830 1588.022 5770.498 34657.522
## [29] 3063.972 1391.621 3129.599 1234.547 5424.664 1226.699 1057.790
## [36] 3392.030 2217.004 2337.939 1476.799 3556.240 1085.005 11833.602
## [43] 1258.977 1036.782 6928.958 1170.436 1069.240 1343.889 1059.520
## [50] 4193.379 1082.622 1466.361 1178.867 1079.573 1278.834 1077.485
## [57] 1593.909 1545.025 1648.691 1156.436 1041.308 2054.964 3721.628
## [64] 1245.934 1017.495 1807.051 2012.533 1899.870 1026.249 1669.937
## [71] 3905.537 1173.786 2248.853 21380.913 1139.604 2856.304 1119.070
## [78] 2456.413 1025.695 3236.773 1105.560 1067.889 7678.731 1157.468
## [85] 4735.346 1845.190 1337.029 2410.503 1070.576 2605.588 1801.082
## [92] 3036.357 1084.707 1414.863 1241.266 1099.801 2206.823 2356.777
## [99] 1396.095 2113.138
Definition: The Cauchy distribution is a probability distribution for a continuous random variable that has a bell-shaped curve, but with heavier tails than the normal distribution, making it more sensitive to extreme values.
Assumptions: The continuous random variable has a location parameter and a scale parameter.
Application: The Cauchy distribution is commonly used in statistics to model outliers or extreme values that cannot be explained by a normal distribution.
Real-life example: The distribution of stock returns during a financial crisis.
# Generate 10 Cauchy random variables with location parameter 0 and scale parameter 1
rcauchy(n = 10, location = 0, scale = 1)
## [1] -3.79702986 0.15283060 3.98029831 0.29326665 -0.49257338 2.14437027
## [7] -0.55608065 -1.16569444 -0.69526402 0.01012299
Definition: The logistic distribution is a probability distribution for a continuous random variable that has a symmetric S-shaped curve, similar to the normal distribution, but with heavier tails.
Assumptions: The continuous random variable has a location parameter and a scale parameter.
Application: The logistic distribution is commonly used in regression analysis, machine learning, and Bayesian statistics to model the probability of binary outcomes or the likelihood of events occurring.
Real-life example: The probability of a customer buying a product based on their age and income.
# Generate 10 logistic random variables with location parameter 0 and scale parameter 1
rlogis(n = 10, location = 0, scale = 1)
## [1] -0.08480989 -1.83529588 1.96376120 0.73841331 1.48413879 -1.44616237
## [7] -2.88356118 -0.06238278 0.54539099 -0.91198168
Definition: The Gumbel distribution is a probability distribution for a continuous random variable that is used to model extreme values or tail events, such as the maximum or minimum of a set of random variables.
Assumptions: The continuous random variable has a location parameter and a scale parameter.
Application: The Gumbel distribution is commonly used in hydrology, finance, and engineering to model extreme events such as floods, stock market crashes, or earthquakes.
Real-life example: The maximum wind speed during a hurricane.
# Generate 10 Gumbel random variables with location parameter 0 and scale parameter 1
library(VGAM)
## Warning: package 'VGAM' was built under R version 4.1.3
## Loading required package: stats4
## Loading required package: splines
rgumbel(n = 10, location = 0, scale = 1)
## [1] 0.74291884 -0.41742555 0.02820533 1.89829290 -1.01948569 -0.08245474
## [7] -0.83061458 0.32728228 1.51593979 -1.06100949
Definition: The Rayleigh distribution is a probability distribution for a continuous random variable that is used to model the magnitude of a vector sum of random variables, such as the wind speed or the amplitude of a radio signal.
Assumptions: The continuous random variable has a scale parameter.
Application: The Rayleigh distribution is commonly used in signal processing, telecommunications, and atmospheric sciences to model the distribution of magnitudes of random variables.
Real-life example: The amplitude of a radio signal received by a satellite.
# Generate 10 Rayleigh random variables with scale parameter 1
rrayleigh(n = 10, scale = 1)
## [1] 0.3895749 1.6752714 0.9781631 0.3853101 0.8557723 0.1988550 0.5767059
## [8] 1.6113468 1.7625553 1.1167735
Thanks for your attention