Basic Probability Review

Hello everyone! I am currently reviewing some basic probability to prepare for an upcoming masters program I am taking in Statistics. In the process I have also been teaching myself how to code in R. In the R markdown document below, I introduce 3 basic distributions then visualize the probability and cumulative density functions. I primarily used ggplot2 and tibble package to generate the distributions. I cover the Uniform distribution, exponential distribution, and the normal distribution below.

One of my goals with R is to fully incorporate the Tidyverse package into my work flow. In the code below I utilize functions from ggplot2 and tibble, however there are many more packages that are installed from the tidyverse library. Reference this link for a list and brief description of the core packages: https://www.tidyverse.org/packages/

library(tidyverse)

## -- Attaching packages --------

## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Uniform Distribution: Models any situation where outcomes are equally probable.

Let us suppose that “X” is the set of random variables that contain all the outcomes of an event. Then we say that “X” has a uniform distribution iff each outcome has probability 1/N, where N is the magnitude(or count) of the set “X. We define parameters”a" and “b” as the range(or endpoints) of the distriubtion.

That is: Range = [a,b]

In literature “X ~ U(a,b)” is short hand for “X approximates to a uniform distribution in range a to b”. If X is a continuous random variable then there exist two functions: “f(x)” and “F(x)”; known as the probability density function (pdf) and the cumulatvive density function (cdf), respectively.

PDF:

We can define PDF as a function “f(x)” such that the integral from a to b is the Probability of X in the region from a to b.

When X ~ U(a, b) the PDF: f(x) = 1/(b-a) for a <= x <= b; where f(x) is the probability density function.

Taking the full sum of the integral we have that “f(x) = 1”.Note that this is consistent with the notion that the sum of all the probabilities of all outcomes in a given random variable equals 1. In Fig. 1a, the region integrated has been shaded blue.

CDF: We can define CDF as a function “F(x)” such that the integral from -inf to b of “f(x)” is equal to the sum of the probabilies of all the outcomes less than b in set X. When X ~ U(a, b) the CDF: F(x) = (x-a)/(b-a) for a <= x <= b.

I’ll leave the generalities of PDF and CDF to another post. For now, the code below visualizes the distributions of a continuous random variable X, where X ~ U(a, b).

a <- -.75
b <- 1.25 #Values for a and b arbitrarily chosen
r <- seq(a, b, by = 1e-3) #1e-3 is arbitrary.
uni_pdf <- 1/(b-a)
uni_tbl <- tibble("Range" = r, "y" = rep(uni_pdf, length(r)), "Probability" = r/length(r), "CDF" = ((r-a)/(b-a)))
uni_pdf_vis <- ggplot(uni_tbl) + geom_area(aes(Range, y), fill = "#2f57c2") + coord_cartesian(xlim = c(-1, 1.5), ylim =  c(0, .75)) +
   labs(title = "Figure 1a. Uniform Probability Density Function Visual")
uni_pdf_vis

#Notice that for all values <a & >b, f(x) = 0.

uni_cdf_vis <- ggplot(uni_tbl) + geom_line(aes(Range, CDF)) +  labs(title = "Figure 1b. Uniform Cumulative Density Function Visual")
uni_cdf_vis

Below, I visualized the pdf’s of discrete uniform random variables. To model a discrete variable, I sampled a sample space 1e6 times, with each outcome having the same probability.

i <- 1:6
j <- sample(i, 1e6, replace = TRUE)
dice_roll <- tibble("Roll" = j, "Outcome" = j) %>%
  ggplot() +
  geom_bar(aes(x = Outcome), color = "blue", fill = "white") + labs(title = "Figure 1c. Dice Rolls")
dice_roll

#If X is the outcome of a normal die, then we have 6 possible outcomes, 1, 2, 3, 4, 5, & 6, each with a probability of 1/6. The code above models 1,000,000 rolls of the die, the outcomes having a uniform distribution. 


#Suppose we survey one million people on their favorate flavor to pair with chocolate from the list below, and that the outcome has a uniform distributio:
flavor <- c("Strawberry", "Cherry", "Caramel", "Sea Salt")
f <- sample(flavor, 1e6, replace=TRUE, prob = c(rep(1/length(flavor), 4))) #Unless specified, sample function randomly samples data with equal probability
flavor_flave <- tibble("Flavor" = f, "Outcome" = f) %>%
  ggplot() +
  geom_bar(aes(x=Outcome), color = "#fff200", fill = "#8f3703") + labs(x = "Flavor", y = "Response", title = "Figure 1d. Favorite flavor to Pair with Chocolate")
flavor_flave #visualizing uniform distribution

#From the visualization we see that their is a pretty even selection of all the flavors among one million people.

Exponential Distribution: Considered to be the continuous analogue to the geometric distribution.Models the time it takes for a continuous process to change state.

For example, suppose we want to model the probability of the next customer coming in to a store at any given moment.The change of state occuring when a new customer walks in, and the time it takes can be modeled exponentially.

We can define the time it takes, on average, for the state to change as the rate parameter, lambda. The range of the distribution is from 0 to inf. With the probability of a change in state occurring at inf is infinitely unlikely. This can be seen in Fig 2a, as the limit of the function as x aproaches inf.

PDF: f(x) = lambda * (exp(-lambda * x)) CDF: F(x) = 1 - (exp(-lambda* x))

There is much more to add on Exponential distributed variable. I find applications in Queueing Theory, business staffing to serve customers, and radioactive particle to decay to be fascinating and will explore these topics in a future post.

Below is code for visualizing the exponential pdf of 5 different values for lambda, and the cdf for a value of lambda at 1.5.

x <- seq(0, 5, by = 0.1)


lambda_5 <- .5
lambda_1 <- 1
lambda_15 <- 1.5
lambda_2 <- 2.0
lambda_10 <- 10

prob_density_fun_5 <- lambda_5 * (exp(-lambda_5*x))
prob_density_fun_1 <- lambda_1 * (exp(-lambda_1*x))
prob_density_fun_15 <- lambda_15  * (exp(-lambda_15 *x))
prob_density_fun_2 <- lambda_2  * (exp(-lambda_2 *x))
prob_density_fun_10 <- lambda_10 * (exp(-lambda_10*x))


prob_df <- tibble(x, L05 = prob_density_fun_5, L10 = prob_density_fun_1, L15 = prob_density_fun_15, L20 = prob_density_fun_2)


prob_df %>%
  ggplot() +
  geom_line(aes(x = x, y = L05), color = "green", size = .4) +
  geom_line(aes(x = x, y = L10), color = "blue", size = .4) +
  geom_line(aes(x = x, y = L15), color = "red", size = .4) +
  geom_line(aes(x = x, y = L20), size = .4) +
  labs(x = "Dependent Value, x", y = "f(x)", title = "Figure 2a. Exponential PDFs of Varying Rates") +
  theme_light()

#How do I add a legend indicating what each color represents? i.e. : Green: Lambda = 0.5...


exp_dist <- 1 - (exp(-lambda_15*x)) 
exp_dist_df <- tibble(x, exp_dist) %>%
  ggplot() +
  geom_line(aes(x = x, y = exp_dist)) +
  labs(x = "Dependent Value, x", y = "F(x)", title = "Figure 2b. Exponential CDF")  +
  theme_light()

exp_dist_df

Normal Distribution: A very common continuous probability distribution. Made especially useful due to the Central Limit Theorem and Law of Large numbers. Also known as the standard normal, Gaussian, Gauss, Laplace-Gauss or more informally the bell curve.

Parameters: mu and sigma mu = mean sigma = standard deviation Range: (-inf, inf) Density: f(x) = 1/(sigma * sqrt(2pi)) exp(-((x-mu)^2)/(2*sigma^2)) #given by dnorm function Distribution given by pnorm function

The normal distribution deserves its own post, as it has the most applied use. Below are the most recognizable curves in stats. Figures 3a and 3b are generated using base R functions.

norm_dist_den <- tibble(x, distribution = pnorm(x, 2.5, 1), density = dnorm(x, 2.5, 1,))

norm_df_dist <- norm_dist_den %>%
  ggplot() +
  geom_line(aes(x, distribution)) + labs(title = "Figure 3a. Normal CDF") + theme_void()


norm_df_den <- norm_dist_den %>%
  ggplot() +
  geom_line(aes(x, density)) + labs(title = "Figure 3b. Normal PDF") + theme_void()


norm_df_den

norm_df_dist

Basic Probability Review

Joseph Oliveira

July 12, 2018