Discussion 3

A: Please explain each of the 3 distributions in less than 4 sentences.

Normal Distribution: Normal distributions are always centered around the average value, and the average or mean, mode and median are all equal. The normal distrubtion helps us understand the majority of values a given data point takes on. For example, adult height has a normal distribution. The mean tells us where the center of the curve is whereas the standard deviation tells us how wide the curve should be.

Binomial Distribution: Binomial distributions help us understand the probability of an event occurring, over several trials, and there are only two possible outcomes on each trail - success or failure. compared with failures. The outcomes must also be mutually exclusive, the random variable is a result of counts, and each trial must be independent (one has no effect on the other).

Poisson Distribution: Poisson distributions give the probability of an event happening a specific number of times within a given interval of time or space. Poisson primarily uses the mean number of events to calculate this probability. For example, what is the probability of 2 prospects becoming a customer in a year for a tech company. The event would be 2 prospects becoming customers, the interval is one year, we use the mean or lambda to say on average in a given year how many prospects become customers, which say is 5.

B: Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions (or a distribution from the list above that we have not covered in class), and provide some intuition as to if the pdf formula makes sense or not.

The PDF measures the likelihood or probability of X taking on particular value in a range of values. The probability of an event occurring is represented by the area under the curve of the PDF over a specified range. The PDF makes sense to use with the normal distribution. For example, say we wanted to understand the likelihood of dating a male with a height of 6’0”. And say we know the average male height is 5’7” and the standard deviation is 2”. We can calculate the probability of selecting a male in that range using the PDF.

The CDF describes the probability that a random variable X takes on a value less than or equal to a specified value. The CDF fits with the normal and Poisson distributions. For example, we can use the CDF to determine the probability that normally distributed variable X (say 5’5”) takes on a value less than or equal to x, say 5’9”, when talking about height. The CDF for normal distributions is symmetric about its mean, so at the mean = 5’7”, we know there is a 50% chance that random variable 5’5” falls below its mean. The CDF also works with Poisson. Say we want to understand how many visits we had to a website in a given hour. We can use the CDF to determine the probability of 200 visitors or less in an hour using the average.

C. What are the key parameters that define the 3 distributions above (or a distribution from the list above)? Does R require these key parameters to be declared ?

For the normal distribution, the key components are:

  • q, the variable in question

  • the mean, which defaults to 0

  • the standard deviation, which defaults to 1

  • lower.tail which allows us to choose whether we are interested in probabilities less than or equal to x or greater than x

  • log.p which allows us to choose whether we want to take the log of the probability

For the poisson distribution, the key components are:

  • q, the variable in question

  • lambda which is the mean

  • lower tail and log.p, which have the same properties as the normal distribution function

For the binomial distribution, the key components are:

  • q, the variable in question

  • size which is the number of trials

  • probability, which includes the probability of success

  • lower tail and log.p, which allow us to choose the same as described in the normal distribution

?distribution
?dpois
?dbinom

D. Give a few examples of situations that can be modeled with each of the 3 distributions above.

Normal distribution examples: heights of people, ocean temps at a particular location, SAT scores.

Poisson distribution examples: tree seedlings emerging from the forest floor, bugs occurring in computer code, defects occur long a strand of yarn.

Binomial distribution examples: medical trials, toxicity tests, quality control (item is defective or it’s not).

E. Plot the distribution in part B

Let’s say the average male height is 5’7” and the standard deviation is 2”. What is the probability of dating a male will be between 5’9” and 6’2”. For calculations sake, I will use centimeters.

#set the mean and standard deviation
mu <- 170.18
sigma <- 5.08 

#Generate a range of values around the mean
x <- seq(from       = mu - 3*sigma,
         to         = mu + 3*sigma,
         length.out = 200
         )

#Calculate the PDF
pdf <- dnorm(x    = x,
             mean = mu,
             sd   = sigma
             )

#Plot the normal distribution
plot(x    = x,
     y    = pdf,
     type = 'l',
     col  = 'red',
     lwd  = 2,
     xlab = 'Height',
     ylab = 'Density',
     main = 'Normal Distribution with Mean 170cm and SD 5cm' 
     )

Let’s plot the area to understand probability of dating a male between 5’9” and 6’2”.

plot(x  = x,
     y  = pdf,
     type  = 'l',
     col = 'blue',
     lwd  = 2,
     xlab = 'Height',
     ylab = 'Density',
     main = 'Normal Distribution with Mean 170cm and SD 5cm'
     )

# Shade the area under the curve for values below 96
x_shade <- seq(from = 175.26,
               to = 187.96,
               length.out = 1000
               )

pdf_shade <- dnorm(x = x_shade,
                 mean = mu,
                 sd = sigma
                 )
?rev
polygon(x   = c(x_shade, rev(x_shade)),
        y   = c(pdf_shade, rep(x   = 0,
                               times = length(pdf_shade)
                                              )
                               ),
                col  = 'red',
                border = NA
                )

Let’s calculate the probability

pnorm(q = 187.96, mean = mu, sd = sigma) - pnorm(q = 175.26, mean = mu, sd = sigma)
## [1] 0.1584226
#explicitly state the upper tail argument
pnorm(q = 187.96, mean = mu, sd = sigma, lower.tail = T) - pnorm(q = 175.26, mean = mu, sd = sigma, lower.tail = T)
## [1] 0.1584226
1 - pnorm(q = 187.96, mean = mu, sd = sigma, lower.tail = F) - (1-pnorm(q = 175.26, mean = mu, sd = sigma, lower.tail = F))
## [1] 0.1584226

The probability of dating a male between 5’9” and 6’2” is .1584.

PART II. Converge of Distributions

Pick your own values of N, x, and pi. x is necessarily less than or equal to N, and pi is a fixed probability of success. The probability should be greater than or equal to x.

Then model both as a binomial and a Poisson, and provide your R code solutions. Do you get similar answers or not under the two different distributional assumptions, and can you guess why?

#modeled as Binomial

# What is N?  50
# What is $\pi$?  .12 or 6/50
# x = 4


?dbinom  

dbinom(x    = 4,
       size = 50, 
       prob = .12
       )                  
## [1] 0.1334203

The answer is 0.1334.

#modeled as Poisson 

N  <- 50
pi <- 6

dpois(4, 6)
## [1] 0.1338526

Answer is 0.1339.

Results are very similar between Poisson and binomial. Likely because the Poisson distribution can be used to approximate the binomial distribution when the number of trials is large and the probability of success is small.