1. Define discrete distribution and continuous distribution

A discrete distribution describes the probability of occurrence of each value of a discrete random variable, and the latter is defined as a random variable that has countable values, such as a list of non-negative integers. With a discrete probability distribution, each possible value of the discrete random variable can be associated with a non-zero probability.

A continuous distribution describes the probabilities of the possible values of a continuous random variable, which is a random variable with a set of possible values (known as the range) that is infinite and uncountable.

Probabilities of continuous random variables (X) are defined as the area under the curve of its PDF. Thus, only ranges of values can have a nonzero probability. The probability that a continuous random variable equals some value is always zero.

2. Define the probability mass function (PMF) and the probability density function (PDF)

The probability density function is a smooth curve giving the probability distribution of a continuous random variable.

A probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value.

3. Define binomial distribution and normal distribution(give the formula and conditions for its use)

A binomial distribution is a frequency distribution of the possible number of successful outcomes in a given number of trials in each of which there is the same probability of success.The number of observations or trials is fixed and one can only figure out the probability of something happening if one does it a certain number of times.Each observation or trial is independent and none of the trials have an effect on the probability of the next trial.

The formula for binomial is P(k out of n) = [n!/k!(!(n-k)!]* [p^k * (1-p)^(n-k)]

A normal distribution is a function that represents the distribution of many random variables as a symmetrical bell-shaped graph.

4. One goal of a study conducted on a large group of eighth-graders in Minneapolis was to calculate the sensitivity and specificity of a diagnostic test for cigarette use. Levels of saliva thiocyanate (SCN) were measured in an attempt to predict smoking status; SCN >= 100 defined a positive diagnostic test result. The students were asked to report the number of cigarettes they had smoked in the previous week, and the saliva samples were taken from each individual to conduct the test. For the following questions, assume the self reports of the number of cigarettes smoked are accurate. The table below displays the breakdown of SCN levels and self reported cigarettes smoked per week. WRITE ALL FORMULAS USED AND SHOW ALL WORK!

  1. What is the sensitivity of this test for
  1. light smokers?

     Sensitivity = P(T+|D+) = 0.050
  2. moderate smokers?

    Sensitivity = P(T+|D+) = 0.326
  3. heavy smokers?

    Sensitivity = P(T+|D+) = 0.652
  1. Find the specificity of the test

    False positive rate = P(T+|D-) = 0.033 = 1 - specificity
    
    Specificity = P(T-|D-) = 1- False Positive rate = 1- 0.033 = 0.967

5. Among the individuals 65 years of age or older who undergo open heart surgery, 70% are still alive 90 days after the operation (i.e., they survive passed 90 days). Of those who survive 90 days, 75% are still alive 5 years after surgery.

  1. What is the probability that exactly 4 out of 10 individuals 65 years of age or older will die within 90 days of the surgery (by hand and show all work)?

  1. What is the probability that 4 or fewer people dies within 90 days of the operation (by hand and show all work)?

  1. Calculate 5(a) and 5(b) using STATA to check your answers. (Show your code used)
#5a
dbinom(4, size=10, prob=0.3)
## [1] 0.2001209
#5b
pbinom(4, size=10, prob=0.3)
## [1] 0.8497317
dbinom(4, size=10, prob=0.3) + dbinom(0, size=10, prob=0.3) + dbinom(1, size=10, prob=0.3) + dbinom(2, size=10, prob=0.3) + dbinom(3, size=10, prob=0.3)
## [1] 0.8497317
  1. Among samples of 10 individuals, find the expected value, variance, and standard deviation for the number of people who SURVIVE 90 days.
# n = number of trials, p = probability of surviving 90 days post-op, q = probability of dying

n=10
p=0.7
q=0.3
c(n,p,q)
## [1] 10.0  0.7  0.3
#expected value

ev= n*p
ev
## [1] 7
#variance

variance=n*p*q
variance
## [1] 2.1
#std

std=sqrt(n*p*q)
std
## [1] 1.449138
  1. What is the probability that a patient survives 5 years after having open heart surgery? Hint Use properties of conditional probability
# Probability of surviving 90 days is 0.7. Probability of those who survive 90 days and are still alive 5 years after surgery is 0.75.

0.75*0.7
## [1] 0.525
  1. In a sample of 10 patients, what is the probability that exactly 2 survive 5 years after their operations?
dbinom(2, size=10, prob=0.525)
## [1] 0.03214253

6. Suppose we have a sample of five phlebotomists ( similar with respect to many personal characteristics) who were exposed to Hepatitis B via a needle stick accident. Suppose that it is reasonable to assume a health worker exposed to Hep-B via needle stick has a 30% chance of developing the disease. Let X represent how many of the five phlebotomists develop Hepatitis B.

  1. Why would the binomial distribution provide an appropriate model? Hint: Remember the acronym B.I.N.S.

    The outcome of each trial is binary (e.g. development the disease or no development of disease)
    
    The outcome of each trial is independent (e.g. one phlebotomist developing the disease will not affect another phlebotomist) 
    
    There are a fixed number of trials (e.g. five phlebotomists)
    
    Probability of development of disease is the same for each five phlebotomists.
  2. What are the parameters of the distribution of the values of X?

    n = number of trials, p = probability of developing disease, q = probability of not developing disease
  3. List the possible values for X

    Possible values for x = 0,1,2,3,4,5
  4. What is the mean number of phlebotomists who will develop Hep-B via needle stick accident?

# n = number of trials, p = probability of developing disease, q = probability of not developing disease

n=5
p=0.3
q=0.7
c(n,p,q)
## [1] 5.0 0.3 0.7
#expected value

ev= n*p
ev
## [1] 1.5
  1. What is the standard deviation from this mean number?
# n = number of trials, p = probability of developing disease, q = probability of not developing disease

n=5
p=0.3
q=0.7
c(n,p,q)
## [1] 5.0 0.3 0.7
#standard deviation

std=sqrt(n*p*q)
std
## [1] 1.024695
  1. In how many ways can the five phlebotomists be ordered?
factorial(5)
## [1] 120
  1. Without regard to order, in how many unique ways can you select one phlebotomist from this group of five?
choose(5,1)
## [1] 5
  1. What is the probability that exactly 1 phlebotomist develops the disease? (Use the formula by hand then check in STATA )

dbinom(1, size=5, prob=0.3)
## [1] 0.36015
  1. What is the probability that none of the phlebotomist develops the disease? (Use the formula by hand then check in STATA )

dbinom(0, size=5, prob=0.3)
## [1] 0.16807
  1. What is the probability that at least 3 of the phlebotomist develops the disease? (Use the formula by hand then check in STATA)

dbinom(3, size=5, prob=0.3)+dbinom(4, size=5, prob=0.3)+dbinom(5, size=5, prob=0.3)
## [1] 0.16308
1- pbinom(2, size=5, prob=0.3)
## [1] 0.16308
  1. What is the probability that no more than 1 of the phlebotomist develops the disease? (Use the formula by hand then check in STATA)

dbinom(0, size=5, prob=0.3)+dbinom(1, size=5, prob=0.3)
## [1] 0.52822
pbinom(1, size=5, prob=0.3)
## [1] 0.52822

7. According to data from the CDC in 2010, 19.3% of adults age eighteen and older smoke cigarettes. In the year 2008, the incidence rate of lung cancer was 65.1 cases per 100,000 people per year. Suppose you are conducting a lung cancer study in the United States, and you obtain a random sample of 2,000 adults (over 18 years of age) who do not have lung cancer. You plan to follow this study cohort over a period of 5 years and observe incident cases of lung cancer. Smoking status is an important predictor of lung cancer incidence. Therefore, as the study designer, it is important to think about baseline smoking rates in your study cohort. We first model the number of smokers in the study cohort using the binomial distribution, and assume that this cohort is representative sample from the US population. Use the binomial distribution to answer the parts below:

  1. How many smokers would you expect to see in the study cohort, on average?

    2000 adults * 0.193 = 386

  2. What is the standard deviation of the number of smokers in the study cohort?

# n = number of people, p = probability of being a smoker, q = probability of not being a smoker

n=2000
p=0.193
q=1-p
c(n,p,q)
## [1] 2000.000    0.193    0.807
#standard deviation

std=sqrt(n*p*q)
std
## [1] 17.64942
  1. What is the probability that you observe exactly 386 smokers? (Write the formula but you can calculate in STATA)

dbinom(386, size=2000, prob=0.193)
## [1] 0.0225986
  1. What is the probability that greater than or equal to 25% of the study population are smokers? (Write the formula but you can calculate in STATA)

# 25% of study population is .25*2000 = 500

1- pbinom(499, size=2000, prob=0.193)
## [1] 2.402751e-10
pbinom(499, 2000, 0.193, lower.tail = FALSE, log.p = FALSE)
## [1] 2.402751e-10
  1. What is the probability that less than or equal to 20% of the study population are smokers? (Write the formula but you can calculate in STATA)

# 20% of study population is .20*2000 = 400

pbinom(400, size=2000, prob=0.193)
## [1] 0.7948741