Definitions

  1. Define discrete distribution and continuous distribution

A discrete distribution describes the probability of occurrence of each value of a discrete random variable. Discrete random variable is able to assume only a finite or countable number of outcomes

A continuous distribution describes the probabilities of the possible values of a continuous random variable. A continuous random variable can take on any value in a specified interval

  1. Define the probability mass function (PMF) and the probability density function (PDF)

A probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.

A probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

  1. Define binomial distribution and normal distribution(give the formula and conditions for its use)

A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice).

The formula: b(x; n, P) = nCx * Px * (1 – P)n – x Where: b = binomial probability x = total number of “successes” (pass or fail, heads or tails etc.) P = probability of a success on an individual trial n = number of trials

Or: nCx = n! / x!(n – x)!

Normal Distribution (aka Gaussian distribution or bell-shaped curve) is most widely used probability distribution in statistics that is heavily used in estimation and hypothesis testing and can be used to approximate many other distributions such as the binomial. Area under the curve adds up to 1, and 68% of the area is within 1 SD of the mean in a standard normal distribution.

Formula: Y = { 1/[ σ * sqrt(2π) ] } * e-(x - μ)2/2σ2

PROBABILITY AND DIAGNOSTICS TESTING

  1. One goal of a study conducted on a large group of eighth-graders in Minneapolis was to calculate the sensitivity and specificity of a diagnostic test for cigarette use. Levels of saliva thiocyanate (SCN) were measured in an attempt to predict smoking status; SCN ≥ 100 defined a positive diagnostic test result. The students were asked to report the number of cigarettes they had smoked in the previous week, and the saliva samples were taken from each individual to conduct the test. For the following questions, assume the self reports of the number of cigarettes smoked are accurate. The table below displays the breakdown of SCN levels and self reported cigarettes smoked per week. WRITE ALL FORMULAS USED AND SHOW ALL WORK!!
  1. What is the sensitivity of this test for:

For smokers, Sensitivity= P(T + D + )=TP/(TP + FN) TP=P(SCN>=100smokingstatus) FN=1-P(SCN>=100smokingstatus) Therefore, Sensitivity= P(SCN>=100smokingstatus)

  1. light smokers? Sensitivity=0.050

  2. moderate smokers? Sensitivity=0.326

  3. heavy smokers? Sensitivity=0.652

  1. Find the specificity of the test

Specificity = true negative / (true negative + false positive) = 1- [P(SCN>=100Inon-smokers)] = 1 – 0.033 = 0.0967

Binomial

  1. Among the individuals 65 years of age or older who undergo open heart surgery, 70% are still alive 90 days after the operation (i.e., they survive passed 90 days). Of those who survive 90 days, 75% are still alive 5 years after surgery.
  1. What is the probability that exactly 4 out of 10 individuals 65 years of age or older will die within 90 days of the surgery (by hand and show all work)?

p=1-0.7=0.3; n=10; k=4

P(Y=k) =p^y (1-p)^(n-k) =0.3^4 0.7^6 =0.20

  1. What is the probability that 4 or fewer people dies within 90 days of the operation (by hand and show all work)?

P(Y=4) = 0.200 P(Y=3) =0.3^3 0.7^7 = 0.266 P(Y=2) =0.3^2 0.7^8 = 0.233 P(Y=1) =0.3^1 0.7^9 = 0.121 P(Y=0) =0.3^0 0.7^10 = 0.028

P(<=4) = P(Y=4)+P(Y=3)+ P(Y=2)+ P(Y=1)+ P(Y=0) = 0.848=0.85

  1. Calculate 6(a) and 6(b) using STATA to check your answers. (Show your code used)
dbinom(4, size=10, prob=0.3)
## [1] 0.2001209
pbinom(4, size=10, prob=0.3)
## [1] 0.8497317
  1. Among samples of 10 individuals, find the expected value, variance, and standard deviation for the number of people who SURVIVE 90 days.

Expected value = 10(0.7) = 7

Variance = np(1-p) = 10(0.7)(0.3) = 2.1

SD = √(2.1) = 1.45

  1. What is the probability that a patient survives 5 years after having open heart surgery? Hint Use properties of conditional probability

P(B)= P(surviving after 5 years) = 0.75 P(A)= P(surviving after 90 days) = 0.7

P(BA) = = 0.7x0.75 x 0.7/ 0.7=0.525

  1. In a sample of 10 patients, what is the probability that exactly 2 survive 5 years after their operations?

n=10, k=2, p=0.525

P(Y=2) =0.525^2 (1-0.525)^8 =0.032

  1. Suppose we have a sample of five phlebotomists ( similar with respect to many personal characteristics) who were exposed to Hepatitis B via a needle stick accident. Suppose that it is reasonable to assume a health worker exposed to Hep-B via needle stick has a 30% chance of developing the disease. Let X represent how many of the five phlebotomists develop Hepatitis B.
  1. Why would the binomial distribution provide an appropriate model? Hint: Remember the acronym B.I.N.S.

B- The outcome is binary

I- The two outcomes are independent

N- There is a fixed number of trials

S- The Probability of development the disease is the same for each individual.

  1. What are the parameters of the distribution of the values of X?

p = 0.3 n=5

  1. List the possible values for X

X= 5, 4, 3, 2, 1, 0

  1. What is the mean number of phlebotomists who will develop Hep-B via needle stick accident?

Expected value = 5(0.3) = 1.5

  1. What is the standard deviation from this mean number?

SD = √ [np(1 – p)] = √ [5(0.3)(1 – 0.3)] = 1.025

  1. In how many ways can the five phlebotomists be ordered?

5! = 5(4)(3)(2)(1) = 120

  1. Without regard to order, in how many unique ways can you select on phlebotomist from this group of five?

= 5! / (1!4!) = 5

  1. What is the probability that exactly 1 phlebotomist develops the disease? (Use the formula by hand then check in STATA)

P(Y=1) = 0.3^1 0.7^4 = 0.36

dbinom(1, size=5, prob=0.3)
## [1] 0.36015
  1. What is the probability that none of the phlebotomist develops the disease? (Use the formula by hand then check in STATA)

P(Y=0) = 0.3^0 0.7^5 = 0.17

dbinom(0, size=5, prob=0.3)
## [1] 0.16807
  1. What is the probability that at least 3 of the phlebotomist develops the disease? (Use the formula by hand then check in STATA)

P(Y=3) = 0.3^3 0.7^2 = 0.132 P(Y=4) = 0.3^4 0.7^1 = 0.028 P(Y=5) = 0.3^5 0.7^0 = 0.002 P(Y>;=3) = P(Y=3) + P(Y=4) + P(Y=5) = 0.132+0.028+0.002=0.162=0.16

1- pbinom(2, size=5, prob=0.3)
## [1] 0.16308
  1. What is the probability that no more than 1 of the phlebotomist develops the disease? (Use the formula by hand then check in STATA)

P(<;=1) = P(Y=1)+ P(Y=0) =0.36+0.17 =0.53

pbinom(1, size=5, prob=0.3)
## [1] 0.52822
  1. According to data from the CDC in 2010, 19.3%of adults age eighteen and older smoke cigarettes. In the year 2008, the incidence rate of lung cancer was 65.1 cases per 100,000 people per year. Suppose you are conducting a lung cancer study in the United States, and you obtain a random sample of 2,000 adults (over 18 years of age) who do not have lung cancer. You plan to follow this study cohort over a period of 5 years and observe incident cases of lung cancer. Smoking status is an important predictor of lung cancer incidence. Therefore, as the study designer, it is important to think about baseline smoking rates in your study cohort. We first model the number of smokers in the study cohort using the binomial distribution, and assume that this cohort is representative sample from the US population. Use the binomial distribution to answer the parts below:
  1. How many smokers would you expect to see in the study cohort, on average?

Expected value = 2000(0.193) = 386

  1. What is the standard deviation of the number of smokers in the study cohort?

SD = √ [np(1 – p)] = √ [2000(0.193)(1 – 0.193)] = 17.65

  1. What is the probability that you observe exactly 386 smokers? (Write the formula but you can calculate in STATA)

P(Y=386)=0.193^386 0.807^(2000-386) = 0.0225986

  1. What is the probability that greater than or equal to 25% of the study population are smokers? (Write the formula but you can calculate in STATA)

k = 2000 x 0.25 = 500

P(>;=500) = P(Y=500) + P(Y=501) + … + P(Y=2000) =0.193^500 0.807^(2000-500) + 0.193501 0.807(2000-501) +…+ 0.193^2000 0.807^(2000-2000) = 2.402751e-10

1- pbinom(499, size=2000, prob=0.193)
## [1] 2.402751e-10
  1. What is the probability that less than or equal to 20% of the study population are smokers? (Write the formula but you can calculate in STATA)

k = 2000 x 0.20 = 400

P (<;=400) = P(Y=0) + P(Y=1) + P(Y=2) + … + P(Y=400) = 0.193^0 0.807^(2000-0) + 0.1931 0.807(2000-1) + 0.193^2 0.807^(2000-2) + … + 0.193^400 0.807^(2000-400) = 0.7948741

pbinom(400, size=2000, prob=0.193)
## [1] 0.7948741