Lesson 1: Random Variables and Discrete Distributions

Basic Concepts

m=mean(zagat$Food)
s=sd(zagat$Food)

Empirical Distribution Function

% of data	within # $\sigma$ from $\mu$
68%	1
95%	2
99%	3

$$ \[\begin{align} \end{align}\] $$

#20% of all food scores are 15 or below (approx)
p=ecdf(zagat$Food)
plot(p)

Works for data sets with histograms that are mound-shaped and symmetric

variable=as.numeric(as.character(normal$norm))
hist(variable, breaks=seq(-4.5,4.5, by =.1))

hist(zagat$Food,breaks=seq(8.5,28.5,by=1),main="Food scores")
axis(1,at=seq(9,28,1))

Proportion of Food scores within 1 sd from the mean

# 72% of data is within 1 sd from mean
u1=p(m+s) # % of data that is below mean+sd
d1=p(m-s) # % of data below mean - sd
u1-d1     # % of data between (mean + sd) and (mean - sd)

## [1] 0.72

# 94% of data is within 2 sd from mean
u2=p(m+2*s) # % of data that is below mean + 2 sd
d2=p(m-2*s) # % of data below mean - 2 sd
u2-d2     # % of data between (mean + 2 sd) and (mean - 2 sd)

## [1] 0.94

# 100% of data is within 3 sd from mean
u3=p(m+3*s) # % of data that is below mean + 2 sd
d3=p(m-3*s) # % of data below mean - 2 sd
u3-d3     # % of data between (mean + 2 sd) and (mean - 2 sd)

## [1] 1

u2 (0.94) and u3 (1) are close to the empirical 95% and 99% rule.
u1 (0.72) are not as close to the empirical 68% rule (*** WHAT DOES THIS MEAN??? ***)

Probability

Notations

Notation	Description	Code	Probability
$A\cup B$	A or B	`A\|B`	$P(A\cup B) = P(A) + P(B)-P( A\cap B)$
$A\cap B$	A and B	`A&B`	$P(A\cap B)=P(A\\|B)\times P(B)$
$A^c$	A complement (not A)	`!A`	$P(A^c)=1-P(A)$
$\emptyset$	empty set, impossible event	`NULL`
$P(A\cap B)=\emptyset$	A and B are mutually exclusive
$(A\|B)$	A given *B		$P(A\|B)=\frac{P(A\cap B)}{P(B)}$
$P(A\|B)=P(A)$	A is independent of B		$P(A\cap B)=P(A)\times P(B)$

Examples

Independent

Quality Pass

John, Kathy, Len, and Martha are the final quality inspectors for computer monitors. If a computer monitor functions properly, then

John will say ”OK” with probability 0.92
Kathy will say ”OK” with probability 0.90
Len will say ”OK” with probability 0.94
Martha will say ”OK” with probability 0.97

We assume that these four inspectors make independent judgments.

If a monitor is really OK, then what is the probability that all four inspectors will say ”OK”? \[ \begin{align} P(\text{John "OK"}&\cap\text{Kathy "OK"}\cap\text{Len "OK"}\cap\text{Martha "OK"})\\ &=0.92\times 0.90\times 0.94\times 0.97\\ &=0.75497040 \end{align} \]
If a monitor is really OK, then what is the probability that none of the four inspectors will say ”OK”? \[ (1-0.92)\times (1-0.90)\times (1-0.94)\times (1-0.97)=0.0000144\\ \]

Conditional

Marketing study

Concerning the effectiveness of an advertisement for a certain product

A: a randomly selected customer bought the product
B: the customer saw the commercial

P(A|B) is probability of A given B (probability product was bought given that the customer saw the commercial)

We observe 1,000 shoppers in an experiment. Some of these shoppers saw a certain commercial and some did not. Also, some bought a particular product and some did not.

	(A) Bought product	Did not buy product	TOTAL
(B) Saw commercial	600	50	650
Did not see commercial	250	100	350
TOTAL	850	150	1000

Probability of bought product AND saw commercial: $P(A\cap B)=\frac{600}{1000}$

Probability of saw commercial: $P(B)=\frac{650}{1000}$

Probability of bought product GIVEN saw commercial:

\[ P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(\text{bought product & saw comm})}{P(\text{saw comm})}=\frac{600}{650}=.923 \]

Recreated with probabilities
	(A) Bought product	Did not buy product	TOTAL
(B) Saw commercial	0.6	0.05	0.65
Did not see commercial	0.25	0.1	0.35
TOTAL	0.85	0.15	1

Medical test

The probability that a medical test will correctly detect the presence of a certain disease is 98% (PT|D). The probability that this test will correctly detect the ABSENCE of the disease is 95% (NT|ND). The disease is fairly rare, found in only 0.5% (D) of the population. If a patient has a positive test (meaning that the test says “yes, the disease is present”) what is the probability that the patient really has the disease?

PT = positive test $P(PT|D)=0.98$ (if you have the disease, correctly tested positive)
NT = negative test $P(NT|ND)=0.95$ (if you don’t have the disease, correctly tested negative)
D = has disease $P(D) = 0.005$
ND = does not have the disease $P(ND)=P(D^c)=0.995$

\[ \begin{align} P(PT\cap D)=P(PT|D)\times P(D)&=0.98 \times 0.005 = 0.0049\\ P(NT\cap ND)=P(NT|ND)\times P(ND)&=0.95\times0.995=0.94525 \end{align} \]

	PT	NT	TOTAL
D	0.0049	0.0001	0.005
ND	0.04975	0.94525	0.995
TOTAL	0.05465	0.94535	1

Salesperson error

An industrial supply firm sometimes gets calls related to improperly filled orders. This situation is related to the salesperson’s error in writing up the bill of sale. It happens that

Hank will make an error on the bill of sale with probability 0.07,
Jerry will make an error on the bill of sale with probability 0.04, and
Carl will make an error on the bill of sale with probability 0.11.

It also should be noted that

Hank writes 30% of all sales,
Jerry writes 30% of all sales, and
Carl writes 40% of all sales.

If the firm receives a call about an improperly filled order,

What is the probability that the bill of sale was written by Hank? by Jerry? by Carl?

Help: You can write down the probability table for this problem. However, you can use any other correct method. For example, you can use algebraic calculations only, based on the definition of the conditional probability.

Salesperson	ERROR	TOTAL
Hank	$0.07\times 0.3=0.021$	$0.3$
Jerry	$0.04\times 0.3=0.012$	$0.3$
Carl	$0.11\times 0.4=0.044$	$0.4$
TOTAL	$0.077$	$1$

Notations:

$E$=“error” ,
$E^c$=“written correctly”,
$H$ =“Hank”,
$J$ =“Jerry”,
$C$=“Carl”.

Hank with error: \[ \begin{align} P[E\cap H]&=P[E|H]\times P[H]\\ &=0.07\times 0.3\\ &=0.021 \end{align} \]

Hank without error: \[ \begin{align} P[H\cap E^c] &=P[H]-P[H\cap E]\\ &= 0.3-0.021\\ &=0.279 \end{align} \]

Note: The other figures are computed similarly.

The probability that a bill contains error is \[ \begin{align} P(E) &= .021 + .012 + .044 \\ &= .077 \end{align} \]

If we learn about an order with error, the probability that this can be traced back

\[ \begin{align} P[H|E] &= \frac{P[H\cap E]}{P[E]}\\ &= \frac{0.021}{0.077}\\ &= 0.2727\\ \\ P[J|E]&=\frac{P[J\cap E]}{P[E]}\\ &= \frac{0.012}{0.077}\\ &= 0.1558\\ \\ P[C|E]&=\frac{P[C\cap E]}{P[E]}\\ &= \frac{0.044}{0.077}\\ &= 0.5714\\ \end{align} \]

Discrete Distributions

Mean (expected value, EX) of discrete random variable

$X$ = discrete random variable
{$x_1, x_2, \ldots, x_n$} = set of possible values
{$p_1, p_2, \ldots, p_n$} = corresponding probabilities
\[ \mu =\sum_{i=1}^{n} x_i p_i \]

Variance

\[ \sigma ^2 = \sum_{i=1}^{n}(x_i-\mu)^2p_i \]

Examples

Doubling in roulette (without zero and double zero)

Suppose that a player puts $1 on red.

If she wins,
- she quits
- takes cumulative winnings
If she loses, she doubles the bet

Suppose that the maximum bet amount is $64. Let W be the overall winning of the player. Write down the distribution of W. Calculate the mean of W. Based on the distribution, is doubling a good idea?

In order to lose in this game, player must lose all bets.
Probability of losing all 7 bets: \[ \left( \frac{1}{2}\right )^7 = \frac{1}{128} \]

Amount of total loss: \[ 1+2+4+\ldots +64=127 \]

Therefore, probability of winning: \[ 1-\left (\frac{1}{2}\right) ^7 = \frac{127}{128} \]

Distribution in table form
Possible values	Probabilities
Lose: -127	\[\frac{1}{128}\]
Win : 1	\[\frac{127}{128}\]

Mean: \[ \mu = -127\times \frac{1}{128} + 1 \times \frac{127}{128} = 0. \]

Variance: \[ \begin{align} \sigma ^2 &= \sum_{i=1}^{n}(x_i-\mu)^2p_i\\ &=\sum_{i=1}^{2}(x_i-0)^2p_i\\ &=\left [\frac{127}{128}(1-0)^2\right]+\left [\frac{1}{128}(-127-0)^2\right]\\ &=127 \end{align} \]

Standard deviation: \[ \begin{align} \sigma&=\sqrt{\sigma^2}\\ &=\sqrt{127}\\ &=11.27 \end{align} \]

Binomial Distribution

Criteria for binomial distribution

Definition: likelihood of observing a certain outcome when performing a series of tests for which there are only two possible outcomes

One outcome for each trial
- $P(\text{success})=p$
- $P(\text{fail})=1-p$
Each trial is independent
Two parameters given:
- $n$ = number of trials
- $p$ = $P(\text{success})=p$
  - If the defective rate is 10%, selecting $n=5$ items yields $X$ defective items among the 5 selected items, THEN it’s a binomial distribution with parameters (5,0.1) (n,p).

Mean

\[\mu=n\times p\]

Standard deviation

\[\sigma=\sqrt{n\times p \times (1-p)}\]

Probability formula

\[P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}\]

Binomial coefficient: \[\binom{n}{k}=\frac{n!}{k!(n-k)!}\]

Probability R functions

$X$ = desired # of success
$n$ = sample size
$p$ = $P(\text{success})$

function	description
R Functions
dbinom(X,n,p)	X = single value
pbinom(X,n,p)	X is LESS than or equal to a value
1-pbinom((X-1),n,p)	X is GREATER than or equal to a value

To visualize:

$y$ = possible values
$n$ = sample size
$p$ = $P(\text{success})$

y <- 0:n # for each y value...
plot(y, dbinom(y,n,p),type="h")

Examples

Basic QC

Suppose you needed th probability that there will be exactly 1 defective item in the sample of 5 when 10% of all manufactured items are defective.

dbinom(1, 5, 0.1)

## [1] 0.32805

Same example but $n=22$, $p=0.64$, and needed

$P[X=15]$ or
$P[X\leq 15]$ or
$P[X\geq 17]$

dbinom(15, 22, 0.64) # exact value X

## [1] 0.165445

# CUMULATIVE
pbinom(15, 22, 0.64) # <= a value X

## [1] 0.7310786

1-pbinom(16, 22, 0.64) # >= a value X

## [1] 0.140242

To visualize:

y <- 0:22 # for each y value...
plot(y, dbinom(y,22,0.64),type="h")

QC w/ empirical rule

$n=100$,
$p=0.4$,
$\mu =40$, and
$\sigma=4.9$

According to empircal rule, X should be in the interval \[ \begin{align} \left((\mu-2\times \sigma),(\mu+2\times \sigma)\right) &=\left((40-2\times 4.9),(40+2\times 4.9)\right)\\ &=(30.2, 49.8) \end{align} \]

with $\approx$ 95% probability.

To calculate the exact probability: \[ P(X\leq 49)-P(X\leq 30) \]

Notice that the decimal was truncated

$P(X\leq 49)$ = pbinom(49,100,0.4)
$P(X\leq 30)$ = pbinom(30,100,0.4)

pbinom(49,100,0.4) - pbinom(30,100,0.4)

## [1] 0.948118

$0.948118\approx 0.95$, which is very close to empirical rule’s 95%.

Insurance claims

Historically, 10% of insurance claims submitted to a certain insurance company are fraudulent.

$p$ = $P(\text{claim is fraudulent})$

The historical data suggests that $p = 0.10$. However, the manager claims that $p > 0.10$. The management decides to investigate $n = 15$ claims.

$X$ = the number of fraudulent claims among the 15.

At what result should the management decide that p > .10?

Professor explanation:

It seems quite intuitive that we believe that $p > 0.1$ if $X$ is “large” (high probability), and believe that $p = .1$ otherwise. But how large is large?

We shall believe that

$p > 0.1$ when $X > w$, and
$p = 0.1$ when $X ≤ w$.

We need to determine the value of $w$ to show that the probability is only > 0.1 after the $w$ value.

Since the hypothesis is $p = 0.1$, we would reject the claim that $p = 0.1$ if $X > w$ for the $w$ we select. (We may take $0.05=5\%$ as a sufficiently small probability.) We shall look at the cumulative binomial probabilities with $n = 15$ and $p = 0.1$.

Goal: prove that $p > .1$ is “small”; select the smallest possible $w$ such that $P(X > w)$ is “small” ($0.05=5\%$) given that $p = 0.1$.

y <- 0:15
prob <- pbinom(y,15,.1)
cbind(y,prob) # creates matrix

##        y      prob
##  [1,]  0 0.2058911
##  [2,]  1 0.5490430
##  [3,]  2 0.8159389
##  [4,]  3 0.9444444
##  [5,]  4 0.9872795
##  [6,]  5 0.9977503
##  [7,]  6 0.9996894
##  [8,]  7 0.9999664
##  [9,]  8 0.9999972
## [10,]  9 0.9999998
## [11,] 10 1.0000000
## [12,] 11 1.0000000
## [13,] 12 1.0000000
## [14,] 13 1.0000000
## [15,] 14 1.0000000
## [16,] 15 1.0000000

$P(X > w) \leq .05$ is the same as $P(X ≤ w) \geq 0.95$ (probability is high when $X\leq w$). We see in the matrix that

$P(X \leq 4) > .95$, hence
$P(X > 4) < .05$, so
$w = $4.

We reject the claim that $p = 0.1$ if the number of defective claims is more than 4, i.e., at least 5.

Thoughts: How would knowing how many defective claims prove that $p>0.1$ only applies after 4 defective claims? Is it that we apply it in testing? See #5 in Miguel’s explanation. Also, re-read the question.

Miguel’s explanation

What is the probability we’re dealing with? We are working with a binomial distribution because:

We have a fixed number of claims ($n = 15$).
Each claim is either fraudulent or not (two possible outcomes).
The probability of a claim being fraudulent is fixed at 0.10.

The binomial distribution is used to calculate the probability of having exactly a certain number of fraudulent claims, given the 15 trials and a 10% fraud rate for each claim.

The goal: We want to find a number $w$ such that the probability of observing more than $w$ fraudulent claims is less than 5% if the fraud rate is really 10%. In other words, we want:

\[P(X > w) < 0.05\]

This means the probability of getting more than $w$ fraudulent claims should be very small under the assumption that p = 0.1.

How do we find $w$? To find $w$, we need to look at the cumulative probability of observing up to $w$ fraudulent claims (i.e., $P(X \leq w)$) and ensure that it covers at least 95% of the possible outcomes (so the probability of getting more than $w$ fraudulent claims is less than 5%).

We can calculate this using R with the pbinom() function, which gives cumulative probabilities for binomial distributions.

y  <-  0:15
prob = pbinom(y, 15, 0.1)
cbind(y, prob)

##        y      prob
##  [1,]  0 0.2058911
##  [2,]  1 0.5490430
##  [3,]  2 0.8159389
##  [4,]  3 0.9444444
##  [5,]  4 0.9872795
##  [6,]  5 0.9977503
##  [7,]  6 0.9996894
##  [8,]  7 0.9999664
##  [9,]  8 0.9999972
## [10,]  9 0.9999998
## [11,] 10 1.0000000
## [12,] 11 1.0000000
## [13,] 12 1.0000000
## [14,] 13 1.0000000
## [15,] 14 1.0000000
## [16,] 15 1.0000000

Cumulative Probabilities From the results of pbinom(), you get a table that shows the cumulative probability for each possible number of fraudulent claims from 0 to 15. For example, $P(X \leq 4)$ (the probability of getting 4 or fewer fraudulent claims) is greater than 95%, meaning:

\[P(X \leq 4) > 0.95\]

This implies that $P(X > 4)$ , or the probability of getting more than 4 fraudulent claims, is less than 5%:

\[P(X > 4) < 0.05\]

Conclusion Since $P(X > 4)$ is less than 5%, the cutoff value $w$ is 4. If the company finds 5 or more fraudulent claims, they will decide that the fraud rate is likely higher than 10%, because it’s very unlikely to get that many fraudulent claims if the fraud rate were really 10%.

Summary:

$w = 4$ is the threshold.
If the company finds more than 4 fraudulent claims out of 15, they’ll reject the idea that the fraud rate is just 10%, because that outcome would be very unlikely under the assumption of a 10% fraud rate.

More QC

In a production line, 5% of the produced items is defective (typically this proportion is unknown; we assume it to be known for the sake of this exercise). A quality control inspector selects a random sample of $n = 20$ items.

What is the probability that there will be no defective item in the sample?

dbinom(0,20,0.05)

## [1] 0.3584859

What is the probability that there will be at least 1 defective item in the sample?

This is asking $P(X\geq1)=?$, meaning we need to find:

\[ P(X\geq1)=1-P(X=0) \]

1-dbinom(0,20,0.05)

## [1] 0.6415141

What is the mean of the defective items in the sample?

\[ \begin{align} \mu &= n\times p\\ &=20\times 0.05\\ &= 1 \end{align} \]
What is the standard deviation of the defective items in the sample?

\[ \begin{align} \sigma&=\sqrt{n\times p \times (1-p)}\\ &=\sqrt{20\times 0.05 \times (1-0.05)}\\ &=\sqrt{0.95}\\ &=0.974678434 \end{align} \]

Spark plugs

You are considering a quality inspection scheme to use on the spark plugs which are sent from your supplier. These spark plugs come in shipments of 50,000. Denote the unknown proportion of defective spark plugs in the shipment by $p$. Ideally you would like to

reject the shipment if $p > .05$ and
accept it if $p ≤ .05$.

In practice you can’t follow this plan since you don’t know $p$. Instead you decide to apply a scheme that consists of the following steps: 1) A random sample of 20 of the spark plugs will be selected from each shipment. Each of the selected plugs will be tested to see whether it is defective or not. (The test involves measuring the plug gap and determining the electrical resistance.) You will note $X$ as the (random) number of defective plugs in the sample.

If $X < 2$ then the shipment passes your quality standard.
If $X ≥ 2$ then the shipment fails the quality test and will be returned to the supplier.

Find the probability that the shipment is rejected when $p = .05$ (this corresponds to an “error” since at $p = .05$ we would want to accept the shipment).

Remember, we don’t know what $p$ is. We are looking for $P(X \geq 2)$

\[ \begin{align} P(X\geq2)&=1-P(X\leq 1)\end{align} \]

1-pbinom(1,20,0.05)

## [1] 0.2641605

Find the probability that the shipment is accepted when $p = .1$ (this corresponds to an “error” again since at $p = 0.10$ we would want to reject the shipment).

We’re looking for $P(X<2)$ or $P(X\leq 1)$

pbinom(1,20,0.1)

## [1] 0.391747

Find the probability that the shipment is accepted when $p = 0.20$.

We’re (still) looking for $P(X<2)$ or $P(X\leq 1)$

pbinom(1,20,0.2)

## [1] 0.06917529

We would like to modify the quality control test in the following way. We want to pass the shipment if $X < w$ and reject the shipment when $X ≥ w$ where $w$ is a number to be determined.

Determine the smallest possible value for $w$ such that the probability of rejecting the shipment when $p = .05$ is no more than 0.01, i.e., 1%.

Professor’s explanation

We want to find the smallest $w$ such that \[ P[X \geq w] \leq .01 \]

Notice that $\geq$ was used because we’re looking for the smallest $w$ that we are rejecting.

Remember that we are looking for the threshold that we REJECT a shipment that makes a probability of 1% where $p=0.05$, a value that will allow us to accept the shipment. We anticipate this number to be small because we’re most likely accepting the shipment. Then we flip the internal $\geq$ to $\leq$ to find the threshold that the probability of acceptance is high (in this case 99%).

This happens if \[ P[X < w] ≥ .99 = P[X \leq (w-1)] ≥ .99 \]

Note that we’re looking for the first value of $(w-1)$ where the probability is 99% or higher.
We use R to print the cumulative binomial distribution with $n = 20$ and $p = .05$. Here are the commands:

y=0:20
prob=pbinom(y,20,.05)
cbind(y,prob)

##        y      prob
##  [1,]  0 0.3584859
##  [2,]  1 0.7358395
##  [3,]  2 0.9245163
##  [4,]  3 0.9840985
##  [5,]  4 0.9974261
##  [6,]  5 0.9996707
##  [7,]  6 0.9999661
##  [8,]  7 0.9999971
##  [9,]  8 0.9999998
## [10,]  9 1.0000000
## [11,] 10 1.0000000
## [12,] 11 1.0000000
## [13,] 12 1.0000000
## [14,] 13 1.0000000
## [15,] 14 1.0000000
## [16,] 15 1.0000000
## [17,] 16 1.0000000
## [18,] 17 1.0000000
## [19,] 18 1.0000000
## [20,] 19 1.0000000
## [21,] 20 1.0000000

Therefore, $w=5$

Notation	Description	Code	Probability
\(A\cup B\)	A or B	`A\|B`	\(P(A\cup B) = P(A) + P(B)-P( A\cap B)\)
\(A\cap B\)	A and B	`A&B`	\(P(A\cap B)=P(A\\|B)\times P(B)\)
\(A^c\)	A complement (not A)	`!A`	\(P(A^c)=1-P(A)\)
\(\emptyset\)	empty set, impossible event	`NULL`
\(P(A\cap B)=\emptyset\)	A and B are mutually exclusive
\((A\|B)\)	A given *B		\(P(A\|B)=\frac{P(A\cap B)}{P(B)}\)
\(P(A\|B)=P(A)\)	A is independent of B		\(P(A\cap B)=P(A)\times P(B)\)

Salesperson	ERROR	TOTAL
Hank	\(0.07\times 0.3=0.021\)	\(0.3\)
Jerry	\(0.04\times 0.3=0.012\)	\(0.3\)
Carl	\(0.11\times 0.4=0.044\)	\(0.4\)
TOTAL	\(0.077\)	\(1\)