Distributions of random variables

Statistics and Probability for Data Analytics

CUNY MSDS DATA 606

Rose Koh

2018/02/24

Links

Rpub Link

Assignments

Graded: 3.2 (see normalPlot), 3.4, 3.18 (use qqnormsim from lab 3), 3.22, 3.38, 3.42

Z = (x – μ) / σ

# finding value for 'x':  x <- Z * sd + mu
mu <- 0
sd <- 1

#a
Z <- -1.13
x <- Z * sd + mu
# prob of x > -1.13
1 - pnorm(x, mean = 0, sd = 1)

## [1] 0.8707619

#b
Z <- 0.18
x <- Z * sd + mu
# prob of x < 0.18
pnorm(x, mean = 0, sd = 1)

## [1] 0.5714237

#c
Z <- 8
x <- Z * sd + mu
# prob of x > 8
1 - pnorm(x, mean = 0, sd = 1)

## [1] 6.661338e-16

#d
Z <- 0.5
x <- Z * sd + mu
# prob of |x| < 0.5 = -x < 0.5 < x
x1 <- pnorm(-x, mean = 0, sd = 1)
x2 <- pnorm(x, mean = 0, sd = 1)
x2 - x1

## [1] 0.3829249

# shadeDist function plots a probability density function, shades the area under the curve, and computes the probability.
mean <- 0 
SD <- 1
x <- seq(-4, 4, length = 10000)
y <- dnorm(x, mean, SD)

par(mfrow=c(2,2))

shadeDist(-1.13, lower.tail = FALSE, col = c("blue", "light blue"))
shadeDist(0.18, col = c("blue", "light blue"))
shadeDist(8, lower.tail = FALSE, col = c("blue", "light blue"))
shadeDist(c(-0.5,0.5), lower.tail = FALSE, col = c("blue", "light blue"))

the percent of standard normal distribution found in a: 87.08%
the percent of standard normal distribution found in b: 42.86%
the percent of standard normal distribution found in c: 0%
the percent of standard normal distribution found in d: 38.29%

# *Z = (x – μ) / σ*
m.mean <- 4313
m.sd <- 583
leo <- 4948

w.mean <- 5261
w.sd <- 807
mary <- 5513

#(b)
leo.z <- (leo - m.mean) / m.sd
mary.z <- (mary - w.mean) / w.sd

#(c)
leo.rank <- 1 - pnorm(leo, m.mean, m.sd)
mary.rank <- 1 - pnorm(mary, w.mean, w.sd)

#(d)
leo.top <- 1 - pnorm(leo, m.mean, m.sd)
mary.top <- 1 - pnorm(mary, w.mean, w.sd)

Men(30-34): N(4313,583)
Women(25-29): N(5261,807)
Mary showed better performance as her z-score 0.3122677 is higher than leo 1.0891938.
Z-scores represents how many standard deviations away from the mean.
In this case, Mary’s z.score means that she is 0.31 sd above her groups’ finishing time.
Leo’s z.score shows that his duration took 1.08 sd above the mean finishing times in his group.
Mary performed better and ranked higher. She is in 0.3774186 SD’s above the mean for her group. Leo is in 0.1380342 SD’s above the mean for his group.
P(Z > 1.09) = 0.1380342, Leo finished faster than 0.1380342 of the runners in his group.
P(Z > 0.31) = 0.3774186, Mary finished faster than 0.3774186 of the runners in her group.
The z-scores won’t change, but the probabilities associated with these z-scores will be affected because Z-scores are calculated based on normal distribution.

colnames(fheights) <- "heights"
heights <- fheights$heights
h.mean <- 61.52 # mean(fheights$heighs)
h.sd <- 4.58 # sd(fheights$heighs)

Despite the small size of the sample, the heights data follow the 68 - 95 - 99.7% rule.

# (a)
# Within 1 standard deviation of the mean.
length(which(heights > (h.mean - h.sd) & heights < (h.mean + h.sd)))/length(heights)

## [1] 0.6666667

# Within 2 standard deviation of the mean.
length(which(heights > (h.mean - 2 * h.sd) & heights < (h.mean + 2 * h.sd)))/length(heights)

## [1] 0.9583333

# Within 3 standard deviation of the mean.
length(which(heights > (h.mean - 3 * h.sd) & heights < (h.mean + 3 * h.sd)))/length(heights)

## [1] 1

The original and simulated data looks similar. The points seem to fit the normal probability line nicely with a few lower and upper outliers that is apparent in both graphs, however, not too extreme.

# (b)
par(mfrow = c(1,2))
hist(fheights$heights)
qqnorm(fheights$heights)
qqline(fheights$heights)

source("./qqnormsim.R")
qqnormsim(fheights$heights)

#success: finding defective part
a <- (1 - 0.02)^9 * 0.02

0.016675 chance for 1 success at 2% defective rate after 9 failures.

(b)

b <- 1 - 0.98^100
b

## [1] 0.8673804

0.8673804 of chance of 100 succeses in a row.

(c)

#E(x) = 1/p
c.mean <- 1/.02
c.sd <- ((1 - .02)/.02^2)^(1/2)

On average, 50 transistors to expect before the first one to be produced with a defect, with a standard deviation of 49.4974747

(d)

d.mean <- 1/.05
d.sd <- ((1 - .05)/.05^2)^(1/2)

On average, I would expect 20 transistor to be produced with the first defect, with a standard deviation of 19.4935887

(e) Increasing the probability of an event decreases the mean and standard deviation of the wait time until the event. Lower chance of success means more trials until success happens. If the event doesn’t happen for 100 days, the probability of the event happening on the 101st day is still at 2% and 5%.

n <- 3
k <- 2 
p <- 0.51
double_boy <- choose(n, k) * (1 - p)^(n - k) * (p)^k
double_boy

## [1] 0.382347

(b)

#(B,B,G), (B,G,B), (G,B,B)
b <- 0.51
g <- 0.49
(b*b*g) + (b*g*b) + (g*b*b)

## [1] 0.382347

(c) A and B are just different ways of writing the same mathematical operations. The use of the formula is quicker way to get the result to save time. In this case, the second methold will require us to create combination of 56 different possibilities.

#Negative binomial dist.
p <- 0.15
n <- 10
k <- 3
choose(n - 1, k - 1) * (1 - p)^(n - k) * p^k

## [1] 0.03895012

(b) 15%. All serves are independent, thus the probability of the next serve is still at 15% of chance.
(c) The given information in B, “two sucessful serves in nine attempts” is not affecting the probability of sucess on the next serve as they are independent disjoint events. The calculation of A,B are in different sense that it is not reaonsable to compare the two.