Homework 3

Note: I utilized the “code_folding: hide” in my html document to make the document cleaner. If you click on the “code” buttons, the code I used will display.

In addition, I loaded the DATA606 library for use of packages in there.

library(DATA606)


Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
This package is designed to support this course. The text book used 
is OpenIntro Statistics, 3rd Edition. You can read this by typing 
vignette('os3') or visit www.OpenIntro.org. 
 
The getLabs() function will return a list of the labs available. 
 
The demo(package='DATA606') will list the demos that are available.

Exercise 3.2

\(Z > -1.13\)

The plot below shows the area or the curve representing a Z-score greater than -1.13.

library(ggplot2)
zscore <- -1.13
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(zscore,4),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
             theme(axis.line.x = element_line(color="black"),
                   axis.line.y = element_blank(),
                   axis.text.y = element_blank(),
                   axis.title.y = element_blank(),
                   axis.ticks.y = element_blank(),
                   panel.grid.major = element_blank(),
                   panel.grid.minor = element_blank(),
                   panel.border = element_blank(),
                   panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore,xend=zscore,y=0,yend=dnorm(zscore,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore,y=-0.02,label="Z = -1.13")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2

The normal probability provided by the pnorm function gives us the probability less than the z-scores, so we need to take the compliment of P(Z < -1.13) in order to get P(Z > -1.13):

\(P(Z > -1.13) = 1 - P(Z < -1.13) = \boxed{0.871}\)

1-pnorm(-1.13)

[1] 0.8707619

\(Z < 0.18\)

The plot below shows the area or the curve representing a Z-score less than 0.18.

library(ggplot2)
zscore <- 0.18
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(-4,zscore),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
             theme(axis.line.x = element_line(color="black"),
                   axis.line.y = element_blank(),
                   axis.text.y = element_blank(),
                   axis.title.y = element_blank(),
                   axis.ticks.y = element_blank(),
                   panel.grid.major = element_blank(),
                   panel.grid.minor = element_blank(),
                   panel.border = element_blank(),
                   panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore,xend=zscore,y=0,yend=dnorm(zscore,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore,y=-0.02,label="Z = 0.18")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2

\(P(Z < 0.18) = \boxed{0.571}\)

pnorm(0.18)

[1] 0.5714237

\(Z > 8\)

As you can see from the plot below, a z-score of 8 does not visibly fall within the plot. A z-score of 8 represents 8 standard deviations away from the mean. 99.7% of the data falls within 3 standards deviations of the mean so the probability of Z > 8 is approximately 0.

library(ggplot2)
zscore <- 8
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(zscore,8),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
             theme(axis.line.x = element_line(color="black"),
                   axis.line.y = element_blank(),
                   axis.text.y = element_blank(),
                   axis.title.y = element_blank(),
                   axis.ticks.y = element_blank(),
                   panel.grid.major = element_blank(),
                   panel.grid.minor = element_blank(),
                   panel.border = element_blank(),
                   panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore,xend=zscore,y=0,yend=dnorm(zscore,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore,y=-0.02,label="Z = 8")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2 + xlim(-4,4)

\(P(Z > 8) = 1 - P(Z < 8) \approx \boxed{0}\)

1-pnorm(8)

[1] 6.661338e-16

\(|Z| < 0.5\)

This problem can be rewritten as:

\(-0.5 < Z < 0.5\)

Therefore, we need to find the probability that the z-score falls within -0.5 and 0.5. Below is a plot showing the area we are looking for:

library(ggplot2)
zscore1 <- -0.5
zscore2 <- 0.5
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(zscore1,zscore2),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
             theme(axis.line.x = element_line(color="black"),
                   axis.line.y = element_blank(),
                   axis.text.y = element_blank(),
                   axis.title.y = element_blank(),
                   axis.ticks.y = element_blank(),
                   panel.grid.major = element_blank(),
                   panel.grid.minor = element_blank(),
                   panel.border = element_blank(),
                   panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore1,xend=zscore1,y=0,yend=dnorm(zscore1,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore1,y=-0.02,label="Z = -0.5")
annotate.3.2.3 <- annotate(geom="segment", x=zscore2,xend=zscore2,y=0,yend=dnorm(zscore2,mean=0,sd=1),color="black")
annotate.3.2.4 <- annotate(geom="text",x=zscore2,y=-0.02,label="Z = 0.5")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2 + annotate.3.2.3 + annotate.3.2.4

\(P(|Z| < 0.5) = P(Z < 0.5) - P(Z < -0.5) = \boxed{0.383}\)

pnorm(0.5) - pnorm(-0.5)

[1] 0.3829249

Exercise 3.4

The short-hand for the women and the men can be seen below:

\(\text{Women} \rightarrow N(\mu,\sigma) = \boxed{N(5261,807)}\)

\(\text{Men} \rightarrow N(\mu,\sigma) = \boxed{N(4313,583)}\)

The calulation of the Z-scores can be seen below:

\(\text{Mary} \rightarrow Z = \frac{x - \mu}{\sigma} = \frac{4948 - 4313}{583} = \boxed{0.312}\)

\(\text{Leo} \rightarrow Z = \frac{x - \mu}{\sigma} = \frac{5513 - 5261}{807} = \boxed{1.089}\)

The Z-scores tell us that Mary and Leo’s times were 0.312 and 1.089 standard deviations above the mean, respectively.

From the z-scores, we can see that Mary actually finished better than Leo respective to their groups. In the case of races, the lower the time, the better the runner did in their race. Therefore, a low Z-score indicates that the runner ran faster with respect to their group.
We need to find the probability that Z is greater than 1.089:

\(P(Z > 1.089) = 1 - P(Z < 1.089) = \boxed{0.138}\)

1-pnorm(1.089)

[1] 0.1380769

Therefore, Leo ran faster than 13.8% of the people in his group.

We need to find the probability that Z is greater than 0.312:

\(P(Z > 0.312) = 1 - P(Z < 0.312) = \boxed{0.378}\)

1-pnorm(0.312)

[1] 0.3775203

Therefore, Mary ran faster than 37.8% of the people in her group.

Yes because all of our answers assumed that the times for both of these races were normally distributed. If these distributions were not normal, than all of our calculations would be incorrect.

Exercise 3.18

The 68-95-99.7% rule says that 68% of the data falls within 1 standard deviation of the mean, 95% of the data falls within 2 standard deviations of the mean, and 99.7% of the data falls within 3 standard deviations of the mean. In order to test this, we need to see the percentage of data falling within these 3 levels.

heights <- c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73)
heights.df <- as.data.frame(heights)
heights.mean <- mean(heights.df$heights)
heights.sd <- sd(heights.df$heights)
percent.vec <- vector()
for (i in 1:3) {
  percent.vec[i] <- nrow(subset(heights.df, heights > heights.mean - i*heights.sd & heights < heights.mean + i*heights.sd))/nrow(heights.df)*100
  cat("Percent of data falling within",i,"standard deviations =",percent.vec[i],"%\n")
}

Percent of data falling within 1 standard deviations = 68 %
Percent of data falling within 2 standard deviations = 96 %
Percent of data falling within 3 standard deviations = 100 %

Using a shot R code, we are able to see that 68%, 96%, and 100% of the data falls within 1, 2 and 3 standard deviations of the mean, respectively. Therefore, we can say that this data follows the 69-95-99.7% rule.

From the plots given in the problem statement, it appears that the data follows a normal distribution. The histograms appear to be symmetric, and the normal probability plot shows the the distribution not deviating from the line too much. We can also use qqnormsim to compare the normal probability plot against many plots to see if the plots match.

qqnormsim(heights.df$heights)

We can see that the plot from the actual data appears to be normal and follows the line even more closely than some of the simulations.

Exercise 3.22

This problem will follow a geometric distribution.

\(P(X = n) = (1-p)^{n-1}p \rightarrow P(X = 10) = (1-0.02)^{10-1} \times 0.02 = 0.0167 = \boxed{1.67\%}\)

Using the ‘dgeom’ function in R, we can see the same solution:

dgeom(9,0.02)

[1] 0.01667496

We simply need to take the probability of drawing a non-defective one and taking that to the 100th power.

\(p = \text{Probability of defective transistor} \rightarrow P(\text{No defects in 100 transistors}) = (1-p)^{100} = 0.98^{100} = 0.1326 = \boxed{13.26\%}\)

The mean and standard deviation of geometric distributions can be seen below:

\(\mu = \frac{1}{p} = \frac{1}{0.02} = \boxed{50}\) \(\sigma = \sqrt{\frac{1-p}{p^{2}}} = \sqrt{\frac{1-0.02}{0.02^2}} = \boxed{49.4975}\)

This is calculated the same way as in part (c):

\(\mu = \frac{1}{p} = \frac{1}{0.05} = \boxed{20}\) \(\sigma = \sqrt{\frac{1-p}{p^{2}}} = \sqrt{\frac{1-0.05}{0.05^2}} = \boxed{19.4936}\)

From the answers in (c) and (d), we can see that increasing the defective rate decreases the mean and standard deviation. This makes sense considering we would expect a smaller expected value for number of transistors before seeing a defective one if the probability for a defective one was larger.

Exercise 3.38

We can use the binomial model to calculate this with \(p = 0.51\), \(n = 3\) and \(k = 2\).

\(P(X = k) = \binom{n}{k}p^{k}(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^{k}(1-p)^{n-k} = \frac{3!}{2!(3-2)!}0.51^{2}(0.49)^{1} = \boxed{0.3823}\)

Using the ‘dbinom’ function in R, we can get to the same answer:

dbinom(2,3,0.51)

[1] 0.382347

There are three possible combinations for two boys and 1 girl: BBG, BGB, GBB. The calculation would like this:

\(P(\text{2 boys, 1 girl}) = (0.51 \times 0.51 \times 0.49) + (0.51 \times 0.49 \times 0.51) + (0.49 \times 0.51 \times 0.51) = \boxed{0.3823}\)

The approach for part (b) would be much more tedious than part (c) because there are a lot of combinations of 3 boys in 8 kids. This would take a while to write it all out, while part (a) is one simple equation for this.

\(P(X = k) = \binom{n}{k}p^{k}(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^{k}(1-p)^{n-k} = \frac{8!}{3!(8-3)!}0.51^{3}(0.49)^{5}\)

Using an R function, we can get to the answer:

dbinom(3,8,0.51)

[1] 0.2098355

Exercise 3.42

For this problem, we need to use a negative binomial distribution because this describes the probability of observing the \(k^{th}\) success on the \(n^{th}\) trial. In other words, the final success has to be on the \(10^{th}\) trial. The variables for this problem are \(p = 0.15\), \(n = 10\) and \(k = 3\). The equation and solution are shown below:

\(P(X = k) = \binom{n-1}{k-1}p^{k}(1-p)^{n-k} = \frac{(n-1)!}{(k-1)!(n-k)!}p^{k}(1-p)^{n-k} = \frac{(10-1)!}{(3-1)!(10-3)!}0.15^{3}(1-0.15)^{10-3} = \boxed{0.039}\)

We can also use the ‘dnbinom’ function in R to get to the same answer:

dnbinom(10-3,3,0.15)

[1] 0.03895012

Since each serve is independent, than the probability is 15%.
In part (a), we are looking for a specific pattern in the 10 serves. For part (b), we only care about one independent serve.