Note: I utilized the “code_folding: hide” in my html document to make the document cleaner. If you click on the “code” buttons, the code I used will display.
In addition, I loaded the DATA606 library for use of packages in there.
library(DATA606)
Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
This package is designed to support this course. The text book used
is OpenIntro Statistics, 3rd Edition. You can read this by typing
vignette('os3') or visit www.OpenIntro.org.
The getLabs() function will return a list of the labs available.
The demo(package='DATA606') will list the demos that are available.
The plot below shows the area or the curve representing a Z-score greater than -1.13.
library(ggplot2)
zscore <- -1.13
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(zscore,4),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
theme(axis.line.x = element_line(color="black"),
axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore,xend=zscore,y=0,yend=dnorm(zscore,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore,y=-0.02,label="Z = -1.13")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2 The normal probability provided by the pnorm function gives us the probability less than the z-scores, so we need to take the compliment of P(Z < -1.13) in order to get P(Z > -1.13):
\(P(Z > -1.13) = 1 - P(Z < -1.13) = \boxed{0.871}\)
1-pnorm(-1.13)[1] 0.8707619
The plot below shows the area or the curve representing a Z-score less than 0.18.
library(ggplot2)
zscore <- 0.18
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(-4,zscore),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
theme(axis.line.x = element_line(color="black"),
axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore,xend=zscore,y=0,yend=dnorm(zscore,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore,y=-0.02,label="Z = 0.18")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2 \(P(Z < 0.18) = \boxed{0.571}\)
pnorm(0.18)[1] 0.5714237
As you can see from the plot below, a z-score of 8 does not visibly fall within the plot. A z-score of 8 represents 8 standard deviations away from the mean. 99.7% of the data falls within 3 standards deviations of the mean so the probability of Z > 8 is approximately 0.
library(ggplot2)
zscore <- 8
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(zscore,8),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
theme(axis.line.x = element_line(color="black"),
axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore,xend=zscore,y=0,yend=dnorm(zscore,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore,y=-0.02,label="Z = 8")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2 + xlim(-4,4)\(P(Z > 8) = 1 - P(Z < 8) \approx \boxed{0}\)
1-pnorm(8)[1] 6.661338e-16
This problem can be rewritten as:
\(-0.5 < Z < 0.5\)
Therefore, we need to find the probability that the z-score falls within -0.5 and 0.5. Below is a plot showing the area we are looking for:
library(ggplot2)
zscore1 <- -0.5
zscore2 <- 0.5
gginit.3.2 <- ggplot(data=data.frame(x=c(-4,4)),aes(x))
stattype.3.2.1 <- stat_function(fun=dnorm,n=100,args=list(mean=0,sd=1))
stattype.3.2.2 <- stat_function(fun=dnorm,xlim=c(zscore1,zscore2),geom="area",fill="blue",alpha=0.5)
theme.3.2 <- theme_bw() +
theme(axis.line.x = element_line(color="black"),
axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
annotate.3.2.1 <- annotate(geom="segment", x=zscore1,xend=zscore1,y=0,yend=dnorm(zscore1,mean=0,sd=1),color="black")
annotate.3.2.2 <- annotate(geom="text",x=zscore1,y=-0.02,label="Z = -0.5")
annotate.3.2.3 <- annotate(geom="segment", x=zscore2,xend=zscore2,y=0,yend=dnorm(zscore2,mean=0,sd=1),color="black")
annotate.3.2.4 <- annotate(geom="text",x=zscore2,y=-0.02,label="Z = 0.5")
gginit.3.2 + stattype.3.2.1 + stattype.3.2.2 + theme.3.2 + ylab("") + xlab("Z") + annotate.3.2.1 + annotate.3.2.2 + annotate.3.2.3 + annotate.3.2.4\(P(|Z| < 0.5) = P(Z < 0.5) - P(Z < -0.5) = \boxed{0.383}\)
pnorm(0.5) - pnorm(-0.5)[1] 0.3829249
\(\text{Women} \rightarrow N(\mu,\sigma) = \boxed{N(5261,807)}\)
\(\text{Men} \rightarrow N(\mu,\sigma) = \boxed{N(4313,583)}\)
\(\text{Mary} \rightarrow Z = \frac{x - \mu}{\sigma} = \frac{4948 - 4313}{583} = \boxed{0.312}\)
\(\text{Leo} \rightarrow Z = \frac{x - \mu}{\sigma} = \frac{5513 - 5261}{807} = \boxed{1.089}\)
The Z-scores tell us that Mary and Leo’s times were 0.312 and 1.089 standard deviations above the mean, respectively.
From the z-scores, we can see that Mary actually finished better than Leo respective to their groups. In the case of races, the lower the time, the better the runner did in their race. Therefore, a low Z-score indicates that the runner ran faster with respect to their group.
We need to find the probability that Z is greater than 1.089:
\(P(Z > 1.089) = 1 - P(Z < 1.089) = \boxed{0.138}\)
1-pnorm(1.089)[1] 0.1380769
Therefore, Leo ran faster than 13.8% of the people in his group.
\(P(Z > 0.312) = 1 - P(Z < 0.312) = \boxed{0.378}\)
1-pnorm(0.312)[1] 0.3775203
Therefore, Mary ran faster than 37.8% of the people in her group.
heights <- c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73)
heights.df <- as.data.frame(heights)
heights.mean <- mean(heights.df$heights)
heights.sd <- sd(heights.df$heights)
percent.vec <- vector()
for (i in 1:3) {
percent.vec[i] <- nrow(subset(heights.df, heights > heights.mean - i*heights.sd & heights < heights.mean + i*heights.sd))/nrow(heights.df)*100
cat("Percent of data falling within",i,"standard deviations =",percent.vec[i],"%\n")
}Percent of data falling within 1 standard deviations = 68 %
Percent of data falling within 2 standard deviations = 96 %
Percent of data falling within 3 standard deviations = 100 %
Using a shot R code, we are able to see that 68%, 96%, and 100% of the data falls within 1, 2 and 3 standard deviations of the mean, respectively. Therefore, we can say that this data follows the 69-95-99.7% rule.
qqnormsim(heights.df$heights)We can see that the plot from the actual data appears to be normal and follows the line even more closely than some of the simulations.
\(P(X = n) = (1-p)^{n-1}p \rightarrow P(X = 10) = (1-0.02)^{10-1} \times 0.02 = 0.0167 = \boxed{1.67\%}\)
Using the ‘dgeom’ function in R, we can see the same solution:
dgeom(9,0.02)[1] 0.01667496
\(p = \text{Probability of defective transistor} \rightarrow P(\text{No defects in 100 transistors}) = (1-p)^{100} = 0.98^{100} = 0.1326 = \boxed{13.26\%}\)
\(\mu = \frac{1}{p} = \frac{1}{0.02} = \boxed{50}\) \(\sigma = \sqrt{\frac{1-p}{p^{2}}} = \sqrt{\frac{1-0.02}{0.02^2}} = \boxed{49.4975}\)
\(\mu = \frac{1}{p} = \frac{1}{0.05} = \boxed{20}\) \(\sigma = \sqrt{\frac{1-p}{p^{2}}} = \sqrt{\frac{1-0.05}{0.05^2}} = \boxed{19.4936}\)
\(P(X = k) = \binom{n}{k}p^{k}(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^{k}(1-p)^{n-k} = \frac{3!}{2!(3-2)!}0.51^{2}(0.49)^{1} = \boxed{0.3823}\)
Using the ‘dbinom’ function in R, we can get to the same answer:
dbinom(2,3,0.51)[1] 0.382347
\(P(\text{2 boys, 1 girl}) = (0.51 \times 0.51 \times 0.49) + (0.51 \times 0.49 \times 0.51) + (0.49 \times 0.51 \times 0.51) = \boxed{0.3823}\)
\(P(X = k) = \binom{n}{k}p^{k}(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^{k}(1-p)^{n-k} = \frac{8!}{3!(8-3)!}0.51^{3}(0.49)^{5}\)
Using an R function, we can get to the answer:
dbinom(3,8,0.51)[1] 0.2098355
\(P(X = k) = \binom{n-1}{k-1}p^{k}(1-p)^{n-k} = \frac{(n-1)!}{(k-1)!(n-k)!}p^{k}(1-p)^{n-k} = \frac{(10-1)!}{(3-1)!(10-3)!}0.15^{3}(1-0.15)^{10-3} = \boxed{0.039}\)
We can also use the ‘dnbinom’ function in R to get to the same answer:
dnbinom(10-3,3,0.15)[1] 0.03895012
Since each serve is independent, than the probability is 15%.
In part (a), we are looking for a specific pattern in the 10 serves. For part (b), we only care about one independent serve.