library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
#library(DATA606)
library(ggplot2)
Area under the curve, Part I. (4.1, p. 142) What percent of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region? Be sure to draw a graph. function for graph
normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightgray", ...) {
x <- seq(mean - 3 * sd, mean + 3 * sd, length = 100)
if (missing(lb)) {
lb <- min(x)
}
if (missing(ub)) {
ub <- max(x)
}
x2 <- seq(lb, ub, length = 100)
plot(x, dnorm(x, mean, sd), type = "n", ylab = "")
y <- dnorm(x2, mean, sd)
polygon(c(lb, x2, ub), c(0, y, 0), col = acolor)
lines(x, dnorm(x, mean, sd), type = "l", ...)
}
pnorm(-1.35)
## [1] 0.08850799
normal_area(mean = 0, sd = 1, ub = -1.35, lwd = 2)
1-pnorm(1.48)
## [1] 0.06943662
normal_area(mean = 0, sd = 1, lb=1.48, lwd = 2)
pnorm(1.5)-pnorm(-.4)
## [1] 0.5886145
normal_area(mean = 0, sd = 1, lb = -.4, ub = 1.5, lwd = 2)
(1-pnorm(2))+(pnorm(-2))
## [1] 0.04550026
normal_area(mean = 0, sd = 1, ub = -2, lwd = 2)+normal_area(mean = 0, sd = 1, lb = 2, lwd = 2)
## integer(0)
Triathlon times, Part I (4.4, p. 142) In triathlons, it is common for racers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach Triathlon, where Leo competed in the Men, Ages 30 - 34 group while Mary competed in the Women, Ages 25 - 29 group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished faster, but they are curious about how they did within their respective groups. Can you help them? Here is some information on the performance of their groups:
Remember: a better performance corresponds to a faster finish.
DISCUSSION: Men Distribution–> N(4313, 583) Women Distribution–> N(5261,807)
#men
(4313-4948)/583
## [1] -1.089194
#women
(5261-5513)/807
## [1] -0.3122677
They both finish below the mean (negative zscores). Mary is closer to the mean (smaller Z score)
x<-seq(-5,5,.01)
dens<-dnorm(x,0,1)
plot(x,dens,type='l')
abline(v=-.31)
abline(v=-1.08)
DISCUSSION:
Because Mary has a higher z score (-.31 ) than Leo (-1.08), Mary would rank higher.
#Leo
pnorm(-1.089194)
## [1] 0.1380342
Leo finishes faster than 13.8 percent of participants.
#Mary
pnorm(-0.3122677)
## [1] 0.3774185
Mary finishes faster than in the .37 percent of runners.
Yes, all of the answers, zscores, rank, percentages would change based on the distribution. In the problem, we are instructed to base our calculations on a normal distribution.
For example, if the distribution was highly right skewed, the shape and therefore the area under the curve changes quite dramatically from the normal distribution and bell shaped curve.
Heights of female college students Below are heights of 25 female college students.
\[ \stackrel{1}{54}, \stackrel{2}{55}, \stackrel{3}{56}, \stackrel{4}{56}, \stackrel{5}{57}, \stackrel{6}{58}, \stackrel{7}{58}, \stackrel{8}{59}, \stackrel{9}{60}, \stackrel{10}{60}, \stackrel{11}{60}, \stackrel{12}{61}, \stackrel{13}{61}, \stackrel{14}{62}, \stackrel{15}{62}, \stackrel{16}{63}, \stackrel{17}{63}, \stackrel{18}{63}, \stackrel{19}{64}, \stackrel{20}{65}, \stackrel{21}{65}, \stackrel{22}{67}, \stackrel{23}{67}, \stackrel{24}{69}, \stackrel{25}{73} \]
x<-c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73)
hist(x)
m.x<-mean(x)
s.x<-sd(x)
#Check if follows 68-95-99.7 rule, 68 below....
within1sd <- x[x >= m.x - s.x & x <= m.x + s.x]
length(within1sd) / length(x)
## [1] 0.68
#95 below.....
within2sd <- x[x >= m.x - 2*s.x & x <= m.x + 2*s.x]
length(within2sd) / length(x)
## [1] 0.96
#99.7 below.....
within3sd <- x[x >= m.x - 3*s.x & x <= m.x + 3*s.x]
length(within3sd) / length(x)
## [1] 1
The heights do approximately follow the 68-95-99.7 rule with .68, .96, 1 respectively.
The histogram depicts a bell-shaped symmetric curve which could be approximately normal distribution.
The qq plot also depicts the points falling on the straight line, also indicating a normal distribution.
# Use the DATA606::qqnormsim function, actually openintro - need to detach DATA606
df<-as.data.frame(x)
qqnormsim(sample=x,data=df)
From this comparison of 8 qq to actual data, it is reasonable to assume a normal distribution.
Defective rate. (4.14, p. 148) A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.
#GGGGGGGGGD
((.98)^9)*(.02)
## [1] 0.01667496
.98^100
## [1] 0.1326196
#Geometric distribution mean=1/p
1/.02
## [1] 50
# so 49
#sd sqrt((1-p)/p^2)
((1-.02)/((.02)^2))^.5
## [1] 49.49747
#meand
1/.05
## [1] 20
#sd sqrt((1-p)/p^2)
((1-.05)/((.05)^2))^.5
## [1] 19.49359
DISCUSSION: The increase in probability in the geometric distribution decrease the value of the mean and standard deviation.
Male children. While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.
choose(3,2)*.49*.51^2
## [1] 0.382347
dbinom(2,3,.51)
## [1] 0.382347
DISCUSSION: BBG
.51*.51*.49
## [1] 0.127449
BGB
.51*.49*51
## [1] 12.7449
GBB
.49*.51*.51
## [1] 0.127449
so P(BBG)+P(BGB)+P(GBB)
(.51*.51*.49)+(.51*.49*.51)+(.49*.51*.51)
## [1] 0.382347
DISCUSSION:
The answers from a & b match .382347
choose(8,3)
## [1] 56
DISCUSSION: There are 56 different combinations that of 8 children, 3 are boys.
To manually write and capture the 56 combinations is extremely tedious could easily result in making mistakes or omissions.
Serving in volleyball. (4.30, p. 162) A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.
DISCUSSION: This a negative binomial distribution p=.15,n=10,k=3
choose(9,2)*((.15)^3)*((.85)^7)
## [1] 0.03895012
dnbinom(7,3,.15)
## [1] 0.03895012
DISCUSSION:
p=.15. The trials are independent, so the on any particular event p=.15.
P(10th trial success|2 successes in 9)=P(A|B) P(A|B)=P(A \[intersect\] B)/P(B)
(9 2) (.15^3) * (.85 ^ 7) /(9 2) (.15^2) * (.85 ^ 7)==.15
The difference is because it is not the same scenario. In the first case, we are calculating the entire sequence.
In the second case, we are calculating the probability of a trial, given something has already occurred.