What percent of a standard normal distribution N(µ = 0, = 1) is found in each region? Be sure to draw a graph.
x <- seq(-5, 5, by = 0.01)
y <- dnorm(x,0,1)
plot(x,y,type = 'l')
p<-1-pnorm(-1.13,0,1)
p
## [1] 0.8707619
v<-seq(-1.13,5,0.1)
w<-dnorm(v,0,1)
plot(x,y,type = 'l')
polygon(c(-1.13,v,5),c(0,w,0),col='red')
text(x=-1.13,y=0,"-1.13")
p<-pnorm(0.18,0,1)
p
## [1] 0.5714237
v<-seq(-5,0.18,0.01)
w<-dnorm(v,0,1)
plot(x,y,type = 'l')
polygon(c(-5,v,0.18),c(0,w,0),col='red')
text(x=0.18,y=0,"0.18")
p<-1-pnorm(8,0,1)
p
## [1] 6.661338e-16
x <- seq(7.75, 9, by = 0.01)
y <- dnorm(x,0,1)
v<-seq(8,9,0.01)
w<-dnorm(v,0,1)
plot(x,y,type = 'l')
polygon(c(8,v,9),c(0,w,0),col='red')
text(x=8,y=0,"8")
p<-pnorm(0.5,0,1)-(pnorm(-0.5,0,1))
p
## [1] 0.3829249
x <- seq(-5, 5, by = 0.01)
y <- dnorm(x,0,1)
v<-seq(-0.5,0.5,0.01)
w<-dnorm(v,0,1)
plot(x,y,type = 'l')
polygon(c(-0.5,v,0.5),c(0,w,0),col='red')
text(x=-0.5,y=0,"-0.5")
text(x=0.5,y=0,"0.5")
In triathlons, it is common for racers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach Triathlon, where Leo competed in the Men, Ages 30 - 34 group while Mary competed in the Women, Ages 25 29 group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished faster, but they are curious about how they did within their respective groups. Can you help them? Here is some information on the performance of their groups: . The finishing times of the Men, Ages 30 - 34 group has a mean of 4313 seconds with a standard deviation of 583 seconds. . The finishing times of the Women, Ages 25 - 29 group has a mean of 5261 seconds with a standard deviation of 807 seconds. . The distributions of finishing times for both groups are approximately Normal. Remember: a better performance corresponds to a faster finish.
men N(\(\mu\)=4313,\(\sigma\) = 583)
women N(\(\mu\)=5261,\(\sigma\) = 807)
Z scores are positive for both Leo and Mary, hence showing they were slower than the mean of their groups, slower than 50% of the people in their groups. Leo a little over one standard deviation (1.089194) slower than the mean of his group, while Mary did about a third (0.3122677) of a standard deviation higher.
#Leo
(4948-4313)/583
## [1] 1.089194
#Mary
(5513-5261)/807
## [1] 0.3122677
Mary did better than Leo in her group as shown by her lower Z score, 0.3122677 vs. 1.089194 The Z score for Mary shows less people completed the triathlon faster than her compared to Leo in his group.
p<-1-pnorm(4948,4313,583)
p
## [1] 0.1380342
x <- seq(2000, 7000, by = 0.01)
y <- dnorm(x,4313,583)
v<-seq(4948, 7000, by = 0.01)
w<-dnorm(v,4313,583)
plot(x,y,type = 'l')
polygon(c(4948,v,7000),c(0,w,0),col='red')
p<-1-pnorm(5513,5261,807)
p
## [1] 0.3774186
x <- seq(2000, 9000, by = 0.01)
y <- dnorm(x,5261,807)
v<-seq(5513, 9000, by = 0.01)
w<-dnorm(v,5261,807)
plot(x,y,type = 'l')
polygon(c(5513,v,9000),c(0,w,0),col='red')
Yes answers would be different if the distributions were something other than Normal Distributions, except for (b). Z-scores can be calculated for for any distribution, they would not change if they are to be used in different distribution types. Having a left or right skewed distribution for example will mean that a different number of athletes fall on the left side of Leo and Mary’s results, so there would be a different number of athletes with better times.
library(fGarch)
## Warning: package 'fGarch' was built under R version 3.3.3
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.3.3
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.3.3
## Loading required package: fBasics
## Warning: package 'fBasics' was built under R version 3.3.3
#Results for Leo using a skew distribution
p<-1-psnorm(4948,4313,583,0.5)
p
## [1] 0.1173959
x <- seq(2000, 7000, by = 0.1)
y <- dsnorm(x,4313,583,0.5) #1.5 is a skew coefficient
v<-seq(4948, 7000, by = 0.1)
w<-dsnorm(v,4313,583,0.5)
plot(x,y,type = 'l')
polygon(c(4948,v,7000),c(0,w,0),col='red')
#Results for Mary using a skew distribution
p<-1-psnorm(5513,5261,807,1.5)
p
## [1] 0.345385
x <- seq(2000, 9000, by = 0.1)
y <- dsnorm(x,5261,807,1.5)
v<-seq(5513, 9000, by = 0.1)
w<-dsnorm(v,5261,807,1.5)
plot(x,y,type = 'l')
polygon(c(5513,v,9000),c(0,w,0),col='red')
Below are heights of 25 female college students.
68% of the readings should be between 61.51-4.58 and 61.51+4.58 => 56.93 and 66.09 Since we have 25 readings, we should have 17 of them, that is 68% of them, within this range Counting readings within this range we find that in fact there are 17
height<-c(54,55,56,56,57,58,58,59,60,60,60,61,61,62,62,63,63,63,64,65,65,67,67,69,73)
heightMean<-mean(height)
heightSD<-sd(height)
sum(height > heightMean-heightSD & height < heightMean+heightSD)
## [1] 17
95% The range is now: 61.51-4.582 to 61.51+4.582 => 52.35 to 70.67 And we should have 95% of the readings in this range, or 23.75 ~ 23 or 24 measurments Doing the count we find 24 measurments in this range
sum(height > heightMean-heightSD*2 & height < heightMean+heightSD*2)
## [1] 24
99% The range is now: 61.51-4.583 to 61.51+4.583 => 47.77 to 75.25 And we should have 99% of the readings in this range, or 24.75 ~ 24 or 25 measurments Doing the count we find 25 measurments in this range
sum(height > heightMean-heightSD*3 & height < heightMean+heightSD*3)
## [1] 25
Yes, they do. We can see in the distribution plot how the normal curve follows the bars in the histogram fairly well. On the Q-Q plot or Probability Plot, we also find the different points falling on the line fairly well, with some exceptions towards the edges, but with a good match in the center. Both graphs are good indicators of a normal distribution.
h<-hist(height,freq = FALSE,ylim = c(0,0.1))
x<-40:80
y <- dnorm(x = x, mean = heightMean, sd = heightSD)
lines(x = x, y = y, col = "blue")
qqnorm(height)
qqline(height)
We can also run a few simulation and look at the probability plots. On these we also see how most points fall on the line.
qqnormsim<-function (dat)
{
par(mfrow = c(3, 3))
qqnorm(dat, main = "Normal QQ Plot (Data)")
qqline(dat)
for (i in 1:8) {
simnorm <- rnorm(n = length(dat), mean = mean(dat),
sd = sd(dat))
qqnorm(simnorm, main = "Normal QQ Plot (Sim)")
qqline(simnorm)
}
par(mfrow = c(1, 1))
}
qqnormsim(height)
A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.
p(defect) = 0.02 => p(n\(^t\)\(^h\) being first defective) = (1 - p)\(^n\)\(^{-}\)\(^1\)
p<-0.02
n<-10
((1-p)^(n-1))*p
## [1] 0.01667496
p(100 non defect) = (1-p)\(^1\)\(^0\)\(^0\)
(1-p)^100
## [1] 0.1326196
\(\mu\) = 1 / p = 1 / 0.02
\(\sigma\)\(^2\) = \(\sqrt[]{}\)((1-p) / p\(^2\)) = \(\sqrt[]{}\)((1-0.02) / 0.02\(^2\))
#Expected number or transistors before the first with a defect
1/p
## [1] 50
#Standard deviation
sqrt((1-p)/p^2)
## [1] 49.49747
\(\mu\) = 1 / p = 1 / 0.05
\(\sigma\)\(^2\) = \(\sqrt[]{}\)((1-p) / p\(^2\) = \(\sqrt[]{}\)((1-0.05) / 0.05\(^2\))
p<-0.05
#Expected number or transistors before the first with a defect
1/p
## [1] 20
#Standard deviation
sqrt((1-p)/p^2)
## [1] 19.49359
Mean, or expected value decreases as the probabilities of each event increases. That is the expected number of transistors before a defective one decreases with an increasing probability of each transistor being defective. Less transistors are produced until a defective one is observed.
Standard deviation also decreases. This means that the likelihood of a defecdtive transistor is less spread out, narrower distributions, as the probability of each transistor being defective increases. Consistend with the fact that there are less good transistors in between defective ones being produced.
While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.
#using r dbinom function
dbinom(2,3,0.51)
## [1] 0.382347
#using r choose function to calculate combination
choose(3,2)*((0.51)^2)*(1-0.51)
## [1] 0.382347
Yes we obtain the same result, confirmed.
0.51*0.51*(1-0.51)+0.51*(1-0.51)*0.51+(1-0.51)*0.51*0.51
## [1] 0.382347
In part (b) we would have to identify all possible combination of 3 boys in 8 kids. With 8 we actually have 56 possible combinations. Doing these many multiplications and then adding them all up would be a major feat.
length(combn(8,3))/3
## [1] 56
A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.
p = 15%
n = 10
k = 3
Since we are looking for the probability that exactly the 10\(^t\)\(^h\) trial is her succesful one, we first calculate the probability of getting 2 in 9 trial, and then multiply by making the 10\(^t\)\(^h\) and last server
p<-0.15
n<-10
k<-3
#2 in 9 serves
firstNine<-dbinom(k-1,n-1,p)
firstNine
## [1] 0.2596674
lastServe<-p
#Probability of last 10th being 3rd succesful serve
firstNine*lastServe
## [1] 0.03895012
Since all events are independent, the probability of a succesful server on the 10\(^t\)\(^h\) try is indenpendent of what the outcome of the previous serves has been. Probability of making the 10\(^t\)\(^h\) server is equal to making any other server, that is each serve. It is equal to p = 15%
In (a) we calculate the actual probability of making 2 serves in 9 tries and then the last one also being succesfull. In (b) we are only showing the probability of making the 10\(^t\)\(^h\) serve after making 2 serves. But since the events are independent, the probability of the 10\(^t\)\(^h\) serve is not dependent on previous results. in (a) we are taking into account previous results, since we want the probability of the 10\(^t\)\(^h\) serve being the 3\(^r\)\(^d\) successful, not it being succesful on its own.