library(DATA606)
## Loading required package: shiny
## Loading required package: openintro
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
## Loading required package: OIdata
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: maps
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:openintro':
##
## diamonds
## Loading required package: markdown
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
Area under the curve, Part I. (4.1, p. 142) What percent of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region? Be sure to draw a graph.
round(((pnorm(-1.35))*100),2)
## [1] 8.85
normalPlot(mean = 0, sd = 1, bounds = c(-1.35, Inf), tails = TRUE)
8.85% of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region.
round((1-pnorm(1.48))*100,2)
## [1] 6.94
normalPlot(mean = 0, sd = 1, bounds = c(1.48,Inf), tails = FALSE)
6.94% of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region.
z<- c(-.4,1.5)
pnorm(z)
## [1] 0.3445783 0.9331928
round((0.9331928-0.3445783)*100,2)
## [1] 58.86
normalPlot(mean = 0, sd = 1, bounds = c(-0.4, 1.5), tails = FALSE)
58.86% of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region.
z<- c(-2,2)
pnorm(z)
## [1] 0.02275013 0.97724987
round(((0.97724987-0.02275013)*100),2)
## [1] 95.45
normalPlot(mean = 0, sd = 1, bounds = c(-2,2), tails = FALSE)
95.45% of a standard normal distribution \(N(\mu=0, \sigma=1)\) is found in each region.
Triathlon times, Part I (4.4, p. 142) In triathlons, it is common for racers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach Triathlon, where Leo competed in the Men, Ages 30 - 34 group while Mary competed in the Women, Ages 25 - 29 group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished faster, but they are curious about how they did within their respective groups. Can you help them? Here is some information on the performance of their groups:
Remember: a better performance corresponds to a faster finish.
Men, Ages 30 - 34 N(μ = 4313, σ = 583)
Women, Ages 25 - 29 N(μ = 5261, σ = 807)
#Z-scores for Leo's finishing time
#(Leo = 4948, mean = 4313, Standard deviation = 583)
Leo_x <- 4948
Leo_mu <- 4313
Leo_sd <- 583
Leo_z <- (Leo_x - Leo_mu)/(Leo_sd)
Leo_z
## [1] 1.089194
Leo’s finishing time is 1.089194 standard deviations above the mean. We know it must be above the mean since Z is positive.
#Z-scores for Mary's finishing times
#(Mary = 5513, mean = 5261, Standard deviation = 807)
Mary_x <- 5513
Mary_mu <- 5261
Mary_sd <- 807
Mary_z <- (Mary_x - Mary_mu)/(Mary_sd)
Mary_z
## [1] 0.3122677
Mary’s finishing time is 0.3122677 standard deviations above the mean. We know it must be above the mean since Z is positive.
Did Leo or Mary rank better in their respective groups? Explain your reasoning.
Based on their Z scores, Mary was more close to her respective age category mean by only .03122677 standard deviation. On the other hand Leo is at 1.089194 standard deviation above from mean. In racing senario the less you are away from mean is better raning. Therefor Mary rank better in her respective group.
What percent of the triathletes did Leo finish faster than in his group?
round( (1-pnorm(Leo_z))*100,2)
## [1] 13.8
Answer: 13.8%
round((1-pnorm(Mary_z))*100,2)
## [1] 37.74
Answer: 37.74%
Z-scores for non normal distributions are relevant for analysis in a multiple group. We will not be able to compare two people from different groups based on Z-scores of non normal distributions. We cannot use the normal probability table to calculate the probabililties and percentiles without a normal model for parts (d) through (e).
Heights of female college students Below are heights of 25 female college students.
\[ \stackrel{1}{54}, \stackrel{2}{55}, \stackrel{3}{56}, \stackrel{4}{56}, \stackrel{5}{57}, \stackrel{6}{58}, \stackrel{7}{58}, \stackrel{8}{59}, \stackrel{9}{60}, \stackrel{10}{60}, \stackrel{11}{60}, \stackrel{12}{61}, \stackrel{13}{61}, \stackrel{14}{62}, \stackrel{15}{62}, \stackrel{16}{63}, \stackrel{17}{63}, \stackrel{18}{63}, \stackrel{19}{64}, \stackrel{20}{65}, \stackrel{21}{65}, \stackrel{22}{67}, \stackrel{23}{67}, \stackrel{24}{69}, \stackrel{25}{73} \]
# Use the DATA606::qqnormsim function
qqnormsim(heights)
check weather 68% lie within 1 standard deviation
# height mean
heights_mean <- mean(heights)
# height standard deviation
heights_sd <- sd(heights)
1-2*pnorm(heights_mean+heights_sd, heights_mean, heights_sd, lower.tail = FALSE)
## [1] 0.6826895
check weather 95% lie within 2 standard deviation
1-2*pnorm(heights_mean+2*heights_sd, heights_mean, heights_sd, lower.tail = FALSE)
## [1] 0.9544997
check weather 99.7% lie within 3 standard deviation
1-2*pnorm(heights_mean+3*heights_sd, heights_mean, heights_sd, lower.tail = FALSE)
## [1] 0.9973002
Yes, from above we can say the heights approximately follow the 68-95-99.7% Rule
summary(heights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.00 58.00 61.00 61.52 64.00 73.00
This histogram shows that data is not perfectly symmetric or in bel-shape and more of right-skewed. Mean is little bit more than the median. But it is unimodal. Normal probability plot also shows that points follows a straight line. We can conclude that this is nearly normal distribution.
Defective rate. (4.14, p. 148) A machine that produces a special type of transistor (a component of computers) has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.
# using actual formula (1-p)^(n-1) * p
defect <- 0.02
success <- 1-defect
n <- 10
Pof_defect <- (success)^(n-1) * defect
Pof_defect
## [1] 0.01667496
# using dgeom function
dgeom((n-1), defect)
## [1] 0.01667496
Answer: 0.01667496
n <- 100
(1-defect)**n
## [1] 0.1326196
Answer: 0.1326196 (c) On average, how many transistors would you expect to be produced before the first with a defect? What is the standard deviation?
#first defect
1/defect
## [1] 50
#standard deviation
sqrt((1-defect)/(defect^2))
## [1] 49.49747
Answer: 50 transistors would you expect to be produced before the first with a defect. standard deviation is 49.49747
defect2 <- 0.05
#first defect
1/defect2
## [1] 20
#standard deviation
sqrt((1-defect2)/(defect2^2))
## [1] 19.49359
Answer: 20 transistors would you expect to be produced before the first with a defect. standard deviation is 19.49359
Increasing the probability of an event decreases value of mean and standard deviation until success.
Male children. While it is often assumed that the probabilities of having a boy or a girl are the same, the actual probability of having a boy is slightly higher at 0.51. Suppose a couple plans to have 3 kids.
#using dbinom function
dbinom(2, 3, prob = 0.51)
## [1] 0.382347
#using mathamatical function of binomial distribution.
n <- 3
k <- 2
p <- 0.51
binomial_d <- factorial(n)/(factorial(k)*factorial(n-k))
prob_boy <- binomial_d*(p^k)*((1-p)^(n-k))
prob_boy
## [1] 0.382347
Answer: 0.382347
# boy boy girls
senario1 <- .51 * .51 * .49
# boys girl boy
senario2 <- .51 * .49 * .51
# girl boy boy
senario3 <- .49 * .51 * .51
senario1 + senario2 + senario3
## [1] 0.382347
Answer from part (a) matches with answer from part (b)
choose(8,3)
## [1] 56
Above we are calculating just saying 3 success in 8 independant trials.if you do it by hand dealing with factorial part is very tedious. Using addition rule of disjoint outcomes for this senario is possible but not easy as we need to find out all possible senarios that give us 3 boys out of 8 trial.
Serving in volleyball. (4.30, p. 162) A not-so-skilled volleyball player has a 15% chance of making the serve, which involves hitting the ball so it passes over the net on a trajectory such that it will land in the opposing team’s court. Suppose that her serves are independent of each other.
#using dbinom function
dbinom(3, 10, prob = 0.15)
## [1] 0.1298337
#using mathamatical function of binomial distribution.
n <- 10
k <- 3
p <- 0.15
binomial_d <- factorial(n)/(factorial(k)*factorial(n-k))
prob_serve <- binomial_d*(p^k)*((1-p)^(n-k))
prob_serve
## [1] 0.1298337
Answer: 0.1298337
Answer: Her 10th serve is still going to be 15%.
Answer: Part B is referencing a single shot and the probability is independant. Part A is combining successful shots (multiple shots) so the probability becomes joined.