setwd("/Users/jiwonban/ADEC7301/Week 4")
library(readxl)
my.data <- read.csv("challenger-2.csv")
In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv.
The variables in the data set are defined as follows:
launch : this numbers the temperature-sorted
observations from 1 to 23.
temp : temperature in degrees Fahrenheit at the time of
launch.
incident : if there was an incident with an O-Ring, then
it is coded “Yes”.
o_ring_probs : counts the number of O-ring partial
failures experienced on the flight.
Load the data into R or Python and answer the following questions. Include all R code.
HINT: You can use the describe function in “psych” package for this.
library("psych")
describe(my.data)
## vars n mean sd median trimmed mad min max range skew
## launch 1 23 12.00 6.78 12.0 12.00 8.90 1.0 23.0 22 0.00
## temp 2 23 69.02 6.97 69.8 69.33 5.34 53.6 80.6 27 -0.40
## incident* 3 23 1.30 0.47 1.0 1.26 0.00 1.0 2.0 1 0.80
## o_ring_probs 4 23 0.43 0.79 0.0 0.26 0.00 0.0 3.0 3 1.81
## kurtosis se
## launch -1.36 1.41
## temp -0.44 1.45
## incident* -1.42 0.10
## o_ring_probs 2.69 0.16
summary(my.data)
## launch temp incident o_ring_probs
## Min. : 1.0 Min. :53.60 Length:23 Min. :0.0000
## 1st Qu.: 6.5 1st Qu.:66.20 Class :character 1st Qu.:0.0000
## Median :12.0 Median :69.80 Mode :character Median :0.0000
## Mean :12.0 Mean :69.02 Mean :0.4348
## 3rd Qu.:17.5 3rd Qu.:74.30 3rd Qu.:1.0000
## Max. :23.0 Max. :80.60 Max. :3.0000
str(my.data)
## 'data.frame': 23 obs. of 4 variables:
## $ launch : int 1 2 3 4 5 6 7 8 9 10 ...
## $ temp : num 53.6 57.2 57.2 62.6 66.2 66.2 66.2 66.2 66.2 68 ...
## $ incident : chr "Yes" "Yes" "Yes" "Yes" ...
## $ o_ring_probs: int 3 1 1 1 0 0 0 0 0 0 ...
Launch is a discrete, nominal, integer variable;
temp is a continuous, interval variable;
incident is a categorical variable with two levels (Yes or
No); and o_ring_probs is a numeric ratio variable. In
addition to checking via R, we can manually check the data —
Launch signifies number of observations, which is equal to
trial IDs (and thus, nominal). Temp is an interval variable
because it has decimals and the differences in temperature magnitudes
are meaningful. Lastly, because a 0 on o_ring_probs is
meaningful (i.e., no failed o-ring for respective flight), it is
considered a ratio variable.
?hist
hist(my.data$o_ring_probs,
main = paste("Histogram of recorded o-ring failures prior to the incident"),
xlab = "Number of O-ring partial failures experienced on the flight",
col = 'blue')
The histogram is left positively skewed. The distribution tells us that majority of the recorded flights (15+) had no o-ring partial failures, around five flights experienced one o-ring partial failure. There was one flight that saw two failed o-rings and another that saw three partial failures.
boxplot(my.data$temp ~ my.data$incident,
col = "pink")
The side-by-side boxplots indicate that, based on a sample of 23 launched flights, incidents occurred more often when the recorded temperature was lower. On average, the temperature was 62.5 degrees when there were recorded incidents with the o-ring, whereas the flights without incidents were recorded at an average temperature of 70 degrees. The distribution bars (quartiles) also show us that there is a larger range of temperature associated an incident than without. This finding highlighted that the more incidents occurred when the temperatures were lower, suggesting that low temperature of 36 degrees may have been a factor in the Challenger’s catastrophic failure.
order(my.data$temp)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
which(my.data$incident == "No", arr.ind = TRUE)[1]
## [1] 5
The fifth observation was the first successful launch without incident.
P(X >65)
nrow(my.data) - which(my.data$temp > 65, arr.ind = TRUE)[1] + 1
## [1] 19
When temperature was above 65 degrees Fahrenheit, were 9 documented incidents.
The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.
\[P(A \mid B)\ = \frac{ P(B \mid A)\ * P(A)}{P(B)}\]
\[ P(individual\ is\ a\ liar \mid polygraph\ detected\ lies) \]
#Parameters
CorrectLiarDetect = .59
CorrectTrutherDetect = .9
Prob_Individuals_Lying = .2
Prob_Individuals_NotLying = .8
IncorrectLiarDetect = 1- CorrectLiarDetect
IncorrectTrutherDetect = 1- CorrectTrutherDetect
DetectLie <- CorrectLiarDetect * Prob_Individuals_Lying
DetectTruth <- CorrectTrutherDetect * Prob_Individuals_NotLying
PredictLiar <- CorrectLiarDetect * IncorrectTrutherDetect
PredictTruther <- CorrectTrutherDetect * IncorrectLiarDetect
Overall <- DetectLie+DetectTruth+PredictLiar+PredictTruther
#Overall <- CorrectLiarDetect + CorrectTrutherDetect + IncorrectLiarDetect + IncorrectTrutherDetect
#Probabilities
Prob_CorrectLiarDetect <- CorrectLiarDetect/Overall
Prob_CorrectTrutherDetect <- CorrectTrutherDetect/Overall
Prob_IncorrectLiarDetect <- IncorrectLiarDetect/Overall
Prob_IncorrectTrutherDetect <- IncorrectTrutherDetect/Overall
#Probability that test detects liar and the individual actually was lying
Prob_Liar_Caught <- round(CorrectLiarDetect*PredictLiar/Prob_Individuals_Lying, digits = 4)
print(Prob_Liar_Caught)
## [1] 0.174
The probability of the polygraph detecting a liar who indeed was lying is at 17.4%.
\[ P(X \mid Liar) + P(X\ \mid Detected\ Liar) \]
Prob_Liar_or_DetectedLiar <- Prob_Individuals_Lying + PredictLiar
print(Prob_Liar_or_DetectedLiar)
## [1] 0.259
The probability of that a randomly selected individual is either a liar or was identified as a liar by the polygraph is 25.9%.
Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e., the machine fails once in 10 years, or the probability of the machine failing in any given year is \(\frac{1}{10}\) .
\[ P(X \ge\ 8\ \mid \lambda=0.10) \]
#parameters
lambda <- 0.10
t <- 8
x <- 1
probability_machinefail_8yrs <- exp(-1*lambda*t)*(lambda*t)/factorial(x) # prob of failure in 8 years
round(1-probability_machinefail_8yrs, 4) # right side of the distribution
## [1] 0.6405
round(sqrt(lambda),4) #sd
## [1] 0.3162
Based on a Poisson distribution, the probability of the MRI machine failing after 8 years is 64.05% (SD = 31.62%)
p <- .1
n <- 8
x <- 1
prob_machinefail_8yrs_BINOMIAL <- dbinom(x,size=n,prob=p) # probability that machine will fail in 8 years
round(1-prob_machinefail_8yrs_BINOMIAL,4) #right side of dist
## [1] 0.6174
round(sqrt(n*p*(1-p)),4) #sd
## [1] 0.8485
Based on a Poisson distribution, the probability of the MRI machine failing after 8 years is 61.74% (SD = 84.85%).
In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers.
#parameters
prob_correct_trial3 <- 0.25 ##Probability that the answer is right
prob_incorrect_trial3 <- (1-prob_correct_trial3) ##Probability that the answer is wrong
round((prob_incorrect_trial3^2)*prob_correct_trial3,4)
## [1] 0.1406
There’s a 14.06% chance.
P(X = 3) + P(X=4), in which X = number of questions correct
#binomial
round(dbinom(3,5,0.25) + dbinom(4,5,0.25),4)
## [1] 0.1025
It is likely a binomial distribution. Robin has a probability of 10.25% of getting exactly 3 or 4 questions correctly.
#binomial dist
round(1-pbinom(2,5,0.25),4)
## [1] 0.1035
#CDF force
round(pbinom(q = 2,
size = 5,
prob = .25,
lower.tail = FALSE),
digits=4)
## [1] 0.1035
The probability of Robin getting more than 2 questions right (out of 5) is 10.35%.
The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour.
P(X<80), in which X = percent of vehicles traveling slower than 80 mph.
#parameters
x = 80
mu = 72.6
sd = 4.78
round(pnorm(80, 72.6, 4.78),4)
## [1] 0.9392
93.92% of vehicles travel slower than 80 mph on I-5 California.
P(68 < X < 78)
#density betweeen 68 and 78
round(pnorm(78, 72.6, 4.78) - pnorm(68, 72.6, 4.78),4)
## [1] 0.7028
70.28% of vehicles travel in between 68 and 78 mph. This makes sense, considering the mean is around 73 miles per hour and the standard deviation is about 5 mph. Because 1 standard deviation in a normal distribution covers 68% of the data, this probability of 70% captures this normal 1 standard deviated range.
P(X>70)
#look at the right side of the normal distribution
1-pnorm(70, 72.6,4.78)
## [1] 0.7067562
#top 5%, left side of dist.
round(qnorm(.05, 4313,583))
## [1] 3354
The cutoff time to be considered the fastest 5% of male athletes is 3354 minutes.
#bottom 10%, right side of distribution
round(qnorm(.90, 5261, 807))
## [1] 6295
The cutoff time to be considered the slowest 10% of female athletes is 6295 minutes.