setwd("/Users/jiwonban/ADEC7301/Week 4")
library(readxl)
my.data <- read.csv("challenger-2.csv")

Q1. Basic Data Analysis in R

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv.

The variables in the data set are defined as follows:

launch : this numbers the temperature-sorted observations from 1 to 23.

temp : temperature in degrees Fahrenheit at the time of launch.

incident : if there was an incident with an O-Ring, then it is coded “Yes”.

o_ring_probs : counts the number of O-ring partial failures experienced on the flight.

Load the data into R or Python and answer the following questions. Include all R code.

a. Print the measures of center (like mean, median, mode, …), spread (like sd, min, max, …) and shape (skewness, kurtosis, …) for the variables in the data.

HINT: You can use the describe function in “psych” package for this.

1a. Solution:

library("psych")
describe(my.data)

##              vars  n  mean   sd median trimmed  mad  min  max range  skew
## launch          1 23 12.00 6.78   12.0   12.00 8.90  1.0 23.0    22  0.00
## temp            2 23 69.02 6.97   69.8   69.33 5.34 53.6 80.6    27 -0.40
## incident*       3 23  1.30 0.47    1.0    1.26 0.00  1.0  2.0     1  0.80
## o_ring_probs    4 23  0.43 0.79    0.0    0.26 0.00  0.0  3.0     3  1.81
##              kurtosis   se
## launch          -1.36 1.41
## temp            -0.44 1.45
## incident*       -1.42 0.10
## o_ring_probs     2.69 0.16

summary(my.data)

##      launch          temp         incident          o_ring_probs   
##  Min.   : 1.0   Min.   :53.60   Length:23          Min.   :0.0000  
##  1st Qu.: 6.5   1st Qu.:66.20   Class :character   1st Qu.:0.0000  
##  Median :12.0   Median :69.80   Mode  :character   Median :0.0000  
##  Mean   :12.0   Mean   :69.02                      Mean   :0.4348  
##  3rd Qu.:17.5   3rd Qu.:74.30                      3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :80.60                      Max.   :3.0000

b. Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

1b. Solution:

str(my.data)

## 'data.frame':    23 obs. of  4 variables:
##  $ launch      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ temp        : num  53.6 57.2 57.2 62.6 66.2 66.2 66.2 66.2 66.2 68 ...
##  $ incident    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ o_ring_probs: int  3 1 1 1 0 0 0 0 0 0 ...

Launch is a discrete, nominal, integer variable; temp is a continuous, interval variable; incident is a categorical variable with two levels (Yes or No); and o_ring_probs is a numeric ratio variable. In addition to checking via R, we can manually check the data — Launch signifies number of observations, which is equal to trial IDs (and thus, nominal). Temp is an interval variable because it has decimals and the differences in temperature magnitudes are meaningful. Lastly, because a 0 on o_ring_probs is meaningful (i.e., no failed o-ring for respective flight), it is considered a ratio variable.

c. Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

1c. Solution:

?hist
hist(my.data$o_ring_probs,
     main = paste("Histogram of recorded o-ring failures prior to the incident"),
     xlab = "Number of O-ring partial failures experienced on the flight",
     col  = 'blue')

The histogram is left positively skewed. The distribution tells us that majority of the recorded flights (15+) had no o-ring partial failures, around five flights experienced one o-ring partial failure. There was one flight that saw two failed o-rings and another that saw three partial failures.

d. The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

1d. Solution:

boxplot(my.data$temp ~ my.data$incident,
        col = "pink")

The side-by-side boxplots indicate that, based on a sample of 23 launched flights, incidents occurred more often when the recorded temperature was lower. On average, the temperature was 62.5 degrees when there were recorded incidents with the o-ring, whereas the flights without incidents were recorded at an average temperature of 70 degrees. The distribution bars (quartiles) also show us that there is a larger range of temperature associated an incident than without. This finding highlighted that the more incidents occurred when the temperatures were lower, suggesting that low temperature of 36 degrees may have been a factor in the Challenger’s catastrophic failure.

e. In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident).

1e. Solution:

order(my.data$temp)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

which(my.data$incident == "No", arr.ind = TRUE)[1]

## [1] 5

The fifth observation was the first successful launch without incident.

f. How many incidents occurred above 65 degrees F?

P(X >65)

1f. Solution:

nrow(my.data) - which(my.data$temp > 65, arr.ind = TRUE)[1] + 1

## [1] 19

When temperature was above 65 degrees Fahrenheit, were 9 documented incidents.

Q2. Probability and Bayes Theorem

The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.

a. What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? Solve using a Bayesian equation. If you are not sure, you can try to solve as with the tree or table method for partial credit.

2a. Solution:

\[P(A \mid B)\ = \frac{ P(B \mid A)\ * P(A)}{P(B)}\]

\[ P(individual\ is\ a\ liar \mid polygraph\ detected\ lies) \]

#Parameters
CorrectLiarDetect = .59
CorrectTrutherDetect = .9
Prob_Individuals_Lying = .2
Prob_Individuals_NotLying = .8
IncorrectLiarDetect = 1- CorrectLiarDetect
IncorrectTrutherDetect = 1- CorrectTrutherDetect

DetectLie <- CorrectLiarDetect * Prob_Individuals_Lying
DetectTruth <- CorrectTrutherDetect * Prob_Individuals_NotLying 
PredictLiar <- CorrectLiarDetect * IncorrectTrutherDetect
PredictTruther <- CorrectTrutherDetect * IncorrectLiarDetect
Overall <- DetectLie+DetectTruth+PredictLiar+PredictTruther
#Overall <- CorrectLiarDetect + CorrectTrutherDetect + IncorrectLiarDetect + IncorrectTrutherDetect

#Probabilities 
Prob_CorrectLiarDetect <- CorrectLiarDetect/Overall
Prob_CorrectTrutherDetect <- CorrectTrutherDetect/Overall
Prob_IncorrectLiarDetect <- IncorrectLiarDetect/Overall
Prob_IncorrectTrutherDetect <- IncorrectTrutherDetect/Overall

#Probability that test detects liar and the individual actually was lying
Prob_Liar_Caught <- round(CorrectLiarDetect*PredictLiar/Prob_Individuals_Lying, digits = 4)
print(Prob_Liar_Caught)

## [1] 0.174

The probability of the polygraph detecting a liar who indeed was lying is at 17.4%.

b. What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

2b. Solution:

\[ P(X \mid Liar) + P(X\ \mid Detected\ Liar) \]

Prob_Liar_or_DetectedLiar <- Prob_Individuals_Lying + PredictLiar
print(Prob_Liar_or_DetectedLiar)

## [1] 0.259

The probability of that a randomly selected individual is either a liar or was identified as a liar by the polygraph is 25.9%.

Q3. Poisson and Binomial

Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e., the machine fails once in 10 years, or the probability of the machine failing in any given year is $\frac{1}{10}$ .

a. What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use lambda*t rather just lambda. Provide also the expected value and standard deviation of the distribution.)

3a. Solution:

\[ P(X \ge\ 8\ \mid \lambda=0.10) \]

#parameters
lambda <- 0.10 
t <- 8
x <- 1

probability_machinefail_8yrs <- exp(-1*lambda*t)*(lambda*t)/factorial(x) # prob of failure in 8 years
round(1-probability_machinefail_8yrs, 4) # right side of the distribution

## [1] 0.6405

round(sqrt(lambda),4) #sd

## [1] 0.3162

Based on a Poisson distribution, the probability of the MRI machine failing after 8 years is 64.05% (SD = 31.62%)

b. What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.) Provide also the expected value and standard deviation of the distribution.

3b. Solution:

p <- .1
n <- 8
x <- 1
prob_machinefail_8yrs_BINOMIAL <- dbinom(x,size=n,prob=p) # probability that machine will fail in 8 years
round(1-prob_machinefail_8yrs_BINOMIAL,4) #right side of dist

## [1] 0.6174

round(sqrt(n*p*(1-p)),4) #sd

## [1] 0.8485

Based on a Poisson distribution, the probability of the MRI machine failing after 8 years is 61.74% (SD = 84.85%).

Q4. Probabilities

In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers.

a. What is the probability that the first question Robin gets right is the 3rd question?

4a. Solution:

#parameters
prob_correct_trial3 <- 0.25 ##Probability that the answer is right
prob_incorrect_trial3 <- (1-prob_correct_trial3) ##Probability that the answer is wrong
round((prob_incorrect_trial3^2)*prob_correct_trial3,4)

## [1] 0.1406

There’s a 14.06% chance.

b. What is the probability that Robin gets exactly 3 or exactly 4 questions right? Define the random variable X, tell us what is its likely distribution (normal, poisson, binomial, hypergeometric,..) and provide the probability statement.

4b. Solution:

P(X = 3) + P(X=4), in which X = number of questions correct

#binomial 

round(dbinom(3,5,0.25) + dbinom(4,5,0.25),4)

## [1] 0.1025

It is likely a binomial distribution. Robin has a probability of 10.25% of getting exactly 3 or 4 questions correctly.

c. What is the probability that Robin gets the majority of the questions right? Provide the probability statement, and show two different ways to get to the same answer?

4c. Solution:

#binomial dist
round(1-pbinom(2,5,0.25),4)

## [1] 0.1035

#CDF force
round(pbinom(q    = 2,
       size = 5, 
       prob = .25,
       lower.tail = FALSE),
      digits=4)

## [1] 0.1035

The probability of Robin getting more than 2 questions right (out of 5) is 10.35%.

Q5. Normal Distribution

The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour.

a1. What percent of passenger vehicles travel slower than 80 miles/hour? Define the random variable X, and write the probability statement.

Q5a1. Solution:

P(X<80), in which X = percent of vehicles traveling slower than 80 mph.

#parameters
x = 80
mu = 72.6
sd = 4.78

round(pnorm(80, 72.6, 4.78),4)

## [1] 0.9392

93.92% of vehicles travel slower than 80 mph on I-5 California.

a2. What percent of passenger vehicles travel between 68 and 78 miles/hour? Does this make sense? Justify.

Q5a2. Solution:

P(68 < X < 78)

#density betweeen 68 and 78
round(pnorm(78, 72.6, 4.78) - pnorm(68, 72.6, 4.78),4)

## [1] 0.7028

70.28% of vehicles travel in between 68 and 78 mph. This makes sense, considering the mean is around 73 miles per hour and the standard deviation is about 5 mph. Because 1 standard deviation in a normal distribution covers 68% of the data, this probability of 70% captures this normal 1 standard deviated range.

a3. The speed limit on this stretch of the I-5 is 70 miles/hour. Approximate what percentage of the passenger vehicles travel above the speed limit on this stretch of the I-5.

Q5a3. Solution:

P(X>70)

#look at the right side of the normal distribution
1-pnorm(70, 72.6,4.78)

## [1] 0.7067562

b1. The cutoff time for the fastest 5% of athletes in the men’s group, i.e. those who took the shortest 5% of time to finish.

Q5b1. Solution:

#top 5%, left side of dist.

round(qnorm(.05, 4313,583))

## [1] 3354

The cutoff time to be considered the fastest 5% of male athletes is 3354 minutes.

b2. The cutoff time for the slowest 10% of athletes in the women’s group.

Q5b2. Solution:

#bottom 10%, right side of distribution

round(qnorm(.90, 5261, 807))

## [1] 6295

The cutoff time to be considered the slowest 10% of female athletes is 6295 minutes.

Midterm

Jiwon Ban

2024-05-14