1.1 In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv.

#The data folder contains the same data in 3 different formats - import any as it is the same file. #The variables in the data set are defined as follows: #• launch : this numbers the temperature-sorted observations from 1 to 23. #• temp : temperature in degrees Fahrenheit at the time of launch. #• o_ring_probs : counts the number of O-ring partial failures experienced on the flight.

#1Print the measures of center (like mean, median, mode, …), spread (like sd, min, max, …) and shape (skewness, kurtosis, …) for the variables in the data. HINT: You can use the describe function in “psych” package for this

data <- read.csv('./challenger-2.csv') # Downloading the data
library(psych)
summary(data)

##      launch          temp         incident          o_ring_probs   
##  Min.   : 1.0   Min.   :53.60   Length:23          Min.   :0.0000  
##  1st Qu.: 6.5   1st Qu.:66.20   Class :character   1st Qu.:0.0000  
##  Median :12.0   Median :69.80   Mode  :character   Median :0.0000  
##  Mean   :12.0   Mean   :69.02                      Mean   :0.4348  
##  3rd Qu.:17.5   3rd Qu.:74.30                      3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :80.60                      Max.   :3.0000

1.2 Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

#There are four levels of measurement we use: Nominal, Ordinal, Interval, and Ratio. 

#Nominal variables refer to data that can be classified into distinct categories without any order or ranking. Examples of nominal variables include gender, nationality, or hair color.

#Ordinal variables arrange data in order, but the intervals between positions are not necessarily equal. Examples include ranking of preferences, class grades like A, B, C, etc., or satisfaction levels.

#Interval variables have equal intervals between values but no true zero, allowing for meaningful addition and subtraction. Typical examples of interval variables include temperature in Celsius and dates.

#Ratio variables have a true zero point, making operations like multiplication and division meaningful. Examples of ratio variables include height, weight, and age.

# The "launch" variable is a categorical variable that signifies the order in which observations were recorded. It is not a continuous or numeric variable but rather serves as a sequence identifier for the observations. While there is a numerical aspect to it, it does not imply a specific measurement scale or equal intervals. As a result, it is considered an ordinal variable.

# The "temp" variable represents the temperature in degrees Fahrenheit at the time of the launch. It is a continuous numeric variable, but it lacks a true zero point (absolute zero in Fahrenheit is -459.67°F), making it an interval variable. It is possible to perform mathematical operations on temperature values (e.g., addition and subtraction) and calculate differences, but it doesn't make sense to say that one temperature is "twice as hot" as another.

# The "incident" variable is a categorical variable that represents the presence or absence of an incident with O-rings. It has two categories, "Yes" and "No," with no inherent order or meaningful numerical values. Therefore, it is a nominal variable.

# The "o_ring_probs" variable represents the count of O-ring partial failures on a flight. It is a discrete numeric variable with a true zero point (i.e., a flight with zero partial failures indicates the absence of partial failures). You can perform mathematical operations, such as addition, subtraction, multiplication, and division, on this variable, which makes it a ratio variable.

#1.3 Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

# Plotting Histogram
hist(data$o_ring_probs)

1.4The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

boxplot(temp~incident, 
        data = data, 
        horizontal = TRUE, 
        main = "Boxplot of Temp vs Incident", 
        xlab = "Temp",
        col = "blue")

# There was a lower temperature during incidents compared to no incidents, indicating that lower temperatures increase the probability of an incident.

1.5 In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident).

which(data$incident == "No")

##  [1]  5  6  7  8  9 10 12 14 15 16 17 19 20 21 22 23

#We can see that the first launch with no incident was the 5th launch.

1.6How many incidents occurred above 65 degrees F?

sum(data$temp > 65 & data$incident == "Yes")

## [1] 3

# There are 3 incidents above 65 degrees F

2 The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie

#2.1What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? Solve using a Bayesian equation. If you are not sure, you can try to solve as with the tree or table method for partial credit

# Parameters
liar <- .59     # Liar
TT <- 0.90  # Truth Teller
poly <- 0.20  # individuals selected for the screening polygraph will lie.
Prob_liar <- (liar*poly)/((liar*poly)+((1-TT)*(1-poly)))
round(Prob_liar, digits = 4)

## [1] 0.596

# after rounding 4 digits and the probability is 59.6% an individual is actually a liar

2.2 What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

#The event that the individual is a liar (L).
#The event that the individual is identified as a liar by the polygraph (P).
#The probability statement we are trying to solve is：P(L∪P)
#This can be calculated using the formula for the union of two events: P(L∪P)=P(L)+P(P)−P(L∩P)

#P(L) is the prior probability of being a liar (20% or 0.20).
#P(P) is the total probability of testing positive, regardless of whether the individual is a liar or not.
#P(L∩P) is the probability of an individual being a liar and testing positive, which we've previously calculated using sensitivity.

#P(L)=0.20， P(L∩P)=P(Positive∣Liar)×P(Liar)=0.59×0.20
liar <- .59     # Liar
TT <- 0.90  # Truth Teller
poly <- 0.20  # individuals selected for the screening polygraph will lie.
Prob_Lie <- ((1-TT)*(1-poly))+poly
round(Prob_Lie, digits = 4)

## [1] 0.28

3 Poisson and Binomial (Discussion+Assignment 3) Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 1/10.

3.1 What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use λt rather just λ. Provide also the expected value and standard deviation of the distribution.)

#To find the probability that the MRI machine will fail after 8 years, we first find the probability that it does not fail during the 8 years, which is the probability of zero events in a Poisson distribution:
#P(X=0)=(e^(−λt)*(λt)^0)/0!
#This simplifies to:P(X=0)=e^(−0.8)

lambda_t <- 1/10 *(8)
k <- 0
# Probability after 8 years
prob_0 <- ppois(k,lambda_t)
round(prob_0, digits = 4)

## [1] 0.4493

#The probability that the MRI machine will not fail within the first 8 years is approximately 44.93%. 

Exp <- lambda_t
Exp

## [1] 0.8

SD <- sqrt(Exp)
round(SD, digits = 4)

## [1] 0.8944

# The expected value (mean number of failures) over this 8-year period is 0.8, and the standard deviation is approximately 0.8944.

3.2What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.) Provide also the expected value and standard deviation of the distribution.

 # P(X=0)=(1−p)^n = 0.9^8
 
Prob_F <- 1/10
Prob_NF <- 1-Prob_F
n <- 8  # Years

Prob <- pbinom(Prob_NF,n,Prob_F)
round(Prob, digits = 4)

## [1] 0.4305

#The probability that the MRI machine will not fail within the first 8 years, modeled as a Binomial distribution, is approximately 43.05% 
Exp <- n*Prob_F
Exp

## [1] 0.8

SD <- sqrt(Exp*(1-0.1))
round(SD, digits = 4)

## [1] 0.8485

# The expected value, which represents the average number of failures, is 0.8. The standard deviation, which measures the variability around this expected value, is approximately 0.8485

4 In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers.

4.1 What is the probability that the first question Robin gets right is the 3rd question?

#The probability of getting any one question correct by random guessing is 1/4.

#The probability of getting any one question wrong is 3/4.
Prob_C <- 0.25 # Probability she answers correctly
Prob_W <- 0.75   # Probability she answers incorrect
Prob<- (Prob_W^2)*Prob_C
round(Prob,digits = 4)

## [1] 0.1406

# There is a 14.06% that the first question Robin gets right is the third question.

4.2 What is the probability that Robin gets exactly 3 or exactly 4 questions right? Define the random variable X, tell us what is its likely distribution (normal, poisson, binomial, hypergeometric,..) and provide the probability statement.

#X ~ Binomial(n = 5, p = 1/4)

#Here, n is the number of trials (5 questions), and p is the probability of success (getting one question correct, which is 1/4).

# In order to find the probability that Robin gets exactly 3 or exactly 4 questions right, then we can calculate the probabilities for these two scenarios and add them together

prob<- dbinom(4,5,0.25)+dbinom(3,5,0.25)
prob

## [1] 0.1025391

# The probability is 0.1025391 that Robin gets exactly 3 or exactly 4 questions right

4.3 What is the probability that Robin gets the majority of the questions right? Provide the robability statement, and show two different ways to get to the same answer?

N  <- 5
K  <- 2
pi <- 0.25
choose(n = N, k = K) * pi^K * (1-pi)^(N-K)

## [1] 0.2636719

dbinom(x   = 2,      
       size = 5, 
       prob = 0.25
       )

## [1] 0.2636719

#5 5a Speeding on the Interstate 5 Freeway (I-5) in California.The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) inCalifornia is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78miles/hour. #5a1 What percent of passenger vehicles travel slower than 80 miles/hour? Define the randomvariable X, and write the probability statement

X = 80
mean_ = 72.6
sd = 4.78
round(pnorm(80, 72.6, 4.78),digits = 4)

## [1] 0.9392

#5.2What percent of passenger vehicles travel between 68 and 78 miles/hour? Does this makesense? Justify.

round(pnorm(78,72.6,4.78)-pnorm(68,72.6,4.78),digits = 4)

## [1] 0.7028

5.3 The speed limit on this stretch of the I-5 is 70 miles/hour. Approximate what percentageof the passenger vehicles travel above the speed limit on this stretch of the I-5.

1-pnorm(70,72.6,4.78)

## [1] 0.7067562

#5b N(𝜇 = 4313,𝜎 = 583) for Men, Ages 30 - 34 group.5b1 The cutoff time for the fastest 5% of athletes in the men’s group, i.e. those who took theshortest 5% of time to finish.

x = 0.05
mean = 4313
sd = 583
Prob <- qnorm(0.05, mean = 4313, sd = 583)
round(Prob, digits = 4)

## [1] 3354.05

b2. The cutoff time for the slowest 10% of athletes in the women’s group.

qnorm(0.10,5261,807)

## [1] 4226.788

Midterm

ANDI XU

2024-04-14