Midterm

Question 1: Basic Data Analysis in R

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv. The data folder contains the same data in 3 different formats - import any as it is the same file. The variables in the data set are defined as follows: • launch: this numbers the temperature-sorted observations from 1 to 23. • temp: temperature in degrees Fahrenheit at the time of launch. • incident: If there was an incident with an O-Ring, then it is coded “Yes”. • o_ring_probs: counts the number of O-ring partial failures experienced on the flight. Load the data into R or Python and answer the following questions. Include all R code.

A. Print the measures of center (like mean, median, mode,. . . ), spread (like sd,min,max,. . . ) and shape (skewness,kurtosis,. . . ) for the variables in the data. HINT: You can use the describe function in “psych” package for this.

setwd("C:/Users/LENOVO/Downloads/Data Analytics/HW 1 Titanic/titanic")

library(psych)
library(readxl)
my_data <- read_excel("challenger.xlsx")

summary(my_data)

##      launch          temp         incident          o_ring_probs   
##  Min.   : 1.0   Min.   :53.60   Length:23          Min.   :0.0000  
##  1st Qu.: 6.5   1st Qu.:66.20   Class :character   1st Qu.:0.0000  
##  Median :12.0   Median :69.80   Mode  :character   Median :0.0000  
##  Mean   :12.0   Mean   :69.02                      Mean   :0.4348  
##  3rd Qu.:17.5   3rd Qu.:74.30                      3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :80.60                      Max.   :3.0000

#Finding center, spread and shape using psych package
psych::describe(my_data$launch)

psych::describe(my_data$temp)

psych::describe(my_data$o_ring_probs)

B. Second, what are the levels of measurement of these 4 variables? Discuss/Justify

The variable launch is considered to be nominal because each observation is numeric, but they are identifiers.

The variable temp is considered to be interval, it is also a numeric value, it does not have a true zero value, as a result of having no absence of temperature.

The variable incident is considered to be nominal.

The variable o_ring_probs is considered to be ratio, as it’s numerical value helps to represent a true and equal interval, while also having a true zero

C. Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

hist(my_data$o_ring_probs)

D. The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

boxplot(temp~incident, data = my_data, horizontal = TRUE, main = "Temp vs Incident", xlab = "Temperature")

Based on this boxplot it is clear that there are more incidents when the temperatures are colder out, so launches should only be made on warm days.

E. Already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident). Answer using a command (instead of eyeballing the data).

mintemp <- min(my_data$temp[my_data$incident == "No"])
#The first observation succesfull launch occured with no incident is
mintemp

## [1] 66.2

F. How many incidents occurred above 65 degrees F? Answer using a command (instead of eyeballing the data).

Three or fewer incidents occured above 65 degrees F.

Question 2: Probability and Bayes Rule

The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.

A. What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? Solve using the Bayesian formula. Be clear with your notation. If you are not sure, you can try to solve as with the tree or table method for partial credit.

pdLiar = .59
pdTruther = .90
pliar = .20
pTruther = 1-pliar
pnotTruther = 1- pdTruther
pnotLiar = 1-pdLiar

p_a = pliar
p_b = pTruther*pnotTruther + pliar*pdLiar
p_ba = pliar*pdLiar

p_ab = p_ba*p_a/p_b
#Probability that an individual is actually a liar given that the polygraph detcted him/her:
p_ab

## [1] 0.1191919

B. What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

p_b = pTruther * pnotTruther 
p_a = pliar * pdLiar
p_AorB = p_a + p_b
p_AorB

## [1] 0.198

Question 3: Poisson and Binomial

Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 1/10.

A. What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use λt rather just λ. Provide also the expected value and standard deviation of the distribution.)

#PMF P(X=k)=e−λ⋅λ/kk!
#Expected value E(X)=λ
#Sd σ(X)=sqrt of λ
#Poisson
lambda <- 8 / 10
# Number of failures of interest (0 failures in 8 years)
k <- 0

# Probability of 0 failures in 8 years using the Poisson distribution
prob_0 <- ppois(k, lambda)

# Expected Value
expected_value <- lambda

# Standard Deviation
standard_deviation <- sqrt(lambda)

# Print the results
#Probability that the machine will fail after 8 years (0 failures in 8 years):
print(prob_0)

## [1] 0.449329

#standard deviation of failures in 8 years :
print(standard_deviation)

## [1] 0.8944272

#expected value of failures in 8 years :
print(expected_value)

## [1] 0.8

B. What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.) Provide also the expected value and standard deviation of the distribution.

#Binomial

p_f <- 1/10
p_s <- 1 - p_f
n <- 8

#Binomial of failures in 8 years
prob_0_failures_in_8_years <- pbinom(p_s, n, p_f)

# Expected value
expected_value <- n * p_f

# Standard deviation
standard_deviation <- sqrt(n * p_s * p_f)

# Print the results
print(prob_0_failures_in_8_years)

## [1] 0.4304672

#Expected value
print(expected_value)

## [1] 0.8

#sd
print(standard_deviation)

## [1] 0.8485281

Question 4: Multiple choice quiz

In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers.

A. What is the probability that the first question Robin gets right is the 3rd question?

pr <- 0.25 ##Probability that the answer is right
pw <- (1-pr) ##Probability that the answer is wrong
(pw^2)*pr

## [1] 0.140625

B. What is the probability that Robin gets exactly 3 or exactly 4 questions right? Define the random variable X, tell us what is its likely distribution (normal, poisson, binomial, hypergeometric,..) and provide the probability statement.

#X = 0:5
#It is binomial distribution
dbinom(4,5,0.25)

## [1] 0.01464844

dbinom(3,5,0.25)

## [1] 0.08789063

sum(dbinom(4,5,0.25)+
dbinom(3,5,0.25))

## [1] 0.1025391

C. What is the probability that Robin gets the majority of the questions right? Provide the probability statement, and show two different ways to get to the same answer?

choose(n = 5, k = 2) * 0.25^2 * (1-0.25)^(5-2)

## [1] 0.2636719

N  <- 5
K  <- 2
pi <- 0.25
choose(n = N, k = K) * pi^K * (1-pi)^(N-K)

## [1] 0.2636719

# Again, gives same answer as above
dbinom(x   = 2,      
       size = 5, 
       prob = 0.25
       )

## [1] 0.2636719

#Got confusion above codes

#gets majority of the questions right
1-pbinom(2,5,0.25)

## [1] 0.1035156

Question 5: Normal Distribution (Week 3)

Speeding on the Interstate 5 Freeway (I-5) in California.

The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) in California is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78 miles/hour.

a1. What percent of passenger vehicles travel slower than 80 miles/hour? Define the random variable X, and write the probability statement.

#It is a normal distribution
X = 80
mean_ = 72.6
sd = 4.78
pnorm(X, mean = 72.6, sd = 4.78)

## [1] 0.939203

Approximately 93.92% of passenger vehicles on I-5 travel slower than than 80 mph.

a2. What percent of passenger vehicles travel between 68 and 78 miles/hour? Does this make sense? Justify.

pnorm(80, mean = 72.6, sd = 4.78)

## [1] 0.939203

pnorm(60, mean = 72.6, sd = 4.78)

## [1] 0.004194693

r = pnorm(80, mean = 72.6, sd = 4.78) - 
pnorm(60, mean = 72.6, sd = 4.78)
round(r, 4)

## [1] 0.935

Therefore, approximately 93.5% of passenger vehicles on I-5 travel between 60 mph and 80 mph.

a3. The speed limit on this stretch of the I-5 is 70 miles/hour. Approximate what percentage of the passenger vehicles travel above the speed limit on this stretch of the I-5.

1 - pnorm(70, mean = 72.6, sd = 4.78)

## [1] 0.7067562

Therefore, approximately 70.68% of passenger vehicles on I-5 travel above speed limit.

b. Distributions for triathlon times:

N(µ = 4313, σ = 583) for Men, Ages 30 - 34 group.

N(µ = 5261, σ = 807) for Women, Ages 25-29 group.

Times are listed in seconds.

Use this information to compute each of the following:

b1. The cutoff time for the fastest 5% of athletes in the men’s group, i.e. those who took the shortest 5% of time to finish.

#P(X > x)=.05
x = 0.05
mean = 4313
sd = 583
qnorm(0.05, mean = 4313, sd = 583)

## [1] 3354.05

b2. The cutoff time for the slowest 10% of athletes in the women’s group.

#1- P(X > x) = .1 P(X > X) = 1 -.1 P(X > x) = .9
x = 0.9
mean = 5621
sd = 807
qnorm(0.9, mean = 5621, sd = 807)

## [1] 6655.212

Midterm_1

Ganesh Kumar

2023-10-25