Midterm Data Analysis

1 In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv. The data folder contains the same data in 3 different formats - import any as it is the same file.

The variables in the data set are defined as follows

1.a. Print the measures of center (like mean, median, mode, …), spread (like sd, min, max, …) and shape (skewness, kurtosis, …) for the variables in the data. HINT: You can use the describe function in “psych” package for this.

data1 <- read.csv('./challenger-2.csv') # Downloading the data
library(psych)
summary(data1) # Summary of our Data
##      launch          temp         incident          o_ring_probs   
##  Min.   : 1.0   Min.   :53.60   Length:23          Min.   :0.0000  
##  1st Qu.: 6.5   1st Qu.:66.20   Class :character   1st Qu.:0.0000  
##  Median :12.0   Median :69.80   Mode  :character   Median :0.0000  
##  Mean   :12.0   Mean   :69.02                      Mean   :0.4348  
##  3rd Qu.:17.5   3rd Qu.:74.30                      3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :80.60                      Max.   :3.0000

Using the summary function, we are able to see things like Min, Max, Median, Mean…

describe(data1$launch) # Launch 
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 23   12 6.78     12      12 8.9   1  23    22    0    -1.36 1.41
describe(data1$temp) # Temp
##    vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 23 69.02 6.97   69.8   69.33 5.34 53.6 80.6    27 -0.4    -0.44 1.45
describe(data1$o_ring_probs) #O Ring Probs
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 23 0.43 0.79      0    0.26   0   0   3     3 1.81     2.69 0.16

Above, we listed each of our variables to describe their data in more detail. This describe function gives us each data points statistics like mean, standard deviation, min, max, and entries. If you scroll to the right on them, we can also see it lists skewness and Kurtosis. It is important to note that I did not list out the Incident column. This is because these entries are just listed as Yes or No. If you enter this in R, you will get a non-argument function.

1.b Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

To start, we know that our levels of measurement are Nominal, Ordinal, Interval, and Ratio. I have listed these in more detail below.

Nominal: Nominal is described as categories, states, or “names of things”. An example of these can be hair color, marital status, occupation, etc… This is a qualitative variable.

Ordinal: Ordinal is described as values that have a meaningful order, also described as a ranking. The magnitude between successive values is not known. Examples of this can be size of a coffee, grades, rankings in the military. This is a qualitative variable.

Interval: Interval is measured on a scale of equal sized units and these values have an order. Examples of this are temperature of calendar dates. This is a quantitative variable.

Ratio: Ratio is an inherent zero-point and its values are an order of magnitude larger than the unit of measurement. Examples of this are temperature in kelvin, length, monetary quantities. This is a quantitative variable.

Our example for the Challenger Data

Our launch variable will be described as nominal. Even though these are numeric, we know that this is a qualitative variable since it is just an identifier for our other variables.

Temperature we know should be interval. This is because it is an quantitative variable and that we can have our temperature we less than zero.

Our incident variable is nominal as well since it is a categorical variable.

O_Ring_Probs is a ratio variable since it is used as a level of measurement.

1.3 Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

# Plotting Histogram
hist(data1$o_ring_probs)

Above, we have plotted our histogram for O_Ring_Probs that lists our data in bins of 0.5.

1.d The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

# Boxplot
boxplot(temp~incident, 
        data = data1, 
        horizontal = TRUE, 
        main = "Boxplot of Temp vs Incident", 
        xlab = "Temperature")

We can see from our boxplot that there was a lower temperature when incidents occurred compared to no incidents. Less temperature increased our probability of having an incident.

1.E In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident).

which(data1$incident == "No")[1]
## [1] 5

We can see from our code that the first launch with no incident was the 5th launch. We can confirm this by looking at our data set as well but sometimes that can be more difficult to find in larger datasets.

1.F How many incidents occurred above 65 degrees F?

sum(data1$temp > 65 & data1$incident == "Yes")
## [1] 3

We can see here that there were 3 incidents above 65 degree by summing our data where temp >65 and there was an incident.

2 Probability and Bayes Rule (Discussion+Assignment 2). The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.

2.a What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? Solve using a Bayesian equation. If you are not sure, you can try to solve as with the tree or table method for partial credit.

First, we know we can write out the formula as \(P(A \mid B)\) = \(\displaystyle \frac{P(B\mid A)* P(A) }{P(B)}\) Bayes Theorem: Bayes Theorem is when we update the probability based on new information or evidence. By being able to use updated information in our formula, it can help us create a more accurate probability hypothesis. This theorem is an extension of conditional probabilities. Since we know conditional probability is what we use to predict the probability of A|B (A happening given that B happened). By using Bayes Theorem, we can essentially calculate the probability of A occurring if we know the probability of another event related to B occurring.

# Parameters
Liar <- .59     # Liar
TruthT <- 0.90  # Truth Teller
LieEst <- 0.20  # Lie on Polygraph
Prob_Liar <- (Liar*LieEst)/((Liar*LieEst)+((1-TruthT)*(1-LieEst)))
round(Prob_Liar, digits = 4)
## [1] 0.596

After plugging in our parameters into our Baysian formula, it looks like we have 59.6% probability an individual is actually a liar.

2.b What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

Prob_Lie2 <- ((1-TruthT)*(1-LieEst))+LieEst
round(Prob_Lie2, digits = 4)
## [1] 0.28

The probability is 28%.

3 Poisson and Binomial (Discussion+Assignment 3) Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 1/10.

3.a What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use λt rather just λ. Provide also the expected value and standard deviation of the distribution.)

# Parameters
lambda <- 8/10
k <- 0
# Probability after 8 years
prob_0 <- ppois(k,lambda)
round(prob_0, digits = 4)
## [1] 0.4493

44.93% probability the machine will fail after 8 years.

# Expected Value
Exp_Value <- lambda
Exp_Value
## [1] 0.8
# Standard Deviation 
Standard_Dev <- sqrt(Exp_Value)
round(Standard_Dev, digits = 4)
## [1] 0.8944

We know that the expected value is = to our mean and that lambda is the average so therefore the expected value is our lambda and standard deviation is the square root of our expected value.

3.b What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.) Provide also the expected value and standard deviation of the distribution.

# Parameters
Prob_Fail <- 1/10
Prob_NotFail <- 1-Prob_Fail
n <- 8  # Years
# Probability after 8 Years
P3b <- pbinom(Prob_NotFail,n,Prob_Fail)
round(P3b, digits = 4)
## [1] 0.4305

There is a 43.05% chance the machine will fail after 8 years.

# Expected Value
Exp_ValueB <- n*Prob_Fail
Exp_ValueB
## [1] 0.8
# Standard Deviation
Standard_DevB <- sqrt(Exp_ValueB)
round(Standard_DevB, digits = 4)
## [1] 0.8944

4 In a multiple choice quiz there are 5 questions and 4 choices for each question (a, b, c, d). Robin has not studied for the quiz at all, and decides to randomly guess the answers.

4.a What is the probability that the first question Robin gets right is the 3rd question?

# Parameters
Prob_Correct <- 0.25 # Probability she answers correctly
Prob_Wrong <- 0.75   # Probability she answers incorrect
P4a<- (Prob_Wrong^2)*Prob_Correct
round(P4a,digits = 4)
## [1] 0.1406

There is a 14.06% the first question Robin gets right is the third question.

4b What is the probability that Robin gets exactly 3 or exactly 4 questions right? Define the random variable X, tell us what is its likely distribution (normal, poisson, binomial, hypergeometric,..) and provide the probability statement.

P4b <- dbinom(4,5,0.25)
round(P4b, digits = 4)
## [1] 0.0146
p4b3 <- dbinom(3,5,0.25)
round(p4b3,digits = 4)
## [1] 0.0879
#Adding the two probabilties together
P4b+p4b3
## [1] 0.1025391

There is a 10.25% Robin gets exactly 3 or 4 questions correct. We know we would use the Binomial distribution since that is used when we want to describe of successes in a fixed number of trials. In this case, we are looking at exactly 3 or 4 successes in 5 trials.

4.c What is the probability that Robin gets the majority of the questions right? Provide the robability statement, and show two different ways to get to the same answer?

We know we need to answer 3 correctly to get a majority of the question of the right since there are 5 questions.

N  <- 5
K  <- 2
pi <- 0.25
choose(n = N, k = K) * pi^K * (1-pi)^(N-K)
## [1] 0.2636719
# Using Binomial 
dbinom(x   = 2,      
       size = 5, 
       prob = 0.25
       )
## [1] 0.2636719

5 Normal Distribution

5a Speeding on the Interstate 5 Freeway (I-5) in California.The distribution of passenger vehicle speeds traveling on the Interstate 5 Freeway (I-5) inCalifornia is nearly normal with a mean of 72.6 miles/hour and a standard deviation of 4.78miles/hour.

5a1 What percent of passenger vehicles travel slower than 80 miles/hour? Define the randomvariable X, and write the probability statement

X = 80
mean_ = 72.6
sd = 4.78
P5a1 <- pnorm(X, mean = 72.6, sd = 4.78)
round(P5a1, digits = 4)
## [1] 0.9392

5a2 What percent of passenger vehicles travel between 68 and 78 miles/hour? Does this makesense? Justify.

P5a2 <- pnorm(80, mean = 72.6, sd = 4.78) - pnorm(60, mean = 72.6, sd = 4.78)
round(P5a2, digits = 4)
## [1] 0.935

5a3 The speed limit on this stretch of the I-5 is 70 miles/hour. Approximate what percentageof the passenger vehicles travel above the speed limit on this stretch of the I-5.

p5a3 <- 1 - pnorm(70, mean = 72.6, sd = 4.78)
round(p5a3,digits = 4)
## [1] 0.7068

5 b N(𝜇 = 4313,𝜎 = 583) for Men, Ages 30 - 34 group.

N(𝜇 = 5261,𝜎 = 807) for Women, Ages 25-29 group. Times are listed in seconds.

5b1 The cutoff time for the fastest 5% of athletes in the men’s group, i.e. those who took theshortest 5% of time to finish.

x = 0.05
mean = 4313
sd = 583
P5b1 <- qnorm(0.05, mean = 4313, sd = 583)
round(P5b1, digits = 4)
## [1] 3354.05

Mens

5b2

x = 0.9
mean = 5621
sd = 807
P5b2 <- qnorm(0.9, mean = 5621, sd = 807)
round(P5b2, digits = 4)
## [1] 6655.212

Womens

For question 5, we know a normal distribution is the most widely know distribution and it allows us to understand the empirac rule of space between each standard deviation. This is a continuos distribution so we used all of our functions we used in HW and discussions to calculate these.