Midterm Data Analysis
The variables in the data set are defined as follows
launch : this numbers the temperature-sorted observations from 1 to 23.
temp : temperature in degrees Fahrenheit at the time of launch.
incident : if there was an incident with an O-Ring, then it is coded “Yes”
o_ring_probs : counts the number of O-ring partial failures experienced on the flight.
data1 <- read.csv('./challenger-2.csv') # Downloading the data
library(psych)
summary(data1) # Summary of our Data
## launch temp incident o_ring_probs
## Min. : 1.0 Min. :53.60 Length:23 Min. :0.0000
## 1st Qu.: 6.5 1st Qu.:66.20 Class :character 1st Qu.:0.0000
## Median :12.0 Median :69.80 Mode :character Median :0.0000
## Mean :12.0 Mean :69.02 Mean :0.4348
## 3rd Qu.:17.5 3rd Qu.:74.30 3rd Qu.:1.0000
## Max. :23.0 Max. :80.60 Max. :3.0000
Using the summary function, we are able to see things like Min, Max, Median, Mean…
describe(data1$launch) # Launch
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 23 12 6.78 12 12 8.9 1 23 22 0 -1.36 1.41
describe(data1$temp) # Temp
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 23 69.02 6.97 69.8 69.33 5.34 53.6 80.6 27 -0.4 -0.44 1.45
describe(data1$o_ring_probs) #O Ring Probs
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 23 0.43 0.79 0 0.26 0 0 3 3 1.81 2.69 0.16
Above, we listed each of our variables to describe their data in more detail. This describe function gives us each data points statistics like mean, standard deviation, min, max, and entries. If you scroll to the right on them, we can also see it lists skewness and Kurtosis. It is important to note that I did not list out the Incident column. This is because these entries are just listed as Yes or No. If you enter this in R, you will get a non-argument function.
To start, we know that our levels of measurement are Nominal, Ordinal, Interval, and Ratio. I have listed these in more detail below.
Nominal: Nominal is described as categories, states, or “names of things”. An example of these can be hair color, marital status, occupation, etc… This is a qualitative variable.
Ordinal: Ordinal is described as values that have a meaningful order, also described as a ranking. The magnitude between successive values is not known. Examples of this can be size of a coffee, grades, rankings in the military. This is a qualitative variable.
Interval: Interval is measured on a scale of equal sized units and these values have an order. Examples of this are temperature of calendar dates. This is a quantitative variable.
Ratio: Ratio is an inherent zero-point and its values are an order of magnitude larger than the unit of measurement. Examples of this are temperature in kelvin, length, monetary quantities. This is a quantitative variable.
Our example for the Challenger Data
Our launch variable will be described as nominal. Even though these are numeric, we know that this is a qualitative variable since it is just an identifier for our other variables.
Temperature we know should be interval. This is because it is an quantitative variable and that we can have our temperature we less than zero.
Our incident variable is nominal as well since it is a categorical variable.
O_Ring_Probs is a ratio variable since it is used as a level of measurement.
# Plotting Histogram
hist(data1$o_ring_probs)
Above, we have plotted our histogram for O_Ring_Probs that lists our
data in bins of 0.5.
# Boxplot
boxplot(temp~incident,
data = data1,
horizontal = TRUE,
main = "Boxplot of Temp vs Incident",
xlab = "Temperature")
We can see from our boxplot that there was a lower temperature when
incidents occurred compared to no incidents. Less temperature increased
our probability of having an incident.
which(data1$incident == "No")[1]
## [1] 5
We can see from our code that the first launch with no incident was the 5th launch. We can confirm this by looking at our data set as well but sometimes that can be more difficult to find in larger datasets.
sum(data1$temp > 65 & data1$incident == "Yes")
## [1] 3
We can see here that there were 3 incidents above 65 degree by summing our data where temp >65 and there was an incident.
First, we know we can write out the formula as \(P(A \mid B)\) = \(\displaystyle \frac{P(B\mid A)* P(A) }{P(B)}\) Bayes Theorem: Bayes Theorem is when we update the probability based on new information or evidence. By being able to use updated information in our formula, it can help us create a more accurate probability hypothesis. This theorem is an extension of conditional probabilities. Since we know conditional probability is what we use to predict the probability of A|B (A happening given that B happened). By using Bayes Theorem, we can essentially calculate the probability of A occurring if we know the probability of another event related to B occurring.
# Parameters
Liar <- .59 # Liar
TruthT <- 0.90 # Truth Teller
LieEst <- 0.20 # Lie on Polygraph
Prob_Liar <- (Liar*LieEst)/((Liar*LieEst)+((1-TruthT)*(1-LieEst)))
round(Prob_Liar, digits = 4)
## [1] 0.596
After plugging in our parameters into our Baysian formula, it looks like we have 59.6% probability an individual is actually a liar.
Prob_Lie2 <- ((1-TruthT)*(1-LieEst))+LieEst
round(Prob_Lie2, digits = 4)
## [1] 0.28
The probability is 28%.
# Parameters
lambda <- 8/10
k <- 0
# Probability after 8 years
prob_0 <- ppois(k,lambda)
round(prob_0, digits = 4)
## [1] 0.4493
44.93% probability the machine will fail after 8 years.
# Expected Value
Exp_Value <- lambda
Exp_Value
## [1] 0.8
# Standard Deviation
Standard_Dev <- sqrt(Exp_Value)
round(Standard_Dev, digits = 4)
## [1] 0.8944
We know that the expected value is = to our mean and that lambda is the average so therefore the expected value is our lambda and standard deviation is the square root of our expected value.
# Parameters
Prob_Fail <- 1/10
Prob_NotFail <- 1-Prob_Fail
n <- 8 # Years
# Probability after 8 Years
P3b <- pbinom(Prob_NotFail,n,Prob_Fail)
round(P3b, digits = 4)
## [1] 0.4305
There is a 43.05% chance the machine will fail after 8 years.
# Expected Value
Exp_ValueB <- n*Prob_Fail
Exp_ValueB
## [1] 0.8
# Standard Deviation
Standard_DevB <- sqrt(Exp_ValueB)
round(Standard_DevB, digits = 4)
## [1] 0.8944
# Parameters
Prob_Correct <- 0.25 # Probability she answers correctly
Prob_Wrong <- 0.75 # Probability she answers incorrect
P4a<- (Prob_Wrong^2)*Prob_Correct
round(P4a,digits = 4)
## [1] 0.1406
There is a 14.06% the first question Robin gets right is the third question.
P4b <- dbinom(4,5,0.25)
round(P4b, digits = 4)
## [1] 0.0146
p4b3 <- dbinom(3,5,0.25)
round(p4b3,digits = 4)
## [1] 0.0879
#Adding the two probabilties together
P4b+p4b3
## [1] 0.1025391
There is a 10.25% Robin gets exactly 3 or 4 questions correct. We know we would use the Binomial distribution since that is used when we want to describe of successes in a fixed number of trials. In this case, we are looking at exactly 3 or 4 successes in 5 trials.
We know we need to answer 3 correctly to get a majority of the question of the right since there are 5 questions.
N <- 5
K <- 2
pi <- 0.25
choose(n = N, k = K) * pi^K * (1-pi)^(N-K)
## [1] 0.2636719
# Using Binomial
dbinom(x = 2,
size = 5,
prob = 0.25
)
## [1] 0.2636719
X = 80
mean_ = 72.6
sd = 4.78
P5a1 <- pnorm(X, mean = 72.6, sd = 4.78)
round(P5a1, digits = 4)
## [1] 0.9392
P5a2 <- pnorm(80, mean = 72.6, sd = 4.78) - pnorm(60, mean = 72.6, sd = 4.78)
round(P5a2, digits = 4)
## [1] 0.935
p5a3 <- 1 - pnorm(70, mean = 72.6, sd = 4.78)
round(p5a3,digits = 4)
## [1] 0.7068
N(𝜇 = 5261,𝜎 = 807) for Women, Ages 25-29 group. Times are listed in seconds.
x = 0.05
mean = 4313
sd = 583
P5b1 <- qnorm(0.05, mean = 4313, sd = 583)
round(P5b1, digits = 4)
## [1] 3354.05
Mens
x = 0.9
mean = 5621
sd = 807
P5b2 <- qnorm(0.9, mean = 5621, sd = 807)
round(P5b2, digits = 4)
## [1] 6655.212
Womens
For question 5, we know a normal distribution is the most widely know distribution and it allows us to understand the empirac rule of space between each standard deviation. This is a continuos distribution so we used all of our functions we used in HW and discussions to calculate these.