# #Q1Question 1.
# The sensitivity and specificity of the polygraph has been a subject of study and debate
# for years. A 2001 study of the use of polygraph for screening purposes suggested that
# the probability of detecting an actual liar was .59 (sensitivity) and that the probability of
# detecting an actual "truth teller" was .90 (specificity). We estimate that about 20% of
# individuals selected for the screening polygraph will lie.
a_truth<-c(0.90*0.8,0,0)
a_lie<-c(0,0.59*0.2,0)
a_tot<-c(0.8,0.2,1)
polygraph_data<-data.frame(cbind(a_truth,a_lie,a_tot))
row.names(polygraph_data)=c('d_truth','d_lie','d_tot')
polygraph_data$a_truth[row.names(polygraph_data)=='d_lie']=(1-0.59)*0.2
polygraph_data$a_lie[row.names(polygraph_data)=='d_truth']=(1-0.9)*0.8
polygraph_data
## a_truth a_lie a_tot
## d_truth 0.720 0.080 0.8
## d_lie 0.082 0.118 0.2
## d_tot 0.000 0.000 1.0
# a. What is the probability that an individual is actually a liar given that the polygraph
# detected him/her as such? Solve using a Bayesian equation.
pdectLiar = .59
pdectTruther = .90
pliar = .20
pTruther = 1-pliar
pmissTruther = 1- pdectTruther
pmissLiar = 1-pdectLiar
p_a = pliar
p_b = pTruther*pmissTruther + pliar*pdectLiar
p_ba = pliar*pdectLiar
p_ab = p_ba*p_a/p_b
#Probability that an individual is actually a liar given that the polygraph detcted him/her:
p_ab
## [1] 0.1191919
# b. What is the probability that a randomly selected individual is either a liar or was
# identified as a liar by the polygraph? Be sure to write the probability statement.
#P(truth teller | detected as a truth-teller)
p_b = pTruther * pmissTruther
p_a = pliar * pdectLiar
p_aorb = p_a + p_b
p_aorb
## [1] 0.198
# Question 2.
# Your organization owns an expensive Magnetic Resonance Imaging machine (MRI).
# This machine has a manufacturer's expected lifetime of 10 years.
#(Include the probability statements and R / Python Code for each part.).
# a. What is the probability that the machine will fail after 8 years? Model as a geometric.
# (Hint: there are at least 7 failures before the first success.) Provide also the expected
# value and standard deviation.
#Probability that the machine failure occurs before 7 years
dgeom(7,.5)
## [1] 0.00390625
#Expected Value:
E_geom = 1/.5
E_geom
## [1] 2
#Standard deviation:
SD_geom = sqrt(1-.5)/.5
SD_geom
## [1] 1.414214
# b. What is the probability that the machine will fail after 8 years? Model as an
# exponential assuming continuous state space. Provide also the expected value and
# standard deviation of the distribution.
n = 8
p = 0.5
lambda = n*p
dexp(n,rate = 1/lambda)
## [1] 0.03383382
#Expected Value
E_exp = 1/lambda
E_exp
## [1] 0.25
#Standard Deviation
SD_exp = 1/lambda
SD_exp
## [1] 0.25
# c. What is the probability that the machine will fail after 8 years? Model as a binomial.
# (Hint: 0 success in 8 years) Provide also the expected value and standard deviation of
# the distribution.
#Probability
dbinom(0,n,p=p)
## [1] 0.00390625
#EV
n*p
## [1] 4
#SD
n*p*(1-p)
## [1] 2
# d. What is the probability that the machine will fail after 8 years? Model as a Poisson.
# Re-define the discrete state space if necessary. (Hint: Don't forget to use ???t rather just
# ????????? Provide also the expected value and standard deviation of the distribution.
n=9
p=.5
lambda = n*p
#Probability
dpois(n, lambda)
## [1] 0.02316458
#EV
lambda
## [1] 4.5
#SD
sqrt(lambda)
## [1] 2.12132
# e. What is the probability that the machine will have its 2nd failure exactly on the 9th
# year? Model as a Negative Binomial. Provide also the expected value and standard
# deviation of the distribution.
k = 9
n = 9
#Prob_2nd_fail
choose(n,k)*(1-p)^(n-k)*p^k
## [1] 0.001953125
#Expected value
k/p
## [1] 18
#Standard Deviation
sqrt(k*(1-p))/p
## [1] 4.242641
# Question 3.
# In 1986, the Challenger space shuttle exploded during "throttle up" due to catastrophic
# failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle
# launches prior to the Challenger disaster are in the file challenger.csv. Load the data
# into R or Python and answer the following questions. Include all R code.
# The variables in the data set are defined as follows:
# launch: this numbers the temperature-sorted observations from 1 to 23.
# temp: temperature in degrees Fahrenheit at the time of launch
# incident: If there was an incident with an O-Ring, then it is coded "Yes."
# o_ring_probs: counts the number of O-ring partial failures experienced on the flight.
library(readr)
library(psych)
challenger <- read_csv("challenger.csv")
## Parsed with column specification:
## cols(
## launch = col_double(),
## temp = col_double(),
## incident = col_character(),
## o_ring_probs = col_double()
## )
# a. What are the levels of measurement of these variables? Provide appropriate
# descriptive statistics and graphs for the variable o_ring_probs. Interpret. Provide
# measures of center, spread, shape, position, and two appropriate plots that are
# appropriate for the level of measurement. Discuss
summary(challenger)
## launch temp incident o_ring_probs
## Min. : 1.0 Min. :53.60 Length:23 Min. :0.0000
## 1st Qu.: 6.5 1st Qu.:66.20 Class :character 1st Qu.:0.0000
## Median :12.0 Median :69.80 Mode :character Median :0.0000
## Mean :12.0 Mean :69.02 Mean :0.4348
## 3rd Qu.:17.5 3rd Qu.:74.30 3rd Qu.:1.0000
## Max. :23.0 Max. :80.60 Max. :3.0000
describe(challenger$o_ring_probs)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 23 0.43 0.79 0 0.26 0 0 3 3 1.81 2.69 0.16
hist(challenger$o_ring_probs)

#This shows us that the data has a Right skew, this is likely because the majority of O_rings will not fail
pairs.panels(challenger[1:4],gap=0, main="Challenger Dataset",pch=21,bg=c("red","green","blue"))

#This describes all of the fields in challenger
# b. The temperature on the day of the Challenger launch was 36 degrees Fahrenheit.
# Provide side-by-side boxplots for temperature by incident (temp~incident). Why might
# this have been a concern?
boxplot(temp~incident, data = challenger, horizontal = TRUE, main = "Temperature vs. Incident", xlab = "Temperature")

#Based on this boxplot it is clear that there are more incidents when the temperatures are colder out, so launches should only be made on warm days.
# c. In the already temperature-sorted dataset, find on which observation the first
# successful launch occurred (one with no incident). Test the hypothesis that the first
# failure would come on or after this observation. Use alpha = .10.
minTemp <- min(challenger$temp[challenger$incident == "No"])
#First observation without an incident
minTemp
## [1] 66.2
meanTemp <- mean(challenger$temp)
failure.dt <- challenger[challenger$incident == "Yes",]
n= nrow(challenger)
stdDev = sd(challenger$temp)
p = minTemp
z = (p-meanTemp)/sqrt(p*(1-p)/n)
## Warning in sqrt(p * (1 - p)/n): NaNs produced
Temp.dt <- failure.dt$temp
Incident.dt <- challenger$incident
t.test(Temp.dt,mu = 66.2, conf.level = .90, alternative = "less")
##
## One Sample t-test
##
## data: Temp.dt
## t = -0.83918, df = 6, p-value = 0.2168
## alternative hypothesis: true mean is less than 66.2
## 90 percent confidence interval:
## -Inf 68.04029
## sample estimates:
## mean of x
## 63.62857
#p-value > a therefore we accept the hypothesis
# d. How many incidents occurred above 65 degrees F? _____ Test the hypothesis that
# you would see this many or fewer failures given a fixed population of 23 launches. Use
# alpha = .10.
nrow(failure.dt[failure.dt$temp > 65,])
## [1] 3
t.test(Temp.dt, mu = 65, conf.level = .90, alternative = "less")
##
## One Sample t-test
##
## data: Temp.dt
## t = -0.44756, df = 6, p-value = 0.3351
## alternative hypothesis: true mean is less than 65
## 90 percent confidence interval:
## -Inf 68.04029
## sample estimates:
## mean of x
## 63.62857
#P-value > a therefore we accept the hypothesis.
# e. Provide a 90% confidence interval for incidents.
t.test(Temp.dt,mu = meanTemp, conf.level = .90)
##
## One Sample t-test
##
## data: Temp.dt
## t = -1.7586, df = 6, p-value = 0.1291
## alternative hypothesis: true mean is not equal to 69.01739
## 90 percent confidence interval:
## 57.67426 69.58289
## sample estimates:
## mean of x
## 63.62857
#Question 4
#a)
survived1 <- 7
survived2 <- 5
died1 <- 0
died2 <- 2
totSurv <- survived1+survived2
totDied <- died1+died2
totDrug1 <- survived1+died1
totDrug2 <- survived2+died2
tot <- totSurv+totDied
p_died1 <- died1/tot #Drug 1
p_died2 <- died2/tot #Drug 2
p_Surv1 <- survived1/tot #Drug 1
p_Surv2 <- survived2/tot #Drug 2
p_died1 #P(A1B2)
## [1] 0
p_died2 #P(A2B2)
## [1] 0.1428571
p_Surv1 #P(A1B1)
## [1] 0.5
p_Surv2 #P(A2B1)
## [1] 0.3571429
#B)
p_drug1 <- totDrug1/tot #Drug 1
p_drug2 <- totDrug2/tot #Drug 2
p_Surv <- totSurv/tot #Survived
p_Died <- totDied/tot #Died
p_drug1 #P(A1)
## [1] 0.5
p_drug2 #P(A2)
## [1] 0.5
p_Surv #P(B1)
## [1] 0.8571429
p_Died #P(B2)
## [1] 0.1428571
#C)
p_died1 #P(A1B2)
## [1] 0
p_drug1*p_Died
## [1] 0.07142857
p_died2 #P(A2B2)
## [1] 0.1428571
p_drug2*p_Died
## [1] 0.07142857
p_Surv1 #P(A1B1)
## [1] 0.5
p_drug1*p_Surv
## [1] 0.4285714
p_Surv2 #P(A2B1)
## [1] 0.3571429
p_drug2*p_Surv
## [1] 0.4285714
#Based on the rule of independence the variables are not indenpendent in any of the relationships
#D)
phyper(0,2,12,14)
## [1] 0
#There is not sufficient evidence.
#Q5
#A) There are everal issues with this plot as it has the potential to overlap and coverup some data and make it unclear
#for example in 2002 it looks like USA has more GDP growth than Euro Area, if it weren't for the data label it would be indiscernable.
#B)
us<-c(3.7,4.5,4.2,4.5,3.7,0.8,1.6,2.7,4.2,3.5,2.9)
euro<-c(1.5,2.6,2.9,2.8,3.7,1.8,0.9,0.7,1.9,1.0,1.8)
yr<-c(1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006)
gdp<-data.frame(cbind(yr,us,euro))
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ay <- list(
tickfont = list(color = "red"),
overlaying = "y",
side = "right",
title = "Euro"
)
plot_ly(data = gdp, y = ~us, x =~yr,name = "US", type = "scatter", mode= "lines+markers") %>%
add_trace(y = ~euro, name = "Euro", type = "scatter", mode= "lines+markers", yaxis = "y2") %>%
layout(
title = "Euro vs. US GDP", yaxis2 = ay,
xaxis = list(title="Year")
)
# #Question 6.
# The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet). A single person is randomly selected for jury duty.
# a. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
# P(IQ>=110) given that mean=100 sd=16
# lower.tail = FALSE indicates we are looking at the right side tail.
pnorm(110, mean = 100, sd = 16, lower.tail = FALSE)
## [1] 0.2659855
# Now, a jury is seated that has 12 individuals on it. The mean IQ of the jury is 110.
# b. What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with ???=100 and ???=16? Be sure to write the probability statement and show your R code.
# P(IQ>=110) given that new mean=110 and SD=16
pnorm(110, mean = 110, sd = 16, lower.tail = FALSE)
## [1] 0.5
#The change in mean value indicates that the probability curve with higher IQ's has changed