Midterm

# #Q1Question 1.
# The sensitivity and specificity of the polygraph has been a subject of study and debate
# for years. A 2001 study of the use of polygraph for screening purposes suggested that
# the probability of detecting an actual liar was .59 (sensitivity) and that the probability of
# detecting an actual "truth teller" was .90 (specificity). We estimate that about 20% of
# individuals selected for the screening polygraph will lie.
a_truth<-c(0.90*0.8,0,0)
a_lie<-c(0,0.59*0.2,0)
a_tot<-c(0.8,0.2,1)
polygraph_data<-data.frame(cbind(a_truth,a_lie,a_tot))
row.names(polygraph_data)=c('d_truth','d_lie','d_tot')

polygraph_data$a_truth[row.names(polygraph_data)=='d_lie']=(1-0.59)*0.2
polygraph_data$a_lie[row.names(polygraph_data)=='d_truth']=(1-0.9)*0.8
polygraph_data

##         a_truth a_lie a_tot
## d_truth   0.720 0.080   0.8
## d_lie     0.082 0.118   0.2
## d_tot     0.000 0.000   1.0

# a. What is the probability that an individual is actually a liar given that the polygraph
# detected him/her as such? Solve using a Bayesian equation.

pdectLiar = .59
pdectTruther = .90
pliar = .20
pTruther = 1-pliar
pmissTruther = 1- pdectTruther
pmissLiar = 1-pdectLiar

p_a = pliar
p_b = pTruther*pmissTruther + pliar*pdectLiar
p_ba = pliar*pdectLiar

p_ab = p_ba*p_a/p_b
#Probability that an individual is actually a liar given that the polygraph detcted him/her:
p_ab

## [1] 0.1191919

# b. What is the probability that a randomly selected individual is either a liar or was
# identified as a liar by the polygraph? Be sure to write the probability statement.
#P(truth teller | detected as a truth-teller)

p_b = pTruther * pmissTruther 
p_a = pliar * pdectLiar
p_aorb = p_a + p_b
p_aorb

## [1] 0.198

# Question 2.
# Your organization owns an expensive Magnetic Resonance Imaging machine (MRI).
# This machine has a manufacturer's expected lifetime of 10 years. 
#(Include the probability statements and R / Python Code for each part.).
# a. What is the probability that the machine will fail after 8 years? Model as a geometric.
# (Hint: there are at least 7 failures before the first success.) Provide also the expected
# value and standard deviation.

#Probability that the machine failure occurs before 7 years
dgeom(7,.5)

## [1] 0.00390625

#Expected Value:
E_geom = 1/.5
E_geom

## [1] 2

#Standard deviation:
SD_geom = sqrt(1-.5)/.5
SD_geom

## [1] 1.414214

# b. What is the probability that the machine will fail after 8 years? Model as an
# exponential assuming continuous state space. Provide also the expected value and
# standard deviation of the distribution.
n = 8
p = 0.5
lambda = n*p

dexp(n,rate = 1/lambda)

## [1] 0.03383382

#Expected Value
E_exp = 1/lambda
E_exp

## [1] 0.25

#Standard Deviation
SD_exp = 1/lambda
SD_exp

## [1] 0.25

# c. What is the probability that the machine will fail after 8 years? Model as a binomial.
# (Hint: 0 success in 8 years) Provide also the expected value and standard deviation of
# the distribution.
#Probability
dbinom(0,n,p=p)

## [1] 0.00390625

#EV
n*p

## [1] 4

#SD
n*p*(1-p)

## [1] 2

# d. What is the probability that the machine will fail after 8 years? Model as a Poisson.
# Re-define the discrete state space if necessary. (Hint: Don't forget to use ???t rather just
# ????????? Provide also the expected value and standard deviation of the distribution.
n=9
p=.5
lambda = n*p

#Probability
dpois(n, lambda)

## [1] 0.02316458

#EV
lambda

## [1] 4.5

#SD
sqrt(lambda)

## [1] 2.12132

# e. What is the probability that the machine will have its 2nd failure exactly on the 9th
# year? Model as a Negative Binomial. Provide also the expected value and standard
# deviation of the distribution. 

k = 9
n = 9
#Prob_2nd_fail 
choose(n,k)*(1-p)^(n-k)*p^k

## [1] 0.001953125

#Expected value
k/p

## [1] 18

#Standard Deviation
sqrt(k*(1-p))/p

## [1] 4.242641

# Question 3.
# In 1986, the Challenger space shuttle exploded during "throttle up" due to catastrophic
# failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle
# launches prior to the Challenger disaster are in the file challenger.csv. Load the data
# into R or Python and answer the following questions. Include all R code.
# The variables in the data set are defined as follows:

#  launch: this numbers the temperature-sorted observations from 1 to 23.
# temp: temperature in degrees Fahrenheit at the time of launch
# incident: If there was an incident with an O-Ring, then it is coded "Yes."
# o_ring_probs: counts the number of O-ring partial failures experienced on the flight.
library(readr)
library(psych)
challenger <- read_csv("challenger.csv")

## Parsed with column specification:
## cols(
##   launch = col_double(),
##   temp = col_double(),
##   incident = col_character(),
##   o_ring_probs = col_double()
## )

# a. What are the levels of measurement of these variables? Provide appropriate
# descriptive statistics and graphs for the variable o_ring_probs. Interpret. Provide
# measures of center, spread, shape, position, and two appropriate plots that are
# appropriate for the level of measurement. Discuss
summary(challenger)

##      launch          temp         incident          o_ring_probs   
##  Min.   : 1.0   Min.   :53.60   Length:23          Min.   :0.0000  
##  1st Qu.: 6.5   1st Qu.:66.20   Class :character   1st Qu.:0.0000  
##  Median :12.0   Median :69.80   Mode  :character   Median :0.0000  
##  Mean   :12.0   Mean   :69.02                      Mean   :0.4348  
##  3rd Qu.:17.5   3rd Qu.:74.30                      3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :80.60                      Max.   :3.0000

describe(challenger$o_ring_probs)

##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 23 0.43 0.79      0    0.26   0   0   3     3 1.81     2.69 0.16

hist(challenger$o_ring_probs)

#This shows us that the data has a Right skew, this is likely because the majority of O_rings will not fail

pairs.panels(challenger[1:4],gap=0, main="Challenger Dataset",pch=21,bg=c("red","green","blue"))

#This describes all of the fields in challenger

# b. The temperature on the day of the Challenger launch was 36 degrees Fahrenheit.
# Provide side-by-side boxplots for temperature by incident (temp~incident). Why might
# this have been a concern?
boxplot(temp~incident, data = challenger, horizontal = TRUE, main = "Temperature vs. Incident", xlab = "Temperature")

#Based on this boxplot it is clear that there are more incidents when the temperatures are colder out, so launches should only be made on warm days.

#  c. In the already temperature-sorted dataset, find on which observation the first
# successful launch occurred (one with no incident). Test the hypothesis that the first
# failure would come on or after this observation. Use alpha = .10.
minTemp <- min(challenger$temp[challenger$incident == "No"])
#First observation without an incident
minTemp

## [1] 66.2

meanTemp <- mean(challenger$temp)
failure.dt <- challenger[challenger$incident == "Yes",]

n= nrow(challenger)
stdDev = sd(challenger$temp)
p = minTemp
z = (p-meanTemp)/sqrt(p*(1-p)/n)

## Warning in sqrt(p * (1 - p)/n): NaNs produced

Temp.dt <- failure.dt$temp
Incident.dt <- challenger$incident
t.test(Temp.dt,mu = 66.2, conf.level = .90, alternative = "less")

## 
##  One Sample t-test
## 
## data:  Temp.dt
## t = -0.83918, df = 6, p-value = 0.2168
## alternative hypothesis: true mean is less than 66.2
## 90 percent confidence interval:
##      -Inf 68.04029
## sample estimates:
## mean of x 
##  63.62857

#p-value > a therefore we accept the hypothesis

# d. How many incidents occurred above 65 degrees F? _____ Test the hypothesis that
# you would see this many or fewer failures given a fixed population of 23 launches. Use
# alpha = .10.
nrow(failure.dt[failure.dt$temp > 65,])

## [1] 3

t.test(Temp.dt, mu = 65, conf.level = .90, alternative = "less")

## 
##  One Sample t-test
## 
## data:  Temp.dt
## t = -0.44756, df = 6, p-value = 0.3351
## alternative hypothesis: true mean is less than 65
## 90 percent confidence interval:
##      -Inf 68.04029
## sample estimates:
## mean of x 
##  63.62857

#P-value > a therefore we accept the hypothesis.

# e. Provide a 90% confidence interval for incidents.
t.test(Temp.dt,mu = meanTemp, conf.level = .90)

## 
##  One Sample t-test
## 
## data:  Temp.dt
## t = -1.7586, df = 6, p-value = 0.1291
## alternative hypothesis: true mean is not equal to 69.01739
## 90 percent confidence interval:
##  57.67426 69.58289
## sample estimates:
## mean of x 
##  63.62857

#Question 4
#a)
survived1 <- 7
survived2 <- 5
died1 <- 0
died2 <- 2

totSurv <- survived1+survived2
totDied <- died1+died2
totDrug1 <- survived1+died1
totDrug2 <- survived2+died2
tot <- totSurv+totDied

p_died1 <- died1/tot #Drug 1
p_died2 <- died2/tot #Drug 2
p_Surv1 <- survived1/tot #Drug 1
p_Surv2 <- survived2/tot #Drug 2

p_died1 #P(A1B2)

## [1] 0

p_died2 #P(A2B2)

## [1] 0.1428571

p_Surv1 #P(A1B1)

## [1] 0.5

p_Surv2 #P(A2B1)

## [1] 0.3571429

#B)
p_drug1 <- totDrug1/tot #Drug 1
p_drug2 <- totDrug2/tot #Drug 2
p_Surv <- totSurv/tot #Survived
p_Died <- totDied/tot #Died

p_drug1 #P(A1)

## [1] 0.5

p_drug2 #P(A2)

## [1] 0.5

p_Surv #P(B1)

## [1] 0.8571429

p_Died #P(B2)

## [1] 0.1428571

#C)
p_died1 #P(A1B2)

## [1] 0

p_drug1*p_Died

## [1] 0.07142857

p_died2 #P(A2B2)

## [1] 0.1428571

p_drug2*p_Died

## [1] 0.07142857

p_Surv1 #P(A1B1)

## [1] 0.5

p_drug1*p_Surv

## [1] 0.4285714

p_Surv2 #P(A2B1)

## [1] 0.3571429

p_drug2*p_Surv

## [1] 0.4285714

#Based on the rule of independence the variables are not indenpendent in any of the relationships

#D)
phyper(0,2,12,14)

## [1] 0

#There is not sufficient evidence.

#Q5
#A) There are everal issues with this plot as it has the potential to overlap and coverup some data and make it unclear
#for example in 2002 it looks like USA has more GDP growth than Euro Area, if it weren't for the data label it would be indiscernable.
#B)
us<-c(3.7,4.5,4.2,4.5,3.7,0.8,1.6,2.7,4.2,3.5,2.9)
euro<-c(1.5,2.6,2.9,2.8,3.7,1.8,0.9,0.7,1.9,1.0,1.8)
yr<-c(1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006)
gdp<-data.frame(cbind(yr,us,euro))
library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

ay <- list(
 tickfont = list(color = "red"),
 overlaying = "y",
 side = "right",
 title = "Euro"
)
plot_ly(data = gdp, y = ~us, x =~yr,name = "US", type = "scatter", mode= "lines+markers") %>%
 add_trace(y = ~euro, name = "Euro", type = "scatter", mode= "lines+markers", yaxis = "y2") %>%
 layout(
  title = "Euro vs. US GDP", yaxis2 = ay,
  xaxis = list(title="Year")
 )

# #Question 6.
# The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet). A single person is randomly selected for jury duty.
# a. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
# P(IQ>=110) given that mean=100 sd=16

# lower.tail = FALSE indicates we are looking at the right side tail.
pnorm(110, mean = 100, sd = 16, lower.tail = FALSE)

## [1] 0.2659855

# Now, a jury is seated that has 12 individuals on it. The mean IQ of the jury is 110.
# b. What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with ???=100 and ???=16? Be sure to write the probability statement and show your R code.
# P(IQ>=110) given that new mean=110 and SD=16
pnorm(110, mean = 110, sd = 16, lower.tail = FALSE)

## [1] 0.5

#The change in mean value indicates that the probability curve with higher IQ's has changed

Midterm

Joseph Sansevero

February 10, 2019