Midterm_Data Analysis

Question 1.

The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.

a_truth<-c(0.90*0.8,0,0)
a_lie<-c(0,0.59*0.2,0)
a_tot<-c(0.8,0.2,1)
polygraph_data<-data.frame(cbind(a_truth,a_lie,a_tot))
row.names(polygraph_data)=c('d_truth','d_lie','d_tot')

polygraph_data$a_truth[row.names(polygraph_data)=='d_lie']=(1-0.59)*0.2
polygraph_data$a_lie[row.names(polygraph_data)=='d_truth']=(1-0.9)*0.8
polygraph_data

##         a_truth a_lie a_tot
## d_truth   0.720 0.080   0.8
## d_lie     0.082 0.118   0.2
## d_tot     0.000 0.000   1.0

a. What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? Solve using a Bayesian equation.

# Probability Statement  
# P (Is actual liar | Is detected liar) 
# = P(Is actual liar)*P(Is detected liar | Is actual liar)/ P(Is detected liar)

p_dl=0.2
p_al=(1-0.9)*0.8
p_dl_al=0.59
# Using bayes theorem as defined above.
P_al_dl<-p_al*p_dl_al/p_dl
P_al_dl

## [1] 0.236

b. What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

# Probability Statement  
# P (A=Is liar) + P(B=Is detected liar) 
p_either_al_or_dl=p_al + p_dl
p_either_al_or_dl

## [1] 0.28

Question 2.

Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years. (Include the probability statements and R / Python Code for each part.).

a. What is the probability that the machine will fail after 8 years? Model as a geometric. (Hint: there are at least 7 failures before the first success.) Provide also the expected value and standard deviation.

# mean age of machine = 10 yrs; 
# Assuming: p=1/2=0.5 (probability that machine will either work or fail )
# Probability Statement: P(X=0 machine failures occur untill 7 yrs)
n=7
p=0.5
E_geom=1/p
SD_geom=sqrt(1-p)/p

paste("P(X=0 machine occurs failure before 7 yrs=",dgeom(x=7, p=0.5, log = FALSE))

## [1] "P(X=0 machine occurs failure before 7 yrs= 0.00390625"

paste("Expected Value,E(X) =",E_geom)

## [1] "Expected Value,E(X) = 2"

paste("Std Dev,SD(X) =",SD_geom)

## [1] "Std Dev,SD(X) = 1.4142135623731"

b. What is the probability that the machine will fail after 8 years? Model as an exponential assuming continuous state space. Provide also the expected value and standard deviation of the distribution.

# Probability Statement: P(X=1st machine occurs failure after 8 yrs)
n=8
p=0.5
lambda=n*p
E_exp=1/lambda
SD_exp=1/lambda

paste("P(X=1st machine occurs failure after 8 yrs=",dexp(x=8, rate=1/lambda, log = FALSE))

## [1] "P(X=1st machine occurs failure after 8 yrs= 0.0338338208091532"

paste("Expected Value,E(X) =",E_exp)

## [1] "Expected Value,E(X) = 0.25"

paste("Std Dev,SD(X) =",SD_exp)

## [1] "Std Dev,SD(X) = 0.25"

c. What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: 0 success in 8 years) Provide also the expected value and standard deviation of the distribution.

# Probability Statement:
# P(X=0 success in 8 yrs)
n=8
p=0.5

E_bin=n*p
SD_bin=n*p*(1-p)

paste("P(X=0 success in 8 yrs=",dbinom(0,8,p=0.5, log = FALSE))

## [1] "P(X=0 success in 8 yrs= 0.00390625"

paste("Expected Value,E(X) =",E_bin)

## [1] "Expected Value,E(X) = 4"

paste("Std Dev,SD(X) =",SD_bin)

## [1] "Std Dev,SD(X) = 2"

d. What is the probability that the machine will fail after 8 years? Model as a Poisson. Re-define the discrete state space if necessary. (Hint: Don’t forget to use lambda(t) rather just lambda. Provide also the expected value and standard deviation of the distribution.

# mean age of machine = 10 yrs; p=1/2=0.5 (probability that machine will either work or fail )
n=9
p=0.5  
lambda=n*p
dpois(x=9, lambda,log = FALSE)

## [1] 0.02316458

E_pois=lambda
SD_pois=sqrt(lambda)

paste("P(X==",dbinom(0,8,p=0.5, log = FALSE))

## [1] "P(X== 0.00390625"

paste("Expected Value,E(X) =",E_bin)

## [1] "Expected Value,E(X) = 4"

paste("Std Dev,SD(X) =",SD_bin)

## [1] "Std Dev,SD(X) = 2"

e. What is the probability that the machine will have its 2nd failure exactly on the 9th year? Model as a Negative Binomial. Provide also the expected value and standard deviation of the distribution.

It would be given as : P(machine will have its 2nd failure exactly on the 9th year) = [Number of possible sequences] * P(Single sequence) = choose(n,k)* (1-p)^(n-k)*p^k

# mean age of machine = 10 yrs; p=1/2=0.5 (probability that machine will either work or fail )
k=2
n=9
Prob_2nd_fail=choose(n,k)*(1-p)^(n-k)*p^k
E_nbin=k/p
SD_nbin=sqrt(k*(1-p))/p

paste("P(X==",Prob_2nd_fail)

## [1] "P(X== 0.0703125"

paste("Expected Value,E(X) =",E_nbin)

## [1] "Expected Value,E(X) = 4"

paste("Std Dev,SD(X) =",SD_nbin)

## [1] "Std Dev,SD(X) = 2"

Question 3.

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv. Load the data into R or Python and answer the following questions. Include all R code.

The variables in the data set are defined as follows:

launch: this numbers the temperature-sorted observations from 1 to 23.
temp: temperature in degrees Fahrenheit at the time of launch
incident: If there was an incident with an O-Ring, then it is coded “Yes.”
o_ring_probs: counts the number of O-ring partial failures experienced on the flight.

a. What are the levels of measurement of these variables? Provide appropriate descriptive statistics and graphs for the variable o_ring_probs. Interpret. Provide measures of center, spread, shape, position, and two appropriate plots that are appropriate for the level of measurement. Discuss

## reading the file

setwd("D:/Boston College/MS AE Courses/Data Analysis")
challenger<-read.csv("midterm_dataset_challenger.csv",header=TRUE)

summary(challenger)

##      launch          temp       incident  o_ring_probs   
##  Min.   : 1.0   Min.   :53.60   No :16   Min.   :0.0000  
##  1st Qu.: 6.5   1st Qu.:66.20   Yes: 7   1st Qu.:0.0000  
##  Median :12.0   Median :69.80            Median :0.0000  
##  Mean   :12.0   Mean   :69.02            Mean   :0.4348  
##  3rd Qu.:17.5   3rd Qu.:74.30            3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :80.60            Max.   :3.0000

str(challenger)

## 'data.frame':    23 obs. of  4 variables:
##  $ launch      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ temp        : num  53.6 57.2 57.2 62.6 66.2 66.2 66.2 66.2 66.2 68 ...
##  $ incident    : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 1 1 1 1 ...
##  $ o_ring_probs: int  3 1 1 1 0 0 0 0 0 0 ...

INTERPRETATION:

b. The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident). Why might this have been a concern?

boxplot(challenger[,2:4])

boxplot(temp~incident,challenger)

c. In the already temperature-sorted dataset, find on which observation the first successful launch occurred (one with no incident). Test the hypothesis that the first failure would come on or after this observation. Use alpha = .10.

c<-challenger[order(challenger$temp),]
# Define Ho = 1st failure would come after i~~th~~

d. How many incidents occurred above 65 degrees F? _____ Test the hypothesis that you would see this many or fewer failures given a fixed population of 23 launches. Use alpha = .10.

# Define Ho = 1st failure would come after i~~th~~
length((c$temp>=65)==TRUE)

## [1] 23

e.—- Provide a 90% confidence interval for incidents.

# Define Ho = 1st failure would come after i~~th~~

Question 4.

I recently conducted some animal research where I was investigating survival of swine based on what drug was given to them. Data are shown below.

survived<-c(7,5,12)
died<-c(0,2,2)
totals<-c(7,7,14)
swine_data<-data.frame(cbind(survived,died,totals))
row.names(swine_data)=c('drug1','drug2','total')

swine_data

##       survived died totals
## drug1        7    0      7
## drug2        5    2      7
## total       12    2     14

a. Let A represent the drug provided {A1=drug 1, A2=drug 2}. Let B represent the pig’s survival. {B1=survived, B2=died}. For each cell, calculate the joint probability. In other words, calculate P(A1B1), P(A1B2), P(A2B1), P(A2B2). Place these probabilities in the following table. (Don’t panic here. This is as easy as you think it is.)

Defining the cases:

P(A1B1) = P(pig survived| when administered drug1)
P(A2B1) = P(pig survived| when administered drug2)
P(A1B2) = P(pig died| when administered drug1)
P(A2B2) = P(pig died| when administered drug2)

Joint probabilities using R.

swine_data_j_prob=swine_data/swine_data$total[3]
swine_data_j_prob

##        survived      died totals
## drug1 0.5000000 0.0000000    0.5
## drug2 0.3571429 0.1428571    0.5
## total 0.8571429 0.1428571    1.0

b. For each row and column, calculate the marginal probability. In other words, calculate the four marginal probabilities,{ P(A1), P(A2), P(B1), P(B2) }. Place them in the next table with the results from part a. (Just remember how we calculated marginal probability in class.)

swine_data_m_prob=data.frame('marginal_prob'=rbind(swine_data_j_prob[3,1],swine_data_j_prob[3,2],swine_data_j_prob[1,3],swine_data_j_prob[2,3]))
row.names(swine_data_m_prob)=c('drug1','drug2','survived','died')
swine_data_m_prob

##          marginal_prob
## drug1        0.8571429
## drug2        0.1428571
## survived     0.5000000
## died         0.5000000

c. Independence of events means that P(AiBj) = P(Ai) x P(Bj) for all values of i and j. For true independence of events, the joint (cell) probabilities should equal the appropriate marginal probabilities multiplied by each other. In other words, you should be able to multiply the row and column marginal probabilities to obtain the cell probability. If this is not the case, then the events are not (by definition) independent from a non-inferential point of view. Demonstrate that survival and drug choice are not independent solely based on the definition of independence. In other words, investigate if P(AiBj) = P(Ai) x P(Bj) for all values of i and j.

swine_data_test_indep<-NULL

survived1=data.frame(rbind(swine_data_m_prob$marginal_prob[1]*swine_data_m_prob$marginal_prob[3],swine_data_m_prob$marginal_prob[2]*swine_data_m_prob$marginal_prob[3]))

died1=data.frame(rbind(swine_data_m_prob$marginal_prob[1]*swine_data_m_prob$marginal_prob[4],swine_data_m_prob$marginal_prob[2]*swine_data_m_prob$marginal_prob[4]))

swine_data_test_indep=cbind(survived1,died1)
colnames(swine_data_test_indep)=c('survived','died')
row.names(swine_data_test_indep)=c('drug1','drug2')

res=matrix(nrow=2,ncol=2)
for (i in c(1,2)) {
    for (j in c(1,2)) {
        res[i,j]=ifelse(swine_data_j_prob[i,j]==swine_data_test_indep[i,j],TRUE,FALSE)
                      }
                  }
res

##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE

Thus the product of marginal prob is not same as the joint probabilities. Hence, the events are not independent.

d. Assume that there are 2 deaths and 12 survivors in the given population. What is the probability that Drug 1 would result in 0 deaths if the results of the study were truly random? Model as a hypergeometric. (FYI-this is often called a Fisher’s Exact test, which is just a hypergeometric.) Do you think there sufficient evidence to suggest different outcomes based on drug selection?

phyper(0,2,12,14)

## [1] 0

Question 5.

The following graph represents GDP growth for the US and the Euro area.

a. Identify the problems associated with this graph. This graph doesnot allow visual comparison esp in case when the US nos. are lower than that of Europe.

b. Generate your own graph that portrays the data in an improved way.

us<-c(3.7,4.5,4.2,4.5,3.7,0.8,1.6,2.7,4.2,3.5,2.9)
euro<-c(1.5,2.6,2.9,2.8,3.7,1.8,0.9,0.7,1.9,1.0,1.8)
yr<-c(1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006)
gdp<-data.frame(cbind(yr,us,euro))
plot(gdp$yr,gdp$us,type='o',xlab='Year',ylab='GDP', main='The Transatlantic Gulf',col='red')
lines(gdp$yr,gdp$euro,type='o',col='blue')

barplot(as.matrix(gdp$us,gdp$euro),names.arg=gdp$yr,cex.names=0.8,axisnames = TRUE, main='GDP numbers',ylab='GDP',xlab='Year',legend.text=c('US','Euro'),beside=TRUE,col=c('lightblue','lightgreen'))

Question 6.

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet). A single person is randomly selected for jury duty.

a. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

# Probability statement: P(IQ>=110) given that mean=100 sd=16

# lower.tail = FALSE indicates we are looking at the right side tail.
pnorm(110, mean = 100, sd = 16, lower.tail = FALSE)

## [1] 0.2659855

Now, a jury is seated that has 12 individuals on it. The mean IQ of the jury is 110.
b. What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with mean=100 and sd=16? Be sure to write the probability statement and show your R code.

# Probability statement: P(IQ>=110) given that new mean=110 and SD=16
pnorm(110, mean = 110, sd = 16, lower.tail = FALSE)

## [1] 0.5

Since the mean value has changed the probability curve with higher IQ’s has changed.