Question 1

The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.

a. What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? Solve using a Bayesian equation.

Let L be a liar. Let + be detection of a liar. We want $P(L|+)=\frac{P(+|L)P(L)}{P(+|L)P(L)+P(+|L')P(L')}$

p_ld=.59*.2/(.59*.2+.1*.8)
p_ld  #.5959596

## [1] 0.5959596

b. What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

$P(L \cup +)=P(L)+P(+)-P(L \cap +)$

require(formattable)

## Loading required package: formattable

## Warning: package 'formattable' was built under R version 3.5.2

N=1000
liars=1000*.2
nonliars=1000*.8
posliars=liars*.59
negliars=liars*.41
negtruth=nonliars*.9
postruth=nonliars*.1

mym=matrix(c(posliars,negliars,liars,postruth,negtruth,nonliars,posliars+postruth,negliars+negtruth,N), 3)
colnames(mym)=c("+","-","Total")
rownames(mym)=c("Liars", "Truthers", "Total")
formattable(as.data.frame(mym))

			Total
Liars	118	80	198
Truthers	82	720	802
Total	200	800	1000

#so P(L or +)=P(L)+P(+)-P(L+) =

 (mym[1,3]+mym[3,1]-mym[1,1])/N  #.28

## [1] 0.28

Question 2

Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years. (Include the probability statements and R / Python Code for each part.)

a. What is the probability that the machine will fail after 8 years? Model as a geometric.

(Hint: there are at least 7 failures before the first success.) Provide also the expected value and standard deviation.

$P(X \ge7 | \pi)$

We convert the time to minutes for all components.

#Convert to minutes for N, pi, t
N=10*365.25*24*60
pi=1/N
t=8*365.25*24*60

#Geometric. 

1-pgeom(t,pi)  #P(X>t | pi)=1-P(X<=t | pi)

## [1] 0.4493288

#0.4493288
(1-pi)/pi #EX

## [1] 5259599

(1-pi)/pi^2  #SD

## [1] 2.766339e+13

b. What is the probability that the machine will fail after 8 years? Model as an exponential assuming continuous state space.

Provide also the expected value and standard deviation of the distribution.

$P(X \ge7 | \lambda=\pi)$

#Exponential.  P(X>=t | lambda = pi)=1-P(X<=t | pi)
1-pexp(t,pi)

## [1] 0.449329

#0.449329
1/pi #EX

## [1] 5259600

1/pi #SD

## [1] 5259600

c. What is the probability that the machine will fail after 8 years? Model as a binomial.

(Hint: 0 success in 8 years) Provide also the expected value and standard deviation of the distribution.

$P(X=0 | N=t, \pi= \pi)$

#Binomial. 
dbinom(0,t,pi)  #P(X=0 | N=t, pi = pi)

## [1] 0.4493289

# 0.4493289
t*pi #EX

## [1] 0.8

sqrt(t*pi*(1-pi))#SD

## [1] 0.8944271

d. What is the probability that the machine will fail after 8 years? Model as a Poisson. Re-define the discrete state space if necessary.

(Hint: Don’t forget to use $\lambda x t$ ???t rather than just $\lambda$. ???????????????Provide also the expected value and standard deviation of the distribution.

$P(X =0 | \lambda t)$

#Poisson.  P(X=0 | lambda=pi, t)
dpois(0,pi*t)

## [1] 0.449329

#0.449329
pi*t #EX

## [1] 0.8

sqrt(pi*t) #SD

## [1] 0.8944272

e. What is the probability that the machine will have its 2d failure exactly on the 9th year?

Model as a Negative Binomial. Provide also the expected value and standard deviation of the distribution.

$P(t=2 | N=9)$

#P(t=2 | N =9)=(N-1)CHOOSE(t-1) x 1/10^2 * 9/10^7
t=2
N=9
pi=1/10
choose(N-1,t-1)*pi^2*(1-pi)^7

## [1] 0.03826375

(t-1)/pi #EX

## [1] 10

(t-1)*(1-pi)/pi^2 #SD

## [1] 90

Question 3.

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of O-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv. Load the data into R or Python and answer the following questions. Include all R code.

a. What are the levels of measurement of these variables?

Launch=ordinal (positional only, non-standard interval) Temp=interval(no zero) Incident=nominal O-ring-problems=ratio (known zero)

Provide appropriate descriptive statistics and graphs for the variable o_ring_probs. Interpret. Provide measures of center, spread, shape, position, and two appropriate plots that are appropriate for the level of measurement. Discuss.

mydata=read.csv("C:/Users/lfult/OneDrive - Texas State University/Courses & Syllabi/Boston College/Data Analysis/challenger.csv")

require(psych)

## Loading required package: psych

## Warning: package 'psych' was built under R version 3.5.1

formattable(describe(mydata$o_ring_probs))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
X1	1	23	0.4347826	0.7877521	0	0.2631579	0	0	3	3	1.805727	2.688725	0.1642577

par(mfrow=c(2,1))

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.1

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x=o_ring_probs))+
  geom_bar(stat="count",color="black", fill="white")+
  ggtitle("Bar Plot of # of O-Ring Problems")+
  xlab("") #because the variable is integer, a bar plot works rather than a histogram.

ggplot(mydata, aes(y=o_ring_probs, x=""))+
  geom_boxplot()+
   ggtitle("Boxplot of # of O-Ring Problems")+
  xlab("")+
  ylab("Count of O-Ring Failures")

stem(mydata$o_ring_probs)

## 
##   The decimal point is at the |
## 
##   0 | 0000000000000000
##   1 | 00000
##   2 | 0
##   3 | 0

You should have described the data (e.g., o-ring data are skewed to the right, positively). They appear to be from a Poisson distribution (discrete), with the minimum = median and the mean pulled to .435 by the outliers.

b. The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident). Why might this have been a concern?

par(mfrow=c(1,1))

ggplot(mydata, aes(y=temp, x=incident, fill=incident))+geom_boxplot()+ggtitle("Boxplots of Temperature vs. Incident")+xlab("")+ylab("Temp")

Without statistical testing, you can still see the median temperature of those launches with incidents was below that of those with no incidents. In fact, the 75% of the “Yes” group is the median of the “No” group. We might look at this again.

c. In the already temperature-sorted dataset, find on which observation the first successful launch occurred (one with no incident).

This occurred on launch 5.

which(mydata$incident=="No")[1]

## [1] 5

Test the hypothesis that the first failure would come on or after this observation. Use alpha = .10.

firstsuccess=5
numfailuresinrow=4
N=length(mydata$incident) #number of trials
noincidents=length(mydata$incident[mydata$incident=="No"])
pi=noincidents/N

#Ho: t<=4,  Ha:  t>4, alpha = .1
#Geometric (alternate form)
(1-16/23)^4   #0.008579872

## [1] 0.008579872

#Ho: X=0 | N, pi, Ha X>0 | N, pi
#Binomial:  P(X>=t) = 1-P(X<=t-1) = P(Y=0 | 4, pi=.695)
dbinom(0,4,pi) #0.008579872

## [1] 0.008579872

#Reject null in both cases. It looks like the ordered observations give us evidence of the effect of temperature.

d. How many incidents occurred above 65 degrees F? _____

Three incidents by inspection and R

length(mydata$incident[mydata$incident=="Yes" & mydata$temp>65])

## [1] 3

Test the hypothesis that you would see this many or fewer failures given a fixed population of 23 launches. Use alpha = .10.

#Ho:  X<=3, Ha: X>3, alpha = .10, hypergeometric, phyper(3,7,16,19)
phyper(3,7,16,19)  #[1] 0.003952569

## [1] 0.003952569

#p<alpha, reject.  It is highly unlikely that you would three or fewer incidences.  More evidence..

e. Provide a 90% confidence interval for incidents.

#Using qbinom
lower=qbinom(.05,23,7/23)
upper=qbinom(.95,23,7/23)
lower

## [1] 4

upper

## [1] 11

#approximation
se=sqrt(23*7/23*(1-7/23))

lower2=7+qnorm(.05)*se
upper2=7+qnorm(.95)*se
lower2

## [1] 3.370286

upper2

## [1] 10.62971

Question 4.

I recently conducted some animal research where I was investigating survival of swine based on what drug was given to them. Data are shown below.

mymatrix=matrix(c(7,5,12,0,2,2, 7, 7, 14), nrow=3)
colnames(mymatrix)=c("Survived", "Died", "Total")
rownames(mymatrix)=c("Drug1", "Drug2", "Total")
mymatrix

##       Survived Died Total
## Drug1        7    0     7
## Drug2        5    2     7
## Total       12    2    14

a. Let A represent the drug provided {A1=drug 1, A2=drug 2}. Let B represent the pig’s survival. {B1=survived, B2=died}. For each cell, calculate the joint probability. In other words, calculate P(A1B1), P(A1B2), P(A2B1), P(A2B2). Place these probabilities in the following table. (Don’t panic here. This is as easy as you think it is.)

mat=mymatrix/14
formattable(as.data.frame(mat))

	Survived	Died	Total
Drug1	0.5000000	0.0000000	0.5
Drug2	0.3571429	0.1428571	0.5
Total	0.8571429	0.1428571	1.0

b. For each row and column, calculate the marginal probability. In other words, calculate the four marginal probabilities,{ P(A1), P(A2), P(B1), P(B2) }. Place them in the next table with the results from part a. (Just remember how we calculated marginal probability in class.)

P_S= mymatrix[3,1]/14
P_D=mymatrix[3,2]/14
P_D1=mymatrix[1,3]/14
P_D2=mymatrix[2,3]/14
P_S #.8571429

## [1] 0.8571429

P_D #.1428571

## [1] 0.1428571

P_D1 #.5

## [1] 0.5

P_D2 #.5

## [1] 0.5

c. Independence of events means that P(Ai , Bj) = P(Ai) x P(Bj) for all values of i and j. For true independence of events, the joint (cell) probabilities should equal the appropriate marginal probabilities multiplied by each other. In other words, you should be able to multiply the row and column marginal probabilities to obtain the cell probability. If this is not the case, then the events are not (by definition) independent from a non-inferential point of view. Demonstrate that survival and drug choice are not independent solely based on the definition of independence. In other words, investigate if P(Ai,Bj) = P(Ai) x P(Bj) for all values of i and j.

a11=7*12/14^2
a12=7*2/14^2
a21=7*12/14^2
a22=7*2/14^2
mat2=matrix(c(a11,a21,a12,a22),nrow=2)
colnames(mat2)=c("Survived", "Died")
rownames(mat2)=c("Drug1", "Drug2")
formattable(as.data.frame(mat2))

	Survived	Died
Drug1	0.4285714	0.07142857
Drug2	0.4285714	0.07142857

#mat and mat2 do not match.  No independence

d. Assume that there are 2 deaths and 12 survivors in the given population. What is the probability that Drug 1 would result in 0 deaths if the results of the study were truly random? Model as a hypergeometric. (FYI-this is often called a Fisher’s Exact test, which is just a hypergeometric.) Do you think there sufficient evidence to suggest different outcomes based on drug selection?

$P(X=0 | S = 2, F = 12, n = 7)$

#P(X=0 | S=2, F=12, n=7) = 
phyper(0,2,12,7)

## [1] 0.2307692

#.231
#Insufficient evidence.

Question 5.

The following graph represents GDP growth for the US and the Euro area.

a. Identify the problems associated with this graph.

Wow. 3D bar when it should be 2D line (time series), chart junk (flags on bars..? really?), meaningless colors, warped perspective, made to look like US has a dominating GDP growth regardless of year…Look at the simple graph below. You can see this is not true. In 2000, there was a tie.in 2001, the EU Euro area performed better.

b. Generate your own graph that portrays the data in an improved way.

Euro=c(1.5,2.6,2.9,2.8,3.7,1.8,0.9,0.7,1.9,1.0,1.8)
US=c(3.7,4.5,4.2,4.5,3.7,0.8,1.6,2.7,4.2,3.5,2.9)
Year=seq(1996,2006) 
plot(US~Year,type="l", xlab="Year", ylab="%GDP Growth", main="% GDP Growth, US vs. EU Euro Area",col="red", ylim=c(0,5),bty="n")
lines(Euro~Year, type="l", col="blue")
legend("topright",legend=c("US","EU Euro Area"), text.col=c("red","blue"),bty="n")

Question 6.

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet). A single person is randomly selected for jury duty.

a. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

$P(X 110 | = 100, =16) $

#P(X>=110 | Mu = 100, s=16)
1-pnorm(110,100,16)

## [1] 0.2659855

#0.2659855

b. What is the probability that the mean IQ of the 12-person jury would be 110 or above if..

drawn from a normal population with $\bar{x} =100$ and $\sigma=16$? Be sure to write the probability statement and show your R code.

$P({X}) 110 | = 100, =) $

#P(Xbar>=110 | mu = 100, se=16/sqrt(12))
1-pnorm(110,100,16/sqrt(12))

## [1] 0.01519141

#0.01519141

Midterm, Data Analysis