The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting an actual liar was .59 (sensitivity) and that the probability of detecting an actual “truth teller” was .90 (specificity). We estimate that about 20% of individuals selected for the screening polygraph will lie.
Let L be a liar. Let + be detection of a liar. We want \(P(L|+)=\frac{P(+|L)P(L)}{P(+|L)P(L)+P(+|L')P(L')}\)
p_ld=.59*.2/(.59*.2+.1*.8)
p_ld #.5959596
## [1] 0.5959596
\(P(L \cup +)=P(L)+P(+)-P(L \cap +)\)
require(formattable)
## Loading required package: formattable
## Warning: package 'formattable' was built under R version 3.5.2
N=1000
liars=1000*.2
nonliars=1000*.8
posliars=liars*.59
negliars=liars*.41
negtruth=nonliars*.9
postruth=nonliars*.1
mym=matrix(c(posliars,negliars,liars,postruth,negtruth,nonliars,posliars+postruth,negliars+negtruth,N), 3)
colnames(mym)=c("+","-","Total")
rownames(mym)=c("Liars", "Truthers", "Total")
formattable(as.data.frame(mym))
|
|
|
Total | |
|---|---|---|---|
| Liars | 118 | 80 | 198 |
| Truthers | 82 | 720 | 802 |
| Total | 200 | 800 | 1000 |
#so P(L or +)=P(L)+P(+)-P(L+) =
(mym[1,3]+mym[3,1]-mym[1,1])/N #.28
## [1] 0.28
Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years. (Include the probability statements and R / Python Code for each part.)
(Hint: there are at least 7 failures before the first success.) Provide also the expected value and standard deviation.
\(P(X \ge7 | \pi)\)
We convert the time to minutes for all components.
#Convert to minutes for N, pi, t
N=10*365.25*24*60
pi=1/N
t=8*365.25*24*60
#Geometric.
1-pgeom(t,pi) #P(X>t | pi)=1-P(X<=t | pi)
## [1] 0.4493288
#0.4493288
(1-pi)/pi #EX
## [1] 5259599
(1-pi)/pi^2 #SD
## [1] 2.766339e+13
Provide also the expected value and standard deviation of the distribution.
\(P(X \ge7 | \lambda=\pi)\)
#Exponential. P(X>=t | lambda = pi)=1-P(X<=t | pi)
1-pexp(t,pi)
## [1] 0.449329
#0.449329
1/pi #EX
## [1] 5259600
1/pi #SD
## [1] 5259600
(Hint: 0 success in 8 years) Provide also the expected value and standard deviation of the distribution.
\(P(X=0 | N=t, \pi= \pi)\)
#Binomial.
dbinom(0,t,pi) #P(X=0 | N=t, pi = pi)
## [1] 0.4493289
# 0.4493289
t*pi #EX
## [1] 0.8
sqrt(t*pi*(1-pi))#SD
## [1] 0.8944271
(Hint: Don’t forget to use \(\lambda x t\) ???t rather than just \(\lambda\). ???????????????Provide also the expected value and standard deviation of the distribution.
\(P(X =0 | \lambda t)\)
#Poisson. P(X=0 | lambda=pi, t)
dpois(0,pi*t)
## [1] 0.449329
#0.449329
pi*t #EX
## [1] 0.8
sqrt(pi*t) #SD
## [1] 0.8944272
Model as a Negative Binomial. Provide also the expected value and standard deviation of the distribution.
\(P(t=2 | N=9)\)
#P(t=2 | N =9)=(N-1)CHOOSE(t-1) x 1/10^2 * 9/10^7
t=2
N=9
pi=1/10
choose(N-1,t-1)*pi^2*(1-pi)^7
## [1] 0.03826375
(t-1)/pi #EX
## [1] 10
(t-1)*(1-pi)/pi^2 #SD
## [1] 90
In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of O-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv. Load the data into R or Python and answer the following questions. Include all R code.
Launch=ordinal (positional only, non-standard interval) Temp=interval(no zero) Incident=nominal O-ring-problems=ratio (known zero)
Provide appropriate descriptive statistics and graphs for the variable o_ring_probs. Interpret. Provide measures of center, spread, shape, position, and two appropriate plots that are appropriate for the level of measurement. Discuss.
mydata=read.csv("C:/Users/lfult/OneDrive - Texas State University/Courses & Syllabi/Boston College/Data Analysis/challenger.csv")
require(psych)
## Loading required package: psych
## Warning: package 'psych' was built under R version 3.5.1
formattable(describe(mydata$o_ring_probs))
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 23 | 0.4347826 | 0.7877521 | 0 | 0.2631579 | 0 | 0 | 3 | 3 | 1.805727 | 2.688725 | 0.1642577 |
par(mfrow=c(2,1))
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.1
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x=o_ring_probs))+
geom_bar(stat="count",color="black", fill="white")+
ggtitle("Bar Plot of # of O-Ring Problems")+
xlab("") #because the variable is integer, a bar plot works rather than a histogram.
ggplot(mydata, aes(y=o_ring_probs, x=""))+
geom_boxplot()+
ggtitle("Boxplot of # of O-Ring Problems")+
xlab("")+
ylab("Count of O-Ring Failures")
stem(mydata$o_ring_probs)
##
## The decimal point is at the |
##
## 0 | 0000000000000000
## 1 | 00000
## 2 | 0
## 3 | 0
You should have described the data (e.g., o-ring data are skewed to the right, positively). They appear to be from a Poisson distribution (discrete), with the minimum = median and the mean pulled to .435 by the outliers.
par(mfrow=c(1,1))
ggplot(mydata, aes(y=temp, x=incident, fill=incident))+geom_boxplot()+ggtitle("Boxplots of Temperature vs. Incident")+xlab("")+ylab("Temp")
Without statistical testing, you can still see the median temperature of those launches with incidents was below that of those with no incidents. In fact, the 75% of the “Yes” group is the median of the “No” group. We might look at this again.
This occurred on launch 5.
which(mydata$incident=="No")[1]
## [1] 5
Test the hypothesis that the first failure would come on or after this observation. Use alpha = .10.
firstsuccess=5
numfailuresinrow=4
N=length(mydata$incident) #number of trials
noincidents=length(mydata$incident[mydata$incident=="No"])
pi=noincidents/N
#Ho: t<=4, Ha: t>4, alpha = .1
#Geometric (alternate form)
(1-16/23)^4 #0.008579872
## [1] 0.008579872
#Ho: X=0 | N, pi, Ha X>0 | N, pi
#Binomial: P(X>=t) = 1-P(X<=t-1) = P(Y=0 | 4, pi=.695)
dbinom(0,4,pi) #0.008579872
## [1] 0.008579872
#Reject null in both cases. It looks like the ordered observations give us evidence of the effect of temperature.
Three incidents by inspection and R
length(mydata$incident[mydata$incident=="Yes" & mydata$temp>65])
## [1] 3
Test the hypothesis that you would see this many or fewer failures given a fixed population of 23 launches. Use alpha = .10.
#Ho: X<=3, Ha: X>3, alpha = .10, hypergeometric, phyper(3,7,16,19)
phyper(3,7,16,19) #[1] 0.003952569
## [1] 0.003952569
#p<alpha, reject. It is highly unlikely that you would three or fewer incidences. More evidence..
#Using qbinom
lower=qbinom(.05,23,7/23)
upper=qbinom(.95,23,7/23)
lower
## [1] 4
upper
## [1] 11
#approximation
se=sqrt(23*7/23*(1-7/23))
lower2=7+qnorm(.05)*se
upper2=7+qnorm(.95)*se
lower2
## [1] 3.370286
upper2
## [1] 10.62971
I recently conducted some animal research where I was investigating survival of swine based on what drug was given to them. Data are shown below.
mymatrix=matrix(c(7,5,12,0,2,2, 7, 7, 14), nrow=3)
colnames(mymatrix)=c("Survived", "Died", "Total")
rownames(mymatrix)=c("Drug1", "Drug2", "Total")
mymatrix
## Survived Died Total
## Drug1 7 0 7
## Drug2 5 2 7
## Total 12 2 14
mat=mymatrix/14
formattable(as.data.frame(mat))
| Survived | Died | Total | |
|---|---|---|---|
| Drug1 | 0.5000000 | 0.0000000 | 0.5 |
| Drug2 | 0.3571429 | 0.1428571 | 0.5 |
| Total | 0.8571429 | 0.1428571 | 1.0 |
P_S= mymatrix[3,1]/14
P_D=mymatrix[3,2]/14
P_D1=mymatrix[1,3]/14
P_D2=mymatrix[2,3]/14
P_S #.8571429
## [1] 0.8571429
P_D #.1428571
## [1] 0.1428571
P_D1 #.5
## [1] 0.5
P_D2 #.5
## [1] 0.5
a11=7*12/14^2
a12=7*2/14^2
a21=7*12/14^2
a22=7*2/14^2
mat2=matrix(c(a11,a21,a12,a22),nrow=2)
colnames(mat2)=c("Survived", "Died")
rownames(mat2)=c("Drug1", "Drug2")
formattable(as.data.frame(mat2))
| Survived | Died | |
|---|---|---|
| Drug1 | 0.4285714 | 0.07142857 |
| Drug2 | 0.4285714 | 0.07142857 |
#mat and mat2 do not match. No independence
\(P(X=0 | S = 2, F = 12, n = 7)\)
#P(X=0 | S=2, F=12, n=7) =
phyper(0,2,12,7)
## [1] 0.2307692
#.231
#Insufficient evidence.
The following graph represents GDP growth for the US and the Euro area.
Wow. 3D bar when it should be 2D line (time series), chart junk (flags on bars..? really?), meaningless colors, warped perspective, made to look like US has a dominating GDP growth regardless of year…Look at the simple graph below. You can see this is not true. In 2000, there was a tie.in 2001, the EU Euro area performed better.
Euro=c(1.5,2.6,2.9,2.8,3.7,1.8,0.9,0.7,1.9,1.0,1.8)
US=c(3.7,4.5,4.2,4.5,3.7,0.8,1.6,2.7,4.2,3.5,2.9)
Year=seq(1996,2006)
plot(US~Year,type="l", xlab="Year", ylab="%GDP Growth", main="% GDP Growth, US vs. EU Euro Area",col="red", ylim=c(0,5),bty="n")
lines(Euro~Year, type="l", col="blue")
legend("topright",legend=c("US","EU Euro Area"), text.col=c("red","blue"),bty="n")
The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet). A single person is randomly selected for jury duty.
$P(X 110 | = 100, =16) $
#P(X>=110 | Mu = 100, s=16)
1-pnorm(110,100,16)
## [1] 0.2659855
#0.2659855
drawn from a normal population with \(\bar{x} =100\) and \(\sigma=16\)? Be sure to write the probability statement and show your R code.
$P({X}) 110 | = 100, =) $
#P(Xbar>=110 | mu = 100, se=16/sqrt(12))
1-pnorm(110,100,16/sqrt(12))
## [1] 0.01519141
#0.01519141