Akula-DATA605-Week06-HW6

#libraries used
if (!require('knitr')) install.packages('knitr')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('VennDiagram')) install.packages('VennDiagram')
if (!require('gridExtra')) install.packages('gridExtra')

Solution:

Number of red marbles $R$ = 54

Number of white marbles $W$ = 9

Number of blue marbles $B$ = 75

Total marbles = $R$ + $W$ + $B$ = $54 + 9 + 75$ = $138$

Probability of selecting red marble $P(R)$ = $\frac{54}{138}$ = $0.3913$

Probability of selecting white marble $P(W)$ = $\frac{9}{138}$ = $0.0652$

Probability of selecting blue marble $P(B)$ = $\frac{75}{138}$ = $0.5435$

Probability of selecting red or blue $P(R or B)$ = $P(R) + P(B)$ = $\frac{54}{138}$ + $\frac{75}{138}$ = $0.9348$

Solution:

Number of green golf balls $G$ = 19

Number of red golf balls $R$ = 20

Number of blue golf balls $B$ = 24

Number of yellow golf balls $Y$ = 17

Total golf balls = $G$ + $R$ + $B$ + $Y$ = $19 + 20 + 24 + 17$ = $80$

Probability of getting red golf ball $P(R)$ = $\frac{20}{80}$ = $0.25$

Solution:

Calculate percentages,

#create matrix
customer.data<- matrix(c(81,228, 116,79, 215,252, 130,97, 129,72),ncol=2,byrow=TRUE)
colnames(customer.data)<- c("Males","Females")
rownames(customer.data)<- c('Apartment', 'Dorm', 'With Parent(s)', 'Sorority/Fraternity House', 'Other')

#bind column and row totals
customer.data<- cbind(customer.data, Total = rowSums(customer.data))
customer.data<- rbind(customer.data, Total = colSums(customer.data))

#calculate percentages
customer.data.percentages<- round((customer.data/1399),4)
kable(customer.data.percentages, format="pandoc", align="l", caption = "Gender and Residence of Customers by Percentages")

Gender and Residence of Customers by Percentages
	Males	Females	Total
Apartment	0.0579	0.1630	0.2209
Dorm	0.0829	0.0565	0.1394
With Parent(s)	0.1537	0.1801	0.3338
Sorority/Fraternity House	0.0929	0.0693	0.1623
Other	0.0922	0.0515	0.1437
Total	0.4796	0.5204	1.0000

Probability that a customer is not male can be calculated both ways,

Probability customer is female $P(C_{gender = female})$, percentage of female customers

$P(C_{gender = female}) = 0.5204$

Probability customer is not male $P(C_{gender \ne male}) = 1 - P(C_{gender = male})$

$P(C_{gender \ne male}) = 1 - 0.4796 = 0.5204$

Since gender is complement event, not male equals female, $P(C_{gender \ne male}) = P(C_{gender = female}) = 0.5204$

Probability customer does not live with parents $P(C_{residence \ne parents})$ = $P(C_{residence=apartment}) + P(C_{residence=dorm}) + P(C_{residence=sorority/fraternity}) + P(C_{residence=other})$ = $1 - P(C_{residence = parents})$ = $1 - 0.3338$ = $0.6662$

Probability customer is not male or does not live with parents = $P(C_{gender\ =\ female}) + P(C_{residence \ne parents}) - (P(C_{female,apartment}) + P(C_{female,dorm}) + P(C_{female,sorority/fraternity}) + P(C_{female,other}))$

Reason we have substract $P(C_{female,apartment}) + P(C_{female,dorm}) + P(C_{female,sorority/fraternity}) + P(C_{female,other})$ is because values are counted twice. Once in $P(C_{gender=female})$ and $P(C_{Residence \ne Parents})$, contains both males and females.

$P(C_{gender = female}) + P(C_{Residence \ne Parents}) - (P(C_{female,apartment}) + P(C_{female,dorm}) + P(C_{female,sorority/fraternity}) + P(C_{female,other}))$

= $0.5204 + 0.6662 - (0.163 + 0.0565 + 0.0693 + 0.0515)$

= $0.8463$

Therefore, Probability customer is not a male or does not live with parents = $0.8463$

Solution:

By definition, two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other.

Going to gym and exercising may result in losing weight. On the other hand, a person can lose weight without going to a gym and just controlling calorie intake. NY Times article explains it, https://www.nytimes.com/2015/06/16/upshot/to-lose-weight-eating-less-is-far-more-important-than-exercising-more.html

In other words, there is some correlation between going to the gym and losing weight. However, it does not demonstrate a casual relation between weight loss and going to the gym.

Hence both events are independent.

Solution:

Number of vegetables $V$ = 8

Number of condiments $C$ = 7

Variety of tortillas available $T$ = 3

To make a veggie wrap we need $W = T + 3V + 3C$

Combination of 3 vegetables from 8 available choices = $C(8,3)$ = $8_{C_3}$ = $\frac{8!}{3!(8-3)!}$ = $\frac{8*7*6*5!}{(3*2*1)(5!)}$ = $56$

Combination of 3 condiments from 7 available choices = $C(7,3)$ = $7_{C_3}$ = $\frac{7!}{3!(7-3)!}$ = $\frac{7*6*5*4!}{(3*2*1)(4!)}$ = $35$

Veggie wrap $W$ = (Combination of 3 vegetables from 8 available choices) X (Combination of 3 condiments from 7 possible options) X (Choice of tortilla)

$W$ = $56$ * $35$ * $3$

$W$ = $5880$

Therefore, we can make a total of $5880$ veggie wraps.

Using R functions,

#vegetables
#available choices 8, can be used to make wrap 3
#combination
v<- choose(8,3)

#condiments
#available choices 7, can be used to make wrap 3
#combination
c<- choose(7,3)

#choice of tortilla = 3
t<- 3
#number of veggie wraps
w<- v*c*t

#vegetables
veg<- c('V1','V2','V3','V4','V5','V6','V7','V8')
veg.combination<- apply(t(combn(veg, 3)), 1, paste, collapse="+")
veg.combination<- data.frame(V=veg.combination,stringsAsFactors = F)

#condiments
con<- c('C1','C2','C3','C4','C5','C6','C7')
con.combination<- apply(t(combn(con, 3)), 1, paste, collapse="+")
con.combination<- data.frame(C=con.combination,stringsAsFactors = F)

#tortilla
tor.combination<- data.frame(T = c('T1','T2','T3'))

wrap<- tor.combination %>% 
  expand(tor.combination, V = veg.combination$V, C = con.combination$C) %>%
  select(`T`,V,C)

randomRows<- sample(1:length(wrap[,1]), 20, replace=T)
wrap %>% slice(randomRows) %>%
  kable (col.names = c("Tortilla Variety", "Vegetables Combination", "Condiments Combination"), caption = 'City Subs Menu')

City Subs Menu
Tortilla Variety	Vegetables Combination	Condiments Combination
T1	V2+V3+V4	C4+C5+C7
T1	V1+V4+V5	C4+C6+C7
T2	V3+V5+V8	C2+C3+C7
T3	V3+V5+V6	C2+C3+C5
T3	V1+V5+V6	C2+C5+C7
T1	V3+V4+V6	C5+C6+C7
T3	V4+V7+V8	C3+C4+C6
T3	V2+V5+V6	C3+C4+C7
T2	V1+V7+V8	C3+C4+C6
T3	V2+V6+V8	C2+C3+C5
T1	V3+V5+V8	C1+C2+C6
T3	V3+V5+V8	C2+C3+C6
T3	V1+V4+V8	C2+C5+C6
T3	V1+V6+V7	C2+C4+C6
T3	V4+V7+V8	C3+C4+C5
T1	V3+V4+V8	C1+C2+C3
T2	V1+V5+V6	C1+C3+C5
T3	V1+V2+V5	C1+C2+C3
T1	V1+V3+V6	C2+C4+C5
T3	V1+V3+V5	C1+C4+C6

Total veggie wraps = $5880$

Solution:

Jeff and Liz are two different people. It is not possible to analyze

Liz watches the evening news because Jeff runs out of gas on the way to work.
Jeff runs out of gas on the way to work because Liz watches the evening news.

As there is no causation or correlation between two events, they are independent.

Solution:

Cabinet positions to be filled $CP$ = 8

Available eligible candidates $A$ = 14

Combination of 8 candidates from 14 available choices = $C(14,8)$ = $14_{C_8}$ = $\frac{14!}{8!(14-8)!}$ = $\frac{14*13*12*11*10*9*8!}{(8!)(6!)}$ = $3003$

Total number of different ways members of the cabinet can be appointed = $3003$

Solution:

Number of red jellybeans $R$ = 9

Number of orange jellybeans $O$ = 4

Number of green jellybeans $G$ = 9

Total jellybeans = $R$ + $O$ + $G$ = $9 + 4 + 9$ = $22$

The event of withdrawing four jellybeans is dependent event. In other words, selecting orange or green jellybean first effect second outcome. The event is also referred to as choosing without replacement. Which means once orange jellybean is selected, it is not put back into the bag while selecting the second jellybean.

Probability of selecting green jellybeans on first draw $P(G_1)$ = $\frac{9}{22}$ = $0.4091$

After the first draw, we are left with 21 jellybeans and eight green jellybeans, if the first draw was green jellybean

Probability of selecting green jellybeans on second draw $P(G_2)$ = $\frac{8}{21}$ = $0.381$

On the third draw, we are left with 20 jellybeans and seven green jellybeans, if first and second draws resulted in green jellybean

Probability of selecting green jellybeans on third draw $P(G_3)$ = $\frac{7}{20}$ = $0.35$

On fourth draw, we are left with 19 jellybeans,

Probability of selecting orange jellybeans $P(O)$ = $\frac{4}{19}$ = $0.2105$

Since, this probability can happen four different ways $\{OGGG, GOGG, GGOG, GGGO\}$, multiply the resulting probability with 4.

Therefore, probability of selecting one orange and three green jellybeans = $4 * P(G_1) * P(G_2) * P(G_3) * P(O)$ = $4 * \frac{9}{22} * \frac{8}{21} * \frac{7}{20} * \frac{4}{19}$ = $0.0459$

Using combinations method,

Probability that event occurs, $P(E) = \frac{size\ of\ event\ space}{size\ of\ sample\ space}$

There are 3 ways to select green jellybeans = $9_{C_3}$ = $\frac{9!}{3!(9-3)!}$ = $84$

One way to select orange jellybean = $4_{C_1}$ = $\frac{4!}{1!(4-1)!}$ = $4$

Event space = $9_{C_3} * 4_{C_1}$ = $336$

Different ways to select 4 jellybeans, Sample space = $22_{C_4}$ = $7315$

Therefore, probability of selecting one orange and three green jellybeans = $\frac{\bigg(9_{C_3}\bigg) * \bigg(4_{C_1}\bigg)}{\bigg(22_{C_4}\bigg)}$ = $0.0459$

Using simulation in R

jellybeans.bag<- c(rep('R',9),rep('O',4),rep('G',9))

#sample size
n<- 100000

#vector holds samples
output<- rep(0, n)

# select four jellybeans without replacement
for(i in 1:n){
  sampleout = sample(jellybeans.bag, 4, replace=F)
  if (length(sampleout[sampleout=='O'])==1 & length(sampleout[sampleout=='G'])==3)
  {
    output[i]<- 1
  }
}

#check probability
sum(output)/n

## [1] 0.0447

Probability of selecting one orange and three green jellybeans: $0.0447$. Probability is very close.

Solution:

$\frac{11!}{7!}$ = $\frac{11*10*9*8*7!}{(7!)}$ = $7920$

Using R factorial function

factorial(11)/factorial(7)

## [1] 7920

Solution:

Percentage of subscribers over age 34 $S_{age > 34}$= 67% = $\frac{67}{100}$ = 0.67

Complement would be group of subscribers age 34 or under = $S_{age \le 34}$

Also, sum of non-complement and complement should be equal to 1

Hence, $S_{age > 34}$ + $S_{age \le 34}$ = 1

= $0.67 + S_{age \le 34} = 1$

= $S_{age \le 34} = 0.33$

Therefore, complement of subscribers over age 34 = $S_{age \le 34} = 0.33$

Solution:

Outcomes of each coin toss = 2, if we assume that each coin toss is equally likely to come up heads or tails, then for four tosses there would be $16(2 * 2 * 2 * 2)$ outcomes.

Out of 4 coin tosses, we can get 3 heads in any combination. Total combinations we get heads = $4_{C_3}$ = $\frac{4!}{3!(4-3)!}$ = $4$

Probability of getting three heads in four tosses = $\frac {\bigg(4_{C_3}\bigg)}{outcomes\ of\ four\ tosses}$

= $\frac {4}{16}$ = $0.25$

There is $25 \%$ or one in four chances of winning $\$97$ and $75 \%$ or three in four chances of losing $\$30$.

Step 2. If 559 games are played, each win makes $\$97$ and for each loss, lose $\$30$.

Using R functions

options(scipen = 10, digits = 4)
coin.toss<- c(rep('H',4),rep('T',4))

#sample size
n<- 559

#vector holds samples
output<- rep(-1, n)

# select four jellybeans without replacement
for(i in 1:n){
  sampleout = sample(coin.toss, 4, replace=F)
  if (length(sampleout[sampleout=='T'])==1 & length(sampleout[sampleout=='H'])==3)
  {
    output[i]<- 1
  }
}

#check wins and loss
prob.wins<- output[output==1]
prob.wins<- length(prob.wins)

prob.loss<- output[output==-1]
prob.loss<- length(prob.loss)

#for success, if you get 3 heads from 4 tosses win 
success<- prob.wins * 97
loss<- prob.loss * 30

Probability of getting exactly 3 heads from 4 tosses in 559 plays: $19.5 \%$.

Amount won: $\$10573$. Amount lost: $-\$13500$

Solution:

Outcomes of each coin toss = 2, if we assume that each coin toss is equally likely to come up heads or tails, then for nine tosses there would be $512(2^9)$ outcomes.

Out of 9 coin tosses, we can get four or less tails in any combination. Total combinations we get tails = $9_{C_4}$ + $9_{C_3}$ + $9_{C_2}$ + $9_{C_1}$ = $\frac{9!}{4!(9-4)!} + \frac{9!}{3!(9-3)!} + \frac{9!}{2!(9-2)!} + \frac{9!}{1!(9-1)!}$ = $126 + 84 + 36 + 9$ = $255$

Probability of getting four or less tails in 9 tosses = $\frac {\bigg(9_{C_4}\bigg) + \bigg(9_{C_3}\bigg) + \bigg(9_{C_2}\bigg) + \bigg(9_{C_1}\bigg)}{outcomes\ of\ nine\ tosses}$

= $\frac {255}{512}$ = $0.498$

There is $49.8 \%$ or little less than one in 2 chances of winning $\$23$ and $50.2 \%$ or little more than one in two chances of losing $\$26$.

Step 2. If 994 games are played, each win makes $\$23$ and for each loss, lose $\$26$.

Using R functions

options(scipen = 10, digits = 4)
coin.toss<- c(rep('H',9),rep('T',9))

#sample size
n<- 994

#vector holds samples
output<- rep(-1, n)

# select four jellybeans without replacement
for(i in 1:n){
  sampleout = sample(coin.toss, 9, replace=F)
  if (length(sampleout[sampleout=='T']) < 5)
  {
    output[i]<- 1
  }
}

#check wins and loss
prob.wins<- output[output==1]
prob.wins<- length(prob.wins)

prob.loss<- output[output==-1]
prob.loss<- length(prob.loss)

#for success, if you get 3 heads from 4 tosses win 
success<- prob.wins * 23
loss<- prob.loss * 26

Probability of getting exactly four or less tails from 9 tosses in 994 plays: $49.09 \%$.

Amount won: $\$11224$. Amount lost: $-\$13156$

Solution:

Based on wikipedia confusion matrix is defined as,

cm<- matrix(NA, nrow=2, ncol=2)

cm<- matrix(c('True Positive(TP)','False Negative(FN)(Type II Error)', 'False Positive(FP)(Type I Error)','True Negative(TN)'), nrow=2, ncol=2)
rownames(cm) <- c('Predict Liar', 'Predict Truth Teller')
kable(cm, col.names = c("Actual Liar", "Actual Truth Teller"), align = "l")

	Actual Liar	Actual Truth Teller
Predict Liar	True Positive(TP)	False Positive(FP)(Type I Error)
Predict Truth Teller	False Negative(FN)(Type II Error)	True Negative(TN)

Given sensitivity = 0.59, specificity = 0.90.

True Positive Rate($TPR$) = $\frac{TP}{TP+FN}$ = $1 - FNR$ = $sensitivity$ = $0.59$, probability that the polygraph will identify actual liar as a liar. In other words polygraph will predict a individual as liar correctly 59% of the time.

False Negative Rate($FNR$) = $\frac{FN}{TP+FN}$ = Type II Error Rate = $1 - sensitivity$ = $1 - 0.59$ = $0.41$, probability that polygraph will identify actual liar as a truth-teller.

True Negative Rate($TNR$) = $\frac{TN}{FP+TN}$ = $1 - FPR$ = $specificity$= $0.90$, probability that the polygraph will identify actual truth-teller as a truth-teller. In other words polygraph will predict a individual as truth-teller correctly 90% of the time.

False Positive Rate($FPR$) = $\frac{FP}{FP+TN}$ = Type I Error Rate = $1 - specificity$ = $1 - 0.90$ = $0.1$, probability that polygraph will identify actual truth-teller as a liar

options(scipen = 10, digits = 8)
#TPR is also known are sensitivity
#probability that the polygraph will identify actual liar as a liar
TPR<- 0.59
#TNR is also known are specificity
#probability that the polygraph will identify actual truth-teller as a truth-teller
TNR<- 0.90
#FNR is also known as Type II Error rate
#probability that polygraph will identify actual liar as a truth-teller 
FNR<- 1 - TPR
#FPR is also known as Type I Error rate
#probability that polygraph will identify actual truth-teller as a liar
FPR<- 1 - TNR

cm<- matrix(c(TPR,FNR, FPR,TNR), nrow=2, ncol=2)
rownames(cm)<- c('Predict Liar', 'Predict Truth Teller')
kable(cm, digits = 2, col.names = c("Actual Liar", "Actual Truth Teller"), align = "l")

	Actual Liar	Actual Truth Teller
Predict Liar	0.59	0.1
Predict Truth Teller	0.41	0.9

For a given sample population 20% of individuals selected for the screening polygraph will lie. Liars = 0.20 and truth-tellers = 0.80.

If we use polygraph on the sample population, it will correctly predict 59% as liars and 41% as truth-tellers of 20% of liars from the sample population.

Liars from sample population = 20%

Polygraph perdiction for liars = $20\% * 59\%$ = $11.8\%$

Polygraph perdiction for truth-tellers(even though they are actual liars), Type II Error = $20\% * (100 - 59)\%$ = $8.2\%$

For truth-tellers, polygraph will predict 90% as truth-tellers and 10% as liars of 80% of truth-tellers from sample population

Truth-tellers from sample population = 80%

Polygraph perdiction for truth-tellers = $80\% * 90\%$ = $72\%$

Polygraph perdiction for liars(even though they are actual truth-tellers), Type I Error = $80\% * (100 - 90)\%$ = $8\%$

options(scipen = 10, digits = 8)
#sample population size
n<-1.00
#population split
actual.liars<- n * 0.20
actual.truth.tellers<- n * 0.80

#polygraph prediction from actuals
#true positives
predict.liars<- actual.liars * 0.59
#type II error
predict.truth.tellers.II<- actual.liars * (1 - 0.59)

#true negatives
predict.truth.tellers<- actual.truth.tellers * 0.90
#type I error
predict.liars.I<- actual.truth.tellers * (1 - 0.90)

#generate matrix
poly.data<- matrix(c(predict.liars,predict.truth.tellers.II, predict.liars.I,predict.truth.tellers), nrow=2, ncol=2)

#add column and row totals
poly.data<- cbind(poly.data, Total = rowSums(poly.data))
poly.data<- rbind(poly.data, Total = round(colSums(poly.data),4))
rownames(poly.data)<- c('Predict Liar', 'Predict Truth Teller','Total')
kable(poly.data, digits = 4, col.names = c("Actual Liar", "Actual Truth Teller","Total"), caption = "Sample Population", align="l")

Sample Population
	Actual Liar	Actual Truth Teller	Total
Predict Liar	0.118	0.08	0.198
Predict Truth Teller	0.082	0.72	0.802
Total	0.200	0.80	1.000

a. What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? (Show me the table or the formulaic solution or both.)

options(scipen = 10, digits = 8)
#venn diagram for Liars
venndiag<- draw.pairwise.venn(area1 = poly.data[1,3], area2 = poly.data[3,1], cross.area = poly.data[1,1], category = c(paste0("Predicted Liars (",poly.data[1,3],")"),paste0("Acutal Liars (",poly.data[3,1],")")), fill = c("red", "blue"),cat.pos = c(0, 180), rotation.degree = 180, scaled = FALSE, ind=FALSE)

#add heading to venn diagram
grid.arrange(gTree(children=venndiag), top="Venn Diagram - Liars")

Using Bayes theorem,

$P(E_1|E) = \frac{P(E_1 \cap E)}{P(E_1 \cap E) + P(E_2 \cap E)}$,

$P(E_1 \cap E)$ represents sample population that actually lied and polygraph predicted as liars.

$P(E_1 \cap E) = 0.118$

$P(E_2 \cap E)$ represents entire sample population that polygraph predicted as liars.

$P(E_2 \cap E) = 0.08$

$P(E_1|E)$ = $P$(individual actually lied given polygraph predicted as liar)

$P(E_1|E)$ = $\frac{0.118}{0.118 + 0.08}$ = $0.596$

Probability that an individual is actually a liar given that the polygraph predicted as liar = $59.6 \%$

b. What is the probability that an individual is actually a truth-teller given that the polygraph detected him/her as such? (Show me the table or the formulaic solution or both.)

options(scipen = 10, digits = 8)

#venn diagram for Truth-tellers
venndiag<- draw.pairwise.venn(area1 = poly.data[2,3], area2 = poly.data[3,2], cross.area = poly.data[2,2], category = c(paste0("Predicted Truth Tellers (",poly.data[2,3],")"),paste0("Acutal Truth Tellers (",poly.data[3,2],")")), fill = c("yellow", "blue"),cat.pos = c(0, 180), rotation.degree = 180, scaled = FALSE, ind=FALSE)

#add heading to venn diagram
grid.arrange(gTree(children=venndiag), top="Venn Diagram - Truth Tellers")

Using Bayes theorem,

$P(E_1|E) = \frac{P(E_1 \cap E)}{P(E_1 \cap E) + P(E_2 \cap E)}$,

$P(E_1 \cap E)$ represents sample population that actually is truth-teller and polygraph predicted as a truth-tellers.

$P(E_1 \cap E) = 0.72$

$P(E_2 \cap E)$ represents entire sample population that polygraph predicted as truth-tellers.

$P(E_2 \cap E) = 0.082$

$P(E_1|E)$ = $P$(individual actually is a truth-teller given polygraph predicted as truth-teller)

$P(E_1|E)$ = $\frac{0.72}{0.082 + 0.72}$ = $0.8978$

Probability that an individual is actually a truth-teller given that the polygraph detected as truth-teller = $89.78 \%$

c. What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

Total actual liars $P(A)= 20\% = 0.20$

Total polygraph identified liars $P(I) = 19.8\% = 0.198$

Total actual liars and polygraph identified liars $P(A) \cap P(I) = 11.8\% = 0.118$

The probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph is $P(A\ or\ I) = P(A) + P(I) - P(A) \cap P(I)$. We have subtracted $P(A) \cap P(I)$ as it is counted twice.

$P(A\ or\ I) = 0.20 + 0.198 - 0.118 = 0.28$

Probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph = $28 \%$

References:

Akula-DATA605-Week06-HW6

Pavan Akula

October 4, 2017