Let’s clean the global environment before moving further
rm(list=ls())
cat("\014")
Copies of the ClassXYZ Final Exam have been stolen from Prof. M’s office. She wants to find the culprit. Luckily, she asked the students to fill a survey which turns out to be pretty useful, as she can use it to narrow down the suspects.
Let’s load the the survey data that we will be working on You can learn more about the dataset here: http://www.amstat.org/publications/jse/v20n3/mclaren/shoesize.xls
#Load the dataset
data<-read.csv("~/Downloads/Datasets/classXYZstudents.csv")
head(data)
## ID Gender Size Height Name
## 1 1 F 5.5 60 Nguyen-Huynh, Bianca
## 2 2 F 6.0 60 Baumgartner, Freddie
## 3 3 F 7.0 60 Gomez-Gonzales, Kendra
## 4 4 F 8.0 60 Naranjo, Alexis
## 5 5 F 8.0 60 al-Farooq, Mahfoodha
## 6 6 F 9.0 60 Martinez, Sable
Like any other mystery, here too we are given a set of clues.
There is a footprint on the floor. It is size 11.
The TA heard a voice and is 80% sure that the thief is female
The exams were on a high shelf. Someone shorter than 6 feet would have had a tough time reaching that shelf.
Assume that means that someone taller than 6 feet can reach the shelf. Someone between 5’8 and 6 feet has a 75% probability of finding a way to reach it (heels?), someone shorter than 5’8 cannot.
What is the probability that if you were to randomly select a student from class, the gender of that student would be female?
priors<-table(data$Gender)
priors
##
## F M
## 187 221
prior = priors['F']/(priors['F']+priors['M'])
prior
## F
## 0.4583333
Looking at this we can say that: Without any other information, there is a 45.83% chance that the thief is female P(H) = 0.4583333
Here we have our first evidence, E Let us calculate P(E|H)
p_e1<-table(data[data$Size==11,]$Gender)
p_e1
##
## F M
## 6 58
6 Female and 58 Male in the class have size 11
posterior1 = p_e1['F']/(p_e1['F']+p_e1['M'])
posterior1
## F
## 0.09375
Posterior, given the TA testimony The TA is 80% sure P(E|H) is 0.8 The TA will say female for (P(E)):
p_e2<-(priors['F']*0.8 + priors['M']*0.2)/(priors['M']+priors['F'])
p_e2
## F
## 0.475
In this case, (P(E)=0.475 is the probability that the TA will say that the thief is female. It happens in two ways: The thief is female and the TA was correct (0.8P(female)) The thief is males and the TA made a mistake ((1-0.8)P(male))
So the probability that the thief is female given the TA’s info is? P(H|E) = P(H)*P(E|H)/P(E)
posterior2<-prior*0.8/p_e2
posterior2
## F
## 0.7719298
The probability that a female can reach the shelf P(E|H) is the probability the a female is taller than 6‘ (72 inches), plus 75% times the probability that a female is between 5‘8 and 6‘ (68 and 72 inches)
We need to calculate 2 things here, first what are the chances for any student to reach the shelf and what are the chances for a female student to reach the shelf given the constraints in the clue
data$height_distribution<-as.factor(cut(data$Height,breaks = c(0,67,71,max(data$Height)), labels = c("lt_68","lte_72",'gte_72')))
head(data)
## ID Gender Size Height Name height_distribution
## 1 1 F 5.5 60 Nguyen-Huynh, Bianca lt_68
## 2 2 F 6.0 60 Baumgartner, Freddie lt_68
## 3 3 F 7.0 60 Gomez-Gonzales, Kendra lt_68
## 4 4 F 8.0 60 Naranjo, Alexis lt_68
## 5 5 F 8.0 60 al-Farooq, Mahfoodha lt_68
## 6 6 F 9.0 60 Martinez, Sable lt_68
update2<-table(data$Gender,data$height_distribution)
update2
##
## lt_68 lte_72 gte_72
## F 149 32 6
## M 20 103 98
Things are easier after getting this table
p_f3<-(update2['F','lte_72']*0.75+ update2['F','gte_72']*1+update2['F','lt_68']*0)/priors['F']
p_f3
## F
## 0.1604278
We have calculated, the probability that a female can reach the shelf P(E|H) is %p_f%
Now let’s look at the probability of any student can reach the shelf
p_e3<-nrow(data[(data$Height>=72),]) /nrow(data) + nrow(data[ (data$Height>=68)&(data$Height<72),]) /nrow(data)*0.75
p_e3
## [1] 0.5012255
So the probability that the thief is female given the shelf reachability constrint info is? P(H|E) = P(H)*P(E|H)/P(E)
posterior3<-(prior*p_f3)/(p_e3)
posterior3
## F
## 0.1466993
Therefore, our final posterior probability turns out to be 0.1466993
After Clue 1, P(female|footprint=11) = 0.09375
So if we use this as our prior for clue 2, the new posterior is:
updated_posterior2<-posterior1*0.8/p_e2
updated_posterior2
## F
## 0.1578947
Therefore, we have P(female|TA testimony&footprint=11) = 0.1578947
Using this as our prior for clue 3, the new posterior is:
updated_posterior3<-updated_posterior2*p_f3/p_e3
updated_posterior3
## F
## 0.05053755
Therefore, we have P(female|TA testimony&footprint=11& high shelf) = 0.05053755 = 5%