Discussion 2

Q1

Please explain Bayes Theorem in your own words, and give an example. Less than 10 sentences. Also, write out the formula. Pick up on how to to type equations in R Markdown using Latex terminology here

Bayes Theorem or Bayes Rule is a way for us to update probabilities given new evidence. It has real world applications such as in medicine, or spam filtering. Bayes Theorem helps us calculate conditional probabilities, or the probability of an event given another event has already happened. For example in sales and marketing, the probability that we open sales opportunities, given an event like a marketing form fill has occurred. Bayes helps us relate the conditional probability of these two events - open opportunity given a form fill - to their individual probabilities - probability of a form fill and probability of an open opportunity. Bayes’ rule updates our initial beliefs about the probability of an opportunity occurring by taking into account new evidence, a form fill . The more likely a form fill is to occur if an open opportunity is true, the more the probability of an open opportunity increases after observing a form fill.

The formula is:

\[ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} \]

Q2

Bayes’ Theorem can yield surprising results. Take a look at Open Stats textbook in Dropbox folder, Guided Practice 3.43 and attempt to solve this using Bayes’ formula. Interpret the results. Then use the attached code and solve via a tree diagram in R. You will need to change the initial parameters to the appropriate values. Attach your final graph to your submission, and your code.

First, we need to install the necessary packages.

# install.packages("BiocManager")
library(BiocManager)
# BiocManager::install("Rgraphviz")

require("Rgraphviz")
## Loading required package: Rgraphviz
## Loading required package: graph
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: grid
BiocManager::valid()  
## Warning: 22 packages out-of-date; 0 packages too new
## 
## * sessionInfo()
## 
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] Rgraphviz_2.46.0    graph_1.80.0        BiocGenerics_0.48.1
## [4] BiocManager_1.30.22
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.34     R6_2.5.1          fastmap_1.1.1     xfun_0.41        
##  [5] cachem_1.0.8      knitr_1.45        htmltools_0.5.7   rmarkdown_2.25   
##  [9] stats4_4.3.2      lifecycle_1.0.4   cli_3.6.2         sass_0.4.8       
## [13] jquerylib_0.1.4   compiler_4.3.2    rstudioapi_0.15.0 tools_4.3.2      
## [17] evaluate_0.23     bslib_0.6.1       yaml_2.3.8        rlang_1.1.3      
## [21] jsonlite_1.8.8   
## 
## Bioconductor version '3.18'
## 
##   * 22 packages out-of-date
##   * 0 packages too new
## 
## create a valid installation with
## 
##   BiocManager::install(c(
##     "bslib", "callr", "curl", "data.table", "DBI", "dbplyr", "digest",
##     "ggplot2", "pkgbuild", "processx", "psych", "R.oo", "ragg", "remotes",
##     "rmarkdown", "rvest", "sass", "systemfonts", "tidyr", "tidyselect",
##     "writexl", "xfun"
##   ), update = TRUE, ask = FALSE, force = TRUE)
## 
## more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Then we need to change the three variables to match actual values from the problem, for our probability tree. From these three values, other probabilities will be calculated.

a1 <- .20
a2 <- .35
a3 <- .45

b1Givena1 <- .70
b2Givena2 <- .25
b3Givena3 <- .05

b1GivenNota1 <- .30
b2GivenNota2 <- .75
b3GivenNota3 <- .95

We continue to build out the rest of the probabilities.

notA1          <-    1 - a1
notA1
## [1] 0.8
notb1Givena1    <-    1 - b1Givena1
notb1Givena1
## [1] 0.3
notb1GivenNota1 <-    1 - b1GivenNota1
notb1GivenNota1
## [1] 0.7
notA2           <-    1 - a2
notA2
## [1] 0.65
notb2Givena2    <-    1 - b2Givena2
notb2Givena2
## [1] 0.75
notb2GivenNota2 <-    1 - b2GivenNota2
notb2GivenNota2
## [1] 0.25
notA3           <-    1 - a3    
notA3
## [1] 0.55
notb3Givena3    <-    1 - b3Givena3
notb3Givena3
## [1] 0.95
notb3GivenNota3 <-    1 - b3GivenNota3
notb3GivenNota3
## [1] 0.05

We continue to find our joint probabilities.

#sporting event

a1ANDb1         <-  a1   *   b1Givena1
a1ANDb1
## [1] 0.14
a1ANDnotb1      <-  a1   *   notb1Givena1
a1ANDnotb1
## [1] 0.06
nota1ANDb1      <- notA1 *   b1GivenNota1
nota1ANDb1
## [1] 0.24
nota1ANDnotb1   <- notA1 *   notb1GivenNota1
nota1ANDnotb1
## [1] 0.56
#academic event

a2ANDb2         <-  a2   *   b2Givena2
a2ANDb2
## [1] 0.0875
a2ANDnotb2      <-  a2   *   notb2Givena2
a2ANDnotb2
## [1] 0.2625
nota2ANDb2      <- notA2 *   b2GivenNota2
nota2ANDb2
## [1] 0.4875
nota2ANDnotb2   <- notA2 *   notb2GivenNota2
nota2ANDnotb2
## [1] 0.1625
#no event

a3ANDb3         <-  a3   *   b3Givena3
a3ANDb3
## [1] 0.0225
a3ANDnotb3      <-  a3   *   notb3Givena3
a3ANDnotb3
## [1] 0.4275
nota3ANDb3      <- notA3 *   b3GivenNota3
nota3ANDb3
## [1] 0.5225
nota3ANDnotb3   <- notA3 *   notb3GivenNota3
nota3ANDnotb3
## [1] 0.0275
# Probability of B
b1             <-  a1ANDb1   +   nota1ANDb1
b1
## [1] 0.38
notB1          <-  1 - b1
notB1
## [1] 0.62
b2             <-  a2ANDb2   +   nota2ANDb2
b2
## [1] 0.575
notB2          <-  1 - b2
notB2
## [1] 0.425
b3             <-  a3ANDb3   +   nota3ANDb3
b3
## [1] 0.545
notB3          <-  1 - b3
notB3
## [1] 0.455
# Bayes theorem - probability of A | B
# Prob (a | b) = Prob (a AND b) / prob (b)
a1Givenb1       <-  a1ANDb1 / b1
a2Givenb2       <-  a2ANDb2 / b2
a3Givenb3       <-  a3ANDb3 / b3

From here, we set up our graph or tree diagram. The tree diagram helps us better visualize conditional probabilities, where we start with the probabilities of two distinct given events, that branch out into conditional events (for example, probability of A, and probability of A, given B). From there we can better understand the probability of events either occurring or not, depending on other events happening.

We want to name our nodes in our tree diagram.

node1   <-  "P"
node2   <-  "A1"
node3   <-  "A2"
node4   <-  "A3"
node5   <-  "B1&A1"
node6   <-  "B1&A1'"
node7   <-  "B2&A2"
node8   <-  "B2&A2'"
node9   <-  "B3&A3"
node10  <-  "B3&A3'"

Then we set up a vector, node names, containing each of the 10 nodes and make the connections between nodes.

nodeNames   <-  c(node1, node2, node3, node4, node5, node6, node7, node8, node9, node10)
rEG <- new("graphNEL", nodes=nodeNames, edgemode="directed")
rEG
## A graphNEL graph with directed edges
## Number of Nodes = 10 
## Number of Edges = 0
rEG <- addEdge(nodeNames[1], nodeNames[2], rEG, 1)
rEG <- addEdge(nodeNames[1], nodeNames[3], rEG, 1)
rEG <- addEdge(nodeNames[1], nodeNames[4], rEG, 1)
rEG <- addEdge(nodeNames[2], nodeNames[5], rEG, 1)
rEG <- addEdge(nodeNames[2], nodeNames[6], rEG, 1)
rEG <- addEdge(nodeNames[3], nodeNames[7], rEG, 1)
rEG <- addEdge(nodeNames[3], nodeNames[8], rEG, 1)
rEG <- addEdge(nodeNames[4], nodeNames[9], rEG, 1)
rEG <- addEdge(nodeNames[4], nodeNames[10], rEG, 10)

First we assign values to the different nodes in our tree diagram.

eAttrs <- list()

q<-edgeNames(rEG)

# Add the probability values to the the branch lines

eAttrs$label <- c(toString(a1),toString(a2), toString(a3), toString(b1Givena1),
                toString(b1GivenNota1),toString(b2Givena2), toString(b2GivenNota2), toString(b3Givena3), toString(b3GivenNota3))

names(eAttrs$label) <- c(q[1],q[2], q[3], q[4], q[5], q[6], q[7], q[8], q[9])
edgeAttrs<-eAttrs

Now we will add some color to our tree diagram.

# Set the color, etc, of the tree
attributes  <-  list(node=list(label="foo", fillcolor="lightgreen", fontsize="15"),
                 edge=list(color="red"),graph=list(rankdir="LR"))

# Plot the probability tree using Rgraphvis
plot(rEG, edgeAttrs=eAttrs, attrs=attributes)
nodes(rEG)
##  [1] "P"      "A1"     "A2"     "A3"     "B1&A1"  "B1&A1'" "B2&A2"  "B2&A2'"
##  [9] "B3&A3"  "B3&A3'"
edges(rEG)
## $P
## [1] "A1" "A2" "A3"
## 
## $A1
## [1] "B1&A1"  "B1&A1'"
## 
## $A2
## [1] "B2&A2"  "B2&A2'"
## 
## $A3
## [1] "B3&A3"  "B3&A3'"
## 
## $`B1&A1`
## character(0)
## 
## $`B1&A1'`
## character(0)
## 
## $`B2&A2`
## character(0)
## 
## $`B2&A2'`
## character(0)
## 
## $`B3&A3`
## character(0)
## 
## $`B3&A3'`
## character(0)
#Add the probability values to the leaves
text(500,420,a1ANDb1, cex=.8, adj = 1)
text(500,450,a1ANDnotb1,cex=.8, adj = 1)
text(500,400,nota1ANDb1,cex=.8, adj = 1)
text(500,380,nota1ANDnotb1,cex=.8, adj = 1)

text(500,310,a2ANDb2, cex=.8, adj = 1) 
text(500,280,a2ANDnotb2,cex=.8, adj = 1)
text(500,250,nota2ANDb2,cex=.8, adj = 1)
text(500,220,nota2ANDnotb2,cex=.8, adj = 1)

text(500,170,a3ANDb3, cex=.8, adj = .5)
text(500,130,a3ANDnotb3,cex=.8, adj = .5)
text(500,80,nota3ANDb3,cex=.8, adj = .5)
text(500,50,nota3ANDnotb3,cex=.8, adj = .5)

#Write a table in the lower left of the probablites of A and B
#A1

text(60,50,paste("P(A1):",a1),cex=.9, col="darkgreen")
text(60,20,paste("P(A1'):",notA1),cex=.9, col="darkgreen")

text(120,50,paste("P(B1):",round(b1,digits=2)),cex=.9)
text(120,20,paste("P(B1'):",round(notB1, 2)),cex=.9)

text(90,420,paste("P(A1|B1): ",round(a1Givenb1,digits=2)),cex=.9,col="blue")

#A2
text(180,50,paste("P(A2):",a2),cex=.9, col="darkgreen")
text(180,20,paste("P(A2'):",notA2),cex=.9, col="darkgreen")

text(250,50,paste("P(B2):",round(b2,digits=2)),cex=.9)
text(250,20,paste("P(B2'):",round(notB2, 2)),cex=.9)

text(90,400,paste("P(A1|B1): ",round(a1Givenb1,digits=2)),cex=.9,col="blue")

#A3
text(100,100,paste("P(A3):",a3),cex=.9, col="darkgreen")
text(100,80,paste("P(A3'):",notA3),cex=.9, col="darkgreen")

text(125,120,paste("P(B3):",round(b3,digits=2)),cex=.9)
text(135,140,paste("P(B3'):",round(notB3, 2)),cex=.9)

text(90,380,paste("P(A1|B1): ",round(a1Givenb1,digits=2)),cex=.9,col="blue")