Project 2

knitr::opts_chunk$set(echo = TRUE)

1. Reading in CSV file

The CAGE is a four-item questionnaire for screening for alcohol use disorder. A patient can answer “yes” to zero, one, two, three, or four questions. Mayfield conducted a CAGE study in 1974 with 366 patients, recorded the number of CAGE questions that were answered “Yes” (range was 0 to 4). A psychiatric social worker independently conducted in-depth interviews and each patient was classified by whether he or she had alcohol use disorder or not (AUD: “Y”, “N”). The AUD status was considered the gold standard. The study results are available here: https://raw.githubusercontent.com/taragonmd/data/master/cage.csv

Read in the data set and display a two-way continency table using the xtabs function. The column headings should be the AUD status.

data<- read.table("https://raw.githubusercontent.com/taragonmd/data/master/cage.csv", header = TRUE, sep = ",")
table<-xtabs(~num_yes+AUD, data=data); table

##        AUD
## num_yes   N   Y
##       0 177  14
##       1  22  12
##       2  20  20
##       3   5  43
##       4   0  53

2. Recode integer variable into a factor with two levels

Recode the num_yes variable into a new factor variable called CAGE:

Factor	Level	`num_yes` values
`CAGE`	`pos`	>= 3
	`neg`	<= 2

Use the xtabs function to display 2 by 2 table of CAGE by AUD.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- data %>%
  mutate(CAGE = case_when(num_yes>=3 ~ "pos", num_yes<=2~"neg"))
table2<-xtabs(~ CAGE+AUD, data=data); table2

##      AUD
## CAGE    N   Y
##   neg 219  46
##   pos   5  96

3. Calculate the sensitivity and specificity of the CAGE questionnaire

The sensitivity is P(CAGE = pos | AUD = Y), and the specificity is P(CAGE = neg | AUD = N).

table3<-addmargins(table2); table3

##      AUD
## CAGE    N   Y Sum
##   neg 219  46 265
##   pos   5  96 101
##   Sum 224 142 366

sens<-table3[5]/table3[6]
spec<-table3[1]/table3[3]
sens; spec

## [1] 0.6760563

## [1] 0.9776786

4. Posterior probabilities as a function of prior probabilities

You have learned that the posterior (post-test) probability is a function of the prior (pre-test) probability, and sensitivity and specificity of the diagnostic test (in this case, the CAGE questionnaire). For alcohol use disorder (AUD), the positive predictive value (PPV) is the probability that a patient has AUD given that they answered “yes” to 3 or 4 items on the CAGE questionnaire.

Using the sensivity and specificity you calculated above, plot a graph displaying on how PPV changes as a function of the prior probability. Some of the R code is provided, including Bayes’ theorem for calculating PPV.

## plot of PPV vs prior probability (prior)
## use `sens` and `spec` from previous results
prior <- seq(from= 0, to = 1, by=0.05)
ppv <- (sens * prior) / (sens * prior + (1 - spec) *(1 - prior)); ppv
plot(ppv, prior)

5. Bayesian network for probabilistic reasoning

Install the bnlearn package as described here: http://www.bnlearn.com/. The causal model is AUD -> CAGE. Create a Bayesian network (BN) model by fitting the causal model to data (I have done this for you).

Assuming the prior probability of AUD is 5%, use the cpquery function from bnlearn package to calculate the PPV; that is P(AUD = “Y” | CAGE = “pos”).

library(bnlearn)

## 
## Attaching package: 'bnlearn'

## The following object is masked from 'package:stats':
## 
##     sigma

library(Rgraphviz)

## Loading required package: graph

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:bnlearn':
## 
##     path, score

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unsplit, which,
##     which.max, which.min

## 
## Attaching package: 'graph'

## The following objects are masked from 'package:bnlearn':
## 
##     degree, nodes, nodes<-

## Loading required package: grid

curl <- 'https://raw.githubusercontent.com/taragonmd/data/master/cage2.csv'
aud <- read.csv(curl, header = TRUE)
summary(aud)

##  AUD      CAGE    
##  N:224   neg:265  
##  Y:142   pos:101

dag <- empty.graph(nodes=c("AUD", "CAGE")) # Build the Bayesian network
dag <- set.arc(dag, from = "AUD", to = "CAGE")
graphviz.plot(dag, layout = "circo") # Plot the BN

bn <- bn.fit(dag, data = aud)        # Fit the BN to the data

#Monte Carlo simulation

(ppv1 <- cpquery(bn, event = (AUD == "Y"), evidence = (CAGE == "pos"),
n = 10^6))

## [1] 0.9499401

#New prior

bn$AUD$prob

##         N         Y 
## 0.6120219 0.3879781

bn_newprior <- bn 
bn_newprior$AUD <- array(c(.95,.05), dim = dim(bn$AUD$prob),
dimnames = dimnames(bn$AUD$prob)) # assign new probs
bn_newprior$AUD$prob

##    N    Y 
## 0.95 0.05

(ppv2 <- cpquery(bn_newprior, event = (AUD == "Y"), evidence = (CAGE == "pos"), n = 10^6))

## [1] 0.612345