Stat 290: Assignment 3

Due: 11:59pm Feb 26, 2016

Instructions

Follow these instructions paying full attention to details to get proper credit. Failure to do so will cost you points!

R version 3.2.3 should be used for all exercises.
Install packages R6 and jsonlite and download the data for this problem set before running this markdown.
Fill in your answers into the markdown chunks named q_1a, q_2, q_3a, etc. Your answer code chunks will be evaluated using R as an expression, the value of which is your answer. Where you are asked to name the resulting variable, do so.
If you are unable to answer a question, leave the corresponding chunk blank, as in the original file.
Until you answer the questions and name variables as expected, the messages stating x not found for problem y will appear. This is not a perfect system but meant as a quick check for you in case you have typos etc.
Do not edit or shorten this markdown file, or rename code chunks, named or unnamed, or modify eval options. It might be easier for you to split this markdown into pieces for each of the questions Q1, Q2 and Q3 separately while working on the problems. You are free to do so, but it is your responsibility to put things together as one file in the end.
Check that you can knit this markdown in Rstudio without error before you enter anything, and, after you enter your answers.
Submit the markdown file (not the html file ass3.html) file by uploading it to stat290.stanford.edu. Put it under your private directory and ensure that its name is ass3.Rmd. NOTE: No other name, case exactly as noted, no other format!
Your submission should be in the right place by the due date and time.

Preamble

###opts_knit$set(stop_on_error = 2L)
library(XML)
library(R6)
library(jsonlite)

XML parsing using `R6` and JSON parsing

1a. Rewrite the example sax.html of lecture 14 using a portable R6 class named EuroCurrencyTracker.

EuroCurrencyTracker <-  R6Class("EuroCurrencyTracker", portable = TRUE,
        private = list(currencies = c(NA,NA)),
#rateData = list(times=NA, rates = vector("list",2)),          
        public =list(times = numeric(0), rates = vector("list",2),
          initialize = function(currencies){
            private$currencies = currencies
            names(self$rates) = currencies},
          setCurrencies = function(currencies){
            private$currencies = currencies
          },
          rateData = function(){
            rateData = list(times = self$times, rates = self$rates)
          },
          xmlHandlers = function(currencies = private$currencies) {
            day <- 0
            #
            startElement <- function(name, attrs){
              #browser()
              if (name =="Cube"){
                if("time" %in% names(attrs)){
                  day <<- day + 1
                  self$times[day] <<- attrs["time"]
                }
                if ("currency" %in% names(attrs) && attrs["currency"] %in%
                    currencies){
                  currency = attrs["currency"]
                  self$rates[[currency]][day] = attrs["rate"]
                }
              }   # end of if (name =="Cube"
              #list(startElement = startElement, 
                   #rateData = function() list(times=times,rates = rates))
            } # end of startElement
            #browser()
            list(startElement = startElement, 
                 rateData1 = function() list(times=times,rates = rates))
          })  # end of xmlHandlers, end of public
)  #end of R6 Class Statement --> 1A

Here is how we will use it.

if (!exists("EuroCurrencyTracker")) {
    cat("EuroCurrencyTracker for problem 1 not found!")
} else {
    usdAndInr <- EuroCurrencyTracker$new(currencies = c("USD", "INR"))
    exchangeRate <- xmlEventParse(file = "eurofxref-hist-90d.xml",
                                  handlers = usdAndInr$xmlHandlers())
    rateData <- usdAndInr$rateData()
    rates <- cbind(as.data.frame(lapply(rateData$rates, as.numeric)),
                   date = as.POSIXct(strptime(rateData$times, "%Y-%m-%d")))
    str(rates)
}

## 'data.frame':    62 obs. of  3 variables:
##  $ USD : num  1.13 1.13 1.12 1.11 1.12 ...
##  $ INR : num  77.7 76.4 76.5 75.5 75.8 ...
##  $ date: POSIXct, format: "2016-02-11" "2016-02-10" ...

This should yield the similar results as in sax.html of lecture 15 (but not the same since the data is the one included here.)

1b. Examine the JSON file provided for you named top100.json which is a list of top 100 TV shows from iTunes. You might find it helpful to view the file using a JSON viewer.

Produce a table of frequency of the various kinds of the top 100 TV terms such as “Reality TV”, “Drama” etc. Your result should be named categoryTable.

z = fromJSON("top100.json")
categoryTable <- table(z[["feed"]][["entry"]][["category"]][["attributes"]][["term"]]) #1B

if (!exists("categoryTable")) {
    cat("categoryTable for problem 1b not found!")
} else {
    print(categoryTable)
}

## 
##           Comedy            Drama             Kids       Reality TV 
##               11               64                1               12 
## Sci-Fi & Fantasy 
##               12

1c. Produce a table of frequency of the prices.

priceTable <- table(z[["feed"]][["entry"]][["im:price"]][["label"]]) #1C

if (!exists("priceTable")) {
    cat("priceTable for problem 1c not found!")
} else {
    print(priceTable)
}

## 
## $1.99 $2.99 
##     2    98

SAX Processing and Termination

2. The entire set of Wikipedia pages is downloadable and distributed as one large XML file among other formats. It is too large for an exercise, so we downloaded a subset of pages via the Wikipedia API Wiki-subset.xml. The structure of this XML file is simple: And indeed, you can locate the schema for the XML at the URLs specified for example. Pages start with the page tag and have one or more revision nodes, each of which has an id, a timestamp, contributor etc. In this exercise, we will use an R6 class to be efficient in sax parsing and stop parsing as soon as we get what we want.

Wikipedia XML Structure

Define a portable R6 class named WikipediaPageHits that will take as parameters a regular expression pattern (pattern) and a maximum specified number (hitLimit default value of 4), It should return a vector of the encountered Wikipedia page titles that match the regular expression pattern. Hint: Use a public method for the class named saxHandler as follows:

saxHandler = function() {
    page <- function(context, node, attrs, ...) {
        pageTitle <- xmlValue(node[["title"]])
        ##if (pagetitle matches pattern) {
        ##   increment title count
        ##   store pageTitle
        ##   If you wish to debug, you can print(pageTitle)
        ##   If titleCount is greater than hitLimit, call
        ##   xmlStopParser(context)
        ## }
    }
    c(page = xmlParserContextFunction(page))
 }

## WikipediaPageHits <- ...
WikipediaPageHits <-  R6Class("hits", portable = TRUE,
                              private = list(pattern = NA, hitLimit = NA, cnt = NA),
                              #
                              public = list(
                                initialize = function(pattern,hitLimit){
                                  private$pattern = pattern
                                  private$hitLimit = hitLimit
                                  private$cnt = 0
                                  #private$titelist = list()
                                }, titlelist = NA,
                                saxHandler = function(){
                                  page = function(context,node,attrs,...){
                                    #browser()
                                    pageTitle = xmlValue(node[["title"]])
                                    if (grepl(private$pattern,pageTitle)){
                                      private$cnt = private$cnt + 1
                                      self$titlelist[[private$cnt]] = pageTitle
                                      #print(pageTitle)
                                      #browser()
                                      if (private$cnt > (private$hitLimit-1)){
                                        xmlStopParser(context)
                                      }#- Counter
                                    }#- Check pattern
                                    #browser()
                                  }#- page fn()
                                  c(page = xmlParserContextFunction(page))
                                }, getTitles = function() self$titlelist)# Public
)    #end of R6 Class

For details see the relevant help in library(XML) particularly on stopping the parser after you hit the hitLimit number of titles. Your class will be used as follows:

## Print all titles that start with an F or an S followed by any of the
## letters e, r, or a.
if (!exists("WikipediaPageHits")) {
    cat("WikipediaPageHits for problem 2 not found!")
} else {
    hits <- WikipediaPageHits$new(pattern = "^[FS][era]", hitLimit = 7L)
    d <- xmlEventParse(file = "Wiki-subset.xml", branches = hits$saxHandler())
    hits$getTitles()
    ## Returns:
    ## [1] "Srinivasa Ramanujan"       "Frank Adams"
    ## [3] "Selman Akbulut"            "Frederick J. Almgren, Jr."
    ## [5] "Felix Bernstein"           "Salomon Bochner"
    ## [7] "Fedor Bogomolov"
}

## [1] "Srinivasa Ramanujan"       "Frank Adams"              
## [3] "Selman Akbulut"            "Frederick J. Almgren, Jr."
## [5] "Felix Bernstein"           "Salomon Bochner"          
## [7] "Fedor Bogomolov"

Computing on the language.

3. Drug companies who sponsor clinical trials costing millions of dollars often mine the finalized data from the trial to determine if their treatment works in specific subgroups of patients with certain characteristics. For example, if a treatment seems to show greater effect in a subset of patients who are female, non-smokers and hypothyroid, then they would like to study it further or or even sponsor further trials after perhaps doing due adjustments (false discovery rate or multiple testing corrections). It is not unusual for them to use a set of say 5 to 10 variables to generate these subsets. These subsets are generated using boolean logic with the and/or conjunctions. Examples of selection clauses:

Sex == "male"
Sex == "female"
Sex == "male" and SmokingHx == "current"
Sex == "male" and SmokingHx != "current" and CancerStage == "II"

In order to keep things simple, assume the conjunction is always and/or and the two are not mixed together. Also assume that a data frame containing these variables is provided in trial.RDS. Then the whole problem becomes a subsetting exercise except that the clauses for subsetting have to be programmatically generated. The only variables you need to deal with are: Sex, SmokingHx, CancerStage, remission and trt.

trial <- readRDS("trial.RDS")

3a. Write a function with the following signature:

#makeClause <- function(Sex = c(NA, "male", "female"),
#            SmokingHx = c(NA, "current", "former", "never"),
#            CancerStage = c(NA, "I", "II", "III", "IV"),
#            conjunction = c("&", "|")) {
#...
#}

that will return an appropriate subset clause as a character string.

Examples of use are:

> makeClause()
[1] "TRUE"
> makeClause(Sex = "female")
[1] "(Sex == 'female')"
> makeClause(Sex = "female", SmokingHx = "NA", CancerStage = "NA")
[1] "(Sex == 'female')"
> makeClause(Sex = "male", CancerStage = "IV")
[1] "(Sex == 'male') & (CancerStage == 'IV')"
> makeClause(Sex = "female", SmokingHx = "NA")
[1] "(Sex == 'female')"
> makeClause(Sex = "male", CancerStage = "IV", SmokingHx = "NA",  conjunction = "|")
[1] "(Sex == 'male') | (CancerStage == 'IV')"
> makeClause(Sex = "female", CancerStage = "IV", SmokingHx = "never", conjunction = '|')
[1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'IV')"

The order of the variables in the output clause string does matter for grading. So does the use of single and double quotes. The extra space around the == or | or & doesn’t matter but is worth striving for since it is the sort of detail that programmers notice.

Ensure that Sex always precedes SmokingHx which always precedes CancerStage (see last example above). Note also how variables with NA values do not enter into the clause. Hint: see ?match.arg.

## makeClause <- ...
makeClause <- function(Sex = c("NA", "male", "female"),
            SmokingHx = c("NA", "current", "former", "never"),
            CancerStage = c("NA", "I", "II", "III", "IV"),
            conjunction = c("&", "|")){
  #
  #browser()
  flg = 0
  conj.p = conjunction[1]
  if (length(conjunction)==1 && (conjunction != "NA")){
    conjp = match.arg(conjunction)
  }
  #
  #browser()
  if (length(Sex)==1 && (Sex != "NA")){
    sex = match.arg(Sex)
    sex.p = paste0("(Sex == '",sex,"')")
    clause = sex.p #;print(clause)
  }
  #browser()
  if (length(SmokingHx)==1 && (SmokingHx != "NA")){
    smoke = match.arg(SmokingHx) 
    smk.p = paste0("(SmokingHx == '",smoke,"')")
    if (!exists("clause")){clause = smk.p}
    else{clause = paste(clause, conj.p,smk.p) }
  }
  #browser()
  if (length(CancerStage)==1 && (CancerStage != "NA")){
    cancer = match.arg(CancerStage)
    cancer.p = paste0("(CancerStage == '",cancer,"')")
    if (!exists("clause")){clause = cancer.p}
    else{clause = paste(clause, conj.p,cancer.p) }
  }
  if (!exists("clause")){clause = "TRUE"}
  #browser()
  return(clause)
} #  3a.--

if (!exists("makeClause")) {
    cat("makeClause not found for problem 3a!")
} else {
    print(makeClause())
    print(makeClause(Sex = "female"))
    print(makeClause(Sex = "female", SmokingHx = "NA", CancerStage = "NA"))
    print(makeClause(Sex = "male", CancerStage = "IV"))
    print(makeClause(Sex = "female", SmokingHx = "NA"))
    print(makeClause(Sex = "male", CancerStage = "IV", SmokingHx = "NA",  conjunction = "|"))
    print(makeClause(Sex = "female", CancerStage = "IV", SmokingHx = "never", conjunction = '|'))
}

## [1] "TRUE"
## [1] "(Sex == 'female')"
## [1] "(Sex == 'female')"
## [1] "(Sex == 'male') & (CancerStage == 'IV')"
## [1] "(Sex == 'female')"
## [1] "(Sex == 'male') | (CancerStage == 'IV')"
## [1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'IV')"

3b. Generate all possible tuples for the four variables Sex, SmokingHx, CancerStage and conjunction as a matrix with 4 columns. The result should be a matrix named mTuples. Oblique hint: See ?combn

## mTuples <- ...
permute <- function(Sex = c(NA, "male", "female"),
                       SmokingHx = c(NA, "current", "former", "never"),
                       CancerStage = c(NA, "I", "II", "III", "IV"),
                       conjunction = c("&", "|")){
  
  val = list(Sex=c(NA, "male", "female"),SmokingHx = c(NA, "current", "former", "never"),CancerStage = c(NA, "I", "II", "III", "IV"), conjunction = c("&", "|"))
  tmp1 = data.frame(a = val[[1]], b=NA, c=NA, d=NA)
  tmp2 = data.frame(a = NA,b = val[[2]],  c=NA, d=NA)
  tmp3 = data.frame(a = NA,b=NA,c = val[[3]], d=NA)
  tmp4 = data.frame(a = NA,b=NA, c=NA,d = val[[4]])
  tmp = rbind(tmp1,tmp2,tmp3,tmp4)
  #
  nms = c("Sex","Smoking","Cancer","Conj")
  names(tmp) = nms
  perm12 = expand.grid(val[[1]],val[[2]],val[[4]])#stringsAsFactors = FALSE)
  perm12[,4] = perm12[,3] ; perm12[,3] = NA
  names(perm12) = nms
  #
  perm13 = expand.grid(val[[1]],val[[3]],val[[4]])#stringsAsFactors = FALSE)
  perm13[,3:4] = perm13[,2:3]; perm13[,2] = NA
  names(perm13) = nms
  #
  perm23 = expand.grid(val[[2]],val[[3]],val[[4]])#stringsAsFactors = FALSE)
  perm23[,2:4] = perm23[,1:3]; perm23[,1] = NA
  names(perm23) = nms
  #
  perm123 = expand.grid(val[[1]],val[[2]],val[[3]],val[[4]])#stringsAsFactors = FALSE)
  names(perm123) = nms
  #
  tmp = (rbind(tmp,perm12, perm13,perm23, perm123))
  master = as.matrix(rbind(tmp,perm12, perm13,perm23, perm123))
  return(master)
}

mTuples = permute() #--- 3B

if (!exists("mTuples") || !is.matrix(mTuples)) {
    cat("mTuples not found for problem 3b!")
} else {
    cat("Dimension of mTuples is", dim(mTuples), "\n")
}

## Dimension of mTuples is 442 4

3c. Produce all the filter clauses using the above matrix for use with dplyr. Remove the trivial and duplicate ones. Hint: do.call. The final result after the removal should be named clauses.

## clauses <- ...
produce_clause = function(dat){
  #data = as.data.frame(data, stringsAsFactors = FALSE)
  dat1 = dat
  holder = list()
  cnt = 0
  for (l in 1:nrow(dat)){
    for (k in 1:ncol(dat)){
      if (is.na(dat[l,k]))dat1[l,k] = "NA"
    }
  }
  #browser()
  for (i in 1:nrow(dat)){
    #browser()
    if ((sum(is.na(dat[i,1:3]))<3)){
      cnt = cnt + 1
      #browser()
     # 
      if (sum(!is.na(dat[i,1:3]))==1){
        holder[[cnt]] = makeClause(Sex=dat1[i,1],SmokingHx = dat1[i,2],
                                   CancerStage = dat1[i,3])
      }# sinlgle clause (disregard conjunction...)
      else{holder[[cnt]] = makeClause(Sex=dat1[i,1],SmokingHx = dat1[i,2],
                        CancerStage = dat1[i,3],  conjunction = dat1[i,4])
      }#multiple clauses
      #
      #print(holder[[cnt]])
    }#check trival cases
  }# end for(i)
  #browser()
  holder1 = unique(holder); print(holder1)
  return(holder1)
}

clauses = produce_clause(mTuples) #== 3c

## [[1]]
## [1] "(Sex == 'male')"
## 
## [[2]]
## [1] "(Sex == 'female')"
## 
## [[3]]
## [1] "(SmokingHx == 'current')"
## 
## [[4]]
## [1] "(SmokingHx == 'former')"
## 
## [[5]]
## [1] "(SmokingHx == 'never')"
## 
## [[6]]
## [1] "(CancerStage == 'I')"
## 
## [[7]]
## [1] "(CancerStage == 'II')"
## 
## [[8]]
## [1] "(CancerStage == 'III')"
## 
## [[9]]
## [1] "(CancerStage == 'IV')"
## 
## [[10]]
## [1] "(Sex == 'male') & (SmokingHx == 'current')"
## 
## [[11]]
## [1] "(Sex == 'female') & (SmokingHx == 'current')"
## 
## [[12]]
## [1] "(Sex == 'male') & (SmokingHx == 'former')"
## 
## [[13]]
## [1] "(Sex == 'female') & (SmokingHx == 'former')"
## 
## [[14]]
## [1] "(Sex == 'male') & (SmokingHx == 'never')"
## 
## [[15]]
## [1] "(Sex == 'female') & (SmokingHx == 'never')"
## 
## [[16]]
## [1] "(Sex == 'male') | (SmokingHx == 'current')"
## 
## [[17]]
## [1] "(Sex == 'female') | (SmokingHx == 'current')"
## 
## [[18]]
## [1] "(Sex == 'male') | (SmokingHx == 'former')"
## 
## [[19]]
## [1] "(Sex == 'female') | (SmokingHx == 'former')"
## 
## [[20]]
## [1] "(Sex == 'male') | (SmokingHx == 'never')"
## 
## [[21]]
## [1] "(Sex == 'female') | (SmokingHx == 'never')"
## 
## [[22]]
## [1] "(Sex == 'male') & (CancerStage == 'I')"
## 
## [[23]]
## [1] "(Sex == 'female') & (CancerStage == 'I')"
## 
## [[24]]
## [1] "(Sex == 'male') & (CancerStage == 'II')"
## 
## [[25]]
## [1] "(Sex == 'female') & (CancerStage == 'II')"
## 
## [[26]]
## [1] "(Sex == 'male') & (CancerStage == 'III')"
## 
## [[27]]
## [1] "(Sex == 'female') & (CancerStage == 'III')"
## 
## [[28]]
## [1] "(Sex == 'male') & (CancerStage == 'IV')"
## 
## [[29]]
## [1] "(Sex == 'female') & (CancerStage == 'IV')"
## 
## [[30]]
## [1] "(Sex == 'male') | (CancerStage == 'I')"
## 
## [[31]]
## [1] "(Sex == 'female') | (CancerStage == 'I')"
## 
## [[32]]
## [1] "(Sex == 'male') | (CancerStage == 'II')"
## 
## [[33]]
## [1] "(Sex == 'female') | (CancerStage == 'II')"
## 
## [[34]]
## [1] "(Sex == 'male') | (CancerStage == 'III')"
## 
## [[35]]
## [1] "(Sex == 'female') | (CancerStage == 'III')"
## 
## [[36]]
## [1] "(Sex == 'male') | (CancerStage == 'IV')"
## 
## [[37]]
## [1] "(Sex == 'female') | (CancerStage == 'IV')"
## 
## [[38]]
## [1] "(SmokingHx == 'current') & (CancerStage == 'I')"
## 
## [[39]]
## [1] "(SmokingHx == 'former') & (CancerStage == 'I')"
## 
## [[40]]
## [1] "(SmokingHx == 'never') & (CancerStage == 'I')"
## 
## [[41]]
## [1] "(SmokingHx == 'current') & (CancerStage == 'II')"
## 
## [[42]]
## [1] "(SmokingHx == 'former') & (CancerStage == 'II')"
## 
## [[43]]
## [1] "(SmokingHx == 'never') & (CancerStage == 'II')"
## 
## [[44]]
## [1] "(SmokingHx == 'current') & (CancerStage == 'III')"
## 
## [[45]]
## [1] "(SmokingHx == 'former') & (CancerStage == 'III')"
## 
## [[46]]
## [1] "(SmokingHx == 'never') & (CancerStage == 'III')"
## 
## [[47]]
## [1] "(SmokingHx == 'current') & (CancerStage == 'IV')"
## 
## [[48]]
## [1] "(SmokingHx == 'former') & (CancerStage == 'IV')"
## 
## [[49]]
## [1] "(SmokingHx == 'never') & (CancerStage == 'IV')"
## 
## [[50]]
## [1] "(SmokingHx == 'current') | (CancerStage == 'I')"
## 
## [[51]]
## [1] "(SmokingHx == 'former') | (CancerStage == 'I')"
## 
## [[52]]
## [1] "(SmokingHx == 'never') | (CancerStage == 'I')"
## 
## [[53]]
## [1] "(SmokingHx == 'current') | (CancerStage == 'II')"
## 
## [[54]]
## [1] "(SmokingHx == 'former') | (CancerStage == 'II')"
## 
## [[55]]
## [1] "(SmokingHx == 'never') | (CancerStage == 'II')"
## 
## [[56]]
## [1] "(SmokingHx == 'current') | (CancerStage == 'III')"
## 
## [[57]]
## [1] "(SmokingHx == 'former') | (CancerStage == 'III')"
## 
## [[58]]
## [1] "(SmokingHx == 'never') | (CancerStage == 'III')"
## 
## [[59]]
## [1] "(SmokingHx == 'current') | (CancerStage == 'IV')"
## 
## [[60]]
## [1] "(SmokingHx == 'former') | (CancerStage == 'IV')"
## 
## [[61]]
## [1] "(SmokingHx == 'never') | (CancerStage == 'IV')"
## 
## [[62]]
## [1] "(Sex == 'male') & (SmokingHx == 'current') & (CancerStage == 'I')"
## 
## [[63]]
## [1] "(Sex == 'female') & (SmokingHx == 'current') & (CancerStage == 'I')"
## 
## [[64]]
## [1] "(Sex == 'male') & (SmokingHx == 'former') & (CancerStage == 'I')"
## 
## [[65]]
## [1] "(Sex == 'female') & (SmokingHx == 'former') & (CancerStage == 'I')"
## 
## [[66]]
## [1] "(Sex == 'male') & (SmokingHx == 'never') & (CancerStage == 'I')"
## 
## [[67]]
## [1] "(Sex == 'female') & (SmokingHx == 'never') & (CancerStage == 'I')"
## 
## [[68]]
## [1] "(Sex == 'male') & (SmokingHx == 'current') & (CancerStage == 'II')"
## 
## [[69]]
## [1] "(Sex == 'female') & (SmokingHx == 'current') & (CancerStage == 'II')"
## 
## [[70]]
## [1] "(Sex == 'male') & (SmokingHx == 'former') & (CancerStage == 'II')"
## 
## [[71]]
## [1] "(Sex == 'female') & (SmokingHx == 'former') & (CancerStage == 'II')"
## 
## [[72]]
## [1] "(Sex == 'male') & (SmokingHx == 'never') & (CancerStage == 'II')"
## 
## [[73]]
## [1] "(Sex == 'female') & (SmokingHx == 'never') & (CancerStage == 'II')"
## 
## [[74]]
## [1] "(Sex == 'male') & (SmokingHx == 'current') & (CancerStage == 'III')"
## 
## [[75]]
## [1] "(Sex == 'female') & (SmokingHx == 'current') & (CancerStage == 'III')"
## 
## [[76]]
## [1] "(Sex == 'male') & (SmokingHx == 'former') & (CancerStage == 'III')"
## 
## [[77]]
## [1] "(Sex == 'female') & (SmokingHx == 'former') & (CancerStage == 'III')"
## 
## [[78]]
## [1] "(Sex == 'male') & (SmokingHx == 'never') & (CancerStage == 'III')"
## 
## [[79]]
## [1] "(Sex == 'female') & (SmokingHx == 'never') & (CancerStage == 'III')"
## 
## [[80]]
## [1] "(Sex == 'male') & (SmokingHx == 'current') & (CancerStage == 'IV')"
## 
## [[81]]
## [1] "(Sex == 'female') & (SmokingHx == 'current') & (CancerStage == 'IV')"
## 
## [[82]]
## [1] "(Sex == 'male') & (SmokingHx == 'former') & (CancerStage == 'IV')"
## 
## [[83]]
## [1] "(Sex == 'female') & (SmokingHx == 'former') & (CancerStage == 'IV')"
## 
## [[84]]
## [1] "(Sex == 'male') & (SmokingHx == 'never') & (CancerStage == 'IV')"
## 
## [[85]]
## [1] "(Sex == 'female') & (SmokingHx == 'never') & (CancerStage == 'IV')"
## 
## [[86]]
## [1] "(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'I')"
## 
## [[87]]
## [1] "(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'I')"
## 
## [[88]]
## [1] "(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'I')"
## 
## [[89]]
## [1] "(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'I')"
## 
## [[90]]
## [1] "(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'I')"
## 
## [[91]]
## [1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'I')"
## 
## [[92]]
## [1] "(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'II')"
## 
## [[93]]
## [1] "(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'II')"
## 
## [[94]]
## [1] "(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'II')"
## 
## [[95]]
## [1] "(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'II')"
## 
## [[96]]
## [1] "(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'II')"
## 
## [[97]]
## [1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'II')"
## 
## [[98]]
## [1] "(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'III')"
## 
## [[99]]
## [1] "(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'III')"
## 
## [[100]]
## [1] "(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'III')"
## 
## [[101]]
## [1] "(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'III')"
## 
## [[102]]
## [1] "(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'III')"
## 
## [[103]]
## [1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'III')"
## 
## [[104]]
## [1] "(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'IV')"
## 
## [[105]]
## [1] "(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'IV')"
## 
## [[106]]
## [1] "(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'IV')"
## 
## [[107]]
## [1] "(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'IV')"
## 
## [[108]]
## [1] "(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'IV')"
## 
## [[109]]
## [1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'IV')"

if (!exists("clauses")) {
    cat("clauses not found for problem 3c!")
} else {
    print(head(clauses))
    print(tail(clauses))
}

## [[1]]
## [1] "(Sex == 'male')"
## 
## [[2]]
## [1] "(Sex == 'female')"
## 
## [[3]]
## [1] "(SmokingHx == 'current')"
## 
## [[4]]
## [1] "(SmokingHx == 'former')"
## 
## [[5]]
## [1] "(SmokingHx == 'never')"
## 
## [[6]]
## [1] "(CancerStage == 'I')"
## 
## [[1]]
## [1] "(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'IV')"
## 
## [[2]]
## [1] "(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'IV')"
## 
## [[3]]
## [1] "(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'IV')"
## 
## [[4]]
## [1] "(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'IV')"
## 
## [[5]]
## [1] "(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'IV')"
## 
## [[6]]
## [1] "(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'IV')"

3d. Using lapply and the variable clauses, generate all subsets of data corresponding to the clauses. The result should be named subsets.

Hint: See examples of lecture code using parse and eval. You may use subset or dplyr (your choice!), but you have to ensure that you name the list of subsets with the clauses so that we know which subset corresponds to which clause.

## subsets <- ...
create_set = function(log_exp,data){
  set = lapply(log_exp, function(x) subset(data,eval(parse(text=x))))
  names(set) = log_exp
  # eval(parse(text = "string")) converts string to logical condition
  return(set)
}
subsets = create_set(clauses,trial) #-- 3d

if (!exists("subsets")) {
    cat("subsets not found for problem 3d!")
} else {
    print(head(lapply(subsets, dim)))
    print(tail(lapply(subsets, dim)))
}

## $`(Sex == 'male')`
## [1] 3410   28
## 
## $`(Sex == 'female')`
## [1] 5115   28
## 
## $`(SmokingHx == 'current')`
## [1] 1705   28
## 
## $`(SmokingHx == 'former')`
## [1] 1705   28
## 
## $`(SmokingHx == 'never')`
## [1] 5115   28
## 
## $`(CancerStage == 'I')`
## [1] 2558   28
## 
## $`(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'IV')`
## [1] 5154   28
## 
## $`(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'IV')`
## [1] 5920   28
## 
## $`(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'IV')`
## [1] 5023   28
## 
## $`(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'IV')`
## [1] 6033   28
## 
## $`(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'IV')`
## [1] 6158   28
## 
## $`(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'IV')`
## [1] 7518   28

3e. Select generated subsets that have at least 120 subjects in the subset. Preserve the names of the subsets. Call the result testableSubsets.

## testableSubsets <- ...
create_miniset = function(subset){
  cnt = 0
  id = list(); test = list()
  for (i in 1:length(subset)){
    #browser()
    if(nrow(subset[[i]]) > 119){
      cnt = cnt + 1
      #browser()
      test[cnt] = subset[i]
      id[cnt] = names(subset[i])
      #names(test[cnt]) = names(subset[i])
      #browser()
    }
  }#end for-loop
  names(test) = id#; browser()
  return(test)
}
testableSubsets = create_miniset(subsets) # 3e..

if (!exists("testableSubsets")) {
    cat("testableSubsets not found for problem 3e!")
} else {
    print(head(lapply(testableSubsets, dim)))
}

## $`(Sex == 'male')`
## [1] 3410   28
## 
## $`(Sex == 'female')`
## [1] 5115   28
## 
## $`(SmokingHx == 'current')`
## [1] 1705   28
## 
## $`(SmokingHx == 'former')`
## [1] 1705   28
## 
## $`(SmokingHx == 'never')`
## [1] 5115   28
## 
## $`(CancerStage == 'I')`
## [1] 2558   28

3f. For each of these subsets, perform a statistical test to see if there is a difference between the proportion of remissions in the Drug group versus Placebo (variable trt). Return a vector of \(p\)-values named pValues for each subset using a Fisher exact test (code below) for each subset d. Preserve names.

f <- function(d) {
    fisher.test(table(d$remission, d$trt))$p.value
}

## pValues <- ...
stat_eval = function(tmp){
  p = list(0)
  for (i in 1: length(tmp)){
    p[i] = f(tmp[[i]])
  }
  names(p) = names(tmp)
  return(p)
}
pValues = stat_eval(testableSubsets) # 3f--

if (!exists("pValues")) {
    cat("pValues not found for problem 3f!")
} else {
    print(head(pValues))
    print(tail(pValues))
}

## $`(Sex == 'male')`
## [1] 0.6566934
## 
## $`(Sex == 'female')`
## [1] 0.8715682
## 
## $`(SmokingHx == 'current')`
## [1] 0.8656192
## 
## $`(SmokingHx == 'former')`
## [1] 0.2259288
## 
## $`(SmokingHx == 'never')`
## [1] 0.3079317
## 
## $`(CancerStage == 'I')`
## [1] 0.9308088
## 
## $`(Sex == 'male') | (SmokingHx == 'current') | (CancerStage == 'IV')`
## [1] 0.6667534
## 
## $`(Sex == 'female') | (SmokingHx == 'current') | (CancerStage == 'IV')`
## [1] 0.7836939
## 
## $`(Sex == 'male') | (SmokingHx == 'former') | (CancerStage == 'IV')`
## [1] 1
## 
## $`(Sex == 'female') | (SmokingHx == 'former') | (CancerStage == 'IV')`
## [1] 0.6513537
## 
## $`(Sex == 'male') | (SmokingHx == 'never') | (CancerStage == 'IV')`
## [1] 0.4907512
## 
## $`(Sex == 'female') | (SmokingHx == 'never') | (CancerStage == 'IV')`
## [1] 0.706759

3g. What clauses yield a \(p\)-value of .10 or less. Your answer is expected in the variable smallPvalues and the names are expected to be clauses.

## smallPvalues <- ...
result = function(pv,nme){
  cnt = 0
  id = list(); sp = list()
  for ( i in 1:length(pv)){
    if (pv[i] <= 0.1){
      cnt = cnt + 1
      sp[cnt] = pv[i]
      id [cnt] = names(nme[i])
    }
  }
  names(sp) = id
  return(sp)
}
smallPvalues = result(pValues,testableSubsets)

if (!exists("smallPvalues")) {
    cat("smallPvalues not found for problem 3g!")
} else {
    print(smallPvalues)
}