Instructions

Follow these instructions paying full attention to details to get proper credit.

R version 3.2.3 should be used for all exercises.
Cut and paste your answers (text or code) into the chunks named q_1a, q_2, q_4i, etc. replacing ANSWER with your answer.
Do not edit or shorten this markdown file, or rename chunks.
Check that you can knit this markdown in Rstudio without error before you enter anything, and, after you enter your answers.
Submit the markdown file (not the html) file by uploading it to stat290.stanford.edu. Put it under your private directory and ensure that its name is ass1.Rmd.

(Short Answers, no more than 2 sentences.)

In an interactive R session

1a. What does x = 2 do?

x = 2, set the x variable value 2; class (and mode) = numeric and type =  
        double.

1b. What does (x = 5) do?

(x=5) set the x variable value 5; class (and mode) = numeric 
       and type =  double.

1c. What does (x == 2) do?

(x==2) is conditional statement that generates logical (TRUE, FALSE) output based on value of x; class = type = mode = Logical

1d. What is the difference between x[] and x[[]] subsetting on lists?

x[i] outputs i-th element of the list; class = list, typeof = list. 
x[[i]] outputs the i-th list element itself, class and type are defined by the variables in the list element.

1e. What’s the difference between & and &&?

“&” is vector boolean logic “AND” operator. “&&” is a scalar logic operator.

1f. Look up the definition of the function is.primitive and explain precisely how it works.

is.primitive(name) compares “name”  to the list of primitive functions in the base package. It returns TRUE if there is match, and FALSE if there is no match.

1g. How does the save/load pair of functions differ from saveRDS/readRDS pair?

save/load save and restore the object and its name from/to the current environment; hence, the restored file has the same name. saveRDS/readRDS only stores (via serialization) the object representation so it can by restored under a different name.

1h. Download the source code for R version 3.2.3. In what file is the source for the R function readBin located relative to the top-level directory R-3.2.3?

The file is located  ...R-3.2.2 / src / library / base / man / readBin.Rd

1i. Why is the comment ##hence length(what) == 1 correct in the source for the R function readBin?

The argument "what" is a character vector of length = 1 that indicates the mode of vector to be read.

1j. What two features of R make arguments such as as.is=!stringsAsFactors, fill=!blank.lines.skip work in read.table ?

(i) lazy evaluation: as.is is defined by a promised stringsAsFactor value that isdefined by lazy evaluation, (ii) logical negation: ! inverts the logic.

1k. Explain the result of

f1 <- function(x = {y <- 7; 2}, y = 0) {
x + y
}
f1()
[1] 9

The value of x = 2. The value of y is example of lazy evaluation. When x is 
         referenced, this defines the y = 7 (the y=0 statement is disregarded) so x + y = 
        2+7=9. Whne I reversed the terms y + x = 0 + 2 = 2, here y is referenced first (so 
        the y=7 statemtn is disregarded).

Write a function to list all the primitive functions in the base package. Then provide that list.

show.packages.1 = function(pckg){
  objs = mget(ls(pckg),inherits=TRUE)
  funs = Filter(is.function,objs)
  op1 = Filter(is.primitive,funs)
  return(op1)
}

Invoke the function.

answer.1 = show.packages.1("package:base"); answer.1

Define a %+% inline operator that works only on two data frames x and y containing integer, character (_not factor!) and numeric vectors as follows.

If the names and types all match, perform an rbind (append rows); otherwise, if the number of rows match perform a cbind (append columns). Else, throw an error due to incompatibility.

Provide meaningful names for the variables when necessary. Warnings and errors are to be thrown where appropriate.

# Answer assumes question requires INT, CHAR or NUM (as opposed to INT, CHAR and #NUM as stated).
"%+%" = function(df1,df2){
  ctrl = 0
  df.out = data.frame()
  # Internal function to check (i) data frame consists of only INT, CHAR or NUM types
  # (ii) columns of indiviual frame are identical and (iii) names are identical.
  # retuns list of (0,1) values for (T,F).
   checkvar = function(df1,df2){
    sum1 = logical(0); sum2 = logical(0)
    ch.type1 = logical(0); ch.type2 = logical(0)
    for (i in 1:ncol(df1)){
      sum1[i] = (is.numeric(df1[,i]) | is.character(df1[,i]) | is.integer(df1[,i]))
      ch.type1[i] = (typeof(df1[,i]) == typeof(df2[,i]))
    }
    for (i in 1:ncol(df2)){
      sum2[i] = (is.numeric(df2[,i]) | is.character(df2[,i]) | is.integer(df2[,i]))
      ch.type2[i] = (typeof(df2[,i]) == typeof(df2[,i]))
    }
    ch.var = list(sum(!sum1,!sum2),sum(!ch.type1,!ch.type2),sum(!(names(df1)==names(df2))))
  }
  #
  # Nested condition statements to determine rbind, cbind or error.
  ch.var = checkvar(df1,df2)
  if ((is.data.frame(df1) & is.data.frame(df2) & ch.var[[1]]==0)){
    #
    #browser()
    if(ch.var[[2]]==0 & ch.var[[3]]==0){
      df.out = rbind(df1,df2)
      print("Combined by rows...")
      ctrl = 1
    }
    if (nrow(df1) == nrow(df2) & ctrl == 0){
      df.out = cbind(df1,df2)
      print("Combined by columns...")
      ctrl = 1
    }
    if (ctrl == 0) print("Invalid for rbind or cbind...")
  }
  else{
    print("Invalid data frame or data frame structure...")
  }
  return(df.out)
}
#---Generate data frames=-
df1=data.frame(x1=1:3,x2=2:4,stringsAsFactors=F)
df2=data.frame(x1=1:2,x2=c("a","b"),stringsAsFactors=F)
df3=data.frame(x1=1:2,x2=2:3,stringsAsFactors=F)
df4=data.frame(x1=1:2,x2=c("a","b"),stringsAsFactors=T)  
#-- Test infix function
final.data = df1 %+% df2; final.data
final.data = df2 %+% df3; final.data
final.data = df3 %+% df4; final.data
final.data = df2 %+% df2; final.data

A classic programming example is Huffman encoding; see . That example is in Scheme but huffEx.R contains an R version. The main difference between a purely functional implementation and the one in R is the avoidance of recursion, particularly because R does not support things like tail recursion. Also, we make do with generic lists in R.

For this exercise, complete parts of the code in huffEx.R; you will need the file PandP.txt (included on coursework). For your (not our) checking you need to make sure the file is available in the same directory as this markdown.

4a. Write code to create a vector of ordered frequencies from the vector (1 liner)

helloFreq <- sort(table(helloWorld))

4b. Create a list of Leaf nodes each of which represents a character of the helloWorld vector, ordered of frequencies. Hint: lapply

helloLeafList <- lapply(seq(helloFreq), function(i) makeLeaf(weight=helloFreq[i], symbol=names(helloFreq[i])))

4c. Complete the function encode and decode (one line each).

    return(result)

    return(result)

4d. Complete the function createCodeTree (no more than 4 or 5 lines).

createCodeTree <- function(trees) {
  j = 0
    while(length(trees) != 1) {
      j = j+1
        trees <- combine(trees)
    }
    trees
    ## end of 4d.
}

4e. Create a code tree for the helloWorld vector.

helloCodeTree <- createCodeTree(helloLeafList)

4f. Encode and decode helloWorld and check that the encoding and decoding process are consistent (3 lines max).

hello.encode <- encode(helloCodeTree,helloWorld)
c(do.call("cbind",hello.encode))
hello.decode <- decode(hello.encode,helloCodeTree)
c(do.call("cbind",hello.decode))

4g. Decode the message in testMessage and print it as one string.

tmLeafList <- lapply(seq(testMessageFreq), function(i) makeLeaf(weight=testMessageFreq[i], symbol=names(testMessageFreq[i])))
tmCodeTree <- createCodeTree(tmLeafList)
tm.decode <- decode(tm,tmCodeTree)
c(do.call("cbind",tm.decode))

4h. Explain what happens in the lines

x <- local({tmp <- sapply(target, weight); split(seq(target),
                                                 tmp > weight(item))})
c(target[x[["FALSE"]]], list(item), target[x[["TRUE"]]])

of the function combine. Particularly state what happens for the extreme cases.

Creates a temporary environment to update the codetree by inserting the 2-node combination while maintining the order based on weight.
Split the codetrees in tree into two groups, ones with weight <= that of item, and those above it. The result is a list with *named* elements "FALSE" or "TRUE"
respectively. In the extreme cases, the list will have one element rather than two, all above or all below. However, by using the names to access the elements, we get an empty list if they don't exist. So the code won't fail.

4i. Encode and decode PandPHex and check they encode and decode consistently. Ignoring the storage of the code tree and other metadata, what is the approximate compression (compressed size / original size) you get for the Pride and Prejudice snippet? (Assume 8 bits/char). Provide R code.

panFreq <- sort(table(PandPHex))
pLeafList <- lapply(seq(PandPHex), function(i) makeLeaf(weight=panFreq[i], symbol=names(panFreq[i])))
panCodeTree <- createCodeTree(pLeafList)
pan.encode <- encode(panCodeTree,PandPHex)
c(do.call("cbind",pan.encode))
pan.decode <- decode(pan.encode,panCodeTree)
c(do.call("cbind",pan.decode))

# Compute Compression --
nbits.asc = length(PandPHex) * 8
nbits.huff = length(pan.encode)
compression = 100*nbits.huff / nbits.asc
compression
#Compression is 56.6 %

Stat 290: Assignment 1

Instructions

(Short Answers, no more than 2 sentences.)

Nothing to do below here