Exercise 1. (25 points)

In the last homework, you were asked to create a table of minimum replicates using matrix algebra. For this exercise, you will recreate the table, this time using iteration.

First we will call the required.replicates function from the previous homework assignment then set repicates.table to a matrix with 0 as the data argumnt, the length of the vector delta.vec as the row argument, and the length of sigma. vec as the column argument.

#call the required replicates function from the previous homework
required.replicates <- function(delta,sigma,z.alpha = 1.959964,z.beta=0.8416212){

  list(
    rslt=list(delta=delta,sigma=sigma,z.alpha=z.alpha,z.beta=z.beta) ,
    r.exact = 2 * (z.alpha + z.beta)^2 * (sigma/delta)^2,
    r.integer = ceiling(2 * (z.alpha + z.beta)^2 * (sigma/delta)^2) )

}
#create the sigma.vec sequence
sigma.vec<- seq(2,12,by=2)

#create the sequence {1,2,5,8,10}
delta.vec <- c(1,2,5,8,10)
           
#set replicates.table equal to a 5x6 zero value matrix         
replicates.table <- matrix(0,nrow=length(delta.vec),ncol=length(sigma.vec))
replicates.table
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0
## [4,]    0    0    0    0    0    0
## [5,]    0    0    0    0    0    0

Part 1. (5 points)

Create an single empty matrix, name this matrix replicates.table. To determine the dimensions, use sigma.vec and delta.vec from the previous exercise, and let these vectors determine the matrix dimensions as before.

This is the same instruction as above, no additional coding needed.

Unit test

## [1] 2
## [1] 4

Part 2. (15 points)

Fill the individual elements of the matrix by iterating over sigma.vec and delta.vec. You may use for or while constructs as you see fit. Use the required replicates function from Homework 2, Exercise 2, to calculate a value for each element in the replicates.table from the corresponding parameters in sigma.vec and delta.vec. Use the same values for \(z_{\alpha/2}\) and \(z_{\beta}\) as before.

You might have noticed from the previous exercise that some combinations of \(\sigma\) and \(\delta\) produce unreasonable number of replicates. So, inside your loops, write an if statement that checks the result of your required replicates function. If that value is less than 3, use a minimum replicates of 3 for that entry in the table. If the number is greater than 1000, use NA. (We will be assuming that any experiment with less than 3 replications will be unreliable, and any experiment with more than 1000 replications will be impractical.)

The key to this exercise is to use a for loop to iterate over each row and each column of the 5x6 matrix replicates.table. Iterating in this manner uses less memory because the sigma and delta matrixes are not stored in memory.

#loop over each row and each column of replicates table
for(i in 1:nrow(replicates.table)){
  
  for(j in 1:ncol(replicates.table)){
    
    #iterate over each row (i) and each column(j)
    replicates.table[i, j]<-required.replicates(sigma=sigma.vec[j],
                                                delta=delta.vec[i])$r.integer
    
    
    #use an ifelse statement to replace all values less than 3 with the value 3
    replicates.table <- ifelse(replicates.table<3,3,replicates.table)
    
    #use an ifelse statement to replace all values greater than 1000 with NA
    replicates.table <- ifelse(replicates.table>=1000,NA,replicates.table) 
  }}

replicates.table
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   63  252  566   NA   NA   NA
## [2,]   16   63  142  252  393  566
## [3,]    3   11   23   41   63   91
## [4,]    3    4    9   16   25   36
## [5,]    3    3    6   11   16   23

Unit test

## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16

Part 3 (5 points)

Finish this exercise by replacing the row names and column names of your matrix with text strings matching sigma and delta values, i.e. “delta = 5”, “delta = 10” so that the resulting matrix, when printed, is easier to read.

The simplest way may be to use the paste function to create a vector of names from your previous sigma and delta vectors, but you can use whatever method you wish, as long as the output is similiar to

           sigma = 2 sigma = 4  ...
delta = 1         63        NA  ...
delta = 2         16        63  ...
.
.
.

The solution is to create a vector of column and row names by using the paste function. Assigning the vectors sigma.vec and delta.vec to the third argument will replicate a vector of the same length as the vectors in the argument which will correspond with the row and column lengths.

#use paste to create a vector of column and row names
colnames <- paste("sigma =",sigma.vec) 
rownames <- paste("delta =",delta.vec)

#use th rownames and columnnames function to set the rownames and columnnames of replicates.table
rownames(replicates.table) <- rownames
colnames(replicates.table) <- colnames
replicates.table
##            sigma = 2 sigma = 4 sigma = 6 sigma = 8 sigma = 10 sigma = 12
## delta = 1         63       252       566        NA         NA         NA
## delta = 2         16        63       142       252        393        566
## delta = 5          3        11        23        41         63         91
## delta = 8          3         4         9        16         25         36
## delta = 10         3         3         6        11         16         23

This statement will print your table, we will grade this by inspection. There is no unit test for this part.

##            sigma = 2 sigma = 4 sigma = 6 sigma = 8 sigma = 10 sigma = 12
## delta = 1         63       252       566        NA         NA         NA
## delta = 2         16        63       142       252        393        566
## delta = 5          3        11        23        41         63         91
## delta = 8          3         4         9        16         25         36
## delta = 10         3         3         6        11         16         23

Exercise 2 (30 points)

We will revisit the AtmWtAg data set from the initial homework.

Part 1 (10 points)

Go to http://www.itl.nist.gov/div898/strd/anova/AtmWtAg.html and download both data tables, http://www.itl.nist.gov/div898/strd/anova/AtmWtAgt.dat and http://www.itl.nist.gov/div898/strd/anova/AtmWtAg.dat .

Edit these into files that R can read. You may create data frames in code, in this document, or read from external files of your choice, as in Homework 2, Exercise 3. However, if you create files, you must upload them to Dropbox and provide a brief description about the expected file format, in this document.

Name the first data frame (Data File in Table Format) AtmWtAg.wide and the second (Data File in Two-Column Format) AtmWtAg.long.

Create two csv fles and read them to R. Set stringsAsFactors to FALSE.

AtmWtAg.wide <- read.csv("C:/Users/Jeremiah Lowhorn/Desktop/STAT 700/Week 3/InstrumentDat.csv", stringsAsFactors=FALSE)

AtmWtAg.long <- read.csv("C:/Users/Jeremiah Lowhorn/Desktop/STAT 700/Week 3/AtmWtAg.csv",
                         stringsAsFactors=FALSE)

Unit Test

The unit test will print the first few lines of each table so that we can grade the tables by inspection.

## [1] 18
## [1] 19
## [1] 20
##            X1          X2
## 1 107.8681568 107.8681079
## 2 107.8681465 107.8681344
## 3 107.8681572 107.8681513
## 4 107.8681785 107.8681197
## 5 107.8681446 107.8681604
## 6 107.8681903 107.8681385
## [1] 22
## [1] 23
## [1] 24
##   Instrument        AgWt
## 1          1 107.8681568
## 2          1 107.8681465
## 3          1 107.8681572
## 4          1 107.8681785
## 5          1 107.8681446
## 6          1 107.8681903

Part 2 Data File in Table Format (10 points)

Print the summary of AtmWtAg.wide. Use a function from the apply family (apply, tapply, lapply,sapply) and the appropriate R functions to compute a similar set of values (column-wise mean, median, min, max and quantiles).

Note that each column in this format are observations from a single machine.

Hint - you’ll find you can do this with fewer functions than you think, if you examine the code for the default summary function. Do this by typing getS3method('summary','default') in the console. You could also try summary.data.frame.

There is no unit test for this part, we will grade this by inspection.

After inspecting the summary function is is evident that the quantile function will return all but the mean. We will use this function with lapply to create a list with the elements. Next we will use mapply to combine the two lists into one. We will then name each row with the corresponding names.

#use lapply to apply the quantile function over the AtmWtAg.wide data set
quant <- lapply(AtmWtAg.wide,quantile)
#use lapply to apply the mean function over the AtmWtAg.wide data set
mean <- lapply(AtmWtAg.wide,mean)
#use mapply to combine the two lists
Summary <- mapply(c,quant,mean)
#create a names vector for the row names
names <- c("Min.","1st Qu.","Median","3rd Qu.","Max","Mean")
#set the row names of Summary to the vector
row.names(Summary) <- names

Summary
##                    X1            X2
## Min.    107.868133300 107.868107900
## 1st Qu. 107.868146025 107.868124000
## Median  107.868151850 107.868136650
## 3rd Qu. 107.868159275 107.868148950
## Max     107.868190300 107.868164200
## Mean    107.868153767 107.868136354

Part 3 Data File in Two-Column Format (10 points)

Print the summary of AtmWtAg.long.

Use a function from the apply family (apply, tapply, lapply,sapply) and the appropriate R functions to compute a summary similar to Part 1 but using AtmWtAg.long In other words, calculate summary values by Instrument from AtmWtAg.long.

There is no unit test for this part, we will grade this by inspection.

Using the by function we will apply the quantile and mean functions to AtmWtAg.long. We use the by function here because the long format is split into factors and this function handles that for us. Once the functions are applied we follow the same process as in the previous exercise; mapply and renaming the rows.

#use by to apply quantile over AtmWtAg.long
quant <- by(AtmWtAg.long[,2],AtmWtAg.long[,1],function(x){quantile(x)})
#use by to apply mean over AtmWtAg.long
mean <- by(AtmWtAg.long[,2],AtmWtAg.long[,1],function(x){mean(x)})
#use mapply to combine the lists
Summary2 <- mapply(c,quant,mean)
#create a vector of row names
names <- c("Min.","1st Qu.","Median","3rd Qu.","Max","Mean")
#set the row names to Summary2
row.names(Summary2) <- names

Summary2
##                     1             2
## Min.    107.868133300 107.868107900
## 1st Qu. 107.868146025 107.868124000
## Median  107.868151850 107.868136650
## 3rd Qu. 107.868159275 107.868148950
## Max     107.868190300 107.868164200
## Mean    107.868153767 107.868136354

Exercise 3 (10 points)

Use reshape to convert AtmWtAg.wide to a long form and name it AtmWtAg.wide.to.long. Use reshape to convert AtmWtAg.long to the wide form and name it AtmWtAg.long.to.wide.

You may only use the reshape function, but you can append columns to either as needed for reshape to work correctly.

Comment on the difference between the two forms - what is meant by “long” versus “wide”? Can you think of a statistical analysis that would be easier to perform on the long version over the wide version? Is there a test that would be easier on the wide version?

We will use the reshape2 package to melt and cast the data into the appropriate format. To create AtmWtAg.wide.to.long we will need to melt AtmWtAg.wide, use the plyr function revalue to revalue the factors \(X1\) and \(X2\) to the appropriate instrument names, and use the plyr rename function to rename the column names of the final data.frame.

Originally I used the reshape2 package which was shown in the lecture videos so I am not going to delete that code, just update my RMD with the desired solution.

–Long Vs Wide. For data to be in long format means that the id variables or factors are within the same column with their corresponding values to be within the same vector. To be in wide format means for the id variables to be in the columns with their corresponding values to match within the table. Typically when data is in wide format it is being used for an analysis of variance procedure, a graphic, or for it just being easier to read and manipulate. When data is placed into long format it is tyipcally being used for a predictive model. Another way of thinking of the difference is that if data is in wide format each factor is distinct and have their own ditribution separate from when they were in long format.

–Reshape 2 Method To create the wide format we will again use the reshape2 package but use the dcast function. The dcast function will produce a table of 3 columns so we will remove the first column with by selecting only rows 2 & 3. The produced data.frame will have 48 rows due to the dcast function producing NA values where the instrument did not correspond with the AtmWt. To solve this we will use a method to remove the NAs from the individual columns, bind the columns back together, and set the data back to a data.frame. Once that method is complete we will then use the name function to appropriately name the columns of the resulting data.frame.

–Base R Method To convert ‘AtmWtAg.wide’ to long we must supply the ‘direction’ argument to long, ‘varying’ argument to the wide format column names, ‘v.names’ argument to ‘AgWt’ for the correct column names in the otuput, set the ‘idvar’ argument to an arbitrary name, and the timevar to ‘Instrument for the appropriate column name. The final output should eliminate the ’ID’ vector so subset that column out. To convert ‘AtmWtAg’ from long to wide we must first create a sequence of 1:48 for the ‘ID’ argument. Then we will use base R cbind to combine the vector to the ‘AtmWtAg.long’ data.frame. The arguments to be used are: ‘v.names’, varying, idvar, ids, timevar, and direction. Finally to eliminate the ‘ID’ vector we will complete a subset from columns 2 to 3.

#use melt to melt AtmWtAg.wide to long format
AtmWtAg.wide.to.long <- reshape2::melt(AtmWtAg.wide)
## No id variables; using all as measure variables
#revalue the resulting instrument factor to the appropriate names
AtmWtAg.wide.to.long$variable <- plyr::revalue(AtmWtAg.wide.to.long$variable,c("X1"="1","X2"="2"))
#rename the columns to the appropriate names
AtmWtAg.wide.to.long <- plyr::rename(AtmWtAg.wide.to.long,c("variable"="Instrument","value"="AgWt"))
#cast the AtmWtAg.long data to wide format and deselect column 1
AtmWtAg.long.to.wide <- reshape2::dcast(AtmWtAg.long,AgWt~Instrument,value.var="AgWt")[,2:3]
#remove the NA values from the individual vectors within the data.frame and reconstruct the data.frame
AtmWtAg.long.to.wide <- as.data.frame(cbind(AtmWtAg.long.to.wide[,1][!is.na(AtmWtAg.long.to.wide[,1])],
                                            AtmWtAg.long.to.wide[,2][!is.na(AtmWtAg.long.to.wide[,2])]))
#rename the column names to the appropriate headings
AtmWtAg.long.to.wide <- plyr::rename(AtmWtAg.long.to.wide,c("V1"="X1","V2"="X2"))


AtmWtAg.wide.to.long <- reshape(AtmWtAg.wide,direction = "long",
                                varying = list(names(AtmWtAg.wide)),
                                v.names = "AgWt",
                                idvar = "ID",
                                timevar = "Instrument")[,1:2]



ID <- 1:nrow(AtmWtAg.wide)
AtmWtAglong<- cbind(AtmWtAg.long,ID)
AtmWtAg.long.to.wide <- reshape(AtmWtAglong,
                                v.names="AgWt", 
                                varying=list(names(AtmWtAg.wide)),
                                idvar= "ID",
                                ids=c("1", "2"), 
                                timevar="Instrument",
                                direction="wide")[,2:3]

Unit Test

## [1] 26
##            X1          X2
## 1 107.8681568 107.8681079
## 2 107.8681465 107.8681344
## 3 107.8681572 107.8681513
## 4 107.8681785 107.8681197
## 5 107.8681446 107.8681604
## 6 107.8681903 107.8681385
## [1] 28
##     Instrument        AgWt
## 1.1          1 107.8681568
## 2.1          1 107.8681465
## 3.1          1 107.8681572
## 4.1          1 107.8681785
## 5.1          1 107.8681446
## 6.1          1 107.8681903

If you’ve used reshape as expected, the plot below will show a straight diagonal line as each in weight value in AtmWtAg.long matches exactly with the weight value in AtmWtAg.wide.to.long Note that this requires weight values are stored in both under the same column name, AgWt.

if(exists("AtmWtAg.long") && exists("AtmWtAg.wide.to.long")) {
  plot(AtmWtAg.long$AgWt,AtmWtAg.wide.to.long$AgWt)
  abline(0,1)
}

If you’ve used reshape as expected, the code below will print two summaries showing values that are identical for the two instruments.

if(exists("AtmWtAg.wide") && exists("AtmWtAg.long.to.wide")) {
  print(summary(AtmWtAg.wide))
  print(summary(AtmWtAg.long.to.wide))
}
##        X1                   X2            
##  Min.   :107.868133   Min.   :107.868108  
##  1st Qu.:107.868146   1st Qu.:107.868124  
##  Median :107.868152   Median :107.868137  
##  Mean   :107.868154   Mean   :107.868136  
##  3rd Qu.:107.868159   3rd Qu.:107.868149  
##  Max.   :107.868190   Max.   :107.868164  
##        X1                   X2            
##  Min.   :107.868133   Min.   :107.868108  
##  1st Qu.:107.868146   1st Qu.:107.868124  
##  Median :107.868152   Median :107.868137  
##  Mean   :107.868154   Mean   :107.868136  
##  3rd Qu.:107.868159   3rd Qu.:107.868149  
##  Max.   :107.868190   Max.   :107.868164

Exercise 4 (10 points)

Use your univariate function from last exercise to compute summary statistics for AtmWtAg. Compute the mean, variance and autocorrelation for both machines combined, and for each machine, individually. Assign the results for both machines to univ.both, and for machines individually to univ.instr1 and univ.instr2.

Use data columns from AtmWtAg.wide and AtmWtAg.long as parameters to your univariate function.

You may need to modify your univariate function so that it works with a vector of arbitrary length instead of a specific data table.

First we will need to call the created functions from homework 2 so they are available in the environment.

univ.sum.stats <- function(x){
  
  #Set y_2 equal to 2 to the length of the input x
  y_2 <- x[2:(length(x))]
  
  #set y_1 euqal to 1 to the length of x - 1
  y_1 <- x[1:(length(x)-1)]
  
  #calculate the mean, variance, and autoregression coefficient with the provided         
  #formulas
  bar.y <- sum(x)/length(x)
  s.2 <- sum((x-(sum(x)/length(x)))^2)/(length(x)-1)
  r_1 <- sum((y_2-bar.y)*(y_1-bar.y))/sum((x-bar.y)^2) 
  
  #create a list of the mean, variance, and autoregression coefficient
  my.list<-  list(
    bar.y = bar.y,
    s.2 = s.2,
    r_1 = r_1
  )
  
  #set the class of the above created list to univariate and call my.list
  class(my.list) <- "univariate"
  my.list
}


summary.univariate <- function(value) {
  #use the paste function to append the statements required in the exercise
  print(paste("Estimated mean is:",value$bar.y))
  print(paste("Estimated variance is:",value$s.2))
  print(paste("Estimated autocorrelation is:", value$r_1))
}

We will then use the appropriate data.frames to construct ‘univ.both’, ‘univ.instr1’, and ‘univ.instr2’.

#use the AgWt variable within AtmWtAg.long to compute the univ.sum.stats
univ.both <- univ.sum.stats(AtmWtAg.long$AgWt)
#use the X1 variable within AtmWtAg.wide to construct the instrument 1 univ.sum.stats
univ.instr1 <- univ.sum.stats(AtmWtAg.wide$X1)
#use the X2 variable within AtmWtAg.wide to construct the instrument 1 univ.sum.stats
univ.instr2 <- univ.sum.stats(AtmWtAg.wide$X2)

Unit Test

If your tables AtmWtAg.long and AtmWtAg.wide match the structure of the original data (2 columns each), then your code will pass the unit test.

## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34

This section of code will print the univariate analysis for the two data sets. You may paste in your own summary function, if you wish.

if(exists("univ.both") && exists("univ.instr1") && exists("univ.instr2")) {
  summary.univariate <- function(value) {
    ret <- data.frame(
      Value=c(value$bar.y,value$s.2,value$r_1)
    )
    row.names(ret) <- c("Mean","Variance","Autocorrelation")
    return(ret)
  }
  print(summary(univ.both))
  print(summary(univ.instr1))
  print(summary(univ.instr2))
}
##                             Value
## Mean            1.07868145060e+02
## Variance        3.00713080672e-10
## Autocorrelation 7.88606838040e-02
##                              Value
## Mean             1.07868153767e+02
## Variance         1.70644927533e-10
## Autocorrelation -1.39762481256e-01
##                              Value
## Mean             1.07868136354e+02
## Variance         2.85666938415e-10
## Autocorrelation -2.29530693396e-01

Total points from Unit Tests

total.points
## [1] 34