Instructions

The purpose of this homework is to refresh concepts learned in previous statistics courses. In addition to submitting an Rmarkdown generated pdf document, you must also submit an R file named ts2test.R. This .R file must contain only the function that you have written for the assignment along with a commented section at the top that contains your name, date, and assignment number. It must not contain any R analysis. The due date is Monday, Jan 26th by 5pm as a Canvas upload (of both files). Only one .pdf (or .html) and one .R file will be accepted. No two students will program exactly alike. So it is very easy to catch cheaters.

General homework guidelines

Use complete sentences and show a “reasonable” amount of algebraic work. (Remember I grade the entire solution not just the answers.) Final answers should include the correct units. Remember to use complete sentences/equations/etc, and context should be clear (answer the word problem). Your work should be rather professional. Do the problems in order. You will be docked points for the overall assignment if the homework isn’t properly organized. Any plots you make should be large enough so that the important aspects can be clearly seen. Every plot should include the appropriate x- and y- labels, main titles, and any other important labels such as the scale or group identifier.

If asked about the probability of a particular event, you should write, e.g., \[ \mathbb{P}(\mbox{event name or label}) = \mbox{appropriate amount of work} = \mbox{answer}. \]

You are encouraged to work with other students in the course. You are free to discuss these problems with each other, however this is not permission to copy each others work. By participating in this class, you must be aware of the course policies, CMDA Integrity Statement, and the university policies with regarding the VT Honor Code.

In particular, you acknowledge and support the statement: “I have neither given nor received unauthorized assistance on this assignment. The work I am presenting is ultimately my own.”

Problem 0: ts2test()

Write a function in R called ts2test(), where “ts2” (read as \((\mathrm{ts})^2\)) stands for two-sample and two-sided, that implements functionality similar to parts of the t.test() function in R. A function prototype with default arguments is suggested below.

ts2test <- function(x, y, conf.level=0.95, type="welch", N=0)
 {
  if ( type == "welch"){
    ybar <- mean(y)
    ny <- length(y)
    xbar <- mean(x)
    nx <- length(x)
    yvar <- var(y)
    xvar <- var(x)

    ypart <- yvar/ny
    xpart <- xvar/nx

    tsat <- (ybar - xbar) / sqrt((ypart + xpart))
    num <- (ypart + xpart)^2
    den <- ((ypart)^2/(ny-1))+((xpart)^2/(nx-1))
    dof <- num/den 
    
    pval <- 2 * pt(-abs(tsat), dof)
    
    
    ret <- list(tval = tsat, pval = pval, dof = dof)
    
  }
  else if (type == "paired"){
    d <- x - y
    dbar <- mean(d)

    sd2 <- var(d)
    nd <- length(d)
    dof = nd - 1

    tsat <- (dbar - 0)/sqrt(sd2/nd)
    pval <- pt(-abs(tsat), nd - 1) + pt(abs(tsat), nd - 1, lower.tail = FALSE)

    if (N > 0) {
      sigma2 <- (nd-1)*sd2/ rchisq(N, nd-1)
      dbars <- rnorm (N, 0, sqrt(sigma2/nd))
      print(dbars)
      pval <- (mean(dbars <- -abs(dbar)) + mean(dbars > abs(dbar)))
      dof = NULL
    }
    ret <- list(estimate = dbar, tval = tsat, pval = pval, df = dof)
  }
  else if (type == "pooled"){
    ybar <- mean(y)
    ny <- length(y)
    xbar <- mean(x)
    nx <- length(x)
    yvar <- var(y)
    xvar <- var(x)
  
    dof <- nx + ny - 2
    sp <- (((ny-1)/(dof))* yvar + ((nx-1)/(dof))* xvar)
    tsat <- (ybar - xbar)/ sqrt (sp *((1/ny) + (1/nx)))
    pval <- 2 * pt(-abs(tsat), dof)
  
    ret <- list(tval = tsat, pval = pval, df = dof)
  }
  return(ret)
}

Arguments will be discussed in detail momentarily. We must be able to call your function simply by typing the following (or similar, depending on options and variable names), or through defaults:

ts2test(x=vector1, y=vector2, conf.level=0.9, type="pooled", N=1000)
ts2test(vector1, vector2)

Like t.test() the function should provide test results, but confidence intervals are optional. Unlike t.test(), there is no alternative argument. We will only do two-sided tests. The conf.level argument is the same as t.test(). The options for type are "welch", "pooled" (which is like var.equal in t.test(), and, "paired" only, for Welch’s, Pooled, and Paired \(t\)-tests, respectively. Positive integer arguments N indicate a number of Monte Carlo samples for a numerical test; the default of N=0 indicates an exact \(t\)-test.

Your function should return list whose appropriately named entries (possibly including sub-lists) include all the important results of the hypothesis test (and confidence intervals):

  • test statistic with a reminder of test type and degrees of freedom unless N > 0
  • \(p\)-value or approximate \(p\)-value if N > 0 (including a reminder of N in this case)
  • outcome of the test at the desired confidence level
  • optionally, lower and upper bound of the confidence interval (including a reminder of conf.level)

While you are writing your function, and when in doubt, look at the output from the t.test() function as a minimal guide and to check your answers. To help you and assess whether your function works, you will apply your function towards solving the following problems below. You must show that you used your function to get the correct answer and you must provide both N=0 and N=1000000 solutions, showing that they are similar.

Problem 1: treatment v. control

A new post-surgical treatment is being compared with a standard treatment. Seven subjects receive the new treatment, while seven others (the controls) receive the standard treatment. The recovery times, in days, are given below.

treatment <- c(12, 13, 15, 19, 20, 21, 24) 
control <- c(18, 23, 24, 30, 32, 35, 39)
data.frame(treatment, control)
##   treatment control
## 1        12      18
## 2        13      23
## 3        15      24
## 4        19      30
## 5        20      32
## 6        21      35
## 7        24      39

Can you conclude that mean recovery time for treated patience different (hopefully less) than that of the control group? When ansewring, condider …

  • Which method should be used for conducting hypothesis tests and generating confidence intervals for this data? Why?

  • I used the Paired T-test method for this data because the data from the control group is paired with the group with the treated patients

  • What assumptions need to be checked in order to justify the use of the statistical method? Do this and include any appropriate plots and output from software.

  • I need to assure that the paired differences follow a normal distribution

    paired_diff <-  treatment - control
    hist(paired_diff)

  • Conduct the appropriate test (and optionally confidence interval) at the 90% level.
    t.test(treatment, control, paired = TRUE)
## 
##  Paired t-test
## 
## data:  treatment and control
## t = -9.5263, df = 6, p-value = 7.633e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -13.82545  -8.17455
## sample estimates:
## mean difference 
##             -11
    ts2test(treatment, control, type = "paired")
## $estimate
## [1] -11
## 
## $tval
## [1] -9.526279
## 
## $pval
## [1] 7.633413e-05
## 
## $df
## [1] 6
    ts2test(treatment, control,conf.level = 0.9, type = "paired", N = 10)
##  [1] -0.2797741  0.7693689 -0.3422884  0.4965776  0.4309616  4.3627372
##  [7] -0.4062469  0.6219027  0.5740814 -2.3240585
## $estimate
## [1] -11
## 
## $tval
## [1] -9.526279
## 
## $pval
## [1] -11
## 
## $df
## NULL

Problem 2: fattening beef steers

In an experiment to compare two diets for fattening beef steers, nine pairs of animals were chosen from the heard; members of each pair were matched as closely as possible with respect to hereditary factors. The members of each pair were randomly allocated, one to teach diet. The following table shows the weight gains (lb) of the animals over a 140-day test period on Diet 1 \((X)\) and on Diet 2 \((Y)\).

one <- c(596, 422, 524, 454, 538, 522, 478, 654, 556)
two <- c(498, 460, 468, 458, 530, 482, 528, 598, 456)
data.frame(one, two)
##   one two
## 1 596 498
## 2 422 460
## 3 524 468
## 4 454 458
## 5 538 530
## 6 522 482
## 7 478 528
## 8 654 598
## 9 556 456
  • What assumptions need to be checked in order to justify the use of the statistical method? Do this and include any appropriate plots and output from software.

    In order to use Welch’s test, it is assumed that the variances are not equal. And by looking at the box plots below it appears that the variances are not equal at all.

  boxplot(one, two)

  • Conduct a hypothesis test/build intervals at the 95% level to determine whether the wear diets are equally effective. If you reject that hypothesis, which diet is more effective? State your conclusion to the study appropriately.
  t.test(treatment, control, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  treatment and control
## t = -3.3724, df = 9.863, p-value = 0.007231
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.281488  -3.718512
## sample estimates:
## mean of x mean of y 
##  17.71429  28.71429
  ts2test(control, treatment, type = "welch")
## $tval
## [1] -3.372353
## 
## $pval
## [1] 0.007230533
## 
## $dof
## [1] 9.863025

Because the P-value of 0.007231 is less than 0.05 we can reject the null hypothesis which implies that wear diets are not as equally effective. Based on the negative t-value we can conclude that the control group is better at reducing wear.

Problem 3: tablet dissolution time

In a study of the relationship of the shape of a tablet to its dissolution time, 6 disk-shaped ibuprofen tablets and 8 oval-shaped ibuprofen tablets were dissolved in water. The dissolve times, in seconds, were as follows:

disk <- c(269.0, 249.3, 255.2, 252.7, 247.0, 261.6)
oval <- c(268.8, 260.0, 273.5, 253.9, 278.5, 289.4, 261.6, 280.2)
boxplot(disk, oval)

Can you conclude that the mean dissolve time is different for disk shaped tablets than for mean dissolve time for oval shaped tablets? Assume that the two samples come from normal distributions and \(\sigma_{\mathrm{disk}}^2 = \sigma_{\mathrm{oval}}^2\). Carry out the appropriate test/intervALS at the \(\alpha=0.05\) (95%) level.

t.test(disk, oval, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  disk and oval
## t = -2.6279, df = 12, p-value = 0.02206
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -27.322157  -2.552843
## sample estimates:
## mean of x mean of y 
##  255.8000  270.7375
ts2test(oval, disk, type = "pooled")
## $tval
## [1] -2.62793
## 
## $pval
## [1] 0.02205977
## 
## $df
## [1] 12

based on the results of the pooled t-test we can conclude that based on the p-value of 0.02206 being lower than 0.05, we can reject the null hypothesis. Because of the negative t-value, it shows that disks are more likely to dissolve faster than ovals.