In Topic 6 we extended our understanding of \(t\)-tests. We will now practice carrying out these new tests, using real data.
🏡 For this question, we will once again be assessing the wonions
data set on White Imperial Spanish onion plants in the sm
R package (Bowman and Azzalini 2021).
In Computer Lab 6, we assessed this data set of size \(n=84\) by focusing on two variables:
Yield
(in grams per plant), andDensity
of planting (in plants per square metre).The wonions
data set also contains a third variable, Locality
, which denotes whether the onion planting occurred in Purnong Landing (1) or in Virginia (2). We ignored Locality
in our previous analysis, but it may be beneficial to reassess the data to see if the onions planted in the two locations have statistically significant yield differences.
To do this, we can carry out an independent samples \(t\)-test.
🏡 Open RStudio and create a new script file, and run the code below to get started:
# install.packages(sm) # uncomment and run this line if the sm package is not installed
library(sm) # Load sm package
data(wonions) # Load onions data
🏡 Suppose that we want to determine if there is a statistically significant difference in the average yield between White Imperial Spanish (WIS) onions planted in Purnong Landing and Virginia.
Using this notation, define appropriate null and alternative hypotheses for our test.
Hint: There are actually two ways in which we can write the hypotheses - can you think of both ways? Check the Topic 6 readings if you are unsure.
🏡 Our two variables are Yield
and Locality
. Define which one is the dependent variable and which one is the independent variable for our test.
💻 As discussed in Computer Lab 6, , it is generally a good idea to perform some exploratory data analysis, before proceeding with your hypothesis testing.
To begin, let’s compute some Yield
summary statistics for both groups.
Hint: Use the subscripts 1 (for Purnong Landing) and 2 (for Virginia) in your notation to differentiate your answers (e.g. use \(n_1\) for the Purnong Landing sample size).
💻 Compute the sample sizes for both groups.
Hint: Try using the table
function.
💻 Compute the Yield
sample means and standard deviations for both groups, using the tapply
function (one of several functions from the apply
family of functions).
Note: The tapply
function can be very useful when we are assessing different categories within a data set. The function has three main arguments:
X
: A vector of data - this is what we would like to assessINDEX
: A list of factors - this specifies the different categories within our dataFUN
: A function to apply to the data that has been specified in X
and grouped into different groups based on the categories in INDEX
Hint: Take a look at the R code below for an example use of the tapply
function:
tapply(X, INDEX = wonions$Locality, FUN = min)
# You will need to replace the `X` here with the appropriate variable
💻 To visualise the Yield
data, create separate histograms and box plots for each group.
Note: Using the R code in the code chunk below as a guide, try to plot both box plots in the one graph - you will have to fill in the ...
s and add some additional details such as a title and axes labels.
Hint: You may be able to re-purpose some of your code from the previous computer lab for this question.
boxplot(... ~ wonions$Locality, #note that we are separating the observations by Locality
col = c("orange","chartreuse3"))
💻 Before we conduct an independent samples \(t\)-test, it is extremely important that we check that the assumptions of the independent samples \(t\)-test are satisfied.
Just as for the one-sample \(t\)-test, we need to check that:
The independent samples \(t\)-test also has a fourth assumption:
💻 We know that the data are numeric (1), and that the observations are independent (2). Therefore, we next need to check the normality assumption (3), which we can do using the Shapiro-Wilk Normality test.
Use the shapiro.test
function to carry out the Shapiro-Wilk Normality test on the Yield
data.
Note: We need to check for normality for both Locality
regions individually. We can achieve this using subsetting.
Hint: Take a look at the R code below if you’d like some help getting started. Note that you will need to specify the variable being assessed.
shapiro.test(wonions$...[wonions$Locality == 1])
# Here we perform the shapiro test on an unspecified variable
# - you will need to replace the `...`
# with the appropriate variable.
# We use the code [wonions$Locality == 1]
# to specify that we are only
# interested in those onions from Locality 1
💻 We can check the equal variances assumption (4) for the Yield
split by Locality
using the
Levene’s Test for equality of variances.
You will need to load the car
package in R to use this test. Fill out the missing details (the …) in the R code below to run this test.
library(car)
leveneTest(wonions$... ~ as.factor(wonions$...))
Based on this test, what is your conclusion?
💻 Regardless of your answers to 1.4.1 and 1.4.2 above, for the remainder of this question assume that all 4 independent samples \(t\)-test test assumptions are satisfied.
We can now carry out an independent samples \(t\)-test. In R, we can do this using the t.test
function, just as for the one-sample \(t\)-test, although we need to be careful with how we specify our arguments for this function.
Run the R code below:
t.test(wonions$Yield ~ wonions$Locality, var.equal = TRUE)
Note here that we have included the argument var.equal = TRUE
since we are conducting an independent samples \(t\)-test with samples that we assume have equal variances. If we had concluded in 1.4.2 that the variances were unequal, we can still conduct our test, but would change this argument to var.equal = FALSE
.
🏡 Interpret the output of the independent samples \(t\)-test, and note down the test statistic value, the \(p\)-value, the degrees of freedom, the sample means, and the \(95\%\) confidence interval for the difference.
🏡 Explain, in your own words, what the \(95\%\) confidence interval tells us.
🏡 Write a short conclusion summarising this test and your findings.
🏡 For this question, we will consider data collected by Cornell Professor of Nutrition David Levitsky, on students’ weight gains over their first year of college (DASL 2021). A random sample of \(68\) students from varying backgrounds was selected, and their weights (in pounds) were measured at the start of semester, and 12 weeks later, at the end of semester. This data is available in the freshman-15.txt
file on LMS.
Since we have paired data (start of semester and end of semester), an independent samples \(t\)-test is not appropriate here. Instead, we can conduct a paired \(t\)-test.
🏡 Download the freshman-15.txt
file from the LMS, and load the data into R using the following code (make sure you save the data to your current R working directory):
freshman <- read.table(file = "freshman-15.txt", header = T)
🏡 Suppose that we would like to know whether the average difference in the weights of students before and after a semester of college is statistically significant. To determine this, we can carry out a paired \(t\)-test.
Using this notation, define the null and alternative hypotheses for this paired \(t\)-test.
Hint: Check the Topic 6 readings if you are unsure.
🏡 What are the dependent and independent variables for this test?
💻 As a first step to our analysis, we should visualise our data:
🏡 By comparing the histograms for the initial and final weights, what do you observe?
🏡 We should also look at some basic descriptive statistics. Using appropriate R functions, compute the sample means and standard deviations of the initial, final, and paired difference weights.
Comment on your findings.
🏡 Before we conduct our test, we need (as always) to check our test assumptions.
Remember, with a paired \(t\)-test, our variable of interest is not the before and after weights themselves, but rather the paired differences. Therefore, when checking the paired \(t\)-test normality assumption, we have to be very careful that we are assessing the right values.
In R, add an extra column of data to the freshman
object, and store in this column the paired difference values between each individual’s initial and final weight. Label this column Paired.Difference
.
Check the code chunk below if you’d like some help getting started.
freshman$Paired.Difference <- ... # Replace the ... with an appropriate expression
💻 Create the following graphs for the paired difference data:
What do you observe?
🏡 We can see that the data are numeric, and we are told that the observations are independent. Therefore, all that remains to confirm for our test assumptions is that the sample mean \(\overline{X}\) is normally distributed.
Carry out a Shapiro-Wilk test for our test variable. Based on the result of this test, and your previous findings, what do you conclude?
💻 For the remainder of this question, regardless of your results in 2.4.1 and 2.4.2, we will proceed under the assumption that all the assumptions for the paired \(t\)-test have been met.
Carry out a paired \(t\)-test in R, using the code below as a guide. Note that we can still use the t.test
function for this test (as long as we use the correct arguments). You will have to fill out the missing details (the …), just as in 1.4.2.
t.test(freshman$..., freshman$..., paired = TRUE)
🏡 Interpret the output of the test, and note down the test statistic, the degrees of freedom, the \(p\)-value, the mean of the differences, and the \(95\%\) confidence interval.
🏡 Explain, in your own words, what the \(95\%\) confidence interval tells us. Based on this confidence interval, would you reject your null hypothesis?
🏡 Write a short conclusion summarising this test and your findings.
Extension: Carry out a one-sample \(t\)-test for the freshman data, testing whether the paired differences are different from 0. What are your findings? Do you notice anything interesting about the \(t\)-test output?
💻 The effect size, or relative size of the difference between means, is a useful statistic that can be computed for any \(t\)-test. We can use the cohen.d
function from the effsize
R package to compute the effect sizes for our \(t\)-tests.
To begin, run the R code below:
install.packages("effsize")
library(effsize)
?cohen.d
Hint: It may be helpful to refer to Section 3 of the Topic 6 readings when answering these questions.
💻 A help file should have appeared in the bottom right section of RStudio when you ran the code in 3. Let’s take a look at the details in this help file on the composition of the cohen.d
function.
This function can take several arguments - we’ll cover the pertinent ones here:
NA
(for a one-sample \(t\)-test), a factor with two levels (for an independent samples \(t\)-test), or the control group data (for a paired \(t\)-test).mu
: The value of \(\mu\) under \(H_0\). Use if you are interested in a single sample effect size.paired
: If the data we are assessing is paired, we set paired == TRUE
. Otherwise we use paired == FALSE
.within
: This is only used for paired \(t\)-tests. Set this to within == FALSE
when conducting paired \(t\)-tests, and don’t include this argument for other tests.💻 To begin, let’s consider a simple one-sample \(t\)-test. Suppose that we are just assessing the Yield
variable of the wonions
data set, and are not considering Locality
. We run a one-sample \(t\)-test, where we have \(H_0: \mu = 110\) vs \(H_1: \mu \neq 110\). To check the effect size between the sample mean and the mean under \(H_0\), we can use the following code:
cohen.d(wonions$Yield, NA, mu = 110)
Run this code now. You should obtain an estimate of roughly \(0.1828\). This is considered a negligible, or very small effect.
Make sure you check over the arguments used here, and refer back to the details provided in 3.1 before continuing.
💻 Now that you have been introduced to the cohen.d
function, see if you can use it to compute the effect size for the independent sample \(t\)-test conducted in 1.5. Recall that for this test we are assessing Yield
between the two Locality
options. Note that Locality
should be treated as a factor
variable here.
Interpret the effect size.
Check the code chunk below if you’d like some help getting started. You will need to replace the ...
s.
cohen.d(..., as.factor(...), mu = ...)
💻 Compute the effect size for the paired \(t\)-test conducted in 2.5. Interpret this effect size.
Use the R code provided below as a starting point. You will have to fill in the ...
sections.
cohen.d(..., ..., paired = ..., within = FALSE)
sm
: Nonparametric Smoothing Methods (Version 2.2-5.7). University of Glasgow, UK; Università di Padova, Italia. http://www.stats.gla.ac.uk/~adrian/sm/.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.