Geography 415 Lab 3

Objectives:

Gain practice setting up a hypothesis test
Learn about “resampling'' as an alternative to resampling

In this lab, we will use data from a cloud seeding experiment. In this seeding experiment, days were first pre-determined as being appropriate for seeding, and then they were randomly assigned to be either seeded or not. These data are available in a number of places: either in the HSAUR package, or at the webpage http://vincentarelbundock.github.com/Rdatasets/csv/HSAUR/clouds.csv

data("clouds", package = "HSAUR")  # I chose to install the HSAUR package
ls()

## [1] "clouds"

dim(clouds)  # Print the dimension (number of rows and columns)

## [1] 24  7

names(clouds)  # Get the column names

## [1] "seeding"    "time"       "sne"        "cloudcover" "prewetness"
## [6] "echomotion" "rainfall"

# Now do some simple exploratory graphics
hist(clouds$rainfall)

plot of chunk data_load

library(ggplot2)
ggplot(data = clouds) + geom_boxplot(aes(x = seeding, y = rainfall))

plot of chunk data_load

There are 24 rows in the dataset. Each row is a rain event. The variables are seeding, whether the clouds were seeded that day or not, time, the number of days from the start of the first day of the experiment, sne, a measure of suitability for seeding, cloudcover, the cloudcover percentage, prewetness, the total rainfall one hour before seeding (cubic meters times 1e8), echomotion, whether the radar echo was moving or stationary, and rainfall, the amount of rain (cubic meters times 1e8).

Calculating the test statistic

The purpose of this study was to assess if dispersing massive amounts of silver iodide is effective for cloud seeding. This study was performed in Florida in 1975, during a period in which 24 days were judged suitable for seeding. 12 of these days were randoly chosen to be seeded.

# Calculate the average rainfall in seeded clouds - three ways
mean(clouds$rainfall[clouds$seeding == "yes"])

## [1] 4.634

with(clouds, mean(rainfall[seeding == "yes"]))

## [1] 4.634

with(subset(clouds, seeding == "yes"), mean(rainfall))

## [1] 4.634


# Calculate the average rainfall in non-seeded clouds - three ways
with(subset(clouds, seeding == "no"), mean(rainfall))

## [1] 4.172


# Do that again, but give names
mean_yes <- with(subset(clouds, seeding == "yes"), mean(rainfall))
mean_no <- with(subset(clouds, seeding == "no"), mean(rainfall))

Calculating the p-value

Under the null hypothesis, there is one common distribution, and seeding does not matter. The formula for this test is in Burt, Barber and Rigby on pages 357-358.


s_yes <- with(subset(clouds, seeding == "yes"), sd(rainfall))
s_no <- with(subset(clouds, seeding == "no"), sd(rainfall))
s_pooled <- sqrt((11 * s_yes^2 + 11 * s_no^2)/22)
# Be sure to look at the three values.  How do you describe each one of
# these to someone else?

T <- (mean_yes - mean_no)/(s_pooled * sqrt(1/12 + 1/12))

This sample T-statistic is just one value that could have been obtained from repeating this experiment and if seeding has no effect. Were we to repeat this experiment, we would have gotten a different sample T-statistic. But these sample T-statistics do come from a T distribution with 22 degrees of freedom.

# What is the probability of getting a random T statistic greater than the
# sample T-statistic?
help(pt)  # look at the description of the pt function!
pt(T, df = 22, lower.tail = FALSE)

## [1] 0.3621


# What is the probability of getting a random T statistic more extreme
# (above or below) the sample T-statistic
2 * pt(T, df = 22, lower.tail = FALSE)

## [1] 0.7242


# What if we use the Z statistic instead of the T statistic
2 * pnorm(T, lower.tail = FALSE)

## [1] 0.7208

Re-randomization

Here, I'll describe a different way of calculating p-values. Rather than assuming the t-distribution or the CLT is appropriate, we'll simulate some data and calculate the means and difference from these simulations.

Assume that seeding makes no difference. If this were so, then then the label seeding is completely random. This makes it easy to simulate new data that satisfies the null hypothesis: we can simply scramble the seeded and non-seeded labels, and get a new dataset that would have been possible under the null hypothesis. The key here is that this new, simulated dataset, would have been possible if the Do this a few hundred, (or a thousand, or all 2704156) times, and we can have a distribution of mean values under the null hypothesis.

sim_treatment <- sample(clouds$seeding)
with(subset(clouds, sim_treatment == "yes"), mean(rainfall))

## [1] 4.549

with(subset(clouds, sim_treatment == "no"), mean(rainfall))

## [1] 4.257

You can repeat the simulation simply by reruniing those three lines a few times. But we could also put those lines into a loop and automate the process. The code below first sets up an empty data.frame, and then repeats the randomization process, each time storing the mean with seeding and without.

N <- 1000
sim_data <- as.data.frame(matrix(0, N, 2))
names(sim_data) <- c("yes", "no")
# Now, we will loop through each rows of this new data.frame, each time
# resampling the data, and recalculating the treatment and control means
for (i in 1:N) {
    sim_treatment <- sample(clouds$seeding)
    sim_data[i, "yes"] <- with(subset(clouds, sim_treatment == "yes"), mean(rainfall))
    sim_data[i, "no"] <- with(subset(clouds, sim_treatment == "no"), mean(rainfall))
}

# Calculate the difference between seeded and unseeded clouds
sim_data$diff <- sim_data$yes - sim_data$no

# Draw a simple histogram of the simulated differences, and include the
# difference as a vertical line
ggplot(data = sim_data) + geom_histogram(aes(x = diff), fill = "grey70") + geom_vline(aes(xintercept = mean_yes - 
    mean_no))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-2

Homework

Given the description of the experiment, state the null hypothesis that should be used when anaylzing the result data.
In the re-randomization, you can see the range of possible values for the mean with and without seeding if you produce a histogram. Is is symmetric? Where is it centered? Where is the observed test statistic in relation to this distribution? Informally, are the data inconsistent with the null hypothesis, or is it possible for the null hypothesis to have generated the observed data?
To compute Fisher's p-value for this data with respect to the re-randomization experiment, you need to calculate how rare the observed data are. Calculate the fraction of randomized differences that are greater (more extreme) than the observed difference between seeded and unseeded clouds. Compare the p-value obtained from the randomization method with the p-value obtained from assuming the t-distribution is appropriate. Are they similar?
Provide a brief writeup describing the results (i.e. how they were obtained, and how they should be interpreted). Be sure to describe how the two methods (t-statistic vs re-randomization) are complementary or contradictory