Objectives:
In this lab, we will use data from a cloud seeding experiment. In this seeding experiment, days were first pre-determined as being appropriate for seeding, and then they were randomly assigned to be either seeded or not. These data are available in a number of places: either in the HSAUR package, or at the webpage http://vincentarelbundock.github.com/Rdatasets/csv/HSAUR/clouds.csv
data("clouds", package = "HSAUR") # I chose to install the HSAUR package
ls()
## [1] "clouds"
dim(clouds) # Print the dimension (number of rows and columns)
## [1] 24 7
names(clouds) # Get the column names
## [1] "seeding" "time" "sne" "cloudcover" "prewetness"
## [6] "echomotion" "rainfall"
# Now do some simple exploratory graphics
hist(clouds$rainfall)
library(ggplot2)
ggplot(data = clouds) + geom_boxplot(aes(x = seeding, y = rainfall))
There are 24 rows in the dataset. Each row is a rain event. The variables are seeding, whether the clouds were seeded that day or not, time, the number of days from the start of the first day of the experiment, sne, a measure of suitability for seeding, cloudcover, the cloudcover percentage, prewetness, the total rainfall one hour before seeding (cubic meters times 1e8), echomotion, whether the radar echo was moving or stationary, and rainfall, the amount of rain (cubic meters times 1e8).
The purpose of this study was to assess if dispersing massive amounts of silver iodide is effective for cloud seeding. This study was performed in Florida in 1975, during a period in which 24 days were judged suitable for seeding. 12 of these days were randoly chosen to be seeded.
# Calculate the average rainfall in seeded clouds - three ways
mean(clouds$rainfall[clouds$seeding == "yes"])
## [1] 4.634
with(clouds, mean(rainfall[seeding == "yes"]))
## [1] 4.634
with(subset(clouds, seeding == "yes"), mean(rainfall))
## [1] 4.634
# Calculate the average rainfall in non-seeded clouds - three ways
with(subset(clouds, seeding == "no"), mean(rainfall))
## [1] 4.172
# Do that again, but give names
mean_yes <- with(subset(clouds, seeding == "yes"), mean(rainfall))
mean_no <- with(subset(clouds, seeding == "no"), mean(rainfall))
Under the null hypothesis, there is one common distribution, and seeding does not matter. The formula for this test is in Burt, Barber and Rigby on pages 357-358.
s_yes <- with(subset(clouds, seeding == "yes"), sd(rainfall))
s_no <- with(subset(clouds, seeding == "no"), sd(rainfall))
s_pooled <- sqrt((11 * s_yes^2 + 11 * s_no^2)/22)
# Be sure to look at the three values. How do you describe each one of
# these to someone else?
T <- (mean_yes - mean_no)/(s_pooled * sqrt(1/12 + 1/12))
This sample T-statistic is just one value that could have been obtained from repeating this experiment and if seeding has no effect. Were we to repeat this experiment, we would have gotten a different sample T-statistic. But these sample T-statistics do come from a T distribution with 22 degrees of freedom.
# What is the probability of getting a random T statistic greater than the
# sample T-statistic?
help(pt) # look at the description of the pt function!
pt(T, df = 22, lower.tail = FALSE)
## [1] 0.3621
# What is the probability of getting a random T statistic more extreme
# (above or below) the sample T-statistic
2 * pt(T, df = 22, lower.tail = FALSE)
## [1] 0.7242
# What if we use the Z statistic instead of the T statistic
2 * pnorm(T, lower.tail = FALSE)
## [1] 0.7208
Here, I'll describe a different way of calculating p-values. Rather than assuming the t-distribution or the CLT is appropriate, we'll simulate some data and calculate the means and difference from these simulations.
Assume that seeding makes no difference. If this were so, then then the label seeding is completely random. This makes it easy to simulate new data that satisfies the null hypothesis: we can simply scramble the seeded and non-seeded labels, and get a new dataset that would have been possible under the null hypothesis. The key here is that this new, simulated dataset, would have been possible if the Do this a few hundred, (or a thousand, or all 2704156) times, and we can have a distribution of mean values under the null hypothesis.
sim_treatment <- sample(clouds$seeding)
with(subset(clouds, sim_treatment == "yes"), mean(rainfall))
## [1] 4.549
with(subset(clouds, sim_treatment == "no"), mean(rainfall))
## [1] 4.257
You can repeat the simulation simply by reruniing those three lines a few times. But we could also put those lines into a loop and automate the process. The code below first sets up an empty data.frame, and then repeats the randomization process, each time storing the mean with seeding and without.
N <- 1000
sim_data <- as.data.frame(matrix(0, N, 2))
names(sim_data) <- c("yes", "no")
# Now, we will loop through each rows of this new data.frame, each time
# resampling the data, and recalculating the treatment and control means
for (i in 1:N) {
sim_treatment <- sample(clouds$seeding)
sim_data[i, "yes"] <- with(subset(clouds, sim_treatment == "yes"), mean(rainfall))
sim_data[i, "no"] <- with(subset(clouds, sim_treatment == "no"), mean(rainfall))
}
# Calculate the difference between seeded and unseeded clouds
sim_data$diff <- sim_data$yes - sim_data$no
# Draw a simple histogram of the simulated differences, and include the
# difference as a vertical line
ggplot(data = sim_data) + geom_histogram(aes(x = diff), fill = "grey70") + geom_vline(aes(xintercept = mean_yes -
mean_no))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.