The excel file clouds.csv contains information about a set of cloud seeding experiments. The data were collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide 100 to 1000 grams per cloud) in cloud seeding to increase rainfall. Our interest is to know whether or not seeding increases the rainfall from a cloud. The file contains the following information:
Install proper packages:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)
Create individual vectors for later use that separate seeding experiments from non-seeding experiments:
clouds = read.csv("clouds.csv")
str(clouds)
## 'data.frame': 24 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ seeding : chr "no" "yes" "yes" "no" ...
## $ time : int 0 1 3 4 6 9 18 25 27 28 ...
## $ sne : num 1.75 2.7 4.1 2.35 4.25 1.6 1.3 3.35 2.85 2.2 ...
## $ cloudcover: num 13.4 37.9 3.9 5.3 7.1 6.9 4.6 4.9 12.1 5.2 ...
## $ prewetness: num 0.274 1.267 0.198 0.526 0.25 ...
## $ echomotion: chr "stationary" "moving" "stationary" "moving" ...
## $ rainfall : num 12.85 5.52 6.29 6.11 2.45 ...
seed = clouds %>%
filter(seeding %in% c("yes"))
noseed = clouds %>%
filter(seeding %in% c("no"))
str(seed)
## 'data.frame': 12 obs. of 8 variables:
## $ X : int 2 3 5 10 11 12 14 15 18 20 ...
## $ seeding : chr "yes" "yes" "yes" "yes" ...
## $ time : int 1 3 6 28 29 32 35 38 55 59 ...
## $ sne : num 2.7 4.1 4.25 2.2 4.4 3.1 2.9 2.05 3.7 3.4 ...
## $ cloudcover: num 37.9 3.9 7.1 5.2 4.1 2.8 3 7 3.3 6.5 ...
## $ prewetness: num 1.267 0.198 0.25 0.084 0.236 ...
## $ echomotion: chr "moving" "stationary" "moving" "moving" ...
## $ rainfall : num 5.52 6.29 2.45 5.06 2.76 ...
str(noseed)
## 'data.frame': 12 obs. of 8 variables:
## $ X : int 1 4 6 7 8 9 13 16 17 19 ...
## $ seeding : chr "no" "no" "no" "no" ...
## $ time : int 0 4 9 18 25 27 33 39 53 56 ...
## $ sne : num 1.75 2.35 1.6 1.3 3.35 2.85 3.95 4 3.35 3.8 ...
## $ cloudcover: num 13.4 5.3 6.9 4.6 4.9 12.1 6.8 11.3 4.2 2.2 ...
## $ prewetness: num 0.274 0.526 0.018 0.307 0.194 0.751 0.796 0.398 0.237 0.23 ...
## $ echomotion: chr "stationary" "moving" "stationary" "moving" ...
## $ rainfall : num 12.85 6.11 3.61 0.47 4.56 ...
To get descriptive statistics based on this variable, group by seeding since it is the variable we are interested in.
seed_stat = clouds %>%
group_by(seeding) %>%
summarize(meanrf = mean(rainfall),
sdrf = sd(rainfall))
seed_stat
## # A tibble: 2 x 3
## seeding meanrf sdrf
## <chr> <dbl> <dbl>
## 1 no 4.17 3.52
## 2 yes 4.63 2.78
From this, we see that the average rainfall in seeding experiments is 0.46 units higher than non-seeding experiments.
Next, we create some plots to understand the data better:
plot(seed_stat$meanrf, seed_stat$sdrf)
ggplot(seed_stat, aes(x = meanrf, y = sdrf)) + geom_point() + facet_wrap(~seeding)
ggplot(clouds, aes(x = rainfall)) + geom_density(aes(fill = seeding), alpha = 0.4)
ggplot(clouds, aes(x = time, y = rainfall)) + geom_point() + facet_wrap(~seeding) + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
ggplot(clouds, aes(x = seeding, y = rainfall, col = 0.4)) + geom_boxplot()
The t-test gives us valuable insight into the data:
t.test(seed$rainfall, noseed$rainfall, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: seed$rainfall and noseed$rainfall
## t = 0.3574, df = 20.871, p-value = 0.7244
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.229691 3.154691
## sample estimates:
## mean of x mean of y
## 4.634167 4.171667
With a p-value of 0.7244, we can accept the null hypothesis that there is no difference between seeding and non-seeding experiments.
Center the variables to get the best model possible:
clouds$cloudcover.cen = clouds$cloudcover - mean(clouds$cloudcover)
clouds$prewetness.cen = clouds$prewetness - mean(clouds$prewetness)
clouds$sne.cen = clouds$sne - mean(clouds$sne)
Create a linear model with the dependent variable as rainfall, and cloudcover, prewetness, echomotion, and sne as independent variables:
model2 = lm(rainfall ~ cloudcover.cen + echomotion + prewetness.cen + sne.cen, clouds)
plot(model2)
summary (model2)
##
## Call:
## lm(formula = rainfall ~ cloudcover.cen + echomotion + prewetness.cen +
## sne.cen, data = clouds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4396 -1.5824 0.2265 1.2153 7.0778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8952 0.6618 5.885 1.15e-05 ***
## cloudcover.cen 0.0366 0.1129 0.324 0.7494
## echomotionstationary 2.4370 1.5254 1.598 0.1266
## prewetness.cen 2.1520 2.6566 0.810 0.4279
## sne.cen -1.1526 0.6647 -1.734 0.0991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.844 on 19 degrees of freedom
## Multiple R-squared: 0.3087, Adjusted R-squared: 0.1632
## F-statistic: 2.122 on 4 and 19 DF, p-value: 0.1178
anova(model2)
## Analysis of Variance Table
##
## Response: rainfall
## Df Sum Sq Mean Sq F value Pr(>F)
## cloudcover.cen 1 16.240 16.2396 2.0076 0.17270
## echomotion 1 25.277 25.2768 3.1248 0.09316 .
## prewetness.cen 1 2.806 2.8059 0.3469 0.56282
## sne.cen 1 24.321 24.3212 3.0067 0.09912 .
## Residuals 19 153.691 8.0890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
While no variable is significant, we see that the variable closest to approaching significance is sne (suitability criterion) with the lowest p-value of 0.0991.
First, we review our vector that was filtered for seeding experiments. Next, we create a linear model with the dependent variable as rainfall, and the independent variable as sne, using our vector for seeding experiments.
seed
## X seeding time sne cloudcover prewetness echomotion rainfall
## 1 2 yes 1 2.70 37.9 1.267 moving 5.52
## 2 3 yes 3 4.10 3.9 0.198 stationary 6.29
## 3 5 yes 6 4.25 7.1 0.250 moving 2.45
## 4 10 yes 28 2.20 5.2 0.084 moving 5.06
## 5 11 yes 29 4.40 4.1 0.236 moving 2.76
## 6 12 yes 32 3.10 2.8 0.214 moving 4.05
## 7 14 yes 35 2.90 3.0 0.124 moving 4.84
## 8 15 yes 38 2.05 7.0 0.144 moving 11.86
## 9 18 yes 55 3.70 3.3 0.960 moving 4.22
## 10 20 yes 59 3.40 6.5 0.142 stationary 5.45
## 11 21 yes 65 3.15 3.1 0.073 moving 2.02
## 12 23 yes 82 4.01 8.3 0.123 moving 1.09
sneseed = lm(rainfall ~ sne, seed)
plot(sneseed)
summary(sneseed)
##
## Call:
## lm(formula = rainfall ~ sne, data = seed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0134 -1.3297 -0.3276 0.6171 4.3867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.0202 2.9774 4.037 0.00237 **
## sne -2.2180 0.8722 -2.543 0.02921 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.27 on 10 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.332
## F-statistic: 6.467 on 1 and 10 DF, p-value: 0.02921
anova(sneseed)
## Analysis of Variance Table
##
## Response: rainfall
## Df Sum Sq Mean Sq F value Pr(>F)
## sne 1 33.310 33.310 6.4669 0.02921 *
## Residuals 10 51.509 5.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the coefficients from our summary command, and the anova command, we see that sne is approaching significance because it has a p-value of 0.02921.
Next, we do the same thing for non-seeding experiments:
sneno = lm(rainfall ~ sne, noseed)
plot(sneno)
summary(sneno)
##
## Call:
## lm(formula = rainfall ~ sne, data = noseed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4892 -2.1762 0.2958 1.4902 7.3616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.319 3.160 2.317 0.043 *
## sne -1.046 0.995 -1.052 0.318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.502 on 10 degrees of freedom
## Multiple R-squared: 0.09957, Adjusted R-squared: 0.009528
## F-statistic: 1.106 on 1 and 10 DF, p-value: 0.3177
anova(sneno)
## Analysis of Variance Table
##
## Response: rainfall
## Df Sum Sq Mean Sq F value Pr(>F)
## sne 1 13.565 13.565 1.1058 0.3177
## Residuals 10 122.667 12.267
In the non-seeeding experiments, we see that sne is no longer significant because it has a p-value of 0.3177.
With these two models, we can plot the differences with ggplot:
ggplot(clouds, aes(sne, rainfall, col = seeding)) +
geom_smooth(method = 'lm') + ggtitle("Suitability index as it relates to rainfall")
## `geom_smooth()` using formula 'y ~ x'
The difference in slope between seeding and non-seeding experiments suggests that as the suitibility index increases in seeding experiments, rainfall decreases more rapidly than in non-seeding experiments. The decrease in precipitation is so great, that rainfall in seeding experiments is lower than non-seeding experiments as the suitability criterion approaches 4.