Introduction

The excel file clouds.csv contains information about a set of cloud seeding experiments. The data were collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide 100 to 1000 grams per cloud) in cloud seeding to increase rainfall. Our interest is to know whether or not seeding increases the rainfall from a cloud. The file contains the following information:

Analysis

1. Explore the difference in rainfall between seeding and non-seeding experiments by looking at the statistical characteristics (mean, sd, etc) of the experiments and by using visualization. Use a t-test to see whether there is a significant difference.

Install proper packages:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)

Create individual vectors for later use that separate seeding experiments from non-seeding experiments:

clouds = read.csv("clouds.csv")
str(clouds)
## 'data.frame':    24 obs. of  8 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ seeding   : chr  "no" "yes" "yes" "no" ...
##  $ time      : int  0 1 3 4 6 9 18 25 27 28 ...
##  $ sne       : num  1.75 2.7 4.1 2.35 4.25 1.6 1.3 3.35 2.85 2.2 ...
##  $ cloudcover: num  13.4 37.9 3.9 5.3 7.1 6.9 4.6 4.9 12.1 5.2 ...
##  $ prewetness: num  0.274 1.267 0.198 0.526 0.25 ...
##  $ echomotion: chr  "stationary" "moving" "stationary" "moving" ...
##  $ rainfall  : num  12.85 5.52 6.29 6.11 2.45 ...
seed = clouds %>%
  filter(seeding %in% c("yes"))
noseed = clouds %>%
  filter(seeding %in% c("no"))
str(seed)
## 'data.frame':    12 obs. of  8 variables:
##  $ X         : int  2 3 5 10 11 12 14 15 18 20 ...
##  $ seeding   : chr  "yes" "yes" "yes" "yes" ...
##  $ time      : int  1 3 6 28 29 32 35 38 55 59 ...
##  $ sne       : num  2.7 4.1 4.25 2.2 4.4 3.1 2.9 2.05 3.7 3.4 ...
##  $ cloudcover: num  37.9 3.9 7.1 5.2 4.1 2.8 3 7 3.3 6.5 ...
##  $ prewetness: num  1.267 0.198 0.25 0.084 0.236 ...
##  $ echomotion: chr  "moving" "stationary" "moving" "moving" ...
##  $ rainfall  : num  5.52 6.29 2.45 5.06 2.76 ...
str(noseed)
## 'data.frame':    12 obs. of  8 variables:
##  $ X         : int  1 4 6 7 8 9 13 16 17 19 ...
##  $ seeding   : chr  "no" "no" "no" "no" ...
##  $ time      : int  0 4 9 18 25 27 33 39 53 56 ...
##  $ sne       : num  1.75 2.35 1.6 1.3 3.35 2.85 3.95 4 3.35 3.8 ...
##  $ cloudcover: num  13.4 5.3 6.9 4.6 4.9 12.1 6.8 11.3 4.2 2.2 ...
##  $ prewetness: num  0.274 0.526 0.018 0.307 0.194 0.751 0.796 0.398 0.237 0.23 ...
##  $ echomotion: chr  "stationary" "moving" "stationary" "moving" ...
##  $ rainfall  : num  12.85 6.11 3.61 0.47 4.56 ...

To get descriptive statistics based on this variable, group by seeding since it is the variable we are interested in.

seed_stat = clouds %>%
  group_by(seeding) %>%
  summarize(meanrf = mean(rainfall),
            sdrf = sd(rainfall))
seed_stat
## # A tibble: 2 x 3
##   seeding meanrf  sdrf
##   <chr>    <dbl> <dbl>
## 1 no        4.17  3.52
## 2 yes       4.63  2.78

From this, we see that the average rainfall in seeding experiments is 0.46 units higher than non-seeding experiments.

Next, we create some plots to understand the data better:

plot(seed_stat$meanrf, seed_stat$sdrf)

ggplot(seed_stat, aes(x = meanrf, y = sdrf)) + geom_point() + facet_wrap(~seeding)

ggplot(clouds, aes(x = rainfall)) + geom_density(aes(fill = seeding), alpha = 0.4)

ggplot(clouds, aes(x = time, y = rainfall)) + geom_point() + facet_wrap(~seeding) + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

ggplot(clouds, aes(x = seeding, y = rainfall, col = 0.4)) + geom_boxplot()

The t-test gives us valuable insight into the data:

t.test(seed$rainfall, noseed$rainfall, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  seed$rainfall and noseed$rainfall
## t = 0.3574, df = 20.871, p-value = 0.7244
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.229691  3.154691
## sample estimates:
## mean of x mean of y 
##  4.634167  4.171667

With a p-value of 0.7244, we can accept the null hypothesis that there is no difference between seeding and non-seeding experiments.

2. Build a multiple linear regression model to model the effects of seeding on rainfall. Include cloudcover, prewetness, echomotion and ‘sne’ as independent variables. Explore the model using the summary() and anova() functions. Which variables appear to influence rainfall the most?

Center the variables to get the best model possible:

clouds$cloudcover.cen = clouds$cloudcover - mean(clouds$cloudcover)
clouds$prewetness.cen = clouds$prewetness - mean(clouds$prewetness)
clouds$sne.cen = clouds$sne - mean(clouds$sne)

Create a linear model with the dependent variable as rainfall, and cloudcover, prewetness, echomotion, and sne as independent variables:

model2 = lm(rainfall ~ cloudcover.cen + echomotion + prewetness.cen + sne.cen, clouds)
plot(model2)

summary (model2)
## 
## Call:
## lm(formula = rainfall ~ cloudcover.cen + echomotion + prewetness.cen + 
##     sne.cen, data = clouds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4396 -1.5824  0.2265  1.2153  7.0778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.8952     0.6618   5.885 1.15e-05 ***
## cloudcover.cen         0.0366     0.1129   0.324   0.7494    
## echomotionstationary   2.4370     1.5254   1.598   0.1266    
## prewetness.cen         2.1520     2.6566   0.810   0.4279    
## sne.cen               -1.1526     0.6647  -1.734   0.0991 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.844 on 19 degrees of freedom
## Multiple R-squared:  0.3087, Adjusted R-squared:  0.1632 
## F-statistic: 2.122 on 4 and 19 DF,  p-value: 0.1178
anova(model2)
## Analysis of Variance Table
## 
## Response: rainfall
##                Df  Sum Sq Mean Sq F value  Pr(>F)  
## cloudcover.cen  1  16.240 16.2396  2.0076 0.17270  
## echomotion      1  25.277 25.2768  3.1248 0.09316 .
## prewetness.cen  1   2.806  2.8059  0.3469 0.56282  
## sne.cen         1  24.321 24.3212  3.0067 0.09912 .
## Residuals      19 153.691  8.0890                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

While no variable is significant, we see that the variable closest to approaching significance is sne (suitability criterion) with the lowest p-value of 0.0991.

3. As the suitability index (sne) appears to have a relationship with rainfall, build two new models relating this variable to rainfall, one for the seeding experiments, one for the non-seeding experiments. Compare the coefficients for the two models, and produce a figure showing the two models. What does the difference in slope suggest?

First, we review our vector that was filtered for seeding experiments. Next, we create a linear model with the dependent variable as rainfall, and the independent variable as sne, using our vector for seeding experiments.

seed
##     X seeding time  sne cloudcover prewetness echomotion rainfall
## 1   2     yes    1 2.70       37.9      1.267     moving     5.52
## 2   3     yes    3 4.10        3.9      0.198 stationary     6.29
## 3   5     yes    6 4.25        7.1      0.250     moving     2.45
## 4  10     yes   28 2.20        5.2      0.084     moving     5.06
## 5  11     yes   29 4.40        4.1      0.236     moving     2.76
## 6  12     yes   32 3.10        2.8      0.214     moving     4.05
## 7  14     yes   35 2.90        3.0      0.124     moving     4.84
## 8  15     yes   38 2.05        7.0      0.144     moving    11.86
## 9  18     yes   55 3.70        3.3      0.960     moving     4.22
## 10 20     yes   59 3.40        6.5      0.142 stationary     5.45
## 11 21     yes   65 3.15        3.1      0.073     moving     2.02
## 12 23     yes   82 4.01        8.3      0.123     moving     1.09
sneseed = lm(rainfall ~ sne, seed)
plot(sneseed)

summary(sneseed)
## 
## Call:
## lm(formula = rainfall ~ sne, data = seed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0134 -1.3297 -0.3276  0.6171  4.3867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  12.0202     2.9774   4.037  0.00237 **
## sne          -2.2180     0.8722  -2.543  0.02921 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.27 on 10 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.332 
## F-statistic: 6.467 on 1 and 10 DF,  p-value: 0.02921
anova(sneseed)
## Analysis of Variance Table
## 
## Response: rainfall
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## sne        1 33.310  33.310  6.4669 0.02921 *
## Residuals 10 51.509   5.151                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the coefficients from our summary command, and the anova command, we see that sne is approaching significance because it has a p-value of 0.02921.

Next, we do the same thing for non-seeding experiments:

sneno = lm(rainfall ~ sne, noseed)
plot(sneno)

summary(sneno)
## 
## Call:
## lm(formula = rainfall ~ sne, data = noseed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4892 -2.1762  0.2958  1.4902  7.3616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    7.319      3.160   2.317    0.043 *
## sne           -1.046      0.995  -1.052    0.318  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.502 on 10 degrees of freedom
## Multiple R-squared:  0.09957,    Adjusted R-squared:  0.009528 
## F-statistic: 1.106 on 1 and 10 DF,  p-value: 0.3177
anova(sneno)
## Analysis of Variance Table
## 
## Response: rainfall
##           Df  Sum Sq Mean Sq F value Pr(>F)
## sne        1  13.565  13.565  1.1058 0.3177
## Residuals 10 122.667  12.267

In the non-seeeding experiments, we see that sne is no longer significant because it has a p-value of 0.3177.

With these two models, we can plot the differences with ggplot:

ggplot(clouds, aes(sne, rainfall, col = seeding)) +
  geom_smooth(method = 'lm') + ggtitle("Suitability index as it relates to rainfall")
## `geom_smooth()` using formula 'y ~ x'

The difference in slope between seeding and non-seeding experiments suggests that as the suitibility index increases in seeding experiments, rainfall decreases more rapidly than in non-seeding experiments. The decrease in precipitation is so great, that rainfall in seeding experiments is lower than non-seeding experiments as the suitability criterion approaches 4.