Introduction

The data set was collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide (100 to 1000 grams per cloud) in cloud seeding to increase rainfall.

For this project, we will perform some analysis of the given data

Resources

We first load the packages potentially needed for the analysis

library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.1
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.1
library (reshape2)
## Warning: package 'reshape2' was built under R version 4.4.1
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

And the experiment data

cloud <- read.csv("clouds.csv")

Analysis

1. Explore the difference in rainfall between seeding and non-seeding experiments by looking at the statistical characteristics (mean, sd, etc) of the experiments and by using visualization. Use a t-test to see whether there is a significant difference.

The first step is to subset the data to calculate the rainfall statistics, grouped in Seeding and Non-Seeding experiments. To do this we will use tools from the dplyr package

rainfall <- cloud %>%
  group_by(seeding) %>%
  summarise(avg = mean(rainfall),
            sd = sd(rainfall),
            max = max(rainfall),
            min = min(rainfall))

The result should be the following:

rainfall
## # A tibble: 2 × 5
##   seeding   avg    sd   max   min
##   <chr>   <dbl> <dbl> <dbl> <dbl>
## 1 no       4.17  3.52  12.8  0.28
## 2 yes      4.63  2.78  11.9  1.09

Even when this new subset comes really handy, we need to further reshape it from wide-format to long-format using the melt() function, that is included in the reshape package, which will facilitate the visualization task.

rainstat <- melt(rainfall, id = 'seeding')
rainstat
##   seeding variable     value
## 1      no      avg  4.171667
## 2     yes      avg  4.634167
## 3      no       sd  3.519196
## 4     yes       sd  2.776841
## 5      no      max 12.850000
## 6     yes      max 11.860000
## 7      no      min  0.280000
## 8     yes      min  1.090000

Although is not strictly necessary in this occasion, we can reorder it to make it look cooler ;)

rainstat <- rainstat %>%
  arrange(factor(seeding, levels = c("yes", "no")))
rainstat
##   seeding variable     value
## 1     yes      avg  4.634167
## 2     yes       sd  2.776841
## 3     yes      max 11.860000
## 4     yes      min  1.090000
## 5      no      avg  4.171667
## 6      no       sd  3.519196
## 7      no      max 12.850000
## 8      no      min  0.280000

We then proceed to compare the data through graphics. We will use the tools from the ggpubr package.

ggbarplot(rainstat, x= "variable", y= "value", fill = "seeding", main = "Cloud Seeding Experiment Statistics 1975",
             palette = "rickandmorty", facet.by = "seeding")

Let’s run a t-test to assess whether there is a significant difference in rainfall between the seeding and no seeding experiments. That requires two vector of values for the two sample t-test. In this case, we will then filter the data in yes and no seeding.

yes <- cloud %>%
  filter(seeding == "yes") %>%
  select(rainfall)

yes
##    rainfall
## 1      5.52
## 2      6.29
## 3      2.45
## 4      5.06
## 5      2.76
## 6      4.05
## 7      4.84
## 8     11.86
## 9      4.22
## 10     5.45
## 11     2.02
## 12     1.09
no <- cloud %>%
  filter (seeding == "no") %>%
  select(rainfall)

no
##    rainfall
## 1     12.85
## 2      6.11
## 3      3.61
## 4      0.47
## 5      4.56
## 6      6.35
## 7      5.74
## 8      4.45
## 9      3.66
## 10     1.16
## 11     0.82
## 12     0.28

Now that we have our vectors, we can run the t-test

t.test(yes, no, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  yes and no
## t = 0.3574, df = 20.871, p-value = 0.7244
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.229691  3.154691
## sample estimates:
## mean of x mean of y 
##  4.634167  4.171667

The results reveals a p-value = 0.7244, which indicates that there is no significance difference in rainfall between the two experiments.

2. Build a multiple linear regression model to model the effects of seeding on rainfall. Include cloudcover, prewetness, echomotion and sne as independent variables. Explore the model using the summary() and anova() functions. Which variables appear to influence rainfall the most?

To build the linear regression model, we will use the lm() function, placing the dependent variable to the left, and the independent variables to the right, separated by a Tilde (~). The Tilde indicates the direction of the relationship. Here, the rainfal depends on cloudcover, prewetness, echomotion, and sne.

mlr <- lm(rainfall ~ cloudcover + prewetness + echomotion + sne, data = cloud)
mlr
## 
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion + 
##     sne, data = cloud)
## 
## Coefficients:
##          (Intercept)            cloudcover            prewetness  
##               6.5789                0.0366                2.1520  
## echomotionstationary                   sne  
##               2.4370               -1.1526
summary(mlr)
## 
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion + 
##     sne, data = cloud)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4396 -1.5824  0.2265  1.2153  7.0778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            6.5789     2.4157   2.723   0.0135 *
## cloudcover             0.0366     0.1129   0.324   0.7494  
## prewetness             2.1520     2.6566   0.810   0.4279  
## echomotionstationary   2.4370     1.5254   1.598   0.1266  
## sne                   -1.1526     0.6647  -1.734   0.0991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.844 on 19 degrees of freedom
## Multiple R-squared:  0.3087, Adjusted R-squared:  0.1632 
## F-statistic: 2.122 on 4 and 19 DF,  p-value: 0.1178
anova(mlr) 
## Analysis of Variance Table
## 
## Response: rainfall
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## cloudcover  1  16.240 16.2396  2.0076 0.17270  
## prewetness  1   0.001  0.0009  0.0001 0.99180  
## echomotion  1  28.082 28.0818  3.4716 0.07796 .
## sne         1  24.321 24.3212  3.0067 0.09912 .
## Residuals  19 153.691  8.0890                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the summary() and anova() functions, non of the variables holds great significance, however, the variables that appear to influence rainfall the most are echomotion and sne, which F-Values are high compared with t-value.

3. As the suitability index (sne) appears to have a relationship with rainfall, build two new models relating this variable to rainfall, one for the seeding experiments, one for the non-seeding experiments. Compare the coefficients for the two models, and produce a figure showing the two models. What does the difference in slope suggest?

Before building the models, let’s subset the data to relate the variable sne with the two experiments

  • For the seeding experiment (yes)
m1y <- cloud %>% 
  filter(seeding == "yes") %>%
  select (rainfall, sne)
m1y
##    rainfall  sne
## 1      5.52 2.70
## 2      6.29 4.10
## 3      2.45 4.25
## 4      5.06 2.20
## 5      2.76 4.40
## 6      4.05 3.10
## 7      4.84 2.90
## 8     11.86 2.05
## 9      4.22 3.70
## 10     5.45 3.40
## 11     2.02 3.15
## 12     1.09 4.01
  • For the non-seeding experiment (no)
m2n <- cloud %>%
  filter(seeding == "no") %>%
  select (rainfall, sne)
m2n
##    rainfall  sne
## 1     12.85 1.75
## 2      6.11 2.35
## 3      3.61 1.60
## 4      0.47 1.30
## 5      4.56 3.35
## 6      6.35 2.85
## 7      5.74 3.95
## 8      4.45 4.00
## 9      3.66 3.35
## 10     1.16 3.80
## 11     0.82 3.15
## 12     0.28 4.65

Now let’s build the models and analyze the coefficients!

  • For the seeding experiment
lm1 <- lm(rainfall ~ sne, data = m1y)
lm1
## 
## Call:
## lm(formula = rainfall ~ sne, data = m1y)
## 
## Coefficients:
## (Intercept)          sne  
##      12.020       -2.218
summary(lm1)
## 
## Call:
## lm(formula = rainfall ~ sne, data = m1y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0134 -1.3297 -0.3276  0.6171  4.3867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  12.0202     2.9774   4.037  0.00237 **
## sne          -2.2180     0.8722  -2.543  0.02921 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.27 on 10 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.332 
## F-statistic: 6.467 on 1 and 10 DF,  p-value: 0.02921

For this model, for each unit increase in sne, rainfall decreases by 2.21 units. This coefficient is statistically significant (** and p-value < 0.05), which indicates a significative relationship between suitability and rainfall in seeding experiments.

  • For the For the non-seeding experiment
lm2 <- lm(rainfall ~ sne, data = m2n)
lm2
## 
## Call:
## lm(formula = rainfall ~ sne, data = m2n)
## 
## Coefficients:
## (Intercept)          sne  
##       7.319       -1.046
summary(lm2)
## 
## Call:
## lm(formula = rainfall ~ sne, data = m2n)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4892 -2.1762  0.2958  1.4902  7.3616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    7.319      3.160   2.317    0.043 *
## sne           -1.046      0.995  -1.052    0.318  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.502 on 10 degrees of freedom
## Multiple R-squared:  0.09957,    Adjusted R-squared:  0.009528 
## F-statistic: 1.106 on 1 and 10 DF,  p-value: 0.3177

In this model, for each unit increase in sne, rainfall decreases by 1.04. This coefficient is not statistically significant (p-value < 0.05), which indicates that there is no significative relationship between sne and rainfall in non-seeding experiments.

We plot our regression line for both models using the ggpubr package.

  • For the seeding experiment
m1plot <- ggscatter(m1y, x = "sne", y = "rainfall", 
                    main = "Seeding Experiment, Sne vs. Rainfall", 
                    ylab = "Rainfall (m3)", xlab = "Suitability",
                    add = "reg.line",
                    add.params = list(color = "red")) +
  stat_cor(method = "pearson", label.y = 12, label.x = 2.5, col = "red")
m1plot

  • For the non-seeding experiment
m2plot <- ggscatter(m2n, x = "sne", y = "rainfall", 
                    main = "No Seeding Experiment, Sne vs. Rainfall", 
                    ylab = "Rainfall (m3)", xlab = "Suitability",
                    add = "reg.line",
                    add.params = list(color = "red")) +
  stat_cor(method = "pearson", label.y = 12, label.x = 2.5, col = "red")
m2plot

The difference in slope suggests that the suitability (sne) has a more significant negative effect on rainfall in the seeding experiments compared to the non-seeding experiments. This could mean that the seeding process influences the relationship between the suitability index and rainfall.