The data set was collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide (100 to 1000 grams per cloud) in cloud seeding to increase rainfall.
For this project, we will perform some analysis of the given data
We first load the packages potentially needed for the analysis
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.1
library (reshape2)
## Warning: package 'reshape2' was built under R version 4.4.1
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
And the experiment data
cloud <- read.csv("clouds.csv")
The first step is to subset the data to calculate the rainfall statistics, grouped in Seeding and Non-Seeding experiments. To do this we will use tools from the dplyr package
rainfall <- cloud %>%
group_by(seeding) %>%
summarise(avg = mean(rainfall),
sd = sd(rainfall),
max = max(rainfall),
min = min(rainfall))
The result should be the following:
rainfall
## # A tibble: 2 × 5
## seeding avg sd max min
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 no 4.17 3.52 12.8 0.28
## 2 yes 4.63 2.78 11.9 1.09
Even when this new subset comes really handy, we need to further reshape it from wide-format to long-format using the melt() function, that is included in the reshape package, which will facilitate the visualization task.
rainstat <- melt(rainfall, id = 'seeding')
rainstat
## seeding variable value
## 1 no avg 4.171667
## 2 yes avg 4.634167
## 3 no sd 3.519196
## 4 yes sd 2.776841
## 5 no max 12.850000
## 6 yes max 11.860000
## 7 no min 0.280000
## 8 yes min 1.090000
Although is not strictly necessary in this occasion, we can reorder it to make it look cooler ;)
rainstat <- rainstat %>%
arrange(factor(seeding, levels = c("yes", "no")))
rainstat
## seeding variable value
## 1 yes avg 4.634167
## 2 yes sd 2.776841
## 3 yes max 11.860000
## 4 yes min 1.090000
## 5 no avg 4.171667
## 6 no sd 3.519196
## 7 no max 12.850000
## 8 no min 0.280000
We then proceed to compare the data through graphics. We will use the tools from the ggpubr package.
ggbarplot(rainstat, x= "variable", y= "value", fill = "seeding", main = "Cloud Seeding Experiment Statistics 1975",
palette = "rickandmorty", facet.by = "seeding")
Let’s run a t-test to assess whether there is a significant difference in rainfall between the seeding and no seeding experiments. That requires two vector of values for the two sample t-test. In this case, we will then filter the data in yes and no seeding.
yes <- cloud %>%
filter(seeding == "yes") %>%
select(rainfall)
yes
## rainfall
## 1 5.52
## 2 6.29
## 3 2.45
## 4 5.06
## 5 2.76
## 6 4.05
## 7 4.84
## 8 11.86
## 9 4.22
## 10 5.45
## 11 2.02
## 12 1.09
no <- cloud %>%
filter (seeding == "no") %>%
select(rainfall)
no
## rainfall
## 1 12.85
## 2 6.11
## 3 3.61
## 4 0.47
## 5 4.56
## 6 6.35
## 7 5.74
## 8 4.45
## 9 3.66
## 10 1.16
## 11 0.82
## 12 0.28
Now that we have our vectors, we can run the t-test
t.test(yes, no, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: yes and no
## t = 0.3574, df = 20.871, p-value = 0.7244
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.229691 3.154691
## sample estimates:
## mean of x mean of y
## 4.634167 4.171667
The results reveals a p-value = 0.7244, which indicates that there is no significance difference in rainfall between the two experiments.
To build the linear regression model, we will use the lm() function, placing the dependent variable to the left, and the independent variables to the right, separated by a Tilde (~). The Tilde indicates the direction of the relationship. Here, the rainfal depends on cloudcover, prewetness, echomotion, and sne.
mlr <- lm(rainfall ~ cloudcover + prewetness + echomotion + sne, data = cloud)
mlr
##
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion +
## sne, data = cloud)
##
## Coefficients:
## (Intercept) cloudcover prewetness
## 6.5789 0.0366 2.1520
## echomotionstationary sne
## 2.4370 -1.1526
summary(mlr)
##
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion +
## sne, data = cloud)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4396 -1.5824 0.2265 1.2153 7.0778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5789 2.4157 2.723 0.0135 *
## cloudcover 0.0366 0.1129 0.324 0.7494
## prewetness 2.1520 2.6566 0.810 0.4279
## echomotionstationary 2.4370 1.5254 1.598 0.1266
## sne -1.1526 0.6647 -1.734 0.0991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.844 on 19 degrees of freedom
## Multiple R-squared: 0.3087, Adjusted R-squared: 0.1632
## F-statistic: 2.122 on 4 and 19 DF, p-value: 0.1178
anova(mlr)
## Analysis of Variance Table
##
## Response: rainfall
## Df Sum Sq Mean Sq F value Pr(>F)
## cloudcover 1 16.240 16.2396 2.0076 0.17270
## prewetness 1 0.001 0.0009 0.0001 0.99180
## echomotion 1 28.082 28.0818 3.4716 0.07796 .
## sne 1 24.321 24.3212 3.0067 0.09912 .
## Residuals 19 153.691 8.0890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the summary() and anova() functions, non of the variables holds great significance, however, the variables that appear to influence rainfall the most are echomotion and sne, which F-Values are high compared with t-value.
Before building the models, let’s subset the data to relate the variable sne with the two experiments
m1y <- cloud %>%
filter(seeding == "yes") %>%
select (rainfall, sne)
m1y
## rainfall sne
## 1 5.52 2.70
## 2 6.29 4.10
## 3 2.45 4.25
## 4 5.06 2.20
## 5 2.76 4.40
## 6 4.05 3.10
## 7 4.84 2.90
## 8 11.86 2.05
## 9 4.22 3.70
## 10 5.45 3.40
## 11 2.02 3.15
## 12 1.09 4.01
m2n <- cloud %>%
filter(seeding == "no") %>%
select (rainfall, sne)
m2n
## rainfall sne
## 1 12.85 1.75
## 2 6.11 2.35
## 3 3.61 1.60
## 4 0.47 1.30
## 5 4.56 3.35
## 6 6.35 2.85
## 7 5.74 3.95
## 8 4.45 4.00
## 9 3.66 3.35
## 10 1.16 3.80
## 11 0.82 3.15
## 12 0.28 4.65
Now let’s build the models and analyze the coefficients!
lm1 <- lm(rainfall ~ sne, data = m1y)
lm1
##
## Call:
## lm(formula = rainfall ~ sne, data = m1y)
##
## Coefficients:
## (Intercept) sne
## 12.020 -2.218
summary(lm1)
##
## Call:
## lm(formula = rainfall ~ sne, data = m1y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0134 -1.3297 -0.3276 0.6171 4.3867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.0202 2.9774 4.037 0.00237 **
## sne -2.2180 0.8722 -2.543 0.02921 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.27 on 10 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.332
## F-statistic: 6.467 on 1 and 10 DF, p-value: 0.02921
For this model, for each unit increase in sne, rainfall decreases by 2.21 units. This coefficient is statistically significant (** and p-value < 0.05), which indicates a significative relationship between suitability and rainfall in seeding experiments.
lm2 <- lm(rainfall ~ sne, data = m2n)
lm2
##
## Call:
## lm(formula = rainfall ~ sne, data = m2n)
##
## Coefficients:
## (Intercept) sne
## 7.319 -1.046
summary(lm2)
##
## Call:
## lm(formula = rainfall ~ sne, data = m2n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4892 -2.1762 0.2958 1.4902 7.3616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.319 3.160 2.317 0.043 *
## sne -1.046 0.995 -1.052 0.318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.502 on 10 degrees of freedom
## Multiple R-squared: 0.09957, Adjusted R-squared: 0.009528
## F-statistic: 1.106 on 1 and 10 DF, p-value: 0.3177
In this model, for each unit increase in sne, rainfall decreases by 1.04. This coefficient is not statistically significant (p-value < 0.05), which indicates that there is no significative relationship between sne and rainfall in non-seeding experiments.
We plot our regression line for both models using the ggpubr package.
m1plot <- ggscatter(m1y, x = "sne", y = "rainfall",
main = "Seeding Experiment, Sne vs. Rainfall",
ylab = "Rainfall (m3)", xlab = "Suitability",
add = "reg.line",
add.params = list(color = "red")) +
stat_cor(method = "pearson", label.y = 12, label.x = 2.5, col = "red")
m1plot
m2plot <- ggscatter(m2n, x = "sne", y = "rainfall",
main = "No Seeding Experiment, Sne vs. Rainfall",
ylab = "Rainfall (m3)", xlab = "Suitability",
add = "reg.line",
add.params = list(color = "red")) +
stat_cor(method = "pearson", label.y = 12, label.x = 2.5, col = "red")
m2plot
The difference in slope suggests that the suitability (sne) has a more significant negative effect on rainfall in the seeding experiments compared to the non-seeding experiments. This could mean that the seeding process influences the relationship between the suitability index and rainfall.