Introduction

The data were collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide 100 to 1000 grams per cloud) in cloud seeding to increase rainfall. Our interest is to know whether or not seeding increases the rainfall from a cloud.

Analysis

library(ggplot2)
library(ggthemes)
clouds = read.csv("clouds.csv")

Part 1

Explore the difference in rainfall between seeding and non-seeding experiments by looking at the statistical characteristics (mean, sd, etc) of the experiments and by using visualization. Use a t-test to see whether there is a significant difference.

seeding = clouds$seeding
rainfall = clouds$rainfall
time = clouds$time
seeding = as.numeric(factor(clouds$seeding))
seedY = clouds$seeding == "yes"
seedN = clouds$seeding == "no"
rainseed_df = data.frame(seed=seeding, rain=rainfall)
summary(rainseed_df)
##       seed          rain       
##  Min.   :1.0   Min.   : 0.280  
##  1st Qu.:1.0   1st Qu.: 2.342  
##  Median :1.5   Median : 4.335  
##  Mean   :1.5   Mean   : 4.403  
##  3rd Qu.:2.0   3rd Qu.: 5.575  
##  Max.   :2.0   Max.   :12.850
sd(rainseed_df$seed)
## [1] 0.5107539
sd(rainseed_df$rain)
## [1] 3.109137
rainseed_plot = ggplot(clouds, aes(x=time, y=rainfall)) + geom_line(aes(group=seeding, color=seeding)) + theme_bw() + ggtitle("Rainfall over Time in relation to Seeding Action") + xlab("Time (days)") + ylab("Rainfall (m^3 * 1e+8)")
rainseed_plot

t.test(seeding, rainfall, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  seeding and rainfall
## t = -4.5135, df = 24.24, p-value = 0.0001401
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.229630 -1.576204
## sample estimates:
## mean of x mean of y 
##  1.500000  4.402917

Description of results/steps

Here, I started by setting variables, factoring seeding action from “yes” and “no” to “1” and “2” in order to get numerical values to take statistical characterisitcs and to graph approrpriately. From there, I created a separate dataframe for the main variables comparing rainfall and seeding/non-seeding experiments. Creating a simple ggplot allowed for a visualization of this comparison. However, once the t-test was completed, it can be seen that the p value does not equal 0 and as such there is no significant difference between the variables (seeding/nonseeding and rainfall).

Part 2

Build a multiple linear regression model to model the effects of seeding on rainfall. Include cloudcover, prewetness, echomotion and sne as independent variables. Explore the model using the summary() and anova() functions. Which variables appear to influence rainfall the most?

lm(rainfall ~ seeding + cloudcover + prewetness + echomotion + sne, data = clouds)
## 
## Call:
## lm(formula = rainfall ~ seeding + cloudcover + prewetness + echomotion + 
##     sne, data = clouds)
## 
## Coefficients:
##          (Intercept)            seedingyes            cloudcover  
##              6.37680               1.12011               0.01821  
##           prewetness  echomotionstationary                   sne  
##              2.55109               2.59855              -1.27530
rainseed_lm = lm(rainfall ~ seeding + cloudcover + prewetness + echomotion + sne, data = clouds)
class(rainseed_lm)
## [1] "lm"
summary(rainseed_lm)
## 
## Call:
## lm(formula = rainfall ~ seeding + cloudcover + prewetness + echomotion + 
##     sne, data = clouds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1158 -1.7078 -0.2422  1.3368  6.4827 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           6.37680    2.43432   2.620   0.0174 *
## seedingyes            1.12011    1.20725   0.928   0.3658  
## cloudcover            0.01821    0.11508   0.158   0.8761  
## prewetness            2.55109    2.70090   0.945   0.3574  
## echomotionstationary  2.59855    1.54090   1.686   0.1090  
## sne                  -1.27530    0.68015  -1.875   0.0771 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.855 on 18 degrees of freedom
## Multiple R-squared:  0.3403, Adjusted R-squared:  0.157 
## F-statistic: 1.857 on 5 and 18 DF,  p-value: 0.1524
anova(rainseed_lm)
## Analysis of Variance Table
## 
## Response: rainfall
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## seeding     1   1.283  1.2834  0.1575 0.69613  
## cloudcover  1  15.738 15.7377  1.9313 0.18157  
## prewetness  1   0.003  0.0027  0.0003 0.98557  
## echomotion  1  29.985 29.9853  3.6798 0.07108 .
## sne         1  28.649 28.6491  3.5158 0.07711 .
## Residuals  18 146.677  8.1487                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Description of results/steps

For this section, I began with creating a multiple linear regression model, which includes the two variables rainfall and seeding as well as the other independent variables (indicated with the + between them). Then, I analyzed the data using the summary and anova functions. As indicated on both, it seems that the suitability index (sne) has an influence on rainfall the most, although it is pretty slight. The way this can be seen is with the symbol next to the Pr(>F) column value.

Part 3

As the suitability index (sne) appears to have a relationship with rainfall, build two new models relating this variable to rainfall, one for the seeding experiments, one for the non-seeding experiments. Compare the coefficients for the two models, and produce a figure showing the two models. What does the difference in slope suggest?

seedingY = clouds$seeding == "yes"
seedingN = clouds$seeding == "no"
sne = clouds$sne

lm(sne ~ rainfall * seedingY, data = clouds)
## 
## Call:
## lm(formula = sne ~ rainfall * seedingY, data = clouds)
## 
## Coefficients:
##           (Intercept)               rainfall           seedingYTRUE  
##               3.40530               -0.09516                0.74521  
## rainfall:seedingYTRUE  
##              -0.08190
sne_rain_y = lm(sne ~ rainfall * seedingY, data = clouds)
summary(sne_rain_y)
## 
## Call:
## lm(formula = sne ~ rainfall * seedingY, data = clouds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06058 -0.44268  0.05072  0.51213  1.27134 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.40530    0.40141   8.483 4.65e-08 ***
## rainfall              -0.09516    0.07486  -1.271    0.218    
## seedingYTRUE           0.74521    0.64655   1.153    0.263    
## rainfall:seedingYTRUE -0.08190    0.12084  -0.678    0.506    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8737 on 20 degrees of freedom
## Multiple R-squared:  0.2282, Adjusted R-squared:  0.1124 
## F-statistic: 1.971 on 3 and 20 DF,  p-value: 0.1508
anova(sne_rain_y)
## Analysis of Variance Table
## 
## Response: sne
##                   Df  Sum Sq Mean Sq F value  Pr(>F)  
## rainfall           1  3.3005  3.3005  4.3236 0.05067 .
## seedingY           1  0.8624  0.8624  1.1297 0.30051  
## rainfall:seedingY  1  0.3506  0.3506  0.4593 0.50571  
## Residuals         20 15.2673  0.7634                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(sne ~ rainfall * seedingN, data = clouds)
## 
## Call:
## lm(formula = sne ~ rainfall * seedingN, data = clouds)
## 
## Coefficients:
##           (Intercept)               rainfall           seedingNTRUE  
##                4.1505                -0.1771                -0.7452  
## rainfall:seedingNTRUE  
##                0.0819
sne_rain_n = lm(sne ~ rainfall * seedingN, data = clouds)
summary(sne_rain_n)
## 
## Call:
## lm(formula = sne ~ rainfall * seedingN, data = clouds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06058 -0.44268  0.05072  0.51213  1.27134 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.15051    0.50684   8.189 8.11e-08 ***
## rainfall              -0.17706    0.09487  -1.866   0.0767 .  
## seedingNTRUE          -0.74521    0.64655  -1.153   0.2627    
## rainfall:seedingNTRUE  0.08190    0.12084   0.678   0.5057    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8737 on 20 degrees of freedom
## Multiple R-squared:  0.2282, Adjusted R-squared:  0.1124 
## F-statistic: 1.971 on 3 and 20 DF,  p-value: 0.1508
anova(sne_rain_n)
## Analysis of Variance Table
## 
## Response: sne
##                   Df  Sum Sq Mean Sq F value  Pr(>F)  
## rainfall           1  3.3005  3.3005  4.3236 0.05067 .
## seedingN           1  0.8624  0.8624  1.1297 0.30051  
## rainfall:seedingN  1  0.3506  0.3506  0.4593 0.50571  
## Residuals         20 15.2673  0.7634                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot_sne_rain = ggplot(clouds, aes(x = rainfall, y = sne, col = seeding)) + geom_point()
plot_sne_rain = plot_sne_rain + geom_smooth(method = "lm") + ggtitle("Rainfall vs Suitability Index") + xlab("Rainfall (m^3 * 1e+8)") + ylab("Suitability Index") + theme_bw()
print(plot_sne_rain)
## `geom_smooth()` using formula 'y ~ x'

Description of results/steps

Here, in order to differentiate between the seeding and non-seeding experiments, I created two new variables. From there, I was able to build the two linear regression models, evaluate the coefficients of both and then compare them to one another in a plot. As seen in the plot, both seeding and non-seeding experiments have a downward slope but one is steeper than the other, indicated that as the rainfall increases, sne decreases but the rate at which the sne decreases is faster with the seeding experiments whereas it’s more gradual with the non-seeding experiments.