The data were collected in the summer of 1975 from an experiment to investigate the use of massive amounts of silver iodide 100 to 1000 grams per cloud) in cloud seeding to increase rainfall. Our interest is to know whether or not seeding increases the rainfall from a cloud.
library(ggplot2)
library(ggthemes)
clouds = read.csv("clouds.csv")
Explore the difference in rainfall between seeding and non-seeding experiments by looking at the statistical characteristics (mean, sd, etc) of the experiments and by using visualization. Use a t-test to see whether there is a significant difference.
seeding = clouds$seeding
rainfall = clouds$rainfall
time = clouds$time
seeding = as.numeric(factor(clouds$seeding))
seedY = clouds$seeding == "yes"
seedN = clouds$seeding == "no"
rainseed_df = data.frame(seed=seeding, rain=rainfall)
summary(rainseed_df)
## seed rain
## Min. :1.0 Min. : 0.280
## 1st Qu.:1.0 1st Qu.: 2.342
## Median :1.5 Median : 4.335
## Mean :1.5 Mean : 4.403
## 3rd Qu.:2.0 3rd Qu.: 5.575
## Max. :2.0 Max. :12.850
sd(rainseed_df$seed)
## [1] 0.5107539
sd(rainseed_df$rain)
## [1] 3.109137
rainseed_plot = ggplot(clouds, aes(x=time, y=rainfall)) + geom_line(aes(group=seeding, color=seeding)) + theme_bw() + ggtitle("Rainfall over Time in relation to Seeding Action") + xlab("Time (days)") + ylab("Rainfall (m^3 * 1e+8)")
rainseed_plot
t.test(seeding, rainfall, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: seeding and rainfall
## t = -4.5135, df = 24.24, p-value = 0.0001401
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.229630 -1.576204
## sample estimates:
## mean of x mean of y
## 1.500000 4.402917
Here, I started by setting variables, factoring seeding action from “yes” and “no” to “1” and “2” in order to get numerical values to take statistical characterisitcs and to graph approrpriately. From there, I created a separate dataframe for the main variables comparing rainfall and seeding/non-seeding experiments. Creating a simple ggplot allowed for a visualization of this comparison. However, once the t-test was completed, it can be seen that the p value does not equal 0 and as such there is no significant difference between the variables (seeding/nonseeding and rainfall).
Build a multiple linear regression model to model the effects of seeding on rainfall. Include cloudcover, prewetness, echomotion and sne as independent variables. Explore the model using the summary() and anova() functions. Which variables appear to influence rainfall the most?
lm(rainfall ~ seeding + cloudcover + prewetness + echomotion + sne, data = clouds)
##
## Call:
## lm(formula = rainfall ~ seeding + cloudcover + prewetness + echomotion +
## sne, data = clouds)
##
## Coefficients:
## (Intercept) seedingyes cloudcover
## 6.37680 1.12011 0.01821
## prewetness echomotionstationary sne
## 2.55109 2.59855 -1.27530
rainseed_lm = lm(rainfall ~ seeding + cloudcover + prewetness + echomotion + sne, data = clouds)
class(rainseed_lm)
## [1] "lm"
summary(rainseed_lm)
##
## Call:
## lm(formula = rainfall ~ seeding + cloudcover + prewetness + echomotion +
## sne, data = clouds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1158 -1.7078 -0.2422 1.3368 6.4827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.37680 2.43432 2.620 0.0174 *
## seedingyes 1.12011 1.20725 0.928 0.3658
## cloudcover 0.01821 0.11508 0.158 0.8761
## prewetness 2.55109 2.70090 0.945 0.3574
## echomotionstationary 2.59855 1.54090 1.686 0.1090
## sne -1.27530 0.68015 -1.875 0.0771 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.855 on 18 degrees of freedom
## Multiple R-squared: 0.3403, Adjusted R-squared: 0.157
## F-statistic: 1.857 on 5 and 18 DF, p-value: 0.1524
anova(rainseed_lm)
## Analysis of Variance Table
##
## Response: rainfall
## Df Sum Sq Mean Sq F value Pr(>F)
## seeding 1 1.283 1.2834 0.1575 0.69613
## cloudcover 1 15.738 15.7377 1.9313 0.18157
## prewetness 1 0.003 0.0027 0.0003 0.98557
## echomotion 1 29.985 29.9853 3.6798 0.07108 .
## sne 1 28.649 28.6491 3.5158 0.07711 .
## Residuals 18 146.677 8.1487
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For this section, I began with creating a multiple linear regression model, which includes the two variables rainfall and seeding as well as the other independent variables (indicated with the + between them). Then, I analyzed the data using the summary and anova functions. As indicated on both, it seems that the suitability index (sne) has an influence on rainfall the most, although it is pretty slight. The way this can be seen is with the symbol next to the Pr(>F) column value.
As the suitability index (sne) appears to have a relationship with rainfall, build two new models relating this variable to rainfall, one for the seeding experiments, one for the non-seeding experiments. Compare the coefficients for the two models, and produce a figure showing the two models. What does the difference in slope suggest?
seedingY = clouds$seeding == "yes"
seedingN = clouds$seeding == "no"
sne = clouds$sne
lm(sne ~ rainfall * seedingY, data = clouds)
##
## Call:
## lm(formula = sne ~ rainfall * seedingY, data = clouds)
##
## Coefficients:
## (Intercept) rainfall seedingYTRUE
## 3.40530 -0.09516 0.74521
## rainfall:seedingYTRUE
## -0.08190
sne_rain_y = lm(sne ~ rainfall * seedingY, data = clouds)
summary(sne_rain_y)
##
## Call:
## lm(formula = sne ~ rainfall * seedingY, data = clouds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06058 -0.44268 0.05072 0.51213 1.27134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.40530 0.40141 8.483 4.65e-08 ***
## rainfall -0.09516 0.07486 -1.271 0.218
## seedingYTRUE 0.74521 0.64655 1.153 0.263
## rainfall:seedingYTRUE -0.08190 0.12084 -0.678 0.506
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8737 on 20 degrees of freedom
## Multiple R-squared: 0.2282, Adjusted R-squared: 0.1124
## F-statistic: 1.971 on 3 and 20 DF, p-value: 0.1508
anova(sne_rain_y)
## Analysis of Variance Table
##
## Response: sne
## Df Sum Sq Mean Sq F value Pr(>F)
## rainfall 1 3.3005 3.3005 4.3236 0.05067 .
## seedingY 1 0.8624 0.8624 1.1297 0.30051
## rainfall:seedingY 1 0.3506 0.3506 0.4593 0.50571
## Residuals 20 15.2673 0.7634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(sne ~ rainfall * seedingN, data = clouds)
##
## Call:
## lm(formula = sne ~ rainfall * seedingN, data = clouds)
##
## Coefficients:
## (Intercept) rainfall seedingNTRUE
## 4.1505 -0.1771 -0.7452
## rainfall:seedingNTRUE
## 0.0819
sne_rain_n = lm(sne ~ rainfall * seedingN, data = clouds)
summary(sne_rain_n)
##
## Call:
## lm(formula = sne ~ rainfall * seedingN, data = clouds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06058 -0.44268 0.05072 0.51213 1.27134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.15051 0.50684 8.189 8.11e-08 ***
## rainfall -0.17706 0.09487 -1.866 0.0767 .
## seedingNTRUE -0.74521 0.64655 -1.153 0.2627
## rainfall:seedingNTRUE 0.08190 0.12084 0.678 0.5057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8737 on 20 degrees of freedom
## Multiple R-squared: 0.2282, Adjusted R-squared: 0.1124
## F-statistic: 1.971 on 3 and 20 DF, p-value: 0.1508
anova(sne_rain_n)
## Analysis of Variance Table
##
## Response: sne
## Df Sum Sq Mean Sq F value Pr(>F)
## rainfall 1 3.3005 3.3005 4.3236 0.05067 .
## seedingN 1 0.8624 0.8624 1.1297 0.30051
## rainfall:seedingN 1 0.3506 0.3506 0.4593 0.50571
## Residuals 20 15.2673 0.7634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot_sne_rain = ggplot(clouds, aes(x = rainfall, y = sne, col = seeding)) + geom_point()
plot_sne_rain = plot_sne_rain + geom_smooth(method = "lm") + ggtitle("Rainfall vs Suitability Index") + xlab("Rainfall (m^3 * 1e+8)") + ylab("Suitability Index") + theme_bw()
print(plot_sne_rain)
## `geom_smooth()` using formula 'y ~ x'
Here, in order to differentiate between the seeding and non-seeding experiments, I created two new variables. From there, I was able to build the two linear regression models, evaluate the coefficients of both and then compare them to one another in a plot. As seen in the plot, both seeding and non-seeding experiments have a downward slope but one is steeper than the other, indicated that as the rainfall increases, sne decreases but the rate at which the sne decreases is faster with the seeding experiments whereas it’s more gradual with the non-seeding experiments.