Project 3: Cloud Seeding

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(rempsyc)

## Suggested APA citation: Thériault, R. (2022). rempsyc: Convenience functions for psychology 
## (R package version 0.1.2) [Computer software]. https://rempsyc.remi-theriault.com

library(gtable)

Analysis

Explore the difference in rainfall between seeding and non-seeding experiments by looking at the statistical characteristics (mean, sd, etc) of the experiments and by using visualization. Use a t-test to see whether there is a significant difference.

The first step in any analysis is understanding the data. The clouds.csv dataset has the following structure:

X: data sample number
seeding: (“no”, “yes”)
time: number of days after the first experiment (data sample #1)
sne: suitability criterion (ranges from 1.3 to 4.65)
cloudcover: % cloud cover in the experment area (ranges 2.2 to 37.9)
prewetness: total rainfall in area an hour before seeding (cubic meters)
echomotion: (“stationary”, “moving”) radar echo factor
rainfall: amount of rain (cubic meters)

There are 24 samples.

I will begin the analysis by dividing the cloud dataset into seeded and non-seeded groups. Using the summary() function, I can get a quick view of the two datasets and see if there is any information that stands out.

clouds <- read.csv("clouds.csv")

seed_col = c("darkorange3", "cornflowerblue")
# divide data into seeded and not seeded data frames
noseed <- clouds[clouds$seeding == "no", ]
seed <- clouds[clouds$seeding == "yes", ]

#summary(noseed)
#summary(seed)

b1 = ggplot(clouds, aes(y=sne, x=seeding, fill=seeding)) + 
  geom_boxplot() + theme_bw() + theme(legend.position = "none") +
  scale_fill_manual(values = seed_col)
b2 = ggplot(clouds, aes(y=cloudcover, x=seeding, fill=seeding)) + 
  geom_boxplot() + theme_bw() + theme(legend.position = "none") +
  scale_fill_manual(values = seed_col)
b3 = ggplot(clouds, aes(y=prewetness, x=seeding, fill=seeding)) + 
  geom_boxplot() + theme_bw() + theme(legend.position = "none") +
  scale_fill_manual(values = seed_col)
b4 = ggplot(clouds, aes(y=rainfall, x=seeding, fill=seeding)) + 
  geom_boxplot() + theme_bw() + theme(legend.position = "none") +
  scale_fill_manual(values = seed_col)
grid.arrange(b1, b2, b3, b4, nrow = 2)

There is a slight increase in the means of SNE, cloud clover, and rainfall in the seed data vs. the non-seeded data. There is a slight decrease in the mean for prewetness in the seed data vs. non-seeded data, although the range of the prewetness variable is higher in the seeded data than in the non-seeded.

Exploratory Analysis of the Relationship between Variables

p1 <- clouds %>% ggplot() + geom_point(aes(x=time, y=sne, col=seeding)) +
  ggtitle("SNE vs. Time") + theme_bw() +
  scale_color_manual(values = seed_col) + theme(legend.position = "none")

p2 = clouds %>% ggplot() + geom_point(aes(x=time, y=cloudcover, col=seeding)) +
  ggtitle("Cloud Cover vs. Time") + theme_bw()+
  scale_color_manual(values = seed_col) + theme(legend.position = "none")

p3 = clouds %>% ggplot() + geom_point(aes(x=prewetness, y=cloudcover, col=seeding)) +
  ggtitle("Cloud Cover vs. Pre-Wetness") + theme_bw()+
  scale_color_manual(values = seed_col)

p4 = clouds %>% ggplot() + geom_point(aes(x=prewetness, y=rainfall, col=seeding)) +
  ggtitle("Rainfall vs. Pre-Wetness") + theme_bw()+
  scale_color_manual(values = seed_col) + theme(legend.position = "none")

p5 = clouds %>% ggplot() + geom_point(aes(x=sne, y=rainfall, col=seeding)) +
  ggtitle("Rainfall vs. SNE") + theme_bw()+
  scale_color_manual(values = seed_col) + theme(legend.position = "none")

p6 = clouds %>% ggplot() + geom_point(aes(x=cloudcover, y=rainfall, col=seeding)) +
  ggtitle("Rainfall vs. Cloud Cover") + theme_bw()+
  scale_color_manual(values = seed_col)

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2)

clouds %>% group_by(echomotion) %>% 
  ggplot(aes(x=echomotion, fill=seeding)) + 
  geom_bar(position = "dodge")+
  scale_fill_manual(values = seed_col) +
  theme_bw()

There are no clear trends between the variables or between the seed and no-seed data sets. Calculating the statistics for continuous variables and performing t-tests will show if there is any significant differences between the seeded and non-seeded data for each variable.

s_vs_ns_t = nice_t_test(clouds, 
                        response = c("sne", "cloudcover", 
                                     "prewetness", "rainfall"),
                        group = "seeding")

## [97mUsing Welch t-test (base R's default; cf. https://doi.org/10.5334/irsp.82).
## For the Student t-test, use `var.equal = TRUE`. 
##  [97m

nice_table(s_vs_ns_t)

Dependent Variable	t	df	p	d	95% CI
sne	-0.84	20.26	.408	-0.34	[-1.15, 0.47]
cloudcover	-0.29	14.09	.774	-0.12	[-0.92, 0.68]
prewetness	0.14	18.57	.890	0.06	[-0.74, 0.86]
rainfall	-0.36	20.87	.724	-0.15	[-0.95, 0.66]

The nice_t_test function performs t-test for each variable listed in the ‘response’. Each t-test is performed comparing the variable data for the seeded and non-seeded groups. The result of each t-test is shown in the table above. The ‘t’ is the t-value, ‘df’ is the degrees of freedom, ‘p’ is the p-value, ‘d’ is Cohen’s d, and ‘CI’ is the confidence interval for the Cohen’s d.

T-tests were run on the sne, cloudcover, prewetness, and rainfall variables.The p-values for all four tests are high, indicating that there is not a significant difference between the seeded and non-seeded variables for sne, cloudcover, prewetness, and rainfall. Perhaps modelling a combination of these variables will give better insight into their relationship.

Build a multiple linear regression model to model the effects of seeding on rainfall. Include cloudcover, prewetness, echomotion and sne as independent variables. Explore the model using the summary() and anova() functions. Which variables appear to influence rainfall the most?

clouds$echomotion = factor(clouds$echomotion)
lm_rain = lm(rainfall ~ cloudcover + prewetness + echomotion + sne, 
             data = clouds
             )
summary(lm_rain)

## 
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion + 
##     sne, data = clouds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4396 -1.5824  0.2265  1.2153  7.0778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            6.5789     2.4157   2.723   0.0135 *
## cloudcover             0.0366     0.1129   0.324   0.7494  
## prewetness             2.1520     2.6566   0.810   0.4279  
## echomotionstationary   2.4370     1.5254   1.598   0.1266  
## sne                   -1.1526     0.6647  -1.734   0.0991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.844 on 19 degrees of freedom
## Multiple R-squared:  0.3087, Adjusted R-squared:  0.1632 
## F-statistic: 2.122 on 4 and 19 DF,  p-value: 0.1178

Using the entire dataset, it appears that the SNE variable is the only one with a significant p-value (pVal = 0.1, which isn’t that great). The adjusted R-squared value is also very low (0.1632), suggesting that the relationship between these variables and rainfall in not strong. The F-statistic is 2.12 with a p-value of 0.12, which is ‘meh’ on the significance scale.

ggplot(lm_rain, aes(x=lm_rain$residuals)) + geom_histogram(binwidth = 1) +
  xlab("Rainfall Model Residuals") + ylab("") +
  ggtitle("Distribution of Residuals") + theme_bw()

The residuals for the model are almost centered, and sort of Normally distributed.

ggplot(lm_rain, aes(x=lm_rain$fitted.values, y=lm_rain$residuals)) + 
  geom_line() + 
  xlab("Fitted Values") + ylab("Residuals") +
  ggtitle("Linear Model Residuals") + theme_bw()

Comparing the residuals with the fitted values from the linear model, there is a large amount of variability suggesting that there may be higher order terms influencing the model.

anova(lm_rain)

## Analysis of Variance Table
## 
## Response: rainfall
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## cloudcover  1  16.240 16.2396  2.0076 0.17270  
## prewetness  1   0.001  0.0009  0.0001 0.99180  
## echomotion  1  28.082 28.0818  3.4716 0.07796 .
## sne         1  24.321 24.3212  3.0067 0.09912 .
## Residuals  19 153.691  8.0890                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running anova() on the linear model for the entire dataset, it looks like echomotion and sne are the only variables that have influence in the model. The p-value for echomotion is 0.08 and the p-value for sne is 0.1. Both values are still on the bigger side (i.e. > 0.05).

There does not appear to be a strong relationship between rainfall and the other variables in the model for this dataset. Dividing the dataset into seeded and non-seeded groups may improve the performance of the model.

lm_rain_ns = lm(rainfall ~ 
                  cloudcover + prewetness + echomotion + sne, data = noseed)
summary(lm_rain_ns)

## 
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion + 
##     sne, data = noseed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.468 -1.680 -0.283  1.121  3.070 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)           -0.2595     3.0715  -0.084    0.935
## cloudcover             0.4422     0.2377   1.860    0.105
## prewetness             5.6243     3.8575   1.458    0.188
## echomotionstationary   3.3306     2.1268   1.566    0.161
## sne                   -0.4333     0.7693  -0.563    0.591
## 
## Residual standard error: 2.429 on 7 degrees of freedom
## Multiple R-squared:  0.6967, Adjusted R-squared:  0.5234 
## F-statistic:  4.02 on 4 and 7 DF,  p-value: 0.05282

Using the same relationship between rainfall and the previous variables, but only using the non-seeded data, the model now has a higher adjusted R-squared value of 0.52, and a F-statistic of 4.02 with a p-value of 0.05. These values are all much better than before, but none of the independent variables show any significance. They all have p-values > 0.1.

anova(lm_rain_ns)

## Analysis of Variance Table
## 
## Response: rainfall
##            Df Sum Sq Mean Sq F value  Pr(>F)  
## cloudcover  1 69.284  69.284 11.7384 0.01104 *
## prewetness  1  1.664   1.664  0.2819 0.61190  
## echomotion  1 22.096  22.096  3.7437 0.09425 .
## sne         1  1.872   1.872  0.3172 0.59088  
## Residuals   7 41.316   5.902                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the F-statistic in ANOVA for the non-seeded data set, there may be a relationship between cloudcover and echomotion in modelling rainfall for the non-seeded data.

lm_rain_s = lm(rainfall ~ 
                  cloudcover + prewetness + echomotion + sne, data = seed)
summary(lm_rain_s)

## 
## Call:
## lm(formula = rainfall ~ cloudcover + prewetness + echomotion + 
##     sne, data = seed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4331 -0.9368 -0.3556  0.9902  4.5028 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          13.06280    3.36617   3.881  0.00605 **
## cloudcover           -0.05349    0.11282  -0.474  0.64985   
## prewetness            1.70050    2.82709   0.601  0.56647   
## echomotionstationary  2.99632    1.89393   1.582  0.15765   
## sne                  -2.72001    0.97453  -2.791  0.02687 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.316 on 7 degrees of freedom
## Multiple R-squared:  0.5574, Adjusted R-squared:  0.3046 
## F-statistic: 2.204 on 4 and 7 DF,  p-value: 0.1702

Creating a similar model for rainfall with only the seeded data results in an adjusted R-squared value of 0.3, and a F-statistic of 2.2 with a p-value of 0.17. These are all slightly better than the full data set model, but not much better. The SNE variable is the only significant variable with a p-value of 0.03, which is better than the full data set model.

anova(lm_rain_s)

## Analysis of Variance Table
## 
## Response: rainfall
##            Df Sum Sq Mean Sq F value  Pr(>F)  
## cloudcover  1  1.186   1.186  0.2211 0.65248  
## prewetness  1  0.320   0.320  0.0597 0.81392  
## echomotion  1  4.002   4.002  0.7463 0.41628  
## sne         1 41.775  41.775  7.7903 0.02687 *
## Residuals   7 37.537   5.362                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the F-statistic with ANOVA for the seeded data, the sne variable is, again, the only variable that appears to influence the model.

Dividing the data into the seed and non-seed groups doesn’t appear to improve things much. The SNE variable does show better significance in the seeded group.

As the suitability index (sne) appears to have a relationship with rainfall, build two new models relating this variable to rainfall, one for the seeding experiments, one for the non-seeding experiments. Compare the coefficients for the two models, and produce a figure showing the two models. What does the difference in slope suggest?

lm_rain_ns = lm(rainfall ~ sne, data = noseed)
summary(lm_rain_ns)

## 
## Call:
## lm(formula = rainfall ~ sne, data = noseed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4892 -2.1762  0.2958  1.4902  7.3616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    7.319      3.160   2.317    0.043 *
## sne           -1.046      0.995  -1.052    0.318  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.502 on 10 degrees of freedom
## Multiple R-squared:  0.09957,    Adjusted R-squared:  0.009528 
## F-statistic: 1.106 on 1 and 10 DF,  p-value: 0.3177

Using only the SNE variable for the model, the non-seeded group has a high p-value for both the t-test and the F-statistic. The adjusted R-squared value is very low (0.010), suggesting that there is almost no relationship between SNE and rainfall for this dataset.

The intercept is 7.319 and the coefficient for SNE is -1.046 for the non-seeded model.

lm_rain_s = lm(rainfall ~ sne, data = seed)
summary(lm_rain_s)

## 
## Call:
## lm(formula = rainfall ~ sne, data = seed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0134 -1.3297 -0.3276  0.6171  4.3867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  12.0202     2.9774   4.037  0.00237 **
## sne          -2.2180     0.8722  -2.543  0.02921 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.27 on 10 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.332 
## F-statistic: 6.467 on 1 and 10 DF,  p-value: 0.02921

The SNE variable is significant for the seeded group, with a p-value of 0.03. The seeded model also has an adjusted R-squared value of 0.33, and a F-statistic of 6.5 with a p-value of 0.03. This suggested that there is a relationship between the SNE variable and rainfall for the seeded model, although this relationship is not very strong.

h1 = ggplot(lm_rain_ns, aes(x=lm_rain_ns$residuals)) +
  xlab("Rainfall Model Residuals") + ylab("") +
  ggtitle("Non-Seeded Rainfall Model") + theme_bw() +
  geom_histogram(binwidth = 1, fill="darkorange3") +
  xlim(-8, 8)
h2 = ggplot(lm_rain_s, aes(x=lm_rain_s$residuals)) +
  xlab("Rainfall Model Residuals") + ylab("") +
  ggtitle("Seeded Rainfall Model") + theme_bw() +
  geom_histogram(binwidth = 1, fill="cornflowerblue") +
  xlim(-8, 8)

grid.arrange(h1, h2, nrow = 1)

## Warning: Removed 2 rows containing missing values (`geom_bar()`).
## Removed 2 rows containing missing values (`geom_bar()`).

The residuals for the non-seeded group is not normally distributed. The residuals for the seeded group is normally distributed, but not centered at zero.

ggplot(lm_rain_ns, aes(x=lm_rain_ns$fitted.values, y=lm_rain_ns$residuals), color="darkorange3") +
  geom_line(color = "darkorange3", lwd=1) +
  geom_line(aes(x=lm_rain_s$fitted.values, y=lm_rain_s$residuals), 
            color="cornflowerblue", lwd=1) +
  labs(title = "Model: Rainfall ~ SNE",
          x="Fitted Values",
          y="Residuals") +
  annotate("text", x=2.5, y=7.5, label="Non Seeded", col="darkorange3") +
  annotate("text", x=2.5, y=6.5, label="Seeded", col="cornflowerblue") +
  theme_bw()

Comparing the fitted values to the residuals for each data set, there is less variability in the seeded model. Neither model has a flat or close-to-zero shape for the residuals.

subtitle_ns = paste("Adj R2 = ",signif(summary(lm_rain_ns)$adj.r.squared, 2),
                    " Intercept =",signif(lm_rain_ns$coef[[1]],3),
                    " \nSlope =",signif(lm_rain_ns$coef[[2]], 3),
                    " P =",signif(summary(lm_rain_ns)$coef[2,4], 3),
                    " 95% CI")
pns = ggplot(lm_rain_ns$model, 
             aes(x = lm_rain_ns$model$sne, y = lm_rain_ns$model$rainfall)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "darkorange3") +
  ggtitle("Non Seeded Data Model", subtitle_ns) +
  xlab("SNE") + ylab("Rainfall") +
  xlim(1, 5) + theme_bw() +
  scale_y_continuous(limits=c(0,13), breaks=0:13)

subtitle_s = paste("Adj R2 = ",signif(summary(lm_rain_s)$adj.r.squared, 2),
                   " Intercept =",signif(lm_rain_s$coef[[1]],3 ),
                   " \nSlope =",signif(lm_rain_s$coef[[2]], 3),
                   " P =",signif(summary(lm_rain_s)$coef[2,4], 3),
                   " 95% CI")

ps = ggplot(lm_rain_s$model, 
             aes(x = lm_rain_s$model$sne, y = lm_rain_s$model$rainfall)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "cornflowerblue") +
  ggtitle("Seeded Data Model", subtitle_s) +
  xlab("SNE") + ylab("") + 
  xlim(1, 5) + theme_bw() +
  scale_y_continuous(limits=c(0,13), breaks=0:13)

grid.arrange(pns, ps, nrow = 1)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The figure above displays the coefficients for each model. The shaded region represents the 95% confidence interval for the models. Both models have a negative relationship between the SNE variable and rainfall. The slope for the seeded model is more than twice that of the non-seeded model. This suggests that the seeded model is more responsive to the SNE variable than the non-seeded model. The seeded model also has less variability than the non-seeded model, shown by the narrower gray area in the seeded data model plot.

The fitted vs residual plot suggested that there may be higher order terms involved in the relationship between the variables. What would happen if we tried to model that? It seems logical that prewetness, sne, and cloud cover would have related factors (humidity, temperature, air pressure for example). Adding this relationship to the model for the seeded data samples:

lm3 = lm(rainfall ~ sne*prewetness*cloudcover, data = seed)
summary(lm3)

## 
## Call:
## lm(formula = rainfall ~ sne * prewetness * cloudcover, data = seed)
## 
## Residuals:
##         2         3         5        10        11        12        14        15 
## -0.001089  1.537529  0.037279 -0.035826 -0.917346 -1.041667  1.636277 -0.031520 
##        18        20        21        23 
##  0.022312  0.053693 -1.151637 -0.108004 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)               -44.6223    14.8283  -3.009   0.0396 *
## sne                        14.2930     4.4889   3.184   0.0334 *
## prewetness                169.2530    77.7526   2.177   0.0951 .
## cloudcover                  8.5729     2.1133   4.057   0.0154 *
## sne:prewetness            -49.5781    23.1660  -2.140   0.0991 .
## sne:cloudcover             -2.5506     0.6837  -3.731   0.0203 *
## prewetness:cloudcover     -18.0634     8.9331  -2.022   0.1132  
## sne:prewetness:cloudcover   5.9405     3.1985   1.857   0.1368  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.442 on 4 degrees of freedom
## Multiple R-squared:  0.902,  Adjusted R-squared:  0.7305 
## F-statistic: 5.259 on 7 and 4 DF,  p-value: 0.06396

The adjusted R-squared value is 0.73 (vs. 0.3 for the previous model only using SNE), with an F-statistic of 5.3 with a p-value of 0.06. The p-values for the various relationships are also lower, with SNE, SNE:prewetness, and prewetness:cloudcover being below 0.05.

anova(lm3)

## Analysis of Variance Table
## 
## Response: rainfall
##                           Df Sum Sq Mean Sq F value  Pr(>F)  
## sne                        1 33.310  33.310 16.0281 0.01608 *
## prewetness                 1  0.052   0.052  0.0249 0.88231  
## cloudcover                 1  0.499   0.499  0.2399 0.64990  
## sne:prewetness             1  0.876   0.876  0.4214 0.55161  
## sne:cloudcover             1 14.227  14.227  6.8459 0.05902 .
## prewetness:cloudcover      1 20.374  20.374  9.8033 0.03515 *
## sne:prewetness:cloudcover  1  7.169   7.169  3.4494 0.13683  
## Residuals                  4  8.313   2.078                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analyzing the variance of the model with anova(), SNE and the combination of SNE:prewetness:cloudcover are significant factors.

r1 = ggplot(lm3, aes(x=lm3$residuals)) +
  xlab("Rainfall Model Residuals") + ylab("") +
  ggtitle("Rainfall~SNE*prewetness*cloudcover") + theme_bw() +
  geom_histogram(binwidth = 1, fill="purple4") +
  xlim(-8, 8)
r2 = ggplot(lm_rain_s, aes(x=lm_rain_s$residuals)) +
  xlab("Rainfall Model Residuals") + ylab("") +
  ggtitle("Rainfall~SNE") + theme_bw() +
  geom_histogram(binwidth = 1, fill="cornflowerblue") +
  xlim(-8, 8)

grid.arrange(r2, r1, nrow = 1)

## Warning: Removed 2 rows containing missing values (`geom_bar()`).
## Removed 2 rows containing missing values (`geom_bar()`).

The distribution of the residuals shows that the new model is closer to a normal distribution, with less variation, and is centered around zero.

ggplot(lm3, aes(x=lm3$fitted.values, y=lm3$residuals)) + 
  geom_line(col = "purple4", lwd=1) + 
  geom_line(aes(x=lm_rain_s$fitted.values, y=lm_rain_s$residuals), 
            col="cornflowerblue", lwd=1) +
  annotate("text", x=5, y=3, label="Rainfall~SNE*prewetness*cloudcover", col="purple4") +
  annotate("text", x=5.5, y=2.5, label="Rainfall~SNE", col="cornflowerblue") +
  xlab("Fitted Values") + ylab("Residuals") +
  ggtitle("Seeded Model") + theme_bw() +
  xlim(2,6) + ylim(-3,3) +
  theme_bw()

## Warning: Removed 2 rows containing missing values (`geom_line()`).

## Warning: Removed 3 rows containing missing values (`geom_line()`).

There is not much change in range for the residuals vs. the fitted data. However, the residuals for the new model are more equally distributed around zero.

subtitle_lm3 = paste("Adj R2 = ",signif(summary(lm3)$adj.r.squared, 2),
                    " Intercept =",signif(lm3$coef[[1]],3),
                    " \nSlope =",signif(lm3$coef[[2]], 3),
                    " P =",signif(summary(lm3)$coef[2,4], 3),
                    " 95% CI")
pns = ggplot(lm3$model, 
             aes(x = lm3$model$sne, y = lm3$model$rainfall)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "purple4") +
  ggtitle("Rainfall~SNE*prewetness*cloudcover", subtitle_lm3) +
  xlab("SNE") + ylab("Rainfall") +
  xlim(1, 5) + theme_bw() +
  scale_y_continuous(limits=c(0,13), breaks=0:13)

subtitle_s = paste("Adj R2 = ",signif(summary(lm_rain_s)$adj.r.squared, 2),
                   " Intercept =",signif(lm_rain_s$coef[[1]],3 ),
                   " \nSlope =",signif(lm_rain_s$coef[[2]], 3),
                   " P =",signif(summary(lm_rain_s)$coef[2,4], 3),
                   " 95% CI")

ps = ggplot(lm_rain_s$model, 
             aes(x = lm_rain_s$model$sne, y = lm_rain_s$model$rainfall)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "cornflowerblue") +
  ggtitle("Rainfall~SNE", subtitle_s) +
  xlab("SNE") + ylab("") + 
  xlim(1, 5) + theme_bw() +
  scale_y_continuous(limits=c(0,13), breaks=0:13)

grid.arrange(pns, ps, nrow = 1)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The relationship for the new model between rainfall and the other variables is:

\[ rainfall = -44.6 + 14.3SNE+169p_w+8.57c_c-49.6SNE*p_w-2.55SNE*c_c-18.1p_w*c_c+5.94SNE*p_w*c_c \]

where \(SNE\) is the SNE variable, \(p_w\) is prewetness, and \(c_c\) is cloud cover.

While the new model does appear to be slightly better (lower p-values, higher adjusted R-squared value), it does not mean that it will perform better than the simple model. The residuals don’t appear to be decreased. The slight improvements in the model may not be worth the additional complexity of the higher order terms.

Conclusion

The initial analysis saw no obvious trends or relationships between the variables. Dividing the data into seeded and non-seeded data sets did not show any trends. Comparing the means of the two datasets for all the variables also was inconclusive. The t-tests resulted in high p-values, which means that we can not say with any certainty that the differences between the means are not just due to chance.

Modeling the data with the full dataset (seeded and non-seeded) using cloud cover, prewetness, SNE, and echomotion resulted in a pretty bad model (low adjusted R-squared value, high p-values). Dividing the data into seeded and non-seeded data sets did improve the models slightly, and the SNE variable appeared to have some influence on rainfall.

Modelling rainfall as a function of the SNE variable for each dataset, seeded and non-seeded, we could see that the SNE variable plays a bigger role in the seeded dataset than in the non-seeded dataset. Another model was created using the seeded dataset and allowing for non-independent relationship between SNE, cloud cover, and prewetness. This model had a much higher adjusted R-squared value and more normally distributed residuals, but the slight increase in performance may not be worth the added complexity of the model solution.

The main difficulty with this dataset is the low number of samples. There are only 24 samples, in total, and only 12 samples for the seeded and non-seeded data. This does not seem to be enough samples to model such a complex system as cloud seeding. More data samples would help to better determine if there are significant relationships between rainfall and the other variables that are measured.

Project 3: Cloud Seeding

Sherlyn Hilton

2023-06-19

Analysis

Exploratory Analysis of the Relationship between Variables

Conclusion