Abstract
Introduction
Methods
Results
- Models and Model Comparisons
- Model Validtion
Conclusion

library(ggplot2)
library(tidyverse)
library(gridExtra)
library(finalfit)
library(kableExtra)
library(xtable)
jrothsch_theme <-  theme_bw() + 
  theme(text = element_text(size = 16, face = "bold", color = "deepskyblue4"),panel.grid = element_blank(),axis.text = element_text(size = 10, color = "gray13"), axis.title = element_text(size = 14, color = "black"), legend.text = element_text(colour="Black", size=10), legend.title = element_text(colour="Black", size=7), plot.subtitle = element_text(size=14, face="italic", color="black"))
setwd("~/Downloads")
rain <- read_csv("Calrain.csv")

rain <- rain %>% 
  mutate(log_altitude  = log(Altitude + 200))
full  <- rain  %>%
  lm(Rainfall~log_altitude + Latitude + Distance + Shadow, data = .) 

no_distance <- rain  %>%
  lm(Rainfall~log_altitude + Latitude  + Shadow, data = .) 

no_alt <- rain  %>%
  lm(Rainfall~Distance + Latitude  + Shadow, data = .) 


no_alt_distance <-  rain  %>%
  lm(Rainfall~ + Latitude  + Shadow, data = .) 
no_lat <-  rain  %>%
  lm(Rainfall~ + log_altitude  + Shadow + Distance, data = .) 

no_shadow <-  rain  %>%
  lm(Rainfall~ + log_altitude  + Latitude + Distance, data = .)

Abstract

Introduction

In this paper, I analyze annual rainfall level across meteorological stations in california. To do so, I create a linear regressional that best explains the variance in rainfaill across stations. By modeling rainfall in this way, we will be able to better understand current rainfall patterns and predict future rainfall.

Methods

Dataset

The dataset consists of 30 meteorological stations in the State of California

Variables

The variables can be broken down into two categories: outcome variables and explanatory variables ###Explanatory Variables Altitude: Altitude of station (feet)

Latitude: Latitude coordinate of the station

Distance: Distance of station from Pacific coast (miles)

Shadow: Binary variable; whether or not the station is facing east (Shadow = 1) or west (Shadow = 0)

Outcome Variables

Rainfall: Average rainfall per year(inches)

Analysis

The analysis consists of the following steps: 1) Exploratory analysis of variable distributions and the relationships between variables 2) Variable transformations based on exploratory analysis 3) Fitting of several regression models 4) Comparison and seletion of best regression model 5) Validaition of selected model

Models

We compare the following linear models of rainfall:

FULL: Rainfall = B1 ln(Altitude) + B2 Latitude + B3 Shadow + B4 Distance + E

No Distance: Rainfall = B1 ln(Altitude) + B2 Latitude + B3 Shadow + E

No Shadow: Rainfall = B1 ln(Altitude) + B2 Latitude + B3 Distance + E

No Altitude: Rainfall = B1 Latitude + B2 Shadow + B3 Distance + E

No Latitude: Rainfall = B1 ln(Altitude) + B2 Shadow + B3 Shadow + E

No Altitude/Distance: Rainfall = B1 Latitude + B2 Shadow + E

Results

##Exploratory analysis and transformation

pairs(~rain$Rainfall + rain$Altitude + rain$Latitude + rain$Distance, lower.panel=NULL)

rain <- rain %>% 
  mutate(log_altitude  = log(Altitude + 200))
#pairs(~rain$Rainfall + rain$log_altitude + rain$Latitude + rain$Distance, lower.panel=NULL)

ggplot(rain, aes(x = Rainfall, color = as.factor(Shadow))) + geom_density() + jrothsch_theme

I first use a scatterplot matrix to understand the relationships and rough distributions of the continuous variables. Based on this scatterplot, I use a log transformation of the altitude variable, which is used for the rest of the analysis.

A density plot is also used to visualizate the relationship between shadow and rainfall.

Models and Model Comparisons

I then create the regression models described in the methods section. We create 6 models: A full model with all explanatory variables, 4 models, each dropping one of the explanatory variables, and a 6th model dropping both Altitude and Distance, as these variables seemed to be less predictive of rainfall than the others. The full regression model is shown here for reference.

full  <- rain  %>%
  lm(Rainfall~log_altitude + Latitude + Distance + Shadow, data = .) 
summary(full)

## 
## Call:
## lm(formula = Rainfall ~ log_altitude + Latitude + Distance + 
##     Shadow, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.8796  -5.5947   0.0171   3.1460  27.9965 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -115.74944   25.79209  -4.488 0.000141 ***
## log_altitude    2.27710    1.48687   1.531 0.138211    
## Latitude        3.60238    0.66532   5.415 1.28e-05 ***
## Distance       -0.02856    0.03390  -0.842 0.407553    
## Shadow        -18.29977    4.09734  -4.466 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.406 on 25 degrees of freedom
## Multiple R-squared:  0.7239, Adjusted R-squared:  0.6798 
## F-statistic: 16.39 on 4 and 25 DF,  p-value: 1.034e-06

Based on this full model, there was reason to believe that altitude and distance may not be useful in fitting the data. ANOVA results confirmed that this is the case, as there is not a significant difference in variance explained when dropping one or both of the variables.

Model analysis and comparison reveals more information about the relationship between variables.

anova(full, no_alt_distance)

## Analysis of Variance Table
## 
## Model 1: Rainfall ~ log_altitude + Latitude + Distance + Shadow
## Model 2: Rainfall ~ +Latitude + Shadow
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     25 2211.6                           
## 2     27 2419.8 -2   -208.21 1.1768 0.3248

However, when we remove latitude or shadow, we see that there is a large drop in variance explained, suggesting that these variables are needed in the model

anova(full, no_lat)

## Analysis of Variance Table
## 
## Model 1: Rainfall ~ log_altitude + Latitude + Distance + Shadow
## Model 2: Rainfall ~ +log_altitude + Shadow + Distance
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     25 2211.6                                 
## 2     26 4805.2 -1   -2593.6 29.317 1.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(full, no_shadow)

## Analysis of Variance Table
## 
## Model 1: Rainfall ~ log_altitude + Latitude + Distance + Shadow
## Model 2: Rainfall ~ +log_altitude + Latitude + Distance
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     25 2211.6                                  
## 2     26 3976.3 -1   -1764.7 19.947 0.0001486 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One other important finding is that there is a multicollinearity problem with Shadow and Distance. This is further confirmed by the increase in the size of the Distance coeffecient when shadow is dropped from the model

cor(rain$Shadow, rain$Distance)

## [1] 0.4897733

summary(no_shadow)

## 
## Call:
## lm(formula = Rainfall ~ +log_altitude + Latitude + Distance, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.694  -7.874   0.099   3.338  32.746 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -137.25448   33.31572  -4.120 0.000342 ***
## log_altitude    3.80510    1.90251   2.000 0.056043 .  
## Latitude        3.80607    0.87271   4.361 0.000181 ***
## Distance       -0.10910    0.03774  -2.890 0.007665 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 26 degrees of freedom
## Multiple R-squared:  0.5037, Adjusted R-squared:  0.4464 
## F-statistic: 8.795 on 3 and 26 DF,  p-value: 0.0003405

We therefore must choose whether to include shadow or distance in the model. Dropping distance uses less variance, so we keep shadow in the model. We also drop altitude, as it seems to not have explanatory power. The final model is therefore No Alt Distance, or Rainfall = B1 Latitude + B2 Shadow + E

summary(no_alt_distance)

## 
## Call:
## lm(formula = Rainfall ~ +Latitude + Shadow, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.852  -5.666  -1.403   4.002  26.792 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -103.3358    24.4955  -4.219 0.000248 ***
## Latitude       3.6310     0.6584   5.515 7.66e-06 ***
## Shadow       -19.9221     3.4882  -5.711 4.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.467 on 27 degrees of freedom
## Multiple R-squared:  0.698,  Adjusted R-squared:  0.6756 
## F-statistic:  31.2 on 2 and 27 DF,  p-value: 9.569e-08

Model Validtion

y_hat = fitted(no_alt_distance)
epsilon_hat = resid(no_alt_distance)
resid_df = data.frame(y_hat, epsilon_hat)

ggplot(resid_df, aes(x = y_hat, epsilon_hat)) +
  geom_point() +
  geom_smooth() +
  geom_abline(slope = 0, intercept = 0) + 
  jrothsch_theme +
  labs(title = "Residuals of Regression Analysis", y = "Residual", x = "Estimate")

A residual plot reveals that while there is some reason to be concerned about model assumptions; expected value of the residual doesn’t appear to be constant with respect to estimate, although it isn’t too far off.

plot(cooks.distance(no_alt_distance), main="Cook’s D vs index")

The cooks distance plot also reveals that there is an observation with extremely high influence. To ensure that our model isn’t being corrupted by this point, we rerun the regression model without said observation

rain_no_outlier <- rain %>%
  filter(`Station. ID` != "29" & `Station. ID` != "19")

no_alt_distance2 <-  rain_no_outlier  %>%
  lm(Rainfall~ + Latitude  + Shadow, data = .) 

summary(no_alt_distance2)

## 
## Call:
## lm(formula = Rainfall ~ +Latitude + Shadow, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.564 -4.593 -1.272  2.307 16.815 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -96.4179    20.7460  -4.648 9.29e-05 ***
## Latitude      3.3810     0.5641   5.994 2.93e-06 ***
## Shadow      -16.4752     2.7066  -6.087 2.32e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.084 on 25 degrees of freedom
## Multiple R-squared:  0.7392, Adjusted R-squared:  0.7183 
## F-statistic: 35.43 on 2 and 25 DF,  p-value: 5.058e-08

y_hat = fitted(no_alt_distance)
epsilon_hat = resid(no_alt_distance)
resid_df = data.frame(y_hat, epsilon_hat)

ggplot(resid_df, aes(x = y_hat, epsilon_hat)) +
  geom_point() +
  geom_smooth() +
  geom_abline(slope = 0, intercept = 0) + 
  jrothsch_theme +
  labs(title = "Residuals of Regression Analysis", y = "Residual", x = "Estimate")

The fit of the model improves, and we the residual plot is now more supportiuve of the linear regression assumptions

Conclusion

In this paper, we studied the relationship between California rainfall and various geographical factors. We find that a model using latitude and the east/west direction the station is facing to be the best fit for the data. This analysis can be used to predict the rainfall measured by potential new stations. I recommend further invetigation into predictive factors, as there is still a significant amount of unexplained variance in the model.

Modelling California Rainfall

Jake Rothschild

11/29/2019