library(ggplot2)
library(tidyverse)
library(gridExtra)
library(finalfit)
library(kableExtra)
library(xtable)
jrothsch_theme <- theme_bw() +
theme(text = element_text(size = 16, face = "bold", color = "deepskyblue4"),panel.grid = element_blank(),axis.text = element_text(size = 10, color = "gray13"), axis.title = element_text(size = 14, color = "black"), legend.text = element_text(colour="Black", size=10), legend.title = element_text(colour="Black", size=7), plot.subtitle = element_text(size=14, face="italic", color="black"))
setwd("~/Downloads")
rain <- read_csv("Calrain.csv")
rain <- rain %>%
mutate(log_altitude = log(Altitude + 200))
full <- rain %>%
lm(Rainfall~log_altitude + Latitude + Distance + Shadow, data = .)
no_distance <- rain %>%
lm(Rainfall~log_altitude + Latitude + Shadow, data = .)
no_alt <- rain %>%
lm(Rainfall~Distance + Latitude + Shadow, data = .)
no_alt_distance <- rain %>%
lm(Rainfall~ + Latitude + Shadow, data = .)
no_lat <- rain %>%
lm(Rainfall~ + log_altitude + Shadow + Distance, data = .)
no_shadow <- rain %>%
lm(Rainfall~ + log_altitude + Latitude + Distance, data = .)
In this paper, I analyze annual rainfall level across meteorological stations in california. To do so, I create a linear regressional that best explains the variance in rainfaill across stations. By modeling rainfall in this way, we will be able to better understand current rainfall patterns and predict future rainfall.
The dataset consists of 30 meteorological stations in the State of California
The variables can be broken down into two categories: outcome variables and explanatory variables ###Explanatory Variables Altitude: Altitude of station (feet)
Latitude: Latitude coordinate of the station
Distance: Distance of station from Pacific coast (miles)
Shadow: Binary variable; whether or not the station is facing east (Shadow = 1) or west (Shadow = 0)
Rainfall: Average rainfall per year(inches)
The analysis consists of the following steps: 1) Exploratory analysis of variable distributions and the relationships between variables 2) Variable transformations based on exploratory analysis 3) Fitting of several regression models 4) Comparison and seletion of best regression model 5) Validaition of selected model
We compare the following linear models of rainfall:
FULL: Rainfall = B1 ln(Altitude) + B2 Latitude + B3 Shadow + B4 Distance + E
No Distance: Rainfall = B1 ln(Altitude) + B2 Latitude + B3 Shadow + E
No Shadow: Rainfall = B1 ln(Altitude) + B2 Latitude + B3 Distance + E
No Altitude: Rainfall = B1 Latitude + B2 Shadow + B3 Distance + E
No Latitude: Rainfall = B1 ln(Altitude) + B2 Shadow + B3 Shadow + E
No Altitude/Distance: Rainfall = B1 Latitude + B2 Shadow + E
##Exploratory analysis and transformation
pairs(~rain$Rainfall + rain$Altitude + rain$Latitude + rain$Distance, lower.panel=NULL)
rain <- rain %>%
mutate(log_altitude = log(Altitude + 200))
#pairs(~rain$Rainfall + rain$log_altitude + rain$Latitude + rain$Distance, lower.panel=NULL)
ggplot(rain, aes(x = Rainfall, color = as.factor(Shadow))) + geom_density() + jrothsch_theme
I first use a scatterplot matrix to understand the relationships and rough distributions of the continuous variables. Based on this scatterplot, I use a log transformation of the altitude variable, which is used for the rest of the analysis.
A density plot is also used to visualizate the relationship between shadow and rainfall.
I then create the regression models described in the methods section. We create 6 models: A full model with all explanatory variables, 4 models, each dropping one of the explanatory variables, and a 6th model dropping both Altitude and Distance, as these variables seemed to be less predictive of rainfall than the others. The full regression model is shown here for reference.
full <- rain %>%
lm(Rainfall~log_altitude + Latitude + Distance + Shadow, data = .)
summary(full)
##
## Call:
## lm(formula = Rainfall ~ log_altitude + Latitude + Distance +
## Shadow, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.8796 -5.5947 0.0171 3.1460 27.9965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -115.74944 25.79209 -4.488 0.000141 ***
## log_altitude 2.27710 1.48687 1.531 0.138211
## Latitude 3.60238 0.66532 5.415 1.28e-05 ***
## Distance -0.02856 0.03390 -0.842 0.407553
## Shadow -18.29977 4.09734 -4.466 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.406 on 25 degrees of freedom
## Multiple R-squared: 0.7239, Adjusted R-squared: 0.6798
## F-statistic: 16.39 on 4 and 25 DF, p-value: 1.034e-06
Based on this full model, there was reason to believe that altitude and distance may not be useful in fitting the data. ANOVA results confirmed that this is the case, as there is not a significant difference in variance explained when dropping one or both of the variables.
Model analysis and comparison reveals more information about the relationship between variables.
anova(full, no_alt_distance)
## Analysis of Variance Table
##
## Model 1: Rainfall ~ log_altitude + Latitude + Distance + Shadow
## Model 2: Rainfall ~ +Latitude + Shadow
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 2211.6
## 2 27 2419.8 -2 -208.21 1.1768 0.3248
However, when we remove latitude or shadow, we see that there is a large drop in variance explained, suggesting that these variables are needed in the model
anova(full, no_lat)
## Analysis of Variance Table
##
## Model 1: Rainfall ~ log_altitude + Latitude + Distance + Shadow
## Model 2: Rainfall ~ +log_altitude + Shadow + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 2211.6
## 2 26 4805.2 -1 -2593.6 29.317 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(full, no_shadow)
## Analysis of Variance Table
##
## Model 1: Rainfall ~ log_altitude + Latitude + Distance + Shadow
## Model 2: Rainfall ~ +log_altitude + Latitude + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 2211.6
## 2 26 3976.3 -1 -1764.7 19.947 0.0001486 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One other important finding is that there is a multicollinearity problem with Shadow and Distance. This is further confirmed by the increase in the size of the Distance coeffecient when shadow is dropped from the model
cor(rain$Shadow, rain$Distance)
## [1] 0.4897733
summary(no_shadow)
##
## Call:
## lm(formula = Rainfall ~ +log_altitude + Latitude + Distance,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.694 -7.874 0.099 3.338 32.746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -137.25448 33.31572 -4.120 0.000342 ***
## log_altitude 3.80510 1.90251 2.000 0.056043 .
## Latitude 3.80607 0.87271 4.361 0.000181 ***
## Distance -0.10910 0.03774 -2.890 0.007665 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 26 degrees of freedom
## Multiple R-squared: 0.5037, Adjusted R-squared: 0.4464
## F-statistic: 8.795 on 3 and 26 DF, p-value: 0.0003405
We therefore must choose whether to include shadow or distance in the model. Dropping distance uses less variance, so we keep shadow in the model. We also drop altitude, as it seems to not have explanatory power. The final model is therefore No Alt Distance, or Rainfall = B1 Latitude + B2 Shadow + E
summary(no_alt_distance)
##
## Call:
## lm(formula = Rainfall ~ +Latitude + Shadow, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.852 -5.666 -1.403 4.002 26.792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -103.3358 24.4955 -4.219 0.000248 ***
## Latitude 3.6310 0.6584 5.515 7.66e-06 ***
## Shadow -19.9221 3.4882 -5.711 4.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.467 on 27 degrees of freedom
## Multiple R-squared: 0.698, Adjusted R-squared: 0.6756
## F-statistic: 31.2 on 2 and 27 DF, p-value: 9.569e-08
y_hat = fitted(no_alt_distance)
epsilon_hat = resid(no_alt_distance)
resid_df = data.frame(y_hat, epsilon_hat)
ggplot(resid_df, aes(x = y_hat, epsilon_hat)) +
geom_point() +
geom_smooth() +
geom_abline(slope = 0, intercept = 0) +
jrothsch_theme +
labs(title = "Residuals of Regression Analysis", y = "Residual", x = "Estimate")
A residual plot reveals that while there is some reason to be concerned about model assumptions; expected value of the residual doesn’t appear to be constant with respect to estimate, although it isn’t too far off.
plot(cooks.distance(no_alt_distance), main="Cook’s D vs index")
The cooks distance plot also reveals that there is an observation with extremely high influence. To ensure that our model isn’t being corrupted by this point, we rerun the regression model without said observation
rain_no_outlier <- rain %>%
filter(`Station. ID` != "29" & `Station. ID` != "19")
no_alt_distance2 <- rain_no_outlier %>%
lm(Rainfall~ + Latitude + Shadow, data = .)
summary(no_alt_distance2)
##
## Call:
## lm(formula = Rainfall ~ +Latitude + Shadow, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.564 -4.593 -1.272 2.307 16.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -96.4179 20.7460 -4.648 9.29e-05 ***
## Latitude 3.3810 0.5641 5.994 2.93e-06 ***
## Shadow -16.4752 2.7066 -6.087 2.32e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.084 on 25 degrees of freedom
## Multiple R-squared: 0.7392, Adjusted R-squared: 0.7183
## F-statistic: 35.43 on 2 and 25 DF, p-value: 5.058e-08
y_hat = fitted(no_alt_distance)
epsilon_hat = resid(no_alt_distance)
resid_df = data.frame(y_hat, epsilon_hat)
ggplot(resid_df, aes(x = y_hat, epsilon_hat)) +
geom_point() +
geom_smooth() +
geom_abline(slope = 0, intercept = 0) +
jrothsch_theme +
labs(title = "Residuals of Regression Analysis", y = "Residual", x = "Estimate")
The fit of the model improves, and we the residual plot is now more supportiuve of the linear regression assumptions
In this paper, we studied the relationship between California rainfall and various geographical factors. We find that a model using latitude and the east/west direction the station is facing to be the best fit for the data. This analysis can be used to predict the rainfall measured by potential new stations. I recommend further invetigation into predictive factors, as there is still a significant amount of unexplained variance in the model.