1 Introduction

The time series data that I have chosen to analyze is the Monthly Precipitation in Pennsylvania over a span of 150 years. This data is provided by the National Centers for Environmental Information. The purpose for this data being collected is to track and monitor climate and precipitation trends. This helps support research that is related to weather forecasting and public safety. Since precipitation is a critical variable in studying the ever changing climate, knowing the long term trends in precipitation totals will impact how agriculture, infrastructure and water sources function. This will help services plan around different events such as droughts,floods, or normal precipitation.
The purpose of analyzing this time series is to observe the trends and behavior and to determine if there are any obvious patterns that can help us forecast future precipitation values. Through modeling this time series, some scientific questions that I want to answer are the following:
- Is there a seasonal pattern in precipitation, meaning is there more precipitation in the spring compared to the winter or summer to fall?
- Based on the patterns found in the historical data, can we forecast future precipitation values and compare them to the actual values to see how accurate it will be?
- Can we use this analysis to help agriculture to plan around drought or floods?

2 Exploratory Data Analysis

Total Monthly Precipitation in Pennsylvania (1895-2025)

The time series plot above shows the total monthly precipitation in Pennsylvania from the year 1895 to 2025 (present day). Along the y-axis is measured in inches of precipitation for each month and the x-axis is labeled from 0 to 1500 signifying 0 as the starting month in 1895 to a little past 1500 months.

2.1 Stationarity

Before I went ahead and detrended the time series above I found two hypothesis tests one called the Dicky-Fuller Test and another called the KPSS test. When looking at the above time series it does look obviously stationary, the mean looks constant and the variance also looks good (no big spikes). Just to make sure I used the two hypothesis to check for stationarity.

Dicky-Fuller Test:

\(H_0\) - The time series is not stationary.

\(H_a\) - The time series is stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Precipitation$Value
## Dickey-Fuller = -10.218, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

Since the p-value is less than 0.05, we reject the null hypothesis meaning that this series is stationary.

KPSS Test:

\(H_0\) - The time series is stationary.

\(H_a\) - The time series is not stationary.

## 
##  KPSS Test for Level Stationarity
## 
## data:  Precipitation$Value
## KPSS Level = 0.92124, Truncation lag parameter = 7, p-value = 0.01

The KPSS test has the opposite null hypothesis so this means that since the p-value is less than 0.05, we reject the null hypothesis meaning that this series is not stationary according to this test.
Now that I know that it fails one of the hypothesis tests I went on to detrend the time series.

2.2 Detrend

Detrended Series

After detrending the time series it does look almost identical to the original time series but in general it does look slightly better. Now when I run the hypothesis tests over again with my detrended time series they both should say it is stationary.

Dicky-Fuller Test:

\(H_0\) - The time series is not stationary.

\(H_a\) - The time series is stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  detrended
## Dickey-Fuller = -10.218, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

Test Statistic = -10.218

p-value = 0.01

Since the p-value is less than 0.05, we reject the null hypothesis meaning that this series is stationary.

KPSS Test:

\(H_0\) - The time series is stationary.

\(H_a\) - The time series is not stationary.

## 
##  KPSS Test for Level Stationarity
## 
## data:  detrended
## KPSS Level = 0.14733, Truncation lag parameter = 7, p-value = 0.1

Test Statistics = 0.14733

p-value: 0.1

Like I mentioned above,the hypotheses of the KPSS test are the opposite of the Dicky-Fuller test. So this means since the p-value is greater than 0.05, we fail to reject the null hypothesis meaning this time series is stationary.

2.3 ACF and PACF

Now that my time series is confirmed to be stationary I need to plot the ACF and PACF. I need to do this so I can fit a model to my detrended series.

ACF of Precipitation in PA

PACF of Precipitation in PA

When I was analyzing the ACF and PACF there were a few models that I believe can potentially fit this time series.
The first one is an ARMA(3,2) model. I chose this because the ACF and the PACF both seem to be tailing off with no cut offs signifying an ARIMA model, you will see later that it isn’t a horrible choice for the model.
The next two models I chose because there seems to be seasonality in the ACF and PACF, which makes sense since this is a time series on precipitation which does could depend on the season. My first seasonal model is a SARIMA(2,0,0)(1,1,1)[12] and my second one is a SARIMA(2,0,0)(1,0,1)[12]. Both of these models are a seasonal model but with one slight difference and that is the seasonal difference. Both models are taking one seasonal AR and MA term at lag 12 which makes sense when looking at the seasonal spikes in the ACF and PACF. In the next section you will see why I have very similar models with one slight difference.

3 Statistical Data Analysis

In the previous section I suggested three models that could fit this time series. They were the following:
ARMA(3,2)
SARIMA(2,0,0)(1,0,1)[12]
SARIMA(2,0,0)(1,1,1)[12]

3.1 Model 1 ARMA(3,2)

## <><><><><><><><><><><><><><>
##  
## Coefficients: 
##       Estimate     SE  t.value p.value
## ar1     1.7666 0.0254  69.5912  0.0000
## ar2    -1.0594 0.0440 -24.0883  0.0000
## ar3     0.0345 0.0254   1.3585  0.1745
## ma1    -1.7355    NaN      NaN     NaN
## ma2     0.9993    NaN      NaN     NaN
## xmean  -0.0059 0.0336  -0.1758  0.8605
## 
## sigma^2 estimated as 1.691657 on 1556 degrees of freedom 
##  
## AIC = 3.377028  AICc = 3.377063  BIC = 3.401021 
##

3.2 Model 2 SARIMA(2,0,0)(1,1,1)[12]

## <><><><><><><><><><><><><><>
##  
## Coefficients: 
##          Estimate     SE  t.value p.value
## ar1        0.0351 0.0254   1.3809  0.1675
## ar2        0.0241 0.0254   0.9473  0.3436
## sar1      -0.0031 0.0263  -0.1167  0.9071
## sma1      -0.9857 0.0110 -89.9844  0.0000
## constant   0.0000 0.0001   0.0390  0.9689
## 
## sigma^2 estimated as 1.695389 on 1545 degrees of freedom 
##  
## AIC = 3.400984  AICc = 3.401009  BIC = 3.421678 
##

3.3 Model 3 SARIMA(2,0,0)(1,0,1)[12]

## <><><><><><><><><><><><><><>
##  
## Coefficients: 
##       Estimate     SE   t.value p.value
## ar1     0.0406 0.0254    1.5968  0.1105
## ar2     0.0267 0.0253    1.0536  0.2922
## sar1    0.9991 0.0012  854.3352  0.0000
## sma1   -0.9856 0.0096 -102.4054  0.0000
## xmean   0.0092 0.1413    0.0652  0.9481
## 
## sigma^2 estimated as 1.694474 on 1557 degrees of freedom 
##  
## AIC = 3.384692  AICc = 3.384717  BIC = 3.405257 
##

When comparing these models the first thing I am going to do is compare the AIC and BIC values (Hidden) for the three of them.
Model 1 AIC: 3.37696 | BIC: 3.400953
Model 2 AIC: 3.400984 | BIC: 3.421678
Model 3 AIC: 3.384692 | BIC: 3.405257
When looking at AIC and BIC values we are primarily looking for the model that has the lowest of these values. According to the values I presented above the model with the lowest AIC and BIC values is Model 1 ARMA(3,2) which would indicate that this is the best model. However this does not mean that this actually the best model. Now we have to take a look at the residual plots and analyze them.
Model 1:

Model 1 Residuals

When looking at the plot of the standardized residuals there is no obvious pattern or any trend. Looking at the ACF of the Residuals at all of the lags the autocorrelation is between the blue lines which is good and means they are not correlated. The Normal Q-Q plot of the Std Residuals appear normal until you reach the tails where they appear to deviate from normality(which is fine) and finally the p values do all appear to be high except right in the beginning meaning not all of the residuals are not correlated and are consistent with white noise. This model would have seemed like a good fit since the AIC and BIC mentioned above for this model is the lowest but this model fails to be the best because of the p-value. For Ljung-Box statistic the null hypothesis is The data is not correlated and the alternate hypothesis is the data exhibits serial correlation. Since this model has p-value at 0 we reject the null hypothesis which makes this model no good even the AIC and BIC is the lowest.
Model 2:

Model 2 Residuals

Analyzing the residuals of model 2: Looking at the plot of the standardized residuals there is no obvious pattern or any trend that I can see. Looking at the ACF of the Residuals at all of the lags the autocorrelation is between the blue lines which is good and means they are not correlated. The Normal Q-Q plot of the Std Residuals appear normal until you reach the tails where they appear to deviate from normality (which is fine) and finally the p values do all appear to be high and above 0 meaning residuals are not correlated and are consistent with white noise. This seasonal model does seem to be a pretty good fit. Everything checks out, but I need to check one more model with a slight change.
Model 3:

Model 3 Residuals

Analyzing the residuals of model 3 when seasonal differencing is 0: Looking at the plot of the standardized residuals there is no obvious pattern or any trend that I can see, just like the last model. Looking at the ACF of the Residuals at all of the lags the autocorrelation is between the blue lines which is good and means they are not correlated. The Normal Q-Q plot of the Std Residuals appear normal until you reach the tails where they appear to deviate from normality (which is fine) and finally the p values do all appear to be high and above 0 meaning residuals are not correlated and are consistent with white noise. The residuals do look slightly better and so does the p-values when compared to the first two models. This seasonal model does seem to be the best fit model for my detrended time series.

3.4 Estimates of Coefficients:

Now I am going to show the estimates of the coefficients.

\[\left(1-\phi_1 B - \phi_2 B^2 - \phi_3 B^3\right)X_t=\left(1 + \theta_1 B + \theta_2 B^2\right)w_t \]

Coefficient Estimates for Model 1
Terms	Estimate	SE	p.value
ar1	1.7669	0.0254	0.0000
ar2	-1.0599	0.0440	0.0000
ar3	0.0348	0.0254	0.1707
ma1	-1.7356	NaN	NaN
ma2	0.9995	NaN	NaN
xmean	0.0066	0.0336	0.8436

\[ \left(1 - \phi_1 B - \phi_2 B^2\right)\left(1 - \Phi_1 B^{12}\right)\left(1 - B^{12}\right)X_t = \left(1 + \Theta_1 B^{12}\right)w_t \]

Coefficient Estimates for Model 2
Terms	Estimate	SE	p.value
ar1	0.0351	0.0254	0.1675
ar2	0.0241	0.0254	0.3436
sar1	-0.0031	0.0263	0.9071
sma1	-0.9857	0.0110	0.0000
constant	0.0000	0.0001	0.9689

\[ \left(1 - \phi_1 B - \phi_2 B^2\right)\left(1 - \Phi_1 B^{12}\right)X_t = \left(1 + \Theta_1 B^{12}\right)w_t \]

Coefficient Estimates for Model 3
Terms	Estimate	SE	p.value
ar1	0.0406	0.0254	0.1105
ar2	0.0267	0.0253	0.2922
sar1	0.9919	0.0012	0.0000
sma1	-0.9856	0.0096	0.0000
constant	0.0092	0.1413	0.9481

4 Conclusions

My analysis of the time series and the detrended time series of monthly precipitation in Pennsylvania has shown patterns in the log term and the short term since 1895 until present day. To capture these patterns and trends I decided to try three different models. These models include ARMA(3,2) model, SARIMA(2,0,0)(1,1,1)[12], SARIMA(2,0,0)(1,0,1)[12].

4.1 Scientific Questions

Is there a seasonal pattern in precipitation, meaning is there more precipitation in the spring compared to the winter or summer to fall? Now we can say yes there is a seasonal pattern in precipitation. This means that we can tell if there is more precipitation in the spring compared to the winter or summer to fall, etc. We know this because looking at the ACF and PACF of the detrended series. They showed showed strong auto correlations that suggested a yearly seasonal pattern. Also through the SARIMA model fitted it helped us fully realize that there is a seasonal pattern.
Based on the patterns found in the historical data, can we forecast future precipitation values and compare them to the actual values to see how accurate it will be? Since the SARIMA(2,0,0)(1,0,1)[12] model passed all of the residual tests it is suitable for generating forecasts. So you we can forecast values and compare them to the actual values one year from now.
Based on our analysis we can forecast this model to predict future values of precipitation (One year into the Future).

Forecasted Precipitation

Can we use this analysis to help agriculture to plan around drought or floods? Through my analysis in the previous sections I think it is fairly obvious to say that yes we can use this analysis to help agriculture in planning around a drought or floods. We know that there is a seasonal pattern so farmers can use this data to try and predict when there could be a drought or floods and grow there crops around when it will be bad.

4.2 Limitations

When doing this analysis on the monthly precipitation in Pennsylvania time series a few limitations came up. One limitation is the model I have chosen. To me the model I have chosen is a very good model and fits well to the time series, but is it the best and perfect fit for this time series? I do not know. There could be a complex model that fits better that I do not know about. Another limitation could be other parameters that are effecting monthly precipitation that could throw off the seasonality or throw off forecasting.

4.3 Future Work

In the future, some work that can be done on this data could be adding more variables that directly impact precipitation. This could include if we are in a El Niño or La Niña weather pattern, this could change the total precipitation positively or negatively. Temperature could also play a big roll and can be used as a variable. Some more future work could also include examining the seasonal trends more closely to pinpoint the exact pattern of precipitation there is. Mentioned in the limitations section the data can be examined more closely to see if a more complex model can be fitted now that we could potentially add more inputs like weather patterns and temperature. With this in mind in the future we could more accurately forecast precipitation values.
This has been a Time Series Analysis of Monthly Precipitation in Pennsylvania from 1895 to present day.

Time Series Analysis of Monthly Precipitation in Pennsylvania

Jude Harris