Abishek Arunachalam
30/03/18
Introduction
Time-series data contains a sequence of discrete data points ordered by time. As most of the data points are autocorrelated in nature regression techniques cannot be used as they fail to satisfy the basic assumptions(independance of errors). This type of data is commonly found in seismology, meteorology, geophysics, econometrics, quantitative finance etc. where patterns vary with time.
Getting data
Global Land-Sea Average temperatures dataset is used for analysis(Source URL:https://datahub.io/core/global-temp#r). The dataset contains the monthly mean average temperature(measured in °C) from the year 1880 - 2016. Data are included from the GISS Surface Temperature (GISTEMP) analysis and the global component of Climate at a Glance (GCAG).
We start by locating the URL of the JSON(Javascript object notation- A file format similar to csv for storing data) data package.
library(jsonlite)
library(ggplot2)
library(scales)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(xts)
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tseries)
library(forecast)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(epitools)
json_file <- 'https://datahub.io/core/global-temp/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
## Warning in readLines(json_file): incomplete final line found on 'https://
## datahub.io/core/global-temp/datapackage.json'
Data is imported. Let’s print the names of resources available in the JSON package.
print(json_data$resources$name)
## [1] "validation_report" "annual_csv" "monthly_csv"
## [4] "annual_json" "monthly_json" "global-temp_zip"
## [7] "monthly_csv_preview" "annual" "monthly"
The monthly_csv resource seems to be interesting as more trends would be captured in monthly data. Let’s import it.
path_to_file = json_data$resources$path[3]
monthly_global_temp <- read.csv(url(path_to_file))
head(monthly_global_temp)
Cleaning data
Ahh there it is! Let’s inspect the structure of the data.
str(monthly_global_temp)
## 'data.frame': 3288 obs. of 3 variables:
## $ Source: Factor w/ 2 levels "GCAG","GISTEMP": 1 2 1 2 1 2 1 2 1 2 ...
## $ Date : Factor w/ 1644 levels "1880-01-15","1880-02-15",..: 1644 1644 1643 1643 1642 1642 1641 1641 1640 1640 ...
## $ Mean : num 0.789 0.81 0.75 0.93 0.729 ...
There are two factor variables and a numeric variable. I want to convert the Date variable from factor to Date format and remove the Source column for ease of analysis.
monthly_global_temp$Date <- as.Date(monthly_global_temp$Date)
global_temp <- monthly_global_temp[,c(2,3)] #select the last two columns i.e Date and Mean.
Exploratory Analysis
Lets’ do a ggplot() of the data to see the trend.
ggplot(data=global_temp,aes(x=Date, y=ts(Mean)))+
geom_line()+
labs(x="Year",y="Average Temperature",main="Yearly Mean Temperature")+
scale_x_date(labels =date_format("%Y"))+
scale_y_continuous(limits=c(-1.0, 1.3))+
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
This is a typical time-series plot. We can see that the temperature has been raising steadily since 1950.
I’m curious to know if there is some difference in the mean values measured from GISTEMP and GCAG methods. Let’s group them and do a student’s t-test to find it.
GCAG <- filter(monthly_global_temp,Source=="GCAG")
GISETEMP <- filter(monthly_global_temp,Source=="GISTEMP")
t.test(GCAG$Mean,GISETEMP$Mean)
##
## Welch Two Sample t-test
##
## data: GCAG$Mean and GISETEMP$Mean
## t = 2.0892, df = 3277.1, p-value = 0.03676
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.001502512 0.047333132
## sample estimates:
## mean of x mean of y
## 0.04879738 0.02437956
We have a p-value of 0.03676. The mean of GCAG is 0.04879738 and GISTEMP is 0.02437956 respectively. In fact, a very marginal mean difference. But, since the values are on a smaller scale the difference is significant enough to reject null hypothesis and there is evidence for difference in the means.
Now, we are going to group the data based on year and calculate the mean temperatures to study how they have changed.
groupByYear <- cut(global_temp$Date, breaks = "year") #Group dates based on year
yearly_mean <- aggregate(global_temp$Mean, list(groupByYear), mean) #Calculate the mean temperature for each year
names(yearly_mean) <- c("Date","Mean") #Rename the columns
extract_year <- as.numeric(substring(yearly_mean$Date,1,4)) #Extract year from date
yearly_mean <- mutate(yearly_mean,Year = extract_year) # Add year as a new column
ggplot(yearly_mean, aes(x=Year, y=Mean,group=Year,fill=Mean))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(size = 10,
angle = 45,
hjust = 1,
vjust = 1))
The mean global temperature has been below freezing point until 1940. There is a slight increase and a dip in the middle and the temperature has been increasing steadily since 1978.
Let’s do a box-plot and see how the global temperature vary by month.
extract_month <- month(as.POSIXlt(global_temp$Date, format="%Y-%m-%d"),abbr = F) #Extract year from date
Monthly_mean <- mutate(global_temp,Month = extract_month)
ggplot(Monthly_mean, aes(x=Month, y=Mean,group=Month,fill=Mean))+
geom_boxplot()+
scale_x_discrete(labels=month.abb)+
theme(axis.text.x = element_text(size = 10,
angle = 45,
hjust = 1,
vjust = 1))
The median temperature of each month is below 0°C. It might be because of the influence of large number of data paoints from cold years in the past. There are a few high temperature outliers in most of the months.
Decompose the data
Time-series data consists of three main components: Seasonality, Trend and Cycle.
Seasonality: It is the fluctuations with regards to calendar cycles.
Trend: It is the overall pattern of the series. Whether the trend is increasing or decreasing?
Cycle: It is the pattern observed that is not seasonal.
And finally, there is residual or errors associated with each of these components.
By decomposing we can extract these components from the data.
global_temp_xts <- xts(x=global_temp$Mean, order.by = global_temp$Date)
global_temp_ts <- ts(global_temp_xts, freq = 1, start = c(1880,1))
decomp <- stl(ts(global_temp$Mean,frequency = 50), s.window = "periodic") #calculate the seasonal component of the data.
deseasonal_temp <- seasadj(decomp) # remove the seasonal component from the data
plot(decomp)
Stationarity check
The series must be stationary for the ARIMA model(time-series model) to perform better(i.e) the mean, variance and covariance must be time invariant.
The augmented Dickey-Fuller(ADF) test is done to check stationarity. We set the alternative hypothesis as the series to be stationary.
adf.test(global_temp_ts,alternative = "stationary")
## Warning in adf.test(global_temp_ts, alternative = "stationary"): p-value
## smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: global_temp_ts
## Dickey-Fuller = -5.4889, Lag order = 14, p-value = 0.01
## alternative hypothesis: stationary
The p-value 0.01 is less than 0.05 significance threshold. Hence, we reject the null hypothesis and there is evidence that the series is stationary.
If the series is not stationary we usually differentiate the data to make it stationary.
Autocorrelations and model order
ACF(Auto correlation plot) and PACF(Partial autocorrelation plot) help in choosing the order parameters for the ARIMA model.
Acf(global_temp_ts)
The blue dotted line is the 95% significance boundary.
Pacf(global_temp_ts)
There are many spikes evenly spread in ACF plot. On contrary PACF plot has spike in the first 4 lags. We can ideally choose lag 1.
Prediction
auto.arima() would automatically pick the most optimized order parameters for us. Incase if we are using arima() we need to set the order parameters manually.
We use forecast() to predictions. The h parameter in the forecast() is set to the number of years we want to predict. Here the value 60 corresponds to 5 years.
plot(forecast(auto.arima(global_temp_ts),h=60))
The dark shade is the 80% confidence interval and light shade is the 95% confidence interval respectively. We can see the intervals become wider and accuracy of prediction decreases as we try to predict further years.
The forecast seems to be a straight line and does not capture the trend. In this case we need to use a more sophasticated model and is out of the scope of this vignette.
References
Dalinina,R., 2017. Introduction to Forecasting with ARIMA in R. [Online] Available at: https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials [Accessed 29 March 2018]
GISTEMP Team, 2018: GISS Surface Temperature Analysis (GISTEMP). NASA Goddard Institute for Space Studies. Dataset accessed 20YY-MM-DD at https://data.giss.nasa.gov/gistemp/.
Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change. Rev. Geophys., 48, RG4004, doi:10.1029/2010RG000345.