Time series analysis of the bit coins price data

** Loading the libraries **

libraries <- c('forecast','ggplot2','data.table','gridExtra','tidyquant','changepoint','ggfortify','xts')
sapply(libraries, require, character = T)
##    forecast     ggplot2  data.table   gridExtra   tidyquant changepoint 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
##   ggfortify         xts 
##        TRUE        TRUE

Introduction

This note book is all about the time series analysis of the bicoins prices from 2011 to 2017. First we will start with the explonatory data analysis and draw some meaningful insights from the data. Our final goal of this project is build a predictive model for bitcoins prices. This is the first notebook of our project and it contains exploratory data analysis on the bitcoin price data set.

** Introducing the data **

setwd('G:\\BTP\\Data')

data = fread('bstmp.csv')
## 
Read 0.0% of 3045857 rows
Read 6.2% of 3045857 rows
Read 12.5% of 3045857 rows
Read 18.7% of 3045857 rows
Read 25.0% of 3045857 rows
Read 30.9% of 3045857 rows
Read 37.4% of 3045857 rows
Read 43.7% of 3045857 rows
Read 49.6% of 3045857 rows
Read 55.5% of 3045857 rows
Read 61.4% of 3045857 rows
Read 67.3% of 3045857 rows
Read 73.5% of 3045857 rows
Read 79.8% of 3045857 rows
Read 85.7% of 3045857 rows
Read 91.6% of 3045857 rows
Read 97.8% of 3045857 rows
Read 3045857 rows and 8 (of 8) columns from 0.190 GB file in 00:00:37

Checking the dimension of our data set.

dim(data)
## [1] 3045857       8

Sample data

tail(data, n=5)
##     Timestamp    Open    High     Low   Close Volume_(BTC)
## 1: 1508457360 5690.88 5690.88 5690.88 5690.88   0.16894078
## 2: 1508457420 5698.13 5704.10 5695.63 5704.10   2.31166152
## 3: 1508457480 5695.62 5695.64 5694.00 5695.64   0.12964714
## 4: 1508457540 5700.39 5700.39 5698.69 5698.69   0.08542622
## 5: 1508457600 5700.00 5700.00 5698.68 5700.00   0.11684654
##    Volume_(Currency) Weighted_Price
## 1:          961.4217       5690.880
## 2:        13174.8529       5699.300
## 3:          738.4096       5695.534
## 4:          486.9298       5700.004
## 5:          665.9139       5699.046

Columns present in the data set

names(data)
## [1] "Timestamp"         "Open"              "High"             
## [4] "Low"               "Close"             "Volume_(BTC)"     
## [7] "Volume_(Currency)" "Weighted_Price"

Lets get insight about very high level statistics about our data.

summary(data[,c(-1)])
##       Open             High             Low             Close       
##  Min.   :   3.8   Min.   :   3.8   Min.   :   1.5   Min.   :   1.5  
##  1st Qu.: 101.4   1st Qu.: 101.5   1st Qu.: 101.2   1st Qu.: 101.5  
##  Median : 362.5   Median : 362.8   Median : 362.2   Median : 362.5  
##  Mean   : 589.2   Mean   : 589.6   Mean   : 588.8   Mean   : 589.2  
##  3rd Qu.: 632.0   3rd Qu.: 632.2   3rd Qu.: 631.6   3rd Qu.: 632.0  
##  Max.   :5846.4   Max.   :5846.4   Max.   :5839.6   Max.   :5844.4  
##   Volume_(BTC)      Volume_(Currency) Weighted_Price  
##  Min.   :   0.000   Min.   :      0   Min.   :   3.8  
##  1st Qu.:   0.422   1st Qu.:     62   1st Qu.: 101.4  
##  Median :   2.002   Median :    319   Median : 362.5  
##  Mean   :  11.126   Mean   :   5522   Mean   : 589.2  
##  3rd Qu.:   8.658   3rd Qu.:   2111   3rd Qu.: 632.0  
##  Max.   :5853.852   Max.   :4339350   Max.   :5845.8

Converting the Unix time into proper date format

data$Timestamp <- as.POSIXct(data$Timestamp,origin='1970-01-01')
head(data$Timestamp)
## [1] "2011-12-31 13:22:00 IST" "2011-12-31 13:23:00 IST"
## [3] "2011-12-31 13:24:00 IST" "2011-12-31 13:25:00 IST"
## [5] "2011-12-31 13:26:00 IST" "2011-12-31 13:27:00 IST"

Checking the variables of data set

str(data)
## Classes 'data.table' and 'data.frame':   3045857 obs. of  8 variables:
##  $ Timestamp        : POSIXct, format: "2011-12-31 13:22:00" "2011-12-31 13:23:00" ...
##  $ Open             : num  4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 ...
##  $ High             : num  4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 ...
##  $ Low              : num  4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 ...
##  $ Close            : num  4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 ...
##  $ Volume_(BTC)     : num  0.456 0.456 0.456 0.456 0.456 ...
##  $ Volume_(Currency): num  2 2 2 2 2 ...
##  $ Weighted_Price   : num  4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 4.39 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Number of null values column wise

n_col_wise = sapply(data, function(x) sum(is.na(x)))
n_col_wise
##         Timestamp              Open              High               Low 
##                 0                 0                 0                 0 
##             Close      Volume_(BTC) Volume_(Currency)    Weighted_Price 
##                 0                 0                 0                 0

Hence there are no null values in any column

Plotting the closing price for each minute

ggplot(data=data, aes(Timestamp, Close)) + geom_line() + ylab('Closing Price') + xlab('Time')

Clearly, there is an increasing trend in prices of bitcoins. However, lets decompose this plots to find trend and seasonality on varying frequency. First take the log transform to reduce the fluctuations and observe it closely.

data$cp <- tsclean(data$Close, replace.missing = T) ## remove outliers from the time series
ggplot(data=data, aes(x=Timestamp, y=log(cp))) + geom_line() + xlab('Time') + ylab('log of closing prices')

The above plot give a good view of very small fluctuations while supressing the large fluctuations.

** General trend of bitcoin prices in different time stamps **

data$cp_h = ma(data$cp, order = 60)
p1 <- ggplot(data=data, aes(x=Timestamp,y= cp_h)) + geom_line() + xlab('Time') + ylab('Hourly prices') + ggtitle('Hourly trend in prices')

data$cp_d = ma(data$cp, order = 24*60)
p2 <- ggplot(data=data, aes(x=Timestamp,y=cp_d)) + geom_line() + xlab('Time') + ylab('Daily prices') + ggtitle("Daily trend in prices")

data$cp_w = ma(data$cp, order = 7*24*60)
p3 <- ggplot(data=data, aes(x=Timestamp,y=cp_w)) + geom_line() + xlab("Time") + ylab("weekly prices") + ggtitle("Weekly trend in prices")

data$cp_mo = ma(data$cp, order = 30*24*60)
p4 <- ggplot(data=data, aes(x=Timestamp,y=cp_mo)) + geom_line() + xlab('Time') + ylab('Monthly Prices') + ggtitle("Monthly trend in prices")
grid.arrange(p1,p2,p3,p4, ncol=2)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

In 2017, there was a sudden increase in prices of bitcoins. So lets draw more insights about prices hike in this year using boxplot.

x = ts(data$cp, start = c(2012,11), end = c(2017,10), frequency = 12)
data$month = format(data$Timestamp, "%B")

ggplot(data=data,aes(x=month,y=cp)) + geom_boxplot(outlier.colour = "red",outlier.shape = 21)  + ylab('bitcoin prices')+
  theme(axis.text.x = element_text(angle=90, hjust = 1))

The above plot shows that the january has high variance in prices of bitcoins. November was the period when prices were stable. We can clearly see than on some days of October, the prices were exceptionally higher than normal days, contributing a large amoutn of outliers.

** Year wise growth of bitcoins prices **

Lets analyse how the prices of bitcons varies across different years.

data$year = format(data$Timestamp, "%Y")
ggplot(data=data, aes(x=Timestamp,y=cp)) + geom_line(color="darkblue",lwd=1.0) + facet_grid(year ~ .) + 
  theme_tq() + ylab("Prices") +
  ggtitle("Price trend in each year")

From the above plot we can see that 2017 have seen drastic change in bitcoins prices. So lets plot candlecharts for 2017 to have a better look at opening and closing prices of bitcoins.

We have taken another data set in which has been collected at daily interval. This saves our time for computation.

setwd("G:\\BTP\\Data")
data1 = read.table('bitcoin_price_daily.csv', sep=',', header = T)
head(data1)
##           Date    Open    High     Low   Close        Volume
## 1 Nov 07, 2017 7023.10 7253.32 7023.10 7144.38 2,326,340,000
## 2 Nov 06, 2017 7403.22 7445.77 7007.31 7022.76 3,111,900,000
## 3 Nov 05, 2017 7404.52 7617.48 7333.19 7407.41 2,380,410,000
## 4 Nov 04, 2017 7164.48 7492.86 7031.28 7379.95 2,483,800,000
## 5 Nov 03, 2017 7087.53 7461.29 7002.94 7207.76 3,369,860,000
## 6 Nov 02, 2017 6777.77 7367.33 6758.72 7078.50 4,653,770,000
##        Market.Cap
## 1 117,056,000,000
## 2 123,379,000,000
## 3 123,388,000,000
## 4 119,376,000,000
## 5 118,084,000,000
## 6 112,910,000,000
data1$Open <- as.numeric(data1$Open)
data1$High <- as.numeric(data1$High)
data1$Low <- as.numeric(data1$Low)
data1$Close <- as.numeric(data1$Close)

data1$Volume <- as.numeric(data1$Volume)
data1$Date <- as.character(data1$Date)

library(stringr)

x = str_replace_all(data1$Date, fixed(","),"")
x = str_replace_all(x, fixed(" "),"")
data1$Date = as.Date(x,"%B%d%Y")

** Plotting the candlestick chart for 2017 as it has observe sudden increase in prices of bitcoins**

Lets plot candle chart to observe the opening and closing prices on different days of a month.

data1 %>%
  filter(year(Date) >= 2017) %>%
  ggplot(aes(x = month(Date) ,y=Close)) +
  geom_candlestick(aes(open=Open, high=High,low=Low,close=Close),
                   fill_up = "steelblue",fill_down = "red",
                   color_up = "darkgreen",color_down = "darkblue", show.legend = T)+
  
  
  labs(title="2017 bitcoin candlestick chart",y="closing price",x="") +
  
  theme_tq()

The red part shows the days when closing prices were less than the opening prices. The month of September was not good for bitcoin traders and closing prices were less than the opening prices on most of days.

** Examining the weekly trends of 2017 **

Lets examine the weekly trend in bitcoin prices. The bollinger band is used to measure the standard deviation across weekly trend. In common words in measure the volatility in prices of bitcoins.

data1 %>%
  filter(year(Date) >= 2017) %>%
  ggplot(aes(x=Date, y=Close,open=Open, high=High,low=Low,close=Close))+
  geom_candlestick(
                   fill_up = "steelblue",fill_down = "red",
                   color_up = "darkgreen",color_down = "darkblue", show.legend = T) +
  geom_bbands(ma_fun = SMA, sd = 2, n = 20, 
                linetype = 4, size = 1, alpha = 0.1, 
                fill        = palette_light()[[1]], 
                color_bands = palette_light()[[1]], 
                color_ma    = palette_light()[[2]]) +
  ggtitle("Bitcoin price volatility for 2017") +
  theme_tq()

There is a spike in price after august, so lets zoom the above plot for last six months of 2017

data1 %>%
  filter(year(Date) >= 2017) %>%
  ggplot(aes(x=Date, y=Close,open=Open, high=High,low=Low,close=Close))+
  geom_candlestick(
                   fill_up = "steelblue",fill_down = "red",
                   color_up = "darkgreen",color_down = "darkblue", show.legend = T) +
  geom_bbands(ma_fun = SMA, sd = 2, n = 20, 
                linetype = 4, size = 1, alpha = 0.1, 
                fill        = palette_light()[[1]], 
                color_bands = palette_light()[[1]], 
                color_ma    = palette_light()[[2]]) +
  coord_x_date(xlim = c("2017-08-01", "2017-11-07"),
                 ylim = c(2000, 8000)) +
  ggtitle("Prices spikes in second half of 2017")+
  theme_tq()

There was a simillar spike in prices in 2014 as in 2017. So we analyze both prices spikes with candlestick chart and drawing bollinger band around it to measure the prices volatility.

data1 %>%
  filter(Date >="2013-11-15" & Date <= "2014-03-01") %>%
  ggplot(aes(x=Date,y=Close,high=High,low=Low,open=Open,close=Close)) +
  geom_candlestick()+
  
   geom_bbands(ma_fun = SMA, sd = 2, n = 20, 
                linetype = 4, size = 1, alpha = 0.1, 
                fill        = palette_light()[[1]], 
                color_bands = palette_light()[[1]], 
                color_ma    = palette_light()[[2]]) +
  theme_tq()+
  ggtitle("2014 prices spikes") + xlab("Months")

There were two spikes in prices of bitcoins in 2014 and 2017. However, 2014 spike is different from 2017 in sense that closing bitcoin prices were less than opening prices in most of days. Also the volatility in prices in 2014 are much more than in 2017.

** Analysing the distribution of the variables **

p1<-ggplot(data=data1, aes(x=Close, y =..density..)) + geom_histogram(col="blue",fill="steelblue",alpha=0.3) + geom_density(col='red',lwd=0.9)
p2<- ggplot(data=data1, aes(x=log(Close),y=..density..)) + geom_histogram(col="blue",fill="steelblue",alpha=0.3) + geom_density(col='red',lwd=0.9)+
  xlab("log of prices")

grid.arrange(p1,p2, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The logrithmic transformation didn’t make the plot like normal.

The plot is somewhat like the exponential distribution. Let’s analyze the changepoint detection which are quite important in the time series analysis.

Change point detection

** Changepointn detection of mean and variance using PELT algorithm **

Estimating penalty parameter for mean changepoint detection.

cptfn <- function(data, pen){
  
  ans <- cpt.mean(data, test.stat = "Normal",method="PELT",penalty = "Manual",pen.value = pen)
  length(cpts(ans)) + 1
  
}

pen.vals <- seq(0,5,0.2)
elbowplotData <- unlist(lapply(pen.vals, function(p) cptfn(data=log(data1$Close), pen=p)))

qplot(x=pen.vals, y=elbowplotData) + xlab("PELT penalty parameter") + ylab(" ")+ggtitle("Elbow method")+geom_point()

Mean changepoint detection using pelt algorithm.

penvalue = 1
pelt.meancpt = cpt.mean(rev(log(data1$Close)),test.stat = 'Normal', penalty = "Manual",pen.value = penvalue, method="PELT",Q=10)
meancpt.point = cpts(pelt.meancpt)

ggplot(data=data1, aes(x=Date, y = Close)) + geom_line(col="blue") + geom_vline(xintercept = rev(data1$Date)[meancpt.point],
                                                                      col="red",lwd=0.9,linetype="dotted")+
                                                                      ylab("Prices")+
                                                                      ggtitle("ChangePoint of mean with PELT") + theme_gray()

cat("The time for changepoint detection of mean are: \n")
## The time for changepoint detection of mean are:
rev(data1$Date)[meancpt.point]
##  [1] "2013-06-07" "2013-08-16" "2013-10-18" "2013-11-06" "2013-11-17"
##  [6] "2014-02-07" "2014-03-26" "2014-05-24" "2014-08-13" "2014-09-18"
## [11] "2015-01-02" "2015-10-26" "2015-12-04" "2016-05-27" "2016-11-15"
## [16] "2017-01-31" "2017-04-30" "2017-05-21" "2017-08-06" "2017-10-11"

Estimating the penalty parameter for variance changepoint detection.

cptfn <- function(data, pen){
  
  ans <- cpt.var(data, test.stat = "Normal",method="PELT",penalty = "Manual",pen.value = pen)
  length(cpts(ans)) + 1
  
}
pen.vals <- seq(0,10,0.2)
elbowplotData <- unlist(lapply(pen.vals, function(p) cptfn(data=log(data1$Close), pen=p)))

qplot(x=pen.vals, y=elbowplotData) + xlab("PELT penalty parameter") + ylab(" ")+ggtitle("Elbow method")+geom_point()

penvalue = 10
mu = mean(data1$Close)
pelt.varcpt = cpt.var(rev(log(data1$Close)),test.stat = "Normal",penalty = "Manual",pen.value = penvalue,method="PELT")
varcpt.point = cpts(pelt.varcpt)

ggplot(data=data1, aes(x=Date, y = Close)) + geom_line(col="blue") + geom_vline(xintercept = rev(data1$Date)[varcpt.point],
                                                                      col="red",lwd=0.9,linetype="dotted")+
                                                                      ylab("Prices")+
                                                                      ggtitle("ChangePoint of variance with PELT")+theme_gray()

cat("The date for changepoint detection of variance are \n")
## The date for changepoint detection of variance are
 rev(data1$Date)[varcpt.point]
##  [1] "2013-11-04" "2014-02-11" "2014-04-14" "2014-04-24" "2014-05-29"
##  [6] "2014-08-13" "2014-09-16" "2014-12-28" "2015-10-27" "2015-12-04"
## [11] "2016-04-19" "2016-06-10" "2016-07-30" "2016-10-21" "2016-12-21"
## [16] "2017-04-30" "2017-08-04"

Identifying the change in the mean and variance signal using Binary segmentation method

penalty.val = 1
cptm_stationary <- cpt.mean(rev(data1$Close),method="BinSeg", Q=8)
cptv_stationary <- cpt.var(rev(data1$Close), method="BinSeg", Q=8)
cptms_stationary = cpts(cptm_stationary)
cptvs_stationary = cpts(cptv_stationary)

ggplot(data=data1, aes(x=Date, y=Close)) + geom_line(col='blue') + geom_vline(xintercept = rev(data1$Date)[cptms_stationary], col='red',linetype="dotted",lwd=0.9) + theme_tq()+ggtitle("Mean Changepoint with Binary Segmentation")