ANLY 699 - Regression Assignment

Goals

In this assignment you will be working with dataset from your 699 project. You will run a regression model.

Submission Format

Submit 2 files: Rmarkdown and a knitted Rmarkdown (html or pdf).
Text should be entered outside of code blocks (do not use #comments to describe your figures).
Format your graphs properly: captions, title, axis labels

Tasks

Scale or normalize your data. Make sure to apply imputation if needed. [ - 5pts]
Build a multiple linear regression model or logistic regression (based on your Y). [ - 10pts]
Print summary and interpret table (see lecture slides). Describe the summary. [- 15pts]
Perform another model and evaluate which model performs better. [-10pts]

Project Modeling

The EDA parts are available at RPubs - part I and RPubs - part II.

The dataset is the Bitcoin price as a time series. Instead, an ARIMA model is built. Autoregressive integrated moving average (ARIMA) model is denoted as ARIMA(p,d,q) where parameters p, d, q is respectively the number of lags, the degree of differencing, and the order of the moving average. The model assumes stationarity.

The Augmented Dickey-Fuller (ADF) test is one of the unit root tests to examine how strongly a trend defines a time series. The null hypothesis (H0) of such a test is that the time series is non-stationary. When a time series is not stationary, a unit root is present in the sample. The p-value of 0.7481 is not statistically significant at an alpha level of 0.05, hence fail to reject the null hypothesis. The time series is taken first-order difference transformation to decompose trend and seasonality. Logarithm transformation is not necessarily required in this case.

Intuitively, ARIMA parameters p and q is respectively set to 1 and 0, which means ARIMA(1,1,0)(1,1,0)[30] model is built. The standardized residuals fall within [-1,1], and the other half spikes reach as much as ±6. Most autocorrelations of residuals are close to 0. However, only four p-values are above 0.05, while the rest are narrowly around it. Furthermore, the 30 days bitcoin prices are predicted.

adf.test(BTC[which(BTC$Date=="2017-1-12"):which(BTC$Date=="2018-10-6"),5])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  BTC[which(BTC$Date == "2017-1-12"):which(BTC$Date == "2018-10-6"),     5]
## Dickey-Fuller = -1.5988, Lag order = 8, p-value = 0.7481
## alternative hypothesis: stationary

price <- ts(BTC[which(BTC$Date=="2017-1-12"):which(BTC$Date=="2018-10-6"),5],freq=30)
adf.test(diff(price,1))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(price, 1)
## Dickey-Fuller = -7.6665, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

par(mfrow=c(1,2))
acf(diff(BTC[which(BTC$Date=="2017-1-12"):which(BTC$Date=="2018-10-6"),5],1),lag.max=30,main="Autocorrelation Plot, d=1")
pacf(diff(BTC[which(BTC$Date=="2017-1-12"):which(BTC$Date=="2018-10-6"),5],1),lag.max=30,main="Partial Autocorrelation Plot, d=1")

auto.arima(price,D=1)

## Series: price 
## ARIMA(0,1,1)(2,1,0)[30] 
## 
## Coefficients:
##          ma1     sar1     sar2
##       0.1028  -0.5846  -0.3300
## s.e.  0.0405   0.0374   0.0364
## 
## sigma^2 estimated as 226934:  log likelihood=-4579.02
## AIC=9166.05   AICc=9166.11   BIC=9183.65

fit <-arima(price,order=c(1,1,0),seasonal=list(order=c(1,1,0),period=30))
tsdiag(fit)

pred <- forecast(fit,h=30)
par(mfrow=c(1,1))
plot(pred,xlab="Observation",ylab="Closing Price (USD)",main="Bitcoin Trading Price: 2017-2018\n Forecasts from ARIMA(1,1,0)(1,1,0)[30]",lwd=2)
lines(pred$fitted,col="red")
legend("topleft",legend=c("Fitted","Predicted","Original"),col=c("red","blue","black"),lty=c(1,1,1),lwd=c(2,2,2),bty="n")

Practice “A Multiple Linear Regression Model”

A close study associated with a multiple linear regression model is available at the previous project Community Structure and Crime Rates: Evidence from Cross-Sectional Data in the U.S..

The tasks of (1) scaling, (2) regression modeling, (3) interpretation, and (4) model comparison are all included.