============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/601675
Data source: Yahoo Finance

 

Goals

In this assignment you will be working with dataset from your 699 project. You will perform missing data analysis. Read Chapter 2 from Multivariate Data Analysis for more information on MAR and MCAR.

Submission Format

Tasks

  1. Describe missing data, provide summary of missing data, similar to the analysis in the Chapter 2 (table 3): Count of missing data/percent per variable, type of missing data (NA, null), total percent of missingness per dataset [ - 10pts]

  2. Plot visualization of missing data pattern [ - 10pts]

  3. Describe if you have observed any patterns [- 10pts]

  4. Run statistical analysis to determine if your data is MCAR or MAR. For example, LittleMCAR - https://www.rdocumentation.org/packages/BaylorEdPsych/versions/0.5/topics/LittleMCAR [-10pts]

  5. Explain what type of imputation will be performed: list-wise/pair-wise deletions, mean imputation, regression imputation etc [ - 10pts]

 

Project EDA (continued)

The beginning part is available at RPubs.

The dataset is the Bitcoin price as a time series. Hence, there should not be missing data and outliers. Bitcoin trading price reached its peak at one bitcoin equals to USD 19,870.62 on December 17, 2017, while the bottom was USD 0.01 on October 7, 2010.

options(scipen=100)
pastecs::stat.desc(BTC) %>% kable(digits=2) %>% kable_styling()
Date Open High Low Close Adj.Close Volume
nbr.val 3005.00 3005.00 3005.00 3005.00 3005.00 3005.00 3005.00
nbr.null 0.00 0.00 0.00 0.00 0.00 0.00 7.00
nbr.na 0.00 0.00 0.00 0.00 0.00 0.00 0.00
min 14806.00 0.05 0.05 0.01 0.05 0.05 0.00
max 17810.00 19346.60 19870.62 18750.91 19345.49 19345.49 6245731508.00
range 3004.00 19346.55 19870.57 18750.90 19345.44 19345.44 6245731508.00
sum 49006744.00 4344894.13 4496933.09 4170193.94 4351484.24 4351484.24 436964651190.00
median 16308.00 290.02 297.66 283.17 290.35 290.35 8502000.00
mean 16308.40 1445.89 1496.48 1387.75 1448.08 1448.08 145412529.51
SE.mean 15.83 53.81 55.92 51.16 53.83 53.83 7584131.47
CI.mean.0.95 31.03 105.50 109.64 100.30 105.55 105.55 14870616.13
var 752717.07 8699579.66 9396254.87 7863811.28 8707621.33 8707621.33 172844745732914560.00
std.dev 867.59 2949.50 3065.33 2804.25 2950.87 2950.87 415746011.08
coef.var 0.05 2.04 2.05 2.02 2.04 2.04 2.86
BTC[BTC$Low==min(BTC$Low),]
##          Date  Open  High  Low   Close Adj.Close Volume
## 84 2010-10-07 0.067 0.088 0.01 0.08685   0.08685  10784
BTC[BTC$High==max(BTC$High),]
##            Date    Open     High      Low    Close Adj.Close     Volume
## 2711 2017-12-17 19346.6 19870.62 18750.91 19065.71  19065.71 2264650369

 

Investigate missing data

The trading date was from July 16, 2010 to October 6, 2018, which were 3,005 days matching 3,005 observations in the dataset. No missing trading dates exist. The closing price and the adjusted closing price are the same.

difftime(BTC$Date[3005], BTC$Date[1])
## Time difference of 3004 days
sum(BTC$Close==BTC$Adj.Close)==length(BTC$Adj.Close)
## [1] TRUE

 

Observed any patterns

In theory, there should be no “opening” and “closing” in Bitcoin trading. Day-1’s closing price should be the Day-2’s opening price, which means bitcoins are always ready to trade (every 10 minutes). Nevertheless, there are 195 observations that “opening” did not match “closing” mostly from October 2017 to October 2018. Besides the round-off error, one explanation is that the exchange was halted.

#BTC[BTC$Open[-1]!=BTC$Close[-3005],]
sum(BTC$Open[-1]!=BTC$Close[-3005])
## [1] 195

 

MCAR or MAR

Missing Completely at Random (MCAR) means that relationship does not exist between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.

Missing at Random (MAR) means that a systematic relationship exists between the propensity of missing values and the observed data, but not the missing data.

Source: https://www.theanalysisfactor.com/missing-data-mechanism/

days <- data.frame("Date"=BTC$Date, "Included"=rep(1,3005))
days[BTC$Open[-1]!=BTC$Close[-3005],]$Included <- 0
ggplot(days, aes(Date,Included)) +
  geom_bar(stat="identity", fill="steelblue")+
  ylim(0,1)+labs(title="Fig. 21. Bitcoin Trading Dates",subtitle="2010JUL16 - 2018OCT06",x="")+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

Again, No missing values exist in Bitcoin trading prices.

 

Types of imputation

  1. List-wise/pair-wise deletions. As the name indicates, the most common approach is to ignore the missing values.

  2. Mean imputation. Simply calculate the mean of the observed values for that variable for all individuals who are non-missing. It has the advantage of keeping the same mean and the same sample size, but many, many disadvantages. Pretty much every method listed below is better than mean imputation.

  3. Regression imputation. The predicted value obtained by regressing the missing variable on other variables. So instead of just taking the mean, you’re taking the predicted value, based on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values.

Source: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/