ANLY 699 - Missing Data and Outliers Analysis

============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/601675
Data source: Yahoo Finance

Goals

In this assignment you will be working with dataset from your 699 project. You will perform missing data analysis. Read Chapter 2 from Multivariate Data Analysis for more information on MAR and MCAR.

Submission Format

Submit 2 files: Rmarkdown and a knitted Rmarkdown (html or pdf).
Text should be entered outside of code blocks (do not use #comments to describe your figures).
Format your graphs properly: captions, title, axis labels

Tasks

Describe missing data, provide summary of missing data, similar to the analysis in the Chapter 2 (table 3): Count of missing data/percent per variable, type of missing data (NA, null), total percent of missingness per dataset [ - 10pts]
Plot visualization of missing data pattern [ - 10pts]
Describe if you have observed any patterns [- 10pts]
Run statistical analysis to determine if your data is MCAR or MAR. For example, LittleMCAR - https://www.rdocumentation.org/packages/BaylorEdPsych/versions/0.5/topics/LittleMCAR [-10pts]
Explain what type of imputation will be performed: list-wise/pair-wise deletions, mean imputation, regression imputation etc [ - 10pts]

Project EDA (continued)

The beginning part is available at RPubs.

The dataset is the Bitcoin price as a time series. Hence, there should not be missing data and outliers. Bitcoin trading price reached its peak at one bitcoin equals to USD 19,870.62 on December 17, 2017, while the bottom was USD 0.01 on October 7, 2010.

options(scipen=100)
pastecs::stat.desc(BTC) %>% kable(digits=2) %>% kable_styling()

	Date	Open	High	Low	Close	Adj.Close	Volume
nbr.val	3005.00	3005.00	3005.00	3005.00	3005.00	3005.00	3005.00
nbr.null	0.00	0.00	0.00	0.00	0.00	0.00	7.00
nbr.na	0.00	0.00	0.00	0.00	0.00	0.00	0.00
min	14806.00	0.05	0.05	0.01	0.05	0.05	0.00
max	17810.00	19346.60	19870.62	18750.91	19345.49	19345.49	6245731508.00
range	3004.00	19346.55	19870.57	18750.90	19345.44	19345.44	6245731508.00
sum	49006744.00	4344894.13	4496933.09	4170193.94	4351484.24	4351484.24	436964651190.00
median	16308.00	290.02	297.66	283.17	290.35	290.35	8502000.00
mean	16308.40	1445.89	1496.48	1387.75	1448.08	1448.08	145412529.51
SE.mean	15.83	53.81	55.92	51.16	53.83	53.83	7584131.47
CI.mean.0.95	31.03	105.50	109.64	100.30	105.55	105.55	14870616.13
var	752717.07	8699579.66	9396254.87	7863811.28	8707621.33	8707621.33	172844745732914560.00
std.dev	867.59	2949.50	3065.33	2804.25	2950.87	2950.87	415746011.08
coef.var	0.05	2.04	2.05	2.02	2.04	2.04	2.86

BTC[BTC$Low==min(BTC$Low),]

##          Date  Open  High  Low   Close Adj.Close Volume
## 84 2010-10-07 0.067 0.088 0.01 0.08685   0.08685  10784

BTC[BTC$High==max(BTC$High),]

##            Date    Open     High      Low    Close Adj.Close     Volume
## 2711 2017-12-17 19346.6 19870.62 18750.91 19065.71  19065.71 2264650369

Investigate missing data

The trading date was from July 16, 2010 to October 6, 2018, which were 3,005 days matching 3,005 observations in the dataset. No missing trading dates exist. The closing price and the adjusted closing price are the same.

difftime(BTC$Date[3005], BTC$Date[1])

## Time difference of 3004 days

sum(BTC$Close==BTC$Adj.Close)==length(BTC$Adj.Close)

## [1] TRUE

Observed any patterns

In theory, there should be no “opening” and “closing” in Bitcoin trading. Day-1’s closing price should be the Day-2’s opening price, which means bitcoins are always ready to trade (every 10 minutes). Nevertheless, there are 195 observations that “opening” did not match “closing” mostly from October 2017 to October 2018. Besides the round-off error, one explanation is that the exchange was halted.

#BTC[BTC$Open[-1]!=BTC$Close[-3005],]
sum(BTC$Open[-1]!=BTC$Close[-3005])

## [1] 195

MCAR or MAR

Missing Completely at Random (MCAR) means that relationship does not exist between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.

Missing at Random (MAR) means that a systematic relationship exists between the propensity of missing values and the observed data, but not the missing data.

Source: https://www.theanalysisfactor.com/missing-data-mechanism/

days <- data.frame("Date"=BTC$Date, "Included"=rep(1,3005))
days[BTC$Open[-1]!=BTC$Close[-3005],]$Included <- 0
ggplot(days, aes(Date,Included)) +
  geom_bar(stat="identity", fill="steelblue")+
  ylim(0,1)+labs(title="Fig. 21. Bitcoin Trading Dates",subtitle="2010JUL16 - 2018OCT06",x="")+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

Again, No missing values exist in Bitcoin trading prices.

Types of imputation

List-wise/pair-wise deletions. As the name indicates, the most common approach is to ignore the missing values.
Mean imputation. Simply calculate the mean of the observed values for that variable for all individuals who are non-missing. It has the advantage of keeping the same mean and the same sample size, but many, many disadvantages. Pretty much every method listed below is better than mean imputation.
Regression imputation. The predicted value obtained by regressing the missing variable on other variables. So instead of just taking the mean, you’re taking the predicted value, based on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values.

Source: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/