============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/601675
Data source: Yahoo Finance
In this assignment you will be working with dataset from your 699 project. You will perform missing data analysis. Read Chapter 2 from Multivariate Data Analysis for more information on MAR and MCAR.
Describe missing data, provide summary of missing data, similar to the analysis in the Chapter 2 (table 3): Count of missing data/percent per variable, type of missing data (NA, null), total percent of missingness per dataset [ - 10pts]
Plot visualization of missing data pattern [ - 10pts]
Describe if you have observed any patterns [- 10pts]
Run statistical analysis to determine if your data is MCAR or MAR. For example, LittleMCAR - https://www.rdocumentation.org/packages/BaylorEdPsych/versions/0.5/topics/LittleMCAR [-10pts]
Explain what type of imputation will be performed: list-wise/pair-wise deletions, mean imputation, regression imputation etc [ - 10pts]
The beginning part is available at RPubs.
The dataset is the Bitcoin price as a time series. Hence, there should not be missing data and outliers. Bitcoin trading price reached its peak at one bitcoin equals to USD 19,870.62 on December 17, 2017, while the bottom was USD 0.01 on October 7, 2010.
options(scipen=100)
pastecs::stat.desc(BTC) %>% kable(digits=2) %>% kable_styling()
| Date | Open | High | Low | Close | Adj.Close | Volume | |
|---|---|---|---|---|---|---|---|
| nbr.val | 3005.00 | 3005.00 | 3005.00 | 3005.00 | 3005.00 | 3005.00 | 3005.00 |
| nbr.null | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 |
| nbr.na | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| min | 14806.00 | 0.05 | 0.05 | 0.01 | 0.05 | 0.05 | 0.00 |
| max | 17810.00 | 19346.60 | 19870.62 | 18750.91 | 19345.49 | 19345.49 | 6245731508.00 |
| range | 3004.00 | 19346.55 | 19870.57 | 18750.90 | 19345.44 | 19345.44 | 6245731508.00 |
| sum | 49006744.00 | 4344894.13 | 4496933.09 | 4170193.94 | 4351484.24 | 4351484.24 | 436964651190.00 |
| median | 16308.00 | 290.02 | 297.66 | 283.17 | 290.35 | 290.35 | 8502000.00 |
| mean | 16308.40 | 1445.89 | 1496.48 | 1387.75 | 1448.08 | 1448.08 | 145412529.51 |
| SE.mean | 15.83 | 53.81 | 55.92 | 51.16 | 53.83 | 53.83 | 7584131.47 |
| CI.mean.0.95 | 31.03 | 105.50 | 109.64 | 100.30 | 105.55 | 105.55 | 14870616.13 |
| var | 752717.07 | 8699579.66 | 9396254.87 | 7863811.28 | 8707621.33 | 8707621.33 | 172844745732914560.00 |
| std.dev | 867.59 | 2949.50 | 3065.33 | 2804.25 | 2950.87 | 2950.87 | 415746011.08 |
| coef.var | 0.05 | 2.04 | 2.05 | 2.02 | 2.04 | 2.04 | 2.86 |
BTC[BTC$Low==min(BTC$Low),]
## Date Open High Low Close Adj.Close Volume
## 84 2010-10-07 0.067 0.088 0.01 0.08685 0.08685 10784
BTC[BTC$High==max(BTC$High),]
## Date Open High Low Close Adj.Close Volume
## 2711 2017-12-17 19346.6 19870.62 18750.91 19065.71 19065.71 2264650369
The trading date was from July 16, 2010 to October 6, 2018, which were 3,005 days matching 3,005 observations in the dataset. No missing trading dates exist. The closing price and the adjusted closing price are the same.
difftime(BTC$Date[3005], BTC$Date[1])
## Time difference of 3004 days
sum(BTC$Close==BTC$Adj.Close)==length(BTC$Adj.Close)
## [1] TRUE
In theory, there should be no “opening” and “closing” in Bitcoin trading. Day-1’s closing price should be the Day-2’s opening price, which means bitcoins are always ready to trade (every 10 minutes). Nevertheless, there are 195 observations that “opening” did not match “closing” mostly from October 2017 to October 2018. Besides the round-off error, one explanation is that the exchange was halted.
#BTC[BTC$Open[-1]!=BTC$Close[-3005],]
sum(BTC$Open[-1]!=BTC$Close[-3005])
## [1] 195
Missing Completely at Random (MCAR) means that relationship does not exist between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.
Missing at Random (MAR) means that a systematic relationship exists between the propensity of missing values and the observed data, but not the missing data.
Source: https://www.theanalysisfactor.com/missing-data-mechanism/
days <- data.frame("Date"=BTC$Date, "Included"=rep(1,3005))
days[BTC$Open[-1]!=BTC$Close[-3005],]$Included <- 0
ggplot(days, aes(Date,Included)) +
geom_bar(stat="identity", fill="steelblue")+
ylim(0,1)+labs(title="Fig. 21. Bitcoin Trading Dates",subtitle="2010JUL16 - 2018OCT06",x="")+
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Again, No missing values exist in Bitcoin trading prices.
List-wise/pair-wise deletions. As the name indicates, the most common approach is to ignore the missing values.
Mean imputation. Simply calculate the mean of the observed values for that variable for all individuals who are non-missing. It has the advantage of keeping the same mean and the same sample size, but many, many disadvantages. Pretty much every method listed below is better than mean imputation.
Regression imputation. The predicted value obtained by regressing the missing variable on other variables. So instead of just taking the mean, you’re taking the predicted value, based on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values.