============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/597131
Data source: Yahoo Finance

 

Visual Exploration of Your Data

In this assignment you will be working with the data that you have collected for your Final Project.

Submission Format

Tasks

  1. Create Univariate analysis for the variable of your interest (your Y variable). Calculate skewness and kurtosis and describe the results. [histogram, skewness values, kurtosis values, description - 10pts]

  2. Create Bivariate plot Box Plot for your Y variable and one of other important metrics (your X). Describe figure. [box plot, description - 10pts]

  3. If your variables are continuous - Create a scatter plot between your Y and your X. If your variables are categorical - Create a bar plot. Describe figure [plot, description - 10pts]

  4. Create a multivariate plot - Use the same plot as in 3 but add another important variable using colored symbols. Describe Figure. Make sure to add legend [scatterplot, description - 10pts]

 

Project EDA

The paper aims to run both a qualitative and a quantitative analysis on Bitcoin. As for the qualitative analysis, the paper discussed three significant systematic risks in term of the insecurity, the illegality, and the low transparency in Bitcoin investment. As for the quantitative analysis, the paper will build an ARIMA model on Bitcoin price.

The dataset is the Bitcoin price as a time series. Such a dataset is downloaded from Yahoo Finance. However, the website has occasionally been modifying the historical Bitcoin trading prices and volume, causing any inconsistency with researches. Also, the website’s earliest accessible trading date is changing over time. The paper used a source starting on July 16, 2010, and this dataset was recorded on October 8, 2018. Since the dataset is a time series, the EDA is limited to line charts in this case. The prices have changed rapidly in 10 years as Fig. 1 shows, hence four phases were being decomposed in Fig. 2 to Fig. 5. When it comes to data preparation, the ARIMA model requires specific steps such as taking first-order difference transformation to decompose trend and seasonality. The descriptive analysis will focus on the unit root test, ACF and PACF plots.

BTC <- read.csv("~/Documents/HU/ANLY 699-90-O/699 R/BTC-USD.csv")
BTC$Date <- as.Date(BTC$Date,format="%Y-%m-%d")

fig1 <- ggplot(BTC,aes(Date,Close))+geom_line()+labs(x="Trading Date",y="Closing Price (USD)")+ggtitle("Fig. 1. Bitcoin Trading Price: 2010-2018")+theme_classic();fig1

 

fig2 <- ggplot(BTC[which(BTC$Date=="2010-07-16"):which(BTC$Date=="2013-03-16"),],aes(Date,Close))+geom_line()+labs(title="Fig. 2. Bitcoin Trading Price",subtitle="2010JUL16 - 2013MAR16",y="Closing Price (USD)",x="Trading Date")+theme_classic()
fig3 <- ggplot(BTC[which(BTC$Date=="2013-03-17"):which(BTC$Date=="2017-01-11"),],aes(Date,Close))+geom_line()+labs(title="Fig. 3. Bitcoin Trading Price",subtitle="2013MAR17 - 2017JAN11",y="Closing Price (USD)",x="Trading Date")+theme_classic()
fig4 <- ggplot(BTC[which(BTC$Date=="2017-01-12"):which(BTC$Date=="2017-12-15"),],aes(Date,Close))+geom_line()+labs(title="Fig. 4. Bitcoin Trading Price",subtitle="2017JAN12 - 2017DEC15",y="Closing Price (USD)",x="Trading Date")+theme_classic()
fig5 <- ggplot(BTC[which(BTC$Date=="2017-12-16"):which(BTC$Date=="2018-10-6"),],aes(Date,Close))+geom_line()+labs(title="Fig. 5. Bitcoin Trading Price",subtitle="2017DEC16 - 2018OCT06",y="Closing Price (USD)",x="Trading Date")+theme_classic()
gridExtra::grid.arrange(fig2,fig3,fig4,fig5,nrow=2,bottom="Four Phases")

In general, the prices were remaining in relatively small amount as of March 2013; fluctuating towards January 2017; soaring towards December 2017; then falling as of nowadays. Hence, four phases are identified in Fig. 2 to Fig. 5, namely (1) July 16, 2010 – March 16, 2013; (2) March 17, 2013 – January 11, 2017; (3) January 12, 2017 – December 15, 2017; and (4) December 16, 2017 – October 2018.

 

Practice Univariate Analysis

Bitcoin prices needn’t histogram. The following two sections turn to ANLY 510 project “Estimating the effects of various intervention components on smoking cessation with fractional factorial design”. Data source: The Methodology Center, College of Health and Human Development, Pennsylvania State University.

smoking <- read_excel("~/Documents/HU/ANLY 699-90-O/699 R/smoking.xlsx")
smoking[,2:7] <- lapply(smoking[,2:7],factor)
shapiro.test(smoking$SelfEff)
## 
##  Shapiro-Wilk normality test
## 
## data:  smoking$SelfEff
## W = 0.99368, p-value = 0.03096
moments::agostino.test(smoking$SelfEff)
## 
##  D'Agostino skewness test
## 
## data:  smoking$SelfEff
## skew = -0.042716, z = -0.400007, p-value = 0.6892
## alternative hypothesis: data have a skewness
moments::anscombe.test(smoking$SelfEff)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  smoking$SelfEff
## kurt = 2.4708, z = -3.2370, p-value = 0.001208
## alternative hypothesis: kurtosis is not equal to 3
fig6 <- ggplot(smoking,aes(SelfEff))+geom_histogram(aes(y=..density..),colour="black",fill="white")+geom_density(alpha=0.3,fill="#FF6666")+ggtitle("Fig. 6. Distribution of Self-efficacy")+labs(x="Self-efficacy",y="Density")+theme_light(); fig6
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The dependent variable “SelfEff” (self-efficacy) measures beliefs about the ability to quit smoking. This continuous variable ranges from -2.26500838 to 11.949277. The distribution in Fig. 6 looks bell-shaped, but it does not in the hypothesis tests. A Shapiro-Wilk test is executed with a null hypothesis that the sample is normally distributed, and the p-value for this test is 0.03096, hence the null hypothesis is rejected. The p-value of the Anscombe-Glynn test with a null hypothesis that the sample has a kurtosis of 3 is 0.001208, hence the null hypothesis is rejected. The p-value of the D’Agostino test is 0.6892, and the null hypothesis that the sample has no skewness was accepted, and the distribution of dependent variable is slightly violate the normal distribution.

The dependent variable has a skewness of -0.042716 and a kurtosis of 2.4708.

 

Practice Bivariate Plot

p1 <- ggplot(smoking,aes(PATCH,SelfEff))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Self-efficacy",x="Use of nicotine patch")+ggtitle("Fig. 7. Effect of PATCH")
p2 <- ggplot(smoking,aes(ADLIB,SelfEff))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Self-efficacy",x="Adlib use of nicotine gum")+ggtitle("Fig. 8. Effect of ADLIB")
p3 <- ggplot(smoking,aes(COUN,SelfEff))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Self-efficacy",x="Pre-cessation counseling")+ggtitle("Fig. 9. Effect of COUN")
p4 <- ggplot(smoking,aes(IN,SelfEff))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Self-efficacy",x="In-person cessation counseling")+ggtitle("Fig. 10. Effect of IN")
p5 <- ggplot(smoking,aes(PH,SelfEff))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Self-efficacy",x="Phone counseling")+ggtitle("Fig. 11. Effect of PH")
p6 <- ggplot(smoking,aes(MD,SelfEff))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Self-efficacy",x="Duration of medication use")+ggtitle("Fig. 12. Effect of MD")
gridExtra::grid.arrange(p1,p2,p3,p4,p5,p6,nrow=2)

One-way ANOVA is the technique to examine if there are any statistically significant differences between the arithmetic averages of two or more independent groups. The box plots of the dependent variable over the six independent variables are in Fig. 7 to Fig. 12. From the plots, whether using nicotine patch, using pre-cessation counseling, and intensive phone counseling have different means between groups.

 

Practice Scatter Plot

rawdata <- read.csv("~/Documents/HU/ANLY 699-90-O/699 R/thornton_hiv_got.csv")
vars <- c("got", "any", "tinc", "distvct", "male", "age", "hiv2004", "site", "mar", "tb", "thinktreat")
data <- rawdata[vars]
colnames(data)[1] <- "target"
data <- na.omit(data)

f1 <- ggplot(data,aes(tinc))+geom_histogram(binwidth=10)+labs(x="Total value of the incentive (kwacha)",y="Count")+ggtitle("Fig. 13. Distribution")+theme_classic()
f2 <- ggplot(data,aes(distvct))+geom_histogram(binwidth=0.2)+labs(x="Distance (km)",y="Count")+ggtitle("Fig. 14. Distribution")+theme_classic()
f3 <- ggplot(data,aes(age))+geom_histogram(binwidth=1)+labs(x="Age",y="Count")+ggtitle("Fig. 15. Distribution")+theme_classic()
f4 <- ggplot(aes(tinc, target), data=aggregate(target~tinc, data=data, mean))+ylim(0,1)+geom_point()+labs(x="Total value of the incentive (kwacha)",y="Mean response")+ggtitle("Fig. 16. Scatter plot")+theme_classic()
gridExtra::grid.arrange(f1,f2,f3,f4, nrow=2)

Bitcoin prices needn’t scatter plot. This section turns to ANLY 530 project “Predict the Chance of Learning HIV Results after being Tested by Supervised Learning”. Data source: American Economic Review. Fig. 16 is a scatter plot based on the aggregated data of the arithmetic average of the target variable by the total value of the incentives. Fig. 16 shows that the greater monetary incentive increases the chance of learning HIV results.

 

Practice Multivariate Plot

Again, this last section turns to ANLY 510 project “Estimating the effects of various intervention components on smoking cessation with fractional factorial design”. Data source: The Methodology Center, College of Health and Human Development, Pennsylvania State University.

fa <- ggplot(smoking,aes(PATCH,SelfEff,group=ADLIB,color=ADLIB))+geom_smooth(aes(linetype=ADLIB),method=lm,se=F)+theme_light()+ggtitle("Fig. 17. Two-way interaction of PATCH and ADLIB")
fb <- ggplot(smoking,aes(PATCH,SelfEff,group=COUN,color=COUN))+geom_smooth(aes(linetype=COUN),method=lm,se=F)+theme_light()+ggtitle("Fig. 18. Two-way interaction of PATCH and COUN")
fc <- ggplot(smoking,aes(COUN,SelfEff,group=PH,color=PH))+geom_smooth(aes(linetype=PH),method=lm,se=F)+theme_light()+ggtitle("Fig. 19. Two-way interaction of COUN and PH")
fd <- interaction.ABC.plot(SelfEff,PATCH,ADLIB,COUN,smoking)+theme_light()+ggtitle("Fig. 20. Three-way interaction of PATCH, ADLIB, and COUN")

gridExtra::grid.arrange(fa,fb,fc,fd,nrow=2)

Among these terms, PATCH, COUN, PH, PATCH:COUN, ADLIB:COUN, COUN:PH, and PATCH:ADLIB:COUN are statistically significant. Fig. 17 to Fig. 19 show two-way interaction plots and Fig. 20 show three-way interaction plot. It is noticed that there is a parallel in the three-way interaction. By examining the six intervention components, (1) using nicotine patch seems to be the main effect on smoking cessation. (2) There is not much effect from pre-counseling when nicotine patch is already used. (3) When pre-counseling is not included, using nicotine gum has small effect between whether using nicotine patch.