library(readr)
library(dplyr)
library(tidyr)
library(outliers)
library(forecast)
This report will be focus on prepare the dataset containing gold and cryptocurrency price information for future price inference purpose. First, the dataset wiAll be created through merging two open source data gathered from www.kaggle.com. Necessary data type conversion will be performed in order to merge the datasets successfully. Afterwards, the new dataset will be examined through structure checking. Any variables require necessary type conversion or ordering will be performed. When all variables are examined to be in correct format, the whole dataset will be checked once more to make sure it is in a tidy format. Variables will filtered to unsure the efficiency of future analysis. Furthermore, the dataset will be scaned for missing values and outliers. Any missing values and outliers detected will be replaced or excluded with appropiate methodlogy. Finally, a distribution check will be performed on the tidy data. If the data is not in normal distribution, transformations will apply to scale the data to as normal distributed as possible.
The two data sets are both open data sourced from www.kaggle.com.
Here are the links to the original data website and variable descriptions:
The dataset is about cryptocurrency market with price information and relative price radios.
Variables:
Link: https://www.kaggle.com/jessevent/all-crypto-currencies
This dataset is about the VAISHNAVI GOLD share price information and relative price ratios.
Variables:
Link: https://www.kaggle.com/lakshmi25npathi/gold-price
After importing the two datasets, the two dataframes are merged by dates where only matched dates and informations are retained. To merge them by equal dates, “date” coloums need to be in “date” format in order to perform the merge. Through check for class, both “date” coloums are factors. Therefore, both are converted into date type with the same format: Y-M-D.
“Inner_join” is selected to perform the merge, because in this way, the new dataset is more tidy with relevant datas when come to further analysis.
cryptocurrency <- read.csv("cryptomarkets.csv", sep=",")
goldprice <- read.csv("Goldprice.csv", sep=",")
cryptocurrency$date %>% class
[1] "factor"
goldprice$date %>% class
[1] "factor"
goldprice$date <- strptime(as.character(goldprice$date), "%d/%m/%Y")
goldprice$date <- as.Date(goldprice$date)
cryptocurrency$date <- as.Date(cryptocurrency$date)
crypto_gold <- inner_join(cryptocurrency, goldprice, by = "date")
The attributes in the data are checked through showing the data structures.
Apart from data conversion previously (from factor to date), the rest attributes in the data are in appropiate data types. “Slug”, “symbo”l" and “names” are decided to store as factor becuase factors saves more memory compare to character and factors contain levels which helps ordering the data for further analysis. Also there may be cryptocurrency names that contain numbers and characters in the future.
Next, the data is ordered for better visualization.
First, check the levels in “symbol”. As shown in the data structure and level checks, there are 2005 levels in the dataset. It is impossible to label them all.
BTC, ETH and XRP is the three major cryptocurrencies in the market and has majority or the market caps. Assume the analysis between gold price and cryptocurrency price will be most likely to perform on these three cryptocurrencies. Thus, the dataset is filtered to only contain price information of the top 3 cryptocurrency.
Therefore, tidy & manipulation of data is performed first and level ordering will be performed afterwards.
str(crypto_gold)
'data.frame': 137678 obs. of 25 variables:
$ slug : Factor w/ 2071 levels "0chain","0x",..: 209 209 209 209 209 209 209 209 209 209 ...
$ symbol : Factor w/ 2005 levels "$$$","$PAC","0XBTC",..: 275 275 275 275 275 275 275 275 275 275 ...
$ name : Factor w/ 2071 levels "0chain","0x",..: 205 205 205 205 205 205 205 205 205 205 ...
$ date : Date, format: "2013-04-29" "2013-04-30" ...
$ ranknow : int 1 1 1 1 1 1 1 1 1 1 ...
$ open : num 134 144 116 106 116 ...
$ high : num 147 147 126 108 125 ...
$ low : num 134 134.1 92.3 79.1 106.6 ...
$ close : num 144.5 139 105.2 97.8 112.3 ...
$ volume : num 0 0 0 0 0 0 0 0 0 0 ...
$ market : num 1.60e+09 1.54e+09 1.17e+09 1.09e+09 1.25e+09 ...
$ close_ratio : num 0.781 0.384 0.388 0.642 0.314 ...
$ spread : num 13.5 12.9 33.3 29 18 ...
$ Open : num 3.9 3.71 3.71 3.66 3.56 3.49 3.43 3.37 3.31 3.25 ...
$ High : num 3.9 3.71 3.71 3.66 3.56 3.49 3.43 3.37 3.31 3.25 ...
$ Low : num 3.9 3.71 3.53 3.53 3.56 3.49 3.43 3.37 3.31 3.25 ...
$ Close : num 3.9 3.71 3.59 3.63 3.56 3.49 3.43 3.37 3.31 3.25 ...
$ WAP : num 3.9 3.71 3.57 3.63 3.56 ...
$ No..of.Shares : int 154304 8775 20063 14141 80819 205 5040 1175 10089 50 ...
$ No..of.Trades : int 25 11 46 79 11 2 10 14 83 1 ...
$ Total.Turnover : int 601785 32555 71612 51353 287715 715 17287 3959 33394 162 ...
$ Deliverable.Quantity : int 154304 8775 20063 14141 80819 205 5040 1175 10089 50 ...
$ X..Deli..Qty.to.Traded.Qty: num 100 100 100 100 100 100 100 100 100 100 ...
$ Spread.H.L : num 0 0 0.18 0.13 0 0 0 0 0 0 ...
$ Spread.C.O : num 0 0 -0.12 -0.03 0 0 0 0 0 0 ...
crypto_gold$symbol %>% levels
[1] "$$$" "$PAC" "0XBTC" "1337" "1ST" "1WO"
[7] "2GIVE" "2GO" "300" "42" "611" "808"
[13] "8BIT" "AAA" "AAC" "ABBC" "ABC" "ABDT"
[19] "ABL" "ABS" "ABT" "ABX" "ABY" "ABYSS"
[25] "AC" "AC3" "ACAT" "ACC" "ACDC" "ACE"
[31] "ACED" "ACES" "ACM" "ACOIN" "ACP" "ACRE"
[37] "ACT" "ACTP" "ADA" "ADB" "ADC" "ADCN"
[43] "ADH" "ADI" "ADK" "ADL" "ADST" "ADT"
[49] "ADX" "ADZ" "AE" "AEC" "AEG" "AEON"
[55] "AGI" "AGLT" "AI" "AIB" "AID" "AIDOC"
[61] "AION" "AIT" "AIX" "AKA" "ALC" "ALI"
[67] "ALIS" "ALL" "ALT" "ALTX" "ALX" "AMB"
[73] "AMLT" "AMM" "AMMO" "AMN" "AMO" "AMP"
[79] "AMS" "ANC" "ANI" "ANON" "ANT" "AOA"
[85] "AOG" "APC" "APH" "APIS" "APL" "APOT"
[91] "APPC" "APR" "APX" "ARB" "ARC" "ARCO"
[97] "ARCT" "ARDR" "AREPA" "ARG" "ARGUS" "ARI"
[103] "ARION" "ARK" "ARN" "ARO" "ART" "ARY"
[109] "ASA" "ASAFE2" "AST" "AT" "ATB" "ATC"
[115] "ATCC" "ATD" "ATH" "ATL" "ATM" "ATMI"
[121] "ATMOS" "ATN" "ATOM" "ATP" "ATS" "ATX"
[127] "AU" "AUC" "AUR" "AURA" "AUTO" "AUX"
[133] "AV" "AVA" "AVH" "AVINOC" "AVT" "AXIOM"
[139] "AXPR" "AZART" "B@" "B2B" "B2X" "BAAS"
[145] "BANCA" "BANK" "BAT" "BAX" "BAY" "BBC"
[151] "BBK" "BBN" "BBO" "BBP" "BBR" "BBS"
[157] "BC" "BCA" "BCAC" "BCARD" "BCD" "BCDN"
[163] "BCDT" "BCF" "BCH" "BCI" "BCN" "BCO"
[169] "BCPT" "BCV" "BCX" "BCY" "BCZERO" "BDG"
[175] "BDL" "BDT" "BEAT" "BEE" "BEET" "BELA"
[181] "BEN" "BENJI" "BENZ" "BERN" "BERRY" "BET"
[187] "BETHER" "BETR" "BEZ" "BFF" "BFT" "BGG"
[193] "BHPC" "BIFI" "BIGUP" "BIO" "BIR" "BIRDS"
[199] "BIS" "BIT" "BITB" "BITBTC" "BITCF" "BITCNY"
[205] "BITEUR" "BITF" "BITG" "BITGOLD" "BITS" "BITSILVER"
[211] "BITUSD" "BITX" "BIX" "BKBT" "BKX" "BLACK"
[217] "BLAST" "BLAZR" "BLC" "BLK" "BLN" "BLOC"
[223] "BLOCK" "BLT" "BLU" "BLUE" "BLZ" "BMC"
[229] "BMH" "BMX" "BNB" "BNC" "BND" "BNK"
[235] "BNN" "BNT" "BNTY" "BOAT" "BOB" "BOC"
[241] "BOE" "BOLI" "BON" "BOS" "BOST" "BOT"
[247] "BOUTS" "BOX" "BOXX" "BPL" "BPT" "BQ"
[253] "BQT" "BRAT" "BRD" "BRIA" "BRIT" "BRK"
[259] "BRM" "BRO" "BRX" "BRZC" "BSC" "BSD"
[265] "BSM" "BSN" "BSTN" "BSTY" "BSV" "BSX"
[271] "BTA" "BTAD" "BTB" "BTBC" "BTC" "BTCM"
[277] "BTCN" "BTCONE" "BTCP" "BTCRED" "BTCS" "BTCZ"
[283] "BTDX" "BTG" "BTK" "BTM" "BTN" "BTNT"
[289] "BTO" "BTPL" "BTQ" "BTR" "BTRN" "BTS"
[295] "BTT" "BTW" "BTWTY" "BTX" "BTXC" "BU"
[301] "BUB" "BUBO" "BUMBA" "BUN" "BUNNY" "BURST"
[307] "BUT" "BUZZ" "BWK" "BWS" "BWT" "BWX"
[313] "BZ" "BZL" "BZNT" "BZX" "C2" "C20"
[319] "C2C" "C2P" "C8" "CAB" "CAG" "CAN"
[325] "CANDY" "CANN" "CAPP" "CAR" "CARAT" "CARBON"
[331] "CARD" "CARE" "CAS" "CASH" "CAT" "CATO"
[337] "CAZ" "CBC" "CBT" "CBX" "CCC" "CCCX"
[343] "CCL" "CCO" "CCRB" "CCT" "CDC" "CDM"
[349] "CDN" "CDT" "CDX" "CEDEX" "CEEK" "CEFS"
[355] "CEL" "CEN" "CENNZ" "CET" "CF" "CFC"
[361] "CFI" "CFL" "CFUN" "CGEN" "CHAT" "CHE"
[367] "CHEESE" "CHESS" "CHEX" "CHIPS" "CHP" "CHSB"
[373] "CHX" "CIF" "CIT" "CIV" "CJ" "CJS"
[379] "CJT" "CKUSD" "CL" "CLAM" "CLN" "CLO"
[385] "CLOAK" "CLUB" "CMCT" "CMIT" "CMM" "CMPCO"
[391] "CMS" "CMT" "CND" "CNET" "CNN" "CNNC"
[397] "CNO" "CNT" "CNX" "COAL" "COB" "COBRA"
[403] "COFI" "COIN" "COLX" "COMP" "CONI" "CONX"
[409] "COR" "COSM" "COSS" "COTN" "COU" "COUPE"
[415] "COV" "COVAL" "CPAY" "CPC" "CPLO" "CPN"
[421] "CPT" "CPX" "CPY" "CRAVE" "CRB" "CRBT"
[427] "CRC" "CRD" "CRE" "CREA" "CRED" "CREDO"
[433] "CREVA" "CRM" "CROAT" "CROP" "CRPT" "CRW"
[439] "CRYP" "CS" "CSC" "CSM" "CSNO" "CST"
[445] "CSTL" "CTC" "CTIC2" "CTIC3" "CTL" "CTRT"
[451] "CTX" "CTXC" "CURE" "CV" "CVC" "CVN"
[457] "CVT" "CWV" "CXO" "CXT" "CYFM" "CYL"
[463] "CYMT" "CZR" "D" "DAC" "DACC" "DACH"
[469] "DACS" "DADI" "DAG" "DAGT" "DAI" "DALC"
[475] "DAN" "DAPS" "DAR" "DART" "DASC" "DASH"
[481] "DAT" "DATA" "DATP" "DATX" "DAV" "DAX"
[487] "DAXT" "DAXX" "DAY" "DBC" "DBET" "DBIX"
[493] "DCC" "DCN" "DCR" "DCT" "DCY" "DDD"
[499] "DDX" "DEAL" "DEB" "DEC" "DEEX" "DELIZ"
[505] "DELTA" "DEM" "DENT" "DERO" "DEUS" "DEV"
[511] "DEW" "DEX" "DFT" "DGB" "DGC" "DGD"
[517] "DGS" "DGTX" "DGX" "DICE" "DIG" "DIM"
[523] "DIME" "DIN" "DIT" "DIVI" "DIVX" "DIX"
[529] "DKPC" "DLC" "DLT" "DMB" "DMC" "DMD"
[535] "DML" "DMT" "DNA" "DNT" "DNZ" "DOCK"
[541] "DOGE" "DOLLAR" "DOPE" "DOR" "DOT" "DOV"
[547] "DOW" "DP" "DPN" "DPY" "DRG" "DRGN"
[553] "DRM" "DROP" "DRPU" "DRT" "DRXNE" "DSR"
[559] "DT" "DTA" "DTB" "DTC" "DTEM" "DTH"
[565] "DTR" "DTRC" "DTX" "DUO" "DUTCH" "DWS"
[571] "DX" "DXT" "DYN" "EAG" "EARTH" "EBC"
[577] "EBET" "EBST" "EBTC" "ECA" "ECASH" "ECC"
[583] "ECO" "ECOB" "ECOM" "ECOREAL" "ECT" "EDG"
[589] "EDN" "EDO" "EDR" "EDRC" "EDS" "EDT"
[595] "EDU" "EFL" "EFX" "EFYT" "EGC" "EGCC"
[601] "EGEM" "EGT" "EGX" "EJOY" "EKO" "EKT"
[607] "EL" "ELA" "ELE" "ELEC" "ELF" "ELI"
[613] "ELITE" "ELIX" "ELLA" "ELLI" "ELS" "ELTCOIN"
[619] "ELY" "EMB" "EMC" "EMC2" "EMD" "EMPR"
[625] "ENG" "ENGT" "ENJ" "ENT" "ENTS" "EOS"
[631] "EOSDAC" "EPLUS" "EPY" "EQL" "EQT" "ERA"
[637] "ERC20" "ERO" "ERT" "ERY" "ESCE" "ESCO"
[643] "ESN" "ESP" "ESS" "EST" "ESZ" "ETA"
[649] "ETBS" "ETC" "ETG" "ETH" "ETHD" "ETHM"
[655] "ETHO" "ETHOS" "ETI" "ETK" "ETN" "ETP"
[661] "ETT" "ETZ" "EUC" "EUNO" "EURS" "EVC"
[667] "EVE" "EVI" "EVIL" "EVN" "EVR" "EVX"
[673] "EXC" "EXCL" "EXMR" "EXP" "EXRN" "EXT"
[679] "EXY" "EZT" "EZW" "F1C" "FACE" "FAIR"
[685] "FANS" "FBN" "FCT" "FDX" "FDZ" "FGC"
[691] "FID" "FIL" "FJC" "FKX" "FLASH" "FLAX"
[697] "FLDC" "FLIK" "FLIXX" "FLM" "FLO" "FLOT"
[703] "FLP" "FLT" "FLUZ" "FMF" "FND" "FNKOS"
[709] "FNTB" "FOIN" "FOOD" "FOR" "FORK" "FOTA"
[715] "FOX" "FOXT" "FRC" "FREC" "FREE" "FRGC"
[721] "FRN" "FRST" "FSBT" "FSN" "FST" "FT"
[727] "FTC" "FTI" "FTM" "FTO" "FTT" "FTX"
[733] "FTXT" "FUEL" "FUN" "FUNDZ" "FUZZ" "FXT"
[739] "FYP" "GAM" "GAME" "GAP" "GARD" "GARY"
[745] "GAS" "GAT" "GB" "GBC" "GBG" "GBX"
[751] "GBYTE" "GCC" "GCN" "GCR" "GCS" "GDC"
[757] "GEERT" "GEM" "GEN" "GENE" "GEO" "GET"
[763] "GETX" "GIC" "GIN" "GIO" "GLA" "GLD"
[769] "GLT" "GMCN" "GNO" "GNR" "GNT" "GNX"
[775] "GO" "GOD" "GOLF" "GOLOS" "GOOD" "GOSS"
[781] "GOT" "GPKR" "GRC" "GRFT" "GRID" "GRIM"
[787] "GRLC" "GRMD" "GRPH" "GRS" "GRWI" "GRX"
[793] "GSC" "GSE" "GSR" "GST" "GTC" "GTM"
[799] "GTO" "GUESS" "GUP" "GUSD" "GVE" "GVT"
[805] "GXS" "GZE" "GZRO" "HAC" "HAL" "HALLO"
[811] "HAND" "HAV" "HAVY" "HB" "HBC" "HBT"
[817] "HBZ" "HC" "HDAC" "HEAT" "HELP" "HER"
[823] "HERO" "HGT" "HIRE" "HIT" "HKN" "HLC"
[829] "HLM" "HMC" "HMQ" "HNC" "HNDC" "HODL"
[835] "HOLD" "HONEY" "HORSE" "HORUS" "HOT" "HPB"
[841] "HPC" "HPY" "HQT" "HQX" "HRC" "HSC"
[847] "HSN" "HST" "HT" "HTH" "HTML" "HUC"
[853] "HUM" "HUR" "HUSH" "HUZU" "HVCO" "HVN"
[859] "HWC" "HXX" "HYB" "HYC" "HYDRO" "HYP"
[865] "I0C" "IBANK" "IBTC" "IC" "ICN" "ICNQ"
[871] "ICOB" "ICON" "ICOO" "ICR" "ICX" "IDH"
[877] "IDOL" "IDT" "IDXM" "IETH" "IFC" "IFLT"
[883] "IFOOD" "IFP" "IFT" "IG" "IGNIS" "IHF"
[889] "IHT" "IIC" "ILC" "IMP" "IMS" "IMT"
[895] "IMX" "INB" "INC" "INCNT" "INCO" "INCX"
[901] "IND" "INDI" "INFX" "ING" "INK" "INN"
[907] "INO" "INS" "INSN" "INSTAR" "INSUR" "INT"
[913] "INV" "INVE" "INXT" "IOC" "IOG" "ION"
[919] "IONC" "IOP" "IOST" "IOTX" "IOV" "IPC"
[925] "IPL" "IPSX" "IQ" "IQN" "IQT" "IRD"
[931] "IRL" "ISR" "ITC" "ITI" "ITL" "ITT"
[937] "ITZ" "IVY" "IXC" "IXE" "IXT" "J"
[943] "J8T" "JC" "JET" "JEW" "JIN" "JIYO"
[949] "JIYOX" "JNT" "JOINT" "JOT" "JS" "JSE"
[955] "KAN" "KARMA" "KB3" "KBC" "KBR" "KCASH"
[961] "KCS" "KED" "KEK" "KEY" "KICK" "KIN"
[967] "KIND" "KLKS" "KLN" "KMD" "KNC" "KNDC"
[973] "KNOW" "KNT" "KOBO" "KORE" "KRB" "KRL"
[979] "KRM" "KRONE" "KST" "KUN" "KURT" "KWATT"
[985] "KWH" "KXC" "KZC" "LA" "LABH" "LALA"
[991] "LANA" "LATX" "LBA" "LBC" "LBTC" "LCC"
[997] "LCP" "LCS" "LDC" "LDOGE"
[ reached getOption("max.print") -- omitted 1005 entries ]
The dataset is filtered into price information only relate to BTC, ETH and XRP.
Then, the data will be ordered following the order BTC - ETH - XRP.
Furthermore, the date coloum will be seperated into year,month and day. So price analysis based on differenet time frame can be performed easily.
crypto_gold_filtered <- crypto_gold %>% filter(symbol == "BTC" | symbol == "ETH" | symbol == "XRP")
crypto_gold_filtered$symbol <- factor (crypto_gold_filtered$symbol, levels = c("BTC", "ETH", "XRP"), ordered =TRUE)
crypto_gold_filtered$symbol %>% levels
[1] "BTC" "ETH" "XRP"
crypto_gold_tidy <- crypto_gold_filtered %>% separate(date, into = c("Year", "Month", "Day"), sep = "-")
There is a variable named “spread” for the $USD difference between the high and low values of a cryptocurrency for the day.
It may be helpful to create a varibale forthe $USD difference between the close and open values of a cryptocurrency for the day.
The new variable will be named “spread C/O” and the origianl “spread” variable will be named “spread H/L”
crypto_gold_tidy2 <- mutate(crypto_gold_tidy,
"spread C/O" = close- open)
crypto_gold_tidy2 %>% rename("spread H/L" = spread)
Use is.na to scan for any NAs in the data. However, the dataset is too big with too many observations. Use the alternative way: identify total count of NAs in data frame to scan for NAs in the data more efficiently.
As there are 0 NAs, no actions need to take.
is.na(crypto_gold_tidy2)
slug symbol name Year Month Day ranknow open high low close volume
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
market close_ratio spread Open High Low Close WAP No..of.Shares
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
No..of.Trades Total.Turnover Deliverable.Quantity X..Deli..Qty.to.Traded.Qty
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE
Spread.H.L Spread.C.O spread C/O
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE
[11,] FALSE FALSE FALSE
[12,] FALSE FALSE FALSE
[13,] FALSE FALSE FALSE
[14,] FALSE FALSE FALSE
[15,] FALSE FALSE FALSE
[16,] FALSE FALSE FALSE
[17,] FALSE FALSE FALSE
[18,] FALSE FALSE FALSE
[19,] FALSE FALSE FALSE
[20,] FALSE FALSE FALSE
[21,] FALSE FALSE FALSE
[22,] FALSE FALSE FALSE
[23,] FALSE FALSE FALSE
[24,] FALSE FALSE FALSE
[25,] FALSE FALSE FALSE
[26,] FALSE FALSE FALSE
[27,] FALSE FALSE FALSE
[28,] FALSE FALSE FALSE
[29,] FALSE FALSE FALSE
[30,] FALSE FALSE FALSE
[31,] FALSE FALSE FALSE
[32,] FALSE FALSE FALSE
[33,] FALSE FALSE FALSE
[34,] FALSE FALSE FALSE
[35,] FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 1793 rows ]
sum(is.na(crypto_gold_tidy2))
[1] 0
By checking for outliers through the boxplots, outliers can be clearly identified in BTC price.
Next, we use z score to identify the outliers and find the locations of the z-scores whose absolute value is greater than 3.
There are three outliers as identified. These outliers will be replaced by the nearest neighbours that are not outliers. As in this case, the outliers are not a result of data entry error or data processing error, thus it might not be appropiate to exclude the outliers directly.
boxplot(crypto_gold_tidy2$close ~ crypto_gold_tidy2$symbol, main="price by cryptocurrency", ylab = "price", xlab = "cryptocurrency")
crypto_gold_outlier <- crypto_gold_tidy2 %>% filter( symbol == "BTC" ) %>% dplyr::select(close)
z.scores <- crypto_gold_outlier %>% scores(type = "z")
z.scores %>% summary()
close
Min. :-1.52668
1st Qu.:-0.79443
Median :-0.09933
Mean : 0.00000
3rd Qu.: 0.72392
Max. : 3.18060
which( abs(z.scores) >3 )
[1] 148 151 793
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
close_capped <- crypto_gold_outlier$close %>% cap()
Before any transformation, histogram of the close price of bitcoin is created for visualization of the distribution. From the plot, it is clear that the data is right skewed. For further inference on the price, it is preferred to transform the data into normal distribution. Box-Cox transformation is applied to transform the data into normal distribution. From the result histogram, it appears to be left-skewed rather than normal distributed.
Z score standardisation is applied to see if there is any improvement. From the histogram of z score standardiasation, it is clearly that the plot is relatively more normal than the Box-Cox transformation with most data gathered around mean = 0
hist(close_capped)
boxcox_close<- BoxCox(close_capped,lambda = "auto")
hist(boxcox_close)
z_close <- scale(close_capped, center = TRUE, scale = TRUE)
hist(z_close)