library(dplyr)
library(tidyr)
library(readr)
library(Hmisc)
library(editrules)
library(MVN)
library(outliers)
library(forecast)
In this report, there are three data sets which been merged together. These three data sets are open source provide at https://data.world/imf/consumer-price-index-cpi and https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2017&start=1960&view=chart. After merging the data set, the new data set ‘CPI_GDP’ contains only numerics and characters two type of variables. Then, this data set the ‘incomegroup’ have been factorize and ordered variable from an characters attributes as required. The new data set in tidy format by using the tidyr and dplyr packages. In addition, there are two new variables which the ratio between 2015 and 2016 have been added by using mutate function and one variable is created base on these two variable, which is the ratio of GPD and CPI. The rate of changing in GDP & CPI have been sacnned by using the outliers package and editrules packages. It is applied editrules that both GDP and CPI are over 0 to constraints the impossible values. Moreover, the outliers in growth rate of GDP and CPI were removed by usiing absolute z-scores. In the last step, due to the left skewness of GDP growth rate, this variable was symmetrization by using box cox transformation from forcast package. As a resulte, it can be used to explore linear relationship with other variable.
There are three data sets which are CPI, GDP and Income. Those of data sets are open sources from website https://data.world/imf/consumer-price-index-cpi and “https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2017&start=1960&view=chart”. The CPI data set provided the data of Consumer Price Index which index measure changes in the prices of goods from different country over 50 years. GDP data set is the Gross Domestic Product for different country over few decades. Meanwhile, the Income data set provided the level of income on each country compare with whole world. These three data sets will be merge by using mutate function and full_join function though their common attributes. In addition, the year of data was selected from 2015 to 2016.
CPI <- read_csv("/Users/hipposon/Desktop/Master of data science/Data prepossessing/dataset/Consumer Price Index (CPI)_CPI_03-23-2017 15-51-51-76_timeSeries.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Country Code` = col_integer()
## )
## See spec(...) for full column specifications.
GDP <- read_csv("/Users/hipposon/Desktop/Master of data science/Data prepossessing/dataset/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290.csv", skip = 3)
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## X63 = col_character()
## )
## See spec(...) for full column specifications.
IncomeGroup <- read_csv("/Users/hipposon/Desktop/Master of data science/Data prepossessing/dataset/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290/Metadata_Country_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290.csv")
## Parsed with column specification:
## cols(
## `Country Code` = col_character(),
## Region = col_character(),
## IncomeGroup = col_character(),
## SpecialNotes = col_character(),
## TableName = col_character(),
## X6 = col_character()
## )
Income_GDP <- left_join(IncomeGroup, GDP, by = "Country Code") ## The GDP dataset join the income grouping dataset
Income_GDP <- Income_GDP[-c(5,6)]
Income_GDP <- Income_GDP[-c(8:62)] ## remove the historical data.
Income_GDP <- Income_GDP[-c(10:11)]
CPI <- CPI[-c(6:1110)] ##remove 1960-2014 because they are too old.
CPI <- CPI[-(7:22)]
CPI <- CPI[-c(8:25, 27)] ##remove 2017, because it has not completed yet.
CPI %>% group_by(`Indicator Name`) %>% summarise(count = n()) %>% arrange(desc(count))
try <- CPI %>% filter(`Indicator Name` == "Consumer Price Index, All items") ##Only keep the CPI, all items indicator
CPI <- try[!duplicated(try$`Country Name`),] ##remove duplicated rows for each unique country
CPI$`2015` <- as.numeric(CPI$`2015`)
CPI$`2016` <- as.numeric(CPI$`2016`)
Income_GDP$`2015` <- as.numeric(Income_GDP$`2015`)
Income_GDP$`2016` <- as.numeric(Income_GDP$`2016`)
CPI <- mutate(CPI, CPI_increase_rate_16 = CPI$`2016` / CPI$`2015`) ## create a new variable CPI increase rate base on 2015 and 2016 date
Income_GDP <- mutate(Income_GDP, GDP_increase_rate_16 = Income_GDP$`2016` / Income_GDP$`2015`) ## create a new variable GDP increase rate base on 2015 and 2016 date
CPI_GDP <- left_join(CPI, Income_GDP, by = "Country Name") # Merge them again
head(CPI_GDP)
After merging the data set, the new data set ‘CPI_GDP’ contains only numerics and characters two type of variables. Then, this data set the ‘income group’ have been factorize and ordered variable from an characters attributes.
CPI_GDP$IncomeGroup <- factor(CPI_GDP$IncomeGroup,levels = c("High income", "Upper middle income","Lower
middle income","Low income"), ordered = TRUE)
str(CPI_GDP)
## Classes 'tbl_df', 'tbl' and 'data.frame': 164 obs. of 18 variables:
## $ Country Name : chr "Brazil" "France" "Bulgaria" "Cayman Islands" ...
## $ Country Code.x : int 223 132 918 377 233 156 172 662 960 134 ...
## $ Indicator Name.x : chr "Consumer Price Index, All items" "Consumer Price Index, All items" "Consumer Price Index, All items" "Consumer Price Index, All items" ...
## $ Indicator Code.x : chr "PCPI_IX" "PCPI_IX" "PCPI_IX" "PCPI_IX" ...
## $ Attribute : chr "Value" "Value" "Value" "Value" ...
## $ 2015.x : num 4310 100 6550 102 123 ...
## $ 2016.x : num 4687 100 6498 NA 132 ...
## $ Common Reference Period: chr "1993A" "2015A" "1995A" "2008Q2" ...
## $ CPI_increase_rate_16 : num 1.087 1.002 0.992 NA 1.075 ...
## $ Country Code.y : chr "BRA" "FRA" "BGR" "CYM" ...
## $ Region : chr "Latin America & Caribbean" "Europe & Central Asia" "Europe & Central Asia" "Latin America & Caribbean" ...
## $ IncomeGroup : Ord.factor w/ 4 levels "High income"<..: 2 1 2 1 2 1 1 NA 1 1 ...
## $ SpecialNotes : chr NA "A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocab"| __truncated__ NA NA ...
## $ Indicator Name.y : chr "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
## $ Indicator Code.y : chr "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" ...
## $ 2015.y : num 1.80e+12 2.44e+12 5.02e+10 NA 2.92e+11 ...
## $ 2016.y : num 1.79e+12 2.47e+12 5.32e+10 NA 2.80e+11 ...
## $ GDP_increase_rate_16 : num 0.995 1.011 1.061 NA 0.961 ...
This data set is untidy as some column headers are values not variable names. In this step, the gather function will be used to gather those years column in one column.
CPI_GDP1 <- CPI_GDP %>% gather('2015.x','2016.x', key = "year", value = "CPI")
CPI_GDP1 <- CPI_GDP1 %>% gather('2015.y','2016.y', key = "year", value = "GDP")
This two new variable are the changing ratio between 2015 and 2016. In addition, there are a new variable which is created base on those two new variables, it is a ratio between them. As a result, it may give a insight of relationship between GDP and CPI.
## CPI <- mutate(CPI, CPI_increase_rate_16 = CPI$`2016` / CPI$`2015`)
## Income_GDP <- mutate(Income_GDP, GDP_increase_rate_16 = Income_GDP$`2016` / Income_GDP$`2015`)
CPI_GDP1 <- mutate(CPI_GDP1, GDP_CPI_increase_ratio = CPI_GDP1$GDP_increase_rate_16 / CPI_GDP1$CPI_increase_rate_16)
In this part, the NA have been imputed by using Hmisc package’s impute() function. Meanwhile, there are some of rules have been setted and give a result with no violation have been detected.
which(is.na(CPI_GDP1))
## [1] 3940 3950 3953 3957 3966 3967 3970 3981 3982 3983 3987
## [12] 3988 3990 3992 3993 3994 3995 3997 3998 4000 4017 4019
## [23] 4023 4029 4038 4040 4041 4048 4050 4056 4057 4079 4080
## [34] 4084 4087 4088 4089 4093 4094 4095 4098 4104 4114 4117
## [45] 4121 4130 4131 4134 4145 4146 4147 4151 4152 4154 4156
## [56] 4157 4158 4159 4161 4162 4164 4181 4183 4187 4193 4202
## [67] 4204 4205 4212 4214 4220 4221 4243 4244 4248 4251 4252
## [78] 4253 4257 4258 4259 4262 4268 4278 4281 4285 4294 4295
## [89] 4298 4309 4310 4311 4315 4316 4318 4320 4321 4322 4323
## [100] 4325 4326 4328 4345 4347 4351 4357 4366 4368 4369 4376
## [111] 4378 4384 4385 4407 4408 4412 4415 4416 4417 4421 4422
## [122] 4423 4426 4432 4442 4445 4449 4458 4459 4462 4473 4474
## [133] 4475 4479 4480 4482 4484 4485 4486 4487 4489 4490 4492
## [144] 4509 4511 4515 4521 4530 4532 4533 4540 4542 4548 4549
## [155] 4571 4572 4576 4579 4580 4581 4585 4586 4587 4590 4604
## [166] 4607 4620 4625 4627 4631 4635 4637 4651 4655 4688 4691
## [177] 4713 4718 4738 4748 4752 4768 4771 4784 4789 4791 4795
## [188] 4799 4801 4815 4819 4852 4855 4877 4882 4902 4912 4916
## [199] 4932 4935 4948 4953 4955 4959 4963 4965 4979 4983 5016
## [210] 5019 5041 5046 5066 5076 5080 5096 5099 5112 5117 5119
## [221] 5123 5127 5129 5143 5147 5180 5183 5205 5210 5230 5240
## [232] 5244 5260 5263 5276 5281 5283 5287 5291 5293 5307 5311
## [243] 5344 5347 5369 5374 5394 5404 5408 5424 5427 5440 5445
## [254] 5447 5451 5455 5457 5471 5475 5508 5511 5533 5538 5558
## [265] 5568 5572 5588 5591 5604 5609 5611 5615 5619 5621 5635
## [276] 5639 5672 5675 5697 5702 5722 5732 5736 5752 5755 5768
## [287] 5773 5775 5779 5783 5785 5799 5803 5836 5839 5861 5866
## [298] 5886 5896 5900 5912 5916 5919 5921 5932 5937 5939 5943
## [309] 5945 5946 5947 5949 5954 5957 5960 5963 5965 5967 5968
## [320] 5969 5977 5979 5982 5983 5998 6000 6002 6003 6005 6006
## [331] 6008 6011 6014 6015 6016 6019 6023 6024 6025 6030 6032
## [342] 6043 6047 6050 6055 6058 6059 6060 6064 6065 6066 6067
## [353] 6076 6080 6083 6085 6096 6101 6103 6107 6109 6110 6111
## [364] 6113 6118 6121 6124 6127 6129 6131 6132 6133 6141 6143
## [375] 6146 6147 6162 6164 6166 6167 6169 6170 6172 6175 6178
## [386] 6179 6180 6183 6187 6188 6189 6194 6196 6207 6211 6214
## [397] 6219 6222 6223 6224 6228 6229 6230 6231 6240 6244 6247
## [408] 6249 6260 6265 6267 6271 6273 6274 6275 6277 6282 6285
## [419] 6288 6291 6293 6295 6296 6297 6305 6307 6310 6311 6326
## [430] 6328 6330 6331 6333 6334 6336 6339 6342 6343 6344 6347
## [441] 6351 6352 6353 6358 6360 6371 6375 6378 6383 6386 6387
## [452] 6388 6392 6393 6394 6395 6404 6408 6411 6413 6424 6429
## [463] 6431 6435 6437 6438 6439 6441 6446 6449 6452 6455 6457
## [474] 6459 6460 6461 6469 6471 6474 6475 6490 6492 6494 6495
## [485] 6497 6498 6500 6503 6506 6507 6508 6511 6515 6516 6517
## [496] 6522 6524 6535 6539 6542 6547 6550 6551 6552 6556 6557
## [507] 6558 6559 6561 6563 6564 6565 6568 6569 6571 6572 6575
## [518] 6579 6581 6585 6587 6588 6593 6594 6595 6599 6601 6602
## [529] 6603 6605 6606 6607 6609 6610 6611 6612 6613 6614 6615
## [540] 6616 6617 6618 6619 6621 6622 6623 6625 6626 6627 6628
## [551] 6633 6634 6636 6637 6640 6645 6652 6656 6659 6660 6661
## [562] 6666 6668 6670 6671 6673 6675 6678 6679 6681 6683 6684
## [573] 6686 6688 6690 6692 6693 6698 6703 6704 6706 6707 6709
## [584] 6713 6715 6716 6717 6719 6720 6722 6723 6725 6727 6728
## [595] 6729 6732 6733 6735 6736 6739 6743 6745 6749 6751 6752
## [606] 6757 6758 6759 6763 6765 6766 6767 6769 6770 6771 6773
## [617] 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6785
## [628] 6786 6787 6789 6790 6791 6792 6797 6798 6800 6801 6804
## [639] 6809 6816 6820 6823 6824 6825 6830 6832 6834 6835 6837
## [650] 6839 6842 6843 6845 6847 6848 6850 6852 6854 6856 6857
## [661] 6862 6867 6868 6870 6871 6873 6877 6879 6880 6881 6883
## [672] 6884 6886 6887 6889 6891 6892 6893 6896 6897 6899 6900
## [683] 6903 6907 6909 6913 6915 6916 6921 6922 6923 6927 6929
## [694] 6930 6931 6933 6934 6935 6937 6938 6939 6940 6941 6942
## [705] 6943 6944 6945 6946 6947 6949 6950 6951 6953 6954 6955
## [716] 6956 6961 6962 6964 6965 6968 6973 6980 6984 6987 6988
## [727] 6989 6994 6996 6998 6999 7001 7003 7006 7007 7009 7011
## [738] 7012 7014 7016 7018 7020 7021 7026 7031 7032 7034 7035
## [749] 7037 7041 7043 7044 7045 7047 7048 7050 7051 7053 7055
## [760] 7056 7057 7060 7061 7063 7064 7067 7071 7073 7077 7079
## [771] 7080 7085 7086 7087 7091 7093 7094 7095 7097 7098 7099
## [782] 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111
## [793] 7113 7114 7115 7117 7118 7119 7120 7125 7126 7128 7129
## [804] 7132 7137 7144 7148 7151 7152 7153 7158 7160 7162 7163
## [815] 7165 7167 7170 7171 7173 7175 7176 7178 7180 7182 7184
## [826] 7185 7190 7195 7196 7198 7199 7201 7205 7207 7208 7209
## [837] 7211 7212 7214 7215 7228 7231 7244 7249 7251 7255 7259
## [848] 7261 7275 7279 7312 7315 7337 7342 7362 7372 7376 7392
## [859] 7395 7408 7413 7415 7419 7423 7425 7439 7443 7476 7479
## [870] 7501 7506 7526 7536 7540 7556 7559 7572 7577 7579 7583
## [881] 7587 7589 7603 7607 7640 7643 7665 7670 7690 7700 7704
## [892] 7720 7723 7736 7741 7743 7747 7751 7753 7767 7771 7804
## [903] 7807 7829 7834 7854 7864 7868 7884 7887 7900 7905 7907
## [914] 7911 7915 7917 7931 7935 7968 7971 7993 7998 8018 8028
## [925] 8032 8048 8051 8064 8069 8071 8075 8079 8081 8095 8099
## [936] 8132 8135 8157 8162 8182 8192 8196 8212 8215 8228 8233
## [947] 8235 8239 8243 8245 8259 8263 8296 8299 8321 8326 8346
## [958] 8356 8360 8376 8379 8392 8397 8399 8403 8407 8409 8423
## [969] 8427 8460 8463 8485 8490 8510 8520 8524 8532 8540 8543
## [980] 8556 8561 8563 8567 8571 8573 8587 8588 8591 8624 8627
## [991] 8649 8654 8674 8684 8688 8696 8704 8707 8720 8725 8727
## [1002] 8731 8735 8737 8751 8752 8755 8788 8791 8813 8818 8838
## [1013] 8848 8852 8860 8868 8871 8884 8889 8891 8895 8899 8901
## [1024] 8915 8916 8919 8952 8955 8977 8982 9002 9012 9016 9024
## [1035] 9032 9035 9048 9053 9055 9059 9063 9065 9079 9080 9083
## [1046] 9116 9119 9141 9146 9166 9176 9180 9198 9205 9214 9215
## [1057] 9218 9236 9238 9240 9245 9246 9248 9265 9271 9286 9289
## [1068] 9296 9298 9305 9336 9337 9352 9362 9365 9369 9378 9379
## [1079] 9382 9393 9394 9395 9399 9400 9405 9406 9407 9409 9410
## [1090] 9429 9431 9441 9452 9453 9468 9491 9492 9496 9499 9501
## [1101] 9505 9506 9507 9510 9526 9533 9542 9543 9546 9564 9566
## [1112] 9568 9573 9574 9576 9593 9599 9614 9617 9624 9626 9633
## [1123] 9664 9665 9680 9690 9693 9697 9706 9707 9710 9721 9722
## [1134] 9723 9727 9728 9733 9734 9735 9737 9738 9757 9759 9769
## [1145] 9780 9781 9796 9819 9820 9824 9827 9829 9833 9834 9835
## [1156] 9838 10500 10508 10511 10524 10529 10531 10535 10539 10541 10555
## [1167] 10556 10559 10592 10595 10617 10622 10642 10652 10656 10664 10672
## [1178] 10675 10688 10693 10695 10699 10703 10705 10719 10720 10723 10756
## [1189] 10759 10781 10786 10806 10816 10820 10828 10836 10839 10852 10857
## [1200] 10859 10863 10867 10869 10883 10884 10887 10920 10923 10945 10950
## [1211] 10970 10980 10984 10992 11000 11003 11016 11021 11023 11027 11031
## [1222] 11033 11047 11048 11051 11084 11087 11109 11114 11134 11144 11148
## [1233] 11156 11164 11166 11167 11169 11173 11180 11182 11183 11185 11186
## [1244] 11187 11191 11195 11197 11198 11199 11203 11204 11206 11208 11209
## [1255] 11210 11211 11212 11213 11214 11215 11216 11233 11235 11239 11245
## [1266] 11248 11251 11254 11256 11257 11264 11266 11272 11273 11278 11295
## [1277] 11296 11298 11300 11303 11304 11305 11308 11309 11310 11311 11312
## [1288] 11314 11320 11328 11330 11331 11333 11337 11344 11346 11347 11349
## [1299] 11350 11351 11355 11359 11361 11362 11363 11367 11368 11370 11372
## [1310] 11373 11374 11375 11376 11377 11378 11379 11380 11397 11399 11403
## [1321] 11409 11412 11415 11418 11420 11421 11428 11430 11436 11437 11442
## [1332] 11459 11460 11462 11464 11467 11468 11469 11472 11473 11474 11475
## [1343] 11476 11478 11484 11492 11494 11495 11497 11501 11508 11510 11511
## [1354] 11513 11514 11515 11519 11523 11525 11526 11527 11531 11532 11534
## [1365] 11536 11537 11538 11539 11540 11541 11542 11543 11544 11561 11563
## [1376] 11567 11573 11576 11579 11582 11584 11585 11592 11594 11600 11601
## [1387] 11606 11623 11624 11626 11628 11631 11632 11633 11636 11637 11638
## [1398] 11639 11640 11642 11648 11656 11658 11659 11661 11665 11672 11674
## [1409] 11675 11677 11678 11679 11683 11687 11689 11690 11691 11695 11696
## [1420] 11698 11700 11701 11702 11703 11704 11705 11706 11707 11708 11725
## [1431] 11727 11731 11737 11740 11743 11746 11748 11749 11756 11758 11764
## [1442] 11765 11770 11787 11788 11790 11792 11795 11796 11797 11800 11801
## [1453] 11802 11803 11804 11806
CPI_GDP1$CPI_increase_rate_16 <- impute(CPI_GDP1$CPI_increase_rate_16, fun = mean)
CPI_GDP1$GDP_increase_rate_16 <- impute(CPI_GDP1$GDP_increase_rate_16, fun = mean)
CPI_GDP1$CPI <- impute(CPI_GDP1$CPI,fun = mean)
CPI_GDP1$GDP <- impute(CPI_GDP1$GDP,fun = mean)
(Rule1 <- editset(c("GDP > 0","CPI > 0")))
##
## Edit set:
## num1 : 0 < GDP
## num2 : 0 < CPI
violated <- violatedEdits(Rule1,CPI_GDP1)
summary(violated)
## No violations detected, 0 checks evaluated to NA
## NULL
In this step, the data set required to convert to numeric format. because those two columns have been imputed and the class of original two column have been changed. Meanwhile, the number of these two variables have been round to 2 in order to give a good presentation on the box plot. Additionally, the outliers have been removed by using Z scorece to measure.
CPI_GDP1$CPI_increase_rate_16 <- CPI_GDP1$CPI_increase_rate_16 %>% round(2)
CPI_GDP1$GDP_increase_rate_16 <- CPI_GDP1$GDP_increase_rate_16 %>% round(2)
CPI_GDP1$CPI_increase_rate_16 <- as.numeric(CPI_GDP1$CPI_increase_rate_16)
CPI_GDP1$GDP_increase_rate_16 <- as.numeric(CPI_GDP1$GDP_increase_rate_16)
CPI_GDP1$GDP_CPI_increase_ratio %>% boxplot(main = "Box plot of ratio of GDP and CPI", ylab = "ratio of GDP and CPI")
CPI_GDP1$CPI_increase_rate_16 %>% boxplot(main = "Box plot of CPI increasing rate", ylab = "CPI", col = "grey")
CPI_GDP1$GDP_increase_rate_16 %>% boxplot(main = "Box plot of GDP increasing rate", ylab = "GDP", col = "grey")
z.scores <- CPI_GDP1$CPI_increase_rate_16 %>% scores(type = "z") ##calculate the Z score for this variable
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.40027 -0.38041 -0.01557 0.00000 0.16685 3.63283
CPI_GDP1 <- CPI_GDP1[- which(abs(z.scores) > 3),] ## remove the obserration which abosulute z score of CPI increase rate is over 3
z.scores2 <- CPI_GDP1$GDP_increase_rate_16 %>% scores(type = "z")
z.scores2 %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.74295 -0.27444 0.07841 0.00000 0.43127 2.19554
CPI_GDP1 <- CPI_GDP1[- which(abs(z.scores2) >3), ]
Originally, the increase rate of GDP is slightly left skewd. After apply the boxcox funciton, it becomes much more summetrical so that it can be used to explore the linner relationship with other variables.
hist(CPI_GDP1$GDP_increase_rate_16)
boxcox_CPI <- BoxCox(CPI_GDP1$GDP_increase_rate_16, lambda = "auto") #reduce skewness
hist(boxcox_CPI)