MATH2349 Semester 2, 2018

Required packages

library(dplyr)
library(tidyr)
library(readr)
library(Hmisc)
library(editrules)
library(MVN)
library(outliers)
library(forecast)

Executive Summary

In this report, there are three data sets which been merged together. These three data sets are open source provide at https://data.world/imf/consumer-price-index-cpi and https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2017&start=1960&view=chart. After merging the data set, the new data set ‘CPI_GDP’ contains only numerics and characters two type of variables. Then, this data set the ‘incomegroup’ have been factorize and ordered variable from an characters attributes as required. The new data set in tidy format by using the tidyr and dplyr packages. In addition, there are two new variables which the ratio between 2015 and 2016 have been added by using mutate function and one variable is created base on these two variable, which is the ratio of GPD and CPI. The rate of changing in GDP & CPI have been sacnned by using the outliers package and editrules packages. It is applied editrules that both GDP and CPI are over 0 to constraints the impossible values. Moreover, the outliers in growth rate of GDP and CPI were removed by usiing absolute z-scores. In the last step, due to the left skewness of GDP growth rate, this variable was symmetrization by using box cox transformation from forcast package. As a resulte, it can be used to explore linear relationship with other variable.

Data

There are three data sets which are CPI, GDP and Income. Those of data sets are open sources from website https://data.world/imf/consumer-price-index-cpi and “https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2017&start=1960&view=chart”. The CPI data set provided the data of Consumer Price Index which index measure changes in the prices of goods from different country over 50 years. GDP data set is the Gross Domestic Product for different country over few decades. Meanwhile, the Income data set provided the level of income on each country compare with whole world. These three data sets will be merge by using mutate function and full_join function though their common attributes. In addition, the year of data was selected from 2015 to 2016.

CPI <- read_csv("/Users/hipposon/Desktop/Master of data science/Data prepossessing/dataset/Consumer Price Index (CPI)_CPI_03-23-2017 15-51-51-76_timeSeries.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Country Code` = col_integer()
## )

## See spec(...) for full column specifications.

GDP <- read_csv("/Users/hipposon/Desktop/Master of data science/Data prepossessing/dataset/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290.csv", skip = 3)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   X63 = col_character()
## )
## See spec(...) for full column specifications.

IncomeGroup <- read_csv("/Users/hipposon/Desktop/Master of data science/Data prepossessing/dataset/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290/Metadata_Country_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10134290.csv")

## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character(),
##   X6 = col_character()
## )

Income_GDP <- left_join(IncomeGroup, GDP, by = "Country Code")   ## The GDP dataset join the income grouping dataset
Income_GDP <- Income_GDP[-c(5,6)]
Income_GDP <- Income_GDP[-c(8:62)]    ## remove the historical data.
Income_GDP <- Income_GDP[-c(10:11)]

CPI <- CPI[-c(6:1110)]   ##remove 1960-2014 because they are too old. 
CPI <- CPI[-(7:22)]
CPI <- CPI[-c(8:25, 27)] ##remove 2017, because it has not completed yet.
CPI %>% group_by(`Indicator Name`) %>% summarise(count = n()) %>% arrange(desc(count))

try <- CPI %>% filter(`Indicator Name` == "Consumer Price Index, All items")  ##Only keep the CPI, all items indicator
CPI <- try[!duplicated(try$`Country Name`),]   ##remove duplicated rows for each unique country

CPI$`2015` <- as.numeric(CPI$`2015`)   
CPI$`2016` <- as.numeric(CPI$`2016`)
Income_GDP$`2015` <- as.numeric(Income_GDP$`2015`)
Income_GDP$`2016` <- as.numeric(Income_GDP$`2016`)
CPI <- mutate(CPI, CPI_increase_rate_16 = CPI$`2016` / CPI$`2015`)   ## create a new variable CPI increase rate base on 2015 and 2016 date
Income_GDP <- mutate(Income_GDP, GDP_increase_rate_16 = Income_GDP$`2016` / Income_GDP$`2015`)    ## create a new variable GDP increase rate base on 2015 and 2016 date
CPI_GDP <- left_join(CPI, Income_GDP, by = "Country Name")    # Merge them again

head(CPI_GDP)

Understand

After merging the data set, the new data set ‘CPI_GDP’ contains only numerics and characters two type of variables. Then, this data set the ‘income group’ have been factorize and ordered variable from an characters attributes.

CPI_GDP$IncomeGroup <- factor(CPI_GDP$IncomeGroup,levels = c("High income", "Upper middle income","Lower
                                                                   middle income","Low income"), ordered = TRUE)
str(CPI_GDP)

## Classes 'tbl_df', 'tbl' and 'data.frame':    164 obs. of  18 variables:
##  $ Country Name           : chr  "Brazil" "France" "Bulgaria" "Cayman Islands" ...
##  $ Country Code.x         : int  223 132 918 377 233 156 172 662 960 134 ...
##  $ Indicator Name.x       : chr  "Consumer Price Index, All items" "Consumer Price Index, All items" "Consumer Price Index, All items" "Consumer Price Index, All items" ...
##  $ Indicator Code.x       : chr  "PCPI_IX" "PCPI_IX" "PCPI_IX" "PCPI_IX" ...
##  $ Attribute              : chr  "Value" "Value" "Value" "Value" ...
##  $ 2015.x                 : num  4310 100 6550 102 123 ...
##  $ 2016.x                 : num  4687 100 6498 NA 132 ...
##  $ Common Reference Period: chr  "1993A" "2015A" "1995A" "2008Q2" ...
##  $ CPI_increase_rate_16   : num  1.087 1.002 0.992 NA 1.075 ...
##  $ Country Code.y         : chr  "BRA" "FRA" "BGR" "CYM" ...
##  $ Region                 : chr  "Latin America & Caribbean" "Europe & Central Asia" "Europe & Central Asia" "Latin America & Caribbean" ...
##  $ IncomeGroup            : Ord.factor w/ 4 levels "High income"<..: 2 1 2 1 2 1 1 NA 1 1 ...
##  $ SpecialNotes           : chr  NA "A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocab"| __truncated__ NA NA ...
##  $ Indicator Name.y       : chr  "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
##  $ Indicator Code.y       : chr  "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" ...
##  $ 2015.y                 : num  1.80e+12 2.44e+12 5.02e+10 NA 2.92e+11 ...
##  $ 2016.y                 : num  1.79e+12 2.47e+12 5.32e+10 NA 2.80e+11 ...
##  $ GDP_increase_rate_16   : num  0.995 1.011 1.061 NA 0.961 ...

Tidy & Manipulate Data I

This data set is untidy as some column headers are values not variable names. In this step, the gather function will be used to gather those years column in one column.

CPI_GDP1 <- CPI_GDP %>% gather('2015.x','2016.x', key = "year", value = "CPI")
CPI_GDP1 <- CPI_GDP1 %>% gather('2015.y','2016.y', key = "year", value = "GDP")

Tidy & Manipulate Data II

This two new variable are the changing ratio between 2015 and 2016. In addition, there are a new variable which is created base on those two new variables, it is a ratio between them. As a result, it may give a insight of relationship between GDP and CPI.

## CPI <- mutate(CPI, CPI_increase_rate_16 = CPI$`2016` / CPI$`2015`) 
## Income_GDP <- mutate(Income_GDP, GDP_increase_rate_16 = Income_GDP$`2016` / Income_GDP$`2015`)
CPI_GDP1 <- mutate(CPI_GDP1, GDP_CPI_increase_ratio = CPI_GDP1$GDP_increase_rate_16 / CPI_GDP1$CPI_increase_rate_16)

Scan I

In this part, the NA have been imputed by using Hmisc package’s impute() function. Meanwhile, there are some of rules have been setted and give a result with no violation have been detected.

which(is.na(CPI_GDP1))

##    [1]  3940  3950  3953  3957  3966  3967  3970  3981  3982  3983  3987
##   [12]  3988  3990  3992  3993  3994  3995  3997  3998  4000  4017  4019
##   [23]  4023  4029  4038  4040  4041  4048  4050  4056  4057  4079  4080
##   [34]  4084  4087  4088  4089  4093  4094  4095  4098  4104  4114  4117
##   [45]  4121  4130  4131  4134  4145  4146  4147  4151  4152  4154  4156
##   [56]  4157  4158  4159  4161  4162  4164  4181  4183  4187  4193  4202
##   [67]  4204  4205  4212  4214  4220  4221  4243  4244  4248  4251  4252
##   [78]  4253  4257  4258  4259  4262  4268  4278  4281  4285  4294  4295
##   [89]  4298  4309  4310  4311  4315  4316  4318  4320  4321  4322  4323
##  [100]  4325  4326  4328  4345  4347  4351  4357  4366  4368  4369  4376
##  [111]  4378  4384  4385  4407  4408  4412  4415  4416  4417  4421  4422
##  [122]  4423  4426  4432  4442  4445  4449  4458  4459  4462  4473  4474
##  [133]  4475  4479  4480  4482  4484  4485  4486  4487  4489  4490  4492
##  [144]  4509  4511  4515  4521  4530  4532  4533  4540  4542  4548  4549
##  [155]  4571  4572  4576  4579  4580  4581  4585  4586  4587  4590  4604
##  [166]  4607  4620  4625  4627  4631  4635  4637  4651  4655  4688  4691
##  [177]  4713  4718  4738  4748  4752  4768  4771  4784  4789  4791  4795
##  [188]  4799  4801  4815  4819  4852  4855  4877  4882  4902  4912  4916
##  [199]  4932  4935  4948  4953  4955  4959  4963  4965  4979  4983  5016
##  [210]  5019  5041  5046  5066  5076  5080  5096  5099  5112  5117  5119
##  [221]  5123  5127  5129  5143  5147  5180  5183  5205  5210  5230  5240
##  [232]  5244  5260  5263  5276  5281  5283  5287  5291  5293  5307  5311
##  [243]  5344  5347  5369  5374  5394  5404  5408  5424  5427  5440  5445
##  [254]  5447  5451  5455  5457  5471  5475  5508  5511  5533  5538  5558
##  [265]  5568  5572  5588  5591  5604  5609  5611  5615  5619  5621  5635
##  [276]  5639  5672  5675  5697  5702  5722  5732  5736  5752  5755  5768
##  [287]  5773  5775  5779  5783  5785  5799  5803  5836  5839  5861  5866
##  [298]  5886  5896  5900  5912  5916  5919  5921  5932  5937  5939  5943
##  [309]  5945  5946  5947  5949  5954  5957  5960  5963  5965  5967  5968
##  [320]  5969  5977  5979  5982  5983  5998  6000  6002  6003  6005  6006
##  [331]  6008  6011  6014  6015  6016  6019  6023  6024  6025  6030  6032
##  [342]  6043  6047  6050  6055  6058  6059  6060  6064  6065  6066  6067
##  [353]  6076  6080  6083  6085  6096  6101  6103  6107  6109  6110  6111
##  [364]  6113  6118  6121  6124  6127  6129  6131  6132  6133  6141  6143
##  [375]  6146  6147  6162  6164  6166  6167  6169  6170  6172  6175  6178
##  [386]  6179  6180  6183  6187  6188  6189  6194  6196  6207  6211  6214
##  [397]  6219  6222  6223  6224  6228  6229  6230  6231  6240  6244  6247
##  [408]  6249  6260  6265  6267  6271  6273  6274  6275  6277  6282  6285
##  [419]  6288  6291  6293  6295  6296  6297  6305  6307  6310  6311  6326
##  [430]  6328  6330  6331  6333  6334  6336  6339  6342  6343  6344  6347
##  [441]  6351  6352  6353  6358  6360  6371  6375  6378  6383  6386  6387
##  [452]  6388  6392  6393  6394  6395  6404  6408  6411  6413  6424  6429
##  [463]  6431  6435  6437  6438  6439  6441  6446  6449  6452  6455  6457
##  [474]  6459  6460  6461  6469  6471  6474  6475  6490  6492  6494  6495
##  [485]  6497  6498  6500  6503  6506  6507  6508  6511  6515  6516  6517
##  [496]  6522  6524  6535  6539  6542  6547  6550  6551  6552  6556  6557
##  [507]  6558  6559  6561  6563  6564  6565  6568  6569  6571  6572  6575
##  [518]  6579  6581  6585  6587  6588  6593  6594  6595  6599  6601  6602
##  [529]  6603  6605  6606  6607  6609  6610  6611  6612  6613  6614  6615
##  [540]  6616  6617  6618  6619  6621  6622  6623  6625  6626  6627  6628
##  [551]  6633  6634  6636  6637  6640  6645  6652  6656  6659  6660  6661
##  [562]  6666  6668  6670  6671  6673  6675  6678  6679  6681  6683  6684
##  [573]  6686  6688  6690  6692  6693  6698  6703  6704  6706  6707  6709
##  [584]  6713  6715  6716  6717  6719  6720  6722  6723  6725  6727  6728
##  [595]  6729  6732  6733  6735  6736  6739  6743  6745  6749  6751  6752
##  [606]  6757  6758  6759  6763  6765  6766  6767  6769  6770  6771  6773
##  [617]  6774  6775  6776  6777  6778  6779  6780  6781  6782  6783  6785
##  [628]  6786  6787  6789  6790  6791  6792  6797  6798  6800  6801  6804
##  [639]  6809  6816  6820  6823  6824  6825  6830  6832  6834  6835  6837
##  [650]  6839  6842  6843  6845  6847  6848  6850  6852  6854  6856  6857
##  [661]  6862  6867  6868  6870  6871  6873  6877  6879  6880  6881  6883
##  [672]  6884  6886  6887  6889  6891  6892  6893  6896  6897  6899  6900
##  [683]  6903  6907  6909  6913  6915  6916  6921  6922  6923  6927  6929
##  [694]  6930  6931  6933  6934  6935  6937  6938  6939  6940  6941  6942
##  [705]  6943  6944  6945  6946  6947  6949  6950  6951  6953  6954  6955
##  [716]  6956  6961  6962  6964  6965  6968  6973  6980  6984  6987  6988
##  [727]  6989  6994  6996  6998  6999  7001  7003  7006  7007  7009  7011
##  [738]  7012  7014  7016  7018  7020  7021  7026  7031  7032  7034  7035
##  [749]  7037  7041  7043  7044  7045  7047  7048  7050  7051  7053  7055
##  [760]  7056  7057  7060  7061  7063  7064  7067  7071  7073  7077  7079
##  [771]  7080  7085  7086  7087  7091  7093  7094  7095  7097  7098  7099
##  [782]  7101  7102  7103  7104  7105  7106  7107  7108  7109  7110  7111
##  [793]  7113  7114  7115  7117  7118  7119  7120  7125  7126  7128  7129
##  [804]  7132  7137  7144  7148  7151  7152  7153  7158  7160  7162  7163
##  [815]  7165  7167  7170  7171  7173  7175  7176  7178  7180  7182  7184
##  [826]  7185  7190  7195  7196  7198  7199  7201  7205  7207  7208  7209
##  [837]  7211  7212  7214  7215  7228  7231  7244  7249  7251  7255  7259
##  [848]  7261  7275  7279  7312  7315  7337  7342  7362  7372  7376  7392
##  [859]  7395  7408  7413  7415  7419  7423  7425  7439  7443  7476  7479
##  [870]  7501  7506  7526  7536  7540  7556  7559  7572  7577  7579  7583
##  [881]  7587  7589  7603  7607  7640  7643  7665  7670  7690  7700  7704
##  [892]  7720  7723  7736  7741  7743  7747  7751  7753  7767  7771  7804
##  [903]  7807  7829  7834  7854  7864  7868  7884  7887  7900  7905  7907
##  [914]  7911  7915  7917  7931  7935  7968  7971  7993  7998  8018  8028
##  [925]  8032  8048  8051  8064  8069  8071  8075  8079  8081  8095  8099
##  [936]  8132  8135  8157  8162  8182  8192  8196  8212  8215  8228  8233
##  [947]  8235  8239  8243  8245  8259  8263  8296  8299  8321  8326  8346
##  [958]  8356  8360  8376  8379  8392  8397  8399  8403  8407  8409  8423
##  [969]  8427  8460  8463  8485  8490  8510  8520  8524  8532  8540  8543
##  [980]  8556  8561  8563  8567  8571  8573  8587  8588  8591  8624  8627
##  [991]  8649  8654  8674  8684  8688  8696  8704  8707  8720  8725  8727
## [1002]  8731  8735  8737  8751  8752  8755  8788  8791  8813  8818  8838
## [1013]  8848  8852  8860  8868  8871  8884  8889  8891  8895  8899  8901
## [1024]  8915  8916  8919  8952  8955  8977  8982  9002  9012  9016  9024
## [1035]  9032  9035  9048  9053  9055  9059  9063  9065  9079  9080  9083
## [1046]  9116  9119  9141  9146  9166  9176  9180  9198  9205  9214  9215
## [1057]  9218  9236  9238  9240  9245  9246  9248  9265  9271  9286  9289
## [1068]  9296  9298  9305  9336  9337  9352  9362  9365  9369  9378  9379
## [1079]  9382  9393  9394  9395  9399  9400  9405  9406  9407  9409  9410
## [1090]  9429  9431  9441  9452  9453  9468  9491  9492  9496  9499  9501
## [1101]  9505  9506  9507  9510  9526  9533  9542  9543  9546  9564  9566
## [1112]  9568  9573  9574  9576  9593  9599  9614  9617  9624  9626  9633
## [1123]  9664  9665  9680  9690  9693  9697  9706  9707  9710  9721  9722
## [1134]  9723  9727  9728  9733  9734  9735  9737  9738  9757  9759  9769
## [1145]  9780  9781  9796  9819  9820  9824  9827  9829  9833  9834  9835
## [1156]  9838 10500 10508 10511 10524 10529 10531 10535 10539 10541 10555
## [1167] 10556 10559 10592 10595 10617 10622 10642 10652 10656 10664 10672
## [1178] 10675 10688 10693 10695 10699 10703 10705 10719 10720 10723 10756
## [1189] 10759 10781 10786 10806 10816 10820 10828 10836 10839 10852 10857
## [1200] 10859 10863 10867 10869 10883 10884 10887 10920 10923 10945 10950
## [1211] 10970 10980 10984 10992 11000 11003 11016 11021 11023 11027 11031
## [1222] 11033 11047 11048 11051 11084 11087 11109 11114 11134 11144 11148
## [1233] 11156 11164 11166 11167 11169 11173 11180 11182 11183 11185 11186
## [1244] 11187 11191 11195 11197 11198 11199 11203 11204 11206 11208 11209
## [1255] 11210 11211 11212 11213 11214 11215 11216 11233 11235 11239 11245
## [1266] 11248 11251 11254 11256 11257 11264 11266 11272 11273 11278 11295
## [1277] 11296 11298 11300 11303 11304 11305 11308 11309 11310 11311 11312
## [1288] 11314 11320 11328 11330 11331 11333 11337 11344 11346 11347 11349
## [1299] 11350 11351 11355 11359 11361 11362 11363 11367 11368 11370 11372
## [1310] 11373 11374 11375 11376 11377 11378 11379 11380 11397 11399 11403
## [1321] 11409 11412 11415 11418 11420 11421 11428 11430 11436 11437 11442
## [1332] 11459 11460 11462 11464 11467 11468 11469 11472 11473 11474 11475
## [1343] 11476 11478 11484 11492 11494 11495 11497 11501 11508 11510 11511
## [1354] 11513 11514 11515 11519 11523 11525 11526 11527 11531 11532 11534
## [1365] 11536 11537 11538 11539 11540 11541 11542 11543 11544 11561 11563
## [1376] 11567 11573 11576 11579 11582 11584 11585 11592 11594 11600 11601
## [1387] 11606 11623 11624 11626 11628 11631 11632 11633 11636 11637 11638
## [1398] 11639 11640 11642 11648 11656 11658 11659 11661 11665 11672 11674
## [1409] 11675 11677 11678 11679 11683 11687 11689 11690 11691 11695 11696
## [1420] 11698 11700 11701 11702 11703 11704 11705 11706 11707 11708 11725
## [1431] 11727 11731 11737 11740 11743 11746 11748 11749 11756 11758 11764
## [1442] 11765 11770 11787 11788 11790 11792 11795 11796 11797 11800 11801
## [1453] 11802 11803 11804 11806

CPI_GDP1$CPI_increase_rate_16 <- impute(CPI_GDP1$CPI_increase_rate_16, fun = mean)
CPI_GDP1$GDP_increase_rate_16 <- impute(CPI_GDP1$GDP_increase_rate_16, fun = mean)
CPI_GDP1$CPI <- impute(CPI_GDP1$CPI,fun = mean)
CPI_GDP1$GDP <- impute(CPI_GDP1$GDP,fun = mean)

(Rule1 <- editset(c("GDP > 0","CPI > 0")))

## 
## Edit set:
## num1 : 0 < GDP
## num2 : 0 < CPI

violated <- violatedEdits(Rule1,CPI_GDP1)
summary(violated)

## No violations detected, 0 checks evaluated to NA

## NULL

Scan II

In this step, the data set required to convert to numeric format. because those two columns have been imputed and the class of original two column have been changed. Meanwhile, the number of these two variables have been round to 2 in order to give a good presentation on the box plot. Additionally, the outliers have been removed by using Z scorece to measure.

CPI_GDP1$CPI_increase_rate_16 <- CPI_GDP1$CPI_increase_rate_16 %>% round(2)
CPI_GDP1$GDP_increase_rate_16 <- CPI_GDP1$GDP_increase_rate_16 %>% round(2)
CPI_GDP1$CPI_increase_rate_16 <- as.numeric(CPI_GDP1$CPI_increase_rate_16)
CPI_GDP1$GDP_increase_rate_16 <- as.numeric(CPI_GDP1$GDP_increase_rate_16)

CPI_GDP1$GDP_CPI_increase_ratio %>% boxplot(main = "Box plot of ratio of GDP and CPI", ylab = "ratio of GDP and CPI")

CPI_GDP1$CPI_increase_rate_16 %>% boxplot(main = "Box plot of CPI increasing rate", ylab = "CPI", col = "grey")

CPI_GDP1$GDP_increase_rate_16 %>% boxplot(main = "Box plot of GDP increasing rate", ylab = "GDP", col = "grey")

z.scores <- CPI_GDP1$CPI_increase_rate_16 %>% scores(type = "z")   ##calculate the Z score for this variable
z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -6.40027 -0.38041 -0.01557  0.00000  0.16685  3.63283

CPI_GDP1 <- CPI_GDP1[- which(abs(z.scores) > 3),]   ## remove the obserration which abosulute z score of CPI increase rate is over 3

z.scores2 <- CPI_GDP1$GDP_increase_rate_16 %>% scores(type = "z")
z.scores2 %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -8.74295 -0.27444  0.07841  0.00000  0.43127  2.19554

CPI_GDP1 <- CPI_GDP1[- which(abs(z.scores2) >3), ]

Transform

Originally, the increase rate of GDP is slightly left skewd. After apply the boxcox funciton, it becomes much more summetrical so that it can be used to explore the linner relationship with other variables.

hist(CPI_GDP1$GDP_increase_rate_16)

boxcox_CPI <- BoxCox(CPI_GDP1$GDP_increase_rate_16, lambda = "auto")  #reduce skewness
hist(boxcox_CPI)