Introduction

This competition is about the prediction of Russian house prices. In this dataset, we have house related information as well as the Russian economic data. It is a really good challenge to play with the housing dataset and linking it with a country’s economy. Both of these dataset can be joined together by date column. We have lots of missing values in both dataset. But I will use Xgboost algorithm which will take care of missing values on its own.

This dataset can be used to enhance regression skills, feature engineering, and preprocessing. This data is available at this link.

Import the Dataset

train = read.csv('C:\\Users\\uzair\\Codes\\Russian Market Price\\train.csv', stringsAsFactors = FALSE)
test = read.csv('C:\\Users\\uzair\\Codes\\Russian Market Price\\test.csv', stringsAsFactors = FALSE)
macros = read.csv("C:\\Users\\uzair\\Codes\\Russian Market Price\\macro.csv", stringsAsFactors = FALSE)

View the dataset

head(train)
##   id  timestamp full_sq life_sq floor max_floor material build_year num_room
## 1  1 2011-08-20      43      27     4        NA       NA         NA       NA
## 2  2 2011-08-23      34      19     3        NA       NA         NA       NA
## 3  3 2011-08-27      43      29     2        NA       NA         NA       NA
## 4  4 2011-09-01      89      50     9        NA       NA         NA       NA
## 5  5 2011-09-05      77      77     4        NA       NA         NA       NA
## 6  6 2011-09-06      67      46    14        NA       NA         NA       NA
##   kitch_sq state product_type          sub_area   area_m raion_popul
## 1       NA    NA   Investment          Bibirevo  6407578      155572
## 2       NA    NA   Investment Nagatinskij Zaton  9589337      115352
## 3       NA    NA   Investment    Tekstil'shhiki  4808270      101708
## 4       NA    NA   Investment            Mitino 12583536      178473
## 5       NA    NA   Investment         Basmannoe  8398461      108171
## 6       NA    NA   Investment    Nizhegorodskoe  7506452       43795
##   green_zone_part  indust_part children_preschool preschool_quota
## 1     0.189727117 0.0000699893               9576            5001
## 2     0.372602044 0.0496372570               6880            3119
## 3     0.112559644 0.1185373850               5879            1463
## 4     0.194702869 0.0697533610              13087            6839
## 5     0.015233744 0.0373164520               5706            3240
## 6     0.007670134 0.4862456210               2418             852
##   preschool_education_centers_raion children_school school_quota
## 1                                 5           10309        11065
## 2                                 5            7759         6237
## 3                                 4            6207         5580
## 4                                 9           13670        17063
## 5                                 7            6748         7770
## 6                                 2            2514         2012
##   school_education_centers_raion school_education_centers_top_20_raion
## 1                              5                                     0
## 2                              8                                     0
## 3                              7                                     0
## 4                             10                                     0
## 5                              9                                     0
## 6                              3                                     0
##   hospital_beds_raion healthcare_centers_raion university_top_20_raion
## 1                 240                        1                       0
## 2                 229                        1                       0
## 3                1183                        1                       0
## 4                  NA                        1                       0
## 5                 562                        4                       2
## 6                  NA                        0                       0
##   sport_objects_raion additional_education_raion culture_objects_top_25
## 1                   7                          3                     no
## 2                   6                          1                    yes
## 3                   5                          1                     no
## 4                  17                          6                     no
## 5                  25                          2                     no
## 6                   7                          0                     no
##   culture_objects_top_25_raion shopping_centers_raion office_raion
## 1                            0                     16            1
## 2                            1                      3            0
## 3                            0                      0            1
## 4                            0                     11            4
## 5                            0                     10           93
## 6                            0                      6           19
##   thermal_power_plant_raion incineration_raion oil_chemistry_raion
## 1                        no                 no                  no
## 2                        no                 no                  no
## 3                        no                 no                  no
## 4                        no                 no                  no
## 5                        no                 no                  no
## 6                       yes                 no                  no
##   radiation_raion railroad_terminal_raion big_market_raion
## 1              no                      no               no
## 2              no                      no               no
## 3             yes                      no               no
## 4              no                      no               no
## 5             yes                     yes               no
## 6             yes                      no               no
##   nuclear_reactor_raion detention_facility_raion full_all male_f female_f
## 1                    no                       no    86206  40477    45729
## 2                    no                       no    76284  34200    42084
## 3                    no                       no   101982  46076    55906
## 4                    no                       no    21155   9828    11327
## 5                    no                       no    28179  13522    14657
## 6                    no                       no    19940   9400    10540
##   young_all young_male young_female work_all work_male work_female ekder_all
## 1     21154      11007        10147    98207     52277       45930     36211
## 2     15727       7925         7802    70194     35622       34572     29431
## 3     13028       6835         6193    63388     31813       31575     25292
## 4     28563      14680        13883   120381     60040       60341     29529
## 5     13368       7159         6209    68043     34236       33807     26760
## 6      5291       2744         2547    29660     15793       13867      8844
##   ekder_male ekder_female X0_6_all X0_6_male X0_6_female X7_14_all X7_14_male
## 1      10580        25631     9576      4899        4677     10309       5463
## 2       9266        20165     6880      3466        3414      7759       3909
## 3       7609        17683     5879      3095        2784      6207       3269
## 4       9083        20446    13087      6645        6442     13670       7126
## 5       8563        18197     5706      2982        2724      6748       3664
## 6       2608         6236     2418      1224        1194      2514       1328
##   X7_14_female X0_17_all X0_17_male X0_17_female X16_29_all X16_29_male
## 1         4846     23603      12286        11317      17508        9425
## 2         3850     17700       8998         8702      15164        7571
## 3         2938     14884       7821         7063      19401        9045
## 4         6544     32063      16513        15550       3292        1450
## 5         3084     15237       8113         7124       5164        2583
## 6         1186      5866       3035         2831       4851        2329
##   X16_29_female X0_13_all X0_13_male X0_13_female
## 1          8083     18654       9709         8945
## 2          7593     13729       6929         6800
## 3         10356     11252       5916         5336
## 4          1842     24934      12782        12152
## 5          2581     11631       6223         5408
## 6          2522      4632       2399         2233
##   raion_build_count_with_material_info build_count_block build_count_wood
## 1                                  211                25                0
## 2                                  245                83                1
## 3                                  330                59                0
## 4                                  458                 9               51
## 5                                  746                48                0
## 6                                  188                24                0
##   build_count_frame build_count_brick build_count_monolith build_count_panel
## 1                 0                 0                    2               184
## 2                 0                67                    4                90
## 3                 0               206                    4                60
## 4                12               124                   50               201
## 5                 0               643                   16                35
## 6                 0               147                    2                15
##   build_count_foam build_count_slag build_count_mix
## 1                0                0               0
## 2                0                0               0
## 3                0                1               0
## 4                0                9               2
## 5                0                3               1
## 6                0                0               0
##   raion_build_count_with_builddate_info build_count_before_1920
## 1                                   211                       0
## 2                                   244                       1
## 3                                   330                       1
## 4                                   459                      13
## 5                                   746                     371
## 6                                   188                       0
##   build_count_1921.1945 build_count_1946.1970 build_count_1971.1995
## 1                     0                     0                   206
## 2                     1                   143                    84
## 3                     0                   246                    63
## 4                    24                    40                   130
## 5                   114                   146                    62
## 6                     5                   152                    25
##   build_count_after_1995 ID_metro metro_min_avto metro_km_avto metro_min_walk
## 1                      5        1      2.5902411     1.1312599      13.575119
## 2                     15        2      0.9366997     0.6473368       7.620630
## 3                     20        3      2.1209989     1.6379963      17.351515
## 4                    252        4      1.4890492     0.9845366      11.565624
## 5                     53        5      1.2571865     0.8766202       8.266305
## 6                      6        6      2.7358839     1.5932465      18.378170
##   metro_km_walk kindergarten_km school_km    park_km green_zone_km
## 1     1.1312599      0.14569955 0.1779754 2.15858707    0.60097310
## 2     0.6350525      0.14775427 0.2733453 0.55068974    0.06532116
## 3     1.4459596      0.04910154 0.1580719 0.37484775    0.45317241
## 4     0.9638020      0.17944096 0.2364550 0.07809029    0.10612451
## 5     0.6888588      0.24790121 0.3768381 0.25828877    0.23621405
## 6     1.5315141      0.14595482 0.1134662 1.07349543    1.49790264
##   industrial_km water_treatment_km cemetery_km incineration_km
## 1     1.0809343          23.683460    1.804127        3.633334
## 2     0.9664791           1.317476    4.655004        8.648587
## 3     0.9392751           4.912660    3.381083       11.996480
## 4     0.4511733          15.623710    2.017080       14.317640
## 5     0.3928710          10.683540    2.936581       11.903910
## 6     0.2564875           7.186740    0.780330       14.075140
##   railroad_station_walk_km railroad_station_walk_min ID_railroad_station_walk
## 1                5.4198930                  65.03872                        1
## 2                3.4119931                  40.94392                        2
## 3                1.2776580                  15.33190                        3
## 4                4.2914325                  51.49719                        4
## 5                0.8539601                  10.24752                        5
## 6                0.3753117                   4.50374                        6
##   railroad_station_avto_km railroad_station_avto_min ID_railroad_station_avto
## 1                5.4198930                  6.905893                        1
## 2                3.6417726                  4.679745                        2
## 3                1.2776580                  1.701420                        3
## 4                3.8160446                  5.271136                        4
## 5                1.5958982                  2.156284                      113
## 6                0.3753117                  1.407419                        6
##   public_transport_station_km public_transport_station_min_walk  water_km
## 1                  0.27498514                         3.2998217 0.9926311
## 2                  0.06526334                         0.7831601 0.6980813
## 3                  0.32875604                         3.9450725 0.4682646
## 4                  0.13159696                         1.5791635 1.2003365
## 5                  0.07148032                         0.8577639 0.8202943
## 6                  0.18922715                         2.2707258 0.6124473
##   water_1line   mkad_km     ttk_km  sadovoe_km bulvar_ring_km kremlin_km
## 1          no  1.422391 10.9185867 13.10061764     13.6756570  15.156211
## 2          no  9.503405  3.1039960  6.44433347      8.1326401   8.698054
## 3          no  5.604800  2.9274871  6.96340300      8.0542523   9.067885
## 4          no  2.677824 14.6065008 17.45719794     18.3094331  19.487005
## 5          no 11.616653  1.7218337  0.04680957      0.7875933   2.578671
## 6          no  8.296087  0.2848681  3.51938898      4.3950575   5.645796
##   big_road1_km ID_big_road1 big_road1_1line big_road2_km ID_big_road2
## 1    1.4223914            1              no     3.830951            5
## 2    2.8873766            2              no     3.103996            4
## 3    0.6472498            3              no     2.927487            4
## 4    2.6778243            1              no     2.780449           17
## 5    1.7218337            4              no     3.133531           10
## 6    0.2848681            4              no     1.478529            3
##   railroad_km railroad_1line zd_vokzaly_avto_km ID_railroad_terminal
## 1  1.30515949             no          14.231961                  101
## 2  0.69453573             no           9.242586                   32
## 3  0.70069112             no           9.540544                    5
## 4  1.99926542             no          17.478380                   83
## 5  0.08411254            yes           1.595898                  113
## 6  0.24467041             no           5.070197                    5
##   bus_terminal_avto_km ID_bus_terminal oil_chemistry_km nuclear_reactor_km
## 1            24.292406               1        18.152338           5.718519
## 2             5.706113               2         9.034642           3.489954
## 3             6.710302               3         5.777394           7.506612
## 4             6.734618               1        27.667863           9.522538
## 5             1.423428               4         6.515857           8.671016
## 6             6.682089               4         3.959509           8.757686
##   radiation_km power_transmission_line_km thermal_power_plant_km     ts_km
## 1    1.2100274                   1.062513               5.814135 4.3081270
## 2    2.7242954                   1.246149               3.419574 0.7255604
## 3    0.7722161                   1.602183               3.682455 3.5621877
## 4    6.3487163                   1.767612              11.178333 0.5830250
## 5    1.6383181                   3.632640               4.587917 2.6094196
## 6    0.1931270                   2.341562               1.272894 1.4380034
##   big_market_km market_shop_km fitness_km swim_pool_km ice_rink_km stadium_km
## 1     10.814172      1.6762583  0.4858414     3.065047    1.107594  8.1485908
## 2      6.910568      3.4247161  0.6683637     2.000154    8.972823  6.1270728
## 3      5.752368      1.3754428  0.7331011     1.239304    1.978517  0.7675688
## 4     27.892717      0.8112753  0.6234843     1.950317    6.483172  7.3855207
## 5      9.155057      1.9697377  0.2202877     2.544696    3.975401  3.6107538
## 6      5.374564      3.4478636  0.8104131     1.911843    2.108923  4.2330947
##   basketball_km hospice_morgue_km detention_facility_km public_healthcare_km
## 1     3.5165129          2.392353              4.248036           0.97474284
## 2     1.1615790          2.543747             12.649879           1.47772267
## 3     1.9527706          0.621357              7.682303           0.09714353
## 4     4.9238432          3.549558              8.789894           2.16373516
## 5     0.3079154          1.864637              3.779781           1.12170284
## 6     1.4509749          3.391117              4.356122           1.69872358
##   university_km workplaces_km shopping_centers_km  office_km
## 1     6.7150258     0.8843500           0.6484876 0.63718883
## 2     1.8525602     0.6862517           0.5193113 0.68879632
## 3     0.8412541     1.5100889           1.4865330 1.54304884
## 4    10.9031613     0.6222716           0.5999136 0.93427350
## 5     0.9916826     0.8926675           0.4290521 0.07790096
## 6     3.8300213     1.0422618           0.4407073 0.42235787
##   additional_education_km preschool_km big_church_km church_synagogue_km
## 1               0.9479617    0.1779754     0.6257834           0.6281865
## 2               1.0723151    0.2733453     0.9678206           0.4714465
## 3               0.3919574    0.1580719     3.1787515           0.7559460
## 4               0.8926743    0.2364550     1.0317768           1.5615048
## 5               0.8108015    0.3768381     0.3787558           0.1216806
## 6               3.0662852    0.1134662     0.6869317           0.8704465
##   mosque_km theater_km museum_km exhibition_km catering_km   ecology
## 1  3.932040  14.053047  7.389498      7.023705 0.516838085      good
## 2  4.841544   6.829889  0.709260      2.358840 0.230286910 excellent
## 3  7.922152   4.273200  3.156423      4.958214 0.190461977      poor
## 4 15.300449  16.990677 16.041521      5.029696 0.465820158      good
## 5  2.584370   1.112486  1.800125      1.339652 0.026102416 excellent
## 6  4.787706   3.388810  3.713557      2.553424 0.004469307      poor
##   green_part_500 prom_part_500 office_count_500 office_sqm_500 trc_count_500
## 1           0.00          0.00                0              0             0
## 2          25.14          0.00                0              0             0
## 3           1.67          0.00                0              0             0
## 4          17.36          0.57                0              0             0
## 5           3.56          4.44               15         293699             1
## 6           0.00         19.42                5         227705             3
##   trc_sqm_500 cafe_count_500 cafe_sum_500_min_price_avg
## 1           0              0                         NA
## 2           0              5                     860.00
## 3           0              3                     666.67
## 4           0              2                    1000.00
## 5       45000             48                     702.22
## 6      102000              7                    1000.00
##   cafe_sum_500_max_price_avg cafe_avg_price_500 cafe_count_500_na_price
## 1                         NA                 NA                       0
## 2                    1500.00            1180.00                       0
## 3                    1166.67             916.67                       0
## 4                    1500.00            1250.00                       0
## 5                    1166.67             934.44                       3
## 6                    1625.00            1312.50                       3
##   cafe_count_500_price_500 cafe_count_500_price_1000 cafe_count_500_price_1500
## 1                        0                         0                         0
## 2                        1                         3                         0
## 3                        0                         2                         1
## 4                        0                         0                         2
## 5                       17                        10                        11
## 6                        0                         1                         2
##   cafe_count_500_price_2500 cafe_count_500_price_4000 cafe_count_500_price_high
## 1                         0                         0                         0
## 2                         0                         1                         0
## 3                         0                         0                         0
## 4                         0                         0                         0
## 5                         7                         0                         0
## 6                         1                         0                         0
##   big_church_count_500 church_count_500 mosque_count_500 leisure_count_500
## 1                    0                0                0                 0
## 2                    0                1                0                 0
## 3                    0                0                0                 0
## 4                    0                0                0                 0
## 5                    1                4                0                 2
## 6                    0                0                0                 0
##   sport_count_500 market_count_500 green_part_1000 prom_part_1000
## 1               1                0            7.36           0.00
## 2               0                0           26.66           0.07
## 3               0                0            4.99           0.29
## 4               0                0           19.25          10.35
## 5               3                0            3.34           8.29
## 6               0                0            0.00          40.27
##   office_count_1000 office_sqm_1000 trc_count_1000 trc_sqm_1000 cafe_count_1000
## 1                 1           30500              3        55600              19
## 2                 2           86600              5        94065              13
## 3                 0               0              0            0               9
## 4                 1           11000              6        80780              12
## 5                46          420952              3       158200             153
## 6                10          275135              5       164000               9
##   cafe_sum_1000_min_price_avg cafe_sum_1000_max_price_avg cafe_avg_price_1000
## 1                      527.78                      888.89              708.33
## 2                      615.38                     1076.92              846.15
## 3                      642.86                     1142.86              892.86
## 4                      658.33                     1083.33              870.83
## 5                      763.45                     1272.41             1017.93
## 6                      883.33                     1416.67             1150.00
##   cafe_count_1000_na_price cafe_count_1000_price_500 cafe_count_1000_price_1000
## 1                        1                        10                          4
## 2                        0                         5                          6
## 3                        2                         0                          5
## 4                        0                         3                          4
## 5                        8                        39                         45
## 6                        3                         1                          1
##   cafe_count_1000_price_1500 cafe_count_1000_price_2500
## 1                          3                          1
## 2                          1                          0
## 3                          2                          0
## 4                          5                          0
## 5                         39                         19
## 6                          3                          1
##   cafe_count_1000_price_4000 cafe_count_1000_price_high big_church_count_1000
## 1                          0                          0                     1
## 2                          1                          0                     1
## 3                          0                          0                     0
## 4                          0                          0                     0
## 5                          2                          1                     7
## 6                          0                          0                     3
##   church_count_1000 mosque_count_1000 leisure_count_1000 sport_count_1000
## 1                 2                 0                  0                6
## 2                 2                 0                  4                2
## 3                 1                 0                  0                5
## 4                 0                 0                  0                3
## 5                12                 0                  6                7
## 6                 1                 0                  0                1
##   market_count_1000 green_part_1500 prom_part_1500 office_count_1500
## 1                 1           14.27           6.92                 3
## 2                 0           21.53           7.71                 3
## 3                 3            9.92           6.73                 0
## 4                 1           28.38           6.57                 2
## 5                 0            4.12           4.83                93
## 6                 0            0.00          50.64                18
##   office_sqm_1500 trc_count_1500 trc_sqm_1500 cafe_count_1500
## 1           39554              9       171420              34
## 2          102910              7       127065              17
## 3               0              1         2600              14
## 4           11000              7        89492              23
## 5         1195735              9       445900             272
## 6          431090              6       186400              14
##   cafe_sum_1500_min_price_avg cafe_sum_1500_max_price_avg cafe_avg_price_1500
## 1                      566.67                      969.70              768.18
## 2                      694.12                     1205.88              950.00
## 3                      516.67                      916.67              716.67
## 4                      673.91                     1130.43              902.17
## 5                      766.80                     1272.73             1019.76
## 6                      718.18                     1181.82              950.00
##   cafe_count_1500_na_price cafe_count_1500_price_500 cafe_count_1500_price_1000
## 1                        1                        14                         11
## 2                        0                         6                          7
## 3                        2                         4                          6
## 4                        0                         5                          9
## 5                       19                        70                         74
## 6                        3                         3                          3
##   cafe_count_1500_price_1500 cafe_count_1500_price_2500
## 1                          6                          2
## 2                          1                          2
## 3                          2                          0
## 4                          8                          1
## 5                         72                         30
## 6                          4                          1
##   cafe_count_1500_price_4000 cafe_count_1500_price_high big_church_count_1500
## 1                          0                          0                     1
## 2                          1                          0                     1
## 3                          0                          0                     0
## 4                          0                          0                     1
## 5                          6                          1                    18
## 6                          0                          0                     4
##   church_count_1500 mosque_count_1500 leisure_count_1500 sport_count_1500
## 1                 2                 0                  0                7
## 2                 5                 0                  4                9
## 3                 4                 0                  0                6
## 4                 0                 0                  0                9
## 5                30                 0                 10               14
## 6                 2                 0                  0               11
##   market_count_1500 green_part_2000 prom_part_2000 office_count_2000
## 1                 1           11.77          15.97                 9
## 2                 0           22.37          19.25                 4
## 3                 5           12.99          12.75                 4
## 4                 2           32.29           5.73                 2
## 5                 2            4.53           5.02               149
## 6                 0            0.38          51.58                21
##   office_sqm_2000 trc_count_2000 trc_sqm_2000 cafe_count_2000
## 1          188854             19      1244891              36
## 2          165510              8       179065              21
## 3          100200              7        52550              24
## 4           11000              7        89492              25
## 5         1625130             17       564843             483
## 6          471290             14       683945              33
##   cafe_sum_2000_min_price_avg cafe_sum_2000_max_price_avg cafe_avg_price_2000
## 1                      614.29                     1042.86              828.57
## 2                      695.24                     1190.48              942.86
## 3                      563.64                      977.27              770.45
## 4                      660.00                     1120.00              890.00
## 5                      765.93                     1269.23             1017.58
## 6                      741.38                     1258.62             1000.00
##   cafe_count_2000_na_price cafe_count_2000_price_500 cafe_count_2000_price_1000
## 1                        1                        15                         11
## 2                        0                         7                          8
## 3                        2                         8                          9
## 4                        0                         5                         11
## 5                       28                       130                        129
## 6                        4                         5                         13
##   cafe_count_2000_price_1500 cafe_count_2000_price_2500
## 1                          6                          2
## 2                          3                          2
## 3                          4                          1
## 4                          8                          1
## 5                        131                         50
## 6                          8                          2
##   cafe_count_2000_price_4000 cafe_count_2000_price_high big_church_count_2000
## 1                          1                          0                     1
## 2                          1                          0                     1
## 3                          0                          0                     0
## 4                          0                          0                     1
## 5                         14                          1                    35
## 6                          1                          0                     6
##   church_count_2000 mosque_count_2000 leisure_count_2000 sport_count_2000
## 1                 2                 0                  0               10
## 2                 5                 0                  4               11
## 3                 4                 0                  0                8
## 4                 1                 0                  0               13
## 5                61                 0                 17               21
## 6                 5                 0                  0               21
##   market_count_2000 green_part_3000 prom_part_3000 office_count_3000
## 1                 1           11.98          13.55                12
## 2                 0           18.07          27.32                12
## 3                 5           12.14          26.46                 8
## 4                 2           20.79           3.57                 4
## 5                 3            5.06           8.62               305
## 6                 1            1.82          39.99                54
##   office_sqm_3000 trc_count_3000 trc_sqm_3000 cafe_count_3000
## 1          251554             23      1419204              68
## 2          821986             14       491565              30
## 3          110856              7        52550              41
## 4          167000             12       205756              32
## 5         3420907             60      2296870            1068
## 6         1181009             29      1059171             120
##   cafe_sum_3000_min_price_avg cafe_sum_3000_max_price_avg cafe_avg_price_3000
## 1                      639.68                     1079.37              859.52
## 2                      631.03                     1086.21              858.62
## 3                      697.44                     1192.31              944.87
## 4                      718.75                     1218.75              968.75
## 5                      853.03                     1410.45             1131.74
## 6                      737.96                     1231.48              984.72
##   cafe_count_3000_na_price cafe_count_3000_price_500 cafe_count_3000_price_1000
## 1                        5                        21                         22
## 2                        1                        11                         11
## 3                        2                         9                         17
## 4                        0                         5                         14
## 5                       63                       266                        267
## 6                       12                        24                         37
##   cafe_count_3000_price_1500 cafe_count_3000_price_2500
## 1                         16                          3
## 2                          4                          2
## 3                          9                          3
## 4                         10                          3
## 5                        262                        149
## 6                         35                         11
##   cafe_count_3000_price_4000 cafe_count_3000_price_high big_church_count_3000
## 1                          1                          0                     2
## 2                          1                          0                     1
## 3                          1                          0                     0
## 4                          0                          0                     1
## 5                         57                          4                    70
## 6                          1                          0                    12
##   church_count_3000 mosque_count_3000 leisure_count_3000 sport_count_3000
## 1                 4                 0                  0               21
## 2                 7                 0                  6               19
## 3                11                 0                  0               20
## 4                 2                 0                  0               18
## 5               121                 1                 40               77
## 6                12                 0                  2               31
##   market_count_3000 green_part_5000 prom_part_5000 office_count_5000
## 1                 1           13.09          13.31                29
## 2                 1           10.26          27.47                66
## 3                 6           13.69          21.58                43
## 4                 3           14.18           3.89                 8
## 5                 5            8.38          10.92               689
## 6                 7            5.92          25.79               253
##   office_sqm_5000 trc_count_5000 trc_sqm_5000 cafe_count_5000
## 1          807385             52      4036616             152
## 2         2690465             40      2034942             177
## 3         1478160             35      1572990             122
## 4          244166             22       942180              61
## 5         8404624            114      3503058            2283
## 6         4274339             63      2010320             567
##   cafe_sum_5000_min_price_avg cafe_sum_5000_max_price_avg cafe_avg_price_5000
## 1                      708.57                     1185.71              947.14
## 2                      673.81                     1148.81              911.31
## 3                      702.68                     1196.43              949.55
## 4                      931.58                     1552.63             1242.11
## 5                      853.88                     1411.45             1132.66
## 6                      769.92                     1280.08             1025.00
##   cafe_count_5000_na_price cafe_count_5000_price_500 cafe_count_5000_price_1000
## 1                       12                        39                         48
## 2                        9                        49                         65
## 3                       10                        29                         45
## 4                        4                         7                         21
## 5                      143                       566                        578
## 6                       35                       137                        163
##   cafe_count_5000_price_1500 cafe_count_5000_price_2500
## 1                         40                          9
## 2                         36                         15
## 3                         25                         10
## 4                         15                         11
## 5                        552                        319
## 6                        155                         62
##   cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000
## 1                          4                          0                    13
## 2                          3                          0                    15
## 3                          3                          0                    11
## 4                          2                          1                     4
## 5                        108                         17                   135
## 6                         14                          1                    53
##   church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000
## 1                22                 1                  0               52
## 2                29                 1                 10               66
## 3                27                 0                  4               67
## 4                 4                 0                  0               26
## 5               236                 2                 91              195
## 6                78                 1                 20              113
##   market_count_5000 price_doc
## 1                 4   5850000
## 2                14   6000000
## 3                10   5700000
## 4                 3  13100000
## 5                14  16331452
## 6                17   9100000

Checking Dimensions

dim(train)
## [1] 30471   292

We have 292 columns and 30471 rows in training dataset.

dim(test)
## [1] 7662  291

We have 291 columns and 7662 rows in training dataset.We are not provided with the target column in testing dataset.

Missing Data

nas = colSums(is.na(train))
nas = nas[nas > 0]
nas = sort(nas, decreasing = TRUE)
head(nas)
##        hospital_beds_raion                 build_year 
##                      14441                      13605 
##                      state cafe_sum_500_min_price_avg 
##                      13559                      13281 
## cafe_sum_500_max_price_avg         cafe_avg_price_500 
##                      13281                      13281

We can see that our dataset has some missing values. We will leave it as it is because we will use xgboost which will take care of it.

Data Visualization

Distribution

ggplot(train, aes(x = price_doc)) +
geom_density(fill = 'blue') +
theme_classic() +
xlab("Price") +
ggtitle("Distribution of Price") +
scale_x_continuous(labels = scales::comma)

Our target variable is positively skewed. I will use log transformation to see whether it will fix this problem. Lets visualize the result of log transformation in order to find out whether this method will help or not.

ggplot(train, aes(x = log(price_doc))) +
geom_density(fill = 'blue') +
theme_classic() +
xlab("Price") +
ggtitle("Distribution of Price after log transformation") +
scale_x_continuous(labels = scales::comma)

The distribution of price is closer to normal distribution after log transformation. We will fix it afterwards.

Merging Train and Test

test$price_doc = NA

data = rbind(train, test)
cat("Number of Rows :", dim(data)[1], "\n", "Number of Columns :", dim(data)[2])
## Number of Rows : 38133 
##  Number of Columns : 292

Getting Year, Month and Day from timestamp

I will use famous lubridate package to extract year, month and day as it is way easier than the any other package available in R.

data$Date = ymd(data$timestamp)
data$Year = year(data$Date)
data$Month = month(data$Date)
data$Day = day(data$Date)

Data Transformation

We have a few columns that are actually numeric but they are converted to strings due to the presence of comma “,”. In order to fix this issue, I will remove comma between those numbers and convert them back to numeric.

Joining macros with data

merged_data = left_join(data, macros, by = "timestamp")

merged_data$child_on_acc_pre_school <- as.numeric(gsub(",","",merged_data$child_on_acc_pre_school))
## Warning: NAs introduced by coercion
merged_data$modern_education_share <- as.numeric(gsub(",","",merged_data$modern_education_share))
merged_data$old_education_build_share <- as.numeric(gsub(",","",merged_data$old_education_build_share))

Mode Imputation

As we dont have any information about product type column, I will impute mode in missing values.

merged_data$product_type[is.na(merged_data$product_type)] = names(sort(-table(merged_data$product_type)))[1]

## Double checking nas
any(is.na(merged_data$product_type))
## [1] FALSE

We fixed missing values in product type variable.

Encoding

Converting all the columns with yes and no values into 1 and 0 repectively.

merged_data$culture_objects_top_25 = ifelse(merged_data$culture_objects_top_25 == "yes", 1, 0)
merged_data$thermal_power_plant_raion = ifelse(merged_data$thermal_power_plant_raion == "yes", 1, 0)
merged_data$incineration_raion = ifelse(merged_data$incineration_raion == "yes", 1, 0)
merged_data$oil_chemistry_raion = ifelse(merged_data$oil_chemistry_raion == "yes", 1, 0)
merged_data$radiation_raion = ifelse(merged_data$radiation_raion == "yes", 1, 0)
merged_data$railroad_terminal_raion = ifelse(merged_data$railroad_terminal_raion == "yes", 1, 0)
merged_data$big_market_raion = ifelse(merged_data$big_market_raion == "yes", 1, 0)
merged_data$nuclear_reactor_raion = ifelse(merged_data$nuclear_reactor_raion == "yes", 1, 0)
merged_data$detention_facility_raion = ifelse(merged_data$detention_facility_raion == "yes", 1, 0)
merged_data$water_1line = ifelse(merged_data$water_1line == "yes", 1, 0)
merged_data$railroad_1line = ifelse(merged_data$railroad_1line == "yes", 1, 0)
merged_data$big_road1_1line = ifelse(merged_data$big_road1_1line == "yes", 1, 0)
table(merged_data$ecology)
## 
##    excellent         good      no data         poor satisfactory 
##         5164         8785         9421        10078         4685

Label Encoding

As ecology is an ordinal variable, I will convert this column into numeric.

number <- c('no data' = 0, 'poor' = 1, 'satisfactory' = 2, 'good' = 3, 'excellent' = 4)
merged_data$ecology = as.integer(revalue(merged_data$ecology, number))
table(merged_data$ecology)
## 
##     0     1     2     3     4 
##  9421 10078  4685  8785  5164
## Selecting all numeric columns

final_data = merged_data %>%
select_if(is.numeric)

final_data$price_doc = data$price_doc

Split the final data

I will use log transformation in order to fix skewness in our target variable i.e. Price_doc

train = final_data[!is.na(final_data$price_doc),]

test = final_data[is.na(final_data$price_doc),]

## log transformation
train$price_doc = log(train$price_doc)
print(dim(train))
## [1] 30471   391
print(dim(test))
## [1] 7662  391
## Get the number of target column

print(paste0("Target Column Number: ",which(colnames(train) == "price_doc")), quote = F)
## [1] Target Column Number: 289

Xgboost

Xgboost is a very powerful algorithm, it will take care of missing values on its own.

set.seed(123)
dtrain = xgb.DMatrix(data = as.matrix(train[, -289]), label = train$price_doc)

params = list(objective = "reg:squarederror",
              booster = "gbtree",
              eta = 0.05,
              max_depth = 4,
              min_child_weight = 1,
              subsample = 1,
              colsample_bytree = 1, 
              gamma = 1)

In order to avoid overfitting of our model, I will use early stopping rounds, which will see whether the results are improving in our validation folds. Our model will automatically stop training when it doesnt see any improvement.

cv_results = xgb.cv(params = params,
                    data = dtrain,
                    nrounds = 500,
                    early_stopping_rounds = 20,
                    maximize = FALSE,
                    print_every_n = 100,
                    nfold = 5,
                    metrics = "rmse"
       
)
## [1]  train-rmse:14.366561+0.001666   test-rmse:14.366530+0.006799 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
## 
## [101]    train-rmse:0.458009+0.001947    test-rmse:0.475350+0.007486 
## [201]    train-rmse:0.431235+0.002393    test-rmse:0.462931+0.008737 
## [301]    train-rmse:0.418679+0.002715    test-rmse:0.461929+0.008788 
## Stopping. Best iteration:
## [309]    train-rmse:0.417792+0.002677    test-rmse:0.461878+0.008767
model_xgb = xgb.train(params = params,data = dtrain,
                      watchlist = list(train = dtrain),
                     nrounds = cv_results$best_iteration,
                     print_every_n = 10,
                     early_stopping_rounds = 20,
                     maximize = F)
## [1]  train-rmse:14.366540 
## Will train until train_rmse hasn't improved in 20 rounds.
## 
## [11] train-rmse:8.612225 
## [21] train-rmse:5.172215 
## [31] train-rmse:3.121177 
## [41] train-rmse:1.907147 
## [51] train-rmse:1.201945 
## [61] train-rmse:0.809774 
## [71] train-rmse:0.608603 
## [81] train-rmse:0.516049 
## [91] train-rmse:0.476934 
## [101]    train-rmse:0.460572 
## [111]    train-rmse:0.453260 
## [121]    train-rmse:0.449418 
## [131]    train-rmse:0.447085 
## [141]    train-rmse:0.444883 
## [151]    train-rmse:0.443211 
## [161]    train-rmse:0.441479 
## [171]    train-rmse:0.440058 
## [181]    train-rmse:0.438903 
## [191]    train-rmse:0.437686 
## [201]    train-rmse:0.436414 
## [211]    train-rmse:0.435065 
## [221]    train-rmse:0.433833 
## [231]    train-rmse:0.432751 
## [241]    train-rmse:0.431634 
## [251]    train-rmse:0.430495 
## [261]    train-rmse:0.429463 
## [271]    train-rmse:0.428179 
## [281]    train-rmse:0.427398 
## [291]    train-rmse:0.426308 
## [301]    train-rmse:0.425261 
## [309]    train-rmse:0.424516

With more preprocessing and feature selection, we can achieve better result.

results = predict(model_xgb, newdata = as.matrix(test[, - 289]))
prediction = exp(results)
submission = read.csv('C:\\Users\\uzair\\Codes\\Russian Market Price\\sample_submission.csv')
submission$price_doc = prediction
write.csv(submission, 'sub.csv', row.names = F)

Final result in the competition is 0.34 rmse :)