This competition is about the prediction of Russian house prices. In this dataset, we have house related information as well as the Russian economic data. It is a really good challenge to play with the housing dataset and linking it with a country’s economy. Both of these dataset can be joined together by date column. We have lots of missing values in both dataset. But I will use Xgboost algorithm which will take care of missing values on its own.
This dataset can be used to enhance regression skills, feature engineering, and preprocessing. This data is available at this link.
train = read.csv('C:\\Users\\uzair\\Codes\\Russian Market Price\\train.csv', stringsAsFactors = FALSE)
test = read.csv('C:\\Users\\uzair\\Codes\\Russian Market Price\\test.csv', stringsAsFactors = FALSE)
macros = read.csv("C:\\Users\\uzair\\Codes\\Russian Market Price\\macro.csv", stringsAsFactors = FALSE)
head(train)
## id timestamp full_sq life_sq floor max_floor material build_year num_room
## 1 1 2011-08-20 43 27 4 NA NA NA NA
## 2 2 2011-08-23 34 19 3 NA NA NA NA
## 3 3 2011-08-27 43 29 2 NA NA NA NA
## 4 4 2011-09-01 89 50 9 NA NA NA NA
## 5 5 2011-09-05 77 77 4 NA NA NA NA
## 6 6 2011-09-06 67 46 14 NA NA NA NA
## kitch_sq state product_type sub_area area_m raion_popul
## 1 NA NA Investment Bibirevo 6407578 155572
## 2 NA NA Investment Nagatinskij Zaton 9589337 115352
## 3 NA NA Investment Tekstil'shhiki 4808270 101708
## 4 NA NA Investment Mitino 12583536 178473
## 5 NA NA Investment Basmannoe 8398461 108171
## 6 NA NA Investment Nizhegorodskoe 7506452 43795
## green_zone_part indust_part children_preschool preschool_quota
## 1 0.189727117 0.0000699893 9576 5001
## 2 0.372602044 0.0496372570 6880 3119
## 3 0.112559644 0.1185373850 5879 1463
## 4 0.194702869 0.0697533610 13087 6839
## 5 0.015233744 0.0373164520 5706 3240
## 6 0.007670134 0.4862456210 2418 852
## preschool_education_centers_raion children_school school_quota
## 1 5 10309 11065
## 2 5 7759 6237
## 3 4 6207 5580
## 4 9 13670 17063
## 5 7 6748 7770
## 6 2 2514 2012
## school_education_centers_raion school_education_centers_top_20_raion
## 1 5 0
## 2 8 0
## 3 7 0
## 4 10 0
## 5 9 0
## 6 3 0
## hospital_beds_raion healthcare_centers_raion university_top_20_raion
## 1 240 1 0
## 2 229 1 0
## 3 1183 1 0
## 4 NA 1 0
## 5 562 4 2
## 6 NA 0 0
## sport_objects_raion additional_education_raion culture_objects_top_25
## 1 7 3 no
## 2 6 1 yes
## 3 5 1 no
## 4 17 6 no
## 5 25 2 no
## 6 7 0 no
## culture_objects_top_25_raion shopping_centers_raion office_raion
## 1 0 16 1
## 2 1 3 0
## 3 0 0 1
## 4 0 11 4
## 5 0 10 93
## 6 0 6 19
## thermal_power_plant_raion incineration_raion oil_chemistry_raion
## 1 no no no
## 2 no no no
## 3 no no no
## 4 no no no
## 5 no no no
## 6 yes no no
## radiation_raion railroad_terminal_raion big_market_raion
## 1 no no no
## 2 no no no
## 3 yes no no
## 4 no no no
## 5 yes yes no
## 6 yes no no
## nuclear_reactor_raion detention_facility_raion full_all male_f female_f
## 1 no no 86206 40477 45729
## 2 no no 76284 34200 42084
## 3 no no 101982 46076 55906
## 4 no no 21155 9828 11327
## 5 no no 28179 13522 14657
## 6 no no 19940 9400 10540
## young_all young_male young_female work_all work_male work_female ekder_all
## 1 21154 11007 10147 98207 52277 45930 36211
## 2 15727 7925 7802 70194 35622 34572 29431
## 3 13028 6835 6193 63388 31813 31575 25292
## 4 28563 14680 13883 120381 60040 60341 29529
## 5 13368 7159 6209 68043 34236 33807 26760
## 6 5291 2744 2547 29660 15793 13867 8844
## ekder_male ekder_female X0_6_all X0_6_male X0_6_female X7_14_all X7_14_male
## 1 10580 25631 9576 4899 4677 10309 5463
## 2 9266 20165 6880 3466 3414 7759 3909
## 3 7609 17683 5879 3095 2784 6207 3269
## 4 9083 20446 13087 6645 6442 13670 7126
## 5 8563 18197 5706 2982 2724 6748 3664
## 6 2608 6236 2418 1224 1194 2514 1328
## X7_14_female X0_17_all X0_17_male X0_17_female X16_29_all X16_29_male
## 1 4846 23603 12286 11317 17508 9425
## 2 3850 17700 8998 8702 15164 7571
## 3 2938 14884 7821 7063 19401 9045
## 4 6544 32063 16513 15550 3292 1450
## 5 3084 15237 8113 7124 5164 2583
## 6 1186 5866 3035 2831 4851 2329
## X16_29_female X0_13_all X0_13_male X0_13_female
## 1 8083 18654 9709 8945
## 2 7593 13729 6929 6800
## 3 10356 11252 5916 5336
## 4 1842 24934 12782 12152
## 5 2581 11631 6223 5408
## 6 2522 4632 2399 2233
## raion_build_count_with_material_info build_count_block build_count_wood
## 1 211 25 0
## 2 245 83 1
## 3 330 59 0
## 4 458 9 51
## 5 746 48 0
## 6 188 24 0
## build_count_frame build_count_brick build_count_monolith build_count_panel
## 1 0 0 2 184
## 2 0 67 4 90
## 3 0 206 4 60
## 4 12 124 50 201
## 5 0 643 16 35
## 6 0 147 2 15
## build_count_foam build_count_slag build_count_mix
## 1 0 0 0
## 2 0 0 0
## 3 0 1 0
## 4 0 9 2
## 5 0 3 1
## 6 0 0 0
## raion_build_count_with_builddate_info build_count_before_1920
## 1 211 0
## 2 244 1
## 3 330 1
## 4 459 13
## 5 746 371
## 6 188 0
## build_count_1921.1945 build_count_1946.1970 build_count_1971.1995
## 1 0 0 206
## 2 1 143 84
## 3 0 246 63
## 4 24 40 130
## 5 114 146 62
## 6 5 152 25
## build_count_after_1995 ID_metro metro_min_avto metro_km_avto metro_min_walk
## 1 5 1 2.5902411 1.1312599 13.575119
## 2 15 2 0.9366997 0.6473368 7.620630
## 3 20 3 2.1209989 1.6379963 17.351515
## 4 252 4 1.4890492 0.9845366 11.565624
## 5 53 5 1.2571865 0.8766202 8.266305
## 6 6 6 2.7358839 1.5932465 18.378170
## metro_km_walk kindergarten_km school_km park_km green_zone_km
## 1 1.1312599 0.14569955 0.1779754 2.15858707 0.60097310
## 2 0.6350525 0.14775427 0.2733453 0.55068974 0.06532116
## 3 1.4459596 0.04910154 0.1580719 0.37484775 0.45317241
## 4 0.9638020 0.17944096 0.2364550 0.07809029 0.10612451
## 5 0.6888588 0.24790121 0.3768381 0.25828877 0.23621405
## 6 1.5315141 0.14595482 0.1134662 1.07349543 1.49790264
## industrial_km water_treatment_km cemetery_km incineration_km
## 1 1.0809343 23.683460 1.804127 3.633334
## 2 0.9664791 1.317476 4.655004 8.648587
## 3 0.9392751 4.912660 3.381083 11.996480
## 4 0.4511733 15.623710 2.017080 14.317640
## 5 0.3928710 10.683540 2.936581 11.903910
## 6 0.2564875 7.186740 0.780330 14.075140
## railroad_station_walk_km railroad_station_walk_min ID_railroad_station_walk
## 1 5.4198930 65.03872 1
## 2 3.4119931 40.94392 2
## 3 1.2776580 15.33190 3
## 4 4.2914325 51.49719 4
## 5 0.8539601 10.24752 5
## 6 0.3753117 4.50374 6
## railroad_station_avto_km railroad_station_avto_min ID_railroad_station_avto
## 1 5.4198930 6.905893 1
## 2 3.6417726 4.679745 2
## 3 1.2776580 1.701420 3
## 4 3.8160446 5.271136 4
## 5 1.5958982 2.156284 113
## 6 0.3753117 1.407419 6
## public_transport_station_km public_transport_station_min_walk water_km
## 1 0.27498514 3.2998217 0.9926311
## 2 0.06526334 0.7831601 0.6980813
## 3 0.32875604 3.9450725 0.4682646
## 4 0.13159696 1.5791635 1.2003365
## 5 0.07148032 0.8577639 0.8202943
## 6 0.18922715 2.2707258 0.6124473
## water_1line mkad_km ttk_km sadovoe_km bulvar_ring_km kremlin_km
## 1 no 1.422391 10.9185867 13.10061764 13.6756570 15.156211
## 2 no 9.503405 3.1039960 6.44433347 8.1326401 8.698054
## 3 no 5.604800 2.9274871 6.96340300 8.0542523 9.067885
## 4 no 2.677824 14.6065008 17.45719794 18.3094331 19.487005
## 5 no 11.616653 1.7218337 0.04680957 0.7875933 2.578671
## 6 no 8.296087 0.2848681 3.51938898 4.3950575 5.645796
## big_road1_km ID_big_road1 big_road1_1line big_road2_km ID_big_road2
## 1 1.4223914 1 no 3.830951 5
## 2 2.8873766 2 no 3.103996 4
## 3 0.6472498 3 no 2.927487 4
## 4 2.6778243 1 no 2.780449 17
## 5 1.7218337 4 no 3.133531 10
## 6 0.2848681 4 no 1.478529 3
## railroad_km railroad_1line zd_vokzaly_avto_km ID_railroad_terminal
## 1 1.30515949 no 14.231961 101
## 2 0.69453573 no 9.242586 32
## 3 0.70069112 no 9.540544 5
## 4 1.99926542 no 17.478380 83
## 5 0.08411254 yes 1.595898 113
## 6 0.24467041 no 5.070197 5
## bus_terminal_avto_km ID_bus_terminal oil_chemistry_km nuclear_reactor_km
## 1 24.292406 1 18.152338 5.718519
## 2 5.706113 2 9.034642 3.489954
## 3 6.710302 3 5.777394 7.506612
## 4 6.734618 1 27.667863 9.522538
## 5 1.423428 4 6.515857 8.671016
## 6 6.682089 4 3.959509 8.757686
## radiation_km power_transmission_line_km thermal_power_plant_km ts_km
## 1 1.2100274 1.062513 5.814135 4.3081270
## 2 2.7242954 1.246149 3.419574 0.7255604
## 3 0.7722161 1.602183 3.682455 3.5621877
## 4 6.3487163 1.767612 11.178333 0.5830250
## 5 1.6383181 3.632640 4.587917 2.6094196
## 6 0.1931270 2.341562 1.272894 1.4380034
## big_market_km market_shop_km fitness_km swim_pool_km ice_rink_km stadium_km
## 1 10.814172 1.6762583 0.4858414 3.065047 1.107594 8.1485908
## 2 6.910568 3.4247161 0.6683637 2.000154 8.972823 6.1270728
## 3 5.752368 1.3754428 0.7331011 1.239304 1.978517 0.7675688
## 4 27.892717 0.8112753 0.6234843 1.950317 6.483172 7.3855207
## 5 9.155057 1.9697377 0.2202877 2.544696 3.975401 3.6107538
## 6 5.374564 3.4478636 0.8104131 1.911843 2.108923 4.2330947
## basketball_km hospice_morgue_km detention_facility_km public_healthcare_km
## 1 3.5165129 2.392353 4.248036 0.97474284
## 2 1.1615790 2.543747 12.649879 1.47772267
## 3 1.9527706 0.621357 7.682303 0.09714353
## 4 4.9238432 3.549558 8.789894 2.16373516
## 5 0.3079154 1.864637 3.779781 1.12170284
## 6 1.4509749 3.391117 4.356122 1.69872358
## university_km workplaces_km shopping_centers_km office_km
## 1 6.7150258 0.8843500 0.6484876 0.63718883
## 2 1.8525602 0.6862517 0.5193113 0.68879632
## 3 0.8412541 1.5100889 1.4865330 1.54304884
## 4 10.9031613 0.6222716 0.5999136 0.93427350
## 5 0.9916826 0.8926675 0.4290521 0.07790096
## 6 3.8300213 1.0422618 0.4407073 0.42235787
## additional_education_km preschool_km big_church_km church_synagogue_km
## 1 0.9479617 0.1779754 0.6257834 0.6281865
## 2 1.0723151 0.2733453 0.9678206 0.4714465
## 3 0.3919574 0.1580719 3.1787515 0.7559460
## 4 0.8926743 0.2364550 1.0317768 1.5615048
## 5 0.8108015 0.3768381 0.3787558 0.1216806
## 6 3.0662852 0.1134662 0.6869317 0.8704465
## mosque_km theater_km museum_km exhibition_km catering_km ecology
## 1 3.932040 14.053047 7.389498 7.023705 0.516838085 good
## 2 4.841544 6.829889 0.709260 2.358840 0.230286910 excellent
## 3 7.922152 4.273200 3.156423 4.958214 0.190461977 poor
## 4 15.300449 16.990677 16.041521 5.029696 0.465820158 good
## 5 2.584370 1.112486 1.800125 1.339652 0.026102416 excellent
## 6 4.787706 3.388810 3.713557 2.553424 0.004469307 poor
## green_part_500 prom_part_500 office_count_500 office_sqm_500 trc_count_500
## 1 0.00 0.00 0 0 0
## 2 25.14 0.00 0 0 0
## 3 1.67 0.00 0 0 0
## 4 17.36 0.57 0 0 0
## 5 3.56 4.44 15 293699 1
## 6 0.00 19.42 5 227705 3
## trc_sqm_500 cafe_count_500 cafe_sum_500_min_price_avg
## 1 0 0 NA
## 2 0 5 860.00
## 3 0 3 666.67
## 4 0 2 1000.00
## 5 45000 48 702.22
## 6 102000 7 1000.00
## cafe_sum_500_max_price_avg cafe_avg_price_500 cafe_count_500_na_price
## 1 NA NA 0
## 2 1500.00 1180.00 0
## 3 1166.67 916.67 0
## 4 1500.00 1250.00 0
## 5 1166.67 934.44 3
## 6 1625.00 1312.50 3
## cafe_count_500_price_500 cafe_count_500_price_1000 cafe_count_500_price_1500
## 1 0 0 0
## 2 1 3 0
## 3 0 2 1
## 4 0 0 2
## 5 17 10 11
## 6 0 1 2
## cafe_count_500_price_2500 cafe_count_500_price_4000 cafe_count_500_price_high
## 1 0 0 0
## 2 0 1 0
## 3 0 0 0
## 4 0 0 0
## 5 7 0 0
## 6 1 0 0
## big_church_count_500 church_count_500 mosque_count_500 leisure_count_500
## 1 0 0 0 0
## 2 0 1 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 1 4 0 2
## 6 0 0 0 0
## sport_count_500 market_count_500 green_part_1000 prom_part_1000
## 1 1 0 7.36 0.00
## 2 0 0 26.66 0.07
## 3 0 0 4.99 0.29
## 4 0 0 19.25 10.35
## 5 3 0 3.34 8.29
## 6 0 0 0.00 40.27
## office_count_1000 office_sqm_1000 trc_count_1000 trc_sqm_1000 cafe_count_1000
## 1 1 30500 3 55600 19
## 2 2 86600 5 94065 13
## 3 0 0 0 0 9
## 4 1 11000 6 80780 12
## 5 46 420952 3 158200 153
## 6 10 275135 5 164000 9
## cafe_sum_1000_min_price_avg cafe_sum_1000_max_price_avg cafe_avg_price_1000
## 1 527.78 888.89 708.33
## 2 615.38 1076.92 846.15
## 3 642.86 1142.86 892.86
## 4 658.33 1083.33 870.83
## 5 763.45 1272.41 1017.93
## 6 883.33 1416.67 1150.00
## cafe_count_1000_na_price cafe_count_1000_price_500 cafe_count_1000_price_1000
## 1 1 10 4
## 2 0 5 6
## 3 2 0 5
## 4 0 3 4
## 5 8 39 45
## 6 3 1 1
## cafe_count_1000_price_1500 cafe_count_1000_price_2500
## 1 3 1
## 2 1 0
## 3 2 0
## 4 5 0
## 5 39 19
## 6 3 1
## cafe_count_1000_price_4000 cafe_count_1000_price_high big_church_count_1000
## 1 0 0 1
## 2 1 0 1
## 3 0 0 0
## 4 0 0 0
## 5 2 1 7
## 6 0 0 3
## church_count_1000 mosque_count_1000 leisure_count_1000 sport_count_1000
## 1 2 0 0 6
## 2 2 0 4 2
## 3 1 0 0 5
## 4 0 0 0 3
## 5 12 0 6 7
## 6 1 0 0 1
## market_count_1000 green_part_1500 prom_part_1500 office_count_1500
## 1 1 14.27 6.92 3
## 2 0 21.53 7.71 3
## 3 3 9.92 6.73 0
## 4 1 28.38 6.57 2
## 5 0 4.12 4.83 93
## 6 0 0.00 50.64 18
## office_sqm_1500 trc_count_1500 trc_sqm_1500 cafe_count_1500
## 1 39554 9 171420 34
## 2 102910 7 127065 17
## 3 0 1 2600 14
## 4 11000 7 89492 23
## 5 1195735 9 445900 272
## 6 431090 6 186400 14
## cafe_sum_1500_min_price_avg cafe_sum_1500_max_price_avg cafe_avg_price_1500
## 1 566.67 969.70 768.18
## 2 694.12 1205.88 950.00
## 3 516.67 916.67 716.67
## 4 673.91 1130.43 902.17
## 5 766.80 1272.73 1019.76
## 6 718.18 1181.82 950.00
## cafe_count_1500_na_price cafe_count_1500_price_500 cafe_count_1500_price_1000
## 1 1 14 11
## 2 0 6 7
## 3 2 4 6
## 4 0 5 9
## 5 19 70 74
## 6 3 3 3
## cafe_count_1500_price_1500 cafe_count_1500_price_2500
## 1 6 2
## 2 1 2
## 3 2 0
## 4 8 1
## 5 72 30
## 6 4 1
## cafe_count_1500_price_4000 cafe_count_1500_price_high big_church_count_1500
## 1 0 0 1
## 2 1 0 1
## 3 0 0 0
## 4 0 0 1
## 5 6 1 18
## 6 0 0 4
## church_count_1500 mosque_count_1500 leisure_count_1500 sport_count_1500
## 1 2 0 0 7
## 2 5 0 4 9
## 3 4 0 0 6
## 4 0 0 0 9
## 5 30 0 10 14
## 6 2 0 0 11
## market_count_1500 green_part_2000 prom_part_2000 office_count_2000
## 1 1 11.77 15.97 9
## 2 0 22.37 19.25 4
## 3 5 12.99 12.75 4
## 4 2 32.29 5.73 2
## 5 2 4.53 5.02 149
## 6 0 0.38 51.58 21
## office_sqm_2000 trc_count_2000 trc_sqm_2000 cafe_count_2000
## 1 188854 19 1244891 36
## 2 165510 8 179065 21
## 3 100200 7 52550 24
## 4 11000 7 89492 25
## 5 1625130 17 564843 483
## 6 471290 14 683945 33
## cafe_sum_2000_min_price_avg cafe_sum_2000_max_price_avg cafe_avg_price_2000
## 1 614.29 1042.86 828.57
## 2 695.24 1190.48 942.86
## 3 563.64 977.27 770.45
## 4 660.00 1120.00 890.00
## 5 765.93 1269.23 1017.58
## 6 741.38 1258.62 1000.00
## cafe_count_2000_na_price cafe_count_2000_price_500 cafe_count_2000_price_1000
## 1 1 15 11
## 2 0 7 8
## 3 2 8 9
## 4 0 5 11
## 5 28 130 129
## 6 4 5 13
## cafe_count_2000_price_1500 cafe_count_2000_price_2500
## 1 6 2
## 2 3 2
## 3 4 1
## 4 8 1
## 5 131 50
## 6 8 2
## cafe_count_2000_price_4000 cafe_count_2000_price_high big_church_count_2000
## 1 1 0 1
## 2 1 0 1
## 3 0 0 0
## 4 0 0 1
## 5 14 1 35
## 6 1 0 6
## church_count_2000 mosque_count_2000 leisure_count_2000 sport_count_2000
## 1 2 0 0 10
## 2 5 0 4 11
## 3 4 0 0 8
## 4 1 0 0 13
## 5 61 0 17 21
## 6 5 0 0 21
## market_count_2000 green_part_3000 prom_part_3000 office_count_3000
## 1 1 11.98 13.55 12
## 2 0 18.07 27.32 12
## 3 5 12.14 26.46 8
## 4 2 20.79 3.57 4
## 5 3 5.06 8.62 305
## 6 1 1.82 39.99 54
## office_sqm_3000 trc_count_3000 trc_sqm_3000 cafe_count_3000
## 1 251554 23 1419204 68
## 2 821986 14 491565 30
## 3 110856 7 52550 41
## 4 167000 12 205756 32
## 5 3420907 60 2296870 1068
## 6 1181009 29 1059171 120
## cafe_sum_3000_min_price_avg cafe_sum_3000_max_price_avg cafe_avg_price_3000
## 1 639.68 1079.37 859.52
## 2 631.03 1086.21 858.62
## 3 697.44 1192.31 944.87
## 4 718.75 1218.75 968.75
## 5 853.03 1410.45 1131.74
## 6 737.96 1231.48 984.72
## cafe_count_3000_na_price cafe_count_3000_price_500 cafe_count_3000_price_1000
## 1 5 21 22
## 2 1 11 11
## 3 2 9 17
## 4 0 5 14
## 5 63 266 267
## 6 12 24 37
## cafe_count_3000_price_1500 cafe_count_3000_price_2500
## 1 16 3
## 2 4 2
## 3 9 3
## 4 10 3
## 5 262 149
## 6 35 11
## cafe_count_3000_price_4000 cafe_count_3000_price_high big_church_count_3000
## 1 1 0 2
## 2 1 0 1
## 3 1 0 0
## 4 0 0 1
## 5 57 4 70
## 6 1 0 12
## church_count_3000 mosque_count_3000 leisure_count_3000 sport_count_3000
## 1 4 0 0 21
## 2 7 0 6 19
## 3 11 0 0 20
## 4 2 0 0 18
## 5 121 1 40 77
## 6 12 0 2 31
## market_count_3000 green_part_5000 prom_part_5000 office_count_5000
## 1 1 13.09 13.31 29
## 2 1 10.26 27.47 66
## 3 6 13.69 21.58 43
## 4 3 14.18 3.89 8
## 5 5 8.38 10.92 689
## 6 7 5.92 25.79 253
## office_sqm_5000 trc_count_5000 trc_sqm_5000 cafe_count_5000
## 1 807385 52 4036616 152
## 2 2690465 40 2034942 177
## 3 1478160 35 1572990 122
## 4 244166 22 942180 61
## 5 8404624 114 3503058 2283
## 6 4274339 63 2010320 567
## cafe_sum_5000_min_price_avg cafe_sum_5000_max_price_avg cafe_avg_price_5000
## 1 708.57 1185.71 947.14
## 2 673.81 1148.81 911.31
## 3 702.68 1196.43 949.55
## 4 931.58 1552.63 1242.11
## 5 853.88 1411.45 1132.66
## 6 769.92 1280.08 1025.00
## cafe_count_5000_na_price cafe_count_5000_price_500 cafe_count_5000_price_1000
## 1 12 39 48
## 2 9 49 65
## 3 10 29 45
## 4 4 7 21
## 5 143 566 578
## 6 35 137 163
## cafe_count_5000_price_1500 cafe_count_5000_price_2500
## 1 40 9
## 2 36 15
## 3 25 10
## 4 15 11
## 5 552 319
## 6 155 62
## cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000
## 1 4 0 13
## 2 3 0 15
## 3 3 0 11
## 4 2 1 4
## 5 108 17 135
## 6 14 1 53
## church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000
## 1 22 1 0 52
## 2 29 1 10 66
## 3 27 0 4 67
## 4 4 0 0 26
## 5 236 2 91 195
## 6 78 1 20 113
## market_count_5000 price_doc
## 1 4 5850000
## 2 14 6000000
## 3 10 5700000
## 4 3 13100000
## 5 14 16331452
## 6 17 9100000
dim(train)
## [1] 30471 292
We have 292 columns and 30471 rows in training dataset.
dim(test)
## [1] 7662 291
We have 291 columns and 7662 rows in training dataset.We are not provided with the target column in testing dataset.
nas = colSums(is.na(train))
nas = nas[nas > 0]
nas = sort(nas, decreasing = TRUE)
head(nas)
## hospital_beds_raion build_year
## 14441 13605
## state cafe_sum_500_min_price_avg
## 13559 13281
## cafe_sum_500_max_price_avg cafe_avg_price_500
## 13281 13281
We can see that our dataset has some missing values. We will leave it as it is because we will use xgboost which will take care of it.
ggplot(train, aes(x = price_doc)) +
geom_density(fill = 'blue') +
theme_classic() +
xlab("Price") +
ggtitle("Distribution of Price") +
scale_x_continuous(labels = scales::comma)
Our target variable is positively skewed. I will use log transformation to see whether it will fix this problem. Lets visualize the result of log transformation in order to find out whether this method will help or not.
ggplot(train, aes(x = log(price_doc))) +
geom_density(fill = 'blue') +
theme_classic() +
xlab("Price") +
ggtitle("Distribution of Price after log transformation") +
scale_x_continuous(labels = scales::comma)
The distribution of price is closer to normal distribution after log transformation. We will fix it afterwards.
test$price_doc = NA
data = rbind(train, test)
cat("Number of Rows :", dim(data)[1], "\n", "Number of Columns :", dim(data)[2])
## Number of Rows : 38133
## Number of Columns : 292
I will use famous lubridate package to extract year, month and day as it is way easier than the any other package available in R.
data$Date = ymd(data$timestamp)
data$Year = year(data$Date)
data$Month = month(data$Date)
data$Day = day(data$Date)
We have a few columns that are actually numeric but they are converted to strings due to the presence of comma “,”. In order to fix this issue, I will remove comma between those numbers and convert them back to numeric.
merged_data = left_join(data, macros, by = "timestamp")
merged_data$child_on_acc_pre_school <- as.numeric(gsub(",","",merged_data$child_on_acc_pre_school))
## Warning: NAs introduced by coercion
merged_data$modern_education_share <- as.numeric(gsub(",","",merged_data$modern_education_share))
merged_data$old_education_build_share <- as.numeric(gsub(",","",merged_data$old_education_build_share))
As we dont have any information about product type column, I will impute mode in missing values.
merged_data$product_type[is.na(merged_data$product_type)] = names(sort(-table(merged_data$product_type)))[1]
## Double checking nas
any(is.na(merged_data$product_type))
## [1] FALSE
We fixed missing values in product type variable.
Converting all the columns with yes and no values into 1 and 0 repectively.
merged_data$culture_objects_top_25 = ifelse(merged_data$culture_objects_top_25 == "yes", 1, 0)
merged_data$thermal_power_plant_raion = ifelse(merged_data$thermal_power_plant_raion == "yes", 1, 0)
merged_data$incineration_raion = ifelse(merged_data$incineration_raion == "yes", 1, 0)
merged_data$oil_chemistry_raion = ifelse(merged_data$oil_chemistry_raion == "yes", 1, 0)
merged_data$radiation_raion = ifelse(merged_data$radiation_raion == "yes", 1, 0)
merged_data$railroad_terminal_raion = ifelse(merged_data$railroad_terminal_raion == "yes", 1, 0)
merged_data$big_market_raion = ifelse(merged_data$big_market_raion == "yes", 1, 0)
merged_data$nuclear_reactor_raion = ifelse(merged_data$nuclear_reactor_raion == "yes", 1, 0)
merged_data$detention_facility_raion = ifelse(merged_data$detention_facility_raion == "yes", 1, 0)
merged_data$water_1line = ifelse(merged_data$water_1line == "yes", 1, 0)
merged_data$railroad_1line = ifelse(merged_data$railroad_1line == "yes", 1, 0)
merged_data$big_road1_1line = ifelse(merged_data$big_road1_1line == "yes", 1, 0)
table(merged_data$ecology)
##
## excellent good no data poor satisfactory
## 5164 8785 9421 10078 4685
As ecology is an ordinal variable, I will convert this column into numeric.
number <- c('no data' = 0, 'poor' = 1, 'satisfactory' = 2, 'good' = 3, 'excellent' = 4)
merged_data$ecology = as.integer(revalue(merged_data$ecology, number))
table(merged_data$ecology)
##
## 0 1 2 3 4
## 9421 10078 4685 8785 5164
## Selecting all numeric columns
final_data = merged_data %>%
select_if(is.numeric)
final_data$price_doc = data$price_doc
I will use log transformation in order to fix skewness in our target variable i.e. Price_doc
train = final_data[!is.na(final_data$price_doc),]
test = final_data[is.na(final_data$price_doc),]
## log transformation
train$price_doc = log(train$price_doc)
print(dim(train))
## [1] 30471 391
print(dim(test))
## [1] 7662 391
## Get the number of target column
print(paste0("Target Column Number: ",which(colnames(train) == "price_doc")), quote = F)
## [1] Target Column Number: 289
Xgboost is a very powerful algorithm, it will take care of missing values on its own.
set.seed(123)
dtrain = xgb.DMatrix(data = as.matrix(train[, -289]), label = train$price_doc)
params = list(objective = "reg:squarederror",
booster = "gbtree",
eta = 0.05,
max_depth = 4,
min_child_weight = 1,
subsample = 1,
colsample_bytree = 1,
gamma = 1)
In order to avoid overfitting of our model, I will use early stopping rounds, which will see whether the results are improving in our validation folds. Our model will automatically stop training when it doesnt see any improvement.
cv_results = xgb.cv(params = params,
data = dtrain,
nrounds = 500,
early_stopping_rounds = 20,
maximize = FALSE,
print_every_n = 100,
nfold = 5,
metrics = "rmse"
)
## [1] train-rmse:14.366561+0.001666 test-rmse:14.366530+0.006799
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
##
## [101] train-rmse:0.458009+0.001947 test-rmse:0.475350+0.007486
## [201] train-rmse:0.431235+0.002393 test-rmse:0.462931+0.008737
## [301] train-rmse:0.418679+0.002715 test-rmse:0.461929+0.008788
## Stopping. Best iteration:
## [309] train-rmse:0.417792+0.002677 test-rmse:0.461878+0.008767
model_xgb = xgb.train(params = params,data = dtrain,
watchlist = list(train = dtrain),
nrounds = cv_results$best_iteration,
print_every_n = 10,
early_stopping_rounds = 20,
maximize = F)
## [1] train-rmse:14.366540
## Will train until train_rmse hasn't improved in 20 rounds.
##
## [11] train-rmse:8.612225
## [21] train-rmse:5.172215
## [31] train-rmse:3.121177
## [41] train-rmse:1.907147
## [51] train-rmse:1.201945
## [61] train-rmse:0.809774
## [71] train-rmse:0.608603
## [81] train-rmse:0.516049
## [91] train-rmse:0.476934
## [101] train-rmse:0.460572
## [111] train-rmse:0.453260
## [121] train-rmse:0.449418
## [131] train-rmse:0.447085
## [141] train-rmse:0.444883
## [151] train-rmse:0.443211
## [161] train-rmse:0.441479
## [171] train-rmse:0.440058
## [181] train-rmse:0.438903
## [191] train-rmse:0.437686
## [201] train-rmse:0.436414
## [211] train-rmse:0.435065
## [221] train-rmse:0.433833
## [231] train-rmse:0.432751
## [241] train-rmse:0.431634
## [251] train-rmse:0.430495
## [261] train-rmse:0.429463
## [271] train-rmse:0.428179
## [281] train-rmse:0.427398
## [291] train-rmse:0.426308
## [301] train-rmse:0.425261
## [309] train-rmse:0.424516
With more preprocessing and feature selection, we can achieve better result.
results = predict(model_xgb, newdata = as.matrix(test[, - 289]))
prediction = exp(results)
submission = read.csv('C:\\Users\\uzair\\Codes\\Russian Market Price\\sample_submission.csv')
submission$price_doc = prediction
write.csv(submission, 'sub.csv', row.names = F)
Final result in the competition is 0.34 rmse :)