This is a exploratory analysis for the kaggle competition placed by the Sverbank of Russian and whose goal is to predict the price fluctuations in the Housing Market of Moscow.

This paper doesn’t try to take part in that a competition.It is only a mere exploratory analysis of that market with the goal of doing some exercise of analitic techniques on a real problem.

I will use the Single Value Decomposition (SGV) technique to extract some patterns from the 392 variable that are in the files train.csv and macro.csv.

In train.csv there are 292 variable of every type regarding the size and features of the houses and buildings, equipments near to the house, some demographic data, transport networks, roads and railways, suppliers and in general all kind of facilities near.

In macro.csv there are 100 variable with macroeconomic data in the date of the transaction.

PART 1: PREPARATORY JOB.

Load of train.csv and macro.csv and merging in a single object called merged through the common variable timestamp.

library(tidyverse)
pf1 <- read.csv("train.csv")
pf2 <- read.csv("macro.csv")
merged <- merge(pf1,pf2,"timestamp")

Conversion of the classic data frame in a tibber called merged2, substraction of the factor variables, so only numeric ones would be kept and final conversion into a matrix of 372 variables called finalMatrix. That one is the object the SGV technique can work with.

merged2 <- as_data_frame(merged)
merged2class <- sapply(merged2,class)
finalColumns <- merged2class[merged2class!="factor"]
finalData <- merged2[,names(finalColumns)]
finalMatrix <- as.matrix(finalData)

There is a last point to apply the SGV. We need to impute some values to the missing values. For that we will use the bioconductor package and will load the impute library to be able to apply the impute.knn function that will fill up the missing values with values taken from the nearest ones. But before we will remove two columns from the matrix because those exceed a 80% of missing values: the 362 and 363. The final matrix will be imputed$data

library(impute)
for (i in 1:372) {
  if((!(sum(is.na(finalMatrix[,i]))/30471)>0.8)==FALSE){
    print(i);
  }
  
}
## [1] 362
## [1] 363
finalMatrix2 <- finalMatrix[,-c(362,363)]
suppressPackageStartupMessages(imputed <- impute.knn(finalMatrix2))
## Cluster size 30471 broken into 26484 3987 
## Cluster size 26484 broken into 19442 7042 
## Cluster size 19442 broken into 6946 12496 
## Cluster size 6946 broken into 5798 1148 
## Cluster size 5798 broken into 963 4835 
## Done cluster 963 
## Cluster size 4835 broken into 1866 2969 
## Cluster size 1866 broken into 884 982 
## Done cluster 884 
## Done cluster 982 
## Done cluster 1866 
## Cluster size 2969 broken into 668 2301 
## Done cluster 668 
## Cluster size 2301 broken into 1168 1133 
## Done cluster 1168 
## Done cluster 1133 
## Done cluster 2301 
## Done cluster 2969 
## Done cluster 4835 
## Done cluster 5798 
## Done cluster 1148 
## Done cluster 6946 
## Cluster size 12496 broken into 5749 6747 
## Cluster size 5749 broken into 4425 1324 
## Cluster size 4425 broken into 2895 1530 
## Cluster size 2895 broken into 1904 991 
## Cluster size 1904 broken into 1029 875 
## Done cluster 1029 
## Done cluster 875 
## Done cluster 1904 
## Done cluster 991 
## Done cluster 2895 
## Cluster size 1530 broken into 684 846 
## Done cluster 684 
## Done cluster 846 
## Done cluster 1530 
## Done cluster 4425 
## Done cluster 1324 
## Done cluster 5749 
## Cluster size 6747 broken into 4707 2040 
## Cluster size 4707 broken into 2322 2385 
## Cluster size 2322 broken into 1002 1320 
## Done cluster 1002 
## Done cluster 1320 
## Done cluster 2322 
## Cluster size 2385 broken into 1541 844 
## Cluster size 1541 broken into 716 825 
## Done cluster 716 
## Done cluster 825 
## Done cluster 1541 
## Done cluster 844 
## Done cluster 2385 
## Done cluster 4707 
## Cluster size 2040 broken into 1020 1020 
## Done cluster 1020 
## Done cluster 1020 
## Done cluster 2040 
## Done cluster 6747 
## Done cluster 12496 
## Done cluster 19442 
## Cluster size 7042 broken into 2988 4054 
## Cluster size 2988 broken into 1202 1786 
## Done cluster 1202 
## Cluster size 1786 broken into 841 945 
## Done cluster 841 
## Done cluster 945 
## Done cluster 1786 
## Done cluster 2988 
## Cluster size 4054 broken into 2045 2009 
## Cluster size 2045 broken into 1596 449 
## Cluster size 1596 broken into 808 788 
## Done cluster 808 
## Done cluster 788 
## Done cluster 1596 
## Done cluster 449 
## Done cluster 2045 
## Cluster size 2009 broken into 778 1231 
## Done cluster 778 
## Done cluster 1231 
## Done cluster 2009 
## Done cluster 4054 
## Done cluster 7042 
## Done cluster 26484 
## Cluster size 3987 broken into 2137 1850 
## Cluster size 2137 broken into 329 1808 
## Done cluster 329 
## Cluster size 1808 broken into 947 861 
## Done cluster 947 
## Done cluster 861 
## Done cluster 1808 
## Done cluster 2137 
## Cluster size 1850 broken into 1298 552 
## Done cluster 1298 
## Done cluster 552 
## Done cluster 1850 
## Done cluster 3987
#names(imputed)
dim(imputed$data)
## [1] 30471   370

PART 1: ANALYSIS.

Now it is when we start the proper analysis. Firstly we will scale the columns of the matrix through the generic scale function. Second, we will apply the SGV function

singularValues <- svd(scale(imputed$data))

The application of SVG give us three matrix: the u and v are unitary, the columns of each form a set of orthonormal vectors.The columns of v (right-singular vectors) correspond to the variables of the original matrix. The d is the diagonal vector that give us the weight of each of the columns. We can transform those values of the diagonal in proportions or percentage of the total variability that each column of v represents.

The above left plot is the diagonal matrix of SVG. The diagonal matrix give us the weight of the columns. The above right plot is the proportion of each value of the diagonal or the variability explained by each column in v. We see how the first column explains nearly the 25% of the total variability. The second near the 15% and the third one some more than 10%. The next ones explain much less.

The next two plots will give us the patterns that we can deduct from the First and Second Right Singular Vectors.

Here we can see the patterns. With the 25% of variance explained (corresponding to the First-Right Singular Column) we see one pattern in the centre-upper part of the plot with lot of points gathering and another pattern in the right-middle part of the plot. In the left-hand side part we see the points more at random. To see these last ones nearest we can do use of the second right-singular column that explains near the 15% of the variability.

We see how in the left-hand side part of this plot corresponding to the Second-Right Singular Vector, there is a gathering of points in the lower part, around the range between the 31th and the 56th variable.

With all that information we can extract five groups of variables:

1.- This will correspond to the details of the house and facilities in the raion (administrative unit where is placed the house).

  names(finalData[1:30])
##  [1] "id"                                   
##  [2] "full_sq"                              
##  [3] "life_sq"                              
##  [4] "floor"                                
##  [5] "max_floor"                            
##  [6] "material"                             
##  [7] "build_year"                           
##  [8] "num_room"                             
##  [9] "kitch_sq"                             
## [10] "state"                                
## [11] "area_m"                               
## [12] "raion_popul"                          
## [13] "green_zone_part"                      
## [14] "indust_part"                          
## [15] "children_preschool"                   
## [16] "preschool_quota"                      
## [17] "preschool_education_centers_raion"    
## [18] "children_school"                      
## [19] "school_quota"                         
## [20] "school_education_centers_raion"       
## [21] "school_education_centers_top_20_raion"
## [22] "hospital_beds_raion"                  
## [23] "healthcare_centers_raion"             
## [24] "university_top_20_raion"              
## [25] "sport_objects_raion"                  
## [26] "additional_education_raion"           
## [27] "culture_objects_top_25_raion"         
## [28] "shopping_centers_raion"               
## [29] "office_raion"                         
## [30] "full_all"

2.- This is a group corresponding to demographic data by sex that probably would be needed to reconvert, so instead male and female being variables they should be reconverted as values of a variable sex.

names(finalData[31:56])
##  [1] "male_f"        "female_f"      "young_all"     "young_male"   
##  [5] "young_female"  "work_all"      "work_male"     "work_female"  
##  [9] "ekder_all"     "ekder_male"    "ekder_female"  "X0_6_all"     
## [13] "X0_6_male"     "X0_6_female"   "X7_14_all"     "X7_14_male"   
## [17] "X7_14_female"  "X0_17_all"     "X0_17_male"    "X0_17_female" 
## [21] "X16_29_all"    "X16_29_male"   "X16_29_female" "X0_13_all"    
## [25] "X0_13_male"    "X0_13_female"

3.- This is a group corresponding to variables related to the material of the buildings and distances to transport, suppliers and facilites of different kind.

names(finalData[57:139])
##  [1] "raion_build_count_with_material_info" 
##  [2] "build_count_block"                    
##  [3] "build_count_wood"                     
##  [4] "build_count_frame"                    
##  [5] "build_count_brick"                    
##  [6] "build_count_monolith"                 
##  [7] "build_count_panel"                    
##  [8] "build_count_foam"                     
##  [9] "build_count_slag"                     
## [10] "build_count_mix"                      
## [11] "raion_build_count_with_builddate_info"
## [12] "build_count_before_1920"              
## [13] "build_count_1921.1945"                
## [14] "build_count_1946.1970"                
## [15] "build_count_1971.1995"                
## [16] "build_count_after_1995"               
## [17] "ID_metro"                             
## [18] "metro_min_avto"                       
## [19] "metro_km_avto"                        
## [20] "metro_min_walk"                       
## [21] "metro_km_walk"                        
## [22] "kindergarten_km"                      
## [23] "school_km"                            
## [24] "park_km"                              
## [25] "green_zone_km"                        
## [26] "industrial_km"                        
## [27] "water_treatment_km"                   
## [28] "cemetery_km"                          
## [29] "incineration_km"                      
## [30] "railroad_station_walk_km"             
## [31] "railroad_station_walk_min"            
## [32] "ID_railroad_station_walk"             
## [33] "railroad_station_avto_km"             
## [34] "railroad_station_avto_min"            
## [35] "ID_railroad_station_avto"             
## [36] "public_transport_station_km"          
## [37] "public_transport_station_min_walk"    
## [38] "water_km"                             
## [39] "mkad_km"                              
## [40] "ttk_km"                               
## [41] "sadovoe_km"                           
## [42] "bulvar_ring_km"                       
## [43] "kremlin_km"                           
## [44] "big_road1_km"                         
## [45] "ID_big_road1"                         
## [46] "big_road2_km"                         
## [47] "ID_big_road2"                         
## [48] "railroad_km"                          
## [49] "zd_vokzaly_avto_km"                   
## [50] "ID_railroad_terminal"                 
## [51] "bus_terminal_avto_km"                 
## [52] "ID_bus_terminal"                      
## [53] "oil_chemistry_km"                     
## [54] "nuclear_reactor_km"                   
## [55] "radiation_km"                         
## [56] "power_transmission_line_km"           
## [57] "thermal_power_plant_km"               
## [58] "ts_km"                                
## [59] "big_market_km"                        
## [60] "market_shop_km"                       
## [61] "fitness_km"                           
## [62] "swim_pool_km"                         
## [63] "ice_rink_km"                          
## [64] "stadium_km"                           
## [65] "basketball_km"                        
## [66] "hospice_morgue_km"                    
## [67] "detention_facility_km"                
## [68] "public_healthcare_km"                 
## [69] "university_km"                        
## [70] "workplaces_km"                        
## [71] "shopping_centers_km"                  
## [72] "office_km"                            
## [73] "additional_education_km"              
## [74] "preschool_km"                         
## [75] "big_church_km"                        
## [76] "church_synagogue_km"                  
## [77] "mosque_km"                            
## [78] "theater_km"                           
## [79] "museum_km"                            
## [80] "exhibition_km"                        
## [81] "catering_km"                          
## [82] "green_part_500"                       
## [83] "prom_part_500"

4.- This is a group corresponding to cafes, resturants and other leisure facilities as well as religious services and their distances to the house.

names(finalData[140:275])
##   [1] "office_count_500"            "office_sqm_500"             
##   [3] "trc_count_500"               "trc_sqm_500"                
##   [5] "cafe_count_500"              "cafe_sum_500_min_price_avg" 
##   [7] "cafe_sum_500_max_price_avg"  "cafe_avg_price_500"         
##   [9] "cafe_count_500_na_price"     "cafe_count_500_price_500"   
##  [11] "cafe_count_500_price_1000"   "cafe_count_500_price_1500"  
##  [13] "cafe_count_500_price_2500"   "cafe_count_500_price_4000"  
##  [15] "cafe_count_500_price_high"   "big_church_count_500"       
##  [17] "church_count_500"            "mosque_count_500"           
##  [19] "leisure_count_500"           "sport_count_500"            
##  [21] "market_count_500"            "green_part_1000"            
##  [23] "prom_part_1000"              "office_count_1000"          
##  [25] "office_sqm_1000"             "trc_count_1000"             
##  [27] "trc_sqm_1000"                "cafe_count_1000"            
##  [29] "cafe_sum_1000_min_price_avg" "cafe_sum_1000_max_price_avg"
##  [31] "cafe_avg_price_1000"         "cafe_count_1000_na_price"   
##  [33] "cafe_count_1000_price_500"   "cafe_count_1000_price_1000" 
##  [35] "cafe_count_1000_price_1500"  "cafe_count_1000_price_2500" 
##  [37] "cafe_count_1000_price_4000"  "cafe_count_1000_price_high" 
##  [39] "big_church_count_1000"       "church_count_1000"          
##  [41] "mosque_count_1000"           "leisure_count_1000"         
##  [43] "sport_count_1000"            "market_count_1000"          
##  [45] "green_part_1500"             "prom_part_1500"             
##  [47] "office_count_1500"           "office_sqm_1500"            
##  [49] "trc_count_1500"              "trc_sqm_1500"               
##  [51] "cafe_count_1500"             "cafe_sum_1500_min_price_avg"
##  [53] "cafe_sum_1500_max_price_avg" "cafe_avg_price_1500"        
##  [55] "cafe_count_1500_na_price"    "cafe_count_1500_price_500"  
##  [57] "cafe_count_1500_price_1000"  "cafe_count_1500_price_1500" 
##  [59] "cafe_count_1500_price_2500"  "cafe_count_1500_price_4000" 
##  [61] "cafe_count_1500_price_high"  "big_church_count_1500"      
##  [63] "church_count_1500"           "mosque_count_1500"          
##  [65] "leisure_count_1500"          "sport_count_1500"           
##  [67] "market_count_1500"           "green_part_2000"            
##  [69] "prom_part_2000"              "office_count_2000"          
##  [71] "office_sqm_2000"             "trc_count_2000"             
##  [73] "trc_sqm_2000"                "cafe_count_2000"            
##  [75] "cafe_sum_2000_min_price_avg" "cafe_sum_2000_max_price_avg"
##  [77] "cafe_avg_price_2000"         "cafe_count_2000_na_price"   
##  [79] "cafe_count_2000_price_500"   "cafe_count_2000_price_1000" 
##  [81] "cafe_count_2000_price_1500"  "cafe_count_2000_price_2500" 
##  [83] "cafe_count_2000_price_4000"  "cafe_count_2000_price_high" 
##  [85] "big_church_count_2000"       "church_count_2000"          
##  [87] "mosque_count_2000"           "leisure_count_2000"         
##  [89] "sport_count_2000"            "market_count_2000"          
##  [91] "green_part_3000"             "prom_part_3000"             
##  [93] "office_count_3000"           "office_sqm_3000"            
##  [95] "trc_count_3000"              "trc_sqm_3000"               
##  [97] "cafe_count_3000"             "cafe_sum_3000_min_price_avg"
##  [99] "cafe_sum_3000_max_price_avg" "cafe_avg_price_3000"        
## [101] "cafe_count_3000_na_price"    "cafe_count_3000_price_500"  
## [103] "cafe_count_3000_price_1000"  "cafe_count_3000_price_1500" 
## [105] "cafe_count_3000_price_2500"  "cafe_count_3000_price_4000" 
## [107] "cafe_count_3000_price_high"  "big_church_count_3000"      
## [109] "church_count_3000"           "mosque_count_3000"          
## [111] "leisure_count_3000"          "sport_count_3000"           
## [113] "market_count_3000"           "green_part_5000"            
## [115] "prom_part_5000"              "office_count_5000"          
## [117] "office_sqm_5000"             "trc_count_5000"             
## [119] "trc_sqm_5000"                "cafe_count_5000"            
## [121] "cafe_sum_5000_min_price_avg" "cafe_sum_5000_max_price_avg"
## [123] "cafe_avg_price_5000"         "cafe_count_5000_na_price"   
## [125] "cafe_count_5000_price_500"   "cafe_count_5000_price_1000" 
## [127] "cafe_count_5000_price_1500"  "cafe_count_5000_price_2500" 
## [129] "cafe_count_5000_price_4000"  "cafe_count_5000_price_high" 
## [131] "big_church_count_5000"       "church_count_5000"          
## [133] "mosque_count_5000"           "leisure_count_5000"         
## [135] "sport_count_5000"            "market_count_5000"

5.- This group gather different demographic, social and economic data: GDP, Balance Trade, Employment, Population Natural Increase, Migrations, Rent Prices, Hospital Occupancy and so on.

names(finalData[276:370])
##  [1] "price_doc"                                 
##  [2] "oil_urals"                                 
##  [3] "gdp_quart"                                 
##  [4] "gdp_quart_growth"                          
##  [5] "cpi"                                       
##  [6] "ppi"                                       
##  [7] "gdp_deflator"                              
##  [8] "balance_trade"                             
##  [9] "balance_trade_growth"                      
## [10] "usdrub"                                    
## [11] "eurrub"                                    
## [12] "brent"                                     
## [13] "net_capital_export"                        
## [14] "gdp_annual"                                
## [15] "gdp_annual_growth"                         
## [16] "average_provision_of_build_contract"       
## [17] "average_provision_of_build_contract_moscow"
## [18] "rts"                                       
## [19] "micex"                                     
## [20] "micex_rgbi_tr"                             
## [21] "micex_cbi_tr"                              
## [22] "deposits_value"                            
## [23] "deposits_growth"                           
## [24] "deposits_rate"                             
## [25] "mortgage_value"                            
## [26] "mortgage_growth"                           
## [27] "mortgage_rate"                             
## [28] "grp"                                       
## [29] "grp_growth"                                
## [30] "income_per_cap"                            
## [31] "real_dispos_income_per_cap_growth"         
## [32] "salary"                                    
## [33] "salary_growth"                             
## [34] "fixed_basket"                              
## [35] "retail_trade_turnover"                     
## [36] "retail_trade_turnover_per_cap"             
## [37] "retail_trade_turnover_growth"              
## [38] "labor_force"                               
## [39] "unemployment"                              
## [40] "employment"                                
## [41] "invest_fixed_capital_per_cap"              
## [42] "invest_fixed_assets"                       
## [43] "profitable_enterpr_share"                  
## [44] "unprofitable_enterpr_share"                
## [45] "share_own_revenues"                        
## [46] "overdue_wages_per_cap"                     
## [47] "fin_res_per_cap"                           
## [48] "marriages_per_1000_cap"                    
## [49] "divorce_rate"                              
## [50] "construction_value"                        
## [51] "invest_fixed_assets_phys"                  
## [52] "pop_natural_increase"                      
## [53] "pop_migration"                             
## [54] "pop_total_inc"                             
## [55] "childbirth"                                
## [56] "mortality"                                 
## [57] "housing_fund_sqm"                          
## [58] "lodging_sqm_per_cap"                       
## [59] "water_pipes_share"                         
## [60] "baths_share"                               
## [61] "sewerage_share"                            
## [62] "gas_share"                                 
## [63] "hot_water_share"                           
## [64] "electric_stove_share"                      
## [65] "heating_share"                             
## [66] "old_house_share"                           
## [67] "average_life_exp"                          
## [68] "infant_mortarity_per_1000_cap"             
## [69] "perinatal_mort_per_1000_cap"               
## [70] "incidence_population"                      
## [71] "rent_price_4.room_bus"                     
## [72] "rent_price_3room_bus"                      
## [73] "rent_price_2room_bus"                      
## [74] "rent_price_1room_bus"                      
## [75] "rent_price_3room_eco"                      
## [76] "rent_price_2room_eco"                      
## [77] "rent_price_1room_eco"                      
## [78] "load_of_teachers_preschool_per_teacher"    
## [79] "load_of_teachers_school_per_teacher"       
## [80] "students_state_oneshift"                   
## [81] "provision_doctors"                         
## [82] "provision_nurse"                           
## [83] "load_on_doctors"                           
## [84] "power_clinics"                             
## [85] "hospital_beds_available_per_cap"           
## [86] "hospital_bed_occupancy_per_year"           
## [87] "provision_retail_space_sqm"                
## [88] "provision_retail_space_modern_sqm"         
## [89] "turnover_catering_per_cap"                 
## [90] "theaters_viewers_per_1000_cap"             
## [91] "seats_theather_rfmin_per_100000_cap"       
## [92] "museum_visitis_per_100_cap"                
## [93] "bandwidth_sports"                          
## [94] "population_reg_sports_share"               
## [95] "students_reg_sports_share"