1 General information

We analyze a dataset published on Kaggle. It refers to french employment, salaries, population per town. The aim is to evaluate equality/inequalities in France, and geographical distribution of business according to their size.

Such data are collected by the INSEE. Information regarding the number of firms in every french town, categorized by size can be found here. This dataset contains about 35000 units/per town.

Information about salaries around french town per job categories, age and sex (expressed in average net amount per hour in euro) can be found here. This dataset contains about 5000 units/per town.

Demographic information in France per town, age, sex and living mode can be found here. This dataset contains about 8 million units/per town. Additional info about Population Data can be found here.

These datasets have been pre-processed and put together. The final dataset contains 58 variables and 5022 observations.

1.1 Aim of the study

This project aims to explore structure of French labour market. In particular, we are interested in:

  • evaluating possible inequalities: per towns/region, sex, age, job categories etc.;
  • discover geographical distribution of business according to their size
  • predicting the … using a regression model;
  • reduce the dimensionality of … performing a PCA;
  • explore different algorithms to cluster male/females using …

1.2 Plan for the study

  1. Unsupervised learning:
  • PCA
  • Clustering methods (K-means/Hierarchical)
  1. Supervised learning:
  • GLM
  • Linear/Quadratic Discriminant Analysis
  • KNN
  • Cross-validation
  • Bootstrap
  • Subset selection
  • Shrinkage methods
  • Dimension reduction methods
  1. Description of population demographics in France
  2. Structure of the french labour market
  3. Future works

1.3 Loading tools

Loading all libraries needed throughout the notebook

Import the datasets

setwd("./data")
firms       <- read.csv("base_etablissement_par_tranche_effectif.csv", encoding = "UTF-8")
geo         <- read.csv("name_geographic_information.csv", encoding = "UTF-8")
salary      <- read.csv("net_salary_per_town_categories.csv", encoding = "UTF-8")
population  <- read.csv("population.csv", encoding = "UTF-8")
educ        <- read.csv("level_education.csv", sep =";", encoding = "UTF-8")
categ_socio <- read.csv("Categorie_socioprofessionnelle.csv", sep =";", encoding = "UTF-8")
status_work <- read.csv("Emplois_lieu_travail.csv", sep =";", encoding = "UTF-8")
ineq        <- read.csv("Comparateur_territoires.csv", sep =";", encoding = "UTF-8")
commune     <- read.csv("insee_commune.csv", encoding = "UTF-8")

2 Pre-processing

2.1 Firms data

Assign meaningful names and check the modified data:

names(firms) 
 [1] "CODGEO"   "LIBGEO"   "REG"      "DEP"      "E14TST"   "E14TS0ND" "E14TS1"  
 [8] "E14TS6"   "E14TS10"  "E14TS20"  "E14TS50"  "E14TS100" "E14TS200" "E14TS500"
names(firms)[2:ncol(firms)] <-
  c("town", 
    "regNum",
    "deptNum",
    "total",
    "null",
    "firmsEmpl_1_5",
    "firmsEmpl_6_9",
    "firmsEmpl_10_19",
    "firmsEmpl_20_49",
    "firmsEmpl_50_99",
    "firmsEmpl_100_199",
    "firmsEmpl_200_499",
    "firmsEmpl_500plus")
# preliminary checks
names(firms)
 [1] "CODGEO"            "town"              "regNum"            "deptNum"          
 [5] "total"             "null"              "firmsEmpl_1_5"     "firmsEmpl_6_9"    
 [9] "firmsEmpl_10_19"   "firmsEmpl_20_49"   "firmsEmpl_50_99"   "firmsEmpl_100_199"
[13] "firmsEmpl_200_499" "firmsEmpl_500plus"
head(firms)
str(firms)
'data.frame':   36681 obs. of  14 variables:
 $ CODGEO           : Factor w/ 36681 levels "01001","01002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ town             : Factor w/ 34142 levels "Aast","Abainville",..: 13659 13661 442 444 460 480 484 581 638 802 ...
 $ regNum           : int  82 82 82 82 82 82 82 82 82 82 ...
 $ deptNum          : Factor w/ 101 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ total            : int  25 10 996 99 4 124 48 22 33 14 ...
 $ null             : int  22 9 577 73 4 87 28 17 23 11 ...
 $ firmsEmpl_1_5    : int  1 1 272 20 0 20 15 4 8 2 ...
 $ firmsEmpl_6_9    : int  2 0 63 3 0 10 2 1 1 1 ...
 $ firmsEmpl_10_19  : int  0 0 46 1 0 5 3 0 0 0 ...
 $ firmsEmpl_20_49  : int  0 0 24 2 0 2 0 0 0 0 ...
 $ firmsEmpl_50_99  : int  0 0 9 0 0 0 0 0 0 0 ...
 $ firmsEmpl_100_199: int  0 0 3 0 0 0 0 0 1 0 ...
 $ firmsEmpl_200_499: int  0 0 2 0 0 0 0 0 0 0 ...
 $ firmsEmpl_500plus: int  0 0 0 0 0 0 0 0 0 0 ...
summary(firms)
     CODGEO                  town           regNum         deptNum     
 01001  :    1   Sainte-Colombe:   14   Min.   : 1.00   62     :  895  
 01002  :    1   Saint-Sauveur :   12   1st Qu.:25.00   02     :  816  
 01004  :    1   Beaulieu      :   11   Median :43.00   80     :  782  
 01005  :    1   Sainte-Marie  :   11   Mean   :49.42   76     :  745  
 01006  :    1   Le Pin        :   10   3rd Qu.:73.00   57     :  730  
 01007  :    1   Saint-Aubin   :   10   Max.   :94.00   14     :  706  
 (Other):36675   (Other)       :36613                   (Other):32007  
     total               null          firmsEmpl_1_5      firmsEmpl_6_9      
 Min.   :     0.0   Min.   :     0.0   Min.   :    0.00   Min.   :    0.000  
 1st Qu.:     8.0   1st Qu.:     6.0   1st Qu.:    1.00   1st Qu.:    0.000  
 Median :    19.0   Median :    14.0   Median :    3.00   Median :    0.000  
 Mean   :   123.5   Mean   :    83.6   Mean   :   27.29   Mean   :    5.221  
 3rd Qu.:    54.0   3rd Qu.:    39.0   3rd Qu.:   11.00   3rd Qu.:    2.000  
 Max.   :427385.0   Max.   :316603.0   Max.   :76368.00   Max.   :14836.000  
                                                                             
 firmsEmpl_10_19   firmsEmpl_20_49    firmsEmpl_50_99     firmsEmpl_100_199 
 Min.   :    0.0   Min.   :   0.000   Min.   :   0.0000   Min.   :  0.0000  
 1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:   0.0000   1st Qu.:  0.0000  
 Median :    0.0   Median :   0.000   Median :   0.0000   Median :  0.0000  
 Mean   :    3.8   Mean   :   2.296   Mean   :   0.7383   Mean   :  0.3324  
 3rd Qu.:    1.0   3rd Qu.:   1.000   3rd Qu.:   0.0000   3rd Qu.:  0.0000  
 Max.   :10829.0   Max.   :5643.000   Max.   :1658.0000   Max.   :812.0000  
                                                                            
 firmsEmpl_200_499  firmsEmpl_500plus  
 Min.   :  0.0000   Min.   :  0.00000  
 1st Qu.:  0.0000   1st Qu.:  0.00000  
 Median :  0.0000   Median :  0.00000  
 Mean   :  0.1728   Mean   :  0.04842  
 3rd Qu.:  0.0000   3rd Qu.:  0.00000  
 Max.   :456.0000   Max.   :180.00000  
                                       
# Check for duplicated data: there is no
sum(duplicated.data.frame(firms))
[1] 0

Categorize firms’ size according to EU standard, but in a slightly different form for medium and large firms (i.e., medium firms have <200 instead of <250 employees):

# merge variables
firms$micro   <- firms$firmsEmpl_1_5 + firms$firmsEmpl_6_9
firms$small   <- firms$firmsEmpl_10_19 + firms$firmsEmpl_20_49
firms$medium  <- firms$firmsEmpl_50_99 + firms$firmsEmpl_100_199
firms$large   <- firms$firmsEmpl_200_499 + firms$firmsEmpl_500plus
# Drop unnecessary (at the moment) columns 
firms <- subset(firms, select = c(CODGEO, town, total, micro, small, medium, large, null))
# check
summary(firms)
     CODGEO                  town           total              micro         
 01001  :    1   Sainte-Colombe:   14   Min.   :     0.0   Min.   :    0.00  
 01002  :    1   Saint-Sauveur :   12   1st Qu.:     8.0   1st Qu.:    1.00  
 01004  :    1   Beaulieu      :   11   Median :    19.0   Median :    4.00  
 01005  :    1   Sainte-Marie  :   11   Mean   :   123.5   Mean   :   32.51  
 01006  :    1   Le Pin        :   10   3rd Qu.:    54.0   3rd Qu.:   13.00  
 01007  :    1   Saint-Aubin   :   10   Max.   :427385.0   Max.   :91204.00  
 (Other):36675   (Other)       :36613                                        
     small               medium             large               null         
 Min.   :    0.000   Min.   :   0.000   Min.   :  0.0000   Min.   :     0.0  
 1st Qu.:    0.000   1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:     6.0  
 Median :    0.000   Median :   0.000   Median :  0.0000   Median :    14.0  
 Mean   :    6.097   Mean   :   1.071   Mean   :  0.2212   Mean   :    83.6  
 3rd Qu.:    2.000   3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:    39.0  
 Max.   :16472.000   Max.   :2470.000   Max.   :636.0000   Max.   :316603.0  
                                                                             
# there is an obs with more than 316K null data: we check if it is plausible
# get the highest 20 null values
str_firms <- sort(firms$null, decreasing = T)[1:20]
# get their indexes
str_firms_ind <- match(str_firms, firms$null)
# get the corresponding city
firms$town[str_firms_ind]
 [1] Paris                Marseille            Lyon                 Nice                
 [5] Toulouse             Bordeaux             Montpellier          Nantes              
 [9] Strasbourg           Lille                Aix-en-Provence      Boulogne-Billancourt
[13] Fort-de-France       Rennes               Grenoble             Toulon              
[17] Cannes               Saint-Denis          Neuilly-sur-Seine    Nîmes               
34142 Levels: Aast Abainville Abancourt Abaucourt Abaucourt-Hautecourt ... Zuytpeene
# they are the largest cities, hence it seems reasonable..

2.2 Geographical data

Assign names and remove some variables:

names(geo)
 [1] "EU_circo"               "code_région"            "nom_région"            
 [4] "chef.lieu_région"       "numéro_département"     "nom_département"       
 [7] "préfecture"             "numéro_circonscription" "nom_commune"           
[10] "codes_postaux"          "code_insee"             "latitude"              
[13] "longitude"              "éloignement"           
names(geo)[c(2:11, 14)] =
  c("code_region",
    "region", 
    "region_capital",
    "number_depart",
    "department", 
    "prefecture",
    "circons",
    "town_name", 
    "postal_code", 
    "CODGEO",
    "eloignement")
# drop unnecessary columns (code/num and name represents same thing) 
geo <- subset(geo, select = -c(EU_circo, code_region, number_depart, prefecture, circons, eloignement, town_name))
# preliminary checks
names(geo)
[1] "region"         "region_capital" "department"     "postal_code"    "CODGEO"        
[6] "latitude"       "longitude"     
head(geo)
str(geo)
'data.frame':   36840 obs. of  7 variables:
 $ region        : Factor w/ 28 levels "Alsace","Aquitaine",..: 27 27 27 27 27 27 27 27 27 27 ...
 $ region_capital: Factor w/ 28 levels "Ajaccio","Amiens",..: 14 14 14 14 14 14 14 14 14 14 ...
 $ department    : Factor w/ 102 levels "Ain","Aisne",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ postal_code   : Factor w/ 6106 levels "01000","01090",..: 26 19 29 26 17 1 23 16 17 17 ...
 $ CODGEO        : int  1024 1029 1038 1040 1245 1053 1065 1069 1072 1095 ...
 $ latitude      : num  46.3 46.4 46.3 46.4 46.1 ...
 $ longitude     : Factor w/ 1151 levels "","-","-0,75",..: 874 880 881 864 891 877 872 880 885 892 ...
summary(geo)
           region       region_capital           department     postal_code   
 Midi-Pyrénées: 3028   Toulouse: 3028   Pas-de-Calais :  898   51300  :   46  
 Rhône-Alpes  : 2890   Lyon    : 2890   Aisne         :  816   51800  :   44  
 Lorraine     : 2336   Metz    : 2336   Somme         :  783   70000  :   42  
 Aquitaine    : 2300   Bordeaux: 2300   Seine-Maritime:  747   88500  :   42  
 Picardie     : 2295   Amiens  : 2295   Moselle       :  732   80140  :   40  
 Bourgogne    : 2050   Dijon   : 2050   Côte-d'Or     :  709   02160  :   38  
 (Other)      :21941   (Other) :21941   (Other)       :32155   (Other):36588  
     CODGEO         latitude        longitude    
 Min.   : 1001   Min.   :41.39           : 2841  
 1st Qu.:24577   1st Qu.:45.22   2.433333:  105  
 Median :48191   Median :47.43   2.333333:  100  
 Mean   :46298   Mean   :47.00   1.833333:   99  
 3rd Qu.:67043   3rd Qu.:48.85   2.116667:   98  
 Max.   :97617   Max.   :51.08   2.25    :   94  
                 NA's   :2929    (Other) :33503  

Correct typos for longitude data and keep just the unique CODGEO to avoid towns with multiple postal codes:

# spot "," instead of "." in longitude
newLong       <- as.character(geo$longitude)    # copy the vector
sum(grep(",", newLong))                         # total commas
[1] 994302
ind_long_err  <- grep(",", newLong)             # indexing them
newLong       <- gsub(",", ".", newLong)        # substituting them with dots
indNA_Long    <- is.na(as.numeric((newLong)))   # spot NA
NAs introduced by coercion
# geo$longitude[indNA_Long]                       # verify that they were actually missing
geo$longitude <- as.numeric(newLong)            # overwrite the longitude variable with the new one
NAs introduced by coercion
# Check for duplicated data (e.g., cities with different postal codes, that we dropped):
  # e.g., to verify it,  try on the initial dataset:
  # sum(geo$nom_commune == "Paris")
  # ind_duplic <- geo$nom_commune == "Paris"
  # geo[ind_duplic,]
sum(duplicated.data.frame(geo)) 
[1] 118
# retaing unique postal cities
geo <- geo[!duplicated(geo$CODGEO),]
# check again
summary(geo)
           region       region_capital           department     postal_code   
 Midi-Pyrénées: 3020   Toulouse: 3020   Pas-de-Calais :  894   51300  :   46  
 Rhône-Alpes  : 2879   Lyon    : 2879   Aisne         :  816   51800  :   44  
 Lorraine     : 2333   Metz    : 2333   Somme         :  782   70000  :   42  
 Aquitaine    : 2296   Bordeaux: 2296   Seine-Maritime:  745   88500  :   42  
 Picardie     : 2291   Amiens  : 2291   Moselle       :  730   80140  :   40  
 Bourgogne    : 2046   Dijon   : 2046   Côte-d'Or     :  707   02160  :   38  
 (Other)      :21828   (Other) :21828   (Other)       :32019   (Other):36441  
     CODGEO         latitude       longitude      
 Min.   : 1001   Min.   :41.39   Min.   :-5.1000  
 1st Qu.:24561   1st Qu.:45.22   1st Qu.: 0.6833  
 Median :48172   Median :47.43   Median : 2.6167  
 Mean   :46264   Mean   :47.00   Mean   : 2.7324  
 3rd Qu.:67020   3rd Qu.:48.85   3rd Qu.: 4.8500  
 Max.   :97617   Max.   :51.08   Max.   : 9.5167  
                 NA's   :2923    NA's   :2902     

Assign latitude and longitude values for missing data (almost 3000). The code used to retrieve them using Google API has been commented and its result is loaded.

# index of NAs and their total
indNA_coord = is.na(geo$latitude) | is.na(geo$longitude)
sum(indNA_coord)
[1] 2987
# code used to retrieve the NA using Google API, which have been saved in a csv file
# 
# # initialize variables
# city_search = 0
# res = as.data.frame(matrix(c(0, 0, 0), 1, 3))
# names(res) = c("lon", "lat", "address")
# 
# # retrieve lat and long (Google API = 2500 request per day)
# # my_iter = floor(sum(indNA_coord)/3)
# for (i in 1:sum(indNA_coord)){
# 
#   # city searched
#   city_search[i] = paste(c(as.character(NA_coord$town_name[i]), as.character(NA_coord$postal_code[i]), as.character(NA_coord$department[i]), "France"), sep=" ", collapse = ", ")
#   
#   # solution
#   res[i,] = geocode(city_search[i], output = "latlona", source = c("google", "dsk"), messaging = FALSE)
# 
#   # retrieve still missing data, because of existing problems with API (up to 15 trials)
#   j = 0
#   while (any(is.na(res[i,])) & j < 25){
#     res[i,] = geocode(city_search[i], output = "latlona", source = c("google", "dsk"), messaging = FALSE)
#     j = j + 1
#   }
# }
# # check the solution
# sol = cbind(searched = city_search, res)
# # save it as a csv file to save time
# write.csv(retrieved_geo_NA[,2:3], "geo_NA_Final.csv", quote = FALSE, row.names=FALSE, fileEncoding = "UTF-8")
    
# read the created csv
setwd("./data")
retrieved_geo_NA = read.csv("geo_NA_Final.csv", header = T, encoding = "UTF-8")
# get only long and lat and assign to original NA 
geo$latitude[indNA_coord] = retrieved_geo_NA[,2]
geo$longitude[indNA_coord] = retrieved_geo_NA[,1]
# there are 37 still missing units, which are towns located in old colonies far from Europe
indNA_coord = is.na(geo$latitude) | is.na(geo$longitude)
sum(indNA_coord)
[1] 35
# exclude those towns
geo = geo[!indNA_coord,]
summary(geo)
           region       region_capital           department     postal_code   
 Midi-Pyrénées: 3020   Toulouse: 3020   Pas-de-Calais :  894   51300  :   46  
 Rhône-Alpes  : 2879   Lyon    : 2879   Aisne         :  816   51800  :   44  
 Lorraine     : 2333   Metz    : 2333   Somme         :  782   70000  :   42  
 Aquitaine    : 2296   Bordeaux: 2296   Seine-Maritime:  745   88500  :   42  
 Picardie     : 2291   Amiens  : 2291   Moselle       :  730   80140  :   40  
 Bourgogne    : 2046   Dijon   : 2046   Côte-d'Or     :  707   02160  :   38  
 (Other)      :21793   (Other) :21793   (Other)       :31984   (Other):36406  
     CODGEO         latitude        longitude       
 Min.   : 1001   Min.   :-21.38   Min.   :-63.0885  
 1st Qu.:24551   1st Qu.: 45.15   1st Qu.:  0.6585  
 Median :48154   Median : 47.38   Median :  2.6333  
 Mean   :46215   Mean   : 46.87   Mean   :  2.6611  
 3rd Qu.:66225   3rd Qu.: 48.83   3rd Qu.:  4.8667  
 Max.   :97613   Max.   : 51.08   Max.   : 55.8250  
                                                    

Remove DOM-TOM towns (i.e. old colonies far from Europe)

# delete non-European countries
ind_nonEur = geo$latitude < 30 | geo$latitude > 70 |geo$longitude < -20 | geo$longitude > 20
sum(ind_nonEur)
[1] 92
geo = geo[!ind_nonEur,]

2.3 Salary data

Assign meaningful names and check the data:

names(salary)
 [1] "CODGEO"    "LIBGEO"    "SNHM14"    "SNHMC14"   "SNHMP14"   "SNHME14"   "SNHMO14"  
 [8] "SNHMF14"   "SNHMFC14"  "SNHMFP14"  "SNHMFE14"  "SNHMFO14"  "SNHMH14"   "SNHMHC14" 
[15] "SNHMHP14"  "SNHMHE14"  "SNHMHO14"  "SNHM1814"  "SNHM2614"  "SNHM5014"  "SNHMF1814"
[22] "SNHMF2614" "SNHMF5014" "SNHMH1814" "SNHMH2614" "SNHMH5014"
names(salary)[2:ncol(salary)] <-
  c("town",
    "sal_general",    
    "sal_executive",
    "sal_midManager",
    "sal_employee",
    "sal_worker",
    "sal_Females",
    "sal_F_executive",
    "sal_F_midManager",
    "sal_F_employee",
    "sal_F_worker",
    "sal_Males",
    "sal_M_executive",
    "sal_M_midManager",
    "sal_M_employee",
    "sal_M_worker",
    "sal_18_25",
    "sal_26_50",
    "sal_51plus",
    "sal_F_18_25",
    "sal_F_26_50",
    "sal_F_51plus",
    "sal_M_18_25",
    "sal_M_26_50",
    "sal_M_51plus")
# preliminary checks
names(salary)
 [1] "CODGEO"           "town"             "sal_general"      "sal_executive"   
 [5] "sal_midManager"   "sal_employee"     "sal_worker"       "sal_Females"     
 [9] "sal_F_executive"  "sal_F_midManager" "sal_F_employee"   "sal_F_worker"    
[13] "sal_Males"        "sal_M_executive"  "sal_M_midManager" "sal_M_employee"  
[17] "sal_M_worker"     "sal_18_25"        "sal_26_50"        "sal_51plus"      
[21] "sal_F_18_25"      "sal_F_26_50"      "sal_F_51plus"     "sal_M_18_25"     
[25] "sal_M_26_50"      "sal_M_51plus"    
head(salary)
str(salary)
'data.frame':   5136 obs. of  26 variables:
 $ CODGEO          : Factor w/ 5136 levels "01004","01007",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ town            : Factor w/ 5085 levels "Abbeville","Ablis",..: 73 79 133 192 275 298 407 395 400 406 ...
 $ sal_general     : num  13.7 13.5 13.5 12.9 13 13.9 12.4 14 11.5 12.4 ...
 $ sal_executive   : num  24.2 22.1 27.6 21.8 22.8 22.2 24 23.1 21.2 23.4 ...
 $ sal_midManager  : num  15.5 14.7 15.6 14.1 14.1 15.1 13.1 15.3 13.5 14.1 ...
 $ sal_employee    : num  10.3 10.7 11.1 11 10.5 11 10.5 10.9 9.9 10.3 ...
 $ sal_worker      : num  11.2 11.4 11.1 11.3 11.1 11.4 10.4 11.3 10.5 10.5 ...
 $ sal_Females     : num  11.6 11.9 10.9 11.4 11.6 12.5 10.9 12.4 10.3 11 ...
 $ sal_F_executive : num  19.1 19 19.5 19 19.4 20.3 20.7 20.5 20.8 21.5 ...
 $ sal_F_midManager: num  13.2 13.3 11.7 13 13.6 14 11.8 13.9 12.3 13 ...
 $ sal_F_employee  : num  10.1 10.6 10.8 10.3 10.2 10.9 10.4 10.7 9.8 9.9 ...
 $ sal_F_worker    : num  9.6 10 9.5 9.9 9.8 10.5 9.3 10.3 9 9.5 ...
 $ sal_Males       : num  15 14.7 15.3 13.8 13.8 15.2 13.4 15.4 12.3 13.2 ...
 $ sal_M_executive : num  26.4 23.3 30.2 23 24.1 23.1 25.2 24.4 21.3 24 ...
 $ sal_M_midManager: num  16.7 15.8 17.2 14.7 14.4 15.9 13.8 16.3 14.2 14.9 ...
 $ sal_M_employee  : num  11 11.3 12.4 13.2 11.7 12.1 10.8 11.8 10.5 11.6 ...
 $ sal_M_worker    : num  11.6 11.7 11.8 11.6 11.4 11.7 10.8 11.6 11 10.9 ...
 $ sal_18_25       : num  10.5 9.8 9.3 9.6 9.4 9.7 9.3 9.7 9.6 9.7 ...
 $ sal_26_50       : num  13.7 13.8 13.3 12.9 12.8 14.1 12.5 13.9 11.5 12.3 ...
 $ sal_51plus      : num  16.1 14.6 16 14.2 15.2 15.4 13.3 16.7 12.7 13.7 ...
 $ sal_F_18_25     : num  9.7 9.2 8.9 9.3 9 9.5 8.9 9.7 9.2 9.3 ...
 $ sal_F_26_50     : num  11.8 12.2 10.6 11.4 11.8 12.8 11 12.4 10.3 11.2 ...
 $ sal_F_51plus    : num  12.5 12.5 12.5 12.2 12.3 13 11.5 13.8 11.3 11.4 ...
 $ sal_M_18_25     : num  11 10.2 9.6 9.7 9.7 9.9 9.6 9.6 10 9.9 ...
 $ sal_M_26_50     : num  14.9 14.9 15.1 13.8 13.4 15.3 13.3 15 12.3 13 ...
 $ sal_M_51plus    : num  18.6 16.4 18.6 15.9 16.9 17.2 14.9 19.3 13.9 15.4 ...
summary(salary)
     CODGEO               town       sal_general    sal_executive  sal_midManager 
 01004  :   1   Sainte-Marie:   4   Min.   :10.20   Min.   :16.0   Min.   :11.60  
 01007  :   1   Saint-Ouen  :   3   1st Qu.:12.10   1st Qu.:21.9   1st Qu.:13.80  
 01014  :   1   Allonnes    :   2   Median :13.00   Median :23.2   Median :14.40  
 01024  :   1   Andilly     :   2   Mean   :13.71   Mean   :23.7   Mean   :14.58  
 01025  :   1   Bassens     :   2   3rd Qu.:14.40   3rd Qu.:24.9   3rd Qu.:15.10  
 01027  :   1   Beaumont    :   2   Max.   :43.30   Max.   :51.5   Max.   :54.60  
 (Other):5130   (Other)     :5121                                                 
  sal_employee     sal_worker     sal_Females    sal_F_executive sal_F_midManager
 Min.   : 8.70   Min.   : 8.30   Min.   : 9.30   Min.   :12.00   Min.   :10.60   
 1st Qu.:10.00   1st Qu.:10.60   1st Qu.:10.90   1st Qu.:18.80   1st Qu.:12.60   
 Median :10.40   Median :11.00   Median :11.50   Median :20.00   Median :13.10   
 Mean   :10.56   Mean   :11.24   Mean   :12.04   Mean   :20.22   Mean   :13.27   
 3rd Qu.:10.90   3rd Qu.:11.60   3rd Qu.:12.70   3rd Qu.:21.40   3rd Qu.:13.80   
 Max.   :17.50   Max.   :46.30   Max.   :26.70   Max.   :35.50   Max.   :19.00   
                                                                                 
 sal_F_employee   sal_F_worker      sal_Males     sal_M_executive sal_M_midManager
 Min.   : 8.70   Min.   : 6.100   Min.   :10.40   Min.   :13.8    Min.   :11.80   
 1st Qu.: 9.80   1st Qu.: 9.200   1st Qu.:12.90   1st Qu.:23.1    1st Qu.:14.50   
 Median :10.10   Median : 9.700   Median :14.10   Median :24.6    Median :15.20   
 Mean   :10.31   Mean   : 9.827   Mean   :14.85   Mean   :25.2    Mean   :15.49   
 3rd Qu.:10.60   3rd Qu.:10.200   3rd Qu.:15.80   3rd Qu.:26.6    3rd Qu.:16.00   
 Max.   :16.10   Max.   :28.100   Max.   :52.40   Max.   :58.0    Max.   :93.40   
                                                                                  
 sal_M_employee   sal_M_worker    sal_18_25       sal_26_50      sal_51plus   
 Min.   : 8.00   Min.   : 8.9   Min.   : 7.90   Min.   : 9.7   Min.   :10.50  
 1st Qu.:10.50   1st Qu.:10.8   1st Qu.: 9.20   1st Qu.:12.0   1st Qu.:13.70  
 Median :11.10   Median :11.3   Median : 9.50   Median :12.9   Median :15.00  
 Mean   :11.27   Mean   :11.5   Mean   : 9.55   Mean   :13.5   Mean   :15.88  
 3rd Qu.:11.80   3rd Qu.:11.9   3rd Qu.: 9.70   3rd Qu.:14.3   3rd Qu.:16.90  
 Max.   :23.50   Max.   :53.2   Max.   :60.60   Max.   :38.1   Max.   :56.90  
                                                                              
  sal_F_18_25      sal_F_26_50     sal_F_51plus    sal_M_18_25      sal_M_26_50   
 Min.   : 7.500   Min.   : 9.10   Min.   : 9.50   Min.   : 7.800   Min.   : 9.60  
 1st Qu.: 8.900   1st Qu.:10.90   1st Qu.:11.70   1st Qu.: 9.400   1st Qu.:12.70  
 Median : 9.100   Median :11.60   Median :12.60   Median : 9.700   Median :13.80  
 Mean   : 9.162   Mean   :12.06   Mean   :13.17   Mean   : 9.821   Mean   :14.49  
 3rd Qu.: 9.400   3rd Qu.:12.70   3rd Qu.:14.00   3rd Qu.:10.000   3rd Qu.:15.50  
 Max.   :12.000   Max.   :26.60   Max.   :31.00   Max.   :93.300   Max.   :45.40  
                                                                                  
  sal_M_51plus  
 Min.   :10.80  
 1st Qu.:14.90  
 Median :16.60  
 Mean   :17.68  
 3rd Qu.:19.00  
 Max.   :68.60  
                
# Check for duplicated data: there are no
sum(duplicated.data.frame(salary))
[1] 0
# drop unnecessary variable
salary <-subset(salary, select = -c(town))

2.4 Population data

Rename the variables for population and exclude the unnecessary ones:

#names(population)
names(population)[5:7] <-
  c("ageCateg5",
    "sex",
    "peopleCategNum")
# drop unnecessary columns (NIVGEO is the same for all)
population <- subset(population, select = -c(NIVGEO, LIBGEO))
# Refactor sex and MOCO
population$MOCO <- factor(population$MOCO, levels = c(11,12,21,22,23,31,32),
                          labels = c("children_living_with_two_parents", 
                                     "children_living_with_one_parent",
                                     "adults_living_in_couple_without_child",
                                     "adults_living_in_couple_with_children",
                                     "adults_living_alone_with_children",
                                     "persons_not_from_family_living_in_the_home",
                                     "persons_living_alone"))
population$sex <- factor(population$sex, levels = c(1,2), labels = c("Male", "Female"))
# check again
summary(population)
     CODGEO                                                MOCO           ageCateg5 
 01001  :    238   children_living_with_two_parents          :1219512   Min.   : 0  
 01002  :    238   children_living_with_one_parent           :1219512   1st Qu.:20  
 01004  :    238   adults_living_in_couple_without_child     :1219512   Median :40  
 01005  :    238   adults_living_in_couple_with_children     :1219512   Mean   :40  
 01006  :    238   adults_living_alone_with_children         :1219512   3rd Qu.:60  
 01007  :    238   persons_not_from_family_living_in_the_home:1219512   Max.   :80  
 (Other):8535156   persons_living_alone                      :1219512               
     sex          peopleCategNum    
 Male  :4268292   Min.   :    0.00  
 Female:4268292   1st Qu.:    0.00  
                  Median :    0.00  
                  Mean   :    7.45  
                  3rd Qu.:    3.00  
                  Max.   :48873.00  
                                    

Understand the distribution of population according to different age categories using population pyramide (performed now because later the dataset will be modified):

# need the initial shape of data to construct the pyramide 
# Population pyramide
population_data2 <- ddply(population, .(sex, ageCateg5), function(population) {
  data.frame(total_population = sum(population$peopleCategNum))
  })
pop_pyramid <- ggplot(data = population_data2,
       mapping = aes(x = ageCateg5, fill = sex,
                     y = ifelse(test = sex == "Male",
                                yes = -total_population, no = total_population))) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = abs, limits = max(population_data2$total_population) * c(-1,1)) + 
  ggtitle("Pyramid of Population") + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(x= "Age")+ labs(y = "Population") + coord_flip() # + scale_fill_brewer(palette = "Set1")
pop_pyramid

Re-organize population dataset and set CODGEO as units and create new variables using the available variables.

# Re-organize the data by creating new variables: "total population", "male", "female", "child", "elderly" and "workforce".
population <- ddply(population, .(CODGEO), function(population) {
  data.frame(total_population = sum(population$peopleCategNum),
             male = sum(population[population$sex == "Male",]$peopleCategNum),
             female = sum(population[population$sex == "Female",]$peopleCategNum),
             child = sum(population[population$ageCateg5 %in% seq(0, 10, by=5),]$peopleCategNum),
             elderly = sum(population[population$ageCateg5 %in% seq(65, 80, by=5),]$peopleCategNum),
             workforce = sum(population[population$ageCateg5 %in% seq(15, 60, by=5),]$peopleCategNum) 
  )})
# Calculate ratios using the existing variables
population$dependent <- population$child + population$elderly
population$sex_ratio <- ifelse(population$female==0, 0, population$male / population$female)
population$dependency_ratio <- ifelse(population$workforce==0, 0, population$dependent / population$workforce)
population$aged_dependency_ratio <- ifelse(population$workforce==0, 0, population$elderly / population$workforce)
population$child_dependency_ratio <- ifelse(population$workforce==0, 0, population$child / population$workforce)
# total population=65mln, which is reasonable
sum(population$total_population)
[1] 63569750
# create dummy variable for urban city
# taking the top 20% in terms of population
population$urbanDummy = population$total_population > quantile(population$total_population, 0.90)

2.5 Education data

Assign meaningful names and build some useful indicators:

names(educ)[3:ncol(educ)] <-
  c("Age15_NoDip_M","Age15_NoDip_F", "Age15_Sec_M", "Age15_Sec_F", "Age15_Hi_M","Age15_Hi_F", "Age15_Univ_M",
    "Age15_Univ_F", "Age20_NoDip_M", "Age20_NoDip_F", "Age20_Sec_M","Age20_Sec_F", "Age20_Hi_M", "Age20_Hi_F",
    "Age20_Univ_M", "Age20_Univ_F","Age25_NoDip_M", "Age25_NoDip_F","Age25_Sec_M","Age25_Sec_F", "Age25_Hi_M",
    "Age25_Hi_F","Age25_Univ_M","Age25_Univ_F", "Age30_NoDip_M", "Age30_NoDip_F", "Age30_Sec_M", "Age30_Sec_F",
    "Age30_Hi_M", "Age30_Hi_F","Age30_Univ_M","Age30_Univ_F", "Age35_NoDip_M", "Age35_NoDip_F", "Age35_Sec_M",
    "Age35_Sec_F", "Age35_Hi_M", "Age35_Hi_F", "Age35_Univ_M","Age35_Univ_F", "Age40_NoDip_M", "Age40_NoDip_F",
    "Age40_Sec_M","Age40_Sec_F", "Age40_Hi_M", "Age40_Hi_F", "Age40_Univ_M","Age40_Univ_F", "Age45_NoDip_M",
    "Age45_NoDip_F","Age45_Sec_M","Age45_Sec_F", "Age45_Hi_M", "Age45_Hi_F", "Age45_Univ_M","Age45_Univ_F",
    "Age50_NoDip_M", "Age50_NoDip_F","Age50_Sec_M","Age50_Sec_F", "Age50_Hi_M", "Age50_Hi_F", "Age50_Univ_M",
    "Age50_Univ_F", "Age55_NoDip_M", "Age55_NoDip_F","Age55_Sec_M", "Age55_Sec_F", "Age55_Hi_M", "Age55_Hi_F",
    "Age55_Univ_M", "Age55_Univ_F", "Age60_NoDip_M", "Age60_NoDip_F","Age60_Sec_M","Age60_Sec_F", "Age60_Hi_M",
    "Age60_Hi_F", "Age60_Univ_M", "Age60_Univ_F", "Age65_NoDip_M", "Age65_NoDip_F","Age65_Sec_M","Age65_Sec_F",
    "Age65_Hi_M", "Age65_Hi_F", "Age65_Univ_M", "Age65_Univ_F")
# Aggregate variables by grouping all the education level categories by gender
# total number of females with no diploma
educ$female_NoDip <- rowSums(cbind(educ$Age15_NoDip_F, educ$Age20_NoDip_F, educ$Age25_NoDip_F, educ$Age30_NoDip_F,
                                   educ$Age35_NoDip_F,educ$Age40_NoDip_F,educ$Age45_NoDip_F, educ$Age50_NoDip_F,
                                   educ$Age55_NoDip_F, educ$Age60_NoDip_F, educ$Age65_NoDip_F))
# total number of males with no diploma
educ$male_NoDip <- rowSums(cbind(educ$Age15_NoDip_M, educ$Age20_NoDip_M, educ$Age25_NoDip_M, educ$Age30_NoDip_M,
                                 educ$Age35_NoDip_M,educ$Age40_NoDip_M,educ$Age45_NoDip_M, educ$Age50_NoDip_M,
                                 educ$Age55_NoDip_M, educ$Age60_NoDip_M, educ$Age65_NoDip_M))
# total number of females with secondary education level
educ$female_Sec <- rowSums(cbind(educ$Age15_Sec_F, educ$Age20_Sec_F, educ$Age25_Sec_F, educ$Age30_Sec_F,
                                 educ$Age35_Sec_F,educ$Age40_Sec_F,educ$Age45_Sec_F, educ$Age50_Sec_F, 
                                 educ$Age55_Sec_F, educ$Age60_Sec_F, educ$Age65_Sec_F))
# total number of males with secondary education level
educ$male_Sec <- rowSums(cbind(educ$Age15_Sec_M, educ$Age20_Sec_M, educ$Age25_Sec_M, educ$Age30_Sec_M,
                               educ$Age35_Sec_M,educ$Age40_Sec_M,educ$Age45_Sec_M, educ$Age50_Sec_M, 
                               educ$Age55_Sec_M, educ$Age60_Sec_M, educ$Age65_Sec_M))
# total number of females with high-school education level
educ$female_Hi <- rowSums(cbind(educ$Age15_Hi_F, educ$Age20_Hi_F, educ$Age25_Hi_F, educ$Age30_Hi_F,
                                educ$Age35_Hi_F,educ$Age40_Hi_F,educ$Age45_Hi_F, educ$Age50_Hi_F, 
                                educ$Age55_Hi_F, educ$Age60_Hi_F, educ$Age65_Hi_F))
# total number of males with high-school education level
educ$male_Hi <- rowSums(cbind(educ$Age15_Hi_M, educ$Age20_Hi_M, educ$Age25_Hi_M, educ$Age30_Hi_M,
                              educ$Age35_Hi_M,educ$Age40_Hi_M,educ$Age45_Hi_M, educ$Age50_Hi_M, 
                              educ$Age55_Hi_M, educ$Age60_Hi_M, educ$Age65_Hi_M))
# total number of females with university degree
educ$female_Univ <- rowSums(cbind(educ$Age15_Univ_F, educ$Age20_Univ_F, educ$Age25_Univ_F, educ$Age30_Univ_F,
                                  educ$Age35_Univ_F,educ$Age40_Univ_F,educ$Age45_Univ_F, educ$Age50_Univ_F,
                                  educ$Age55_Univ_F, educ$Age60_Univ_F, educ$Age65_Univ_F))
# total number of males with university degree
educ$male_Univ <- rowSums(cbind(educ$Age15_Univ_M, educ$Age20_Univ_M, educ$Age25_Univ_M, educ$Age30_Univ_M,
                                educ$Age35_Univ_M,educ$Age40_Univ_M,educ$Age45_Univ_M, educ$Age50_Univ_M,
                                educ$Age55_Univ_M, educ$Age60_Univ_M, educ$Age65_Univ_M))
# drop all the unnecessary variables 
educ <- subset(educ, select= c(CODGEO,
                               female_NoDip, male_NoDip, 
                               female_Sec, male_Sec, 
                               female_Hi, male_Hi, 
                               female_Univ, male_Univ))
# total number of population without a diploma
educ$nodip <-rowSums(cbind(educ$female_NoDip, educ$male_NoDip))
# total number of population with secondary level of education
educ$sec <-rowSums(cbind(educ$female_Sec, educ$male_NoDip))
# total number of population with highschool diploma
educ$high <-rowSums(cbind(educ$female_Hi, educ$male_Hi))
# total number of poopulation with university diploma
educ$univ <- rowSums(cbind(educ$female_Univ, educ$male_Univ))

2.6 Demographic/social profiles data

Rename the original variables

# rename variables
names(categ_socio)[3:ncol(categ_socio)] <-
  c("M_immi_agri",
    "F_immi_agri",
    "M_NoImmi_agri",
    "F_NoImmi_agri",
    "M_immi_comm",
    "F_immi_comm",
    "M_NoImmi_comm",
    "F_NoImmi_comm",
    "M_immi_exec",
    "F_immi_exec",
    "M_NoImmi_exec",
    "F_NoImmi_exec",
    "M_immi_midman",
    "F_immi_midman",
    "M_NoImmi_midman",
    "F_NoImmi_midman",
    "M_immi_emp",
    "F_immi_emp",
    "M_NoImmi_emp",
    "F_NoImmi_emp",
    "M_immi_worker",
    "F_immi_worker",
    "M_NoImmi_worker",
    "F_NoImmi_worker",
    "M_immi_retired",
    "F_immi_retired",
    "M_NoImmi_retired",
    "F_NoImmi_retired",
    "M_immi_noAct",
    "F_immi_noAct",
    "M_NoImmi_noAct",
    "F_NoImmi_noAct")
# drop LIBGEO
categ_socio <- subset(categ_socio, select=-c(LIBGEO))

Create the total number of immigrants and natives per town, also separeted by gender.

# total number of male immigrants per town 
categ_socio$male_immig <-rowSums(cbind(categ_socio$M_immi_agri, categ_socio$M_immi_comm, categ_socio$M_immi_emp,  
                                       categ_socio$M_immi_exec, categ_socio$M_immi_midman,categ_socio$M_immi_noAct, 
                                       categ_socio$M_immi_retired, categ_socio$M_immi_worker))
 
# total number of female immigrants per town
categ_socio$female_immig <- rowSums(cbind(categ_socio$F_immi_agri,categ_socio$F_immi_comm, categ_socio$F_immi_emp, 
                                          categ_socio$F_immi_exec, categ_socio$F_immi_midman, 
                                          categ_socio$F_immi_noAct, categ_socio$F_immi_retired, categ_socio$F_immi_worker))
# total number immigrants per town
categ_socio$total_immig <- categ_socio$male_immig + categ_socio$female_immig
# working male immigrants per town 
categ_socio$male_working_immig <- categ_socio$male_immig - categ_socio$M_immi_retired 
 
# working female immigrants per town  
categ_socio$female_working_immig <- categ_socio$female_immig - categ_socio$F_immi_retired 
 
# total number of male natives per town 
categ_socio$male_native <- rowSums(cbind(categ_socio$M_NoImmi_agri, categ_socio$M_NoImmi_comm, categ_socio$M_NoImmi_exec, categ_socio$M_NoImmi_emp, categ_socio$M_NoImmi_midman, categ_socio$M_NoImmi_noAct, categ_socio$M_NoImmi_retired, categ_socio$M_NoImmi_worker))
# total number of female natives per town 
categ_socio$female_native <- rowSums(cbind(categ_socio$F_NoImmi_agri, categ_socio$F_NoImmi_comm, categ_socio$F_NoImmi_exec, categ_socio$F_NoImmi_emp, categ_socio$F_NoImmi_midman, categ_socio$F_NoImmi_noAct, categ_socio$F_NoImmi_retired, categ_socio$F_NoImmi_worker))
# total number of natives per town 
categ_socio$total_native <- categ_socio$male_native + categ_socio$female_native
# drop unnecessary variables 
categ_socio <- subset(categ_socio, select = c(CODGEO, male_immig, female_immig, total_immig, male_working_immig, 
                                               female_working_immig, male_native, female_native, total_native)) 

2.7 Work status data

Rename the original variables

# rename variables
names(status_work)[3:ncol(status_work)] <-
  c("Less20_M_wagearner_Full", "Less20_M_wagearner_Half", "Less20_M_Indp1_Full", "Less20_M_Indp1_Half",
    "Less20_M_empl_Full", "Less20_M_empl_Half", "Less20_M_trans_Full", "Less20_M_trans_Half",
    "Less20_F_wagearner_Full", "Less20_F_wagearner_Half", "Less20_F_Indp1_Full", "Less20_F_Indp1_Half",
    "Less20_F_empl_Full", "Less20_F_empl_Half", "Less20_F_trans_Full", "Less20_F_trans_Half",
    "t4_M_wagearner_Full", "t4_M_wagearner_Half", "t4_M_Indp1_Full", "t4_M_Indp1_Half", "t4_M_empl_Full",
    "t4_M_empl_Half", "t4_M_trans_Full", "t4_M_trans_Half", "t4_F_wagearner_Full", "t4_F_wagearner_Half",
    "t4_F_Indp1_Full", "t4_F_Indp1_Half", "t4_F_empl_Full", "t4_F_empl_Half", "t4_F_trans_Full", "t4_F_trans_Half",
    "t9_M_wagearner_Full", "t9_M_wagearner_Half", "t9_M_Indp1_Full", "t9_M_Indp1_Half", "t9_M_empl_Full",
    "t9_M_empl_Half", "t9_M_trans_Full", "t9_M_trans_Half", "t9_F_wagearner_Full", "t9_F_wagearner_Half",
    "t9_F_Indp1_Full", "t9_F_Indp1_Half", "t9_F_empl_Full", "t9_F_empl_Half", "t9_F_trans_Full", "t9_F_trans_Half",
    "T4_M_wagearner_Full", "T4_M_wagearner_Half", "T4_M_Indp1_Full", "T4_M_Indp1_Half", "T4_M_empl_Full",
    "T4_M_empl_Half", "T4_M_trans_Full", "T4_M_trans_Half", "T4_F_wagearner_Full", "T4_F_wagearner_Half",
    "T4_F_Indp1_Full", "T4_F_Indp1_Half", "T4_F_empl_Full", "T4_F_empl_Half", "T4_F_trans_Full", "T4_F_trans_Half",
    "T9_M_wagearner_Full", "T9_M_wagearner_Half", "T9_M_Indp1_Full", "T9_M_Indp1_Half", "T9_M_empl_Full",
    "T9_M_empl_Half", "T9_M_trans_Full", "T9_M_trans_Half", "T9_F_wagearner_Full", "T9_F_wagearner_Half",
    "T9_F_Indp1_Full", "T9_F_Indp1_Half", "T9_F_empl_Full", "T9_F_empl_Half", "T9_F_trans_Full", "T9_F_trans_Half",
    "FF_M_wagearner_Full", "FF_M_wagearner_Half", "FF_M_Indp1_Full", "FF_M_Indp1_Half", "FF_M_empl_Full",
    "FF_M_empl_Half", "FF_M_trans_Full", "FF_M_trans_Half", "FF_F_wagearner_Full", "FF_F_wagearner_Half",
    "FF_F_Indp1_Full", "FF_F_Indp1_Half", "FF_F_empl_Full", "FF_F_empl_Half", "FF_F_trans_Full", "FF_F_trans_Half",
    "F9_M_wagearner_Full", "F9_M_wagearner_Half", "F9_M_Indp1_Full", "F9_M_Indp1_Half", "F9_M_empl_Full",
    "F9_M_empl_Half", "F9_M_trans_Full", "F9_M_trans_Half", "F9_F_wagearner_Full", "F9_F_wagearner_Half",
    "F9_F_Indp1_Full", "F9_F_Indp1_Half", "F9_F_empl_Full", "F9_F_empl_Half", "F9_F_trans_Full", "F9_F_trans_Half",
    "f4_M_wagearner_Full", "f4_M_wagearner_Half", "f4_M_Indp1_Full", "f4_M_Indp1_Half", "f4_M_empl_Full",
    "f4_M_empl_Half", "f4_M_trans_Full", "f4_M_trans_Half", "f4_F_wagearner_Full", "f4_F_wagearner_Half",
    "f4_F_Indp1_Full", "f4_F_Indp1_Half", "f4_F_empl_Full", "f4_F_empl_Half", "f4_F_trans_Full", "f4_F_trans_Half",
    "f9_M_wagearner_Full", "f9_M_wagearner_Half", "f9_M_Indp1_Full", "f9_M_Indp1_Half", "f9_M_empl_Full",
    "f9_M_empl_Half", "f9_M_trans_Full", "f9_M_trans_Half", "f9_F_wagearner_Full", "f9_F_wagearner_Half",
    "f9_F_Indp1_Full", "f9_F_Indp1_Half", "f9_F_empl_Full", "f9_F_empl_Half", "f9_F_trans_Full", "f9_F_trans_Half",
    "s4_M_wagearner_Full", "s4_M_wagearner_Half", "s4_M_Indp1_Full", "s4_M_Indp1_Half", "s4_M_empl_Full",
    "s4_M_empl_Half", "s4_M_trans_Full", "s4_M_trans_Half", "s4_F_wagearner_Full", "s4_F_wagearner_Half",
    "s4_F_Indp1_Full", "s4_F_Indp1_Half", "s4_F_empl_Full", "s4_F_empl_Half", "s4_F_trans_Full", "s4_F_trans_Half",
    "s9_M_wagearner_Full", "s9_M_wagearner_Half", "s9_M_Indp1_Full", "s9_M_Indp1_Half", "s9_M_empl_Full",
    "s9_M_empl_Half", "s9_M_trans_Full", "s9_M_trans_Half", "s9_F_wagearner_Full", "s9_F_wagearner_Half",
    "s9_F_Indp1_Full", "s9_F_Indp1_Half", "s9_F_empl_Full", "s9_F_empl_Half", "s9_F_trans_Full", "s9_F_trans_Half")

Create aggregate statistics and drop unnecessary variables

# Simplify the variables by summing up by categories
# total number of wage earners for males and females.
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('wagearner', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$wagearner_M=rowSums(status_work[,nam.M])
status_work$wagearner_F=rowSums(status_work[,nam.F]) 
                                               
# same but for independent workers by gender
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('Indp1', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$independent_M=rowSums(status_work[,nam.M]) 
status_work$independent_F=rowSums(status_work[,nam.F])
# same but for government transfer receivers
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('trans', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$transfer_M=rowSums(status_work[,nam.M])  
status_work$transfer_F=rowSums(status_work[,nam.F])
# same but for employers
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('empl', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$employer_M=rowSums(status_work[,nam.M])
status_work$employer_F=rowSums(status_work[,nam.F])
 
# same but for full-time contracts
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('Full', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$full_M=rowSums(status_work[,nam.M])
status_work$full_F=rowSums(status_work[,nam.F])
# same but for half-time contracts
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('Half', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
#names(status_work[nam.M])
#names(status_work[nam.F])
status_work$half_M=rowSums(status_work[,nam.M])
status_work$half_F=rowSums(status_work[,nam.F])
# total of each variables
status_work$full <- status_work$full_M + status_work$full_F
status_work$half <- status_work$half_M + status_work$half_F
status_work$wagearner <- status_work$wagearner_M + status_work$wagearner_F
status_work$independent <- status_work$independent_M + status_work$independent_F
status_work$transfer <- status_work$transfer_F + status_work$transfer_M 
status_work$employer <- status_work$employer_F + status_work$employer_M 
# drop unnecessary variables
status_work <- subset(status_work, select = c(CODGEO,
                                              wagearner_M, wagearner_F, 
                                              independent_M, independent_F, 
                                              transfer_M, transfer_F, 
                                              employer_M, employer_F, 
                                              full_M, full_F,
                                              half_M, half_F,
                                              full, half,
                                              wagearner, independent, transfer, employer))
# Full variable means individuals with full time contracts. Half means individuals with part time contract. 
# "transfer" is individuals receiving transfer payments. 
# "independent" is like auto-entrepreneurs. 

2.8 Inequality data

Drop unnecessary variables and rename the other ones

names(ineq)
 [1] "CODGEO"        "LIBGEO"        "REG"           "DEP"           "P14_POP"      
 [6] "P09_POP"       "SUPERF"        "NAIS0914"      "DECE0914"      "P14_MEN"      
[11] "NAISD16"       "DECESD16"      "P14_LOG"       "P14_RP"        "P14_RSECOCC"  
[16] "P14_LOGVAC"    "P14_RP_PROP"   "NBMENFISC14"   "PIMP14"        "MED14"        
[21] "TP6014"        "P14_EMPLT"     "P14_EMPLT_SAL" "P09_EMPLT"     "P14_POP1564"  
[26] "P14_CHOM1564"  "P14_ACT1564"   "ETTOT15"       "ETAZ15"        "ETBE15"       
[31] "ETFZ15"        "ETGU15"        "ETGZ15"        "ETOQ15"        "ETTEF115"     
[36] "ETTEFP1015"   
# drop unnecessary variables
ineq <- subset(ineq, select = -c(LIBGEO, P09_POP, NAISD16, DECESD16, P14_RP_PROP, P09_EMPLT,ETTOT15, ETAZ15, ETBE15, ETFZ15, ETGU15, ETGZ15, ETOQ15, ETTEF115, ETTEFP1015))
# rename variables 
names(ineq)[4:ncol(ineq)] <-
  c("pop_2014",
    "Superficie",
    "birth09_14",
    "death09_14",
    "households14",
    "housing14",
    "princ_resid14",
    "sec_resid14",
    "vac_resid14",
    "tax_house14",
    "shared_tax_house14",
    "median_living14",
    "lev_ineq14",
    "empl",
    "emp_sal",
    "pop15_64",
    "unemp15_64",
    "act15_64")
# drop more unnecessary variables 
ineq <- subset(ineq, select = -c(princ_resid14, sec_resid14, vac_resid14, pop_2014, households14, birth09_14, 
                                death09_14, tax_house14, shared_tax_house14, lev_ineq14)) 
 
# unemployment rate 
ineq$unemp_rate <- ineq$unemp15_64/ineq$pop15_64 
# impute NA using median (only 3)
ineq$median_living14[is.na(ineq$median_living14)] = median(ineq$median_living14, na.rm = T)

2.9 Commune data

Drop unnecessary variables and rename the other ones

names(commune)
 [1] "codgeo"                           "orientationeconomique"           
 [3] "indicefiscalpartiel"              "scorefiscal"                     
 [5] "libgeo"                           "reg"                             
 [7] "dep"                              "indicedmographique"              
 [9] "population"                       "evolutionpop"                    
[11] "nbmnages"                         "nbrsidencesprincipales"          
[13] "nbpropritaire"                    "nblogement"                      
[15] "nboccupantsrsidenceprincipale"    "nbfemme"                         
[17] "nbhomme"                          "nbmineurs"                       
[19] "nbmajeurs"                        "nbetudiants"                     
[21] "nbentreprisessecteurservices"     "nbentreprisessecteurcommerce"    
[23] "nbentreprisessecteurconstruction" "nbentreprisessecteurindustrie"   
[25] "nbcrationenteprises"              "nbcrationindustrielles"          
[27] "nbcrationconstruction"            "nbcrationcommerces"              
[29] "nbcrationservices"                "moyennerevenusfiscauxdpartementa"
[31] "moyennerevenusfiscauxrgionaux"    "depmoyennesalaireshoraires"      
[33] "depmoyennesalairescadrehoraires"  "depmoyennesalairesprofintermdiai"
[35] "depmoyennesalairesemployhoraires" "depmoyennesalairesouvrihoraires" 
[37] "regmoyennesalaireshoraires"       "regmoyennesalairescadrehoraires" 
[39] "regmoyennesalairesprofintermdiai" "regmoyennesalairesemployhoraires"
[41] "regmoyennesalairesouvrihoraires"  "valeurajoutergionale"            
[43] "urbanitruralit"                   "scoreurbanit"                    
[45] "nbatifs"                          "nbactifssalaris"                 
[47] "nbactifsnonsalaris"               "nblogementsecondaireetoccasionne"
[49] "dynamiquedmographiquebv"          "tauxtudiants"                    
[51] "tauxproprit"                      "dynamiquedmographiqueinsee"      
[53] "capacitfisc"                      "capacitfiscale"                  
[55] "moyennerevnusfiscaux"             "nbindustriesdesbiensintermdiaire"
[57] "nbdecommerce"                     "nbdeservicesauxparticuliers"     
[59] "nbinstitutiondeeducationsantacti" "pibrgionnal"                     
[61] "segenvironnementdmographiqueobso" "scorecroissancepopulation"       
[63] "scorecroissanceentrepreneuriale"  "scorevargion"                    
[65] "scorepib"                         "environnementdmographique"       
[67] "fidlit"                          
#Drop unnecessary variables
commune <-subset(commune, select=-c(libgeo, scorefiscal, dep, reg, evolutionpop, nbpropritaire, 
                                    nblogementsecondaireetoccasionne, scoreurbanit, scorevargion, scorepib,  
                                    nbrsidencesprincipales, indicefiscalpartiel, nboccupantsrsidenceprincipale, 
                                    depmoyennesalaireshoraires, depmoyennesalairescadrehoraires, 
                                    depmoyennesalairesprofintermdiai, depmoyennesalairesemployhoraires, 
                                    depmoyennesalairesouvrihoraires, valeurajoutergionale, 
                                    nbindustriesdesbiensintermdiaire, nbdecommerce, nbdeservicesauxparticuliers, 
                                    nbinstitutiondeeducationsantacti, scorevargion, scorepib, 
                                    moyennerevenusfiscauxdpartementa, moyennerevenusfiscauxrgionaux, 
                                    regmoyennesalaireshoraires, regmoyennesalairescadrehoraires, 
                                    regmoyennesalairesprofintermdiai, regmoyennesalairesemployhoraires, 
                                    regmoyennesalairesouvrihoraires, tauxtudiants, dynamiquedmographiqueinsee, 
                                    capacitfisc, capacitfiscale, moyennerevnusfiscaux, nblogement, fidlit)) 
# rename variables  
names(commune) <-c("CODGEO", "orient_econ", "demo_index", "population", "nb_household", 
                    "nb_female", "nb_male", "nb_minor", "nb_major", "nb_students",  
                    "nb_firms_service", "nb_firms_commerce", "nb_firms_construction", 
                    "nb_firms_ind", "nb_created_firms", "nb_created_ind", 
                    "nb_created_construction", "nb_created_commerce",
                    "nb_created_services", "urbanRural", "nb_active", "nb_active_employees",
                    "nb_active_non_employees", "dymanic_demo_prof", "profitRate", 
                    "regional_GDP", "demo_environment", "evol_pop_score", 
                    "evol_entrepreneurial_score", "demo_envir2") 
# drop more variables 
commune <-subset(commune, select=-c(demo_index, population, profitRate, demo_envir2, demo_environment)) 

3 Produce consistent datasets

The CODGEO variables have to be merged. However, for different reasons already identified by other kaggle users they need some pre-processing. To do so, some already known mistakes are corrected:

firms$CODGEO       <- sub("A", "0", firms$CODGEO)
firms$CODGEO       <- sub("B", "0", firms$CODGEO)
salary$CODGEO      <- sub("A", "0", salary$CODGEO)
salary$CODGEO      <- sub("B", "0", salary$CODGEO)
population$CODGEO  <- sub("A", "0", population$CODGEO)
population$CODGEO  <- sub("B", "0", population$CODGEO)
geo$CODGEO         <- sub("A", "0", geo$CODGEO)
geo$CODGEO         <- sub("B", "0", geo$CODGEO)
educ$CODGEO        <- sub("A", "0", educ$CODGEO)
educ$CODGEO        <- sub("B", "0", educ$CODGEO)
categ_socio$CODGEO <- sub("A", "0", categ_socio$CODGEO)
categ_socio$CODGEO <- sub("B", "0", categ_socio$CODGEO)
ineq$CODGEO        <- sub("A", "0", ineq$CODGEO)
ineq$CODGEO        <- sub("B", "0", ineq$CODGEO)
status_work$CODGEO <- sub("A", "0", status_work$CODGEO)
status_work$CODGEO <- sub("B", "0", status_work$CODGEO)
commune$CODGEO     <- sub("A", "0", commune$CODGEO)
commune$CODGEO     <- sub("B", "0", commune$CODGEO)
# Then all CODGEO are trasformed to integers and four new datasets are created retaining only the common CODGEO:
# use only integer values
geo$CODGEO         <- as.integer(geo$CODGEO)  
population$CODGEO  <- as.integer(population$CODGEO)
firms$CODGEO       <- as.integer(firms$CODGEO)
salary$CODGEO      <- as.integer(salary$CODGEO)
status_work$CODGEO <- as.integer(status_work$CODGEO)  
ineq$CODGEO        <- as.integer(ineq$CODGEO)
educ$CODGEO        <- as.integer(educ$CODGEO)
categ_socio$CODGEO <- as.integer(categ_socio$CODGEO)
commune$CODGEO     <- as.integer(commune$CODGEO)
# store datasets' names to loop on them
dataset = c("population", "salary", "firms", "geo", "ineq", "commune", "educ", "categ_socio", "status_work")
# obtain sommon IDs for all datasets
for (i in dataset){
  # get i-th name and create a new variable concateneting "NEW" at the end
  nam <- paste(i, "NEW", sep = "")
  # initialize counter to identify the number of iteration in j
  iter = 1
  for (j in dataset){
    if (j != i){
      # for each dataset different from the i-th
      if (iter == 1){
        # 1st iteration: use the original dataset (e.g., geo)
        assign(nam, semi_join(get(i), get(j), by = "CODGEO"))
      } else{
        # successive iteration: use the new dataset (e.g., geoNEW)
        assign(nam, semi_join(get(nam), get(j), by = "CODGEO"))
      }
      iter = iter + 1
    }
  }
}
# check how many observation have been deleted
for (i in dataset){
  del_rows = nrow(get(i)) - nrow(get(paste(i, "NEW", sep = "")))
  del_prop = del_rows / nrow(get(i))
  del_obs = paste("For", i, del_rows, "units have been deleted.",
                  "They were the", round(del_prop*100, digits=2), "% of the total.", sep = " ")
  print(paste(del_obs))
}
[1] "For population 30843 units have been deleted. They were the 85.99 % of the total."
[1] "For salary 111 units have been deleted. They were the 2.16 % of the total."
[1] "For firms 31656 units have been deleted. They were the 86.3 % of the total."
[1] "For geo 31541 units have been deleted. They were the 86.26 % of the total."
[1] "For ineq 31664 units have been deleted. They were the 86.3 % of the total."
[1] "For commune 31652 units have been deleted. They were the 86.3 % of the total."
[1] "For educ 30843 units have been deleted. They were the 85.99 % of the total."
[1] "For categ_socio 30843 units have been deleted. They were the 85.99 % of the total."
[1] "For status_work 30843 units have been deleted. They were the 85.99 % of the total."
print(paste("The new dataset has", nrow(salaryNEW), "units and", 
      ncol(salaryNEW)+ncol(populationNEW)+ncol(firmsNEW)+ncol(geoNEW)+
        ncol(status_workNEW)+ncol(ineqNEW)+ncol(educNEW)+ncol(categ_socioNEW)+ncol(communeNEW), "features."))
[1] "The new dataset has 5025 units and 131 features."

Overwrite new datesets into the old ones and remove the latter ones from memory

firms       <- firmsNEW
geo         <- geoNEW
salary      <- salaryNEW
population  <- populationNEW
educ        <- educNEW
categ_socio <- categ_socioNEW
status_work <- status_workNEW
ineq        <- ineqNEW
commune     <- communeNEW
rm(list=c("firmsNEW", "geoNEW", "salaryNEW", "populationNEW", "educNEW", "categ_socioNEW", "status_workNEW", "ineqNEW", "communeNEW"))

4 Descriptive statistics

4.1 Firms data

Check the distribution of the null firms (i.e., unknown sizes) and analyze firms’ distribution per town:

# check the ratio of null firms for each town
summary(firms$null/firms$total)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3000  0.5937  0.6549  0.6549  0.7152  1.0000 
# a lot of information is missing, should we remove these data?
ggplot(data=firms, aes(firms$null/firms$total)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Null firms/total firms", y="Count") +
  ggtitle("Ratio between the number of null firms and total firms per town") +
  theme(plot.title = element_text(hjust = 0.5)) 

# distribution for the firms with unknown size 
ggplot(data=firms, aes(log(firms$null))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of unknown size (log scale)", y="Density") +
  ggtitle("Number of firms of unknown size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the total number of firms 
ggplot(data=firms, aes(log(firms$total))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms (log scale)", y="Density") +
  ggtitle("Total number of firms per town") +
  theme(plot.title = element_text(hjust = 0.5)) 

# distribution for the number of micro size firms 
ggplot(data=firms, aes(log(firms$micro))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of micro size (log scale)", y="Density") +
  ggtitle("Number of firms of micro size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the number of small size firms 
ggplot(data=firms, aes(log(firms$small))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of small size (log scale)", y="Density") +
  ggtitle("Number of firms of small size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the number of medium size firms 
ggplot(data=firms, aes(log(firms$medium))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of medium size (log scale)", y="Density") +
  ggtitle("Number of firms of medium size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the number of large size firms 
ggplot(data=firms, aes(log(firms$large))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of large size (log scale)", y="Density") +
  ggtitle("Number of firms of large size per town") +
  theme(plot.title = element_text(hjust = 0.5))

Check correlation among the variables:

corrplot(cor(firms[,3:8]))

4.1.1 What have learned

Solved problems:

  • Use the EU firms’ categorization.

Some problematic aspects:

  • Some towns have 100% of null firms, should we remove these data?
  • These variables are strongly correlated, should we use just the total amount in the following parts?

We could use these data for the following tasks:

  • predict the salaries using such information as proxy for the competition in the job market;
  • predict the total number of firms, using salary data;
  • geo-spatial plot for firms’ size

4.2 Geographic data

Plot available towns on a map:

# center of France, obtained using:
# fra_center = as.numeric(geocode("France"))
fra_center = c(2.213749, 46.227638)
# plot all European towns available
geo_pos = as.data.frame(cbind(lon = geo$longitude, lat = geo$latitude))
geo_pos = geo_pos[complete.cases(geo_pos),]
ggmap(get_googlemap(center=fra_center, scale=2, zoom=5), extent="normal") +
  geom_point(aes(x=lon, y=lat), data=geo_pos, col="orange", alpha=0.2, size=0.01) 
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=5&size=640x640&scale=2&maptype=terrain&sensor=false

# Try to cll google API until there is no error
# 
# FraMap = NA
# while (any(is.na(FraMap))){
#   FraMap = tryCatch({
#       ggmap(get_googlemap(center=fra_center, scale=2, zoom=5), extent="normal")
#   }, error = function(e) {
#     message("cannot open URL")  
#     FraMap = NA
#   })
# }

4.2.1 What we have learned

Solved problems:

  • Missing latitude and longitude;
  • Duplications due to multiple postal codes in the same town;
  • Exclude non-European towns;

These information are useful to plot any future dataset/analysis. In addition, they could be useful to compare European towns vs. old colonies.

4.3 Salary data

Univariate analysis comparing salaries for both genders among various job categories:

#  number of units
n_sex <- length(salary$sal_Females)
# vector representing males and females
Label <- c(rep("M", n_sex*5), rep("F", n_sex*5))
# vector representing the variable considered
Variable <- c(rep("General", n_sex), 
             rep("Executive", n_sex),
             rep("MidManager", n_sex),
             rep("Employee", n_sex),
             rep("Worker",n_sex),
             rep("General", n_sex), 
             rep("Executive", n_sex),
             rep("MidManager", n_sex),
             rep("Employee", n_sex),
             rep("Worker",n_sex))
# merge these data
sal_sex = cbind.data.frame(Label = Label, 
             value = c(salary$sal_Males, salary$sal_M_executive, salary$sal_M_midManager, salary$sal_M_employee, salary$sal_M_worker,
                       salary$sal_Females, salary$sal_F_executive, salary$sal_F_midManager, salary$sal_F_employee, salary$sal_F_worker),
             Variable = Variable)
# plotting phase
ggplot(data = sal_sex, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  # not color points replacing colour = group instead of colour=Label
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("Mean net salary per hour") + ggtitle("Gender comparison for different job positions") +
  theme(plot.title = element_text(hjust = 0.5)) +      stat_boxplot(geom = "errorbar", width = 0.5)

  # + guides(fill=guide_legend(title="Legend"))
# the same but excluding outliers
ggplot(data = sal_sex, aes(x=Label, y=value)) +
  scale_y_continuous(limits = quantile(sal_sex$value, c(0, 0.9))) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("Mean net salary per hour") + 
  ggtitle("Gender comparison for different job positions excluding the last decile") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

Univariate analysis comparing salaries for both genders among various ages:

# vector representing males and females
Label <- c(rep("M", n_sex*3), rep("F", n_sex*3))
# vector representing the variable considered
Variable <- c(rep("18-25", n_sex), 
              rep("26-50", n_sex),
              rep("51+", n_sex),
              rep("18-25", n_sex), 
              rep("26-50", n_sex),
              rep("51+", n_sex))
# merge these data
sal_sex <- cbind.data.frame(Label = Label, 
                           value = c(salary$sal_M_18_25, salary$sal_M_26_50, salary$sal_M_51plus, 
                                     salary$sal_F_18_25, salary$sal_F_26_50, salary$sal_F_51plus),
                           Variable = Variable)
# plotting phase
ggplot(data = sal_sex, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("Mean net salary per hour") + ggtitle("Gender comparison for different ages") +
  theme(plot.title = element_text(hjust = 0.5)) + ylim(c(5, 100)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

The income inequality between genders, age groups and working positions is clear. In the following analyses the focus is on the salary ratio between women and men among different job positions:

# Gender salary ratio and general level of income
# Overall mean salary: The higher the net mean income, the more skewed the ratio of salary between female and male is. Only 2 towns have a ratio>1
# create overall F vs M ratio
salary$salary_ratio_FvsM <- salary$sal_Females / salary$sal_Males
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Overall salary ratio (females/males)", y="Density") + 
  labs(title = "Overall salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs overall mean salary
ggplot(salary, aes(x= sal_general, y=salary_ratio_FvsM)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Overall salary ratio(females/males)") + 
  labs(title = "Overall salary ratio between females and males vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm") 

# Executives mean salary: a bit better the situation for females in this case and less skewed
# create Executives F vs M ratio
salary$salary_ratio_FvsM_Exec <- salary$sal_F_executive / salary$sal_M_executive
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_Exec)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Executives salary ratio (females/males)", y="Density") + 
  labs(title = "Executives salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs executives mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_Exec)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Executives salary ratio (females/males)") + 
  labs(title = "Executives salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

# Middle managers mean salary: ....
# create Middle managers F vs M ratio
salary$salary_ratio_FvsM_midManag <- salary$sal_F_midManager / salary$sal_M_midManager
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_Exec)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Middle managers salary ratio (females/males)", y="Density") + 
  labs(title = "Middle managers salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs executives mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_midManag)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Middle managers salary ratio (females/males)") + 
  labs(title = "Middle managers salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

# Workers mean salary: ...
# create workers F vs M ratio
salary$salary_ratio_FvsM_worker <- salary$sal_F_worker / salary$sal_M_worker
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_worker)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Workers salary ratio (females/males)", y="Density") + 
  labs(title = "Workers salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs workers mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_worker)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Workers salary ratio (females/males)") + 
  labs(title = "Workers salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

# Employee mean salary: ...
# create Employee F vs M ratio
salary$salary_ratio_FvsM_employee <- salary$sal_F_employee / salary$sal_M_employee
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_employee)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Employee salary ratio (females/males)", y="Density") + 
  labs(title = "Employee salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs Employee mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_employee)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Employee salary ratio (females/males)") + 
  labs(title = "Employee salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

Now the focus is on the salary ratio between women and men among different age groups:

# 18-25 mean salary: are quite equal apart from some outliers and a quadratic trend
# create 18-25 F vs M ratio
salary$salary_ratio_FvsM_18_25 <- salary$sal_F_18_25 / salary$sal_M_18_25
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_18_25)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="18-25 salary ratio (females/males)", y="Density") + 
  labs(title = "18-25 salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs 18-25 mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_18_25)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="18-25 salary ratio (females/males)") + 
  labs(title = "18-25 salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

# scatter plot vs 18-25 mean salary for them
ggplot(salary, aes(x= sal_18_25, y= salary_ratio_FvsM_18_25)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="18-25 salary", y="18-25 salary ratio (females/males)") + 
  labs(title = "18-25 salary ratio between females and males \n vs. overall 18-25 salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "loess")

# 26-50 mean salary: ...
# create 26-50 F vs M ratio
salary$salary_ratio_FvsM_26_50 <- salary$sal_F_26_50 / salary$sal_M_26_50
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_26_50)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="26-50 salary ratio (females/males)", y="Density") + 
  labs(title = "26-50 salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs 26-50 mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_26_50)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="26-50 salary ratio (females/males)") + 
  labs(title = "26-50 salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

# 51+ mean salary: ...
# create 51+ F vs M ratio
salary$salary_ratio_FvsM_51plus <- salary$sal_F_51plus / salary$sal_M_51plus
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_51plus)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="51+ salary ratio (females/males)", y="Density") + 
  labs(title = "51+ salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 

# scatter plot vs 26-50 mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_51plus)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="51+ salary ratio (females/males)") + 
  labs(title = "51+ salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

Highlight bivariate relations:

# correlation matrix
corrplot(cor(salary[, 3:ncol(salary)]), method = "circle", title = "Correlation matrix for salary data", 
         diag = T, tl.cex=0.5, type="lower", 
         tl.col = "black", mar=c(0,0,1.5,0)) 

# most general pairs
pairs(salary[c(3:8, 13, 18:20)], gap=0, main = "Scatter matrix of the main variables in salary data", cex = 0.6)

# pairs highlighting genders' differences
pairs(salary[c(9:12, 14:17)], gap=0, main = "Scatter matrix of job categories for both genders", cex = 0.6)

4.3.1 What we have learned

….

4.4 Population data

Plot population data across different towns:

# Histogram of total population per town in log
ggplot(data=population, aes(ifelse(total_population!=0, log10(total_population), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Log10 total population", y="Density") +
  ggtitle("Histogram of total population per town in log10 scale") +
  theme(plot.title = element_text(hjust = 0.5))

# Histogram of dependency ratio per town
ggplot(data=population, aes(ifelse(dependency_ratio!=0, log10(dependency_ratio), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Log10 dependency ratio", y="Density") +
  ggtitle("Histogram of dependency ratio per town in log10 scale") +
  theme(plot.title = element_text(hjust = 0.5))

# Histogram of sex ratio per town
ggplot(data=population, aes(ifelse(sex_ratio!=0, log10(sex_ratio), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Log10 sex ratio", y="Density") +
  ggtitle("Histogram of sex ratio per town in log10 scale") +
  theme(plot.title = element_text(hjust = 0.5))

Understand which towns have the highest concentration of population and ratios.

# Merge geography and population data
# geo_population <- merge(geo, population, by="CODGEO")
geo_population <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(population$CODGEO),],
                                   population[unique(population$CODGEO) %in% unique(geo$CODGEO),])
# France map
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
# Plot "Distribution of total population for each town" 
sc <- scale_colour_gradientn(colours =palette(rainbow(3)), limits=c(log10(min(geo_population$total_population)), 
                                                                    log10(max(geo_population$total_population)))) 
population_distribution <-
  FraMap +
  geom_point(aes(x=geo_population$longitude, y=geo_population$latitude,
                 colour=log10(geo_population$total_population)), 
             data=geo_population, alpha=0.2, size=0.01) +
             sc + labs(color='') + ggtitle("Distribution of Population for each town")
population_distribution
# Plot "Distribution of aged dependency ratio for each town"
sc <- scale_colour_gradientn(colours =palette(rainbow(4)), limits=c(min(geo_population$aged_dependency_ratio),
                                                                  max(geo_population$aged_dependency_ratio)))

aged_ratio_distribution <-
  FraMap +
  geom_point(aes(x=geo_population$longitude, y=geo_population$latitude,
                 colour=geo_population$aged_dependency_ratio),
             data=geo_population, alpha=1, size=1) +
             sc + labs(color='') + ggtitle("Aged dependency ratio per town")
aged_ratio_distribution

# which city has the highest/lowest aged dependency ratio 
geo_population[which.min(geo_population$aged_dependency_ratio),c('CODGEO')]  
[1] 77183
geo_population[which.max(geo_population$aged_dependency_ratio),c('CODGEO')]  
[1] 85003
 
# get list of top 10 highest aged dependency ratio 
Rank = sort(population$aged_dependency_ratio, TRUE) 
Rank10 = sort(Rank[1:10], TRUE) 
sol = rep(0, 10) 
for (i in 1:10){ 
    sol[i] = paste(geo_population$CODGEO[geo_population$aged_dependency_ratio %in% Rank10[i]] )
} 
# remove the dataset created to save memory
rm(list = c("geo_population", "Rank", "Rank10")) 

4.4.1 What we have learned

We can observe that in urban cities, the aged dependcy ratio is lower compared to non-urban cities. The city with the lowest aged dependency ratio is La Fert?-sous-Jouarre and the highest is L’Aiguillon-sur-Mer.

4.5 Education data

 
# Difference between male and female's education level (with University degrees) 
summary(educ$male_Univ) 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    42.0    206.0    329.0    960.3    656.0 440490.0 
summary(educ$female_Univ) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     57     255     395    1100     776  486406 
summary(educ$male_NoDip) 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    58.0    260.0    401.0    904.1    754.0 138888.0 
summary(educ$female_NoDip) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    107     356     567    1242    1081  177240 
 
# Merge geo and educ data to include longitude and latitude in the data. 
geo_educ <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(educ$CODGEO),], 
                                   educ[unique(educ$CODGEO) %in% unique(geo$CODGEO),]) 
 
# check the ratio of individuals with high and low education level for each town. 
summary(educ$univ/educ$nodip) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1149  0.4989  0.7689  0.9438  1.1728  7.7004 
 
# Histogram of the ratio of individuals with high and low education level for each town 
ggplot(data=educ, aes(log(educ$univ/educ$nodip))) + 
  geom_histogram(aes(y=..density..), col="black", fill="blue", alpha=0.3, bins= 50)+ 
                   geom_density(col="black") + 
                   labs(x="log. University degree/No diploma", y="Density") + 
                   ggtitle("Ratio between the number of individuals with no diploma and  
                           individuals with University degree per town") + 
                   theme(plot.title =element_text(hjust= 0.5)) 

 
# Adjust scales before plotting 
sc <- scale_colour_gradientn(colours = c("#56ddc5", "#F8766D", "#00BA38"), limits=c(min(educ$univ/educ$nodip), max(educ$univ/educ$nodip))) 
 
# Map ratio  
educ_ratio_NoDipUniv <- 
  FraMap + 
  geom_point(aes(x=geo_educ$longitude, y=geo_educ$latitude, 
                 colour=educ$univ/educ$nodip), 
             data=geo_educ, alpha=1, size=1) + 
             sc + labs(color='Ratio') + ggtitle("Education level ratio for each town") 
educ_ratio_NoDipUniv 

 
# Statistics of shape for the data 
# library(psych) 
describe(educ$univ/educ$nodip) 
   vars    n mean   sd median trimmed  mad  min max range skew kurtosis   se
X1    1 5025 0.94 0.67   0.77    0.84 0.47 0.11 7.7  7.59  2.4     9.98 0.01
 
# remove the dataset created to save memory 
rm(list=c("geo_educ", "educ_ratio_NoDipUniv")) 
 

4.5.1 What we have learned

In average 960 males have university degree and 1100 females have university degree.
In average 904 males have no diploma and 1242 females have no diploma. Highest ratio in Paris=78251=Yvelines

4.6 Demographic/social profiles data

glimpse(categ_socio)
Observations: 5,025
Variables: 9
$ CODGEO               <int> 1004, 1007, 1014, 1024, 1025, 1027, 1031, 1032, 1033, 1...
$ male_immig           <dbl> 702, 55, 456, 55, 60, 63, 397, 72, 1304, 519, 115, 90, ...
$ female_immig         <dbl> 660, 45, 427, 65, 35, 38, 397, 92, 1337, 490, 145, 65, ...
$ total_immig          <dbl> 1362, 100, 883, 120, 95, 101, 794, 164, 2641, 1009, 260...
$ male_working_immig   <dbl> 568, 25, 334, 51, 45, 59, 308, 56, 1006, 400, 90, 70, 2...
$ female_working_immig <dbl> 562, 25, 333, 52, 30, 34, 348, 80, 1069, 387, 100, 45, ...
$ male_native          <dbl> 6076, 1174, 1325, 1581, 1548, 1798, 1391, 1524, 4448, 3...
$ female_native        <dbl> 6586, 1289, 1195, 1681, 1534, 1063, 1463, 1626, 4645, 4...
$ total_native         <dbl> 12662, 2463, 2520, 3262, 3082, 2861, 2854, 3150, 9093, ...
# merge geo and categ_socio
geo_categ_socio <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(categ_socio$CODGEO),],
                                   categ_socio[unique(categ_socio$CODGEO) %in% unique(geo$CODGEO),])
# find which cities have the most and least number of immigrants
categ_socio[which.max(categ_socio$total_immig),c('CODGEO')] 
[1] 75056
categ_socio[which.min(categ_socio$total_immig),c('CODGEO')] 
[1] 33154
# get list of top 10 cities with the most number of immigrants
Rank = sort(categ_socio$total_immig, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(geo_categ_socio$region[categ_socio$total_immig %in% Rank10[i]] )
}
# Adjust scales before plotting 
sc <- scale_colour_gradientn(colours = c("#F8766D",  "#00BA38"), limits=c(min(geo_categ_socio$total_immig/population$total_population), max(geo_categ_socio$total_immig/population$total_population))) 
 
# Map ratio  
distribution_immig <- 
  FraMap + 
  geom_point(aes(x=geo_categ_socio$longitude, y=geo_categ_socio$latitude, 
                 colour=geo_categ_socio$total_immig/population$total_population), 
             data=geo_categ_socio, alpha=1, size=1) + 
             sc + labs(color='') + ggtitle("Immigrants for each town") 
distribution_immig

rm(geo_categ_socio)

4.6.1 What we have learned

City with the least number of immigrants is Rennes and the most is Paris. =“Île-de-France” “Provence-Alpes-Côte d’Azur” “Rhône-Alpes” “Midi-Pyrénées”
“Provence-Alpes-Côte d’Azur” “Alsace” “Languedoc-Roussillon” “Île-de-France”
“Île-de-France” “Île-de-France”

4.7 Work status data

summary(status_work)
     CODGEO       wagearner_M      wagearner_F     independent_M     independent_F     
 Min.   : 1004   Min.   :     0   Min.   :     0   Min.   :    0.0   Min.   :    0.00  
 1st Qu.:31455   1st Qu.:   274   1st Qu.:   299   1st Qu.:   37.0   1st Qu.:   23.00  
 Median :56017   Median :   589   Median :   594   Median :   62.0   Median :   40.00  
 Mean   :51625   Mean   :  1964   Mean   :  2011   Mean   :  149.8   Mean   :   99.83  
 3rd Qu.:73008   3rd Qu.:  1412   3rd Qu.:  1334   3rd Qu.:  112.0   3rd Qu.:   75.00  
 Max.   :97113   Max.   :751026   Max.   :828113   Max.   :78180.0   Max.   :56586.00  
   transfer_M        transfer_F        employer_M        employer_F      
 Min.   :  0.000   Min.   :  0.000   Min.   :    0.0   Min.   :    0.00  
 1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:   34.0   1st Qu.:   10.00  
 Median :  0.000   Median :  0.000   Median :   59.0   Median :   20.00  
 Mean   :  1.287   Mean   :  3.183   Mean   :  134.4   Mean   :   48.01  
 3rd Qu.:  0.000   3rd Qu.:  5.000   3rd Qu.:  110.0   3rd Qu.:   40.00  
 Max.   :761.000   Max.   :889.000   Max.   :54988.0   Max.   :20580.00  
     full_M           full_F           half_M            half_F        
 Min.   :     0   Min.   :     0   Min.   :    0.0   Min.   :     0.0  
 1st Qu.:   332   1st Qu.:   227   1st Qu.:   26.0   1st Qu.:   105.0  
 Median :   662   Median :   458   Median :   53.0   Median :   198.0  
 Mean   :  2071   Mean   :  1580   Mean   :  178.3   Mean   :   581.9  
 3rd Qu.:  1526   3rd Qu.:  1030   3rd Qu.:  118.0   3rd Qu.:   417.0  
 Max.   :790568   Max.   :712040   Max.   :94387.0   Max.   :194128.0  
      full              half            wagearner        independent      
 Min.   :      0   Min.   :     0.0   Min.   :      0   Min.   :     0.0  
 1st Qu.:    575   1st Qu.:   135.0   1st Qu.:    588   1st Qu.:    62.0  
 Median :   1126   Median :   252.0   Median :   1194   Median :   102.0  
 Mean   :   3651   Mean   :   760.2   Mean   :   3975   Mean   :   249.6  
 3rd Qu.:   2568   3rd Qu.:   536.0   3rd Qu.:   2772   3rd Qu.:   180.0  
 Max.   :1502608   Max.   :288515.0   Max.   :1579139   Max.   :134766.0  
    transfer           employer      
 Min.   :   0.000   Min.   :    0.0  
 1st Qu.:   0.000   1st Qu.:   45.0  
 Median :   0.000   Median :   79.0  
 Mean   :   4.471   Mean   :  182.4  
 3rd Qu.:   5.000   3rd Qu.:  148.0  
 Max.   :1650.000   Max.   :75568.0  
names(status_work) 
 [1] "CODGEO"        "wagearner_M"   "wagearner_F"   "independent_M" "independent_F"
 [6] "transfer_M"    "transfer_F"    "employer_M"    "employer_F"    "full_M"       
[11] "full_F"        "half_M"        "half_F"        "full"          "half"         
[16] "wagearner"     "independent"   "transfer"      "employer"     
# merge geo and status_work
# geo_status_work <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(status_work$CODGEO),],
#                                    status_work[unique(status_work$CODGEO) %in% unique(geo$CODGEO),])
geo_status_work <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(status_work$CODGEO),], 
                                    status_work[unique(status_work$CODGEO) %in% unique(geo$CODGEO),]) 
# which town has the most/least people receiving government transfers
status_work[which.max(status_work$transfer),c('CODGEO')] 
[1] 75056
status_work[which.min(status_work$transfer),c('CODGEO')] 
[1] 1007
# get list of top 10 cities with the people receiving government transfers
Rank = sort(status_work$transfer, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(geo_status_work$CODGEO[status_work$transfer %in% Rank10[i]])
}
# which town has the most/least people with part time contract.
status_work[which.max(status_work$half),c('CODGEO')] 
[1] 75056
status_work[which.min(status_work$half),c('CODGEO')] 
[1] 76095
# get list of top 10 cities with the most part-time contract.
Rank = sort(status_work$half, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(status_work$CODGEO[status_work$half %in% Rank10[i]] ) 
}
# Histogram (Females with full time contract)
describe(status_work$full_F)    # Skewness
   vars    n    mean       sd median trimmed    mad min    max  range  skew kurtosis
X1    1 5025 1580.12 11070.87    458  658.08 422.54   0 712040 712040 53.63  3381.11
       se
X1 156.18
ggplot(data=status_work, aes(ifelse(full_F!=0, log(full_F), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="", y="Density") +
  ggtitle("Female workers with full time contract") + 
  theme(plot.title = element_text(hjust = 0.5))

# Histogram (males with full time contract) 
describe(status_work$full_M)    # Skewness 
   vars    n    mean       sd median trimmed    mad min    max  range  skew kurtosis
X1    1 5025 2070.78 12491.66    662  953.99 616.76   0 790568 790568 51.29  3167.63
       se
X1 176.22
ggplot(data=status_work, aes(ifelse(full_M!=0, log(full_M), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="", y="Density") +
  ggtitle("Male workers with full time contract") + 
  theme(plot.title = element_text(hjust = 0.5))

# Histogram: Contract type ratio
describe(status_work$half/status_work$full)    # Skewness
   vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
X1    1 5024 0.24 0.09   0.23    0.23 0.07 0.04 0.82  0.78 1.24     3.45  0
ggplot(data=status_work, aes(status_work$half/status_work$full)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Contract type ratio (temporary/full time)", y="Density") +
  ggtitle("Contract type ratio") +
  theme(plot.title = element_text(hjust = 0.5))

#  number of units
n_sex <- length(status_work$wagearner_F)
# vector representing males and females
Label <- c(rep("M", n_sex*6), rep("F", n_sex*6))
# vector representing the variable considered
Variable <- c(rep("Wagearner", n_sex), 
             rep("Independent", n_sex),
             rep("Transfer", n_sex),
             rep("Employer", n_sex),
             rep("Full",n_sex),
             rep("Half", n_sex))
             
# merge these data
status_sex = cbind.data.frame(Label = Label, 
             value = c(status_work$wagearner_M, status_work$wagearner_F, status_work$independent_M,
                       status_work$independent_F, status_work$transfer_M, status_work$transfer_F,
                       status_work$employer_M, status_work$employer_F, status_work$full_M, status_work$full_F,
                       status_work$half_M, status_work$half_F),
             Variable = Variable)
# plotting phase
ggplot(data = status_sex, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("# of individuals") + ggtitle("Gender comparison") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

# the same but excluding outliers
ggplot(data = status_sex, aes(x=Label, y=value)) +
  scale_y_continuous(limits = quantile(status_sex$value, c(0, 0.9))) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("# of individuals") + 
  ggtitle("Gender comparison excluding the last decile") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

4.7.1 What we have learned

Paris is the city where they have the most individuals receiving government transfers. Lyon is the city where they have the least individuals receiving government transfers. =“Île-de-France” “Provence-Alpes-Côte d’Azur” “Provence-Alpes-Côte d’Azur” “Rhône-Alpes”
“Midi-Pyrénées” “Aquitaine” “Languedoc-Roussillon” “Pays de la Loire”
“Rhône-Alpes” “Alsace”

Paris has most individuals with half time contract. Rouen has the least. = “Île-de-France” “Rhône-Alpes” “Provence-Alpes-Côte d’Azur” “Midi-Pyrénées”
“Pays de la Loire” “Nord-Pas-de-Calais” “Alsace” “Languedoc-Roussillon”
“Aquitaine” “Bretagne”

In average, more females receive transfers than males. In average, more females have part-time contract than males. (In average, 581 females have part-time contract (in each town) and 178 males have part-time contract. (More males have full-time contract.)

More females work as wage-earners than males. =All these facts can be related to vulnerability of females in the labour market. Many studies show that females are mostly working as part-time workers and lower job categories. On top of this, more females receive transfers. We can also observe this in our data that more males are registered as independent workers or employers.

4.8 Inequality data

glimpse(ineq)
Observations: 5,025
Variables: 12
$ CODGEO          <int> 1004, 1007, 1014, 1024, 1025, 1027, 1031, 1032, 1033, 1034, ...
$ REG             <int> 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, 84, ...
$ DEP             <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ Superficie      <int> 25, 34, 23, 19, 40, 18, 8, 13, 15, 22, 11, 9, 24, 6, 9, 33, ...
$ housing14       <int> 6838, 1160, 1453, 1362, 1242, 688, 1867, 1253, 5801, 4804, 1...
$ median_living14 <int> 19756, 21679, 19806, 21143, 21781, 22310, 18911, 20612, 1934...
$ empl            <int> 7453, 488, 2128, 852, 474, 2117, 2113, 1430, 4328, 5722, 335...
$ emp_sal         <int> 6743, 373, 1940, 738, 354, 1903, 1934, 1297, 3824, 5099, 303...
$ pop15_64        <int> 8963, 1614, 2217, 2163, 2043, 2361, 2426, 2059, 7271, 5286, ...
$ unemp15_64      <int> 1060, 108, 204, 121, 120, 69, 263, 165, 976, 647, 193, 111, ...
$ act15_64        <int> 6682, 1267, 1596, 1726, 1634, 2020, 1776, 1627, 5551, 3923, ...
$ unemp_rate      <dbl> 0.11826397, 0.06691450, 0.09201624, 0.05594082, 0.05873715, ...
# merge geo and ineq
geo_ineq <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(ineq$CODGEO),],
                                   ineq[unique(ineq$CODGEO) %in% unique(geo$CODGEO),])
# which city has the highest/lowest unemployment rate
ineq[which.max(ineq$unemp_rate),c('CODGEO')] 
[1] 33402
ineq[which.min(ineq$unemp_rate),c('CODGEO')] 
[1] 73257
# get list of top 10 cities with the highest unemployment rate.
Rank = sort(ineq$unemp_rate, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(ineq$CODGEO[ineq$unemp_rate %in% Rank10[i]] ) 
}
# which is the smallest/biggest city (area)
ineq[which.max(ineq$Superficie),c('CODGEO')] 
[1] 13004
ineq[which.min(ineq$Superficie),c('CODGEO')] 
[1] 6011
# which city has the highest/lowest median living level
ineq[which.max(ineq$median_living14),c('CODGEO')] 
[1] 74016
ineq[which.min(ineq$median_living14),c('CODGEO')] 
[1] 93014
# Adjust scales before plotting
sc <- scale_colour_gradientn(colours = c("#56ddc5", "#ff3db7"), limits=c(min(ineq$unemp_rate), max(ineq$unemp_rate)))
# Map ratio 
ineq_unempl_distribution <-
  FraMap +
  geom_point(aes(x=geo_ineq$longitude, y=geo_ineq$latitude,
                 colour=geo_ineq$unemp_rate),
             data=geo_ineq, alpha=1, size=1) +
             sc + labs(color='Unemployment rate') + ggtitle("Unemployment rate for each town")
ineq_unempl_distribution

# remove the dataset created to save memory
rm(list=c("geo_ineq", "ineq_unempl_distribution"))

4.8.1 What we have learned

Bordeaux has the highest unemployment rate. Nantes has the lowest unemployment rate. =“Aquitaine” “Languedoc-Roussillon” “Nord-Pas-de-Calais” “Haute-Normandie” “Guadeloupe”
“Corse” “Nord-Pas-de-Calais” “Corse” “Rhône-Alpes” “Nord-Pas-de-Calais”

4.9 Commune data

glimpse(commune)
Observations: 5,025
Variables: 25
$ CODGEO                     <int> 1004, 1007, 1014, 1024, 1025, 1027, 1031, 1032, 1...
$ orient_econ                <fctr> Bassin Résidentiel, Bassin Résidentiel, Bassin...
$ nb_household               <int> 4640, 798, 1139, 719, 797, 482, 1484, 855, 4526, ...
$ nb_female                  <int> 11350, 2078, 3480, 1828, 2220, 1372, 3344, 2442, ...
$ nb_male                    <int> 10878, 2096, 3560, 1922, 2280, 1590, 3476, 2546, ...
$ nb_minor                   <int> 13624, 2586, 4537, 2295, 2886, 1871, 4184, 3233, ...
$ nb_major                   <int> 8604, 1588, 2503, 1455, 1614, 1091, 2636, 1755, 8...
$ nb_students                <int> 904, 173, 335, 159, 152, 133, 385, 190, 755, 519,...
$ nb_firms_service           <int> 342, 33, 63, 34, 17, 17, 70, 34, 264, 288, 108, 3...
$ nb_firms_commerce          <int> 301, 27, 96, 24, 22, 33, 54, 32, 270, 236, 140, 4...
$ nb_firms_construction      <int> 58, 17, 22, 20, 14, 9, 15, 22, 83, 36, 49, 21, 17...
$ nb_firms_ind               <int> 108, 24, 106, 22, 16, 26, 102, 34, 83, 103, 75, 3...
$ nb_created_firms           <int> 83, 14, 19, 7, 8, 5, 12, 14, 42, 54, 37, 9, 226, ...
$ nb_created_ind             <int> 4, 1, 1, 1, 0, 0, 2, 2, 3, 3, 0, 0, 17, 1, 1, 1, ...
$ nb_created_construction    <int> 14, 2, 3, 0, 1, 0, 2, 1, 15, 10, 6, 1, 31, 1, 3, ...
$ nb_created_commerce        <int> 27, 1, 7, 4, 2, 5, 3, 1, 6, 18, 7, 3, 63, 5, 2, 2...
$ nb_created_services        <int> 38, 10, 8, 2, 5, 0, 5, 10, 18, 23, 24, 5, 115, 3,...
$ urbanRural                 <fctr> Com < 50 m habts, Com rurale > 2 000 habts, Com ...
$ nb_active                  <int> 4556, 921, 1589, 912, 1070, 740, 1727, 1115, 4382...
$ nb_active_employees        <int> 4203, 798, 1418, 814, 906, 678, 1594, 1005, 4033,...
$ nb_active_non_employees    <int> 353, 123, 171, 98, 164, 62, 133, 110, 349, 320, 2...
$ dymanic_demo_prof          <fctr> 1.Accroissement par excédent naturel et migrato...
$ regional_GDP               <int> 173681, 173681, 173681, 173681, 173681, 173681, 1...
$ evol_pop_score             <fctr> 72,95082, 72,54098, 75, 72,54098, 72,54098, 74,5...
$ evol_entrepreneurial_score <fctr> 0,38471, 0,03314, 0,09539, 0,04784, 0,0317, 0,05...
# change the factor variables to numeric
# aaa = as.numeric(commune$urbanRural)
# Merge geo and commune data to include longitude and latitude in the data.
geo_commune <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(commune$CODGEO),],
                                   commune[unique(commune$CODGEO) %in% unique(geo$CODGEO),])
# Adjust scales before plotting
#sc <- scale_colour_gradientn(colours = palette(rainbow(3)), limits=c(min(aaa), max(aaa)))
# Map ratio 
 # urbanrural<-
 #  FraMap +
 #  geom_point(aes(x=geo_commune$longitude, y=geo_commune$latitude,
 #                 colour=aaa),
 #             data=geo_commune, alpha=1, size=0.5) +
 #             sc + labs(color='') + ggtitle("Urban and rural cities")
# urbanrural
# boxplot
#  number of units
n_firms <- length(commune$nb_firms_commerce)
# vector 
Label <- c(rep("Firms", n_firms*4))
# vector representing the variable considered
Variable <- c(rep("Service", n_firms), 
             rep("Industry", n_firms),
             rep("Commerce", n_firms),
             rep("Construction", n_firms))
          
             
# merge these data
firms_sector = cbind.data.frame(Label = Label, 
             value = c(commune$nb_firms_commerce, commune$nb_firms_service, commune$nb_firms_ind, commune$nb_firms_construction),
             Variable = Variable)
# plotting phase
ggplot(data = firms_sector, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Firms") + ylab("# of firms") + ggtitle("Sector comparison") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

# the same but excluding outliers
ggplot(data = firms_sector, aes(x=Label, y=value)) +
  scale_y_continuous(limits = quantile(firms_sector$value, c(0, 0.9))) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Firms") + ylab("# of firms") + 
  ggtitle("Sector comparison excluding the last decile") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

rm(list = c("geo_commune", "Rank", "Rank10", "aaa"))
object 'aaa' not found

5 Create an unique dataset

Create an unique dataset:

# merging
newDat = merge(firms, population, by="CODGEO")
newDat = merge(newDat, salary, by="CODGEO")
newDat = merge(newDat, geo, by="CODGEO")
newDat = merge(newDat, status_work, by="CODGEO")
newDat = merge(newDat, ineq, by="CODGEO")
newDat = merge(newDat, educ, by="CODGEO")
newDat = merge(newDat, categ_socio, by="CODGEO")
newDat = merge(newDat, commune, by="CODGEO")
# check
names(newDat)
  [1] "CODGEO"                     "town"                      
  [3] "total"                      "micro"                     
  [5] "small"                      "medium"                    
  [7] "large"                      "null"                      
  [9] "total_population"           "male"                      
 [11] "female"                     "child"                     
 [13] "elderly"                    "workforce"                 
 [15] "dependent"                  "sex_ratio"                 
 [17] "dependency_ratio"           "aged_dependency_ratio"     
 [19] "child_dependency_ratio"     "urbanDummy"                
 [21] "sal_general"                "sal_executive"             
 [23] "sal_midManager"             "sal_employee"              
 [25] "sal_worker"                 "sal_Females"               
 [27] "sal_F_executive"            "sal_F_midManager"          
 [29] "sal_F_employee"             "sal_F_worker"              
 [31] "sal_Males"                  "sal_M_executive"           
 [33] "sal_M_midManager"           "sal_M_employee"            
 [35] "sal_M_worker"               "sal_18_25"                 
 [37] "sal_26_50"                  "sal_51plus"                
 [39] "sal_F_18_25"                "sal_F_26_50"               
 [41] "sal_F_51plus"               "sal_M_18_25"               
 [43] "sal_M_26_50"                "sal_M_51plus"              
 [45] "salary_ratio_FvsM"          "salary_ratio_FvsM_Exec"    
 [47] "salary_ratio_FvsM_midManag" "salary_ratio_FvsM_worker"  
 [49] "salary_ratio_FvsM_employee" "salary_ratio_FvsM_18_25"   
 [51] "salary_ratio_FvsM_26_50"    "salary_ratio_FvsM_51plus"  
 [53] "region"                     "region_capital"            
 [55] "department"                 "postal_code"               
 [57] "latitude"                   "longitude"                 
 [59] "wagearner_M"                "wagearner_F"               
 [61] "independent_M"              "independent_F"             
 [63] "transfer_M"                 "transfer_F"                
 [65] "employer_M"                 "employer_F"                
 [67] "full_M"                     "full_F"                    
 [69] "half_M"                     "half_F"                    
 [71] "full"                       "half"                      
 [73] "wagearner"                  "independent"               
 [75] "transfer"                   "employer"                  
 [77] "REG"                        "DEP"                       
 [79] "Superficie"                 "housing14"                 
 [81] "median_living14"            "empl"                      
 [83] "emp_sal"                    "pop15_64"                  
 [85] "unemp15_64"                 "act15_64"                  
 [87] "unemp_rate"                 "female_NoDip"              
 [89] "male_NoDip"                 "female_Sec"                
 [91] "male_Sec"                   "female_Hi"                 
 [93] "male_Hi"                    "female_Univ"               
 [95] "male_Univ"                  "nodip"                     
 [97] "sec"                        "high"                      
 [99] "univ"                       "male_immig"                
[101] "female_immig"               "total_immig"               
[103] "male_working_immig"         "female_working_immig"      
[105] "male_native"                "female_native"             
[107] "total_native"               "orient_econ"               
[109] "nb_household"               "nb_female"                 
[111] "nb_male"                    "nb_minor"                  
[113] "nb_major"                   "nb_students"               
[115] "nb_firms_service"           "nb_firms_commerce"         
[117] "nb_firms_construction"      "nb_firms_ind"              
[119] "nb_created_firms"           "nb_created_ind"            
[121] "nb_created_construction"    "nb_created_commerce"       
[123] "nb_created_services"        "urbanRural"                
[125] "nb_active"                  "nb_active_employees"       
[127] "nb_active_non_employees"    "dymanic_demo_prof"         
[129] "regional_GDP"               "evol_pop_score"            
[131] "evol_entrepreneurial_score"
head(newDat)

Save/load final dataset (just to avoid running all the code)

## UNCOMMENT THESE LINE AND COMMENT THE NEXT ONES TO SAVE THE DATASET
## THEN MOVE THIS FILE IN THE data FOLDER
# #remove unused factor levels on newDat
# for (i in 1:ncol(newDat)){
#   if (is.factor(newDat[,i])){
#     newDat[,i] = droplevels.factor(newDat[,i])
#   }
# }
# # save the NewDat.csv file created (which has to be moved in the data folder)
# write.table(newDat, "newDat.csv", quote = TRUE, row.names=FALSE, col.names = TRUE,
#           fileEncoding = "UTF-8", sep = ",")
 
## LOADING DATA
# load the final dataset
setwd("./data")
newDat <- read.csv("newDat.csv", encoding = "UTF-8", header = TRUE)
# JUST A CHECK, DO NOT NEED TO RUN
# check (changing newDat loaded with newDat2), the only differences are in the order of 10e-15
# identical(newDat, newDat2)
 

Spatial plot of the data to spot possible patterns. A sub-sample will be retained when needed in order to reduce geo-spatial correlations.

# center of France, obtained using: 
# fra_center = as.numeric(geocode("France")) 
fra_center = c(2.213749, 46.227638) 
 
ggmap(get_googlemap(center=fra_center, scale=2, zoom=5), extent="normal") + 
  geom_point(aes(x=longitude, y=latitude), data=newDat, col="orange", alpha=1, size=0.05)  
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=5&size=640x640&scale=2&maptype=terrain&sensor=false

NA

6 Unsupervised Learning

6.1 Multi-Dimensional Scaling (MDS)

6.1.1 Preliminary multivariate data visualisation to better conduct the unsupervised learning

region <- geo$region
# add region variable to the salary dataset
salary_with_dep <- cbind(region, salary)
# extract continuous variables
salary_variables <- salary_with_dep[, 3:34]
#scatterplot to visually see the correlations between variables
pairs(salary_variables[1:8], 
      panel = function (x, y, ...) {
          points(x, y, ...)
          abline(lm(y ~ x), col = "red")
      }, pch = ".", cex = 0.5)

#scatterplot between variables of major interest
pairs(salary_variables[, c("salary_ratio_FvsM", "sal_general", "sal_Females","sal_Males")], 
      panel = function (x, y, ...) {
          points(x, y, ...)
          abline(lm(y ~ x), col = "red")
      }, pch = ".", cex = 0.5)

#scale the data
salary_variables_scaled <- as.data.frame(scale(salary_variables))
#divide sal_general into 4 parts
salary_levels <- with(salary_variables_scaled, 
                  equal.count(sal_general,4))
#plot sal_worker against sal_midManager given 3 partitions of sal_general
plot(xyplot(sal_worker ~ sal_midManager| cut(sal_general, 3), 
            data = salary_variables, 
            layout = c(3, 1), 
            xlab = "Worker Salary", 
            ylab = "Midmanager salary"))

#multidimesnional plot of midManager salary against worker salary, given 4 partitions of F/M salary ratio, and grayscale intensities of general salary
M_F_salary_ratio  <- with(salary_variables, equal.count(salary_ratio_FvsM, 4))
general_salary.ord <- with(salary_variables, rev(order(sal_general)))
salary_variables.ordered <- salary_variables[general_salary.ord,]
sal_general.breaks <- with(salary_variables.ordered, 
                     do.breaks(range(sal_general),50))
salary_variables.ordered$color<-level.colors(salary_variables.ordered$sal_general,
                                   at=sal_general.breaks,
                                   col.regions=grey.colors)
plot(xyplot(sal_worker ~ sal_midManager | M_F_salary_ratio, 
            data = salary_variables.ordered,
            aspect = "iso", 
            groups = color, 
            cex = 1, col = "black",
       panel = function(x, y, groups, ..., subscripts) {
           fill <- groups[subscripts]
           panel.grid(h = -1, v = -1)
           panel.xyplot(x, y, pch = 21, 
                        fill = fill, ...)
       },
       legend =
       list(right =
            list(fun = draw.colorkey,
                 args=list(key=list(col=gray.colors,
                                    at = sal_general.breaks),
                           draw = FALSE))),
            xlab = "Worker Salary", 
            ylab = "Midmanager salary"))

6.1.1.1 Preliminary multivariate firms data visualisation to better conduct the unsupervised learning

# add region variable to the salary dataset
firms_with_dep <- cbind(region, firms)
# extract continuous variables
firms_variables <- firms_with_dep[, 4:8]
#scatterplot to visually see the correlations between variables
pairs(firms_variables[, 2:5], 
      panel = function (x, y, ...) {
          points(x, y, ...)
          abline(lm(y ~ x), col = "red")
      }, pch = ".", cex = 0.5)

#scale the data
firms_variables_scaled <- as.data.frame(scale(firms_variables))
#divide firms total into 4 parts
firms_levels <- with(firms_variables_scaled, 
                  equal.count(total,4))
#plot sal_worker against sal_midManager given 3 partitions of sal_general
plot(xyplot(medium ~ large| cut(total, 3), 
            data = firms_variables, 
            layout = c(3, 1), 
            xlab = "medium firms", 
            ylab = "large firms"))

6.1.2 Multi-dimensional scaling (MDS)

6.1.2.1 Multi-dimensional scaling (MDS) on salary data

#Multi-dimensional scaling (MDS) on salary data
# produces a list with covariance matrix
# for each region as component:
salary_var <- tapply(1:nrow(salary_with_dep), salary_with_dep$region, 
                     function(i) var(salary_with_dep[i,3:7]))
#remove regions with null values
salary_var <- salary_var[-c(11, 12, 15, 19, 20, 28)]
#create dataframe with the counts of towns per region
regions_count <- data.frame(table(salary_with_dep$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
#row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
#sum(regions_count_sub$Freq)
# initializes common covariance matrix var:
S <- regions_count_sub$Freq[1] * as.matrix(salary_var[[1]])
# creates common covariance matrix S
s <- 0
for (v in c(1:22)) S <- S + regions_count_sub$Freq[v] * as.matrix(salary_var[[v]]) 
S <- S / 5003
# finds center of each variable (sal_general, etc)
# for each region
salary_cen <- tapply(1:nrow(salary_with_dep), salary_with_dep$region, 
    function(i) apply(salary_with_dep[i,c(3:7)], 2, mean))
salary_cen <- salary_cen[-c(11, 12, 15, 19, 20, 28)]
# create a matrix out of each components, each
# mean measurement for all variables by region
salary_cen <- matrix(unlist(salary_cen), 
    nrow = length(salary_cen), byrow = TRUE)
#salary_cen
# compute the mahalanobis distances:
salary_mah <- apply(salary_cen, 1, 
    function(cen) mahalanobis(salary_cen, cen, S))
# run it on two dimensions
salary_mds <- cmdscale(salary_mah)
# draw a scatterplot of two-dimensional solution
# from classical MDS applied to Mahalanobis
# distances:
lim <- range(salary_mds) * 1.2
plot(salary_mds, xlab = "Coordinate 1", 
     ylab = "Coordinate 2",
     xlim = lim, ylim = lim, type = "n")
text(salary_mds, labels = levels(salary_with_dep$region), 
     cex = 0.5)

One can see that Languedoc Roussilon, the worst region in terms of GDP per capita, is further away from the center.

6.1.2.2 Multi-dimensional scaling (MDS) on firms data

firms_with_dep <- firms_with_dep[, -9]
firms_with_dep_normalized <- firms_with_dep[, 4:8]/population$total_population
firms_with_dep_normalized <- cbind(firms_with_dep$region, firms_with_dep_normalized)
names(firms_with_dep_normalized)[1] <- "region"
# produces a list with covariance matrix
# for each region as component:
firms_var <- tapply(1:nrow(firms_with_dep_normalized), firms_with_dep_normalized$region, 
                     function(i) var(firms_with_dep_normalized[i,2:6]))
#remove regions with null values
firms_var <- firms_var[-c(11, 12, 15, 19, 20, 28)]
#create dataframe with the counts of towns per region
regions_count <- data.frame(table(firms_with_dep_normalized$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
# row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
# sum(regions_count_sub$Freq)
# initializes common covariance matrix var:
S <- regions_count_sub$Freq[1] * as.matrix(firms_var[[1]])
# creates common covariance matrix S
s <- 0
for (v in c(1:22)) S <- S + (regions_count_sub$Freq[v] - 1) * as.matrix(firms_var[[v]]) 
S <- S / 5003
# finds center of each variable (micro, etc)
# for each region
firms_cen <- tapply(1:nrow(firms_with_dep_normalized), firms_with_dep_normalized$region, 
    function(i) apply(firms_with_dep_normalized[i,-1], 2, mean))
firms_cen <- firms_cen[-c(11, 12, 15, 19, 20, 28)]
# create a matrix out of each components, each
# mean measurement for all variables by region
firms_cen <- matrix(unlist(firms_cen), 
    nrow = length(firms_cen), byrow = TRUE)
# compute the mahalanobis distances:
firms_mah <- apply(firms_cen, 1, 
    function(cen) mahalanobis(firms_cen, cen, S))
# run it on two dimensions
firms_mds <- cmdscale(firms_mah)
# draw a scatterplot of two-dimensional solution
# from classical MDS applied to Mahalanobis
# distances:
lim <- range(firms_mds) * 1.2
plot(firms_mds, xlab = "Coordinate 1", 
     ylab = "Coordinate 2",
     xlim = lim, ylim = lim, type = "n")
text(firms_mds, labels = levels(firms_with_dep_normalized$region), 
     cex = 0.5)

Since tha data is normalized according to population size, Haute Normandie and Midi Pyrenees are further away.

6.1.2.3 Multi-Dimensional Scaling for merged dataset

nums <- unlist(lapply(commune, is.numeric)) 
commune_selected <- commune[, nums]
commune_selected <- commune_selected[, 2:20]
commune_selected_normalized <- commune_selected[]/population$total_population
unemp_percent <- as.data.frame((ineq$unemp15_64/ineq$pop15_64)*100)
names(unemp_percent)[1] <- "unemp_percent"
firms_variables_normalized <- firms_variables[]/population$total_population
selected_variables <- cbind(firms_variables_normalized, salary_variables, commune_selected_normalized, unemp_percent)
selected_variables_with_dep <- cbind(geo$region, selected_variables)
names(selected_variables_with_dep)[1] <- "region"
pca_selected_variables <- prcomp(selected_variables)
selected_pca_scores<- as.data.frame(pca_selected_variables$x[,1:3])
selected_pca_with_dep <- cbind(geo$region, selected_pca_scores)
names(selected_pca_with_dep)[1] <- "region"
# produces a list with covariance matrix
# for each region as component:
selected_PCA_var <- tapply(1:nrow(selected_pca_with_dep), selected_pca_with_dep$region,
                     function(i) var(selected_pca_with_dep[i,2:4]))
#remove regions with null values
selected_PCA_var <- selected_PCA_var[-c(11, 12, 15, 19, 20, 28)]
#create dataframe with the counts of towns per region
regions_count <- data.frame(table(selected_pca_with_dep$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
# row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
# sum(regions_count_sub$Freq)
# initializes common covariance matrix var:
S <- regions_count_sub$Freq[1] * as.matrix(selected_PCA_var[[1]])
# creates common covariance matrix S
s <- 0
for (v in c(1:22)) S <- S + (regions_count_sub$Freq[v] - 1) * as.matrix(selected_PCA_var[[v]])
S <- S / 5003
# finds center of each variable (micro, etc)
# for each region
selected_cen <- tapply(1:nrow(selected_pca_with_dep), selected_pca_with_dep$region,
    function(i) apply(selected_pca_with_dep[i,-1], 2, mean))
selected_cen <- selected_cen[-c(11, 12, 15, 19, 20, 28)]
# create a matrix out of each components, each
# mean measurement for all variables by region
selected_cen <- matrix(unlist(selected_cen),
    nrow = length(selected_cen), byrow = TRUE)
# compute the mahalanobis distances:
selected_mah <- apply(selected_cen, 1,
    function(cen) mahalanobis(selected_cen, cen, S))
# run it on two dimensions
selected_mds <- cmdscale(selected_mah)
# draw a scatterplot of two-dimensional solution
# from classical MDS applied to Mahalanobis
# distances:
lim <- range(selected_mds) * 1.2
plot(selected_mds, xlab = "Coordinate 1",
     ylab = "Coordinate 2",
     xlim = lim, ylim = lim, type = "n")
text(selected_mds, labels = levels(selected_pca_with_dep$region),
     cex = 0.5)

On the merged dataset one can Rhone-Alpes and Cote d’Azur, the two biggest economies after Ile de France.

6.2 PCA

6.2.0.1 PCA on salary data and delta M-F

pca_salary <- prcomp(salary_variables)
autoplot(pca_salary, data = salary_with_dep, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

plot(pca_salary, type = "l")

#create dataset with M-F differences
salary_var_delta <- data.frame("M-F" = salary_variables$sal_Males -                                                           salary_variables$sal_Females,
                     "Executive M-F" = salary_variables$sal_M_executive -
                                         salary_variables$sal_F_executive,
                    "midManager M-F" = salary_variables$sal_M_midManager -
                                         salary_variables$sal_F_midManager,
                      "employee M-F" = salary_variables$sal_M_employee -
                                         salary_variables$sal_F_employee,
                        "worker M-F" = salary_variables$sal_M_worker -
                                         salary_variables$sal_F_worker,
                         "18-25 M-F" = salary_variables$sal_M_18_25 -
                                         salary_variables$sal_F_18_25,
                         "26-50 M-F" = salary_variables$sal_M_26_50 -
                                         salary_variables$sal_F_26_50,
                       "51 plus M-F" = salary_variables$sal_M_51plus -
                                         salary_variables$sal_F_51plus)
pca_salary_delta <- prcomp(salary_var_delta)
autoplot(pca_salary_delta, data = salary_with_dep, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

plot(pca_salary_delta, type = "l")

There are three main directions of eigenvectors contributing to principal components’ scores: variables describing salaries of ordinary staff across different age categories, variables describing mid-manager and manager salaries, and salaries of executives.

PCA on delta M-F revealed three main directions of eigenvectors contributing to scores: male-female salary differences of mid-managers, executives, and 51 and older aged people.

6.2.0.2 PCA on firms data

firms_variables_normalized <- firms_variables[, -1]/population$total_population
pca_firms <- prcomp(firms_variables_normalized)
autoplot(pca_firms, data = firms_with_dep, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

plot(pca_firms, type = "l")

Since there is a strong multicollinearity among the variables of the firms data, the conducted PCA results in the variable with considerably higher values, i.e. very small firms, to form a main eigenvector, leaving firms of other sizes to form a separate group of eigenvectors.

6.2.0.3 PCA on commune dataset

commune_trimmed <- commune[,c("nb_firms_service","nb_firms_commerce","nb_firms_construction")]
commune_trimmed$active_empl_prop <- commune$nb_active_employees/population$total_population
commune_trimmed$nb_firms_service <- commune$nb_firms_service/population$total_population
commune_trimmed$nb_firms_commerce <- commune$nb_firms_commerce/population$total_population
commune_trimmed$nb_firms_construction <- commune$nb_firms_construction/population$total_population
#commune_trimmed <- as.data.frame(scale(commune_trimmed))
names(which(sapply(commune_trimmed, anyNA)))
character(0)
str(commune_trimmed)
'data.frame':   5025 obs. of  4 variables:
 $ nb_firms_service     : num  0.02541 0.01288 0.0192 0.01021 0.00541 ...
 $ nb_firms_commerce    : num  0.02236 0.01053 0.02926 0.00721 0.007 ...
 $ nb_firms_construction: num  0.00431 0.00663 0.00671 0.00601 0.00446 ...
 $ active_empl_prop     : num  0.312 0.311 0.432 0.245 0.288 ...
commune_pca <- prcomp(commune_trimmed)
commune_pca
Standard deviations (1, .., p=4):
[1] 0.065193566 0.016941520 0.005599656 0.002815369

Rotation (n x k) = (4 x 4):
                               PC1         PC2         PC3         PC4
nb_firms_service      -0.025510236 -0.57990964 -0.81271473  0.05048565
nb_firms_commerce     -0.003487765 -0.81175562  0.57203205 -0.11755840
nb_firms_construction  0.011709990 -0.06688609  0.10896441  0.99172369
active_empl_prop      -0.999599890  0.01684835  0.02002141  0.01073949
commune_pca_scores<- as.data.frame(commune_pca$x[,1:3])
autoplot(commune_pca, data = geo, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

plot(commune_pca, type = "l")

# plot3d(commune_pca[,1:2], col=cluster$cluster+1)
scatter3d(x = commune_pca_scores$PC1, y = commune_pca_scores$PC2, z = commune_pca_scores$PC3, groups = as.factor(kmeans(commune_pca_scores, 
                  centers = 6)$cluster),
          grid = FALSE, surface = FALSE)
Loading required namespace: mgcv

PCA conducted on the commune dataset revealed two main directions of eigenvectors: proportion of active employers in each town, and industry related variables.

6.2.0.4 PCA on selected variables

nums <- unlist(lapply(commune, is.numeric)) 
commune_selected <- commune[, nums]
commune_selected <- commune_selected[, 2:20]
commune_selected_normalized <- commune_selected[]/population$total_population
unemp_percent <- as.data.frame((ineq$unemp15_64/ineq$pop15_64)*100)
names(unemp_percent)[1] <- "unemp_percent"
selected_variables <- cbind(firms_variables_normalized, salary_variables, commune_selected_normalized, unemp_percent)
pca_selected_variables <- prcomp(selected_variables)
autoplot(pca_selected_variables, data = geo, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

plot(pca_selected_variables, type = "l")

regional GDP is a striking important factor, comprising a distinctly oriented eigenvector. Another distinct eigenvector, though less substantial, is the unemployment rate.

6.3 Cluster Analysis

6.3.0.1 Cluster Analysis on salary

#create dataframe with first two principal components scores as variables
salary_pca_scores<- as.data.frame(pca_salary$x[,1:2])
salary_pca_scores <- as.data.frame(scale(salary_pca_scores))
sapply(salary_pca_scores, var)
PC1 PC2 
  1   1 
 
# K-means clustering
n <- nrow(salary_pca_scores)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(salary_pca_scores, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(salary_pca_scores,
                         centers = i)$withinss)
did not converge in 10 iterations
#we can see from the plot that optimal number of cluster is 5
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#plotting in PC space shows us that there are indeed distinguishable 5 groups 
plot(salary_pca_scores, 
     pch = kmeans(salary_pca_scores, 
                  centers = 5)$cluster)

#now we have to see whether these groups correspond to geographical regions...

6.3.0.2 clustering salary delta M-F

salary_delta_pca_scores <- as.data.frame(pca_salary_delta$x[,1:3])
n <- nrow(salary_delta_pca_scores)
wss <- rep(0, 15)
wss[1] <- (n - 1) * sum(sapply(salary_delta_pca_scores, var))
for (i in 2:15)
    wss[i] <- sum(kmeans(salary_delta_pca_scores,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 5
plot(1:15, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#plots
scatter3d(x = salary_delta_pca_scores$PC1, y = salary_delta_pca_scores$PC2, z = salary_delta_pca_scores$PC3, groups = as.factor(kmeans(salary_delta_pca_scores, 
                  centers = 6)$cluster),
          grid = FALSE, surface = FALSE)
clusters <- kmeans(salary_delta_pca_scores, 
                  centers = 6)
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on male-female salary difference")
salary_geo

salary_delta_cluster <- cbind(salary_var_delta, clusters$cluster)
names(salary_delta_cluster)[9] <- "cluster"
sal_delta_by_cluster <- aggregate(salary_delta_cluster$M.F, by=list(salary_delta_cluster$cluster), FUN=mean)[2]
names(sal_delta_by_cluster)[1] <- "Salary_difference"
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= salary_variables$salary_ratio_FvsM),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Female-Male salary ratio")
salary_geo

6.3.0.3 Finding towns with large difference in male-female salaries

#detecting extreme cities in PCA
quantiles<-tapply(salary_delta_pca_scores$PC1,firms$town,quantile)
minq <- sapply(firms$town, function(x) quantiles[[x]]["25%"])
maxq <- sapply(firms$town, function(x) quantiles[[x]]["75%"])
towns <- as.data.frame(firms$town)
k <- salary_delta_pca_scores[which(salary_delta_pca_scores$PC1<minq | salary_delta_pca_scores$PC1>maxq), ]
#find towns with largedifference 
extreme_town_list <- towns[which(salary_delta_pca_scores$PC1<minq | salary_delta_pca_scores$PC1>maxq), ]
extreme_towns <- as.data.frame(extreme_town_list)
#detecting extreme cities in salary delta
quantiles<-tapply(salary_var_delta$M.F,firms$town,quantile)
minq <- sapply(firms$town, function(x) quantiles[[x]]["25%"])
maxq <- sapply(firms$town, function(x) quantiles[[x]]["75%"])
town <- towns[which(salary_var_delta$M.F<minq | 
                                    salary_var_delta$M.F>maxq), ]
extreme_towns_m_f <- as.data.frame(town)
geo <- as.data.frame(cbind(firms$town, geo))
colnames(geo)[colnames(geo)=="firms$town"] <- "town"
geo_extreme <- subset(geo, town %in% extreme_towns_m_f$town)
geo_extreme <- geo_extreme[, c("town", "latitude", "longitude")]
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo_extreme$longitude, y=geo_extreme$latitude),
             data=geo_extreme, col="red", alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns with large difference in male-female salaries")
salary_geo

#As you can see, the towns with large salry differences are concetrated around large cities such as Paris and Lyon.

6.3.0.4 clustering towns based on unemployed %

unemp_percent <- as.data.frame((ineq$unemp15_64/ineq$pop15_64)*100)
names(unemp_percent)[1] <- "unemp_percent"
n <- nrow(unemp_percent)
wss <- rep(0, 15)
wss[1] <- (n - 1) * sum(sapply(unemp_percent, var))
for (i in 2:15)
    wss[i] <- sum(kmeans(unemp_percent,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 6
plot(1:15, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

clusters <- kmeans(unemp_percent, 6)
unemp_percent <- cbind(unemp_percent, clusters$cluster)
names(unemp_percent)[2] <- "cluster"
unemp_by_cluster <- aggregate(unemp_percent$unemp_percent, by=list(unemp_percent$cluster), FUN=mean)[2]
names(unemp_by_cluster)[1] <- "unemp_rate(%)" 
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster), size = unemp_percent$unemp_percent), data=geo, alpha=0.5, size=0.5) + scale_fill_manual(name = "Unemployment rate", values = unemp_by_cluster$`unemp_rate(%)`) + labs(color='') + ggtitle("Clusters of towns based on unemployment rate")
salary_geo

6.3.0.5 Clustering based on Locational Gini Coefficient

industry <- commune[, c("nb_firms_service", "nb_firms_commerce", "nb_firms_construction")]
lq_service <- (industry$nb_firms_service/(industry$nb_firms_service+industry$nb_firms_commerce+industry$nb_firms_construction))/(sum(industry$nb_firms_service)/sum(industry))
lq_service <- as.data.frame((lq_service))
lq_commerce <- (industry$nb_firms_commerce/(industry$nb_firms_service+industry$nb_firms_commerce+industry$nb_firms_construction))/(sum(industry$nb_firms_commerce)/sum(industry))
lq_commerce <- as.data.frame((lq_commerce))
lq_construction <- (industry$nb_firms_construction/(industry$nb_firms_service+industry$nb_firms_commerce+industry$nb_firms_construction))/(sum(industry$nb_firms_construction)/sum(industry))
lq_construction <- as.data.frame((lq_construction))
industry_lq <- cbind(industry, lq_service, lq_commerce, lq_construction)
#create dataframe with the counts of towns per region
regions_count <- data.frame(table(salary_with_dep$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
sum(regions_count_sub$Freq)
[1] 5025
industry_lq <- cbind(firms$town, industry_lq)
names(industry_lq)[1] <- "town" 
names(industry_lq)[5] <- "lq_service"
names(industry_lq)[6] <- "lq_commerce"
names(industry_lq)[7] <- "lq_construction"
dm <- dist(industry_lq[, 5:7])
#round to two decimals
dm <- round(dm, 2)
 
# K-means clustering
n <- nrow(industry_lq)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(industry_lq[, 5:7], var))
for (i in 2:22)
    wss[i] <- sum(kmeans(industry_lq[, 5:7],
                         centers = i)$withinss)
did not converge in 10 iterations
#we can see from the plot that optimal number of cluster is 5
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

clusters <- kmeans(industry_lq[, 5:7], 5)
#Mapping the towns based on the locational Gini coefficient
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on locational gini coefficient")
salary_geo

6.3.0.6 apping the towns based on the intensity of locational Gini coefficient

#Mapping the towns based on the intensity of locational Gini coefficient of the service industry
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= industry_lq$lq_service),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns based on service locational gini coefficient")
salary_geo

#Mapping the towns based on the intensity of locational Gini coefficient of the commerce industry
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= industry_lq$lq_commerce),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns based on commerce locational gini coefficient")
salary_geo

#Mapping the towns based on the intensity of locational Gini coefficient of the commerce industry
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= industry_lq$lq_construction),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns based on construction locational gini coefficient")
salary_geo

6.3.0.7 Salary clusters

#create distance matrix for hclust
dm <- dist(salary_variables)
#round to two decimals
dm <- round(dm, 2)
 
# K-means clustering
n <- nrow(salary_variables)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(salary_variables, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(salary_variables,
                         centers = i)$withinss)
Quick-TRANSfer stage steps exceeded maximum (= 251250)did not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterations
#we can see from the plot that optimal number of cluster is 5
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#now we have to see whether these groups correspond to geographical regions...

6.3.0.8 Mapping the salary clusters

 
#now we have to see whether these groups correspond to geographical regions...
clusters <- kmeans(salary_variables, 6)
salary_with_dep$cluster <- clusters$cluster
geo$cluster <- clusters$cluster
 
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(geo$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on salary")
salary_geo

salary_cl <- cbind(salary_variables, clusters$cluster)
names(salary_cl)[33] <- "cluster"
salary_cluster <- aggregate(salary_cl$sal_general, by=list(salary_cl$cluster), FUN=mean)[2]
names(salary_cluster)[1] <- "salary" 
 

6.3.0.9 Clustering on selected variables

selected_pca_scores<- as.data.frame(pca_selected_variables$x[,1:3])
# K-means clustering
n <- nrow(selected_variables)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(selected_variables, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(selected_variables,
                         centers = i)$withinss)
did not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterations
#we can see from the plot that optimal number of cluster is 7
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

clusters <- kmeans(selected_variables, centers = 7)
scatter3d(x = selected_pca_scores$PC1, y = selected_pca_scores$PC2, z = selected_pca_scores$PC3, groups = as.factor(clusters$cluster),
          grid = FALSE, surface = FALSE)
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on selected variables")
salary_geo

6.3.0.10 Clustering in commune

#K means
n <- nrow(commune_trimmed)
wss <- rep(0, 15)
wss[1] <- (n - 1) * sum(sapply(commune_trimmed, var))
for (i in 2:15)
    wss[i] <- sum(kmeans(commune_trimmed,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 11
plot(1:15, wss, type = "b",
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

6.3.0.11 Mapping the new clusters

n <- nrow(commune_trimmed)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(commune_trimmed$active_empl_prop, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(commune_trimmed$active_empl_prop,
                         centers = i)$withinss)
Quick-TRANSfer stage steps exceeded maximum (= 251250)Quick-TRANSfer stage steps exceeded maximum (= 251250)
#we can see from the plot that optimal number of cluster is 7
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#now we have to see whether these groups correspond to geographical regions...
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
# Plot "Distribution of total population for each town"
newDat_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(kmeans(commune_trimmed$active_empl_prop, 
                  centers = 6)$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on active employees proportion")
newDat_geo

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=46.227638,2.213749&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false
# Plot "Distribution of total population for each town"
newDat_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= commune_trimmed$active_empl_prop),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on active employees")
newDat_geo

7 Supervised Learning

7.1 ANOVA for salary

First the distribution of salary for both genders is considered, in original and log-scale

# original data
# male
ggplot(data=data.frame(newDat$sal_Males), aes(newDat$sal_Males)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Male salaries", y="Density") +
  ggtitle("Distribution of salary for males") +
  theme(plot.title = element_text(hjust = 0.5))

# females
ggplot(data=data.frame(newDat$sal_Females), aes(newDat$sal_Females)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Female salaries", y="Density") +
  ggtitle("Distribution of salary for females") +
  theme(plot.title = element_text(hjust = 0.5))

# log data
# male
ggplot(data=data.frame(log10(newDat$sal_Males)), aes(log10(newDat$sal_Males))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Log10 male salaries", y="Density") +
  ggtitle("Distribution of salary for males in log scale") +
  theme(plot.title = element_text(hjust = 0.5))

# females
ggplot(data=data.frame(log10(newDat$sal_Females)), aes(log10(newDat$sal_Females))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Log10 female salaries", y="Density") +
  ggtitle("Distribution of salary for females in log scale") +
  theme(plot.title = element_text(hjust = 0.5))

# hist(c(log10(newDat$sal_Males), log10(newDat$sal_Females)), 30)

It is clear that the normality assumption, cannot hold.

At the same time, it is shown how to check if the difference in the mean in salary between males and females using a t-test

t.test(log10(newDat$sal_Males), log10(newDat$sal_Females), alternative = c("two.sided"))

    Welch Two Sample t-test

data:  log10(newDat$sal_Males) and log10(newDat$sal_Females)
t = 64.095, df = 9332.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.0857778 0.0911900
sample estimates:
mean of x mean of y 
 1.165991  1.077507 

This solution can also be otbained fitting a linear model with a dummy variable catching the sex effect, which corresponds to a one-way ANOVA model

# create the reponse attaching the salary for each sex
sal_y = c(log10(newDat$sal_Males), log10(newDat$sal_Females))
# create the dummy variable controlling for the corresponding sex
dummy = c(rep(1, length(newDat$sal_Males)), rep(0, length(newDat$sal_Females)))
# the t value is exactly the same as the one obtained in the previous t-test
sal_ANOVA = lm(sal_y ~ dummy)
summary(sal_ANOVA)

Call:
lm(formula = sal_y ~ dummy)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.14896 -0.04542 -0.01677  0.02970  0.55334 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0775073  0.0009762 1103.81   <2e-16 ***
dummy       0.0884839  0.0013805   64.09   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06918 on 10042 degrees of freedom
Multiple R-squared:  0.2903,    Adjusted R-squared:  0.2903 
F-statistic:  4108 on 1 and 10042 DF,  p-value: < 2.2e-16
# residuals plot
plot(sal_ANOVA$residuals)
abline (0, 0, col = "red")

To improve such analysis some more aspects are now considered. In particualar, the model considered in an ANOVA, where the predictors take in account each gender for different age and job levels, plus their interaction terms.

First of all the dataset is created and a subsample of it is considered, in order to avoid strong correlation among units

set.seed(300)
# create response variable
sal_y = c(newDat$sal_M_18_25, newDat$sal_M_26_50, newDat$sal_M_51plus,
          newDat$sal_M_executive, newDat$sal_M_midManager, newDat$sal_M_employee, newDat$sal_M_worker,
          newDat$sal_F_18_25, newDat$sal_F_26_50, newDat$sal_F_51plus,
          newDat$sal_F_executive, newDat$sal_F_midManager, newDat$sal_F_employee, newDat$sal_F_worker)
n_sal_y = length(sal_y)             # length response variable
n_cat = length(newDat$sal_M_18_25)  # length of each category (i.e., original vectors)
# create sex dummy variable, 1 for males and 0 for females
sal_sex = rep(0, n_sal_y)   # full regressors
sal_sex[1:n_sal_y/2] = 1    # assign males
# create age dummy variables, 18-25 years old is the base case
sal_age = cbind(rep(0, n_sal_y), rep(0, n_sal_y)) # full regressors
# 26-50 y.o.
sal_age[(n_cat+1):(n_cat*2), 1] = 1     # males
sal_age[(n_cat*8+1):(n_cat*9), 1] = 1   # females
# 51+ y.o.
sal_age[(n_cat*2+1):(n_cat*3), 2] = 1   # males
sal_age[(n_cat*9+1):(n_cat*10), 2] = 1  # females
# create job type dummy variables, worker is the base case
sal_job = cbind(rep(0, n_sal_y), rep(0, n_sal_y), rep(0, n_sal_y)) # full regressors
# executives
sal_job[(n_cat*3+1):(n_cat*4), 1] = 1     # males
sal_job[(n_cat*10+1):(n_cat*11), 1] = 1   # females
# middle managers
sal_job[(n_cat*4+1):(n_cat*5), 2] = 1     # males
sal_job[(n_cat*11+1):(n_cat*12), 2] = 1   # females
# employee
sal_job[(n_cat*5+1):(n_cat*6), 3] =   1   # males
sal_job[(n_cat*12+1):(n_cat*13), 3] = 1   # females
# final data set
data_ANOVA = cbind.data.frame(response = sal_y, sex = sal_sex, age = sal_age, job = sal_job)
# names(data_ANOVA)
# show regressors' shape
imagemat(data_ANOVA[,-1], xaxt = "n", main = "Factors for ANOVA")
axis(1, at=1:6, labels=c("Sex", "Age 26-50", "Age 51+", "Execut.", "Mid.Man.", "Empl."))
box()

# sub sample to avoid correlation, (unbalanced dataset)
# set.seed(20)
subs = sample(1:n_sal_y, size = round(n_sal_y*0.1))
data_ANOVA = data_ANOVA[subs,]
# show randomized data
imagemat(data_ANOVA[,-1], xaxt = "n", main = "Factors for ANOVA after sub-sampling")
axis(1, at=1:6, labels=c("Sex", "Age 26-50", "Age 51+", "Execut.", "Mid.Man.", "Empl."))
box()

Run the ANOVA model, tranforming the response variable using Box-Cox transformation

# plot the transformed response variable (suggested by Box-Cox transformation)
sal_y = data_ANOVA[,1]^-1
ggplot(data=data.frame(data_ANOVA[,1]), aes(sal_y)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 60) +
  geom_density(col="black") +
  labs(x="salary^-1", y="Density") +
  ggtitle("Transformation of salary found using Box-Cox transformation") +
  theme(plot.title = element_text(hjust = 0.5))

# take the identified subsample for dummy variables
sex = sal_sex[subs]
age = sal_age[subs,]
job = sal_job[subs,]
# ANOVA model
sal_ANOVA = lm(sal_y ~ sex + age + job + sex:age + sex:job)
summary(sal_ANOVA)

Call:
lm(formula = sal_y ~ sex + age + job + sex:age + sex:job)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.084391 -0.004587  0.000657  0.005446  0.050454 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.1057963  0.0002737  386.559  < 2e-16 ***
sex         -0.0106870  0.0003870  -27.618  < 2e-16 ***
age1        -0.0212090  0.0004851  -43.723  < 2e-16 ***
age2        -0.0279090  0.0004717  -59.163  < 2e-16 ***
job1        -0.0555240  0.0004630 -119.912  < 2e-16 ***
job2        -0.0298464  0.0004690  -63.633  < 2e-16 ***
job3        -0.0081649  0.0004693  -17.397  < 2e-16 ***
sex:age1    -0.0023524  0.0006850   -3.434 0.000598 ***
sex:age2    -0.0083323  0.0006713  -12.412  < 2e-16 ***
sex:job1     0.0010803  0.0006723    1.607 0.108151    
sex:job2    -0.0002828  0.0006755   -0.419 0.675486    
sex:job3     0.0027415  0.0006671    4.109 4.01e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.008719 on 7019 degrees of freedom
Multiple R-squared:  0.8352,    Adjusted R-squared:  0.8349 
F-statistic:  3233 on 11 and 7019 DF,  p-value: < 2.2e-16
anova(sal_ANOVA)
Analysis of Variance Table

Response: sal_y
            Df  Sum Sq Mean Sq    F value    Pr(>F)    
sex          1 0.20446 0.20446  2689.2922 < 2.2e-16 ***
age          2 0.16187 0.08093  1064.5298 < 2.2e-16 ***
job          3 2.31845 0.77282 10164.8476 < 2.2e-16 ***
sex:age      2 0.01775 0.00887   116.7281 < 2.2e-16 ***
sex:job      3 0.00160 0.00053     7.0028 0.0001064 ***
Residuals 7019 0.53364 0.00008                         
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# aov(sal_y ~ sex + age + job + sex:age + sex:job)
plot(sal_ANOVA, 1)

# box-cox transformation suggested to use y^-1
boxcox(sal_ANOVA)
title("Box-Cox transformation of the response")

# comparison a full model with a reduced one excluding one interaction term
sal_ANOVA2 = lm(sal_y ~ sex + age + job + sex:age)
anova(sal_ANOVA, sal_ANOVA2)
Analysis of Variance Table

Model 1: sal_y ~ sex + age + job + sex:age + sex:job
Model 2: sal_y ~ sex + age + job + sex:age
  Res.Df     RSS Df  Sum of Sq      F    Pr(>F)    
1   7019 0.53364                                   
2   7022 0.53524 -3 -0.0015972 7.0028 0.0001064 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To check if the model can be sparsified, considering groups of variables as a whole, the group-Lasso is applied. It shows that the interaction term sex:job is not so important.

# define the considered model, which includes interaction terms
form <- model.matrix(sal_y ~ (sex + age + job)^2)
# remove interaction age:job
# colnames(form)
form = form[,2:12]
# define groups' labels
grp = c(1,2,2,3,3,3,4,4,5,5,5)
#Group-Lasso with 10-folds Cross Validation and LAD cost function
fit.cv=cv.gglasso(x=form, y=sal_y, group=grp, nfolds=10, pred.loss = "L2")
# plot CV for Lambda
plot(fit.cv) 
mtext(expression("10-folds CV for group-Lasso in ANOVA"), outer=TRUE,  cex=1, line=-2.2)

# extract coefficients
# coef.mat = fit.cv$gglasso.fit$beta
# plot separately each coefficient
cols = brewer.pal(5,name="Set1")
plot(fit.cv$gglasso.fit, col = cols, lwd = 3, main = "Coefficients vs.Lambda values")
legend('bottomright', legend = paste("Group", 1:5), col=cols[1:5], cex = 0.65, pch = 1, lwd=4, bty ="n")

# plot coefficients' norms (not working), doing it manually
# https://www.rdocumentation.org/packages/gglasso/versions/1.3/topics/plot.gglasso
# https://code.google.com/archive/p/gglasso/issues
  # plot(fit.cv$gglasso.fit, group=TRUE, col =  cols, lwd = 2)
  # legend('bottomright', legend = paste("Group", 1:5), col=cols[1:5], cex = 0.65, pch = 1, lwd=4, bty ="n")
# initialize norms
coeffNorm = matrix(0, nrow =  max(fit.cv$gglasso.fit$group), length(fit.cv$gglasso.fit$lambda))
# compute norms group-wise
for (i in 1:max(fit.cv$gglasso.fit$group)){ 
  ind = fit.cv$gglasso.fit$group == i
  coeff = fit.cv$gglasso.fit$beta[ind,]
  if (sum(ind)>1){
    coeffNorm[i,] = apply(coeff, 2, function(x){sqrt(sum(x^2))})
  } else coeffNorm[i,] = coeff
} 
# coefficients' norm plot
plot(log(fit.cv$gglasso.fit$lambda), coeffNorm[1,], main="Coefficients' norm vs Lambda",
     ylab="Coefficients' norm",xlab="Ln Lambda values",
     col=cols[1],
     # xlim=c(-1,100),
     ylim=c(min(coeffNorm), max(coeffNorm)),
     type="l",lwd=4)
for (j in 2:nrow(coeffNorm)){
  lines(log(fit.cv$gglasso.fit$lambda), coeffNorm[j,], type="l",lwd=4, col=cols[j])
}
abline(0,0, lwd=3)
# plot legend once
# grid()
par(new=T)
legend('topright', legend = paste("group ", 1:5), lty=1, col=cols[1:5], cex = 0.6,lwd=2)

#Pick the best Lambda
lmbda=fit.cv$lambda.1se
(coefs=coef.gglasso(object=fit.cv$gglasso.fit,s=lmbda))
                       1
(Intercept)  0.103352890
sex         -0.009843253
age1        -0.018198974
age2        -0.025734448
job1        -0.051617852
job2        -0.027165667
job3        -0.004583826
sex:age1    -0.002729062
sex:age2    -0.006619003
sex:job1     0.000000000
sex:job2     0.000000000
sex:job3     0.000000000
#At best lambda get coefficients and fitted values
plt=sal_y-predict.gglasso(object=fit.cv$gglasso.fit,newx=form,s=lmbda,type='link')
plot(plt, ylab="residuals", xlab="index", main="Plot of residuals")
abline(0, 0, col= "red")

From this study it is possible to conclude that each factor considered is relevant to determine the salary level.

The least important factor results the interaction between sex and job, which is not statistically significant using group-Lasso. This means that there is no gender discrimination between various jobs, even if gender discrimination exists in general and for different age groups.

In the next section this topic will be investigated further.

7.2 Difference in salary: men vs. females

The focus of this section is on the on the difference on salary between males and females in various towns. Some classical approaches will be compared with robust ones, to diagnose the possible negative effect of the presence of noise or outliers. It is motivated by the assumptions that not all the towns come from the same data genereting process.

# These functions are used to create the plots used in the next chunks
# PLOTS FOR interactive regression diagnostic plots
RegressionPlots <- function(fit, textlabels, robust=F, out = NULL){
  
  # number of units
  n = length(fit$fitted.values)
  
  # original response
  y = fit$y
  
  # Extract fitted values from lm() object
  Fitted.Values <-  fit$fitted.values
  
  # Extract residuals from lm() object
  Residuals <-  fit$residuals
  
  if (robust == F){
    # Extract standardized residuals from lm() object
    Standardized.Residuals <- MASS::stdres(fit)  
    
    # Calculate Leverage
    Leverage <- lm.influence(fit)$hat
    
    # text for the labels of the plots
    tt = "OLS"
  } else{
    # standardized residuals based on a robust scale estimate
    Standardized.Residuals <- fit$residuals/fit$scale
    
    # get the robust distances based on mahalanobis distance from the robust output 
    # (it has to be specified in the call to lmrob using: control = lmrob.control(compute.rd = T))
    Leverage <- fit$MD
    
    # text for the labels of the plots
    tt = "Robust"
  }
  
  # Extract fitted values for lm() object
  Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
  
  # Create data frame 
  # Will be used as input to plotly
  regMat <- data.frame(Fitted.Values, 
                       Residuals, 
                       Standardized.Residuals, 
                       Theoretical.Quantiles,
                       Leverage, 
                       textlabels)
  
  # Plot using Plotly
  
  # text info
  t <- list(
    family = "sans serif",
    size = 14,
    color = toRGB("grey50"))
  
  
  # Plot 1: Fitted vs Residuals
  # For scatter plot smoother
  LOESS <- loess.smooth(Fitted.Values, Residuals)
  
  # plt1 <- regMat %>% 
  #   plot_ly(x = Fitted.Values, y = Residuals, 
  #           type = "scatter", mode = "markers", hoverinfo = "text",
  #           color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
  #           text = paste('</br> Unit: ', 1:n,
  #                        '</br> Town: ', as.character(textlabels),
  #                        '</br> x: ', Fitted.Values,
  #                        '</br> y: ', Residuals)) %>% 
  #   
  #   add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
  #             inherit = F, line = list(width = 2)) %>% 
  # 
  #   layout(title = paste(tt, " Residuals vs Fitted Values"), plot_bgcolor = "#e6e6e6",
  #          xaxis = list(title = "Fitted Values"),
  #          yaxis = list(title = "Residuals"))
  
  # Plot 2: Fitted vs Standardized Residuals
  # For scatter plot smoother
  LOESS <- loess.smooth(Fitted.Values, Standardized.Residuals)
  
  plt2 <- regMat %>% 
    plot_ly(x = Fitted.Values, y = Standardized.Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text",
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
            text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Fitted.Values,
                         '</br> y: ', Standardized.Residuals)) %>% 
    
    add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
              inherit = F, line = list(width = 2)) %>% 
    
    layout(title = paste(tt, " Standardized residuals vs Fitted Values"), plot_bgcolor = "#e6e6e6",
           xaxis = list(title = "Fitted Values"),
           yaxis = list(title = "Standardized residuals"))  
  # Plot 3: Residuals index
  plt3 <- regMat %>% 
    plot_ly(x = 1:n, y = Standardized.Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text",
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', 1:n,
                         '</br> y: ', Standardized.Residuals)) %>% 
    
    add_segments(x = -5, xend = n+5, y = 0, yend = 0, mode = "line", inherit = F, 
                 name = "Null line", line = list(width = 2)) %>%
    
    layout(title = paste(tt, " Standardized Residuals' index"), plot_bgcolor = "#e6e6e6", 
             xaxis = list(title = "Index"),
             yaxis = list(title = "Standardized.Residuals"))
  # Plot 4: QQ Pot
  plt4 <- regMat %>% 
    plot_ly(x = Theoretical.Quantiles, y = Standardized.Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text", 
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Theoretical.Quantiles,
                         '</br> y: ', Standardized.Residuals)) %>% 
    
    add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", 
              mode = "lines", name = "Theoretical", line = list(width = 2), 
              inherit = F) %>%
    
    layout(title = paste(tt, " Q-Q Plot"), plot_bgcolor = "#e6e6e6",
             xaxis = list(title = "Normal theoretical quantiles"),
             yaxis = list(title = "Data quantiles"))
  
  # Plot5: Residuals vs Leverage
  # For scatter plot smoother
  LOESS <- loess.smooth(Leverage, Residuals)
  
  plt5 <- regMat %>% 
    plot_ly(x = Leverage, y = Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text", 
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5), 
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Leverage,
                         '</br> y: ', Residuals)) %>% 
    
    add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
              line = list(width = 2), inherit = F) %>%
    
    layout(title = paste(tt, " Leverage vs Residuals"), plot_bgcolor = "#e6e6e6", 
             xaxis = list(title = "Leverage values"),
             yaxis = list(title = "Residuals"))  
  # Plot 6: Fitted vs response
  # For scatter plot smoother
  LOESS <- loess.smooth(Fitted.Values, y)
  
  plt6 <- regMat %>% 
    plot_ly(x = Fitted.Values, y =y, 
            type = "scatter", mode = "markers", hoverinfo = "text",
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Fitted.Values,
                         '</br> y: ', y)) %>% 
    
    add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
              inherit = F, line = list(width = 2)) %>% 
  
    layout(title = paste(tt, " Fitted Values vs response"), plot_bgcolor = "#e6e6e6",
             xaxis = list(title = "Fitted Values"),
             yaxis = list(title = "Response values"))
  # # Plot 7: MISSING: sqrt(abs(Residuals))
  # # For scatter plot smoother
  # LOESS2 <- loess.smooth(Fitted.Values, Root.Residuals)
  # 
  # plt3 <- regMat %>% 
  #   plot_ly(x = Fitted.Values, y = Root.Residuals, 
  #           type = "scatter", mode = "markers", hoverinfo = "text", 
  #           name = "Data", marker = list(size = 10, opacity = 0.5),
  #           text = paste('town: ', as.character(textlabels))) %>%
  #   
  #   add_trace(x = LOESS2$x, y = LOESS2$y, type = "scatter", mode = "lines", name = "Smooth",
  #             line = list(width = 2), inherit = F) %>% 
  #   
  #   layout(title = "Scale Location", plot_bgcolor = "#e6e6e6")
  plt = list(plt2, plt3, plt4, plt5, plt6) # plt1, 
  return(plt)
  # not working...to add in all plots
  # to avoid error with colorPalette<3
  # suppressWarnings(warning("RColorBrewer::brewer.pal"))
}
#
# PLOTS FOR BEST SUBSET REGRESSION
#
plot.regsubsets2 <-  
  function (x, labels = obj$xnames, main = NULL, scale = c("bic",  
     "Cp", "adjr2", "r2"), col = gray(seq(0, 0.9, length = 10)), ...)  
  { 
    obj <- x 
    lsum <- summary(obj) 
    par(mar = c(7, 5, 6, 3) + 0.1) 
    nmodels <- length(lsum$rsq) 
    np <- obj$np 
    propscale <- FALSE 
    sscale <- pmatch(scale[1], c("bic", "Cp", "adjr2", "r2"),  
                     nomatch = 0) 
    if (sscale == 0)  
      stop(paste("Unrecognised scale=", scale)) 
    if (propscale)  
      stop(paste("Proportional scaling only for probabilities")) 
    yscale <- switch(sscale, lsum$bic, lsum$cp, lsum$adjr2, lsum$rsq) 
    up <- switch(sscale, -1, -1, 1, 1) 
    index <- order(yscale * up) 
    colorscale <- switch(sscale, yscale, yscale, -log(pmax(yscale,  
                                                           1e-04)), -log(pmax(yscale, 1e-04))) 
    image(z = t(ifelse(lsum$which[index, ], colorscale[index],  
                       NA + max(colorscale) * 1.5)), xaxt = "n", yaxt = "n",  
          x = (1:np), y = 1:nmodels, xlab = "", ylab = scale[1],  
          col = col) 
    laspar <- par("las") 
    on.exit(par(las = laspar)) 
    par(las = 2) 
    axis(1, at = 1:np, labels = labels, ...) # I modified this line 
    axis(2, at = 1:nmodels, labels = signif(yscale[index], 2), ...) 
    # axis(2,cex.axis=0.01)
    if (!is.null(main))  
      title(main = main) 
    box() 
    invisible(NULL)
  } 

The considered model is built using all the variables that could explain the gap in salary. Some of them refers to the difference of some quantity between males and females for each town (e.g., diploma, univeristy, independent workers, etc.). The final model consist of 43 variables and 5025 units.

Additionally, some variables which would have had a huge positive impact on the prediction of the response variable have non been considered, because of their strong relation (e.g., gap in salary between males and females for various ages or occupations)

# response variable
# y = log10(newDat$sal_Males - newDat$sal_Females)
y = newDat$sal_Males - newDat$sal_Females
# original response variable histogram
ggplot(data=as.data.frame(y), aes(y)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 35) +
  geom_density(col="black") +
  labs(x="Male-Female salary", y="Density") +
  ggtitle("Original response variable") +
  theme(plot.title = element_text(hjust = 0.5))

x = cbind.data.frame(pop = newDat$total_population, 
                     firms = newDat$total,
                     ratio_pop_firms = newDat$total/newDat$total_population,
                     # delta_exec = newDat$sal_M_executive - newDat$sal_F_executive,
                     # delta_midMan = newDat$sal_M_midManager - newDat$sal_F_midManager,
                     # delta_empl = newDat$sal_M_employee - newDat$sal_F_employee,
                     # delta_worker = newDat$sal_M_worker - newDat$sal_F_worker,
                     workforce = newDat$workforce,
                     dependency_ratio = newDat$dependency_ratio,
                     delta_wagearner = newDat$wagearner_M - newDat$wagearner_F,
                     delta_indep = newDat$independent_M - newDat$independent_F,
                     # delta_transfer = newDat$transfer_M - newDat$transfer_F, # 2000 zeros
                     delta_employer = newDat$employer_M - newDat$employer_F,
                     delta_full = newDat$full_M - newDat$full_F,
                     # delta_half = newDat$half_M - newDat$half_F, # OLS gives NA
                     superficie = newDat$Superficie,
                     housing14 = newDat$housing14,
                     median_living14 = newDat$median_living14,
                     empl = newDat$empl,
                     emp_sal = newDat$emp_sal,
                     pop15_64 = newDat$pop15_64,
                     unemp15_64 = newDat$unemp15_64,
                     act15_64 = newDat$act15_64,
                     unemp_rate = newDat$unemp_rate,
                     delta_NoDip = newDat$male_NoDip - newDat$female_NoDip,
                     delta_sec = newDat$male_Sec - newDat$female_Sec,
                     delta_hi = newDat$male_Hi - newDat$female_Hi,
                     delta_univ = newDat$male_Univ - newDat$female_Univ,
                     nodip = newDat$nodip,
                     delta_immig = newDat$male_immig - newDat$female_immig,
                     delta_workingImmig = newDat$male_working_immig - newDat$female_working_immig,
                     delta_native = newDat$male_native - newDat$female_native,
                     # evol_entrepr = newDat$evol_entrepreneurial_score,
                     # evol_pop = newDat$evol_pop_score,
                     regional_GDP = newDat$regional_GDP,
                     # dymanic_demo_prof = newDat$dymanic_demo_prof,
                     nb_active_employees = newDat$nb_active_employees,
                     nb_active_non_employees = newDat$nb_active_non_employees,
                     # nb_active = newDat$nb_active,   # OLS gives NA
                     nb_created_services = newDat$nb_created_services,
                     nb_created_commerce = newDat$nb_created_commerce,
                     nb_created_construction = newDat$nb_created_construction,
                     nb_created_ind = newDat$nb_created_ind,
                     # nb_created_firms = newDat$nb_created_firms, # OLS gives  
                     nb_firms_service = newDat$nb_firms_service,
                     nb_firms_commerce = newDat$nb_firms_commerce,
                     nb_firms_construction = newDat$nb_firms_construction,
                     nb_firms_ind = newDat$nb_firms_ind,
                     nb_major = newDat$nb_major,
                     nb_minor = newDat$nb_minor,
                     delta_num_sex = newDat$nb_male - newDat$nb_female,
                     nb_household = newDat$nb_household)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 5022, 0

These plots shows that:

  • the response is long tailed, which also motivates the robust framework
  • there is strong evidence of collinearity

The final model is built considering a random sample of observations (70% of them) in order to reduce spatial correlation among units

In the following analyses to towns will be often highlighted and they have been found ex-post. They are considered as 2 typical outliers but for different reasons. One of them is Paris, which is an outlier for obvious reasons. The other (unexpected) one is Chambourcy, which has been spotted by the following analysis and it is a very small town.

All the coefficients’ estimates obtained in the following will be saved in a dataframe to allow an easy ex-post comparison. For fitting methods that provide significance levels, only variables significant at 0.01% will be kept.

7.2.1 Naive approaches

7.2.1.1 Ordinary least squares

A classical linear model based on OLS is first fitted

It shows that:

  • there is etoreskadasticity
  • Paris has a levarage equal to 1

Also, high collinearity and robustness are not taken into account. At the same time, running different times OLS didn’t produce any coefficients’ instability, even if VIF values are very high.

7.2.1.2 Best subset selection

Model selection by exhaustive search (suffering the same drawbacks as OLS) is now applied, which is also a very heavy approach in terms of computing time.

7.2.1.3 Stepwise selection

Model selection by sequential replacement (suffering the same drawbacks as OLS) is now applied and it is very fast compared to the exhaustive search

This approach results in a sparser model compared to the exhaustive search, and the cost (e.g., in terms of BIC) is higher.

7.2.1.4 MM estimator

The model is now fitted using the MM estimator, which is a soft trimming methodology. It is based on a starting robust estimator with a high breakdown point (in this case an S estimator), which provides a robust scale estimate used by the MM estimator to recover efficiency

This model shows that:

  • residuals and fitted values seem better than the OLS fit
  • it is able to identify the most oulying units (considering all the dimensions)
  • leverage points (Lyon and Toulouse) have a lower weight compared to OLS estimates

7.2.1.5 Least trimmed squares

A very similar result is obtained using the LTS estimator (i.e., an hard trimming approach), with trimming proportion equal to 10% of the units.

Using the MM or LTS estimator, as expected, increases the adjusted R squared of around 10%.

7.2.1.6 Best subset selection with MM weights

The results obtained using a best subset selection procedure are now combined with the weights obtained using the MM estimator (pretty similar to the ones obtaiend using the LTS weights)

This approach results in a significant improvement of the cost function, but the best model is less sparse and it keeps approximately 20 variables.

7.2.1.7 Elastic net

To face the problem of strong collinearity, which has not been considere yet, an Elastic-net is now fitted using 10-folds cross validation to tune lambda and considering a range of 11 equispaced alpha values. The best model is considered the one with the lowest MSPE.

This estimator proides a fairly sparse model, but it is non robust, it is biased, and it doesn’t easy allow to perform inferential reasoning.

7.2.1.8 OLS on elastic net solution

To partially solve the latter two problems, an OLS fit could be applied only on the variables previously identified by ENET.

This results in a sparser model and the diagnostic plots shows a better result compared to the raw OLS estimates, but it is still not robust.

7.2.1.9 Elastic net with MM weights

At this point, it seems reasonable to use the weights obtained by the MM estimator in an ENET model, trying to recover some robustness and keeping a sparse model

Contrarily to what was expected, the model is not so sparse. On the other hand, the mean CV error decreases substiantially (around 0.4) compared to the pure ENET model (around 1.4), and this gain would persist even allowing for a stronger shrinkage of the lambda values and hence reducing the number of coefficients. It is also interesting to note the strong decrease in CV error as some proportion of Lasso is introduced.

Such behavior should be studied further. A motivation could be the fact that ENET is based on CV, and some oulying units with weights different from 0 can affect this result. On the other hand, even providing to ENET only units with MM-weights greater than 0.9 (i.e., almost surely non outlying according to MM based on all variables), mean CV error decrease substantially, but the problem persists.

7.2.1.10 OLS on Elastic net solution with MM weights

The same is true applying an OLS estimate on the non oulying units (i.e., with MM-weights > 0.05) and the variables identified bu ENET(MM).

Such a model is sparser. It still contains some leverage points, like Toulose and Lyon, but the diagnostic plots don’t show any particularly strange behavior.

At the same two important aspects have to be taken in account:

  • the MM weights used were obtained on the base of the full model, hence they could not be really representative
  • some units are not being modeled in this framework

7.2.1.11 Elastic net with MM weights to model outliers

To face the latter problem an approach could be based on the idea of modeling apart the most outlying units. For instance, the following fit is based on ENET where the provided weights correspond to 1 - MMweights.

This model results in:

  • in a very sparse model which seems to fit reasonably well all the points
  • provides coefficients which are sensibly different than the ones for the previously considered models (e.g., unemployment coefficients changes its sign) supporting the assumption that there is more than one data generating process
  • the mean CV error increases substantially

A possible drawback of this approach, again, reside on the fact that the MM weights are based on the full model.

7.2.2 Robust approaches incorporating shrinkage

7.2.2.1 Sparse-LTS

The estimator sparse-LTS, which is a robustification of Lasso using LTS, is now applied. The parameters used perform a trimming 10% of the units, and 10 replication of 10-folds CV

This model results in:

  • a very sparse and robust model
  • the unemployment rate is the most important variable
  • clearly shows 2 different sub-popilations
  • Paris is not a clear outlier

The main drawback resides on the fact that it is not taking into account collinearity.

7.2.2.2 Elastic net with sparse-LTS weights

In order to try to solve this problem, ENET with weights provided by sparseLTS could be applied

In this case, instead of the actual minimum obtained by ENET which corresponded to a Lasso fit, the best labda kept corresponds to alpha=0.7 (at the cost of a slightly higher CV error). The highest coefficient for this model results the unemployment rate.

To perform a further validation of these results, OLS is applied on the model identified (discarding the units identified by sparse-LTS)

They show that:

  • the model seems correctly specified and all regressors are significant
  • Paris is not considered as on outlier (its leverage is strongly decreased)
  • the unemployment rate has a strong effect on the response

7.2.2.3 Penalized ENET based on the S estimator

Penalized ENET based on the S estimator (PENSE) is now applied. The PENSE estimate minimizes the robust M-scale of the residuals penalized by the L1 and L2 norm of the regression coefficients (elastic net penalty). The level of penalization is chosen to minimize the k-fold cross-validated prediction error (using a robust measure).

This models show that:

The main drawback it’s the computing time required.

7.2.2.4 Elastic net with LTS cost function

Robust elastic net, based on LTS, is now fitted to the model trimming 10% of the units. It requires the removal of the dummy variable for urban towns, in order to avoid numerical problems.

This model shows that:

  • the algorithm (even if not parallelized for Windows users) is much faster than PENSE

7.2.2.5

7.2.3 Comparison of the estimates obtained

Print table of results

This function is used to plot a summary of the results obtained in HTML format

---
title: "TSL Project"
author: "Luca Insolia, Jisu Kim and Gevorg Yeghikyan"    
date: "`r format(Sys.time(), '%d/%m/%Y')`"
output:
  html_notebook:
    toc: yes
    toc_depth: '3' 
    highlight: tango
    keep_tex: yes
    number_sections: yes
  html_document:
    theme: united
    highlight: tango    
    toc: yes
    toc_depth: '3' 
#   pdf_document: # ADD PDF!
#     toc: yes
#     toc_depth: '3'
#     keep_tex: true
#     latex_engine: pdflatex
# editor_options:
#   # toc: yes
#   # toc_depth: 3
#   chunk_output_type: inline
---


# General information 

We analyze a dataset published on [Kaggle](https://www.kaggle.com/etiennelq/french-employment-by-town).
It refers to french employment, salaries, population per town. 
The aim is to evaluate equality/inequalities in France, and geographical distribution of business according to their size.

Such data are collected by the INSEE.
Information regarding the number of firms in every french town, categorized by size
can be found [here](https://www.insee.fr/fr/metadonnees/definition/c1135). This dataset contains about 35000 units/per town.

Information about salaries around french town per job categories, age and sex (expressed in average net amount per hour in euro) can be found [here](https://www.insee.fr/fr/statistiques/2522515). This dataset contains about 5000 units/per town. 

Demographic information in France per town, age, sex and living mode
can be found [here](https://www.insee.fr/fr/statistiques/2863607). This dataset contains about 8 million units/per town. Additional info about Population Data can be found [here](https://www.insee.fr/fr/statistiques/2863607#dictionnaire). 

These datasets have been pre-processed and put together. The final dataset contains 58 variables and 5022 observations. 


## Aim of the study 

This project aims to explore structure of French labour market. 
In particular, we are interested in:

* evaluating possible inequalities: per towns/region, sex, age, job categories etc.;
* discover geographical distribution of business according to their size
* predicting the ... using a regression model;
* reduce the dimensionality of ... performing a PCA;
* explore different algorithms to cluster male/females using ...


## Plan for the study 

1. Unsupervised learning: 
* PCA
* Clustering methods (K-means/Hierarchical)
2. Supervised learning:
* GLM
* Linear/Quadratic Discriminant Analysis
* KNN
* Cross-validation
* Bootstrap
* Subset selection 
* Shrinkage methods
* Dimension reduction methods
3. Description of population demographics in France
4. Structure of the french labour market
5. ...
6. Future works


## Loading tools

Loading all libraries needed throughout the notebook
```{r loading libraries, include=FALSE, cache=FALSE}

# data manipulation
if(!require(psych)){install.packages("psych"); library(psych)} 
if(!require(plyr)){install.packages("dplyr"); library(plyr)}
if(!require(dplyr)){install.packages("dplyr"); library(dplyr)}
# plotting
if(!require(RColorBrewer)){install.packages("RColorBrewer"); library(RColorBrewer)} # colors
if(!require(zoo)){install.packages("zoo"); library(zoo)} # colors
if(!require(ggplot2)){install.packages("ggplot2"); library(ggplot2)}
if(!require(ggfortify)){install.packages("ggfortify"); library(ggfortify)}
if(!require(ggmap)){install.packages("ggmap"); library(ggmap)}
if(!require(ggbiplot)){
  if(!require(devtools)){
    install.packages("devtools")}
  install_github("vqv/ggbiplot")
  library(ggbiplot)}
if(!require(corrplot)){install.packages("corrplot"); library(corrplot)}
if(!require(zoo)){install.packages("zoo"); library(zoo)}
if(!require(plotly)){install.packages("plotly"); library(plotly)} # interactive plotting
if(!require(lattice)){install.packages("lattice"); library(lattice)}
if(!require(MVA)){install.packages("MVA"); library(MVA)}
if(!require(ape)){install.packages("ape"); library(ape)}
if(!require(KernSmooth)){install.packages("KernSmooth"); library(KernSmooth)}
if(!require(rgl)){install.packages("rgl"); library(rgl)}
if(!require(car)){install.packages("car"); library(car)}
# Clustering
if(!require(mclust)){install.packages("mclust"); library(mclust)}
if(!require(dbscan)){install.packages("dbscan"); library(dbscan)}
# plot design matrix
if(!require(rafalib)){install.packages("rafalib"); library(rafalib)}
# cross validation
if(!require(boot)){install.packages("boot"); library(boot)}
# elastic net
if(!require(glmnet)){install.packages("glmnet"); library(glmnet)}
# group Lasso
if(!require(gglasso)){install.packages("gglasso"); library(gglasso)}
# robust fit
if(!require(robustbase)){install.packages("robustbase"); library(robustbase)}
# robust Lasso 
if(!require(robustHD)){install.packages("robustHD"); library(robustHD)}
if(!require(enetLTS)){install.packages("enetLTS"); library(enetLTS)}
# box-cox tranformation (in ANOVA) and standardized residuals
if(!require(MASS)){install.packages("MASS"); library(MASS)}
# best subset selection
if(!require(leaps)){install.packages("leaps"); library(leaps)}
# plot for PCA
# if(!require(factoextra)){install.packages("factoextra"); library(factoextra)}

```

Import the datasets
```{r loading datasets, warning=FALSE}

setwd("./data")
firms       <- read.csv("base_etablissement_par_tranche_effectif.csv", encoding = "UTF-8")
geo         <- read.csv("name_geographic_information.csv", encoding = "UTF-8")
salary      <- read.csv("net_salary_per_town_categories.csv", encoding = "UTF-8")
population  <- read.csv("population.csv", encoding = "UTF-8")
educ        <- read.csv("level_education.csv", sep =";", encoding = "UTF-8")
categ_socio <- read.csv("Categorie_socioprofessionnelle.csv", sep =";", encoding = "UTF-8")
status_work <- read.csv("Emplois_lieu_travail.csv", sep =";", encoding = "UTF-8")
ineq        <- read.csv("Comparateur_territoires.csv", sep =";", encoding = "UTF-8")
commune     <- read.csv("insee_commune.csv", encoding = "UTF-8")

```


# Pre-processing

## Firms data
 
Assign meaningful names and check the modified data:
```{r pre processing firms}

names(firms) 
names(firms)[2:ncol(firms)] <-
  c("town", 
    "regNum",
    "deptNum",
    "total",
    "null",
    "firmsEmpl_1_5",
    "firmsEmpl_6_9",
    "firmsEmpl_10_19",
    "firmsEmpl_20_49",
    "firmsEmpl_50_99",
    "firmsEmpl_100_199",
    "firmsEmpl_200_499",
    "firmsEmpl_500plus")

# preliminary checks
names(firms)
head(firms)
str(firms)
summary(firms)

# Check for duplicated data: there is no
sum(duplicated.data.frame(firms))

```

Categorize firms' size according to 
[EU standard](http://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Enterprise_size), 
but in a slightly different form for medium and large firms 
(i.e., medium firms have <200 instead of <250 employees):
```{r modify firms}

# merge variables
firms$micro   <- firms$firmsEmpl_1_5 + firms$firmsEmpl_6_9
firms$small   <- firms$firmsEmpl_10_19 + firms$firmsEmpl_20_49
firms$medium  <- firms$firmsEmpl_50_99 + firms$firmsEmpl_100_199
firms$large   <- firms$firmsEmpl_200_499 + firms$firmsEmpl_500plus

# Drop unnecessary (at the moment) columns 
firms <- subset(firms, select = c(CODGEO, town, total, micro, small, medium, large, null))

# check
summary(firms)

# there is an obs with more than 316K null data: we check if it is plausible
# get the highest 20 null values
str_firms <- sort(firms$null, decreasing = T)[1:20]
# get their indexes
str_firms_ind <- match(str_firms, firms$null)
# get the corresponding city
firms$town[str_firms_ind]
# they are the largest cities, hence it seems reasonable..

```

## Geographical data 

Assign names and remove some variables:
```{r pre processing geo}

names(geo)
names(geo)[c(2:11, 14)] =
  c("code_region",
    "region", 
    "region_capital",
    "number_depart",
    "department", 
    "prefecture",
    "circons",
    "town_name", 
    "postal_code", 
    "CODGEO",
    "eloignement")

# drop unnecessary columns (code/num and name represents same thing) 
geo <- subset(geo, select = -c(EU_circo, code_region, number_depart, prefecture, circons, eloignement, town_name))

# preliminary checks
names(geo)
head(geo)
str(geo)
summary(geo)

```

Correct typos for longitude data and keep just the unique CODGEO to avoid towns with multiple postal codes:
```{r fix geo}

# spot "," instead of "." in longitude
newLong       <- as.character(geo$longitude)    # copy the vector
sum(grep(",", newLong))                         # total commas
ind_long_err  <- grep(",", newLong)             # indexing them
newLong       <- gsub(",", ".", newLong)        # substituting them with dots
indNA_Long    <- is.na(as.numeric((newLong)))   # spot NA
# geo$longitude[indNA_Long]                       # verify that they were actually missing
geo$longitude <- as.numeric(newLong)            # overwrite the longitude variable with the new one

# Check for duplicated data (e.g., cities with different postal codes, that we dropped):
  # e.g., to verify it,  try on the initial dataset:
  # sum(geo$nom_commune == "Paris")
  # ind_duplic <- geo$nom_commune == "Paris"
  # geo[ind_duplic,]
sum(duplicated.data.frame(geo)) 
# retaing unique postal cities
geo <- geo[!duplicated(geo$CODGEO),]

# check again
summary(geo)

```

Assign latitude and longitude values for missing data (almost 3000).
The code used to retrieve them using Google API has been commented and its result is loaded.
```{r assign NA in geo, warning=FALSE}

# index of NAs and their total
indNA_coord = is.na(geo$latitude) | is.na(geo$longitude)
sum(indNA_coord)

# code used to retrieve the NA using Google API, which have been saved in a csv file
# 
# # initialize variables
# city_search = 0
# res = as.data.frame(matrix(c(0, 0, 0), 1, 3))
# names(res) = c("lon", "lat", "address")
# 
# # retrieve lat and long (Google API = 2500 request per day)
# # my_iter = floor(sum(indNA_coord)/3)
# for (i in 1:sum(indNA_coord)){
# 
#   # city searched
#   city_search[i] = paste(c(as.character(NA_coord$town_name[i]), as.character(NA_coord$postal_code[i]), as.character(NA_coord$department[i]), "France"), sep=" ", collapse = ", ")
#   
#   # solution
#   res[i,] = geocode(city_search[i], output = "latlona", source = c("google", "dsk"), messaging = FALSE)
# 
#   # retrieve still missing data, because of existing problems with API (up to 15 trials)
#   j = 0
#   while (any(is.na(res[i,])) & j < 25){
#     res[i,] = geocode(city_search[i], output = "latlona", source = c("google", "dsk"), messaging = FALSE)
#     j = j + 1
#   }
# }

# # check the solution
# sol = cbind(searched = city_search, res)

# # save it as a csv file to save time
# write.csv(retrieved_geo_NA[,2:3], "geo_NA_Final.csv", quote = FALSE, row.names=FALSE, fileEncoding = "UTF-8")
    

# read the created csv
setwd("./data")
retrieved_geo_NA = read.csv("geo_NA_Final.csv", header = T, encoding = "UTF-8")
# get only long and lat and assign to original NA 
geo$latitude[indNA_coord] = retrieved_geo_NA[,2]
geo$longitude[indNA_coord] = retrieved_geo_NA[,1]

# there are 37 still missing units, which are towns located in old colonies far from Europe
indNA_coord = is.na(geo$latitude) | is.na(geo$longitude)
sum(indNA_coord)
# exclude those towns
geo = geo[!indNA_coord,]

summary(geo)

```

Remove DOM-TOM towns (i.e. old colonies far from Europe)
```{r}

# delete non-European countries
ind_nonEur = geo$latitude < 30 | geo$latitude > 70 |geo$longitude < -20 | geo$longitude > 20
sum(ind_nonEur)
geo = geo[!ind_nonEur,]

```


## Salary data 

Assign meaningful names and check the data:
```{r pre processing salary}

names(salary)
names(salary)[2:ncol(salary)] <-
  c("town",
    "sal_general",    
    "sal_executive",
    "sal_midManager",
    "sal_employee",
    "sal_worker",
    "sal_Females",
    "sal_F_executive",
    "sal_F_midManager",
    "sal_F_employee",
    "sal_F_worker",
    "sal_Males",
    "sal_M_executive",
    "sal_M_midManager",
    "sal_M_employee",
    "sal_M_worker",
    "sal_18_25",
    "sal_26_50",
    "sal_51plus",
    "sal_F_18_25",
    "sal_F_26_50",
    "sal_F_51plus",
    "sal_M_18_25",
    "sal_M_26_50",
    "sal_M_51plus")

# preliminary checks
names(salary)
head(salary)
str(salary)
summary(salary)

# Check for duplicated data: there are no
sum(duplicated.data.frame(salary))

# drop unnecessary variable
salary <-subset(salary, select = -c(town))

```

## Population data 

Rename the variables for population and exclude the unnecessary ones:
```{r pre processing population}

#names(population)
names(population)[5:7] <-
  c("ageCateg5",
    "sex",
    "peopleCategNum")

# drop unnecessary columns (NIVGEO is the same for all)
population <- subset(population, select = -c(NIVGEO, LIBGEO))

# Refactor sex and MOCO
population$MOCO <- factor(population$MOCO, levels = c(11,12,21,22,23,31,32),
                          labels = c("children_living_with_two_parents", 
                                     "children_living_with_one_parent",
                                     "adults_living_in_couple_without_child",
                                     "adults_living_in_couple_with_children",
                                     "adults_living_alone_with_children",
                                     "persons_not_from_family_living_in_the_home",
                                     "persons_living_alone"))
population$sex <- factor(population$sex, levels = c(1,2), labels = c("Male", "Female"))

# check again
summary(population)

```

Understand the distribution of population according to different age categories using population pyramide
(performed now because later the dataset will be modified):
```{r Population pyramide}

# need the initial shape of data to construct the pyramide 
# Population pyramide
population_data2 <- ddply(population, .(sex, ageCateg5), function(population) {
  data.frame(total_population = sum(population$peopleCategNum))
  })
pop_pyramid <- ggplot(data = population_data2,
       mapping = aes(x = ageCateg5, fill = sex,
                     y = ifelse(test = sex == "Male",
                                yes = -total_population, no = total_population))) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = abs, limits = max(population_data2$total_population) * c(-1,1)) + 
  ggtitle("Pyramid of Population") + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(x= "Age")+ labs(y = "Population") + coord_flip() # + scale_fill_brewer(palette = "Set1")
pop_pyramid

```

Re-organize population dataset and set CODGEO as units and create new variables using the available variables.
```{r Re-organize population dataset by CODGEO}

# Re-organize the data by creating new variables: "total population", "male", "female", "child", "elderly" and "workforce".
population <- ddply(population, .(CODGEO), function(population) {
  data.frame(total_population = sum(population$peopleCategNum),
             male = sum(population[population$sex == "Male",]$peopleCategNum),
             female = sum(population[population$sex == "Female",]$peopleCategNum),
             child = sum(population[population$ageCateg5 %in% seq(0, 10, by=5),]$peopleCategNum),
             elderly = sum(population[population$ageCateg5 %in% seq(65, 80, by=5),]$peopleCategNum),
             workforce = sum(population[population$ageCateg5 %in% seq(15, 60, by=5),]$peopleCategNum) 
  )})

# Calculate ratios using the existing variables
population$dependent <- population$child + population$elderly
population$sex_ratio <- ifelse(population$female==0, 0, population$male / population$female)
population$dependency_ratio <- ifelse(population$workforce==0, 0, population$dependent / population$workforce)
population$aged_dependency_ratio <- ifelse(population$workforce==0, 0, population$elderly / population$workforce)
population$child_dependency_ratio <- ifelse(population$workforce==0, 0, population$child / population$workforce)

# total population=65mln, which is reasonable
sum(population$total_population)

# create dummy variable for urban city
# taking the top 20% in terms of population
population$urbanDummy = population$total_population > quantile(population$total_population, 0.90)

```

## Education data

Assign meaningful names and build some useful indicators:
```{r}

names(educ)[3:ncol(educ)] <-
  c("Age15_NoDip_M","Age15_NoDip_F", "Age15_Sec_M", "Age15_Sec_F", "Age15_Hi_M","Age15_Hi_F", "Age15_Univ_M",
    "Age15_Univ_F", "Age20_NoDip_M", "Age20_NoDip_F", "Age20_Sec_M","Age20_Sec_F", "Age20_Hi_M", "Age20_Hi_F",
    "Age20_Univ_M", "Age20_Univ_F","Age25_NoDip_M", "Age25_NoDip_F","Age25_Sec_M","Age25_Sec_F", "Age25_Hi_M",
    "Age25_Hi_F","Age25_Univ_M","Age25_Univ_F", "Age30_NoDip_M", "Age30_NoDip_F", "Age30_Sec_M", "Age30_Sec_F",
    "Age30_Hi_M", "Age30_Hi_F","Age30_Univ_M","Age30_Univ_F", "Age35_NoDip_M", "Age35_NoDip_F", "Age35_Sec_M",
    "Age35_Sec_F", "Age35_Hi_M", "Age35_Hi_F", "Age35_Univ_M","Age35_Univ_F", "Age40_NoDip_M", "Age40_NoDip_F",
    "Age40_Sec_M","Age40_Sec_F", "Age40_Hi_M", "Age40_Hi_F", "Age40_Univ_M","Age40_Univ_F", "Age45_NoDip_M",
    "Age45_NoDip_F","Age45_Sec_M","Age45_Sec_F", "Age45_Hi_M", "Age45_Hi_F", "Age45_Univ_M","Age45_Univ_F",
    "Age50_NoDip_M", "Age50_NoDip_F","Age50_Sec_M","Age50_Sec_F", "Age50_Hi_M", "Age50_Hi_F", "Age50_Univ_M",
    "Age50_Univ_F", "Age55_NoDip_M", "Age55_NoDip_F","Age55_Sec_M", "Age55_Sec_F", "Age55_Hi_M", "Age55_Hi_F",
    "Age55_Univ_M", "Age55_Univ_F", "Age60_NoDip_M", "Age60_NoDip_F","Age60_Sec_M","Age60_Sec_F", "Age60_Hi_M",
    "Age60_Hi_F", "Age60_Univ_M", "Age60_Univ_F", "Age65_NoDip_M", "Age65_NoDip_F","Age65_Sec_M","Age65_Sec_F",
    "Age65_Hi_M", "Age65_Hi_F", "Age65_Univ_M", "Age65_Univ_F")

# Aggregate variables by grouping all the education level categories by gender
# total number of females with no diploma
educ$female_NoDip <- rowSums(cbind(educ$Age15_NoDip_F, educ$Age20_NoDip_F, educ$Age25_NoDip_F, educ$Age30_NoDip_F,
                                   educ$Age35_NoDip_F,educ$Age40_NoDip_F,educ$Age45_NoDip_F, educ$Age50_NoDip_F,
                                   educ$Age55_NoDip_F, educ$Age60_NoDip_F, educ$Age65_NoDip_F))

# total number of males with no diploma
educ$male_NoDip <- rowSums(cbind(educ$Age15_NoDip_M, educ$Age20_NoDip_M, educ$Age25_NoDip_M, educ$Age30_NoDip_M,
                                 educ$Age35_NoDip_M,educ$Age40_NoDip_M,educ$Age45_NoDip_M, educ$Age50_NoDip_M,
                                 educ$Age55_NoDip_M, educ$Age60_NoDip_M, educ$Age65_NoDip_M))

# total number of females with secondary education level
educ$female_Sec <- rowSums(cbind(educ$Age15_Sec_F, educ$Age20_Sec_F, educ$Age25_Sec_F, educ$Age30_Sec_F,
                                 educ$Age35_Sec_F,educ$Age40_Sec_F,educ$Age45_Sec_F, educ$Age50_Sec_F, 
                                 educ$Age55_Sec_F, educ$Age60_Sec_F, educ$Age65_Sec_F))

# total number of males with secondary education level
educ$male_Sec <- rowSums(cbind(educ$Age15_Sec_M, educ$Age20_Sec_M, educ$Age25_Sec_M, educ$Age30_Sec_M,
                               educ$Age35_Sec_M,educ$Age40_Sec_M,educ$Age45_Sec_M, educ$Age50_Sec_M, 
                               educ$Age55_Sec_M, educ$Age60_Sec_M, educ$Age65_Sec_M))

# total number of females with high-school education level
educ$female_Hi <- rowSums(cbind(educ$Age15_Hi_F, educ$Age20_Hi_F, educ$Age25_Hi_F, educ$Age30_Hi_F,
                                educ$Age35_Hi_F,educ$Age40_Hi_F,educ$Age45_Hi_F, educ$Age50_Hi_F, 
                                educ$Age55_Hi_F, educ$Age60_Hi_F, educ$Age65_Hi_F))

# total number of males with high-school education level
educ$male_Hi <- rowSums(cbind(educ$Age15_Hi_M, educ$Age20_Hi_M, educ$Age25_Hi_M, educ$Age30_Hi_M,
                              educ$Age35_Hi_M,educ$Age40_Hi_M,educ$Age45_Hi_M, educ$Age50_Hi_M, 
                              educ$Age55_Hi_M, educ$Age60_Hi_M, educ$Age65_Hi_M))

# total number of females with university degree
educ$female_Univ <- rowSums(cbind(educ$Age15_Univ_F, educ$Age20_Univ_F, educ$Age25_Univ_F, educ$Age30_Univ_F,
                                  educ$Age35_Univ_F,educ$Age40_Univ_F,educ$Age45_Univ_F, educ$Age50_Univ_F,
                                  educ$Age55_Univ_F, educ$Age60_Univ_F, educ$Age65_Univ_F))

# total number of males with university degree
educ$male_Univ <- rowSums(cbind(educ$Age15_Univ_M, educ$Age20_Univ_M, educ$Age25_Univ_M, educ$Age30_Univ_M,
                                educ$Age35_Univ_M,educ$Age40_Univ_M,educ$Age45_Univ_M, educ$Age50_Univ_M,
                                educ$Age55_Univ_M, educ$Age60_Univ_M, educ$Age65_Univ_M))

# drop all the unnecessary variables 
educ <- subset(educ, select= c(CODGEO,
                               female_NoDip, male_NoDip, 
                               female_Sec, male_Sec, 
                               female_Hi, male_Hi, 
                               female_Univ, male_Univ))

# total number of population without a diploma
educ$nodip <-rowSums(cbind(educ$female_NoDip, educ$male_NoDip))

# total number of population with secondary level of education
educ$sec <-rowSums(cbind(educ$female_Sec, educ$male_NoDip))

# total number of population with highschool diploma
educ$high <-rowSums(cbind(educ$female_Hi, educ$male_Hi))

# total number of poopulation with university diploma
educ$univ <- rowSums(cbind(educ$female_Univ, educ$male_Univ))

```


## Demographic/social profiles data

Rename the original variables 
```{r pre-processing demographic/social profiles data}

# rename variables
names(categ_socio)[3:ncol(categ_socio)] <-
  c("M_immi_agri",
    "F_immi_agri",
    "M_NoImmi_agri",
    "F_NoImmi_agri",
    "M_immi_comm",
    "F_immi_comm",
    "M_NoImmi_comm",
    "F_NoImmi_comm",
    "M_immi_exec",
    "F_immi_exec",
    "M_NoImmi_exec",
    "F_NoImmi_exec",
    "M_immi_midman",
    "F_immi_midman",
    "M_NoImmi_midman",
    "F_NoImmi_midman",
    "M_immi_emp",
    "F_immi_emp",
    "M_NoImmi_emp",
    "F_NoImmi_emp",
    "M_immi_worker",
    "F_immi_worker",
    "M_NoImmi_worker",
    "F_NoImmi_worker",
    "M_immi_retired",
    "F_immi_retired",
    "M_NoImmi_retired",
    "F_NoImmi_retired",
    "M_immi_noAct",
    "F_immi_noAct",
    "M_NoImmi_noAct",
    "F_NoImmi_noAct")

# drop LIBGEO
categ_socio <- subset(categ_socio, select=-c(LIBGEO))

```

Create the total number of immigrants and natives per town, also separeted by gender.
```{r Categ_socio-rename variables }

# total number of male immigrants per town 
categ_socio$male_immig <-rowSums(cbind(categ_socio$M_immi_agri, categ_socio$M_immi_comm, categ_socio$M_immi_emp,  
                                       categ_socio$M_immi_exec, categ_socio$M_immi_midman,categ_socio$M_immi_noAct, 
                                       categ_socio$M_immi_retired, categ_socio$M_immi_worker))
 
# total number of female immigrants per town
categ_socio$female_immig <- rowSums(cbind(categ_socio$F_immi_agri,categ_socio$F_immi_comm, categ_socio$F_immi_emp, 
                                          categ_socio$F_immi_exec, categ_socio$F_immi_midman, 
                                          categ_socio$F_immi_noAct, categ_socio$F_immi_retired, categ_socio$F_immi_worker))

# total number immigrants per town
categ_socio$total_immig <- categ_socio$male_immig + categ_socio$female_immig

# working male immigrants per town 
categ_socio$male_working_immig <- categ_socio$male_immig - categ_socio$M_immi_retired 
 
# working female immigrants per town  
categ_socio$female_working_immig <- categ_socio$female_immig - categ_socio$F_immi_retired 
 
# total number of male natives per town 
categ_socio$male_native <- rowSums(cbind(categ_socio$M_NoImmi_agri, categ_socio$M_NoImmi_comm, categ_socio$M_NoImmi_exec, categ_socio$M_NoImmi_emp, categ_socio$M_NoImmi_midman, categ_socio$M_NoImmi_noAct, categ_socio$M_NoImmi_retired, categ_socio$M_NoImmi_worker))

# total number of female natives per town 
categ_socio$female_native <- rowSums(cbind(categ_socio$F_NoImmi_agri, categ_socio$F_NoImmi_comm, categ_socio$F_NoImmi_exec, categ_socio$F_NoImmi_emp, categ_socio$F_NoImmi_midman, categ_socio$F_NoImmi_noAct, categ_socio$F_NoImmi_retired, categ_socio$F_NoImmi_worker))

# total number of natives per town 
categ_socio$total_native <- categ_socio$male_native + categ_socio$female_native

# drop unnecessary variables 
categ_socio <- subset(categ_socio, select = c(CODGEO, male_immig, female_immig, total_immig, male_working_immig, 
                                               female_working_immig, male_native, female_native, total_native)) 

```


## Work status data

Rename the original variables
```{r pre-processing work status data}

# rename variables
names(status_work)[3:ncol(status_work)] <-
  c("Less20_M_wagearner_Full", "Less20_M_wagearner_Half", "Less20_M_Indp1_Full", "Less20_M_Indp1_Half",
    "Less20_M_empl_Full", "Less20_M_empl_Half", "Less20_M_trans_Full", "Less20_M_trans_Half",
    "Less20_F_wagearner_Full", "Less20_F_wagearner_Half", "Less20_F_Indp1_Full", "Less20_F_Indp1_Half",
    "Less20_F_empl_Full", "Less20_F_empl_Half", "Less20_F_trans_Full", "Less20_F_trans_Half",
    "t4_M_wagearner_Full", "t4_M_wagearner_Half", "t4_M_Indp1_Full", "t4_M_Indp1_Half", "t4_M_empl_Full",
    "t4_M_empl_Half", "t4_M_trans_Full", "t4_M_trans_Half", "t4_F_wagearner_Full", "t4_F_wagearner_Half",
    "t4_F_Indp1_Full", "t4_F_Indp1_Half", "t4_F_empl_Full", "t4_F_empl_Half", "t4_F_trans_Full", "t4_F_trans_Half",
    "t9_M_wagearner_Full", "t9_M_wagearner_Half", "t9_M_Indp1_Full", "t9_M_Indp1_Half", "t9_M_empl_Full",
    "t9_M_empl_Half", "t9_M_trans_Full", "t9_M_trans_Half", "t9_F_wagearner_Full", "t9_F_wagearner_Half",
    "t9_F_Indp1_Full", "t9_F_Indp1_Half", "t9_F_empl_Full", "t9_F_empl_Half", "t9_F_trans_Full", "t9_F_trans_Half",
    "T4_M_wagearner_Full", "T4_M_wagearner_Half", "T4_M_Indp1_Full", "T4_M_Indp1_Half", "T4_M_empl_Full",
    "T4_M_empl_Half", "T4_M_trans_Full", "T4_M_trans_Half", "T4_F_wagearner_Full", "T4_F_wagearner_Half",
    "T4_F_Indp1_Full", "T4_F_Indp1_Half", "T4_F_empl_Full", "T4_F_empl_Half", "T4_F_trans_Full", "T4_F_trans_Half",
    "T9_M_wagearner_Full", "T9_M_wagearner_Half", "T9_M_Indp1_Full", "T9_M_Indp1_Half", "T9_M_empl_Full",
    "T9_M_empl_Half", "T9_M_trans_Full", "T9_M_trans_Half", "T9_F_wagearner_Full", "T9_F_wagearner_Half",
    "T9_F_Indp1_Full", "T9_F_Indp1_Half", "T9_F_empl_Full", "T9_F_empl_Half", "T9_F_trans_Full", "T9_F_trans_Half",
    "FF_M_wagearner_Full", "FF_M_wagearner_Half", "FF_M_Indp1_Full", "FF_M_Indp1_Half", "FF_M_empl_Full",
    "FF_M_empl_Half", "FF_M_trans_Full", "FF_M_trans_Half", "FF_F_wagearner_Full", "FF_F_wagearner_Half",
    "FF_F_Indp1_Full", "FF_F_Indp1_Half", "FF_F_empl_Full", "FF_F_empl_Half", "FF_F_trans_Full", "FF_F_trans_Half",
    "F9_M_wagearner_Full", "F9_M_wagearner_Half", "F9_M_Indp1_Full", "F9_M_Indp1_Half", "F9_M_empl_Full",
    "F9_M_empl_Half", "F9_M_trans_Full", "F9_M_trans_Half", "F9_F_wagearner_Full", "F9_F_wagearner_Half",
    "F9_F_Indp1_Full", "F9_F_Indp1_Half", "F9_F_empl_Full", "F9_F_empl_Half", "F9_F_trans_Full", "F9_F_trans_Half",
    "f4_M_wagearner_Full", "f4_M_wagearner_Half", "f4_M_Indp1_Full", "f4_M_Indp1_Half", "f4_M_empl_Full",
    "f4_M_empl_Half", "f4_M_trans_Full", "f4_M_trans_Half", "f4_F_wagearner_Full", "f4_F_wagearner_Half",
    "f4_F_Indp1_Full", "f4_F_Indp1_Half", "f4_F_empl_Full", "f4_F_empl_Half", "f4_F_trans_Full", "f4_F_trans_Half",
    "f9_M_wagearner_Full", "f9_M_wagearner_Half", "f9_M_Indp1_Full", "f9_M_Indp1_Half", "f9_M_empl_Full",
    "f9_M_empl_Half", "f9_M_trans_Full", "f9_M_trans_Half", "f9_F_wagearner_Full", "f9_F_wagearner_Half",
    "f9_F_Indp1_Full", "f9_F_Indp1_Half", "f9_F_empl_Full", "f9_F_empl_Half", "f9_F_trans_Full", "f9_F_trans_Half",
    "s4_M_wagearner_Full", "s4_M_wagearner_Half", "s4_M_Indp1_Full", "s4_M_Indp1_Half", "s4_M_empl_Full",
    "s4_M_empl_Half", "s4_M_trans_Full", "s4_M_trans_Half", "s4_F_wagearner_Full", "s4_F_wagearner_Half",
    "s4_F_Indp1_Full", "s4_F_Indp1_Half", "s4_F_empl_Full", "s4_F_empl_Half", "s4_F_trans_Full", "s4_F_trans_Half",
    "s9_M_wagearner_Full", "s9_M_wagearner_Half", "s9_M_Indp1_Full", "s9_M_Indp1_Half", "s9_M_empl_Full",
    "s9_M_empl_Half", "s9_M_trans_Full", "s9_M_trans_Half", "s9_F_wagearner_Full", "s9_F_wagearner_Half",
    "s9_F_Indp1_Full", "s9_F_Indp1_Half", "s9_F_empl_Full", "s9_F_empl_Half", "s9_F_trans_Full", "s9_F_trans_Half")

```

Create aggregate statistics and drop unnecessary variables
```{r Status drop and aggregation}

# Simplify the variables by summing up by categories
# total number of wage earners for males and females.
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('wagearner', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$wagearner_M=rowSums(status_work[,nam.M])
status_work$wagearner_F=rowSums(status_work[,nam.F]) 
                                               
# same but for independent workers by gender
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('Indp1', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$independent_M=rowSums(status_work[,nam.M]) 
status_work$independent_F=rowSums(status_work[,nam.F])

# same but for government transfer receivers
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('trans', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$transfer_M=rowSums(status_work[,nam.M])  
status_work$transfer_F=rowSums(status_work[,nam.F])

# same but for employers
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('empl', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$employer_M=rowSums(status_work[,nam.M])
status_work$employer_F=rowSums(status_work[,nam.F])
 
# same but for full-time contracts
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('Full', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
# names(status_work[nam.M])
# names(status_work[nam.F])
status_work$full_M=rowSums(status_work[,nam.M])
status_work$full_F=rowSums(status_work[,nam.F])

# same but for half-time contracts
nam.M = rep(0, ncol(status_work))
nam.F = rep(0, ncol(status_work))
for (i in 1:ncol(status_work)){
  sol = grepl('Half', names(status_work)[i])
  if (sol==1){
    if (grepl('M', names(status_work)[i])){
      nam.M[i] = i  
    }else{nam.F[i]=i}
  }else{
    nam.F[i]=0
    nam.M[i]=0}
}
#names(status_work[nam.M])
#names(status_work[nam.F])
status_work$half_M=rowSums(status_work[,nam.M])
status_work$half_F=rowSums(status_work[,nam.F])

# total of each variables
status_work$full <- status_work$full_M + status_work$full_F
status_work$half <- status_work$half_M + status_work$half_F
status_work$wagearner <- status_work$wagearner_M + status_work$wagearner_F
status_work$independent <- status_work$independent_M + status_work$independent_F
status_work$transfer <- status_work$transfer_F + status_work$transfer_M 
status_work$employer <- status_work$employer_F + status_work$employer_M 

# drop unnecessary variables
status_work <- subset(status_work, select = c(CODGEO,
                                              wagearner_M, wagearner_F, 
                                              independent_M, independent_F, 
                                              transfer_M, transfer_F, 
                                              employer_M, employer_F, 
                                              full_M, full_F,
                                              half_M, half_F,
                                              full, half,
                                              wagearner, independent, transfer, employer))
# Full variable means individuals with full time contracts. Half means individuals with part time contract. 
# "transfer" is individuals receiving transfer payments. 
# "independent" is like auto-entrepreneurs. 

```


## Inequality data

Drop unnecessary variables and rename the other ones
```{r pre-processing inequality data}

names(ineq)

# drop unnecessary variables
ineq <- subset(ineq, select = -c(LIBGEO, P09_POP, NAISD16, DECESD16, P14_RP_PROP, P09_EMPLT,ETTOT15, ETAZ15, ETBE15, ETFZ15, ETGU15, ETGZ15, ETOQ15, ETTEF115, ETTEFP1015))

# rename variables 
names(ineq)[4:ncol(ineq)] <-
  c("pop_2014",
    "Superficie",
    "birth09_14",
    "death09_14",
    "households14",
    "housing14",
    "princ_resid14",
    "sec_resid14",
    "vac_resid14",
    "tax_house14",
    "shared_tax_house14",
    "median_living14",
    "lev_ineq14",
    "empl",
    "emp_sal",
    "pop15_64",
    "unemp15_64",
    "act15_64")

# drop more unnecessary variables 
ineq <- subset(ineq, select = -c(princ_resid14, sec_resid14, vac_resid14, pop_2014, households14, birth09_14, 
                                death09_14, tax_house14, shared_tax_house14, lev_ineq14)) 
 
# unemployment rate 
ineq$unemp_rate <- ineq$unemp15_64/ineq$pop15_64 

# impute NA using median (only 3)
ineq$median_living14[is.na(ineq$median_living14)] = median(ineq$median_living14, na.rm = T)

```


## Commune data

Drop unnecessary variables and rename the other ones
```{r pre-processing commune data}
names(commune)

#Drop unnecessary variables
commune <-subset(commune, select=-c(libgeo, scorefiscal, dep, reg, evolutionpop, nbpropritaire, 
                                    nblogementsecondaireetoccasionne, scoreurbanit, scorevargion, scorepib,  
                                    nbrsidencesprincipales, indicefiscalpartiel, nboccupantsrsidenceprincipale, 
                                    depmoyennesalaireshoraires, depmoyennesalairescadrehoraires, 
                                    depmoyennesalairesprofintermdiai, depmoyennesalairesemployhoraires, 
                                    depmoyennesalairesouvrihoraires, valeurajoutergionale, 
                                    nbindustriesdesbiensintermdiaire, nbdecommerce, nbdeservicesauxparticuliers, 
                                    nbinstitutiondeeducationsantacti, scorevargion, scorepib, 
                                    moyennerevenusfiscauxdpartementa, moyennerevenusfiscauxrgionaux, 
                                    regmoyennesalaireshoraires, regmoyennesalairescadrehoraires, 
                                    regmoyennesalairesprofintermdiai, regmoyennesalairesemployhoraires, 
                                    regmoyennesalairesouvrihoraires, tauxtudiants, dynamiquedmographiqueinsee, 
                                    capacitfisc, capacitfiscale, moyennerevnusfiscaux, nblogement, fidlit)) 


# rename variables  
names(commune) <-c("CODGEO", "orient_econ", "demo_index", "population", "nb_household", 
                    "nb_female", "nb_male", "nb_minor", "nb_major", "nb_students",  
                    "nb_firms_service", "nb_firms_commerce", "nb_firms_construction", 
                    "nb_firms_ind", "nb_created_firms", "nb_created_ind", 
                    "nb_created_construction", "nb_created_commerce",
                    "nb_created_services", "urbanRural", "nb_active", "nb_active_employees",
                    "nb_active_non_employees", "dymanic_demo_prof", "profitRate", 
                    "regional_GDP", "demo_environment", "evol_pop_score", 
                    "evol_entrepreneurial_score", "demo_envir2") 

# drop more variables 
commune <-subset(commune, select=-c(demo_index, population, profitRate, demo_envir2, demo_environment)) 

```


# Produce consistent datasets

The CODGEO variables have to be merged.
However, for different reasons already identified by other kaggle users they need some pre-processing.
To do so, some already known mistakes are corrected:

```{r fix and match CODGEO}

firms$CODGEO       <- sub("A", "0", firms$CODGEO)
firms$CODGEO       <- sub("B", "0", firms$CODGEO)
salary$CODGEO      <- sub("A", "0", salary$CODGEO)
salary$CODGEO      <- sub("B", "0", salary$CODGEO)
population$CODGEO  <- sub("A", "0", population$CODGEO)
population$CODGEO  <- sub("B", "0", population$CODGEO)
geo$CODGEO         <- sub("A", "0", geo$CODGEO)
geo$CODGEO         <- sub("B", "0", geo$CODGEO)
educ$CODGEO        <- sub("A", "0", educ$CODGEO)
educ$CODGEO        <- sub("B", "0", educ$CODGEO)
categ_socio$CODGEO <- sub("A", "0", categ_socio$CODGEO)
categ_socio$CODGEO <- sub("B", "0", categ_socio$CODGEO)
ineq$CODGEO        <- sub("A", "0", ineq$CODGEO)
ineq$CODGEO        <- sub("B", "0", ineq$CODGEO)
status_work$CODGEO <- sub("A", "0", status_work$CODGEO)
status_work$CODGEO <- sub("B", "0", status_work$CODGEO)
commune$CODGEO     <- sub("A", "0", commune$CODGEO)
commune$CODGEO     <- sub("B", "0", commune$CODGEO)

# Then all CODGEO are trasformed to integers and four new datasets are created retaining only the common CODGEO:
# use only integer values
geo$CODGEO         <- as.integer(geo$CODGEO)  
population$CODGEO  <- as.integer(population$CODGEO)
firms$CODGEO       <- as.integer(firms$CODGEO)
salary$CODGEO      <- as.integer(salary$CODGEO)
status_work$CODGEO <- as.integer(status_work$CODGEO)  
ineq$CODGEO        <- as.integer(ineq$CODGEO)
educ$CODGEO        <- as.integer(educ$CODGEO)
categ_socio$CODGEO <- as.integer(categ_socio$CODGEO)
commune$CODGEO     <- as.integer(commune$CODGEO)

# store datasets' names to loop on them
dataset = c("population", "salary", "firms", "geo", "ineq", "commune", "educ", "categ_socio", "status_work")

# obtain sommon IDs for all datasets
for (i in dataset){
  # get i-th name and create a new variable concateneting "NEW" at the end
  nam <- paste(i, "NEW", sep = "")
  # initialize counter to identify the number of iteration in j
  iter = 1
  for (j in dataset){
    if (j != i){
      # for each dataset different from the i-th
      if (iter == 1){
        # 1st iteration: use the original dataset (e.g., geo)
        assign(nam, semi_join(get(i), get(j), by = "CODGEO"))
      } else{
        # successive iteration: use the new dataset (e.g., geoNEW)
        assign(nam, semi_join(get(nam), get(j), by = "CODGEO"))
      }
      iter = iter + 1
    }
  }
}

# check how many observation have been deleted
for (i in dataset){
  del_rows = nrow(get(i)) - nrow(get(paste(i, "NEW", sep = "")))
  del_prop = del_rows / nrow(get(i))
  del_obs = paste("For", i, del_rows, "units have been deleted.",
                  "They were the", round(del_prop*100, digits=2), "% of the total.", sep = " ")
  print(paste(del_obs))
}

print(paste("The new dataset has", nrow(salaryNEW), "units and", 
      ncol(salaryNEW)+ncol(populationNEW)+ncol(firmsNEW)+ncol(geoNEW)+
        ncol(status_workNEW)+ncol(ineqNEW)+ncol(educNEW)+ncol(categ_socioNEW)+ncol(communeNEW), "features."))

```

Overwrite new datesets into the old ones and remove the latter ones from memory
```{r take unique datasets}

firms       <- firmsNEW
geo         <- geoNEW
salary      <- salaryNEW
population  <- populationNEW
educ        <- educNEW
categ_socio <- categ_socioNEW
status_work <- status_workNEW
ineq        <- ineqNEW
commune     <- communeNEW

rm(list=c("firmsNEW", "geoNEW", "salaryNEW", "populationNEW", "educNEW", "categ_socioNEW", "status_workNEW", "ineqNEW", "communeNEW"))

```


# Descriptive statistics 

## Firms data

Check the distribution of the null firms (i.e., unknown sizes) and analyze
firms' distribution per town:
```{r firms distribution}

# check the ratio of null firms for each town
summary(firms$null/firms$total)
# a lot of information is missing, should we remove these data?
ggplot(data=firms, aes(firms$null/firms$total)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Null firms/total firms", y="Count") +
  ggtitle("Ratio between the number of null firms and total firms per town") +
  theme(plot.title = element_text(hjust = 0.5)) 

# distribution for the firms with unknown size 
ggplot(data=firms, aes(log(firms$null))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of unknown size (log scale)", y="Density") +
  ggtitle("Number of firms of unknown size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the total number of firms 
ggplot(data=firms, aes(log(firms$total))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms (log scale)", y="Density") +
  ggtitle("Total number of firms per town") +
  theme(plot.title = element_text(hjust = 0.5)) 

# distribution for the number of micro size firms 
ggplot(data=firms, aes(log(firms$micro))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of micro size (log scale)", y="Density") +
  ggtitle("Number of firms of micro size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the number of small size firms 
ggplot(data=firms, aes(log(firms$small))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of small size (log scale)", y="Density") +
  ggtitle("Number of firms of small size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the number of medium size firms 
ggplot(data=firms, aes(log(firms$medium))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of medium size (log scale)", y="Density") +
  ggtitle("Number of firms of medium size per town") +
  theme(plot.title = element_text(hjust = 0.5))

# distribution for the number of large size firms 
ggplot(data=firms, aes(log(firms$large))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Number of firms of large size (log scale)", y="Density") +
  ggtitle("Number of firms of large size per town") +
  theme(plot.title = element_text(hjust = 0.5))

```

Check correlation among the variables:
```{r firms correlation}
corrplot(cor(firms[,3:8]))
```


### What have learned

Solved problems: 

* Use the EU firms' categorization.

Some problematic aspects:

* Some towns have 100% of null firms, should we remove these data?
* These variables are strongly correlated, should we use just the total amount in the following parts?

We could use these data for the following tasks:

* predict the salaries using such information as proxy for the competition in the job market;
* predict the total number of firms, using salary data;
* geo-spatial plot for firms' size 



## Geographic data

Plot available towns on a map:
```{r geo map}

# center of France, obtained using:
# fra_center = as.numeric(geocode("France"))
fra_center = c(2.213749, 46.227638)
# plot all European towns available
geo_pos = as.data.frame(cbind(lon = geo$longitude, lat = geo$latitude))
geo_pos = geo_pos[complete.cases(geo_pos),]
ggmap(get_googlemap(center=fra_center, scale=2, zoom=5), extent="normal") +
  geom_point(aes(x=lon, y=lat), data=geo_pos, col="orange", alpha=0.2, size=0.01) 

# Try to cll google API until there is no error
# 
# FraMap = NA
# while (any(is.na(FraMap))){
#   FraMap = tryCatch({
#       ggmap(get_googlemap(center=fra_center, scale=2, zoom=5), extent="normal")
#   }, error = function(e) {
#     message("cannot open URL")  
#     FraMap = NA
#   })
# }

```

### What we have learned 

Solved problems: 

* Missing latitude and longitude;
* Duplications due to multiple postal codes in the same town;
* Exclude non-European towns;

These information are useful to plot any future dataset/analysis.
In addition, they could be useful to compare European towns vs. old colonies.


## Salary data 

Univariate analysis comparing salaries for both genders among various job categories:

```{r boxplots works}

#  number of units
n_sex <- length(salary$sal_Females)
# vector representing males and females
Label <- c(rep("M", n_sex*5), rep("F", n_sex*5))
# vector representing the variable considered
Variable <- c(rep("General", n_sex), 
             rep("Executive", n_sex),
             rep("MidManager", n_sex),
             rep("Employee", n_sex),
             rep("Worker",n_sex),
             rep("General", n_sex), 
             rep("Executive", n_sex),
             rep("MidManager", n_sex),
             rep("Employee", n_sex),
             rep("Worker",n_sex))
# merge these data
sal_sex = cbind.data.frame(Label = Label, 
             value = c(salary$sal_Males, salary$sal_M_executive, salary$sal_M_midManager, salary$sal_M_employee, salary$sal_M_worker,
                       salary$sal_Females, salary$sal_F_executive, salary$sal_F_midManager, salary$sal_F_employee, salary$sal_F_worker),
             Variable = Variable)
# plotting phase
ggplot(data = sal_sex, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  # not color points replacing colour = group instead of colour=Label
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("Mean net salary per hour") + ggtitle("Gender comparison for different job positions") +
  theme(plot.title = element_text(hjust = 0.5)) +      stat_boxplot(geom = "errorbar", width = 0.5)
  # + guides(fill=guide_legend(title="Legend"))

# the same but excluding outliers
ggplot(data = sal_sex, aes(x=Label, y=value)) +
  scale_y_continuous(limits = quantile(sal_sex$value, c(0, 0.9))) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("Mean net salary per hour") + 
  ggtitle("Gender comparison for different job positions excluding the last decile") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

```


Univariate analysis comparing salaries for both genders among various ages:

```{r boxplots ages}

# vector representing males and females
Label <- c(rep("M", n_sex*3), rep("F", n_sex*3))
# vector representing the variable considered
Variable <- c(rep("18-25", n_sex), 
              rep("26-50", n_sex),
              rep("51+", n_sex),
              rep("18-25", n_sex), 
              rep("26-50", n_sex),
              rep("51+", n_sex))
# merge these data
sal_sex <- cbind.data.frame(Label = Label, 
                           value = c(salary$sal_M_18_25, salary$sal_M_26_50, salary$sal_M_51plus, 
                                     salary$sal_F_18_25, salary$sal_F_26_50, salary$sal_F_51plus),
                           Variable = Variable)
# plotting phase
ggplot(data = sal_sex, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("Mean net salary per hour") + ggtitle("Gender comparison for different ages") +
  theme(plot.title = element_text(hjust = 0.5)) + ylim(c(5, 100)) +
  stat_boxplot(geom = "errorbar", width = 0.5)

```


The income inequality between genders, age groups and working positions is clear.
In the following analyses the focus is on the salary ratio between women and men among different
job positions:
```{r ratio F vs M accross jobs}

# Gender salary ratio and general level of income

# Overall mean salary: The higher the net mean income, the more skewed the ratio of salary between female and male is. Only 2 towns have a ratio>1
# create overall F vs M ratio
salary$salary_ratio_FvsM <- salary$sal_Females / salary$sal_Males
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Overall salary ratio (females/males)", y="Density") + 
  labs(title = "Overall salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs overall mean salary
ggplot(salary, aes(x= sal_general, y=salary_ratio_FvsM)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Overall salary ratio(females/males)") + 
  labs(title = "Overall salary ratio between females and males vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm") 


# Executives mean salary: a bit better the situation for females in this case and less skewed
# create Executives F vs M ratio
salary$salary_ratio_FvsM_Exec <- salary$sal_F_executive / salary$sal_M_executive
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_Exec)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Executives salary ratio (females/males)", y="Density") + 
  labs(title = "Executives salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs executives mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_Exec)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Executives salary ratio (females/males)") + 
  labs(title = "Executives salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")


# Middle managers mean salary: ....
# create Middle managers F vs M ratio
salary$salary_ratio_FvsM_midManag <- salary$sal_F_midManager / salary$sal_M_midManager
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_Exec)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Middle managers salary ratio (females/males)", y="Density") + 
  labs(title = "Middle managers salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs executives mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_midManag)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Middle managers salary ratio (females/males)") + 
  labs(title = "Middle managers salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")


# Workers mean salary: ...
# create workers F vs M ratio
salary$salary_ratio_FvsM_worker <- salary$sal_F_worker / salary$sal_M_worker
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_worker)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Workers salary ratio (females/males)", y="Density") + 
  labs(title = "Workers salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs workers mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_worker)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Workers salary ratio (females/males)") + 
  labs(title = "Workers salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")


# Employee mean salary: ...
# create Employee F vs M ratio
salary$salary_ratio_FvsM_employee <- salary$sal_F_employee / salary$sal_M_employee
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_employee)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="Employee salary ratio (females/males)", y="Density") + 
  labs(title = "Employee salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs Employee mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_employee)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="Employee salary ratio (females/males)") + 
  labs(title = "Employee salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

```


Now the focus is on the salary ratio between women and men among different age groups:
```{r ratio F vs M accross ages}

# 18-25 mean salary: are quite equal apart from some outliers and a quadratic trend
# create 18-25 F vs M ratio
salary$salary_ratio_FvsM_18_25 <- salary$sal_F_18_25 / salary$sal_M_18_25
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_18_25)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="18-25 salary ratio (females/males)", y="Density") + 
  labs(title = "18-25 salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs 18-25 mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_18_25)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="18-25 salary ratio (females/males)") + 
  labs(title = "18-25 salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")
# scatter plot vs 18-25 mean salary for them
ggplot(salary, aes(x= sal_18_25, y= salary_ratio_FvsM_18_25)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="18-25 salary", y="18-25 salary ratio (females/males)") + 
  labs(title = "18-25 salary ratio between females and males \n vs. overall 18-25 salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "loess")


# 26-50 mean salary: ...
# create 26-50 F vs M ratio
salary$salary_ratio_FvsM_26_50 <- salary$sal_F_26_50 / salary$sal_M_26_50
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_26_50)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="26-50 salary ratio (females/males)", y="Density") + 
  labs(title = "26-50 salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs 26-50 mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_26_50)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="26-50 salary ratio (females/males)") + 
  labs(title = "26-50 salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")


# 51+ mean salary: ...
# create 51+ F vs M ratio
salary$salary_ratio_FvsM_51plus <- salary$sal_F_51plus / salary$sal_M_51plus
# histogram
ggplot(data=salary, aes(salary$salary_ratio_FvsM_51plus)) + 
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) + 
  geom_density(col="black") + 
  labs(x="51+ salary ratio (females/males)", y="Density") + 
  labs(title = "51+ salary ratio between females and males") + 
  theme(plot.title = element_text(hjust = 0.5)) 
# scatter plot vs 26-50 mean salary
ggplot(salary, aes(x= sal_general, y= salary_ratio_FvsM_51plus)) +  
  geom_point(size = 0.5, colour = "#0091ff")+ 
  labs(x="Overall salary", y="51+ salary ratio (females/males)") + 
  labs(title = "51+ salary ratio between females and males \n vs. overall salary") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm")

```


Highlight bivariate relations:
```{r bivariate plots}

# correlation matrix
corrplot(cor(salary[, 3:ncol(salary)]), method = "circle", title = "Correlation matrix for salary data", 
         diag = T, tl.cex=0.5, type="lower", 
         tl.col = "black", mar=c(0,0,1.5,0)) 

# most general pairs
pairs(salary[c(3:8, 13, 18:20)], gap=0, main = "Scatter matrix of the main variables in salary data", cex = 0.6)
# pairs highlighting genders' differences
pairs(salary[c(9:12, 14:17)], gap=0, main = "Scatter matrix of job categories for both genders", cex = 0.6)

```

### What we have learned

....



## Population data 

Plot population data across different towns:
```{r plotting some population data in log10}

# Histogram of total population per town in log
ggplot(data=population, aes(ifelse(total_population!=0, log10(total_population), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Log10 total population", y="Density") +
  ggtitle("Histogram of total population per town in log10 scale") +
  theme(plot.title = element_text(hjust = 0.5))

# Histogram of dependency ratio per town
ggplot(data=population, aes(ifelse(dependency_ratio!=0, log10(dependency_ratio), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Log10 dependency ratio", y="Density") +
  ggtitle("Histogram of dependency ratio per town in log10 scale") +
  theme(plot.title = element_text(hjust = 0.5))

# Histogram of sex ratio per town
ggplot(data=population, aes(ifelse(sex_ratio!=0, log10(sex_ratio), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Log10 sex ratio", y="Density") +
  ggtitle("Histogram of sex ratio per town in log10 scale") +
  theme(plot.title = element_text(hjust = 0.5))

```

Understand which towns have the highest concentration of population and ratios.
```{r geo plot for each town}

# Merge geography and population data
# geo_population <- merge(geo, population, by="CODGEO")
geo_population <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(population$CODGEO),],
                                   population[unique(population$CODGEO) %in% unique(geo$CODGEO),])

# France map
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")

# Plot "Distribution of total population for each town" 
sc <- scale_colour_gradientn(colours =palette(rainbow(3)), limits=c(log10(min(geo_population$total_population)), 
                                                                    log10(max(geo_population$total_population)))) 
population_distribution <-
  FraMap +
  geom_point(aes(x=geo_population$longitude, y=geo_population$latitude,
                 colour=log10(geo_population$total_population)), 
             data=geo_population, alpha=0.2, size=0.01) +
             sc + labs(color='') + ggtitle("Distribution of Population for each town")
population_distribution

# Plot "Distribution of aged dependency ratio for each town"
sc <- scale_colour_gradientn(colours =palette(rainbow(4)), limits=c(min(geo_population$aged_dependency_ratio),
                                                                  max(geo_population$aged_dependency_ratio)))
aged_ratio_distribution <-
  FraMap +
  geom_point(aes(x=geo_population$longitude, y=geo_population$latitude,
                 colour=geo_population$aged_dependency_ratio),
             data=geo_population, alpha=1, size=1) +
             sc + labs(color='') + ggtitle("Aged dependency ratio per town")
aged_ratio_distribution

# which city has the highest/lowest aged dependency ratio 
geo_population[which.min(geo_population$aged_dependency_ratio),c('CODGEO')]  
geo_population[which.max(geo_population$aged_dependency_ratio),c('CODGEO')]  
 
# get list of top 10 highest aged dependency ratio 
Rank = sort(population$aged_dependency_ratio, TRUE) 
Rank10 = sort(Rank[1:10], TRUE) 
sol = rep(0, 10) 
for (i in 1:10){ 
    sol[i] = paste(geo_population$CODGEO[geo_population$aged_dependency_ratio %in% Rank10[i]] )
} 

# remove the dataset created to save memory
rm(list = c("geo_population", "Rank", "Rank10")) 

```


### What we have learned 

We can observe that in urban cities, the aged dependcy ratio is lower compared to non-urban cities. 
The city with the lowest aged dependency ratio is La Fert?-sous-Jouarre and the highest is L'Aiguillon-sur-Mer. 

## Education data

```{r Descriptive statistics-Education} 
 
# Difference between male and female's education level (with University degrees) 
summary(educ$male_Univ) 
summary(educ$female_Univ) 
summary(educ$male_NoDip) 
summary(educ$female_NoDip) 
 
# Merge geo and educ data to include longitude and latitude in the data. 
geo_educ <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(educ$CODGEO),], 
                                   educ[unique(educ$CODGEO) %in% unique(geo$CODGEO),]) 
 
# check the ratio of individuals with high and low education level for each town. 
summary(educ$univ/educ$nodip) 
 
# Histogram of the ratio of individuals with high and low education level for each town 
ggplot(data=educ, aes(log(educ$univ/educ$nodip))) + 
  geom_histogram(aes(y=..density..), col="black", fill="blue", alpha=0.3, bins= 50)+ 
                   geom_density(col="black") + 
                   labs(x="log. University degree/No diploma", y="Density") + 
                   ggtitle("Ratio between the number of individuals with no diploma and  
                           individuals with University degree per town") + 
                   theme(plot.title =element_text(hjust= 0.5)) 
 
# Adjust scales before plotting 
sc <- scale_colour_gradientn(colours = c("#56ddc5", "#F8766D", "#00BA38"), limits=c(min(educ$univ/educ$nodip), max(educ$univ/educ$nodip))) 
 
# Map ratio  
educ_ratio_NoDipUniv <- 
  FraMap + 
  geom_point(aes(x=geo_educ$longitude, y=geo_educ$latitude, 
                 colour=educ$univ/educ$nodip), 
             data=geo_educ, alpha=1, size=1) + 
             sc + labs(color='Ratio') + ggtitle("Education level ratio for each town") 
educ_ratio_NoDipUniv 
 
# Statistics of shape for the data 
# library(psych) 
describe(educ$univ/educ$nodip) 
 
# remove the dataset created to save memory 
rm(list=c("geo_educ", "educ_ratio_NoDipUniv")) 
 
``` 
 
### What we have learned 
 
In average 960 males have university degree and 1100 females have university degree.  
In average 904 males have no diploma and 1242 females have no diploma. 
Highest ratio in Paris=78251=Yvelines 
 

## Demographic/social profiles data

```{r Descriptive statistics Demographic profiles}
glimpse(categ_socio)

# merge geo and categ_socio
geo_categ_socio <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(categ_socio$CODGEO),],
                                   categ_socio[unique(categ_socio$CODGEO) %in% unique(geo$CODGEO),])

# find which cities have the most and least number of immigrants
categ_socio[which.max(categ_socio$total_immig),c('CODGEO')] 
categ_socio[which.min(categ_socio$total_immig),c('CODGEO')] 

# get list of top 10 cities with the most number of immigrants
Rank = sort(categ_socio$total_immig, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(geo_categ_socio$region[categ_socio$total_immig %in% Rank10[i]] )
}

# Adjust scales before plotting 
sc <- scale_colour_gradientn(colours = c("#F8766D",  "#00BA38"), limits=c(min(geo_categ_socio$total_immig/population$total_population), max(geo_categ_socio$total_immig/population$total_population))) 
 
# Map ratio  
distribution_immig <- 
  FraMap + 
  geom_point(aes(x=geo_categ_socio$longitude, y=geo_categ_socio$latitude, 
                 colour=geo_categ_socio$total_immig/population$total_population), 
             data=geo_categ_socio, alpha=1, size=1) + 
             sc + labs(color='') + ggtitle("Immigrants for each town") 
distribution_immig

rm(geo_categ_socio)
```

### What we have learned

City with the least number of immigrants is Rennes and the most is Paris.
="Île-de-France"              "Provence-Alpes-Côte d'Azur" "Rhône-Alpes"                "Midi-Pyrénées"             
"Provence-Alpes-Côte d'Azur" "Alsace"                     "Languedoc-Roussillon"       "Île-de-France"             
"Île-de-France"              "Île-de-France"             


## Work status data

```{r descriptive stats status_work}

summary(status_work)
names(status_work) 

# merge geo and status_work
# geo_status_work <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(status_work$CODGEO),],
#                                    status_work[unique(status_work$CODGEO) %in% unique(geo$CODGEO),])
geo_status_work <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(status_work$CODGEO),], 
                                    status_work[unique(status_work$CODGEO) %in% unique(geo$CODGEO),]) 


# which town has the most/least people receiving government transfers
status_work[which.max(status_work$transfer),c('CODGEO')] 
status_work[which.min(status_work$transfer),c('CODGEO')] 

# get list of top 10 cities with the people receiving government transfers
Rank = sort(status_work$transfer, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(geo_status_work$CODGEO[status_work$transfer %in% Rank10[i]])
}

# which town has the most/least people with part time contract.
status_work[which.max(status_work$half),c('CODGEO')] 
status_work[which.min(status_work$half),c('CODGEO')] 

# get list of top 10 cities with the most part-time contract.
Rank = sort(status_work$half, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(status_work$CODGEO[status_work$half %in% Rank10[i]] ) 
}


# Histogram (Females with full time contract)
describe(status_work$full_F)    # Skewness

ggplot(data=status_work, aes(ifelse(full_F!=0, log(full_F), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="", y="Density") +
  ggtitle("Female workers with full time contract") + 
  theme(plot.title = element_text(hjust = 0.5))

# Histogram (males with full time contract) 
describe(status_work$full_M)    # Skewness 

ggplot(data=status_work, aes(ifelse(full_M!=0, log(full_M), 0))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="", y="Density") +
  ggtitle("Male workers with full time contract") + 
  theme(plot.title = element_text(hjust = 0.5))

# Histogram: Contract type ratio
describe(status_work$half/status_work$full)    # Skewness

ggplot(data=status_work, aes(status_work$half/status_work$full)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 50) +
  geom_density(col="black") +
  labs(x="Contract type ratio (temporary/full time)", y="Density") +
  ggtitle("Contract type ratio") +
  theme(plot.title = element_text(hjust = 0.5))

```

```{r Box plot}
#  number of units
n_sex <- length(status_work$wagearner_F)

# vector representing males and females
Label <- c(rep("M", n_sex*6), rep("F", n_sex*6))

# vector representing the variable considered
Variable <- c(rep("Wagearner", n_sex), 
             rep("Independent", n_sex),
             rep("Transfer", n_sex),
             rep("Employer", n_sex),
             rep("Full",n_sex),
             rep("Half", n_sex))
             
# merge these data
status_sex = cbind.data.frame(Label = Label, 
             value = c(status_work$wagearner_M, status_work$wagearner_F, status_work$independent_M,
                       status_work$independent_F, status_work$transfer_M, status_work$transfer_F,
                       status_work$employer_M, status_work$employer_F, status_work$full_M, status_work$full_F,
                       status_work$half_M, status_work$half_F),
             Variable = Variable)

# plotting phase
ggplot(data = status_sex, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("# of individuals") + ggtitle("Gender comparison") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)


# the same but excluding outliers
ggplot(data = status_sex, aes(x=Label, y=value)) +
  scale_y_continuous(limits = quantile(status_sex$value, c(0, 0.9))) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Sex") + ylab("# of individuals") + 
  ggtitle("Gender comparison excluding the last decile") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)
```

### What we have learned

Paris is the city where they have the most individuals receiving government transfers.
Lyon is the city where they have the least individuals receiving government transfers.
="Île-de-France"              "Provence-Alpes-Côte d'Azur" "Provence-Alpes-Côte d'Azur" "Rhône-Alpes"               
"Midi-Pyrénées"              "Aquitaine"                  "Languedoc-Roussillon"       "Pays de la Loire"          
"Rhône-Alpes"                "Alsace"                    

Paris has most individuals with half time contract. 
Rouen has the least. 
= "Île-de-France"              "Rhône-Alpes"                "Provence-Alpes-Côte d'Azur" "Midi-Pyrénées"             
"Pays de la Loire"           "Nord-Pas-de-Calais"         "Alsace"                     "Languedoc-Roussillon"      
"Aquitaine"                  "Bretagne"    

In average, more females receive transfers than males. 
In average, more females have part-time contract than males. (In average, 581 females have part-time contract (in each town) and 178 males have part-time contract. (More males have full-time contract.)

More females work as wage-earners than males.
=All these facts can be related to vulnerability of females in the labour market. Many studies show that females are mostly working as part-time workers and lower job categories. On top of this, more females receive transfers.
We can also observe this in our data that more males are registered as independent workers or employers. 

## Inequality data

```{r descriptive statistics Inequality data}
glimpse(ineq)

# merge geo and ineq
geo_ineq <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(ineq$CODGEO),],
                                   ineq[unique(ineq$CODGEO) %in% unique(geo$CODGEO),])

# which city has the highest/lowest unemployment rate
ineq[which.max(ineq$unemp_rate),c('CODGEO')] 
ineq[which.min(ineq$unemp_rate),c('CODGEO')] 

# get list of top 10 cities with the highest unemployment rate.
Rank = sort(ineq$unemp_rate, TRUE)
Rank10 = sort(Rank[1:10], TRUE)
sol = rep(0, 10)
for (i in 1:10){
    sol[i] = paste(ineq$CODGEO[ineq$unemp_rate %in% Rank10[i]] ) 
}

# which is the smallest/biggest city (area)
ineq[which.max(ineq$Superficie),c('CODGEO')] 
ineq[which.min(ineq$Superficie),c('CODGEO')] 

# which city has the highest/lowest median living level
ineq[which.max(ineq$median_living14),c('CODGEO')] 
ineq[which.min(ineq$median_living14),c('CODGEO')] 

# Adjust scales before plotting
sc <- scale_colour_gradientn(colours = c("#56ddc5", "#ff3db7"), limits=c(min(ineq$unemp_rate), max(ineq$unemp_rate)))

# Map ratio 
ineq_unempl_distribution <-
  FraMap +
  geom_point(aes(x=geo_ineq$longitude, y=geo_ineq$latitude,
                 colour=geo_ineq$unemp_rate),
             data=geo_ineq, alpha=1, size=1) +
             sc + labs(color='Unemployment rate') + ggtitle("Unemployment rate for each town")
ineq_unempl_distribution

# remove the dataset created to save memory
rm(list=c("geo_ineq", "ineq_unempl_distribution"))

```

### What we have learned

Bordeaux has the highest unemployment rate.
Nantes has the lowest unemployment rate. 
="Aquitaine"            "Languedoc-Roussillon" "Nord-Pas-de-Calais"   "Haute-Normandie"      "Guadeloupe"          
"Corse"                "Nord-Pas-de-Calais"   "Corse"                "Rhône-Alpes"          "Nord-Pas-de-Calais" 

## Commune data

```{r descriptive statistics Commune data}
glimpse(commune)

# change the factor variables to numeric
# aaa = as.numeric(commune$urbanRural)

# Merge geo and commune data to include longitude and latitude in the data.
geo_commune <- cbind.data.frame(geo[unique(geo$CODGEO) %in% unique(commune$CODGEO),],
                                   commune[unique(commune$CODGEO) %in% unique(geo$CODGEO),])


# Adjust scales before plotting
#sc <- scale_colour_gradientn(colours = palette(rainbow(3)), limits=c(min(aaa), max(aaa)))

# Map ratio 
 # urbanrural<-
 #  FraMap +
 #  geom_point(aes(x=geo_commune$longitude, y=geo_commune$latitude,
 #                 colour=aaa),
 #             data=geo_commune, alpha=1, size=0.5) +
 #             sc + labs(color='') + ggtitle("Urban and rural cities")
# urbanrural

# boxplot
#  number of units
n_firms <- length(commune$nb_firms_commerce)

# vector 
Label <- c(rep("Firms", n_firms*4))

# vector representing the variable considered
Variable <- c(rep("Service", n_firms), 
             rep("Industry", n_firms),
             rep("Commerce", n_firms),
             rep("Construction", n_firms))
          
             
# merge these data
firms_sector = cbind.data.frame(Label = Label, 
             value = c(commune$nb_firms_commerce, commune$nb_firms_service, commune$nb_firms_ind, commune$nb_firms_construction),
             Variable = Variable)

# plotting phase
ggplot(data = firms_sector, aes(x=Label, y=value)) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Firms") + ylab("# of firms") + ggtitle("Sector comparison") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)


# the same but excluding outliers
ggplot(data = firms_sector, aes(x=Label, y=value)) +
  scale_y_continuous(limits = quantile(firms_sector$value, c(0, 0.9))) +
  geom_boxplot(aes(fill = Label)) +
  geom_point(aes(y=value, colour=Label), position = position_dodge(width=0.75)) +
  facet_wrap( ~ Variable, scales="free") +
  xlab("Firms") + ylab("# of firms") + 
  ggtitle("Sector comparison excluding the last decile") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_boxplot(geom = "errorbar", width = 0.5)


rm(list = c("geo_commune", "Rank", "Rank10", "aaa"))
```


# Create an unique dataset

Create an unique dataset:
```{r Merge the data and create csv file}

# merging
newDat = merge(firms, population, by="CODGEO")
newDat = merge(newDat, salary, by="CODGEO")
newDat = merge(newDat, geo, by="CODGEO")
newDat = merge(newDat, status_work, by="CODGEO")
newDat = merge(newDat, ineq, by="CODGEO")
newDat = merge(newDat, educ, by="CODGEO")
newDat = merge(newDat, categ_socio, by="CODGEO")
newDat = merge(newDat, commune, by="CODGEO")

# check
names(newDat)
head(newDat)

```

Save/load final dataset (just to avoid running all the code) 
```{r Save/load final dataset, warning=FALSE} 

## UNCOMMENT THESE LINE AND COMMENT THE NEXT ONES TO SAVE THE DATASET
## THEN MOVE THIS FILE IN THE data FOLDER
# #remove unused factor levels on newDat
# for (i in 1:ncol(newDat)){
#   if (is.factor(newDat[,i])){
#     newDat[,i] = droplevels.factor(newDat[,i])
#   }
# }
# # save the NewDat.csv file created (which has to be moved in the data folder)
# write.table(newDat, "newDat.csv", quote = TRUE, row.names=FALSE, col.names = TRUE,
#           fileEncoding = "UTF-8", sep = ",")
 
## LOADING DATA
# load the final dataset
setwd("./data")
newDat <- read.csv("newDat.csv", encoding = "UTF-8", header = TRUE)

# JUST A CHECK, DO NOT NEED TO RUN
# check (changing newDat loaded with newDat2), the only differences are in the order of 10e-15
# identical(newDat, newDat2)
 
```

Spatial plot of the data to spot possible patterns. 
A sub-sample will be retained when needed in order to reduce 
geo-spatial correlations. 
```{r newDat spatial plot} 

# center of France, obtained using: 
# fra_center = as.numeric(geocode("France")) 
fra_center = c(2.213749, 46.227638) 
 
ggmap(get_googlemap(center=fra_center, scale=2, zoom=5), extent="normal") + 
  geom_point(aes(x=longitude, y=latitude), data=newDat, col="orange", alpha=1, size=0.05)  
 
``` 
 

# Unsupervised Learning

## Multi-Dimensional Scaling (MDS)

### Preliminary multivariate data visualisation to better conduct the unsupervised learning  
```{r}

region <- geo$region

# add region variable to the salary dataset
salary_with_dep <- cbind(region, salary)

# extract continuous variables
salary_variables <- salary_with_dep[, 3:34]

#scatterplot to visually see the correlations between variables
pairs(salary_variables[1:8], 
      panel = function (x, y, ...) {
          points(x, y, ...)
          abline(lm(y ~ x), col = "red")
      }, pch = ".", cex = 0.5)

#scatterplot between variables of major interest
pairs(salary_variables[, c("salary_ratio_FvsM", "sal_general", "sal_Females","sal_Males")], 
      panel = function (x, y, ...) {
          points(x, y, ...)
          abline(lm(y ~ x), col = "red")
      }, pch = ".", cex = 0.5)


#scale the data
salary_variables_scaled <- as.data.frame(scale(salary_variables))
#divide sal_general into 4 parts
salary_levels <- with(salary_variables_scaled, 
                  equal.count(sal_general,4))

#plot sal_worker against sal_midManager given 3 partitions of sal_general
plot(xyplot(sal_worker ~ sal_midManager| cut(sal_general, 3), 
            data = salary_variables, 
            layout = c(3, 1), 
            xlab = "Worker Salary", 
            ylab = "Midmanager salary"))

#multidimesnional plot of midManager salary against worker salary, given 4 partitions of F/M salary ratio, and grayscale intensities of general salary
M_F_salary_ratio  <- with(salary_variables, equal.count(salary_ratio_FvsM, 4))
general_salary.ord <- with(salary_variables, rev(order(sal_general)))
salary_variables.ordered <- salary_variables[general_salary.ord,]
sal_general.breaks <- with(salary_variables.ordered, 
                     do.breaks(range(sal_general),50))
salary_variables.ordered$color<-level.colors(salary_variables.ordered$sal_general,
                                   at=sal_general.breaks,
                                   col.regions=grey.colors)
plot(xyplot(sal_worker ~ sal_midManager | M_F_salary_ratio, 
            data = salary_variables.ordered,
            aspect = "iso", 
            groups = color, 
            cex = 1, col = "black",
       panel = function(x, y, groups, ..., subscripts) {
           fill <- groups[subscripts]
           panel.grid(h = -1, v = -1)
           panel.xyplot(x, y, pch = 21, 
                        fill = fill, ...)
       },
       legend =
       list(right =
            list(fun = draw.colorkey,
                 args=list(key=list(col=gray.colors,
                                    at = sal_general.breaks),
                           draw = FALSE))),
            xlab = "Worker Salary", 
            ylab = "Midmanager salary"))


```

#### Preliminary multivariate firms data visualisation to better conduct the unsupervised learning 
```{r}
# add region variable to the salary dataset
firms_with_dep <- cbind(region, firms)

# extract continuous variables
firms_variables <- firms_with_dep[, 4:8]

#scatterplot to visually see the correlations between variables
pairs(firms_variables[, 2:5], 
      panel = function (x, y, ...) {
          points(x, y, ...)
          abline(lm(y ~ x), col = "red")
      }, pch = ".", cex = 0.5)


#scale the data
firms_variables_scaled <- as.data.frame(scale(firms_variables))
#divide firms total into 4 parts
firms_levels <- with(firms_variables_scaled, 
                  equal.count(total,4))


#plot sal_worker against sal_midManager given 3 partitions of sal_general
plot(xyplot(medium ~ large| cut(total, 3), 
            data = firms_variables, 
            layout = c(3, 1), 
            xlab = "medium firms", 
            ylab = "large firms"))


```


### Multi-dimensional scaling (MDS) 
#### Multi-dimensional scaling (MDS) on salary data

```{r}
#Multi-dimensional scaling (MDS) on salary data
# produces a list with covariance matrix
# for each region as component:
salary_var <- tapply(1:nrow(salary_with_dep), salary_with_dep$region, 
                     function(i) var(salary_with_dep[i,3:7]))

#remove regions with null values
salary_var <- salary_var[-c(11, 12, 15, 19, 20, 28)]


#create dataframe with the counts of towns per region
regions_count <- data.frame(table(salary_with_dep$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
#row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
#sum(regions_count_sub$Freq)



# initializes common covariance matrix var:
S <- regions_count_sub$Freq[1] * as.matrix(salary_var[[1]])
# creates common covariance matrix S
s <- 0
for (v in c(1:22)) S <- S + regions_count_sub$Freq[v] * as.matrix(salary_var[[v]]) 
S <- S / 5003

# finds center of each variable (sal_general, etc)
# for each region
salary_cen <- tapply(1:nrow(salary_with_dep), salary_with_dep$region, 
    function(i) apply(salary_with_dep[i,c(3:7)], 2, mean))
salary_cen <- salary_cen[-c(11, 12, 15, 19, 20, 28)]


# create a matrix out of each components, each
# mean measurement for all variables by region
salary_cen <- matrix(unlist(salary_cen), 
    nrow = length(salary_cen), byrow = TRUE)
#salary_cen

# compute the mahalanobis distances:
salary_mah <- apply(salary_cen, 1, 
    function(cen) mahalanobis(salary_cen, cen, S))


# run it on two dimensions
salary_mds <- cmdscale(salary_mah)

# draw a scatterplot of two-dimensional solution
# from classical MDS applied to Mahalanobis
# distances:
lim <- range(salary_mds) * 1.2
plot(salary_mds, xlab = "Coordinate 1", 
     ylab = "Coordinate 2",
     xlim = lim, ylim = lim, type = "n")
text(salary_mds, labels = levels(salary_with_dep$region), 
     cex = 0.5)

```
One can see that Languedoc Roussilon, the worst region in terms of GDP per capita, is further away from the center.

#### Multi-dimensional scaling (MDS) on firms data

```{r}

firms_with_dep <- firms_with_dep[, -9]
firms_with_dep_normalized <- firms_with_dep[, 4:8]/population$total_population
firms_with_dep_normalized <- cbind(firms_with_dep$region, firms_with_dep_normalized)
names(firms_with_dep_normalized)[1] <- "region"
# produces a list with covariance matrix
# for each region as component:
firms_var <- tapply(1:nrow(firms_with_dep_normalized), firms_with_dep_normalized$region, 
                     function(i) var(firms_with_dep_normalized[i,2:6]))

#remove regions with null values
firms_var <- firms_var[-c(11, 12, 15, 19, 20, 28)]


#create dataframe with the counts of towns per region
regions_count <- data.frame(table(firms_with_dep_normalized$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
# row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
# sum(regions_count_sub$Freq)



# initializes common covariance matrix var:
S <- regions_count_sub$Freq[1] * as.matrix(firms_var[[1]])

# creates common covariance matrix S
s <- 0
for (v in c(1:22)) S <- S + (regions_count_sub$Freq[v] - 1) * as.matrix(firms_var[[v]]) 
S <- S / 5003

# finds center of each variable (micro, etc)
# for each region
firms_cen <- tapply(1:nrow(firms_with_dep_normalized), firms_with_dep_normalized$region, 
    function(i) apply(firms_with_dep_normalized[i,-1], 2, mean))
firms_cen <- firms_cen[-c(11, 12, 15, 19, 20, 28)]


# create a matrix out of each components, each
# mean measurement for all variables by region
firms_cen <- matrix(unlist(firms_cen), 
    nrow = length(firms_cen), byrow = TRUE)

# compute the mahalanobis distances:
firms_mah <- apply(firms_cen, 1, 
    function(cen) mahalanobis(firms_cen, cen, S))



# run it on two dimensions
firms_mds <- cmdscale(firms_mah)

# draw a scatterplot of two-dimensional solution
# from classical MDS applied to Mahalanobis
# distances:
lim <- range(firms_mds) * 1.2
plot(firms_mds, xlab = "Coordinate 1", 
     ylab = "Coordinate 2",
     xlim = lim, ylim = lim, type = "n")
text(firms_mds, labels = levels(firms_with_dep_normalized$region), 
     cex = 0.5)

```
Since tha data is normalized according to population size, Haute Normandie and Midi Pyrenees are further away.

#### Multi-Dimensional Scaling for merged dataset

```{r}
nums <- unlist(lapply(commune, is.numeric)) 
commune_selected <- commune[, nums]

commune_selected <- commune_selected[, 2:20]

commune_selected_normalized <- commune_selected[]/population$total_population

unemp_percent <- as.data.frame((ineq$unemp15_64/ineq$pop15_64)*100)
names(unemp_percent)[1] <- "unemp_percent"

firms_variables_normalized <- firms_variables[]/population$total_population

selected_variables <- cbind(firms_variables_normalized, salary_variables, commune_selected_normalized, unemp_percent)

selected_variables_with_dep <- cbind(geo$region, selected_variables)
names(selected_variables_with_dep)[1] <- "region"

pca_selected_variables <- prcomp(selected_variables)
selected_pca_scores<- as.data.frame(pca_selected_variables$x[,1:3])
selected_pca_with_dep <- cbind(geo$region, selected_pca_scores)
names(selected_pca_with_dep)[1] <- "region"
# produces a list with covariance matrix
# for each region as component:
selected_PCA_var <- tapply(1:nrow(selected_pca_with_dep), selected_pca_with_dep$region,
                     function(i) var(selected_pca_with_dep[i,2:4]))


#remove regions with null values
selected_PCA_var <- selected_PCA_var[-c(11, 12, 15, 19, 20, 28)]


#create dataframe with the counts of towns per region
regions_count <- data.frame(table(selected_pca_with_dep$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
# row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
# sum(regions_count_sub$Freq)



# initializes common covariance matrix var:
S <- regions_count_sub$Freq[1] * as.matrix(selected_PCA_var[[1]])

# creates common covariance matrix S
s <- 0
for (v in c(1:22)) S <- S + (regions_count_sub$Freq[v] - 1) * as.matrix(selected_PCA_var[[v]])
S <- S / 5003

# finds center of each variable (micro, etc)
# for each region
selected_cen <- tapply(1:nrow(selected_pca_with_dep), selected_pca_with_dep$region,
    function(i) apply(selected_pca_with_dep[i,-1], 2, mean))
selected_cen <- selected_cen[-c(11, 12, 15, 19, 20, 28)]

# create a matrix out of each components, each
# mean measurement for all variables by region
selected_cen <- matrix(unlist(selected_cen),
    nrow = length(selected_cen), byrow = TRUE)

# compute the mahalanobis distances:
selected_mah <- apply(selected_cen, 1,
    function(cen) mahalanobis(selected_cen, cen, S))


# run it on two dimensions
selected_mds <- cmdscale(selected_mah)

# draw a scatterplot of two-dimensional solution
# from classical MDS applied to Mahalanobis
# distances:
lim <- range(selected_mds) * 1.2
plot(selected_mds, xlab = "Coordinate 1",
     ylab = "Coordinate 2",
     xlim = lim, ylim = lim, type = "n")
text(selected_mds, labels = levels(selected_pca_with_dep$region),
     cex = 0.5)

```
On the merged dataset one can Rhone-Alpes and Cote d'Azur, the two biggest economies after Ile de France.

## PCA

#### PCA on salary data and delta M-F

```{r}

pca_salary <- prcomp(salary_variables)
autoplot(pca_salary, data = salary_with_dep, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)
plot(pca_salary, type = "l")

#create dataset with M-F differences
salary_var_delta <- data.frame("M-F" = salary_variables$sal_Males -                                                           salary_variables$sal_Females,
                     "Executive M-F" = salary_variables$sal_M_executive -
                                         salary_variables$sal_F_executive,
                    "midManager M-F" = salary_variables$sal_M_midManager -
                                         salary_variables$sal_F_midManager,
                      "employee M-F" = salary_variables$sal_M_employee -
                                         salary_variables$sal_F_employee,
                        "worker M-F" = salary_variables$sal_M_worker -
                                         salary_variables$sal_F_worker,
                         "18-25 M-F" = salary_variables$sal_M_18_25 -
                                         salary_variables$sal_F_18_25,
                         "26-50 M-F" = salary_variables$sal_M_26_50 -
                                         salary_variables$sal_F_26_50,
                       "51 plus M-F" = salary_variables$sal_M_51plus -
                                         salary_variables$sal_F_51plus)
pca_salary_delta <- prcomp(salary_var_delta)
autoplot(pca_salary_delta, data = salary_with_dep, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)
plot(pca_salary_delta, type = "l")

```
There are three main directions of eigenvectors
contributing to principal components' scores: variables describing salaries of ordinary staff
across different age categories, variables describing mid-manager and manager salaries, and
salaries of executives.

PCA on delta M-F revealed three main directions
of eigenvectors contributing to scores: male-female salary differences of mid-managers,
executives, and 51 and older aged people.


#### PCA on firms data

```{r}
firms_variables_normalized <- firms_variables[, -1]/population$total_population

pca_firms <- prcomp(firms_variables_normalized)
autoplot(pca_firms, data = firms_with_dep, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)
plot(pca_firms, type = "l")

```
Since there is a strong
multicollinearity among the variables of the firms data, the conducted PCA results
in the variable with considerably higher values, i.e. very small firms, to form a main
eigenvector, leaving firms of other sizes to form a separate group of eigenvectors.


#### PCA on commune dataset

```{r}
commune_trimmed <- commune[,c("nb_firms_service","nb_firms_commerce","nb_firms_construction")]
commune_trimmed$active_empl_prop <- commune$nb_active_employees/population$total_population
commune_trimmed$nb_firms_service <- commune$nb_firms_service/population$total_population
commune_trimmed$nb_firms_commerce <- commune$nb_firms_commerce/population$total_population
commune_trimmed$nb_firms_construction <- commune$nb_firms_construction/population$total_population
#commune_trimmed <- as.data.frame(scale(commune_trimmed))

names(which(sapply(commune_trimmed, anyNA)))
str(commune_trimmed)


commune_pca <- prcomp(commune_trimmed)
commune_pca
commune_pca_scores<- as.data.frame(commune_pca$x[,1:3])

autoplot(commune_pca, data = geo, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)
plot(commune_pca, type = "l")

# plot3d(commune_pca[,1:2], col=cluster$cluster+1)
scatter3d(x = commune_pca_scores$PC1, y = commune_pca_scores$PC2, z = commune_pca_scores$PC3, groups = as.factor(kmeans(commune_pca_scores, 
                  centers = 6)$cluster),
          grid = FALSE, surface = FALSE)

```
PCA conducted
on the commune dataset revealed two main directions of eigenvectors:
proportion of active employers in each town, and industry related variables.

#### PCA on selected variables

```{r}
nums <- unlist(lapply(commune, is.numeric)) 
commune_selected <- commune[, nums]

commune_selected <- commune_selected[, 2:20]

commune_selected_normalized <- commune_selected[]/population$total_population

unemp_percent <- as.data.frame((ineq$unemp15_64/ineq$pop15_64)*100)
names(unemp_percent)[1] <- "unemp_percent"

selected_variables <- cbind(firms_variables_normalized, salary_variables, commune_selected_normalized, unemp_percent)

pca_selected_variables <- prcomp(selected_variables)
autoplot(pca_selected_variables, data = geo, colour = 'region',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)
plot(pca_selected_variables, type = "l")

```
regional GDP is a striking important
factor, comprising a distinctly oriented eigenvector. Another distinct eigenvector, though less
substantial, is the unemployment rate.


## Cluster Analysis

#### Cluster Analysis on salary
```{r}

#create dataframe with first two principal components scores as variables
salary_pca_scores<- as.data.frame(pca_salary$x[,1:2])
salary_pca_scores <- as.data.frame(scale(salary_pca_scores))
sapply(salary_pca_scores, var)

 
# K-means clustering
n <- nrow(salary_pca_scores)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(salary_pca_scores, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(salary_pca_scores,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 5
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#plotting in PC space shows us that there are indeed distinguishable 5 groups 
plot(salary_pca_scores, 
     pch = kmeans(salary_pca_scores, 
                  centers = 5)$cluster)

#now we have to see whether these groups correspond to geographical regions...

```
#### clustering salary delta M-F
```{r}
salary_delta_pca_scores <- as.data.frame(pca_salary_delta$x[,1:3])

n <- nrow(salary_delta_pca_scores)
wss <- rep(0, 15)
wss[1] <- (n - 1) * sum(sapply(salary_delta_pca_scores, var))
for (i in 2:15)
    wss[i] <- sum(kmeans(salary_delta_pca_scores,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 5
plot(1:15, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")


#plots
scatter3d(x = salary_delta_pca_scores$PC1, y = salary_delta_pca_scores$PC2, z = salary_delta_pca_scores$PC3, groups = as.factor(kmeans(salary_delta_pca_scores, 
                  centers = 6)$cluster),
          grid = FALSE, surface = FALSE)

clusters <- kmeans(salary_delta_pca_scores, 
                  centers = 6)

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on male-female salary difference")
salary_geo


salary_delta_cluster <- cbind(salary_var_delta, clusters$cluster)
names(salary_delta_cluster)[9] <- "cluster"
sal_delta_by_cluster <- aggregate(salary_delta_cluster$M.F, by=list(salary_delta_cluster$cluster), FUN=mean)[2]
names(sal_delta_by_cluster)[1] <- "Salary_difference"

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= salary_variables$salary_ratio_FvsM),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Female-Male salary ratio")
salary_geo

```

#### Finding towns with large difference in male-female salaries
```{r}
#detecting extreme cities in PCA
quantiles<-tapply(salary_delta_pca_scores$PC1,firms$town,quantile)
minq <- sapply(firms$town, function(x) quantiles[[x]]["25%"])
maxq <- sapply(firms$town, function(x) quantiles[[x]]["75%"])


towns <- as.data.frame(firms$town)
k <- salary_delta_pca_scores[which(salary_delta_pca_scores$PC1<minq | salary_delta_pca_scores$PC1>maxq), ]

#find towns with largedifference 
extreme_town_list <- towns[which(salary_delta_pca_scores$PC1<minq | salary_delta_pca_scores$PC1>maxq), ]

extreme_towns <- as.data.frame(extreme_town_list)


#detecting extreme cities in salary delta
quantiles<-tapply(salary_var_delta$M.F,firms$town,quantile)
minq <- sapply(firms$town, function(x) quantiles[[x]]["25%"])
maxq <- sapply(firms$town, function(x) quantiles[[x]]["75%"])

town <- towns[which(salary_var_delta$M.F<minq | 
                                    salary_var_delta$M.F>maxq), ]

extreme_towns_m_f <- as.data.frame(town)

geo <- as.data.frame(cbind(firms$town, geo))
colnames(geo)[colnames(geo)=="firms$town"] <- "town"
geo_extreme <- subset(geo, town %in% extreme_towns_m_f$town)
geo_extreme <- geo_extreme[, c("town", "latitude", "longitude")]

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo_extreme$longitude, y=geo_extreme$latitude),
             data=geo_extreme, col="red", alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns with large difference in male-female salaries")
salary_geo

#As you can see, the towns with large salry differences are concetrated around large cities such as Paris and Lyon.

```
#### clustering towns based on unemployed %
```{r}
unemp_percent <- as.data.frame((ineq$unemp15_64/ineq$pop15_64)*100)
names(unemp_percent)[1] <- "unemp_percent"

n <- nrow(unemp_percent)
wss <- rep(0, 15)
wss[1] <- (n - 1) * sum(sapply(unemp_percent, var))
for (i in 2:15)
    wss[i] <- sum(kmeans(unemp_percent,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 6
plot(1:15, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

clusters <- kmeans(unemp_percent, 6)

unemp_percent <- cbind(unemp_percent, clusters$cluster)
names(unemp_percent)[2] <- "cluster"
unemp_by_cluster <- aggregate(unemp_percent$unemp_percent, by=list(unemp_percent$cluster), FUN=mean)[2]
names(unemp_by_cluster)[1] <- "unemp_rate(%)" 

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster), size = unemp_percent$unemp_percent), data=geo, alpha=0.5, size=0.5) + scale_fill_manual(name = "Unemployment rate", values = unemp_by_cluster$`unemp_rate(%)`) + labs(color='') + ggtitle("Clusters of towns based on unemployment rate")
salary_geo



```

#### Clustering based on Locational Gini Coefficient
```{r}
industry <- commune[, c("nb_firms_service", "nb_firms_commerce", "nb_firms_construction")]

lq_service <- (industry$nb_firms_service/(industry$nb_firms_service+industry$nb_firms_commerce+industry$nb_firms_construction))/(sum(industry$nb_firms_service)/sum(industry))
lq_service <- as.data.frame((lq_service))


lq_commerce <- (industry$nb_firms_commerce/(industry$nb_firms_service+industry$nb_firms_commerce+industry$nb_firms_construction))/(sum(industry$nb_firms_commerce)/sum(industry))
lq_commerce <- as.data.frame((lq_commerce))

lq_construction <- (industry$nb_firms_construction/(industry$nb_firms_service+industry$nb_firms_commerce+industry$nb_firms_construction))/(sum(industry$nb_firms_construction)/sum(industry))
lq_construction <- as.data.frame((lq_construction))

industry_lq <- cbind(industry, lq_service, lq_commerce, lq_construction)
#create dataframe with the counts of towns per region
regions_count <- data.frame(table(salary_with_dep$region))
regions_count_sub<-regions_count[!(regions_count$Freq==0 | regions_count$Freq==1),]
#check total sum
row.names(regions_count_sub) <- 1:nrow(regions_count_sub)
sum(regions_count_sub$Freq)

industry_lq <- cbind(firms$town, industry_lq)
names(industry_lq)[1] <- "town" 
names(industry_lq)[5] <- "lq_service"
names(industry_lq)[6] <- "lq_commerce"
names(industry_lq)[7] <- "lq_construction"

dm <- dist(industry_lq[, 5:7])
#round to two decimals
dm <- round(dm, 2)
 
# K-means clustering
n <- nrow(industry_lq)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(industry_lq[, 5:7], var))
for (i in 2:22)
    wss[i] <- sum(kmeans(industry_lq[, 5:7],
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 5
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

clusters <- kmeans(industry_lq[, 5:7], 5)

#Mapping the towns based on the locational Gini coefficient
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on locational gini coefficient")
salary_geo


```

#### apping the towns based on the intensity of locational Gini coefficient
```{r}
#Mapping the towns based on the intensity of locational Gini coefficient of the service industry
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= industry_lq$lq_service),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns based on service locational gini coefficient")
salary_geo

#Mapping the towns based on the intensity of locational Gini coefficient of the commerce industry
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= industry_lq$lq_commerce),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns based on commerce locational gini coefficient")
salary_geo

#Mapping the towns based on the intensity of locational Gini coefficient of the commerce industry
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= industry_lq$lq_construction),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Towns based on construction locational gini coefficient")
salary_geo

```


#### Salary clusters 
```{r}
#create distance matrix for hclust
dm <- dist(salary_variables)
#round to two decimals
dm <- round(dm, 2)
 
# K-means clustering
n <- nrow(salary_variables)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(salary_variables, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(salary_variables,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 5
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#now we have to see whether these groups correspond to geographical regions...
```


#### Mapping the salary clusters
```{r}
 
#now we have to see whether these groups correspond to geographical regions...
clusters <- kmeans(salary_variables, 6)
salary_with_dep$cluster <- clusters$cluster
geo$cluster <- clusters$cluster
 
FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(geo$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on salary")
salary_geo

salary_cl <- cbind(salary_variables, clusters$cluster)
names(salary_cl)[33] <- "cluster"
salary_cluster <- aggregate(salary_cl$sal_general, by=list(salary_cl$cluster), FUN=mean)[2]
names(salary_cluster)[1] <- "salary" 

 
```


#### Clustering on selected variables
```{r}
selected_pca_scores<- as.data.frame(pca_selected_variables$x[,1:3])

# K-means clustering
n <- nrow(selected_variables)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(selected_variables, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(selected_variables,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 7
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

clusters <- kmeans(selected_variables, centers = 7)
scatter3d(x = selected_pca_scores$PC1, y = selected_pca_scores$PC2, z = selected_pca_scores$PC3, groups = as.factor(clusters$cluster),
          grid = FALSE, surface = FALSE)

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")
 
# Plot "Distribution of total population for each town" 
salary_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(clusters$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on selected variables")
salary_geo


```

#### Clustering in commune
```{r}
#K means

n <- nrow(commune_trimmed)
wss <- rep(0, 15)
wss[1] <- (n - 1) * sum(sapply(commune_trimmed, var))
for (i in 2:15)
    wss[i] <- sum(kmeans(commune_trimmed,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 11
plot(1:15, wss, type = "b",
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

```


#### Mapping the new clusters
```{r}
n <- nrow(commune_trimmed)
wss <- rep(0, 22)
wss[1] <- (n - 1) * sum(sapply(commune_trimmed$active_empl_prop, var))
for (i in 2:22)
    wss[i] <- sum(kmeans(commune_trimmed$active_empl_prop,
                         centers = i)$withinss)
#we can see from the plot that optimal number of cluster is 7
plot(1:22, wss, type = "b", 
     xlab = "Number of groups",
     ylab = "Within groups sum of squares")

#now we have to see whether these groups correspond to geographical regions...

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")

# Plot "Distribution of total population for each town"
newDat_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= factor(kmeans(commune_trimmed$active_empl_prop, 
                  centers = 6)$cluster)),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on active employees proportion")
newDat_geo

FraMap = ggmap(get_googlemap(center=fra_center, scale=2, zoom=6), extent="normal")

# Plot "Distribution of total population for each town"
newDat_geo <-
  FraMap +
  geom_point(aes(x=geo$longitude, y=geo$latitude, colour= commune_trimmed$active_empl_prop),
             data=geo, alpha=0.5, size=0.5) + labs(color='') + ggtitle("Clusters of towns based on active employees")
newDat_geo
```





# Supervised Learning

## ANOVA for salary

First the distribution of salary for both genders is considered, in original and log-scale
```{r hist salary for gender}

# original data
# male
ggplot(data=data.frame(newDat$sal_Males), aes(newDat$sal_Males)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Male salaries", y="Density") +
  ggtitle("Distribution of salary for males") +
  theme(plot.title = element_text(hjust = 0.5))
# females
ggplot(data=data.frame(newDat$sal_Females), aes(newDat$sal_Females)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Female salaries", y="Density") +
  ggtitle("Distribution of salary for females") +
  theme(plot.title = element_text(hjust = 0.5))

# log data
# male
ggplot(data=data.frame(log10(newDat$sal_Males)), aes(log10(newDat$sal_Males))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Log10 male salaries", y="Density") +
  ggtitle("Distribution of salary for males in log scale") +
  theme(plot.title = element_text(hjust = 0.5))
# females
ggplot(data=data.frame(log10(newDat$sal_Females)), aes(log10(newDat$sal_Females))) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 40) +
  geom_density(col="black") +
  labs(x="Log10 female salaries", y="Density") +
  ggtitle("Distribution of salary for females in log scale") +
  theme(plot.title = element_text(hjust = 0.5))

# hist(c(log10(newDat$sal_Males), log10(newDat$sal_Females)), 30)

```

It is clear that the normality assumption, cannot hold.

At the same time, it is shown
how to check if the difference in the mean in salary 
between males and females using a t-test 
```{r salary difference t test}

t.test(log10(newDat$sal_Males), log10(newDat$sal_Females), alternative = c("two.sided"))

```

This solution can also be otbained fitting a linear model with a dummy variable catching the sex effect,
which corresponds to a one-way ANOVA model
```{r salary difference t test as an ANOVA}

# create the reponse attaching the salary for each sex
sal_y = c(log10(newDat$sal_Males), log10(newDat$sal_Females))
# create the dummy variable controlling for the corresponding sex
dummy = c(rep(1, length(newDat$sal_Males)), rep(0, length(newDat$sal_Females)))

# the t value is exactly the same as the one obtained in the previous t-test
sal_ANOVA = lm(sal_y ~ dummy)
summary(sal_ANOVA)

# residuals plot
plot(sal_ANOVA$residuals)
abline (0, 0, col = "red")

```

To improve such analysis some more aspects are now considered.
In particualar, the model considered in an ANOVA, where the
predictors take in account each gender for different age and job levels,
plus their interaction terms.

First of all the dataset is created and a subsample of it is considered, 
in order to avoid strong correlation among units
```{r salary construct full ANOVA model}

set.seed(300)

# create response variable
sal_y = c(newDat$sal_M_18_25, newDat$sal_M_26_50, newDat$sal_M_51plus,
          newDat$sal_M_executive, newDat$sal_M_midManager, newDat$sal_M_employee, newDat$sal_M_worker,
          newDat$sal_F_18_25, newDat$sal_F_26_50, newDat$sal_F_51plus,
          newDat$sal_F_executive, newDat$sal_F_midManager, newDat$sal_F_employee, newDat$sal_F_worker)

n_sal_y = length(sal_y)             # length response variable
n_cat = length(newDat$sal_M_18_25)  # length of each category (i.e., original vectors)

# create sex dummy variable, 1 for males and 0 for females
sal_sex = rep(0, n_sal_y)   # full regressors
sal_sex[1:n_sal_y/2] = 1    # assign males

# create age dummy variables, 18-25 years old is the base case
sal_age = cbind(rep(0, n_sal_y), rep(0, n_sal_y)) # full regressors
# 26-50 y.o.
sal_age[(n_cat+1):(n_cat*2), 1] = 1     # males
sal_age[(n_cat*8+1):(n_cat*9), 1] = 1   # females
# 51+ y.o.
sal_age[(n_cat*2+1):(n_cat*3), 2] = 1   # males
sal_age[(n_cat*9+1):(n_cat*10), 2] = 1  # females

# create job type dummy variables, worker is the base case
sal_job = cbind(rep(0, n_sal_y), rep(0, n_sal_y), rep(0, n_sal_y)) # full regressors
# executives
sal_job[(n_cat*3+1):(n_cat*4), 1] = 1     # males
sal_job[(n_cat*10+1):(n_cat*11), 1] = 1   # females
# middle managers
sal_job[(n_cat*4+1):(n_cat*5), 2] = 1     # males
sal_job[(n_cat*11+1):(n_cat*12), 2] = 1   # females
# employee
sal_job[(n_cat*5+1):(n_cat*6), 3] =   1   # males
sal_job[(n_cat*12+1):(n_cat*13), 3] = 1   # females

# final data set
data_ANOVA = cbind.data.frame(response = sal_y, sex = sal_sex, age = sal_age, job = sal_job)
# names(data_ANOVA)
# show regressors' shape
imagemat(data_ANOVA[,-1], xaxt = "n", main = "Factors for ANOVA")
axis(1, at=1:6, labels=c("Sex", "Age 26-50", "Age 51+", "Execut.", "Mid.Man.", "Empl."))
box()
# sub sample to avoid correlation, (unbalanced dataset)
# set.seed(20)
subs = sample(1:n_sal_y, size = round(n_sal_y*0.1))
data_ANOVA = data_ANOVA[subs,]

# show randomized data
imagemat(data_ANOVA[,-1], xaxt = "n", main = "Factors for ANOVA after sub-sampling")
axis(1, at=1:6, labels=c("Sex", "Age 26-50", "Age 51+", "Execut.", "Mid.Man.", "Empl."))
box()

```

Run the ANOVA model, tranforming the response variable using Box-Cox transformation
```{r salary ANOVA model}

# plot the transformed response variable (suggested by Box-Cox transformation)
sal_y = data_ANOVA[,1]^-1
ggplot(data=data.frame(data_ANOVA[,1]), aes(sal_y)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 60) +
  geom_density(col="black") +
  labs(x="salary^-1", y="Density") +
  ggtitle("Transformation of salary found using Box-Cox transformation") +
  theme(plot.title = element_text(hjust = 0.5))

# take the identified subsample for dummy variables
sex = sal_sex[subs]
age = sal_age[subs,]
job = sal_job[subs,]
# ANOVA model
sal_ANOVA = lm(sal_y ~ sex + age + job + sex:age + sex:job)
summary(sal_ANOVA)
anova(sal_ANOVA)
# aov(sal_y ~ sex + age + job + sex:age + sex:job)
plot(sal_ANOVA, 1)

# box-cox transformation suggested to use y^-1
boxcox(sal_ANOVA)
title("Box-Cox transformation of the response")

# comparison a full model with a reduced one excluding one interaction term
sal_ANOVA2 = lm(sal_y ~ sex + age + job + sex:age)
anova(sal_ANOVA, sal_ANOVA2)

```


To check if the model can be sparsified, considering groups of variables
as a whole, the group-Lasso is applied.
It shows that the interaction term sex:job is not so important.
```{r salary group-Lasso}

# define the considered model, which includes interaction terms
form <- model.matrix(sal_y ~ (sex + age + job)^2)
# remove interaction age:job
# colnames(form)
form = form[,2:12]

# define groups' labels
grp = c(1,2,2,3,3,3,4,4,5,5,5)

#Group-Lasso with 10-folds Cross Validation and LAD cost function
fit.cv=cv.gglasso(x=form, y=sal_y, group=grp, nfolds=10, pred.loss = "L2")

# plot CV for Lambda
plot(fit.cv) 
mtext(expression("10-folds CV for group-Lasso in ANOVA"), outer=TRUE,  cex=1, line=-2.2)

# extract coefficients
# coef.mat = fit.cv$gglasso.fit$beta
# plot separately each coefficient
cols = brewer.pal(5,name="Set1")
plot(fit.cv$gglasso.fit, col = cols, lwd = 3, main = "Coefficients vs.Lambda values")
legend('bottomright', legend = paste("Group", 1:5), col=cols[1:5], cex = 0.65, pch = 1, lwd=4, bty ="n")

# plot coefficients' norms (not working), doing it manually
# https://www.rdocumentation.org/packages/gglasso/versions/1.3/topics/plot.gglasso
# https://code.google.com/archive/p/gglasso/issues
  # plot(fit.cv$gglasso.fit, group=TRUE, col =  cols, lwd = 2)
  # legend('bottomright', legend = paste("Group", 1:5), col=cols[1:5], cex = 0.65, pch = 1, lwd=4, bty ="n")
# initialize norms
coeffNorm = matrix(0, nrow =  max(fit.cv$gglasso.fit$group), length(fit.cv$gglasso.fit$lambda))
# compute norms group-wise
for (i in 1:max(fit.cv$gglasso.fit$group)){ 
  ind = fit.cv$gglasso.fit$group == i
  coeff = fit.cv$gglasso.fit$beta[ind,]
  if (sum(ind)>1){
    coeffNorm[i,] = apply(coeff, 2, function(x){sqrt(sum(x^2))})
  } else coeffNorm[i,] = coeff
} 
# coefficients' norm plot
plot(log(fit.cv$gglasso.fit$lambda), coeffNorm[1,], main="Coefficients' norm vs Lambda",
     ylab="Coefficients' norm",xlab="Ln Lambda values",
     col=cols[1],
     # xlim=c(-1,100),
     ylim=c(min(coeffNorm), max(coeffNorm)),
     type="l",lwd=4)
for (j in 2:nrow(coeffNorm)){
  lines(log(fit.cv$gglasso.fit$lambda), coeffNorm[j,], type="l",lwd=4, col=cols[j])
}
abline(0,0, lwd=3)
# plot legend once
# grid()
par(new=T)
legend('topright', legend = paste("group ", 1:5), lty=1, col=cols[1:5], cex = 0.6,lwd=2)

#Pick the best Lambda
lmbda=fit.cv$lambda.1se
(coefs=coef.gglasso(object=fit.cv$gglasso.fit,s=lmbda))
#At best lambda get coefficients and fitted values
plt=sal_y-predict.gglasso(object=fit.cv$gglasso.fit,newx=form,s=lmbda,type='link')
plot(plt, ylab="residuals", xlab="index", main="Plot of residuals")
abline(0, 0, col= "red")

```

From this study it is possible to conclude that each factor considered 
is relevant to determine the salary level. 

The least important factor results the interaction between sex and job,
which is not statistically significant using group-Lasso.
This means that there is no gender discrimination
between various jobs, even if gender discrimination exists 
in general and for different age groups.

In the next section this topic will be investigated further.


## Difference in salary: men vs. females

The focus of this section is on the on the difference on salary 
between males and females in various towns.
Some classical approaches will be compared with robust ones,
to diagnose the possible negative effect of the presence of noise or outliers.
It is motivated by the assumptions that not all the towns come from 
the same data genereting process.

```{r functions linear model}

# These functions are used to create the plots used in the next chunks

# PLOTS FOR interactive regression diagnostic plots
RegressionPlots <- function(fit, textlabels, robust=F, out = NULL){
  
  # number of units
  n = length(fit$fitted.values)
  
  # original response
  y = fit$y
  
  # Extract fitted values from lm() object
  Fitted.Values <-  fit$fitted.values
  
  # Extract residuals from lm() object
  Residuals <-  fit$residuals
  
  if (robust == F){
    # Extract standardized residuals from lm() object
    Standardized.Residuals <- MASS::stdres(fit)  
    
    # Calculate Leverage
    Leverage <- lm.influence(fit)$hat
    
    # text for the labels of the plots
    tt = "OLS"
  } else{
    # standardized residuals based on a robust scale estimate
    Standardized.Residuals <- fit$residuals/fit$scale
    
    # get the robust distances based on mahalanobis distance from the robust output 
    # (it has to be specified in the call to lmrob using: control = lmrob.control(compute.rd = T))
    Leverage <- fit$MD
    
    # text for the labels of the plots
    tt = "Robust"
  }
  
  # Extract fitted values for lm() object
  Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
  
  # Create data frame 
  # Will be used as input to plotly
  regMat <- data.frame(Fitted.Values, 
                       Residuals, 
                       Standardized.Residuals, 
                       Theoretical.Quantiles,
                       Leverage, 
                       textlabels)
  
  # Plot using Plotly
  
  # text info
  t <- list(
    family = "sans serif",
    size = 14,
    color = toRGB("grey50"))
  
  
  # Plot 1: Fitted vs Residuals
  # For scatter plot smoother
  LOESS <- loess.smooth(Fitted.Values, Residuals)
  
  # plt1 <- regMat %>% 
  #   plot_ly(x = Fitted.Values, y = Residuals, 
  #           type = "scatter", mode = "markers", hoverinfo = "text",
  #           color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
  #           text = paste('</br> Unit: ', 1:n,
  #                        '</br> Town: ', as.character(textlabels),
  #                        '</br> x: ', Fitted.Values,
  #                        '</br> y: ', Residuals)) %>% 
  #   
  #   add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
  #             inherit = F, line = list(width = 2)) %>% 
  # 
  #   layout(title = paste(tt, " Residuals vs Fitted Values"), plot_bgcolor = "#e6e6e6",
  #          xaxis = list(title = "Fitted Values"),
  #          yaxis = list(title = "Residuals"))
  
  # Plot 2: Fitted vs Standardized Residuals
  # For scatter plot smoother
  LOESS <- loess.smooth(Fitted.Values, Standardized.Residuals)
  
  plt2 <- regMat %>% 
    plot_ly(x = Fitted.Values, y = Standardized.Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text",
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
            text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Fitted.Values,
                         '</br> y: ', Standardized.Residuals)) %>% 
    
    add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
              inherit = F, line = list(width = 2)) %>% 
    
    layout(title = paste(tt, " Standardized residuals vs Fitted Values"), plot_bgcolor = "#e6e6e6",
           xaxis = list(title = "Fitted Values"),
           yaxis = list(title = "Standardized residuals"))  

  # Plot 3: Residuals index
  plt3 <- regMat %>% 
    plot_ly(x = 1:n, y = Standardized.Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text",
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', 1:n,
                         '</br> y: ', Standardized.Residuals)) %>% 
    
    add_segments(x = -5, xend = n+5, y = 0, yend = 0, mode = "line", inherit = F, 
                 name = "Null line", line = list(width = 2)) %>%
    
    layout(title = paste(tt, " Standardized Residuals' index"), plot_bgcolor = "#e6e6e6", 
             xaxis = list(title = "Index"),
             yaxis = list(title = "Standardized.Residuals"))

  # Plot 4: QQ Pot
  plt4 <- regMat %>% 
    plot_ly(x = Theoretical.Quantiles, y = Standardized.Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text", 
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Theoretical.Quantiles,
                         '</br> y: ', Standardized.Residuals)) %>% 
    
    add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", 
              mode = "lines", name = "Theoretical", line = list(width = 2), 
              inherit = F) %>%
    
    layout(title = paste(tt, " Q-Q Plot"), plot_bgcolor = "#e6e6e6",
             xaxis = list(title = "Normal theoretical quantiles"),
             yaxis = list(title = "Data quantiles"))
  
  # Plot5: Residuals vs Leverage
  # For scatter plot smoother
  LOESS <- loess.smooth(Leverage, Residuals)
  
  plt5 <- regMat %>% 
    plot_ly(x = Leverage, y = Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "text", 
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5), 
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Leverage,
                         '</br> y: ', Residuals)) %>% 
    
    add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
              line = list(width = 2), inherit = F) %>%
    
    layout(title = paste(tt, " Leverage vs Residuals"), plot_bgcolor = "#e6e6e6", 
             xaxis = list(title = "Leverage values"),
             yaxis = list(title = "Residuals"))  

  # Plot 6: Fitted vs response
  # For scatter plot smoother
  LOESS <- loess.smooth(Fitted.Values, y)
  
  plt6 <- regMat %>% 
    plot_ly(x = Fitted.Values, y =y, 
            type = "scatter", mode = "markers", hoverinfo = "text",
            color = out, name = "Data", marker = list(size = 10, opacity = 0.5),
           text = paste('</br> Unit: ', 1:n,
                         '</br> Town: ', as.character(textlabels),
                         '</br> x: ', Fitted.Values,
                         '</br> y: ', y)) %>% 
    
    add_trace(x = LOESS$x, y = LOESS$y, type = "scatter", mode = "lines", name = "Smooth",
              inherit = F, line = list(width = 2)) %>% 
  
    layout(title = paste(tt, " Fitted Values vs response"), plot_bgcolor = "#e6e6e6",
             xaxis = list(title = "Fitted Values"),
             yaxis = list(title = "Response values"))

  # # Plot 7: MISSING: sqrt(abs(Residuals))
  # # For scatter plot smoother
  # LOESS2 <- loess.smooth(Fitted.Values, Root.Residuals)
  # 
  # plt3 <- regMat %>% 
  #   plot_ly(x = Fitted.Values, y = Root.Residuals, 
  #           type = "scatter", mode = "markers", hoverinfo = "text", 
  #           name = "Data", marker = list(size = 10, opacity = 0.5),
  #           text = paste('town: ', as.character(textlabels))) %>%
  #   
  #   add_trace(x = LOESS2$x, y = LOESS2$y, type = "scatter", mode = "lines", name = "Smooth",
  #             line = list(width = 2), inherit = F) %>% 
  #   
  #   layout(title = "Scale Location", plot_bgcolor = "#e6e6e6")
  plt = list(plt2, plt3, plt4, plt5, plt6) # plt1, 
  return(plt)

  # not working...to add in all plots
  # to avoid error with colorPalette<3
  # suppressWarnings(warning("RColorBrewer::brewer.pal"))
}

#
# PLOTS FOR BEST SUBSET REGRESSION
#
plot.regsubsets2 <-  
  function (x, labels = obj$xnames, main = NULL, scale = c("bic",  
     "Cp", "adjr2", "r2"), col = gray(seq(0, 0.9, length = 10)), ...)  
  { 
    obj <- x 
    lsum <- summary(obj) 
    par(mar = c(7, 5, 6, 3) + 0.1) 
    nmodels <- length(lsum$rsq) 
    np <- obj$np 
    propscale <- FALSE 
    sscale <- pmatch(scale[1], c("bic", "Cp", "adjr2", "r2"),  
                     nomatch = 0) 
    if (sscale == 0)  
      stop(paste("Unrecognised scale=", scale)) 
    if (propscale)  
      stop(paste("Proportional scaling only for probabilities")) 
    yscale <- switch(sscale, lsum$bic, lsum$cp, lsum$adjr2, lsum$rsq) 
    up <- switch(sscale, -1, -1, 1, 1) 
    index <- order(yscale * up) 
    colorscale <- switch(sscale, yscale, yscale, -log(pmax(yscale,  
                                                           1e-04)), -log(pmax(yscale, 1e-04))) 
    image(z = t(ifelse(lsum$which[index, ], colorscale[index],  
                       NA + max(colorscale) * 1.5)), xaxt = "n", yaxt = "n",  
          x = (1:np), y = 1:nmodels, xlab = "", ylab = scale[1],  
          col = col) 
    laspar <- par("las") 
    on.exit(par(las = laspar)) 
    par(las = 2) 
    axis(1, at = 1:np, labels = labels, ...) # I modified this line 
    axis(2, at = 1:nmodels, labels = signif(yscale[index], 2), ...) 
    # axis(2,cex.axis=0.01)
    if (!is.null(main))  
      title(main = main) 
    box() 
    invisible(NULL)
  } 


```


The considered model is built using all the variables that could explain the gap in salary. 
Some of them refers to the difference of some quantity between males and females for each town 
(e.g., diploma, univeristy, independent workers, etc.).
The final model consist of 43 variables and 5025 units.

Additionally, some variables which would have had a huge positive impact on the prediction
of the response variable have non been considered, because of their strong relation
(e.g., gap in salary between males and females for various ages or occupations)
```{r linear model specification for delta M-F salary, fig.keep='all'}

# response variable
# y = log10(newDat$sal_Males - newDat$sal_Females)
y = newDat$sal_Males - newDat$sal_Females

# original response variable histogram
ggplot(data=as.data.frame(y), aes(y)) +
  geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 35) +
  geom_density(col="black") +
  labs(x="Male-Female salary", y="Density") +
  ggtitle("Original response variable") +
  theme(plot.title = element_text(hjust = 0.5))

x = cbind.data.frame(pop = newDat$total_population, 
                     firms = newDat$total,
                     ratio_pop_firms = newDat$total/newDat$total_population,
                     # delta_exec = newDat$sal_M_executive - newDat$sal_F_executive,
                     # delta_midMan = newDat$sal_M_midManager - newDat$sal_F_midManager,
                     # delta_empl = newDat$sal_M_employee - newDat$sal_F_employee,
                     # delta_worker = newDat$sal_M_worker - newDat$sal_F_worker,
                     workforce = newDat$workforce,
                     dependency_ratio = newDat$dependency_ratio,
                     delta_wagearner = newDat$wagearner_M - newDat$wagearner_F,
                     delta_indep = newDat$independent_M - newDat$independent_F,
                     # delta_transfer = newDat$transfer_M - newDat$transfer_F, # 2000 zeros
                     delta_employer = newDat$employer_M - newDat$employer_F,
                     delta_full = newDat$full_M - newDat$full_F,
                     # delta_half = newDat$half_M - newDat$half_F, # OLS gives NA
                     superficie = newDat$Superficie,
                     housing14 = newDat$housing14,
                     median_living14 = newDat$median_living14,
                     empl = newDat$empl,
                     emp_sal = newDat$emp_sal,
                     pop15_64 = newDat$pop15_64,
                     unemp15_64 = newDat$unemp15_64,
                     act15_64 = newDat$act15_64,
                     unemp_rate = newDat$unemp_rate,
                     delta_NoDip = newDat$male_NoDip - newDat$female_NoDip,
                     delta_sec = newDat$male_Sec - newDat$female_Sec,
                     delta_hi = newDat$male_Hi - newDat$female_Hi,
                     delta_univ = newDat$male_Univ - newDat$female_Univ,
                     nodip = newDat$nodip,
                     delta_immig = newDat$male_immig - newDat$female_immig,
                     delta_workingImmig = newDat$male_working_immig - newDat$female_working_immig,
                     delta_native = newDat$male_native - newDat$female_native,
                     # evol_entrepr = newDat$evol_entrepreneurial_score,
                     # evol_pop = newDat$evol_pop_score,
                     regional_GDP = newDat$regional_GDP,
                     # dymanic_demo_prof = newDat$dymanic_demo_prof,
                     nb_active_employees = newDat$nb_active_employees,
                     nb_active_non_employees = newDat$nb_active_non_employees,
                     # nb_active = newDat$nb_active,   # OLS gives NA
                     nb_created_services = newDat$nb_created_services,
                     nb_created_commerce = newDat$nb_created_commerce,
                     nb_created_construction = newDat$nb_created_construction,
                     nb_created_ind = newDat$nb_created_ind,
                     # nb_created_firms = newDat$nb_created_firms, # OLS gives  
                     nb_firms_service = newDat$nb_firms_service,
                     nb_firms_commerce = newDat$nb_firms_commerce,
                     nb_firms_construction = newDat$nb_firms_construction,
                     nb_firms_ind = newDat$nb_firms_ind,
                     nb_major = newDat$nb_major,
                     nb_minor = newDat$nb_minor,
                     delta_num_sex = newDat$nb_male - newDat$nb_female,
                     nb_household = newDat$nb_household)
                     # urbanTRUE = newDat$urbanDummy)

# add dummy for urban !!!

# merge variables
mod = cbind.data.frame(y, x)
# dim(mod)

# Specify the model with additional polynomials as predictors
# # merge variables
# modyx = cbind.data.frame(y, x)
# # merge variables excluding ouliers
# # mod = mod[!outl,]
# 
# # add transformations on the predictors
#   mod = cbind.data.frame(mod, mod[,2:ncol(mod)]^2, mod[,2:ncol(mod)]^3)
#   namesdataBS = names(mod)
#   lll = ((dim(mod)[2]-1)/3)
#   # for (i in 1:lll+1){
#   #   str = namesdataBS[i]
#   #   namesdataBS[i] = regmatches(str, regexpr("_", str), invert = TRUE)[[1]][-1]
#   # }
#   namesdataBS[(lll+2):(lll*2+1)] = paste(namesdataBS[1:lll+1], rep("^2", lll), sep="")
#   namesdataBS[(lll*2+2):(lll*3+1)] = paste(namesdataBS[1:lll+1], rep("^3", lll), sep="")
#   # namesdataBS[(lll*3+2):(lll*4+1)] = paste(namesdataBS[1:lll+1], rep("log", lll))
#   names(mod) = namesdataBS

# scatter plot matrix
# pairs(mod)
# correlations
corrplot(cor(mod), method = "color", #title = "Correlation matrix for X variables",
         diag = T, tl.cex=0.5, type="lower", #col = colorRampPalette(c("red","green","navyblue"))(100))
         tl.col = "black")  # , mar=c(0,0,1.5,0)

```

These plots shows that: 

* the response is long tailed, which also motivates the robust framework
* there is strong evidence of collinearity

The final model is built considering a random sample of observations (70% of them)
in order to reduce spatial correlation among units
```{r sub-sampling}

set.seed(1)
subs = sample(1:length(y), size = round(length(y)*0.7))
y = y[subs]
x = x[subs,]
mod = mod[subs,]
towns = newDat$town[subs]
print("The final dataset has a dimension of: ")
dim(mod)

# dummy variable for Paris and Chambourcy, used later for plots (found ex-post)
dcity = rep(1, length(y))
dcity[towns=="Paris"] = 2
dcity[towns=="Chambourcy"] = 3
dcity = as.factor(dcity)

# # reduced response variable histogram
# ggplot(data=as.data.frame(y), aes(y)) +
#   geom_histogram(aes(y =..density..), col="black", fill="blue", alpha = .3, bins = 35) +
#   geom_density(col="black") +
#   labs(x="Male-Female salary", y="Density") +
#   ggtitle("Original response variable") +
#   theme(plot.title = element_text(hjust = 0.5))
# 
# # correlations
# corrplot(cor(mod), method = "color", # title = "Correlation matrix for X variables",
#          diag = T, tl.cex=0.5, type="lower", #col = colorRampPalette(c("red","green","navyblue"))(100))
#          tl.col = "black")  # , mar=c(0,0,1.5,0)



```

In the following analyses to towns will be often highlighted and 
they have been found ex-post.
They are considered as 2 typical outliers but for different reasons.
One of them is Paris, which is an outlier for obvious reasons. 
The other (unexpected) one is [Chambourcy](https://en.wikipedia.org/wiki/Chambourcy), 
which has been spotted by the following analysis and it is a very small town.

All the coefficients' estimates obtained in the following will be saved
in a dataframe to allow an easy ex-post comparison.
For fitting methods that provide significance levels, only variables
significant at 0.01% will be kept.

### Naive approaches 

#### Ordinary least squares

A classical linear model based on OLS is first fitted
```{r OLS}

# OLS
ols = lm(y ~ ., data = mod, y=T)
summary(ols)

# VIF
# vif(ols)

# diagnostic plots
# layout(matrix(1:6, nr = 2), height = c(2, 1))
RegressionPlots(ols, towns, robust = F, out=dcity)

# extract variables significant at 0.01% and save them in a data frame
coeff.df = data.frame(as.matrix(summary(ols)[["coefficients"]][,"Pr(>|t|)"]<0.01), 
                      row.names = c("(Intercept)", names(x)))
names(coeff.df) = c("OLS")
coeff.df[coeff.df==TRUE] = sprintf("%.2e",
           as.matrix(ols$coefficients[summary(ols)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
coeff.df[coeff.df==FALSE] = "--"

```

It shows that:

* there is etoreskadasticity
* Paris has a levarage equal to 1

Also, high collinearity and robustness are not taken into account.
At the same time, running different times OLS didn't produce any coefficients' instability,
even if VIF values are very high.

#### Best subset selection

Model selection by exhaustive search (suffering the same drawbacks as OLS) is now applied,
which is also a very heavy approach in terms of computing time.
```{r BSS exhaustive}

set.seed(1)

# best subset selection
best.sub = regsubsets(y ~ ., data = mod, nvmax = ncol(mod), really.big=TRUE, method="exhaustive") # exhaustive # seqrep
best.sub.summary = summary(best.sub)
# manual plotting
par(mfrow =c(2,2))
# rsq
plot(best.sub.summary$rsq , xlab="Number of Variables", ylab="Rsq", type="l")
ind_Rsq = which.max(best.sub.summary$rsq)
points(ind_Rsq, best.sub.summary$adjr2[ind_Rsq], col ="red", cex=2, pch=20)
# adjRsq
plot(best.sub.summary$adjr2 ,xlab="Number of Variables", ylab="Adjusted RSq", type="l")
ind_adjRsq = which.max(best.sub.summary$adjr2)
points(ind_adjRsq, best.sub.summary$adjr2[ind_adjRsq], col ="red", cex=2, pch=20)
# Cp
plot(best.sub.summary$cp ,xlab="Number of Variables", ylab="Cp", type="l")
ind_Cp = which.min(best.sub.summary$cp)
points(ind_Cp, best.sub.summary$cp[ind_adjRsq], col ="red", cex=2, pch=20)
# bic
plot(best.sub.summary$bic ,xlab="Number of Variables", ylab="bic", type="l")
ind_bic = which.min(best.sub.summary$bic)
points(ind_bic, best.sub.summary$bic[ind_bic], col ="red", cex=2, pch=20)
mtext("Best subset selection for delta gender salary", outer=TRUE,  cex=1.2, line=-1.5)

# built-in plots
par(mfrow=c(1,1))
plot.regsubsets2(best.sub, scale = "r2", cex.axis = 0.6, main = "Best subset selection using R squared")
plot.regsubsets2(best.sub, scale = "adjr2", cex.axis = 0.6, main = "Best subset selection using adjusted R squared")
plot.regsubsets2(best.sub, scale = "Cp", cex.axis = 0.6, main = "Best subset selection using Mallows' Cp")
plot.regsubsets2(best.sub, scale = "bic", cex.axis = 0.6, main = "Best subset selection using BIC")

# mtext("Best subset selection for salary 18-25 using BIC", outer=TRUE,  cex=1.4, line=-3.5)

# retrieve the model with min BIC
coefficients(best.sub, which.min(best.sub.summary$bic))
nnn = names(coefficients(best.sub, which.min(best.sub.summary$bic)))
nnn[(nnn == "urbanTRUETRUE")] = "urbanTRUE"
  # nnn <- gsub(x = nnn,
  #             pattern = "`",
  #             replacement = "")

# apply OLS to obtain diagnostic plots (approximately the same coefficients)
mod_bic = mod[, colnames(mod) %in% nnn] 
mod_bic = cbind.data.frame(y, mod_bic)
ols_bic = lm(y ~ ., data = mod_bic, y=T)
summary(ols_bic)
RegressionPlots(ols_bic, textlabels = towns, out = dcity)

# extract variables significant at 0.01% and save them in a data frame
iBSS = row.names(coeff.df) %in% nnn
coeff.df$BSS = as.character(rep("--", nrow(coeff.df)))
coeff.df$BSS[iBSS] = sprintf("%.2e", as.matrix(coefficients(best.sub, which.min(best.sub.summary$bic))))

```


#### Stepwise selection

Model selection by sequential replacement (suffering the same drawbacks as OLS) is now applied
and it is very fast compared to the exhaustive search
```{r BSS seqrep}

set.seed(1)

# best subset selection
best.sub = regsubsets(y ~ ., data = mod, nvmax = ncol(mod), really.big=TRUE, method="seqrep") # exhaustive # seqrep
best.sub.summary = summary(best.sub)
# manual plotting
par(mfrow =c(2,2))
# rsq
plot(best.sub.summary$rsq , xlab="Number of Variables", ylab="Rsq", type="l")
ind_Rsq = which.max(best.sub.summary$rsq)
points(ind_Rsq, best.sub.summary$adjr2[ind_Rsq], col ="red", cex=2, pch=20)
# adjRsq
plot(best.sub.summary$adjr2 ,xlab="Number of Variables", ylab="Adjusted RSq", type="l")
ind_adjRsq = which.max(best.sub.summary$adjr2)
points(ind_adjRsq, best.sub.summary$adjr2[ind_adjRsq], col ="red", cex=2, pch=20)
# Cp
plot(best.sub.summary$cp ,xlab="Number of Variables", ylab="Cp", type="l")
ind_Cp = which.min(best.sub.summary$cp)
points(ind_Cp, best.sub.summary$cp[ind_adjRsq], col ="red", cex=2, pch=20)
# bic
plot(best.sub.summary$bic ,xlab="Number of Variables", ylab="bic", type="l")
ind_bic = which.min(best.sub.summary$bic)
points(ind_bic, best.sub.summary$bic[ind_bic], col ="red", cex=2, pch=20)
mtext("Best subset selection for delta gender salary", outer=TRUE,  cex=1.2, line=-1.5)

# built-in plots
par(mfrow=c(1,1))
plot.regsubsets2(best.sub, scale = "r2", cex.axis = 0.6, main = "Best subset selection using R squared")
plot.regsubsets2(best.sub, scale = "adjr2", cex.axis = 0.6, main = "Best subset selection using adjusted R squared")
plot.regsubsets2(best.sub, scale = "Cp", cex.axis = 0.6, main = "Best subset selection using Mallows' Cp")
plot.regsubsets2(best.sub, scale = "bic", cex.axis = 0.6, main = "Best subset selection using BIC")

# mtext("Best subset selection for salary 18-25 using BIC", outer=TRUE,  cex=1.4, line=-3.5)

# retrieve the model with min BIC
coefficients(best.sub, which.min(best.sub.summary$bic))
nnn = names(coefficients(best.sub, which.min(best.sub.summary$bic)))
nnn[(nnn == "urbanTRUETRUE")] = "urbanTRUE"
  # nnn <- gsub(x = nnn,
  #             pattern = "`",
  #             replacement = "")

# apply OLS to obtain diagnostic plots (approximately the same coefficients)
mod_bic = mod[, colnames(mod) %in% nnn] 
mod_bic = cbind.data.frame(y, mod_bic)
ols_bic = lm(y ~ ., data = mod_bic, y=T)
summary(ols_bic)
RegressionPlots(ols_bic, textlabels = towns, out = dcity)

# extract variables significant at 0.01% and save them in a data frame
iBSS = row.names(coeff.df) %in% nnn
coeff.df$"BSS-SR" = as.character(rep("--", nrow(coeff.df)))
coeff.df$"BSS-SR"[iBSS] = sprintf("%.2e", as.matrix(coefficients(best.sub, which.min(best.sub.summary$bic))))

```

This approach results in a sparser model compared to the exhaustive search,
and the cost (e.g., in terms of BIC) is higher.


#### MM estimator

The model is now fitted using the MM estimator,
which is a soft trimming methodology.
It is based on a starting robust estimator with a high
breakdown point (in this case an S estimator),
which provides a robust scale estimate used by
the MM estimator to recover efficiency
```{r MM}

set.seed(11) # to ensure convergence

# robust fit using MM estimator 
mm = lmrob(y~., data = mod, y=T, control = lmrob.control(compute.rd = T)) # setting = "KS2014",
summary(mm)

# outliers identified and important cities
outlMM = as.character(dcity)
outlMM[mm$rweights < 2.8e-05] = "0"
outlMM[dcity!=1] = as.character(dcity[dcity!=1])
outlMM = as.factor(outlMM)

# diagnostic plots
  # layout(matrix(1:6, nr = 2), height = c(2, 1))
  # plot(mm, cex = 0.6, id.n = 10, labels.id = newDat$town_name, sub.caption="MM")
RegressionPlots(mm, towns, robust = T,out=outlMM)

# extract variables significant at 0.01% and save them in a data frame
coeff.df$MM = as.matrix(summary(mm)[["coefficients"]][,"Pr(>|t|)"]<0.01)
coeff.df$MM[coeff.df$MM==TRUE] = sprintf("%.2e",
               as.matrix(mm$coefficients[summary(mm)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4) 
coeff.df$MM[coeff.df$MM==FALSE] = "--"

```
This model shows that:

* residuals and fitted values seem better than the OLS fit
* it is able to identify the most oulying units (considering all the dimensions)
* leverage points (Lyon and Toulouse) have a lower weight compared to OLS estimates


#### Least trimmed squares

A very similar result is obtained using the LTS estimator 
(i.e., an hard trimming approach), with trimming proportion equal 
to 10% of the units.
```{r LTS}

set.seed(1)

lts_sol = ltsReg(y~., data = mod,  method = "lts", alpha = 0.9, mcd = F)
summary(lts_sol)
# assign variables used for plotting
lts_sol$y = y
lts_sol$MD = mm$MD # use MM mahalanobis distance (for computational problems of singularity)
  # lts_sol$MD = covMcd(mod[,c(1:12, 14:ncol(mod))])$mah

# outliers identified and important cities
outlLTS = as.character(dcity)
outlLTS[lts_sol$lts.wt < 2e-05] = "0"
outlLTS[dcity!=1] = as.character(dcity[dcity!=1])
outlLTS = as.factor(outlLTS)

# plot(lts_sol)
RegressionPlots(lts_sol, towns, robust = T, out = outlLTS)

# extract variables significant at 0.01% and save them in a data frame
coeff.df$LTS = summary(lts_sol)[["coefficients"]][,"Pr(>|t|)"]<0.01
coeff.df$LTS[coeff.df$LTS==TRUE] =
  sprintf("%.2e", as.matrix(lts_sol$coefficients[summary(lts_sol)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
coeff.df$LTS[coeff.df$LTS==FALSE] = "--"

```

Using the MM or LTS estimator, as expected, increases the adjusted R squared of around 10%.


#### Best subset selection with MM weights

The results obtained using a best subset selection procedure are now combined with the
weights obtained using the MM estimator (pretty similar to the ones obtaiend using the LTS weights)
```{r BSS(MM) exhaustive}

set.seed(1)

# best subset selection
best.sub = regsubsets(y ~ ., data = mod, nvmax = ncol(mod), really.big=TRUE, method="exhaustive", weights = lts_sol$lts.wt) # exhaustive # seqrep
best.sub.summary = summary(best.sub)
# manual plotting
par(mfrow =c(2,2))
# rsq
plot(best.sub.summary$rsq , xlab="Number of Variables", ylab="Rsq", type="l")
ind_Rsq = which.max(best.sub.summary$rsq)
points(ind_Rsq, best.sub.summary$adjr2[ind_Rsq], col ="red", cex=2, pch=20)
# adjRsq
plot(best.sub.summary$adjr2 ,xlab="Number of Variables", ylab="Adjusted RSq", type="l")
ind_adjRsq = which.max(best.sub.summary$adjr2)
points(ind_adjRsq, best.sub.summary$adjr2[ind_adjRsq], col ="red", cex=2, pch=20)
# Cp
plot(best.sub.summary$cp ,xlab="Number of Variables", ylab="Cp", type="l")
ind_Cp = which.min(best.sub.summary$cp)
points(ind_Cp, best.sub.summary$cp[ind_adjRsq], col ="red", cex=2, pch=20)
# bic
plot(best.sub.summary$bic ,xlab="Number of Variables", ylab="bic", type="l")
ind_bic = which.min(best.sub.summary$bic)
points(ind_bic, best.sub.summary$bic[ind_bic], col ="red", cex=2, pch=20)
mtext("Best subset selection for delta gender salary", outer=TRUE,  cex=1.2, line=-1.5)

# built-in plots
par(mfrow=c(1,1))
plot.regsubsets2(best.sub, scale = "r2", cex.axis = 0.6, main = "Best subset selection using R squared")
plot.regsubsets2(best.sub, scale = "adjr2", cex.axis = 0.6, main = "Best subset selection using adjusted R squared")
plot.regsubsets2(best.sub, scale = "Cp", cex.axis = 0.6, main = "Best subset selection using Mallows' Cp")
plot.regsubsets2(best.sub, scale = "bic", cex.axis = 0.6, main = "Best subset selection using BIC")

# mtext("Best subset selection for salary 18-25 using BIC", outer=TRUE,  cex=1.4, line=-3.5)

# retrieve the model with min BIC
coefficients(best.sub, which.min(best.sub.summary$bic))
nnn = names(coefficients(best.sub, which.min(best.sub.summary$bic)))
nnn[(nnn == "urbanTRUETRUE")] = "urbanTRUE"
  # nnn <- gsub(x = nnn,
  #             pattern = "`",
  #             replacement = "")

# apply OLS to obtain diagnostic plots (approximately the same coefficients)
  # mod_bic = mod[, colnames(mod) %in% nnn] 
  # mod_bic = cbind.data.frame(y, mod_bic)
  # ols_bic = lm(y ~ ., data = mod_bic, y=T)
  # summary(ols_bic)
  # RegressionPlots(ols_bic, textlabels = towns, out = dcity)

# extract variables significant at 0.01% and save them in a data frame
iBSS = row.names(coeff.df) %in% nnn
coeff.df$"BSS(MM)" = as.character(rep("--", nrow(coeff.df)))
coeff.df$"BSS(MM)"[iBSS] = sprintf("%.2e", as.matrix(coefficients(best.sub, which.min(best.sub.summary$bic))))

```

This approach results in a significant improvement of the cost function,
but the best model is less sparse and it keeps approximately 20 variables.


#### Elastic net

To face the problem of strong collinearity, which has not been considere yet,
an Elastic-net is now fitted using 10-folds cross validation 
to tune lambda and considering a range of 11 equispaced alpha values.
The best model is considered the one with the lowest MSPE.
```{r ENET}

set.seed(2)

# Elastic net
x = as.matrix(mod[, 2:ncol(mod)])
y = mod[, 1]
par(mfrow =c(3,4))
par(oma=c(0,0,2,0))
alp = seq(0, 1, 0.1) # seq(0, 1, length.out = 20)

# initialize list to contain the variables selected, 
# number of non zero coefficients,
# cost function and the counter
nzcoef = matrix(NA, length(alp), 2)
var_selec = list()
costCV = rep(NA, length(alp))
i = 1

for (j in alp){
  # set.seed (3)
  cv.out = cv.glmnet(x, y, alpha = j, nfolds=10) # standardize.response = T , parallel = T 
  plot(cv.out)
  title(paste("alpha = ", j), line = 0.3)
  # save selected variables
  # keep lower number of variables selected bu lambda or lambda.1se
  if (sum(coef(cv.out, s = "lambda.min") != 0) > sum(coef(cv.out, s = "lambda.1se") != 0)){ # lowest is min Lambda
    tmp_coeffs <- coef(cv.out, s = "lambda.1se") 
  } else {
    tmp_coeffs <- coef(cv.out, s = "lambda.min") # cv.out$lambda.1se 
  }
  var_selec[[i]] = data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], 
                              coefficient = tmp_coeffs@x)
  # save number of non zero variables
  nzcoef[i,1] = cv.out$nzero[cv.out$lambda == cv.out$lambda.min]
  nzcoef[i,2] = cv.out$nzero[cv.out$lambda == cv.out$lambda.1se]
  # save the value of the cost function
  costCV[i] = min(cv.out$cvm)
  # increase iterator
  i = i + 1
}
mtext(expression("Best lambda for delta gender salary using elastic-net and 10-folds CV"), outer=TRUE,  cex=1, line=0.2)

# best model (min mean Cv error)
bestCV = which.min(costCV)
# # show the coefficients obtained for the best model (added in the final table)
# print(var_selec[[bestCV]])

par(mfrow =c(1,1))
# plotmean Cv error
plot(alp, costCV, main="Mean 10-folds CV error for each alpha level",
     xlab = "alpha proportion", ylab = "mean CV error", type = "b", col = "blue")
axis(side = 1, at = c(alp))

# plot number of non zero coefficients
plot(alp, nzcoef[,1], main="Number of non-zero coefficients for each alpha level",
     xlab = "alpha proportion", ylab = "Non-zero coefficients", type = "b", col = "blue",
     ylim = c(min(nzcoef), max(nzcoef)))
lines(alp, nzcoef[,2], col = "red", type = "b")
legend('topright', c("lambda min", "lambda 1SE"), lty = 1, col = c("blue", "red"), cex = 0.7,lwd=2) 
axis(side = 1, at = c(alp))

# extract significant variables and save in a dataframe
coeff.df$ENET = as.character(rep("--", nrow(coeff.df)))
iENET = row.names(coeff.df) %in% var_selec[[bestCV]]$name
coeff.df$ENET[iENET] = sprintf("%.2e", as.matrix(var_selec[[bestCV]]$coefficient)) 

# # diagnostic plots
# # create quantities of interest
# cv.out$fitted.values = 
#   as.matrix(cbind.data.frame("(Intercept)" = rep(1, nrow(x)), x))[,i] %*% var_selec[[bestCV]]$coefficient
# cv.out$residuals = y - cv.out$fitted.values
# cv.out$residuals$y = y
# cv.out$ 
# 
# # outliers identified and important cities
# outl = dcity
# # outl[lts_sol$lts.wt < 2e-05] = "0"
# # outl[dcity!=1] = as.character(dcity[dcity!=1])
# # outl = as.factor(outl)
# 
# # plot(lts_sol)
# RegressionPlots(cv.out, towns, robust = F, out = outl)

```

This estimator proides a fairly sparse model,
but it is non robust, it is biased, and it doesn't 
easy allow to perform inferential reasoning.


#### OLS on elastic net solution

To partially solve the latter two problems,
an OLS fit could be applied only on the variables 
previously identified by ENET.
```{r ENET+OLS}

# OLS
indCoef = names(mod) %in% var_selec[[bestCV]]$name
modENETOLS = cbind.data.frame(y, mod[,indCoef])
ENET_ols = lm(y ~ ., y=T, data = modENETOLS)
summary(ENET_ols)

# diagnostic plots
# layout(matrix(1:6, nr = 2), height = c(2, 1))
RegressionPlots(ENET_ols, towns, robust = F, out=dcity)

# extract variables significant at 0.01% and save them in a data frame
coeff.df$"ENET+OLS" = as.character(rep("--", nrow(coeff.df)))
coeff.df$"ENET+OLS"[iENET] = as.matrix(summary(ENET_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01)
coeff.df[coeff.df==TRUE] = sprintf("%.2e",
           as.matrix(ENET_ols$coefficients[summary(ENET_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
coeff.df[coeff.df==FALSE] = "--"

```

This results in a sparser model and the diagnostic plots
shows a better result compared to the raw OLS estimates,
but it is still not robust.


#### Elastic net with MM weights

At this point, it seems reasonable to use the weights
obtained by the MM estimator in an ENET model,
trying to recover some robustness and keeping a sparse model
```{r ENET(MM)}

set.seed(1)

# initialize list to contain the variables selected, 
# number of non zero coefficients,
# cost function and the counter
nzcoef = matrix(NA, length(alp), 2)
var_selec = list()
costCV = rep(NA, length(alp))
i = 1

par(mfrow =c(3,4))
par(oma=c(0,0,2,0))

# xx = x[mm$rweights>0.9,]
# yy = y[mm$rweights>0.9]


for (j in alp){
  # set.seed (3)
  cv.out = cv.glmnet(x, y, alpha = j, nfolds=10, weights = mm$rweights) # lts_sol$lts.wt
  plot(cv.out)
  title(paste("alpha = ", j), line = 0.3)
  # save selected variables
  # keep lower number of variables selected bu lambda or lambda.1se
  if (sum(coef(cv.out, s = "lambda.min") != 0) > sum(coef(cv.out, s = "lambda.1se") != 0)){ # lowest is min Lambda
    tmp_coeffs <- coef(cv.out, s = "lambda.1se") 
  } else {
    tmp_coeffs <- coef(cv.out, s = "lambda.min") # cv.out$lambda.1se 
  }
  var_selec[[i]] = data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], 
                              coefficient = tmp_coeffs@x)
  # save number of non zero variables
  nzcoef[i,1] = cv.out$nzero[cv.out$lambda == cv.out$lambda.min]
  nzcoef[i,2] = cv.out$nzero[cv.out$lambda == cv.out$lambda.1se]
  # save the value of the cost function
  costCV[i] = min(cv.out$cvm)
  # increase iterator
  i = i + 1
}
mtext(expression("Best lambda for delta gender salary using elastic-net and 10-folds CV with MM weights"), 
      outer=TRUE,  cex=0.8, line=0.2)

# best model (min mean Cv error)
bestCV = which.min(costCV)
# # show the coefficients obtained for the best model (added in the final table)
# print(var_selec[[bestCV]])

par(mfrow =c(1,1))
# plotmean Cv error
plot(alp, costCV, main="Mean 10-folds CV error for each alpha level",
     xlab = "alpha proportion", ylab = "mean CV error", type = "b", col = "blue")
axis(side = 1, at = c(alp))

# plot number of non zero coefficients
plot(alp, nzcoef[,1], main="Number of non-zero coefficients for each alpha level",
     xlab = "alpha proportion", ylab = "Non-zero coefficients", type = "b", col = "blue",
     ylim = c(min(nzcoef), max(nzcoef)))
lines(alp, nzcoef[,2], col = "red", type = "b")
legend('topright', c("lambda min", "lambda 1SE"), lty = 1, col = c("blue", "red"), cex = 0.7,lwd=2) 
axis(side = 1, at = c(alp))

# extract significant variables and save in a dataframe
coeff.df$"ENET(MM)" = as.character(rep("--", nrow(coeff.df)))
iENETMM = row.names(coeff.df) %in% var_selec[[bestCV]]$name
coeff.df$"ENET(MM)"[iENETMM] = sprintf("%.2e", as.matrix(var_selec[[bestCV]]$coefficient)) 

```

Contrarily to what was expected, the model is not so sparse.
On the other hand, the mean CV error decreases substiantially (around 0.4)
compared to the pure ENET model (around 1.4), and this gain would
persist even allowing for a stronger shrinkage of the lambda values
and hence reducing the number of coefficients.
It is also interesting to note the strong decrease in CV error
as some proportion of Lasso is introduced.

Such behavior should be studied further.
A motivation could be the fact that ENET is based on CV, and some oulying units with weights
different from 0 can affect this result.
On the other hand, even providing to ENET only units with MM-weights
greater than 0.9 (i.e., almost surely non outlying according to MM based on all variables),
mean CV error decrease substantially, but the problem persists.


#### OLS on Elastic net solution with MM weights

The same is true applying an OLS estimate on the non oulying units
(i.e., with MM-weights > 0.05) and the variables identified bu ENET(MM).
```{r ENET(MM)+OLS}

# To try to solve this problem, another step of
# OLS estimate is computed on the variables selected by
# E-Net with MM weights, excluding fromt the fit units
# with MM-weights <= 0.05

# ADD OLS
indCoef = names(mod) %in% var_selec[[bestCV]]$name
modENETMMOLS = cbind.data.frame(y, mod[,indCoef])
modENETMMOLS = modENETMMOLS[mm$rweights>0.05,]
ENETMM_ols = lm(y ~ ., y=T, data = modENETMMOLS)
summary(ENETMM_ols)

# diagnostic plots
# layout(matrix(1:6, nr = 2), height = c(2, 1))
RegressionPlots(ENETMM_ols, towns[mm$rweights>0.05], robust = F, out=dcity[mm$rweights>0.05])

# extract variables significant at 0.01% and save them in a data frame
coeff.df$"ENET(MM)+OLS" = as.character(rep("--", nrow(coeff.df)))
coeff.df$"ENET(MM)+OLS"[iENETMM] = as.matrix(summary(ENETMM_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01)
coeff.df$"ENET(MM)+OLS"[coeff.df$"ENET(MM)+OLS"==TRUE] = sprintf("%.2e",
           as.matrix(ENETMM_ols$coefficients[summary(ENETMM_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
coeff.df$"ENET(MM)+OLS"[coeff.df$"ENET(MM)+OLS"==FALSE] = "--"

```


Such a model is sparser. It still contains some leverage points,
like Toulose and Lyon, but the diagnostic plots don't show any
particularly strange behavior.

At the same two important aspects have to be taken in account:

* the MM weights used were obtained on the base of the full model, 
  hence they could not be really representative
* some units are not being modeled in this framework


#### Elastic net with MM weights to model outliers

To face the latter problem an approach could be based on the idea of modeling
apart the most outlying units.
For instance, the following fit is based on ENET where the provided weights
correspond to 1 - MMweights.
```{r ENET(MMout)}

set.seed(1)

# initialize list to contain the variables selected, 
# number of non zero coefficients,
# cost function and the counter
nzcoef = matrix(NA, length(alp), 2)
var_selec = list()
costCV = rep(NA, length(alp))
i = 1

par(mfrow =c(3,4))
par(oma=c(0,0,2,0))

for (j in alp){
  # set.seed (3)
  cv.out = cv.glmnet(x, y, alpha = j, nfolds=10, weights = 1-mm$rweights) # standardize.response = T , parallel = T 
  plot(cv.out)
  title(paste("alpha = ", j), line = 0.3)
  # save selected variables
  # keep lower number of variables selected bu lambda or lambda.1se
  if (sum(coef(cv.out, s = "lambda.min") != 0) > sum(coef(cv.out, s = "lambda.1se") != 0)){ # lowest is min Lambda
    tmp_coeffs <- coef(cv.out, s = "lambda.1se") 
  } else {
    tmp_coeffs <- coef(cv.out, s = "lambda.min") # cv.out$lambda.1se 
  }
  var_selec[[i]] = data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], 
                              coefficient = tmp_coeffs@x)
  # save number of non zero variables
  nzcoef[i,1] = cv.out$nzero[cv.out$lambda == cv.out$lambda.min]
  nzcoef[i,2] = cv.out$nzero[cv.out$lambda == cv.out$lambda.1se]
  # save the value of the cost function
  costCV[i] = min(cv.out$cvm)
  # increase iterator
  i = i + 1
}
mtext(expression("Best lambda for delta gender salary using elastic-net and 10-folds CV with MM weights"), 
      outer=TRUE,  cex=0.8, line=0.2)

# best model (min mean Cv error)
bestCV = which.min(costCV)
# # show the coefficients obtained for the best model (added in the final table)
# print(var_selec[[bestCV]])

par(mfrow =c(1,1))
# plotmean Cv error
plot(alp, costCV, main="Mean 10-folds CV error for each alpha level",
     xlab = "alpha proportion", ylab = "mean CV error", type = "b", col = "blue")
axis(side = 1, at = c(alp))

# plot number of non zero coefficients
plot(alp, nzcoef[,1], main="Number of non-zero coefficients for each alpha level",
     xlab = "alpha proportion", ylab = "Non-zero coefficients", type = "b", col = "blue",
     ylim = c(min(nzcoef), max(nzcoef)))
lines(alp, nzcoef[,2], col = "red", type = "b")
legend('topright', c("lambda min", "lambda 1SE"), lty = 1, col = c("blue", "red"), cex = 0.7,lwd=2) 
axis(side = 1, at = c(alp))

# extract significant variables and save in a dataframe
coeff.df$"ENET(MMout)" = as.character(rep("--", nrow(coeff.df)))
iENETMM = row.names(coeff.df) %in% var_selec[[bestCV]]$name
coeff.df$"ENET(MMout)"[iENETMM] = sprintf("%.2e", as.matrix(var_selec[[bestCV]]$coefficient)) 

# ADD OLS (only to obtain diagnostic plots)
trim = 0.15 # MM weights cutoff to consider outliers
indCoef = names(mod) %in% var_selec[[bestCV]]$name
modENETMMOLSout = cbind.data.frame(y = y[mm$rweights<trim], mod[mm$rweights<trim,indCoef]) 
modENETMMOLSout = lm(y ~ ., y=T, data = modENETMMOLSout)
summary(modENETMMOLSout)

# diagnostic plots
# layout(matrix(1:6, nr = 2), height = c(2, 1))
outlOpposite = as.character(outlMM) 
outlOpposite[outlOpposite == 0] = 100
outlOpposite[outlOpposite == 1] = 0
outlOpposite[outlOpposite == 100] = 1
outlOpposite = as.factor(outlOpposite)
RegressionPlots(modENETMMOLSout, towns[mm$rweights<trim], robust = F, out=dcity[mm$rweights<trim])

# extract variables significant at 0.01% and save them in a data frame
# coeff.df$"ENET(MMout)" = as.character(rep("--", nrow(coeff.df)))
# coeff.df$"ENET(MMout)"[iENETMM] = as.matrix(summary(modENETMMOLSout)[["coefficients"]][,"Pr(>|t|)"]<0.01)
# coeff.df[coeff.df==TRUE] = sprintf("%.2e",
#            as.matrix(modENETMMOLSout$coefficients[summary(modENETMMOLSout)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
# coeff.df[coeff.df==FALSE] = "--" 

```

This model results in:

* in a very sparse model which seems to fit reasonably well all the points
* provides coefficients which are sensibly different than the ones for the previously considered models
(e.g., unemployment coefficients changes its sign) supporting the assumption that there is more than one data generating process
* the mean CV error increases substantially 

A possible drawback of this approach, again, reside on the fact that the MM weights are based on the full model.


### Robust approaches incorporating shrinkage

#### Sparse-LTS

The estimator sparse-LTS, which is a robustification of Lasso using LTS, is now applied.
The parameters used perform a trimming 10% of the units, and 10 replication of 10-folds CV
```{r sparseLTS, warning=FALSE}

set.seed(1)

# install.packages("snow")
# library("snow")

# fit robust Lasso
cl = makeCluster(detectCores())
sol_sparseLTS = sparseLTS(x, y, mode = "lambda", alpha = 0.9, cl = cl, 
                          model=TRUE, crit = "PE", splits = foldControl(10, 10)) # crit = "BIC"
# sol_sparseLTS_CV = perry(sol_sparseLTS, splits = foldControl(10, 10), cost = rtmspe, ncores = cl)
stopCluster(cl)

# check the solution
print(sol_sparseLTS) 

# fit and evaluate sparse LTS model
  # plot(sol_sparseLTS_CV, "density") 

# assign leverages, assigned only to obtain diagnostic plots
sol_sparseLTS$MD = rep(1, length(y)) # covMcd(x)$mah # compute standard leverages
# outliers identified and important cities
outlSLTS = as.character(sol_sparseLTS$wt)
outlSLTS[dcity!=1] = as.character(dcity[dcity!=1])
outlSLTS = as.factor(outlSLTS)
# diagnostic plots
RegressionPlots(sol_sparseLTS, textlabels=towns, robust = TRUE, out = outlSLTS)[[1]]
RegressionPlots(sol_sparseLTS, textlabels=towns, robust = TRUE, out = outlSLTS)[[2]]
RegressionPlots(sol_sparseLTS, textlabels=towns, robust = TRUE, out = outlSLTS)[[3]]
diagnosticPlot(sol_sparseLTS, id.n = 10, ask = FALSE, label=towns, which = "rdiag")

# extract variables significant and save them in a data frame
iSLTS = coefficients(sol_sparseLTS)!=0
coeff.df$SLTS = as.character(rep("--", nrow(coeff.df)))
coeff.df$SLTS[iSLTS] = sprintf("%.2e", as.matrix(coefficients(sol_sparseLTS)[iSLTS]))

```

This model results in: 

* a very sparse and robust model
* the unemployment rate is the most important variable
* clearly shows 2 different sub-popilations
* Paris is not a clear outlier

The main drawback resides on the fact that it is not taking into account collinearity.


#### Elastic net with sparse-LTS weights

In order to try to solve this problem, ENET with weights provided by sparseLTS could be applied
```{r ENET(SLTS)}

set.seed(1)

# initialize list to contain the variables selected,
# number of non zero coefficients,
# cost function and the counter
nzcoef = matrix(NA, length(alp), 2)
var_selec = list()
costCV = rep(NA, length(alp))
i = 1

par(mfrow =c(3,4))
par(oma=c(0,0,2,0))

nonOutSLTS = as.logical(sol_sparseLTS$wt)

for (j in alp){
  # set.seed (3)
  cv.out = cv.glmnet(x, y, alpha = j, nfolds=10, weights = nonOutSLTS)
  plot(cv.out)
  title(paste("alpha = ", j), line = 0.3)
  # save selected variables
  # keep lower number of variables selected by lambda or lambda.1se
  if (sum(coef(cv.out, s = "lambda.min") != 0) > sum(coef(cv.out, s = "lambda.1se") != 0)){ # lowest is min Lambda
    tmp_coeffs <- coef(cv.out, s = "lambda.1se")
  } else {
    tmp_coeffs <- coef(cv.out, s = "lambda.min") # cv.out$lambda.1se
  }
  var_selec[[i]] = data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1],
                              coefficient = tmp_coeffs@x)
  # save number of non zero variables
  nzcoef[i,1] = cv.out$nzero[cv.out$lambda == cv.out$lambda.min]
  nzcoef[i,2] = cv.out$nzero[cv.out$lambda == cv.out$lambda.1se]
  # save the value of the cost function
  costCV[i] = min(cv.out$cvm)
  # increase iterator
  i = i + 1
}
mtext(expression("Best lambda for delta gender salary using elastic-net and 10-folds CV with sparse-LTS weights"),
      outer=TRUE,  cex=0.8, line=0.2)

# best model (min mean Cv error)
bestCV = which.min(costCV)
# manually select the 8th as the best model
bestCV = 8
# # show the coefficients obtained for the best model (added in the final table)
# print(var_selec[[bestCV]])

par(mfrow =c(1,1))
# plotmean Cv error
plot(alp, costCV, main="Mean 10-folds CV error for each alpha level",
     xlab = "alpha proportion", ylab = "mean CV error", type = "b", col = "blue")
axis(side = 1, at = c(alp))

# plot number of non zero coefficients
plot(alp, nzcoef[,1], main="Number of non-zero coefficients for each alpha level",
     xlab = "alpha proportion", ylab = "Non-zero coefficients", type = "b", col = "blue",
     ylim = c(min(nzcoef), max(nzcoef)))
lines(alp, nzcoef[,2], col = "red", type = "b")
legend('topright', c("lambda min", "lambda 1SE"), lty = 1, col = c("blue", "red"), cex = 0.7,lwd=2)
axis(side = 1, at = c(alp))

# extract significant variables and save in a dataframe
coeff.df$"ENET(SLTS)" = as.character(rep("--", nrow(coeff.df)))
iENETSLTS = row.names(coeff.df) %in% var_selec[[bestCV]]$name
coeff.df$"ENET(SLTS)"[iENETSLTS] = sprintf("%.2e", as.matrix(var_selec[[bestCV]]$coefficient))

```

In this case, instead of the actual minimum obtained by ENET which
corresponded to a Lasso fit, the best labda kept corresponds to alpha=0.7
(at the cost of a slightly higher CV error).
The highest coefficient for this model results the unemployment rate.

To perform a further validation of these results, OLS is applied on the model identified 
(discarding the units identified by sparse-LTS)
```{r ENET(SLTS)+OLS}

# ADD OLS
indCoef = names(mod) %in% var_selec[[bestCV]]$name
modENETsltsOLS = cbind.data.frame(y = y[nonOutSLTS], mod[nonOutSLTS,indCoef]) 
ENET_SLTS_ols = lm(y ~ ., y=T, data = modENETsltsOLS)
summary(ENET_SLTS_ols)

# diagnostic plots
# layout(matrix(1:6, nr = 2), height = c(2, 1))
RegressionPlots(ENET_SLTS_ols, towns[nonOutSLTS], robust = F, out=dcity[nonOutSLTS])

# # extract variables significant at 0.01% and save them in a data frame
# coeff.df$"ENET(SLTS)+OLS" = as.character(rep("--", nrow(coeff.df)))
# coeff.df$"ENET(SLTS)+OLS"[iENETSLTS] = as.matrix(summary(ENET_SLTS_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01)
# coeff.df$"ENET(SLTS)+OLS"[coeff.df$"ENET+SLTS+OLS"==TRUE] = sprintf("%.2e",
#            as.matrix(ENET_SLTS_ols$coefficients[summary(ENET_SLTS_ols)
#                                                 [["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
# coeff.df$"ENET(SLTS)+OLS"[coeff.df$"ENET+SLTS+OLS"==FALSE] = "--"

```

They show that:

* the model seems correctly specified and all regressors are significant
* Paris is not considered as on outlier (its leverage is strongly decreased)
* the unemployment rate has a strong effect on the response


#### Penalized ENET based on the S estimator

Penalized ENET based on the S estimator (PENSE) is now applied.
The PENSE estimate minimizes the robust M-scale of the residuals penalized by the L1 and L2
norm of the regression coefficients (elastic net penalty). The level of penalization is chosen to
minimize the k-fold cross-validated prediction error (using a robust measure).
```{r ENETS}

# install.packages("pense")
library("pense")

alp = seq(0, 1,length.out = 5)
i = 1
sol_ENETS = list()

for (j in alp){
  cl = makeCluster(detectCores())
  sol_ENETS[[i]] = pense(x[,1:ncol(x)-1], y, alpha = j, nlambda = 10, cv_k = 2, cl = cl)
  stopCluster(cl)
  i = i + 1
}

# pensem(x, alpha, scale, nlambda = 50, lambda,
# lambda_min_ratio, standardize, cv_k = 5, cv_objective,
# ncores = getOption("mc.cores", 1L), cl = NULL,
# mm_options = mstep_options(), en_options, x_train, y_train, ...)


```

This models show that:
*
*

The main drawback it's the computing time required.


#### Elastic net with LTS cost function

Robust elastic net, based on LTS, is now fitted to the model trimming 10% of the units. 
It requires the removal of the dummy variable for urban towns, 
in order to avoid numerical problems.
```{r enetLTS}

# , echo=FALSE, cache=FALSE

# slower version
sol_enetLTS = enetLTS(x[,1:ncol(x)-1], y, family="gaussian", alphas=seq(0, 1, 0.2), hsize=0.9,
                      nfold=10, para=FALSE, repl = 10, del=0.05) # parallel does not work on windows
# faster version
# sol_enetLTS = enetLTS(x[,1:ncol(x)-1], y, family="gaussian", alphas=seq(0, 1, 0.2), hsize=0.9, 
#                       nfold=10, para=TRUE, repl = 5, del=0.05) # parallel does not work on windows

print(sol_enetLTS)

# save original response (used for diagnostic plots)
sol_enetLTS$y = y
# construct mahalanobis distance (used for diagnostic plots)
sol_enetLTS$MD = hatvalues(ols) # covMcd(x)$mah # compute standard leverages
# construct robust scale (used for diagnostic plots)
sol_enetLTS$scale = sd(sol_enetLTS$residuals[as.logical(sol_enetLTS$wt)])
# interactive plots
RegressionPlots(sol_enetLTS, textlabels=towns, robust = TRUE, out = as.factor(sol_enetLTS$wt))[[1]]
RegressionPlots(sol_enetLTS, textlabels=towns, robust = TRUE, out = as.factor(sol_enetLTS$wt))[[2]]
RegressionPlots(sol_enetLTS, textlabels=towns, robust = TRUE, out = as.factor(sol_enetLTS$wt))[[3]]

# built-in plots
plotCoef.enetLTS(sol_enetLTS, cex = 5)
plotDiagnostic.enetLTS(sol_enetLTS, cex = 3)
plotResid.enetLTS(sol_enetLTS, cex = 3)
# plot(sol_enetLTS, method=c("coefficients","resid","diagnostic"), vers=c("reweighted","raw"),...)

# save outliers index 
outlENETLTS = as.logical(sol_enetLTS$wt)

# save coefficients in a dataframe
iENETLTS = c(TRUE, sol_enetLTS$coefficients != 0, FALSE) # manually include intercept and dummy for Urban
coeff.df$"ENETLTS" = as.character(rep("--", nrow(coeff.df)))
sol_enetLTS$coefficientsWithInt = c("(Intercept)" = sol_enetLTS$a0, sol_enetLTS$coefficients)
coeff.df$"ENETLTS"[iENETLTS] = sprintf("%.2e",sol_enetLTS$coefficientsWithInt[iENETLTS], 4)

## perform CV for user-values
# 
# Calculate lambda path (first get lambda_max):
# lambda_max <- 10 # max(abs(colSums(sx*sy)))/n
# epsilon <- .0001
# K <- 10
# lambdapath <- round(exp(seq(log(lambda_max), log(lambda_max*epsilon),
#                             length.out = K)), digits = 10)
# sol_cv = cv.enetLTS(index=NULL,x,y,family="gaussian",round(length(y)*0.90),
#            seq(0, 1, 0.2),lambdas = lambdapath ,ncores=1,repl=5,nfold=10,plot=TRUE)
# sol_cv

# OLS
modENETltsOLS = data.frame(mod[outlENETLTS,iENETLTS])
ENETlts_ols = lm(y ~ ., y=T, data = modENETltsOLS, subset = sol_enetLTS$wt)
summary(ENETlts_ols)

# diagnostic plots
# layout(matrix(1:6, nr = 2), height = c(2, 1))
RegressionPlots(ENETlts_ols, towns[outlENETLTS], robust = F, out=outlENETLTS[outlENETLTS==1])

# extract variables significant at 0.01% and save them in a data frame
  # coeff.df$"ENETLTS+OLS" = as.character(rep("--", nrow(coeff.df)))
  # coeff.df$"ENETLTS+OLS"[iENETLTS] = as.matrix(summary(ENETlts_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01)
  # coeff.df[coeff.df==TRUE] = sprintf("%.2e",
  #                                    as.matrix(ENETlts_ols$coefficients[summary(ENETlts_ols)[["coefficients"]][,"Pr(>|t|)"]<0.01]), 4)
  # coeff.df[coeff.df==FALSE] = "--"

```

This model shows that:

* the algorithm (even if not parallelized for Windows users) is much faster than PENSE
* ...


#### 

```{r BSS weighted}

# best subset selection
best.sub = regsubsets(y ~ ., data = modENETltsOLS, nvmax = ncol(mod), method="exhaustive") # exhaustive # seqrep # really.big=TRUE, 
best.sub.summary = summary(best.sub)
# manual plotting
par(mfrow =c(2,2))
# rsq
plot(best.sub.summary$rsq , xlab="Number of Variables", ylab="Rsq", type="l")
ind_Rsq = which.max(best.sub.summary$rsq)
points(ind_Rsq, best.sub.summary$adjr2[ind_Rsq], col ="red", cex=2, pch=20)
# adjRsq
plot(best.sub.summary$adjr2 ,xlab="Number of Variables", ylab="Adjusted RSq", type="l")
ind_adjRsq = which.max(best.sub.summary$adjr2)
points(ind_adjRsq, best.sub.summary$adjr2[ind_adjRsq], col ="red", cex=2, pch=20)
# Cp
plot(best.sub.summary$cp ,xlab="Number of Variables", ylab="Cp", type="l")
ind_Cp = which.min(best.sub.summary$cp)
points(ind_Cp, best.sub.summary$cp[ind_adjRsq], col ="red", cex=2, pch=20)
# bic
plot(best.sub.summary$bic ,xlab="Number of Variables", ylab="bic", type="l")
ind_bic = which.min(best.sub.summary$bic)
points(ind_bic, best.sub.summary$bic[ind_bic], col ="red", cex=2, pch=20)
mtext("Best subset selection for delta gender salary", outer=TRUE,  cex=1.2, line=-1.5)

# built-in plots
par(mfrow=c(1,1))
plot.regsubsets2(best.sub, scale = "r2", cex.axis = 0.6, main = "Best subset selection using R squared")
plot.regsubsets2(best.sub, scale = "adjr2", cex.axis = 0.6, main = "Best subset selection using adjusted R squared")
plot.regsubsets2(best.sub, scale = "Cp", cex.axis = 0.6, main = "Best subset selection using Mallows' Cp")
plot.regsubsets2(best.sub, scale = "bic", cex.axis = 0.6, main = "Best subset selection using BIC")

# mtext("Best subset selection for salary 18-25 using BIC", outer=TRUE,  cex=1.4, line=-3.5)

# retrieve the model with min BIC
coefficients(best.sub, which.min(best.sub.summary$bic))
nnn = names(coefficients(best.sub, which.min(best.sub.summary$bic)))

# apply OLS
mod_bic = mod[mm$rweights>trim, colnames(mod) %in% nnn] 
mod_bic = cbind.data.frame(y = y[mm$rweights>trim], mod_bic)
ols_bic = lm(y ~ ., data = mod_bic, y=T)
summary(ols_bic)
RegressionPlots(ols_bic, textlabels = towns[mm$rweights>trim], out = dcity[mm$rweights>trim])

# extract variables significant at 0.01% and save them in a data frame
iBSS = row.names(coeff.df) %in% nnn
coeff.df$"BSS(SENET)" = as.character(rep("--", nrow(coeff.df)))
coeff.df$"BSS(SENET)"[iBSS] = sprintf("%.2e", as.matrix(coefficients(best.sub, which.min(best.sub.summary$bic))))

```


### Comparison of the estimates obtained

Print table of results
```{r}
#  print results in pdf, results='asis', render=normal_print


library(gridExtra)
pdf("results sparse models.pdf", height=11, width=12)
grid.table(coeff.df[1:23,])
grid.newpage()
grid.table(coeff.df[24:nrow(coeff.df),])
dev.off()

```

This function is used to plot a summary of the results obtained in HTML format
```{r interactive results for LM, echo = FALSE}

## PROCEDURE FOR WINDOWS
## Install 64bit java and set its Path in environmental variables, then
# install.packages('rJava', .libPaths()[1], 'http://www.rforge.net/')
# options(java.home="C:/Program Files/Java/jre1.8.0_171") 
# Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_171') # for 64-bit version 
## Install more packages
# install.packages("ReporteRs")
# install.packages("shiny")
# install.packages("rsconnect")

# WARNING: in a notebook the interaction is dispayed just at the end
# BETTER TO RUN IT IN A NORMAL SCRIPT

# Load packages
library("rJava")
library("shiny")
library("ReporteRs")
library("rsconnect")

# store a dataframe with X in place of significant coefficients
mycoeff.df = coeff.df
mycoeff.df[mycoeff.df!="--"] = "X"

# ui
ui <- fluidPage(
  tags$h1("Linear models' results for gender's gap in salary"),
  tableOutput(outputId = "table"),
  br(),
  verbatimTextOutput(outputId = "out")
)

# server
server <- function(input, output) {
  output$table <- renderFlexTable({
    # Create checkboxes   
    mycoeff.df$Variable <- paste0('<label><input type="checkbox" id="car', 
                                  seq_along(rownames(coeff.df)), 
                                  '"> <span>', rownames(coeff.df), '</span></label>')
    mycoeff.df <- mycoeff.df[c("Variable", names(coeff.df))] # Put col 'Name' in the first place
    # mycoeff.df = coeff.df[,1:ncol(coeff.df)-1]
    ft <- vanilla.table(mycoeff.df) # convert to FlexTable object
    ft[, "Variable", to = "header"] <- parLeft() # left align checkboxes
    ft[, "Variable"] <- parLeft() # left align header
    return(ft)
  })
  # the inputs created are in input$car1, input$car2, ...
  output$out <- renderPrint({
    # results
    res <- unlist(lapply(1:nrow(mycoeff.df), function(i) input[[paste0("car", i)]]))
    # print(res)
    if (any(res)) {
      mycoeff.df[mycoeff.df!="--"] = "X"
      res2 = as.data.frame(coeff.df[res,]) # rownames(mycoeff.df)[res])
      print(res2)
    }
  })
}

# launch app
shinyApp(ui = ui, server = server)

```
