Unsupervised Methods

In this lab, we’ll look at how to implement two unsupervised classification methods:

  • k-means classification
  • Self-organizing maps

The data we will use is from the Gap Minder project. While you can download the individual variables from the web site, we will use a preprocessed set of the data covering the period 1801-2018, in the file gapminder_1800_2018.csv. Download this to your datafiles folder and make a new folder for today’s class called lab10. You will also need the shapefile of country borders: ne_50m_admin_0_countries.shp (you should have this from a previous lab).

Set up

## Set working directory

setwd("~/Desktop/University of Utah PhD /Course Work/Spring 2023 Semester/GEOG6160 - Spatial Modeling/Labs/lab10/")


## Load libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggpubr)
## Loading required package: ggplot2
library(RColorBrewer)
library(countrycode) # convert country codes and country names
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(kohonen) #self organizing maps
library(tmap)

Unsupervised Classification in R

## Read in the data
gap_df = read.csv("../datafiles/gapminder_1800_2018.csv")


## Check
head(gap_df)
##       country year child_mortality fertility per_cap_co2 income life_expectancy
## 1 Afghanistan 1800             469         7          NA    603            28.2
## 2 Afghanistan 1801             469         7          NA    603            28.2
## 3 Afghanistan 1802             469         7          NA    603            28.2
## 4 Afghanistan 1803             469         7          NA    603            28.2
## 5 Afghanistan 1804             469         7          NA    603            28.2
## 6 Afghanistan 1805             469         7          NA    603            28.2
##   population
## 1    3280000
## 2    3280000
## 3    3280000
## 4    3280000
## 5    3280000
## 6    3280000
##Check the variable names
names(gap_df)
## [1] "country"         "year"            "child_mortality" "fertility"      
## [5] "per_cap_co2"     "income"          "life_expectancy" "population"
## Create Histograms of the variables
h1 <- gap_df %>%
  gghistogram(x = "child_mortality")

h2 <- gap_df %>%
  gghistogram(x = "fertility")

h3 <- gap_df %>%
  gghistogram(x = "per_cap_co2")

h4 <- gap_df %>%
  gghistogram(x = "income")

h5 <- gap_df %>%
  gghistogram(x = "life_expectancy")

h6 <- gap_df %>%
  gghistogram(x = "population")

## Plot as one faceted output
ggarrange(h1, h2, h3, h4, h5, h6)

The histograms clearly show skew in most of the variables, and we will need to transform the data before analyzing it. We’ll do this in two steps. First, we’ll log transform all variables to remove the skew, then use the scale() function, which in R scales data to z-scores by subtracting the mean and dividing by the standard deviation. To simplify things, we do all this in a single step by nesting the log() function in the scale() function.

Note that we remove rows with missing observations (we could potentially impute these, but that is a little more difficult for time series data). We also remove any row where the per capita CO2 emissions are zero, partly because this would cause problems when log-transforming and partly because this is clearly not true.

## Create the data processing framework
## Drop NAs
## Remove rows of per cap CO2 of 0
## Finally log transform each variable

gap_df_scale <- gap_df %>% 
  drop_na() %>%
  filter(per_cap_co2 > 0) %>%
  mutate(
    child_mortality = scale(log(child_mortality)),
    fertility = scale(log(fertility)),
    per_cap_co2 = scale(log(per_cap_co2)),
    income = scale(log(income)),
    life_expectancy = scale(log(life_expectancy)),
    population = scale(log(population))
  )

The very last thing we’ll do is to extract a single year of data for the first part of the lab (just 2018):

## Extract the 2018 data
gap_df_scale2 = gap_df_scale %>% 
  filter(year == 2018)


##Check
head(gap_df_scale2)
##               country year child_mortality  fertility per_cap_co2     income
## 1         Afghanistan 2018     -0.26516474  0.1506520  -0.5552216 -0.8328214
## 2             Albania 2018     -1.79404903 -1.6975208   0.3237170  0.8904765
## 3             Algeria 2018     -1.02764487 -0.8336127   0.7271571  0.9982332
## 4              Angola 2018     -0.09746087  0.6444475   0.1558006  0.2173724
## 5 Antigua and Barbuda 2018     -2.04000237 -1.3562778   0.9504338  1.4721185
## 6           Argentina 2018     -1.70055729 -1.1427722   0.8125748  1.2405631
##   life_expectancy population
## 1       0.5332313  1.0768253
## 2       1.1959360 -0.3511459
## 3       1.1794885  1.1472112
## 4       0.5782861  0.9714548
## 5       1.1505013 -2.2476952
## 6       1.1338189  1.1755747

k-means Cluster Analysis

We’ll next use these data with the k-means cluster algorithm. The default function for this in R is kmeans(), and we need to pass the data we are going to use, and the number of requested clusters to this function. First let’s figure out which columns we need:

## Check the variable names

names(gap_df_scale2)
## [1] "country"         "year"            "child_mortality" "fertility"      
## [5] "per_cap_co2"     "income"          "life_expectancy" "population"
## Set up the kmeans for the last 6 variables

gap_kmeans = kmeans(gap_df_scale2[,3:8], 
                   centers = 6)

There output of this algorithm provides information about the cluster solution. We can see the size of each cluster (the number of observations assigned to each cluster):

## Check the cluster size
gap_kmeans$size
## [1] 21 34 42 29 19 39

And we can see the cluster assignments for each country:

## Check the cluster assignments for each country

gap_kmeans$cluster
##   [1] 6 2 4 6 2 4 2 3 3 4 2 2 5 2 3 3 1 6 1 5 2 1 4 2 3 6 6 5 6 3 1 6 6 3 4 4 1
##  [38] 6 6 2 6 3 4 2 3 3 1 4 4 4 5 1 6 2 1 6 1 3 3 1 6 2 3 6 3 2 5 6 6 1 6 5 3 2
##  [75] 4 4 4 4 3 3 3 2 3 5 3 6 1 3 5 5 2 2 1 6 2 2 2 6 6 3 2 6 2 6 2 4 1 2 2 2 4
## [112] 6 5 1 5 3 3 5 6 6 5 2 3 3 5 5 2 5 5 4 4 3 3 3 3 4 6 1 1 3 6 3 2 6 3 3 3 1
## [149] 6 4 3 6 3 4 2 2 6 2 3 3 5 5 6 4 1 6 1 2 4 4 4 6 4 3 3 3 2 4 1 4 4 6 6 6

And finally the cluster centers. These are the prototypes; the average value of each variable for all the observations assigned to a given cluster (note these are the log-transformed and scaled data)

## Check the cluster center averages

gap_kmeans$centers
##   child_mortality  fertility per_cap_co2      income life_expectancy population
## 1      -0.7327950 -0.4507884   0.1613643  0.23733360       0.7476560 -1.2086934
## 2      -1.8282980 -1.6122468   0.8234725  1.30657069       1.1184521 -0.9017155
## 3      -2.3342882 -1.7030141   1.1209210  1.89523576       1.2666004  0.5393696
## 4      -1.3047987 -1.2265247   0.6881191  0.90779333       1.0433002  1.2933892
## 5      -0.9467182 -0.8454630   0.1015151  0.04711013       0.9232515  0.5815970
## 6      -0.1947134  0.2959364  -0.6062523 -0.71611076       0.5354277  0.6349778

Here, we can see that cluster 6, for example, has the lowest life expectancy and GDP, as well as high infant mortality and fertility rates.

Maps

As these are spatial data, we can now plot the distribution of the clusters. First, we’ll load the world shapefile in the ne_50m_admin_0_countries folder (we used this in a previous lab:)

## Read in the border shapefile

borders = st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
                  quiet = TRUE)

Next, we add the cluster assignments to the gap_df_scale2 data frame:

## Add the cluster assignments

gap_df_scale2$kmeans = as.factor(gap_kmeans$cluster)

Now we need to merge the gap_df_scale2 data frame with the world borders spatial object. To do this, we add a new column to the data frame containing the ISO3 standard country codes, using the countrycode package (you’ll need to install this).

## Add ISO3 standard country codes using the country codes package

gap_df_scale2 = gap_df_scale2 %>% 
  mutate(ISO3 = countrycode(sourcevar = country, 
                            origin = "country.name", 
                            destination = "iso3c"))

Finally, we use these ISO codes to merge the two datasets. We need to specify the column name for each dataset that contains the label to be used in merging, which we do with by.x and by.y.

## Merge the two datasets
cluster_sf = merge(borders,
                   gap_df_scale2,
                   by.x = "ADM0_A3",
                   by.y = "ISO3")

##Check
head(cluster_sf)
## Simple feature collection with 6 features and 103 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.57627 ymin: -55.03213 xmax: 74.89131 ymax: 42.64795
## Geodetic CRS:  WGS 84
##   ADM0_A3      featurecla scalerank LABELRANK           SOVEREIGNT SOV_A3
## 1     AFG Admin-0 country         1         3          Afghanistan    AFG
## 2     AGO Admin-0 country         1         3               Angola    AGO
## 3     ALB Admin-0 country         1         6              Albania    ALB
## 4     ARE Admin-0 country         1         4 United Arab Emirates    ARE
## 5     ARG Admin-0 country         1         2            Argentina    ARG
## 6     ARM Admin-0 country         1         6              Armenia    ARM
##   ADM0_DIF LEVEL              TYPE                ADMIN GEOU_DIF
## 1        0     2 Sovereign country          Afghanistan        0
## 2        0     2 Sovereign country               Angola        0
## 3        0     2 Sovereign country              Albania        0
## 4        0     2 Sovereign country United Arab Emirates        0
## 5        0     2 Sovereign country            Argentina        0
## 6        0     2 Sovereign country              Armenia        0
##                GEOUNIT GU_A3 SU_DIF              SUBUNIT SU_A3 BRK_DIFF
## 1          Afghanistan   AFG      0          Afghanistan   AFG        0
## 2               Angola   AGO      0               Angola   AGO        0
## 3              Albania   ALB      0              Albania   ALB        0
## 4 United Arab Emirates   ARE      0 United Arab Emirates   ARE        0
## 5            Argentina   ARG      0            Argentina   ARG        0
## 6              Armenia   ARM      0              Armenia   ARM        0
##                   NAME            NAME_LONG BRK_A3             BRK_NAME
## 1          Afghanistan          Afghanistan    AFG          Afghanistan
## 2               Angola               Angola    AGO               Angola
## 3              Albania              Albania    ALB              Albania
## 4 United Arab Emirates United Arab Emirates    ARE United Arab Emirates
## 5            Argentina            Argentina    ARG            Argentina
## 6              Armenia              Armenia    ARM              Armenia
##   BRK_GROUP ABBREV POSTAL                    FORMAL_EN FORMAL_FR
## 1      <NA>   Afg.     AF Islamic State of Afghanistan      <NA>
## 2      <NA>   Ang.     AO  People's Republic of Angola      <NA>
## 3      <NA>   Alb.     AL          Republic of Albania      <NA>
## 4      <NA> U.A.E.     AE         United Arab Emirates      <NA>
## 5      <NA>   Arg.     AR           Argentine Republic      <NA>
## 6      <NA>   Arm.    ARM          Republic of Armenia      <NA>
##             NAME_CIAWF NOTE_ADM0 NOTE_BRK            NAME_SORT NAME_ALT
## 1          Afghanistan      <NA>     <NA>          Afghanistan     <NA>
## 2               Angola      <NA>     <NA>               Angola     <NA>
## 3              Albania      <NA>     <NA>              Albania     <NA>
## 4 United Arab Emirates      <NA>     <NA> United Arab Emirates     <NA>
## 5            Argentina      <NA>     <NA>            Argentina     <NA>
## 6              Armenia      <NA>     <NA>              Armenia     <NA>
##   MAPCOLOR7 MAPCOLOR8 MAPCOLOR9 MAPCOLOR13  POP_EST POP_RANK GDP_MD_EST
## 1         5         6         8          7 34124811       15      64080
## 2         3         2         6          1 29310273       15     189000
## 3         1         4         1          6  3047987       12      33900
## 4         2         1         3          3  6072475       13     667200
## 5         3         1         3         13 44293293       15     879400
## 6         3         1         2         10  3045191       12      26300
##   POP_YEAR LASTCENSUS GDP_YEAR                   ECONOMY
## 1     2017       1979     2016 7. Least developed region
## 2     2017       1970     2016 7. Least developed region
## 3     2017       2001     2016      6. Developing region
## 4     2017       2010     2016      6. Developing region
## 5     2017       2010     2016   5. Emerging region: G20
## 6     2017       2001     2016      6. Developing region
##                INCOME_GRP WIKIPEDIA FIPS_10_ ISO_A2 ISO_A3 ISO_A3_EH ISO_N3
## 1           5. Low income       -99       AF     AF    AFG       AFG    004
## 2  3. Upper middle income       -99       AO     AO    AGO       AGO    024
## 3  4. Lower middle income       -99       AL     AL    ALB       ALB    008
## 4 2. High income: nonOECD       -99       AE     AE    ARE       ARE    784
## 5  3. Upper middle income       -99       AR     AR    ARG       ARG    032
## 6  4. Lower middle income       -99       AM     AM    ARM       ARM    051
##   UN_A3 WB_A2 WB_A3   WOE_ID WOE_ID_EH                   WOE_NOTE ADM0_A3_IS
## 1   004    AF   AFG 23424739  23424739 Exact WOE match as country        AFG
## 2   024    AO   AGO 23424745  23424745 Exact WOE match as country        AGO
## 3   008    AL   ALB 23424742  23424742 Exact WOE match as country        ALB
## 4   784    AE   ARE 23424738  23424738 Exact WOE match as country        ARE
## 5   032    AR   ARG 23424747  23424747 Exact WOE match as country        ARG
## 6   051    AM   ARM 23424743  23424743 Exact WOE match as country        ARM
##   ADM0_A3_US ADM0_A3_UN ADM0_A3_WB     CONTINENT REGION_UN       SUBREGION
## 1        AFG        -99        -99          Asia      Asia   Southern Asia
## 2        AGO        -99        -99        Africa    Africa   Middle Africa
## 3        ALB        -99        -99        Europe    Europe Southern Europe
## 4        ARE        -99        -99          Asia      Asia    Western Asia
## 5        ARG        -99        -99 South America  Americas   South America
## 6        ARM        -99        -99          Asia      Asia    Western Asia
##                    REGION_WB NAME_LEN LONG_LEN ABBREV_LEN TINY HOMEPART
## 1                 South Asia       11       11          4  -99        1
## 2         Sub-Saharan Africa        6        6          4  -99        1
## 3      Europe & Central Asia        7        7          4  -99        1
## 4 Middle East & North Africa       20       20          6  -99        1
## 5  Latin America & Caribbean        9        9          4  -99        1
## 6      Europe & Central Asia        7        7          4  -99        1
##   MIN_ZOOM MIN_LABEL MAX_LABEL      NE_ID WIKIDATAID                  NAME_AR
## 1        0         3         7 1159320319       Q889                أفغانستان
## 2        0         3         7 1159320323       Q916                   أنغولا
## 3        0         5        10 1159320325       Q222                  ألبانيا
## 4        0         4         9 1159320329       Q878 الإمارات العربية المتحدة
## 5        0         2         7 1159320331       Q414                الأرجنتين
## 6        0         5        10 1159320333       Q399                  أرمينيا
##            NAME_BN                      NAME_DE              NAME_EN
## 1       আফগানিস্তান                  Afghanistan          Afghanistan
## 2         অ্যাঙ্গোলা                       Angola               Angola
## 3         আলবেনিয়া                     Albanien              Albania
## 4 সংযুক্ত আরব আমিরাত Vereinigte Arabische Emirate United Arab Emirates
## 5        আর্জেন্টিনা                  Argentinien            Argentina
## 6         আর্মেনিয়া                     Armenien              Armenia
##                  NAME_ES             NAME_FR                 NAME_EL
## 1             Afganistán         Afghanistan              Αφγανιστάν
## 2                 Angola              Angola                 Ανγκόλα
## 3                Albania             Albanie                 Αλβανία
## 4 Emiratos Árabes Unidos Émirats arabes unis Ηνωμένα Αραβικά Εμιράτα
## 5              Argentina           Argentine               Αργεντινή
## 6                Armenia             Arménie                 Αρμενία
##           NAME_HI                 NAME_HU         NAME_ID             NAME_IT
## 1      अफ़्गानिस्तान             Afganisztán      Afganistan         Afghanistan
## 2           अंगोला                  Angola          Angola              Angola
## 3        अल्बानिया                 Albánia         Albania             Albania
## 4 संयुक्त अरब अमीरात Egyesült Arab Emírségek Uni Emirat Arab Emirati Arabi Uniti
## 5        अर्जेण्टीना               Argentína       Argentina           Argentina
## 6        आर्मीनिया            Örményország         Armenia             Armenia
##            NAME_JA      NAME_KO                      NAME_NL
## 1   アフガニスタン 아프가니스탄                  Afghanistan
## 2         アンゴラ       앙골라                       Angola
## 3       アルバニア     알바니아                      Albanië
## 4 アラブ首長国連邦 아랍에미리트 Verenigde Arabische Emiraten
## 5     アルゼンチン   아르헨티나                   Argentinië
## 6       アルメニア   아르메니아                      Armenië
##                        NAME_PL                NAME_PT
## 1                   Afganistan            Afeganistão
## 2                       Angola                 Angola
## 3                      Albania                Albânia
## 4 Zjednoczone Emiraty Arabskie Emirados Árabes Unidos
## 5                    Argentyna              Argentina
## 6                      Armenia                Arménia
##                         NAME_RU               NAME_SV                   NAME_TR
## 1                    Афганистан           Afghanistan                Afganistan
## 2                        Ангола                Angola                    Angola
## 3                       Албания              Albanien                Arnavutluk
## 4 Объединённые Арабские Эмираты Förenade Arabemiraten Birleşik Arap Emirlikleri
## 5                     Аргентина             Argentina                  Arjantin
## 6                       Армения              Armenien                Ermenistan
##                                NAME_VI          NAME_ZH              country
## 1                          Afghanistan           阿富汗          Afghanistan
## 2                               Angola           安哥拉               Angola
## 3                              Albania       阿尔巴尼亚              Albania
## 4 Các Tiểu vương quốc Ả Rập Thống nhất 阿拉伯联合酋长国 United Arab Emirates
## 5                            Argentina           阿根廷            Argentina
## 6                              Armenia         亞美尼亞              Armenia
##   year child_mortality  fertility per_cap_co2     income life_expectancy
## 1 2018     -0.26516474  0.1506520  -0.5552216 -0.8328214       0.5332313
## 2 2018     -0.09746087  0.6444475   0.1558006  0.2173724       0.5782861
## 3 2018     -1.79404903 -1.6975208   0.3237170  0.8904765       1.1959360
## 4 2018     -1.91150716 -1.6859215   1.5694888  2.3788493       0.9927787
## 5 2018     -1.70055729 -1.1427722   0.8125748  1.2405631       1.1338189
## 6 2018     -1.52762031 -1.8297863   0.4065443  0.6326736       1.0917298
##   population kmeans                       geometry
## 1  1.0768253      6 MULTIPOLYGON (((66.52227 37...
## 2  0.9714548      6 MULTIPOLYGON (((13.07275 -4...
## 3 -0.3511459      2 MULTIPOLYGON (((19.34238 41...
## 4  0.3225620      3 MULTIPOLYGON (((56.29785 25...
## 5  1.1755747      4 MULTIPOLYGON (((-57.60889 -...
## 6 -0.3377426      2 MULTIPOLYGON (((44.76826 39...

Finally, we can plot out the clusters using the tmap package, which highlights the position of this poorer cluster 6 in Saharan and sub-Saharan Africa:

## Map the clusters
tm_shape(cluster_sf) +
  tm_fill("kmeans", palette = "Dark2") +
  tm_legend(legend.position = c("left", "bottom"))

Hierarchical Cluster Analysis

We’ll briefly demonstrate the use of hierarchical clustering here. This works a little differently in R, as it requires you to first calculate a dissimilarity matrix (the aggregate difference between each pair of observations), and then use this for clustering.

The R function dist() will calculate this, based on simple Euclidean dissimiarity between samples:

## Calculate dissimilarity matrix
gap_d = dist(gap_df_scale2[,3:8])

We can now use this to carry out hierarchical clustering using the hclust() function:

## Hierarchical Clustering
gap_hclust = hclust(gap_d)

And you can then plot the resulting dendrogram with:

## Visualize

plot(gap_hclust)

To actually get a cluster assigment for each observation, we need to ‘cut’ the tree in a way that it will result in the desired number of groups, using the well-named cutree() function, To get 6 clusters:

## Cut tree into 6 clusters

gap_cluster = cutree(gap_hclust,
                     k = 6)

## Check 
head(gap_cluster)
## [1] 1 2 3 4 5 3

To see what this has done, you can overlay the clusters on the dendogram as follows:

## Visualize the cut tree

plot(gap_hclust)
rect.hclust(gap_hclust,
            k = 6)

The clusters that you obtain from the cutree() function can be used in the same way that we used the k -means clusters. Here, we’ll append them to the spatial object (cluster_sf) and map them out:

## Merge hclust to the gap_df_scale2 data frame
## Set as factor (category)
gap_df_scale2$hclust = as.factor(gap_cluster)

## Re-merge the data frame to the sf
cluster_sf = merge(x = borders,
                   y = gap_df_scale2,
                   by.x = "ADM0_A3",
                   by.y = "ISO3")
## Map the hclusters

tm_shape(cluster_sf) + 
  tm_fill("hclust", palette = "Set2") +
  tm_legend(legend.position = c("left", "bottom"))

Self-organizing Maps

We’ll now repeat this analysis with a self-organizing map (SOM), but using the full dataset. For this we’ll use the kohonen package, which allows the construction of unsupervised and supervised SOMs.

Building the SOM is relatively simple: first we construct the grid of nodes (30x20) using the somgrid() function. Note the topo argument that controls the neighborhood shape (rectangular or hexagonal):

## Build the SOM grid build 30x20 nodes
## How many nodes to build??

som_grid = somgrid(xdim = 30,
                   ydim = 20,
                   topo = "hexagonal")

## Visualize the empty grid
plot(som_grid)

Now we’ll train the SOM. The function we use is som(), and we need to specify the data to be used as well as the SOM grid. The function requires the data to be as a matrix rather than an R data frame, so we first convert this, then pass it to the function:

## Convert the dataframe into a matrix
gap_mat = as.matrix(gap_df_scale[,3:8])

## Train the SOM
gap_som = som(X = gap_mat,
              grid = som_grid)

Once finished, we need to see if the algorithm has converged. We can do this by plotting the change in the loss function as a function of the iterations. As a reminder, the loss function here is the mean distance between the node weights and the observations assigned to that node:

## Visualize the change in the loss function
plot(gap_som,
     type = "changes")

We can see here that the loss function has decreased, but not stabilized suggesting that the algorithm has converged. We’ll re-run it for a larger number of iterations by setting rlen to 1000 (this may take a couple of minutes to run):

## Time estimate
st = Sys.time()

## Rerun the SOM training
gap_som = som(X = gap_mat,
              grid = som_grid,
              rlen = 1000)


et = Sys.time()
tt = et-st

## Computation Time
print(tt)
## Time difference of 1.281102 mins
## Visualize
plot(gap_som, type = "changes")

Note: This plot shows a series of step like drops in the loss function, finally stabilizing at approximately iteration 900-1000.

Plots

Next we’ll go through some visualizations of the data. We’ve already see then loss function changes, so next we’ll make a couple of plots to show how the SOM has trained. First a plot of distances between nodes, which can be useful to identify outliers:

## Visualize

plot(gap_som, type = "dist")

Next, we plot the number of observations per cluster to look for empty nodes.

## Plot to identify empty notes
plot(gap_som, type = "counts")

The map has a couple of empty nodes. Empty nodes can be common when using larger grid sizes, and can be taken to represent a part of the parameter space that is missing in the observations.

Next, we’ll plot the codebook vectors. These are the weights associated with each feature, and can be used to interpret the organization of the SOM. For each node, there is a radial plot showing the value of each feature associated with it. Note that the type of plot can change here. If you have a large number of features, this will plot as a line plot rather than a radial plot.

## Plot to identify empty notes
plot(gap_som, type = "code")
## Warning in par(opar): argument 1 does not name a graphical parameter

The map shows a clear gradient from richer regions (income, orange) to poorer regions, and a gradient from smaller populations at the top to smaller at the bottom. Note that the life expectancy follows the gradient of GDP fairly well, and the poorer regions tend to have higher infant mortality and fertility rates.

It is also possible to plot out the individual feature weights as a heatmap to illustrate the gradient using type = “property”. The codebook vectors of weights are contained within the gap_som object. The syntax to extract them is a little complicated, and some examples are given below

Population

## Visualize the population heat map

plot(gap_som, 
     type = "property", 
     property = gap_som$codes[[1]][,"population"],
     main = "Pop")

Life Expectancy

## Visualize the heat map

plot(gap_som, 
     type = "property", 
     property = gap_som$codes[[1]][,"life_expectancy"],
     main = "Life Exp.")

Income (per capita GDP)

## Visualize the heat map

plot(gap_som, 
     type = "property", 
     property = gap_som$codes[[1]][,"income"],
     main = "Income")

CO2 Emissions

## Visualize the heat map

plot(gap_som, 
     type = "property", 
     property = gap_som$codes[[1]][,"per_cap_co2"],
     main = "Per Capita CO2")

Cluster Analysis

As a next step, we’ll cluster the nodes of the map to try and identify homogenous regions to simplify the interpretation of the SOM. While k-means is often used here, we’ll use a hierarchical algorithm (for reasons that will be clear shortly) to form 6 clusters.

Again, we first calculate the dissimilarity matrix, but this time between the values (codes) of each of the SOM nodes.

Then we use the cutree() and hclust() function to form the clusters:

## Calculate the distance/dissimilarity matrix
## Reminder codes are the values 
dist_codes = dist(x = gap_som$codes[[1]])


## Form the clusters 
som_cluster = cutree(hclust(dist_codes),
                     k = 6)

We can now show this on the SOM grid using the plot() function, setting type to mapping and using the cluster values in som_cluster to set the colors. The cluster boundaries can also be overlaid with add.cluster.boundaries()

## Create a palette
mypal = brewer.pal(n = 6,
                   "Set2")

## Plot The clusters
## Add boundaries

plot(gap_som, 
     type="mapping", 
     bgcol = mypal[som_cluster], 
     main = "Clusters", 
     pch = '.') 
add.cluster.boundaries(gap_som, som_cluster)

The main problem with this is that the clusters are not-contiguous on the SOM grid. This reflects the non-linear mapping that a SOM carries out (and may also suggest that 6 clusters is too high). We can force the clusters to be contiguous by including the distance between the nodes on the SOM grid as well as the dissimilarity between codebook values. We’ve already calulated this latter value above (as codes_dist), so let’s get the distance between the nodes using the handy helper function unit.distances():

We then multiply these two distance/dissimilarity matrices together:

## Distance between nodes

dist_grid = unit.distances(som_grid)


## Then multiply the two distance/dissimalirity matrices together
dist_adj = as.matrix(dist_codes) * dist_grid

If we now repeat the clustering, the inclusion of the distances between nodes now forces these clusters to be contiguous:

## Repeat the clustering to include the distances between nodes
## What is the ward.D2 method?

clust_adj = hclust(as.dist(dist_adj), 
                   method = 'ward.D2')
som_cluster_adj = cutree(clust_adj, 6)


## Visualize
plot(gap_som, 
     type="mapping", 
     bgcol = mypal[som_cluster_adj], 
     main = "Clusters", 
     pch = '.') 
add.cluster.boundaries(gap_som, som_cluster_adj)

Next, we obtain aggregate values for each cluster. While we can get these values from the gap_som object, we can also work back to get values from the original data (in gap_df). In the following code we:

  1. Extract a vector with the node assignments for each country
  2. Use this to get the cluster number for the node that the country is assigned to
  3. Attach this to the original gap2_df dataframe as a factor (rather than an integer)
## Extract a vector with the node assignments
nodeByCountry <- gap_som$unit.classif

## Assign the nodes to the original data frame
gap_df_scale$somclust <- as.factor(som_cluster_adj[nodeByCountry])

## Check
str(gap_df_scale)
## 'data.frame':    17676 obs. of  9 variables:
##  $ country        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year           : int  1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 ...
##  $ child_mortality: num [1:17676, 1] 1.22 1.22 1.21 1.2 1.19 ...
##   ..- attr(*, "scaled:center")= num 4.47
##   ..- attr(*, "scaled:scale")= num 1.28
##  $ fertility      : num [1:17676, 1] 1.26 1.26 1.26 1.26 1.25 ...
##   ..- attr(*, "scaled:center")= num 1.39
##   ..- attr(*, "scaled:scale")= num 0.503
##  $ per_cap_co2    : num [1:17676, 1] -2.9 -2.06 -2.03 -2.04 -1.97 ...
##   ..- attr(*, "scaled:center")= num -0.212
##   ..- attr(*, "scaled:scale")= num 2.09
##  $ income         : num [1:17676, 1] -0.564 -0.553 -0.542 -0.528 -0.489 ...
##   ..- attr(*, "scaled:center")= num 8.41
##   ..- attr(*, "scaled:scale")= num 1.13
##  $ life_expectancy: num [1:17676, 1] -1.64 -1.63 -1.59 -1.52 -1.45 ...
##   ..- attr(*, "scaled:center")= num 3.99
##   ..- attr(*, "scaled:scale")= num 0.311
##  $ population     : num [1:17676, 1] 0.192 0.201 0.208 0.215 0.222 ...
##   ..- attr(*, "scaled:center")= num 15.5
##   ..- attr(*, "scaled:scale")= num 1.79
##  $ somclust       : Factor w/ 6 levels "1","2","3","4",..: 4 4 4 4 4 4 4 4 4 4 ...

To get the values for each cluster on the original, non-transformed scale, we need to merge gap_df_scale which contains the cluster assignments with the original data (gap_df)

## Merge the two data frames
gap_df_clus = merge(x = gap_df,
                    y = gap_df_scale,
                    by = c("country", "year"))

## Check
head(gap_df_clus)
##       country year child_mortality.x fertility.x per_cap_co2.x income.x
## 1 Afghanistan 1949               418        7.56       0.00192     2360
## 2 Afghanistan 1950               416        7.57       0.01090     2390
## 3 Afghanistan 1951               413        7.56       0.01170     2420
## 4 Afghanistan 1952               407        7.55       0.01150     2460
## 5 Afghanistan 1953               401        7.54       0.01320     2570
## 6 Afghanistan 1954               395        7.53       0.01310     2580
##   life_expectancy.x population.x child_mortality.y fertility.y per_cap_co2.y
## 1              32.4      7620000          1.223506    1.259275     -2.896148
## 2              32.5      7750000          1.219755    1.261905     -2.064037
## 3              32.9      7840000          1.214095    1.259275     -2.030096
## 4              33.6      7940000          1.202650    1.256642     -2.038359
## 5              34.3      8040000          1.191035    1.254006     -1.972291
## 6              35.0      8150000          1.179245    1.251366     -1.975935
##     income.y life_expectancy.y population.y somclust
## 1 -0.5642644         -1.637728    0.1919013        4
## 2 -0.5531338         -1.627831    0.2013428        4
## 3 -0.5421421         -1.588548    0.2077869        4
## 4 -0.5276965         -1.520938    0.2148608        4
## 5 -0.4891505         -1.454722    0.2218461        4
## 6 -0.4857286         -1.389844    0.2294304        4

Now we can use this list of clusters by country to aggregate the variables:

## Aggregate the variables by group

cluster.vals = aggregate(gap_df_clus[,3:8],
                         by = list(gap_df_scale$somclust),
                         FUN = mean )


## Check
cluster.vals
##   Group.1 child_mortality.x fertility.x per_cap_co2.x  income.x
## 1       1         344.28571    5.393164     1.4178033  2755.990
## 2       2         154.41043    5.409502     1.2812579  4202.002
## 3       3          45.87448    2.402679     6.8197455 17229.733
## 4       4         286.56349    6.227594     0.1644801  1626.150
## 5       5          20.53622    2.289385     8.7255613 22919.833
## 6       6          81.89601    4.975301     3.5040853  8377.746
##   life_expectancy.x population.x
## 1          36.38351     52044859
## 2          55.63231     25783594
## 3          69.77503     52602445
## 4          40.48865      5792850
## 5          73.31479      3792450
## 6          64.20135      1135660

Note: This table can help us infer characteristics of the clusters

Exercise

For the exercise, you will need to build use an unsupervised classification method with the California housing dataset using a self-organizing map. Your answer should use the following variables (you can use code from a previous example to create the average number of rooms per house, and the ratio of bedrooms to rooms):

  • housing_median_age
  • population
  • median_income
  • median_house_value
  • avg_rooms
  • bedroom_ratio

Note on the data: This is a data set of California house prices from the file housing.csv, which is available through Canvas. The data are taken from a paper by Kelley and Barry (“Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297), and is a classic dataset used in machine learning examples. You should also download the zipped file ca.zip. This contains a shapefile of California’s border. Move both of these to your datafiles directory and unzip and extract the files from ca.zip.

The csv file contains the following columns most of which should be self-explanatory), with values for each California district taken from the 1990 census:

  • longitude
  • latitude
  • housing_median_age
  • total_rooms: total number of rooms in the district
  • total_bedrooms: total number of bedrooms in the district
  • population
  • households: total number of households in the district
  • median_income
  • median_house_value
  • ocean_proximity: a categorical value giving the position of the district relative to the ocean
## Read in the data

## Housing
housing = read.csv("../datafiles/housing.csv")

## Check
str(housing)
## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : chr  "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
## Check for missing values
colSums(is.na(housing))
##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                207                  0                  0                  0 
## median_house_value    ocean_proximity 
##                  0                  0
##Remove missing values using dplyr's filter function

housing = housing %>% 
  filter(!is.na(total_bedrooms))

##Check
colSums(is.na(housing))
##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                  0                  0                  0                  0 
## median_house_value    ocean_proximity 
##                  0                  0
## Create a variable for average rooms
housing$avg_rooms = housing$total_rooms / housing$households

## Create a variable for bedroom ratio
housing$bedroom_ratio = housing$total_bedrooms / housing$total_rooms


## Check
head(housing)
##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88                 41         880            129        322
## 2   -122.22    37.86                 21        7099           1106       2401
## 3   -122.24    37.85                 52        1467            190        496
## 4   -122.25    37.85                 52        1274            235        558
## 5   -122.25    37.85                 52        1627            280        565
## 6   -122.25    37.85                 52         919            213        413
##   households median_income median_house_value ocean_proximity avg_rooms
## 1        126        8.3252             452600        NEAR BAY  6.984127
## 2       1138        8.3014             358500        NEAR BAY  6.238137
## 3        177        7.2574             352100        NEAR BAY  8.288136
## 4        219        5.6431             341300        NEAR BAY  5.817352
## 5        259        3.8462             342200        NEAR BAY  6.281853
## 6        193        4.0368             269700        NEAR BAY  4.761658
##   bedroom_ratio
## 1     0.1465909
## 2     0.1557966
## 3     0.1295160
## 4     0.1844584
## 5     0.1720959
## 6     0.2317737
##Dimensions
dim(housing)
## [1] 20433    12

Note - There are 20433 total observations

## Variable name check
names(housing)
##  [1] "longitude"          "latitude"           "housing_median_age"
##  [4] "total_rooms"        "total_bedrooms"     "population"        
##  [7] "households"         "median_income"      "median_house_value"
## [10] "ocean_proximity"    "avg_rooms"          "bedroom_ratio"
## Save a copy of the coordinates
housing.crds = housing %>%
  select(longitude, latitude)

## Create a vector with the variable names for easy column extraction
var_names = c("housing_median_age", "population", "median_income", "median_house_value",
              "avg_rooms", "bedroom_ratio")

## Test the subset
#housing[, var_names]
## Create histograms to check for normality

hh1 = housing %>% 
  gghistogram(x = "housing_median_age")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh2 = housing %>% 
  gghistogram(x = "population")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh3 = housing %>% 
  gghistogram(x = "median_income")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh4 = housing %>% 
  gghistogram(x = "median_house_value")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh5 = housing %>% 
  gghistogram(x = "avg_rooms")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh6 = housing %>% 
  gghistogram(x = "bedroom_ratio")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Visualize
ggarrange(hh1, hh2, hh3, hh4, hh5, hh6)

## Do I need to scale the bedroom ratio data?
## Exclude the outliers
temp.df = housing %>% 
  filter(bedroom_ratio < 1)

## Replot bedroom ratio
hh6 = temp.df %>% 
  gghistogram(x = "bedroom_ratio",
              bins = 10)

hh6

max(temp.df$bedroom_ratio)

## The maximum is still 0.92
## Transform and Scale the data
## Note **I had an issue log transforming the bedroom ratio data**
## BR is only scaled here
housing_scale = housing %>% 
  mutate(
    housing_median_age = scale(log(housing_median_age)),
    population = scale(log(population)),
    median_income = scale(log(median_income)),
    median_house_value = scale(log(median_house_value)),
    avg_rooms = scale(log(avg_rooms)),
    bedroom_ratio = scale(log(bedroom_ratio))
  )

## Check
head(housing_scale)
##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88          0.8603451         880            129 -1.6912077
## 2   -122.22    37.86         -0.3171483        7099           1106  1.0288051
## 3   -122.24    37.85          1.2786352        1467            190 -1.1063098
## 4   -122.25    37.85          1.2786352        1274            235 -0.9468488
## 5   -122.25    37.85          1.2786352        1627            280 -0.9299706
## 6   -122.25    37.85          1.2786352         919            213 -1.3542388
##   households median_income median_house_value ocean_proximity  avg_rooms
## 1        126     1.8584612           1.647690        NEAR BAY  1.0756316
## 2       1138     1.8523782           1.238219        NEAR BAY  0.6645069
## 3        177     1.5668026           1.206573        NEAR BAY  1.6986787
## 4        219     1.0322337           1.151844        NEAR BAY  0.4103292
## 5        259     0.2177044           1.156470        NEAR BAY  0.6899240
## 6        193     0.3204728           0.738206        NEAR BAY -0.3184999
##   bedroom_ratio
## 1    -1.4009772
## 2    -1.1517965
## 3    -1.9076447
## 4    -0.4608930
## 5    -0.7447132
## 6     0.4732963
## Dimensions
dim(housing_scale)
## [1] 20433    12

Note - Still 20433 observations

## Read in the shapefile

ca_sf = st_read("../datafiles/ca/ca.shp",
                crs = 4326,
                quiet = TRUE)

## Check
str(ca_sf)
## Classes 'sf' and 'data.frame':   1 obs. of  15 variables:
##  $ REGION  : chr "4"
##  $ DIVISION: chr "9"
##  $ STATEFP : chr "06"
##  $ STATENS : chr "01779778"
##  $ GEOID   : chr "06"
##  $ STUSPS  : chr "CA"
##  $ NAME    : chr "California"
##  $ LSAD    : chr "00"
##  $ MTFCC   : chr "G4000"
##  $ FUNCSTAT: chr "A"
##  $ ALAND   : chr "403660088482"
##  $ AWATER  : chr "20305454540"
##  $ INTPTLAT: chr "+37.1551773"
##  $ INTPTLON: chr "-119.5434183"
##  $ geometry:sfc_MULTIPOLYGON of length 1; first list element: List of 7
##   ..$ :List of 1
##   .. ..$ : num [1:11640, 1:2] -124 -124 -124 -124 -124 ...
##   ..$ :List of 1
##   .. ..$ : num [1:392, 1:2] -119 -119 -119 -119 -119 ...
##   ..$ :List of 1
##   .. ..$ : num [1:154, 1:2] -119 -119 -119 -119 -119 ...
##   ..$ :List of 1
##   .. ..$ : num [1:191, 1:2] -119 -119 -119 -119 -119 ...
##   ..$ :List of 1
##   .. ..$ : num [1:121, 1:2] -120 -120 -120 -120 -120 ...
##   ..$ :List of 1
##   .. ..$ : num [1:93, 1:2] -123 -123 -123 -123 -123 ...
##   ..$ :List of 1
##   .. ..$ : num [1:77, 1:2] -119 -119 -119 -119 -119 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:14] "REGION" "DIVISION" "STATEFP" "STATENS" ...

Loss Function Plot

## Initialize a Grid
h.som.grid = somgrid(xdim = 30,
                     ydim = 20,
                     topo = "hexagonal")

## Check empty grid
plot(h.som.grid)

## Convert from data frame to matrix
housing.mat = as.matrix(housing_scale[ , var_names])

dim(housing.mat)
## [1] 20433     6

Note - Still 20433 observations

## Timing 
st1 = Sys.time()

## Train the SOM
housing.som = som(X = housing.mat,
                  grid = h.som.grid,
                  rlen = 1500)

et1 = Sys.time()

## Total time elapsed
print(et1 - st1)
## Time difference of 2.254369 mins
## Check the number of observations
length(housing.som$unit.classif)
## [1] 20433
##PROBLEM IS HERE!
##Fixed

The issue is here Why am I losing observations? Down to 10,423??

Note: After re-reading in the data, I did not log transform the bedroom ratio as this was producing NaNs

## Plot the loss function by iterartions
plot(housing.som,
     type = "changes")

Note: The loss function stabilized after approximately 1450 iterations.

Counts per Nodes Plot

## Visualization of the number of observations per node

plot(housing.som,
     type = "counts")

Note: It appears that there is one empty node

Codebook Vectors Plot

## Visualize the codebook vector plot

plot(housing.som,
     type = "code")
## Warning in par(opar): argument 1 does not name a graphical parameter

Feature Heat Maps

## Visualizing Housing Median Age

plot(housing.som,
     type = "property",
     property = housing.som$codes[[1]][,"housing_median_age"],
     main = "Housing Median Age")

## Visualizing Median Income

plot(housing.som,
     type = "property",
     property = housing.som$codes[[1]][,"median_income"],
     main = "Median Income")

## Visualizing Population

plot(housing.som,
     type = "property",
     property = housing.som$codes[[1]][,"population"],
     main = "Population")

## Visualizing Median House Value

plot(housing.som,
     type = "property",
     property = housing.som$codes[[1]][,"median_house_value"],
     main = "Median House Value")

## Visualizing Average Rooms

plot(housing.som,
     type = "property",
     property = housing.som$codes[[1]][,"avg_rooms"],
     main = "Average Rooms")

## Visualizing Bedroom Ratio

plot(housing.som,
     type = "property",
     property = housing.som$codes[[1]][,"bedroom_ratio"],
     main = "Bedroom Ratio")

Set of Clusters for the SOM Nodes

## Calculate the dissimilarity matrix between the values (codes) of each SOM node
h.dist_codes = dist(housing.som$codes[[1]])

## Cluster build
#h.som_cluster = cutree(hclust(d = h.dist_codes),
                       #k = 6)

## k-means
h.kmeans.som_cluster = kmeans(housing.som$codes[[1]], 
                              centers = 6)
## Color Palette
h.mypal = brewer.pal(n = 6,
                     "Accent")
  

## Visualize
plot(housing.som,
     type = "mapping",
     #bgcol = h.mypal[h.som_cluster],
     bgcol = h.mypal[h.kmeans.som_cluster$cluster],
     main = "Housing Clusters",
     pch = ".")

##add boundaries
add.cluster.boundaries(housing.som,
                       h.kmeans.som_cluster$cluster)

                       #h.som_cluster)

Contiguous Mapping

## Calculate distance between nodes
h.dist_grid = unit.distances(h.som.grid)

## Multiple the node distances and the codes (value distances) together
## Adjustment to make it contiguous

h.dist_adj = as.matrix(h.dist_codes) * h.dist_grid
## Re-run the kmeans
h.clust_adj = kmeans(h.dist_adj,
                     centers = 6)

#h.hclust_adj = hclust(as.dist(h.dist_adj),
                      #"ward.D2")
#h.som_cluster_adj = cutree(clust_adj, 
                           #k= 6)


## Visualize
plot(housing.som,
     type = "mapping",
     bgcol = h.mypal[h.clust_adj$cluster],
     #bgcol = h.mypal[h.som_cluster_adj],
     main = "Housing Clusters",
     pch = ".")

##add boundaries
add.cluster.boundaries(housing.som,
                       h.clust_adj$cluster)

                       #h.som_cluster_adj)

Average Feature Values by Cluster

## Extract node by observation
nodeByObs = housing.som$unit.classif

## Attach to the housing.scale data frame
housing_scale$somclust = as.factor(h.clust_adj$cluster[nodeByObs])
## Merge the scale df to the original data

housing_df_clus = merge(x = housing,
                        y = housing_scale,
                        by = c("longitude", "latitude"))

## Check
str(housing_df_clus)
## 'data.frame':    49635 obs. of  23 variables:
##  $ longitude           : num  -114 -114 -114 -115 -115 ...
##  $ latitude            : num  34.2 34.4 34 32.8 33.7 ...
##  $ housing_median_age.x: num  15 19 17 19 17 27 20 14 25 29 ...
##  $ total_rooms.x       : num  5612 7650 2809 2570 720 ...
##  $ total_bedrooms.x    : num  1283 1901 635 820 174 ...
##  $ population.x        : num  1015 1129 83 1431 333 ...
##  $ households.x        : num  472 463 45 608 117 34 262 226 633 239 ...
##  $ median_income.x     : num  1.49 1.82 1.62 1.27 1.65 ...
##  $ median_house_value.x: num  66900 80100 87500 56100 85700 45000 65500 73400 82400 74000 ...
##  $ ocean_proximity.x   : chr  "INLAND" "INLAND" "INLAND" "INLAND" ...
##  $ avg_rooms.x         : num  11.89 16.52 62.42 4.23 6.15 ...
##  $ bedroom_ratio.x     : num  0.229 0.248 0.226 0.319 0.242 ...
##  $ housing_median_age.y: num [1:49635, 1] -0.909 -0.493 -0.689 -0.493 -0.689 ...
##  $ total_rooms.y       : num  5612 7650 2809 2570 720 ...
##  $ total_bedrooms.y    : num  1283 1901 635 820 174 ...
##  $ population.y        : num [1:49635, 1] -0.13686 0.00725 -3.52664 0.32817 -1.64573 ...
##  $ households.y        : num  472 463 45 608 117 34 262 226 633 239 ...
##  $ median_income.y     : num [1:49635, 1] -1.79 -1.37 -1.63 -2.13 -1.58 ...
##  $ median_house_value.y: num [1:49635, 1] -1.71 -1.39 -1.24 -2.02 -1.28 ...
##  $ ocean_proximity.y   : chr  "INLAND" "INLAND" "INLAND" "INLAND" ...
##  $ avg_rooms.y         : num [1:49635, 1] 3.012 4.21 9.047 -0.752 0.615 ...
##  $ bedroom_ratio.y     : num [1:49635, 1] 0.417 0.758 0.371 1.781 0.644 ...
##  $ somclust            : Factor w/ 6 levels "1","2","3","4",..: 3 3 2 3 3 1 3 3 3 3 ...
var_names2 = paste(var_names,
                   ".x",
                   sep = "")

## Aggregate the original values by cluster
housing.clust.vals = aggregate(x = housing_df_clus[, var_names2],
                               by = list(housing_df_clus$somclust),
                               FUN = mean)

## Check
housing.clust.vals
##   Group.1 housing_median_age.x population.x median_income.x
## 1       1             36.69893     1191.461        3.388889
## 2       2             35.12607     1492.598        1.988819
## 3       3             34.31376     1344.696        2.249996
## 4       4             28.02497     1639.679        3.420259
## 5       5             32.46993     1141.125        5.078442
## 6       6             16.29476     2042.762        6.620156
##   median_house_value.x avg_rooms.x bedroom_ratio.x
## 1             191436.3    4.845196       0.2223440
## 2             195170.9    4.579685       0.3834643
## 3             127576.7    4.442990       0.2599936
## 4             242585.5    4.585557       0.2473099
## 5             304540.3    5.765096       0.1866238
## 6             310820.3    6.751067       0.1638144
## Coerce into a table for a better figure
library(data.table)
library(kableExtra)

housing.table = setDT(housing.clust.vals)

housing.table.pub = housing.table %>% 
  kable() %>% 
  kable_material_dark() #%>% 
  #scroll_box(width = "600px", height= "600px" )

housing.table.pub
Group.1 housing_median_age.x population.x median_income.x median_house_value.x avg_rooms.x bedroom_ratio.x
1 36.69893 1191.461 3.388889 191436.3 4.845196 0.2223440
2 35.12607 1492.598 1.988819 195170.9 4.579685 0.3834643
3 34.31376 1344.696 2.249996 127576.7 4.442990 0.2599936
4 28.02497 1639.679 3.420259 242585.5 4.585557 0.2473099
5 32.46993 1141.125 5.078442 304540.3 5.765097 0.1866238
6 16.29476 2042.762 6.620156 310820.3 6.751067 0.1638144

Map of SOM Clusters (Optional)

## Check the combined housing dataframe
names(housing_df_clus)
##  [1] "longitude"            "latitude"             "housing_median_age.x"
##  [4] "total_rooms.x"        "total_bedrooms.x"     "population.x"        
##  [7] "households.x"         "median_income.x"      "median_house_value.x"
## [10] "ocean_proximity.x"    "avg_rooms.x"          "bedroom_ratio.x"     
## [13] "housing_median_age.y" "total_rooms.y"        "total_bedrooms.y"    
## [16] "population.y"         "households.y"         "median_income.y"     
## [19] "median_house_value.y" "ocean_proximity.y"    "avg_rooms.y"         
## [22] "bedroom_ratio.y"      "somclust"
## Combine the lat/long - did this in a previous step
## housing.crds

## Set housing_df_clus as a sf object
housing_df_clus_sf = st_as_sf(housing_df_clus,
                              coords = c("longitude", "latitude"),
                              crs = 4326)

## Visualize
tm_shape(ca_sf) +
  tm_borders() +
  tm_shape(housing_df_clus_sf) +
  tm_symbols(col = "somclust",
             border.lwd = NA,
             alpha = 0.5,
             palette = "Dark2",
             title.col = "CA Housing Clusters",
             size = 0.1)