In this lab, we’ll look at how to implement two unsupervised classification methods:
The data we will use is from the Gap Minder project. While you can download the individual variables from the web site, we will use a preprocessed set of the data covering the period 1801-2018, in the file gapminder_1800_2018.csv. Download this to your datafiles folder and make a new folder for today’s class called lab10. You will also need the shapefile of country borders: ne_50m_admin_0_countries.shp (you should have this from a previous lab).
## Set working directory
setwd("~/Desktop/University of Utah PhD /Course Work/Spring 2023 Semester/GEOG6160 - Spatial Modeling/Labs/lab10/")
## Load libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggpubr)
## Loading required package: ggplot2
library(RColorBrewer)
library(countrycode) # convert country codes and country names
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(kohonen) #self organizing maps
library(tmap)
## Read in the data
gap_df = read.csv("../datafiles/gapminder_1800_2018.csv")
## Check
head(gap_df)
## country year child_mortality fertility per_cap_co2 income life_expectancy
## 1 Afghanistan 1800 469 7 NA 603 28.2
## 2 Afghanistan 1801 469 7 NA 603 28.2
## 3 Afghanistan 1802 469 7 NA 603 28.2
## 4 Afghanistan 1803 469 7 NA 603 28.2
## 5 Afghanistan 1804 469 7 NA 603 28.2
## 6 Afghanistan 1805 469 7 NA 603 28.2
## population
## 1 3280000
## 2 3280000
## 3 3280000
## 4 3280000
## 5 3280000
## 6 3280000
##Check the variable names
names(gap_df)
## [1] "country" "year" "child_mortality" "fertility"
## [5] "per_cap_co2" "income" "life_expectancy" "population"
## Create Histograms of the variables
h1 <- gap_df %>%
gghistogram(x = "child_mortality")
h2 <- gap_df %>%
gghistogram(x = "fertility")
h3 <- gap_df %>%
gghistogram(x = "per_cap_co2")
h4 <- gap_df %>%
gghistogram(x = "income")
h5 <- gap_df %>%
gghistogram(x = "life_expectancy")
h6 <- gap_df %>%
gghistogram(x = "population")
## Plot as one faceted output
ggarrange(h1, h2, h3, h4, h5, h6)
The histograms clearly show skew in most of the variables, and we will need to transform the data before analyzing it. We’ll do this in two steps. First, we’ll log transform all variables to remove the skew, then use the scale() function, which in R scales data to z-scores by subtracting the mean and dividing by the standard deviation. To simplify things, we do all this in a single step by nesting the log() function in the scale() function.
Note that we remove rows with missing observations (we could potentially impute these, but that is a little more difficult for time series data). We also remove any row where the per capita CO2 emissions are zero, partly because this would cause problems when log-transforming and partly because this is clearly not true.
## Create the data processing framework
## Drop NAs
## Remove rows of per cap CO2 of 0
## Finally log transform each variable
gap_df_scale <- gap_df %>%
drop_na() %>%
filter(per_cap_co2 > 0) %>%
mutate(
child_mortality = scale(log(child_mortality)),
fertility = scale(log(fertility)),
per_cap_co2 = scale(log(per_cap_co2)),
income = scale(log(income)),
life_expectancy = scale(log(life_expectancy)),
population = scale(log(population))
)
The very last thing we’ll do is to extract a single year of data for the first part of the lab (just 2018):
## Extract the 2018 data
gap_df_scale2 = gap_df_scale %>%
filter(year == 2018)
##Check
head(gap_df_scale2)
## country year child_mortality fertility per_cap_co2 income
## 1 Afghanistan 2018 -0.26516474 0.1506520 -0.5552216 -0.8328214
## 2 Albania 2018 -1.79404903 -1.6975208 0.3237170 0.8904765
## 3 Algeria 2018 -1.02764487 -0.8336127 0.7271571 0.9982332
## 4 Angola 2018 -0.09746087 0.6444475 0.1558006 0.2173724
## 5 Antigua and Barbuda 2018 -2.04000237 -1.3562778 0.9504338 1.4721185
## 6 Argentina 2018 -1.70055729 -1.1427722 0.8125748 1.2405631
## life_expectancy population
## 1 0.5332313 1.0768253
## 2 1.1959360 -0.3511459
## 3 1.1794885 1.1472112
## 4 0.5782861 0.9714548
## 5 1.1505013 -2.2476952
## 6 1.1338189 1.1755747
We’ll next use these data with the k-means cluster algorithm. The default function for this in R is kmeans(), and we need to pass the data we are going to use, and the number of requested clusters to this function. First let’s figure out which columns we need:
## Check the variable names
names(gap_df_scale2)
## [1] "country" "year" "child_mortality" "fertility"
## [5] "per_cap_co2" "income" "life_expectancy" "population"
## Set up the kmeans for the last 6 variables
gap_kmeans = kmeans(gap_df_scale2[,3:8],
centers = 6)
There output of this algorithm provides information about the cluster solution. We can see the size of each cluster (the number of observations assigned to each cluster):
## Check the cluster size
gap_kmeans$size
## [1] 21 34 42 29 19 39
And we can see the cluster assignments for each country:
## Check the cluster assignments for each country
gap_kmeans$cluster
## [1] 6 2 4 6 2 4 2 3 3 4 2 2 5 2 3 3 1 6 1 5 2 1 4 2 3 6 6 5 6 3 1 6 6 3 4 4 1
## [38] 6 6 2 6 3 4 2 3 3 1 4 4 4 5 1 6 2 1 6 1 3 3 1 6 2 3 6 3 2 5 6 6 1 6 5 3 2
## [75] 4 4 4 4 3 3 3 2 3 5 3 6 1 3 5 5 2 2 1 6 2 2 2 6 6 3 2 6 2 6 2 4 1 2 2 2 4
## [112] 6 5 1 5 3 3 5 6 6 5 2 3 3 5 5 2 5 5 4 4 3 3 3 3 4 6 1 1 3 6 3 2 6 3 3 3 1
## [149] 6 4 3 6 3 4 2 2 6 2 3 3 5 5 6 4 1 6 1 2 4 4 4 6 4 3 3 3 2 4 1 4 4 6 6 6
And finally the cluster centers. These are the prototypes; the average value of each variable for all the observations assigned to a given cluster (note these are the log-transformed and scaled data)
## Check the cluster center averages
gap_kmeans$centers
## child_mortality fertility per_cap_co2 income life_expectancy population
## 1 -0.7327950 -0.4507884 0.1613643 0.23733360 0.7476560 -1.2086934
## 2 -1.8282980 -1.6122468 0.8234725 1.30657069 1.1184521 -0.9017155
## 3 -2.3342882 -1.7030141 1.1209210 1.89523576 1.2666004 0.5393696
## 4 -1.3047987 -1.2265247 0.6881191 0.90779333 1.0433002 1.2933892
## 5 -0.9467182 -0.8454630 0.1015151 0.04711013 0.9232515 0.5815970
## 6 -0.1947134 0.2959364 -0.6062523 -0.71611076 0.5354277 0.6349778
Here, we can see that cluster 6, for example, has the lowest life expectancy and GDP, as well as high infant mortality and fertility rates.
As these are spatial data, we can now plot the distribution of the clusters. First, we’ll load the world shapefile in the ne_50m_admin_0_countries folder (we used this in a previous lab:)
## Read in the border shapefile
borders = st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
quiet = TRUE)
Next, we add the cluster assignments to the gap_df_scale2 data frame:
## Add the cluster assignments
gap_df_scale2$kmeans = as.factor(gap_kmeans$cluster)
Now we need to merge the gap_df_scale2 data frame with the world borders spatial object. To do this, we add a new column to the data frame containing the ISO3 standard country codes, using the countrycode package (you’ll need to install this).
## Add ISO3 standard country codes using the country codes package
gap_df_scale2 = gap_df_scale2 %>%
mutate(ISO3 = countrycode(sourcevar = country,
origin = "country.name",
destination = "iso3c"))
Finally, we use these ISO codes to merge the two datasets. We need to specify the column name for each dataset that contains the label to be used in merging, which we do with by.x and by.y.
## Merge the two datasets
cluster_sf = merge(borders,
gap_df_scale2,
by.x = "ADM0_A3",
by.y = "ISO3")
##Check
head(cluster_sf)
## Simple feature collection with 6 features and 103 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -73.57627 ymin: -55.03213 xmax: 74.89131 ymax: 42.64795
## Geodetic CRS: WGS 84
## ADM0_A3 featurecla scalerank LABELRANK SOVEREIGNT SOV_A3
## 1 AFG Admin-0 country 1 3 Afghanistan AFG
## 2 AGO Admin-0 country 1 3 Angola AGO
## 3 ALB Admin-0 country 1 6 Albania ALB
## 4 ARE Admin-0 country 1 4 United Arab Emirates ARE
## 5 ARG Admin-0 country 1 2 Argentina ARG
## 6 ARM Admin-0 country 1 6 Armenia ARM
## ADM0_DIF LEVEL TYPE ADMIN GEOU_DIF
## 1 0 2 Sovereign country Afghanistan 0
## 2 0 2 Sovereign country Angola 0
## 3 0 2 Sovereign country Albania 0
## 4 0 2 Sovereign country United Arab Emirates 0
## 5 0 2 Sovereign country Argentina 0
## 6 0 2 Sovereign country Armenia 0
## GEOUNIT GU_A3 SU_DIF SUBUNIT SU_A3 BRK_DIFF
## 1 Afghanistan AFG 0 Afghanistan AFG 0
## 2 Angola AGO 0 Angola AGO 0
## 3 Albania ALB 0 Albania ALB 0
## 4 United Arab Emirates ARE 0 United Arab Emirates ARE 0
## 5 Argentina ARG 0 Argentina ARG 0
## 6 Armenia ARM 0 Armenia ARM 0
## NAME NAME_LONG BRK_A3 BRK_NAME
## 1 Afghanistan Afghanistan AFG Afghanistan
## 2 Angola Angola AGO Angola
## 3 Albania Albania ALB Albania
## 4 United Arab Emirates United Arab Emirates ARE United Arab Emirates
## 5 Argentina Argentina ARG Argentina
## 6 Armenia Armenia ARM Armenia
## BRK_GROUP ABBREV POSTAL FORMAL_EN FORMAL_FR
## 1 <NA> Afg. AF Islamic State of Afghanistan <NA>
## 2 <NA> Ang. AO People's Republic of Angola <NA>
## 3 <NA> Alb. AL Republic of Albania <NA>
## 4 <NA> U.A.E. AE United Arab Emirates <NA>
## 5 <NA> Arg. AR Argentine Republic <NA>
## 6 <NA> Arm. ARM Republic of Armenia <NA>
## NAME_CIAWF NOTE_ADM0 NOTE_BRK NAME_SORT NAME_ALT
## 1 Afghanistan <NA> <NA> Afghanistan <NA>
## 2 Angola <NA> <NA> Angola <NA>
## 3 Albania <NA> <NA> Albania <NA>
## 4 United Arab Emirates <NA> <NA> United Arab Emirates <NA>
## 5 Argentina <NA> <NA> Argentina <NA>
## 6 Armenia <NA> <NA> Armenia <NA>
## MAPCOLOR7 MAPCOLOR8 MAPCOLOR9 MAPCOLOR13 POP_EST POP_RANK GDP_MD_EST
## 1 5 6 8 7 34124811 15 64080
## 2 3 2 6 1 29310273 15 189000
## 3 1 4 1 6 3047987 12 33900
## 4 2 1 3 3 6072475 13 667200
## 5 3 1 3 13 44293293 15 879400
## 6 3 1 2 10 3045191 12 26300
## POP_YEAR LASTCENSUS GDP_YEAR ECONOMY
## 1 2017 1979 2016 7. Least developed region
## 2 2017 1970 2016 7. Least developed region
## 3 2017 2001 2016 6. Developing region
## 4 2017 2010 2016 6. Developing region
## 5 2017 2010 2016 5. Emerging region: G20
## 6 2017 2001 2016 6. Developing region
## INCOME_GRP WIKIPEDIA FIPS_10_ ISO_A2 ISO_A3 ISO_A3_EH ISO_N3
## 1 5. Low income -99 AF AF AFG AFG 004
## 2 3. Upper middle income -99 AO AO AGO AGO 024
## 3 4. Lower middle income -99 AL AL ALB ALB 008
## 4 2. High income: nonOECD -99 AE AE ARE ARE 784
## 5 3. Upper middle income -99 AR AR ARG ARG 032
## 6 4. Lower middle income -99 AM AM ARM ARM 051
## UN_A3 WB_A2 WB_A3 WOE_ID WOE_ID_EH WOE_NOTE ADM0_A3_IS
## 1 004 AF AFG 23424739 23424739 Exact WOE match as country AFG
## 2 024 AO AGO 23424745 23424745 Exact WOE match as country AGO
## 3 008 AL ALB 23424742 23424742 Exact WOE match as country ALB
## 4 784 AE ARE 23424738 23424738 Exact WOE match as country ARE
## 5 032 AR ARG 23424747 23424747 Exact WOE match as country ARG
## 6 051 AM ARM 23424743 23424743 Exact WOE match as country ARM
## ADM0_A3_US ADM0_A3_UN ADM0_A3_WB CONTINENT REGION_UN SUBREGION
## 1 AFG -99 -99 Asia Asia Southern Asia
## 2 AGO -99 -99 Africa Africa Middle Africa
## 3 ALB -99 -99 Europe Europe Southern Europe
## 4 ARE -99 -99 Asia Asia Western Asia
## 5 ARG -99 -99 South America Americas South America
## 6 ARM -99 -99 Asia Asia Western Asia
## REGION_WB NAME_LEN LONG_LEN ABBREV_LEN TINY HOMEPART
## 1 South Asia 11 11 4 -99 1
## 2 Sub-Saharan Africa 6 6 4 -99 1
## 3 Europe & Central Asia 7 7 4 -99 1
## 4 Middle East & North Africa 20 20 6 -99 1
## 5 Latin America & Caribbean 9 9 4 -99 1
## 6 Europe & Central Asia 7 7 4 -99 1
## MIN_ZOOM MIN_LABEL MAX_LABEL NE_ID WIKIDATAID NAME_AR
## 1 0 3 7 1159320319 Q889 أفغانستان
## 2 0 3 7 1159320323 Q916 أنغولا
## 3 0 5 10 1159320325 Q222 ألبانيا
## 4 0 4 9 1159320329 Q878 الإمارات العربية المتحدة
## 5 0 2 7 1159320331 Q414 الأرجنتين
## 6 0 5 10 1159320333 Q399 أرمينيا
## NAME_BN NAME_DE NAME_EN
## 1 আফগানিস্তান Afghanistan Afghanistan
## 2 অ্যাঙ্গোলা Angola Angola
## 3 আলবেনিয়া Albanien Albania
## 4 সংযুক্ত আরব আমিরাত Vereinigte Arabische Emirate United Arab Emirates
## 5 আর্জেন্টিনা Argentinien Argentina
## 6 আর্মেনিয়া Armenien Armenia
## NAME_ES NAME_FR NAME_EL
## 1 Afganistán Afghanistan Αφγανιστάν
## 2 Angola Angola Ανγκόλα
## 3 Albania Albanie Αλβανία
## 4 Emiratos Árabes Unidos Émirats arabes unis Ηνωμένα Αραβικά Εμιράτα
## 5 Argentina Argentine Αργεντινή
## 6 Armenia Arménie Αρμενία
## NAME_HI NAME_HU NAME_ID NAME_IT
## 1 अफ़्गानिस्तान Afganisztán Afganistan Afghanistan
## 2 अंगोला Angola Angola Angola
## 3 अल्बानिया Albánia Albania Albania
## 4 संयुक्त अरब अमीरात Egyesült Arab Emírségek Uni Emirat Arab Emirati Arabi Uniti
## 5 अर्जेण्टीना Argentína Argentina Argentina
## 6 आर्मीनिया Örményország Armenia Armenia
## NAME_JA NAME_KO NAME_NL
## 1 アフガニスタン 아프가니스탄 Afghanistan
## 2 アンゴラ 앙골라 Angola
## 3 アルバニア 알바니아 Albanië
## 4 アラブ首長国連邦 아랍에미리트 Verenigde Arabische Emiraten
## 5 アルゼンチン 아르헨티나 Argentinië
## 6 アルメニア 아르메니아 Armenië
## NAME_PL NAME_PT
## 1 Afganistan Afeganistão
## 2 Angola Angola
## 3 Albania Albânia
## 4 Zjednoczone Emiraty Arabskie Emirados Árabes Unidos
## 5 Argentyna Argentina
## 6 Armenia Arménia
## NAME_RU NAME_SV NAME_TR
## 1 Афганистан Afghanistan Afganistan
## 2 Ангола Angola Angola
## 3 Албания Albanien Arnavutluk
## 4 Объединённые Арабские Эмираты Förenade Arabemiraten Birleşik Arap Emirlikleri
## 5 Аргентина Argentina Arjantin
## 6 Армения Armenien Ermenistan
## NAME_VI NAME_ZH country
## 1 Afghanistan 阿富汗 Afghanistan
## 2 Angola 安哥拉 Angola
## 3 Albania 阿尔巴尼亚 Albania
## 4 Các Tiểu vương quốc Ả Rập Thống nhất 阿拉伯联合酋长国 United Arab Emirates
## 5 Argentina 阿根廷 Argentina
## 6 Armenia 亞美尼亞 Armenia
## year child_mortality fertility per_cap_co2 income life_expectancy
## 1 2018 -0.26516474 0.1506520 -0.5552216 -0.8328214 0.5332313
## 2 2018 -0.09746087 0.6444475 0.1558006 0.2173724 0.5782861
## 3 2018 -1.79404903 -1.6975208 0.3237170 0.8904765 1.1959360
## 4 2018 -1.91150716 -1.6859215 1.5694888 2.3788493 0.9927787
## 5 2018 -1.70055729 -1.1427722 0.8125748 1.2405631 1.1338189
## 6 2018 -1.52762031 -1.8297863 0.4065443 0.6326736 1.0917298
## population kmeans geometry
## 1 1.0768253 6 MULTIPOLYGON (((66.52227 37...
## 2 0.9714548 6 MULTIPOLYGON (((13.07275 -4...
## 3 -0.3511459 2 MULTIPOLYGON (((19.34238 41...
## 4 0.3225620 3 MULTIPOLYGON (((56.29785 25...
## 5 1.1755747 4 MULTIPOLYGON (((-57.60889 -...
## 6 -0.3377426 2 MULTIPOLYGON (((44.76826 39...
Finally, we can plot out the clusters using the tmap package, which highlights the position of this poorer cluster 6 in Saharan and sub-Saharan Africa:
## Map the clusters
tm_shape(cluster_sf) +
tm_fill("kmeans", palette = "Dark2") +
tm_legend(legend.position = c("left", "bottom"))
We’ll briefly demonstrate the use of hierarchical clustering here. This works a little differently in R, as it requires you to first calculate a dissimilarity matrix (the aggregate difference between each pair of observations), and then use this for clustering.
The R function dist() will calculate this, based on simple Euclidean dissimiarity between samples:
## Calculate dissimilarity matrix
gap_d = dist(gap_df_scale2[,3:8])
We can now use this to carry out hierarchical clustering using the hclust() function:
## Hierarchical Clustering
gap_hclust = hclust(gap_d)
And you can then plot the resulting dendrogram with:
## Visualize
plot(gap_hclust)
To actually get a cluster assigment for each observation, we need to ‘cut’ the tree in a way that it will result in the desired number of groups, using the well-named cutree() function, To get 6 clusters:
## Cut tree into 6 clusters
gap_cluster = cutree(gap_hclust,
k = 6)
## Check
head(gap_cluster)
## [1] 1 2 3 4 5 3
To see what this has done, you can overlay the clusters on the dendogram as follows:
## Visualize the cut tree
plot(gap_hclust)
rect.hclust(gap_hclust,
k = 6)
The clusters that you obtain from the cutree() function can be used in the same way that we used the k -means clusters. Here, we’ll append them to the spatial object (cluster_sf) and map them out:
## Merge hclust to the gap_df_scale2 data frame
## Set as factor (category)
gap_df_scale2$hclust = as.factor(gap_cluster)
## Re-merge the data frame to the sf
cluster_sf = merge(x = borders,
y = gap_df_scale2,
by.x = "ADM0_A3",
by.y = "ISO3")
## Map the hclusters
tm_shape(cluster_sf) +
tm_fill("hclust", palette = "Set2") +
tm_legend(legend.position = c("left", "bottom"))
We’ll now repeat this analysis with a self-organizing map (SOM), but using the full dataset. For this we’ll use the kohonen package, which allows the construction of unsupervised and supervised SOMs.
Building the SOM is relatively simple: first we construct the grid of nodes (30x20) using the somgrid() function. Note the topo argument that controls the neighborhood shape (rectangular or hexagonal):
## Build the SOM grid build 30x20 nodes
## How many nodes to build??
som_grid = somgrid(xdim = 30,
ydim = 20,
topo = "hexagonal")
## Visualize the empty grid
plot(som_grid)
Now we’ll train the SOM. The function we use is som(), and we need to specify the data to be used as well as the SOM grid. The function requires the data to be as a matrix rather than an R data frame, so we first convert this, then pass it to the function:
## Convert the dataframe into a matrix
gap_mat = as.matrix(gap_df_scale[,3:8])
## Train the SOM
gap_som = som(X = gap_mat,
grid = som_grid)
Once finished, we need to see if the algorithm has converged. We can do this by plotting the change in the loss function as a function of the iterations. As a reminder, the loss function here is the mean distance between the node weights and the observations assigned to that node:
## Visualize the change in the loss function
plot(gap_som,
type = "changes")
We can see here that the loss function has decreased, but not stabilized suggesting that the algorithm has converged. We’ll re-run it for a larger number of iterations by setting rlen to 1000 (this may take a couple of minutes to run):
## Time estimate
st = Sys.time()
## Rerun the SOM training
gap_som = som(X = gap_mat,
grid = som_grid,
rlen = 1000)
et = Sys.time()
tt = et-st
## Computation Time
print(tt)
## Time difference of 1.281102 mins
## Visualize
plot(gap_som, type = "changes")
Note: This plot shows a series of step like drops in the loss function, finally stabilizing at approximately iteration 900-1000.
Next we’ll go through some visualizations of the data. We’ve already see then loss function changes, so next we’ll make a couple of plots to show how the SOM has trained. First a plot of distances between nodes, which can be useful to identify outliers:
## Visualize
plot(gap_som, type = "dist")
Next, we plot the number of observations per cluster to look for empty nodes.
## Plot to identify empty notes
plot(gap_som, type = "counts")
The map has a couple of empty nodes. Empty nodes can be common when using larger grid sizes, and can be taken to represent a part of the parameter space that is missing in the observations.
Next, we’ll plot the codebook vectors. These are the weights associated with each feature, and can be used to interpret the organization of the SOM. For each node, there is a radial plot showing the value of each feature associated with it. Note that the type of plot can change here. If you have a large number of features, this will plot as a line plot rather than a radial plot.
## Plot to identify empty notes
plot(gap_som, type = "code")
## Warning in par(opar): argument 1 does not name a graphical parameter
The map shows a clear gradient from richer regions (income, orange) to poorer regions, and a gradient from smaller populations at the top to smaller at the bottom. Note that the life expectancy follows the gradient of GDP fairly well, and the poorer regions tend to have higher infant mortality and fertility rates.
It is also possible to plot out the individual feature weights as a heatmap to illustrate the gradient using type = “property”. The codebook vectors of weights are contained within the gap_som object. The syntax to extract them is a little complicated, and some examples are given below
Population
## Visualize the population heat map
plot(gap_som,
type = "property",
property = gap_som$codes[[1]][,"population"],
main = "Pop")
Life Expectancy
## Visualize the heat map
plot(gap_som,
type = "property",
property = gap_som$codes[[1]][,"life_expectancy"],
main = "Life Exp.")
Income (per capita GDP)
## Visualize the heat map
plot(gap_som,
type = "property",
property = gap_som$codes[[1]][,"income"],
main = "Income")
CO2 Emissions
## Visualize the heat map
plot(gap_som,
type = "property",
property = gap_som$codes[[1]][,"per_cap_co2"],
main = "Per Capita CO2")
As a next step, we’ll cluster the nodes of the map to try and identify homogenous regions to simplify the interpretation of the SOM. While k-means is often used here, we’ll use a hierarchical algorithm (for reasons that will be clear shortly) to form 6 clusters.
Again, we first calculate the dissimilarity matrix, but this time between the values (codes) of each of the SOM nodes.
Then we use the cutree() and hclust() function to form the clusters:
## Calculate the distance/dissimilarity matrix
## Reminder codes are the values
dist_codes = dist(x = gap_som$codes[[1]])
## Form the clusters
som_cluster = cutree(hclust(dist_codes),
k = 6)
We can now show this on the SOM grid using the plot() function, setting type to mapping and using the cluster values in som_cluster to set the colors. The cluster boundaries can also be overlaid with add.cluster.boundaries()
## Create a palette
mypal = brewer.pal(n = 6,
"Set2")
## Plot The clusters
## Add boundaries
plot(gap_som,
type="mapping",
bgcol = mypal[som_cluster],
main = "Clusters",
pch = '.')
add.cluster.boundaries(gap_som, som_cluster)
The main problem with this is that the clusters are not-contiguous on the SOM grid. This reflects the non-linear mapping that a SOM carries out (and may also suggest that 6 clusters is too high). We can force the clusters to be contiguous by including the distance between the nodes on the SOM grid as well as the dissimilarity between codebook values. We’ve already calulated this latter value above (as codes_dist), so let’s get the distance between the nodes using the handy helper function unit.distances():
We then multiply these two distance/dissimilarity matrices together:
## Distance between nodes
dist_grid = unit.distances(som_grid)
## Then multiply the two distance/dissimalirity matrices together
dist_adj = as.matrix(dist_codes) * dist_grid
If we now repeat the clustering, the inclusion of the distances between nodes now forces these clusters to be contiguous:
## Repeat the clustering to include the distances between nodes
## What is the ward.D2 method?
clust_adj = hclust(as.dist(dist_adj),
method = 'ward.D2')
som_cluster_adj = cutree(clust_adj, 6)
## Visualize
plot(gap_som,
type="mapping",
bgcol = mypal[som_cluster_adj],
main = "Clusters",
pch = '.')
add.cluster.boundaries(gap_som, som_cluster_adj)
Next, we obtain aggregate values for each cluster. While we can get these values from the gap_som object, we can also work back to get values from the original data (in gap_df). In the following code we:
## Extract a vector with the node assignments
nodeByCountry <- gap_som$unit.classif
## Assign the nodes to the original data frame
gap_df_scale$somclust <- as.factor(som_cluster_adj[nodeByCountry])
## Check
str(gap_df_scale)
## 'data.frame': 17676 obs. of 9 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 ...
## $ child_mortality: num [1:17676, 1] 1.22 1.22 1.21 1.2 1.19 ...
## ..- attr(*, "scaled:center")= num 4.47
## ..- attr(*, "scaled:scale")= num 1.28
## $ fertility : num [1:17676, 1] 1.26 1.26 1.26 1.26 1.25 ...
## ..- attr(*, "scaled:center")= num 1.39
## ..- attr(*, "scaled:scale")= num 0.503
## $ per_cap_co2 : num [1:17676, 1] -2.9 -2.06 -2.03 -2.04 -1.97 ...
## ..- attr(*, "scaled:center")= num -0.212
## ..- attr(*, "scaled:scale")= num 2.09
## $ income : num [1:17676, 1] -0.564 -0.553 -0.542 -0.528 -0.489 ...
## ..- attr(*, "scaled:center")= num 8.41
## ..- attr(*, "scaled:scale")= num 1.13
## $ life_expectancy: num [1:17676, 1] -1.64 -1.63 -1.59 -1.52 -1.45 ...
## ..- attr(*, "scaled:center")= num 3.99
## ..- attr(*, "scaled:scale")= num 0.311
## $ population : num [1:17676, 1] 0.192 0.201 0.208 0.215 0.222 ...
## ..- attr(*, "scaled:center")= num 15.5
## ..- attr(*, "scaled:scale")= num 1.79
## $ somclust : Factor w/ 6 levels "1","2","3","4",..: 4 4 4 4 4 4 4 4 4 4 ...
To get the values for each cluster on the original, non-transformed scale, we need to merge gap_df_scale which contains the cluster assignments with the original data (gap_df)
## Merge the two data frames
gap_df_clus = merge(x = gap_df,
y = gap_df_scale,
by = c("country", "year"))
## Check
head(gap_df_clus)
## country year child_mortality.x fertility.x per_cap_co2.x income.x
## 1 Afghanistan 1949 418 7.56 0.00192 2360
## 2 Afghanistan 1950 416 7.57 0.01090 2390
## 3 Afghanistan 1951 413 7.56 0.01170 2420
## 4 Afghanistan 1952 407 7.55 0.01150 2460
## 5 Afghanistan 1953 401 7.54 0.01320 2570
## 6 Afghanistan 1954 395 7.53 0.01310 2580
## life_expectancy.x population.x child_mortality.y fertility.y per_cap_co2.y
## 1 32.4 7620000 1.223506 1.259275 -2.896148
## 2 32.5 7750000 1.219755 1.261905 -2.064037
## 3 32.9 7840000 1.214095 1.259275 -2.030096
## 4 33.6 7940000 1.202650 1.256642 -2.038359
## 5 34.3 8040000 1.191035 1.254006 -1.972291
## 6 35.0 8150000 1.179245 1.251366 -1.975935
## income.y life_expectancy.y population.y somclust
## 1 -0.5642644 -1.637728 0.1919013 4
## 2 -0.5531338 -1.627831 0.2013428 4
## 3 -0.5421421 -1.588548 0.2077869 4
## 4 -0.5276965 -1.520938 0.2148608 4
## 5 -0.4891505 -1.454722 0.2218461 4
## 6 -0.4857286 -1.389844 0.2294304 4
Now we can use this list of clusters by country to aggregate the variables:
## Aggregate the variables by group
cluster.vals = aggregate(gap_df_clus[,3:8],
by = list(gap_df_scale$somclust),
FUN = mean )
## Check
cluster.vals
## Group.1 child_mortality.x fertility.x per_cap_co2.x income.x
## 1 1 344.28571 5.393164 1.4178033 2755.990
## 2 2 154.41043 5.409502 1.2812579 4202.002
## 3 3 45.87448 2.402679 6.8197455 17229.733
## 4 4 286.56349 6.227594 0.1644801 1626.150
## 5 5 20.53622 2.289385 8.7255613 22919.833
## 6 6 81.89601 4.975301 3.5040853 8377.746
## life_expectancy.x population.x
## 1 36.38351 52044859
## 2 55.63231 25783594
## 3 69.77503 52602445
## 4 40.48865 5792850
## 5 73.31479 3792450
## 6 64.20135 1135660
Note: This table can help us infer characteristics of the clusters
As we have data over multiple years, we can also use the results of the SOM analysis to track the change made by an individual country over time. To do this, we need to know which node a country was assigned to for each of it’s time steps. To do this we
## Find the observation of interest e.g. US
country_id = which(gap_df_scale == "United States")
## Find the nodes assigned to the observation of interest
country_nodes = gap_som$unit.classif[country_id]
## Find the node coordinates
country_crds = som_grid$pts[country_nodes,]
Now we plot out the grid again, shaded with the cluster assignments.
Then we use R’s line() function to overlay the country’s trajectory.
In the last couple lines, we make up a vector of years to label the trajectory. As there will be a lot of these, we only keep the label for 20 year increments. Then we add these to the plot
##Plot the grid with cluster assignments
plot(gap_som, type="mapping",
bgcol = mypal[som_cluster_adj],
main = "Clusters",
pch = '',
keepMargins = TRUE)
#add.cluster.boundaries(gap_som, som_cluster_adj)
lines(country_crds, lwd=2)
##Create a vector of years
years <- ifelse(gap_df_scale$year[country_id] %% 20 == 0,
as.character(gap_df_scale$year[country_id]),
"")
text(jitter(country_crds),
labels = years,
cex = 0.75)
And finally, we can plot out the spatial distribution of these clusters for any given year in the dataset. First we need to assign the country codes to link to the shapefile:
## Assign the country code to the shapefile
gap_df_scale <- gap_df_scale %>%
mutate(ISO3 = countrycode(country,
"country.name",
"iso3c"))
##Check
str(gap_df_scale)
## 'data.frame': 17676 obs. of 10 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 ...
## $ child_mortality: num [1:17676, 1] 1.22 1.22 1.21 1.2 1.19 ...
## ..- attr(*, "scaled:center")= num 4.47
## ..- attr(*, "scaled:scale")= num 1.28
## $ fertility : num [1:17676, 1] 1.26 1.26 1.26 1.26 1.25 ...
## ..- attr(*, "scaled:center")= num 1.39
## ..- attr(*, "scaled:scale")= num 0.503
## $ per_cap_co2 : num [1:17676, 1] -2.9 -2.06 -2.03 -2.04 -1.97 ...
## ..- attr(*, "scaled:center")= num -0.212
## ..- attr(*, "scaled:scale")= num 2.09
## $ income : num [1:17676, 1] -0.564 -0.553 -0.542 -0.528 -0.489 ...
## ..- attr(*, "scaled:center")= num 8.41
## ..- attr(*, "scaled:scale")= num 1.13
## $ life_expectancy: num [1:17676, 1] -1.64 -1.63 -1.59 -1.52 -1.45 ...
## ..- attr(*, "scaled:center")= num 3.99
## ..- attr(*, "scaled:scale")= num 0.311
## $ population : num [1:17676, 1] 0.192 0.201 0.208 0.215 0.222 ...
## ..- attr(*, "scaled:center")= num 15.5
## ..- attr(*, "scaled:scale")= num 1.79
## $ somclust : Factor w/ 6 levels "1","2","3","4",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ ISO3 : chr "AFG" "AFG" "AFG" "AFG" ...
Next we’ll extract a single year and merge it to our spatial object (cluster_sf):
## Extract the year 2000
myyear = 2000
## temporary dataframe
tmp.df = gap_df_scale %>%
filter(year == myyear)
cluster_sf = merge(x = borders,
y = tmp.df,
by.x = "ADM0_A3",
by.y = "ISO3",
all = FALSE)
## Plot
tm_shape(cluster_sf) +
tm_fill("somclust", palette = "Set2") +
tm_layout(main.title = paste("GAPMinder data", myyear))
For the exercise, you will need to build use an unsupervised classification method with the California housing dataset using a self-organizing map. Your answer should use the following variables (you can use code from a previous example to create the average number of rooms per house, and the ratio of bedrooms to rooms):
Note on the data: This is a data set of California house prices from the file housing.csv, which is available through Canvas. The data are taken from a paper by Kelley and Barry (“Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297), and is a classic dataset used in machine learning examples. You should also download the zipped file ca.zip. This contains a shapefile of California’s border. Move both of these to your datafiles directory and unzip and extract the files from ca.zip.
The csv file contains the following columns most of which should be self-explanatory), with values for each California district taken from the 1990 census:
## Read in the data
## Housing
housing = read.csv("../datafiles/housing.csv")
## Check
str(housing)
## 'data.frame': 20640 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
## Check for missing values
colSums(is.na(housing))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 207 0 0 0
## median_house_value ocean_proximity
## 0 0
##Remove missing values using dplyr's filter function
housing = housing %>%
filter(!is.na(total_bedrooms))
##Check
colSums(is.na(housing))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 0 0 0 0
## median_house_value ocean_proximity
## 0 0
## Create a variable for average rooms
housing$avg_rooms = housing$total_rooms / housing$households
## Create a variable for bedroom ratio
housing$bedroom_ratio = housing$total_bedrooms / housing$total_rooms
## Check
head(housing)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity avg_rooms
## 1 126 8.3252 452600 NEAR BAY 6.984127
## 2 1138 8.3014 358500 NEAR BAY 6.238137
## 3 177 7.2574 352100 NEAR BAY 8.288136
## 4 219 5.6431 341300 NEAR BAY 5.817352
## 5 259 3.8462 342200 NEAR BAY 6.281853
## 6 193 4.0368 269700 NEAR BAY 4.761658
## bedroom_ratio
## 1 0.1465909
## 2 0.1557966
## 3 0.1295160
## 4 0.1844584
## 5 0.1720959
## 6 0.2317737
##Dimensions
dim(housing)
## [1] 20433 12
Note - There are 20433 total observations
## Variable name check
names(housing)
## [1] "longitude" "latitude" "housing_median_age"
## [4] "total_rooms" "total_bedrooms" "population"
## [7] "households" "median_income" "median_house_value"
## [10] "ocean_proximity" "avg_rooms" "bedroom_ratio"
## Save a copy of the coordinates
housing.crds = housing %>%
select(longitude, latitude)
## Create a vector with the variable names for easy column extraction
var_names = c("housing_median_age", "population", "median_income", "median_house_value",
"avg_rooms", "bedroom_ratio")
## Test the subset
#housing[, var_names]
## Create histograms to check for normality
hh1 = housing %>%
gghistogram(x = "housing_median_age")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh2 = housing %>%
gghistogram(x = "population")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh3 = housing %>%
gghistogram(x = "median_income")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh4 = housing %>%
gghistogram(x = "median_house_value")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh5 = housing %>%
gghistogram(x = "avg_rooms")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
hh6 = housing %>%
gghistogram(x = "bedroom_ratio")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Visualize
ggarrange(hh1, hh2, hh3, hh4, hh5, hh6)
## Do I need to scale the bedroom ratio data?
## Exclude the outliers
temp.df = housing %>%
filter(bedroom_ratio < 1)
## Replot bedroom ratio
hh6 = temp.df %>%
gghistogram(x = "bedroom_ratio",
bins = 10)
hh6
max(temp.df$bedroom_ratio)
## The maximum is still 0.92
## Transform and Scale the data
## Note **I had an issue log transforming the bedroom ratio data**
## BR is only scaled here
housing_scale = housing %>%
mutate(
housing_median_age = scale(log(housing_median_age)),
population = scale(log(population)),
median_income = scale(log(median_income)),
median_house_value = scale(log(median_house_value)),
avg_rooms = scale(log(avg_rooms)),
bedroom_ratio = scale(log(bedroom_ratio))
)
## Check
head(housing_scale)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 0.8603451 880 129 -1.6912077
## 2 -122.22 37.86 -0.3171483 7099 1106 1.0288051
## 3 -122.24 37.85 1.2786352 1467 190 -1.1063098
## 4 -122.25 37.85 1.2786352 1274 235 -0.9468488
## 5 -122.25 37.85 1.2786352 1627 280 -0.9299706
## 6 -122.25 37.85 1.2786352 919 213 -1.3542388
## households median_income median_house_value ocean_proximity avg_rooms
## 1 126 1.8584612 1.647690 NEAR BAY 1.0756316
## 2 1138 1.8523782 1.238219 NEAR BAY 0.6645069
## 3 177 1.5668026 1.206573 NEAR BAY 1.6986787
## 4 219 1.0322337 1.151844 NEAR BAY 0.4103292
## 5 259 0.2177044 1.156470 NEAR BAY 0.6899240
## 6 193 0.3204728 0.738206 NEAR BAY -0.3184999
## bedroom_ratio
## 1 -1.4009772
## 2 -1.1517965
## 3 -1.9076447
## 4 -0.4608930
## 5 -0.7447132
## 6 0.4732963
## Dimensions
dim(housing_scale)
## [1] 20433 12
Note - Still 20433 observations
## Read in the shapefile
ca_sf = st_read("../datafiles/ca/ca.shp",
crs = 4326,
quiet = TRUE)
## Check
str(ca_sf)
## Classes 'sf' and 'data.frame': 1 obs. of 15 variables:
## $ REGION : chr "4"
## $ DIVISION: chr "9"
## $ STATEFP : chr "06"
## $ STATENS : chr "01779778"
## $ GEOID : chr "06"
## $ STUSPS : chr "CA"
## $ NAME : chr "California"
## $ LSAD : chr "00"
## $ MTFCC : chr "G4000"
## $ FUNCSTAT: chr "A"
## $ ALAND : chr "403660088482"
## $ AWATER : chr "20305454540"
## $ INTPTLAT: chr "+37.1551773"
## $ INTPTLON: chr "-119.5434183"
## $ geometry:sfc_MULTIPOLYGON of length 1; first list element: List of 7
## ..$ :List of 1
## .. ..$ : num [1:11640, 1:2] -124 -124 -124 -124 -124 ...
## ..$ :List of 1
## .. ..$ : num [1:392, 1:2] -119 -119 -119 -119 -119 ...
## ..$ :List of 1
## .. ..$ : num [1:154, 1:2] -119 -119 -119 -119 -119 ...
## ..$ :List of 1
## .. ..$ : num [1:191, 1:2] -119 -119 -119 -119 -119 ...
## ..$ :List of 1
## .. ..$ : num [1:121, 1:2] -120 -120 -120 -120 -120 ...
## ..$ :List of 1
## .. ..$ : num [1:93, 1:2] -123 -123 -123 -123 -123 ...
## ..$ :List of 1
## .. ..$ : num [1:77, 1:2] -119 -119 -119 -119 -119 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:14] "REGION" "DIVISION" "STATEFP" "STATENS" ...
## Initialize a Grid
h.som.grid = somgrid(xdim = 30,
ydim = 20,
topo = "hexagonal")
## Check empty grid
plot(h.som.grid)
## Convert from data frame to matrix
housing.mat = as.matrix(housing_scale[ , var_names])
dim(housing.mat)
## [1] 20433 6
Note - Still 20433 observations
## Timing
st1 = Sys.time()
## Train the SOM
housing.som = som(X = housing.mat,
grid = h.som.grid,
rlen = 1500)
et1 = Sys.time()
## Total time elapsed
print(et1 - st1)
## Time difference of 2.254369 mins
## Check the number of observations
length(housing.som$unit.classif)
## [1] 20433
##PROBLEM IS HERE!
##Fixed
The issue is here Why am I losing observations? Down to 10,423??
Note: After re-reading in the data, I did not log transform the bedroom ratio as this was producing NaNs
## Plot the loss function by iterartions
plot(housing.som,
type = "changes")
Note: The loss function stabilized after approximately 1450 iterations.
## Visualization of the number of observations per node
plot(housing.som,
type = "counts")
Note: It appears that there is one empty node
## Visualize the codebook vector plot
plot(housing.som,
type = "code")
## Warning in par(opar): argument 1 does not name a graphical parameter
## Visualizing Housing Median Age
plot(housing.som,
type = "property",
property = housing.som$codes[[1]][,"housing_median_age"],
main = "Housing Median Age")
## Visualizing Median Income
plot(housing.som,
type = "property",
property = housing.som$codes[[1]][,"median_income"],
main = "Median Income")
## Visualizing Population
plot(housing.som,
type = "property",
property = housing.som$codes[[1]][,"population"],
main = "Population")
## Visualizing Median House Value
plot(housing.som,
type = "property",
property = housing.som$codes[[1]][,"median_house_value"],
main = "Median House Value")
## Visualizing Average Rooms
plot(housing.som,
type = "property",
property = housing.som$codes[[1]][,"avg_rooms"],
main = "Average Rooms")
## Visualizing Bedroom Ratio
plot(housing.som,
type = "property",
property = housing.som$codes[[1]][,"bedroom_ratio"],
main = "Bedroom Ratio")
## Calculate the dissimilarity matrix between the values (codes) of each SOM node
h.dist_codes = dist(housing.som$codes[[1]])
## Cluster build
#h.som_cluster = cutree(hclust(d = h.dist_codes),
#k = 6)
## k-means
h.kmeans.som_cluster = kmeans(housing.som$codes[[1]],
centers = 6)
## Color Palette
h.mypal = brewer.pal(n = 6,
"Accent")
## Visualize
plot(housing.som,
type = "mapping",
#bgcol = h.mypal[h.som_cluster],
bgcol = h.mypal[h.kmeans.som_cluster$cluster],
main = "Housing Clusters",
pch = ".")
##add boundaries
add.cluster.boundaries(housing.som,
h.kmeans.som_cluster$cluster)
#h.som_cluster)
## Calculate distance between nodes
h.dist_grid = unit.distances(h.som.grid)
## Multiple the node distances and the codes (value distances) together
## Adjustment to make it contiguous
h.dist_adj = as.matrix(h.dist_codes) * h.dist_grid
## Re-run the kmeans
h.clust_adj = kmeans(h.dist_adj,
centers = 6)
#h.hclust_adj = hclust(as.dist(h.dist_adj),
#"ward.D2")
#h.som_cluster_adj = cutree(clust_adj,
#k= 6)
## Visualize
plot(housing.som,
type = "mapping",
bgcol = h.mypal[h.clust_adj$cluster],
#bgcol = h.mypal[h.som_cluster_adj],
main = "Housing Clusters",
pch = ".")
##add boundaries
add.cluster.boundaries(housing.som,
h.clust_adj$cluster)
#h.som_cluster_adj)
## Extract node by observation
nodeByObs = housing.som$unit.classif
## Attach to the housing.scale data frame
housing_scale$somclust = as.factor(h.clust_adj$cluster[nodeByObs])
## Merge the scale df to the original data
housing_df_clus = merge(x = housing,
y = housing_scale,
by = c("longitude", "latitude"))
## Check
str(housing_df_clus)
## 'data.frame': 49635 obs. of 23 variables:
## $ longitude : num -114 -114 -114 -115 -115 ...
## $ latitude : num 34.2 34.4 34 32.8 33.7 ...
## $ housing_median_age.x: num 15 19 17 19 17 27 20 14 25 29 ...
## $ total_rooms.x : num 5612 7650 2809 2570 720 ...
## $ total_bedrooms.x : num 1283 1901 635 820 174 ...
## $ population.x : num 1015 1129 83 1431 333 ...
## $ households.x : num 472 463 45 608 117 34 262 226 633 239 ...
## $ median_income.x : num 1.49 1.82 1.62 1.27 1.65 ...
## $ median_house_value.x: num 66900 80100 87500 56100 85700 45000 65500 73400 82400 74000 ...
## $ ocean_proximity.x : chr "INLAND" "INLAND" "INLAND" "INLAND" ...
## $ avg_rooms.x : num 11.89 16.52 62.42 4.23 6.15 ...
## $ bedroom_ratio.x : num 0.229 0.248 0.226 0.319 0.242 ...
## $ housing_median_age.y: num [1:49635, 1] -0.909 -0.493 -0.689 -0.493 -0.689 ...
## $ total_rooms.y : num 5612 7650 2809 2570 720 ...
## $ total_bedrooms.y : num 1283 1901 635 820 174 ...
## $ population.y : num [1:49635, 1] -0.13686 0.00725 -3.52664 0.32817 -1.64573 ...
## $ households.y : num 472 463 45 608 117 34 262 226 633 239 ...
## $ median_income.y : num [1:49635, 1] -1.79 -1.37 -1.63 -2.13 -1.58 ...
## $ median_house_value.y: num [1:49635, 1] -1.71 -1.39 -1.24 -2.02 -1.28 ...
## $ ocean_proximity.y : chr "INLAND" "INLAND" "INLAND" "INLAND" ...
## $ avg_rooms.y : num [1:49635, 1] 3.012 4.21 9.047 -0.752 0.615 ...
## $ bedroom_ratio.y : num [1:49635, 1] 0.417 0.758 0.371 1.781 0.644 ...
## $ somclust : Factor w/ 6 levels "1","2","3","4",..: 3 3 2 3 3 1 3 3 3 3 ...
var_names2 = paste(var_names,
".x",
sep = "")
## Aggregate the original values by cluster
housing.clust.vals = aggregate(x = housing_df_clus[, var_names2],
by = list(housing_df_clus$somclust),
FUN = mean)
## Check
housing.clust.vals
## Group.1 housing_median_age.x population.x median_income.x
## 1 1 36.69893 1191.461 3.388889
## 2 2 35.12607 1492.598 1.988819
## 3 3 34.31376 1344.696 2.249996
## 4 4 28.02497 1639.679 3.420259
## 5 5 32.46993 1141.125 5.078442
## 6 6 16.29476 2042.762 6.620156
## median_house_value.x avg_rooms.x bedroom_ratio.x
## 1 191436.3 4.845196 0.2223440
## 2 195170.9 4.579685 0.3834643
## 3 127576.7 4.442990 0.2599936
## 4 242585.5 4.585557 0.2473099
## 5 304540.3 5.765096 0.1866238
## 6 310820.3 6.751067 0.1638144
## Coerce into a table for a better figure
library(data.table)
library(kableExtra)
housing.table = setDT(housing.clust.vals)
housing.table.pub = housing.table %>%
kable() %>%
kable_material_dark() #%>%
#scroll_box(width = "600px", height= "600px" )
housing.table.pub
Group.1 | housing_median_age.x | population.x | median_income.x | median_house_value.x | avg_rooms.x | bedroom_ratio.x |
---|---|---|---|---|---|---|
1 | 36.69893 | 1191.461 | 3.388889 | 191436.3 | 4.845196 | 0.2223440 |
2 | 35.12607 | 1492.598 | 1.988819 | 195170.9 | 4.579685 | 0.3834643 |
3 | 34.31376 | 1344.696 | 2.249996 | 127576.7 | 4.442990 | 0.2599936 |
4 | 28.02497 | 1639.679 | 3.420259 | 242585.5 | 4.585557 | 0.2473099 |
5 | 32.46993 | 1141.125 | 5.078442 | 304540.3 | 5.765097 | 0.1866238 |
6 | 16.29476 | 2042.762 | 6.620156 | 310820.3 | 6.751067 | 0.1638144 |
## Check the combined housing dataframe
names(housing_df_clus)
## [1] "longitude" "latitude" "housing_median_age.x"
## [4] "total_rooms.x" "total_bedrooms.x" "population.x"
## [7] "households.x" "median_income.x" "median_house_value.x"
## [10] "ocean_proximity.x" "avg_rooms.x" "bedroom_ratio.x"
## [13] "housing_median_age.y" "total_rooms.y" "total_bedrooms.y"
## [16] "population.y" "households.y" "median_income.y"
## [19] "median_house_value.y" "ocean_proximity.y" "avg_rooms.y"
## [22] "bedroom_ratio.y" "somclust"
## Combine the lat/long - did this in a previous step
## housing.crds
## Set housing_df_clus as a sf object
housing_df_clus_sf = st_as_sf(housing_df_clus,
coords = c("longitude", "latitude"),
crs = 4326)
## Visualize
tm_shape(ca_sf) +
tm_borders() +
tm_shape(housing_df_clus_sf) +
tm_symbols(col = "somclust",
border.lwd = NA,
alpha = 0.5,
palette = "Dark2",
title.col = "CA Housing Clusters",
size = 0.1)