boston <- read.csv("boston6k.csv")
climate <- read.csv("wnaclim2.csv")

Exercise 1.

Instructions: The file boston6k.csv contains information on house prices in Boston by census tract, as well as various socio-economic and environmental factors. Use this to cluster the tracts by these factors (NOT by price), then examine the characteristics of the clusters, whether they show a difference in house price and if there is any spatial structure to the clusters. Use only the following variables in the cluster analysis: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT (see the file description for explanations of these). You will need to scale the data as the variables are in a wide range of units.

  1. Start by running k-means cluster analysis on the data from 2 to 20 clusters, using the approach outlined above. You should calculate either the silhouette index or the Calinski-Harabasz index for each set of clusters. Provide a plot of the index values, and identify the number of clusters (k) that gives the best solution.
  2. In your opinion, is this the best solution, or would more or less clusters be useful?
bostonloc <- cbind(boston$LON, boston$LAT)
ngrp <- 7

#vars to use
boston_vars <- boston[, seq(9, 21)]
boston_vars_s <- scale(boston_vars)

boston_kmeans <- kmeans(boston_vars_s, ngrp, nstart = 50, iter.max = 20)
table(boston_kmeans$cluster)
## 
##   1   2   3   4   5   6   7 
##  47 175  89  48  35  78  34
mycol <- brewer.pal(ngrp, "Dark2")
plot(bostonloc, xlab = '', ylab = '', bg = alpha(mycol[boston_kmeans$cluster], 0.6),pch = 21, cex = 1.25, col = "gray100", lwd = 0.4, main = "k = 7 (initial guess)")
maps::map("usa", xlim = c(-72, -69), ylim = c(41.5, 43), col = "gray10", add = TRUE)

sil.out <- silhouette(boston_kmeans$cluster, dist(boston_vars_s))
sil.out[1:4, 1]
## [1] 7 7 7 3
mean(sil.out[, 3])
## [1] 0.2922711
tapply(sil.out[, 3], sil.out[, 1], mean)
##         1         2         3         4         5         6         7 
## 0.2309354 0.2683720 0.3881778 0.2369639 0.3880952 0.3069534 0.1947742
calinhara(boston_vars_s, boston_kmeans$cluster)
## [1] 165.4948
ch.out <- rep(NA, 20)
sil.out <- rep(NA, 20)
for (i in 2:20) {
  boston.kmeans <- kmeans(boston_vars_s, centers = i, nstart = 50)
  ch.out[i] <- calinhara(boston_vars_s, boston.kmeans$cluster)
  tmp <- silhouette(boston.kmeans$cluster, dist(boston_vars_s))
  sil.out[i] <- mean(tmp[,3])
}

par(mfrow = c(1, 2))
ch <- plot(1:20, ch.out, type = 'b', lwd = 2, xlab = "N Groups", ylab = "C", main = "CH")
sil <- plot(1:20,sil.out, type = 'b', lwd = 2, xlab = "N Groups", ylab = "C", main = "Average silhouette index")

mean(sil.out, na.rm = TRUE)
## [1] 0.2800466

Answer: I initially tried running the k-means analysis with seven clusters, but I think fewer would be more appropriate. Our Calinski-Harabasz plot appears to “bend the elbow” at k=5. While a cluster of 2 maximizes the average silhouette index, two clusters probably don’t capture all of the variability of Boston. The second-highest value is for k=12, but since this is much higher than the CH index seems to suggest, I would favor a lower k. Based on the plotted results of the Calinski-Harabasz index and silhouette index, I would select 5 as my k. The silhouette index is 0.2878, which is near the “elbow” without being too high. It is higher than the average silhouette value of 0.2832, if only slightly.  

(I like placing my plots side-by-side to minimize scrolling, but I think it’s worth viewing separately when choosing k.)

  1. Re-run kmeans() using your chosen number for k.
  2. Using the aggregate() function, provide a table showing the median the variables used in clustering. In 1-2 sentences, describe the characteristics of the clusters
ngrp_final <- 5

#vars to use
boston_vars <- boston[, seq(9, 21)]
boston_vars_s <- scale(boston_vars)

boston_kmeans <- kmeans(boston_vars_s, ngrp_final, nstart = 50, iter.max = 20)
table(boston_kmeans$cluster)
## 
##   1   2   3   4   5 
##  39  84 230 119  34
mycol <- brewer.pal(ngrp, "Dark2")
plot(bostonloc, xlab = '', ylab = '', bg = alpha(mycol[boston_kmeans$cluster], 0.6),pch = 21, cex = 1.25, col = "gray100", lwd = 0.4, main = "k = 5")
maps::map("usa", xlim = c(-72, -69), ylim = c(41.5, 43), col = "gray10", add = TRUE)

boston_meds <- aggregate(boston_vars, list(boston_kmeans$cluster), median)
boston_meds
##   Group.1      CRIM    ZN  INDUS CHAS   NOX     RM   AGE     DIS RAD TAX
## 1       1 12.247200  0.00 18.100    0 0.679 6.1930 93.30 1.91420  24 666
## 2       2  0.042990 53.75  3.440    0 0.427 6.7070 30.95 6.92510   4 304
## 3       3  0.154870  0.00  7.625    0 0.510 6.1385 69.95 3.75370   5 302
## 4       4  5.731160  0.00 18.100    0 0.693 6.0030 96.60 1.86620  24 666
## 5       5  0.484245  0.00 13.890    1 0.550 6.2310 88.55 2.96865   5 307
##   PTRATIO      B  LSTAT
## 1    20.2  48.45 19.780
## 2    16.6 393.07  5.670
## 3    18.4 393.26  9.845
## 4    20.2 390.11 17.730
## 5    17.6 390.58 10.990

Answer: By far the cluster with the most crime per capita is cluster 2, which is fairly similar to cluster 5 — they are nearer to highways, have higher levels of nitric oxides, higher percent lowest status population, and the highest property tax rate.

  1. Report the mean corrected house value per cluster
  2. Use anova() to test whether the values are significantly different between clusters. You will need the vector of house prices/values and the vector of clusters from kmeans(). Give the F-statistic and the p-value
#Cluster vector
boston$cluster <- boston_kmeans$cluster

group_by(boston, cluster) %>%
  summarise(
    count = n(),
    mean = mean(CMEDV, na.rm = TRUE), 
    sd = sd(CMEDV, na.rm = TRUE)
  )
## # A tibble: 5 × 4
##   cluster count  mean    sd
##     <int> <int> <dbl> <dbl>
## 1       1    39  12.5  4.48
## 2       2    84  29.0  7.67
## 3       3   230  24.2  7.93
## 4       4   119  16.6  6.60
## 5       5    34  27.8 11.4
summary(boston_aov <- aov(CMEDV ~ cluster, data = boston))
##              Df Sum Sq Mean Sq F value Pr(>F)
## cluster       1     22   22.38   0.265  0.607
## Residuals   504  42555   84.44

Answer: The mean home values (calculated from the median home values) are given above, with the highest value in cluster 1 at 28.987 (USD 1000), and the lowest in cluster 2 at 12.500 (USD 1000). The F-statistic is 40.35 (p = 4.74e-10), which means we have sufficient evidence to rejecting the null hypothesis and conclude that there is a statistically significant difference in home values between the clusters.

Exercise 2

Instructions: The file wnaclim2.csv contains a set of climatic variables for sites distributed across western North America. Use principal component analysis to explore the spatial distribution of climate. This will require you to install the add-on package fields for plotting.

  1. Read in the file and perform principal component analysis using the monthly temperature and precipitation variables (these are the same as you used in the cluster analysis in the lab). Use the SVD approach with the function prcomp(). Note that you will have to use to scale the data in the PCA to avoid any bias from the difference in magnitude of the variables in the dataset (use the scale=TRUE parameter). Make a biplot of this ordination (biplot()) and a scree-plot showing the variance explained by the components (screeplot()).
  2. Give the total variance from the second PCA. Calculate the total percentage of variance explained by axes 1 and 2 (use summary())
climate <- read.csv("wnaclim2.csv")
clim <- climate[, seq(3, 26)]
clim <- scale(clim)

var <- summary(clim.pca <- prcomp(clim, scale = TRUE))
ax_1 <- var$importance[2, 1]
ax_2 <- var$importance[2, 2]
total <- sum(var$importance[2, ])
exp_var_1_2 <- (ax_1 + ax_2)/total

head(clim.pca$x)
##               PC1        PC2        PC3         PC4        PC5        PC6
## [1,]  0.008071052 -1.3874311 -0.9836279  1.37546482 -0.2239042  0.2117742
## [2,] -5.747155910  0.9793170  0.7990065 -0.45842202 -0.2070474 -0.5716732
## [3,] -1.110850774  0.2726580 -2.3350788  1.13128346 -0.4553512  0.3760333
## [4,] -5.918467141  1.0162797  0.4778910 -0.27049059 -0.4314492 -0.4778650
## [5,] -5.199307142  1.0117284 -0.1186365 -0.12540711 -0.5941900 -0.4596768
## [6,] -4.203801060  0.7803072 -0.6706470  0.08404358 -0.7348544 -0.1435930
##               PC7        PC8         PC9        PC10        PC11        PC12
## [1,]  0.035194421 -0.3159553  0.06726654 -0.05805034  0.03909418  0.00647676
## [2,] -0.763810164 -0.3245466 -0.22768624 -0.04503544  0.14008427  0.01760272
## [3,] -0.007616908 -0.5079982 -0.06260684 -0.21804157 -0.11938261 -0.02463859
## [4,] -0.725385454 -0.5392769 -0.13383186 -0.16217628  0.11596128  0.06857436
## [5,] -0.769997993 -0.6386171 -0.09193712 -0.19551564  0.08765735  0.14634885
## [6,] -0.527080924 -0.4733514 -0.10392325 -0.27130413  0.16409294 -0.04424016
##             PC13         PC14         PC15        PC16          PC17
## [1,] -0.10304484  0.034346560 -0.004171344  0.03827428 -0.0664656951
## [2,]  0.07894863  0.008881291 -0.050599419  0.06475534  0.0991135758
## [3,]  0.04330995 -0.060888579 -0.082947178 -0.09367767  0.0036155133
## [4,]  0.11447572 -0.046880594 -0.040798213 -0.04901866  0.0003842374
## [5,]  0.11209646 -0.094006903 -0.059572519 -0.08295071 -0.0251603437
## [6,]  0.13005645 -0.044753326 -0.032075194 -0.08947706 -0.0236370642
##              PC18        PC19         PC20         PC21         PC22
## [1,]  0.018140117 -0.03019069  0.006147098 -0.022525087  0.040515659
## [2,]  0.040463155 -0.02032345  0.020491664 -0.009543126 -0.002088933
## [3,] -0.002420637  0.03519190 -0.022245545  0.019584065 -0.039697096
## [4,]  0.011531935 -0.01402926 -0.035612801 -0.037170816 -0.008734897
## [5,]  0.034568474 -0.01414965 -0.034133021 -0.057726180 -0.003614496
## [6,]  0.003260044  0.01509390 -0.013932235 -0.046920943  0.031314336
##             PC23          PC24
## [1,]  0.03004748 -1.007782e-02
## [2,] -0.02948187 -3.676611e-02
## [3,]  0.02594588 -3.126072e-03
## [4,] -0.02523369 -1.285121e-02
## [5,] -0.01908725 -2.206720e-06
## [6,] -0.02717742  2.242538e-02
var <- clim.pca$sdev

par(mfrow = c(1, 2))
biplot(clim.pca)
screeplot((clim.pca), main = "Climate PCA")

Answer: The total variance explained by the second PC is 0.32819, and the total percentage of variance explained by axes 1 and 2 is 82.73517%.

  1. Examine the ‘loadings’ of the variables on the first two axes (wnaclim.pca$rotation). Name two variables that are highly associated (high positive or negative values) with axis 1 and two that are highly associated with axis 2, and give their scores.
loadings <- clim.pca$rotation

Answer: PC1 is most highly associated with February temperature (0.27524689) and December temperature (0.27301687), PC2 is highly associated with October precipitation (0.28759576).

  1. Produce a map of the sites scores on axis 1, using the quilt.plot() function from the fields package (code to do this is given with the file description below). With reference to the association between the variables and axis 1 (previous question), give a short description of the map (e.g. where do you find negative or positive values and what variables are these associated with?). Does this make sense in terms of what you know about the geography and climate of North America?
  2. Finally, produce a map of the sites scores on the second axis and give a short description of the spatial pattern in terms of the associated variables
library(spam)
## Warning: package 'spam' was built under R version 3.6.2
## Loading required package: dotCall64
## Warning: package 'dotCall64' was built under R version 3.6.2
## Loading required package: grid
## Spam version 2.6-0 (2020-12-14) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
## 
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve
library(fields)
## Warning: package 'fields' was built under R version 3.6.2
## See https://github.com/NCAR/Fields for
##  an extensive vignette, other supplements and source code
wnaclim.pca.score <- clim.pca$x[, 1]
quilt.plot(climate$Longitude, climate$Latitude, wnaclim.pca.score)
world(add = TRUE)

wnaclim.pca.score <- clim.pca$x[, 2]
quilt.plot(climate$Longitude, climate$Latitude, wnaclim.pca.score)
world(add = TRUE)

 

Answers: The spatial pattern of PC1 in Map 1 appears to be closely associated with coastal geography, as well as latitude. The loadings for PC1 showed is that it is highly associated with temperature, but not as much precipitation. The temperature is less variable along the coast than it is inland of the contiguous U.S. — we would expect it to be more temperature on the coast. In Alaska, however, we know it can get extremely cold, especially further north.  

The spatial pattern of PC2 is more associated with variation in precipitation than temperature. We expect relatively frequent precipitation in coast Canada and the Pacific Northwest, whereas in the deserts of the Southwest showers are less frequent and more variable.