boston <- read.csv("boston6k.csv")
climate <- read.csv("wnaclim2.csv")
Instructions: The file boston6k.csv contains information on house prices in Boston by census tract, as well as various socio-economic and environmental factors. Use this to cluster the tracts by these factors (NOT by price), then examine the characteristics of the clusters, whether they show a difference in house price and if there is any spatial structure to the clusters. Use only the following variables in the cluster analysis: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT (see the file description for explanations of these). You will need to scale the data as the variables are in a wide range of units.
bostonloc <- cbind(boston$LON, boston$LAT)
ngrp <- 7
#vars to use
boston_vars <- boston[, seq(9, 21)]
boston_vars_s <- scale(boston_vars)
boston_kmeans <- kmeans(boston_vars_s, ngrp, nstart = 50, iter.max = 20)
table(boston_kmeans$cluster)
##
## 1 2 3 4 5 6 7
## 47 175 89 48 35 78 34
mycol <- brewer.pal(ngrp, "Dark2")
plot(bostonloc, xlab = '', ylab = '', bg = alpha(mycol[boston_kmeans$cluster], 0.6),pch = 21, cex = 1.25, col = "gray100", lwd = 0.4, main = "k = 7 (initial guess)")
maps::map("usa", xlim = c(-72, -69), ylim = c(41.5, 43), col = "gray10", add = TRUE)
sil.out <- silhouette(boston_kmeans$cluster, dist(boston_vars_s))
sil.out[1:4, 1]
## [1] 7 7 7 3
mean(sil.out[, 3])
## [1] 0.2922711
tapply(sil.out[, 3], sil.out[, 1], mean)
## 1 2 3 4 5 6 7
## 0.2309354 0.2683720 0.3881778 0.2369639 0.3880952 0.3069534 0.1947742
calinhara(boston_vars_s, boston_kmeans$cluster)
## [1] 165.4948
ch.out <- rep(NA, 20)
sil.out <- rep(NA, 20)
for (i in 2:20) {
boston.kmeans <- kmeans(boston_vars_s, centers = i, nstart = 50)
ch.out[i] <- calinhara(boston_vars_s, boston.kmeans$cluster)
tmp <- silhouette(boston.kmeans$cluster, dist(boston_vars_s))
sil.out[i] <- mean(tmp[,3])
}
par(mfrow = c(1, 2))
ch <- plot(1:20, ch.out, type = 'b', lwd = 2, xlab = "N Groups", ylab = "C", main = "CH")
sil <- plot(1:20,sil.out, type = 'b', lwd = 2, xlab = "N Groups", ylab = "C", main = "Average silhouette index")
mean(sil.out, na.rm = TRUE)
## [1] 0.2800466
Answer: I initially tried running the k-means analysis with seven clusters, but I think fewer would be more appropriate. Our Calinski-Harabasz plot appears to “bend the elbow” at k=5. While a cluster of 2 maximizes the average silhouette index, two clusters probably don’t capture all of the variability of Boston. The second-highest value is for k=12, but since this is much higher than the CH index seems to suggest, I would favor a lower k. Based on the plotted results of the Calinski-Harabasz index and silhouette index, I would select 5 as my k. The silhouette index is 0.2878, which is near the “elbow” without being too high. It is higher than the average silhouette value of 0.2832, if only slightly.
(I like placing my plots side-by-side to minimize scrolling, but I think it’s worth viewing separately when choosing k.)
ngrp_final <- 5
#vars to use
boston_vars <- boston[, seq(9, 21)]
boston_vars_s <- scale(boston_vars)
boston_kmeans <- kmeans(boston_vars_s, ngrp_final, nstart = 50, iter.max = 20)
table(boston_kmeans$cluster)
##
## 1 2 3 4 5
## 39 84 230 119 34
mycol <- brewer.pal(ngrp, "Dark2")
plot(bostonloc, xlab = '', ylab = '', bg = alpha(mycol[boston_kmeans$cluster], 0.6),pch = 21, cex = 1.25, col = "gray100", lwd = 0.4, main = "k = 5")
maps::map("usa", xlim = c(-72, -69), ylim = c(41.5, 43), col = "gray10", add = TRUE)
boston_meds <- aggregate(boston_vars, list(boston_kmeans$cluster), median)
boston_meds
## Group.1 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
## 1 1 12.247200 0.00 18.100 0 0.679 6.1930 93.30 1.91420 24 666
## 2 2 0.042990 53.75 3.440 0 0.427 6.7070 30.95 6.92510 4 304
## 3 3 0.154870 0.00 7.625 0 0.510 6.1385 69.95 3.75370 5 302
## 4 4 5.731160 0.00 18.100 0 0.693 6.0030 96.60 1.86620 24 666
## 5 5 0.484245 0.00 13.890 1 0.550 6.2310 88.55 2.96865 5 307
## PTRATIO B LSTAT
## 1 20.2 48.45 19.780
## 2 16.6 393.07 5.670
## 3 18.4 393.26 9.845
## 4 20.2 390.11 17.730
## 5 17.6 390.58 10.990
Answer: By far the cluster with the most crime per capita is cluster 2, which is fairly similar to cluster 5 — they are nearer to highways, have higher levels of nitric oxides, higher percent lowest status population, and the highest property tax rate.
#Cluster vector
boston$cluster <- boston_kmeans$cluster
group_by(boston, cluster) %>%
summarise(
count = n(),
mean = mean(CMEDV, na.rm = TRUE),
sd = sd(CMEDV, na.rm = TRUE)
)
## # A tibble: 5 × 4
## cluster count mean sd
## <int> <int> <dbl> <dbl>
## 1 1 39 12.5 4.48
## 2 2 84 29.0 7.67
## 3 3 230 24.2 7.93
## 4 4 119 16.6 6.60
## 5 5 34 27.8 11.4
summary(boston_aov <- aov(CMEDV ~ cluster, data = boston))
## Df Sum Sq Mean Sq F value Pr(>F)
## cluster 1 22 22.38 0.265 0.607
## Residuals 504 42555 84.44
Answer: The mean home values (calculated from the median home values) are given above, with the highest value in cluster 1 at 28.987 (USD 1000), and the lowest in cluster 2 at 12.500 (USD 1000). The F-statistic is 40.35 (p = 4.74e-10), which means we have sufficient evidence to rejecting the null hypothesis and conclude that there is a statistically significant difference in home values between the clusters.
Instructions: The file wnaclim2.csv contains a set of climatic variables for sites distributed across western North America. Use principal component analysis to explore the spatial distribution of climate. This will require you to install the add-on package fields for plotting.
climate <- read.csv("wnaclim2.csv")
clim <- climate[, seq(3, 26)]
clim <- scale(clim)
var <- summary(clim.pca <- prcomp(clim, scale = TRUE))
ax_1 <- var$importance[2, 1]
ax_2 <- var$importance[2, 2]
total <- sum(var$importance[2, ])
exp_var_1_2 <- (ax_1 + ax_2)/total
head(clim.pca$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] 0.008071052 -1.3874311 -0.9836279 1.37546482 -0.2239042 0.2117742
## [2,] -5.747155910 0.9793170 0.7990065 -0.45842202 -0.2070474 -0.5716732
## [3,] -1.110850774 0.2726580 -2.3350788 1.13128346 -0.4553512 0.3760333
## [4,] -5.918467141 1.0162797 0.4778910 -0.27049059 -0.4314492 -0.4778650
## [5,] -5.199307142 1.0117284 -0.1186365 -0.12540711 -0.5941900 -0.4596768
## [6,] -4.203801060 0.7803072 -0.6706470 0.08404358 -0.7348544 -0.1435930
## PC7 PC8 PC9 PC10 PC11 PC12
## [1,] 0.035194421 -0.3159553 0.06726654 -0.05805034 0.03909418 0.00647676
## [2,] -0.763810164 -0.3245466 -0.22768624 -0.04503544 0.14008427 0.01760272
## [3,] -0.007616908 -0.5079982 -0.06260684 -0.21804157 -0.11938261 -0.02463859
## [4,] -0.725385454 -0.5392769 -0.13383186 -0.16217628 0.11596128 0.06857436
## [5,] -0.769997993 -0.6386171 -0.09193712 -0.19551564 0.08765735 0.14634885
## [6,] -0.527080924 -0.4733514 -0.10392325 -0.27130413 0.16409294 -0.04424016
## PC13 PC14 PC15 PC16 PC17
## [1,] -0.10304484 0.034346560 -0.004171344 0.03827428 -0.0664656951
## [2,] 0.07894863 0.008881291 -0.050599419 0.06475534 0.0991135758
## [3,] 0.04330995 -0.060888579 -0.082947178 -0.09367767 0.0036155133
## [4,] 0.11447572 -0.046880594 -0.040798213 -0.04901866 0.0003842374
## [5,] 0.11209646 -0.094006903 -0.059572519 -0.08295071 -0.0251603437
## [6,] 0.13005645 -0.044753326 -0.032075194 -0.08947706 -0.0236370642
## PC18 PC19 PC20 PC21 PC22
## [1,] 0.018140117 -0.03019069 0.006147098 -0.022525087 0.040515659
## [2,] 0.040463155 -0.02032345 0.020491664 -0.009543126 -0.002088933
## [3,] -0.002420637 0.03519190 -0.022245545 0.019584065 -0.039697096
## [4,] 0.011531935 -0.01402926 -0.035612801 -0.037170816 -0.008734897
## [5,] 0.034568474 -0.01414965 -0.034133021 -0.057726180 -0.003614496
## [6,] 0.003260044 0.01509390 -0.013932235 -0.046920943 0.031314336
## PC23 PC24
## [1,] 0.03004748 -1.007782e-02
## [2,] -0.02948187 -3.676611e-02
## [3,] 0.02594588 -3.126072e-03
## [4,] -0.02523369 -1.285121e-02
## [5,] -0.01908725 -2.206720e-06
## [6,] -0.02717742 2.242538e-02
var <- clim.pca$sdev
par(mfrow = c(1, 2))
biplot(clim.pca)
screeplot((clim.pca), main = "Climate PCA")
Answer: The total variance explained by the second PC is 0.32819, and the total percentage of variance explained by axes 1 and 2 is 82.73517%.
loadings <- clim.pca$rotation
Answer: PC1 is most highly associated with February temperature (0.27524689) and December temperature (0.27301687), PC2 is highly associated with October precipitation (0.28759576).
library(spam)
## Warning: package 'spam' was built under R version 3.6.2
## Loading required package: dotCall64
## Warning: package 'dotCall64' was built under R version 3.6.2
## Loading required package: grid
## Spam version 2.6-0 (2020-12-14) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
##
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
##
## backsolve, forwardsolve
library(fields)
## Warning: package 'fields' was built under R version 3.6.2
## See https://github.com/NCAR/Fields for
## an extensive vignette, other supplements and source code
wnaclim.pca.score <- clim.pca$x[, 1]
quilt.plot(climate$Longitude, climate$Latitude, wnaclim.pca.score)
world(add = TRUE)
wnaclim.pca.score <- clim.pca$x[, 2]
quilt.plot(climate$Longitude, climate$Latitude, wnaclim.pca.score)
world(add = TRUE)
Answers: The spatial pattern of PC1 in Map 1 appears to be closely associated with coastal geography, as well as latitude. The loadings for PC1 showed is that it is highly associated with temperature, but not as much precipitation. The temperature is less variable along the coast than it is inland of the contiguous U.S. — we would expect it to be more temperature on the coast. In Alaska, however, we know it can get extremely cold, especially further north.
The spatial pattern of PC2 is more associated with variation in precipitation than temperature. We expect relatively frequent precipitation in coast Canada and the Pacific Northwest, whereas in the deserts of the Southwest showers are less frequent and more variable.