Exercise 1

The file boston6k.csv contains information on house prices in Boston by census tract, as well as various socio-economic and environmental factors. Use this to cluster the tracts by these factors (NOT by price), then examine the characteristics of the clusters, whether they show a difference in house price and if there is any spatial structure to the clusters. Use only the following variables in the cluster analysis: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT (see the file description for explanations of these). You will need to scale the data as the variables are in a wide range of units.

First, read in the data and take a look at the variables.

boston_all <- read.csv("../datafiles/boston6k.csv")

head(boston_all)
##   ID                    TOWN TOWNNO TRACT       LON      LAT MEDV CMEDV    CRIM
## 1  1 Boston Allston-Brighton     74     1 -71.13937 42.36245 17.8  17.8 8.98296
## 2  2 Boston Allston-Brighton     74     2 -71.15855 42.35312 21.7  21.7 3.84970
## 3  3 Boston Allston-Brighton     74     3 -71.16849 42.35061 22.7  22.7 5.20177
## 4  4 Boston Allston-Brighton     74     4 -71.15616 42.34382 22.6  22.6 4.26131
## 5  5 Boston Allston-Brighton     74     5 -71.15581 42.33732 25.0  25.0 4.54192
## 6  6 Boston Allston-Brighton     74     6 -71.14548 42.35033 19.9  19.9 3.83684
##   ZN INDUS CHAS  NOX    RM  AGE    DIS RAD TAX PTRATIO      B LSTAT
## 1  0  18.1    1 0.77 6.212 97.4 2.1222  24 666    20.2 377.73 17.60
## 2  0  18.1    1 0.77 6.395 91.0 2.5052  24 666    20.2 391.34 13.27
## 3  0  18.1    1 0.77 6.127 83.4 2.7227  24 666    20.2 395.43 11.48
## 4  0  18.1    0 0.77 6.112 81.3 2.5091  24 666    20.2 390.74 12.67
## 5  0  18.1    0 0.77 6.398 88.0 2.5182  24 666    20.2 374.56  7.79
## 6  0  18.1    0 0.77 6.251 91.1 2.2955  24 666    20.2 350.65 14.19
plot(cbind(boston_all$LON, boston_all$LAT),
     xlab = "Longitude", ylab = "Latitude",
     main = "House Locations in Boston Area")
map(add=T)

This is point data for 506 locations in the Boston area. There are 15 neighborhood variables and 6 location variables.

Next, I will select the desired variables for analysis and scale the data.

vars_interest <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", 
                   "RAD", "TAX", "PTRATIO", "B", "LSTAT")

boston_data <- boston_all[ , vars_interest]
boston_data <- scale(boston_data)

Next, calculate the distance matrix for the scaled data (default calculation is euclidean distances)

boston_dist <- dist(boston_data)

Start by running \(k\)-means cluster analysis on the data from 2 to 20 clusters, using the approach outlined above. You should calculate either the silhouette index or the Calinski-Harabasz index for each set of clusters. Provide a plot of the index values, and identify the number of clusters (\(k\)) that gives the best solution

bntown_hc <- hclust(boston_dist, method = "ward.D2")
plot(bntown_hc)

The dendrogram shows that there are many levels of clusters, its not clear what an ideal number of clusters would be from this. Possible values are n = (3, 5, 7, or 8).

ch.out <- rep(NA, 20)
sil.out <- rep(NA, 20)

for(i in 2:20) {
  bn_kmeans <- kmeans(boston_data, centers = i, nstart = 50)
  ch.out[i] <- calinhara(boston_data, bn_kmeans$cluster)
  bn_si <- silhouette(bn_kmeans$cluster, boston_dist)
  sil.out[i] <- mean(bn_si[ ,3])
}

plot(x = 1:20, y = ch.out, type = 'b', lwd=2,
     xlab = "N Groups", ylab = "CH",
     main = "Calinski-Harabasz Index for Boston Data")

The Calinski-Harabasz Index steadily decreases from n=2 groups. By this approach it appears that n=3 would be a good number for groups.

plot(x = 1:20, y = sil.out, type = 'b', lwd=2,
     xlab = "N Groups", ylab = "SI",
     main = "Average Silhouette Index for Boston Data")

The Silhouette Index shows n=6 groups to be the best number of groups (the lowest value of n, for a maximum si)

In your opinion, is this the best solution, or would more or less clusters be useful?

In my opinion the best solution for the number of clusters is a combination of output from the CH index, si index, and dendogram. For this analysis I think n= 5 is the best value.

plot(bntown_hc)
rect.hclust(bntown_hc, k = 5)

It looks like the groups don’t have the same number of observation in each group. This may cause a problem with the k-means clustering approach.

Re-run kmeans() using your chosen number for \(k\)

ngrps <- 5

bntown_kmeans <- kmeans(boston_dist, ngrps, nstart = 50, iter.max = 30)
table(bntown_kmeans$cluster)
## 
##   1   2   3   4   5 
## 217  98  63  65  63
mycol <- brewer.pal(n = ngrps, name = "Paired")
plot(cbind(boston_all$LON, boston_all$LAT),
     xlab = "", ylab = "", main = "5 Clusters in Boston Housing Data",
     ylim = c(42, 43),
     pch = 16,
     col = mycol[bntown_kmeans$cluster])
legend(x = -71.5, y = 43,
       legend = c("Group 1", "Group 2", "Group 3", "Group 4", "Group 5"),
       fill = mycol)
map(add = T)

Using the aggregate() function, provide a table showing the median the variables used in clustering. In 1-2 sentences, describe the characteristics of the clusters

clusters <- bntown_kmeans$cluster
bn_centers <- aggregate(boston_all[ , vars_interest], 
                        list(clusters), mean)
bn_centers
##   Group.1        CRIM        ZN     INDUS       CHAS       NOX       RM
## 1       1  0.23317871  6.463134  7.168664 0.00000000 0.4909521 6.264594
## 2       2  7.09521286  0.000000 19.073571 0.00000000 0.6562245 6.103612
## 3       3  0.05250905 64.365079  3.113651 0.07936508 0.4171286 6.831873
## 4       4 15.79153969  0.000000 18.498769 0.18461538 0.7143846 5.774369
## 5       5  0.83736714  4.642857 12.886032 0.28571429 0.5891270 6.614476
##        AGE      DIS       RAD      TAX  PTRATIO        B     LSTAT
## 1 58.50276 4.466467  4.534562 301.0691 18.54562 388.6646 10.193963
## 2 89.87755 2.169315 20.122449 632.7857 20.34898 362.8751 17.008878
## 3 28.98254 7.184990  3.714286 307.9206 16.69048 388.5114  5.536349
## 4 92.66308 1.771475 20.476923 622.1846 19.26769 202.9091 21.550923
## 5 84.86984 2.709129  4.936508 307.6508 16.12698 363.6473 12.283968

Group 1 has the second highest value for crime, 0% residential lots, 19% industrial businesses, no Charles River border, the second highest value of NOx concentration (0.656 pp10m), 89.9% of buildings built prior to 1940, the second closest weighted distance to employment centers, the second highest accessibility to highways, the highest property tax rate, the highest pupil to teacher ratio, the middle value of African-American proportion, and 17% lower status population. Group 1 looks like a more industrial, inner-city area.

Group2 has the lowest crime rate, the 64% is larger residential lots, the lowest industry, some areas border the Charles River, the lowest NOx concentration, 29.9% of the buildings were built prior to 1940 (infer this cluster has the most new buildings), the furthest commute to employment centers, and the lowest accessiblity to highways, a pupil-teacher ratio of 16.7, the second highest proportion of African-American proportion, and the lowest percentage of lower status population. This area is most likely suburbs far from the city center.

Group3 has the highest crime rate, no residential lots, the second highest industrial proportion, the second highest Charles River border, the highest NOx concentration, the highest proportion of older buildings, the shortest commute distance, the highest access to highways, the second highest tax rate, the second highest pupil to teacher ratio, the lowest proportion of African-Americans, and the highest percentage of lower status population. This group is also a more industrial, inner-city area, in an older area of downtown.

Group4 has the second lowest crime, 6.4% of larger residential lots. The second lowest industrial use, the second lowest NOx concentration, 58.5% of the buildings were built before 1940, middle values for commuting, the lowest tax rate, and middle values for pupil-teacher ratio and lower status populations. It has the highest proportion of African-American population. This group represents locations on the border between suburban and urban areas.

Group 5 has middle values for crime, residential lots, industry, NOx, age, commuting, tax, and African-American and low-status populations. This group has the lowest pupil to teacher ratio and highest number of properties bordering the Charles River. This group also appears to represent locations on the border between suburban and urban areas, but are slightly different than group 4.

Report the mean corrected house value per cluster

First need to add cluster information to the boston data, then calculate the mean house value for each cluster.

bntown_data_clustered <- cbind(boston_all, clusters)
bntown_data_clustered$clusters <- factor(bntown_data_clustered$clusters)
mean_vals <- bntown_data_clustered %>% select(MEDV, clusters) %>%
                            group_by(clusters) %>% 
                            mutate(mean_cluster_val = mean(MEDV))
bntown_data_clustered$mean_clust_val = mean_vals$mean_cluster_val
bntown_data_clustered$adj_house_val = bntown_data_clustered$MEDV - bntown_data_clustered$mean_clust_val

cluster <- data.frame(group = c(1:5),
                      mean_val = unique(mean_vals$mean_cluster_val))
cluster
##   group mean_val
## 1     1 14.57385
## 2     2 17.36735
## 3     3 23.10507
## 4     4 28.70635
## 5     5 30.63492

Use anova() to test whether the values are significantly different between clusters. You will need the vector of house prices/values and the vector of clusters from kmeans(). Give the \(F\)-statistic and the \(p\)-value

price_lm <- lm(MEDV ~ 1, data = bntown_data_clustered)
summary(price_lm)
## 
## Call:
## lm(formula = MEDV ~ 1, data = bntown_data_clustered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.533  -5.508  -1.333   2.467  27.467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.5328     0.4089   55.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.197 on 505 degrees of freedom
anova(price_lm)
## Analysis of Variance Table
## 
## Response: MEDV
##            Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 505  42716  84.587

Making a model with the home price as a function of the mean home price, results in an intercept that is significant, with a SSE of 42716.

cluster_lm <- lm(MEDV ~ clusters, data = bntown_data_clustered)
summary(cluster_lm)
## 
## Call:
## lm(formula = MEDV ~ clusters, data = bntown_data_clustered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.506  -4.605  -1.174   2.465  35.426 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1051     0.5198  44.449  < 2e-16 ***
## clusters2    -5.7377     0.9319  -6.157 1.52e-09 ***
## clusters3     7.5299     1.0959   6.871 1.90e-11 ***
## clusters4    -8.5312     1.0827  -7.879 2.06e-14 ***
## clusters5     5.6013     1.0959   5.111 4.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.657 on 501 degrees of freedom
## Multiple R-squared:  0.3123, Adjusted R-squared:  0.3068 
## F-statistic: 56.88 on 4 and 501 DF,  p-value: < 2.2e-16
anova(cluster_lm)
## Analysis of Variance Table
## 
## Response: MEDV
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## clusters    4  13340  3335.0  56.877 < 2.2e-16 ***
## Residuals 501  29376    58.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding the clusters to the model results in significant values for all the levels of the clusters. The SSE is much lower. Comparing the models with anova()

anova(price_lm, cluster_lm)
## Analysis of Variance Table
## 
## Model 1: MEDV ~ 1
## Model 2: MEDV ~ clusters
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    505 42716                                  
## 2    501 29376  4     13340 56.877 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding clusters to the model reduces the RSS. The second model has an \(F\)-statistic is 56.88, with a \(p\)-value is less than 2.2e-16.


Exercise 2

The file wnaclim2.csv contains a set of climatic variables for sites distributed across western North America. Use principal component analysis to explore the spatial distribution of climate. This will require you to install the add-on package fields for plotting.

Read in the data

wnaclim_all <- read.csv("../datafiles/wnaclim2.csv")

plot(cbind(wnaclim_all$Longitude, wnaclim_all$Latitude),
     xlab = "", ylab = "", main = "Climate Stations")
map(add = T)

Read in the file and perform principal component analysis using the monthly temperature and precipitation variables (these are the same as you used in the cluster analysis in the lab). . Use the SVD approach with the function prcomp(). Note that you will have to use to scale the data in the PCA to avoid any bias from the difference in magnitude of the variables in the dataset (use the scale=TRUE parameter). Make a biplot of this ordination (biplot()) and a scree-plot showing the variance explained by the components (screeplot()).

Need to remove lat/lon columns.

wnaclim <- wnaclim_all[ , -c(1,2)]

clim_pcs <- prcomp(wnaclim, scale = T)
biplot(clim_pcs)

This is a large data set, and the ordination plot is a bit complicated. It appears that PC1 is more weighted with temperature, and PC2 if more weighted with precipitation data.

screeplot(clim_pcs)

More than 95% of the variance is explained in the first four principal components (PC1, PC2, PC3, and PC4).

Give the total variance from the second PCA. Calculate the total percentage of variance explained by axes 1 and 2 (use summary())

summary(clim_pcs)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.1779 3.0900 1.60849 1.00332 0.78439 0.52420 0.39309
## Proportion of Variance 0.5455 0.2984 0.08085 0.03146 0.01923 0.00859 0.00483
## Cumulative Proportion  0.5455 0.8439 0.92471 0.95617 0.97540 0.98398 0.98881
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.30346 0.27875 0.20304 0.18712 0.17359 0.15187 0.13437
## Proportion of Variance 0.00288 0.00243 0.00129 0.00109 0.00094 0.00072 0.00056
## Cumulative Proportion  0.99169 0.99412 0.99541 0.99650 0.99744 0.99816 0.99873
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.10876 0.08293 0.07672 0.05834 0.05672 0.04860 0.04102
## Proportion of Variance 0.00037 0.00021 0.00018 0.00011 0.00010 0.00007 0.00005
## Cumulative Proportion  0.99910 0.99931 0.99950 0.99960 0.99970 0.99978 0.99983
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.03837 0.03355 0.02829 0.02502 0.02456 0.01959 0.01549
## Proportion of Variance 0.00005 0.00004 0.00003 0.00002 0.00002 0.00001 0.00001
## Cumulative Proportion  0.99988 0.99991 0.99994 0.99995 0.99997 0.99999 0.99999
##                           PC29     PC30     PC31     PC32
## Standard deviation     0.01212 0.008449 0.000535 1.39e-15
## Proportion of Variance 0.00000 0.000000 0.000000 0.00e+00
## Cumulative Proportion  1.00000 1.000000 1.000000 1.00e+00

PC1 accounts for 54.55% of the variance. PC2 accounts for 29.84% of the variance. Both components account for 84.39% of the variance. The standard deviation for PC1 and PC2 are:

clim_pcs$sdev[1:2]^2
## [1] 17.455207  9.548334

Examine the ‘loadings’ of the variables on the first two axes (wnaclim.pca$rotation). Name two variables that are highly associated (high positive or negative values) with axis 1 and two that are highly associated with axis 2, and give their scores.

sort(clim_pcs$rotation[ ,1])
##         pjul         paug         pjun         psep         poct         pmay 
## -0.005671661  0.002371154  0.004448985  0.040340802  0.066096892  0.073861228 
##         annp         pnov         papr         pdec         pjan         pfeb 
##  0.084419341  0.086049233  0.091428949  0.092353434  0.092875437  0.094670457 
##         pmar         tmax         tjun         tjul         mtwa         gdd5 
##  0.101293427  0.196130102  0.199709812  0.202588315  0.204174075  0.204306749 
##         tmin         gdd0         taug         tmay         tjan         mtco 
##  0.211685887  0.217330901  0.217667932  0.219329733  0.228635104  0.229431369 
##         tdec         tsep         tapr         tfeb         tnov         toct 
##  0.230551893  0.230710874  0.231190173  0.232437510  0.233118635  0.234181837 
##         tmar         tave 
##  0.234373425  0.236613251

For PC1, the highest positive association is tave, the highest negative association is with pjul.

sort(clim_pcs$rotation[ ,2])
##         tjul         mtwa         tjun         gdd5         taug         gdd0 
## -0.148430496 -0.144447320 -0.139592432 -0.130248573 -0.117953668 -0.103531713 
##         tmax         tmay         tsep         tave         tapr         toct 
## -0.097237292 -0.085480955 -0.080352693 -0.036116236 -0.035420680 -0.035366689 
##         tnov         tmar         tmin         tfeb         tdec         tjan 
## -0.007199719 -0.005334822  0.002463008  0.009299060  0.012830270  0.016062674 
##         mtco         pjul         paug         pjun         psep         pmay 
##  0.016106230  0.087475200  0.134792156  0.220609608  0.258582542  0.266701042 
##         pmar         pjan         pfeb         pdec         poct         pnov 
##  0.276185531  0.276538982  0.280741784  0.282098129  0.289754249  0.290545974 
##         papr         annp 
##  0.291801945  0.301634187

For PC2, the highest positive association is with annp, the highest negative association is with tjul.

Produce a map of the sites scores on axis 1, using the quilt.plot() function from the fields package (code to do this is given with the file description below). With reference to the association between the variables and axis 1 (previous question), give a short description of the map (e.g. where do you find negative or positive values and what variables are these associated with?). Does this make sense in terms of what you know about the geography and climate of North America?

quilt.plot(wnaclim_all$Longitude, wnaclim_all$Latitude, clim_pcs$x[ ,1])
world(add = T)

PC1 is highly associated with average temperature values. The maps shows station locations and their average temperature values. We can see a gradient in temperature from warmer values in the southwest to cold values in the artic that change with latitude. There is also clusters of areas with similar climates, such as the pacific northwest, coastal ranges, the rockies, and continental areas.

Finally, produce a map of the sites scores on the second axis and give a short description of the spatial pattern in terms of the associated variables

quilt.plot(wnaclim_all$Longitude, wnaclim_all$Latitude, clim_pcs$x[ ,2])
world(add = T)

PC2 is highly associated with precipitation. Drier areas such as the arctic and southwest are clustered together. Areas that recieve more rain are also clustered, such as the Pacific Northwest and coastal British Columbia.


fin