The file boston6k.csv contains information on house prices in Boston by census tract, as well as various socio-economic and environmental factors. Use this to cluster the tracts by these factors (NOT by price), then examine the characteristics of the clusters, whether they show a difference in house price and if there is any spatial structure to the clusters. Use only the following variables in the cluster analysis: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT (see the file description for explanations of these). You will need to scale the data as the variables are in a wide range of units.
First, read in the data and take a look at the variables.
boston_all <- read.csv("../datafiles/boston6k.csv")
head(boston_all)
## ID TOWN TOWNNO TRACT LON LAT MEDV CMEDV CRIM
## 1 1 Boston Allston-Brighton 74 1 -71.13937 42.36245 17.8 17.8 8.98296
## 2 2 Boston Allston-Brighton 74 2 -71.15855 42.35312 21.7 21.7 3.84970
## 3 3 Boston Allston-Brighton 74 3 -71.16849 42.35061 22.7 22.7 5.20177
## 4 4 Boston Allston-Brighton 74 4 -71.15616 42.34382 22.6 22.6 4.26131
## 5 5 Boston Allston-Brighton 74 5 -71.15581 42.33732 25.0 25.0 4.54192
## 6 6 Boston Allston-Brighton 74 6 -71.14548 42.35033 19.9 19.9 3.83684
## ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
## 1 0 18.1 1 0.77 6.212 97.4 2.1222 24 666 20.2 377.73 17.60
## 2 0 18.1 1 0.77 6.395 91.0 2.5052 24 666 20.2 391.34 13.27
## 3 0 18.1 1 0.77 6.127 83.4 2.7227 24 666 20.2 395.43 11.48
## 4 0 18.1 0 0.77 6.112 81.3 2.5091 24 666 20.2 390.74 12.67
## 5 0 18.1 0 0.77 6.398 88.0 2.5182 24 666 20.2 374.56 7.79
## 6 0 18.1 0 0.77 6.251 91.1 2.2955 24 666 20.2 350.65 14.19
plot(cbind(boston_all$LON, boston_all$LAT),
xlab = "Longitude", ylab = "Latitude",
main = "House Locations in Boston Area")
map(add=T)
This is point data for 506 locations in the Boston area. There are 15 neighborhood variables and 6 location variables.
Next, I will select the desired variables for analysis and scale the data.
vars_interest <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS",
"RAD", "TAX", "PTRATIO", "B", "LSTAT")
boston_data <- boston_all[ , vars_interest]
boston_data <- scale(boston_data)
Next, calculate the distance matrix for the scaled data (default calculation is euclidean distances)
boston_dist <- dist(boston_data)
Start by running \(k\)-means cluster analysis on the data from 2 to 20 clusters, using the approach outlined above. You should calculate either the silhouette index or the Calinski-Harabasz index for each set of clusters. Provide a plot of the index values, and identify the number of clusters (\(k\)) that gives the best solution
bntown_hc <- hclust(boston_dist, method = "ward.D2")
plot(bntown_hc)
The dendrogram shows that there are many levels of clusters, its not clear what an ideal number of clusters would be from this. Possible values are n = (3, 5, 7, or 8).
ch.out <- rep(NA, 20)
sil.out <- rep(NA, 20)
for(i in 2:20) {
bn_kmeans <- kmeans(boston_data, centers = i, nstart = 50)
ch.out[i] <- calinhara(boston_data, bn_kmeans$cluster)
bn_si <- silhouette(bn_kmeans$cluster, boston_dist)
sil.out[i] <- mean(bn_si[ ,3])
}
plot(x = 1:20, y = ch.out, type = 'b', lwd=2,
xlab = "N Groups", ylab = "CH",
main = "Calinski-Harabasz Index for Boston Data")
The Calinski-Harabasz Index steadily decreases from n=2 groups. By this approach it appears that n=3 would be a good number for groups.
plot(x = 1:20, y = sil.out, type = 'b', lwd=2,
xlab = "N Groups", ylab = "SI",
main = "Average Silhouette Index for Boston Data")
The Silhouette Index shows n=6 groups to be the best number of groups (the lowest value of n, for a maximum si)
In your opinion, is this the best solution, or would more or less clusters be useful?
In my opinion the best solution for the number of clusters is a combination of output from the CH index, si index, and dendogram. For this analysis I think n= 5 is the best value.
plot(bntown_hc)
rect.hclust(bntown_hc, k = 5)
It looks like the groups don’t have the same number of observation in each group. This may cause a problem with the k-means clustering approach.
Re-run kmeans() using your chosen number for \(k\)
ngrps <- 5
bntown_kmeans <- kmeans(boston_dist, ngrps, nstart = 50, iter.max = 30)
table(bntown_kmeans$cluster)
##
## 1 2 3 4 5
## 217 98 63 65 63
mycol <- brewer.pal(n = ngrps, name = "Paired")
plot(cbind(boston_all$LON, boston_all$LAT),
xlab = "", ylab = "", main = "5 Clusters in Boston Housing Data",
ylim = c(42, 43),
pch = 16,
col = mycol[bntown_kmeans$cluster])
legend(x = -71.5, y = 43,
legend = c("Group 1", "Group 2", "Group 3", "Group 4", "Group 5"),
fill = mycol)
map(add = T)
Using the aggregate() function, provide a table
showing the median the variables used in clustering. In 1-2 sentences,
describe the characteristics of the clusters
clusters <- bntown_kmeans$cluster
bn_centers <- aggregate(boston_all[ , vars_interest],
list(clusters), mean)
bn_centers
## Group.1 CRIM ZN INDUS CHAS NOX RM
## 1 1 0.23317871 6.463134 7.168664 0.00000000 0.4909521 6.264594
## 2 2 7.09521286 0.000000 19.073571 0.00000000 0.6562245 6.103612
## 3 3 0.05250905 64.365079 3.113651 0.07936508 0.4171286 6.831873
## 4 4 15.79153969 0.000000 18.498769 0.18461538 0.7143846 5.774369
## 5 5 0.83736714 4.642857 12.886032 0.28571429 0.5891270 6.614476
## AGE DIS RAD TAX PTRATIO B LSTAT
## 1 58.50276 4.466467 4.534562 301.0691 18.54562 388.6646 10.193963
## 2 89.87755 2.169315 20.122449 632.7857 20.34898 362.8751 17.008878
## 3 28.98254 7.184990 3.714286 307.9206 16.69048 388.5114 5.536349
## 4 92.66308 1.771475 20.476923 622.1846 19.26769 202.9091 21.550923
## 5 84.86984 2.709129 4.936508 307.6508 16.12698 363.6473 12.283968
Group 1 has the second highest value for crime, 0% residential lots, 19% industrial businesses, no Charles River border, the second highest value of NOx concentration (0.656 pp10m), 89.9% of buildings built prior to 1940, the second closest weighted distance to employment centers, the second highest accessibility to highways, the highest property tax rate, the highest pupil to teacher ratio, the middle value of African-American proportion, and 17% lower status population. Group 1 looks like a more industrial, inner-city area.
Group2 has the lowest crime rate, the 64% is larger residential lots, the lowest industry, some areas border the Charles River, the lowest NOx concentration, 29.9% of the buildings were built prior to 1940 (infer this cluster has the most new buildings), the furthest commute to employment centers, and the lowest accessiblity to highways, a pupil-teacher ratio of 16.7, the second highest proportion of African-American proportion, and the lowest percentage of lower status population. This area is most likely suburbs far from the city center.
Group3 has the highest crime rate, no residential lots, the second highest industrial proportion, the second highest Charles River border, the highest NOx concentration, the highest proportion of older buildings, the shortest commute distance, the highest access to highways, the second highest tax rate, the second highest pupil to teacher ratio, the lowest proportion of African-Americans, and the highest percentage of lower status population. This group is also a more industrial, inner-city area, in an older area of downtown.
Group4 has the second lowest crime, 6.4% of larger residential lots. The second lowest industrial use, the second lowest NOx concentration, 58.5% of the buildings were built before 1940, middle values for commuting, the lowest tax rate, and middle values for pupil-teacher ratio and lower status populations. It has the highest proportion of African-American population. This group represents locations on the border between suburban and urban areas.
Group 5 has middle values for crime, residential lots, industry, NOx, age, commuting, tax, and African-American and low-status populations. This group has the lowest pupil to teacher ratio and highest number of properties bordering the Charles River. This group also appears to represent locations on the border between suburban and urban areas, but are slightly different than group 4.
Report the mean corrected house value per cluster
First need to add cluster information to the boston data, then calculate the mean house value for each cluster.
bntown_data_clustered <- cbind(boston_all, clusters)
bntown_data_clustered$clusters <- factor(bntown_data_clustered$clusters)
mean_vals <- bntown_data_clustered %>% select(MEDV, clusters) %>%
group_by(clusters) %>%
mutate(mean_cluster_val = mean(MEDV))
bntown_data_clustered$mean_clust_val = mean_vals$mean_cluster_val
bntown_data_clustered$adj_house_val = bntown_data_clustered$MEDV - bntown_data_clustered$mean_clust_val
cluster <- data.frame(group = c(1:5),
mean_val = unique(mean_vals$mean_cluster_val))
cluster
## group mean_val
## 1 1 14.57385
## 2 2 17.36735
## 3 3 23.10507
## 4 4 28.70635
## 5 5 30.63492
Use anova() to test whether the values are
significantly different between clusters. You will need the vector of
house prices/values and the vector of clusters from
kmeans(). Give the \(F\)-statistic and the \(p\)-value
price_lm <- lm(MEDV ~ 1, data = bntown_data_clustered)
summary(price_lm)
##
## Call:
## lm(formula = MEDV ~ 1, data = bntown_data_clustered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.533 -5.508 -1.333 2.467 27.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.4089 55.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.197 on 505 degrees of freedom
anova(price_lm)
## Analysis of Variance Table
##
## Response: MEDV
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 505 42716 84.587
Making a model with the home price as a function of the mean home price, results in an intercept that is significant, with a SSE of 42716.
cluster_lm <- lm(MEDV ~ clusters, data = bntown_data_clustered)
summary(cluster_lm)
##
## Call:
## lm(formula = MEDV ~ clusters, data = bntown_data_clustered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.506 -4.605 -1.174 2.465 35.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1051 0.5198 44.449 < 2e-16 ***
## clusters2 -5.7377 0.9319 -6.157 1.52e-09 ***
## clusters3 7.5299 1.0959 6.871 1.90e-11 ***
## clusters4 -8.5312 1.0827 -7.879 2.06e-14 ***
## clusters5 5.6013 1.0959 5.111 4.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.657 on 501 degrees of freedom
## Multiple R-squared: 0.3123, Adjusted R-squared: 0.3068
## F-statistic: 56.88 on 4 and 501 DF, p-value: < 2.2e-16
anova(cluster_lm)
## Analysis of Variance Table
##
## Response: MEDV
## Df Sum Sq Mean Sq F value Pr(>F)
## clusters 4 13340 3335.0 56.877 < 2.2e-16 ***
## Residuals 501 29376 58.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding the clusters to the model results in significant values for all the levels of the clusters. The SSE is much lower. Comparing the models with anova()
anova(price_lm, cluster_lm)
## Analysis of Variance Table
##
## Model 1: MEDV ~ 1
## Model 2: MEDV ~ clusters
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 505 42716
## 2 501 29376 4 13340 56.877 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding clusters to the model reduces the RSS. The second model has an \(F\)-statistic is 56.88, with a \(p\)-value is less than 2.2e-16.
The file wnaclim2.csv contains a set of climatic variables for sites distributed across western North America. Use principal component analysis to explore the spatial distribution of climate. This will require you to install the add-on package fields for plotting.
Read in the data
wnaclim_all <- read.csv("../datafiles/wnaclim2.csv")
plot(cbind(wnaclim_all$Longitude, wnaclim_all$Latitude),
xlab = "", ylab = "", main = "Climate Stations")
map(add = T)
Read in the file and perform principal component analysis using
the monthly temperature and precipitation variables (these are the same
as you used in the cluster analysis in the lab). . Use the SVD approach
with the function prcomp(). Note that you will have to use
to scale the data in the PCA to avoid any bias from the difference in
magnitude of the variables in the dataset (use the
scale=TRUE parameter). Make a biplot of this ordination
(biplot()) and a scree-plot showing the variance explained
by the components (screeplot()).
Need to remove lat/lon columns.
wnaclim <- wnaclim_all[ , -c(1,2)]
clim_pcs <- prcomp(wnaclim, scale = T)
biplot(clim_pcs)
This is a large data set, and the ordination plot is a bit complicated. It appears that PC1 is more weighted with temperature, and PC2 if more weighted with precipitation data.
screeplot(clim_pcs)
More than 95% of the variance is explained in the first four principal components (PC1, PC2, PC3, and PC4).
Give the total variance from the second PCA. Calculate the total
percentage of variance explained by axes 1 and 2 (use
summary())
summary(clim_pcs)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.1779 3.0900 1.60849 1.00332 0.78439 0.52420 0.39309
## Proportion of Variance 0.5455 0.2984 0.08085 0.03146 0.01923 0.00859 0.00483
## Cumulative Proportion 0.5455 0.8439 0.92471 0.95617 0.97540 0.98398 0.98881
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.30346 0.27875 0.20304 0.18712 0.17359 0.15187 0.13437
## Proportion of Variance 0.00288 0.00243 0.00129 0.00109 0.00094 0.00072 0.00056
## Cumulative Proportion 0.99169 0.99412 0.99541 0.99650 0.99744 0.99816 0.99873
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.10876 0.08293 0.07672 0.05834 0.05672 0.04860 0.04102
## Proportion of Variance 0.00037 0.00021 0.00018 0.00011 0.00010 0.00007 0.00005
## Cumulative Proportion 0.99910 0.99931 0.99950 0.99960 0.99970 0.99978 0.99983
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.03837 0.03355 0.02829 0.02502 0.02456 0.01959 0.01549
## Proportion of Variance 0.00005 0.00004 0.00003 0.00002 0.00002 0.00001 0.00001
## Cumulative Proportion 0.99988 0.99991 0.99994 0.99995 0.99997 0.99999 0.99999
## PC29 PC30 PC31 PC32
## Standard deviation 0.01212 0.008449 0.000535 1.39e-15
## Proportion of Variance 0.00000 0.000000 0.000000 0.00e+00
## Cumulative Proportion 1.00000 1.000000 1.000000 1.00e+00
PC1 accounts for 54.55% of the variance. PC2 accounts for 29.84% of the variance. Both components account for 84.39% of the variance. The standard deviation for PC1 and PC2 are:
clim_pcs$sdev[1:2]^2
## [1] 17.455207 9.548334
Examine the ‘loadings’ of the variables on the first two axes
(wnaclim.pca$rotation). Name two variables that are highly
associated (high positive or negative values) with axis 1 and two that
are highly associated with axis 2, and give their scores.
sort(clim_pcs$rotation[ ,1])
## pjul paug pjun psep poct pmay
## -0.005671661 0.002371154 0.004448985 0.040340802 0.066096892 0.073861228
## annp pnov papr pdec pjan pfeb
## 0.084419341 0.086049233 0.091428949 0.092353434 0.092875437 0.094670457
## pmar tmax tjun tjul mtwa gdd5
## 0.101293427 0.196130102 0.199709812 0.202588315 0.204174075 0.204306749
## tmin gdd0 taug tmay tjan mtco
## 0.211685887 0.217330901 0.217667932 0.219329733 0.228635104 0.229431369
## tdec tsep tapr tfeb tnov toct
## 0.230551893 0.230710874 0.231190173 0.232437510 0.233118635 0.234181837
## tmar tave
## 0.234373425 0.236613251
For PC1, the highest positive association is tave, the highest negative association is with pjul.
sort(clim_pcs$rotation[ ,2])
## tjul mtwa tjun gdd5 taug gdd0
## -0.148430496 -0.144447320 -0.139592432 -0.130248573 -0.117953668 -0.103531713
## tmax tmay tsep tave tapr toct
## -0.097237292 -0.085480955 -0.080352693 -0.036116236 -0.035420680 -0.035366689
## tnov tmar tmin tfeb tdec tjan
## -0.007199719 -0.005334822 0.002463008 0.009299060 0.012830270 0.016062674
## mtco pjul paug pjun psep pmay
## 0.016106230 0.087475200 0.134792156 0.220609608 0.258582542 0.266701042
## pmar pjan pfeb pdec poct pnov
## 0.276185531 0.276538982 0.280741784 0.282098129 0.289754249 0.290545974
## papr annp
## 0.291801945 0.301634187
For PC2, the highest positive association is with annp, the highest negative association is with tjul.
Produce a map of the sites scores on axis 1, using the
quilt.plot() function from the fields package (code to do
this is given with the file description below). With reference to the
association between the variables and axis 1 (previous question), give a
short description of the map (e.g. where do you find negative or
positive values and what variables are these associated with?). Does
this make sense in terms of what you know about the geography and
climate of North America?
quilt.plot(wnaclim_all$Longitude, wnaclim_all$Latitude, clim_pcs$x[ ,1])
world(add = T)
PC1 is highly associated with average temperature values. The maps shows station locations and their average temperature values. We can see a gradient in temperature from warmer values in the southwest to cold values in the artic that change with latitude. There is also clusters of areas with similar climates, such as the pacific northwest, coastal ranges, the rockies, and continental areas.
Finally, produce a map of the sites scores on the second axis and give a short description of the spatial pattern in terms of the associated variables
quilt.plot(wnaclim_all$Longitude, wnaclim_all$Latitude, clim_pcs$x[ ,2])
world(add = T)
PC2 is highly associated with precipitation. Drier areas such as the arctic and southwest are clustered together. Areas that recieve more rain are also clustered, such as the Pacific Northwest and coastal British Columbia.
fin