Lab 06 Exercises

Eric Goodwin || October 28, 2021

Exercise 1

The file boston6k.csv contains information on house prices in Boston by census tract, as well as various socio-economic and environmental factors. Use this to cluster the tracts by these factors (NOT by price), then examine the characteristics of the clusters, whether they show a difference in house price and if there is any spatial structure to the clusters. Use only the following variables in the cluster analysis: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT (see the file description for explanations of these). You will need to scale the data as the variables are in a wide range of units.

Start by running k-means cluster analysis on the data from 2 to 20 clusters, using the approach outlined above. You should calculate either the silhouette index or the Calinski-Harabasz index for each set of clusters. Provide a plot of the index values, and identify the number of clusters (k) that gives the best solution

library(cluster)
library(fpc)
boston = read.csv("../datafiles/boston6k.csv")
boston2 = boston[, seq(9,21)]
boston2.s = scale(boston2)
ch.out = rep(NA,20)
for (i in 2:20) {
  boston2.kmeans = kmeans(boston2.s, centers = i, nstart = 50)
  ch.out[i] = calinhara(boston2.s, boston2.kmeans$cluster)
}
plot(1:20, ch.out, type ='b', lwd=2,
     xlab = "N Groups", ylab = "C", main = "Calinski-Harabasz 
     Index for Boston Dataset")

From the Calinski-Harabasz Index, it appears that 2 clusters gives the best solution. There is very drastic drop off after two clusters

In your opinion, is this the best solution, or would more or less clusters be useful?

I honestly donโ€™t know. Two clusters seems like a small amount, but it might be just fine. Since the values drop off so quickly, I feel like it would be hard to justify using more than 2 groups.

Re-run kmeans() using your chosen number for k

numgroups = 2
boston2.kmeans2 = kmeans(boston2.s, numgroups, nstart = 5, iter.max = 20)
table(boston2.kmeans2$cluster)
## 
##   1   2 
## 177 329

Using the aggregate() function, provide a table showing the median the variables used in clustering. In 1-2 sentences, describe the characteristics of the clusters

boston.centers = aggregate(boston2, list(boston2.kmeans2$cluster), mean)

boston.centers
##   Group.1      CRIM      ZN     INDUS       CHAS       NOX       RM      AGE
## 1       1 9.8447302  0.0000 19.039718 0.06779661 0.6805028 5.967181 91.31808
## 2       2 0.2611723 17.4772  6.885046 0.06990881 0.4870112 6.455422 56.33921
##        DIS       RAD      TAX  PTRATIO        B     LSTAT
## 1 2.007242 18.988701 605.8588 19.60452 301.3317 18.572768
## 2 4.756868  4.471125 301.9179 17.83739 386.4479  9.468298

By looking at the mean value for each variable by cluster, we can see that cluster 1 has notably higher crime and proportion of industrial land, older owner-occupied units, higher access to radial highways, higher property taxes, and highest percent lower status population. Group 2 exhibits higher proportions of residental land, farther distance to employment centers,and higher proportion African-American.

Report the mean corrected house value per cluster

tapply(boston$CMEDV, boston2.kmeans2$cluster, mean)
##        1        2 
## 16.52712 25.75775

Cluster 2 has higher median house prices.

Use anova() to test whether the values are significantly different between clusters. You will need the vector of house prices/values and the vector of clusters from kmeans(). Give the F-statistic and the p-value

boston$cluster = boston2.kmeans2$cluster
aov(cluster ~ CMEDV, data=boston)
## Call:
##    aov(formula = cluster ~ CMEDV, data = boston)
## 
## Terms:
##                    CMEDV Residuals
## Sum of Squares  26.50438  88.58060
## Deg. of Freedom        1       504
## 
## Residual standard error: 0.4192316
## Estimated effects may be unbalanced

This did not give me an f-statistic or p-value. I thought this might be due to only having two groups, so I decided to run a t-test.

t.test(CMEDV ~ cluster, data=boston)
## 
##  Welch Two Sample t-test
## 
## data:  CMEDV by cluster
## t = -12.6, df = 387.68, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -10.670982  -7.790282
## sample estimates:
## mean in group 1 mean in group 2 
##        16.52712        25.75775

And here you can see the p value is very small and indicates a true difference in housing price means between the groups.