The file boston6k.csv contains information on house prices in Boston by census tract, as well as various socio-economic and environmental factors. Use this to cluster the tracts by these factors (NOT by price), then examine the characteristics of the clusters, whether they show a difference in house price and if there is any spatial structure to the clusters. Use only the following variables in the cluster analysis: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT (see the file description for explanations of these). You will need to scale the data as the variables are in a wide range of units.
Start by running k-means cluster analysis on the data from 2 to 20 clusters, using the approach outlined above. You should calculate either the silhouette index or the Calinski-Harabasz index for each set of clusters. Provide a plot of the index values, and identify the number of clusters (k) that gives the best solution
library(cluster)
library(fpc)
boston = read.csv("../datafiles/boston6k.csv")
boston2 = boston[, seq(9,21)]
boston2.s = scale(boston2)
ch.out = rep(NA,20)
for (i in 2:20) {
boston2.kmeans = kmeans(boston2.s, centers = i, nstart = 50)
ch.out[i] = calinhara(boston2.s, boston2.kmeans$cluster)
}
plot(1:20, ch.out, type ='b', lwd=2,
xlab = "N Groups", ylab = "C", main = "Calinski-Harabasz
Index for Boston Dataset")
From the Calinski-Harabasz Index, it appears that 2 clusters gives the best solution. There is very drastic drop off after two clusters
In your opinion, is this the best solution, or would more or less clusters be useful?
I honestly donโt know. Two clusters seems like a small amount, but it might be just fine. Since the values drop off so quickly, I feel like it would be hard to justify using more than 2 groups.
Re-run kmeans() using your chosen number for k
numgroups = 2
boston2.kmeans2 = kmeans(boston2.s, numgroups, nstart = 5, iter.max = 20)
table(boston2.kmeans2$cluster)
##
## 1 2
## 177 329
Using the aggregate() function, provide a table showing the median the variables used in clustering. In 1-2 sentences, describe the characteristics of the clusters
boston.centers = aggregate(boston2, list(boston2.kmeans2$cluster), mean)
boston.centers
## Group.1 CRIM ZN INDUS CHAS NOX RM AGE
## 1 1 9.8447302 0.0000 19.039718 0.06779661 0.6805028 5.967181 91.31808
## 2 2 0.2611723 17.4772 6.885046 0.06990881 0.4870112 6.455422 56.33921
## DIS RAD TAX PTRATIO B LSTAT
## 1 2.007242 18.988701 605.8588 19.60452 301.3317 18.572768
## 2 4.756868 4.471125 301.9179 17.83739 386.4479 9.468298
By looking at the mean value for each variable by cluster, we can see that cluster 1 has notably higher crime and proportion of industrial land, older owner-occupied units, higher access to radial highways, higher property taxes, and highest percent lower status population. Group 2 exhibits higher proportions of residental land, farther distance to employment centers,and higher proportion African-American.
Report the mean corrected house value per cluster
tapply(boston$CMEDV, boston2.kmeans2$cluster, mean)
## 1 2
## 16.52712 25.75775
Cluster 2 has higher median house prices.
Use anova() to test whether the values are significantly different between clusters. You will need the vector of house prices/values and the vector of clusters from kmeans(). Give the F-statistic and the p-value
boston$cluster = boston2.kmeans2$cluster
aov(cluster ~ CMEDV, data=boston)
## Call:
## aov(formula = cluster ~ CMEDV, data = boston)
##
## Terms:
## CMEDV Residuals
## Sum of Squares 26.50438 88.58060
## Deg. of Freedom 1 504
##
## Residual standard error: 0.4192316
## Estimated effects may be unbalanced
This did not give me an f-statistic or p-value. I thought this might be due to only having two groups, so I decided to run a t-test.
t.test(CMEDV ~ cluster, data=boston)
##
## Welch Two Sample t-test
##
## data: CMEDV by cluster
## t = -12.6, df = 387.68, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -10.670982 -7.790282
## sample estimates:
## mean in group 1 mean in group 2
## 16.52712 25.75775
And here you can see the p value is very small and indicates a true difference in housing price means between the groups.