GEOG 6000
Date: 10/29/2015
Sulochan Dhungel
setwd("C:/Users/Sulochan/Copy/Fall 2015/Advance Geo/Assignment 6")
boston2 = read.csv("boston.csv")
#head(boston2)
boston = boston2[,11:23]
boston.sc = scale(boston)
#install.packages("cluster",repos="http://cran.us.r-project.org")
library(cluster)
#install.packages("fpc",repos="http://cran.us.r-project.org")
library(fpc)
ch.out = rep(NA,20)
sil.out = rep(NA,20)
for (i in 2:20){
boston.kmeans = kmeans(boston.sc, centers = i, nstart = 50)
ch.out[i] = calinhara(boston.sc, boston.kmeans$cluster)
tmp = silhouette(boston.kmeans$cluster, dist(boston.sc))
sil.out[i] = mean(tmp[,3])
}
plot(1:20,ch.out, type='b', lwd=2,
xlab="N Groups", ylab="C", main="Calinski-Harabasz index")
grid()
plot(1:20,sil.out, type='b', lwd=2,
xlab="N Groups", ylab="C", main="Average silhouette index")
grid()
Both plots indicate that 2 might be the best set of clusters. Based on these two measures, 2 clusters might be enough to separate the data. But since the dataset has 506 rows and 13 variables used to differentiate prices of house, I would think, it would be worthwhile to investigate on more than 2 clusters.
Investigating 2 clusters
boston.kmeans = kmeans(boston.sc, centers = 2, nstart = 50)
layout(matrix(c(1:2), 1, 2, byrow = TRUE))
for (i in 1:2){
ind = which(boston.kmeans$cluster == i)
boxplot(boston.sc[ind,],ylim=c(-4,10),las=2)
abline(h=0,col="gray")
mtext(paste("size = ",length(ind),sep=""),3)
}
These boxplots of the variables of the two clusters are different. Almost all the positive variables in one of the cluster is negative for the same variables in next cluster.
A PCA can help to reduce the variability of 13 variables to less number of variables so that interpretation of clusters could be easier.
Choice of cluster
For my preliminary analysis I chose 6 clusters since average silhouette index was highest at N=6 after N=2. This was based solely on this chart and requires intensive checking the cluster properties after selection.
boston.kmeans = kmeans(boston.sc, centers = 6, nstart = 50)
aggregate(boston,list(boston.kmeans$cluster),median)
## Group.1 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
## 1 1 0.043010 55 3.44 0 0.428 6.7180 30.80 7.03550 4 304
## 2 2 0.484245 0 13.89 1 0.550 6.2310 88.55 2.96865 5 307
## 3 3 0.144550 0 7.07 0 0.507 6.1440 67.80 3.93420 4 296
## 4 4 8.103765 0 18.10 0 0.693 6.1380 95.70 1.96480 24 666
## 5 5 1.084555 0 21.89 0 0.624 5.8795 96.60 1.87270 4 403
## 6 6 12.884700 0 18.10 0 0.679 6.1975 92.95 1.92490 24 666
## PTRATIO B LSTAT
## 1 16.6 392.900 5.570
## 2 17.6 390.580 10.990
## 3 18.4 393.490 9.680
## 4 20.2 391.085 17.500
## 5 19.1 377.880 15.965
## 6 20.2 45.755 19.930
The median values show an increasing or decreasing trends in variable values based on socio-economic aspects. The crime values, tax rates, pollution, pupil teacher ratio, age are increasing for classes 1, 3, 2, 5, 4 and 6 while distances from employment services are decreasing with each of these clusters. This indicates that the first few clusters (1,3,2) are better housing communities than the latter ones.
Although these values give an idea about how the clusters are formed, to interpret each class, I tried plotting box plots of variables of each cluster.
layout(matrix(c(1:6), 2, 3, byrow = TRUE))
for (i in 1:6){
ind = which(boston.kmeans$cluster == i)
boxplot(boston.sc[ind,],ylim=c(-4,10),las=2)
abline(h=0,col="gray")
mtext(paste("size = ",length(ind),sep=""),3)
}
These plots also visually gives the variabilities of the variables in the clusters. e.g. An interpretation could be that for class 4(size = 86) and class 6(size = 38), there is not much difference in any of the variables other than the black proportion in the cluster. The crime rates variabilty is a bit more in sixth cluster compared to the forth cluster.
Mean corrected house value per cluster
mean_cmedv = signif(aggregate(boston2$CMEDV*1000, list(boston.kmeans$cluster), mean), digits=5)
# Sorting the classes based on house values
colnames(mean_cmedv) = c("Cluster","House price in $")
mean_cmedv[with(mean_cmedv, order(-mean_cmedv[,2])), ]
## Cluster House price in $
## 1 1 29037
## 2 2 27806
## 3 3 23985
## 5 5 19552
## 4 4 16691
## 6 6 12484
The housing values are also ordered as 1,2,3, 5, 4 and 6. This shows that the house values are very closely related to the social and economic characterisitcs of neighborhood.
Analysis of Variance of house prices and clusters
aov.cmedv = aov(boston2$CMEDV~boston.kmeans$cluster)
summary(aov.cmedv)
## Df Sum Sq Mean Sq F value Pr(>F)
## boston.kmeans$cluster 1 10542 10542 165.8 <2e-16 ***
## Residuals 504 32036 64
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA tests if the house values are significantly different among groups. Based on the F-statistics and p-value, the house group values are different from each other.
#install.packages("fields",repos="http://cran.us.r-project.org")
library(fields)
wnaclim2 = read.csv("wnaclim2.csv")
#head(wnaclim2)
wnaclim <- wnaclim2[,seq(3,26)]
wnaclim.s <- scale(wnaclim)
wnaclim.pca <- prcomp(wnaclim, scale=TRUE)
biplot(wnaclim.pca,col = c("gray","red"), cex=c(0.4,0.75))
The biplot indicates that the first two major components includes precipitations and temperatures for different months .
screeplot(wnaclim.pca)
Since this plot did not plot all of the components, we could use a line plot to plot variances of all components
screeplot(wnaclim.pca, npcs = 24, type = "lines")
mtext("Components",side =1,line =3)
summary(wnaclim.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.4612 2.8065 1.5846 0.84733 0.63986 0.51316
## Proportion of Variance 0.4992 0.3282 0.1046 0.02992 0.01706 0.01097
## Cumulative Proportion 0.4992 0.8274 0.9320 0.96190 0.97896 0.98993
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.29261 0.20707 0.17437 0.15014 0.13534 0.10798
## Proportion of Variance 0.00357 0.00179 0.00127 0.00094 0.00076 0.00049
## Cumulative Proportion 0.99350 0.99528 0.99655 0.99749 0.99825 0.99874
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.09353 0.08322 0.06116 0.05676 0.04584 0.04161
## Proportion of Variance 0.00036 0.00029 0.00016 0.00013 0.00009 0.00007
## Cumulative Proportion 0.99910 0.99939 0.99955 0.99968 0.99977 0.99984
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.03439 0.02946 0.02562 0.02411 0.01802 0.01392
## Proportion of Variance 0.00005 0.00004 0.00003 0.00002 0.00001 0.00001
## Cumulative Proportion 0.99989 0.99993 0.99995 0.99998 0.99999 1.00000
Based on the summary statistics, the second component explains 32.82% of total variation and the first and second component together explain 82.74% of total variation.
Examination of loadings
sort(wnaclim.pca$rotation[,1])
## pjul paug pjun psep poct pmay
## 0.01369406 0.03206302 0.05730680 0.10354548 0.13939969 0.14748106
## pnov pjan pdec papr pfeb pmar
## 0.16303542 0.16803736 0.16860662 0.17074339 0.17102216 0.17796187
## tjun tjul taug tmay tsep tapr
## 0.20220435 0.20390908 0.22863878 0.23877755 0.25186014 0.26473010
## toct tjan tnov tdec tmar tfeb
## 0.26714245 0.27154635 0.27214789 0.27301687 0.27484658 0.27524689
The two variables that are highly associated with the first axis is March Temperature and February temperature. This can be interpreted as spring temperatures.The scores of both March Temp and February temperature is 0.275.
sort(wnaclim.pca$rotation[,2])
## tjul tjun taug tmay tsep toct
## -0.22851811 -0.21701738 -0.20271341 -0.16889072 -0.16881584 -0.12423680
## tapr tnov tmar tfeb tdec tjan
## -0.12257641 -0.09552584 -0.09367701 -0.07830351 -0.07380880 -0.07007925
## pjul paug pjun pmar pmay pjan
## 0.10097126 0.14782502 0.23790590 0.25650454 0.25905950 0.25995212
## pfeb psep pdec papr pnov poct
## 0.26393484 0.26528778 0.26639834 0.27824680 0.27829605 0.28759576
The two variables that are highly associated with the second axis is November precipitation and October precipitation. The scores of November and October precipitation is 0.278 and 0.287 respectively. This can be interpreted as fall precipitations.
Additionally, June and July temperatures (Summer temperatures) have higher negative association with this axis.
wnaclim.pca.score = wnaclim.pca$x[,1]
quilt.plot(wnaclim2$Longitude,wnaclim2$Latitude,wnaclim.pca.score)
world(add=TRUE)
This map provides a measure of winter temperature in western North America. It makes sense since the winter climate is a lot colder in Alaska compared to Southern US.
wnaclim.pca.score = wnaclim.pca$x[,2]
quilt.plot(wnaclim2$Longitude,wnaclim2$Latitude,wnaclim.pca.score)
world(add=TRUE)
This map presents the measure of the second component which explains variability associated with winter precipitation. It makes sense since a lot of precipitation occurs in North Western US during winter.