Advanced Geographical Statistics

Assignment 6

GEOG 6000

Date: 10/29/2015

Sulochan Dhungel

\(Answer 1\)

setwd("C:/Users/Sulochan/Copy/Fall 2015/Advance Geo/Assignment 6")
boston2 = read.csv("boston.csv")
#head(boston2)
boston = boston2[,11:23]
boston.sc = scale(boston)

#install.packages("cluster",repos="http://cran.us.r-project.org")
library(cluster)
#install.packages("fpc",repos="http://cran.us.r-project.org")
library(fpc)

ch.out = rep(NA,20) 
sil.out = rep(NA,20)
for (i in 2:20){
  boston.kmeans = kmeans(boston.sc, centers = i, nstart = 50)
  ch.out[i] = calinhara(boston.sc, boston.kmeans$cluster)
  tmp = silhouette(boston.kmeans$cluster, dist(boston.sc))
  sil.out[i] = mean(tmp[,3])
}

plot(1:20,ch.out, type='b', lwd=2,
     xlab="N Groups", ylab="C", main="Calinski-Harabasz index")
grid()

plot(1:20,sil.out, type='b', lwd=2,
     xlab="N Groups", ylab="C", main="Average silhouette index")
grid()

Both plots indicate that 2 might be the best set of clusters. Based on these two measures, 2 clusters might be enough to separate the data. But since the dataset has 506 rows and 13 variables used to differentiate prices of house, I would think, it would be worthwhile to investigate on more than 2 clusters.

Investigating 2 clusters

boston.kmeans = kmeans(boston.sc, centers = 2, nstart = 50)

layout(matrix(c(1:2), 1, 2, byrow = TRUE))
for (i in 1:2){
    ind = which(boston.kmeans$cluster == i)
    boxplot(boston.sc[ind,],ylim=c(-4,10),las=2)
    abline(h=0,col="gray")
    mtext(paste("size = ",length(ind),sep=""),3)
}

These boxplots of the variables of the two clusters are different. Almost all the positive variables in one of the cluster is negative for the same variables in next cluster.
A PCA can help to reduce the variability of 13 variables to less number of variables so that interpretation of clusters could be easier.

Choice of cluster
For my preliminary analysis I chose 6 clusters since average silhouette index was highest at N=6 after N=2. This was based solely on this chart and requires intensive checking the cluster properties after selection.

boston.kmeans = kmeans(boston.sc, centers = 6, nstart = 50)

aggregate(boston,list(boston.kmeans$cluster),median)

##   Group.1      CRIM ZN INDUS CHAS   NOX     RM   AGE     DIS RAD TAX
## 1       1  0.043010 55  3.44    0 0.428 6.7180 30.80 7.03550   4 304
## 2       2  0.484245  0 13.89    1 0.550 6.2310 88.55 2.96865   5 307
## 3       3  0.144550  0  7.07    0 0.507 6.1440 67.80 3.93420   4 296
## 4       4  8.103765  0 18.10    0 0.693 6.1380 95.70 1.96480  24 666
## 5       5  1.084555  0 21.89    0 0.624 5.8795 96.60 1.87270   4 403
## 6       6 12.884700  0 18.10    0 0.679 6.1975 92.95 1.92490  24 666
##   PTRATIO       B  LSTAT
## 1    16.6 392.900  5.570
## 2    17.6 390.580 10.990
## 3    18.4 393.490  9.680
## 4    20.2 391.085 17.500
## 5    19.1 377.880 15.965
## 6    20.2  45.755 19.930

The median values show an increasing or decreasing trends in variable values based on socio-economic aspects. The crime values, tax rates, pollution, pupil teacher ratio, age are increasing for classes 1, 3, 2, 5, 4 and 6 while distances from employment services are decreasing with each of these clusters. This indicates that the first few clusters (1,3,2) are better housing communities than the latter ones.
Although these values give an idea about how the clusters are formed, to interpret each class, I tried plotting box plots of variables of each cluster.

layout(matrix(c(1:6), 2, 3, byrow = TRUE))
for (i in 1:6){
    ind = which(boston.kmeans$cluster == i)
    boxplot(boston.sc[ind,],ylim=c(-4,10),las=2)
    abline(h=0,col="gray")
    mtext(paste("size = ",length(ind),sep=""),3)
}

These plots also visually gives the variabilities of the variables in the clusters. e.g. An interpretation could be that for class 4(size = 86) and class 6(size = 38), there is not much difference in any of the variables other than the black proportion in the cluster. The crime rates variabilty is a bit more in sixth cluster compared to the forth cluster.

Mean corrected house value per cluster

mean_cmedv = signif(aggregate(boston2$CMEDV*1000, list(boston.kmeans$cluster), mean), digits=5)
# Sorting the classes based on house values
colnames(mean_cmedv) = c("Cluster","House price in $")
mean_cmedv[with(mean_cmedv, order(-mean_cmedv[,2])), ]

##   Cluster House price in $
## 1       1            29037
## 2       2            27806
## 3       3            23985
## 5       5            19552
## 4       4            16691
## 6       6            12484

The housing values are also ordered as 1,2,3, 5, 4 and 6. This shows that the house values are very closely related to the social and economic characterisitcs of neighborhood.
Analysis of Variance of house prices and clusters

aov.cmedv = aov(boston2$CMEDV~boston.kmeans$cluster)
summary(aov.cmedv)

##                        Df Sum Sq Mean Sq F value Pr(>F)    
## boston.kmeans$cluster   1  10542   10542   165.8 <2e-16 ***
## Residuals             504  32036      64                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA tests if the house values are significantly different among groups. Based on the F-statistics and p-value, the house group values are different from each other.

\(Answer 2\)

#install.packages("fields",repos="http://cran.us.r-project.org")
library(fields)
wnaclim2 = read.csv("wnaclim2.csv")
#head(wnaclim2)
wnaclim <- wnaclim2[,seq(3,26)]
wnaclim.s <- scale(wnaclim)
wnaclim.pca <- prcomp(wnaclim, scale=TRUE)
biplot(wnaclim.pca,col = c("gray","red"), cex=c(0.4,0.75))

The biplot indicates that the first two major components includes precipitations and temperatures for different months .

screeplot(wnaclim.pca)

Since this plot did not plot all of the components, we could use a line plot to plot variances of all components

screeplot(wnaclim.pca, npcs = 24, type = "lines")
mtext("Components",side =1,line =3)

summary(wnaclim.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     3.4612 2.8065 1.5846 0.84733 0.63986 0.51316
## Proportion of Variance 0.4992 0.3282 0.1046 0.02992 0.01706 0.01097
## Cumulative Proportion  0.4992 0.8274 0.9320 0.96190 0.97896 0.98993
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.29261 0.20707 0.17437 0.15014 0.13534 0.10798
## Proportion of Variance 0.00357 0.00179 0.00127 0.00094 0.00076 0.00049
## Cumulative Proportion  0.99350 0.99528 0.99655 0.99749 0.99825 0.99874
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.09353 0.08322 0.06116 0.05676 0.04584 0.04161
## Proportion of Variance 0.00036 0.00029 0.00016 0.00013 0.00009 0.00007
## Cumulative Proportion  0.99910 0.99939 0.99955 0.99968 0.99977 0.99984
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.03439 0.02946 0.02562 0.02411 0.01802 0.01392
## Proportion of Variance 0.00005 0.00004 0.00003 0.00002 0.00001 0.00001
## Cumulative Proportion  0.99989 0.99993 0.99995 0.99998 0.99999 1.00000

Based on the summary statistics, the second component explains 32.82% of total variation and the first and second component together explain 82.74% of total variation.
Examination of loadings

sort(wnaclim.pca$rotation[,1])

##       pjul       paug       pjun       psep       poct       pmay 
## 0.01369406 0.03206302 0.05730680 0.10354548 0.13939969 0.14748106 
##       pnov       pjan       pdec       papr       pfeb       pmar 
## 0.16303542 0.16803736 0.16860662 0.17074339 0.17102216 0.17796187 
##       tjun       tjul       taug       tmay       tsep       tapr 
## 0.20220435 0.20390908 0.22863878 0.23877755 0.25186014 0.26473010 
##       toct       tjan       tnov       tdec       tmar       tfeb 
## 0.26714245 0.27154635 0.27214789 0.27301687 0.27484658 0.27524689

The two variables that are highly associated with the first axis is March Temperature and February temperature. This can be interpreted as spring temperatures.The scores of both March Temp and February temperature is 0.275.

sort(wnaclim.pca$rotation[,2])

##        tjul        tjun        taug        tmay        tsep        toct 
## -0.22851811 -0.21701738 -0.20271341 -0.16889072 -0.16881584 -0.12423680 
##        tapr        tnov        tmar        tfeb        tdec        tjan 
## -0.12257641 -0.09552584 -0.09367701 -0.07830351 -0.07380880 -0.07007925 
##        pjul        paug        pjun        pmar        pmay        pjan 
##  0.10097126  0.14782502  0.23790590  0.25650454  0.25905950  0.25995212 
##        pfeb        psep        pdec        papr        pnov        poct 
##  0.26393484  0.26528778  0.26639834  0.27824680  0.27829605  0.28759576

The two variables that are highly associated with the second axis is November precipitation and October precipitation. The scores of November and October precipitation is 0.278 and 0.287 respectively. This can be interpreted as fall precipitations.
Additionally, June and July temperatures (Summer temperatures) have higher negative association with this axis.

wnaclim.pca.score = wnaclim.pca$x[,1]
quilt.plot(wnaclim2$Longitude,wnaclim2$Latitude,wnaclim.pca.score)
world(add=TRUE)

This map provides a measure of winter temperature in western North America. It makes sense since the winter climate is a lot colder in Alaska compared to Southern US.

wnaclim.pca.score = wnaclim.pca$x[,2]
quilt.plot(wnaclim2$Longitude,wnaclim2$Latitude,wnaclim.pca.score)
world(add=TRUE)

This map presents the measure of the second component which explains variability associated with winter precipitation. It makes sense since a lot of precipitation occurs in North Western US during winter.