Here, we will set working directory, load the required libraries and import data in R environment. Also, we use disable scientific notations using scipen argument.
#- set working Directory
setwd("C:/Users/awani/Documents/GitHub/50daysofAnalytics/Day 1 - K Means Clustering")
#- Libraries
library(ggplot2)
library(dplyr)
library(knitr)
#- read data
data = read.csv("housing.csv",stringsAsFactors = F)
options(scipen = 999)
Since, most of the data is numeric, a five point univariate summary would suffice. We can get an idea of central tendency and spread of the variables. Aditionally, histrograms of important variables gives us good idea of their distribution.
#- Exploratory Data Analysis
summary(data)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
#Housing Age
qplot(data$housing_median_age, geom="histogram", bins = 30) + xlab("Median House Age in the district")
qplot(data$population, geom="histogram", bins = 40) + xlab("Population Distribution")
qplot(data$median_house_value, geom = "histogram", bins = 30)+ xlab("Median House Value")
Before we even cluster, we need to determine how many groups/clusters to create. For this purpose, we can make an elblow plot and find the point of steep change. This point will serve as optimal number of clusters.
# prepare the dataset for clustering
mydata = data[,c("housing_median_age",'population',"median_house_value","households")]
#get appropriate number of clusters - Elblow Plot
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15)
wss[i] = sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
The elblow plot suggests, a four cluster solution and hence we divide our data into four different clusters.
# Clustering
cluster = kmeans(mydata,4)
data$cluster = cluster$cluster
Let’s summarize the clusters we have identified and understand how different they are from each other.
#cluster summary
Summary = data %>% group_by(cluster)%>%
summarise(Count = length(cluster), MedianAge = mean(housing_median_age),Population = mean(population), value = mean(median_house_value))
kable(Summary)
| cluster | Count | MedianAge | Population | value |
|---|---|---|---|---|
| 1 | 7226 | 27.77180 | 1365.878 | 99761.89 |
| 2 | 3977 | 29.09304 | 1446.172 | 303897.64 |
| 3 | 1953 | 32.94880 | 1192.269 | 466267.06 |
| 4 | 7484 | 28.11171 | 1532.881 | 190994.68 |