Basic Settings

Here, we will set working directory, load the required libraries and import data in R environment. Also, we use disable scientific notations using scipen argument.

#- set working Directory
setwd("C:/Users/awani/Documents/GitHub/50daysofAnalytics/Day 1 - K Means Clustering")

#- Libraries
library(ggplot2)
library(dplyr)
library(knitr)

#- read data
data = read.csv("housing.csv",stringsAsFactors = F)
options(scipen = 999)

Exploratory Data Analysis

Since, most of the data is numeric, a five point univariate summary would suffice. We can get an idea of central tendency and spread of the variables. Aditionally, histrograms of important variables gives us good idea of their distribution.

#- Exploratory Data Analysis
summary(data)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 
#Housing Age
qplot(data$housing_median_age, geom="histogram", bins = 30) + xlab("Median House Age in the district")

qplot(data$population, geom="histogram", bins = 40) + xlab("Population Distribution")

qplot(data$median_house_value, geom = "histogram", bins = 30)+ xlab("Median House Value")

Optimal Clusters

Before we even cluster, we need to determine how many groups/clusters to create. For this purpose, we can make an elblow plot and find the point of steep change. This point will serve as optimal number of clusters.

# prepare the dataset for clustering
mydata = data[,c("housing_median_age",'population',"median_house_value","households")]

#get appropriate number of clusters - Elblow Plot
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))

for (i in 2:15) 
  wss[i] = sum(kmeans(mydata,centers=i)$withinss)

  plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

K Means Clustering

The elblow plot suggests, a four cluster solution and hence we divide our data into four different clusters.

# Clustering
cluster = kmeans(mydata,4)
data$cluster = cluster$cluster

Cluster Summary

Let’s summarize the clusters we have identified and understand how different they are from each other.

#cluster summary
Summary = data %>% group_by(cluster)%>%
     summarise(Count = length(cluster), MedianAge = mean(housing_median_age),Population = mean(population), value = mean(median_house_value))

kable(Summary)
cluster Count MedianAge Population value
1 7226 27.77180 1365.878 99761.89
2 3977 29.09304 1446.172 303897.64
3 1953 32.94880 1192.269 466267.06
4 7484 28.11171 1532.881 190994.68