Kmeans Clustering

Basic Settings

Here, we will set working directory, load the required libraries and import data in R environment. Also, we use disable scientific notations using scipen argument.

#- set working Directory
setwd("C:/Users/awani/Documents/GitHub/50daysofAnalytics/Day 1 - K Means Clustering")

#- Libraries
library(ggplot2)
library(dplyr)
library(knitr)

#- read data
data = read.csv("housing.csv",stringsAsFactors = F)
options(scipen = 999)

Exploratory Data Analysis

Since, most of the data is numeric, a five point univariate summary would suffice. We can get an idea of central tendency and spread of the variables. Aditionally, histrograms of important variables gives us good idea of their distribution.

#- Exploratory Data Analysis
summary(data)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
##

#Housing Age
qplot(data$housing_median_age, geom="histogram", bins = 30) + xlab("Median House Age in the district")

qplot(data$population, geom="histogram", bins = 40) + xlab("Population Distribution")

qplot(data$median_house_value, geom = "histogram", bins = 30)+ xlab("Median House Value")

Optimal Clusters

Before we even cluster, we need to determine how many groups/clusters to create. For this purpose, we can make an elblow plot and find the point of steep change. This point will serve as optimal number of clusters.

# prepare the dataset for clustering
mydata = data[,c("housing_median_age",'population',"median_house_value","households")]

#get appropriate number of clusters - Elblow Plot
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))

for (i in 2:15) 
  wss[i] = sum(kmeans(mydata,centers=i)$withinss)

  plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")