Exploratory Analysis

K-means clustering is a basic pattern recognition technique.It is an unsupervised machine learning task that automatically divides the data into clusters (groups of similar data points). It is important to understand that clustering is not a method for prediction but, rather, a method for knowledge discovery. In this report, I show an example use of k-means clustering on housing data. K-means clustering works well when comparing two different variables to each other. The dataset of study contains data on 545 different houses. If I were a potential home buyer, I would not want to review each and every house. I would rather narrow my search to a particular grouping of houses. I can use the k-means algorithm to group houses by price and area to narrow my search to only houses that will provide great value per square foot.

Here is my data source: [link] (https://www.kaggle.com/datasets/shrutipandit707/housingnewdataset).

housing = 'https://raw.githubusercontent.com/trevvshaw/Case_Study_3/main/newhousing.csv'
housingdata = read.csv(housing)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

You will notice the dataset has 18 different variables for 545 different houses. Many of the variables only have a few different options. For example, the bedrooms variable only goes from 1 to 6. So, filtering based preferences may not narrow the search down far enough. Even choosing to filter based on a variable with a wide spread of values (like Price), narrowing of options is still difficult. You can see from figure 1 that the large majority of prices fall between 3-5 million while the spread goes from 2-13 million.

Figure 1: Pricing Histogram

Price=housingdata$price/1000000
hist(Price, breaks=50, main="Figure 1: Prices (in Millions)")

Even if you had a hard budget at the low end of the spread (say 4 million), you have still only eliminated half of the field of houses. Let’s take the assumption the we will eventually want to re-sell this house in the future. Therefore, if we get a good value now, we can add value to the house over time via renovations, additions, or changes in the markert. So, it makes sense to consider how much house we can buy per dollar. Figure 2 shows a histogram of the Square footage of each house.

Figure 2: Area Histogram

Area=housingdata$area/1000
hist(Area, breaks=50, main="Figure 2: Area (in Thousands of Square Feet)")

You can see that the shapes of figures 1 and 2 are quite similar. So, lets explore the correlation between the two values. We can plot the correlation between Price and Area and attempt to forecast Price as a function of Area (Figure 3). If the goal is to get a good value, how would one define what a good value is? The houses below the lower bound of the forecast interval would, in theory, provide the most square foot per dollar. However, this only leaves with 6 houses which could simply be outliers. There could be something wrong with the house if you are getting an extreme value to the dollar (e.g. significant repairs). Looking between the lower bound and the normal forecast line, the data points get quite congested. Perhaps, there is a sweet spot grouping of houses that give us a great value but, is not so extreme that we run a risk of finding something gravely wrong with the house. We can use a k-means clustering algorithm to create groupings of housing based on Price and Area.

Figure 3: Price/Area Correlation

scaledhousingdata = data.frame(Price, Area)
model1 = lm(Price~Area, data=scaledhousingdata)
xvals = seq(from=0, to=20, by=0.1)
df = data.frame(Area = xvals)
forecast = predict(model1, newdata=df, interval="prediction")
forecast = as.data.frame(forecast)

plot(Area, Price, xlab="Area (thousands of sqaure feet)", ylab="Price (millions)", main = "Figure 3: Area/Price Correlation" )
lines(x=df$Area, y=forecast$fit, lwd=2)
lines(x=df$Area, y=forecast$lwr, lwd=2, lty="dashed", col="red")
lines(x=df$Area, y=forecast$upr, lwd=2, lty="dashed", col="red")

An important measure when analyzing clusters is the sum of squared error or SSE value. It represents a statistical calculation of how, within each cluster, the data points are similar to each other. The lower the SSE value to more similar each data point is to each other within each cluster. In Figure 4, you can see how the SSE value changes for the number of clusters you select to use. You can see from the figure the SSE value trend flattens out around 8 clusters. So, you do not gain much value from the algroithm choosing more than 8 clusters.

Figure 4: SSE vs Number of Clusters

library(cluster)
## Warning: package 'cluster' was built under R version 4.2.2
wss = kmeans(scaledhousingdata, centers=1)$tot.withinss
for (i in 2:15)
  wss[i] = kmeans(scaledhousingdata, centers=i)$tot.withinss
library(ggvis)
## Warning: package 'ggvis' was built under R version 4.2.2
sse = data.frame(c(1:15), c(wss))
names(sse)[1] = 'Clusters'
names(sse)[2] = 'SSE'
sse %>%
  ggvis(~Clusters, ~SSE) %>%
  layer_points(fill := 'blue') %>% 
  layer_lines() %>%
  set_options(height = 300, width = 400)

In conclusion, I believe 8 clusters is the ideal amount to use for this analysis. A visual representation of the clusters is shown in Figure 5. The houses with the most value will be in the most bottomr righ cluster.

Figure 5: Clustering Visualization

library(cluster)
clusters = kmeans(scaledhousingdata, 8)
clusplot(scaledhousingdata, clusters$cluster, color=T, shade=F,labels=0,lines=0, main='k-Means Cluster Analysis')