Each observation represents a home sold in King County, Washington over the period of May 2014 through May 2015. The original data set contains 21,613 observations, each with 21 variables.

I am interested in seeing the difference in price per sq foot throughout the county. I will create two separate graphs. Both interactive, using the library rbokeh. The first graph will display the price per sq ft for the lot, and the second graph will display the price per sq ft for the living space.

Motivation: I would like to break down the data set into smaller data sets to analyze the smaller sets separately. I suspect that if I group the homes in an educated manner based on price of sq ft and other characteristics, I can come up with models with higher predictive power. Real estate in a large metropolitan area is too complex to expect one model to fit all of the data.

For this project I used the libraries:
* gdata: to read in the .xlsx file
* dplyr: for data wrangling
* rbokeh: for the interactive graphs
* RColorBrewer: to create a color gradient.

suppressMessages(library(gdata))
suppressMessages(library(dplyr))
library(rbokeh)
library(RColorBrewer)

nwater <- read.xls("~/Documents/datasets/KC_home_sales.xlsx", sheet="not_waterfront")
water <- read.xls("~/Documents/datasets/KC_home_sales.xlsx", sheet="waterfront")
# creates the data frame homes by stacking nwater on top of water
homes <- rbind(nwater, water)

Let’s take a look at the variables available in the data frame.

names(homes)
##  [1] "id"            "date"          "price"         "bedrooms"     
##  [5] "bathrooms"     "sqft_living"   "sqft_lot"      "floors"       
##  [9] "waterfront"    "view"          "condition"     "grade"        
## [13] "sqft_above"    "sqft_basement" "yr_built"      "yr_renovated" 
## [17] "zipcode"       "lat"           "long"          "sqft_living15"
## [21] "sqft_lot15"

We create two additional variables in the data frame: \(sqft_price_lot\) and \(sqft_price_liv\).

homes$sqft_price_lot <- homes$price/homes$sqft_lot
homes$sqft_price_liv <- homes$price/homes$sqft_living

We then create a smaller data frame, with only the variables of interest.

homes_sqft <- homes %>% select(lat, long, sqft_price_lot, sqft_price_liv)
homes_sqft %>% head(3)
##       lat     long sqft_price_lot sqft_price_liv
## 1 47.5112 -122.257       39.27434       188.0508
## 2 47.7210 -122.319       74.28887       209.3385
## 3 47.7379 -122.233       18.00000       233.7662

The code below assigns color to each home based on the price per sq ft of the lot. I proceeded by breaking down the data frame into 5 bins, determined by the 20th, 40th, 60th, and 80th percentiles. Each bin is a assigned a color in a gradient. The homes with the lowest price per sq ft (lot) will be plotted yellow and the homes with the highest price per sq ft will be plotted red. The homes in between will fall somewhere in the yellow to red spectrum.

# creates a color gradient of length 5
colors <- brewer.pal(5, "YlOrRd")
# computes quantiles to later break up the data set based on sqft_price
breaks <- quantile(homes_sqft$sqft_price_lot, probs = seq(0, 1, 0.2))

# Adding a color to the houses based on sqft_price
cat1 <- homes_sqft %>% filter(sqft_price_lot >= breaks[1] & sqft_price_lot <= breaks[2])
cat2 <- homes_sqft %>% filter(sqft_price_lot > breaks[2] & sqft_price_lot <= breaks[3])
cat3 <- homes_sqft %>% filter(sqft_price_lot > breaks[3] & sqft_price_lot <= breaks[4])
cat4 <- homes_sqft %>% filter(sqft_price_lot > breaks[4] & sqft_price_lot <= breaks[5])
cat5 <- homes_sqft %>% filter(sqft_price_lot > breaks[5] & sqft_price_lot <= breaks[6])
cat1$color <- colors[1]
cat2$color <- colors[2]
cat3$color <- colors[3]
cat4$color <- colors[4]
cat5$color <- colors[5]

# data with color is the data set containing lat, long, sqft_price, and their respective color
homes_sqft_col <- rbind(cat1, cat2, cat3, cat4, cat5)
homes_sqft_col %>% head
##       lat     long sqft_price_lot sqft_price_liv   color
## 1 47.7379 -122.233       18.00000       233.7662 #FFFFB2
## 2 47.6561 -122.005       12.06710       226.9373 #FFFFB2
## 3 47.7558 -122.229       15.57711       216.7832 #FFFFB2
## 4 47.7277 -121.962       28.13390       208.9947 #FFFFB2
## 5 47.3089 -122.210       19.18782       157.5000 #FFFFB2
## 6 47.3343 -122.306       23.53182       184.0000 #FFFFB2

Once we create the new data set with the respective color of each observation, we graph it using the gmap function of rbokeh. I made the graph using a grayscale so that the yellow dots on the map would be more visible than on a regular google roadmap.

grayscale = '[{"featureType":"landscape","stylers":[{"saturation":-100},{"lightness":65},{"visibility":"on"}]},{"featureType":"poi","stylers":[{"saturation":-100},{"lightness":51},{"visibility":"simplified"}]},{"featureType":"road.highway","stylers":[{"saturation":-100},{"visibility":"simplified"}]},{"featureType":"road.arterial","stylers":[{"saturation":-100},{"lightness":30},{"visibility":"on"}]},{"featureType":"road.local","stylers":[{"saturation":-100},{"lightness":40},{"visibility":"on"}]},{"featureType":"transit","stylers":[{"saturation":-100},{"visibility":"simplified"}]},{"featureType":"administrative.province","stylers":[{"visibility":"off"}]},{"featureType":"water","elementType":"labels","stylers":[{"visibility":"on"},{"lightness":-25},{"saturation":-100}]},{"featureType":"water","elementType":"geometry","stylers":[{"hue":"#ffff00"},{"lightness":-25},{"saturation":-97}]}]'

gmap(lat = 47.51950, lng = -122.1578, zoom = 10,
     width = 800, height = 1000, map_style = grayscale) %>%
  ly_points(long, lat, data = homes_sqft_col,
            fill_alpha = 0.15, line_alpha = 0, col = color)

We’ll now carry out the same strategy to plot the price per sq foot of the living space of each home sold. As suspected, we see the same pattern as in the first graph. The highest price per sq ft occurs close to downtown Seattle (especially the immediate east and immediate north), Bellevue, Redmond, Mercer Island, Kirkland, the east part of the county. South Seattle is the most affordable part of the city to live in. Cities like Seatac, Des Moines, Kent, Renton, Auburn have a relatively low price of sq ft when compared to the east side of the county. Cities like Shorline, Bothell, and Woodinville also display some yellow dots.

breaks <- quantile(homes_sqft$sqft_price_liv, probs = seq(0, 1, 0.2))

# Adding a color to the houses based on sqft_price
cat1 <- homes_sqft %>% filter(sqft_price_liv >= breaks[1] & sqft_price_liv <= breaks[2])
cat2 <- homes_sqft %>% filter(sqft_price_liv > breaks[2] & sqft_price_liv <= breaks[3])
cat3 <- homes_sqft %>% filter(sqft_price_liv > breaks[3] & sqft_price_liv <= breaks[4])
cat4 <- homes_sqft %>% filter(sqft_price_liv > breaks[4] & sqft_price_liv <= breaks[5])
cat5 <- homes_sqft %>% filter(sqft_price_liv > breaks[5] & sqft_price_liv <= breaks[6])
cat1$colorliv <- colors[1]
cat2$colorliv <- colors[2]
cat3$colorliv <- colors[3]
cat4$colorliv <- colors[4]
cat5$colorliv <- colors[5]

# data with color is the data set containing lat, long, sqft_price, and their respective color
homes_sqft_col_liv <- rbind(cat1, cat2, cat3, cat4, cat5)
homes_sqft_col_liv %>% head
##       lat     long sqft_price_lot sqft_price_liv colorliv
## 1 47.3097 -122.327       37.76214       150.1458  #FFFFB2
## 2 47.5123 -122.337       30.72289       128.9326  #FFFFB2
## 3 47.3089 -122.210       19.18782       157.5000  #FFFFB2
## 4 47.3266 -122.169       45.23810       125.5507  #FFFFB2
## 5 47.3739 -122.172       50.61538       134.2857  #FFFFB2
## 6 47.3048 -122.218       49.60613       136.2573  #FFFFB2
gmap(lat = 47.51950, lng = -122.1578, zoom = 10,
     width = 800, height = 1000, map_style = grayscale) %>%
  ly_points(long, lat, data = homes_sqft_col_liv,
            fill_alpha = 0.15, line_alpha = 0, col = colorliv)