Each observation represents a home sold in King County, Washington over the period of May 2014 through May 2015. The original data set contains 21,613 observations, each with 21 variables.
I am interested in seeing the difference in price per sq foot throughout the county. I will create two separate graphs. Both interactive, using the library rbokeh. The first graph will display the price per sq ft for the lot, and the second graph will display the price per sq ft for the living space.
Motivation: I would like to break down the data set into smaller data sets to analyze the smaller sets separately. I suspect that if I group the homes in an educated manner based on price of sq ft and other characteristics, I can come up with models with higher predictive power. Real estate in a large metropolitan area is too complex to expect one model to fit all of the data.
For this project I used the libraries:
* gdata: to read in the .xlsx file
* dplyr: for data wrangling
* rbokeh: for the interactive graphs
* RColorBrewer: to create a color gradient.
suppressMessages(library(gdata))
suppressMessages(library(dplyr))
library(rbokeh)
library(RColorBrewer)
nwater <- read.xls("~/Documents/datasets/KC_home_sales.xlsx", sheet="not_waterfront")
water <- read.xls("~/Documents/datasets/KC_home_sales.xlsx", sheet="waterfront")
# creates the data frame homes by stacking nwater on top of water
homes <- rbind(nwater, water)
Let’s take a look at the variables available in the data frame.
names(homes)
## [1] "id" "date" "price" "bedrooms"
## [5] "bathrooms" "sqft_living" "sqft_lot" "floors"
## [9] "waterfront" "view" "condition" "grade"
## [13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
## [17] "zipcode" "lat" "long" "sqft_living15"
## [21] "sqft_lot15"
We create two additional variables in the data frame: \(sqft_price_lot\) and \(sqft_price_liv\).
homes$sqft_price_lot <- homes$price/homes$sqft_lot
homes$sqft_price_liv <- homes$price/homes$sqft_living
We then create a smaller data frame, with only the variables of interest.
homes_sqft <- homes %>% select(lat, long, sqft_price_lot, sqft_price_liv)
homes_sqft %>% head(3)
## lat long sqft_price_lot sqft_price_liv
## 1 47.5112 -122.257 39.27434 188.0508
## 2 47.7210 -122.319 74.28887 209.3385
## 3 47.7379 -122.233 18.00000 233.7662
The code below assigns color to each home based on the price per sq ft of the lot. I proceeded by breaking down the data frame into 5 bins, determined by the 20th, 40th, 60th, and 80th percentiles. Each bin is a assigned a color in a gradient. The homes with the lowest price per sq ft (lot) will be plotted yellow and the homes with the highest price per sq ft will be plotted red. The homes in between will fall somewhere in the yellow to red spectrum.
# creates a color gradient of length 5
colors <- brewer.pal(5, "YlOrRd")
# computes quantiles to later break up the data set based on sqft_price
breaks <- quantile(homes_sqft$sqft_price_lot, probs = seq(0, 1, 0.2))
# Adding a color to the houses based on sqft_price
cat1 <- homes_sqft %>% filter(sqft_price_lot >= breaks[1] & sqft_price_lot <= breaks[2])
cat2 <- homes_sqft %>% filter(sqft_price_lot > breaks[2] & sqft_price_lot <= breaks[3])
cat3 <- homes_sqft %>% filter(sqft_price_lot > breaks[3] & sqft_price_lot <= breaks[4])
cat4 <- homes_sqft %>% filter(sqft_price_lot > breaks[4] & sqft_price_lot <= breaks[5])
cat5 <- homes_sqft %>% filter(sqft_price_lot > breaks[5] & sqft_price_lot <= breaks[6])
cat1$color <- colors[1]
cat2$color <- colors[2]
cat3$color <- colors[3]
cat4$color <- colors[4]
cat5$color <- colors[5]
# data with color is the data set containing lat, long, sqft_price, and their respective color
homes_sqft_col <- rbind(cat1, cat2, cat3, cat4, cat5)
homes_sqft_col %>% head
## lat long sqft_price_lot sqft_price_liv color
## 1 47.7379 -122.233 18.00000 233.7662 #FFFFB2
## 2 47.6561 -122.005 12.06710 226.9373 #FFFFB2
## 3 47.7558 -122.229 15.57711 216.7832 #FFFFB2
## 4 47.7277 -121.962 28.13390 208.9947 #FFFFB2
## 5 47.3089 -122.210 19.18782 157.5000 #FFFFB2
## 6 47.3343 -122.306 23.53182 184.0000 #FFFFB2
Once we create the new data set with the respective color of each observation, we graph it using the gmap function of rbokeh. I made the graph using a grayscale so that the yellow dots on the map would be more visible than on a regular google roadmap.
grayscale = '[{"featureType":"landscape","stylers":[{"saturation":-100},{"lightness":65},{"visibility":"on"}]},{"featureType":"poi","stylers":[{"saturation":-100},{"lightness":51},{"visibility":"simplified"}]},{"featureType":"road.highway","stylers":[{"saturation":-100},{"visibility":"simplified"}]},{"featureType":"road.arterial","stylers":[{"saturation":-100},{"lightness":30},{"visibility":"on"}]},{"featureType":"road.local","stylers":[{"saturation":-100},{"lightness":40},{"visibility":"on"}]},{"featureType":"transit","stylers":[{"saturation":-100},{"visibility":"simplified"}]},{"featureType":"administrative.province","stylers":[{"visibility":"off"}]},{"featureType":"water","elementType":"labels","stylers":[{"visibility":"on"},{"lightness":-25},{"saturation":-100}]},{"featureType":"water","elementType":"geometry","stylers":[{"hue":"#ffff00"},{"lightness":-25},{"saturation":-97}]}]'
gmap(lat = 47.51950, lng = -122.1578, zoom = 10,
width = 800, height = 1000, map_style = grayscale) %>%
ly_points(long, lat, data = homes_sqft_col,
fill_alpha = 0.15, line_alpha = 0, col = color)
We’ll now carry out the same strategy to plot the price per sq foot of the living space of each home sold. As suspected, we see the same pattern as in the first graph. The highest price per sq ft occurs close to downtown Seattle (especially the immediate east and immediate north), Bellevue, Redmond, Mercer Island, Kirkland, the east part of the county. South Seattle is the most affordable part of the city to live in. Cities like Seatac, Des Moines, Kent, Renton, Auburn have a relatively low price of sq ft when compared to the east side of the county. Cities like Shorline, Bothell, and Woodinville also display some yellow dots.
breaks <- quantile(homes_sqft$sqft_price_liv, probs = seq(0, 1, 0.2))
# Adding a color to the houses based on sqft_price
cat1 <- homes_sqft %>% filter(sqft_price_liv >= breaks[1] & sqft_price_liv <= breaks[2])
cat2 <- homes_sqft %>% filter(sqft_price_liv > breaks[2] & sqft_price_liv <= breaks[3])
cat3 <- homes_sqft %>% filter(sqft_price_liv > breaks[3] & sqft_price_liv <= breaks[4])
cat4 <- homes_sqft %>% filter(sqft_price_liv > breaks[4] & sqft_price_liv <= breaks[5])
cat5 <- homes_sqft %>% filter(sqft_price_liv > breaks[5] & sqft_price_liv <= breaks[6])
cat1$colorliv <- colors[1]
cat2$colorliv <- colors[2]
cat3$colorliv <- colors[3]
cat4$colorliv <- colors[4]
cat5$colorliv <- colors[5]
# data with color is the data set containing lat, long, sqft_price, and their respective color
homes_sqft_col_liv <- rbind(cat1, cat2, cat3, cat4, cat5)
homes_sqft_col_liv %>% head
## lat long sqft_price_lot sqft_price_liv colorliv
## 1 47.3097 -122.327 37.76214 150.1458 #FFFFB2
## 2 47.5123 -122.337 30.72289 128.9326 #FFFFB2
## 3 47.3089 -122.210 19.18782 157.5000 #FFFFB2
## 4 47.3266 -122.169 45.23810 125.5507 #FFFFB2
## 5 47.3739 -122.172 50.61538 134.2857 #FFFFB2
## 6 47.3048 -122.218 49.60613 136.2573 #FFFFB2
gmap(lat = 47.51950, lng = -122.1578, zoom = 10,
width = 800, height = 1000, map_style = grayscale) %>%
ly_points(long, lat, data = homes_sqft_col_liv,
fill_alpha = 0.15, line_alpha = 0, col = colorliv)