Another example with the house price data

For you homework, I have asked you to explore the changes in house price for high and low value houses, in high and low value towns. Here, I'll provide an example to look at high and low value houses in towns where the price changed a lot and where it changed a little leading up to the crisis.

It help me to sketch out the structure of the data, and the computations that will need to be done.

We'll need to define “change before the crisis”
- A good calculation might be (high-start)/start for each community.
- I'll need a function to calculate the highest price and the start price
- I'll create a new data.frame with rows as communities, and columns for high, start, and change.
We'll need to classify each community as a “fast” or “slow” changer
- I don't know yet what how to define “fast” vs “slow” change…
- I'll look at the histogram of change and then decide fast vs slow change.
Then I'll want to plot the housing values (high, medium and low) for the fast and slow change communities
- To do this, I'll need a dataset with columns of “city”, “date”, “change”, “quantile”

Preprocess the data

This is largely copied from last week's code. It will load the address and price data, merge them, and then select out just the larger towns using the subset() function.

setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab2/")
# load packages:
require(ggplot2)

## Loading required package: ggplot2

require(plyr)  # These packages will be used laster, don't worry now.

## Loading required package: plyr



ad <- read.csv("addresses.csv", stringsAsFactors = FALSE)
sales <- read.csv("house-sales.csv", stringsAsFactors = FALSE)
# At this point, you should inspect ad and sales to see what is included.
# In R studio, you can double click on the data to inspect them.

# Now, merge the data
geo <- merge(sales, ad, by = c("street", "city", "zip"), all.x = TRUE)

# Now tell R what format the dates and prices are.
geo$date2 <- as.Date(strptime(geo$date, "%Y-%m-%d"))
geo$price2 <- as.numeric(geo$price)

## Warning: NAs introduced by coercion


# Calculate month of sale from date of sale I tried to reuse code from
# last week: geo$month <- as.Date(paste0(strftime(geo$date2,
# format='%Y-%m'),'-01')) but it didn't work.  Here is a picture of what
# goes wrong, you can see it in the first few cases:
paste0(strftime(geo$date2[1:5], format = "%Y-%m"), "-01")

## [1] "NA-01"      "NA-01"      "2004-05-01" "2008-02-01" "2008-02-01"

# That's not a date.  We'll have to go through in steps, and turn missing
# values to NA
geo$month <- paste0(strftime(geo$date2, format = "%Y-%m"), "-01")
# This extracts the Year and month, and then sets the date to 01 (the
# first day of each month) If geo$date is NA, geo$month should be also
geo$month[is.na(geo$date2)] <- NA
# Now proceed
geo$month <- as.Date(geo$month, format = "%Y-%m-%d")

# Subset on just the big cities.  First, calculate the city sizes:
cities <- as.data.frame(table(geo$city))  # Tabulate the data by city name
names(cities) <- c("city", "freq")  # Clean up the result with meaningful names!
big_cities <- subset(cities, freq > 3000)  # 3000 is a pretty arbitrary cutoff, but will select the larger cities

# Now we have a list of big_cities.
geo_big <- subset(geo, city %in% big_cities$city)

# Be sure to inspect geo_big and make sure everything looks good!!!
head(geo_big)

##    street      city   zip              county  price br lsqft bsqft year
## 5          Milpitas 95035  Santa Clara County 968000 NA    NA    NA   NA
## 6          Milpitas 95035  Santa Clara County 722500 NA    NA    NA   NA
## 7              Napa 94558         Napa County 587500 NA    NA    NA   NA
## 8          Petaluma 94954       Sonoma County 380000 NA    NA    NA   NA
## 9          Richmond 94541 Contra Costa County 360000 NA  5000    NA   NA
## 11        San Mateo 94401    San Mateo County 409000 NA    NA    NA   NA
##          date datesold   long   lat               quality     match
## 5  2008-02-10     <NA> -121.9 37.43 QUALITY_CITY_CENTROID     Exact
## 6  2007-06-10     <NA> -121.9 37.43 QUALITY_CITY_CENTROID     Exact
## 7  2004-10-31     <NA> -122.3 38.30 QUALITY_CITY_CENTROID     Exact
## 8  2008-02-10     <NA> -122.6 38.25 QUALITY_CITY_CENTROID     Exact
## 9  2003-08-10     <NA> -122.3 37.93         gpsvisualizer CityLevel
## 11 2007-06-10     <NA> -122.3 37.57 QUALITY_CITY_CENTROID     Exact
##    success error      date2 price2      month
## 5  Success  None 2008-02-10 968000 2008-02-01
## 6  Success  None 2007-06-10 722500 2007-06-01
## 7  Success  None 2004-10-31 587500 2004-10-01
## 8  Success  None 2008-02-10 380000 2008-02-01
## 9  Success  None 2003-08-10 360000 2003-08-01
## 11 Success  None 2007-06-10 409000 2007-06-01

Calculate the monthly median price and other quantiles for each community and each date.

I can borrow this code from last week also

# Write a function to calculate the 5%, 20%, median, 80% and 95% quantiles
quant_fun <- function(df) {
    quants <- quantile(df$price2, c(0.05, 0.2, 0.5, 0.8, 0.95), na.rm = TRUE)
    return(as.data.frame(t(quants)))
}
# Test it on the complete dataset
quant_fun(geo_big)

##       5%    20%    50%    80%     95%
## 1 265500 380000 550000 770000 1175000

# Create a new dataset that has each of these calculated for each
# community and each month
geo_quantile <- ddply(geo_big, .(city, month), quant_fun)
# Be sure to inspect geo_quantile and make sure it looks good!!!

Calculate price change before the crisis

Now, for each community, I need to extract the starting date, and calculate the highest price. This is a split-combine-merge operation, so I will use ddply. I'll need to write a function to calculate the starting and

# Create a function that takes a data.frame, and calculates the median
# price on 1 April, 2003, and the highest median price

find_prices <- function(df) {
    start_price <- df[which(df$month == "2003-04-01"), "50%"]  # Median price on 2003-04-01
    high_price <- max(df[, "50%"], na.rm = TRUE)  # Max price
    # find which date was the maximum... not required, but it might be fun to
    # look at when the peak happened
    high_month <- df$month[which(df[, "50%"] == high_price)]
    # insert those values into a data_frame
    new_df <- data.frame(start = start_price, high = high_price, high_month = high_month)
    # Exit function and return the data.frame
    return(new_df)
}

# Test that function on 'San Francisco'
find_prices(subset(geo_quantile, city == "San Francisco"))

##    start   high high_month
## 1 550000 899000 2007-07-01

# Cool, it worked.

# Now do that for every community
city_change <- ddply(geo_quantile, .(city), find_prices)

# Now calculate growth rate before the crash
city_change$rate <- (city_change$high - city_change$start)/city_change$start

# And histogram it.
ggplot() + geom_histogram(aes(x = rate), data = city_change)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-3

# There is kind of a natural break around .75.  There aren't too many
# data: let's sort them and print them to make sure
sort(city_change$rate)

##  [1] 0.1385 0.2709 0.3673 0.3731 0.3952 0.4239 0.4254 0.4340 0.4637 0.4784
## [11] 0.4871 0.5147 0.5186 0.5255 0.5256 0.5293 0.5333 0.5618 0.5723 0.5885
## [21] 0.5951 0.6000 0.6031 0.6056 0.6087 0.6112 0.6244 0.6345 0.6373 0.6627
## [31] 0.6648 0.6648 0.6665 0.6701 0.6779 0.6835 0.6923 0.6955 0.7664 0.8032
## [41] 0.8095 0.8231 0.8305 0.8411 0.8593 0.8727 0.8744 0.8855 0.8932 0.9125
## [51] 0.9584 0.9658 0.9956 1.0161 1.0466


# Subset to just the fast cities
fast_cities <- subset(city_change, rate > 0.75)
# I'll also create a subset for the slow cities at .5
slow_cities <- subset(city_change, rate < 0.5)

Plot the housing values (high, medium and low) for the big and little change communities

I want to plot monthly prices for slow and fast growth communities. We can use the geo_quantile data.frame, but we need to add a column that has “slow” for slow growth communities, and 'fast' for the fast communities.

I will create a new column, called change, and set everything to 'slow'. Then, for the fast communities, I'll change that to a 'fast'

geo_quantile$change <- "middle"
geo_quantile[which(geo_quantile$city %in% fast_cities$city), "change"] <- "fast"
geo_quantile[which(geo_quantile$city %in% slow_cities$city), "change"] <- "slow"

# Be sure to inspect geo_quantile!!!

# Before plotting, we need to reshape the data.frame.  We did this last
# time too.  Instead of multiple columns for each quantile, we need
# multiple rows for each quantile I want to reshape this data.frame.  The
# melt function is the reshape package does this:
require(reshape)

## Loading required package: reshape

## Attaching package: 'reshape'

## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any

new_dat <- melt(geo_quantile, measure.vars = c("5%", "20%", "50%", "80%", "95%"))
# The column names are silly.  Let's change them
names(new_dat) <- c("city", "month", "change", "quantile", "price")

# Now, at last... plot The X and Y aesthetics are month and price.  Each
# line is grouped by quantile and city We can facet by change

# Let's just plot the median price (note the use of subset)
ggplot(data = subset(new_dat, quantile == "50%")) + geom_line(aes(x = month, 
    y = price, group = city, color = city)) + facet_wrap(~change)

## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk unnamed-chunk-4

# That's kinda a mess.  Let's simplify by plotting smooth curves
ggplot(data = subset(new_dat, quantile == "50%")) + geom_smooth(aes(x = month, 
    y = price, group = city, color = city)) + facet_wrap(~change)

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-4


# And let's do the whole shebang.  Create a matrix of plots, with
# quantiles on the x axis, and change on the y-axis
ggplot(data = new_dat) + geom_smooth(aes(x = month, y = price, group = city, 
    color = city)) + facet_grid(quantile ~ change) + scale_y_log10()

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 1 rows containing missing values (stat_smooth).

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-4

You could spend a long time looking at that, trying to figure out any trends. I'm not terrible convinced that there is much to make of it. But a few points emerge. First, the “fast” growers are in two groups: those that rose fast and then crashed, and those that continued to grow and never crashed. This latter group is not necessarily fast growers, but “big” growers. I need to rethink (high-start)/start as a measure for fast/slow growth. Also, the biggest and most expensive growers are often big because they never stopped growing. Prices continued to rise right through the crash.