For you homework, I have asked you to explore the changes in house price for high and low value houses, in high and low value towns. Here, I'll provide an example to look at high and low value houses in towns where the price changed a lot and where it changed a little leading up to the crisis.
It help me to sketch out the structure of the data, and the computations that will need to be done.
This is largely copied from last week's code. It will load the address and price data, merge them, and then select out just the larger towns using the subset() function.
setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab2/")
# load packages:
require(ggplot2)
## Loading required package: ggplot2
require(plyr) # These packages will be used laster, don't worry now.
## Loading required package: plyr
ad <- read.csv("addresses.csv", stringsAsFactors = FALSE)
sales <- read.csv("house-sales.csv", stringsAsFactors = FALSE)
# At this point, you should inspect ad and sales to see what is included.
# In R studio, you can double click on the data to inspect them.
# Now, merge the data
geo <- merge(sales, ad, by = c("street", "city", "zip"), all.x = TRUE)
# Now tell R what format the dates and prices are.
geo$date2 <- as.Date(strptime(geo$date, "%Y-%m-%d"))
geo$price2 <- as.numeric(geo$price)
## Warning: NAs introduced by coercion
# Calculate month of sale from date of sale I tried to reuse code from
# last week: geo$month <- as.Date(paste0(strftime(geo$date2,
# format='%Y-%m'),'-01')) but it didn't work. Here is a picture of what
# goes wrong, you can see it in the first few cases:
paste0(strftime(geo$date2[1:5], format = "%Y-%m"), "-01")
## [1] "NA-01" "NA-01" "2004-05-01" "2008-02-01" "2008-02-01"
# That's not a date. We'll have to go through in steps, and turn missing
# values to NA
geo$month <- paste0(strftime(geo$date2, format = "%Y-%m"), "-01")
# This extracts the Year and month, and then sets the date to 01 (the
# first day of each month) If geo$date is NA, geo$month should be also
geo$month[is.na(geo$date2)] <- NA
# Now proceed
geo$month <- as.Date(geo$month, format = "%Y-%m-%d")
# Subset on just the big cities. First, calculate the city sizes:
cities <- as.data.frame(table(geo$city)) # Tabulate the data by city name
names(cities) <- c("city", "freq") # Clean up the result with meaningful names!
big_cities <- subset(cities, freq > 3000) # 3000 is a pretty arbitrary cutoff, but will select the larger cities
# Now we have a list of big_cities.
geo_big <- subset(geo, city %in% big_cities$city)
# Be sure to inspect geo_big and make sure everything looks good!!!
head(geo_big)
## street city zip county price br lsqft bsqft year
## 5 Milpitas 95035 Santa Clara County 968000 NA NA NA NA
## 6 Milpitas 95035 Santa Clara County 722500 NA NA NA NA
## 7 Napa 94558 Napa County 587500 NA NA NA NA
## 8 Petaluma 94954 Sonoma County 380000 NA NA NA NA
## 9 Richmond 94541 Contra Costa County 360000 NA 5000 NA NA
## 11 San Mateo 94401 San Mateo County 409000 NA NA NA NA
## date datesold long lat quality match
## 5 2008-02-10 <NA> -121.9 37.43 QUALITY_CITY_CENTROID Exact
## 6 2007-06-10 <NA> -121.9 37.43 QUALITY_CITY_CENTROID Exact
## 7 2004-10-31 <NA> -122.3 38.30 QUALITY_CITY_CENTROID Exact
## 8 2008-02-10 <NA> -122.6 38.25 QUALITY_CITY_CENTROID Exact
## 9 2003-08-10 <NA> -122.3 37.93 gpsvisualizer CityLevel
## 11 2007-06-10 <NA> -122.3 37.57 QUALITY_CITY_CENTROID Exact
## success error date2 price2 month
## 5 Success None 2008-02-10 968000 2008-02-01
## 6 Success None 2007-06-10 722500 2007-06-01
## 7 Success None 2004-10-31 587500 2004-10-01
## 8 Success None 2008-02-10 380000 2008-02-01
## 9 Success None 2003-08-10 360000 2003-08-01
## 11 Success None 2007-06-10 409000 2007-06-01
I can borrow this code from last week also
# Write a function to calculate the 5%, 20%, median, 80% and 95% quantiles
quant_fun <- function(df) {
quants <- quantile(df$price2, c(0.05, 0.2, 0.5, 0.8, 0.95), na.rm = TRUE)
return(as.data.frame(t(quants)))
}
# Test it on the complete dataset
quant_fun(geo_big)
## 5% 20% 50% 80% 95%
## 1 265500 380000 550000 770000 1175000
# Create a new dataset that has each of these calculated for each
# community and each month
geo_quantile <- ddply(geo_big, .(city, month), quant_fun)
# Be sure to inspect geo_quantile and make sure it looks good!!!
Now, for each community, I need to extract the starting date, and calculate the highest price. This is a split-combine-merge operation, so I will use ddply. I'll need to write a function to calculate the starting and
# Create a function that takes a data.frame, and calculates the median
# price on 1 April, 2003, and the highest median price
find_prices <- function(df) {
start_price <- df[which(df$month == "2003-04-01"), "50%"] # Median price on 2003-04-01
high_price <- max(df[, "50%"], na.rm = TRUE) # Max price
# find which date was the maximum... not required, but it might be fun to
# look at when the peak happened
high_month <- df$month[which(df[, "50%"] == high_price)]
# insert those values into a data_frame
new_df <- data.frame(start = start_price, high = high_price, high_month = high_month)
# Exit function and return the data.frame
return(new_df)
}
# Test that function on 'San Francisco'
find_prices(subset(geo_quantile, city == "San Francisco"))
## start high high_month
## 1 550000 899000 2007-07-01
# Cool, it worked.
# Now do that for every community
city_change <- ddply(geo_quantile, .(city), find_prices)
# Now calculate growth rate before the crash
city_change$rate <- (city_change$high - city_change$start)/city_change$start
# And histogram it.
ggplot() + geom_histogram(aes(x = rate), data = city_change)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
# There is kind of a natural break around .75. There aren't too many
# data: let's sort them and print them to make sure
sort(city_change$rate)
## [1] 0.1385 0.2709 0.3673 0.3731 0.3952 0.4239 0.4254 0.4340 0.4637 0.4784
## [11] 0.4871 0.5147 0.5186 0.5255 0.5256 0.5293 0.5333 0.5618 0.5723 0.5885
## [21] 0.5951 0.6000 0.6031 0.6056 0.6087 0.6112 0.6244 0.6345 0.6373 0.6627
## [31] 0.6648 0.6648 0.6665 0.6701 0.6779 0.6835 0.6923 0.6955 0.7664 0.8032
## [41] 0.8095 0.8231 0.8305 0.8411 0.8593 0.8727 0.8744 0.8855 0.8932 0.9125
## [51] 0.9584 0.9658 0.9956 1.0161 1.0466
# Subset to just the fast cities
fast_cities <- subset(city_change, rate > 0.75)
# I'll also create a subset for the slow cities at .5
slow_cities <- subset(city_change, rate < 0.5)
I want to plot monthly prices for slow and fast growth communities. We can use the geo_quantile data.frame, but we need to add a column that has “slow” for slow growth communities, and 'fast' for the fast communities.
I will create a new column, called change, and set everything to 'slow'. Then, for the fast communities, I'll change that to a 'fast'
geo_quantile$change <- "middle"
geo_quantile[which(geo_quantile$city %in% fast_cities$city), "change"] <- "fast"
geo_quantile[which(geo_quantile$city %in% slow_cities$city), "change"] <- "slow"
# Be sure to inspect geo_quantile!!!
# Before plotting, we need to reshape the data.frame. We did this last
# time too. Instead of multiple columns for each quantile, we need
# multiple rows for each quantile I want to reshape this data.frame. The
# melt function is the reshape package does this:
require(reshape)
## Loading required package: reshape
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
new_dat <- melt(geo_quantile, measure.vars = c("5%", "20%", "50%", "80%", "95%"))
# The column names are silly. Let's change them
names(new_dat) <- c("city", "month", "change", "quantile", "price")
# Now, at last... plot The X and Y aesthetics are month and price. Each
# line is grouped by quantile and city We can facet by change
# Let's just plot the median price (note the use of subset)
ggplot(data = subset(new_dat, quantile == "50%")) + geom_line(aes(x = month,
y = price, group = city, color = city)) + facet_wrap(~change)
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
# That's kinda a mess. Let's simplify by plotting smooth curves
ggplot(data = subset(new_dat, quantile == "50%")) + geom_smooth(aes(x = month,
y = price, group = city, color = city)) + facet_wrap(~change)
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
# And let's do the whole shebang. Create a matrix of plots, with
# quantiles on the x axis, and change on the y-axis
ggplot(data = new_dat) + geom_smooth(aes(x = month, y = price, group = city,
color = city)) + facet_grid(quantile ~ change) + scale_y_log10()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
You could spend a long time looking at that, trying to figure out any trends. I'm not terrible convinced that there is much to make of it. But a few points emerge. First, the “fast” growers are in two groups: those that rose fast and then crashed, and those that continued to grow and never crashed. This latter group is not necessarily fast growers, but “big” growers. I need to rethink (high-start)/start as a measure for fast/slow growth. Also, the biggest and most expensive growers are often big because they never stopped growing. Prices continued to rise right through the crash.