Lab 2: Solution Key

The Plan

I want to produce a series of plots showing house prices. Each plot will have time on the x-axis, and the y-axis will show price, either for high value homes, or for low value home. I want to reproduce this plot for homes in expensive cities, and for homes in low income cities. Do I know yet which cities are expensive and which are cheap? No, I do not. I'll have to (1) define what is meant by “high-value cities” and “low-value communities.” Then, I'll have to (2) group (or sort) each house into whether it is in a high value community or a low-value community. Then, I can proceed with plotting.

Data set-up

Nothing too special here: Just setting up the data for later analysis. This is copied from the lab exercise.

setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab2/")
# load packages:
require(ggplot2)

## Loading required package: ggplot2

require(plyr)  # These packages will be used later

## Loading required package: plyr



ad <- read.csv("addresses.csv", stringsAsFactors = FALSE)
sales <- read.csv("house-sales.csv", stringsAsFactors = FALSE)

# Now, merge the geocoded address and sales data
geo <- merge(sales, ad, by = c("street", "city", "zip"), all.x = TRUE)
# Now create new - properly formatted - dates and prices
geo$date2 <- as.Date(strptime(geo$date, "%Y-%m-%d"))
geo$price2 <- as.numeric(geo$price)

## Warning: NAs introduced by coercion


# Create a month variable instead of a date variable.  Each date will be
# converted to the first of the month
geo$month <- paste0(strftime(geo$date2, format = "%Y-%m"), "-01")
# This extracts the Year and month, and then sets the date to 01 (the
# first day of each month) If geo$date is NA, geo$month should be also
geo$month[is.na(geo$date2)] <- NA
# Now proceed: reformat that text-valued month into a proper date
geo$month <- as.Date(geo$month, format = "%Y-%m-%d")


# Subset on just the big cities.  First, calculate the city sizes:
cities <- as.data.frame(table(geo$city))  # Tabulate the data by city name
names(cities) <- c("city", "freq")  # Clean up the result with meaningful names!
big_cities <- subset(cities, freq > 3000)  # 3000 is a pretty arbitrary cutoff, but will select the larger cities
# Now we have a list of big_cities.
geo_big <- subset(geo, city %in% big_cities$city)

# Write a function to calculate the 20%, median, and 80% quantiles
quant_fun <- function(df) {
    quants <- quantile(df$price2, c(0.2, 0.5, 0.8), na.rm = TRUE)
    return(as.data.frame(t(quants)))
}
# Test it on the complete dataset
quant_fun(geo_big)

##      20%    50%    80%
## 1 380000 550000 770000

# Create a new dataset that has each of these calculated for each
# community and each month
geo_quantile <- ddply(geo_big, .(city, month), quant_fun)

Identifying low- and high-value communities

What is a low- and high-value community? I'll choose to look at the (1) median house price, (2) at the beginning of the data period, (3) for each community. Why the median? Why at the beginning of the data period? These decisions would have to be justified in more full analysis, but I will skip over this here.

# The median price is already calculated in the data.frame 'geo_quantile'
# But geo_quantile has each date.  We're going to define high and low
# cities based on the first date What is the first date?
head(sort(geo_quantile$month))

## [1] "2003-04-01" "2003-04-01" "2003-04-01" "2003-04-01" "2003-04-01"
## [6] "2003-04-01"

# The first date is '2003-04-01'
geo_quantile_first <- subset(geo_quantile, month == "2003-04-01")
# What are the median prices:
sort(geo_quantile_first[, "50%"])

##  [1] 267000 283750 285000 291000 292500 298000 317750 318500 323500 330000
## [11] 335000 336500 348000 354000 355000 367500 373000 375000 387250 390000
## [21] 390000 415500 416500 420000 421750 431000 442500 443750 455000 462000
## [31] 466500 470000 474500 477000 479500 492500 495000 501000 505500 509000
## [41] 519000 536000 539000 540000 550000 553250 595000 620000 630000 650000
## [51] 655000 670000 685500 688000 730000

# I think a decent division between high- and- low-communities might be
# 500,000 Create a new dataset out of geo_quantile_first, with only the
# high communities high_communities <- subset(geo_quantile_first,
# 50%>500000) That didn't work at all... the problem is the column called
# 50%.  It should really have a text name, not a number name
names(geo_quantile_first) <- c("city", "month", "quintile_low", "median", "quintile_high")
# Rename geo_quantile while we're at it.
names(geo_quantile) <- c("city", "month", "quintile_low", "median", "quintile_high")

high_communities <- subset(geo_quantile_first, median > 5e+05)
low_communities <- subset(geo_quantile_first, median <= 5e+05)

# Let's split geo_quantile into high and low communities
geo_quantile_high <- subset(geo_quantile, city %in% high_communities$city)
geo_quantile_low <- subset(geo_quantile, city %in% low_communities$city)

Plotting

I have two data_frames. One for low communities, and one for high. Each data.frame has a row for each community and date combination, and column for the low quintile (20%), median (50%), and high quintile (80%). To plot these, I can copy code again from the lab.

# Plot the high values in high communities
ggplot(aes(x = month, y = quintile_high), data = geo_quantile_high) + geom_line(aes(group = city, 
    color = city))

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk plotting

# Plot the low values in high communities
ggplot(aes(x = month, y = quintile_low), data = geo_quantile_high) + geom_line(aes(group = city, 
    color = city))

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk plotting


# Plot the high values in low communities
ggplot(aes(x = month, y = quintile_high), data = geo_quantile_low) + geom_line(aes(group = city, 
    color = city))

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk plotting

# Plot the low values in high communities
ggplot(aes(x = month, y = quintile_low), data = geo_quantile_low) + geom_line(aes(group = city, 
    color = city))

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk plotting