Geography 415 Lab 2

Objectives

Learn about project management in R and Rstudio
Learn how to manage R's package system
Acquire a basic introduction to plotting with the ggplot package
Acquire some exposure to the “dirty” side of quantitative methods: data cleaning.

Project Management

There are a few different file types that are associated with R. An “R script” is a text file with R commands, and has a “.R” extension. You can save all or some of the variables from a workspace into an “Rdata” file with a “.Rdata” extension, file example “lab2.Rdata”. More on this later.

R Studio Projects

But before even doing this, I'll introduce the “Project” feature that is particular to R studio. It is not a feature of R. A R Studio Project creates a new folder in which you can save all of your data, files and plots. R studio will remember all of this from session to session. If you close R Studio, and then open it back up, it will remember everything, even if you open it from a different computer.
To create a Project, from within R Studio, click Project -> Create Project. (You may be prompted whether you want to close your current R session… If you're just starting, there is no reason no to… if you have something important open, you may want to save it first). You will then be prompted whether you want to create a new folder, or to use an existing folder. This is up to you. Whatever you choose, R will then start up inside the project folder. You might create a folder called lab 2, for example. There will also be a file with the extension “.Rproj”- if you click on this from your file explorer, it will start up R Studio exactly where you left off.

R scripts

Once you're in R Studio, create an R script. Type CTRL + SHIFT +N or go to File -> New -> R Script. It will create a blank script. This is like a text document, but everything is an R command (or a comment). Suppose your R script looks like this:

Screenshot of Rstudio

There are a few ways you can work with this. If you place your cursor on the first line, you can either click “Run”, or type CTRL + RETURN, and the line will be pasted to the Console and executed. You can select multiple lines and Run, and the entire selection will be run. You can Click Source (or type CTRL+SHIFT+S), and the entire script will be run silently. You can check that x now equals 2, but you will never know that it was first 3, or that x+3 was 6. If you Click Source -> Source with Echo (CTRL + SHIRT + RETURN), then everything will run, AND all of the results will be printed out to screen.

Here's my usual workflow:

1) I experiment with commands in the console.
2) When I am happy with a particular command, then I copy it to the Script.
3) Periodically, I will “re-source” the script, just so that I know it runs from beginning to end.

Don't forget to save the script periodically. Your Project is saved automatically. I usually put my Projects in a Dropbox folder, that way I can run them from either my desktop or my laptop.

I love the Project feature of R Studio.

R's package management system

There are so many functions available in R, that you might spend your whole life using R every day, and still only use a small fraction of them. On the principle of only loading what you need, functions in R are collected into groups called packages and you can load packages as you need them.

Loading Packages

We'll be using the ggplot2 package. To use a package, there are two steps: first, you must install it from the series of tubes we call the internet, to a “library” on your computer. You only need to do this once. A library is just a specific folder o your computer where R will look for packages. You can have multiple libraries on a computer, but this can cause confusion when you need to update everything to a new version, and you have libraries all over the place. But, if you can't install a package to a particular computer (such as in the computer lab), then you can create a new library on the H drive or on a USB drive. Once you have a library, however, then each time you want to use the package, you will need to get it from the library.

Installing (downloading) a package

In R studio, you can click on Tools -> Install Packages…, which will open a dialog box. You can then type ggplot2 in the next box and click install. More generally, you can type the following option:

install.packages("ggplot2", dependencies = TRUE)

## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)

## Error: trying to use CRAN without setting a mirror

# If you are installing to a custom library, then you'll use
# install.packages('ggplot2', dependencies=TRUE, lib.loc=NULL) where NULL
# would be replaced by the file path of the library (in quotes).

Loading a package from a library

library(ggplot2)
# That's it!  If the library is not the default one, then you can type
# library(ggplot, lib.loc=NULL) where NULL would be replace by the file
# path of the libary (in quotes).

# To display the help for a package, you would type
help(package = "ggplot2")

Detaching a package

If you don't want to use a package anymore, you may either leave it there, or you may choose to “detach” it.

# detach('package:ggplot2') #I commented it out, because I don't want to# do this.

Why would you want to detach a package? Good question. I don't really know. One possible downside to having a lot of packages is that it makes the help() slower. Another downside is that sometimes two different packages will have two different functions with the same name. It can be confusing to know what R will do. Detaching might help. But there is another, better solution, which I'll cover another day.

The data

This lab will use data of housing transactions in teh Sn Francisco bay Area around the period of the housing crash. These data were downloaded from the San Francisco Chronicle, geocoded, and made publicly available by Hadley Wickham.

I have made the files available for download here: https://www.dropbox.com/s/x7vreaa4qnrfdv1/addresses.csv https://www.dropbox.com/s/vxn6zbpsdr6bz7j/house-sales.csv

Merging two datasets

Merging datasets is one of the most common data management tasks. In this case, we have two datasets: addressess (which have lat/lon coordinates), and houseprices, that will need to be merged from two data.frames to one. In order to perform a merge, one needs to identify common columns that can act as unique “keys” for matching upon. In this case, both datasets contain a city, street, and zip that together form unique keys.

# Change this for your own computer...
setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab2/")
# load packages:
require(ggplot2)
require(plyr)  # These packages will be used laster, don't worry now.

## Loading required package: plyr



ad <- read.csv("addresses.csv", stringsAsFactors = FALSE)
sales <- read.csv("house-sales.csv", stringsAsFactors = FALSE)
# At this point, you should inspect ad and sales to see what is included.
# In R studio, you can double click on the data to inspect them.  other
# help functions are
names(ad)
names(sales)
head(ad)
head(sales)

# Now, merge the data
geo <- merge(sales, ad, by = c("street", "city", "zip"), all.x = TRUE)

# Now tell R what format the dates and prices are.
geo$date <- as.Date(strptime(geo$date, "%Y-%m-%d"))
# There are two conversions in that last line: 1) strptime converts from
# character to POSIX (a standard tim format in computing) 2) as.Date
# converts from POSIX to R's 'Date' type
geo$price <- as.numeric(geo$price)

## Warning: NAs introduced by coercion

# That converted price from character to numeric

A first ggplot

We will plot the lat/lon of each address in order to see where they are. Our map will have coordinates of lat/lon, so the x and y aesthetics will be long and lat, respectively. We will want a scatterplot, which uses the point geometry.

ggplot(aes(x = long, y = lat), data = geo) + geom_point()

## Warning: Removed 37 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-1

# That's not quite right.  ggplot doesn't yet know that the coordinates
# are geographic coordinates.  We can set the scales properly by using its
# coord_map() function.
ggplot(data = geo, aes(x = long, y = lat)) + geom_point() + coord_map()

## Warning: Removed 37 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-1

# coord_map() uses a mercator projection.

Using table to tabulate results

There are many cities (type unique(geo$city) to see them all). Some of those cities will be pretty small, and the number of sales may be too small to reliably look at. We will subset to just those houses in the bigger cities.

R's table() function is a quick way to create a frequency table. Here, we will create a frequency table of the cities (i.e. the number of sales in each city). Here we will immediately put that frequency table into a data.frame, and then select just the larger cities.

# Create a data.frame with the cities and their sample sizes
cities <- as.data.frame(table(geo$city))
# Be sure to inspect the result!!!!
names(cities) <- c("city", "freq")  # Clean up the result with meaningful names!
# Create a subset of just the larger cities
big_cities <- subset(cities, freq > 3000)  # 3000 is a pretty arbitrary cutoff

# Now we have a list of big_cities.
geo_big <- subset(geo, city %in% big_cities$city)
# %in% is really cool... Here we look through each row of the housing data
# set and see if the city is one of the big cities or not.

# Create a quick map
ggplot(data = geo_big, aes(x = long, y = lat)) + geom_point() + coord_map()

## Warning: Removed 34 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-2


# Save the graphic as a variable...
map <- ggplot(data = geo_big, aes(x = long, y = lat)) + geom_point() + 
    coord_map()
map  # plot the graphic

## Warning: Removed 34 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-2


# Color by city name
map <- ggplot(data = geo_big, aes(x = long, y = lat)) + geom_point(aes(color = city)) + 
    coord_map()
map  # plot the graphic

## Warning: Removed 34 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-2

Creating functions and the plyr package

In this section, we will calculate and plot the median home price in each city over time.

One of the greatest advantages of a statistical computing language is the ability to create new functions to do computations on data. In this section, we will see our first handwritten function.

The other thing we will see in this section is the plyr package. Much of the data manipulations we do require us to 1) split the data into pieces, 2) apply a function to each piece, and 3) combine the resuts back into one object. The plyr package is written to do exactly this. In particular, we will split the data into a data frame for each city and time, calculate the median home price at each city and time, and then combine the results again. So finally, instead of a data frame where each row is a house sale, we will have a data.frame where each row is a particular city and point in time.

To do the split-apply-combine strategy, we will use the ddply function in the plyr package. The 'dd' in ddply stands for “data.frame in, data.frame out”. There are other types, such as mdply (“matrix in, data.frame out”). But for the apply step in the “split-apply-combine,” we will need to write a function that takes in a data.frame, calculates the median, and then return the median as a data.frame. I will also calculate the number of housing sales in addition to the median, while we are at it.

The function looks like this:

# Create a function, that takes a data.frame with a variable called price,
# and returns a new data.frame with 1) the number of rows, and 2) the
# median price
agg_fun <- function(df) {
    new.df <- data.frame(n = nrow(df), med = median(df$price, na.rm = TRUE))
    # I used two functions here: nrow and median I hope they are self
    # explanatory or you can look at their help pages.
    return(new.df)  #Now, tell the function what to exit with.
}

A handwritten function has 4 basic components:

the function name (here: agg_fun)
the input declarations (here function(df))
some computations
a return value (here: return(new.df)) Note that 3. and 4. are in curly braces. Our function takes as input any data.frame. It will rename the data.frame “df” inside the function. The function will calucate the number of rows, and the median price, and return these in a brand new data.frame. Note, if the input data.frame doesn't have a column called price, our function wouldn't know what to do.

Here's how we use the function:

agg_fun(geo_big)  # Take the function out for a spin...

##        n    med
## 1 419441 550000

Now we are ready to split-apply-combine:

# Now, apply the function to each city and date combination This will take
# a little while... there are 400000 rows to crunch through This is
# something I would normally do in SQL (with the sqldf() function) rather
# than with ddply, but that's a topic for another course.
bigsum <- ddply(geo_big, .(city, date), agg_fun)
# That split the geo_big data.frame by unique combinations of city and
# date, applying the agg_fun() function to each subset.  The resulting
# data.frame will have columns for city, date, and whatever columns are
# returned by agg_fun().

Now we are ready to plot the median price over time for each city:

ggplot(data = bigsum, aes(x = date, y = n)) + geom_line(aes(group = city))

## Warning: Removed 2 rows containing missing values (geom_path).

plot of chunk median_plot

# Note the use of the group aesthetic in geom_line.  If I put the group
# aesthetic in the ggplot() function that would have worked in this case
# too.

# Try it with log scaling:
ggplot(data = bigsum, aes(x = date, y = n)) + geom_line(aes(group = city)) + 
    scale_y_log10("median price", breaks = c(250000, 5e+05, 1e+06))

## Warning: Removed 2 rows containing missing values (geom_path).

plot of chunk median_plot


# There are too many points.  Let's use a 10% alpha transparency effect to
# reduce overplotting problems:
ggplot(data = bigsum, aes(x = date, y = med)) + geom_line(aes(group = city), 
    alpha = 0.1) + scale_y_log10("median price", breaks = c(250000, 5e+05, 1e+06))

## Warning: Removed 2 rows containing missing values (geom_path).

plot of chunk median_plot

# Note: alpha is not an aesthetic in this case.  Aesthetics are visual
# variables, requiring a scale or a legend.  Here, the alpha value does
# not need a legend entry.

Faceting

We've seen aesthetics, geoms, and scales in ggplot. Now we'll look at faceting. We'll look at the median price by age of house. There are too many housing ages, though. So we'll group the house ages into a categorical variable first.

geo_big$year_r <- cut(geo_big$year, breaks = c(1840, 1940, 1960, 
    1980, 1990, 2000, 2009), include.lowest = TRUE, labels = c("<1940", "1940 - 1959", 
    "1960 - 1979", "1980 - 1989", "1990 - 1999", "2000 - 2008"))

# Create a simple set of boxplots:
ggplot(aes(x = year_r, y = price), data = geo_big) + geom_boxplot()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

plot of chunk facet

ggplot(aes(x = year_r, y = price), data = geo_big) + geom_boxplot() + 
    scale_y_log10()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

plot of chunk facet


# Create time series We will do the split-apply-combine strategy again.
# We can reuse our function (yay!!!)  year_med <- ddply(geo_big, .(year_r,
# city, date), agg_fun) That took too long for a lab and on some older
# machines. Collapse date to month.  I wasn't sure how to do this, so I
# went to stack overflow and searched on '[r] year month Date, and
# immediately found an answer!
geo_big$month <- as.Date(paste0(strftime(geo_big$date, format = "%Y-%m"), 
    "-01"))
# This extracts the Year and month, and then sets the date to 01 (the
# first day of each month)
year_med <- ddply(geo_big, .(year_r, city, month), agg_fun)

ggplot(aes(x = month, y = med), data = year_med) + geom_line(aes(group = city), 
    alpha = 0.1) + facet_wrap(~year_r) + scale_y_log10()

## Warning: Removed 2 rows containing missing values (geom_path).

$plot of chunk facet$

# I don't want to plot the NA
ggplot(aes(x = month, y = med), data = subset(year_med, !is.na(year_r))) + 
    geom_line(aes(group = city), alpha = 0.1) + facet_wrap(~year_r) + scale_y_log10()

plot of chunk facet

Mapping median house price

Here, I will map the median house price. Rather than mapping the median house price per city, I'll divide the region into 0.5 x 0.5 degree grid cells and calculate the median price per grid cell.

# Create a tile plot, with tiles of size .5 degrees
geo$lat2 <- round(2 * geo$lat, digits = 1)/2
geo$long2 <- round(2 * geo$long, digits = 1)/2
geo_plot <- ddply(geo, .(lat2, long2), agg_fun)

## Error: could not find function "ddply"

# Set small tiles to NA
geo_plot[geo_plot$n < 10, "med"] <- NA

## Error: object 'geo_plot' not found

# Cut the median price/cell into a categorical variable:
geo_plot$med_r <- cut_number(geo_plot$med, 7)

## Error: object 'geo_plot' not found

ggplot(aes(x = long2, y = lat2), data = geo_plot) + geom_tile(aes(fill = med_r)) + 
    coord_map() + scale_fill_brewer("Median Price", palette = "YlGnBu")

## Error: object 'geo_plot' not found

Plotting the quantiles of housing prices

In this section, I will produce a time series plot of the housing prices, and overlay on this lines representing the .02, .1, .25, .5, .75, .9, and .98 quantiles. Hopefully, this will help to show how the distribution of housing prices changed over time.

# See how the quantile function works:
quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98), 
    na.rm = TRUE)

##      2%     10%     25%     50%     75%     90%     98% 
##  217000  315000  410000  550000  720000  940000 1500000

We'll use the split-apply-combine strategy again. We'll need to make the quantiles into a data.frame.

# This was my first try: it didn't work as expected
as.data.frame(quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 
    0.9, 0.98), na.rm = TRUE))

##     quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98), na.rm = TRUE)
## 2%                                                                           217000
## 10%                                                                          315000
## 25%                                                                          410000
## 50%                                                                          550000
## 75%                                                                          720000
## 90%                                                                          940000
## 98%                                                                         1500000

# The data.frame was sideways.  The transpose function t() fixes it.

as.data.frame(t(quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 
    0.9, 0.98), na.rm = TRUE)))

##       2%    10%    25%    50%    75%    90%     98%
## 1 217000 315000 410000 550000 720000 940000 1500000

# There, that worked.  Now we can write a new function to use this.

quant_fun <- function(df) {
    quants <- quantile(df$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98), na.rm = TRUE)
    return(as.data.frame(t(quants)))
}
new_dat <- ddply(geo_big, .(date), quant_fun)

## Error: could not find function "ddply"

head(new_dat)

## Error: object 'new_dat' not found

# I want to reshape this data.frame.  The melt function is the reshape
# package does this:
require(reshape)

## Loading required package: reshape

## Loading required package: plyr

## Attaching package: 'reshape'

## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any

new_dat <- melt(new_dat, id.vars = "date")

## Error: object 'new_dat' not found

head(new_dat)

## Error: object 'new_dat' not found

ggplot(aes(x = date, y = value, group = variable), data = new_dat) + 
    geom_line() + scale_y_log10()

## Error: object 'new_dat' not found

That worked. But now I want to show how to overlay these lines over the actual data points. I can create both geom_line() and geom_point() geometries. But, since the x and y variables have different names in the two datasets, there is no longer a simple default. I have to move the aesthetic function out of the ggplot() function and into each of the geometry functions. The following code does that.

ggplot() + geom_line(aes(x = date, y = value, group = variable), 
    data = new_dat) + scale_y_log10() + geom_point(aes(x = date, y = price), 
    data = geo_big, alpha = 0.05, size = 0.75)

## Error: object 'new_dat' not found

It appears from these plots that the highest valued homes lost some value during the crash, but didn't fall below their 2002 levels. This is in stark contrast to the lowest valued homes, which crashed to much lower values than their 2002 levels.

Homework

This homework is very open ended, and will require some problem solving and trial and error in R and a little creativity.

Compare the performance of high and low value houses in high and low value cities during the housing crisis? Across the entire region, low value homes seemed to lose more of their value than high value homes. Is this consistent across communities?

In order to do this, you will need to determine proper ways to categorize the communities. Functions you may find especially helpful are cut(), subset(), table(), ddply()

Write a brief report explaining your findings: no more than one page of text. Include as many graphics as necessary to illustrate your point.

Geography 415 Lab 2

Objectives

Project Management

R Studio Projects

R scripts

R's package management system

Loading Packages

Installing (downloading) a package

Loading a package from a library

Detaching a package

The data

Merging two datasets

A first ggplot

Using table to tabulate results

Creating functions and the plyr package

Faceting

Mapping median house price

Plotting the quantiles of housing prices

Homework

R functions and concepts

Packages worth remembering

Functions worth remembering