There are a few different file types that are associated with R. An “R script” is a text file with R commands, and has a “.R” extension. You can save all or some of the variables from a workspace into an “Rdata” file with a “.Rdata” extension, file example “lab2.Rdata”. More on this later.
But before even doing this, I'll introduce the “Project” feature that is particular to R studio. It is not a feature of R. A R Studio Project creates a new folder in which you can save all of your data, files and plots. R studio will remember all of this from session to session. If you close R Studio, and then open it back up, it will remember everything, even if you open it from a different computer.
To create a Project, from within R Studio, click Project -> Create Project. (You may be prompted whether you want to close your current R session… If you're just starting, there is no reason no to… if you have something important open, you may want to save it first). You will then be prompted whether you want to create a new folder, or to use an existing folder. This is up to you. Whatever you choose, R will then start up inside the project folder. You might create a folder called lab 2, for example. There will also be a file with the extension “.Rproj”- if you click on this from your file explorer, it will start up R Studio exactly where you left off.
Once you're in R Studio, create an R script. Type CTRL + SHIFT +N or go to File -> New -> R Script. It will create a blank script. This is like a text document, but everything is an R command (or a comment). Suppose your R script looks like this:
There are a few ways you can work with this. If you place your cursor on the first line, you can either click “Run”, or type CTRL + RETURN, and the line will be pasted to the Console and executed. You can select multiple lines and Run, and the entire selection will be run. You can Click Source (or type CTRL+SHIFT+S), and the entire script will be run silently. You can check that x now equals 2, but you will never know that it was first 3, or that x+3 was 6. If you Click Source -> Source with Echo (CTRL + SHIRT + RETURN), then everything will run, AND all of the results will be printed out to screen.
Here's my usual workflow:
Don't forget to save the script periodically. Your Project is saved automatically. I usually put my Projects in a Dropbox folder, that way I can run them from either my desktop or my laptop.
I love the Project feature of R Studio.
There are so many functions available in R, that you might spend your whole life using R every day, and still only use a small fraction of them. On the principle of only loading what you need, functions in R are collected into groups called packages and you can load packages as you need them.
We'll be using the ggplot2 package. To use a package, there are two steps: first, you must install it from the series of tubes we call the internet, to a “library” on your computer. You only need to do this once. A library is just a specific folder o your computer where R will look for packages. You can have multiple libraries on a computer, but this can cause confusion when you need to update everything to a new version, and you have libraries all over the place. But, if you can't install a package to a particular computer (such as in the computer lab), then you can create a new library on the H drive or on a USB drive. Once you have a library, however, then each time you want to use the package, you will need to get it from the library.
In R studio, you can click on Tools -> Install Packages…, which will open a dialog box. You can then type ggplot2 in the next box and click install. More generally, you can type the following option:
install.packages("ggplot2", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
# If you are installing to a custom library, then you'll use
# install.packages('ggplot2', dependencies=TRUE, lib.loc=NULL) where NULL
# would be replaced by the file path of the library (in quotes).
library(ggplot2)
# That's it! If the library is not the default one, then you can type
# library(ggplot, lib.loc=NULL) where NULL would be replace by the file
# path of the libary (in quotes).
# To display the help for a package, you would type
help(package = "ggplot2")
If you don't want to use a package anymore, you may either leave it there, or you may choose to “detach” it.
# detach('package:ggplot2') #I commented it out, because I don't want to# do this.
Why would you want to detach a package? Good question. I don't really know. One possible downside to having a lot of packages is that it makes the help() slower. Another downside is that sometimes two different packages will have two different functions with the same name. It can be confusing to know what R will do. Detaching might help. But there is another, better solution, which I'll cover another day.
This lab will use data of housing transactions in teh Sn Francisco bay Area around the period of the housing crash. These data were downloaded from the San Francisco Chronicle, geocoded, and made publicly available by Hadley Wickham.
I have made the files available for download here: https://www.dropbox.com/s/x7vreaa4qnrfdv1/addresses.csv https://www.dropbox.com/s/vxn6zbpsdr6bz7j/house-sales.csv
Merging datasets is one of the most common data management tasks. In this case, we have two datasets: addressess (which have lat/lon coordinates), and houseprices, that will need to be merged from two data.frames to one. In order to perform a merge, one needs to identify common columns that can act as unique “keys” for matching upon. In this case, both datasets contain a city, street, and zip that together form unique keys.
# Change this for your own computer...
setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab2/")
# load packages:
require(ggplot2)
require(plyr) # These packages will be used laster, don't worry now.
## Loading required package: plyr
ad <- read.csv("addresses.csv", stringsAsFactors = FALSE)
sales <- read.csv("house-sales.csv", stringsAsFactors = FALSE)
# At this point, you should inspect ad and sales to see what is included.
# In R studio, you can double click on the data to inspect them. other
# help functions are
names(ad)
names(sales)
head(ad)
head(sales)
# Now, merge the data
geo <- merge(sales, ad, by = c("street", "city", "zip"), all.x = TRUE)
# Now tell R what format the dates and prices are.
geo$date <- as.Date(strptime(geo$date, "%Y-%m-%d"))
# There are two conversions in that last line: 1) strptime converts from
# character to POSIX (a standard tim format in computing) 2) as.Date
# converts from POSIX to R's 'Date' type
geo$price <- as.numeric(geo$price)
## Warning: NAs introduced by coercion
# That converted price from character to numeric
We will plot the lat/lon of each address in order to see where they are. Our map will have coordinates of lat/lon, so the x and y aesthetics will be long and lat, respectively. We will want a scatterplot, which uses the point geometry.
ggplot(aes(x = long, y = lat), data = geo) + geom_point()
## Warning: Removed 37 rows containing missing values (geom_point).
# That's not quite right. ggplot doesn't yet know that the coordinates
# are geographic coordinates. We can set the scales properly by using its
# coord_map() function.
ggplot(data = geo, aes(x = long, y = lat)) + geom_point() + coord_map()
## Warning: Removed 37 rows containing missing values (geom_point).
# coord_map() uses a mercator projection.
There are many cities (type unique(geo$city) to see them all). Some of those cities will be pretty small, and the number of sales may be too small to reliably look at. We will subset to just those houses in the bigger cities.
R's table() function is a quick way to create a frequency table. Here, we will create a frequency table of the cities (i.e. the number of sales in each city). Here we will immediately put that frequency table into a data.frame, and then select just the larger cities.
# Create a data.frame with the cities and their sample sizes
cities <- as.data.frame(table(geo$city))
# Be sure to inspect the result!!!!
names(cities) <- c("city", "freq") # Clean up the result with meaningful names!
# Create a subset of just the larger cities
big_cities <- subset(cities, freq > 3000) # 3000 is a pretty arbitrary cutoff
# Now we have a list of big_cities.
geo_big <- subset(geo, city %in% big_cities$city)
# %in% is really cool... Here we look through each row of the housing data
# set and see if the city is one of the big cities or not.
# Create a quick map
ggplot(data = geo_big, aes(x = long, y = lat)) + geom_point() + coord_map()
## Warning: Removed 34 rows containing missing values (geom_point).
# Save the graphic as a variable...
map <- ggplot(data = geo_big, aes(x = long, y = lat)) + geom_point() +
coord_map()
map # plot the graphic
## Warning: Removed 34 rows containing missing values (geom_point).
# Color by city name
map <- ggplot(data = geo_big, aes(x = long, y = lat)) + geom_point(aes(color = city)) +
coord_map()
map # plot the graphic
## Warning: Removed 34 rows containing missing values (geom_point).
In this section, we will calculate and plot the median home price in each city over time.
One of the greatest advantages of a statistical computing language is the ability to create new functions to do computations on data. In this section, we will see our first handwritten function.
The other thing we will see in this section is the plyr package. Much of the data manipulations we do require us to 1) split the data into pieces, 2) apply a function to each piece, and 3) combine the resuts back into one object. The plyr package is written to do exactly this. In particular, we will split the data into a data frame for each city and time, calculate the median home price at each city and time, and then combine the results again. So finally, instead of a data frame where each row is a house sale, we will have a data.frame where each row is a particular city and point in time.
To do the split-apply-combine strategy, we will use the ddply function in the plyr package. The 'dd' in ddply stands for “data.frame in, data.frame out”. There are other types, such as mdply (“matrix in, data.frame out”). But for the apply step in the “split-apply-combine,” we will need to write a function that takes in a data.frame, calculates the median, and then return the median as a data.frame. I will also calculate the number of housing sales in addition to the median, while we are at it.
The function looks like this:
# Create a function, that takes a data.frame with a variable called price,
# and returns a new data.frame with 1) the number of rows, and 2) the
# median price
agg_fun <- function(df) {
new.df <- data.frame(n = nrow(df), med = median(df$price, na.rm = TRUE))
# I used two functions here: nrow and median I hope they are self
# explanatory or you can look at their help pages.
return(new.df) #Now, tell the function what to exit with.
}
A handwritten function has 4 basic components:
Here's how we use the function:
agg_fun(geo_big) # Take the function out for a spin...
## n med
## 1 419441 550000
Now we are ready to split-apply-combine:
# Now, apply the function to each city and date combination This will take
# a little while... there are 400000 rows to crunch through This is
# something I would normally do in SQL (with the sqldf() function) rather
# than with ddply, but that's a topic for another course.
bigsum <- ddply(geo_big, .(city, date), agg_fun)
# That split the geo_big data.frame by unique combinations of city and
# date, applying the agg_fun() function to each subset. The resulting
# data.frame will have columns for city, date, and whatever columns are
# returned by agg_fun().
Now we are ready to plot the median price over time for each city:
ggplot(data = bigsum, aes(x = date, y = n)) + geom_line(aes(group = city))
## Warning: Removed 2 rows containing missing values (geom_path).
# Note the use of the group aesthetic in geom_line. If I put the group
# aesthetic in the ggplot() function that would have worked in this case
# too.
# Try it with log scaling:
ggplot(data = bigsum, aes(x = date, y = n)) + geom_line(aes(group = city)) +
scale_y_log10("median price", breaks = c(250000, 5e+05, 1e+06))
## Warning: Removed 2 rows containing missing values (geom_path).
# There are too many points. Let's use a 10% alpha transparency effect to
# reduce overplotting problems:
ggplot(data = bigsum, aes(x = date, y = med)) + geom_line(aes(group = city),
alpha = 0.1) + scale_y_log10("median price", breaks = c(250000, 5e+05, 1e+06))
## Warning: Removed 2 rows containing missing values (geom_path).
# Note: alpha is not an aesthetic in this case. Aesthetics are visual
# variables, requiring a scale or a legend. Here, the alpha value does
# not need a legend entry.
We've seen aesthetics, geoms, and scales in ggplot. Now we'll look at faceting. We'll look at the median price by age of house. There are too many housing ages, though. So we'll group the house ages into a categorical variable first.
geo_big$year_r <- cut(geo_big$year, breaks = c(1840, 1940, 1960,
1980, 1990, 2000, 2009), include.lowest = TRUE, labels = c("<1940", "1940 - 1959",
"1960 - 1979", "1980 - 1989", "1990 - 1999", "2000 - 2008"))
# Create a simple set of boxplots:
ggplot(aes(x = year_r, y = price), data = geo_big) + geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
ggplot(aes(x = year_r, y = price), data = geo_big) + geom_boxplot() +
scale_y_log10()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
# Create time series We will do the split-apply-combine strategy again.
# We can reuse our function (yay!!!) year_med <- ddply(geo_big, .(year_r,
# city, date), agg_fun) That took too long for a lab and on some older
# machines. Collapse date to month. I wasn't sure how to do this, so I
# went to stack overflow and searched on '[r] year month Date, and
# immediately found an answer!
geo_big$month <- as.Date(paste0(strftime(geo_big$date, format = "%Y-%m"),
"-01"))
# This extracts the Year and month, and then sets the date to 01 (the
# first day of each month)
year_med <- ddply(geo_big, .(year_r, city, month), agg_fun)
ggplot(aes(x = month, y = med), data = year_med) + geom_line(aes(group = city),
alpha = 0.1) + facet_wrap(~year_r) + scale_y_log10()
## Warning: Removed 2 rows containing missing values (geom_path).
# I don't want to plot the NA
ggplot(aes(x = month, y = med), data = subset(year_med, !is.na(year_r))) +
geom_line(aes(group = city), alpha = 0.1) + facet_wrap(~year_r) + scale_y_log10()
Here, I will map the median house price. Rather than mapping the median house price per city, I'll divide the region into 0.5 x 0.5 degree grid cells and calculate the median price per grid cell.
# Create a tile plot, with tiles of size .5 degrees
geo$lat2 <- round(2 * geo$lat, digits = 1)/2
geo$long2 <- round(2 * geo$long, digits = 1)/2
geo_plot <- ddply(geo, .(lat2, long2), agg_fun)
## Error: could not find function "ddply"
# Set small tiles to NA
geo_plot[geo_plot$n < 10, "med"] <- NA
## Error: object 'geo_plot' not found
# Cut the median price/cell into a categorical variable:
geo_plot$med_r <- cut_number(geo_plot$med, 7)
## Error: object 'geo_plot' not found
ggplot(aes(x = long2, y = lat2), data = geo_plot) + geom_tile(aes(fill = med_r)) +
coord_map() + scale_fill_brewer("Median Price", palette = "YlGnBu")
## Error: object 'geo_plot' not found
In this section, I will produce a time series plot of the housing prices, and overlay on this lines representing the .02, .1, .25, .5, .75, .9, and .98 quantiles. Hopefully, this will help to show how the distribution of housing prices changed over time.
# See how the quantile function works:
quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98),
na.rm = TRUE)
## 2% 10% 25% 50% 75% 90% 98%
## 217000 315000 410000 550000 720000 940000 1500000
We'll use the split-apply-combine strategy again. We'll need to make the quantiles into a data.frame.
# This was my first try: it didn't work as expected
as.data.frame(quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75,
0.9, 0.98), na.rm = TRUE))
## quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98), na.rm = TRUE)
## 2% 217000
## 10% 315000
## 25% 410000
## 50% 550000
## 75% 720000
## 90% 940000
## 98% 1500000
# The data.frame was sideways. The transpose function t() fixes it.
as.data.frame(t(quantile(geo_big$price, c(0.02, 0.1, 0.25, 0.5, 0.75,
0.9, 0.98), na.rm = TRUE)))
## 2% 10% 25% 50% 75% 90% 98%
## 1 217000 315000 410000 550000 720000 940000 1500000
# There, that worked. Now we can write a new function to use this.
quant_fun <- function(df) {
quants <- quantile(df$price, c(0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98), na.rm = TRUE)
return(as.data.frame(t(quants)))
}
new_dat <- ddply(geo_big, .(date), quant_fun)
## Error: could not find function "ddply"
head(new_dat)
## Error: object 'new_dat' not found
# I want to reshape this data.frame. The melt function is the reshape
# package does this:
require(reshape)
## Loading required package: reshape
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
new_dat <- melt(new_dat, id.vars = "date")
## Error: object 'new_dat' not found
head(new_dat)
## Error: object 'new_dat' not found
ggplot(aes(x = date, y = value, group = variable), data = new_dat) +
geom_line() + scale_y_log10()
## Error: object 'new_dat' not found
That worked. But now I want to show how to overlay these lines over the actual data points. I can create both geom_line() and geom_point() geometries. But, since the x and y variables have different names in the two datasets, there is no longer a simple default. I have to move the aesthetic function out of the ggplot() function and into each of the geometry functions. The following code does that.
ggplot() + geom_line(aes(x = date, y = value, group = variable),
data = new_dat) + scale_y_log10() + geom_point(aes(x = date, y = price),
data = geo_big, alpha = 0.05, size = 0.75)
## Error: object 'new_dat' not found
It appears from these plots that the highest valued homes lost some value during the crash, but didn't fall below their 2002 levels. This is in stark contrast to the lowest valued homes, which crashed to much lower values than their 2002 levels.
This homework is very open ended, and will require some problem solving and trial and error in R and a little creativity.
Compare the performance of high and low value houses in high and low value cities during the housing crisis? Across the entire region, low value homes seemed to lose more of their value than high value homes. Is this consistent across communities?
In order to do this, you will need to determine proper ways to categorize the communities. Functions you may find especially helpful are cut(), subset(), table(), ddply()
Write a brief report explaining your findings: no more than one page of text. Include as many graphics as necessary to illustrate your point.
ggplot2 plyr reshape
merge() subset() table() as.data.frame() plyr::ddply()