There are a variety of ways to use GIS to abstract the world.
Vector data is when points, lines, and polygons are used to condense the natural and built world into geometric shapes. However, it is important to note that any given real-world object can be represented by a GIS (geographic information system) in multiple ways- the location of a cabin might be represented by a point, showing its location- or by a square showing the footprint of the structure itself. Rivers and lakes can be abstracted to lines and polygons, but given that the width of a river might also be of importance, rivers might also be represented as polygons. The level of detail describing the line or polygon also influences the precision with which you can make inferences about shape, size, and relationships between data (is your lake represented using a clean oval, or exploring every nook and cranny of the wetlands?). Finding areas of similarity, or connecting the dots between raster data objects often is one way of condensing and simplifying raster data into patches or lines from a surface of points.
Raster data is when a network or mesh of evenly spaced points are used to create a “surface” that describes some attribute of that space. Remotely sensed data using satellites to take primary data like land cover and climate data are often represented this way. They are mapped such that the center of the points form the center of a graph-paper like surface, where the value in the center of the point applies to the entire graph square. This creates the appearance of a surface of squares. The size of these squares (the amount of space each data point is assumed to represent) is refered to as the resolution- if the squares are 5km by 5km, the raster is said to be at a 5km resolution. However, even this is an abstraction-the points are usually sampled such that they are evenly spaced in latitude and longitude, but these “squares” can look more irregular in some areas (such as at the poles) depending on what projection you are using. Alternately, these squares can be created such that they are evenly spaced everywhere within that projection. These spatial data often carry file extensions like .TIFF, .JPEG, although different software packages and governemnt agencies sometimes have their own proprietary formats.
Both vector data and raster data can store information that corresponds to the data’s shapes, points, or pixels. This data can be already in the data format (within the “attribute table” of an ESRI shapefile, or as “bands” in remotely sensed orthoimagery). We can also think flexibly about how we relate the geometry to the data– with the use of a primary key, we can keep the data “on the side”, and only add it on when it’s necessary for analysis or visualization. That way, the spatial data formats with lots of detailed georaphy stay “lightweight”.
First, we load in the relevant libraries, and the data sets. For the purposes of this example, we have a simulated data set with observations for multiple variables in each county in the United States in multiple years. We also have a R SpatialPolygonsDataFrame of counties in the US. If you need help creating one of these, please see https://rpubs.com/BeccaStubbs/bringing_shapefiles_into_R
library(woodson) # plots data, has other libraries like data.table as dependencies, so they get loaded as well.
library(sp) # Classes and Methods for Spatial Data
library(rgdal)
rm(list=ls()) # clearing the work space of anything in memory
# Loading data
#~~~~~~~~~~~~~~~~~~~
# Simulated, county-level data.table: called "simulated_us_data"
load(paste0(getwd(),"/crude_example_data/simulated_us_data.rdata"))
# county-level Spatial Polygons object: called "mcnty_map"
load(paste0(getwd(),"/crude_example_data/mcnty_mapping_shape_file.rdata"))
We can take a look at these data:
head(simulated_us_data)
## mcnty year var1 var2 var3
## 1: 0 2000 0.4093835 0.284472722 0.8993208
## 2: 0 2001 0.6903311 0.021357884 0.8170376
## 3: 0 2002 1.0628828 -0.009044333 0.7209044
## 4: 0 2003 1.4611360 -0.159716437 0.6412607
## 5: 0 2004 1.9105751 -0.328401417 0.5625853
## 6: 0 2005 2.4323515 -0.359534092 0.4695426
plot(mcnty_map)
We also should take a look at the attributes of the mcnty_map data:
head(mcnty_map@data)
## state state_name mcnty
## 1: 1 Alabama 0
## 2: 1 Alabama 1
## 3: 1 Alabama 2
## 4: 1 Alabama 3
## 5: 1 Alabama 4
## 6: 1 Alabama 5
To generate a raster data set from a polygon data set, we need to generate a grid of points that will serve as the centroids of our pixels.
To do this, we first need to figure out the bounding box of the data– the spatial extent of the polygon layer in x and y coordinate minimums and maximums. If you need a raster even larger than the extent of your polygon layer, feel free to manually define your minimum and maximum x and y coordinates. This may be helpful if you have other data that has the top-left point in a certain location, and you want to be sure that your new raster data lines up perfectly with the old.
box<-data.table(dim=c("x","y"),bbox(mcnty_map))
box
## dim min max
## 1: x -2500000 2516373.8
## 2: y -2500000 732103.3
Now that we have the spatial extent of the raster layer we need to create (it will be a rectangle such that all parts of the polygon will be included), we can generate a mesh or grid of points covering this space in equal intervals. There are two ways conceptually to do this:
1.) We want a certain number of points in either the x or y direction, so we sample such that the distance between the points is always equal.
# Now, we find the maximum range, and say that we want 1,000 points
# between that range. "Steps" will be the interval we use to interpolate
# between the points.
box[,range:=max-min]
## dim min max range
## 1: x -2500000 2516373.8 5016374
## 2: y -2500000 732103.3 3232103
steps<-max(box$range)/1000
2.) We want a grid with meaningful, specific grid sizes. For this strategy, we look in the proj4string of our polygon object (it’s projection) to find out what the “units” are. In this case, we have units=m, which means that all of our geographic points are relating to eachother using meters as the unit of measurement. So, if we wanted a 5 km grid, we would set steps=5000, to generate pixels 5km by 5km.
proj4string(mcnty_map)
## [1] "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"
# Now, we are setting the pixel size to be 5k x 5k
steps<-5000
Now, we sample a grid of those points, by interpolating between the minimum and maximum of the x and y boundaries, using the steps we’ve just discovered. Next, we can use expand.grid, a function in the base package that generates all the combinations of two columns, which will create
interpolated_x_coordinates<-seq(from = box[dim=="x"]$min, to = box[dim=="x"]$max, by = steps)
interpolated_y_coordinates<-seq(from =box[dim=="y"]$min, box[dim=="y"]$max, by = steps)
# This now will have a list of x and y coordinates for the full 5x5 k sampled grid
sampled_points<-expand.grid(x = interpolated_x_coordinates,
y = interpolated_y_coordinates)
head(sampled_points)
## x y
## 1 -2500000 -2500000
## 2 -2495000 -2500000
## 3 -2490000 -2500000
## 4 -2485000 -2500000
## 5 -2480000 -2500000
## 6 -2475000 -2500000
n_points<-nrow(sampled_points)
Note that the increments for x and y coordinates are going up by 5000 meters.
Now, we will build a SpatialPointsDataFrame using these coordinates.
At this point, we have a list of coordinates, but x and y values alone don’t work with spatial methods. To manipulate and use this data, we need to generate a spatialPointsDataFrame.
Right now, the object “sampled_points” is a data.frame– we can check this by querying for its class:
class(sampled_points)
## [1] "data.frame"
To do this, we first need to explicitly define the fact that two fo the columns in this data.frame are actually coordinate values. To do this, we designate the x and y columns as coordinates, and thus change the class of the sampled_points object:
coordinates(sampled_points) <- ~x + y
class(sampled_points)
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
At this point, R knows that sampled_points is a SpatialPoints object. To make this a SpatialPointsdataFrame object, we need to add some attribute data– we do this by passing in a data.table of a sequential ID variable that we can use to keep track of our points down the line. We also tell it what projection to use– since we created our mesh of points based on the mcnty_map layer, we need to apply the same projection here as well.
sampled_points <- sp::SpatialPointsDataFrame(coords=sampled_points,
data=data.table(point=seq(1,n_points)))
# setting the proj4string (the projection information) of this layer to be the
# same as the mcnty_map.
proj4string(sampled_points)<-proj4string(mcnty_map)
class(sampled_points)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
Excellent- we now have a SpatialPointsDataFrame. At this point, we can “sample” the polygons using our regular, 5km points. This operation is essentially a spatial join– the points that you overlay on top of the polygons inherit the data attributes of the polygons below. Once we have this information, we can apply the polygon data to the points by assigning the information from the overlay into the @data slot of the sampled_points layer.
overlay<-sp::over(sampled_points,mcnty_map) # overlaying the sampled points onto the mcnty map
sampled_points@data<-overlay # making a full-on data.table for the spatial object
At this point, we also would like the coordinates (x and y) for our mesh to be saved somewhere more convenient than in the @coords slot of this data object. As such, we can add it back into the data by saving a new data.table with the coordinate information. Given that these coordinates are in the albers equal area projection, I am designating column names that reflect that projection, in case this data becomes separate from the spatial projection information later on. We add the coordinate data back into the data.table using a cbind() operation. We also can save the data as a simple data.table, separate from the Spatial geometry.
albers_coords<-data.table(sampled_points@coords)[,list(x_albers=x,y_albers=y)]
sampled_points@data<-cbind(sampled_points@data,albers_coords)
points_dt<-copy(sampled_points@data)
At this point, we have a raster data format that we can use to map data. Let’s compare this to the polygon version.
Let’s make a chloropleth map using the Woodson library, using the polygon geometry as the vehicle for the data:
wmap(chloropleth_map = mcnty_map,
data=simulated_us_data[year==2000],
variable="var1",
geog_id="mcnty")
Now, let’s try plotting some data by merging in the county data with the points– this will be a simple, rough sketch of the same variable using the 5x5 km point data, being represented as points.
points_dt<-merge(simulated_us_data[year==2000],points_dt,by="mcnty")
ggplot(data=points_dt)+geom_point(aes(x=x_albers,y=y_albers,color=var1))
We can also use ggplot()’s “geom_raster” plotting function, which is even faster than geom_tile– it’s designed to take points of equal spacing.
ggplot(data=points_dt)+geom_raster(aes(x=x_albers,y=y_albers,fill=var1))
Or, we can use the Woodson mapping function, which can now map raster data!
wmaprast(geometry=points_dt,
xcol="x_albers",
ycol="y_albers",
geog_id="mcnty",
variable="var1",
map_title="Raster version of Variable 1")
Note that this doesn’t look very different- in part because the sample is so small. Let’s try a similar process, but this time, with much greater grid size.
steps<-50000
sampled_points<-expand.grid(x = seq(from = box[dim=="x"]$min, to = box[dim=="x"]$max, by = steps),
y = seq(from =box[dim=="y"]$min, box[dim=="y"]$max, by = steps))
n_points<-nrow(sampled_points)
# We build a SpatialPointsDataFrame using
coordinates(sampled_points) <- ~x + y
sampled_points <- sp::SpatialPointsDataFrame(sampled_points, data.table(point=seq(1,n_points))) # make spatial data frame
proj4string(sampled_points)<-proj4string(mcnty_map) # defining the same projection
overlay<-sp::over(sampled_points,mcnty_map) # overlaying the sampled points onto the mcnty map
sampled_points@data<-overlay # making afull-on data.table for the spatial object
albers_coords<-data.table(sampled_points@coords)[,list(x_albers=x,y_albers=y)]
sampled_points@data<-cbind(sampled_points@data,albers_coords)
wmaprast(geometry=sampled_points@data,
xcol="x_albers",
ycol="y_albers",
geog_id="mcnty",
data=simulated_us_data[year==2000],
variable="var1",
map_title="Coarse Raster version of Variable 1")
Ah. Impressionist!
At this point you could save your new spatialpoints data frame as a Shapefile, or other vector format:
rgdal::writeOGR(obj=sampled_points, dsn=paste0(getwd(),"/outputs/sampled_points.shp"), layer="sampled_points", driver="ESRI Shapefile")
Alternately, you give R the hint that this is actually a gridded, raster data set, and can switch the spatial format by asserting that gridded()<-TRUE.
gridded(sampled_points)<-TRUE # making R know that this is a grid of points.
class(sampled_points)
## [1] "SpatialPixelsDataFrame"
## attr(,"package")
## [1] "sp"