Andy Stauffer's R-Journal

GEOG 5023: Quantitative Methods In Geography


Purpose

The purpose of this Journal is to document all of the tools and functions covered in Geography 5023: Quantitative Methods. To do this, all of the labs, homework, and in-class exercises will be posted and discussed. In order to maintain some type of organization, please use the below table of contents to find the assignment of interest. Here, a list of the functions covered for each exercise will be given accompanied by a very brief description. Each exercise heading will be hyperlinked to an external document that has a deeper explanation of the functions used and shows a practical, worked example of each. The external documents for the in-class exercises also overview a very brief outline of the lecture material covered.

Here is my custom-made unofficial expectations rubric that lists the in-class exercises. I take no responsibility of its accuracy or completeness. Last updated Apr 22, 2013.

4/23/13 Update: Note that this assignment is now extra credit instead of 10% of the course grade.

Table of Contents

Final Project
     1. Descriptive Geostatistics
Labs (note: external documents crudely converted to rmd files. See the lab write-up for details.)
     1. Linear Regression
     2. Logistic Regression
     3. Spatial Autocorrelation
     4. Point Pattern Analysis
Homework
     1. Calculate Root Mean Square Standard Deviation
     2. Hypothesis Testing
In-Class Exercises
     1. Introduction to R
     2. Review of Key Concepts
     3. Intro to Correlation and Regression
     4. Logistic Regression
     5. Logistic and Poisson Regression
     6. Multiple Regression
     7. Autocorrelation (time series)
     8. Intro to Spatial Autocorrelation (in Geoda; not shown here)
     9a. Spatial Analysis and Regression 1 Exercise 1
     9b. Spatial Analysis and Regression 1 Exercise 2
     10. Spatial Regression 2
     11. Point Pattern Analysis Part 1: Visualization
     12. Point Pattern Analysis Part 2: Hypothesis Testing
Misc. In-Class Exercises
     1. Swain v. Alabama
     2. Maps and Multiple Regression
     3. Using GGPLOT2 for visualization


Final Project

Estimating Attribution Resolution of DEMs with Descriptive Geostatistical Methods

# I'm looking at comparing variograms.  Check out the link for more
# information.

Labs

Lab 1: Linear Regression

General functions

read.csv()  #reads a csv file as a table
names()  #gets the column names of a table

dataset[rows, cols]  #arrangement of parsing through a table
partialData <- dataset[logicalFunction, ]  #uses a logical function to get only the rows of interest

par(mfrow = c(2, 2))  #Creates a tiled layout to display multiple plots at the same time.  This configuration shows 4 plots.

I(x^2)  #this is a literal reading of x.  Useful for creating quadratic regression variables.

Testing for normality

hist()  #creates a histogram of the data

qqnorm()  #creates a qq plot
qqline()  #plots the line of normality on a qq plot

shaprio.test()  #uses the shaprio wilks test for normality; null hypothesis = the data is normally distributed

Comparing datasets

# Normally we could use a t-test.  But the data has to be normally
# distributed.  In this lab the data wasn't normally distributed, so we
# must use nonparametric tests.

wilcox.test(x, y, paried = T)  #Compared two datasets that do not follow normally distributions.  The null hypothesis is that there is no difference between the two datasets.

Creating and assessing linear regression models

lm(D.V. ~ I.V., dataset)  #creates a linear regression model.
plot(lm)  #plots the residuals.  This allows us to see if the residuals follow a normal distribution.  If not, our model is no good.

summary(lm)  #gives p-values to assess model parameters.  It also gives us an R^2 fit statistic

anova(lm1, lm2)  #Compares 2 regression models using a F-Test where the null hypothesis is that there is no difference between the datasets.

Lab 2: Logistic Regression

Creating dummy variables

dataset$newField <- ifelse(dataset$categoricalField == XXXX, c(1), c(0))  #creates a dummy variable manually using the ifelse function.  XXXX would be a value within the categorical field.  It translates to all of these values being 1 and all others being 0.

factor(dataset$categoricalField)  #automatically creates dummy variables using the factor() function.

# When doing a regression model using factors, one must always be dropped.
# It is easy to do this when creating dummy variables manually by simply
# not including a variable, but an extra step is required if the factor()
# function is done:
relevel(factor(dataset$categoricalField), "XXXX")  #Relevel omits one of the dummies.  You can specify which category to drop by placing the number inside the quotes.

# Plotting categorical data is sometimes a useful visualization:
plot(factor(dataset$categroricalField1) ~ factor(dataset$categoricalField))  #creates a really interesting plot.  See the lab for the actual plot.

Creating a logistic model

glm(I.V. ~ relvel(factor(D.V.), "OmittedDummy"), binomial, dataset)  #creates a generalized linear model - logistic regression

# using summary(glm), we cannot interpret the model parameters.  We must
# compute the odds ratio:
exp(cbind(OR = coef(glm), confint(glm)))


# Comparing 2 logistic models:
anova(glm_full, glm_reduced, test = "LRT")  #Calculates a Likelihood Ratio Test to compare the full and reduced logistic models.

anova(glm_full, glm_reduced, test = "Chisq")  #Calculates a Chi-Squared Test to compare the full and reduced logistic models.

Assessing Multicolinearity

vif(glm)  #Calculates the variance inflation factor.  see the in-class exercises for more explanation about this.

mean(vif(glm))  #The average VIF.  values over 1 indicate some degree of multicolinearity.

Lab 3: Spatial Autocorrelation

Note: This lab takes a very long time to run. Beware if you are interested in rerunning it.

Create a weights matrix

# Contiguity weights matrices:
poly2nb(dataset)  #queen's contiguity
poly2nb(dataset, queen = F)  # rook'c contiguity

# Fixed distance based weights matrix:
coords <- coordinates(dataset)  #gets the centroids of polygons.  This is needed to assess distance
FIPS <- row.names(as(dataset, "data.frame"))  #Creates a unique identifier for each point
dnearneigh(coords, d1 = 0, d2 = FixedDistance, row.names = dataset$FIPS)  #Create distance based weights matrix

# Create a k-nearest neighbor weights matrix
library(RANN)  #This is needed, I forget why...
coords <- coordinates(dataset)  #gets the centroids of polygons.  This is needed to assess the nearest centroids
FIPS <- row.names(as(dataset, "data.frame"))  #Creates a unique identifier for each point
knn2nb(knearneigh(coords, k = NumberOfNeighbors), row.names = FIPS)  #Creates k-nearest neighbors using the nearest k neighbors specified.

# Row Standardized weights Matrix
nb2listw(WEIGHTS, style = "W", zero.policy = T)  #row standardized weights matrix.  The zero.policy=T allows for islands to be in the dataset.

Assessing Autocorrelation - Converting an OLS to SAR

moran.test(dataset$attribute, listw = WEIGHTS, zero.policy = T)  #Calculates the Morna's I of a variable using a weights matrix.

ols <- lm(D.V. ~ I.V., data = dataset)  #Creates a general linear model (OLS)
shapiro.test(ols$residuals)  #Are the residuals normally distributed?  They should be randomly distributed.
lm.morantest(ols, nbqw, zero.policy = T)  #tests the autocorrelation in the OLS model.  The null hypothesis is that there is no autocorrelation, or the Moran's I value is 0.

sar <- lagsarlm(D.V. ~ I.V., data = dataset, nbqw, zero.policy = T)  #Creates a spatial lag model.  This helps to remove autocorrelation in the OLS.
summary(sar)  # The Lagrange Multiplier test tells if there is any residual autocorrelation left in the SAR model.  If there is, other steps need to be taken to remove it.
bptest.sarlm(sar)  #Breusch-Pagen test for heteroscedacity.  The null hypothesis is that the residuals are homoscedactic, which is good.  Heteroscedacity also alludes to autocorrelation, or at least the residuals are dependent on X (they shouldn't be...)

anova(sar, ols)  #Does a Likelihood Ratio Test to see if incorporating autocorrelation into the SAR model improves the model fit.

Mapping and Understanding Spillover Effects

edit <- dataset  #copy the data so we don't corrupt it.
edit@data[edit@data$attribute == "LogicalTest", "AttributeToBeEdited"] <- "EditiedValue"  #edit the data

# Step 1: Predict original and edited data based on the SAR model...
orig_predict <- as.data.frame(predict(sar))  #predict a variable using the original dataset
edit_predict <- as.data.frame(predict(sar, newdata = edit, listw = nbqw, zero.policy = T))  #predict the same variable after editing one of the I.V. values.
Effects <- edit_predict$fit - orig_predict$fit  #calculate the change in the predicted values
edit$EDIT_ATTR <- Effects  #append it to the edited dataset so we can map it.

# Step 2: Make a map...
library(RColorBrewer)
library(classInt)
colorPal_ <- brewer.pal(5, "PuOr")
cats <- classIntervals(edit$EDIT_ATTR, n = 4, style = "jenks")  #Create class breaks.  Note that n is one less than the number of colors selected!
colors <- findColours(cats, colorPal)  #Tie the class breaks with the colors

plot(edit, col = colors)  #This actually creates the map!
legend("bottomright", legend = round(cats$brks, 2), fill = colorPal, bty = "n")
mtext("Map Title", side = 3, line = 1)

Lab 4: Point Pattern Analysis

Making a ppp object and doing simple visualizations:

convexhull.xy(x = points_df$Easting, y = points_df$Northing)  #This is a convex hull of the AOI and is used as the owin window
ppp(x = points_df$Easting, y = points_df$Northing, window = convexhull, marks = points_marks)  #ppp object.  points_marks is a table of the points df

plot(ppp)  #factors the points in the points_df
plot(density(ppp, sigma = 500))  #kernel density map

Comparing data to CSR using hypothesis testing

quadrat.test(deaths_ppp, nx = 5, ny = 5)  #quadrant testing.  The null hypothesis is the data was created by a CSR process
plot(envelope(ppp, Kest, nsim = 99, correction = "border"))  #K-function.  Note that the PPP cannot have marks set.

Examine the distribution of points

plot(envelope(ppp, Kcross, i = "event type 1", j = "event type 2", nsim = 99, 
    correction = "border"))  #Cross K.  The ppp must have a marks object that ONLY contains the attribute that distinguishes the event type.  This is easier than the alltypes() function because the user doesn't have to compute cross k for every event type.

ktype1 <- Kest(ppp[marks$Type == "event type 1", ], correction = "best")
ktype2 <- Kest(ppp[marks$Type == "event type 2", ], correction = "best")
KdiffCEEA <- ktype1 - ktype2  # Difference in K Functions.  This examines the similarity in spatial patterns of two event types.

Create a point process model

# Prepare data for input into a ppm:
eventType1_df <- points_df[points_df$Type == "event type 1", ]
eventType1_ppp <- ppp(x = eventType1_df$Easting, y = eventType1_df$Northing, 
    window = convexHull)

eventType2_df <- points_df[points_df$Type == "event type 2", ]
eventType2_ppp <- ppp(x = eventType2_df$Easting, y = eventType2_df$Northing, 
    window = convexHull)
et2_den <- density(eventType2_ppp)  #Density Surface used to predict event type 1

# Create the PPM object
ppm(eventType1_ppp, ~et2_den, covariates = list(et2_den = et2_den))  #point process model
ppm_sim <- simulate(ppm)  #Simulates CSR with a variable lambda intensity, not needed to assess the model, but a nice visual.

envelope(ppm, Kest, nsim = 99, correction = "border", verbose = F)  #K function.  If the model is good, the K-function should appear to be a CSR process (i.e. we were able to understand the distribution of data).  However, in a better scenario, we should use this to understand if the process really is a CSR process (i.e. trees on high vs. low slopes.)

Homework

Homework 1: Calculate the Root Mean Square (RMS) Standard Deviation

Part of homework 1 was to create a function to calculate RMS Standard Deviation. Using RMarkdown, we are able to code LaTeX Equations such as: \[ s_N = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \overline{x})^2} \]

Here is how the above formula was coded was coded:

$$ s_N = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \overline{x})^2} $$

Creating a function:

functionName<-function(requiredData){
  ...does stuff...
  ... it can be multiple lines
  return(result)
}

Homework 2: Hypothesis Testing and Correlation: new material

Creating graphs:

plot(dataset ~ dataset)
# OR
plot(dataset$Y ~ dataset$X)  #This is probably the better way to do it...

Graphs can also be customized on the screen:

myLayout <- layout(mat = matrix(c(1, 2), 1, 2), respect = TRUE)
plot(DatasetA)
plot(DatasetB)
# The result will be the two plots tiled side-by-side.  This can be useful
# when looking at summaries of multiple plots or comparing plots.
# However, the sizes are greatly reduced.

Common functions:

grep()  #allows you to search for text.  Does this only apply to tables?
summary()  #returns some descriptive statistics for a list or table
is.na()  #logical operator that tests if something is null or not
ifelse()  #this is an if-else statement

Working with Tables:

dim()  #returns number or rows and columns in a table()
as.factor(table$heading)  #formalizes a custom-created column to a table
list_subselection <- list_original[rows, cols]  #slices though a table
# e.g. myNewList <- oldList[oldList$colA < 20] This will return all the
# records in the table were column A is less than 20.
head(table)  #will return all of the column headings and first few lines of a table.

Hypothesis Testing

# State null hypothesis (the two groups are the same)
t.test(x, y)  #calculates a t-test which compares group X and group Y.
# Afterwards, state p-value and if you accept(p-value>0.05) or
# reject(p-value<0.05) the null hypothesis

Understanding Correlation and Correlation Testing

# State null hypothesis (the correlation is 0)
cor.test(x, y, method)  #Runs a correlation test on the two variables
# Afterwards, state p-value and if you accept(p-value>0.05) or
# reject(p-value<0.05) the null hypothesis

# Possible Methods:
cor.test(x, y, "pearson")  #DEFAULT
cor.test(x, y, "kendall")
cor.test(x, y, "spearman")

In-Class Exercises

1. Lecture 2: Introduction to R

Working with external datasets

read.csv()  #Import csv file
names()  #gets the column names of a table
length()  #can parse through a table and return the number of objects

Plotting Data

hist()  #creates a histogram of a field

boxplot()  #creates a boxplot for a given field
boxplot(x ~ y)  #will create multiple box plots.  X will be subdivided among y

plot(x, y)  #can create a scatter plot

# Extra formatting can also be applied:
plot(x, y, main = "Main Title", xlab = "x-axis label", ylab = "y-axis label")

# Not shown, but useful:
par(mfrow = c(2, 2))  #Will divide the plots into a 2x2 frame, showing 4 plots at once.

2. Lecture 3: Review of Key Concepts

General Functions

summary()  #returns summary information of an object.

Descriptive Statistics

mean()  #gets the mean
# The worked example shows how to do this manually, instead of using an
# out-of-the box function

sd()  #gets the standard deviation
# homework 1 calculates this manually.

Assessing Normality

# Statistical Methods:
shaprio.test()  #null hypothesis is that the data is normally distributed
ks.test()  #not 100% sure... couldn't get it to work in the example

# Graphical Methods:
qqnorm()  #Quantile-Quantile Plot.  Data MUST be plotted as a straight line to be normally distributed
qqline()  #plots the line for normality to the QQ Plot

# The following have been done before, but we can make more edits to the
# graphical display, like color and number of bins.
hist()  #Histograms
# We can add more detail to the histogram:
hist(x, col = "red", breaks = seq(20000, 110000, 5000))

boxplot()  #boxplots

3. Lecture 4: Intro to Correlation and Regression

Creating and testing linear models

lm(D.V. ~ I.V.)  #Creates a general linear model
summary(lmObject)  #Returns a t-test of the model parameter; returns the R^2 (coefficient of determination).  The null hypothesis for a univariate model's t-test is the coefficient is 0. (nothing explained by the model)

anova(lmObject)  #Uses an F-test to identify if the model is significant.  The null hypothesis is that the variance of the data described by the model is due to random chance.

Data Transformations

# When the data is not homoscedastic, we have the option to transform the
# data.  Tukey's Ladder of Power gives us some idea of how to transform,
# but it takes a while to process...
ladder(D.V. ~ I.V.)  #Tukey's Ladder of Powers Graphic.

# This is how we can transform the data:
lm(log(D.V.) ~ (I.V.^0.5))  #natural log to the D.V. and sq root to the I.V.  This makes interpreting model outputs somewhat tricky though!

Intro to Multiple Regression

lm(D.V. ~ I.V.1 + I.V.2)  #a 2 variable multiple regression

# We can use R to help us find the 'best' multivariate model
models <- regsubsets(D.V. ~ I.V.1 + I.V.2 + +++....)  #Finds the best x-numbered multivariate model
plot(models)  #we can display this graphically to see how the variables change as we increase the number of variables

4. Logistic Regression

Logistic model creation

glm(D.V. ~ I.V. + I.V. + ++., family = binomial)  #creates a generalized linear model.  The family determines what link function to use.  Binomial is logit

# Because the outcome of a logistic model is either a 1 or a 0, it can be
# difficult to see the plotted results.  To compensate, we can add noise
# to the plot:
plot(jitter(D.V.) ~ I.V.)  #jitter staggers the plotted data, adding noise.

# We can also see how the likelihood of a 1 occurring changes with the
# I.V. :
plot(I.V., fitted(glm))

Interpreting glm model outputs: Using the link function

# Odds Ratio
exp(confint(glm))  #uses the link function to interpret the likelihood confidence interval of the model parameters.  Values greater than 1 are more likely to return a 1; values less than 1 are less likely (or have a declining likelihood)

anova(glm, test = "LRT")  #ANOVA's Likelihood Ratio Test.  Null hypothesis is that the model parameters do not change the likelihood of the model returning a 1.  Can also be used to compare full and reduced models.

General to-dos

# parse through data using ifelse()
Ifelse(logicStatement, c(TrueValue), c(FalseValue))  #ifelse statement to select data.  The False Value could also be a nested ifelse statment

5. Logistic and Poisson Regression

Using data stored in R (example data)

library(faraway)  #install the library that contains the data
data(datasetName)  #use the data function to get at the dataset called gala.  I'm not sure how you can easily tell what the datasets included in the library are besides doing a web search...

Create a Poisson Model

glm(D.V. ~ I.V., family = poisson)  #same setup as a logistic model, but with a different family.  The link function is still the same though

# Checking for over dispersion:
plot(log(fitted(glm)), log((dataset$D.V. - fitted(glm))^2))  #plot the data against its residuals
abline(0, 1)

hist(predict(glm))  #Are the fitted values following a Poisson distribution?

glm(D.V. ~ I.V., family = quasipoisson)  #the quasipoisson family factors over dispersion into the model, generally returning more reliable information

# We can also fit a negative binomial model:
glm.nb(glm)

General know how

fitted(glm)  #extracts fitted values of a function, the Poisson model here.
# Question: How does this differ from predict(glm)?

6. Multiple Regression

Reading in data

readShapePoly()  #imports a shapefile
read.dbf()  #imports the attribute table as a non-spatial table

More stuff with RegSubsets

regModels <- regsubsets(D.V. ~ I.V., nbest = XXXX, nvmax = XXXX)  #nbest tells R to save the XXXX best models for a given multivariate model; nvmax tells R how many models to make (i.e. 20 will make the best 1-variable model to the best 20-variable model)

# Accessing attributes of regsubsets
regModels$adjr2  #extracts the Adjusted R^2 values for the models created in regsubsets()
regModels$bic  #extracts the BIC (similar to AIC) - lower values are better

# We can also represent these statistics graphically.  Where the plots
# level out indicate that adding more variables won't add much more detail
# to the predictive power of the model:
plot(1:nvmaxAsNumber, y = regModels$adjr2)  #Will plot the calculated adjusted R^2 for each of the models created

# easily finding out the variables used in a given model calculated by
# regsubsets.  Previously, we graphed the output, which can be difficult
# to read.  We can also show the parameters for a model in a tabular
# format:
regModels$outmat[xxxx]  #where xxxx is the n-variable model

Multiple Regression Diagnostics

# Testing for multicolinearity
library(car)
vif(lm)  #Variance inflation factor.  Values of 10 indicate a red flag that there is multicolinearity between one or more variables
mean(vif(lm))  #returns the mean VIF - values over 1 indicate some degree of multicolinearity

# Plot residuals against fitted values
plot(predict(lm), lm$residuals)

# other ways previously covered include:
qqnorm()  #qq plot
shapiro.test()  #Shapiro Wilks test for normality

7. Autocorrelation (time series)

Testing for autocorrelation

dwtest(lm)  #Burbin-Watson test.  Null hypothesis is the autocorrelation in the data is 0 (random; not present)

acf()  #plots the residuals.  Remember: residuals must be randomly and normally distributed across the 0.  We can
acf[x]  #returns the autocorrelation (rho) for the provided lag.

Create a time series function to help cope with autocorrelation

ts(D.V., start = min(dataset$TimeAttribute), frequency = Periodicity)  #time series function
plot(decompose(ts))  #produces 4 graphs that break the observed data into a trend line, periodicity effects, and random noise

Time Series Diagnostics

summary(acf())  #returns descriptive stats on the autocorrelation observed in the time series
acf[x]  #returns the autocorrelation (rho) for the provided lag.

cochrance.orcutt(ts)$rho  #calculates the autocorrelation factor or weight of the model

Working with factors and dummy variables

cycle(ts)  #creates a periodicity factor for a time series
time(ts)  #I don't know....

factor(cycle(ts))  #breaks a categorical attribute into dummy variables

8. Intro to Spatial Autocorrelation (in Geoda; not shown here)

We did stuff in Geoda. It was fun.

9a. Spatial Analysis and Regression 1; Exercise 1

Loading Data

# Data can often be exported from R, taking the file extension of .rda.
load("blabla.rda")  #imports the rda file

Weights Matrices

# Create a weights matrix
poly2nb()  #Default is queen contiguity
poly2nb(x, queen = FALSE)  #rooks contiguity

nb2list2()  #row standardized weights matrix

# We can also do distance-based weights and k-nearest neighbors weights...

Moran's I

# Global Moran's I:
moran.mc(dataset$attribute, listw = poly2nb, nsim = 999)  #simulates random Moran's I values.  We can compare this distribution to our own value...
his(moran.mc, breaks = 50)  #If the Moran's statistic is extreme, it is likely not caused by random chance
moran.plot()  #We can also create a Moran's Scatterplot to see the spread of autocorrelation.  See the linked example for more detail about this and how to create one manually to add versatility to the scatterplot.

# LISA
localmoran(dataset$attribute, weights)  #Calculates the Local Indicator of Spatial Autocorrelation.

# See the worked example of how to render the LISA map

other ways to visualize autocorrelation.

# Create a correlogram of moran's I
sp.correlogram(weights, dataset$attribute, order = No.Lags, method = "I")  #Creates a correlogram with a standard error bar for a specified number of lags.

# Create a variogram for moran's I
plot(variogram(dataset$attribute ~ 1, locations(coordinates(dataset), cloud = F)))  #plots a variogram.

9b. Spatial Analysis and Regression 1; Exercise 2

Mapping in R

# Check out the worked example here to see how it works.  Also, check out
# the Maps and Multiple Regression 'Other In-Class Exercise' for much more
# detail.

Checking for Autocorrelation

# Autocorrelation in the residuals can bias regression outputs.  After a
# regression, we should make sure there is NO autocorrelation in the
# residuals.  If there is, an ordinary OLS (lm()) model is not
# appropriate.

# moran.mc(lm$resid, weights, nsim=999)#The null hypothesis is that the
# autocorrelation observed in the data is due to random chance.  Rejecting
# the null hypothesis tells us that our model is biased due to
# autocorrelation.  THIS IS WRONG!! NEVER DO THIS!

# Do this instead...

lm.morantest(lm, weights)  #This gives a moran's I statistic without doing a random simulation (like above).  We don't want to do this because we already have some assumptions about how the residuals should behave.  Returns a p-value which determines if the Moran's statistic is significant.  I'm not sure how they differ in terms of results, but anyway...

Creating a spatial autoregressive model

lagsarlm(I.V. ~ D.V., data = dataset, weights)  #this creates a spatial lag model.

# We can compare the spatial model to the non-spatial model:
anova(lagsarlm, lm)  #it is EXTREMELY important that the spatial model is put first!! We can then compare the AIC to identify the better model.

10. Spatial Regression 2

creating a spatial autoregressive model

lagsarlm(lm, data, weights)  #creates a spatial lag model
errorsarlm(lm, data, weights)  #creates a spatial error model

summary(sarlm)  #Gives a TON on information including the usual coefficient tests, but also information on if including autocorrelation in the regression model significantly improves the performance of the model.  Several fit statistics are also provided, but ANOVA should be used to compare the fit statistics.

anova(sarlm, lm)  #must be order in this way!  Compares the OLS and SAR models using the Likelihood Ratio Test

Understanding lag effects

impacts(sarlm, weights)  #allows us to see how the effects of the interconnections defined by the weights matrix affect direct and indirect neighbors.  Beware: This takes a long time to run.

# Otherwise, we can use the predict() function on original and altered
# data.  We can then plot a map of the predicted values and how they have
# changed by altering a single value.

General know how…

dataset_new <- dataset  #copy the existing data frame to a new object.  Now, we can make edits without corrupting the original data.

as.data.frame(predict(lm))  #creates a data frame for predicted values.  Now we can visualize the results.

abs()  #calculates the absolute value.

11. Point Pattern Analysis: Visualization

General know how…

sub(dataset$Attribute, pattern = "old string", replacement = "new string")  #this is a tool to help organize table data.  It replaces one string with another

Creating a map using ggmap (google)

SpatialPointsDataFrame(coords, proj4string, data)  #Create a spatial points data frame.  I'm not sure if the projection is required....

get_map(locaiton = "United States", maptype = "road", zoom = zoomLevel)  #gets a tile from google maps.  Pretty cool!
ggmap(getmap(), extent = "panel")  #plots the map
ggmap() + geom_point(data = , ase(x = dataset$Longtiude, y = dataset$Latitude), 
    color = red)  #puts points on the ggmap
geom_point(...., size = attribute)  #Can also scale the point sizes

Using SPSTAT to do basic data visualization

# Note: spstat is dependent upon using a spatial data frame and defining
# the ppp and owin.  Please see the worked example for implementing these.

plot.ppp(ppp)  #Creates a dot map of events.  These can also be categorized based on event type.
plot(density(ppp, sigma = DistValue))  #Creates a density image using a kernel.  By default, the kernel is a gaussian.  The spread is defined by the sigma value.
contour(density(ppp, sigma), axes = F)  #Creates a contour map based on the density image
persp(density(ppp, 1000), theta)  #Creates a pretty cool 3-D rendering the density map.  Theta describes the angle that is facing you.
plot(ppp, which.marks = "variable", maxsize = radius)  #Creates a proportional circle map.  The max size parameter will likely need to be played with to create a readable map.

plot(split(ppp))
plot(density(split(ppp)))

Editing legends for images (density map)

color()  #This returns all the colors available in R.  There are a lot...
colorRampPalette(c("Color1", "Color2"))  #Will create a gradation between these two colors
colourmap(colourmapObject(NumberOfClasses), range = c(min, max))  #Defines the values the colors will represent.

# Classifying data:
quantile(ppp, probs = (0:3)/3)  #This will create 3 classes using the quantile classification
cut(ppp, breaks = quantile(), labels = c("Label1", "Label2", "Label3"))  #We can also give sensible labels to the value ranges.

quadratcount(X = ppp, tess = quantileMap)
plot(quadratcount())  #plots the number of cells in each category on the image

Some point pattern exploration/visualization methods…

plot(rhohat(unmark(deaths.ppp), deaths.death.im))  #plots rhohat -> compares kernel densities

nnclean(X = ppp.types$Category, k = numberOfPts)  #Will map the clusters that include k neighbors
plot(split(nnclean()))

12. Point Pattern Analysis: Hypothesis Testing

Testing CSR

rpoispp(lambda, win)  #Creates a random poisson point distribution
quadrat.test(points, nx, ny)  #Chi Squared quadrant test.  The null hypothesis is that the points are formed by a CSR process.

Distances Tests

Gest(points)  #G-Function (Nearest Neighbors)
envelope(points, Kest, nsim = 99, correction = "border")  #K-function (spatial dependence)

ppm(points, ~covaraite1 + covariate2, covariates = list(covariate1 = data, covariate2 = data2))  #Creates an inhomogeneous test.  Like a regression for point intensity
simulate(ppm(), 1)  #Can simulate random points using the covariate intensity image.  We can then test for CSR assuming some other spatial factor is in the way (A good way to remove an underlying trend surface from an analysis)

Look at multiple events

plot(split(points))  #Like plotting the factor of some variable

alltypes(data, "Kcross", envelope = T)  #Calculates the K-Function for all event types against all other event types (This can take a while to run because it creates so many K-functions)
alltypes(data, "Kdot", envelope = T)  #Creates a Kdot plot.  I'm not sure what this tells us.  it is somewhat confusing...

localK(data, verbose = F)  #A local K-Function.  This plots the K-function for all points in a dataset.  It takes a very long time to run (i.e. there are 1000 points in a plot, 1000 k-functions are generated.)
plot(lK, iso1606 ~ r)

Other In-Class Exercises

1. Swain v. Alabama

This was an in-class discussion on day 1 to get us thinking about statistical methods.

2. Maps and Multiple Regression

Using colorBrewer to make a map in R

display.brewer.all()  #shows all color palettes in the library

# Get Colors...
pal <- brewer.pal(XXXX, "ColorPalette")  #Extracts XXXX number of color patches from the given color palette provided by the ColorBrewer library

# Get class range breaks...
cat <- classIntervals(dataset$attribute, n = xxxx, style = "")  #classifies the attribute into n number of classes.
# These are the options for style: 'fixed', 'sd', 'equal', 'pretty',
# 'quantile', 'kmeans', 'hclust', 'bclust', 'fisher', or 'jenks'

findColours(cat, pal)  #links the color brewer palette with the defined class breaks.

3. Using GGPlot2

General ggplot creation and editing

# We did a lot of stuff using the ggplot2 library.  Please see the linked
# examples for more detail.
ggplot(data, aes(x, y))  #Creates a ggplot object
ggplot + geom_point()  #this will plot the ggplot.  We need to specify a dataframe type thing.
ggplot + stat_binhex()  #This will bin the scatterplot into some aggregated shape

ggplot(data, aes(x, y, color = factorAttribute))  #This will plot the ggplot and color the factored points differently

ggplot + facet_grid(. ~ factorAttribute)  #Creates multiple plots; one for each factor variable

Styling a ggplots has a lot of customization options.

library(ggthemes)  #GGThemes permit various styling options for a ggplot.  The user can also create his/her own.  See the linked page for how to do this.  Using ?themes is also useful in understanding what can be customized.

Maps can also be made using ggplot, but can take a long time to render

fortify(shapefile, region = "UniqueID")

merge(fortify(), shapefile, by.x = "id", by.y = "UniqueID")  #This creates the dataframe that will be used to plot the map

# See the example for combining this to create a ggplot map.  The
# structure is somewhat long.  The neat thing is that the themes (above)
# can also be applied to the maps.

Andy Stauffer
Last updated Apr 23, 2013