1. Introduction

In this practical you will use regression, probably the standard method for constructing predictive models. You will learn how to set up a regression, consider about some of the key issues in model construction (selection) and develop local, kernel based regression models using GWR (Brunsdon et al 2016; Fotheringham et al 2002).

You will find Chapters 5 and Chapter 8 in An Introduction to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber useful for this practical.

The aim of this practical is to develop your competency and expertise in generating inference from data and to understand some of the limitations of classic statistical inference in the analysis of big data. This practical provides a smooth introduction to the sessions next week on Machine Learning and Outlier detection.

To do this you will use georgia dataset in the GISTools package and the GWmodel package which is curated and supported by the original GWR team. The code below will make sure you have all of the R packages you will need installed. The tests check whether a given package is in the list of installed packages, and installs it if not. Package installation only needs to be done once - not every time you use R - so if you are sure you have all of the packages installed already, you can skip this step.

# set a mirror
options(repos = c(CRAN = "http://cran.rstudio.com"))
# test for package existance and install
if (!is.element("tidyverse", installed.packages()))
    install.packages("tidyverse", dep = T)
if (!is.element("GISTools", installed.packages()))
    install.packages("GISTools", dep = T)
if (!is.element("GWmodel", installed.packages())) 
    install.packages("GWmodel", dep = T)
if (!is.element("rgdal", installed.packages()))
    install.packages("rgdal", dep = T)
if (!is.element("tmap", installed.packages())) 
    install.packages("tmap", dep = T)
if (!is.element("knitr", installed.packages()))
    install.packages("knitr", dep = T)
if (!is.element("car", installed.packages()))
    install.packages("car", dep = T)
if (!is.element("gclus", installed.packages()))
    install.packages("gclus", dep = T)
if (!is.element("kableExtra", installed.packages()))
    install.packages("kableExtra", dep = T)

Once we are sure all the packages are installed, you need to load them into the current session:

# load into the R session
library(tidyverse)
library(GISTools)
library(GWmodel)
library(rgdal)
library(tmap)
library(knitr)
library(car)
library(gclus)
library(kableExtra)

2. Data

You should load the georgia data, examine it and assign as selection of the variables to a data frame df:

library(GISTools)
data(georgia)
# have a look!
class(georgia)
head(georgia@data)
# assign selected variables to df
df <- as_tibble(georgia@data[, c(14,4:9)])
df$MedInc <- df$MedInc/1000

You will see the df object is a tibble format and contains a number of variables for the counties in Georgia from the 1990 census including the percentage of the population in each County that

is Rural (PctRural)
have a college degree (PctBach)
are elderly (PctEld)
that are foreign born (PctFB)
that are classed as being in poverty (PctPov)
that are black (PctBlack) and the median income of the county (MedInc) (in 1000s of dollars)

Below is a quick look at the first few records of the data:

MedInc	PctRural	PctBach	PctEld	PctFB	PctPov	PctBlack
32.152	75.6	8.2	11.43	0.64	19.9	20.76
27.657	100.0	6.4	11.77	1.58	26.0	26.86
29.342	61.7	6.6	11.11	0.27	24.1	15.42
29.610	100.0	9.4	13.17	0.11	24.8	51.67
36.414	42.7	13.3	8.64	1.43	17.5	42.39
41.783	100.0	6.4	11.37	0.34	15.1	3.49

To see this in R, enter

head(df)

Our aim in this practical is to build a model that describes the median income using the factors listed above. We will use this to understand a number of key features in regression.

3. Initial explorations

We should have a look at the data to see what kinds of properties it has - we could use previously introduced functions such as summary to summarize the attributes. You may wish to use this now. But in terms of model selection what we are interested in the degree to which different variables are good predictors of Median Income.

In order to examine the relationships between the variables in df we can plot the continuous data against each other and use the cor function to examine collinearity amongst them:

plot(df, cex = 0.5, col = grey(0.145,alpha=0.5))

Correlations between numeric variables.

Another useful tool here is to show the upper triangle of the scatterplot matrix with smoothed trend lines. These are achieved with lowess curve fits (Cleveland, 1979) - these are smooth (but possibly curved) bivariate trend lines - and provide a good way of judging by eye whether there is colinearity in the predictor variables. Essentially a straight-line shaped trend with not too much scattering of the individual points suggests collinearity might be an issue. When two predictors are correlated it can be difficult to identify whether it is one or the other (or both) that influence the quantity to be predicted. The code below does this. Note that as collinearity amongst the predictors is the key concern here, we focus on these, ignoring medinc for this graphic. The upper.panel=panel.smooth causes the lowess curves to be added.

plot(df[,-1], cex = 0.5, col = grey(0.145,alpha=0.5), upper.panel=panel.smooth)

Visually, it seems that there may be some colinearity between PctPov,PctBlack and PctEld.

A numerical table of these relationships as correlations can be generated with the kable function:

library(knitr)
tab <- round(cor(df), 3)
kable(tab, booktabs = T, format = 'markdown')

	MedInc	PctRural	PctBach	PctEld	PctFB	PctPov	PctBlack
MedInc	1.000	-0.169	0.523	-0.615	0.320	-0.818	-0.536
PctRural	-0.169	1.000	-0.619	0.390	-0.547	0.174	-0.069
PctBach	0.523	-0.619	1.000	-0.458	0.672	-0.402	-0.109
PctEld	-0.615	0.390	-0.458	1.000	-0.483	0.568	0.297
PctFB	0.320	-0.547	0.672	-0.483	1.000	-0.329	-0.112
PctPov	-0.818	0.174	-0.402	0.568	-0.329	1.000	0.736
PctBlack	-0.536	-0.069	-0.109	0.297	-0.112	0.736	1.000

What we are looking for in the plots, and to be confirmed in the table, are correlations that look like straight lines, showing that values in one variable change with changes in another.

A final useful tool here is Catherine Hurley’s gclus package. This is similar to the data frame plotting, but provides tools to reorder the variables so that the most strongly correlated pairs are closer to the diagonal in the plots. It can also colour-code the plots to show whether the correlations are positive (red) negative (blue) or near to zero (yellow). For this plot we focus on the predictoes again. This is shown below

library(gclus)
cor.mat <- cov(df[,-1]) 
cpairs(df[,-1],order.single(cor.mat),dmat.color(cor.mat))

The order of the variables in the diagonal generally puts positively correlated variables near to each other as much as possible. The ordering here is based on a clustering method (single linkage) provided via the order.single functtion in gclus - the colouring is based on correlations (via the dmat.color) function in gclus.

Note, you could also write the table out to a .csv file:

write.csv(tab, file = "tab.csv")

The plot and the table above of the correlations between the variables show that a number of the variables are highly correlated (< -0.8 or > +0.8) with each other and with the response variable MedInc and others less so. This provides us with confidence that we can construct a reasonable model of from this data - at least some of the variables are correlated with MedInc - but are may be needed in interpreting some of the regression parameters due to some collinearity in the predictors.

We are interested in a predictive model of income and it is useful to how that value varies. The code below plots the histogram of the variable MedInc. Note the use of the parameter prob=T and prob = F: if set to TRUE this computes probabilities for each histogram bin - rather than a frequency count. This makes the $y$-axis compatible with a kernel density estimate (a smooth estimate of the probability distribution of $x$) , so the two may be overlaid. Here this is done by using the lines function with a kernel density estimate for the distribution given by density. A rug plot is added to show the exact locations of the values on the $x$-axis.

The graphs were obtained with the following code:

hist(df$MedInc, xlab='Median Income ($1000\'s)', main='Histogram of Median Income', col = 'DarkRed')

hist(df$MedInc, prob=T, xlab='Median Income ($1000\'s)', main='Histogram of Median Income', col='NavyBlue')
lines(density(df$MedInc,na.rm=T),col='salmon',lwd=2)
rug(jitter(df$MedInc))

# qqPlot(df$MedInc,main='Normal QQ plot of Median Income')

You may wish to explore some of the other variables in the same way. For example:

hist(df$PctFB, prob=T, xlab='Proportion of Foreign-Born Population (%)', main='Histogram of % Foreign-Born',col='peru')
lines(density(df$PctFB,na.rm=T),col='olivedrab',lwd=2)
rug(jitter(df$PctFB))

# qqPlot(df$PctBach,main='Normal QQ plot of % with a Degree')

As an aside, you will see R offers quite a range of colours - including a few I had never heared of - ‘peru’ is a colour ?! - these can be helpful in making plots more visible. Entering colors() gives the full list of named R colours - and is used in an addendum at the end of this is session to create a visual guide.

4. Regression Models

So we have some data. We have had a look at some of the variables, seen how some of them relate to the variable we are interested in predicting, MedInc, how they correlate with each other, and perhaps we have even started to think about hypotheses that we might wish to test in our model.

We will start with a hypothesized model of County Median Income in the state of Georgia that associated with rurality (PctRurual), whether people went to university (PctBach), how old the population is (PctEld), the number of people that are foreign born (PctFB), the percentage of the populations classed as being in poverty (PctPov) and being black (PctBlack).

In R we can construct a regression model using the function lm which stands for linear model. This is a standard ordinary regression function, although simple it is very powerful and there are many variants and functions for manipulating lm outputs: remember R was originally a stats package.

The equation for a regression is:

$y_i = \beta_{0} + \sum_{m}^{k=1}\beta_{k}x_{ik} + \epsilon_i \textsf{ for } i \in 1 \cdots n$

where $y_i$ is the response variable for observation $i$, $x_{ik}$ is the value of the $k^{\textsf{th}}$ predictor variable for observation $i$, $m$ is the number of predictor variables, $\beta_{0}$ is the intercept term, $\beta_{k}$ is the regression coefficient for the $k^{th}$ predictor variable, $\epsilon_i$ is the random error term for observation $i$ and $n$ is the number of observations.

In our case, the response variable $y$ is median income (MedInc), and the 6 socio-economic variables PctRural, PctBach, PctEld, PctFB, PctPov, PctBlack are the different predictor variables, the $k$ different $x$’s.

The basic formula in R to specify a linear regression model, lm is as follows:

model <- lm(variable_we_are_trying_to_predict ~ predictor_variable1 +
              predictor_variable2...)

where the ~ sign (tilde) separates the variable we are trying to model (the response variable) from the variables we will use to predict it, the predictor variables. The predictor variables are usually separated by a + sign - meaning that they are linear terms in the model added together.

We can specify our first model.

m <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df)

And look at its summary attributes

summary(m)

## 
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4203  -2.9897  -0.6163   2.2095  25.8201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.59895    3.10893  16.919  < 2e-16 ***
## PctRural     0.07377    0.02043   3.611 0.000414 ***
## PctBach      0.69726    0.11221   6.214 4.73e-09 ***
## PctEld      -0.78862    0.17979  -4.386 2.14e-05 ***
## PctFB       -1.29030    0.47388  -2.723 0.007229 ** 
## PctPov      -0.95400    0.10459  -9.121 4.19e-16 ***
## PctBlack     0.03313    0.03717   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

There are lots of things to look at in the summary of the model but we will focus on 3 aspects:

First, Significance. If we examine the model we can see that nearly all are significant predictors of Median Income. The level of significance is indicated by the $P$-value in the column $Pr(>|t|)$ and the asterisks $*$. For each coefficient $\beta_k$ these are the significance levels associated with the hypothesis $\textsf{H}_0 : \beta_k = 0$ against the alternative $\textsf{H}_0 : \beta_k \ne 0$ - this corresponds to the probability that one would obtain a $t$-statistic for $\beta_k$ at least as large as the one observed given $\textsf{H}_0$ is true.

It is a measure of how surprising the relationship is and indicative of whether the relationship between $x_k$ and $y$ found in this data could have been found by chance. Very small values suggest that the level of association found here could have come from a random sample of data if $\beta_k$ really is zero. Cutting to the chase, small $P$-values mean strong evidence for a non-zero regression coefficient.

So in this case we can say:

‘changes in PctFB are significantly associated with changes in MedInc at the $0.001$ level’. This means that the association between the predictor and the response variables is not one that would be found by chance in a series of random samples $99.999\%$ of the time. Formally we can reject the Null hypothesis that there is no relationship between changes in MedInc and PctFB
‘changes in PctRural, PctBach, PctEld, and PctPov are significantly associated with changes in MedInc at the at the $<0.001$ level’, which is actually an indicator of less than $0.001$. So were are even more confident that the observed relationship between these variables and MedInc are not by chance and can again confidently reject the Null hypotheses.
PctBlack was not found to be significantly associated with MedInc.

Second, the Coeffeicient Estimates. These are the $\beta_{k}$ values that tell you the rate of change between $x_k$ and $y$ in the column ‘Estimate’ in the summary table. The first of these is the Intercept - this is $\beta_0$ in the regression equation and tells you the value of $y$ if $x_1 \cdots x_k$ are all zero.

Remember regression is like taking lots of $x_k$ and $y$ values and seeing how well they are correlated. The regression fits a line between them. So in the figure below a regression model with just one $x$ is defined, it is summarised, and then the $x$ and $y$ values are plotted and the model of the relationship is plotted.

m2 <- lm(df$MedInc~df$PctBach)
summary(m2)

## 
## Call:
## lm(formula = df$MedInc ~ df$PctBach)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.124  -6.025  -1.905   3.416  39.969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.5913     1.5476  17.182  < 2e-16 ***
## df$PctBach    0.9643     0.1255   7.684 1.57e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.986 on 157 degrees of freedom
## Multiple R-squared:  0.2733, Adjusted R-squared:  0.2687 
## F-statistic: 59.04 on 1 and 157 DF,  p-value: 1.566e-12

plot(df$PctBach, df$MedInc, xlab = "x = PctBach", ylab = "y = MedInc")
abline(m2, col = "red", lty = 2, lwd = 2)

Here we can see that the intercept of the relationship is 26.6 and the line shows this, and that the rate of change coefficient between MedInc and PctBach, when they are modelled on their own, is 0.964. This means that each increase in 1 of PctBach (each extra 1% of people with a degree) is associated with an increase in 0.964 in income, ie about $964 per annum.

So returning to our model m, we can now see how changes of 1 unit for the different predictor variables are associated with changes in MedInc. The association of PctRural is positive but small, the association of PctBach is reduced due to interactions with other $x$’s but is still positive, the PctEld, PctFB and PctPov variables are strongly negatively associated with MedInc.

So the regression model m gives us an idea of how different factors are associated in different ways with Median Income in the counties of the state of Georgia.

Third the ** $R^2$ ** value. This provides a measure of model fit. They are calculated as the difference between the actual values of $y$ (eg of MedInc) and the values predicted by the model - sometimes written as $\hat{y\ }$ (called ‘$y$-hat’). A good description is here http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit: Generally $R^2$’s of greater than 0.7 indicate a reasonably well fitting model. Some users of this method describe low $R^2$ as ‘bad’ - they are not - it just means you a have a more imprecise model - the predictions are a bit wrong-er! Essentially there is a sliding scale of reliability of the predictions - based on the relative amounts of deterministic and random variability in the system being analysed - and $R^2$ is a measure of this.

Finally, we are also interested in outliers - these are records that the model does not describe very well, either vastly over-predicting Median Income or under-predicting it. When the bivariate regression was considered earlier, these were easy to find, since they were just the points lying a long way from the regression line. However, when there are multiple predictors, relationships cannot be encapsulated in two dimensional space.

For this situation, a useful diagnostic is the studentised residual. For different values of predictors, the expected difference between the actual and fitted $y$ values has a different variance even if the theoretic model gives $\epsilon_i$ a constant variance. This discrepancy is due to issues such as collinear predictors. Standardised variables attempt to compensate for this. Plotting these and looking for notably high or low values is a helpful way of spotting outliers in the regression model.

A simple way of doing this is just to plot the residuals in order:

s.resids <- rstudent(m)
plot(s.resids,type='h')
abline(h=c(-2,2),col='lightskyblue')

Here we see some residuals are relatively high. Normally, one might expect standardised residuals to lie between -2 and 2. Adding lines to this plot helps to identify when this is not the case - as is done above.

Another useful job is to identify which are the outlying cases - here just done as a basic table (via kable)

df_out <- data_frame(Name=georgia$Name,Residual=s.resids)

## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

df_out <- df_out[abs(df_out$Residual)>2,]
kable(df_out)

Name	Residual
Charlton	-2.248484
Chattahoochee	-2.205354
Clarke	-2.918361
Fayette	2.984996
Forsyth	5.620108
Seminole	2.954353
Wilkinson	-2.307853

Negative values here have a much lower income than the model predicts, positive values suggest it is much higher.

Finally, we need to check another model assumption in lm - that the error term $\epsilon_i$’s are normally distributed. The best way to do this is via a qq-plot which plots the empirical quantiles of a variable whose distribution we are interested in against the theoretical distribution we are testing for. If the variable is from this distribution, the results should yield a straight line - frequently a reference line is provided to test this. Here s.resids is the variable of interest - the studentised residuals, and these are compared to the $t$ distribution. Although the $\epsilon_i$’s have a normal distribution, theory shows that the studentised residuals derived from a sample coming from this model have a $t$ distribution. However, the qqPlot function from the package car does all the work, and requires only the regression model object to create a plot:

qqPlot(m)

This demonstrates that mostly the residuals do conform to this, with a couple of extreme outliers at the higher end - labelled in the plot. Consulting the earlier table suggests these are Fayette and Forsythe Counties.

It is also important to consider spatial patterns in the residuals, from a geographical viewpoint. The code below plots the outliers and their spatial distributions:

# determine residuals
s.resids = rstudent(m)

# 1.5 standard deviations
resid.shades = shading(c(-1.5,1.5),c("indianred1","antiquewhite2","powderblue"))
# define a color scheme
cols = resid.shades$cols[1 + findInterval(s.resids, resid.shades$breaks)]
# set plot paramters
par(mar = c(0,0,2,0))

# have a look at the spatial distribution of outliers
choropleth(georgia,s.resids,resid.shades, main = "Model Outliers")

This session finishes here, but you will need the results of this session later on. To save all of the variables, regression objects and so on from this session, enter

save.image(file='session2.RData')

This saves an image of all of the R objects currently defined in a file called session2.RData - later on you will be able to load this again via the load function.

Addendum - all the named colours of R

As referred to earlier in the text. For those interested the code used is listed below. You’ll need kableExtra installed to get this to work.

library(kableExtra)

### Matrix of rgb values for each named colour
rgbmat <- t(col2rgb(colors()))

### Turned into hex formats
hexes <- apply(rgbmat, 1, function(x) sprintf("#%02X%02X%02X",x[1],x[2],x[3]))

### These will be backgrounds - foreground either black or white depending on lightness of colour
textcol <- ifelse(rowMeans(rgbmat) > 156, 'black', 'white')

### Now create the cells with fg/bg colour spec
cols <- colours()
cols <- cell_spec(cols,background=hexes,color=textcol)

### Make this a grid
colour_grid <- matrix(c(cols,rep("",7)),83,8)
colnames(colour_grid)=rep(" ",8)

### Use kableExtra stuff to create html table
ktab <- kable(colour_grid,align='l',format='html',escape=FALSE)
kable_styling(ktab)


white	darkmagenta	gray14	gray97	grey72	lightgray	orange1	seashell3
aliceblue	darkolivegreen	gray15	gray98	grey73	lightgreen	orange2	seashell4
antiquewhite	darkolivegreen1	gray16	gray99	grey74	lightgrey	orange3	sienna
antiquewhite1	darkolivegreen2	gray17	gray100	grey75	lightpink	orange4	sienna1
antiquewhite2	darkolivegreen3	gray18	green	grey76	lightpink1	orangered	sienna2
antiquewhite3	darkolivegreen4	gray19	green1	grey77	lightpink2	orangered1	sienna3
antiquewhite4	darkorange	gray20	green2	grey78	lightpink3	orangered2	sienna4
aquamarine	darkorange1	gray21	green3	grey79	lightpink4	orangered3	skyblue
aquamarine1	darkorange2	gray22	green4	grey80	lightsalmon	orangered4	skyblue1
aquamarine2	darkorange3	gray23	greenyellow	grey81	lightsalmon1	orchid	skyblue2
aquamarine3	darkorange4	gray24	grey	grey82	lightsalmon2	orchid1	skyblue3
aquamarine4	darkorchid	gray25	grey0	grey83	lightsalmon3	orchid2	skyblue4
azure	darkorchid1	gray26	grey1	grey84	lightsalmon4	orchid3	slateblue
azure1	darkorchid2	gray27	grey2	grey85	lightseagreen	orchid4	slateblue1
azure2	darkorchid3	gray28	grey3	grey86	lightskyblue	palegoldenrod	slateblue2
azure3	darkorchid4	gray29	grey4	grey87	lightskyblue1	palegreen	slateblue3
azure4	darkred	gray30	grey5	grey88	lightskyblue2	palegreen1	slateblue4
beige	darksalmon	gray31	grey6	grey89	lightskyblue3	palegreen2	slategray
bisque	darkseagreen	gray32	grey7	grey90	lightskyblue4	palegreen3	slategray1
bisque1	darkseagreen1	gray33	grey8	grey91	lightslateblue	palegreen4	slategray2
bisque2	darkseagreen2	gray34	grey9	grey92	lightslategray	paleturquoise	slategray3
bisque3	darkseagreen3	gray35	grey10	grey93	lightslategrey	paleturquoise1	slategray4
bisque4	darkseagreen4	gray36	grey11	grey94	lightsteelblue	paleturquoise2	slategrey
black	darkslateblue	gray37	grey12	grey95	lightsteelblue1	paleturquoise3	snow
blanchedalmond	darkslategray	gray38	grey13	grey96	lightsteelblue2	paleturquoise4	snow1
blue	darkslategray1	gray39	grey14	grey97	lightsteelblue3	palevioletred	snow2
blue1	darkslategray2	gray40	grey15	grey98	lightsteelblue4	palevioletred1	snow3
blue2	darkslategray3	gray41	grey16	grey99	lightyellow	palevioletred2	snow4
blue3	darkslategray4	gray42	grey17	grey100	lightyellow1	palevioletred3	springgreen
blue4	darkslategrey	gray43	grey18	honeydew	lightyellow2	palevioletred4	springgreen1
blueviolet	darkturquoise	gray44	grey19	honeydew1	lightyellow3	papayawhip	springgreen2
brown	darkviolet	gray45	grey20	honeydew2	lightyellow4	peachpuff	springgreen3
brown1	deeppink	gray46	grey21	honeydew3	limegreen	peachpuff1	springgreen4
brown2	deeppink1	gray47	grey22	honeydew4	linen	peachpuff2	steelblue
brown3	deeppink2	gray48	grey23	hotpink	magenta	peachpuff3	steelblue1
brown4	deeppink3	gray49	grey24	hotpink1	magenta1	peachpuff4	steelblue2
burlywood	deeppink4	gray50	grey25	hotpink2	magenta2	peru	steelblue3
burlywood1	deepskyblue	gray51	grey26	hotpink3	magenta3	pink	steelblue4
burlywood2	deepskyblue1	gray52	grey27	hotpink4	magenta4	pink1	tan
burlywood3	deepskyblue2	gray53	grey28	indianred	maroon	pink2	tan1
burlywood4	deepskyblue3	gray54	grey29	indianred1	maroon1	pink3	tan2
cadetblue	deepskyblue4	gray55	grey30	indianred2	maroon2	pink4	tan3
cadetblue1	dimgray	gray56	grey31	indianred3	maroon3	plum	tan4
cadetblue2	dimgrey	gray57	grey32	indianred4	maroon4	plum1	thistle
cadetblue3	dodgerblue	gray58	grey33	ivory	mediumaquamarine	plum2	thistle1
cadetblue4	dodgerblue1	gray59	grey34	ivory1	mediumblue	plum3	thistle2
chartreuse	dodgerblue2	gray60	grey35	ivory2	mediumorchid	plum4	thistle3
chartreuse1	dodgerblue3	gray61	grey36	ivory3	mediumorchid1	powderblue	thistle4
chartreuse2	dodgerblue4	gray62	grey37	ivory4	mediumorchid2	purple	tomato
chartreuse3	firebrick	gray63	grey38	khaki	mediumorchid3	purple1	tomato1
chartreuse4	firebrick1	gray64	grey39	khaki1	mediumorchid4	purple2	tomato2
chocolate	firebrick2	gray65	grey40	khaki2	mediumpurple	purple3	tomato3
chocolate1	firebrick3	gray66	grey41	khaki3	mediumpurple1	purple4	tomato4
chocolate2	firebrick4	gray67	grey42	khaki4	mediumpurple2	red	turquoise
chocolate3	floralwhite	gray68	grey43	lavender	mediumpurple3	red1	turquoise1
chocolate4	forestgreen	gray69	grey44	lavenderblush	mediumpurple4	red2	turquoise2
coral	gainsboro	gray70	grey45	lavenderblush1	mediumseagreen	red3	turquoise3
coral1	ghostwhite	gray71	grey46	lavenderblush2	mediumslateblue	red4	turquoise4
coral2	gold	gray72	grey47	lavenderblush3	mediumspringgreen	rosybrown	violet
coral3	gold1	gray73	grey48	lavenderblush4	mediumturquoise	rosybrown1	violetred
coral4	gold2	gray74	grey49	lawngreen	mediumvioletred	rosybrown2	violetred1
cornflowerblue	gold3	gray75	grey50	lemonchiffon	midnightblue	rosybrown3	violetred2
cornsilk	gold4	gray76	grey51	lemonchiffon1	mintcream	rosybrown4	violetred3
cornsilk1	goldenrod	gray77	grey52	lemonchiffon2	mistyrose	royalblue	violetred4
cornsilk2	goldenrod1	gray78	grey53	lemonchiffon3	mistyrose1	royalblue1	wheat
cornsilk3	goldenrod2	gray79	grey54	lemonchiffon4	mistyrose2	royalblue2	wheat1
cornsilk4	goldenrod3	gray80	grey55	lightblue	mistyrose3	royalblue3	wheat2
cyan	goldenrod4	gray81	grey56	lightblue1	mistyrose4	royalblue4	wheat3
cyan1	gray	gray82	grey57	lightblue2	moccasin	saddlebrown	wheat4
cyan2	gray0	gray83	grey58	lightblue3	navajowhite	salmon	whitesmoke
cyan3	gray1	gray84	grey59	lightblue4	navajowhite1	salmon1	yellow
cyan4	gray2	gray85	grey60	lightcoral	navajowhite2	salmon2	yellow1
darkblue	gray3	gray86	grey61	lightcyan	navajowhite3	salmon3	yellow2
darkcyan	gray4	gray87	grey62	lightcyan1	navajowhite4	salmon4	yellow3
darkgoldenrod	gray5	gray88	grey63	lightcyan2	navy	sandybrown	yellow4
darkgoldenrod1	gray6	gray89	grey64	lightcyan3	navyblue	seagreen	yellowgreen
darkgoldenrod2	gray7	gray90	grey65	lightcyan4	oldlace	seagreen1
darkgoldenrod3	gray8	gray91	grey66	lightgoldenrod	olivedrab	seagreen2
darkgoldenrod4	gray9	gray92	grey67	lightgoldenrod1	olivedrab1	seagreen3
darkgray	gray10	gray93	grey68	lightgoldenrod2	olivedrab2	seagreen4
darkgreen	gray11	gray94	grey69	lightgoldenrod3	olivedrab3	seashell
darkgrey	gray12	gray95	grey70	lightgoldenrod4	olivedrab4	seashell1
darkkhaki	gray13	gray96	grey71	lightgoldenrodyellow	orange	seashell2

References

Brunsdon, C. and Comber, L., 2015. An introduction to R for spatial analysis and mapping. Sage.

Cleveland, W. S. (1979) Robust locally weighted regression and smoothing scatterplots. J. American Statistical Association 74, 829–836.

Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. John Wiley & Sons.

Hurley, Catherine B. (2004) Clustering Visualisations of Multidimensional Data, Journal of Computational and Graphical Statistics, 13(4), 788-806.

Groningen Spatial Analysis Workshop - Session 2: Regression