1. Introduction

In this practical you will use regression, probably the standard method for constructing predictive models. You will learn how to set up a regression, consider about some of the key issues in model construction (selection) and develop local, kernel based regression models using GWR (Brunsdon et al 2016; Fotheringham et al 2002).

You will find Chapters 5 and Chapter 8 in An Introduction to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber useful for this practical.

The aim of this practical is to develop your competency and expertise in generating inference from data and to understand some of the limitations of classic statistical inference in the analysis of big data. This practical provides a smooth introduction to the sessions next week on Machine Learning and Outlier detection.

To do this you will use georgia dataset in the GISTools package and the GWmodel package which is curated and supported by the original GWR team. The code below will make sure you have all of the R packages you will need installed. The tests check whether a given package is in the list of installed packages, and installs it if not. Package installation only needs to be done once - not every time you use R - so if you are sure you have all of the packages installed already, you can skip this step.

# set a mirror
options(repos = c(CRAN = "http://cran.rstudio.com"))
# test for package existance and install
if (!is.element("tidyverse", installed.packages()))
    install.packages("tidyverse", dep = T)
if (!is.element("GISTools", installed.packages()))
    install.packages("GISTools", dep = T)
if (!is.element("GWmodel", installed.packages())) 
    install.packages("GWmodel", dep = T)
if (!is.element("rgdal", installed.packages()))
    install.packages("rgdal", dep = T)
if (!is.element("tmap", installed.packages())) 
    install.packages("tmap", dep = T)
if (!is.element("knitr", installed.packages()))
    install.packages("knitr", dep = T)
if (!is.element("car", installed.packages()))
    install.packages("car", dep = T)
if (!is.element("gclus", installed.packages()))
    install.packages("gclus", dep = T)
if (!is.element("kableExtra", installed.packages()))
    install.packages("kableExtra", dep = T)

Once we are sure all the packages are installed, you need to load them into the current session:

# load into the R session
library(tidyverse)
library(GISTools)
library(GWmodel)
library(rgdal)
library(tmap)
library(knitr)
library(car)
library(gclus)
library(kableExtra)

2. Data

You should load the georgia data, examine it and assign as selection of the variables to a data frame df:

library(GISTools)
data(georgia)
# have a look!
class(georgia)
head(georgia@data)
# assign selected variables to df
df <- as_tibble(georgia@data[, c(14,4:9)])
df$MedInc <- df$MedInc/1000

You will see the df object is a tibble format and contains a number of variables for the counties in Georgia from the 1990 census including the percentage of the population in each County that

Below is a quick look at the first few records of the data:

MedInc PctRural PctBach PctEld PctFB PctPov PctBlack
32.152 75.6 8.2 11.43 0.64 19.9 20.76
27.657 100.0 6.4 11.77 1.58 26.0 26.86
29.342 61.7 6.6 11.11 0.27 24.1 15.42
29.610 100.0 9.4 13.17 0.11 24.8 51.67
36.414 42.7 13.3 8.64 1.43 17.5 42.39
41.783 100.0 6.4 11.37 0.34 15.1 3.49

To see this in R, enter

head(df)

Our aim in this practical is to build a model that describes the median income using the factors listed above. We will use this to understand a number of key features in regression.

3. Initial explorations

We should have a look at the data to see what kinds of properties it has - we could use previously introduced functions such as summary to summarize the attributes. You may wish to use this now. But in terms of model selection what we are interested in the degree to which different variables are good predictors of Median Income.

In order to examine the relationships between the variables in df we can plot the continuous data against each other and use the cor function to examine collinearity amongst them:

plot(df, cex = 0.5, col = grey(0.145,alpha=0.5))
Correlations between numeric variables.

Correlations between numeric variables.

Another useful tool here is to show the upper triangle of the scatterplot matrix with smoothed trend lines. These are achieved with lowess curve fits (Cleveland, 1979) - these are smooth (but possibly curved) bivariate trend lines - and provide a good way of judging by eye whether there is colinearity in the predictor variables. Essentially a straight-line shaped trend with not too much scattering of the individual points suggests collinearity might be an issue. When two predictors are correlated it can be difficult to identify whether it is one or the other (or both) that influence the quantity to be predicted. The code below does this. Note that as collinearity amongst the predictors is the key concern here, we focus on these, ignoring medinc for this graphic. The upper.panel=panel.smooth causes the lowess curves to be added.

plot(df[,-1], cex = 0.5, col = grey(0.145,alpha=0.5), upper.panel=panel.smooth)

Visually, it seems that there may be some colinearity between PctPov,PctBlack and PctEld.

A numerical table of these relationships as correlations can be generated with the kable function:

library(knitr)
tab <- round(cor(df), 3)
kable(tab, booktabs = T, format = 'markdown')
MedInc PctRural PctBach PctEld PctFB PctPov PctBlack
MedInc 1.000 -0.169 0.523 -0.615 0.320 -0.818 -0.536
PctRural -0.169 1.000 -0.619 0.390 -0.547 0.174 -0.069
PctBach 0.523 -0.619 1.000 -0.458 0.672 -0.402 -0.109
PctEld -0.615 0.390 -0.458 1.000 -0.483 0.568 0.297
PctFB 0.320 -0.547 0.672 -0.483 1.000 -0.329 -0.112
PctPov -0.818 0.174 -0.402 0.568 -0.329 1.000 0.736
PctBlack -0.536 -0.069 -0.109 0.297 -0.112 0.736 1.000

What we are looking for in the plots, and to be confirmed in the table, are correlations that look like straight lines, showing that values in one variable change with changes in another.

A final useful tool here is Catherine Hurley’s gclus package. This is similar to the data frame plotting, but provides tools to reorder the variables so that the most strongly correlated pairs are closer to the diagonal in the plots. It can also colour-code the plots to show whether the correlations are positive (red) negative (blue) or near to zero (yellow). For this plot we focus on the predictoes again. This is shown below

library(gclus)
cor.mat <- cov(df[,-1]) 
cpairs(df[,-1],order.single(cor.mat),dmat.color(cor.mat))

The order of the variables in the diagonal generally puts positively correlated variables near to each other as much as possible. The ordering here is based on a clustering method (single linkage) provided via the order.single functtion in gclus - the colouring is based on correlations (via the dmat.color) function in gclus.

Note, you could also write the table out to a .csv file:

write.csv(tab, file = "tab.csv")

The plot and the table above of the correlations between the variables show that a number of the variables are highly correlated (< -0.8 or > +0.8) with each other and with the response variable MedInc and others less so. This provides us with confidence that we can construct a reasonable model of from this data - at least some of the variables are correlated with MedInc - but are may be needed in interpreting some of the regression parameters due to some collinearity in the predictors.

We are interested in a predictive model of income and it is useful to how that value varies. The code below plots the histogram of the variable MedInc. Note the use of the parameter prob=T and prob = F: if set to TRUE this computes probabilities for each histogram bin - rather than a frequency count. This makes the \(y\)-axis compatible with a kernel density estimate (a smooth estimate of the probability distribution of \(x\)) , so the two may be overlaid. Here this is done by using the lines function with a kernel density estimate for the distribution given by density. A rug plot is added to show the exact locations of the values on the \(x\)-axis.

The graphs were obtained with the following code:

hist(df$MedInc, xlab='Median Income ($1000\'s)', main='Histogram of Median Income', col = 'DarkRed')

hist(df$MedInc, prob=T, xlab='Median Income ($1000\'s)', main='Histogram of Median Income', col='NavyBlue')
lines(density(df$MedInc,na.rm=T),col='salmon',lwd=2)
rug(jitter(df$MedInc))

# qqPlot(df$MedInc,main='Normal QQ plot of Median Income') 

You may wish to explore some of the other variables in the same way. For example:

hist(df$PctFB, prob=T, xlab='Proportion of Foreign-Born Population (%)', main='Histogram of % Foreign-Born',col='peru')
lines(density(df$PctFB,na.rm=T),col='olivedrab',lwd=2)
rug(jitter(df$PctFB))

# qqPlot(df$PctBach,main='Normal QQ plot of % with a Degree') 

As an aside, you will see R offers quite a range of colours - including a few I had never heared of - ‘peru’ is a colour ?! - these can be helpful in making plots more visible. Entering colors() gives the full list of named R colours - and is used in an addendum at the end of this is session to create a visual guide.

4. Regression Models

So we have some data. We have had a look at some of the variables, seen how some of them relate to the variable we are interested in predicting, MedInc, how they correlate with each other, and perhaps we have even started to think about hypotheses that we might wish to test in our model.

We will start with a hypothesized model of County Median Income in the state of Georgia that associated with rurality (PctRurual), whether people went to university (PctBach), how old the population is (PctEld), the number of people that are foreign born (PctFB), the percentage of the populations classed as being in poverty (PctPov) and being black (PctBlack).

In R we can construct a regression model using the function lm which stands for linear model. This is a standard ordinary regression function, although simple it is very powerful and there are many variants and functions for manipulating lm outputs: remember R was originally a stats package.

The equation for a regression is:

\(y_i = \beta_{0} + \sum_{m}^{k=1}\beta_{k}x_{ik} + \epsilon_i \textsf{ for } i \in 1 \cdots n\)

where \(y_i\) is the response variable for observation \(i\), \(x_{ik}\) is the value of the \(k^{\textsf{th}}\) predictor variable for observation \(i\), \(m\) is the number of predictor variables, \(\beta_{0}\) is the intercept term, \(\beta_{k}\) is the regression coefficient for the \(k^{th}\) predictor variable, \(\epsilon_i\) is the random error term for observation \(i\) and \(n\) is the number of observations.

In our case, the response variable \(y\) is median income (MedInc), and the 6 socio-economic variables PctRural, PctBach, PctEld, PctFB, PctPov, PctBlack are the different predictor variables, the \(k\) different \(x\)’s.

The basic formula in R to specify a linear regression model, lm is as follows:

model <- lm(variable_we_are_trying_to_predict ~ predictor_variable1 +
              predictor_variable2...)

where the ~ sign (tilde) separates the variable we are trying to model (the response variable) from the variables we will use to predict it, the predictor variables. The predictor variables are usually separated by a + sign - meaning that they are linear terms in the model added together.

We can specify our first model.

m <- lm(MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack, data = df) 

And look at its summary attributes

summary(m)
## 
## Call:
## lm(formula = MedInc ~ PctRural + PctBach + PctEld + PctFB + PctPov + 
##     PctBlack, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4203  -2.9897  -0.6163   2.2095  25.8201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.59895    3.10893  16.919  < 2e-16 ***
## PctRural     0.07377    0.02043   3.611 0.000414 ***
## PctBach      0.69726    0.11221   6.214 4.73e-09 ***
## PctEld      -0.78862    0.17979  -4.386 2.14e-05 ***
## PctFB       -1.29030    0.47388  -2.723 0.007229 ** 
## PctPov      -0.95400    0.10459  -9.121 4.19e-16 ***
## PctBlack     0.03313    0.03717   0.891 0.374140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.155 on 152 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7593 
## F-statistic: 84.09 on 6 and 152 DF,  p-value: < 2.2e-16

There are lots of things to look at in the summary of the model but we will focus on 3 aspects:

First, Significance. If we examine the model we can see that nearly all are significant predictors of Median Income. The level of significance is indicated by the \(P\)-value in the column \(Pr(>|t|)\) and the asterisks \(*\). For each coefficient \(\beta_k\) these are the significance levels associated with the hypothesis \(\textsf{H}_0 : \beta_k = 0\) against the alternative \(\textsf{H}_0 : \beta_k \ne 0\) - this corresponds to the probability that one would obtain a \(t\)-statistic for \(\beta_k\) at least as large as the one observed given \(\textsf{H}_0\) is true.

It is a measure of how surprising the relationship is and indicative of whether the relationship between \(x_k\) and \(y\) found in this data could have been found by chance. Very small values suggest that the level of association found here could have come from a random sample of data if \(\beta_k\) really is zero. Cutting to the chase, small \(P\)-values mean strong evidence for a non-zero regression coefficient.

So in this case we can say:

Second, the Coeffeicient Estimates. These are the \(\beta_{k}\) values that tell you the rate of change between \(x_k\) and \(y\) in the column ‘Estimate’ in the summary table. The first of these is the Intercept - this is \(\beta_0\) in the regression equation and tells you the value of \(y\) if \(x_1 \cdots x_k\) are all zero.

Remember regression is like taking lots of \(x_k\) and \(y\) values and seeing how well they are correlated. The regression fits a line between them. So in the figure below a regression model with just one \(x\) is defined, it is summarised, and then the \(x\) and \(y\) values are plotted and the model of the relationship is plotted.

m2 <- lm(df$MedInc~df$PctBach)
summary(m2)
## 
## Call:
## lm(formula = df$MedInc ~ df$PctBach)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.124  -6.025  -1.905   3.416  39.969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.5913     1.5476  17.182  < 2e-16 ***
## df$PctBach    0.9643     0.1255   7.684 1.57e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.986 on 157 degrees of freedom
## Multiple R-squared:  0.2733, Adjusted R-squared:  0.2687 
## F-statistic: 59.04 on 1 and 157 DF,  p-value: 1.566e-12
plot(df$PctBach, df$MedInc, xlab = "x = PctBach", ylab = "y = MedInc")
abline(m2, col = "red", lty = 2, lwd = 2)

Here we can see that the intercept of the relationship is 26.6 and the line shows this, and that the rate of change coefficient between MedInc and PctBach, when they are modelled on their own, is 0.964. This means that each increase in 1 of PctBach (each extra 1% of people with a degree) is associated with an increase in 0.964 in income, ie about $964 per annum.

So returning to our model m, we can now see how changes of 1 unit for the different predictor variables are associated with changes in MedInc. The association of PctRural is positive but small, the association of PctBach is reduced due to interactions with other \(x\)’s but is still positive, the PctEld, PctFB and PctPov variables are strongly negatively associated with MedInc.

So the regression model m gives us an idea of how different factors are associated in different ways with Median Income in the counties of the state of Georgia.

Third the ** \(R^2\) ** value. This provides a measure of model fit. They are calculated as the difference between the actual values of \(y\) (eg of MedInc) and the values predicted by the model - sometimes written as \(\hat{y\ }\) (called ‘\(y\)-hat’). A good description is here http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit: Generally \(R^2\)’s of greater than 0.7 indicate a reasonably well fitting model. Some users of this method describe low \(R^2\) as ‘bad’ - they are not - it just means you a have a more imprecise model - the predictions are a bit wrong-er! Essentially there is a sliding scale of reliability of the predictions - based on the relative amounts of deterministic and random variability in the system being analysed - and \(R^2\) is a measure of this.

Finally, we are also interested in outliers - these are records that the model does not describe very well, either vastly over-predicting Median Income or under-predicting it. When the bivariate regression was considered earlier, these were easy to find, since they were just the points lying a long way from the regression line. However, when there are multiple predictors, relationships cannot be encapsulated in two dimensional space.

For this situation, a useful diagnostic is the studentised residual. For different values of predictors, the expected difference between the actual and fitted \(y\) values has a different variance even if the theoretic model gives \(\epsilon_i\) a constant variance. This discrepancy is due to issues such as collinear predictors. Standardised variables attempt to compensate for this. Plotting these and looking for notably high or low values is a helpful way of spotting outliers in the regression model.

A simple way of doing this is just to plot the residuals in order:

s.resids <- rstudent(m)
plot(s.resids,type='h')
abline(h=c(-2,2),col='lightskyblue')

Here we see some residuals are relatively high. Normally, one might expect standardised residuals to lie between -2 and 2. Adding lines to this plot helps to identify when this is not the case - as is done above.

Another useful job is to identify which are the outlying cases - here just done as a basic table (via kable)

df_out <- data_frame(Name=georgia$Name,Residual=s.resids)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
df_out <- df_out[abs(df_out$Residual)>2,]
kable(df_out)
Name Residual
Charlton -2.248484
Chattahoochee -2.205354
Clarke -2.918361
Fayette 2.984996
Forsyth 5.620108
Seminole 2.954353
Wilkinson -2.307853

Negative values here have a much lower income than the model predicts, positive values suggest it is much higher.

Finally, we need to check another model assumption in lm - that the error term \(\epsilon_i\)’s are normally distributed. The best way to do this is via a qq-plot which plots the empirical quantiles of a variable whose distribution we are interested in against the theoretical distribution we are testing for. If the variable is from this distribution, the results should yield a straight line - frequently a reference line is provided to test this. Here s.resids is the variable of interest - the studentised residuals, and these are compared to the \(t\) distribution. Although the \(\epsilon_i\)’s have a normal distribution, theory shows that the studentised residuals derived from a sample coming from this model have a \(t\) distribution. However, the qqPlot function from the package car does all the work, and requires only the regression model object to create a plot:

qqPlot(m)

This demonstrates that mostly the residuals do conform to this, with a couple of extreme outliers at the higher end - labelled in the plot. Consulting the earlier table suggests these are Fayette and Forsythe Counties.

It is also important to consider spatial patterns in the residuals, from a geographical viewpoint. The code below plots the outliers and their spatial distributions:

# determine residuals
s.resids = rstudent(m)

# 1.5 standard deviations
resid.shades = shading(c(-1.5,1.5),c("indianred1","antiquewhite2","powderblue"))
# define a color scheme
cols = resid.shades$cols[1 + findInterval(s.resids, resid.shades$breaks)]
# set plot paramters
par(mar = c(0,0,2,0))

# have a look at the spatial distribution of outliers
choropleth(georgia,s.resids,resid.shades, main = "Model Outliers") 

This session finishes here, but you will need the results of this session later on. To save all of the variables, regression objects and so on from this session, enter

save.image(file='session2.RData')

This saves an image of all of the R objects currently defined in a file called session2.RData - later on you will be able to load this again via the load function.

Addendum - all the named colours of R

As referred to earlier in the text. For those interested the code used is listed below. You’ll need kableExtra installed to get this to work.

library(kableExtra)

### Matrix of rgb values for each named colour
rgbmat <- t(col2rgb(colors()))

### Turned into hex formats
hexes <- apply(rgbmat, 1, function(x) sprintf("#%02X%02X%02X",x[1],x[2],x[3]))

### These will be backgrounds - foreground either black or white depending on lightness of colour
textcol <- ifelse(rowMeans(rgbmat) > 156, 'black', 'white')

### Now create the cells with fg/bg colour spec
cols <- colours()
cols <- cell_spec(cols,background=hexes,color=textcol)

### Make this a grid
colour_grid <- matrix(c(cols,rep("",7)),83,8)
colnames(colour_grid)=rep(" ",8)

### Use kableExtra stuff to create html table
ktab <- kable(colour_grid,align='l',format='html',escape=FALSE)
kable_styling(ktab)
white darkmagenta gray14 gray97 grey72 lightgray orange1 seashell3
aliceblue darkolivegreen gray15 gray98 grey73 lightgreen orange2 seashell4
antiquewhite darkolivegreen1 gray16 gray99 grey74 lightgrey orange3 sienna
antiquewhite1 darkolivegreen2 gray17 gray100 grey75 lightpink orange4 sienna1
antiquewhite2 darkolivegreen3 gray18 green grey76 lightpink1 orangered sienna2
antiquewhite3 darkolivegreen4 gray19 green1 grey77 lightpink2 orangered1 sienna3
antiquewhite4 darkorange gray20 green2 grey78 lightpink3 orangered2 sienna4
aquamarine darkorange1 gray21 green3 grey79 lightpink4 orangered3 skyblue
aquamarine1 darkorange2 gray22 green4 grey80 lightsalmon orangered4 skyblue1
aquamarine2 darkorange3 gray23 greenyellow grey81 lightsalmon1 orchid skyblue2
aquamarine3 darkorange4 gray24 grey grey82 lightsalmon2 orchid1 skyblue3
aquamarine4 darkorchid gray25 grey0 grey83 lightsalmon3 orchid2 skyblue4
azure darkorchid1 gray26 grey1 grey84 lightsalmon4 orchid3 slateblue
azure1 darkorchid2 gray27 grey2 grey85 lightseagreen orchid4 slateblue1
azure2 darkorchid3 gray28 grey3 grey86 lightskyblue palegoldenrod slateblue2
azure3 darkorchid4 gray29 grey4 grey87 lightskyblue1 palegreen slateblue3
azure4 darkred gray30 grey5 grey88 lightskyblue2 palegreen1 slateblue4
beige darksalmon gray31 grey6 grey89 lightskyblue3 palegreen2 slategray
bisque darkseagreen gray32 grey7 grey90 lightskyblue4 palegreen3 slategray1
bisque1 darkseagreen1 gray33 grey8 grey91 lightslateblue palegreen4 slategray2
bisque2 darkseagreen2 gray34 grey9 grey92 lightslategray paleturquoise slategray3
bisque3 darkseagreen3 gray35 grey10 grey93 lightslategrey paleturquoise1 slategray4
bisque4 darkseagreen4 gray36 grey11 grey94 lightsteelblue paleturquoise2 slategrey
black darkslateblue gray37 grey12 grey95 lightsteelblue1 paleturquoise3 snow
blanchedalmond darkslategray gray38 grey13 grey96 lightsteelblue2 paleturquoise4 snow1
blue darkslategray1 gray39 grey14 grey97 lightsteelblue3 palevioletred snow2
blue1 darkslategray2 gray40 grey15 grey98 lightsteelblue4 palevioletred1 snow3
blue2 darkslategray3 gray41 grey16 grey99 lightyellow palevioletred2 snow4
blue3 darkslategray4 gray42 grey17 grey100 lightyellow1 palevioletred3 springgreen
blue4 darkslategrey gray43 grey18 honeydew lightyellow2 palevioletred4 springgreen1
blueviolet darkturquoise gray44 grey19 honeydew1 lightyellow3 papayawhip springgreen2
brown darkviolet gray45 grey20 honeydew2 lightyellow4 peachpuff springgreen3
brown1 deeppink gray46 grey21 honeydew3 limegreen peachpuff1 springgreen4
brown2 deeppink1 gray47 grey22 honeydew4 linen peachpuff2 steelblue
brown3 deeppink2 gray48 grey23 hotpink magenta peachpuff3 steelblue1
brown4 deeppink3 gray49 grey24 hotpink1 magenta1 peachpuff4 steelblue2
burlywood deeppink4 gray50 grey25 hotpink2 magenta2 peru steelblue3
burlywood1 deepskyblue gray51 grey26 hotpink3 magenta3 pink steelblue4
burlywood2 deepskyblue1 gray52 grey27 hotpink4 magenta4 pink1 tan
burlywood3 deepskyblue2 gray53 grey28 indianred maroon pink2 tan1
burlywood4 deepskyblue3 gray54 grey29 indianred1 maroon1 pink3 tan2
cadetblue deepskyblue4 gray55 grey30 indianred2 maroon2 pink4 tan3
cadetblue1 dimgray gray56 grey31 indianred3 maroon3 plum tan4
cadetblue2 dimgrey gray57 grey32 indianred4 maroon4 plum1 thistle
cadetblue3 dodgerblue gray58 grey33 ivory mediumaquamarine plum2 thistle1
cadetblue4 dodgerblue1 gray59 grey34 ivory1 mediumblue plum3 thistle2
chartreuse dodgerblue2 gray60 grey35 ivory2 mediumorchid plum4 thistle3
chartreuse1 dodgerblue3 gray61 grey36 ivory3 mediumorchid1 powderblue thistle4
chartreuse2 dodgerblue4 gray62 grey37 ivory4 mediumorchid2 purple tomato
chartreuse3 firebrick gray63 grey38 khaki mediumorchid3 purple1 tomato1
chartreuse4 firebrick1 gray64 grey39 khaki1 mediumorchid4 purple2 tomato2
chocolate firebrick2 gray65 grey40 khaki2 mediumpurple purple3 tomato3
chocolate1 firebrick3 gray66 grey41 khaki3 mediumpurple1 purple4 tomato4
chocolate2 firebrick4 gray67 grey42 khaki4 mediumpurple2 red turquoise
chocolate3 floralwhite gray68 grey43 lavender mediumpurple3 red1 turquoise1
chocolate4 forestgreen gray69 grey44 lavenderblush mediumpurple4 red2 turquoise2
coral gainsboro gray70 grey45 lavenderblush1 mediumseagreen red3 turquoise3
coral1 ghostwhite gray71 grey46 lavenderblush2 mediumslateblue red4 turquoise4
coral2 gold gray72 grey47 lavenderblush3 mediumspringgreen rosybrown violet
coral3 gold1 gray73 grey48 lavenderblush4 mediumturquoise rosybrown1 violetred
coral4 gold2 gray74 grey49 lawngreen mediumvioletred rosybrown2 violetred1
cornflowerblue gold3 gray75 grey50 lemonchiffon midnightblue rosybrown3 violetred2
cornsilk gold4 gray76 grey51 lemonchiffon1 mintcream rosybrown4 violetred3
cornsilk1 goldenrod gray77 grey52 lemonchiffon2 mistyrose royalblue violetred4
cornsilk2 goldenrod1 gray78 grey53 lemonchiffon3 mistyrose1 royalblue1 wheat
cornsilk3 goldenrod2 gray79 grey54 lemonchiffon4 mistyrose2 royalblue2 wheat1
cornsilk4 goldenrod3 gray80 grey55 lightblue mistyrose3 royalblue3 wheat2
cyan goldenrod4 gray81 grey56 lightblue1 mistyrose4 royalblue4 wheat3
cyan1 gray gray82 grey57 lightblue2 moccasin saddlebrown wheat4
cyan2 gray0 gray83 grey58 lightblue3 navajowhite salmon whitesmoke
cyan3 gray1 gray84 grey59 lightblue4 navajowhite1 salmon1 yellow
cyan4 gray2 gray85 grey60 lightcoral navajowhite2 salmon2 yellow1
darkblue gray3 gray86 grey61 lightcyan navajowhite3 salmon3 yellow2
darkcyan gray4 gray87 grey62 lightcyan1 navajowhite4 salmon4 yellow3
darkgoldenrod gray5 gray88 grey63 lightcyan2 navy sandybrown yellow4
darkgoldenrod1 gray6 gray89 grey64 lightcyan3 navyblue seagreen yellowgreen
darkgoldenrod2 gray7 gray90 grey65 lightcyan4 oldlace seagreen1
darkgoldenrod3 gray8 gray91 grey66 lightgoldenrod olivedrab seagreen2
darkgoldenrod4 gray9 gray92 grey67 lightgoldenrod1 olivedrab1 seagreen3
darkgray gray10 gray93 grey68 lightgoldenrod2 olivedrab2 seagreen4
darkgreen gray11 gray94 grey69 lightgoldenrod3 olivedrab3 seashell
darkgrey gray12 gray95 grey70 lightgoldenrod4 olivedrab4 seashell1
darkkhaki gray13 gray96 grey71 lightgoldenrodyellow orange seashell2

References

Brunsdon, C. and Comber, L., 2015. An introduction to R for spatial analysis and mapping. Sage.

Cleveland, W. S. (1979) Robust locally weighted regression and smoothing scatterplots. J. American Statistical Association 74, 829–836.

Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. John Wiley & Sons.

Hurley, Catherine B. (2004) Clustering Visualisations of Multidimensional Data, Journal of Computational and Graphical Statistics, 13(4), 788-806.