IAR Group Meeting 1

Introduction

The month that changed my life

¡¡¡¡¡¡ Freedom !!!!!!

Workflow

SPSS

ArcGIS

Notepad

GWR4.0

Notepad

ArcGIS

Word

R-based analysis

# Load data
data(LondonHP)
# Create distance matrix
DM <- gw.dist(dp.locat=coordinates(londonhp))
# Create optimized bandwidth
bw1 <- bw.gwr(PURCHASE~FLOORSZ, data=londonhp, kernel = "gaussian",dMat=DM)

## Fixed bandwidth: 28008.52 CV score: 901202470969 
## Fixed bandwidth: 17313.68 CV score: 842907727581 
## Fixed bandwidth: 10703.9 CV score: 736181883398 
## Fixed bandwidth: 6618.837 CV score: 607130814353 
## Fixed bandwidth: 4094.128 CV score: 529769270141 
## Fixed bandwidth: 2533.772 CV score: 496244493691 
## Fixed bandwidth: 1569.419 CV score: 558268461315 
## Fixed bandwidth: 3129.775 CV score: 504912379213 
## Fixed bandwidth: 2165.422 CV score: 500148436808 
## Fixed bandwidth: 2761.425 CV score: 498237171785 
## Fixed bandwidth: 2393.075 CV score: 496399101924 
## Fixed bandwidth: 2620.728 CV score: 496732595196 
## Fixed bandwidth: 2480.03 CV score: 496150911653 
## Fixed bandwidth: 2446.816 CV score: 496183570335 
## Fixed bandwidth: 2500.558 CV score: 496166149711 
## Fixed bandwidth: 2467.344 CV score: 496154814854 
## Fixed bandwidth: 2487.871 CV score: 496153635785 
## Fixed bandwidth: 2475.184 CV score: 496151178429 
## Fixed bandwidth: 2483.025 CV score: 496151494463 
## Fixed bandwidth: 2478.179 CV score: 496150836405 
## Fixed bandwidth: 2477.035 CV score: 496150899230 
## Fixed bandwidth: 2478.886 CV score: 496150839373 
## Fixed bandwidth: 2477.742 CV score: 496150850527 
## Fixed bandwidth: 2478.449 CV score: 496150833774 
## Fixed bandwidth: 2478.616 CV score: 496150834475 
## Fixed bandwidth: 2478.346 CV score: 496150834229 
## Fixed bandwidth: 2478.513 CV score: 496150833832 
## Fixed bandwidth: 2478.41 CV score: 496150833868 
## Fixed bandwidth: 2478.474 CV score: 496150833765 
## Fixed bandwidth: 2478.489 CV score: 496150833779 
## Fixed bandwidth: 2478.465 CV score: 496150833764 
## Fixed bandwidth: 2478.459 CV score: 496150833766 
## Fixed bandwidth: 2478.468 CV score: 496150833764 
## Fixed bandwidth: 2478.47 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.466 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.468 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764

# Run GWR
gwr.res1 <- gwr.basic(PURCHASE~FLOORSZ, data=londonhp, 
                      bw=bw1,kernel = "gaussian", dMat=DM)

Output

# Make map interactive
tmap_mode("view")
# Take spatial data, change format, make sure projection is correct, create map
gwr.res1$SDF %>% st_as_sf() %>% st_set_crs(27700) %>% 
  tm_shape()+tm_dots(col="FLOORSZ", palette="RdYlGn") -> plot1

Plot

Conclude

R is brilliant
It made my life a lot easier
Coming up next:
- A very very short introduction, covering some of the basices
- Too much material for the time we have
- Goal is to (hopefully) enthuse and show what’s possible
- Links to (free!) textbooks and resources at the end

Goals

By the end of this session you will be able to:

Install packages, open packages
Open data
Create simple tables
Create simple models
Create maps with data and models

So how does that help you?

What can you do in R?

Stop doing repetitive tasks

Wiersma et al. 2022

What can you do in R?

Do reproducible science
- Quantitative
- And very much qualitative as well
Create presentations

What can you do in R?

Write entire papers

Rijnks et al. 2022

What can you do in R?

R does everything better than any other software package

Using R

What do you need to know?

Overview of RStudio
Basic R commands
- Installing and loading packages
- Loading data
- Viewing data
- Brief overview of data types
- Simple models
- Find help
- Learn to love errors

Two

Links to books / examples
- Book on geographic data science with R
- Book on text mining with R
- Some fun and useful links

How does R work?

Stand-alone
In IDE (e.g. RStudio)

Stand-alone

Rstudio

REPL vs Quarto

REPL: Instantaneous, but fleeting
Quarto: bit more faff to write, but saves as document
REPL: Could do comments using #
Quarto: Text / presentation / notes / descriptions separate
REPL: Model output, but barely legible
Quarto: Lovely tables and figures

Start a project

File –> New Project
New Directory –> New Project
Give it a name (e.g. Winterschool) and location (e.g. Desktop)
Start a new file in that project: Quarto Document
Sort of optional: Change the “editor” from “visual” to “source”

Installing and loading packages

install.packages in REPL
- once installed you have them accessible on your pc
load packages in Quarto
- that way you can load only project-specific packages
How to make R chunk in Quarto?

Installing and loading packages

# Install the packages
install.packages("tidyverse")
install.packages("sf")
install.packages("tmap")
install.packages("GWmodel")
install.packages("RColorBrewer")
install.packages("kableExtra")
install.packages("gridExtra")
# Load the packages (note the lack of quotation marks)

library(tidyverse)
library(sf)
library(tmap)
library(GWmodel)
library(RColorBrewer)
library(kableExtra)

Data basics

Loading data

# Data as part of packages
data(LondonHP)
# Change from SpatialPointsDataFrame (old) to SF (modern)
london_sf <- londonhp %>% st_as_sf() %>% st_set_crs(27700)

# CSV files
#read.csv("path_to_data")

# Dutch CSV files (if your data comes in as: however many observations of 1 variable...)
#read.csv2("path_to_data")

# R proprietary files
#load("name_of_data.Rda")

# SPSS files? See package: "foreign" or "haven". Stata13 files? see package "readstata13"

# Shapefiles? st_read("path_to_shape.shp")

Arrow gets…

Typing “<-” means object on left “gets” whatever is on the right
Typing “->” vice versa
Convention: “<-”
Try it in REPL:
- test <- 5
- 2 * test
- test_squared <- test * test
- sqrt(test_squared)

Now what?

Head

“head” displays the first few rows of the data file

head(london_sf)

## Simple feature collection with 6 features and 20 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 531900 ymin: 159400 xmax: 535700 ymax: 161700
## Projected CRS: OSGB36 / British National Grid
##   PURCHASE FLOORSZ TYPEDETCH TPSEMIDTCH TYPETRRD TYPEBNGLW TYPEFLAT BLDPWW1
## 0   157000      77         1          0        0         1        0       0
## 1   113500      75         0          0        1         0        0       1
## 2    81750      64         0          0        0         0        1       0
## 3   150000      95         0          1        0         0        0       0
## 4   190000     107         1          0        0         0        0       0
## 5   159950     100         0          1        0         0        0       0
##   BLDPOSTW BLD60S BLD70S BLD80S BLD90S BATH2 BEDS2 GARAGE1 CENTHEAT   UNEMPLOY
## 0        0      0      0      0      0     0     1       0        1 0.03566768
## 1        0      0      0      0      0     0     1       0        1 0.03566768
## 2        0      0      0      1      0     0     0       1        0 0.03566768
## 3        0      0      0      0      0     0     1       0        1 0.03566768
## 4        0      0      0      0      0     0     1       1        1 0.03566768
## 5        0      0      1      0      0     0     1       1        1 0.02408854
##        PROF BLDINTW              geometry
## 0 0.4786992       1 POINT (533200 159400)
## 1 0.4786992       0 POINT (533300 159700)
## 2 0.4786992       0 POINT (532000 159800)
## 3 0.4786992       1 POINT (531900 160100)
## 4 0.4786992       1 POINT (532800 160300)
## 5 0.4773715       0 POINT (535700 161700)

“View” (mind the capitalization) shows file

#View(london_sf)

Maybe some documentation

Bring up data documentation

?LondonHP

Get to work

Load data
How many:
- Garages
- Post-war
- From the 90s
Correlate
- Price x Professional / managerial
- Floorsize x unemployed
Plot map of purchase price

Useful commands

- use the dollar sign to access variables:
london_sf$BEDS2
- table(x) creates simple tables
- cor(x,y) correlates variables
- cor.test(x,y) runs correlation test
- draw map: 
qtm(shapefilename, dots.col="VARIABLE_YOU_WANT")
- note the quotation marks

Improving map

london_sf %>% mutate(log_purch = log(PURCHASE)) %>%
tm_shape() +
  tm_dots(col="log_purch", palette="-RdYlGn")

# to see more options go to:
# vignette("tmap-getstarted")

Modelling

Brief intro to modelling

Model is an abstraction / simplification of reality
All models are wrong \[therefore\] the scientist must be alert to what is importantly wrong
The standard intro into modelling:
- Correlation
- ANOVA
- Linear regression
- And their expansions / non-parametric equivalents
Simple, usually wrong, superseded by more complicated models

Boston model

Boston Marathon: very difficult to get into
Qualifying time: age + gender stratified
- Example: 18-34 years, men: 3.00, women: 3.30
Bib numbers are sequentially with qualifying time
You’d expect finish time to be linked to qual time
Model: \[Time_{finish} = \alpha + \beta * Time_{quali} + \epsilon\]

Boston scatterplot

marathoninvestigation.com, 17-2-2017

London HP Example

london_sf %>% st_drop_geometry() %>%
  ggplot(aes(x=FLOORSZ, y=PURCHASE))+
           geom_point() + theme_classic()

London HP Example

Stepping up to linear regression

Regression equation \[ y = \alpha + \beta_1 * x_1 + \beta_2 * x_2 + \epsilon\]
In R: lm(y ~ x1 + x2, data=name_of_data)
Be sure to save the output as an object!
- lm1 <- lm(y ~ x1 + x2, data=name_of_data)
Now you can have fun with it:
- summary(lm1)
- plot(lm1)
- or in Quarto: xtable()

Table neat

london_sf %>% st_drop_geometry() %>% 
  mutate(PURCHASE = log(PURCHASE), FLOORSZ = log(FLOORSZ)) %>%
  lm(PURCHASE ~ FLOORSZ, data=.) -> lm1
library(xtable)
# make sure to mark the chunk with  results='asis'
print(xtable(summary(lm1)),type='html')

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	8.4775	0.1986	42.69	0.0000
FLOORSZ	0.7611	0.0450	16.92	0.0000

Improve model

Your turn

Starting from the log-log model
Build a parsimonious model for house-prices
- Be aware of dummy variables: what is your reference category
- Make sure to save your models
- For model selection (which one is best)
  - AIC()
Interpret the model

Help, errors, etc.

Help

Getting help in R is easy:
- Package or function documentation: ?name_of_function
- Online: cran is the official R website
- Online: most packages will have a github repo
Help outside R:
- Stackexchange
- r-bloggers
- bookdowns

Error messages

Denoted in red in the console
Try and read them fully and understand them
They are meant to help
You’ll develop a love-hate relationship with them

R Books

R books

Key book for learning geographic data-science

https://geocompr.robinlovelace.net

Other great books:
- https://uk.sagepub.com/en-gb/eur/an-introduction-to-r-for-spatial-analysis-and-mapping/book258267
- https://r4ds.had.co.nz

Not just quants

Text mining tweets
Twitter still has an open API for data science
Can be used for downloading selections of tweets
Can give really interesting (or really wrong) insights

Trump tweets

All of the following is from:

http://varianceexplained.org/r/trump-tweets/
This website gives code and interpretation
It’s fascinating work, and typical of R:
- Open, shared, reproducible

Time of day

varianceexplained.org/r/trumptweets

Picture yes or no

varianceexplained.org/r/trumptweets

Words in tweet

varianceexplained.org/r/trumptweets

Words in tweet

varianceexplained.org/r/trumptweets

For you to explore

Authoring books with R-Markdown (Yihui Xie)

https://bookdown.org/yihui/bookdown/

Shiny apps (Hadley Wickham)

https://mastering-shiny.org

GGPlot graphics (Hadley Wickham)

https://ggplot2-book.org

General fun and interesting

https://r-bloggers.com

The last bookmark you’ll ever need

https://www.bigbookofr.com/index.html

Final words

Reproducible science

“Reproducible science helps you do science, better science”

Reproducing science is a valid way of “doing” science
Reproducible science reduces errors in new studies
Reproducible science reduces publication bias, effect size bias, etc.
Reproducible science helps you do science
Better science
Reproducing science: Kuhn: The structure of Scientific Revolutions

Future of science

Open
- Access
- Source
Reproducible

It’s also a great way of getting cited…

https://ejd.econ.mathematik.uni-ulm.de

Get in touch

If you try R and want to discuss
Get in touch
- with me (r.h.rijnks@rug.nl)
- or with whoever published the package you are working on. They are usually happy to help

Introduction

The month that changed my life

Workflow

R-based analysis

Output

Plot

Conclude

Goals

So how does that help you?

What can you do in R?

What can you do in R?

What can you do in R?

What can you do in R?

Using R

What do you need to know?

Two

How does R work?

Stand-alone

Rstudio

REPL vs Quarto

Start a project

Installing and loading packages

Installing and loading packages

Data basics

Loading data

Arrow gets…

Now what?

Head

Maybe some documentation

Get to work

Useful commands

Improving map

Modelling

Brief intro to modelling

Boston model

Boston scatterplot

London HP Example

London HP Example

Stepping up to linear regression

Table neat

Improve model

Your turn

Help, errors, etc.

Help

Error messages

R Books

R books

Not just quants

Trump tweets

Time of day

Picture yes or no

Words in tweet

Words in tweet

For you to explore

The last bookmark you’ll ever need

Final words

Reproducible science

Future of science

Get in touch

Be an Rsehole!!