Introduction

The month that changed my life

¡¡¡¡¡¡ Freedom !!!!!!

Workflow

SPSS

ArcGIS

Notepad

GWR4.0

Notepad

ArcGIS

Word

R-based analysis

# Load data
data(LondonHP)
# Create distance matrix
DM <- gw.dist(dp.locat=coordinates(londonhp))
# Create optimized bandwidth
bw1 <- bw.gwr(PURCHASE~FLOORSZ, data=londonhp, kernel = "gaussian",dMat=DM)
## Fixed bandwidth: 28008.52 CV score: 901202470969 
## Fixed bandwidth: 17313.68 CV score: 842907727581 
## Fixed bandwidth: 10703.9 CV score: 736181883398 
## Fixed bandwidth: 6618.837 CV score: 607130814353 
## Fixed bandwidth: 4094.128 CV score: 529769270141 
## Fixed bandwidth: 2533.772 CV score: 496244493691 
## Fixed bandwidth: 1569.419 CV score: 558268461315 
## Fixed bandwidth: 3129.775 CV score: 504912379213 
## Fixed bandwidth: 2165.422 CV score: 500148436808 
## Fixed bandwidth: 2761.425 CV score: 498237171785 
## Fixed bandwidth: 2393.075 CV score: 496399101924 
## Fixed bandwidth: 2620.728 CV score: 496732595196 
## Fixed bandwidth: 2480.03 CV score: 496150911653 
## Fixed bandwidth: 2446.816 CV score: 496183570335 
## Fixed bandwidth: 2500.558 CV score: 496166149711 
## Fixed bandwidth: 2467.344 CV score: 496154814854 
## Fixed bandwidth: 2487.871 CV score: 496153635785 
## Fixed bandwidth: 2475.184 CV score: 496151178429 
## Fixed bandwidth: 2483.025 CV score: 496151494463 
## Fixed bandwidth: 2478.179 CV score: 496150836405 
## Fixed bandwidth: 2477.035 CV score: 496150899230 
## Fixed bandwidth: 2478.886 CV score: 496150839373 
## Fixed bandwidth: 2477.742 CV score: 496150850527 
## Fixed bandwidth: 2478.449 CV score: 496150833774 
## Fixed bandwidth: 2478.616 CV score: 496150834475 
## Fixed bandwidth: 2478.346 CV score: 496150834229 
## Fixed bandwidth: 2478.513 CV score: 496150833832 
## Fixed bandwidth: 2478.41 CV score: 496150833868 
## Fixed bandwidth: 2478.474 CV score: 496150833765 
## Fixed bandwidth: 2478.489 CV score: 496150833779 
## Fixed bandwidth: 2478.465 CV score: 496150833764 
## Fixed bandwidth: 2478.459 CV score: 496150833766 
## Fixed bandwidth: 2478.468 CV score: 496150833764 
## Fixed bandwidth: 2478.47 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.466 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.468 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764 
## Fixed bandwidth: 2478.467 CV score: 496150833764
# Run GWR
gwr.res1 <- gwr.basic(PURCHASE~FLOORSZ, data=londonhp, 
                      bw=bw1,kernel = "gaussian", dMat=DM)

Output

# Make map interactive
tmap_mode("view")
# Take spatial data, change format, make sure projection is correct, create map
gwr.res1$SDF %>% st_as_sf() %>% st_set_crs(27700) %>% 
  tm_shape()+tm_dots(col="FLOORSZ", palette="RdYlGn") -> plot1

Plot

Conclude

  • R is brilliant
  • It made my life a lot easier
  • Coming up next:
    • A very very short introduction, covering some of the basices
    • Too much material for the time we have
    • Goal is to (hopefully) enthuse and show what’s possible
    • Links to (free!) textbooks and resources at the end

Goals

By the end of this session you will be able to:

  • Install packages, open packages
  • Open data
  • Create simple tables
  • Create simple models
  • Create maps with data and models

So how does that help you?

What can you do in R?

  • Stop doing repetitive tasks

Wiersma et al. 2022

What can you do in R?

  • Do reproducible science
    • Quantitative
    • And very much qualitative as well
  • Create presentations

What can you do in R?

  • Write entire papers

Rijnks et al. 2022

What can you do in R?

  • R does everything better than any other software package

Using R

What do you need to know?

  1. Overview of RStudio

  2. Basic R commands

    • Installing and loading packages
    • Loading data
    • Viewing data
    • Brief overview of data types
    • Simple models
    • Find help
    • Learn to love errors

Two

  1. Links to books / examples

    • Book on geographic data science with R
    • Book on text mining with R
    • Some fun and useful links

How does R work?

  • Stand-alone
  • In IDE (e.g. RStudio)

Stand-alone

Rstudio

REPL vs Quarto

  • REPL: Instantaneous, but fleeting
  • Quarto: bit more faff to write, but saves as document
  • REPL: Could do comments using #
  • Quarto: Text / presentation / notes / descriptions separate
  • REPL: Model output, but barely legible
  • Quarto: Lovely tables and figures

Start a project

  1. File –> New Project

  2. New Directory –> New Project

  3. Give it a name (e.g. Winterschool) and location (e.g. Desktop)

  4. Start a new file in that project: Quarto Document

  5. Sort of optional: Change the “editor” from “visual” to “source”

Installing and loading packages

  • install.packages in REPL
    • once installed you have them accessible on your pc
  • load packages in Quarto
    • that way you can load only project-specific packages
  • How to make R chunk in Quarto?

Installing and loading packages

# Install the packages
install.packages("tidyverse")
install.packages("sf")
install.packages("tmap")
install.packages("GWmodel")
install.packages("RColorBrewer")
install.packages("kableExtra")
install.packages("gridExtra")
# Load the packages (note the lack of quotation marks)

library(tidyverse)
library(sf)
library(tmap)
library(GWmodel)
library(RColorBrewer)
library(kableExtra)

Data basics

Loading data

# Data as part of packages
data(LondonHP)
# Change from SpatialPointsDataFrame (old) to SF (modern)
london_sf <- londonhp %>% st_as_sf() %>% st_set_crs(27700)

# CSV files
#read.csv("path_to_data")

# Dutch CSV files (if your data comes in as: however many observations of 1 variable...)
#read.csv2("path_to_data")

# R proprietary files
#load("name_of_data.Rda")

# SPSS files? See package: "foreign" or "haven". Stata13 files? see package "readstata13"

# Shapefiles? st_read("path_to_shape.shp")

Arrow gets…

  • Typing “<-” means object on left “gets” whatever is on the right
  • Typing “->” vice versa
  • Convention: “<-”
  • Try it in REPL:
    • test <- 5
    • 2 * test
    • test_squared <- test * test
    • sqrt(test_squared)

Now what?

Head

Maybe some documentation

  • Bring up data documentation
?LondonHP

Get to work

  1. Load data
  2. How many:
    • Garages
    • Post-war
    • From the 90s
  3. Correlate
    • Price x Professional / managerial
    • Floorsize x unemployed
  4. Plot map of purchase price

Useful commands

- use the dollar sign to access variables:
london_sf$BEDS2
- table(x) creates simple tables
- cor(x,y) correlates variables
- cor.test(x,y) runs correlation test
- draw map: 
qtm(shapefilename, dots.col="VARIABLE_YOU_WANT")
- note the quotation marks

Improving map

london_sf %>% mutate(log_purch = log(PURCHASE)) %>%
tm_shape() +
  tm_dots(col="log_purch", palette="-RdYlGn")
# to see more options go to:
# vignette("tmap-getstarted")

Modelling

Brief intro to modelling

  • Model is an abstraction / simplification of reality
  • All models are wrong \[therefore\] the scientist must be alert to what is importantly wrong
  • The standard intro into modelling:
    • Correlation
    • ANOVA
    • Linear regression
    • And their expansions / non-parametric equivalents
  • Simple, usually wrong, superseded by more complicated models

Boston model

  • Boston Marathon: very difficult to get into
  • Qualifying time: age + gender stratified
    • Example: 18-34 years, men: 3.00, women: 3.30
  • Bib numbers are sequentially with qualifying time
  • You’d expect finish time to be linked to qual time
  • Model: \[Time_{finish} = \alpha + \beta * Time_{quali} + \epsilon\]

Boston scatterplot

marathoninvestigation.com, 17-2-2017

London HP Example

london_sf %>% st_drop_geometry() %>%
  ggplot(aes(x=FLOORSZ, y=PURCHASE))+
           geom_point() + theme_classic()

London HP Example

Stepping up to linear regression

  • Regression equation \[ y = \alpha + \beta_1 * x_1 + \beta_2 * x_2 + \epsilon\]
  • In R: lm(y ~ x1 + x2, data=name_of_data)
  • Be sure to save the output as an object!
    • lm1 <- lm(y ~ x1 + x2, data=name_of_data)
  • Now you can have fun with it:
    • summary(lm1)
    • plot(lm1)
    • or in Quarto: xtable()

Table neat

london_sf %>% st_drop_geometry() %>% 
  mutate(PURCHASE = log(PURCHASE), FLOORSZ = log(FLOORSZ)) %>%
  lm(PURCHASE ~ FLOORSZ, data=.) -> lm1
library(xtable)
# make sure to mark the chunk with  results='asis'
print(xtable(summary(lm1)),type='html')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.4775 0.1986 42.69 0.0000
FLOORSZ 0.7611 0.0450 16.92 0.0000

Improve model

Your turn

  1. Starting from the log-log model
  2. Build a parsimonious model for house-prices
    • Be aware of dummy variables: what is your reference category
    • Make sure to save your models
    • For model selection (which one is best)
      • AIC()
  3. Interpret the model

Help, errors, etc.

Help

  • Getting help in R is easy:
    • Package or function documentation: ?name_of_function
    • Online: cran is the official R website
    • Online: most packages will have a github repo
  • Help outside R:
    • Stackexchange
    • r-bloggers
    • bookdowns

Error messages

  • Denoted in red in the console
  • Try and read them fully and understand them
  • They are meant to help
  • You’ll develop a love-hate relationship with them

R Books

R books

Not just quants

  • Text mining tweets
  • Twitter still has an open API for data science
  • Can be used for downloading selections of tweets
  • Can give really interesting (or really wrong) insights

Trump tweets

Time of day

varianceexplained.org/r/trumptweets

Picture yes or no

varianceexplained.org/r/trumptweets

Words in tweet

varianceexplained.org/r/trumptweets

Words in tweet

varianceexplained.org/r/trumptweets

For you to explore

The last bookmark you’ll ever need

Final words

Reproducible science

“Reproducible science helps you do science, better science”

  • Reproducing science is a valid way of “doing” science

  • Reproducible science reduces errors in new studies

  • Reproducible science reduces publication bias, effect size bias, etc.

  • Reproducible science helps you do science

  • Better science

  • Reproducing science: Kuhn: The structure of Scientific Revolutions

Future of science

Get in touch

  • If you try R and want to discuss
  • Get in touch
    • with me (r.h.rijnks@rug.nl)
    • or with whoever published the package you are working on. They are usually happy to help

Be an Rsehole!!