SPUR R Lecture - Session 1

J. Kavanagh

2022-07-14

Introduction

This lecture is designed to show you the potential for using R for spatial analysis when applied to historical questions.

You must have the following installed on your computer before we begin.

Open RStudio, in the console type the following

getwd()
## [1] "/Users/jackkavanagh/Dropbox/R_SPUR"

Set the Working Directory using this command

    setwd()

Load the Libraries

Load the following libraries

library(tidyverse) library(lubridate) library(ggthemes) library(reshape2) library(historydata) library(HistData)

After you’ve loaded the libraries, import the following file which you need to place in your Working Directory, use the following command to import the data

load('SPUR.RData')

There should be a number of distinct dataframes:

Key Commands to Remember

The %>% pipeline command will be used throughout this lecture to link various command queries

    early_colleges %>% select(established)

The %in% command is used for matching a vector within a dataframe

    early_colleges %>% filter(established %in% c('1795','1797','1802'))

The $ command is used to display the internal components of a dataframe

    early_colleges$

Please run these now in the Console section of RStudio

Explore the Data

The head() and tail() commands are useful for exploring the datasets, each shows the first ten and last ten rows. This is particularly useful when importing data and ensuring that all the information has been correctly inputted.

head(early_colleges)
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Columbia               King's College      New … NY           1754 Anglican
tail(early_colleges)
## # A tibble: 6 × 6
##   college             original_name city      state established sponsorship   
##   <chr>               <chr>         <chr>     <chr>       <int> <chr>         
## 1 Beloit              <NA>          Beloit    WI           1846 Congregational
## 2 Bucknell            <NA>          Lewisburg PA           1846 Baptist       
## 3 Grinnell            <NA>          Grinnell  IA           1846 Congregational
## 4 Mount Union         <NA>          Alliance  OH           1846 Methodist     
## 5 Earlham             <NA>          Richmond  IN           1847 Quaker        
## 6 Wisconsin, Univ. of <NA>          Madison   WI           1848 Secular

A bird’s eye view of the dataset

Another way to view an entire dataset is to use the glimpse() command which displays the overall dataset and the class of each type

glimpse(early_colleges)
## Rows: 65
## Columns: 6
## $ college       <chr> "Harvard", "William and Mary", "Yale", "Pennsylvania, Un…
## $ original_name <chr> NA, NA, NA, NA, "College of New Jersey", "King's College…
## $ city          <chr> "Cambridge", "Williamsburg", "New Haven", "Philadelphia"…
## $ state         <chr> "MA", "VA", "CT", "PA", "NJ", "NY", "RI", "NJ", "NH", "S…
## $ established   <int> 1636, 1693, 1701, 1740, 1746, 1754, 1765, 1766, 1769, 17…
## $ sponsorship   <chr> "Congregational; after 1805 Unitarian", "Anglican", "Con…

The summary() command provides an overview of the dataset, most beneficial with numerical data

summary(early_colleges)
##    college          original_name          city              state          
##  Length:65          Length:65          Length:65          Length:65         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   established   sponsorship       
##  Min.   :1636   Length:65         
##  1st Qu.:1793   Class :character  
##  Median :1823   Mode  :character  
##  Mean   :1810                     
##  3rd Qu.:1838                     
##  Max.   :1848

Selecting and Filtering the Data

The select() command from ‘dplyr’ is a very versatile command that can be used in sequence using the %>% pipeline to link to other commands

early_colleges %>% select(college, city, state) %>% head()
## # A tibble: 6 × 3
##   college                city         state
##   <chr>                  <chr>        <chr>
## 1 Harvard                Cambridge    MA   
## 2 William and Mary       Williamsburg VA   
## 3 Yale                   New Haven    CT   
## 4 Pennsylvania, Univ. of Philadelphia PA   
## 5 Princeton              Princeton    NJ   
## 6 Columbia               New York     NY

The filter() command also from ‘dplyr’ is very useful and can implement numerical and text commands.

This example shows the number of colleges established prior to 1800

early_colleges %>% filter(established < 1800) %>% head()
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Columbia               King's College      New … NY           1754 Anglican

Filtering (cont.)

This example shows the colleges with the state of New York using the ==

early_colleges %>% filter(state == "NY")
## # A tibble: 6 × 6
##   college               original_name  city        state established sponsorship
##   <chr>                 <chr>          <chr>       <chr>       <int> <chr>      
## 1 Columbia              King's College New York    NY           1754 Anglican   
## 2 Union College         <NA>           Schenectady NY           1795 Presbyteri…
## 3 U.S. Military Academy <NA>           West Point  NY           1802 Secular    
## 4 Colgate               <NA>           Hamilton    NY           1819 Baptist    
## 5 New York Univ.        <NA>           New York    NY           1831 Nondenomin…
## 6 Fordham               <NA>           Fordham     NY           1841 Roman Cath…

Using the != displays all states that are not New York

early_colleges %>% filter(state != "NY") %>% head()
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Brown                  <NA>                Prov… RI           1765 Baptist

The simple logical operators are for the filter command are:

    & (and)

    | (or)

    ! (not)

Filtering (cont.)

This example shows the colleges with the state of New York, Massachusetts & Virginia –

early_colleges %>% filter(state %in% c( 'NY','VA','MA')) %>% head()
## # A tibble: 6 × 6
##   college          original_name  city           state established sponsorship  
##   <chr>            <chr>          <chr>          <chr>       <int> <chr>        
## 1 Harvard          <NA>           Cambridge      MA           1636 Congregation…
## 2 William and Mary <NA>           Williamsburg   VA           1693 Anglican     
## 3 Columbia         King's College New York       NY           1754 Anglican     
## 4 Hampden-Sydney   <NA>           Hampden-Sydney VA           1775 Presbyterian 
## 5 Williams         <NA>           Williamstown   MA           1793 Congregation…
## 6 Union College    <NA>           Schenectady    NY           1795 Presbyterian…

Note that the %in% command can also be expressed as follows:

# Create a new character list of three states using their abbreviations
three_states <- c("NY", "VA", "MA")

# Filter for this using the %in% command
early_colleges %>% filter(state %in% three_states) %>% head()
## # A tibble: 6 × 6
##   college          original_name  city           state established sponsorship  
##   <chr>            <chr>          <chr>          <chr>       <int> <chr>        
## 1 Harvard          <NA>           Cambridge      MA           1636 Congregation…
## 2 William and Mary <NA>           Williamsburg   VA           1693 Anglican     
## 3 Columbia         King's College New York       NY           1754 Anglican     
## 4 Hampden-Sydney   <NA>           Hampden-Sydney VA           1775 Presbyterian 
## 5 Williams         <NA>           Williamstown   MA           1793 Congregation…
## 6 Union College    <NA>           Schenectady    NY           1795 Presbyterian…

Creating new variable using count() and mutate()

This will be one of the most common ways to explore a dataset, exploring specific variables and counting them

early_colleges %>% select(state) %>% count(state)
## # A tibble: 25 × 2
##    state     n
##    <chr> <int>
##  1 CT        3
##  2 DC        2
##  3 GA        2
##  4 IA        1
##  5 IL        1
##  6 IN        2
##  7 KY        1
##  8 LA        1
##  9 MA        6
## 10 MD        2
## # … with 15 more rows

To create a new column, in this case merging two geographic attributes into a single column, use mutate(). The sep= refers to how you want the two words to be separated. In this example its with a comma e.g. New York, NY

early_colleges %>% mutate(location=paste(city,state,sep=","))
## # A tibble: 65 × 7
##    college            original_name city  state established sponsorship location
##    <chr>              <chr>         <chr> <chr>       <int> <chr>       <chr>   
##  1 Harvard            <NA>          Camb… MA           1636 Congregati… Cambrid…
##  2 William and Mary   <NA>          Will… VA           1693 Anglican    William…
##  3 Yale               <NA>          New … CT           1701 Congregati… New Hav…
##  4 Pennsylvania, Uni… <NA>          Phil… PA           1740 Nondenomin… Philade…
##  5 Princeton          College of N… Prin… NJ           1746 Presbyteri… Princet…
##  6 Columbia           King's Colle… New … NY           1754 Anglican    New Yor…
##  7 Brown              <NA>          Prov… RI           1765 Baptist     Provide…
##  8 Rutgers            Queen's Coll… New … NJ           1766 Dutch Refo… New Bru…
##  9 Dartmouth          <NA>          Hano… NH           1769 Congregati… Hanover…
## 10 Charleston, Coll.… <NA>          Char… SC           1770 Anglican    Charles…
## # … with 55 more rows

Saving new variables

Although mutate() creates a new column, unless you save it back into the main dataframe it will be lost therefore always point your code back to the original dataframe using the -> command.

Some programmers use the = sign, however, this is not recommended in R as that sign has other uses depending on which package you are using.

# Now when you run this code, the number of variables will increase to 7
early_colleges %>% mutate(location=paste(city,state,sep=","))-> early_colleges
early_colleges
## # A tibble: 65 × 7
##    college            original_name city  state established sponsorship location
##    <chr>              <chr>         <chr> <chr>       <int> <chr>       <chr>   
##  1 Harvard            <NA>          Camb… MA           1636 Congregati… Cambrid…
##  2 William and Mary   <NA>          Will… VA           1693 Anglican    William…
##  3 Yale               <NA>          New … CT           1701 Congregati… New Hav…
##  4 Pennsylvania, Uni… <NA>          Phil… PA           1740 Nondenomin… Philade…
##  5 Princeton          College of N… Prin… NJ           1746 Presbyteri… Princet…
##  6 Columbia           King's Colle… New … NY           1754 Anglican    New Yor…
##  7 Brown              <NA>          Prov… RI           1765 Baptist     Provide…
##  8 Rutgers            Queen's Coll… New … NJ           1766 Dutch Refo… New Bru…
##  9 Dartmouth          <NA>          Hano… NH           1769 Congregati… Hanover…
## 10 Charleston, Coll.… <NA>          Char… SC           1770 Anglican    Charles…
## # … with 55 more rows

Using the filter() and mutate() operaters together

early_colleges %>% filter(established < 1812) %>% mutate(is_secular=ifelse(sponsorship!="Secular", "no", "yes")) -> secular_colleges_before_1812

secular_colleges_before_1812
## # A tibble: 23 × 8
##    college original_name city  state established sponsorship location is_secular
##    <chr>   <chr>         <chr> <chr>       <int> <chr>       <chr>    <chr>     
##  1 Harvard <NA>          Camb… MA           1636 Congregati… Cambrid… no        
##  2 Willia… <NA>          Will… VA           1693 Anglican    William… no        
##  3 Yale    <NA>          New … CT           1701 Congregati… New Hav… no        
##  4 Pennsy… <NA>          Phil… PA           1740 Nondenomin… Philade… no        
##  5 Prince… College of N… Prin… NJ           1746 Presbyteri… Princet… no        
##  6 Columb… King's Colle… New … NY           1754 Anglican    New Yor… no        
##  7 Brown   <NA>          Prov… RI           1765 Baptist     Provide… no        
##  8 Rutgers Queen's Coll… New … NJ           1766 Dutch Refo… New Bru… no        
##  9 Dartmo… <NA>          Hano… NH           1769 Congregati… Hanover… no        
## 10 Charle… <NA>          Char… SC           1770 Anglican    Charles… no        
## # … with 13 more rows

Class Exercise

Filter the early_colleges to show the colleges established in the original 13 colonies of the United States of America

Create a new object from the early_colleges dataset showing the largest religious sponsorship of colleges

Displaying Research Findings

A useful package for visualising research findings as charts and graphs is ‘ggplot2’. It is included in the ‘tidyverse’ package and follows the guidelines of the ‘Layered Grammar of Graphics’.

The key layers are:

Brief tutorial for ggplot2

We’re going to go through Prof. Chris Brunsdon’s introductionary lecture to using ggplot2 available here

# A number of training datasets are included with R, a common one that is used for tutorials is the 'mtcars' 

data(mtcars)
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Plotting the results of the colleges dataset

ggplot(secular_colleges_before_1812) +
geom_bar(aes(x=is_secular, fill=is_secular)) +
labs(x="Is the college secular?") + 
theme_economist()