Advanced GIS, Week 8

Outline

  • Assignment 3

  • File organization

  • Intro to R and R Studio

  • Lab 3 overview & Part 1 deliverables

  • Importing data into R

  • Data manipulation with the tidyverse

  • Exporting data frames

  • Lab 3 deliverables detail

  • Getting help

Assignment 3


Research Proposal:

One PDF (3-5 pages) describing the research question, methods and expected results of the research project.

  • Due date extended: Nov 1

File organization

File organization is very important for to keep track of data and your analysis.

We’ll all work in these folders as you learn so that we can follow my lessons more easily.

  • Download the main data folder and save it in your advanced-gis folder.

Use R to process and explore data



From R for Data Science Hadley Wickham & Garrett Grolemund

Terms/Definitions

  • R: a programming language and software environment for statistical computing and graphics
  • RStudio: an application that helps you write in R in a user-friendly way
  • R script: an executable document with instructions in R to complete a task (it’s like a recipe!).
  • function: a set of statements organized together to perform a specific task.
  • Base R: the functions that come standard with your R installation.
  • R package: a collection of functions created by the community that other R users can use.

Lecture Format: Quarto


I use Quarto to make slides for this course.

  • They contains text and executable R code together

  • Text in a box is a code chunk - something you can execute in R

  • The output from the code chunk displays below the box

# this is a gray box
print("this is the output from the code chunk")
[1] "this is the output from the code chunk"
  • You can follow along with all of the lectures in R Studio on your computer.

A few things

  • R scripts sometimes run slightly differently on different machines
    (different operating systems, R versions, package versions, etc)
  • You can google R, try it! (“r import csv”)
  • Other people have probably had the same problem as you or asked the same question
  • Style matters! Make it readable for future you, and for your colleagues
    • Indent for readability
    • Include comments explaining what you are doing/thinking
    • Include sources in your script
  • Learn R shortcuts (cheatsheet)

Lab 3 Structure and Deliverables


  • Part 1 R: Data Manipulation for Mapping
    • due October 25
  • Part 2 R: tidycensus + revisions to part 1
    • due November 1

Lab 3, Part 1 Deliverables


due October 25

  • Deliverable 1 R script to process student poverty data for New York school districts in 2018, save it to your computer, and create summary statistics about student poverty in New York.
  • Deliverable 1.1 (optional) R script to join student poverty data to school district shapefile
  • Deliverable 2 Choropleth map of school districts in New York, styled by student poevrty rate
  • Deliverable 3 R script to process student poverty data for New York school districts in 2019, save it to your computer, and create summary statistics about student poverty in New York.

Installation


R is installed on all of the computers in this lab. You can install on your computer if you want to work from home.

  • First, install R

  • Second, install RStudio

Open R Studio


Create a New R Project in your existing folder called main_data

R Studio Layout

The Console pane

The Console is:

  • a calculator: Type code, press enter, the answer displays below.
  • terminal: Warnings and errors are displayed in the console

Type into your console, and then press enter:

4 + 5
[1] 9


14 - seven

The Source pane

Use the Source Pane to write R scripts.

  • scripts are like a recipe, a list of instructions for how to process your data

Create your first script

  • In R Studio, select File > New File > R Script
    • Save the script in your main_data/scripts/data_exploration folder as ny_process_school_district_data_2018.R
    • At the top, type “# this is my first script” (make sure you include the hash #)
    • Create a variable called seven
    • Type 14 - seven
    • Highlight both lines with your cursor and click Run in the upper-right of your Source pane
# this is my first script

seven <- 7
14 - seven
[1] 7

Notice

  • A hash # tells R not to run that line of code
    • In R-speak, it’s called “comment out”
  • In R, create a new object with an assignment operator <-
    • keyboard shortcut = Alt+- (Windows) / Option+- (Mac)
  • When you define an object, the console does not display the value
    • type the object name to return the current value
    • you can see all defined objects in the Environment pane
  • Object names must begin with a letter, and only contain letters, numbers, _(underscores) and .(periods).
  • There are lots of different ways to run your script
  • Place your cursor at the end of a line, Cmd+Return (Mac) / Ctrl+Return (Windows)
  • Place your cursor at the end of a line, Click RUN
  • Highlight the code to run, use keyboard shortcut or Click RUN

The Files Pane: Files tab


The Files window is like file explorer. You can see all of the files on your computer

Notice

We created a new R Project in the main_data folder.

  • That tells R Studio to defin main_data as your working directory
  • Your Files pane should automatically be in your main_data folder

The Files Pane: Packages tab

The Packages window lists the packages you have installed and provides a user interface to search for other packages and install them.

Packages are collections of functions and datasets developed by the R community to expand the things you can do in R.

  • Some, like the tidyverse, have become the backbone of analysis in R and we use it all the time
  • Install the tidyverse by typing into your console:
install.packages(tidyverse)
  • You only need to install a package onto your computer once
  • But you need to add any script you use in a script at the top so R knows you are using it

The Environment pane


The Environment shows all of the objects that you have in your workspace

  • If you are following along, you should have at least 1 object in your Environment: the variable “seven”.

Import a csv

  • We will import a csv of every school district in New York that is in your main_data/raw/school_district folder

In your script:

  • Delete your practice equations
  • At the top of you script, describe what you will do in this script (see below)
  • Load the tidyverse package into your environment
  • “read in” your csv into R Studio
## Process 2018 enrollment data for every school district in the country

library(tidyverse)

## import the education dataset for 2019 using read_csv from the readr package of the tidyverse
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
raw_sd_poverty_18 <- read_csv("raw/school_district/ny_student_poverty_2018.csv")

File paths and working directories


The specification of the list of folders to get to a file on your computer is called a path.

  • Absolute path: starts at the root folder of the computer

"/Users/sarahodges/spatial/main_data/raw/school_district/ny_student_poverty_2018.csv"

  • Relative path: starts at a given folder and provides the path starting from that folder

"raw/school_district/ny_student_poverty_2018.csv"

Windows and filepaths


  • Windows file paths use forward slashes (R doesn’t like them)
  • If you have issues with an absolute filepath, use the file.path() function

Data frames

Data tables are called dataframes in R.

  • When you read a table into your R Environment, all of changes that you make to the dataframe are not saved to your computer until you save the dataframe to your computer (in R we say write it out).

Let’s explore our first data frame:

  • To view your dataframe, click on the name in your Environment pane
Postal FIPS district_id Name Estimated_total_pop Estimated_pop_age_5_17 Estimated_pop_age_5_17_in_poverty
NY 36 3602370 Addison Central School District 6887 1228 277
NY 36 3605040 Adirondack Central School District 8456 1346 209
NY 36 3602400 Afton Central School District 3723 589 111
NY 36 3602430 Akron Central School District 9644 1517 157
NY 36 3602460 Albany City School District 98781 11020 3259

Basic data types in R

Type glimpse(raw_sd_poverty_18) in the console to look at a snapshot of your dataframe

  • Numeric
    • Integers (whole numbers)
    • Doubles (fractions)
  • Character (string)
  • Logical (boolean) - TRUE or FALSE

Notice

  • The whole column is always the same type, if there is one character in a numeric column, the whole column will be type = character.
  • NA for missing value

tidyverse data processing functions

We’ll learn five functions that are the backbone of data transformation in R

  • mutate() - create or redefine a variable (column)
  • select() - subset variables (columns) by their names
  • filter() - subset observations (rows) by their values
  • rename() - rename the variables
  • summarise() - calculate summary statistics

mutate() - create or redefine a variable

  • Create new data frame from raw_stpov18
  • New column that equals the student poverty rate
stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17)

Notice

  • The tidyverse uses a pipe operator “%>%” to string together multiple commands
    • think of it as meaning “and then”
    • you’ll eventually love the “%>%”
    • keystroke = cmd/ctrl - shift - m
  • To create a new variable mutate(var_name = equation)
    • you use the equal sign within dplyr (tidyverse) functions

Create a text column

  • year = “2018”
stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018")

Notice

  • You can create more than one new column in the same code chunk
    • separate each new variable by a comma
  • Use double quotes around the value if you want the new variable’s data type to be character

select()

  • subset columns by their names

Our student poverty table has 9 columns, we may not need all of those.

Type names(stpov18) in your console to see column names

stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_total_pop, 
         Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year)

script so far

## Process 2018 enrollment data for every school district in the country

library(tidyverse)

## import the education dataset for 2019 using read_csv from the readr package of the tidyverse
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
raw_sd_poverty_18 <- read_csv("raw/school_district/ny_student_poverty_2018.csv")

# create student poverty rate & year column
# select necessary variables
stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_total_pop, 
         Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year)
Postal district_id Name Estimated_total_pop Estimated_pop_age_5_17 Estimated_pop_age_5_17_in_poverty stud_pov_rate year
NY 3602370 Addison Central School District 6887 1228 277 0.2255700 2018
NY 3605040 Adirondack Central School District 8456 1346 209 0.1552749 2018
NY 3602400 Afton Central School District 3723 589 111 0.1884550 2018

Notice

Notice

  • Script format:
    • purpose at the top
    • all packages needed for the script next
    • import all raw data next (name the dataframe “raw_…)
    • new dataframe to create dataset you want for your analysis
  • Use a pipe operator %>% to add a new function
    • mutate() all of your variables, then %>% to use select()

rename()

rename your columns * new_name = old_name * no spaces in variable names

stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_total_pop, 
         Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year) %>%
  rename(id = district_id,
         district = Name,
         tpop = Estimated_total_pop,
         stpop = Estimated_pop_age_5_17,
         stpov = Estimated_pop_age_5_17_in_poverty,
         stpovrate = stud_pov_rate)

filter()

  • subset observations (rows) by their values

  • Create new dataframe without districts with fewer than 100 children

stpov18_no_low_enroll <- stpov18 %>%
  filter(stpop >= 100)

Notice

Notice

  • You can filter on a character variable too
  • You can have multiple arguments in a filter
  • Use “&” if every expression must be true for a row to be included in the output
  • Use “|” (OR) if any expression can be true for a row to be included in the output
  • “!” means not

filter examples

Try these!

# filter based on text value
nyc <- stpov18 %>% 
  filter(district == "New York City Department Of Education")

# remove new york city
ny_no_nyc <- stpov18 %>% 
  filter(district != "New York City Department Of Education")

# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>% 
  filter(tpop <= 10000)

# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>% 
  filter(tpop <= 10000 & tpop > 500)

Write out your dataframe to csv


Save this processed data frame dataframe to your computer

# write out processed student poverty data for 2018
write_csv(stpov18, "processed/school_district/ny_student_poverty_rate_2018.csv")

Notice

  • You have only saved stpov18 - the processed dataframe with all 667 school districts
  • The filtered dataframes that you created are not saved to to your computer.
    • You have the recipe to recreate them in this script if you need them

Exploratory Analysis


Now that we have our analysis dataset, it’s time to use it to learn more about poverty in New York state. We’ll use descriptive statistics and visualization to interpret the data.

  • Summary statistics for each variable with the summary() function
  • A very simple scatterplot & histogram to understand the shape of our data
  • Calculate our own descriptive statistics for the state with the summarise() function

Calculate some quick summary statistics

  • type into your console
summary(stpov18)
    Postal                id            district              tpop        
 Length:682         Min.   :3600001   Length:682         Min.   :    104  
 Class :character   1st Qu.:3609368   Class :character   1st Qu.:   5292  
 Mode  :character   Median :3617055   Mode  :character   Median :   9644  
                    Mean   :3616789                      Mean   :  29022  
                    3rd Qu.:3624262                      3rd Qu.:  20624  
                    Max.   :3632010                      Max.   :8398748  
     stpop               stpov             stpovrate           year          
 Min.   :      1.0   Min.   :     0.00   Min.   :0.00000   Length:682        
 1st Qu.:    788.5   1st Qu.:    84.25   1st Qu.:0.07119   Class :character  
 Median :   1453.0   Median :   159.00   Median :0.12362   Mode  :character  
 Mean   :   4292.7   Mean   :   749.48   Mean   :0.12986                     
 3rd Qu.:   3306.5   3rd Qu.:   306.25   3rd Qu.:0.17642                     
 Max.   :1204282.0   Max.   :277784.00   Max.   :0.49781                     

Look at a histogram

A histogram is a chart that shows the distribution of your data.

  • The height of each bar indicates how many district’s poverty rate is within that range.
  • To create a histogram use the hist() function, with format:
    • hist(dataframe_name$column_name)
    • $ indicates a column name
hist(stpov18$stpovrate)

summarise()


  • We can use the summarise() function to create our own statistics to help answer our questions.
    • we’ll create a new dataframe that is a single row summarizing our dataframe
    • useful arguments within summarize:
      • mean(), median(), min(), max(), sum(), n()
      • n() returns the number of rows

summarise() New York poverty


# calculate student poverty statistics for New York

ny_pov_stats <- stpov18 %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            average_school_district_stpovrate = mean(stpovrate),
            max_school_district_stpovrate = max(stpovrate),
            min_school_district_stpovrate = min(stpovrate),
            statewide_student_poverty_rate = kids_in_pov/kids,
            poverty_range = max_school_district_stpovrate - min_school_district_stpovrate)


districts kids kids_in_pov average_school_district_stpovrate max_school_district_stpovrate min_school_district_stpovrate statewide_student_poverty_rate poverty_range
682 2927648 511147 0.1298569 0.4978058 0 0.1745931 0.4978058

Write out the summary stats as a csv


Save summary stats dataframe to your computer

# write out student poverty summary stats for 2018
write_csv(ny_pov_stats, "output/school_district/ny_student_poverty_rate_statistics_2018.csv")

Deliverable 1

Neaten up the script for processing the student-age poverty rate for New York school districts in 2018. It should include everything we have done above:

  • One sentence description at the top of the purpose of the script (commented out)
  • Load the tidyverse package needed to run the script
  • Read in the raw school-age poverty rate data
  • Transform it with the mutate(), rename(), select()
  • Create 5 additional dataframes using the filter() function
  • Write out the processed stpov18 dataframe
  • Quick summary statistics with summary()
  • Quick histogram with hist()
  • Bespoke summary statistics with summarise()
  • Write out summary statistics dataframe

Upload your script to CANVAS

Deliverable 2

Join poverty data to school district shapefile and make a choropleth map of student poverty rate in QGIS or ArcMap.

Step 1: Two options to join the data:

  1. Use QGIS or ArcMap to join school district poverty rate data to a school district shapefile
  2. Use R to join school district poverty rate data to a school district shapefile (see option 2 slides for instructions)

Step 2: Create choropleth map of New York school districts, colored by student-age poverty rate in 2018

  • make the map nice. include title, legend, carefully chosen bins for the student-age poverty rate

Upload your map (and optionally, your script to join the data to the shapefile)

Option 2 - join with R

  • Create a new script called ny_school_districts_poverty_shape_join.R and save it in main_data/scripts/data_processing
  • Install sf the best package for handling spatial data in R
    • In the packages tab, click install, search for sf, and install in default location
  • Read in your processed 2018 student poverty dataset
  • The the sf function st_read() to read in the school district shapefile
  • Use the left_join() function to join the poverty rate data to the shapefile
    • for any join you have to find the common key (or common, unique column that is in both datasets)
    • in this case, they have a common column with different names GEOID and id

Option 2 - Join with R script


library(tidyverse)
library(sf)

# import your processed student poverty dataset
ny_sd_pov18 <- read_csv("processed/school_district/ny_student_poverty_rate_2018.csv")

# import new york school districts, crs = 2260 (NAD83 / New York East)
raw_ny_sd_shp <- st_read("raw/school_district/ny_school_districts.shp") 

ny_sd_pov_shp <- raw_ny_sd_shp %>% 
  left_join(ny_sd_pov18, by = c("GEOID"="id"))

st_write(ny_sd_pov_shp, "raw/school_district/ny_school_districts_poverty18.shp")

Deliverable 3

One of the magic aspects of scripts is you can reuse them! You are going to process the 2019 poverty rate data by making a copy of your 2018 script and slightly adjusting it to work with the 2019 data.

  • Save a copy of your ny_process_school_district_data_2018.R script as ny_process_school_district_data_2019.R in the main_data/scripts/data_exploration folder.
  • You’ll find the raw 2019 student-age poverty data in the same folder as the 2018 data.
  • Adjust your 2019 script so that it imports the 2019 data (hint, change all the “18”s to “19”s)
  • There is one column that is named slightly differently, so you’ll need to use the rename() function to change it.
  • Use this script to do exactly the same processing steps to the 2019 data as you did for 2018 data, including saving the processed data and the summary stats to the same folders
  • There is an additional data file in the raw/school_district folder that has enrollment data for every school district in the country (full_data_19_geo_exc.csv). Optionally, read it in and explore it for more practice. See Optional Deliverable 3.1 for instructions.

Upload your script to CANVAS

Optional Deliverable 3.1

Process the 2019 school district enrollment dataset for 2019 and join it to the poverty dataframe.

In your ny_process_school_district_data_2019.R script:

  • Use read_csv() to read in the full_data_19_geo_exc.csv dataset of enrollment data for every school district in the country.
  • This dataset has a lot of fields; create a dataframe with the following variables:
    • NCESID, NAME, County, dEnroll_district, dWhite, dHispanic, dBlack, dAsian_PI
    • that is the id field, district name, and enrollment by race/ethnicity
  • Create the following variables:
    • percent_latinx, percent_black, percent_white, percent_asian
    • convert NCESID to numeric type (mutate(NCESID = as.numeric(NCESID)))
  • Create a new dataframe called ny_pov_enroll from your 2019 ny poverty dataframe
    • Use left_join() to join the enrollment data to the ny poverty data
      • Notice that the left join keeps all of the rows in the first dataset, and add columns from the second dataframe that match
      • There will be 5 districts from the enrollment dataframe that don’t join properly (don’t worry about them for now)

Upload this script to CANVAS for .5 points for processing the enrollment data, and .5 points for joining the dataframes

Getting Help with R


Getting Help with R