Advanced GIS, Week 8

Outline

Assignment 3
File organization
Intro to R and R Studio
Lab 3 overview & Part 1 deliverables
Importing data into R
Data manipulation with the tidyverse
Exporting data frames
Lab 3 deliverables detail
Getting help

Assignment 3

Research Proposal:

One PDF (3-5 pages) describing the research question, methods and expected results of the research project.

Due date extended: Nov 1

File organization

File organization is very important for to keep track of data and your analysis.

We’ll all work in these folders as you learn so that we can follow my lessons more easily.

Download the main data folder and save it in your advanced-gis folder.

Use R to process and explore data

From R for Data Science Hadley Wickham & Garrett Grolemund

Terms/Definitions

R: a programming language and software environment for statistical computing and graphics
RStudio: an application that helps you write in R in a user-friendly way
R script: an executable document with instructions in R to complete a task (it’s like a recipe!).
function: a set of statements organized together to perform a specific task.
Base R: the functions that come standard with your R installation.
R package: a collection of functions created by the community that other R users can use.

Lecture Format: Quarto

I use Quarto to make slides for this course.

They contains text and executable R code together
Text in a box is a code chunk - something you can execute in R
The output from the code chunk displays below the box

# this is a gray box
print("this is the output from the code chunk")

[1] "this is the output from the code chunk"

You can follow along with all of the lectures in R Studio on your computer.

A few things

R scripts sometimes run slightly differently on different machines
(different operating systems, R versions, package versions, etc)
You can google R, try it! (“r import csv”)
Other people have probably had the same problem as you or asked the same question
- Stack Overflow
- Google your error message
Style matters! Make it readable for future you, and for your colleagues
- Indent for readability
- Include comments explaining what you are doing/thinking
- Include sources in your script
Learn R shortcuts (cheatsheet)

Lab 3 Structure and Deliverables

Part 1 R: Data Manipulation for Mapping
- due October 25
Part 2 R: tidycensus + revisions to part 1
- due November 1

Lab 3, Part 1 Deliverables

due October 25

Deliverable 1 R script to process student poverty data for New York school districts in 2018, save it to your computer, and create summary statistics about student poverty in New York.
Deliverable 1.1 (optional) R script to join student poverty data to school district shapefile
Deliverable 2 Choropleth map of school districts in New York, styled by student poevrty rate
Deliverable 3 R script to process student poverty data for New York school districts in 2019, save it to your computer, and create summary statistics about student poverty in New York.

Installation

R is installed on all of the computers in this lab. You can install on your computer if you want to work from home.

First, install R
Second, install RStudio

Open R Studio

Create a New R Project in your existing folder called main_data

R Studio Layout

The Console pane

The Console is:

a calculator: Type code, press enter, the answer displays below.
terminal: Warnings and errors are displayed in the console

Type into your console, and then press enter:

4 + 5

[1] 9

14 - seven

The Source pane

Use the Source Pane to write R scripts.

scripts are like a recipe, a list of instructions for how to process your data

Create your first script

In R Studio, select File > New File > R Script
- Save the script in your main_data/scripts/data_exploration folder as ny_process_school_district_data_2018.R
- At the top, type “# this is my first script” (make sure you include the hash #)
- Create a variable called seven
- Type 14 - seven
- Highlight both lines with your cursor and click Run in the upper-right of your Source pane

# this is my first script

seven <- 7
14 - seven

[1] 7

…

Notice

A hash # tells R not to run that line of code
- In R-speak, it’s called “comment out”
In R, create a new object with an assignment operator <-
- keyboard shortcut = Alt+- (Windows) / Option+- (Mac)
When you define an object, the console does not display the value
- type the object name to return the current value
- you can see all defined objects in the Environment pane
Object names must begin with a letter, and only contain letters, numbers, _(underscores) and .(periods).
There are lots of different ways to run your script
Place your cursor at the end of a line, Cmd+Return (Mac) / Ctrl+Return (Windows)
Place your cursor at the end of a line, Click RUN
Highlight the code to run, use keyboard shortcut or Click RUN

The Files Pane: Files tab

The Files window is like file explorer. You can see all of the files on your computer

Notice

We created a new R Project in the main_data folder.

That tells R Studio to defin main_data as your working directory
Your Files pane should automatically be in your main_data folder

The Files Pane: Packages tab

The Packages window lists the packages you have installed and provides a user interface to search for other packages and install them.

Packages are collections of functions and datasets developed by the R community to expand the things you can do in R.

Some, like the tidyverse, have become the backbone of analysis in R and we use it all the time
Install the tidyverse by typing into your console:

install.packages(tidyverse)

You only need to install a package onto your computer once
But you need to add any script you use in a script at the top so R knows you are using it

The Environment pane

The Environment shows all of the objects that you have in your workspace

If you are following along, you should have at least 1 object in your Environment: the variable “seven”.

Import a csv

We will import a csv of every school district in New York that is in your main_data/raw/school_district folder

In your script:

Delete your practice equations
At the top of you script, describe what you will do in this script (see below)
Load the tidyverse package into your environment
“read in” your csv into R Studio

## Process 2018 enrollment data for every school district in the country

library(tidyverse)

## import the education dataset for 2019 using read_csv from the readr package of the tidyverse
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
raw_sd_poverty_18 <- read_csv("raw/school_district/ny_student_poverty_2018.csv")

File paths and working directories

The specification of the list of folders to get to a file on your computer is called a path.

Absolute path: starts at the root folder of the computer

"/Users/sarahodges/spatial/main_data/raw/school_district/ny_student_poverty_2018.csv"

Relative path: starts at a given folder and provides the path starting from that folder

"raw/school_district/ny_student_poverty_2018.csv"

Windows and filepaths

Windows file paths use forward slashes (R doesn’t like them)
If you have issues with an absolute filepath, use the file.path() function

Data frames

Data tables are called dataframes in R.

When you read a table into your R Environment, all of changes that you make to the dataframe are not saved to your computer until you save the dataframe to your computer (in R we say write it out).

Let’s explore our first data frame:

To view your dataframe, click on the name in your Environment pane

Postal	FIPS	district_id	Name	Estimated_total_pop	Estimated_pop_age_5_17	Estimated_pop_age_5_17_in_poverty
NY	36	3602370	Addison Central School District	6887	1228	277
NY	36	3605040	Adirondack Central School District	8456	1346	209
NY	36	3602400	Afton Central School District	3723	589	111
NY	36	3602430	Akron Central School District	9644	1517	157
NY	36	3602460	Albany City School District	98781	11020	3259

Basic data types in R

Type glimpse(raw_sd_poverty_18) in the console to look at a snapshot of your dataframe

Numeric
- Integers (whole numbers)
- Doubles (fractions)
Character (string)
Logical (boolean) - TRUE or FALSE

Notice

The whole column is always the same type, if there is one character in a numeric column, the whole column will be type = character.
NA for missing value

tidyverse data processing functions

We’ll learn five functions that are the backbone of data transformation in R

mutate() - create or redefine a variable (column)
select() - subset variables (columns) by their names
filter() - subset observations (rows) by their values
rename() - rename the variables
summarise() - calculate summary statistics

mutate() - create or redefine a variable

Create new data frame from raw_stpov18
New column that equals the student poverty rate

stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17)

Notice

The tidyverse uses a pipe operator “%>%” to string together multiple commands
- think of it as meaning “and then”
- you’ll eventually love the “%>%”
- keystroke = cmd/ctrl - shift - m
To create a new variable mutate(var_name = equation)
- you use the equal sign within dplyr (tidyverse) functions

Create a text column

year = “2018”

stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018")

Notice

You can create more than one new column in the same code chunk
- separate each new variable by a comma
Use double quotes around the value if you want the new variable’s data type to be character

select()

subset columns by their names

Our student poverty table has 9 columns, we may not need all of those.

Type names(stpov18) in your console to see column names

stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_total_pop, 
         Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year)

script so far

## Process 2018 enrollment data for every school district in the country

library(tidyverse)

## import the education dataset for 2019 using read_csv from the readr package of the tidyverse
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
raw_sd_poverty_18 <- read_csv("raw/school_district/ny_student_poverty_2018.csv")

# create student poverty rate & year column
# select necessary variables
stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_total_pop, 
         Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year)

Postal	district_id	Name	Estimated_total_pop	Estimated_pop_age_5_17	Estimated_pop_age_5_17_in_poverty	stud_pov_rate	year
NY	3602370	Addison Central School District	6887	1228	277	0.2255700	2018
NY	3605040	Adirondack Central School District	8456	1346	209	0.1552749	2018
NY	3602400	Afton Central School District	3723	589	111	0.1884550	2018

Notice

Notice

Script format:
- purpose at the top
- all packages needed for the script next
- import all raw data next (name the dataframe “raw_…)
- new dataframe to create dataset you want for your analysis
Use a pipe operator %>% to add a new function
- mutate() all of your variables, then %>% to use select()

rename()

rename your columns * new_name = old_name * no spaces in variable names

stpov18 <- raw_sd_poverty_18 %>%
  mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
         year = "2018") %>%
  select(Postal, district_id, Name, Estimated_total_pop, 
         Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year) %>%
  rename(id = district_id,
         district = Name,
         tpop = Estimated_total_pop,
         stpop = Estimated_pop_age_5_17,
         stpov = Estimated_pop_age_5_17_in_poverty,
         stpovrate = stud_pov_rate)

filter()

subset observations (rows) by their values
Create new dataframe without districts with fewer than 100 children

stpov18_no_low_enroll <- stpov18 %>%
  filter(stpop >= 100)

Notice

Notice

You can filter on a character variable too
You can have multiple arguments in a filter
Use “&” if every expression must be true for a row to be included in the output
Use “|” (OR) if any expression can be true for a row to be included in the output
“!” means not

filter examples

Try these!

# filter based on text value
nyc <- stpov18 %>% 
  filter(district == "New York City Department Of Education")

# remove new york city
ny_no_nyc <- stpov18 %>% 
  filter(district != "New York City Department Of Education")

# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>% 
  filter(tpop <= 10000)

# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>% 
  filter(tpop <= 10000 & tpop > 500)

Write out your dataframe to csv

Save this processed data frame dataframe to your computer

# write out processed student poverty data for 2018
write_csv(stpov18, "processed/school_district/ny_student_poverty_rate_2018.csv")

Notice

You have only saved stpov18 - the processed dataframe with all 667 school districts
The filtered dataframes that you created are not saved to to your computer.
- You have the recipe to recreate them in this script if you need them

Exploratory Analysis

Now that we have our analysis dataset, it’s time to use it to learn more about poverty in New York state. We’ll use descriptive statistics and visualization to interpret the data.

Summary statistics for each variable with the summary() function
A very simple scatterplot & histogram to understand the shape of our data
Calculate our own descriptive statistics for the state with the summarise() function

Calculate some quick summary statistics

type into your console

summary(stpov18)

    Postal                id            district              tpop        
 Length:682         Min.   :3600001   Length:682         Min.   :    104  
 Class :character   1st Qu.:3609368   Class :character   1st Qu.:   5292  
 Mode  :character   Median :3617055   Mode  :character   Median :   9644  
                    Mean   :3616789                      Mean   :  29022  
                    3rd Qu.:3624262                      3rd Qu.:  20624  
                    Max.   :3632010                      Max.   :8398748  
     stpop               stpov             stpovrate           year          
 Min.   :      1.0   Min.   :     0.00   Min.   :0.00000   Length:682        
 1st Qu.:    788.5   1st Qu.:    84.25   1st Qu.:0.07119   Class :character  
 Median :   1453.0   Median :   159.00   Median :0.12362   Mode  :character  
 Mean   :   4292.7   Mean   :   749.48   Mean   :0.12986                     
 3rd Qu.:   3306.5   3rd Qu.:   306.25   3rd Qu.:0.17642                     
 Max.   :1204282.0   Max.   :277784.00   Max.   :0.49781

Look at a histogram

A histogram is a chart that shows the distribution of your data.

The height of each bar indicates how many district’s poverty rate is within that range.
To create a histogram use the hist() function, with format:
- hist(dataframe_name$column_name)
- $ indicates a column name

hist(stpov18$stpovrate)

summarise()

We can use the summarise() function to create our own statistics to help answer our questions.
- we’ll create a new dataframe that is a single row summarizing our dataframe
- useful arguments within summarize:
  - mean(), median(), min(), max(), sum(), n()
  - n() returns the number of rows

summarise() New York poverty

# calculate student poverty statistics for New York

ny_pov_stats <- stpov18 %>%
  summarise(districts = n(),
            kids = sum(stpop),
            kids_in_pov = sum(stpov),
            average_school_district_stpovrate = mean(stpovrate),
            max_school_district_stpovrate = max(stpovrate),
            min_school_district_stpovrate = min(stpovrate),
            statewide_student_poverty_rate = kids_in_pov/kids,
            poverty_range = max_school_district_stpovrate - min_school_district_stpovrate)

districts	kids	kids_in_pov	average_school_district_stpovrate	max_school_district_stpovrate	min_school_district_stpovrate	statewide_student_poverty_rate	poverty_range
682	2927648	511147	0.1298569	0.4978058	0	0.1745931	0.4978058

Write out the summary stats as a csv

Save summary stats dataframe to your computer

# write out student poverty summary stats for 2018
write_csv(ny_pov_stats, "output/school_district/ny_student_poverty_rate_statistics_2018.csv")

Deliverable 1

Neaten up the script for processing the student-age poverty rate for New York school districts in 2018. It should include everything we have done above:

One sentence description at the top of the purpose of the script (commented out)
Load the tidyverse package needed to run the script
Read in the raw school-age poverty rate data
Transform it with the mutate(), rename(), select()
Create 5 additional dataframes using the filter() function
Write out the processed stpov18 dataframe
Quick summary statistics with summary()
Quick histogram with hist()
Bespoke summary statistics with summarise()
Write out summary statistics dataframe

Upload your script to CANVAS

Deliverable 2

Join poverty data to school district shapefile and make a choropleth map of student poverty rate in QGIS or ArcMap.

Step 1: Two options to join the data:

Use QGIS or ArcMap to join school district poverty rate data to a school district shapefile
Use R to join school district poverty rate data to a school district shapefile (see option 2 slides for instructions)

Step 2: Create choropleth map of New York school districts, colored by student-age poverty rate in 2018

make the map nice. include title, legend, carefully chosen bins for the student-age poverty rate

Upload your map (and optionally, your script to join the data to the shapefile)

Option 2 - join with R

Create a new script called ny_school_districts_poverty_shape_join.R and save it in main_data/scripts/data_processing
Install sf the best package for handling spatial data in R
- In the packages tab, click install, search for sf, and install in default location
Read in your processed 2018 student poverty dataset
The the sf function st_read() to read in the school district shapefile
Use the left_join() function to join the poverty rate data to the shapefile
- for any join you have to find the common key (or common, unique column that is in both datasets)
- in this case, they have a common column with different names GEOID and id

Option 2 - Join with R script

library(tidyverse)
library(sf)

# import your processed student poverty dataset
ny_sd_pov18 <- read_csv("processed/school_district/ny_student_poverty_rate_2018.csv")

# import new york school districts, crs = 2260 (NAD83 / New York East)
raw_ny_sd_shp <- st_read("raw/school_district/ny_school_districts.shp") 

ny_sd_pov_shp <- raw_ny_sd_shp %>% 
  left_join(ny_sd_pov18, by = c("GEOID"="id"))

st_write(ny_sd_pov_shp, "raw/school_district/ny_school_districts_poverty18.shp")

Deliverable 3

One of the magic aspects of scripts is you can reuse them! You are going to process the 2019 poverty rate data by making a copy of your 2018 script and slightly adjusting it to work with the 2019 data.

Save a copy of your ny_process_school_district_data_2018.R script as ny_process_school_district_data_2019.R in the main_data/scripts/data_exploration folder.
You’ll find the raw 2019 student-age poverty data in the same folder as the 2018 data.
Adjust your 2019 script so that it imports the 2019 data (hint, change all the “18”s to “19”s)
There is one column that is named slightly differently, so you’ll need to use the rename() function to change it.
Use this script to do exactly the same processing steps to the 2019 data as you did for 2018 data, including saving the processed data and the summary stats to the same folders
There is an additional data file in the raw/school_district folder that has enrollment data for every school district in the country (full_data_19_geo_exc.csv). Optionally, read it in and explore it for more practice. See Optional Deliverable 3.1 for instructions.

Upload your script to CANVAS

Optional Deliverable 3.1

Process the 2019 school district enrollment dataset for 2019 and join it to the poverty dataframe.

In your ny_process_school_district_data_2019.R script:

Use read_csv() to read in the full_data_19_geo_exc.csv dataset of enrollment data for every school district in the country.
This dataset has a lot of fields; create a dataframe with the following variables:
- NCESID, NAME, County, dEnroll_district, dWhite, dHispanic, dBlack, dAsian_PI
- that is the id field, district name, and enrollment by race/ethnicity
Create the following variables:
- percent_latinx, percent_black, percent_white, percent_asian
- convert NCESID to numeric type (mutate(NCESID = as.numeric(NCESID)))
Create a new dataframe called ny_pov_enroll from your 2019 ny poverty dataframe
- Use left_join() to join the enrollment data to the ny poverty data
  - Notice that the left join keeps all of the rows in the first dataset, and add columns from the second dataframe that match
  - There will be 5 districts from the enrollment dataframe that don’t join properly (don’t worry about them for now)

Upload this script to CANVAS for .5 points for processing the enrollment data, and .5 points for joining the dataframes

Advanced GIS, Week 8

Outline

Assignment 3

File organization

Intro to R and R Studio

Lab 3 overview & Part 1 deliverables

Importing data into R

Data manipulation with the tidyverse

Exporting data frames

Lab 3 deliverables detail

Getting help

Assignment 3

File organization

Use R to process and explore data

Terms/Definitions

Lecture Format: Quarto

A few things

Lab 3 Structure and Deliverables

Lab 3, Part 1 Deliverables

Installation

Open R Studio

R Studio Layout

The Console pane

The Source pane

Create your first script

…

The Files Pane: Files tab

The Files Pane: Packages tab

The Environment pane

Import a csv

File paths and working directories

Windows and filepaths

Data frames

Basic data types in R

tidyverse data processing functions

mutate() - create or redefine a variable

Create a text column

select()

script so far

Notice

rename()

filter()

Notice

filter examples

Write out your dataframe to csv

Exploratory Analysis

Calculate some quick summary statistics

Look at a histogram

summarise()

summarise() New York poverty

Write out the summary stats as a csv

Deliverable 1

Deliverable 2

Option 2 - join with R

Option 2 - Join with R script

Deliverable 3

Optional Deliverable 3.1

Getting Help with R