[1] "this is the output from the code chunk"
Research Proposal:
One PDF (3-5 pages) describing the research question, methods and expected results of the research project.
File organization is very important for to keep track of data and your analysis.
We’ll all work in these folders as you learn so that we can follow my lessons more easily.
From R for Data Science Hadley Wickham & Garrett Grolemund
I use Quarto to make slides for this course.
They contains text and executable R code together
Text in a box is a code chunk - something you can execute in R
The output from the code chunk displays below the box
[1] "this is the output from the code chunk"
due October 25
Create a New R Project in your existing folder called main_data
Use the Source Pane to write R scripts.
main_data/scripts/data_exploration folder as ny_process_school_district_data_2018.R#)Run in the upper-right of your Source paneNotice
# tells R not to run that line of code
<-
The Files window is like file explorer. You can see all of the files on your computer
Notice
We created a new R Project in the main_data folder.
main_data folderThe Packages window lists the packages you have installed and provides a user interface to search for other packages and install them.
Packages are collections of functions and datasets developed by the R community to expand the things you can do in R.
tidyverse, have become the backbone of analysis in R and we use it all the timetidyverse by typing into your console:The Environment shows all of the objects that you have in your workspace
main_data/raw/school_district folderIn your script:
## Process 2018 enrollment data for every school district in the country
library(tidyverse)
## import the education dataset for 2019 using read_csv from the readr package of the tidyverse
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
raw_sd_poverty_18 <- read_csv("raw/school_district/ny_student_poverty_2018.csv")
The specification of the list of folders to get to a file on your computer is called a path.
"/Users/sarahodges/spatial/main_data/raw/school_district/ny_student_poverty_2018.csv"
"raw/school_district/ny_student_poverty_2018.csv"
file.path() functionData tables are called dataframes in R.
Let’s explore our first data frame:
| Postal | FIPS | district_id | Name | Estimated_total_pop | Estimated_pop_age_5_17 | Estimated_pop_age_5_17_in_poverty |
|---|---|---|---|---|---|---|
| NY | 36 | 3602370 | Addison Central School District | 6887 | 1228 | 277 |
| NY | 36 | 3605040 | Adirondack Central School District | 8456 | 1346 | 209 |
| NY | 36 | 3602400 | Afton Central School District | 3723 | 589 | 111 |
| NY | 36 | 3602430 | Akron Central School District | 9644 | 1517 | 157 |
| NY | 36 | 3602460 | Albany City School District | 98781 | 11020 | 3259 |
Type glimpse(raw_sd_poverty_18) in the console to look at a snapshot of your dataframe
Notice
We’ll learn five functions that are the backbone of data transformation in R
Notice
Notice
characterOur student poverty table has 9 columns, we may not need all of those.
Type names(stpov18) in your console to see column names
## Process 2018 enrollment data for every school district in the country
library(tidyverse)
## import the education dataset for 2019 using read_csv from the readr package of the tidyverse
# source = https://www.census.gov/data/datasets/2018/demo/saipe/2018-school-districts.html
raw_sd_poverty_18 <- read_csv("raw/school_district/ny_student_poverty_2018.csv")
# create student poverty rate & year column
# select necessary variables
stpov18 <- raw_sd_poverty_18 %>%
mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
year = "2018") %>%
select(Postal, district_id, Name, Estimated_total_pop,
Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year)| Postal | district_id | Name | Estimated_total_pop | Estimated_pop_age_5_17 | Estimated_pop_age_5_17_in_poverty | stud_pov_rate | year |
|---|---|---|---|---|---|---|---|
| NY | 3602370 | Addison Central School District | 6887 | 1228 | 277 | 0.2255700 | 2018 |
| NY | 3605040 | Adirondack Central School District | 8456 | 1346 | 209 | 0.1552749 | 2018 |
| NY | 3602400 | Afton Central School District | 3723 | 589 | 111 | 0.1884550 | 2018 |
Notice
%>% to add a new function
mutate() all of your variables, then %>% to use select()rename your columns * new_name = old_name * no spaces in variable names
stpov18 <- raw_sd_poverty_18 %>%
mutate(stud_pov_rate = Estimated_pop_age_5_17_in_poverty/Estimated_pop_age_5_17,
year = "2018") %>%
select(Postal, district_id, Name, Estimated_total_pop,
Estimated_pop_age_5_17, Estimated_pop_age_5_17_in_poverty, stud_pov_rate, year) %>%
rename(id = district_id,
district = Name,
tpop = Estimated_total_pop,
stpop = Estimated_pop_age_5_17,
stpov = Estimated_pop_age_5_17_in_poverty,
stpovrate = stud_pov_rate)Notice
Try these!
# filter based on text value
nyc <- stpov18 %>%
filter(district == "New York City Department Of Education")
# remove new york city
ny_no_nyc <- stpov18 %>%
filter(district != "New York City Department Of Education")
# remove all districts with more than 10,000 people
ny_no_large_districts <- stpov18 %>%
filter(tpop <= 10000)
# remove all districts with more than 10,000 people AND less than or equal to 500
ny_medium_districts <- stpov18 %>%
filter(tpop <= 10000 & tpop > 500)Save this processed data frame dataframe to your computer
Notice
Now that we have our analysis dataset, it’s time to use it to learn more about poverty in New York state. We’ll use descriptive statistics and visualization to interpret the data.
summary() functionsummarise() function Postal id district tpop
Length:682 Min. :3600001 Length:682 Min. : 104
Class :character 1st Qu.:3609368 Class :character 1st Qu.: 5292
Mode :character Median :3617055 Mode :character Median : 9644
Mean :3616789 Mean : 29022
3rd Qu.:3624262 3rd Qu.: 20624
Max. :3632010 Max. :8398748
stpop stpov stpovrate year
Min. : 1.0 Min. : 0.00 Min. :0.00000 Length:682
1st Qu.: 788.5 1st Qu.: 84.25 1st Qu.:0.07119 Class :character
Median : 1453.0 Median : 159.00 Median :0.12362 Mode :character
Mean : 4292.7 Mean : 749.48 Mean :0.12986
3rd Qu.: 3306.5 3rd Qu.: 306.25 3rd Qu.:0.17642
Max. :1204282.0 Max. :277784.00 Max. :0.49781
A histogram is a chart that shows the distribution of your data.
hist() function, with format:
hist(dataframe_name$column_name)$ indicates a column namesummarise() function to create our own statistics to help answer our questions.
mean(), median(), min(), max(), sum(), n()n() returns the number of rows# calculate student poverty statistics for New York
ny_pov_stats <- stpov18 %>%
summarise(districts = n(),
kids = sum(stpop),
kids_in_pov = sum(stpov),
average_school_district_stpovrate = mean(stpovrate),
max_school_district_stpovrate = max(stpovrate),
min_school_district_stpovrate = min(stpovrate),
statewide_student_poverty_rate = kids_in_pov/kids,
poverty_range = max_school_district_stpovrate - min_school_district_stpovrate)| districts | kids | kids_in_pov | average_school_district_stpovrate | max_school_district_stpovrate | min_school_district_stpovrate | statewide_student_poverty_rate | poverty_range |
|---|---|---|---|---|---|---|---|
| 682 | 2927648 | 511147 | 0.1298569 | 0.4978058 | 0 | 0.1745931 | 0.4978058 |
Neaten up the script for processing the student-age poverty rate for New York school districts in 2018. It should include everything we have done above:
tidyverse package needed to run the scriptmutate(), rename(), select()filter() functionsummary()hist()summarise()Upload your script to CANVAS
Join poverty data to school district shapefile and make a choropleth map of student poverty rate in QGIS or ArcMap.
Step 1: Two options to join the data:
Step 2: Create choropleth map of New York school districts, colored by student-age poverty rate in 2018
Upload your map (and optionally, your script to join the data to the shapefile)
ny_school_districts_poverty_shape_join.R and save it in main_data/scripts/data_processingsf the best package for handling spatial data in R
sf function st_read() to read in the school district shapefileleft_join() function to join the poverty rate data to the shapefile
GEOID and idlibrary(tidyverse)
library(sf)
# import your processed student poverty dataset
ny_sd_pov18 <- read_csv("processed/school_district/ny_student_poverty_rate_2018.csv")
# import new york school districts, crs = 2260 (NAD83 / New York East)
raw_ny_sd_shp <- st_read("raw/school_district/ny_school_districts.shp")
ny_sd_pov_shp <- raw_ny_sd_shp %>%
left_join(ny_sd_pov18, by = c("GEOID"="id"))
st_write(ny_sd_pov_shp, "raw/school_district/ny_school_districts_poverty18.shp")One of the magic aspects of scripts is you can reuse them! You are going to process the 2019 poverty rate data by making a copy of your 2018 script and slightly adjusting it to work with the 2019 data.
ny_process_school_district_data_2018.R script as ny_process_school_district_data_2019.R in the main_data/scripts/data_exploration folder.rename() function to change it.raw/school_district folder that has enrollment data for every school district in the country (full_data_19_geo_exc.csv). Optionally, read it in and explore it for more practice. See Optional Deliverable 3.1 for instructions.Upload your script to CANVAS
Process the 2019 school district enrollment dataset for 2019 and join it to the poverty dataframe.
In your ny_process_school_district_data_2019.R script:
read_csv() to read in the full_data_19_geo_exc.csv dataset of enrollment data for every school district in the country.(mutate(NCESID = as.numeric(NCESID)))ny_pov_enroll from your 2019 ny poverty dataframe
left_join() to join the enrollment data to the ny poverty data
Upload this script to CANVAS for .5 points for processing the enrollment data, and .5 points for joining the dataframes