library(readr)
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.1
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.1
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.3.1

#Beginner Track


Prompt 1. Map all school Districts in the US


We begin by reading in files from various educational data sources. The EDGE_GEOCODE file is from the 2017 School Locations & Geoassignments (SLGA) data set, and the LEA data is from the Civil Rights Data Collection data set, which captures achievement metrics.

# read in EDGE data from SLGA dataset
path <- "EDGE_GEOCODE_PUBLICLEA_1718/EDGE_GEOCODE_PUBLICLEA_1718/EDGE_GEOCODE_PUBLICLEA_1718.csv"
data1718 <- read.csv(path)

# read in LEA data from CDRC dataset
LEApath <- "2017-18-crdc-data/2017-18 Public-Use Files/Data/LEA/CRDC/CSV/LEA Characteristics.csv"
LEAdata <- read.csv(LEApath)
# Check for missing values
sum(is.na.data.frame(data1718))
## [1] 0
# no missing values!

sum(is.na.data.frame(LEAdata))
## [1] 30
# 30 missing values.

Checking the dataframes for missing values, we find none in the SLGA data but 30 entries out of 17,604 in the LEA data.


We are tasked with mapping all school districts in the US. WE first count the number of distinct counties appearing in our data set, using the county codes represented in CNTY, rather than county names which may not necessarily be unique.

n_distinct(data1718$CNTY)
## [1] 3135

According to the 2020 Census, there are 3,234 counties and county equivalents in the US, including territories. The total number of counties containing at least one school district is slightly less than the total number of counties in the US, which could be reasonable in the context of extremely remote/rural areas.

Next, we count the number of instances of each unique county, then incorporate this information into our database in case it’s useful later.

countdata <- data1718 %>% 
  group_by(CNTY) %>% 
  summarize(count = n())

# merge  database with countdata
map_group <- left_join(data1718, countdata, by = "CNTY")

Now, we plot a population map of school districts in the US. Translucent datapoints allow locations to stack, appearing more opaque in areas with many school districts. Two views are offered: one showing school district density in all US states and territories, and one examining the US mainland in greater detail.

fullplot <- ggplot(map_group, aes(x=LON, y=LAT, fill = log(count))) +
  geom_point(alpha = 0.05, color = "blue") +
  labs(
    title = "US School Districts by Location"
  )

mainland <- ggplot(map_group, aes(x=LON, y=LAT, fill = log(count))) +
  geom_point(alpha = 0.05, color = "blue") +
  scale_x_continuous(limits = c(-130, -65)) +
  scale_y_continuous(limits = c(22,52)) +
  labs(
    title = "US Mainland School Districts by Location"
  )

fullplot

mainland
## Warning: Removed 60 rows containing missing values (`geom_point()`).


Prompt 2. Visualizing Geographic Data in CODAP


We present an alternative visualization using Common Online Data Analysis Platform, or CODAP. We export our database as a comma-delimited file to import to CODAP for mapping.

write.csv(map_group, "counts.csv", row.names = F)

# use this csv in CODAP to show density

Our first attempt at CODAP visualization of US School districts by county yielded strange results.

CODAP visualization organized by CNTY

Observe the range of the scale at the bottom of the map: 1001 - 78030. These are obviously unreasonable numbers for counts of school districts and instead represent geographic codes rather than intensity, explaining the recognizable state shapes.


Instead, using the previously calculated count column as the intensity variable for the CODAP map yields much more reasonable results. We again present two views: all US territories, and mainland focus.

CODAP visualization organized by count

CODAP vizualization organized by count: Mainland


Prompt 3. Student Poverty Rate per School District


We now examine the data sets more closely to gain insights about relationships within the data. We begin by reading in an additional data set showing populations of school-aged children and populations of impoverished school-aged children.

#read in excel file, skipping title in first row
SAIPEpath <- "ussd17.xls"
SAIPE <- read_excel(SAIPEpath, skip = 2)

In order to join the datasets, we must link them by common values. However, as is often the case in data science, the key values are of different formats and must be manipulated before a successful join can be implemented.

# coerce SAIPE character string to integer to match CDRC data
SAIPE$`State FIPS Code` <- as.integer(SAIPE$`State FIPS Code`)

# make a function to only keep last 5 digits of CDRC data so that district codes match SAIPE
KL5 <- function(x) {
  ifelse(nchar(x) <= 5, x, substr(x, nchar(x) - 4, nchar(x)))
}

# right join because poverty data only exists in second table
povertyDB <- map_group |>
  mutate(LEAID = sapply(LEAID, KL5)) |>
  right_join(SAIPE, by = c("LEAID" = "District ID", "OPSTFIPS" = "State FIPS Code"))

colnames(povertyDB)[32] <- "Enroll"
colnames(povertyDB)[33] <- "NumPov"

povertyDB <- povertyDB |>
  mutate(PovRate = NumPov/Enroll)

We then compute a new variable to capture poverty rates among school-aged children in each school district.


We can use this poverty rate variable to create a heatmap of poverty in US School Districts, again in two views: all territories and US mainland.

fullpov <- ggplot(povertyDB, aes(x=LON, y=LAT, color = PovRate)) +
  geom_point(alpha = 0.2) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "US School Districts by Poverty Rate among School-Aged Children"
  )

mainpov <- ggplot(povertyDB, aes(x=LON, y=LAT, color = PovRate)) +
  geom_point(alpha = 0.2) +
  scale_color_gradient(low = "blue", high = "red") +
  scale_x_continuous(limits = c(-130, -65)) +
  scale_y_continuous(limits = c(22,52)) +
  labs(
    title = "US Mainland School Districts by Poverty Rate among School-Aged Children"
  )

fullpov
## Warning: Removed 88 rows containing missing values (`geom_point()`).

mainpov
## Warning: Removed 143 rows containing missing values (`geom_point()`).

Redder areas show high child poverty rates, while blue areas indicate lower poverty rates.


We export our manipulated data to a new csv for future use.

write.csv(povertyDB, "povertymap.csv", row.names = F)

# use this csv in CODAP to show density

Advanced Track


Prompt 2


To insert the database first we read one of the csv files into pandas to infer the datatypes then ran a simple script to get a mapping of all the columns and their data types:

import pandas as pd

df = pd.read_csv(mortgage_data_filepath)

column_names = list(df.columns)
datatype_dictionary = {}
for index, dtype in enumerate(df.dtypes):
  datatype_dictionary[column_names[index]] = dtype

We then got the following dictionary:

datatype_dictionary = {'as_of_year': np.dtype('int64'),
 'respondent_id': np.dtype('O'),
 'agency_name': np.dtype('O'),
 'agency_abbr': np.dtype('O'),
 'agency_code': np.dtype('int64'),
 'loan_type_name': np.dtype('O'),
 'loan_type': np.dtype('int64'),
 'property_type_name': np.dtype('O'),
 'property_type': np.dtype('int64'),
 'loan_purpose_name': np.dtype('O'),
 'loan_purpose': np.dtype('int64'),
 'owner_occupancy_name': np.dtype('O'),
 'owner_occupancy': np.dtype('int64'),
 'loan_amount_000s': np.dtype('float64'),
 'preapproval_name': np.dtype('O'),
 'preapproval': np.dtype('int64'),
 'action_taken_name': np.dtype('O'),
 'action_taken': np.dtype('int64'),
 'msamd_name': np.dtype('O'),
 'msamd': np.dtype('float64'),
 'state_name': np.dtype('O'),
 'state_abbr': np.dtype('O'),
 'state_code': np.dtype('float64'),
 'county_name': np.dtype('O'),
 'county_code': np.dtype('float64'),
 'census_tract_number': np.dtype('float64'),
 'applicant_ethnicity_name': np.dtype('O'),
 'applicant_ethnicity': np.dtype('int64'),
 'co_applicant_ethnicity_name': np.dtype('O'),
 'co_applicant_ethnicity': np.dtype('int64'),
 'applicant_race_name_1': np.dtype('O'),
 'applicant_race_1': np.dtype('int64'),
 'applicant_race_name_2': np.dtype('O'),
 'applicant_race_2': np.dtype('float64'),
 'applicant_race_name_3': np.dtype('O'),
 'applicant_race_3': np.dtype('float64'),
 'applicant_race_name_4': np.dtype('O'),
 'applicant_race_4': np.dtype('float64'),
 'applicant_race_name_5': np.dtype('O'),
 'applicant_race_5': np.dtype('float64'),
 'co_applicant_race_name_1': np.dtype('O'),
 'co_applicant_race_1': np.dtype('int64'),
 'co_applicant_race_name_2': np.dtype('O'),
 'co_applicant_race_2': np.dtype('float64'),
 'co_applicant_race_name_3': np.dtype('O'),
 'co_applicant_race_3': np.dtype('float64'),
 'co_applicant_race_name_4': np.dtype('O'),
 'co_applicant_race_4': np.dtype('float64'),
 'co_applicant_race_name_5': np.dtype('O'),
 'co_applicant_race_5': np.dtype('float64'),
 'applicant_sex_name': np.dtype('O'),
 'applicant_sex': np.dtype('int64'),
 'co_applicant_sex_name': np.dtype('O'),
 'co_applicant_sex': np.dtype('int64'),
 'applicant_income_000s': np.dtype('float64'),
 'purchaser_type_name': np.dtype('O'),
 'purchaser_type': np.dtype('int64'),
 'denial_reason_name_1': np.dtype('O'),
 'denial_reason_1': np.dtype('float64'),
 'denial_reason_name_2': np.dtype('O'),
 'denial_reason_2': np.dtype('float64'),
 'denial_reason_name_3': np.dtype('O'),
 'denial_reason_3': np.dtype('float64'),
 'rate_spread': np.dtype('float64'),
 'hoepa_status_name': np.dtype('O'),
 'hoepa_status': np.dtype('int64'),
 'lien_status_name': np.dtype('O'),
 'lien_status': np.dtype('int64'),
 'edit_status_name': np.dtype('float64'),
 'edit_status': np.dtype('float64'),
 'sequence_number': np.dtype('float64'),
 'population': np.dtype('float64'),
 'minority_population': np.dtype('float64'),
 'hud_median_family_income': np.dtype('float64'),
 'tract_to_msamd_income': np.dtype('float64'),
 'number_of_owner_occupied_units': np.dtype('float64'),
 'number_of_1_to_4_family_units': np.dtype('float64'),
 'application_date_indicator': np.dtype('float64')}

Now we wanted to create a table on the database for this data-set, so we fed the dictionary into GPT-4 and had it generate an insert statement with the correct datatypes:

CREATE TABLE hmd_2017 (
    as_of_year INTEGER,
    respondent_id VARCHAR,
    agency_name VARCHAR,
    agency_abbr VARCHAR,
    agency_code INTEGER,
    loan_type_name VARCHAR,
    loan_type INTEGER,
    property_type_name VARCHAR,
    property_type INTEGER,
    loan_purpose_name VARCHAR,
    loan_purpose INTEGER,
    owner_occupancy_name VARCHAR,
    owner_occupancy INTEGER,
    loan_amount_000s DOUBLE PRECISION,
    preapproval_name VARCHAR,
    preapproval INTEGER,
    action_taken_name VARCHAR,
    action_taken INTEGER,
    msamd_name VARCHAR,
    msamd DOUBLE PRECISION,
    state_name VARCHAR,
    state_abbr VARCHAR,
    state_code DOUBLE PRECISION,
    county_name VARCHAR,
    county_code DOUBLE PRECISION,
    census_tract_number DOUBLE PRECISION,
    applicant_ethnicity_name VARCHAR,
    applicant_ethnicity INTEGER,
    co_applicant_ethnicity_name VARCHAR,
    co_applicant_ethnicity INTEGER,
    applicant_race_name_1 VARCHAR,
    applicant_race_1 INTEGER,
    applicant_race_name_2 VARCHAR,
    applicant_race_2 DOUBLE PRECISION,
    applicant_race_name_3 VARCHAR,
    applicant_race_3 DOUBLE PRECISION,
    applicant_race_name_4 VARCHAR,
    applicant_race_4 DOUBLE PRECISION,
    applicant_race_name_5 VARCHAR,
    applicant_race_5 DOUBLE PRECISION,
    co_applicant_race_name_1 VARCHAR,
    co_applicant_race_1 INTEGER,
    co_applicant_race_name_2 VARCHAR,
    co_applicant_race_2 DOUBLE PRECISION,
    co_applicant_race_name_3 VARCHAR,
    co_applicant_race_3 DOUBLE PRECISION,
    co_applicant_race_name_4 VARCHAR,
    co_applicant_race_4 DOUBLE PRECISION,
    co_applicant_race_name_5 VARCHAR,
    co_applicant_race_5 DOUBLE PRECISION,
    applicant_sex_name VARCHAR,
    applicant_sex INTEGER,
    co_applicant_sex_name VARCHAR,
    co_applicant_sex INTEGER,
    applicant_income_000s DOUBLE PRECISION,
    purchaser_type_name VARCHAR,
    purchaser_type INTEGER,
    denial_reason_name_1 VARCHAR,
    denial_reason_1 DOUBLE PRECISION,
    denial_reason_name_2 VARCHAR,
    denial_reason_2 DOUBLE PRECISION,
    denial_reason_name_3 VARCHAR,
    denial_reason_3 DOUBLE PRECISION,
    rate_spread DOUBLE PRECISION,
    hoepa_status_name VARCHAR,
    hoepa_status INTEGER,
    lien_status_name VARCHAR,
    lien_status INTEGER,
    edit_status_name DOUBLE PRECISION,
    edit_status DOUBLE PRECISION,
    sequence_number DOUBLE PRECISION,
    population DOUBLE PRECISION,
    minority_population DOUBLE PRECISION,
    hud_median_family_income DOUBLE PRECISION,
    tract_to_msamd_income DOUBLE PRECISION,
    number_of_owner_occupied_units DOUBLE PRECISION,
    number_of_1_to_4_family_units DOUBLE PRECISION,
    application_date_indicator DOUBLE PRECISION
);

After this we were almost good to go, the only problem was for a good amount of the numeric fields there were empty strings instead of Null values. Postgres does not allow empty strings for null values and so way we got around that was to first get a list of all the numeric columns:

numeric_columns = [x for x in datatype_dictionary.keys() if datatype_dictionary[x] == np.dtype('float64') or datatype_dictionary[x] == np.dtype('int64')]

Then we ran the following query to insert each part of the mortgage dataset into our database:

from sqlalchemy import create_engine 
import psycopg2 

connection_string = CONNECTION_STRING_REDACTED
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()

#Do for all 3 files
with open("/Volumes/Untitled/Data-20231026T061813Z-001/Data/parts_hmda_2017_nationwide_all-records_labels/part_1_of_3_hmda_2017_nationwide_all-records_labels.csv") as csv_file:
    cursor.copy_expert("COPY hmd_2017 FROM STDIN WITH (FORMAT csv,DELIMITER ',',HEADER,FORCE_NULL ({}))".format(','.join(numeric_columns)), csv_file)
    print(cursor.rowcount)
connection.commit()
connection.close()

Notice, we have a field called FORCE_NULL where we selected all the numeric columns, this helped us solve the issue where there were empty strings in places of numeric values. This script ran flawlessly for all 3 files and in the end we had over 14 million rows in our production database hosted on AWS.


Part 3