Module 2 Lab

Author

Benedict Leonardi

Part 1: HTML Bio

yum bread — Me, walking with a baguette in Paris

Introduction

I am an aspiring economist interested in topics at the intersection of urban/real estate economics, history and philosophy of social science, and the utilization of a variety of AI/ML methods in creative ways to answer questions along those lines.

Academic Background

B.A. in Economics (Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH)
B.A. in Mathematics (College of Arts & Sciences, University of Cincinnati, Cincinnati, OH)
M.Ed. in Mathematics Education (University of Notre Dame, Notre Dame, IN)
Ph.D. in Economics (Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH)

Professional Background

Currently, I am a PhD student in economics at the Lindner College of Business. Here, I am a Wolfgang-Mayer fellow, which is a great honor. I have worked in a variety of Data Science-adjacent roles, most recently at the OKI Regional Council of Governments performing geospatial analysis. I have experience and interests in:

Geospatial analysis
LLM integration
- Code assistance
- Data cleaning
- Image analysis
Time series analysis
Historically-aware social science

Experience with R/Analytical Software

Most of the above work has been performed with R, though I also have experience working with Python for natural language processing and mathematical analysis (NTLK, scipy).

Part 2: Importing Data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(dplyr)

# Importing Data
blood_data <- read_csv("blood_transfusion.csv")

Rows: 748 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Class
dbl (4): Recency, Frequency, Monetary, Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pdi_data <- read_csv("PDI__Police_Data_Initiative__Crime_Incidents.csv")

Rows: 15155 Columns: 40
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
dbl  (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Descriptive statistics for each data frame
data_describe <- tribble(~data_type, ~num_cols, ~num_rows, ~ missing_vals,
                          "blood", blood_data %>% ncol(), blood_data %>% nrow(), sum(is.na(blood_data)),
                          "pdi", pdi_data %>% ncol(), pdi_data %>% nrow(), sum(is.na(pdi_data)))
# basic observational
print(data_describe)

# A tibble: 2 × 4
  data_type num_cols num_rows missing_vals
  <chr>        <int>    <int>        <int>
1 blood            5      748            0
2 pdi             40    15155        95592

# 100th value's monetary
blood_data %>% slice(100) %>% pull(`Monetary`)

[1] 1750

# mean monetary
avg_mon <- blood_data %>% pull(`Monetary`) %>% mean
print(avg_mon)

[1] 1378.676

# number of observations greater than the mean
blood_data %>% filter(Monetary > avg_mon) %>% nrow

[1] 267

# missing values in PDI data
pdi_data %>% summarise(across(everything(), ~ sum(is.na(.)))) %>% 
  pivot_longer(cols = everything(), names_to = "column", values_to = "missing_vals") %>% 
  filter(missing_vals > 0) %>%
  arrange(desc(missing_vals))

# A tibble: 31 × 2
   column            missing_vals
   <chr>                    <int>
 1 OPENING                  14508
 2 FLOOR                    14127
 3 SIDE                     14120
 4 THEFT_CODE               10167
 5 SUSPECT_RACE              7082
 6 SUSPECT_ETHNICITY         7082
 7 SUSPECT_GENDER            7082
 8 TOTALSUSPECTS             7082
 9 DATE_OF_CLEARANCE         2613
10 VICTIM_RACE               2192
# ℹ 21 more rows

# date range
pdi_data %>% pull(DATE_REPORTED) %>% range

[1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"

# most common age range
pdi_data %>% group_by(SUSPECT_AGE) %>%
  summarize(ct = n()) %>%
  arrange(desc(ct))

# A tibble: 9 × 2
  SUSPECT_AGE    ct
  <chr>       <int>
1 UNKNOWN      9003
2 18-25        1778
3 31-40        1525
4 26-30        1126
5 41-50         659
6 UNDER 18      629
7 51-60         298
8 61-70         121
9 OVER 70        16

# most common zip code
pdi_data %>% group_by(ZIP) %>%
  summarize(ct = n()) %>%
  arrange(desc(ct))

# A tibble: 39 × 2
     ZIP    ct
   <dbl> <int>
 1 45202  2049
 2 45205  1110
 3 45211  1094
 4 45238   956
 5 45229   913
 6 45219   863
 7 45225   811
 8 45214   774
 9 45237   699
10 45223   653
# ℹ 29 more rows

# most common days of week
pdi_data %>% group_by(DAYOFWEEK) %>%
  summarize(ct = n()) %>%
  arrange(desc(ct))

# A tibble: 8 × 2
  DAYOFWEEK    ct
  <chr>     <int>
1 SATURDAY   2272
2 SUNDAY     2134
3 MONDAY     2119
4 TUESDAY    2111
5 WEDNESDAY  2070
6 FRIDAY     2018
7 THURSDAY   2008
8 <NA>        423