Module 2 Lab

Author

Benedict Leonardi

Part 1: HTML Bio

yum bread

Me, walking with a baguette in Paris

Introduction

I am an aspiring economist interested in topics at the intersection of urban/real estate economics, history and philosophy of social science, and the utilization of a variety of AI/ML methods in creative ways to answer questions along those lines.

Academic Background

  • B.A. in Economics (Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH)
  • B.A. in Mathematics (College of Arts & Sciences, University of Cincinnati, Cincinnati, OH)
  • M.Ed. in Mathematics Education (University of Notre Dame, Notre Dame, IN)
  • Ph.D. in Economics (Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH)

Professional Background

Currently, I am a PhD student in economics at the Lindner College of Business. Here, I am a Wolfgang-Mayer fellow, which is a great honor. I have worked in a variety of Data Science-adjacent roles, most recently at the OKI Regional Council of Governments performing geospatial analysis. I have experience and interests in:

  • Geospatial analysis

  • LLM integration

    • Code assistance

    • Data cleaning

    • Image analysis

  • Time series analysis

  • Historically-aware social science

Experience with R/Analytical Software

Most of the above work has been performed with R, though I also have experience working with Python for natural language processing and mathematical analysis (NTLK, scipy).

Part 2: Importing Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(dplyr)

# Importing Data
blood_data <- read_csv("blood_transfusion.csv")
Rows: 748 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Class
dbl (4): Recency, Frequency, Monetary, Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pdi_data <- read_csv("PDI__Police_Data_Initiative__Crime_Incidents.csv")
Rows: 15155 Columns: 40
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
dbl  (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Descriptive statistics for each data frame
data_describe <- tribble(~data_type, ~num_cols, ~num_rows, ~ missing_vals,
                          "blood", blood_data %>% ncol(), blood_data %>% nrow(), sum(is.na(blood_data)),
                          "pdi", pdi_data %>% ncol(), pdi_data %>% nrow(), sum(is.na(pdi_data)))
# basic observational
print(data_describe)
# A tibble: 2 × 4
  data_type num_cols num_rows missing_vals
  <chr>        <int>    <int>        <int>
1 blood            5      748            0
2 pdi             40    15155        95592
# 100th value's monetary
blood_data %>% slice(100) %>% pull(`Monetary`)
[1] 1750
# mean monetary
avg_mon <- blood_data %>% pull(`Monetary`) %>% mean
print(avg_mon)
[1] 1378.676
# number of observations greater than the mean
blood_data %>% filter(Monetary > avg_mon) %>% nrow
[1] 267
# missing values in PDI data
pdi_data %>% summarise(across(everything(), ~ sum(is.na(.)))) %>% 
  pivot_longer(cols = everything(), names_to = "column", values_to = "missing_vals") %>% 
  filter(missing_vals > 0) %>%
  arrange(desc(missing_vals))
# A tibble: 31 × 2
   column            missing_vals
   <chr>                    <int>
 1 OPENING                  14508
 2 FLOOR                    14127
 3 SIDE                     14120
 4 THEFT_CODE               10167
 5 SUSPECT_RACE              7082
 6 SUSPECT_ETHNICITY         7082
 7 SUSPECT_GENDER            7082
 8 TOTALSUSPECTS             7082
 9 DATE_OF_CLEARANCE         2613
10 VICTIM_RACE               2192
# ℹ 21 more rows
# date range
pdi_data %>% pull(DATE_REPORTED) %>% range
[1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
# most common age range
pdi_data %>% group_by(SUSPECT_AGE) %>%
  summarize(ct = n()) %>%
  arrange(desc(ct))
# A tibble: 9 × 2
  SUSPECT_AGE    ct
  <chr>       <int>
1 UNKNOWN      9003
2 18-25        1778
3 31-40        1525
4 26-30        1126
5 41-50         659
6 UNDER 18      629
7 51-60         298
8 61-70         121
9 OVER 70        16
# most common zip code
pdi_data %>% group_by(ZIP) %>%
  summarize(ct = n()) %>%
  arrange(desc(ct))
# A tibble: 39 × 2
     ZIP    ct
   <dbl> <int>
 1 45202  2049
 2 45205  1110
 3 45211  1094
 4 45238   956
 5 45229   913
 6 45219   863
 7 45225   811
 8 45214   774
 9 45237   699
10 45223   653
# ℹ 29 more rows
# most common days of week
pdi_data %>% group_by(DAYOFWEEK) %>%
  summarize(ct = n()) %>%
  arrange(desc(ct))
# A tibble: 8 × 2
  DAYOFWEEK    ct
  <chr>     <int>
1 SATURDAY   2272
2 SUNDAY     2134
3 MONDAY     2119
4 TUESDAY    2111
5 WEDNESDAY  2070
6 FRIDAY     2018
7 THURSDAY   2008
8 <NA>        423