Which one is me?

This is me
This is me

Synopsis

Hi! I’m Kenidy. I’m a Cincinnatian living in Columbus building the foundation of my career. I’m interested in analytics and how data can support better decisions within corporate supply chain.

Academic Background

Current program: MSBA / Business Analytics School: University of Cincinnati Areas of interest: analytics, data manipulation, visualization, operations management

Professional Background

I currently work in Consumer Packaged Goods, where I support supply chain and finance. I’m especially interested in roles that involve process improvement, analysis, and translating data into actionable insights.

Experience with R

I’m still building my foundation in R. So far, I’ve learned how to work in RStudio, use R Markdown, and practice basic data import.

Experience with Other Analytic Software

I have basic to intermediate experience with: Excel (spreadsheets, formulas, basic analysis) Power BI / Tableau / SQL / Python

Part 2: Importing Data

Part 2: Importing Data

df <- readr::read_csv("blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# missing values
sum(is.na(df))
## [1] 0
# dimensions
dim(df)
## [1] 748   5
# first 10 rows
head(df, 10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1       2        50    12500    98 donated    
##  2       0        13     3250    28 donated    
##  3       1        16     4000    35 donated    
##  4       2        20     5000    45 donated    
##  5       1        24     6000    77 not donated
##  6       4         4     1000     4 not donated
##  7       2         7     1750    14 donated    
##  8       1        12     3000    35 not donated
##  9       2         9     2250    22 donated    
## 10       5        46    11500    98 donated
# last 10 rows
tail(df, 10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1      23         1      250    23 not donated
##  2      23         4     1000    52 not donated
##  3      23         1      250    23 not donated
##  4      23         7     1750    88 not donated
##  5      16         3      750    86 not donated
##  6      23         2      500    38 not donated
##  7      21         2      500    52 not donated
##  8      23         3      750    62 not donated
##  9      39         1      250    39 not donated
## 10      72         1      250    72 not donated
# 100th row, Monetary column
df[100, "Monetary"]
## # A tibble: 1 × 1
##   Monetary
##      <dbl>
## 1     1750
# mean of Monetary
mean(df[["Monetary"]])
## [1] 1378.676
# how many rows have Monetary > mean
above_avg <- df[["Monetary"]] > mean(df[["Monetary"]])
sum(above_avg)
## [1] 267
df[above_avg, "Monetary"]
## # A tibble: 267 × 1
##    Monetary
##       <dbl>
##  1    12500
##  2     3250
##  3     4000
##  4     5000
##  5     6000
##  6     1750
##  7     3000
##  8     2250
##  9    11500
## 10     5750
## # ℹ 257 more rows

There are 267 observations where the Monetary value is greater than the mean. ## Dataset 2: Cincinnati Police Data

df <- readr::read_csv("PDI__Police_Data_Initiative__Crime_Incidents.csv")
## Rows: 15155 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl  (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(df)
## [1] 15155    40
sum(is.na(df))
## [1] 95592
sort(colSums(is.na(df)), decreasing = TRUE)[1:5]
##      OPENING        FLOOR         SIDE   THEFT_CODE SUSPECT_RACE 
##        14508        14127        14120        10167         7082
range(df[["DATE_REPORTED"]])
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
sort(table(df[["SUSPECT_AGE"]]), decreasing = TRUE)[1:5]
## 
## UNKNOWN   18-25   31-40   26-30   41-50 
##    9003    1778    1525    1126     659
sort(table(df[["ZIP"]]), decreasing = TRUE)[1:5]
## 
## 45202 45205 45211 45238 45229 
##  2049  1110  1094   956   913
sort(prop.table(table(df[["DAYOFWEEK"]])), decreasing = TRUE)
## 
##  SATURDAY    SUNDAY    MONDAY   TUESDAY WEDNESDAY    FRIDAY  THURSDAY 
## 0.1542221 0.1448547 0.1438365 0.1432935 0.1405105 0.1369807 0.1363019

This dataset contains 15,155 rows and 40 columns. There are 95,592 missing values. The OPENING column has the most missing values at 14,508. The data spans from 01/01/2022 01:08:00 AM to 06/26/2022 12:50:00 AM The most common suspect age range is 18–25. The ZIP code with the most incidents is 45202. No unusual values or zipcodes. Most incidents occur on Saturday at 0.15% Based on this dataset, I would be interested in analyzing trends related to time, location, and demographics. Columns such as DAYOFWEEK, ZIP, and SUSPECT_AGE could help identify patterns in incident occurrences. Some variables contain missing values, which may need to be addressed before further analysis. Summary statistics and frequency tables help highlight trends and potential data quality issues.