Part 1

Introduction

Greetings - hope you’re doing well. I prefer to go by my last name, McKellar, instead of my first, although my friends shorten my name to Mick. I’m a nerd, a gamer, I work in IT, and in general like to play with puzzles and learn. I’m also a fan of stage musicals, which you can probably tell from the picture I’ve included here where I’m dressed as Patrick Star from the SpongeBob musical for my office Halloween costume contest.

Background

I got my bachelor’s in computer engineering from Rose-Hulman Institute of Technology, and now I’m ever so slowly getting an MBA from University of Cincinnati. In a perfect world I would have graduated already, but after enough curveballs from work and life that finish line has been pushed out a bit. Through my previous education and work I’ve worked with analytic software before, mainly Python Pandas and Power BI / Power Query.

Part 2

Part 2 Question 1

# import data and the message print out will also tell you
# what the data types are
df <- readr::read_csv("blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

There are the following number of missing values

# Are there any missing values?
sum(is.na(df))
## [1] 0

The data has the following dimensions

# What are the dimensions of this data
dim(df)
## [1] 748   5

With these values in the first and last 10 rows

# Check out the first 10 rows
head(df, 10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1       2        50    12500    98 donated    
##  2       0        13     3250    28 donated    
##  3       1        16     4000    35 donated    
##  4       2        20     5000    45 donated    
##  5       1        24     6000    77 not donated
##  6       4         4     1000     4 not donated
##  7       2         7     1750    14 donated    
##  8       1        12     3000    35 not donated
##  9       2         9     2250    22 donated    
## 10       5        46    11500    98 donated
# Check out the last 10 rows
tail(df, 10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1      23         1      250    23 not donated
##  2      23         4     1000    52 not donated
##  3      23         1      250    23 not donated
##  4      23         7     1750    88 not donated
##  5      16         3      750    86 not donated
##  6      23         2      500    38 not donated
##  7      21         2      500    52 not donated
##  8      23         3      750    62 not donated
##  9      39         1      250    39 not donated
## 10      72         1      250    72 not donated

The value in the ‘Monetary’ column at the 100th row is

# Index for the 100th row and just the `Monetary` column. What is the value?
df[100, 'Monetary']
## # A tibble: 1 × 1
##   Monetary
##      <dbl>
## 1     1750

The mean of the ‘Monetary’ column is

# Index for just the `Monetary` column. What is the mean of this vector?
mean(df[['Monetary']])
## [1] 1378.676

A subset of data where the value of the ‘Monetary’ column is less than the mean of that column

# Subset this data frame for all observations where `Monetary` is greater
# than the mean value. How many rows are in the resulting data frame?
above_avg <- df[['Monetary']] > mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
##    Monetary
##       <dbl>
##  1    12500
##  2     3250
##  3     4000
##  4     5000
##  5     6000
##  6     1750
##  7     3000
##  8     2250
##  9    11500
## 10     5750
## # ℹ 257 more rows

Part 2 Question 2

df <- readr::read_csv("PDI__Police_Data_Initiative__Crime_Incidents.csv")
## Rows: 15155 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl  (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For the data provided in the dile ’PDI__Police_Data_Initiative__Crime_Incidents.csv’ it has the following dimensions

# dimensions of this data
dim(df)
## [1] 15155    40

This data has this amount of missing data and are in the following columns

# Are there any missing data
sum(is.na(df))
## [1] 95592
# If so, how many missing values are in each column?
colSums(is.na(df))
##                     INSTANCEID                    INCIDENT_NO 
##                              0                              0 
##                  DATE_REPORTED                      DATE_FROM 
##                              0                              2 
##                        DATE_TO                           CLSD 
##                              9                            545 
##                            UCR                            DST 
##                             10                              0 
##                           BEAT                        OFFENSE 
##                             28                             10 
##                       LOCATION                     THEFT_CODE 
##                              2                          10167 
##                          FLOOR                           SIDE 
##                          14127                          14120 
##                        OPENING                      HATE_BIAS 
##                          14508                              0 
##                      DAYOFWEEK                       RPT_AREA 
##                            423                            239 
##               CPD_NEIGHBORHOOD                        WEAPONS 
##                            249                              5 
##              DATE_OF_CLEARANCE                      HOUR_FROM 
##                           2613                              2 
##                        HOUR_TO                      ADDRESS_X 
##                              9                            148 
##                    LONGITUDE_X                     LATITUDE_X 
##                           1714                           1714 
##                     VICTIM_AGE                    VICTIM_RACE 
##                              0                           2192 
##               VICTIM_ETHNICITY                  VICTIM_GENDER 
##                           2192                           2192 
##                    SUSPECT_AGE                   SUSPECT_RACE 
##                              0                           7082 
##              SUSPECT_ETHNICITY                 SUSPECT_GENDER 
##                           7082                           7082 
##             TOTALNUMBERVICTIMS                  TOTALSUSPECTS 
##                             33                           7082 
##                      UCR_GROUP                            ZIP 
##                             10                              1 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               SNA_NEIGHBORHOOD 
##                              0                              0

The ‘DATE_REPORTED’ column has has the following range of dates

# Using the `DATE_REPORTED` column, what is the `range` of dates included in this data?
range(df[['DATE_REPORTED']])
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"

The column ‘SUSPECT_AGE’ has the following range

# Using `table()`, what is the most common age range for known `SUSPECT_AGE`s?
table(df[['SUSPECT_AGE']])
## 
##    18-25    26-30    31-40    41-50    51-60    61-70  OVER 70 UNDER 18 
##     1778     1126     1525      659      298      121       16      629 
##  UNKNOWN 
##     9003

By using the function ‘table()’ we can see the number of incidents per zip code and them manipulate that data to present it from greatest number to fewest

# Use `table()` to get the number of incidents per zip code. Sort this table
# for those zip codes with the most activity to the least activity.
sort(table(df['ZIP']), decreasing = TRUE)
## ZIP
## 45202 45205 45211 45238 45229 45219 45225 45214 45237 45223 45206 45220 45232 
##  2049  1110  1094   956   913   863   811   774   699   653   616   477   477 
## 45224 45209 45208 45204 45216 45227 45207 45203 45230 45213 45239 45226 45217 
##   429   380   359   348   302   286   245   226   214   190   169   112   100 
## 45221 45233 45212 45215 45231 45228 42502 45236 45244 45248  4523  5239 
##    90    77    61    47     7     5     3     3     3     3     2     1

Finally, using the same table method and some averaging we can see which day of the week the most incidents occur on

# Using the `DAYOFWEEK` column, which day do most incidents occur on? What
# is the proportion of incidents that fall on this day?
table(df[['DAYOFWEEK']]) / sum(table(df[['DAYOFWEEK']]))
## 
##    FRIDAY    MONDAY  SATURDAY    SUNDAY  THURSDAY   TUESDAY WEDNESDAY 
## 0.1369807 0.1438365 0.1542221 0.1448547 0.1363019 0.1432935 0.1405105

At a glance, three columns that jumped out to me are SUSPECT_AGE, WEAPONS, and HATE_BIAS, as it’d be interesting to see if there’s a trend or connection between these. Despite whatever interesting insights could be found from those values, there might not be enough SUSPECT_AGE data since of 15155 rows 9003 are marked as UNKNOWN. This can likely be exchanged for VICTIM_AGE, which while a different angle of analysis only has 2283 UNKNOWN values. The other columns are pretty solid without blanks, although even without blanks HATE_BIAS might not have enough data to yield interesting info since “N–NO BIAS/NOT APPLICABLE” makes up 13885 of its rows. Using these columns, I’d like to tabulate the average number of co-occurrences and see if anything jumps out.