Lab 2

This is me!

Who Am I?

Beyond work, I’m proud to play professional ultimate frisbee for the San Diego Growlers. Competing at a high level has taught me discipline, teamwork, and resilience—qualities I bring into every project I take on. I also enjoy surfing and being outdoors.

I’m passionate about creating impact in healthcare by combining analytical thinking, collaborative leadership, and a drive to push new ideas forward.

Academic Background

I studied Biochemistry at UC San Diego with a minor in Business, which gave me a unique mix of scientific depth and an understanding of how organizations operate. I worked in the Yeo Lab which was especially formative. I focused on RNA-focused projects aimed at finding new biomarkers for ALS, where I learned how to approach complex problems with both precision and curiosity. That experience not only sharpened my technical and analytical skills but also showed me how research can translate into real-world impact.

Professional Background

I started my career doing an internship with Rehab United in Business Development. I was offered to return and now I work as a Business Development Associate at Rehab United, a physical therapy and wellness company in San Diego. In this role, I focus on expanding our impact in healthcare by identifying growth opportunities, building partnerships, and supporting new initiatives in preventative care and musculoskeletal health. A big part of my work involves turning data into strategy—whether that’s through pricing optimization, revenue forecasting, or operational dashboards that help leadership make faster, more informed decisions.

Experience with R

I have limited experience with R, I have taken the google data analytics course online. This gave me a very basic understanding of how to use R. However, I have not used it much since that course. I also used R in a limited fashion in the Yeo lab to perform analysis and create visuals.

Experience with other Analytic Software

I have extensive experience with Excel having created many interactive dashboards, and performed complex analysis creating forecasts for revenue, costs, and more for my current company. I have also utilized Power BI and Tableau to create visuals and more interactive dashboards that our company uses. I don’t have much experience using other programming languages.

Part 2: Importing Data

Problem 1

df <- readr::read_csv("C:/Users/tynan/OneDrive/Documents/BANA 7025/lab2_data/blood_transfusion.csv")

## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Dimensions of Data
dim(df)

## [1] 748   5

# Number of missing values
sum(is.na(df))

## [1] 0

# First 10 rows of data
head(df, 10)

## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1       2        50    12500    98 donated    
##  2       0        13     3250    28 donated    
##  3       1        16     4000    35 donated    
##  4       2        20     5000    45 donated    
##  5       1        24     6000    77 not donated
##  6       4         4     1000     4 not donated
##  7       2         7     1750    14 donated    
##  8       1        12     3000    35 not donated
##  9       2         9     2250    22 donated    
## 10       5        46    11500    98 donated

# Last 10 rows of data
tail(df, 10)

## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1      23         1      250    23 not donated
##  2      23         4     1000    52 not donated
##  3      23         1      250    23 not donated
##  4      23         7     1750    88 not donated
##  5      16         3      750    86 not donated
##  6      23         2      500    38 not donated
##  7      21         2      500    52 not donated
##  8      23         3      750    62 not donated
##  9      39         1      250    39 not donated
## 10      72         1      250    72 not donated

# 100th row, Monetary column
df[100, "Monetary"]

## # A tibble: 1 × 1
##   Monetary
##      <dbl>
## 1     1750

# Mean of Monetary column
mean_mon <- mean(df[["Monetary"]])

# Which rows are above the mean?
above_avg <- df[['Monetary']] > mean(df[['Monetary']])

# Subset those rows, keep Monetary column
subset_df <- df[above_avg, 'Monetary']

# Count rows
nrow(subset_df)

## [1] 267

Problem 2

df2 <- readr::read_csv("C:/Users/tynan/OneDrive/Documents/BANA 7025/lab2_data/PDI__Police_Data_Initiative__Crime_Incidents.csv")

## Rows: 15155 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl  (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# dimensions of this data
dim(df2)

## [1] 15155    40

# Are there any missing data
any(is.na(df2))

## [1] TRUE

# If so, how many missing values are in each column?
sort(colSums(is.na(df2)), decreasing = TRUE)

##                        OPENING                          FLOOR 
##                          14508                          14127 
##                           SIDE                     THEFT_CODE 
##                          14120                          10167 
##                   SUSPECT_RACE              SUSPECT_ETHNICITY 
##                           7082                           7082 
##                 SUSPECT_GENDER                  TOTALSUSPECTS 
##                           7082                           7082 
##              DATE_OF_CLEARANCE                    VICTIM_RACE 
##                           2613                           2192 
##               VICTIM_ETHNICITY                  VICTIM_GENDER 
##                           2192                           2192 
##                    LONGITUDE_X                     LATITUDE_X 
##                           1714                           1714 
##                           CLSD                      DAYOFWEEK 
##                            545                            423 
##               CPD_NEIGHBORHOOD                       RPT_AREA 
##                            249                            239 
##                      ADDRESS_X             TOTALNUMBERVICTIMS 
##                            148                             33 
##                           BEAT                            UCR 
##                             28                             10 
##                        OFFENSE                      UCR_GROUP 
##                             10                             10 
##                        DATE_TO                        HOUR_TO 
##                              9                              9 
##                        WEAPONS                      DATE_FROM 
##                              5                              2 
##                       LOCATION                      HOUR_FROM 
##                              2                              2 
##                            ZIP                     INSTANCEID 
##                              1                              0 
##                    INCIDENT_NO                  DATE_REPORTED 
##                              0                              0 
##                            DST                      HATE_BIAS 
##                              0                              0 
##                     VICTIM_AGE                    SUSPECT_AGE 
##                              0                              0 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               SNA_NEIGHBORHOOD 
##                              0                              0

# Using the `DATE_REPORTED` column, what is the `range` of dates included in this data?
range(df2[['DATE_REPORTED']])

## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"

# Using `table()`, what is the most common age range for known `SUSPECT_AGE`s?
table(df2[['SUSPECT_AGE']])

## 
##    18-25    26-30    31-40    41-50    51-60    61-70  OVER 70 UNDER 18 
##     1778     1126     1525      659      298      121       16      629 
##  UNKNOWN 
##     9003

# Use `table()` to get the number of incidents per zip code. Sort this table
# for those zip codes with the most activity to the least activity.
sort(table(df2['ZIP']), decreasing = TRUE)

## ZIP
## 45202 45205 45211 45238 45229 45219 45225 45214 45237 45223 45206 45220 45232 
##  2049  1110  1094   956   913   863   811   774   699   653   616   477   477 
## 45224 45209 45208 45204 45216 45227 45207 45203 45230 45213 45239 45226 45217 
##   429   380   359   348   302   286   245   226   214   190   169   112   100 
## 45221 45233 45212 45215 45231 45228 42502 45236 45244 45248  4523  5239 
##    90    77    61    47     7     5     3     3     3     3     2     1

# Using the `DAYOFWEEK` column, which day do most incidents occur on? 
sort(table(df2[['DAYOFWEEK']]), decreasing = TRUE)

## 
##  SATURDAY    SUNDAY    MONDAY   TUESDAY WEDNESDAY    FRIDAY  THURSDAY 
##      2272      2134      2119      2111      2070      2018      2008

# What is the proportion of incidents that fall on this day?
sort(table(df2[['DAYOFWEEK']]) / sum(table(df2[['DAYOFWEEK']])), decreasing = TRUE)

## 
##  SATURDAY    SUNDAY    MONDAY   TUESDAY WEDNESDAY    FRIDAY  THURSDAY 
## 0.1542221 0.1448547 0.1438365 0.1432935 0.1405105 0.1369807 0.1363019