I am a second year student studying Business Analytics at the University of Cincinnati college of business. I currently am an Offensive Student Assistant coach for the University of Cincinnati football team.
I went to high school at Lakota West High School. I came to Cincinnati starting out as a Mechanical Engineering student, but I decided to switch majors to better align with my coaching career. I am currently studying Business Analytics.
My first job was an unpaid internship developing applications for clients with Butler Tech. I next worked another unpaid internship this time as an Engineering Team Leader for nonprofit Luke5Adventures. I then worked two summer programming internships with P&G; One as a front end web developer and another working back end with data analytics. I now am currently in my third season coaching football for the University of Cincinnati.
While I have not utilized R until this class, I have had lots of experience with similar coding languages such as Python. I also am using R in other BANA class, Forecasting and Risk Analysis, this semester.
I have utilized Excel for analytical purposes previously. While I of course I used Excel in BANA I and II, I also have leveraged it in coaching. I have used Excel to assist analyzing opposing defenses for exploitable tendencies. I manually recorded data from opponent games, turned it into binary, and then used Excel to find tendencies in that data.
Fill in the blanks below to import the blood_transfusion.csv file (provided via Canvas) and answer the following questions.
Import the data.
# import data and the message print out will also tell you
# what the data types are
df <- readr::read_csv("/Users/ghaga/Desktop/School/Data Mining/Data/lab2_data/blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Missing values?
# Are there any missing values?
sum(is.na(df))
## [1] 0
Dimensions of the data.
# What are the dimensions of this data
dim(df)
## [1] 748 5
First 10 rows.
# Check out the first 10 rows
head(df, 10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
Last 10 rows.
# Check out the last 10 rows
tail(df, 10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 23 1 250 23 not donated
## 2 23 4 1000 52 not donated
## 3 23 1 250 23 not donated
## 4 23 7 1750 88 not donated
## 5 16 3 750 86 not donated
## 6 23 2 500 38 not donated
## 7 21 2 500 52 not donated
## 8 23 3 750 62 not donated
## 9 39 1 250 39 not donated
## 10 72 1 250 72 not donated
What is the value of the 100th row in the Monetary column?
# Index for the 100th row and just the `Monetary` column. What is the value?
df[100, 'Monetary']
## # A tibble: 1 × 1
## Monetary
## <dbl>
## 1 1750
What is the average of the monetary column?
# Index for just the `Monetary` column. What is the mean of this vector?
mean(df[['Monetary']])
## [1] 1378.676
Find all values of the monetary column that are greater than the average value.
# Subset this data frame for all observations where `Monetary` is greater
# than the mean value. How many rows are in the resulting data frame?
above_avg <- df[['Monetary']] > mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
## Monetary
## <dbl>
## 1 12500
## 2 3250
## 3 4000
## 4 5000
## 5 6000
## 6 1750
## 7 3000
## 8 2250
## 9 11500
## 10 5750
## # ℹ 257 more rows
Fill in the blanks below to import the PDI__Police_Data_Initiative__Crime_Incidents.csv data (provided via Canvas) and answer the questions that follow. Data is taken from the City of Cincinnati Open Data Portal website 4, which you may need to read to place context in your answers
Import the data.
df <- readr::read_csv("Data/lab2_data/PDI__Police_Data_Initiative__Crime_Incidents.csv")
## Rows: 15155 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Dimensions of the data set.
# dimensions of this data
dim(df)
## [1] 15155 40
Any missing data?
# Are there any missing data
sum(is.na(df))
## [1] 95592
How many na in each column?
# If so, how many missing values are in each column?
colSums(is.na(df))
## INSTANCEID INCIDENT_NO
## 0 0
## DATE_REPORTED DATE_FROM
## 0 2
## DATE_TO CLSD
## 9 545
## UCR DST
## 10 0
## BEAT OFFENSE
## 28 10
## LOCATION THEFT_CODE
## 2 10167
## FLOOR SIDE
## 14127 14120
## OPENING HATE_BIAS
## 14508 0
## DAYOFWEEK RPT_AREA
## 423 239
## CPD_NEIGHBORHOOD WEAPONS
## 249 5
## DATE_OF_CLEARANCE HOUR_FROM
## 2613 2
## HOUR_TO ADDRESS_X
## 9 148
## LONGITUDE_X LATITUDE_X
## 1714 1714
## VICTIM_AGE VICTIM_RACE
## 0 2192
## VICTIM_ETHNICITY VICTIM_GENDER
## 2192 2192
## SUSPECT_AGE SUSPECT_RACE
## 0 7082
## SUSPECT_ETHNICITY SUSPECT_GENDER
## 7082 7082
## TOTALNUMBERVICTIMS TOTALSUSPECTS
## 33 7082
## UCR_GROUP ZIP
## 10 1
## COMMUNITY_COUNCIL_NEIGHBORHOOD SNA_NEIGHBORHOOD
## 0 0
Range of dates reported column
# Using the `DATE_REPORTED` column, what is the `range` of dates included in this data?
range(df[['DATE_REPORTED']])
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
What is the most common age in suspect age?
# Using `table()`, what is the most common age range for known `SUSPECT_AGE`s?
table(df[["SUSPECT_AGE"]])
##
## 18-25 26-30 31-40 41-50 51-60 61-70 OVER 70 UNDER 18
## 1778 1126 1525 659 298 121 16 629
## UNKNOWN
## 9003
Crime by zip code.
# Use `table()` to get the number of incidents per zip code. Sort this table
# for those zip codes with the most activity to the least activity.
sort(table(df['ZIP']), decreasing = TRUE)
## ZIP
## 45202 45205 45211 45238 45229 45219 45225 45214 45237 45223 45206 45220 45232
## 2049 1110 1094 956 913 863 811 774 699 653 616 477 477
## 45224 45209 45208 45204 45216 45227 45207 45203 45230 45213 45239 45226 45217
## 429 380 359 348 302 286 245 226 214 190 169 112 100
## 45221 45233 45212 45215 45231 45228 42502 45236 45244 45248 4523 5239
## 90 77 61 47 7 5 3 3 3 3 2 1
Which day of the week do the largest portion of crimes occur?
# Using the `DAYOFWEEK` column, which day do most incidents occur on? What
# is the proportion of incidents that fall on this day?
sort(table(df[["DAYOFWEEK"]]) / sum(table(df[["DAYOFWEEK"]])), decreasing = TRUE)
##
## SATURDAY SUNDAY MONDAY TUESDAY WEDNESDAY FRIDAY THURSDAY
## 0.1542221 0.1448547 0.1438365 0.1432935 0.1405105 0.1369807 0.1363019