Hello! My name is Denise Huynh. I grew up in central Vietnam; I studied aboard to the US,Pennsylvalnia 5 years ago; and now Cincinnati, OH.
I currently work as a Student Worker at Aerospace Engineering, and Mechanical Engineering Department of University of Cincinnati. My daily work includes:
In additional to R, I have experience with tools such as:
Fill in the blanks below to import the blood_transfusion.csv file (provided via Canvas) and answer the following questions.
df <- readr::read_csv('blood_transfusion.csv')
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(df)
## [1] 748 5
df <- readr::read_csv('blood_transfusion.csv')
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sum(is.na(df))
## [1] 0
head(df, 10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
tail(df, 10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 23 1 250 23 not donated
## 2 23 4 1000 52 not donated
## 3 23 1 250 23 not donated
## 4 23 7 1750 88 not donated
## 5 16 3 750 86 not donated
## 6 23 2 500 38 not donated
## 7 21 2 500 52 not donated
## 8 23 3 750 62 not donated
## 9 39 1 250 39 not donated
## 10 72 1 250 72 not donated
df[100, 'Monetary']
## # A tibble: 1 × 1
## Monetary
## <dbl>
## 1 1750
mean(df[['Monetary']])
## [1] 1378.676
above_avg <- df[['Monetary']] > mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
## Monetary
## <dbl>
## 1 12500
## 2 3250
## 3 4000
## 4 5000
## 5 6000
## 6 1750
## 7 3000
## 8 2250
## 9 11500
## 10 5750
## # ℹ 257 more rows
Fill in the blanks below to import the PDI__Police_Data_Initiative__Crime_Incidents.csv data (provided via Canvas) and answer the questions that follow. Data is taken from the City of Cincinnati Open Data Portal website 4, which you may need to read to place context in your answers.
df <- readr::read_csv('PDI__Police_Data_Initiative__Crime_Incidents.csv')
## Rows: 15155 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
• What are the dimensions of this data (number of rows and columns)?
dim(df)
## [1] 15155 40
• What do you think these columns represent?
These columns likely aim to provide a detailed understanding of crime incidents, including when and where they occur, and some demographic data on suspects and victims.
• Are there any missing values in this data? If so, how many missing values are in each column? Which column has the most missing values?
sum(is.na(df))
## [1] 95592
colSums(is.na(df))
## INSTANCEID INCIDENT_NO
## 0 0
## DATE_REPORTED DATE_FROM
## 0 2
## DATE_TO CLSD
## 9 545
## UCR DST
## 10 0
## BEAT OFFENSE
## 28 10
## LOCATION THEFT_CODE
## 2 10167
## FLOOR SIDE
## 14127 14120
## OPENING HATE_BIAS
## 14508 0
## DAYOFWEEK RPT_AREA
## 423 239
## CPD_NEIGHBORHOOD WEAPONS
## 249 5
## DATE_OF_CLEARANCE HOUR_FROM
## 2613 2
## HOUR_TO ADDRESS_X
## 9 148
## LONGITUDE_X LATITUDE_X
## 1714 1714
## VICTIM_AGE VICTIM_RACE
## 0 2192
## VICTIM_ETHNICITY VICTIM_GENDER
## 2192 2192
## SUSPECT_AGE SUSPECT_RACE
## 0 7082
## SUSPECT_ETHNICITY SUSPECT_GENDER
## 7082 7082
## TOTALNUMBERVICTIMS TOTALSUSPECTS
## 33 7082
## UCR_GROUP ZIP
## 10 1
## COMMUNITY_COUNCIL_NEIGHBORHOOD SNA_NEIGHBORHOOD
## 0 0
Column with most missing values
print(paste("The column with the most missing values is:",colnames(df)[colSums(is.na(df)) == max(sapply(df, function(x) sum(is.na(x))))]," with total number of", max(sapply(df, function(x) sum(is.na(x)))),"missing values "))
## [1] "The column with the most missing values is: OPENING with total number of 14508 missing values "
• Using the DATE_REPORTED column, what is the range of dates included in this data?
range(df[['DATE_REPORTED']])
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
• Using table(), what is the most common age range for known SUSPECT_AGEs?
table(df[['SUSPECT_AGE']])
##
## 18-25 26-30 31-40 41-50 51-60 61-70 OVER 70 UNDER 18
## 1778 1126 1525 659 298 121 16 629
## UNKNOWN
## 9003
• Use table() to get the number of incidents per zip code. Sort this table for those zip codes with the most activity to the least activity. Which zip code has the most incidents? Do you see any peculiar data quality issues with any of these zip code values?
sort(table(df[['ZIP']]), decreasing = TRUE)
##
## 45202 45205 45211 45238 45229 45219 45225 45214 45237 45223 45206 45220 45232
## 2049 1110 1094 956 913 863 811 774 699 653 616 477 477
## 45224 45209 45208 45204 45216 45227 45207 45203 45230 45213 45239 45226 45217
## 429 380 359 348 302 286 245 226 214 190 169 112 100
## 45221 45233 45212 45215 45231 45228 42502 45236 45244 45248 4523 5239
## 90 77 61 47 7 5 3 3 3 3 2 1
• Using the DAYOFWEEK column, which day do most incidents occur on? What is the proportion of incidents that fall on this day?
table(df[['DAYOFWEEK']]) / sum(table(df[['DAYOFWEEK']]))
##
## FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
## 0.1369807 0.1438365 0.1542221 0.1448547 0.1363019 0.1432935 0.1405105
• Looking at the information this data set provides, what are some insights you’d be interested in assessing? Analyze three different columns that could start to provide you with these insights.Are there missing values in these columns? What are some summary statistics you can compute for these columns? Are there any outliers or aberrant values in these columns? How do you know? Would you remove or recode them?
One area I’d be interested in exploring is the geographic distribution of crime incidents, identifying ZIP codes or neighborhoods with higher crime rates. Additionally, I would analyze the data to see if there are patterns in the ethnicity of both victims and suspects. Columns like LOCATION, VICTIM_ETHNICITY, and SUSPECT_ETHNICITY can provide these insights.
This dataset contains a list of crime incidents with various attributes, including location and suspect demographics. There are 95,592 missing values in total, with the Opening column having 14,508 missing entries. The reports cover a period from January 1, 2022, to June 26, 2022. The most common age range for suspects is 18-25, ZIP code 45202 has the most incidents, and Saturday is the day with the most reported crimes, accounting for 15.42% of incidents.