Part 1

Synopsis of Me

Hello! My name is Johnny Liu and I am a student at the University of Cincinnati.I was born in Dayton, Ohio but lived the first 3-4 years of my life in China before my parents eventually settled down in Toledo, Ohio.

Academic Background

  • BS in Business Analytics (University of Cincinnati) (in progress)

  • BBA in Information Systems (University of Cincinnati) (in progress)

Professional Background

I have not had any professional experience as of yet.

Experience with R

I have used R in a previous class before, and would say I am fairly familiar with the basics of it.

Experience with Other Analytic Software

I have some experiences with SQL from previous classes and am currently learning Python.

Part 2

Question 1

# Fill in the blanks below to import the blood_transfusion.csv file (provided via Canvas) and answer the following questions.
df <- readr::read_csv("data/blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What are the dimensions of this data (number of rows and columns)?
dim(df)
## [1] 748   5
# What are the data types of each column?
str(df)
## spc_tbl_ [748 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Recency  : num [1:748] 2 0 1 2 1 4 2 1 2 5 ...
##  $ Frequency: num [1:748] 50 13 16 20 24 4 7 12 9 46 ...
##  $ Monetary : num [1:748] 12500 3250 4000 5000 6000 1000 1750 3000 2250 11500 ...
##  $ Time     : num [1:748] 98 28 35 45 77 4 14 35 22 98 ...
##  $ Class    : chr [1:748] "donated" "donated" "donated" "donated" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Recency = col_double(),
##   ..   Frequency = col_double(),
##   ..   Monetary = col_double(),
##   ..   Time = col_double(),
##   ..   Class = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Are there any missing values?
sum(is.na(df))
## [1] 0
# Check out the first 10 rows? What are the Class values for the first 10 observations?\
head(df, 10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1       2        50    12500    98 donated    
##  2       0        13     3250    28 donated    
##  3       1        16     4000    35 donated    
##  4       2        20     5000    45 donated    
##  5       1        24     6000    77 not donated
##  6       4         4     1000     4 not donated
##  7       2         7     1750    14 donated    
##  8       1        12     3000    35 not donated
##  9       2         9     2250    22 donated    
## 10       5        46    11500    98 donated
# Check out the last 10 rows? What are the Class values for the last 10 observations?
tail(df, 10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1      23         1      250    23 not donated
##  2      23         4     1000    52 not donated
##  3      23         1      250    23 not donated
##  4      23         7     1750    88 not donated
##  5      16         3      750    86 not donated
##  6      23         2      500    38 not donated
##  7      21         2      500    52 not donated
##  8      23         3      750    62 not donated
##  9      39         1      250    39 not donated
## 10      72         1      250    72 not donated
# Index for the 100th row and just the Monetary column. What is the value?
df[100, 'Monetary']
## # A tibble: 1 × 1
##   Monetary
##      <dbl>
## 1     1750
# Index for just the Monetary column. What is the mean of this vector?
mean(df[['Monetary']])
## [1] 1378.676
# Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?
above_avg <- df[['Monetary']] > mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
##    Monetary
##       <dbl>
##  1    12500
##  2     3250
##  3     4000
##  4     5000
##  5     6000
##  6     1750
##  7     3000
##  8     2250
##  9    11500
## 10     5750
## # … with 257 more rows

Question 2

# Go to this University of Dayton webpage http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm, scroll down to Ohio and import the Cincinnati (OHCINCIN.txt) file. Fill in the blanks below to import the data directly from online and answer the questions that follow.

df <- readr::read_table('http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt', col_names = c('Month', 'Day', 'Year', 'Avg_temp'))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Month = col_double(),
##   Day = col_double(),
##   Year = col_double(),
##   Avg_temp = col_double()
## )
## Warning: 3 parsing failures.
##  row col  expected    actual                                                                           file
## 4623  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5016  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5213  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
# What are the dimensions of this data (number of rows and columns)? 
dim(df)
## [1] 9265    4
# What do you think these columns represent?

I believe these columns represent the average daily temperature of Cincinnati from January 1, 1995 to May 13, 2020.

# Are there any missing values in this data?
sum(is.na(df))
## [1] 0
# Index for the 365th row. What is the date of this observation and what was the average temperature?
df[365, c("Day", "Avg_temp")]
## # A tibble: 1 × 2
##     Day Avg_temp
##   <dbl>    <dbl>
## 1    31     39.3
# Subset for all observations that happened during January of 2000. What was the median average temp for this month?
jan_2000 <- df[df[['Month']] == 1 & df[['Year']] == 2000, ]
median(jan_2000[['Avg_temp']])
## [1] 27.1
# Which date was the highest average temp recorded (hint: which.max)? 
df[which.max(df[['Avg_temp']]),]
## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1     7     7  2012     89.2
# Which date was the coldest average temp recorded? Does this temp make sense? Are there more than just one date that has this temperature value recorded? If so, how many?
df[which.min(df[['Avg_temp']]),]
## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1    12    24  1998      -99
sum(df[['Avg_temp']] == -99)
## [1] 14
# Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean.
mean(df[['Avg_temp']])
## [1] 54.39876
bad_values <- df[['Avg_temp']] == -99
df[bad_values, 'Avg_temp'] <- NA
mean(df[['Avg_temp']], na.rm = TRUE)
## [1] 54.6309

Question 3

# Fill in the blanks below to import the PDI__Police_Data_Initiative__Crime_Incidents.csv data (provided via Canvas) and answer the questions that follow. Data is taken from the City of Cincinnati Open Data Portal website 4, which you may need to read to place context in your answers.
df <- readr::read_csv("data/PDI__Police_Data_Initiative__Crime_Incidents.csv")
## Rows: 15155 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl  (6): UCR, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS, ZIP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What are the dimensions of this data (number of rows and columns)?
dim(df)
## [1] 15155    40
# What do you think these columns represent?

I think these columns represent all the different incidents that happened, including information such as the dates, offense types, locations, etc.

# Are there any missing values in this data? If so, how many missing values are in each column? Which column has the most missing values?
sum(is.na(df))
## [1] 95592
colSums(is.na(df))
##                     INSTANCEID                    INCIDENT_NO 
##                              0                              0 
##                  DATE_REPORTED                      DATE_FROM 
##                              0                              2 
##                        DATE_TO                           CLSD 
##                              9                            545 
##                            UCR                            DST 
##                             10                              0 
##                           BEAT                        OFFENSE 
##                             28                             10 
##                       LOCATION                     THEFT_CODE 
##                              2                          10167 
##                          FLOOR                           SIDE 
##                          14127                          14120 
##                        OPENING                      HATE_BIAS 
##                          14508                              0 
##                      DAYOFWEEK                       RPT_AREA 
##                            423                            239 
##               CPD_NEIGHBORHOOD                        WEAPONS 
##                            249                              5 
##              DATE_OF_CLEARANCE                      HOUR_FROM 
##                           2613                              2 
##                        HOUR_TO                      ADDRESS_X 
##                              9                            148 
##                    LONGITUDE_X                     LATITUDE_X 
##                           1714                           1714 
##                     VICTIM_AGE                    VICTIM_RACE 
##                              0                           2192 
##               VICTIM_ETHNICITY                  VICTIM_GENDER 
##                           2192                           2192 
##                    SUSPECT_AGE                   SUSPECT_RACE 
##                              0                           7082 
##              SUSPECT_ETHNICITY                 SUSPECT_GENDER 
##                           7082                           7082 
##             TOTALNUMBERVICTIMS                  TOTALSUSPECTS 
##                             33                           7082 
##                      UCR_GROUP                            ZIP 
##                             10                              1 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               SNA_NEIGHBORHOOD 
##                              0                              0
# Using the DATE_REPORTED column, what is the range of dates included in this data?
range(df[['DATE_REPORTED']])
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
# Using table(), what is the most common age range for known SUSPECT_AGEs?
table(df['SUSPECT_AGE'])
## SUSPECT_AGE
##    18-25    26-30    31-40    41-50    51-60    61-70  OVER 70 UNDER 18 
##     1778     1126     1525      659      298      121       16      629 
##  UNKNOWN 
##     9003
# Use table() to get the number of incidents per zip code. Sort this table for those zip codes with the most activity to the least activity. Which zip code has the most incidents? Do you see any peculiar data quality issues with any of these zip code values?
table(df['ZIP'])
## ZIP
##  4523  5239 42502 45202 45203 45204 45205 45206 45207 45208 45209 45211 45212 
##     2     1     3  2049   226   348  1110   616   245   359   380  1094    61 
## 45213 45214 45215 45216 45217 45219 45220 45221 45223 45224 45225 45226 45227 
##   190   774    47   302   100   863   477    90   653   429   811   112   286 
## 45228 45229 45230 45231 45232 45233 45236 45237 45238 45239 45244 45248 
##     5   913   214     7   477    77     3   699   956   169     3     3
# Using the DAYOFWEEK column, which day do most incidents occur on? What is the proportion of incidents that fall on this day?
table(df[['DAYOFWEEK']]) / sum(table(df[['DAYOFWEEK']]))
## 
##    FRIDAY    MONDAY  SATURDAY    SUNDAY  THURSDAY   TUESDAY WEDNESDAY 
## 0.1369807 0.1438365 0.1542221 0.1448547 0.1363019 0.1432935 0.1405105
# Looking at the information this data set provides, what are some insights you’d be interested in assessing? Analyze three different columns that could start to provide you with these insights. Are there missing values in these columns? What are some summary statistics you can compute for these columns? Are there any outliers or aberrant values in these columns? How do you know? Would you remove or recode them?

I’d be interested in finding out which locations have the most incidents. Does the victim’s ethnicity or gender have anything to do with it? Do certain ethnicity or genders have a higher chance of falling victim to crimes? Some of the columns I could analyze for these questions could be the LOCATION, VICTIM_GENDER, and VICTIM_ETHNICITY.

sum(is.na(df['LOCATION']))
## [1] 2
sum(is.na(df['VICTIM_GENDER']))
## [1] 2192
sum(is.na(df['VICTIM_ETHNICITY']))
## [1] 2192

I can see that the genders and ethnicity columns have a lot of missing values. These could be due to various reasons, but for analyzing purposes, I think the easiest solution would be to remove these missing values. The downside to that is that we would be losing a decent amount of data as well.