Part 1: HTML Bio

knitr::include_graphics("IMG_5076.JPG")

Intro to Me

Above is the Clifton based trivia team, “Dirty Lube and the Boys” (13x Murphy’s Pub trivia champions). Going from the left to right are my teammates Emmett, Austin(also known as Lube), Jack and myself. I am wearing a neon orange Nas hoodie because Nas is the best rapper of all time.

A synopsis of myself

I am from Cincinnati, as I grew up in West Chester and attended Archbishop Moeller High School. At Moeller I was involved in a number of things such as golf, lacrosse, ski team, MACH1 and mens chorus. After high school, I attended the University of Dayton for three years where I studied Mechanical Engineering. I transferred to the University of Cincinnati in the spring of ’20. I currently am in my last semester of undergrad in which I am studying Marketing and minoring in Business Analytics. Outside of school, I am heavily involved in my fraternity, Sigma Phi Epsilon. In my free time I enjoy listening to vinyl as I have been collecting for a while. Some of my other hobbies include:

  • Playing guitar
  • Skateboarding
  • Fishing
  • Tennis
  • Golf
  • Watching the Bengals
  • Watching Nascar / Formula 1

Academic Background

  • Archbishop Moeller High School
  • University of Dayton
  • University of Cincinnati

Professional Background

This past month I completed my summer internship with Interprise Holdings. My job didn’t really explore the anylitical side of the company as it was a sales internship. I do enjoy sales, however I would much rather go into the field of analytics. After I graduate this semester, I will be looking for a full time position in the field of Marketing research or business analytics.

Experience With R

I have been using R in various classes over the past year and a half. I would say that I have used R in at least one class for the past 3 semesters. I enjoy R very much and I like that fact that it is extremely versatile. One of my favorite features of R is the data visualization tools that can be used such as ggplot. I have used R for many different projects in which I used different types of Histograms to evaluate Excel files with millions of cells.

Other Analytical Experience

I have been using Excel and VBA starting out in high school as well as using it for some of my engineering classes at UD. Excel is one of my favorite programs and I think that there is a lot that can be done with that. Other programs that I have used in the past to analyze data include SAS, python and SQL.

Part 2: Importing Data

Question 1

blood_trans <- read.csv(file = 'blood_transfusion.csv')
  • What are the dimensions of this data (number of rows and columns)?
nrow(blood_trans)
## [1] 748
ncol(blood_trans)
## [1] 5
  • What are the data types of each column?
class(blood_trans$Recency)
## [1] "integer"
class(blood_trans$Frequency)
## [1] "integer"
class(blood_trans$Monetary)
## [1] "integer"
class(blood_trans$Time)
## [1] "integer"
class(blood_trans$Class)
## [1] "character"
  • Are there any missing values?
sum(is.na(blood_trans))
## [1] 0
  • Check out the first 10 rows? What are the Class values for the first 10 observations?
head(blood_trans,10)
##    Recency Frequency Monetary Time       Class
## 1        2        50    12500   98     donated
## 2        0        13     3250   28     donated
## 3        1        16     4000   35     donated
## 4        2        20     5000   45     donated
## 5        1        24     6000   77 not donated
## 6        4         4     1000    4 not donated
## 7        2         7     1750   14     donated
## 8        1        12     3000   35 not donated
## 9        2         9     2250   22     donated
## 10       5        46    11500   98     donated
  • Check out the last 10 rows? What are the Class values for the last 10 observations?
tail(blood_trans,10)
##     Recency Frequency Monetary Time       Class
## 739      23         1      250   23 not donated
## 740      23         4     1000   52 not donated
## 741      23         1      250   23 not donated
## 742      23         7     1750   88 not donated
## 743      16         3      750   86 not donated
## 744      23         2      500   38 not donated
## 745      21         2      500   52 not donated
## 746      23         3      750   62 not donated
## 747      39         1      250   39 not donated
## 748      72         1      250   72 not donated
  • Index for the 100th row and just the Monetary column. What is the value?
blood_trans$Monetary[100]
## [1] 1750
  • Index for just the Monetary column. What is the mean of this vector?
mean(blood_trans$Monetary)
## [1] 1378.676

= Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?

sum(blood_trans$Monetary > mean(blood_trans$Monetary))
## [1] 267

Question 2

UD_Table <- 'https://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt' 
UDtemp <- read.table(UD_Table, col.names = c("Month", "Day", "Year", "Temp"))
  • What are the dimensions of this data (number of rows and columns)?
nrow(UDtemp)
## [1] 9265
ncol(UDtemp)
## [1] 4
  • What do you think these columns represent?
    • These columns represent the month, day of the month, year and temperature
  • Are there any missing values in this data?
sum(is.na(UDtemp))
## [1] 0
  • Index for the 365th row. What is the date of this observation and what was the average temperature?
UDtemp[365,]
##     Month Day Year Temp
## 365    12  31 1995 39.3

-Subset for all observations that happened during January of 2000. What was the median average temp for this month?

JanTemp <- UDtemp[ which( UDtemp$Month == 1 & UDtemp$Year == 2000),]
median(JanTemp$Temp)
## [1] 27.1

-Which date was the highest average temp recorded (hint: which.max)?

UDtemp[ which.max(UDtemp$Temp),]
##      Month Day Year Temp
## 6398     7   7 2012 89.2

-Which date was the cold average temp recorded? Does this temp make sense? Are there more than just one date that has this temperature value recorded? If so, how many?

UDtemp[ which.min(UDtemp$Temp),]
##      Month Day Year Temp
## 1454    12  24 1998  -99
sum(UDtemp$Temp == -99)
## [1] 14
##a temp of -99 doesn't make sense.

-Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean.

mean(UDtemp$Temp)
## [1] 54.39876
New_Mean <- ifelse(UDtemp$Temp <= -99, NA, UDtemp$Temp)
mean(New_Mean, na.rm = TRUE)
## [1] 54.6309

Question 3

  • What are the dimensions of this data (number of rows and columns)?
Police_data <- read.csv('PDI__Police_Data_Initiative__Crime_Incidents.csv', na.strings = "")
  • What do you think these columns represent?
head(Police_data)
##                             INSTANCEID INCIDENT_NO          DATE_REPORTED
## 1 4B312B08-FE95-4DD4-8A62-20D1A1138E82   229000003 01/01/2022 12:09:00 AM
## 2 4B312B08-FE95-4DD4-8A62-20D1A1138E82   229000003 01/01/2022 12:09:00 AM
## 3 4B312B08-FE95-4DD4-8A62-20D1A1138E82   229000003 01/01/2022 12:09:00 AM
## 4 4B312B08-FE95-4DD4-8A62-20D1A1138E82   229000003 01/01/2022 12:09:00 AM
## 5 4B312B08-FE95-4DD4-8A62-20D1A1138E82   229000003 01/01/2022 12:09:00 AM
## 6 4B312B08-FE95-4DD4-8A62-20D1A1138E82   229000003 01/01/2022 12:09:00 AM
##                DATE_FROM                DATE_TO                         CLSD
## 1 12/31/2021 11:50:00 PM 01/01/2022 12:08:00 AM F--CLEARED BY ARREST - ADULT
## 2 12/31/2021 11:50:00 PM 01/01/2022 12:08:00 AM F--CLEARED BY ARREST - ADULT
## 3 12/31/2021 11:50:00 PM 01/01/2022 12:08:00 AM F--CLEARED BY ARREST - ADULT
## 4 12/31/2021 11:50:00 PM 01/01/2022 12:08:00 AM F--CLEARED BY ARREST - ADULT
## 5 12/31/2021 11:50:00 PM 01/01/2022 12:08:00 AM F--CLEARED BY ARREST - ADULT
## 6 12/31/2021 11:50:00 PM 01/01/2022 12:08:00 AM F--CLEARED BY ARREST - ADULT
##    UCR DST BEAT                       OFFENSE LOCATION THEFT_CODE FLOOR SIDE
## 1  803   2    2                      MENACING   26-BAR       <NA>  <NA> <NA>
## 2  803   2    2                      MENACING   26-BAR       <NA>  <NA> <NA>
## 3  803   2    2                      MENACING   26-BAR       <NA>  <NA> <NA>
## 4 1493   2    2 CRIMINAL DAMAGING/ENDANGERING   26-BAR       <NA>  <NA> <NA>
## 5 1493   2    2 CRIMINAL DAMAGING/ENDANGERING   26-BAR       <NA>  <NA> <NA>
## 6 1493   2    2 CRIMINAL DAMAGING/ENDANGERING   26-BAR       <NA>  <NA> <NA>
##   OPENING                 HATE_BIAS DAYOFWEEK RPT_AREA CPD_NEIGHBORHOOD
## 1    <NA> N--NO BIAS/NOT APPLICABLE    FRIDAY      124           OAKLEY
## 2    <NA> N--NO BIAS/NOT APPLICABLE    FRIDAY      124           OAKLEY
## 3    <NA> N--NO BIAS/NOT APPLICABLE    FRIDAY      124           OAKLEY
## 4    <NA> N--NO BIAS/NOT APPLICABLE    FRIDAY      124           OAKLEY
## 5    <NA> N--NO BIAS/NOT APPLICABLE    FRIDAY      124           OAKLEY
## 6    <NA> N--NO BIAS/NOT APPLICABLE    FRIDAY      124           OAKLEY
##             WEAPONS      DATE_OF_CLEARANCE HOUR_FROM HOUR_TO       ADDRESS_X
## 1         99 - NONE 01/01/2022 12:00:00 AM      2350       8 30XX MADISON RD
## 2         99 - NONE 01/01/2022 12:00:00 AM      2350       8 30XX MADISON RD
## 3         99 - NONE 01/01/2022 12:00:00 AM      2350       8 30XX MADISON RD
## 4 80 - OTHER WEAPON 01/01/2022 12:00:00 AM      2350       8 30XX MADISON RD
## 5 80 - OTHER WEAPON 01/01/2022 12:00:00 AM      2350       8 30XX MADISON RD
## 6 80 - OTHER WEAPON 01/01/2022 12:00:00 AM      2350       8 30XX MADISON RD
##   LONGITUDE_X LATITUDE_X VICTIM_AGE VICTIM_RACE     VICTIM_ETHNICITY
## 1   -84.43017   39.15166      18-25       WHITE NOT OF HISPANIC ORIG
## 2   -84.43140   39.15350    UNKNOWN        <NA>                 <NA>
## 3   -84.43091   39.15360      31-40       WHITE NOT OF HISPANIC ORIG
## 4   -84.42995   39.15224      18-25       WHITE NOT OF HISPANIC ORIG
## 5   -84.43008   39.15209    UNKNOWN        <NA>                 <NA>
## 6   -84.42980   39.15228      31-40       WHITE NOT OF HISPANIC ORIG
##   VICTIM_GENDER SUSPECT_AGE SUSPECT_RACE SUSPECT_ETHNICITY SUSPECT_GENDER
## 1          MALE     UNKNOWN         <NA>              <NA>           <NA>
## 2          <NA>     UNKNOWN         <NA>              <NA>           <NA>
## 3        FEMALE     UNKNOWN         <NA>              <NA>           <NA>
## 4          MALE     UNKNOWN         <NA>              <NA>           <NA>
## 5          <NA>     UNKNOWN         <NA>              <NA>           <NA>
## 6        FEMALE     UNKNOWN         <NA>              <NA>           <NA>
##   TOTALNUMBERVICTIMS TOTALSUSPECTS    UCR_GROUP   ZIP
## 1                  3            NA PART 2 MINOR 45209
## 2                  3            NA PART 2 MINOR 45209
## 3                  3            NA PART 2 MINOR 45209
## 4                  3            NA PART 2 MINOR 45209
## 5                  3            NA PART 2 MINOR 45209
## 6                  3            NA PART 2 MINOR 45209
##   COMMUNITY_COUNCIL_NEIGHBORHOOD SNA_NEIGHBORHOOD
## 1                         OAKLEY           OAKLEY
## 2                         OAKLEY           OAKLEY
## 3                         OAKLEY           OAKLEY
## 4                         OAKLEY           OAKLEY
## 5                         OAKLEY           OAKLEY
## 6                         OAKLEY           OAKLEY
##These columns represent many different things like dates, times, ethnicity, location, weapons, age, etc.

= Are there any missing values in this data? If so, how many missing values are in each column?

sum(is.na(Police_data))
## [1] 95592
colSums(is.na(Police_data))
##                     INSTANCEID                    INCIDENT_NO 
##                              0                              0 
##                  DATE_REPORTED                      DATE_FROM 
##                              0                              2 
##                        DATE_TO                           CLSD 
##                              9                            545 
##                            UCR                            DST 
##                             10                              0 
##                           BEAT                        OFFENSE 
##                             28                             10 
##                       LOCATION                     THEFT_CODE 
##                              2                          10167 
##                          FLOOR                           SIDE 
##                          14127                          14120 
##                        OPENING                      HATE_BIAS 
##                          14508                              0 
##                      DAYOFWEEK                       RPT_AREA 
##                            423                            239 
##               CPD_NEIGHBORHOOD                        WEAPONS 
##                            249                              5 
##              DATE_OF_CLEARANCE                      HOUR_FROM 
##                           2613                              2 
##                        HOUR_TO                      ADDRESS_X 
##                              9                            148 
##                    LONGITUDE_X                     LATITUDE_X 
##                           1714                           1714 
##                     VICTIM_AGE                    VICTIM_RACE 
##                              0                           2192 
##               VICTIM_ETHNICITY                  VICTIM_GENDER 
##                           2192                           2192 
##                    SUSPECT_AGE                   SUSPECT_RACE 
##                              0                           7082 
##              SUSPECT_ETHNICITY                 SUSPECT_GENDER 
##                           7082                           7082 
##             TOTALNUMBERVICTIMS                  TOTALSUSPECTS 
##                             33                           7082 
##                      UCR_GROUP                            ZIP 
##                             10                              1 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               SNA_NEIGHBORHOOD 
##                              0                              0
  • Which column has the most missing values?
colnames(Police_data[colSums(is.na(Police_data)) == max(sapply(Police_data, function(x) sum(is.na(x))))])
## [1] "OPENING"
max(sapply(Police_data, function(x) sum(is.na(x))))
## [1] 14508
  • Using the DATE_REPORTED column, what is the range of dates included in this data?
range(Police_data$DATE_REPORTED)
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
  • Using table(), what is the most common age range for known SUSPECT_AGEs?
table(Police_data$SUSPECT_AGE)
## 
##    18-25    26-30    31-40    41-50    51-60    61-70  OVER 70 UNDER 18 
##     1778     1126     1525      659      298      121       16      629 
##  UNKNOWN 
##     9003
  • Use table() to get the number of incidents per zip code. Sort this table for those zip codes with the most activity to the least activity. Which zip code has the most incidents? Do you see any peculiar data quality issues with any of these zip code values?
table(Police_data$ZIP)
## 
##  4523  5239 42502 45202 45203 45204 45205 45206 45207 45208 45209 45211 45212 
##     2     1     3  2049   226   348  1110   616   245   359   380  1094    61 
## 45213 45214 45215 45216 45217 45219 45220 45221 45223 45224 45225 45226 45227 
##   190   774    47   302   100   863   477    90   653   429   811   112   286 
## 45228 45229 45230 45231 45232 45233 45236 45237 45238 45239 45244 45248 
##     5   913   214     7   477    77     3   699   956   169     3     3
  • Using the DAYOFWEEK column, which day do most incidents occur on? What is the proportion of incidents that fall on this day?
table(Police_data$DAYOFWEEK) / length(Police_data$DAYOFWEEK)
## 
##    FRIDAY    MONDAY  SATURDAY    SUNDAY  THURSDAY   TUESDAY WEDNESDAY 
## 0.1331574 0.1398218 0.1499175 0.1408116 0.1324975 0.1392940 0.1365886
  • Looking at the information this data set provides, what are some insights you’d be interested in assessing? Analyze three different columns that could start to provide you with these insights. Are there missing values in these columns? What are some summary statistics you can compute for these columns? Are there any outliers or aberrant values in these columns? How do you know? Would you remove or recode them?

I think that some good things to look at would be if a weapon was involved. Then you could calculate things like where the most gun incidents occur in the city. Something else would be using the date column to see the season of when things occur. Another column that would be interesting is looking at the average age of certain crimes that are being commited.

knitr::purl(input = "LAB2DrewAsher.Rmd", output = "Module_2_lab_Asher_Andrew.R",documentation = 0)
## 
## 
## processing file: LAB2DrewAsher.Rmd
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |.                                                                     |   2%
  |                                                                            
  |...                                                                   |   4%
  |                                                                            
  |....                                                                  |   6%
  |                                                                            
  |.....                                                                 |   7%
  |                                                                            
  |......                                                                |   9%
  |                                                                            
  |........                                                              |  11%
  |                                                                            
  |.........                                                             |  13%
  |                                                                            
  |..........                                                            |  15%
  |                                                                            
  |............                                                          |  17%
  |                                                                            
  |.............                                                         |  19%
  |                                                                            
  |..............                                                        |  20%
  |                                                                            
  |................                                                      |  22%
  |                                                                            
  |.................                                                     |  24%
  |                                                                            
  |..................                                                    |  26%
  |                                                                            
  |...................                                                   |  28%
  |                                                                            
  |.....................                                                 |  30%
  |                                                                            
  |......................                                                |  31%
  |                                                                            
  |.......................                                               |  33%
  |                                                                            
  |.........................                                             |  35%
  |                                                                            
  |..........................                                            |  37%
  |                                                                            
  |...........................                                           |  39%
  |                                                                            
  |.............................                                         |  41%
  |                                                                            
  |..............................                                        |  43%
  |                                                                            
  |...............................                                       |  44%
  |                                                                            
  |................................                                      |  46%
  |                                                                            
  |..................................                                    |  48%
  |                                                                            
  |...................................                                   |  50%
  |                                                                            
  |....................................                                  |  52%
  |                                                                            
  |......................................                                |  54%
  |                                                                            
  |.......................................                               |  56%
  |                                                                            
  |........................................                              |  57%
  |                                                                            
  |.........................................                             |  59%
  |                                                                            
  |...........................................                           |  61%
  |                                                                            
  |............................................                          |  63%
  |                                                                            
  |.............................................                         |  65%
  |                                                                            
  |...............................................                       |  67%
  |                                                                            
  |................................................                      |  69%
  |                                                                            
  |.................................................                     |  70%
  |                                                                            
  |...................................................                   |  72%
  |                                                                            
  |....................................................                  |  74%
  |                                                                            
  |.....................................................                 |  76%
  |                                                                            
  |......................................................                |  78%
  |                                                                            
  |........................................................              |  80%
  |                                                                            
  |.........................................................             |  81%
  |                                                                            
  |..........................................................            |  83%
  |                                                                            
  |............................................................          |  85%
  |                                                                            
  |.............................................................         |  87%
  |                                                                            
  |..............................................................        |  89%
  |                                                                            
  |................................................................      |  91%
  |                                                                            
  |.................................................................     |  93%
  |                                                                            
  |..................................................................    |  94%
  |                                                                            
  |...................................................................   |  96%
  |                                                                            
  |..................................................................... |  98%
  |                                                                            
  |......................................................................| 100%
## output file: Module_2_lab_Asher_Andrew.R
## [1] "Module_2_lab_Asher_Andrew.R"