R Markdown
# Import the blood_transfusion.csv file -----------------------------------
blood_transfusion <- data.table::fread("data/blood_transfusion.csv", data.table = FALSE)
### What are the dimensions of this data (number of rows and columns)?
dim(blood_transfusion)
## [1] 748 5
### What are the data types of each column?
dplyr::glimpse(blood_transfusion)
## Rows: 748
## Columns: 5
## $ Recency <int> 2, 0, 1, 2, 1, 4, 2, 1, 2, 5, 4, 0, 2, 1, 2, 2, 2, 2, 2, 2, …
## $ Frequency <int> 50, 13, 16, 20, 24, 4, 7, 12, 9, 46, 23, 3, 10, 13, 6, 5, 14…
## $ Monetary <int> 12500, 3250, 4000, 5000, 6000, 1000, 1750, 3000, 2250, 11500…
## $ Time <int> 98, 28, 35, 45, 77, 4, 14, 35, 22, 98, 58, 4, 28, 47, 15, 11…
## $ Class <chr> "donated", "donated", "donated", "donated", "not donated", "…
### Are there any missing values?
sum(is.na(blood_transfusion))
## [1] 0
### Check out the first 10 rows? What are the Class values for the first 10 observations?
head(blood_transfusion, 10)
## Recency Frequency Monetary Time Class
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
### Check out the last 10 rows? What are the Class values for the last 10 observations?
tail(blood_transfusion, 10)
## Recency Frequency Monetary Time Class
## 739 23 1 250 23 not donated
## 740 23 4 1000 52 not donated
## 741 23 1 250 23 not donated
## 742 23 7 1750 88 not donated
## 743 16 3 750 86 not donated
## 744 23 2 500 38 not donated
## 745 21 2 500 52 not donated
## 746 23 3 750 62 not donated
## 747 39 1 250 39 not donated
## 748 72 1 250 72 not donated
### Index for the 100th row and just the Monetary column. What is the value?
blood_transfusion[100,'Monetary']
## [1] 1750
### Index for just the Monetary column. What is the mean of this vector?
x<- mean(blood_transfusion[,'Monetary'])
x
## [1] 1378.676
### Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?
y<- blood_transfusion[blood_transfusion$Monetary > x, ]
dim(y)
## [1] 267 5
# Import the Cincinnati weather file -----------------------------------
weather<- data.table::fread("https://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
### What are the dimensions of this data (number of rows and columns)?
dim(weather)
## [1] 9265 4
### What do you think these columns represent?
dplyr::glimpse(weather)
## Rows: 9,265
## Columns: 4
## $ V1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ V2 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
## $ V3 <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
## $ V4 <dbl> 41.1, 22.2, 22.8, 14.9, 9.5, 23.8, 31.1, 26.9, 31.3, 31.5, 44.4, 58…
# V1 : Month number || V2 : Date || V3 : Year || V4 : Avg. Temperature in Fahrenheit
### Are there any missing values in this data?
sum(is.na(weather))
## [1] 0
### Index for the 365th row. What is the date of this observation and what was the average temperature?
weather[365,]
## V1 V2 V3 V4
## 1: 12 31 1995 39.3
#Date - 12-31-1995
#Avg. Temp - 39.3 F
### Subset for all observations that happened during January of 2000. What was the median average temp for this month?
df1<-weather[weather$V3==2000 & weather$V1==1]
median(df1$V4)
## [1] 27.1
### Which date was the highest average temp recorded (hint: which.max)?
max(weather$V4)
## [1] 89.2
### Which date was the cold average temp recorded? Does this temp make sense? Are there more than just one date that has this temperature value recorded? If so, how many?
min(weather$V4)
## [1] -99
sum(weather$V4 == -99)
## [1] 14
weather[weather$V4==-99]
## V1 V2 V3 V4
## 1: 12 24 1998 -99
## 2: 12 25 1998 -99
## 3: 12 30 1998 -99
## 4: 12 31 1998 -99
## 5: 1 10 1999 -99
## 6: 6 18 2002 -99
## 7: 6 19 2002 -99
## 8: 6 20 2002 -99
## 9: 6 21 2002 -99
## 10: 9 7 2002 -99
## 11: 3 1 2003 -99
## 12: 8 28 2007 -99
## 13: 9 24 2008 -99
## 14: 4 9 2009 -99
### Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean.
mean(weather$V4) #54.39876
## [1] 54.39876
df3<- weather[weather$V4 != -99]
mean(df3$V4)
## [1] 54.6309
# Import the PDI__Police_Data_Initiative__Crime_Incidents data --------------------
crime_data <- data.table::fread("data/PDI__Police_Data_Initiative__Crime_Incidents.csv", data.table = FALSE)
### What are the dimensions of this data (number of rows and columns)?
dim(crime_data)
## [1] 15155 40
### What do you think these columns represent?
dplyr::glimpse(crime_data)
## Rows: 15,155
## Columns: 40
## $ INSTANCEID <chr> "4B312B08-FE95-4DD4-8A62-20D1A1138E82",…
## $ INCIDENT_NO <chr> "229000003", "229000003", "229000003", …
## $ DATE_REPORTED <chr> "01/01/2022 12:09:00 AM", "01/01/2022 1…
## $ DATE_FROM <chr> "12/31/2021 11:50:00 PM", "12/31/2021 1…
## $ DATE_TO <chr> "01/01/2022 12:08:00 AM", "01/01/2022 1…
## $ CLSD <chr> "F--CLEARED BY ARREST - ADULT", "F--CLE…
## $ UCR <int> 803, 803, 803, 1493, 1493, 1493, 810, 8…
## $ DST <chr> "2", "2", "2", "2", "2", "2", "2", "2",…
## $ BEAT <chr> "2", "2", "2", "2", "2", "2", "2", "2",…
## $ OFFENSE <chr> "MENACING", "MENACING", "MENACING", "CR…
## $ LOCATION <chr> "26-BAR", "26-BAR", "26-BAR", "26-BAR",…
## $ THEFT_CODE <chr> "", "", "", "", "", "", "", "", "", "",…
## $ FLOOR <chr> "", "", "", "", "", "", "", "", "", "",…
## $ SIDE <chr> "", "", "", "", "", "", "", "", "", "",…
## $ OPENING <chr> "", "", "", "", "", "", "", "", "", "",…
## $ HATE_BIAS <chr> "N--NO BIAS/NOT APPLICABLE", "N--NO BIA…
## $ DAYOFWEEK <chr> "FRIDAY", "FRIDAY", "FRIDAY", "FRIDAY",…
## $ RPT_AREA <chr> "124", "124", "124", "124", "124", "124…
## $ CPD_NEIGHBORHOOD <chr> "OAKLEY", "OAKLEY", "OAKLEY", "OAKLEY",…
## $ WEAPONS <chr> "99 - NONE", "99 - NONE", "99 - NONE", …
## $ DATE_OF_CLEARANCE <chr> "01/01/2022 12:00:00 AM", "01/01/2022 1…
## $ HOUR_FROM <int> 2350, 2350, 2350, 2350, 2350, 2350, 235…
## $ HOUR_TO <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 17, 19, 20, …
## $ ADDRESS_X <chr> "30XX MADISON RD", "30XX MADISON RD", "…
## $ LONGITUDE_X <dbl> -84.43017, -84.43140, -84.43091, -84.42…
## $ LATITUDE_X <dbl> 39.15166, 39.15350, 39.15360, 39.15224,…
## $ VICTIM_AGE <chr> "18-25", "UNKNOWN", "31-40", "18-25", "…
## $ VICTIM_RACE <chr> "WHITE", "", "WHITE", "WHITE", "", "WHI…
## $ VICTIM_ETHNICITY <chr> "NOT OF HISPANIC ORIG", "", "NOT OF HIS…
## $ VICTIM_GENDER <chr> "MALE", "", "FEMALE", "MALE", "", "FEMA…
## $ SUSPECT_AGE <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNO…
## $ SUSPECT_RACE <chr> "", "", "", "", "", "", "", "", "", "UN…
## $ SUSPECT_ETHNICITY <chr> "", "", "", "", "", "", "", "", "", "UN…
## $ SUSPECT_GENDER <chr> "", "", "", "", "", "", "", "", "", "UN…
## $ TOTALNUMBERVICTIMS <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ TOTALSUSPECTS <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, …
## $ UCR_GROUP <chr> "PART 2 MINOR", "PART 2 MINOR", "PART 2…
## $ ZIP <int> 45209, 45209, 45209, 45209, 45209, 4520…
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD <chr> "OAKLEY", "OAKLEY", "OAKLEY", "OAKLEY",…
## $ SNA_NEIGHBORHOOD <chr> "OAKLEY", "OAKLEY", "OAKLEY", "OAKLEY",…
### Are there any missing values in this data? If so, how many missing values are in each column? Which column has the most missing values?
sum(is.na(crime_data))
## [1] 10565
### Using the DATE_REPORTED column, what is the range of dates included in this data?
range(crime_data$DATE_REPORTED, na.rm=TRUE)
## [1] "01/01/2022 01:08:00 AM" "06/26/2022 12:50:00 AM"
### Using table(), what is the most common age range for known SUSPECT_AGEs?
table(crime_data$SUSPECT_AGE)
##
## 18-25 26-30 31-40 41-50 51-60 61-70 OVER 70 UNDER 18
## 1778 1126 1525 659 298 121 16 629
## UNKNOWN
## 9003
### Use table() to get the number of incidents per zip code. Sort this table for those zip codes with the most activity to the least activity. Which zip code has the most incidents? Do you see any peculiar data quality issues with any of these zip code values?
table(crime_data$ZIP)
##
## 4523 5239 42502 45202 45203 45204 45205 45206 45207 45208 45209 45211 45212
## 2 1 3 2049 226 348 1110 616 245 359 380 1094 61
## 45213 45214 45215 45216 45217 45219 45220 45221 45223 45224 45225 45226 45227
## 190 774 47 302 100 863 477 90 653 429 811 112 286
## 45228 45229 45230 45231 45232 45233 45236 45237 45238 45239 45244 45248
## 5 913 214 7 477 77 3 699 956 169 3 3
#45205
### Using the DAYOFWEEK column, which day do most incidents occur on? What is the proportion of incidents that fall on this day?
table(crime_data$DAYOFWEEK)
##
## FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
## 423 2018 2119 2272 2134 2008 2111 2070
#saturday
### Looking at the information this data set provides, what are some insights you’d be interested in assessing? Analyze three different columns that could start to provide you with these insights. Are there missing values in these columns? What are some summary statistics you can compute for these columns? Are there any outliers or aberrant values in these columns? How do you know? Would you remove or recode them?