I am a student at University of Cincinnati
Majoring in Business analytics
I am a Front Desk assistant in University Housing
I am beginner in R
I have a experience with Python
df<- readr::read_csv("../data/blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#What are the dimensions of this data (number of rows and columns)?
dim(df)
## [1] 748 5
#What are the data types of each column?
colnames(df)
## [1] "Recency" "Frequency" "Monetary" "Time" "Class"
#Are there any missing values?
sapply(df, function(x) sum(is.na(x)))
## Recency Frequency Monetary Time Class
## 0 0 0 0 0
#Check out the first 10 rows? What are the Class values for the first 10 observations?
head(df,10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
#Check out the last 10 rows? What are the Class values for the last 10 observations?
tail(df,10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 23 1 250 23 not donated
## 2 23 4 1000 52 not donated
## 3 23 1 250 23 not donated
## 4 23 7 1750 88 not donated
## 5 16 3 750 86 not donated
## 6 23 2 500 38 not donated
## 7 21 2 500 52 not donated
## 8 23 3 750 62 not donated
## 9 39 1 250 39 not donated
## 10 72 1 250 72 not donated
#Index for the 100th row and just the Monetary column. What is the value?
df[100, 'Monetary']
## # A tibble: 1 × 1
## Monetary
## <dbl>
## 1 1750
#Index for just the Monetary column. What is the mean of this vector?
mean(df[['Monetary']])
## [1] 1378.676
#Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?
above_avg <- df[['Monetary']]>mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
## Monetary
## <dbl>
## 1 12500
## 2 3250
## 3 4000
## 4 5000
## 5 6000
## 6 1750
## 7 3000
## 8 2250
## 9 11500
## 10 5750
## # ℹ 257 more rows
df <- readr::read_table('http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt', col_names = c('Month', 'Day', 'Year', 'Avg_temp'))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Month = col_double(),
## Day = col_double(),
## Year = col_double(),
## Avg_temp = col_double()
## )
## Warning: 3 parsing failures.
## row col expected actual file
## 4623 -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5016 -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5213 -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
dim(df)
## [1] 9265 4
These columns represent the average daily temperature of Cincinnati from 1st January,1995 to 13th May,2020.
sum(is.na(df))
## [1] 0
df[365,c("Day", "Avg_temp")]
## # A tibble: 1 × 2
## Day Avg_temp
## <dbl> <dbl>
## 1 31 39.3
jan_2000 <- df[df[['Month']] == 1 & df[['Year']] == 2000, ]
median(jan_2000[['Avg_temp']])
## [1] 27.1
df[which.max(df[['Avg_temp']]),]
## # A tibble: 1 × 4
## Month Day Year Avg_temp
## <dbl> <dbl> <dbl> <dbl>
## 1 7 7 2012 89.2
df[which.min(df[['Avg_temp']]),]
## # A tibble: 1 × 4
## Month Day Year Avg_temp
## <dbl> <dbl> <dbl> <dbl>
## 1 12 24 1998 -99
On 24th December 2012 the coldest average temp was recorded, this temp don’t make sense # Are there more than just one date that has this temperature value recorded? If so, how many?
sum(df[['Avg_temp']] == -99)
## [1] 14
Yes, there are more. The number of dates on which the average temperature was recorded as -99 are: 14 # Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean
mean(df[['Avg_temp']])
## [1] 54.39876
bad_values <- df[['Avg_temp']] == -99
df[bad_values, 'Avg_temp'] <- NA
mean(df[['Avg_temp']], na.rm = TRUE)
## [1] 54.6309
df <- readr::read_csv("../data/PDI__Police_Data_Initiative__Crime_Incidents.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 511088 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (32): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl (8): UCR, BEAT, RPT_AREA, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(df)
## [1] 511088 40
sum(is.na(df))
## [1] 3173571
colSums(is.na(df))
## INSTANCEID INCIDENT_NO
## 0 1
## DATE_REPORTED DATE_FROM
## 508 20
## DATE_TO CLSD
## 1439 1720
## UCR DST
## 331 560
## BEAT OFFENSE
## 1592 313
## LOCATION THEFT_CODE
## 21 338160
## FLOOR SIDE
## 452057 455174
## OPENING HATE_BIAS
## 462761 36
## DAYOFWEEK RPT_AREA
## 7223 3469
## CPD_NEIGHBORHOOD WEAPONS
## 3856 51
## DATE_OF_CLEARANCE HOUR_FROM
## 8644 12
## HOUR_TO ADDRESS_X
## 1405 4749
## LONGITUDE_X LATITUDE_X
## 67110 67110
## VICTIM_AGE VICTIM_RACE
## 2 76506
## VICTIM_ETHNICITY VICTIM_GENDER
## 76508 76506
## SUSPECT_AGE SUSPECT_RACE
## 0 266238
## SUSPECT_ETHNICITY SUSPECT_GENDER
## 266242 266238
## TOTALNUMBERVICTIMS TOTALSUSPECTS
## 409 266238
## UCR_GROUP ZIP
## 343 19
## COMMUNITY_COUNCIL_NEIGHBORHOOD SNA_NEIGHBORHOOD
## 0 0
range(df[['DATE_REPORTED']])
## [1] NA NA
table(df['SUSPECT_AGE'])
## SUSPECT_AGE
## 18-25 26-30 31-40 41-50 51-60 61-70 OVER 70 UNDER 18
## 58014 32089 38289 19391 9401 2327 414 20646
## UNKNOWN
## 330517
table(df['ZIP'])
## ZIP
## 33 123 300 452 4202 4204 4205
## 4 4 1 3 1 1 2
## 4211 4219 4225 4237 4422 4462 4492
## 1 1 1 2 1 3 1
## 4502 4505 4508 4513 4519 4520 4521
## 3 2 1 1 6 1 4
## 4522 4523 4524 4525 4526 4539 5205
## 2 16 8 1 1 1 5
## 5211 5219 5225 5232 5239 5281 24206
## 3 4 1 1 1 2 1
## 34205 35209 40226 40520 41015 41018 41095
## 2 3 1 1 1 1 1
## 42220 42502 45001 45002 45011 45014 45020
## 2 15 4 3 1 2 1
## 45030 45040 45069 45102 45140 45174 45200
## 4 1 1 2 8 3 1
## 45201 45202 45203 45204 45205 45206 45207
## 14 55069 7547 12841 44024 24883 10260
## 45208 45209 45210 45211 45212 45213 45214
## 8445 10647 126 36525 2080 7686 26744
## 45215 45216 45217 45218 45219 45220 45221
## 1921 10382 2664 2 32510 15814 1749
## 45222 45223 45224 45225 45226 45227 45228
## 3 22680 14931 24236 6024 10480 542
## 45229 45230 45231 45232 45233 45234 45235
## 30126 7465 467 17064 3478 11 3
## 45236 45237 45238 45239 45240 45241 45242
## 94 23475 30383 7194 11 4 8
## 45243 45244 45245 45246 45247 45248 45249
## 3 11 2 4 4 74 1
## 45251 45252 45253 45255 45268 45299 45449
## 5 2 2 13 2 2 1
## 45520 45523 45524 45525 45891 47012 54202
## 1 2 1 1 1 1 4
## 54204 54205 54211 54214 54222 54224 54225
## 4 36 25 17 1 2 4
## 54233 54238 56205 91016 98031 345230 445211
## 7 12 1 1 1 1 1
## 445214 452002 452020 452022 452024 452025 452054
## 1 2 15 2 1 3 1
## 452071 452111 452154 452258 452338 452376 452389
## 1 2 1 1 1 1 4
## 458204 475211 475219 545232 945211 945226 531110230
## 2 4 1 1 1 4 1
## 4521345213 4521945219 4522045220 4522445224
## 4 1 1 2
table(df[['DAYOFWEEK']]) / sum(table(df[['DAYOFWEEK']]))
##
## FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
## 0.1470632 0.1422881 0.1481369 0.1443760 0.1379278 0.1405654 0.1396426
Rows represents a police report. Columns represents an attribute of the police report.