Synopsis of Me

I am a student at University of Cincinnati

Academic Background

Majoring in Business analytics

Professional Background

I am a Front Desk assistant in University Housing

Experience with R

I am beginner in R

Experience with other analytic software

I have a experience with Python

df<- readr::read_csv("../data/blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#What are the dimensions of this data (number of rows and columns)?

dim(df)
## [1] 748   5

#What are the data types of each column?

colnames(df)
## [1] "Recency"   "Frequency" "Monetary"  "Time"      "Class"

#Are there any missing values?

sapply(df, function(x) sum(is.na(x)))
##   Recency Frequency  Monetary      Time     Class 
##         0         0         0         0         0

#Check out the first 10 rows? What are the Class values for the first 10 observations?

head(df,10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1       2        50    12500    98 donated    
##  2       0        13     3250    28 donated    
##  3       1        16     4000    35 donated    
##  4       2        20     5000    45 donated    
##  5       1        24     6000    77 not donated
##  6       4         4     1000     4 not donated
##  7       2         7     1750    14 donated    
##  8       1        12     3000    35 not donated
##  9       2         9     2250    22 donated    
## 10       5        46    11500    98 donated

#Check out the last 10 rows? What are the Class values for the last 10 observations?

tail(df,10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1      23         1      250    23 not donated
##  2      23         4     1000    52 not donated
##  3      23         1      250    23 not donated
##  4      23         7     1750    88 not donated
##  5      16         3      750    86 not donated
##  6      23         2      500    38 not donated
##  7      21         2      500    52 not donated
##  8      23         3      750    62 not donated
##  9      39         1      250    39 not donated
## 10      72         1      250    72 not donated

#Index for the 100th row and just the Monetary column. What is the value?

df[100, 'Monetary']
## # A tibble: 1 × 1
##   Monetary
##      <dbl>
## 1     1750

#Index for just the Monetary column. What is the mean of this vector?

mean(df[['Monetary']])
## [1] 1378.676

#Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?

above_avg <- df[['Monetary']]>mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
##    Monetary
##       <dbl>
##  1    12500
##  2     3250
##  3     4000
##  4     5000
##  5     6000
##  6     1750
##  7     3000
##  8     2250
##  9    11500
## 10     5750
## # ℹ 257 more rows

2.Go to this University of Dayton webpage http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm, scroll down to Ohio and import the Cincinnati (OHCINCIN.txt) file.

df <- readr::read_table('http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt', col_names = c('Month', 'Day', 'Year', 'Avg_temp'))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Month = col_double(),
##   Day = col_double(),
##   Year = col_double(),
##   Avg_temp = col_double()
## )
## Warning: 3 parsing failures.
##  row col  expected    actual                                                                           file
## 4623  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5016  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5213  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'

What are the dimensions of this data (number of rows and columns)?

dim(df)
## [1] 9265    4

What do you think these columns represent?

These columns represent the average daily temperature of Cincinnati from 1st January,1995 to 13th May,2020.

Are there any missing values in this data?

sum(is.na(df))
## [1] 0

Index for the 365th row. What is the date of this observation and what was the average temperature?

df[365,c("Day", "Avg_temp")]
## # A tibble: 1 × 2
##     Day Avg_temp
##   <dbl>    <dbl>
## 1    31     39.3

Subset for all observations that happened during January of 2000. What was the median average temp for this month?

jan_2000 <- df[df[['Month']] == 1 & df[['Year']] == 2000, ]
median(jan_2000[['Avg_temp']])
## [1] 27.1

Which date was the highest average temp recorded (hint: which.max)?

df[which.max(df[['Avg_temp']]),]
## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1     7     7  2012     89.2

Which date was the coldest average temp recorded? Does this temp make sense?

df[which.min(df[['Avg_temp']]),]
## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1    12    24  1998      -99

On 24th December 2012 the coldest average temp was recorded, this temp don’t make sense # Are there more than just one date that has this temperature value recorded? If so, how many?

sum(df[['Avg_temp']] == -99)
## [1] 14

Yes, there are more. The number of dates on which the average temperature was recorded as -99 are: 14 # Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean

mean(df[['Avg_temp']])
## [1] 54.39876
bad_values <- df[['Avg_temp']] == -99
df[bad_values, 'Avg_temp'] <- NA
mean(df[['Avg_temp']], na.rm = TRUE)
## [1] 54.6309

3. Fill in the blanks below to import the PDI__Police_Data_Initiative__Crime_Incidents.csv data(provided via Canvas) and answer the questions that follow

df <- readr::read_csv("../data/PDI__Police_Data_Initiative__Crime_Incidents.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 511088 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (32): INSTANCEID, INCIDENT_NO, DATE_REPORTED, DATE_FROM, DATE_TO, CLSD, ...
## dbl  (8): UCR, BEAT, RPT_AREA, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What are the dimensions of this data (number of rows and columns)?

dim(df)
## [1] 511088     40

What do you think these columns represent?

Are there any missing values in this data? If so, how many missing values are in each column? Which column has the most missing values?

sum(is.na(df))
## [1] 3173571
colSums(is.na(df))
##                     INSTANCEID                    INCIDENT_NO 
##                              0                              1 
##                  DATE_REPORTED                      DATE_FROM 
##                            508                             20 
##                        DATE_TO                           CLSD 
##                           1439                           1720 
##                            UCR                            DST 
##                            331                            560 
##                           BEAT                        OFFENSE 
##                           1592                            313 
##                       LOCATION                     THEFT_CODE 
##                             21                         338160 
##                          FLOOR                           SIDE 
##                         452057                         455174 
##                        OPENING                      HATE_BIAS 
##                         462761                             36 
##                      DAYOFWEEK                       RPT_AREA 
##                           7223                           3469 
##               CPD_NEIGHBORHOOD                        WEAPONS 
##                           3856                             51 
##              DATE_OF_CLEARANCE                      HOUR_FROM 
##                           8644                             12 
##                        HOUR_TO                      ADDRESS_X 
##                           1405                           4749 
##                    LONGITUDE_X                     LATITUDE_X 
##                          67110                          67110 
##                     VICTIM_AGE                    VICTIM_RACE 
##                              2                          76506 
##               VICTIM_ETHNICITY                  VICTIM_GENDER 
##                          76508                          76506 
##                    SUSPECT_AGE                   SUSPECT_RACE 
##                              0                         266238 
##              SUSPECT_ETHNICITY                 SUSPECT_GENDER 
##                         266242                         266238 
##             TOTALNUMBERVICTIMS                  TOTALSUSPECTS 
##                            409                         266238 
##                      UCR_GROUP                            ZIP 
##                            343                             19 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               SNA_NEIGHBORHOOD 
##                              0                              0

Using the DATE_REPORTED column, what is the range of dates included in this data?

range(df[['DATE_REPORTED']])
## [1] NA NA

Using table(), what is the most common age range for known SUSPECT_AGEs?

table(df['SUSPECT_AGE'])
## SUSPECT_AGE
##    18-25    26-30    31-40    41-50    51-60    61-70  OVER 70 UNDER 18 
##    58014    32089    38289    19391     9401     2327      414    20646 
##  UNKNOWN 
##   330517

Use table() to get the number of incidents per zip code. Sort this table for those zip codes with the most activity to the least activity. Which zip code has the most incidents? Do you see any peculiar data quality issues with any of these zip code values?

table(df['ZIP'])
## ZIP
##         33        123        300        452       4202       4204       4205 
##          4          4          1          3          1          1          2 
##       4211       4219       4225       4237       4422       4462       4492 
##          1          1          1          2          1          3          1 
##       4502       4505       4508       4513       4519       4520       4521 
##          3          2          1          1          6          1          4 
##       4522       4523       4524       4525       4526       4539       5205 
##          2         16          8          1          1          1          5 
##       5211       5219       5225       5232       5239       5281      24206 
##          3          4          1          1          1          2          1 
##      34205      35209      40226      40520      41015      41018      41095 
##          2          3          1          1          1          1          1 
##      42220      42502      45001      45002      45011      45014      45020 
##          2         15          4          3          1          2          1 
##      45030      45040      45069      45102      45140      45174      45200 
##          4          1          1          2          8          3          1 
##      45201      45202      45203      45204      45205      45206      45207 
##         14      55069       7547      12841      44024      24883      10260 
##      45208      45209      45210      45211      45212      45213      45214 
##       8445      10647        126      36525       2080       7686      26744 
##      45215      45216      45217      45218      45219      45220      45221 
##       1921      10382       2664          2      32510      15814       1749 
##      45222      45223      45224      45225      45226      45227      45228 
##          3      22680      14931      24236       6024      10480        542 
##      45229      45230      45231      45232      45233      45234      45235 
##      30126       7465        467      17064       3478         11          3 
##      45236      45237      45238      45239      45240      45241      45242 
##         94      23475      30383       7194         11          4          8 
##      45243      45244      45245      45246      45247      45248      45249 
##          3         11          2          4          4         74          1 
##      45251      45252      45253      45255      45268      45299      45449 
##          5          2          2         13          2          2          1 
##      45520      45523      45524      45525      45891      47012      54202 
##          1          2          1          1          1          1          4 
##      54204      54205      54211      54214      54222      54224      54225 
##          4         36         25         17          1          2          4 
##      54233      54238      56205      91016      98031     345230     445211 
##          7         12          1          1          1          1          1 
##     445214     452002     452020     452022     452024     452025     452054 
##          1          2         15          2          1          3          1 
##     452071     452111     452154     452258     452338     452376     452389 
##          1          2          1          1          1          1          4 
##     458204     475211     475219     545232     945211     945226  531110230 
##          2          4          1          1          1          4          1 
## 4521345213 4521945219 4522045220 4522445224 
##          4          1          1          2

Using the DAYOFWEEK column, which day do most incidents occur on? What is the proportion of incidents that fall on this day?

table(df[['DAYOFWEEK']]) / sum(table(df[['DAYOFWEEK']]))
## 
##    FRIDAY    MONDAY  SATURDAY    SUNDAY  THURSDAY   TUESDAY WEDNESDAY 
## 0.1470632 0.1422881 0.1481369 0.1443760 0.1379278 0.1405654 0.1396426

Looking at the information this data set provides, what are some insights you’d be interested in assessing? Analyze three different columns that could start to provide you with these insights. Are there missing values in these columns? What are some summary statistics you can compute for these columns? Are there any outliers or aberrant values in these columns? How do you know? Would you remove or recode them?

Rows represents a police report. Columns represents an attribute of the police report.