Proposal Draft

For our first data set, we chose a data set involving passengers on Titanic’s maiden voyage. We downloaded this data set from openml.com. Here is the Link: https://www.openml.org/d/40945

For the data set, we first want to read the data and set up our library programs.

#Reading in data
titanic <- read.csv("Titanic New Data.csv" , header = TRUE , stringsAsFactors = TRUE , na.strings = c("?" , NA))
#Library 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v readr   1.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

After this initial setup, we then explored its dimensions and variables.

#Exploring the data
colnames(titanic)

##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
##  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
## [13] "body"      "home.dest"

summary(titanic)

##      pclass         survived                                   name     
##  Min.   :1.000   Min.   :0.000   Connolly, Miss. Kate            :   2  
##  1st Qu.:2.000   1st Qu.:0.000   Kelly, Mr. James                :   2  
##  Median :3.000   Median :0.000   Abbing, Mr. Anthony             :   1  
##  Mean   :2.295   Mean   :0.382   Abbott, Master. Eugene Joseph   :   1  
##  3rd Qu.:3.000   3rd Qu.:1.000   Abbott, Mr. Rossmore Edward     :   1  
##  Max.   :3.000   Max.   :1.000   Abbott, Mrs. Stanton (Rosa Hunt):   1  
##                                  (Other)                         :1301  
##      sex           age              sibsp            parch      
##  female:466   Min.   : 0.1667   Min.   :0.0000   Min.   :0.000  
##  male  :843   1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000  
##               Median :28.0000   Median :0.0000   Median :0.000  
##               Mean   :29.8811   Mean   :0.4989   Mean   :0.385  
##               3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
##               Max.   :80.0000   Max.   :8.0000   Max.   :9.000  
##               NA's   :263                                       
##       ticket          fare                     cabin      embarked  
##  CA. 2343:  11   Min.   :  0.000   C23 C25 C27    :   6   C   :270  
##  1601    :   8   1st Qu.:  7.896   B57 B59 B63 B66:   5   Q   :123  
##  CA 2144 :   8   Median : 14.454   G6             :   5   S   :914  
##  3101295 :   7   Mean   : 33.295   B96 B98        :   4   NA's:  2  
##  347077  :   7   3rd Qu.: 31.275   C22 C26        :   4             
##  347082  :   7   Max.   :512.329   (Other)        : 271             
##  (Other) :1261   NA's   :1         NA's           :1014             
##       boat          body                      home.dest  
##  13     : 39   Min.   :  1.0   New York, NY        : 64  
##  C      : 38   1st Qu.: 72.0   London              : 14  
##  15     : 37   Median :155.0   Montreal, PQ        : 10  
##  14     : 33   Mean   :160.8   Cornwall / Akron, OH:  9  
##  4      : 31   3rd Qu.:256.0   Paris, France       :  9  
##  (Other):308   Max.   :328.0   (Other)             :639  
##  NA's   :823   NA's   :1188    NA's                :564

We next wanted to clean up any missing data. The first variable to receive treament is the age variable.

#Replacing missing Ages with the mean value in age
titanic$age[is.na(titanic$age)] <- mean(titanic$age , na.rm = TRUE)

After cleaning, we have replaced all missing ages with the mean value of age. This will help us later on in our analysis.

Next, we wanted to round all the numeric values in the data set to make the numbers cleaner.

#Rounding all the numerical values
titanic2 <- titanic %>%
  mutate_if(is.numeric , round)

After rounding all the numeric variables, we next wanted to clean up the data set to be easier to understand. The two variables “parch” and “sibsp” are quite confusing. In order to make the analysis easier to comprehend, we decided to mix these two variables into one variable called “family_size.”

#Creating new column, family, that is sum of sibsp(sibling/spouse) + parch(parent/child) 
titanic2$family <- (titanic2$sibsp + titanic2$parch)

After this initial cleaning, we wanted to determine the best variables to use to predict whether one would survive the voyage.

We now wanted to do some exploratory analysis on the data.

First, we wanted to find the average family size of the passengers. Here is the code and output.

#Family size Histogram
hist(titanic2$family , main = "Family Size" , xlab = "Family Size" , labels = levels(titanic2$family) , ylab = "Number of Families" , col = "orange", breaks = 10)

With this Histogram, we can see that the majority of passengers had zero other family members on board. This aligns with the backround of the Titanic, as many passengers were single people traveling to America for work, and planning on sending money back home.

Next, we wanted to check the fare prices that the passengers payed. This will give us a good look into the proportion of wealth between the Titanic’s passengers.

hist(titanic2$fare , main = "Ticket Cost" , xlab = "Price Paid" , ylab = "Number of Tickets" , col = "green" , breaks = 10)

This histogram shows us that the vast majority of customers payed anywhere from 0-50 dollars for their ticket, while the second most common pay level was 50-100 dollars per ticket. This fits in with the Titanic as the majority of passengers came from third class, which had the cheapest tickets on the ship. The histogram also shows a small number of passengers payed over 500 dollars for their ticket, making them apart of the elites who sailed on Titanic’s maiden voyage.

With some of the Exploratory data finished, it time for us to ask some questions. For this data set, we want chose three:

1. Did Men survive more or less in proportion to women?

2. Is passenger class a good predictor for survival rate?

3. Is age a factor in survival?

Question 1. Did Men survive more or less in proportion to Women?

For this first question, we feel a good starting point would be to map the survival rates by sex, to see if there is any correlation. Before we plot the chart, we first wanted to change the survival variable from 1s and 0s to yes and no. This will make it easier to understand the graph without knowing the dataset.

titanic2$survived[titanic2$survived == 1] <- "Yes"
titanic2$survived[titanic2$survived == 0] <- "No"

After this change, we will now look into the survival rate of men.

ggplot(data = titanic2 , aes(x = sex , y = survived , color = sex)) + geom_jitter() + scale_color_manual(values = c("male" = "blue" , "female" = "#fe019a")) + labs(title = "Survival by Gender" , x = "Gender" , y = "Survived")

After analyzing this chart, it seems that men have a lower number of survivors compared to women. This chart makes it very clear that sex is a strong predictor in survival rate. We can say with extreme confidence that men were less likely to survive the Titanic compared to women.

Question 2. Is passenger class a good predictor for survival rate?

For this next question, we want to know whether a passenger’s class would effect their survival rate. We believe a good start would to analyze the proportion of passengers in each class and their survival rate.

ggplot(data = titanic2  , aes(x = pclass , y = survived , color = pclass)) + geom_jitter() + scale_color_gradient(low = "purple" , high = "red") +labs(title = "Survival by Passenger Class" , x = "Passenger Class" , y = "Survied")

After plotting the three passenger classes, we are given some interesting information. It seems that slightly higher proportion of first class passengers survived the sinking. This tells us that being in first class increased survival chances by a small amount. This makes sense as the life boat deck on the ship was closest to the first class cabins. As for the second class passengers, the output shows that there was about the same number of second class passengers who survived the sinking as there were passengers who did not. This means being in second class was not a good predictor for survival. Finally, the third class data was very clear. A much larger number of third class passengers did not survive the sinking compared to third class passengers who survived. This means being a third class passenger means one is less likely to survive, making it an excellent predictor for survival.

3. Is age a good predictor for survival?

Our final question tasks us with determining whether a passengers age can be a predictor for survival rate. The best way to start answering this question is to compare survival rate with age.

ggplot(data = titanic2  , aes(x = age , y = survived , color = age)) + geom_jitter() + scale_color_gradient(low = "green" , high = "red") + labs(title = "Survival by Age" , x = "Age" , y = "Survived")

This chart is very interesting. It seems those in the 0-20 age group survived more than they did not. This is also slightly true for passengers in their 40’s. However, it can not be declared in great confidence that being in these age groups would increase one’s chances of survival. We can conclude that age is not a great predictor in survival chances.

In conclusion, the best predictors for survival on Titanic’s maiden voyage are passenger sex, and passenger class. Within passenger class, one was slightly more favored to survive if they were in first class, and much more likely to not survive if they were in third class. With sex, a female passenger’s odds for survival were much greater than the odds of a male passenger surviving. This leads us to the conclusion that a female passenger in first class had the greatest chances of survival, while male passengers in third class had the slimmest chances of survival on Titanic’s maiden voyage.

Data Set 2

Our First code for this data set is to load the initial code like before.

library(tidyverse)
library(dplyr)

Next is to read the data set as a workable function.

pdi = read.csv("PDI__Police_Data_Initiative__Police_Calls_for_Service__CAD_.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")

Next we wanted to deal with missing data with the code below.

pdi <- pdi[!(is.na(pdi$LATITUDE_X ) |
               is.na(pdi$LONGITUDE_X)|
               is.na(pdi$DISTRICT)|
               is.na(pdi$PRIORITY)), ]
summary(Filter(is.numeric, pdi))

##    LATITUDE_X     LONGITUDE_X        PRIORITY        DISTRICT    
##  Min.   :39.05   Min.   :-84.71   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:39.11   1st Qu.:-84.55   1st Qu.: 7.00   1st Qu.:2.000  
##  Median :39.13   Median :-84.52   Median :18.00   Median :3.000  
##  Mean   :39.14   Mean   :-84.52   Mean   :18.94   Mean   :3.047  
##  3rd Qu.:39.15   3rd Qu.:-84.49   3rd Qu.:29.00   3rd Qu.:4.000  
##  Max.   :39.22   Max.   :-84.37   Max.   :35.00   Max.   :5.000

pdic <- na.omit(pdi)

What it looks like after you fix missing data. Next, we want to summarize and look at some sample data points.

Overview of data

str(pdic)

## 'data.frame':    384790 obs. of  19 variables:
##  $ ADDRESS_X                     : Factor w/ 49869 levels "**SECRET** 200 W NORTH BEND",..: 4088 46524 35475 47733 23819 1465 12652 13002 19531 47091 ...
##  $ LATITUDE_X                    : num  39.1 39.1 39.1 39.1 39.2 ...
##  $ LONGITUDE_X                   : num  -84.5 -84.5 -84.5 -84.5 -84.5 ...
##  $ AGENCY                        : Factor w/ 2 levels "CP","CPD": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CREATE_TIME_INCIDENT          : Factor w/ 808959 levels "1/1/2015 0:01",..: 734962 735126 734591 734719 734818 734873 734480 734452 734641 734365 ...
##  $ DISPOSITION_TEXT              : Factor w/ 1041 levels "26: AVAIL/DETAIL COMPLETED",..: 410 838 693 838 936 771 838 202 771 771 ...
##  $ EVENT_NUMBER                  : Factor w/ 1048572 levels "CPD170124000097",..: 318203 317011 317561 317795 316799 318033 317337 317272 317652 317105 ...
##  $ INCIDENT_TYPE_ID              : Factor w/ 317 levels "1","911CALL",..: 63 161 92 63 67 92 105 89 47 92 ...
##  $ INCIDENT_TYPE_DESC            : Factor w/ 258 levels "911 CALL EMERGENCY INDICATED, NO SPECIFICS",..: 72 150 101 72 76 101 116 97 52 101 ...
##  $ PRIORITY                      : int  27 35 18 27 12 18 29 29 6 18 ...
##  $ PRIORITY_COLOR                : Factor w/ 4 levels "BLUE","PURPLE",..: 2 2 1 2 1 1 2 2 4 1 ...
##  $ ARRIVAL_TIME_PRIMARY_UNIT     : Factor w/ 539247 levels "1/1/2015 0:01",..: 489167 489293 488913 488955 489043 489084 488778 488762 488914 488690 ...
##  $ CLOSED_TIME_INCIDENT          : Factor w/ 791474 levels "1/1/2015 0:01",..: 719463 719172 719149 719214 719471 719346 719094 718977 719162 718858 ...
##  $ DISPATCH_TIME_PRIMARY_UNIT    : Factor w/ 595858 levels "1/1/2015 0:25",..: 540858 540990 540573 540645 540734 540776 540445 540429 540575 540358 ...
##  $ BEAT                          : Factor w/ 167 levels "CLR","OTHER COUNTY",..: 14 4 76 90 108 130 62 79 139 20 ...
##  $ COMMUNITY_COUNCIL_NEIGHBORHOOD: Factor w/ 71 levels "AVONDALE","AVONDALE - NORTH AVONDALE",..: 50 68 62 67 58 5 22 61 71 68 ...
##  $ DISTRICT                      : int  1 1 3 4 4 5 3 3 5 1 ...
##  $ SNA_NEIGHBORHOOD              : Factor w/ 51 levels "AVONDALE","BOND HILL",..: 35 48 44 47 40 4 13 43 51 48 ...
##  $ CPD_NEIGHBORHOOD              : Factor w/ 54 levels "AVONDALE","BONDHILL",..: 38 51 48 50 44 5 14 26 54 51 ...
##  - attr(*, "na.action")= 'omit' Named int [1:395180] 3 4 7 12 14 16 23 26 32 35 ...
##   ..- attr(*, "names")= chr [1:395180] "3" "4" "7" "13" ...

summary(pdic)

##                  ADDRESS_X        LATITUDE_X     LONGITUDE_X     AGENCY      
##  23XX FERGUSON RD     :  5309   Min.   :39.05   Min.   :-84.71   CP :     0  
##  3XX EZZARD CHARLES DR:  4803   1st Qu.:39.11   1st Qu.:-84.55   CPD:384790  
##  41XX READING RD      :  4103   Median :39.13   Median :-84.52               
##  58XX HAMILTON AV     :  3316   Mean   :39.14   Mean   :-84.52               
##  17XX REPUBLIC ST     :  2291   3rd Qu.:39.15   3rd Qu.:-84.49               
##  35XX READING RD      :  1970   Max.   :39.22   Max.   :-84.37               
##  (Other)              :362998                                                
##        CREATE_TIME_INCIDENT               DISPOSITION_TEXT 
##  10/31/2020 21:16:     8    NTR: NOTHING TO REPORT:104734  
##  10/22/2020 13:07:     7    INV: INV              : 70962  
##  10/1/2020 17:03 :     6    ADV:ADVISED           : 34645  
##  10/11/2020 20:52:     6    AST: ASSIST           : 24074  
##  10/28/2020 9:58 :     6    GOA: GOA              : 20710  
##  10/7/2020 14:03 :     6    301:OFFENSE REPORT    : 15557  
##  (Other)         :384751    (Other)               :114108  
##           EVENT_NUMBER    INCIDENT_TYPE_ID                 INCIDENT_TYPE_DESC
##  CPD180306001717:     2   DIRPAT : 56180   DIRECTED PATROL - VEHICLE: 56180  
##  CPD200327001402:     2   SDET   : 26924   OFF DUTY POLICE DETAILS  : 26924  
##  CPD170124000185:     1   INV    : 19661   STATION RUN              : 24881  
##  CPD170124000380:     1   TSTOP  : 18400   INVESTIGATION            : 19661  
##  CPD170124000854:     1   ACC    : 18338   TRAFFIC STOP             : 18400  
##  CPD170124000892:     1   ST     : 17495   ACCIDENT NO INJURIES     : 18338  
##  (Other)        :384782   (Other):227792   (Other)                  :220406  
##     PRIORITY     PRIORITY_COLOR     ARRIVAL_TIME_PRIMARY_UNIT
##  Min.   : 1.00   BLUE  :115585   8/8/2019 16:31  :    10     
##  1st Qu.:12.00   PURPLE:207181   11/7/2020 19:01 :     8     
##  Median :25.00   RED   :  4317   5/29/2020 13:55 :     8     
##  Mean   :20.52   YELLOW: 57707   6/27/2020 12:53 :     8     
##  3rd Qu.:29.00                   10/17/2020 18:48:     7     
##  Max.   :35.00                   11/17/2020 12:48:     7     
##                                  (Other)         :384742     
##        CLOSED_TIME_INCIDENT   DISPATCH_TIME_PRIMARY_UNIT      BEAT       
##  8/20/2020 10:05 :    13    8/14/2019 13:08:     7       P321   : 34915  
##  6/2/2020 0:00   :     8    10/28/2020 9:58:     6       P331   : 32174  
##  6/27/2020 13:55 :     8    10/3/2020 16:09:     6       P131   : 29225  
##  8/23/2020 2:24  :     8    11/12/2020 0:01:     6       P121   : 27598  
##  8/25/2020 23:00 :     8    3/20/2020 17:09:     6       P011   : 26554  
##  10/13/2020 21:51:     7    3/3/2020 13:40 :     6       P411   : 19813  
##  (Other)         :384738    (Other)        :384753       (Other):214511  
##  COMMUNITY_COUNCIL_NEIGHBORHOOD    DISTRICT            SNA_NEIGHBORHOOD 
##  WESTWOOD       : 29696         Min.   :1.000   WESTWOOD       : 29422  
##  OTR            : 25793         1st Qu.:2.000   OVER-THE-RHINE : 25409  
##  WEST END       : 24313         Median :3.000   DOWNTOWN       : 24519  
##  DOWNTOWN       : 24112         Mean   :3.006   AVONDALE       : 23763  
##  AVONDALE       : 19283         3rd Qu.:4.000   WEST END       : 22933  
##  EAST PRICE HILL: 18015         Max.   :5.000   EAST PRICE HILL: 22147  
##  (Other)        :243578                         (Other)        :236597  
##               CPD_NEIGHBORHOOD 
##  WESTWOOD             : 29187  
##  C. B. D. / RIVERFRONT: 26565  
##  OVER-THE-RHINE       : 25914  
##  WEST  END            : 22670  
##  EAST PRICE HILL      : 18426  
##  AVONDALE             : 18069  
##  (Other)              :243959

class(pdic)

## [1] "data.frame"

head(pdic)

##                      ADDRESS_X LATITUDE_X LONGITUDE_X AGENCY
## 1               16XX WALNUT ST   39.11293   -84.51470    CPD
## 2        W 7TH ST / CENTRAL AV   39.10219   -84.51901    CPD
## 5  HARRISON AV / QUEEN CITY AV   39.12569   -84.54933    CPD
## 6            WAYNE ST / MAY ST   39.12458   -84.49463    CPD
## 8              76XX READING RD   39.19973   -84.45657    CPD
## 10              11XX HOPPLE ST   39.13976   -84.53463    CPD
##    CREATE_TIME_INCIDENT       DISPOSITION_TEXT    EVENT_NUMBER INCIDENT_TYPE_ID
## 1        8/7/2019 23:29            ARR: ARREST CPD190807001817           DIRPAT
## 2         8/7/2019 9:23 NTR: NOTHING TO REPORT CPD190807000383             SDET
## 5        8/7/2019 15:35               GOA: GOA CPD190807001056           HAZARD
## 6        8/7/2019 18:17 NTR: NOTHING TO REPORT CPD190807001333           DIRPAT
## 8         8/7/2019 2:51       SOW: SENT ON WAY CPD190807000133           DISORD
## 10       8/7/2019 21:16               INV: INV CPD190807001617           HAZARD
##              INCIDENT_TYPE_DESC PRIORITY PRIORITY_COLOR
## 1     DIRECTED PATROL - VEHICLE       27         PURPLE
## 2       OFF DUTY POLICE DETAILS       35         PURPLE
## 5  HAZARD TO TRAFFIC/PEDESTRIAN       18           BLUE
## 6     DIRECTED PATROL - VEHICLE       27         PURPLE
## 8          DISORDERLY PERSON(S)       12           BLUE
## 10 HAZARD TO TRAFFIC/PEDESTRIAN       18           BLUE
##    ARRIVAL_TIME_PRIMARY_UNIT CLOSED_TIME_INCIDENT DISPATCH_TIME_PRIMARY_UNIT
## 1             8/7/2019 23:29       8/7/2019 23:53             8/7/2019 23:29
## 2              8/7/2019 9:24       8/7/2019 17:40              8/7/2019 9:24
## 5             8/7/2019 17:12       8/7/2019 17:13             8/7/2019 16:37
## 6             8/7/2019 18:17       8/7/2019 18:32             8/7/2019 18:17
## 8              8/7/2019 2:58        8/7/2019 3:03              8/7/2019 2:52
## 10            8/7/2019 21:16       8/7/2019 21:18             8/7/2019 21:16
##    BEAT COMMUNITY_COUNCIL_NEIGHBORHOOD DISTRICT SNA_NEIGHBORHOOD
## 1  P121                            OTR        1   OVER-THE-RHINE
## 2  P011                       WEST END        1         WEST END
## 5  P341                SOUTH FAIRMOUNT        3  SOUTH FAIRMOUNT
## 6  P421                   WALNUT HILLS        4     WALNUT HILLS
## 8  P451                       ROSELAWN        4         ROSELAWN
## 10 P521                CAMP WASHINGTON        5  CAMP WASHINGTON
##    CPD_NEIGHBORHOOD
## 1    OVER-THE-RHINE
## 2         WEST  END
## 5  SOUTH  FAIRMOUNT
## 6      WALNUT HILLS
## 8          ROSELAWN
## 10 CAMP  WASHINGTON

With the data now clean, we can begin data analysis. We want to first propose three questions about this data to find some insight. Here are our three questions:

Question 1. Question 1. How frequently does each priority level occur? Question 2. Which CPD Neighborhoods are the most Dangerous? Question 3. Which CPD Neighborhoods are the safest?

Question 1. How frequently does each priority level occur?

Priority 1 calls are calls which pose the greatest danger to the community and the officers who respond to the call. We want to see how often this dangerous call comes up, compared to the other less dangerous calls.

hist(pdic$PRIORITY, col = "green", breaks = 35)

This histogram shows that priority 1 calls are thankfully not the most common priority call level. There are 9 other levels of priority that occur more often. In order to drill down on this subject, we want to see which district has the most police calls. This may help us determine the district that poses the greatest danger to officers.

barplot(table(substr(pdic$DISTRICT, 1, 5)),
        main = "Number of Police Calls by District",
        xlab = "District Number",
        ylab = "Number of Calls",
        col = "red")

After analyzing the barchart, we can see that District 2 has the least number of police calls, while district 3 has the most number of calls. We can assume with the visualization that District 2 is the safest as it has by far the least amount of police activity, while district 3 is the most dangerous as it has the most police activity.

Question 2. Which CPD Neighborhoods are the most Dangerous?

After figuring out which districts were the safest and most dangerous, we now want to find out which neighborhood is the most dangerous. We first decided that we would look up the police call numbers for each neighborhood, however, we decided we wanted to be more thorough and look up the calls with priority level 1. Level 1 calls are the most dangerous for both the community and the officers taking the call, as they contain the most violent crimes. We decided to create a barchart for analysing.

pdic %>%
  filter(PRIORITY == "1") %>%
  count(CPD_NEIGHBORHOOD,sort = TRUE) %>%
  filter(n > 15) %>%
  mutate(CPD_NEIGHBORHOOD = reorder(CPD_NEIGHBORHOOD, n)) %>%
  ggplot(aes(CPD_NEIGHBORHOOD, n)) +
  geom_col(fill = "blue") +
  xlab(NULL)+
  ylab("No. of Priority 1 calls") +
  theme_minimal()+
  coord_flip()

This chart shows us that the two neighborhoods with the most level 1 calls are West Price Hill and Avondale. This means that if you live in one of these two neighborhoods, you will have the highest chance of a level 1 crime being committed, making them the most dangerous places to live in the Cincinnati area.

Question 3. Which CPD Neighborhoods are the safest?

After discovering which CPD neighborhoods are the most dangerous, we wanted to see which neighborhoods are the safest. We first want to see which neighborhoods have the least amount of police calls.

pdic %>%
  filter(PRIORITY == c(1:35)) %>%
  count(CPD_NEIGHBORHOOD,sort = TRUE) %>%
  filter(n < 45) %>%
  mutate(CPD_NEIGHBORHOOD = reorder(CPD_NEIGHBORHOOD, n)) %>%
  ggplot(aes(CPD_NEIGHBORHOOD, n)) +
  geom_col(fill = "purple") +
  xlab(NULL)+
  ylab("No. of Police Calls") +
  theme_minimal()+
  coord_flip()

After the analysis, we get a list of neighborhoods with the less than 45 total police calls. However, of these six neighborhoods, further analysis is needed to determine which is actually the safest, as priority level is not taken into account. To remedy this, we will run a new code that shows the number of high/medium level calls to determine which neighborhood is the safest. We will use priority levels 1-25 to make this determination.

pdic %>%
  filter(PRIORITY == c(1:25)) %>%
  count(CPD_NEIGHBORHOOD,sort = TRUE) %>%
  filter(n < 45) %>%
  mutate(CPD_NEIGHBORHOOD = reorder(CPD_NEIGHBORHOOD, n)) %>%
  ggplot(aes(CPD_NEIGHBORHOOD, n)) +
  geom_col(fill = "green") +
  xlab(NULL)+
  ylab("No. of Priority 1-25 calls") +
  theme_minimal()+
  coord_flip()

## Warning in PRIORITY == c(1:25): longer object length is not a multiple of
## shorter object length

With this output, we can make the reasonable assumption that O’Bryonville is the safest neighborhood in Cincinnati, as it has the least amount of police calls, as well as the least amount of high/medium level of priority calls.

With our analysis, we can conclude that police respond to lower level priority calls more often than they do for higher levels. We can also see that the most dangerous calls (priority level 1) occur the 9th most often. We also determined that the most dangerous district for the number of calls is district 3, and the least dangerous district is district 2, as it has the fewest number of police calls compared to the other four districts.

We can also conclude that the neighborhoods of West Price Hill and Avondale are the most dangerous for police calls, as they have the highest number of level 1 priority calls. This can mean that the community and police officers who engage with these neighborhoods have the most risk of danger.

We also analyzed which neighborhoods are the safest, and we concluded that O’Bryonville was the safest place to live as it has the lowest amount of high/medium calls, and the lowest amount of calls overall.

Project Proposal Group 1

Kevin Hazenfield M12044610, Johnny Dumoulin M12371984, Jacob Leusch M12272783, Kevin Leson M02553127, Erik Randall M12942660

1/29/2021