Proposal Draft

For our first data set, we chose a data set involving passengers on Titanic’s maiden voyage. We downloaded this data set from Kaggle.com. Here is the Link: https://www.kaggle.com/azeembootwala/titanic?select=train_data.csv

For the data set, we first read it as a data frame.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

titanic = read.csv("Titanic Train.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
titanic_df <- data.frame(titanic)
train <- titanic_df

After this initial setup, we then explored its dimensions and variables.

class(train)

## [1] "data.frame"

nrow(train)

## [1] 792

ncol(train)

## [1] 17

names(train)

##  [1] "X"             "PassengerId"   "Survived"      "Sex"          
##  [5] "Age"           "Fare"          "Pclass_1"      "Pclass_2"     
##  [9] "Pclass_3"      "Family_size"   "Married.Man"   "Married.Woman"
## [13] "Single.Man"    "Single.Woman"  "Chebourg"      "Queenstown"   
## [17] "Southampton"

head (train)

##   X PassengerId Survived Sex    Age       Fare Pclass_1 Pclass_2 Pclass_3
## 1 0           1        0   1 0.2750 0.01415106        0        0        1
## 2 1           2        1   0 0.4750 0.13913574        1        0        0
## 3 2           3        1   0 0.3250 0.01546857        0        0        1
## 4 3           4        1   0 0.4375 0.10364430        1        0        0
## 5 4           5        0   1 0.4375 0.01571255        0        0        1
## 6 5           6        0   1 0.3500 0.01650950        0        0        1
##   Family_size Married.Man Married.Woman Single.Man Single.Woman Chebourg
## 1         0.1           1             0          0            0        0
## 2         0.1           1             0          0            0        1
## 3         0.0           0             0          0            1        0
## 4         0.1           1             0          0            0        0
## 5         0.0           1             0          0            0        0
## 6         0.0           1             0          0            0        0
##   Queenstown Southampton
## 1          0           1
## 2          0           0
## 3          0           1
## 4          0           1
## 5          0           1
## 6          1           0

tail(train)

##       X PassengerId Survived Sex    Age       Fare Pclass_1 Pclass_2 Pclass_3
## 787 786         787        1   0 0.2250 0.01463083        0        0        1
## 788 787         788        0   1 0.1000 0.05684821        0        0        1
## 789 788         789        1   1 0.0125 0.04015973        0        0        1
## 790 789         790        0   1 0.5750 0.15458811        1        0        0
## 791 790         791        0   1 0.3500 0.01512699        0        0        1
## 792 791         792        0   1 0.2000 0.05074862        0        1        0
##     Family_size Married.Man Married.Woman Single.Man Single.Woman Chebourg
## 787         0.0           0             0          0            1        0
## 788         0.5           0             0          1            0        0
## 789         0.3           0             0          1            0        0
## 790         0.0           1             0          0            0        1
## 791         0.0           1             0          0            0        0
## 792         0.0           1             0          0            0        0
##     Queenstown Southampton
## 787          0           1
## 788          1           0
## 789          0           1
## 790          0           0
## 791          1           0
## 792          0           1

With this dataset, we wanted to determine the best variables to use to predict whether one would survive the voyage.

We first wanted to do some exploratory analysis on the data.

First, we wanted to find the average family size of the passengers. Here is the code and output.

hist(train$Family_size, col = "blue", breaks = 15)

With this Histogram, we can see that the normalized family size for the majority of passengers is less than .20.This means their is a great disparity in the amount of passengers that are traveling with no/few family members than there are with many.

Next, we wanted to check the fare prices that the passengers payed.

hist(train$Fare, col = "green", breaks = 15)

Again, the data shows that the majority normalized fare payed for the voyage is below .20. This means that the majority of passengers payed a small fare compared to the minority of passengers who payed a much larger fare. This makes sense as the price and passenger disparity between the first, second, and third class was very prevalent for Titanic’s maiden voyage.

With some of the Exploratory data finished, it time for us to ask some questions. For this data set, we want chose three:

1. Did Men survive more or less in proportion to women?

2. Is passenger class a good predictor for survival rate?

3. Is a larger family size better for survival?

Question 1. Did Single Men survive more or less in proportion to Single Women?

For this first question, we feel a good starting point would be to map the survival rates by Sex, to see if there is any correlation.

p <- ggplot(train) + aes(Sex,Survived)

p + geom_jitter()

This chart compares survival rates (1 = survived, 0 = did not survive) with Males(1) and Females(0). This chart shows that, overall, men have a lower number of survivors compared to other men, and women have a higher survival rate compared to other women. This data helps us see that Women most likely had a higher survival rate than men. However, more analysis is needed on whether being single or not is important for survival.

Question 2. Is passenger class a good predictor for survival rate?

For this next question, we want to know whether a passenger’s class would effect their survival rate. We believe a good start would to analyze the proportion of passengers in each class and their survival rate.

p <- ggplot(train) + aes(Pclass_1,Survived)

p + geom_jitter()

p2 <- ggplot(train) + aes(Pclass_2,Survived)

p2 + geom_jitter()

p3 <- ggplot(train) + aes(Pclass_3,Survived)

p3 + geom_jitter()

After plotting the three charts, we can see that there is some different in survival rate between the three classes. With further analysis, we may be able to raise our confidence in passenger class being an accurate predictor of survival.

3. Is a larger family size better for survival?

Our final question tasks us with determining whether a larger family size can increase the survival rate for each passenger. The best way to start answering this question is to compare survival rate with family size.

p4 <- ggplot(train) + aes(Family_size,Survived)

p4 + geom_jitter()

This chart is very interesting. The chart shows that passenger’s with the largest amount of family members seemed to not survive the voyage. However, more analysis needs to be made to answer our question. Maybe comparing family size to passenger class could paint a better picture for out analysis.

#Data Set 2

library(tidyverse)
library(dplyr)

pdi = read.csv("PDI__Police_Data_Initiative__Police_Calls_for_Service__CAD_.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")

pdi <- pdi[!(is.na(pdi$LATITUDE_X ) |
               is.na(pdi$LONGITUDE_X)|
               is.na(pdi$DISTRICT)|
               is.na(pdi$PRIORITY)), ]
summary(Filter(is.numeric, pdi))

##    LATITUDE_X     LONGITUDE_X        PRIORITY        DISTRICT    
##  Min.   :39.05   Min.   :-84.71   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:39.11   1st Qu.:-84.55   1st Qu.: 7.00   1st Qu.:2.000  
##  Median :39.13   Median :-84.52   Median :18.00   Median :3.000  
##  Mean   :39.14   Mean   :-84.52   Mean   :18.94   Mean   :3.047  
##  3rd Qu.:39.15   3rd Qu.:-84.49   3rd Qu.:29.00   3rd Qu.:4.000  
##  Max.   :39.22   Max.   :-84.37   Max.   :35.00   Max.   :5.000

What it looks like after you fix missing data.

Overview of data

str(pdi)

## 'data.frame':    779970 obs. of  19 variables:
##  $ ADDRESS_X                     : Factor w/ 49869 levels "**SECRET** 200 W NORTH BEND",..: 4088 46524 6237 45838 35475 47733 11839 23819 1465 12652 ...
##  $ LATITUDE_X                    : num  39.1 39.1 39.1 39.2 39.1 ...
##  $ LONGITUDE_X                   : num  -84.5 -84.5 -84.5 -84.5 -84.5 ...
##  $ AGENCY                        : Factor w/ 2 levels "CP","CPD": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CREATE_TIME_INCIDENT          : Factor w/ 808959 levels "1/1/2015 0:01",..: 734962 735126 734571 734721 734591 734719 734853 734818 734873 734480 ...
##  $ DISPOSITION_TEXT              : Factor w/ 1041 levels "26: AVAIL/DETAIL COMPLETED",..: 410 838 330 771 693 838 642 936 771 838 ...
##  $ EVENT_NUMBER                  : Factor w/ 1048572 levels "CPD170124000097",..: 318203 317011 317524 317799 317561 317795 317997 316799 318033 317337 ...
##  $ INCIDENT_TYPE_ID              : Factor w/ 317 levels "1","911CALL",..: 63 161 13 132 92 63 169 67 92 105 ...
##  $ INCIDENT_TYPE_DESC            : Factor w/ 258 levels "911 CALL EMERGENCY INDICATED, NO SPECIFICS",..: 72 150 11 NA 101 72 215 76 101 116 ...
##  $ PRIORITY                      : int  27 35 35 18 18 27 10 12 18 29 ...
##  $ PRIORITY_COLOR                : Factor w/ 4 levels "BLUE","PURPLE",..: 2 2 2 NA 1 2 1 1 1 2 ...
##  $ ARRIVAL_TIME_PRIMARY_UNIT     : Factor w/ 539247 levels "1/1/2015 0:01",..: 489167 489293 NA NA 488913 488955 NA 489043 489084 488778 ...
##  $ CLOSED_TIME_INCIDENT          : Factor w/ 791474 levels "1/1/2015 0:01",..: 719463 719172 719045 719203 719149 719214 719339 719471 719346 719094 ...
##  $ DISPATCH_TIME_PRIMARY_UNIT    : Factor w/ 595858 levels "1/1/2015 0:25",..: 540858 540990 NA 540647 540573 540645 NA 540734 540776 540445 ...
##  $ BEAT                          : Factor w/ 167 levels "CLR","OTHER COUNTY",..: 14 4 124 117 76 90 28 108 130 62 ...
##  $ COMMUNITY_COUNCIL_NEIGHBORHOOD: Factor w/ 71 levels "AVONDALE","AVONDALE - NORTH AVONDALE",..: 50 68 18 6 62 67 27 58 5 22 ...
##  $ DISTRICT                      : int  1 1 5 4 3 4 2 4 5 3 ...
##  $ SNA_NEIGHBORHOOD              : Factor w/ 51 levels "AVONDALE","BOND HILL",..: 35 48 10 5 44 47 17 40 4 13 ...
##  $ CPD_NEIGHBORHOOD              : Factor w/ 54 levels "AVONDALE","BONDHILL",..: 38 51 18 6 48 50 17 44 5 14 ...

summary(pdi)

##                  ADDRESS_X        LATITUDE_X     LONGITUDE_X     AGENCY      
##  20XX RADCLIFF DR     : 38768   Min.   :39.05   Min.   :-84.71   CP :105078  
##  3XX EZZARD CHARLES DR:  8599   1st Qu.:39.11   1st Qu.:-84.55   CPD:674892  
##  23XX FERGUSON RD     :  8168   Median :39.13   Median :-84.52               
##  41XX READING RD      :  6236   Mean   :39.14   Mean   :-84.52               
##  17XX REPUBLIC ST     :  5236   3rd Qu.:39.15   3rd Qu.:-84.49               
##  (Other)              :712912   Max.   :39.22   Max.   :-84.37               
##  NA's                 :    51                                                
##        CREATE_TIME_INCIDENT               DISPOSITION_TEXT 
##  8/3/2020 17:00  :     9    NTR: NOTHING TO REPORT:127820  
##  9/1/2020 19:11  :     9    ADV:ADVISED           : 84266  
##  9/25/2020 14:54 :     9    INV: INV              : 82925  
##  10/1/2020 17:03 :     8    CAN:CANCEL            : 80887  
##  10/15/2020 18:14:     8    ADVISED               : 48083  
##  10/31/2020 21:16:     8    (Other)               :340895  
##  (Other)         :779919    NA's                  : 15094  
##           EVENT_NUMBER    INCIDENT_TYPE_ID                 INCIDENT_TYPE_DESC
##  CPD180306001717:     2   DIRPAT : 68919   DIRECTED PATROL - VEHICLE: 61094  
##  CPD200327001402:     2   ADV    : 62015   ADVISED INCIDENT         : 53039  
##  CPD200424001443:     2   SDET   : 51780   OFF DUTY POLICE DETAILS  : 41634  
##  CPD170124000097:     1   CELL   : 35663   STATION RUN              : 33763  
##  CPD170124000102:     1   REPO   : 33093   CELL DISCON OR SICAL     : 33504  
##  CPD170124000111:     1   911CALL: 32837   (Other)                  :482251  
##  (Other)        :779961   (Other):495663   NA's                     : 74685  
##     PRIORITY     PRIORITY_COLOR    ARRIVAL_TIME_PRIMARY_UNIT
##  Min.   : 1.00   BLUE  :181210   8/8/2019 16:31 :    10     
##  1st Qu.: 7.00   PURPLE:344862   2/29/2020 0:26 :     9     
##  Median :18.00   RED   :  5608   11/7/2020 19:01:     8     
##  Mean   :18.94   YELLOW: 68527   5/29/2020 13:55:     8     
##  3rd Qu.:29.00   NA's  :179763   6/27/2020 12:53:     8     
##  Max.   :35.00                   (Other)        :496911     
##                                  NA's           :283016     
##        CLOSED_TIME_INCIDENT    DISPATCH_TIME_PRIMARY_UNIT      BEAT       
##  8/20/2020 10:05 :    14    7/30/2020 11:07 :     8       P321   : 60540  
##  6/2/2020 0:00   :    10    8/14/2019 13:08 :     7       P331   : 59237  
##  10/31/2020 13:59:     9    8/16/2020 13:46 :     7       P131   : 47581  
##  7/5/2020 1:01   :     9    8/23/2020 9:31  :     7       P011   : 46543  
##  8/23/2020 2:24  :     9    10/11/2020 16:25:     6       P121   : 45661  
##  (Other)         :762211    (Other)         :550425       (Other):518524  
##  NA's            : 17708    NA's            :229510       NA's   :  1884  
##  COMMUNITY_COUNCIL_NEIGHBORHOOD    DISTRICT            SNA_NEIGHBORHOOD 
##  WESTWOOD       : 59587         Min.   :1.000   EAST PRICE HILL: 78488  
##  SOUTH FAIRMOUNT: 49549         1st Qu.:2.000   WESTWOOD       : 59135  
##  DOWNTOWN       : 46587         Median :3.000   DOWNTOWN       : 47449  
##  OTR            : 45507         Mean   :3.047   OVER-THE-RHINE : 44872  
##  WEST END       : 42275         3rd Qu.:4.000   AVONDALE       : 44849  
##  AVONDALE       : 36460         Max.   :5.000   WEST END       : 39579  
##  (Other)        :500005                         (Other)        :465598  
##               CPD_NEIGHBORHOOD 
##  WESTWOOD             : 58720  
##  SOUTH  FAIRMOUNT     : 51491  
##  C. B. D. / RIVERFRONT: 50962  
##  OVER-THE-RHINE       : 45766  
##  WEST  END            : 39052  
##  EAST PRICE HILL      : 33615  
##  (Other)              :500364

class(pdi)

## [1] "data.frame"

head(pdi)

##                     ADDRESS_X LATITUDE_X LONGITUDE_X AGENCY
## 1              16XX WALNUT ST   39.11293   -84.51470    CPD
## 2       W 7TH ST / CENTRAL AV   39.10219   -84.51901    CPD
## 3           1XX W MCMILLAN ST   39.12864   -84.51835    CPD
## 4         VINE ST / E 70TH ST   39.19225   -84.48156    CPD
## 5 HARRISON AV / QUEEN CITY AV   39.12569   -84.54933    CPD
## 6           WAYNE ST / MAY ST   39.12458   -84.49463    CPD
##   CREATE_TIME_INCIDENT       DISPOSITION_TEXT    EVENT_NUMBER INCIDENT_TYPE_ID
## 1       8/7/2019 23:29            ARR: ARREST CPD190807001817           DIRPAT
## 2        8/7/2019 9:23 NTR: NOTHING TO REPORT CPD190807000383             SDET
## 3       8/7/2019 15:06            ADV:ADVISED CPD190807001005              ADV
## 4       8/7/2019 18:19               INV: INV CPD190807001337  PERDWP-COMBINED
## 5       8/7/2019 15:35               GOA: GOA CPD190807001056           HAZARD
## 6       8/7/2019 18:17 NTR: NOTHING TO REPORT CPD190807001333           DIRPAT
##             INCIDENT_TYPE_DESC PRIORITY PRIORITY_COLOR
## 1    DIRECTED PATROL - VEHICLE       27         PURPLE
## 2      OFF DUTY POLICE DETAILS       35         PURPLE
## 3             ADVISED INCIDENT       35         PURPLE
## 4                         <NA>       18           <NA>
## 5 HAZARD TO TRAFFIC/PEDESTRIAN       18           BLUE
## 6    DIRECTED PATROL - VEHICLE       27         PURPLE
##   ARRIVAL_TIME_PRIMARY_UNIT CLOSED_TIME_INCIDENT DISPATCH_TIME_PRIMARY_UNIT
## 1            8/7/2019 23:29       8/7/2019 23:53             8/7/2019 23:29
## 2             8/7/2019 9:24       8/7/2019 17:40              8/7/2019 9:24
## 3                      <NA>       8/7/2019 15:06                       <NA>
## 4                      <NA>       8/7/2019 18:20             8/7/2019 18:19
## 5            8/7/2019 17:12       8/7/2019 17:13             8/7/2019 16:37
## 6            8/7/2019 18:17       8/7/2019 18:32             8/7/2019 18:17
##   BEAT COMMUNITY_COUNCIL_NEIGHBORHOOD DISTRICT SNA_NEIGHBORHOOD
## 1 P121                            OTR        1   OVER-THE-RHINE
## 2 P011                       WEST END        1         WEST END
## 3 P511                  CUF - HEIGHTS        5              CUF
## 4 P461                       CARTHAGE        4         CARTHAGE
## 5 P341                SOUTH FAIRMOUNT        3  SOUTH FAIRMOUNT
## 6 P421                   WALNUT HILLS        4     WALNUT HILLS
##   CPD_NEIGHBORHOOD
## 1   OVER-THE-RHINE
## 2        WEST  END
## 3         FAIRVIEW
## 4         CARTHAGE
## 5 SOUTH  FAIRMOUNT
## 6     WALNUT HILLS

Histogram

hist(pdi$PRIORITY, col = "green", breaks = 20)

Density Plot

plot(density(pdi$PRIORITY))

Priority vs Density Histogram

hist(pdi$PRIORITY, probability = TRUE, col = "green", breaks = 20, main = "PRIORITY", xlim = c(1, 45), xlab = "PRIORITY")
lines(density(pdi$PRIORITY), col = "red", lwd = 2)
abline(v = mean(pdi$PRIORITY), col = "blue", lty = 2, lwd = 1.5)

Boxplot of Priority

boxplot(pdi$PRIORITY)

This tells us the how many standard deviation away from the median.

Priority vs District Plot

plot(pdi$PRIORITY ~ pdi$DISTRICT, xlab = "PRIORITY", ylab = "DISTRICT",
     main = "PRIORITY VS DISTRICT ")

Agency vs priority

xtabs(~ PRIORITY + AGENCY, data = pdi)

##         AGENCY
## PRIORITY     CP    CPD
##       1     432   7055
##       2    6846  22308
##       3     470  18222
##       4   11565  35198
##       5    5815   8634
##       6   12575  10874
##       7   60659  17762
##       8    6710   3652
##       9       6   7545
##       10      0  39260
##       11      0  21107
##       12      0  38662
##       13      0   8881
##       14      0  18946
##       15      0     12
##       16      0   3175
##       17      0     77
##       18      0  31965
##       19      0    296
##       20      0    436
##       21      0  26267
##       22      0   6805
##       23      0    381
##       24      0   3108
##       25      0  29803
##       26      0   2290
##       27      0  64542
##       28      0   5221
##       29      0  54726
##       30      0    306
##       31      0  32611
##       32      0  19889
##       33      0    455
##       34      0   2021
##       35      0 132400

plot(xtabs(~ PRIORITY + AGENCY, data = pdi), main="Agency vs level of priority")

The output tells us which agency takes control according to priority

Convert the cross tabulation to propoportions by row

crosstabs <- xtabs(~ PRIORITY + AGENCY, data = pdi)
prop.table(crosstabs, 1)

##         AGENCY
## PRIORITY           CP          CPD
##       1  0.0577000134 0.9422999866
##       2  0.2348219798 0.7651780202
##       3  0.0251444468 0.9748555532
##       4  0.2473109082 0.7526890918
##       5  0.4024499965 0.5975500035
##       6  0.5362702034 0.4637297966
##       7  0.7735045460 0.2264954540
##       8  0.6475583864 0.3524416136
##       9  0.0007945967 0.9992054033
##       10 0.0000000000 1.0000000000
##       11 0.0000000000 1.0000000000
##       12 0.0000000000 1.0000000000
##       13 0.0000000000 1.0000000000
##       14 0.0000000000 1.0000000000
##       15 0.0000000000 1.0000000000
##       16 0.0000000000 1.0000000000
##       17 0.0000000000 1.0000000000
##       18 0.0000000000 1.0000000000
##       19 0.0000000000 1.0000000000
##       20 0.0000000000 1.0000000000
##       21 0.0000000000 1.0000000000
##       22 0.0000000000 1.0000000000
##       23 0.0000000000 1.0000000000
##       24 0.0000000000 1.0000000000
##       25 0.0000000000 1.0000000000
##       26 0.0000000000 1.0000000000
##       27 0.0000000000 1.0000000000
##       28 0.0000000000 1.0000000000
##       29 0.0000000000 1.0000000000
##       30 0.0000000000 1.0000000000
##       31 0.0000000000 1.0000000000
##       32 0.0000000000 1.0000000000
##       33 0.0000000000 1.0000000000
##       34 0.0000000000 1.0000000000
##       35 0.0000000000 1.0000000000

Now you can see the rows add up to be 1 and they are essentially percents.

Project Proposal Group 1

Kevin Hazenfield

1/29/2021