For our first data set, we chose a data set involving passengers on Titanic’s maiden voyage. We downloaded this data set from Kaggle.com. Here is the Link: https://www.kaggle.com/azeembootwala/titanic?select=train_data.csv
For the data set, we first read it as a data frame.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
titanic = read.csv("Titanic Train.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
titanic_df <- data.frame(titanic)
train <- titanic_df
After this initial setup, we then explored its dimensions and variables.
class(train)
## [1] "data.frame"
nrow(train)
## [1] 792
ncol(train)
## [1] 17
names(train)
## [1] "X" "PassengerId" "Survived" "Sex"
## [5] "Age" "Fare" "Pclass_1" "Pclass_2"
## [9] "Pclass_3" "Family_size" "Married.Man" "Married.Woman"
## [13] "Single.Man" "Single.Woman" "Chebourg" "Queenstown"
## [17] "Southampton"
head (train)
## X PassengerId Survived Sex Age Fare Pclass_1 Pclass_2 Pclass_3
## 1 0 1 0 1 0.2750 0.01415106 0 0 1
## 2 1 2 1 0 0.4750 0.13913574 1 0 0
## 3 2 3 1 0 0.3250 0.01546857 0 0 1
## 4 3 4 1 0 0.4375 0.10364430 1 0 0
## 5 4 5 0 1 0.4375 0.01571255 0 0 1
## 6 5 6 0 1 0.3500 0.01650950 0 0 1
## Family_size Married.Man Married.Woman Single.Man Single.Woman Chebourg
## 1 0.1 1 0 0 0 0
## 2 0.1 1 0 0 0 1
## 3 0.0 0 0 0 1 0
## 4 0.1 1 0 0 0 0
## 5 0.0 1 0 0 0 0
## 6 0.0 1 0 0 0 0
## Queenstown Southampton
## 1 0 1
## 2 0 0
## 3 0 1
## 4 0 1
## 5 0 1
## 6 1 0
tail(train)
## X PassengerId Survived Sex Age Fare Pclass_1 Pclass_2 Pclass_3
## 787 786 787 1 0 0.2250 0.01463083 0 0 1
## 788 787 788 0 1 0.1000 0.05684821 0 0 1
## 789 788 789 1 1 0.0125 0.04015973 0 0 1
## 790 789 790 0 1 0.5750 0.15458811 1 0 0
## 791 790 791 0 1 0.3500 0.01512699 0 0 1
## 792 791 792 0 1 0.2000 0.05074862 0 1 0
## Family_size Married.Man Married.Woman Single.Man Single.Woman Chebourg
## 787 0.0 0 0 0 1 0
## 788 0.5 0 0 1 0 0
## 789 0.3 0 0 1 0 0
## 790 0.0 1 0 0 0 1
## 791 0.0 1 0 0 0 0
## 792 0.0 1 0 0 0 0
## Queenstown Southampton
## 787 0 1
## 788 1 0
## 789 0 1
## 790 0 0
## 791 1 0
## 792 0 1
With this dataset, we wanted to determine the best variables to use to predict whether one would survive the voyage.
We first wanted to do some exploratory analysis on the data.
First, we wanted to find the average family size of the passengers. Here is the code and output.
hist(train$Family_size, col = "blue", breaks = 15)
With this Histogram, we can see that the normalized family size for the majority of passengers is less than .20.This means their is a great disparity in the amount of passengers that are traveling with no/few family members than there are with many.
Next, we wanted to check the fare prices that the passengers payed.
hist(train$Fare, col = "green", breaks = 15)
Again, the data shows that the majority normalized fare payed for the voyage is below .20. This means that the majority of passengers payed a small fare compared to the minority of passengers who payed a much larger fare. This makes sense as the price and passenger disparity between the first, second, and third class was very prevalent for Titanic’s maiden voyage.
With some of the Exploratory data finished, it time for us to ask some questions. For this data set, we want chose three:
1. Did Men survive more or less in proportion to women?
2. Is passenger class a good predictor for survival rate?
3. Is a larger family size better for survival?
For this first question, we feel a good starting point would be to map the survival rates by Sex, to see if there is any correlation.
p <- ggplot(train) + aes(Sex,Survived)
p + geom_jitter()
This chart compares survival rates (1 = survived, 0 = did not survive) with Males(1) and Females(0). This chart shows that, overall, men have a lower number of survivors compared to other men, and women have a higher survival rate compared to other women. This data helps us see that Women most likely had a higher survival rate than men. However, more analysis is needed on whether being single or not is important for survival.
For this next question, we want to know whether a passenger’s class would effect their survival rate. We believe a good start would to analyze the proportion of passengers in each class and their survival rate.
p <- ggplot(train) + aes(Pclass_1,Survived)
p + geom_jitter()
p2 <- ggplot(train) + aes(Pclass_2,Survived)
p2 + geom_jitter()
p3 <- ggplot(train) + aes(Pclass_3,Survived)
p3 + geom_jitter()
After plotting the three charts, we can see that there is some different in survival rate between the three classes. With further analysis, we may be able to raise our confidence in passenger class being an accurate predictor of survival.
Our final question tasks us with determining whether a larger family size can increase the survival rate for each passenger. The best way to start answering this question is to compare survival rate with family size.
p4 <- ggplot(train) + aes(Family_size,Survived)
p4 + geom_jitter()
This chart is very interesting. The chart shows that passenger’s with the largest amount of family members seemed to not survive the voyage. However, more analysis needs to be made to answer our question. Maybe comparing family size to passenger class could paint a better picture for out analysis.
#Data Set 2
library(tidyverse)
library(dplyr)
pdi = read.csv("PDI__Police_Data_Initiative__Police_Calls_for_Service__CAD_.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
pdi <- pdi[!(is.na(pdi$LATITUDE_X ) |
is.na(pdi$LONGITUDE_X)|
is.na(pdi$DISTRICT)|
is.na(pdi$PRIORITY)), ]
summary(Filter(is.numeric, pdi))
## LATITUDE_X LONGITUDE_X PRIORITY DISTRICT
## Min. :39.05 Min. :-84.71 Min. : 1.00 Min. :1.000
## 1st Qu.:39.11 1st Qu.:-84.55 1st Qu.: 7.00 1st Qu.:2.000
## Median :39.13 Median :-84.52 Median :18.00 Median :3.000
## Mean :39.14 Mean :-84.52 Mean :18.94 Mean :3.047
## 3rd Qu.:39.15 3rd Qu.:-84.49 3rd Qu.:29.00 3rd Qu.:4.000
## Max. :39.22 Max. :-84.37 Max. :35.00 Max. :5.000
What it looks like after you fix missing data.
str(pdi)
## 'data.frame': 779970 obs. of 19 variables:
## $ ADDRESS_X : Factor w/ 49869 levels "**SECRET** 200 W NORTH BEND",..: 4088 46524 6237 45838 35475 47733 11839 23819 1465 12652 ...
## $ LATITUDE_X : num 39.1 39.1 39.1 39.2 39.1 ...
## $ LONGITUDE_X : num -84.5 -84.5 -84.5 -84.5 -84.5 ...
## $ AGENCY : Factor w/ 2 levels "CP","CPD": 2 2 2 2 2 2 2 2 2 2 ...
## $ CREATE_TIME_INCIDENT : Factor w/ 808959 levels "1/1/2015 0:01",..: 734962 735126 734571 734721 734591 734719 734853 734818 734873 734480 ...
## $ DISPOSITION_TEXT : Factor w/ 1041 levels "26: AVAIL/DETAIL COMPLETED",..: 410 838 330 771 693 838 642 936 771 838 ...
## $ EVENT_NUMBER : Factor w/ 1048572 levels "CPD170124000097",..: 318203 317011 317524 317799 317561 317795 317997 316799 318033 317337 ...
## $ INCIDENT_TYPE_ID : Factor w/ 317 levels "1","911CALL",..: 63 161 13 132 92 63 169 67 92 105 ...
## $ INCIDENT_TYPE_DESC : Factor w/ 258 levels "911 CALL EMERGENCY INDICATED, NO SPECIFICS",..: 72 150 11 NA 101 72 215 76 101 116 ...
## $ PRIORITY : int 27 35 35 18 18 27 10 12 18 29 ...
## $ PRIORITY_COLOR : Factor w/ 4 levels "BLUE","PURPLE",..: 2 2 2 NA 1 2 1 1 1 2 ...
## $ ARRIVAL_TIME_PRIMARY_UNIT : Factor w/ 539247 levels "1/1/2015 0:01",..: 489167 489293 NA NA 488913 488955 NA 489043 489084 488778 ...
## $ CLOSED_TIME_INCIDENT : Factor w/ 791474 levels "1/1/2015 0:01",..: 719463 719172 719045 719203 719149 719214 719339 719471 719346 719094 ...
## $ DISPATCH_TIME_PRIMARY_UNIT : Factor w/ 595858 levels "1/1/2015 0:25",..: 540858 540990 NA 540647 540573 540645 NA 540734 540776 540445 ...
## $ BEAT : Factor w/ 167 levels "CLR","OTHER COUNTY",..: 14 4 124 117 76 90 28 108 130 62 ...
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD: Factor w/ 71 levels "AVONDALE","AVONDALE - NORTH AVONDALE",..: 50 68 18 6 62 67 27 58 5 22 ...
## $ DISTRICT : int 1 1 5 4 3 4 2 4 5 3 ...
## $ SNA_NEIGHBORHOOD : Factor w/ 51 levels "AVONDALE","BOND HILL",..: 35 48 10 5 44 47 17 40 4 13 ...
## $ CPD_NEIGHBORHOOD : Factor w/ 54 levels "AVONDALE","BONDHILL",..: 38 51 18 6 48 50 17 44 5 14 ...
summary(pdi)
## ADDRESS_X LATITUDE_X LONGITUDE_X AGENCY
## 20XX RADCLIFF DR : 38768 Min. :39.05 Min. :-84.71 CP :105078
## 3XX EZZARD CHARLES DR: 8599 1st Qu.:39.11 1st Qu.:-84.55 CPD:674892
## 23XX FERGUSON RD : 8168 Median :39.13 Median :-84.52
## 41XX READING RD : 6236 Mean :39.14 Mean :-84.52
## 17XX REPUBLIC ST : 5236 3rd Qu.:39.15 3rd Qu.:-84.49
## (Other) :712912 Max. :39.22 Max. :-84.37
## NA's : 51
## CREATE_TIME_INCIDENT DISPOSITION_TEXT
## 8/3/2020 17:00 : 9 NTR: NOTHING TO REPORT:127820
## 9/1/2020 19:11 : 9 ADV:ADVISED : 84266
## 9/25/2020 14:54 : 9 INV: INV : 82925
## 10/1/2020 17:03 : 8 CAN:CANCEL : 80887
## 10/15/2020 18:14: 8 ADVISED : 48083
## 10/31/2020 21:16: 8 (Other) :340895
## (Other) :779919 NA's : 15094
## EVENT_NUMBER INCIDENT_TYPE_ID INCIDENT_TYPE_DESC
## CPD180306001717: 2 DIRPAT : 68919 DIRECTED PATROL - VEHICLE: 61094
## CPD200327001402: 2 ADV : 62015 ADVISED INCIDENT : 53039
## CPD200424001443: 2 SDET : 51780 OFF DUTY POLICE DETAILS : 41634
## CPD170124000097: 1 CELL : 35663 STATION RUN : 33763
## CPD170124000102: 1 REPO : 33093 CELL DISCON OR SICAL : 33504
## CPD170124000111: 1 911CALL: 32837 (Other) :482251
## (Other) :779961 (Other):495663 NA's : 74685
## PRIORITY PRIORITY_COLOR ARRIVAL_TIME_PRIMARY_UNIT
## Min. : 1.00 BLUE :181210 8/8/2019 16:31 : 10
## 1st Qu.: 7.00 PURPLE:344862 2/29/2020 0:26 : 9
## Median :18.00 RED : 5608 11/7/2020 19:01: 8
## Mean :18.94 YELLOW: 68527 5/29/2020 13:55: 8
## 3rd Qu.:29.00 NA's :179763 6/27/2020 12:53: 8
## Max. :35.00 (Other) :496911
## NA's :283016
## CLOSED_TIME_INCIDENT DISPATCH_TIME_PRIMARY_UNIT BEAT
## 8/20/2020 10:05 : 14 7/30/2020 11:07 : 8 P321 : 60540
## 6/2/2020 0:00 : 10 8/14/2019 13:08 : 7 P331 : 59237
## 10/31/2020 13:59: 9 8/16/2020 13:46 : 7 P131 : 47581
## 7/5/2020 1:01 : 9 8/23/2020 9:31 : 7 P011 : 46543
## 8/23/2020 2:24 : 9 10/11/2020 16:25: 6 P121 : 45661
## (Other) :762211 (Other) :550425 (Other):518524
## NA's : 17708 NA's :229510 NA's : 1884
## COMMUNITY_COUNCIL_NEIGHBORHOOD DISTRICT SNA_NEIGHBORHOOD
## WESTWOOD : 59587 Min. :1.000 EAST PRICE HILL: 78488
## SOUTH FAIRMOUNT: 49549 1st Qu.:2.000 WESTWOOD : 59135
## DOWNTOWN : 46587 Median :3.000 DOWNTOWN : 47449
## OTR : 45507 Mean :3.047 OVER-THE-RHINE : 44872
## WEST END : 42275 3rd Qu.:4.000 AVONDALE : 44849
## AVONDALE : 36460 Max. :5.000 WEST END : 39579
## (Other) :500005 (Other) :465598
## CPD_NEIGHBORHOOD
## WESTWOOD : 58720
## SOUTH FAIRMOUNT : 51491
## C. B. D. / RIVERFRONT: 50962
## OVER-THE-RHINE : 45766
## WEST END : 39052
## EAST PRICE HILL : 33615
## (Other) :500364
class(pdi)
## [1] "data.frame"
head(pdi)
## ADDRESS_X LATITUDE_X LONGITUDE_X AGENCY
## 1 16XX WALNUT ST 39.11293 -84.51470 CPD
## 2 W 7TH ST / CENTRAL AV 39.10219 -84.51901 CPD
## 3 1XX W MCMILLAN ST 39.12864 -84.51835 CPD
## 4 VINE ST / E 70TH ST 39.19225 -84.48156 CPD
## 5 HARRISON AV / QUEEN CITY AV 39.12569 -84.54933 CPD
## 6 WAYNE ST / MAY ST 39.12458 -84.49463 CPD
## CREATE_TIME_INCIDENT DISPOSITION_TEXT EVENT_NUMBER INCIDENT_TYPE_ID
## 1 8/7/2019 23:29 ARR: ARREST CPD190807001817 DIRPAT
## 2 8/7/2019 9:23 NTR: NOTHING TO REPORT CPD190807000383 SDET
## 3 8/7/2019 15:06 ADV:ADVISED CPD190807001005 ADV
## 4 8/7/2019 18:19 INV: INV CPD190807001337 PERDWP-COMBINED
## 5 8/7/2019 15:35 GOA: GOA CPD190807001056 HAZARD
## 6 8/7/2019 18:17 NTR: NOTHING TO REPORT CPD190807001333 DIRPAT
## INCIDENT_TYPE_DESC PRIORITY PRIORITY_COLOR
## 1 DIRECTED PATROL - VEHICLE 27 PURPLE
## 2 OFF DUTY POLICE DETAILS 35 PURPLE
## 3 ADVISED INCIDENT 35 PURPLE
## 4 <NA> 18 <NA>
## 5 HAZARD TO TRAFFIC/PEDESTRIAN 18 BLUE
## 6 DIRECTED PATROL - VEHICLE 27 PURPLE
## ARRIVAL_TIME_PRIMARY_UNIT CLOSED_TIME_INCIDENT DISPATCH_TIME_PRIMARY_UNIT
## 1 8/7/2019 23:29 8/7/2019 23:53 8/7/2019 23:29
## 2 8/7/2019 9:24 8/7/2019 17:40 8/7/2019 9:24
## 3 <NA> 8/7/2019 15:06 <NA>
## 4 <NA> 8/7/2019 18:20 8/7/2019 18:19
## 5 8/7/2019 17:12 8/7/2019 17:13 8/7/2019 16:37
## 6 8/7/2019 18:17 8/7/2019 18:32 8/7/2019 18:17
## BEAT COMMUNITY_COUNCIL_NEIGHBORHOOD DISTRICT SNA_NEIGHBORHOOD
## 1 P121 OTR 1 OVER-THE-RHINE
## 2 P011 WEST END 1 WEST END
## 3 P511 CUF - HEIGHTS 5 CUF
## 4 P461 CARTHAGE 4 CARTHAGE
## 5 P341 SOUTH FAIRMOUNT 3 SOUTH FAIRMOUNT
## 6 P421 WALNUT HILLS 4 WALNUT HILLS
## CPD_NEIGHBORHOOD
## 1 OVER-THE-RHINE
## 2 WEST END
## 3 FAIRVIEW
## 4 CARTHAGE
## 5 SOUTH FAIRMOUNT
## 6 WALNUT HILLS
hist(pdi$PRIORITY, col = "green", breaks = 20)
plot(density(pdi$PRIORITY))
hist(pdi$PRIORITY, probability = TRUE, col = "green", breaks = 20, main = "PRIORITY", xlim = c(1, 45), xlab = "PRIORITY")
lines(density(pdi$PRIORITY), col = "red", lwd = 2)
abline(v = mean(pdi$PRIORITY), col = "blue", lty = 2, lwd = 1.5)
boxplot(pdi$PRIORITY)
This tells us the how many standard deviation away from the median.
plot(pdi$PRIORITY ~ pdi$DISTRICT, xlab = "PRIORITY", ylab = "DISTRICT",
main = "PRIORITY VS DISTRICT ")
xtabs(~ PRIORITY + AGENCY, data = pdi)
## AGENCY
## PRIORITY CP CPD
## 1 432 7055
## 2 6846 22308
## 3 470 18222
## 4 11565 35198
## 5 5815 8634
## 6 12575 10874
## 7 60659 17762
## 8 6710 3652
## 9 6 7545
## 10 0 39260
## 11 0 21107
## 12 0 38662
## 13 0 8881
## 14 0 18946
## 15 0 12
## 16 0 3175
## 17 0 77
## 18 0 31965
## 19 0 296
## 20 0 436
## 21 0 26267
## 22 0 6805
## 23 0 381
## 24 0 3108
## 25 0 29803
## 26 0 2290
## 27 0 64542
## 28 0 5221
## 29 0 54726
## 30 0 306
## 31 0 32611
## 32 0 19889
## 33 0 455
## 34 0 2021
## 35 0 132400
plot(xtabs(~ PRIORITY + AGENCY, data = pdi), main="Agency vs level of priority")
The output tells us which agency takes control according to priority
crosstabs <- xtabs(~ PRIORITY + AGENCY, data = pdi)
prop.table(crosstabs, 1)
## AGENCY
## PRIORITY CP CPD
## 1 0.0577000134 0.9422999866
## 2 0.2348219798 0.7651780202
## 3 0.0251444468 0.9748555532
## 4 0.2473109082 0.7526890918
## 5 0.4024499965 0.5975500035
## 6 0.5362702034 0.4637297966
## 7 0.7735045460 0.2264954540
## 8 0.6475583864 0.3524416136
## 9 0.0007945967 0.9992054033
## 10 0.0000000000 1.0000000000
## 11 0.0000000000 1.0000000000
## 12 0.0000000000 1.0000000000
## 13 0.0000000000 1.0000000000
## 14 0.0000000000 1.0000000000
## 15 0.0000000000 1.0000000000
## 16 0.0000000000 1.0000000000
## 17 0.0000000000 1.0000000000
## 18 0.0000000000 1.0000000000
## 19 0.0000000000 1.0000000000
## 20 0.0000000000 1.0000000000
## 21 0.0000000000 1.0000000000
## 22 0.0000000000 1.0000000000
## 23 0.0000000000 1.0000000000
## 24 0.0000000000 1.0000000000
## 25 0.0000000000 1.0000000000
## 26 0.0000000000 1.0000000000
## 27 0.0000000000 1.0000000000
## 28 0.0000000000 1.0000000000
## 29 0.0000000000 1.0000000000
## 30 0.0000000000 1.0000000000
## 31 0.0000000000 1.0000000000
## 32 0.0000000000 1.0000000000
## 33 0.0000000000 1.0000000000
## 34 0.0000000000 1.0000000000
## 35 0.0000000000 1.0000000000
Now you can see the rows add up to be 1 and they are essentially percents.