1.0 Warming Up.

In this project I will be using the tidyr, dplyr, and stringr packages extensively.

library(tidyr)
library(dplyr)
library(stringr)

I will be looking into following three datasets:

  1. Transit Funding Time Series
  2. NYPD Motor Vehicle Collisions
  3. Global Shack Attack Data

2.0 Transit Funding Time Series

The original dataset can be found here: https://www.transit.dot.gov/ntd/data-product/ts11-total-funding-time-series-2

This is the Total Funding Summary for transit agencies in United States. It includes the total dollar spent annually reported by each transit agency from 1991 to 2016.

2.1 Cleaning Data

First I load the data.

theURL1 <- "https://raw.githubusercontent.com/Tyllis/Data607/master/TS1.1TimeSeriesOpCapFundingSummary_3.csv"
t.data <- read.csv(theURL1, stringsAsFactors = FALSE)
dimbfr <- dim(t.data) 
str(t.data)
## 'data.frame':    1210 obs. of  44 variables:
##  $ ï..Last.Report.Year: int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ X5.Digit.NTDID     : int  1 2 3 5 6 7 8 11 12 16 ...
##  $ X4.Digit.NTDID     : chr  "0001" "0002" "0003" "0005" ...
##  $ Agency.Name        : chr  "King County Department of Transportation - Metro Transit Division(King County Metro)" "Spokane Transit Authority(STA)" "Pierce County Transportation Benefit Area Authority(Pierce Transit)" "Everett Transit(ET)" ...
##  $ Agency.Status      : chr  "Active" "Active" "Active" "Active" ...
##  $ Reporter.Type      : chr  "Full Reporter" "Full Reporter" "Full Reporter" "Full Reporter" ...
##  $ City               : chr  "Seattle" "Spokane" "Tacoma" "Everett" ...
##  $ State              : chr  "WA" "WA" "WA" "WA" ...
##  $ Census.Year        : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Primary.UZA.Name   : chr  "Seattle, WA" "Spokane, WA" "Seattle, WA" "Seattle, WA" ...
##  $ UZA                : int  14 96 14 14 248 151 24 108 149 431 ...
##  $ UZA.Area.SQ.Miles  : chr  "1,010" "164" "1,010" "1,010" ...
##  $ UZA.Population     : chr  "3,059,393" "387,847" "3,059,393" "3,059,393" ...
##  $ X2016.Status       : chr  "Existing 2016" "Existing 2016" "Existing 2016" "Existing 2016" ...
##  $ X1991              : chr  "$383,107,583" "$27,042,832" "$54,776,294" "$6,604,377" ...
##  $ X1992              : chr  "$329,694,739" "$27,976,993" "$64,034,797" "$6,939,479" ...
##  $ X1993              : chr  "$275,639,731" "$29,374,328" "$44,060,100" "$9,082,327" ...
##  $ X1994              : chr  "$277,424,929" "$42,316,768" "$52,209,179" "$8,126,093" ...
##  $ X1995              : chr  "$289,759,749" "$37,303,829" "$58,145,467" "$6,433,211" ...
##  $ X1996              : chr  "$342,032,852" "$32,192,970" "$50,672,958" "$8,593,023" ...
##  $ X1997              : chr  "$399,974,914" "$40,737,394" "$63,142,957" "$9,712,609" ...
##  $ X1998              : chr  "$367,127,087" "$33,591,348" "$69,330,047" "$8,266,938" ...
##  $ X1999              : chr  "$483,918,410" "$34,253,966" "$75,940,929" "$15,969,679" ...
##  $ X2000              : chr  "$416,810,718" "$35,204,499" "$83,031,163" "$19,554,697" ...
##  $ X2001              : chr  "$400,536,050" "$37,802,003" "$62,516,343" "$31,962,983" ...
##  $ X2002              : chr  "$493,524,290" "$39,682,730" "$75,453,800" "$18,290,801" ...
##  $ X2003              : chr  "$506,818,105" "$45,535,339" "$74,596,167" "$12,947,676" ...
##  $ X2004              : chr  "$651,927,092" "$40,763,480" "$97,130,128" "$11,608,063" ...
##  $ X2005              : chr  "$509,808,277" "$48,952,703" "$103,805,219" "$14,806,506" ...
##  $ X2006              : chr  "$535,030,832" "$56,450,544" "$101,860,747" "$18,193,940" ...
##  $ X2007              : chr  "$572,783,766" "$64,963,968" "$115,427,096" "$21,030,199" ...
##  $ X2008              : chr  "$651,636,069" "$73,982,401" "$132,252,330" "$20,090,792" ...
##  $ X2009              : chr  "$668,646,858" "$66,417,392" "$121,813,470" "$25,076,998" ...
##  $ X2010              : chr  "$669,733,012" "$67,478,671" "$132,109,408" "$21,677,219" ...
##  $ X2011              : chr  "$867,772,522" "$64,728,901" "$130,505,371" "$22,596,747" ...
##  $ X2012              : chr  "$817,791,736" "$70,077,825" "$117,485,734" "$21,626,515" ...
##  $ X2013              : chr  "$762,388,692" "$65,295,886" "$115,083,767" "$25,333,502" ...
##  $ X2014              : chr  "$782,670,566" "$70,699,557" "$128,861,135" "$20,388,976" ...
##  $ X2015              : chr  "$831,702,187" "$72,960,135" "$132,886,339" "$20,726,374" ...
##  $ X2016              : chr  "$988,752,004" "$73,146,122" "$142,349,888" "$21,506,189" ...
##  $ X                  : logi  NA NA NA NA NA NA ...
##  $ X.1                : logi  NA NA NA NA NA NA ...
##  $ X.2                : logi  NA NA NA NA NA NA ...
##  $ X.3                : logi  NA NA NA NA NA NA ...

Here are some problems that needed to be fixed before the data.frame can be used for analysis:

  1. The table contains some redundant columns, “X” thru “X.3”.
  2. The table contains some redundant rows at the end.
  3. The time series data is in “wide” format.
  4. Dollar amounts and the years are stored as strings.

Also the first column name needs to be corrected.

Below, I used select to remove the redundant columns, gather to stack the dollar amount data, str_replace_all to remove the “$” and “,”, and as.numeric to turn string into numeric.

names(t.data)[1] <- "Last.Report.Year"
t.data <- t.data %>% 
  select(Last.Report.Year : X2016) %>% 
  filter(!is.na(Last.Report.Year)) %>% 
  gather("Year", "Amount", X1991 : X2016)
t.data$Amount <- t.data$Amount %>% 
  str_replace_all(pattern = "[$,]", replacement = "") %>% 
  as.numeric()
t.data$Year <- t.data$Year %>% 
  str_replace_all(pattern = "X", replacement = "") %>% 
  as.numeric()
dimaft <- dim(t.data) 
str(t.data)
## 'data.frame':    31278 obs. of  16 variables:
##  $ Last.Report.Year : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ X5.Digit.NTDID   : int  1 2 3 5 6 7 8 11 12 16 ...
##  $ X4.Digit.NTDID   : chr  "0001" "0002" "0003" "0005" ...
##  $ Agency.Name      : chr  "King County Department of Transportation - Metro Transit Division(King County Metro)" "Spokane Transit Authority(STA)" "Pierce County Transportation Benefit Area Authority(Pierce Transit)" "Everett Transit(ET)" ...
##  $ Agency.Status    : chr  "Active" "Active" "Active" "Active" ...
##  $ Reporter.Type    : chr  "Full Reporter" "Full Reporter" "Full Reporter" "Full Reporter" ...
##  $ City             : chr  "Seattle" "Spokane" "Tacoma" "Everett" ...
##  $ State            : chr  "WA" "WA" "WA" "WA" ...
##  $ Census.Year      : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Primary.UZA.Name : chr  "Seattle, WA" "Spokane, WA" "Seattle, WA" "Seattle, WA" ...
##  $ UZA              : int  14 96 14 14 248 151 24 108 149 431 ...
##  $ UZA.Area.SQ.Miles: chr  "1,010" "164" "1,010" "1,010" ...
##  $ UZA.Population   : chr  "3,059,393" "387,847" "3,059,393" "3,059,393" ...
##  $ X2016.Status     : chr  "Existing 2016" "Existing 2016" "Existing 2016" "Existing 2016" ...
##  $ Year             : num  1991 1991 1991 1991 1991 ...
##  $ Amount           : num  3.83e+08 2.70e+07 5.48e+07 6.60e+06 4.23e+06 ...

Notice that the Amount are large numbers. I created a column to scale the amount in millions.

t.data$Amount.in.Mil <- t.data$Amount/1000000

The data is now in its tidy form. Each row is a observation. Each column is a variable.

2.2 Analyze Data

First I would like to find out how many transit agencies are in the NY and NJ states.

nynj.data <- filter(t.data, State %in% c("NY", "NJ"))
nynj <- table(nynj.data$State) / 26
nynj
## 
## NJ NY 
## 47 81

Note that because I stack the time series over 26 years (2016-1991+1 = 26 columns), each original row is stacked 26 times. So I divide the counts by 26.

It looks like NY has 81 agencies and NJ has 47 agencies.

I want to do some analysis on NYC MTA. But it looks like MTA is consisted of several sub-agencies and they are reported separately in the data base. First I would like to find out how many sub-agencies are there for MTA. I use grep to locate names that has “MTA” in it.

mta.names <- nynj.data$Agency.Name %>% 
  grep(pattern = "MTA", value = TRUE) %>% 
  unique()
mta.names
## [1] "Metropolitan Suburban Bus Authority, dba: MTA Long Island Bus(MTA Long Island Bus)"     
## [2] "MTA New York City Transit(NYCT)"                                                        
## [3] "Metro-North Commuter Railroad Company, dba: MTA Metro-North Railroad(MTA-MNCR)"         
## [4] "Staten Island Rapid Transit Operating Authority, dba: MTA Staten Island Railway(SIRTOA)"
## [5] "MTA Long Island Rail Road(MTA LIRR)"                                                    
## [6] "MTA Bus Company(MTABUS)"

So there are 6 agencies in MTA. I now extract the MTA agencies and perform some analysis. Here, I use group_by and summarise.

mta.data <- filter(nynj.data, Agency.Name %in% mta.names)
sum.mta <- mta.data %>% 
  group_by(Year) %>% 
  summarise(Total = sum(Amount.in.Mil, na.rm = TRUE))
plot(x = sum.mta$Year, y = sum.mta$Total, xlab = "Year" , ylab = "Amount in Millions", type = "l", main = "MTA Total Spending")

Looks like MTA spending has been increasing steadily.

Next I would like to see a bar plot, which should show break down of spending among the sub-agencies. But first I have to build a “wide” format table. To use barplot, the years must be in columns, so I must spread the data.

Also, I have to do something with the MTA agency names. They are way too long. I just need the agency abbreviation inside the ().

mta.data$Agency.Name <- mta.data$Agency.Name %>% 
  str_extract_all(pattern = "\\([[:alpha:]- ]+\\)") %>% 
  str_extract_all(pattern = "[[:alpha:]- ]+") %>% 
  unlist()
sum2.mta <- mta.data %>% 
  group_by(Agency.Name, Year) %>% 
  summarise(Total = sum(Amount.in.Mil, na.rm = TRUE)) %>% 
  spread(Year, Total)
sum2.mta
## # A tibble: 6 x 27
## # Groups:   Agency.Name [6]
##           Agency.Name     `1991`     `1992`     `1993`     `1994`
## *               <chr>      <dbl>      <dbl>      <dbl>      <dbl>
## 1            MTA-MNCR  665.33515  628.27748  539.57303  633.98635
## 2            MTA LIRR  774.32193  837.58532  801.45880  863.04390
## 3 MTA Long Island Bus   70.61361   62.39569   62.42088   60.10073
## 4              MTABUS    0.00000    0.00000    0.00000    0.00000
## 5                NYCT 4781.66803 4098.92414 4182.69142 4251.42833
## 6              SIRTOA   21.95250   22.68951   23.25011   23.52328
## # ... with 22 more variables: `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
## #   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
## #   `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## #   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>

Now I can build the barplot.

colors <- c("blue", "green", "red", "yellow", "purple", "pink")
Agency <- sum2.mta$Agency.Name
lgspec <- list(x = "topleft", bty = "n")
barplot(as.matrix(sum2.mta), legend.text = Agency, col = colors, xlab = "Year", ylab = "Amount in Millions", main = "MTA Spending Breakdown", args.legend = lgspec, las = 3)
## Warning in apply(height, 2L, cumsum): NAs introduced by coercion

As suspected, NYCT is driving the majority of the spending growth. New York City Transit is the MTA agency running the NYC Subway and Bus. That explains why they keep raising the fares. /eyeroll

3.0 NYPD Motor Vehicle Collisions

This data contains detail records of vehicle collisions in New York City reported by the New York Police Department.

The original dataset can be found here: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95

The original dataset has more than 1 million rows. The entire csv file is 245 MB. For this project, to save space on Github and avoid crashing R, I randomly sampled 1/100 of the data for analysis.

The subsetted data can be found here: https://raw.githubusercontent.com/Tyllis/Data607/master/NYPD_Motor_Vehicle_Collisions_subset.csv

theURL2 <- "https://raw.githubusercontent.com/Tyllis/Data607/master/NYPD_Motor_Vehicle_Collisions_subset.csv"
c.data <- read.csv(theURL2, stringsAsFactors = F)
str(c.data)
## 'data.frame':    11255 obs. of  30 variables:
##  $ ORI.ROW.NUM                  : int  594464 516462 1064397 402796 772612 798178 530865 230645 1091157 435177 ...
##  $ DATE                         : chr  "02/09/2015" "06/22/2015" "10/19/2012" "12/22/2015" ...
##  $ TIME                         : chr  "16:49" "18:20" "19:50" "19:25" ...
##  $ BOROUGH                      : chr  "BROOKLYN" "MANHATTAN" "BRONX" "MANHATTAN" ...
##  $ ZIP.CODE                     : int  11230 10003 10456 10033 10033 10304 NA 11223 11225 10470 ...
##  $ LATITUDE                     : num  40.6 40.7 40.8 40.9 40.8 ...
##  $ LONGITUDE                    : num  -74 -74 -73.9 -73.9 -73.9 ...
##  $ LOCATION                     : chr  "(40.6119531, -73.9681646)" "(40.7326034, -73.9876815)" "(40.8317235, -73.9150083)" "(40.8504481, -73.9366612)" ...
##  $ ON.STREET.NAME               : chr  "OCEAN PARKWAY                   " "EAST 13 STREET                  " "MORRIS AVENUE                   " "WEST 181 STREET                 " ...
##  $ CROSS.STREET.NAME            : chr  "AVENUE O                        " "3 AVENUE                        " "MCCLELLAN STREET                " "BENNETT AVENUE                  " ...
##  $ OFF.STREET.NAME              : chr  "" "" "" "" ...
##  $ NUMBER.OF.PERSONS.INJURED    : int  0 1 1 0 0 0 0 2 1 1 ...
##  $ NUMBER.OF.PERSONS.KILLED     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.PEDESTRIANS.INJURED: int  0 0 0 0 0 0 0 0 1 1 ...
##  $ NUMBER.OF.PEDESTRIANS.KILLED : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.CYCLIST.INJURED    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.CYCLIST.KILLED     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.MOTORIST.INJURED   : int  0 1 1 0 0 0 0 2 0 0 ...
##  $ NUMBER.OF.MOTORIST.KILLED    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.1: chr  "Driver Inattention/Distraction" "Driver Inattention/Distraction" "Physical Disability" "Unspecified" ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.2: chr  "Unspecified" "Prescription Medication" "Unspecified" "Unspecified" ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.3: chr  "" "" "" "" ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.4: chr  "" "" "" "" ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.5: chr  "" "" "" "" ...
##  $ UNIQUE.KEY                   : int  3167071 3243985 85811 3358626 313499 301002 3229815 3531769 152896 3326684 ...
##  $ VEHICLE.TYPE.CODE.1          : chr  "PASSENGER VEHICLE" "TAXI" "PASSENGER VEHICLE" "SPORT UTILITY / STATION WAGON" ...
##  $ VEHICLE.TYPE.CODE.2          : chr  "LARGE COM VEH(6 OR MORE TIRES)" "PASSENGER VEHICLE" "SPORT UTILITY / STATION WAGON" "SPORT UTILITY / STATION WAGON" ...
##  $ VEHICLE.TYPE.CODE.3          : chr  "" "" "" "" ...
##  $ VEHICLE.TYPE.CODE.4          : chr  "" "" "" "" ...
##  $ VEHICLE.TYPE.CODE.5          : chr  "" "" "" "" ...
3.1 Cleaning Data

The subsetted data has 11,255 rows and 30 columns.

The original data has 29 columns. I added one more column “ORI.ROW.NUM” that contains the row numbers from the orignal data.

head(c.data)
##   ORI.ROW.NUM       DATE  TIME       BOROUGH ZIP.CODE LATITUDE LONGITUDE
## 1      594464 02/09/2015 16:49      BROOKLYN    11230 40.61195 -73.96816
## 2      516462 06/22/2015 18:20     MANHATTAN    10003 40.73260 -73.98768
## 3     1064397 10/19/2012 19:50         BRONX    10456 40.83172 -73.91501
## 4      402796 12/22/2015 19:25     MANHATTAN    10033 40.85045 -73.93666
## 5      772612 04/01/2014 13:15     MANHATTAN    10033 40.84812 -73.93088
## 6      798178 02/12/2014  8:05 STATEN ISLAND    10304 40.59759 -74.09609
##                    LOCATION                   ON.STREET.NAME
## 1 (40.6119531, -73.9681646) OCEAN PARKWAY                   
## 2 (40.7326034, -73.9876815) EAST 13 STREET                  
## 3 (40.8317235, -73.9150083) MORRIS AVENUE                   
## 4 (40.8504481, -73.9366612) WEST 181 STREET                 
## 5 (40.8481201, -73.9308839) WEST 181 STREET                 
## 6 (40.5975895, -74.0960906) DUTCHESS AVENUE                 
##                  CROSS.STREET.NAME OFF.STREET.NAME
## 1 AVENUE O                                        
## 2 3 AVENUE                                        
## 3 MCCLELLAN STREET                                
## 4 BENNETT AVENUE                                  
## 5 AMSTERDAM AVENUE                                
## 6 VISTA AVENUE                                    
##   NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 1                         0                        0
## 2                         1                        0
## 3                         1                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## 1                             0                            0
## 2                             0                            0
## 3                             0                            0
## 4                             0                            0
## 5                             0                            0
## 6                             0                            0
##   NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## 1                         0                        0
## 2                         0                        0
## 3                         0                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
## 1                          0                         0
## 2                          1                         0
## 3                          1                         0
## 4                          0                         0
## 5                          0                         0
## 6                          0                         0
##    CONTRIBUTING.FACTOR.VEHICLE.1 CONTRIBUTING.FACTOR.VEHICLE.2
## 1 Driver Inattention/Distraction                   Unspecified
## 2 Driver Inattention/Distraction       Prescription Medication
## 3            Physical Disability                   Unspecified
## 4                    Unspecified                   Unspecified
## 5                    Unspecified                   Unspecified
## 6                    Unspecified                   Unspecified
##   CONTRIBUTING.FACTOR.VEHICLE.3 CONTRIBUTING.FACTOR.VEHICLE.4
## 1                                                            
## 2                                                            
## 3                                                            
## 4                                                            
## 5                                                            
## 6                                                            
##   CONTRIBUTING.FACTOR.VEHICLE.5 UNIQUE.KEY           VEHICLE.TYPE.CODE.1
## 1                                  3167071             PASSENGER VEHICLE
## 2                                  3243985                          TAXI
## 3                                    85811             PASSENGER VEHICLE
## 4                                  3358626 SPORT UTILITY / STATION WAGON
## 5                                   313499             PASSENGER VEHICLE
## 6                                   301002             PASSENGER VEHICLE
##              VEHICLE.TYPE.CODE.2 VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4
## 1 LARGE COM VEH(6 OR MORE TIRES)                                        
## 2              PASSENGER VEHICLE                                        
## 3  SPORT UTILITY / STATION WAGON                                        
## 4  SPORT UTILITY / STATION WAGON                                        
## 5        SMALL COM VEH(4 TIRES)                                         
## 6                        UNKNOWN                                        
##   VEHICLE.TYPE.CODE.5
## 1                    
## 2                    
## 3                    
## 4                    
## 5                    
## 6

Upon inspection, the CONTRIBUTING.FACTOR.VEHICLE.1 thru 5 and VEHICLE.TYPE.CODE.1 thru 5 for each vehicle involved are in “wide” format. I need to stack these two variables. Here’s how I did it:

  1. Use gather to stack the CONTRIBUTING.FACTOR.VEHICLE.1 thru 5 and VEHICLE.TYPE.CODE.1 thru 5 into one column named key. The factor and type are stored in the column named value.
  2. Use str_extract_all to extract the vehicle number and store in a new column VEHICLE.NUM.
  3. Use str_replace_all to remove the vehicle number from the key column.
  4. Use spread to spread the CONTRIBUTING.FACTOR.VEHICLE and VEHICLE.TYPE.CODE rows into columns.
idx1 <- which(names(c.data) == "VEHICLE.TYPE.CODE.1")
idx2 <- which(names(c.data) == "VEHICLE.TYPE.CODE.5")
idx3 <- which(names(c.data) == "CONTRIBUTING.FACTOR.VEHICLE.1")
idx4 <- which(names(c.data) == "CONTRIBUTING.FACTOR.VEHICLE.5")
c.data <- gather(c.data, key, value, c(idx1:idx2, idx3:idx4))
c.data$VEHICLE.NUM <- unlist(as.numeric(str_extract_all(c.data$key, pattern = "[0-9]")))
c.data$key <- str_replace_all(c.data$key, pattern = ".[0-9]", replacement = "")
c.data <- spread(c.data, key, value)
str(c.data)
## 'data.frame':    56275 obs. of  23 variables:
##  $ ORI.ROW.NUM                  : int  594464 516462 1064397 402796 772612 798178 530865 230645 1091157 435177 ...
##  $ DATE                         : chr  "02/09/2015" "06/22/2015" "10/19/2012" "12/22/2015" ...
##  $ TIME                         : chr  "16:49" "18:20" "19:50" "19:25" ...
##  $ BOROUGH                      : chr  "BROOKLYN" "MANHATTAN" "BRONX" "MANHATTAN" ...
##  $ ZIP.CODE                     : int  11230 10003 10456 10033 10033 10304 NA 11223 11225 10470 ...
##  $ LATITUDE                     : num  40.6 40.7 40.8 40.9 40.8 ...
##  $ LONGITUDE                    : num  -74 -74 -73.9 -73.9 -73.9 ...
##  $ LOCATION                     : chr  "(40.6119531, -73.9681646)" "(40.7326034, -73.9876815)" "(40.8317235, -73.9150083)" "(40.8504481, -73.9366612)" ...
##  $ ON.STREET.NAME               : chr  "OCEAN PARKWAY                   " "EAST 13 STREET                  " "MORRIS AVENUE                   " "WEST 181 STREET                 " ...
##  $ CROSS.STREET.NAME            : chr  "AVENUE O                        " "3 AVENUE                        " "MCCLELLAN STREET                " "BENNETT AVENUE                  " ...
##  $ OFF.STREET.NAME              : chr  "" "" "" "" ...
##  $ NUMBER.OF.PERSONS.INJURED    : int  0 1 1 0 0 0 0 2 1 1 ...
##  $ NUMBER.OF.PERSONS.KILLED     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.PEDESTRIANS.INJURED: int  0 0 0 0 0 0 0 0 1 1 ...
##  $ NUMBER.OF.PEDESTRIANS.KILLED : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.CYCLIST.INJURED    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.CYCLIST.KILLED     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.MOTORIST.INJURED   : int  0 1 1 0 0 0 0 2 0 0 ...
##  $ NUMBER.OF.MOTORIST.KILLED    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UNIQUE.KEY                   : int  3167071 3243985 85811 3358626 313499 301002 3229815 3531769 152896 3326684 ...
##  $ VEHICLE.NUM                  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE  : chr  "Driver Inattention/Distraction" "Driver Inattention/Distraction" "Physical Disability" "Unspecified" ...
##  $ VEHICLE.TYPE.CODE            : chr  "PASSENGER VEHICLE" "TAXI" "PASSENGER VEHICLE" "SPORT UTILITY / STATION WAGON" ...
head(c.data)
##   ORI.ROW.NUM       DATE  TIME       BOROUGH ZIP.CODE LATITUDE LONGITUDE
## 1      594464 02/09/2015 16:49      BROOKLYN    11230 40.61195 -73.96816
## 2      516462 06/22/2015 18:20     MANHATTAN    10003 40.73260 -73.98768
## 3     1064397 10/19/2012 19:50         BRONX    10456 40.83172 -73.91501
## 4      402796 12/22/2015 19:25     MANHATTAN    10033 40.85045 -73.93666
## 5      772612 04/01/2014 13:15     MANHATTAN    10033 40.84812 -73.93088
## 6      798178 02/12/2014  8:05 STATEN ISLAND    10304 40.59759 -74.09609
##                    LOCATION                   ON.STREET.NAME
## 1 (40.6119531, -73.9681646) OCEAN PARKWAY                   
## 2 (40.7326034, -73.9876815) EAST 13 STREET                  
## 3 (40.8317235, -73.9150083) MORRIS AVENUE                   
## 4 (40.8504481, -73.9366612) WEST 181 STREET                 
## 5 (40.8481201, -73.9308839) WEST 181 STREET                 
## 6 (40.5975895, -74.0960906) DUTCHESS AVENUE                 
##                  CROSS.STREET.NAME OFF.STREET.NAME
## 1 AVENUE O                                        
## 2 3 AVENUE                                        
## 3 MCCLELLAN STREET                                
## 4 BENNETT AVENUE                                  
## 5 AMSTERDAM AVENUE                                
## 6 VISTA AVENUE                                    
##   NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 1                         0                        0
## 2                         1                        0
## 3                         1                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## 1                             0                            0
## 2                             0                            0
## 3                             0                            0
## 4                             0                            0
## 5                             0                            0
## 6                             0                            0
##   NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## 1                         0                        0
## 2                         0                        0
## 3                         0                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED UNIQUE.KEY
## 1                          0                         0    3167071
## 2                          1                         0    3243985
## 3                          1                         0      85811
## 4                          0                         0    3358626
## 5                          0                         0     313499
## 6                          0                         0     301002
##   VEHICLE.NUM    CONTRIBUTING.FACTOR.VEHICLE             VEHICLE.TYPE.CODE
## 1           1 Driver Inattention/Distraction             PASSENGER VEHICLE
## 2           1 Driver Inattention/Distraction                          TAXI
## 3           1            Physical Disability             PASSENGER VEHICLE
## 4           1                    Unspecified SPORT UTILITY / STATION WAGON
## 5           1                    Unspecified             PASSENGER VEHICLE
## 6           1                    Unspecified             PASSENGER VEHICLE

The data is now in its tidy form.

3.2 Analyzing Data

First I would like to know which borough has the most accident. I use filter to remove the rows that does not have borough information, and use table to find the result. Since each accident is now recorded in 5 rows (each row for a vehicle), I divide the values calculated by 5.

table(c.data$BOROUGH) / 5
## 
##                       BRONX      BROOKLYN     MANHATTAN        QUEENS 
##          3069          1089          2564          2073          2095 
## STATEN ISLAND 
##           365
table(c.data$BOROUGH) / sum(table(c.data$BOROUGH))
## 
##                       BRONX      BROOKLYN     MANHATTAN        QUEENS 
##    0.27267881    0.09675700    0.22780986    0.18418481    0.18613949 
## STATEN ISLAND 
##    0.03243003

The data shows that Brooklyn has majority of the accidents in NYC, and Staten Island has the least.

Next I would like to know what kind of vehicles are involved in most accident. However, I first have to filter out the blank rows.

c1.data <- filter(c.data, VEHICLE.TYPE.CODE != "")
vtype <- table(c1.data$VEHICLE.TYPE.CODE)
barplot(vtype, las = 3)

It looks like overwhelming majority of accidents invovled “passenger vehicles”. This makes sense since most of vehicles on the road are passenger vehicles, I presume.

What about contribution factors? Which factor causes the most accidents?

I filter out the blank row as well as the “Unspecified” row. There are many factors contributing to accident. Here I look at the top 5 factors.

c2.data <- filter(c.data, !(CONTRIBUTING.FACTOR.VEHICLE %in% c("", "Unspecified")))
cfactor <- sort(table(c2.data$CONTRIBUTING.FACTOR.VEHICLE))
cfactor[(length(cfactor)-4) : length(cfactor)] / sum(cfactor)
## 
##               Backing Unsafely                Other Vehicular 
##                     0.05918142                     0.08061394 
##                Fatigued/Drowsy  Failure to Yield Right-of-Way 
##                     0.08780420                     0.09126106 
## Driver Inattention/Distraction 
##                     0.27917588

It looks like “driver inattention/distraction” causes overwhelming majority of the accidents. So, we better pay attention while driving… Especially in New York!

4.0 Global Shack Attack Data

This dataset contains records of shack attack incidents reported all over the world, from Year 2014 going all the way back to Year 1845.

The original data can be found here: http://www.sharkattackfile.net/incidentlog.htm

4.1 Cleaning Data
theURL3 <- "https://raw.githubusercontent.com/Tyllis/Data607/master/attacks.csv"
s.data <- read.csv(theURL3, stringsAsFactors = F)
str(s.data)
## 'data.frame':    25614 obs. of  22 variables:
##  $ Case.Number           : chr  "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
##  $ Date                  : chr  "11-Jun-17" "10-Jun-17" "10-Jun-17" "Reported 07-Jun-2017" ...
##  $ Year                  : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ Type                  : chr  "Unprovoked" "Unprovoked" "Unprovoked" "Unprovoked" ...
##  $ Country               : chr  "AUSTRALIA" "AUSTRALIA" "USA" "UNITED KINGDOM" ...
##  $ Area                  : chr  "Western Australia" "Victoria" "Florida" "South Devon" ...
##  $ Location              : chr  "Point Casuarina, Bunbury" "Flinders, Mornington Penisula" "Ponce Inlet, Volusia County" "Bantham Beach" ...
##  $ Activity              : chr  "Body boarding" "Surfing" "Surfing" "Surfing " ...
##  $ Name                  : chr  "Paul Goff" "female" "Bryan Brock" "Rich Thomson" ...
##  $ Sex                   : chr  "M" "F" "M" "M" ...
##  $ Age                   : chr  "48" "" "19" "30" ...
##  $ Injury                : chr  "No injury, board bitten" "No injury, knocke off board" "Laceration to left foot" "Bruise to leg, cuts to hand sustained when he hit the shark" ...
##  $ Fatal..Y.N.           : chr  "N" "N" "N" "N" ...
##  $ Time                  : chr  "08h30" "15h45" "10h00" "" ...
##  $ Species               : chr  "White shark, 4 m" "7 gill shark" "" "3m shark, probably a smooth hound" ...
##  $ Investigator.or.Source: chr  "WA Today, 6/11/2017" "" "Daytona Beach News-Journal, 6/10/2017" "C. Moore, GSAF" ...
##  $ pdf                   : chr  "2017.06.11-Goff.pdf" "2017.06.10.b-Flinders.pdf" "2017.06.10.a-Brock.pdf" "2017.06.07.R-Thomson.pdf" ...
##  $ href.formula          : chr  "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.11-Goff.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.b-Flinders.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.a-Brock.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.07.R-Thomson.pdf" ...
##  $ href                  : chr  "http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/201"| __truncated__ "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.b-Flinders.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.a-Brock.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.07.R-Thomson.pdf" ...
##  $ Case.Number.1         : chr  "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
##  $ Case.Number.2         : chr  "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
##  $ original.order        : int  6095 6094 6093 6092 6091 6090 6089 6088 6087 6086 ...
tail(s.data)
##       Case.Number Date Year Type Country Area Location Activity Name Sex
## 25609                    NA                                             
## 25610                    NA                                             
## 25611                    NA                                             
## 25612                    NA                                             
## 25613                    NA                                             
## 25614          xx        NA                                             
##       Age Injury Fatal..Y.N. Time Species Investigator.or.Source pdf
## 25609                                                               
## 25610                                                               
## 25611                                                               
## 25612                                                               
## 25613                                                               
## 25614                                                               
##       href.formula href Case.Number.1 Case.Number.2 original.order
## 25609                                                           NA
## 25610                                                           NA
## 25611                                                           NA
## 25612                                                           NA
## 25613                                                           NA
## 25614                                                           NA

When the csv file was imported, there are some blank rows introduced at the end of the table. Also, there seems to be two duplicated Case.Number columns at the end. I remove these first.

s.data <- s.data %>% 
  filter(!(Case.Number %in% c("", NA, 0, "xx"))) %>% 
  select(-Case.Number.1) %>% 
  select(-Case.Number.2)

The data is now in its tidy form.

4.2 Analyzing Data

First I’m interested to see if the number of reported attacks has increased/decreased over the years. I will have to filter the Year column using as.numeric, and drop any 0 or NA rows.

s1.data <- s.data
s1.data$Year <- as.numeric(s1.data$Year)
s1.data <- filter(s1.data, !(Year %in% c(0, NA)))
a <- s1.data %>% 
  group_by(Year) %>% 
  summarise(Attacks = n())
plot(a, type = "l")

Looks like most attacks are reported in the last 100 years. Let’s look at attacks from 1900 to 2016.

b <- s1.data %>% 
  filter(Year >= 1900 & Year < 2017) %>% 
  group_by(Year) %>% 
  summarise(Attacks = n())
plot(b, type = "l")

There is a spike in 1960. It has been increasing since. I think that the increasing trend is probably due to advancement of information technologies since 1980s - it is getting easier for people around the world to communicate and share shack attack reports.

Next I want to see the percentage of fatalities reported.

s2.data <- filter(s.data, Fatal..Y.N. %in% c("Y", "N"))
(a <- table(s2.data$Fatal..Y.N.))
## 
##    N    Y 
## 4391 1566
a/sum(a)
## 
##        N        Y 
## 0.737116 0.262884

So the percentage of fatality is about 26.2884002%.

Let’s look at the type of attacks.

table(s.data$Type)
## 
##                      Boat      Boating      Invalid     Provoked 
##            4          202          110          529          563 
## Sea Disaster   Unprovoked 
##          220         4466

Majority of the attacks are unprovoked. The GSAF website defines a provoked incident as one in which the shark was speared, hooked, captured or in which a human drew “first blood”.

Lastly, I want to see if the Species variable can tell me something about shack attacks.

However, this variable is very messy, containing all kinds of strings:

(len <- length(unique(s.data$Species)))
## [1] 1555

This column contains 1555 number of different descriptions. Because shack attacks usually happen very fast, victims typically were just able to remember the length, sizes (weights), colors, and/or other apparent features of the shack. This Species column is basically a description of the shacks.

My goal is to use stringr to extract the length of the attacking shacks. Even just this task is difficult:

  1. The lengths were recorded in different units. They can be in metric or British unit.
  2. The lengths were sometimes integer and other times decimal numbers.
  3. Sometimes the length were recorded in a range, for example “2m to 3 m shack”.

I use this pattern to extract the length: pattern = “[0-9.]+ ?[cm’]+”. This pattern will perform the following search:

  1. Look for numbers first “[0-9.]+”"
  2. Look for an optional white space " ?"
  3. Look for string units “[cm’]+”

Note that this search pattern will not extract every length information from all rows that contains the information. Since the data is so messy, loss is inevitable. My hope is that this will retain enough data for analysis. Also keep in mind that not all rows has length information.

species <- s.data$Species
slen <- unlist(str_extract(species, pattern = "[0-9.]+ ?[cm']+"))
slen <- slen[!is.na(slen)]
length(slen)
## [1] 1951

So I have extracted 1951 non-NA rows. Recall that the original data has 6094 rows. I have retained 32.0150968% of the dataset.

Next I extract units from the numbers, and create a data.frame object containing the lengths.

num <- unlist(str_extract(slen, pattern = "[0-9.]+"))
unit <- unlist(str_extract(slen, pattern = "[cm']+"))
slen.data <- data.frame(num = as.numeric(num), unit, stringsAsFactors = F)
## Warning in data.frame(num = as.numeric(num), unit, stringsAsFactors = F):
## NAs introduced by coercion

When converting num to numeric, I introduced NA rows. I will filter out these NA rows, instead of trying to find out the reason.

slen.data <- filter(slen.data, !is.na(num))
table(is.na(slen.data$num))
## 
## FALSE 
##  1947

Notice only few rows are removed.

Next I try to convert the units from metric to British. I simply subset out the metric units, apply the conversion factor, then stick them back together in a vector.

cm <- filter(slen.data, unit == "cm")$num * 0.0328084
m <- filter(slen.data, unit == "m")$num * 3.28084
mm <- filter(slen.data, unit == "mm")$num * 0.00328084
ft <- filter(slen.data, unit == "'")$num * 1
shack.len.ft <- c(cm, m, mm, ft)
str(shack.len.ft)
##  num [1:1947] 1.31 1.97 2.62 1.97 1.97 ...

Now I can finally do some analysis.

(a <- summary(shack.len.ft))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0164  4.9213  7.0000  8.4710 11.4829 24.9344

So average length of a attacking shack is 8.4709872 ft, with the median length being 7 ft. The maximum size ever observed was a monsterous 24.934384 ft shack.

Let’s look at the distribution.

hist(shack.len.ft)

It is bell-shape, unimodal, and skewed to the right.