In this project I will be using the tidyr, dplyr, and stringr packages extensively.
library(tidyr)
library(dplyr)
library(stringr)
I will be looking into following three datasets:
The original dataset can be found here: https://www.transit.dot.gov/ntd/data-product/ts11-total-funding-time-series-2
This is the Total Funding Summary for transit agencies in United States. It includes the total dollar spent annually reported by each transit agency from 1991 to 2016.
First I load the data.
theURL1 <- "https://raw.githubusercontent.com/Tyllis/Data607/master/TS1.1TimeSeriesOpCapFundingSummary_3.csv"
t.data <- read.csv(theURL1, stringsAsFactors = FALSE)
dimbfr <- dim(t.data)
str(t.data)
## 'data.frame': 1210 obs. of 44 variables:
## $ ï..Last.Report.Year: int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ X5.Digit.NTDID : int 1 2 3 5 6 7 8 11 12 16 ...
## $ X4.Digit.NTDID : chr "0001" "0002" "0003" "0005" ...
## $ Agency.Name : chr "King County Department of Transportation - Metro Transit Division(King County Metro)" "Spokane Transit Authority(STA)" "Pierce County Transportation Benefit Area Authority(Pierce Transit)" "Everett Transit(ET)" ...
## $ Agency.Status : chr "Active" "Active" "Active" "Active" ...
## $ Reporter.Type : chr "Full Reporter" "Full Reporter" "Full Reporter" "Full Reporter" ...
## $ City : chr "Seattle" "Spokane" "Tacoma" "Everett" ...
## $ State : chr "WA" "WA" "WA" "WA" ...
## $ Census.Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Primary.UZA.Name : chr "Seattle, WA" "Spokane, WA" "Seattle, WA" "Seattle, WA" ...
## $ UZA : int 14 96 14 14 248 151 24 108 149 431 ...
## $ UZA.Area.SQ.Miles : chr "1,010" "164" "1,010" "1,010" ...
## $ UZA.Population : chr "3,059,393" "387,847" "3,059,393" "3,059,393" ...
## $ X2016.Status : chr "Existing 2016" "Existing 2016" "Existing 2016" "Existing 2016" ...
## $ X1991 : chr "$383,107,583" "$27,042,832" "$54,776,294" "$6,604,377" ...
## $ X1992 : chr "$329,694,739" "$27,976,993" "$64,034,797" "$6,939,479" ...
## $ X1993 : chr "$275,639,731" "$29,374,328" "$44,060,100" "$9,082,327" ...
## $ X1994 : chr "$277,424,929" "$42,316,768" "$52,209,179" "$8,126,093" ...
## $ X1995 : chr "$289,759,749" "$37,303,829" "$58,145,467" "$6,433,211" ...
## $ X1996 : chr "$342,032,852" "$32,192,970" "$50,672,958" "$8,593,023" ...
## $ X1997 : chr "$399,974,914" "$40,737,394" "$63,142,957" "$9,712,609" ...
## $ X1998 : chr "$367,127,087" "$33,591,348" "$69,330,047" "$8,266,938" ...
## $ X1999 : chr "$483,918,410" "$34,253,966" "$75,940,929" "$15,969,679" ...
## $ X2000 : chr "$416,810,718" "$35,204,499" "$83,031,163" "$19,554,697" ...
## $ X2001 : chr "$400,536,050" "$37,802,003" "$62,516,343" "$31,962,983" ...
## $ X2002 : chr "$493,524,290" "$39,682,730" "$75,453,800" "$18,290,801" ...
## $ X2003 : chr "$506,818,105" "$45,535,339" "$74,596,167" "$12,947,676" ...
## $ X2004 : chr "$651,927,092" "$40,763,480" "$97,130,128" "$11,608,063" ...
## $ X2005 : chr "$509,808,277" "$48,952,703" "$103,805,219" "$14,806,506" ...
## $ X2006 : chr "$535,030,832" "$56,450,544" "$101,860,747" "$18,193,940" ...
## $ X2007 : chr "$572,783,766" "$64,963,968" "$115,427,096" "$21,030,199" ...
## $ X2008 : chr "$651,636,069" "$73,982,401" "$132,252,330" "$20,090,792" ...
## $ X2009 : chr "$668,646,858" "$66,417,392" "$121,813,470" "$25,076,998" ...
## $ X2010 : chr "$669,733,012" "$67,478,671" "$132,109,408" "$21,677,219" ...
## $ X2011 : chr "$867,772,522" "$64,728,901" "$130,505,371" "$22,596,747" ...
## $ X2012 : chr "$817,791,736" "$70,077,825" "$117,485,734" "$21,626,515" ...
## $ X2013 : chr "$762,388,692" "$65,295,886" "$115,083,767" "$25,333,502" ...
## $ X2014 : chr "$782,670,566" "$70,699,557" "$128,861,135" "$20,388,976" ...
## $ X2015 : chr "$831,702,187" "$72,960,135" "$132,886,339" "$20,726,374" ...
## $ X2016 : chr "$988,752,004" "$73,146,122" "$142,349,888" "$21,506,189" ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : logi NA NA NA NA NA NA ...
## $ X.3 : logi NA NA NA NA NA NA ...
Here are some problems that needed to be fixed before the data.frame can be used for analysis:
Also the first column name needs to be corrected.
Below, I used select to remove the redundant columns, gather to stack the dollar amount data, str_replace_all to remove the “$” and “,”, and as.numeric to turn string into numeric.
names(t.data)[1] <- "Last.Report.Year"
t.data <- t.data %>%
select(Last.Report.Year : X2016) %>%
filter(!is.na(Last.Report.Year)) %>%
gather("Year", "Amount", X1991 : X2016)
t.data$Amount <- t.data$Amount %>%
str_replace_all(pattern = "[$,]", replacement = "") %>%
as.numeric()
t.data$Year <- t.data$Year %>%
str_replace_all(pattern = "X", replacement = "") %>%
as.numeric()
dimaft <- dim(t.data)
str(t.data)
## 'data.frame': 31278 obs. of 16 variables:
## $ Last.Report.Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ X5.Digit.NTDID : int 1 2 3 5 6 7 8 11 12 16 ...
## $ X4.Digit.NTDID : chr "0001" "0002" "0003" "0005" ...
## $ Agency.Name : chr "King County Department of Transportation - Metro Transit Division(King County Metro)" "Spokane Transit Authority(STA)" "Pierce County Transportation Benefit Area Authority(Pierce Transit)" "Everett Transit(ET)" ...
## $ Agency.Status : chr "Active" "Active" "Active" "Active" ...
## $ Reporter.Type : chr "Full Reporter" "Full Reporter" "Full Reporter" "Full Reporter" ...
## $ City : chr "Seattle" "Spokane" "Tacoma" "Everett" ...
## $ State : chr "WA" "WA" "WA" "WA" ...
## $ Census.Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Primary.UZA.Name : chr "Seattle, WA" "Spokane, WA" "Seattle, WA" "Seattle, WA" ...
## $ UZA : int 14 96 14 14 248 151 24 108 149 431 ...
## $ UZA.Area.SQ.Miles: chr "1,010" "164" "1,010" "1,010" ...
## $ UZA.Population : chr "3,059,393" "387,847" "3,059,393" "3,059,393" ...
## $ X2016.Status : chr "Existing 2016" "Existing 2016" "Existing 2016" "Existing 2016" ...
## $ Year : num 1991 1991 1991 1991 1991 ...
## $ Amount : num 3.83e+08 2.70e+07 5.48e+07 6.60e+06 4.23e+06 ...
Notice that the Amount are large numbers. I created a column to scale the amount in millions.
t.data$Amount.in.Mil <- t.data$Amount/1000000
The data is now in its tidy form. Each row is a observation. Each column is a variable.
First I would like to find out how many transit agencies are in the NY and NJ states.
nynj.data <- filter(t.data, State %in% c("NY", "NJ"))
nynj <- table(nynj.data$State) / 26
nynj
##
## NJ NY
## 47 81
Note that because I stack the time series over 26 years (2016-1991+1 = 26 columns), each original row is stacked 26 times. So I divide the counts by 26.
It looks like NY has 81 agencies and NJ has 47 agencies.
I want to do some analysis on NYC MTA. But it looks like MTA is consisted of several sub-agencies and they are reported separately in the data base. First I would like to find out how many sub-agencies are there for MTA. I use grep to locate names that has “MTA” in it.
mta.names <- nynj.data$Agency.Name %>%
grep(pattern = "MTA", value = TRUE) %>%
unique()
mta.names
## [1] "Metropolitan Suburban Bus Authority, dba: MTA Long Island Bus(MTA Long Island Bus)"
## [2] "MTA New York City Transit(NYCT)"
## [3] "Metro-North Commuter Railroad Company, dba: MTA Metro-North Railroad(MTA-MNCR)"
## [4] "Staten Island Rapid Transit Operating Authority, dba: MTA Staten Island Railway(SIRTOA)"
## [5] "MTA Long Island Rail Road(MTA LIRR)"
## [6] "MTA Bus Company(MTABUS)"
So there are 6 agencies in MTA. I now extract the MTA agencies and perform some analysis. Here, I use group_by and summarise.
mta.data <- filter(nynj.data, Agency.Name %in% mta.names)
sum.mta <- mta.data %>%
group_by(Year) %>%
summarise(Total = sum(Amount.in.Mil, na.rm = TRUE))
plot(x = sum.mta$Year, y = sum.mta$Total, xlab = "Year" , ylab = "Amount in Millions", type = "l", main = "MTA Total Spending")
Looks like MTA spending has been increasing steadily.
Next I would like to see a bar plot, which should show break down of spending among the sub-agencies. But first I have to build a “wide” format table. To use barplot, the years must be in columns, so I must spread the data.
Also, I have to do something with the MTA agency names. They are way too long. I just need the agency abbreviation inside the ().
mta.data$Agency.Name <- mta.data$Agency.Name %>%
str_extract_all(pattern = "\\([[:alpha:]- ]+\\)") %>%
str_extract_all(pattern = "[[:alpha:]- ]+") %>%
unlist()
sum2.mta <- mta.data %>%
group_by(Agency.Name, Year) %>%
summarise(Total = sum(Amount.in.Mil, na.rm = TRUE)) %>%
spread(Year, Total)
sum2.mta
## # A tibble: 6 x 27
## # Groups: Agency.Name [6]
## Agency.Name `1991` `1992` `1993` `1994`
## * <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MTA-MNCR 665.33515 628.27748 539.57303 633.98635
## 2 MTA LIRR 774.32193 837.58532 801.45880 863.04390
## 3 MTA Long Island Bus 70.61361 62.39569 62.42088 60.10073
## 4 MTABUS 0.00000 0.00000 0.00000 0.00000
## 5 NYCT 4781.66803 4098.92414 4182.69142 4251.42833
## 6 SIRTOA 21.95250 22.68951 23.25011 23.52328
## # ... with 22 more variables: `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
## # `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
## # `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## # `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## # `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>
Now I can build the barplot.
colors <- c("blue", "green", "red", "yellow", "purple", "pink")
Agency <- sum2.mta$Agency.Name
lgspec <- list(x = "topleft", bty = "n")
barplot(as.matrix(sum2.mta), legend.text = Agency, col = colors, xlab = "Year", ylab = "Amount in Millions", main = "MTA Spending Breakdown", args.legend = lgspec, las = 3)
## Warning in apply(height, 2L, cumsum): NAs introduced by coercion
As suspected, NYCT is driving the majority of the spending growth. New York City Transit is the MTA agency running the NYC Subway and Bus. That explains why they keep raising the fares. /eyeroll
This data contains detail records of vehicle collisions in New York City reported by the New York Police Department.
The original dataset can be found here: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
The original dataset has more than 1 million rows. The entire csv file is 245 MB. For this project, to save space on Github and avoid crashing R, I randomly sampled 1/100 of the data for analysis.
The subsetted data can be found here: https://raw.githubusercontent.com/Tyllis/Data607/master/NYPD_Motor_Vehicle_Collisions_subset.csv
theURL2 <- "https://raw.githubusercontent.com/Tyllis/Data607/master/NYPD_Motor_Vehicle_Collisions_subset.csv"
c.data <- read.csv(theURL2, stringsAsFactors = F)
str(c.data)
## 'data.frame': 11255 obs. of 30 variables:
## $ ORI.ROW.NUM : int 594464 516462 1064397 402796 772612 798178 530865 230645 1091157 435177 ...
## $ DATE : chr "02/09/2015" "06/22/2015" "10/19/2012" "12/22/2015" ...
## $ TIME : chr "16:49" "18:20" "19:50" "19:25" ...
## $ BOROUGH : chr "BROOKLYN" "MANHATTAN" "BRONX" "MANHATTAN" ...
## $ ZIP.CODE : int 11230 10003 10456 10033 10033 10304 NA 11223 11225 10470 ...
## $ LATITUDE : num 40.6 40.7 40.8 40.9 40.8 ...
## $ LONGITUDE : num -74 -74 -73.9 -73.9 -73.9 ...
## $ LOCATION : chr "(40.6119531, -73.9681646)" "(40.7326034, -73.9876815)" "(40.8317235, -73.9150083)" "(40.8504481, -73.9366612)" ...
## $ ON.STREET.NAME : chr "OCEAN PARKWAY " "EAST 13 STREET " "MORRIS AVENUE " "WEST 181 STREET " ...
## $ CROSS.STREET.NAME : chr "AVENUE O " "3 AVENUE " "MCCLELLAN STREET " "BENNETT AVENUE " ...
## $ OFF.STREET.NAME : chr "" "" "" "" ...
## $ NUMBER.OF.PERSONS.INJURED : int 0 1 1 0 0 0 0 2 1 1 ...
## $ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 0 0 0 0 0 0 1 1 ...
## $ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.MOTORIST.INJURED : int 0 1 1 0 0 0 0 2 0 0 ...
## $ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.1: chr "Driver Inattention/Distraction" "Driver Inattention/Distraction" "Physical Disability" "Unspecified" ...
## $ CONTRIBUTING.FACTOR.VEHICLE.2: chr "Unspecified" "Prescription Medication" "Unspecified" "Unspecified" ...
## $ CONTRIBUTING.FACTOR.VEHICLE.3: chr "" "" "" "" ...
## $ CONTRIBUTING.FACTOR.VEHICLE.4: chr "" "" "" "" ...
## $ CONTRIBUTING.FACTOR.VEHICLE.5: chr "" "" "" "" ...
## $ UNIQUE.KEY : int 3167071 3243985 85811 3358626 313499 301002 3229815 3531769 152896 3326684 ...
## $ VEHICLE.TYPE.CODE.1 : chr "PASSENGER VEHICLE" "TAXI" "PASSENGER VEHICLE" "SPORT UTILITY / STATION WAGON" ...
## $ VEHICLE.TYPE.CODE.2 : chr "LARGE COM VEH(6 OR MORE TIRES)" "PASSENGER VEHICLE" "SPORT UTILITY / STATION WAGON" "SPORT UTILITY / STATION WAGON" ...
## $ VEHICLE.TYPE.CODE.3 : chr "" "" "" "" ...
## $ VEHICLE.TYPE.CODE.4 : chr "" "" "" "" ...
## $ VEHICLE.TYPE.CODE.5 : chr "" "" "" "" ...
The subsetted data has 11,255 rows and 30 columns.
The original data has 29 columns. I added one more column “ORI.ROW.NUM” that contains the row numbers from the orignal data.
head(c.data)
## ORI.ROW.NUM DATE TIME BOROUGH ZIP.CODE LATITUDE LONGITUDE
## 1 594464 02/09/2015 16:49 BROOKLYN 11230 40.61195 -73.96816
## 2 516462 06/22/2015 18:20 MANHATTAN 10003 40.73260 -73.98768
## 3 1064397 10/19/2012 19:50 BRONX 10456 40.83172 -73.91501
## 4 402796 12/22/2015 19:25 MANHATTAN 10033 40.85045 -73.93666
## 5 772612 04/01/2014 13:15 MANHATTAN 10033 40.84812 -73.93088
## 6 798178 02/12/2014 8:05 STATEN ISLAND 10304 40.59759 -74.09609
## LOCATION ON.STREET.NAME
## 1 (40.6119531, -73.9681646) OCEAN PARKWAY
## 2 (40.7326034, -73.9876815) EAST 13 STREET
## 3 (40.8317235, -73.9150083) MORRIS AVENUE
## 4 (40.8504481, -73.9366612) WEST 181 STREET
## 5 (40.8481201, -73.9308839) WEST 181 STREET
## 6 (40.5975895, -74.0960906) DUTCHESS AVENUE
## CROSS.STREET.NAME OFF.STREET.NAME
## 1 AVENUE O
## 2 3 AVENUE
## 3 MCCLELLAN STREET
## 4 BENNETT AVENUE
## 5 AMSTERDAM AVENUE
## 6 VISTA AVENUE
## NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 1 0 0
## 2 1 0
## 3 1 0
## 4 0 0
## 5 0 0
## 6 0 0
## NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
## 1 0 0
## 2 1 0
## 3 1 0
## 4 0 0
## 5 0 0
## 6 0 0
## CONTRIBUTING.FACTOR.VEHICLE.1 CONTRIBUTING.FACTOR.VEHICLE.2
## 1 Driver Inattention/Distraction Unspecified
## 2 Driver Inattention/Distraction Prescription Medication
## 3 Physical Disability Unspecified
## 4 Unspecified Unspecified
## 5 Unspecified Unspecified
## 6 Unspecified Unspecified
## CONTRIBUTING.FACTOR.VEHICLE.3 CONTRIBUTING.FACTOR.VEHICLE.4
## 1
## 2
## 3
## 4
## 5
## 6
## CONTRIBUTING.FACTOR.VEHICLE.5 UNIQUE.KEY VEHICLE.TYPE.CODE.1
## 1 3167071 PASSENGER VEHICLE
## 2 3243985 TAXI
## 3 85811 PASSENGER VEHICLE
## 4 3358626 SPORT UTILITY / STATION WAGON
## 5 313499 PASSENGER VEHICLE
## 6 301002 PASSENGER VEHICLE
## VEHICLE.TYPE.CODE.2 VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4
## 1 LARGE COM VEH(6 OR MORE TIRES)
## 2 PASSENGER VEHICLE
## 3 SPORT UTILITY / STATION WAGON
## 4 SPORT UTILITY / STATION WAGON
## 5 SMALL COM VEH(4 TIRES)
## 6 UNKNOWN
## VEHICLE.TYPE.CODE.5
## 1
## 2
## 3
## 4
## 5
## 6
Upon inspection, the CONTRIBUTING.FACTOR.VEHICLE.1 thru 5 and VEHICLE.TYPE.CODE.1 thru 5 for each vehicle involved are in “wide” format. I need to stack these two variables. Here’s how I did it:
gather to stack the CONTRIBUTING.FACTOR.VEHICLE.1 thru 5 and VEHICLE.TYPE.CODE.1 thru 5 into one column named key. The factor and type are stored in the column named value.str_extract_all to extract the vehicle number and store in a new column VEHICLE.NUM.str_replace_all to remove the vehicle number from the key column.spread to spread the CONTRIBUTING.FACTOR.VEHICLE and VEHICLE.TYPE.CODE rows into columns.idx1 <- which(names(c.data) == "VEHICLE.TYPE.CODE.1")
idx2 <- which(names(c.data) == "VEHICLE.TYPE.CODE.5")
idx3 <- which(names(c.data) == "CONTRIBUTING.FACTOR.VEHICLE.1")
idx4 <- which(names(c.data) == "CONTRIBUTING.FACTOR.VEHICLE.5")
c.data <- gather(c.data, key, value, c(idx1:idx2, idx3:idx4))
c.data$VEHICLE.NUM <- unlist(as.numeric(str_extract_all(c.data$key, pattern = "[0-9]")))
c.data$key <- str_replace_all(c.data$key, pattern = ".[0-9]", replacement = "")
c.data <- spread(c.data, key, value)
str(c.data)
## 'data.frame': 56275 obs. of 23 variables:
## $ ORI.ROW.NUM : int 594464 516462 1064397 402796 772612 798178 530865 230645 1091157 435177 ...
## $ DATE : chr "02/09/2015" "06/22/2015" "10/19/2012" "12/22/2015" ...
## $ TIME : chr "16:49" "18:20" "19:50" "19:25" ...
## $ BOROUGH : chr "BROOKLYN" "MANHATTAN" "BRONX" "MANHATTAN" ...
## $ ZIP.CODE : int 11230 10003 10456 10033 10033 10304 NA 11223 11225 10470 ...
## $ LATITUDE : num 40.6 40.7 40.8 40.9 40.8 ...
## $ LONGITUDE : num -74 -74 -73.9 -73.9 -73.9 ...
## $ LOCATION : chr "(40.6119531, -73.9681646)" "(40.7326034, -73.9876815)" "(40.8317235, -73.9150083)" "(40.8504481, -73.9366612)" ...
## $ ON.STREET.NAME : chr "OCEAN PARKWAY " "EAST 13 STREET " "MORRIS AVENUE " "WEST 181 STREET " ...
## $ CROSS.STREET.NAME : chr "AVENUE O " "3 AVENUE " "MCCLELLAN STREET " "BENNETT AVENUE " ...
## $ OFF.STREET.NAME : chr "" "" "" "" ...
## $ NUMBER.OF.PERSONS.INJURED : int 0 1 1 0 0 0 0 2 1 1 ...
## $ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 0 0 0 0 0 0 1 1 ...
## $ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.MOTORIST.INJURED : int 0 1 1 0 0 0 0 2 0 0 ...
## $ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ UNIQUE.KEY : int 3167071 3243985 85811 3358626 313499 301002 3229815 3531769 152896 3326684 ...
## $ VEHICLE.NUM : num 1 1 1 1 1 1 1 1 1 1 ...
## $ CONTRIBUTING.FACTOR.VEHICLE : chr "Driver Inattention/Distraction" "Driver Inattention/Distraction" "Physical Disability" "Unspecified" ...
## $ VEHICLE.TYPE.CODE : chr "PASSENGER VEHICLE" "TAXI" "PASSENGER VEHICLE" "SPORT UTILITY / STATION WAGON" ...
head(c.data)
## ORI.ROW.NUM DATE TIME BOROUGH ZIP.CODE LATITUDE LONGITUDE
## 1 594464 02/09/2015 16:49 BROOKLYN 11230 40.61195 -73.96816
## 2 516462 06/22/2015 18:20 MANHATTAN 10003 40.73260 -73.98768
## 3 1064397 10/19/2012 19:50 BRONX 10456 40.83172 -73.91501
## 4 402796 12/22/2015 19:25 MANHATTAN 10033 40.85045 -73.93666
## 5 772612 04/01/2014 13:15 MANHATTAN 10033 40.84812 -73.93088
## 6 798178 02/12/2014 8:05 STATEN ISLAND 10304 40.59759 -74.09609
## LOCATION ON.STREET.NAME
## 1 (40.6119531, -73.9681646) OCEAN PARKWAY
## 2 (40.7326034, -73.9876815) EAST 13 STREET
## 3 (40.8317235, -73.9150083) MORRIS AVENUE
## 4 (40.8504481, -73.9366612) WEST 181 STREET
## 5 (40.8481201, -73.9308839) WEST 181 STREET
## 6 (40.5975895, -74.0960906) DUTCHESS AVENUE
## CROSS.STREET.NAME OFF.STREET.NAME
## 1 AVENUE O
## 2 3 AVENUE
## 3 MCCLELLAN STREET
## 4 BENNETT AVENUE
## 5 AMSTERDAM AVENUE
## 6 VISTA AVENUE
## NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 1 0 0
## 2 1 0
## 3 1 0
## 4 0 0
## 5 0 0
## 6 0 0
## NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED UNIQUE.KEY
## 1 0 0 3167071
## 2 1 0 3243985
## 3 1 0 85811
## 4 0 0 3358626
## 5 0 0 313499
## 6 0 0 301002
## VEHICLE.NUM CONTRIBUTING.FACTOR.VEHICLE VEHICLE.TYPE.CODE
## 1 1 Driver Inattention/Distraction PASSENGER VEHICLE
## 2 1 Driver Inattention/Distraction TAXI
## 3 1 Physical Disability PASSENGER VEHICLE
## 4 1 Unspecified SPORT UTILITY / STATION WAGON
## 5 1 Unspecified PASSENGER VEHICLE
## 6 1 Unspecified PASSENGER VEHICLE
The data is now in its tidy form.
First I would like to know which borough has the most accident. I use filter to remove the rows that does not have borough information, and use table to find the result. Since each accident is now recorded in 5 rows (each row for a vehicle), I divide the values calculated by 5.
table(c.data$BOROUGH) / 5
##
## BRONX BROOKLYN MANHATTAN QUEENS
## 3069 1089 2564 2073 2095
## STATEN ISLAND
## 365
table(c.data$BOROUGH) / sum(table(c.data$BOROUGH))
##
## BRONX BROOKLYN MANHATTAN QUEENS
## 0.27267881 0.09675700 0.22780986 0.18418481 0.18613949
## STATEN ISLAND
## 0.03243003
The data shows that Brooklyn has majority of the accidents in NYC, and Staten Island has the least.
Next I would like to know what kind of vehicles are involved in most accident. However, I first have to filter out the blank rows.
c1.data <- filter(c.data, VEHICLE.TYPE.CODE != "")
vtype <- table(c1.data$VEHICLE.TYPE.CODE)
barplot(vtype, las = 3)
It looks like overwhelming majority of accidents invovled “passenger vehicles”. This makes sense since most of vehicles on the road are passenger vehicles, I presume.
What about contribution factors? Which factor causes the most accidents?
I filter out the blank row as well as the “Unspecified” row. There are many factors contributing to accident. Here I look at the top 5 factors.
c2.data <- filter(c.data, !(CONTRIBUTING.FACTOR.VEHICLE %in% c("", "Unspecified")))
cfactor <- sort(table(c2.data$CONTRIBUTING.FACTOR.VEHICLE))
cfactor[(length(cfactor)-4) : length(cfactor)] / sum(cfactor)
##
## Backing Unsafely Other Vehicular
## 0.05918142 0.08061394
## Fatigued/Drowsy Failure to Yield Right-of-Way
## 0.08780420 0.09126106
## Driver Inattention/Distraction
## 0.27917588
It looks like “driver inattention/distraction” causes overwhelming majority of the accidents. So, we better pay attention while driving… Especially in New York!
This dataset contains records of shack attack incidents reported all over the world, from Year 2014 going all the way back to Year 1845.
The original data can be found here: http://www.sharkattackfile.net/incidentlog.htm
theURL3 <- "https://raw.githubusercontent.com/Tyllis/Data607/master/attacks.csv"
s.data <- read.csv(theURL3, stringsAsFactors = F)
str(s.data)
## 'data.frame': 25614 obs. of 22 variables:
## $ Case.Number : chr "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
## $ Date : chr "11-Jun-17" "10-Jun-17" "10-Jun-17" "Reported 07-Jun-2017" ...
## $ Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ Type : chr "Unprovoked" "Unprovoked" "Unprovoked" "Unprovoked" ...
## $ Country : chr "AUSTRALIA" "AUSTRALIA" "USA" "UNITED KINGDOM" ...
## $ Area : chr "Western Australia" "Victoria" "Florida" "South Devon" ...
## $ Location : chr "Point Casuarina, Bunbury" "Flinders, Mornington Penisula" "Ponce Inlet, Volusia County" "Bantham Beach" ...
## $ Activity : chr "Body boarding" "Surfing" "Surfing" "Surfing " ...
## $ Name : chr "Paul Goff" "female" "Bryan Brock" "Rich Thomson" ...
## $ Sex : chr "M" "F" "M" "M" ...
## $ Age : chr "48" "" "19" "30" ...
## $ Injury : chr "No injury, board bitten" "No injury, knocke off board" "Laceration to left foot" "Bruise to leg, cuts to hand sustained when he hit the shark" ...
## $ Fatal..Y.N. : chr "N" "N" "N" "N" ...
## $ Time : chr "08h30" "15h45" "10h00" "" ...
## $ Species : chr "White shark, 4 m" "7 gill shark" "" "3m shark, probably a smooth hound" ...
## $ Investigator.or.Source: chr "WA Today, 6/11/2017" "" "Daytona Beach News-Journal, 6/10/2017" "C. Moore, GSAF" ...
## $ pdf : chr "2017.06.11-Goff.pdf" "2017.06.10.b-Flinders.pdf" "2017.06.10.a-Brock.pdf" "2017.06.07.R-Thomson.pdf" ...
## $ href.formula : chr "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.11-Goff.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.b-Flinders.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.a-Brock.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.07.R-Thomson.pdf" ...
## $ href : chr "http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/201"| __truncated__ "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.b-Flinders.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.a-Brock.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.07.R-Thomson.pdf" ...
## $ Case.Number.1 : chr "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
## $ Case.Number.2 : chr "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
## $ original.order : int 6095 6094 6093 6092 6091 6090 6089 6088 6087 6086 ...
tail(s.data)
## Case.Number Date Year Type Country Area Location Activity Name Sex
## 25609 NA
## 25610 NA
## 25611 NA
## 25612 NA
## 25613 NA
## 25614 xx NA
## Age Injury Fatal..Y.N. Time Species Investigator.or.Source pdf
## 25609
## 25610
## 25611
## 25612
## 25613
## 25614
## href.formula href Case.Number.1 Case.Number.2 original.order
## 25609 NA
## 25610 NA
## 25611 NA
## 25612 NA
## 25613 NA
## 25614 NA
When the csv file was imported, there are some blank rows introduced at the end of the table. Also, there seems to be two duplicated Case.Number columns at the end. I remove these first.
s.data <- s.data %>%
filter(!(Case.Number %in% c("", NA, 0, "xx"))) %>%
select(-Case.Number.1) %>%
select(-Case.Number.2)
The data is now in its tidy form.
First I’m interested to see if the number of reported attacks has increased/decreased over the years. I will have to filter the Year column using as.numeric, and drop any 0 or NA rows.
s1.data <- s.data
s1.data$Year <- as.numeric(s1.data$Year)
s1.data <- filter(s1.data, !(Year %in% c(0, NA)))
a <- s1.data %>%
group_by(Year) %>%
summarise(Attacks = n())
plot(a, type = "l")
Looks like most attacks are reported in the last 100 years. Let’s look at attacks from 1900 to 2016.
b <- s1.data %>%
filter(Year >= 1900 & Year < 2017) %>%
group_by(Year) %>%
summarise(Attacks = n())
plot(b, type = "l")
There is a spike in 1960. It has been increasing since. I think that the increasing trend is probably due to advancement of information technologies since 1980s - it is getting easier for people around the world to communicate and share shack attack reports.
Next I want to see the percentage of fatalities reported.
s2.data <- filter(s.data, Fatal..Y.N. %in% c("Y", "N"))
(a <- table(s2.data$Fatal..Y.N.))
##
## N Y
## 4391 1566
a/sum(a)
##
## N Y
## 0.737116 0.262884
So the percentage of fatality is about 26.2884002%.
Let’s look at the type of attacks.
table(s.data$Type)
##
## Boat Boating Invalid Provoked
## 4 202 110 529 563
## Sea Disaster Unprovoked
## 220 4466
Majority of the attacks are unprovoked. The GSAF website defines a provoked incident as one in which the shark was speared, hooked, captured or in which a human drew “first blood”.
Lastly, I want to see if the Species variable can tell me something about shack attacks.
However, this variable is very messy, containing all kinds of strings:
(len <- length(unique(s.data$Species)))
## [1] 1555
This column contains 1555 number of different descriptions. Because shack attacks usually happen very fast, victims typically were just able to remember the length, sizes (weights), colors, and/or other apparent features of the shack. This Species column is basically a description of the shacks.
My goal is to use stringr to extract the length of the attacking shacks. Even just this task is difficult:
I use this pattern to extract the length: pattern = “[0-9.]+ ?[cm’]+”. This pattern will perform the following search:
Note that this search pattern will not extract every length information from all rows that contains the information. Since the data is so messy, loss is inevitable. My hope is that this will retain enough data for analysis. Also keep in mind that not all rows has length information.
species <- s.data$Species
slen <- unlist(str_extract(species, pattern = "[0-9.]+ ?[cm']+"))
slen <- slen[!is.na(slen)]
length(slen)
## [1] 1951
So I have extracted 1951 non-NA rows. Recall that the original data has 6094 rows. I have retained 32.0150968% of the dataset.
Next I extract units from the numbers, and create a data.frame object containing the lengths.
num <- unlist(str_extract(slen, pattern = "[0-9.]+"))
unit <- unlist(str_extract(slen, pattern = "[cm']+"))
slen.data <- data.frame(num = as.numeric(num), unit, stringsAsFactors = F)
## Warning in data.frame(num = as.numeric(num), unit, stringsAsFactors = F):
## NAs introduced by coercion
When converting num to numeric, I introduced NA rows. I will filter out these NA rows, instead of trying to find out the reason.
slen.data <- filter(slen.data, !is.na(num))
table(is.na(slen.data$num))
##
## FALSE
## 1947
Notice only few rows are removed.
Next I try to convert the units from metric to British. I simply subset out the metric units, apply the conversion factor, then stick them back together in a vector.
cm <- filter(slen.data, unit == "cm")$num * 0.0328084
m <- filter(slen.data, unit == "m")$num * 3.28084
mm <- filter(slen.data, unit == "mm")$num * 0.00328084
ft <- filter(slen.data, unit == "'")$num * 1
shack.len.ft <- c(cm, m, mm, ft)
str(shack.len.ft)
## num [1:1947] 1.31 1.97 2.62 1.97 1.97 ...
Now I can finally do some analysis.
(a <- summary(shack.len.ft))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0164 4.9213 7.0000 8.4710 11.4829 24.9344
So average length of a attacking shack is 8.4709872 ft, with the median length being 7 ft. The maximum size ever observed was a monsterous 24.934384 ft shack.
Let’s look at the distribution.
hist(shack.len.ft)
It is bell-shape, unimodal, and skewed to the right.