Collecting - Processing - Integrating - Visualizing
Data for the project was collected from three different sources
1) Parking violations data from the portal https://www.opendata.dc.gov. The dataset constains information about parking violations in Washington, DC in March 2015.
2) DC bikeshare data from the portal https://www.capitalbikeshare.com/. The dataset contains information about bike rides in March 2015.
3) DC bikeshare stations from the portal https://www.capitalbikeshare.com/. The dataset contains information about bike stations in Washington, DC.
4) DC crime data from the portal https://www.opendata.gov.dc/. The dataset contains information about crimes in Washington DC in March 2015 (this dataset is processing).
Data will be processed in the following order
Firstly, cleaning - removing NAs and renaming columns. Some cleaning will be done in Excel
Secondly, transforming data is some convenient formats. For instance, date and time will be transformed into more appropriate format.
Integrating will require certain merging of data. For example, attempt to merge bike share data with bike stations data in order to visualize it properly
Other integration ideas are under consideration now.
Data will be vizualized with ggplot2 package primarily.
The planned final project will be a mixture of dataset about parking violations, bike rides, and crimes in Washington, DC in March 2015.
The project is inspired by “The Broken Windows Theory”, introduced by James Q. Wilson and George L. Kelling. The “Broken Windows Theory”" is a criminological theory of the norm-setting and signaling effect of urban disorder and vandalism on additional crime and anti-social behavior. The theory states that maintaining and monitoring urban environments to prevent small crimes such as vandalism, public drinking, and toll-jumping helps to create an atmosphere of order and lawfulness, thereby preventing more serious crimes from happening. https://en.wikipedia.org/wiki/Broken_windows_theory
As I described before, I will try to analyze and combine four different datasets and find certain patterns among them.
What is common among the data sets to provide the linkage (Descriptive, Visual). The common thing about datasets that they have simmilar structure. Each event has certain characteristics - for instance, date and time, location coordinates, and others. Each row of data describe one event.
Primarily I plan to use dplyr package for data wrangling.
So far, I am working with existing data without too much acquisition. But I will try to merge data of bike rides and bike stations in order to demonstrate some patterns of geolocation.
All detailed information about datasets (without crime dataset, whch is in process) will be provided below, describing names of datasets, column names, the number of rows, and others. I will use such functions as str(), summary(), colnames() and others.
Proposed datasets have pretty high quality of data, but missing data also exists. For instance, ticket issue column misses 3777 rows with date and time. As a result, I removed NAs in order to visualize ticket time more properly. Other columns in the same dataset also have missing values. For example, 1. VIOLATION_DESCRIPTION, which specifies the type of violation. Currently 4281 out of 143693 cells in the column are empty (about 3%). 2. BODY_STYLE, which specifies the US state, where the car-violator is registered. Currently 8126 out of 143693 cells in the column are empty (less than 5 %).
At the same time, the problem with dataset is that they have too much data. It creates a mess (will be demonstrated below), when I tried to visualize it. As a result, data should be splitted.
Data is pretty structured serious problems are not expected here
Conducted steps will be demonstrated below
Some data cleansing was already made. Date and time was cleaned for creating histograms (see below). Steps/functions are provided below
** This project will unlikely analyze data from the econometrics perspective; hence, no multicollinearity is expected.**
This part is still under consideration.
# Packages
library(ggplot2)
library(dplyr)
library(ggthemes)
library(ggmap)
library(leaflet)
# Directory
setwd("C:/DC/Advanced Information Analytics/Final Project")
tickets <- read.csv("Parking_Violations_in_March_2015.csv", header = TRUE)
summary(tickets)
## ï..X Y OBJECTID ROWID_
## Min. :-77.11 Min. :38.82 Min. :24665146 Min. : 27453
## 1st Qu.:-77.04 1st Qu.:38.90 1st Qu.:28296845 1st Qu.: 3652150
## Median :-77.03 Median :38.90 Median :31124561 Median : 6507797
## Mean :-77.03 Mean :38.91 Mean :31276505 Mean : 6651752
## 3rd Qu.:-77.02 3rd Qu.:38.92 3rd Qu.:34291891 3rd Qu.: 9669648
## Max. :-76.91 Max. :38.99 Max. :38657620 Max. :13913115
##
## DAY_OF_WEEK HOLIDAY WEEK_OF_YEAR MONTH_OF_YEAR ISSUE_TIME
## FRIDAY :19389 Min. :0 Min. :10.0 Min. :3 Min. : 0
## MONDAY :28189 1st Qu.:0 1st Qu.:11.0 1st Qu.:3 1st Qu.: 956
## SATURDAY :10824 Median :0 Median :12.0 Median :3 Median :1251
## SUNDAY : 5010 Mean :0 Mean :11.9 Mean :3 Mean :1290
## THURSDAY :24040 3rd Qu.:0 3rd Qu.:13.0 3rd Qu.:3 3rd Qu.:1650
## TUESDAY :29892 Max. :0 Max. :14.0 Max. :3 Max. :2359
## WEDNESDAY:26349 NA's :72
## VIOLATION_CODE
## P039 :20280
## P003 :12330
## P281 :10513
## P173 : 9543
## P012 : 9072
## P259 : 6426
## (Other):75529
## VIOLATION_DESCRIPTION
## PARK AT EXPIRED METER :20280
## RESIDENTIAL PERMIT PKING BEYOND LIMIT W/O PERMIT:12330
## FAIL TO DISPLAY A MULTISPACE METER RECEIPT :10513
## NO PARKING STREET CLEANING : 9543
## DISOBEYING OFFICIAL SIGN : 9072
## NO STOPPING OR STANDING IN PM RUSH HOUR ZONE : 6426
## (Other) :75529
## LOCATION RP_PLATE_STATE
## 1300 BLOCK H ST NE SOUTH SIDE : 356 MD :51584
## 1300 BLOCK F ST NW NORTH SIDE : 333 DC :42415
## 1300 BLOCK H ST NE NORTH SIDE : 284 VA :31906
## 1300 BLOCK G ST NW NORTH SIDE : 282 PA : 2057
## 600 BLOCK MARYLAND AVE SW SOUTH SID: 250 NY : 1620
## 1300 BLOCK S ST NW NORTH SIDE : 224 FL : 1525
## (Other) :141964 (Other):12586
## BODY_STYLE ADDRESS_ID STREETSEGID XCOORD
## 4D :80913 Min. : 16 Min. : 1 Min. :390217
## UT :16451 1st Qu.:802201 1st Qu.: 3375 1st Qu.:396372
## 2D :11488 Median :806782 Median : 6812 Median :397528
## VA : 9798 Mean :738652 Mean : 6764 Mean :397656
## : 8126 3rd Qu.:811335 3rd Qu.:10191 3rd Qu.:398580
## TK : 6920 Max. :816524 Max. :15627 Max. :407773
## (Other): 9997 NA's :4
## YCOORD TICKET_ISSUE_DATE
## Min. :128055 2015-03-25T00:00:00.000Z: 7346
## 1st Qu.:136593 2015-03-24T00:00:00.000Z: 7300
## Median :137300 2015-03-18T00:00:00.000Z: 7298
## Mean :137700 2015-03-19T00:00:00.000Z: 7265
## 3rd Qu.:138634 2015-03-26T00:00:00.000Z: 7209
## Max. :147149 2015-03-12T00:00:00.000Z: 6743
## (Other) :100532
str(tickets) # 143693 obs. of 19 variables
## 'data.frame': 143693 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 24838706 24838707 24838708 24858535 24859070 24859071 24859072 24859073 24859074 24859075 ...
## $ ROWID_ : int 200051 200052 200053 221785 221904 221905 221906 221907 221908 221909 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 6 6 5 5 5 5 5 5 5 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 14 14 14 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1238 1738 2052 841 1917 1917 1000 857 1410 2044 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 3 3 94 2 9 76 78 72 78 94 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 125 125 23 126 11 39 57 64 57 23 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 8955 10612 16615 1411 5873 5873 17138 16459 17556 12230 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 11 11 61 27 11 11 61 27 11 61 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 7 12 47 49 12 12 54 49 12 12 ...
## $ ADDRESS_ID : int 804577 802342 807504 243076 812954 812954 803748 807858 800400 814423 ...
## $ STREETSEGID : int 3948 2626 6437 3036 11134 11134 3354 6734 335 12430 ...
## $ XCOORD : int 400277 400656 398186 396796 398598 398598 398003 398186 398684 397560 ...
## $ YCOORD : int 132405 140124 136476 137342 138408 138408 134827 136885 136994 136190 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 31 31 31 19 19 19 19 19 19 19 ...
glimpse(tickets) # Observations: 143,693. Variables: 19
## Observations: 143,693
## Variables: 19
## $ ï..X <dbl> -76.99681, -76.99244, -77.02091, -77.036...
## $ Y <dbl> 38.85947, 38.92900, 38.89614, 38.90394, ...
## $ OBJECTID <int> 24838706, 24838707, 24838708, 24858535, ...
## $ ROWID_ <int> 200051, 200052, 200053, 221785, 221904, ...
## $ DAY_OF_WEEK <fctr> TUESDAY, TUESDAY, TUESDAY, THURSDAY, TH...
## $ HOLIDAY <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ WEEK_OF_YEAR <int> 14, 14, 14, 12, 12, 12, 12, 12, 12, 12, ...
## $ MONTH_OF_YEAR <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ ISSUE_TIME <int> 1238, 1738, 2052, 841, 1917, 1917, 1000,...
## $ VIOLATION_CODE <fctr> P003, P003, P281, P002, P012, P170, P17...
## $ VIOLATION_DESCRIPTION <fctr> RESIDENTIAL PERMIT PKING BEYOND LIMIT W...
## $ LOCATION <fctr> 2600 BLOCK WADE RD SE EAST SIDE, 3200 B...
## $ RP_PLATE_STATE <fctr> DC, DC, VA, MD, DC, DC, VA, MD, DC, VA,...
## $ BODY_STYLE <fctr> 2D, 4D, SU, TK, 4D, 4D, UT, TK, 4D, 4D,...
## $ ADDRESS_ID <int> 804577, 802342, 807504, 243076, 812954, ...
## $ STREETSEGID <int> 3948, 2626, 6437, 3036, 11134, 11134, 33...
## $ XCOORD <int> 400277, 400656, 398186, 396796, 398598, ...
## $ YCOORD <int> 132405, 140124, 136476, 137342, 138408, ...
## $ TICKET_ISSUE_DATE <fctr> 2015-03-31T00:00:00.000Z, 2015-03-31T00...
colnames(tickets) # names of columns
## [1] "ï..X" "Y"
## [3] "OBJECTID" "ROWID_"
## [5] "DAY_OF_WEEK" "HOLIDAY"
## [7] "WEEK_OF_YEAR" "MONTH_OF_YEAR"
## [9] "ISSUE_TIME" "VIOLATION_CODE"
## [11] "VIOLATION_DESCRIPTION" "LOCATION"
## [13] "RP_PLATE_STATE" "BODY_STYLE"
## [15] "ADDRESS_ID" "STREETSEGID"
## [17] "XCOORD" "YCOORD"
## [19] "TICKET_ISSUE_DATE"
# Columns need to be renamed in Excel or in R
# Monday
monday <- tickets[tickets$DAY_OF_WEEK == "MONDAY",]
str(monday) # 28189 tickets on Mondays. We had five Mondays.
## 'data.frame': 28189 obs. of 19 variables:
## $ ï..X : num -77 -77.1 -77.1 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 25032129 25032130 25032131 25032132 25032133 25032134 25032135 25032136 25032137 25032138 ...
## $ ROWID_ : int 402877 402878 402879 402880 402881 402882 402883 402884 402885 402886 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 14 14 14 14 14 14 14 14 14 14 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 920 1233 725 731 1033 709 1304 2204 2003 1613 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 89 43 72 11 78 82 9 47 94 6 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 60 120 64 14 57 84 11 54 23 83 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 7938 10706 8578 4874 17759 10744 14268 12228 10316 20920 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 4 11 11 11 10 11 27 61 27 11 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 12 12 12 54 54 12 54 12 12 54 ...
## $ ADDRESS_ID : int 810340 805250 808981 806811 809535 800814 811975 805338 273857 254617 ...
## $ STREETSEGID : int 8859 4502 7692 5845 8151 600 10275 4577 12650 11017 ...
## $ XCOORD : int 395858 394574 393732 396807 397907 394489 398477 400846 394856 397594 ...
## $ YCOORD : int 136859 137844 139296 139163 141438 142422 136477 136259 137469 143111 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 30 30 30 30 30 30 30 30 30 30 ...
28189/5 # Every monday - 5637.8 tickets or approximately 5638
## [1] 5637.8
# Tuesday
tuesday <- tickets[tickets$DAY_OF_WEEK == "TUESDAY",]
str(tuesday) # 29892 tickets on Tuesday. We had five Tuesdays.
## 'data.frame': 29892 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77.1 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 24838706 24838707 24838708 24861787 24861788 24861789 24861790 24861791 24861792 24861793 ...
## $ ROWID_ : int 200051 200052 200053 223406 223407 223408 223409 223410 223411 223412 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 14 14 14 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1238 1738 2052 1338 157 154 200 228 750 1058 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 3 3 94 11 54 54 54 54 72 47 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 125 125 23 14 1 1 1 1 64 54 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 8955 10612 16615 3032 8770 8939 7981 17522 14309 8004 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 11 11 61 27 19 39 15 61 11 11 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 7 12 47 12 54 54 12 12 7 12 ...
## $ ADDRESS_ID : int 804577 802342 807504 807842 804652 804652 814231 801824 807978 809604 ...
## $ STREETSEGID : int 3948 2626 6437 6721 4010 4010 12265 1682 6834 8211 ...
## $ XCOORD : int 400277 400656 398186 397497 395424 395424 395817 396578 398117 400823 ...
## $ YCOORD : int 132405 140124 136476 137186 138083 138083 138142 136992 139950 133087 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 31 31 31 17 17 17 17 17 17 17 ...
29892/5 # Every Tuesday - 5978.4 tickets or approximately 5978
## [1] 5978.4
# Wednesday
wednesday <- tickets[tickets$DAY_OF_WEEK == "WEDNESDAY",]
str(wednesday) # 26349 tickets on Wednesdays. We had four Wednesdays.
## 'data.frame': 26349 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 25088369 25088370 25088371 25088372 25088373 25088374 25088375 25088376 25088377 25088378 ...
## $ ROWID_ : int 438142 438143 438144 438145 438146 438147 438148 438149 438150 438151 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 12 12 12 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1216 824 1333 1129 730 1234 757 1435 931 2101 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 94 72 31 2 133 78 33 2 78 33 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 23 64 90 126 123 57 80 126 57 80 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 1223 18079 2246 21599 1514 12386 8477 21647 12940 6018 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 39 61 61 61 39 27 8 27 11 61 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 56 49 12 49 12 12 12 55 12 12 ...
## $ ADDRESS_ID : int 810366 807502 802767 240249 806616 812831 808968 814139 801547 800219 ...
## $ STREETSEGID : int 8885 6436 2264 2222 5677 11030 7680 12182 1152 184 ...
## $ XCOORD : int 397833 397101 397609 397138 397430 398641 398180 397497 404807 396547 ...
## $ YCOORD : int 136476 137254 136476 137440 137253 137053 139323 136610 136726 137534 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 18 18 18 18 18 18 18 18 18 18 ...
26349/4 # Every Wednesday - 6587.25 tickets or 6587
## [1] 6587.25
# Thursday
thursday <- tickets[tickets$DAY_OF_WEEK == "THURSDAY",]
str(thursday) # 24040 tickets on Thursdays. We had four Thursdays.
## 'data.frame': 24040 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 24858535 24859070 24859071 24859072 24859073 24859074 24859075 24859076 24859077 24859078 ...
## $ ROWID_ : int 221785 221904 221905 221906 221907 221908 221909 221910 221911 221912 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 12 12 12 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 841 1917 1917 1000 857 1410 2044 2210 1115 752 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 2 9 76 78 72 78 94 78 76 119 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 126 11 39 57 64 57 23 57 39 107 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 1411 5873 5873 17138 16459 17556 12230 2921 8465 7538 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 27 11 11 61 27 11 61 11 27 61 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 49 12 12 54 49 12 12 12 54 12 ...
## $ ADDRESS_ID : int 243076 812954 812954 803748 807858 800400 814423 811171 814267 804964 ...
## $ STREETSEGID : int 3036 11134 11134 3354 6734 335 12430 9570 12297 4282 ...
## $ XCOORD : int 396796 398598 398598 398003 398186 398684 397560 396372 393318 397431 ...
## $ YCOORD : int 137342 138408 138408 134827 136885 136994 136190 137667 139120 138855 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 19 19 19 19 19 19 19 19 19 19 ...
24040/4 # Every Thursday - 6010 tickets
## [1] 6010
# Friday
friday <- tickets[tickets$DAY_OF_WEEK == "FRIDAY",]
str(friday) # 19389 tickets on Fridays. We had four Fridays.
## 'data.frame': 19389 obs. of 19 variables:
## $ ï..X : num -77 -77 -76.9 -77 -77.1 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 25017849 25017850 25017851 25017852 25017853 25017854 25017855 25017856 25017857 25017858 ...
## $ ROWID_ : int 385568 385569 385570 385571 385572 385573 385574 385575 385576 385577 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 11 11 11 11 11 11 11 11 11 11 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 2024 2102 1155 853 1209 945 754 2006 1219 1237 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 9 82 3 6 119 112 72 3 119 76 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 11 84 125 83 107 31 64 125 107 39 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 206 225 15925 16284 8765 31 16282 2856 9841 9266 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 61 61 11 61 27 27 11 27 27 61 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 1 1 54 55 49 12 1 12 54 12 ...
## $ ADDRESS_ID : int 814307 306041 811663 806705 805558 810238 801279 814179 812162 808931 ...
## $ STREETSEGID : int 12329 8852 10006 5749 4766 8766 1257 12220 10438 7649 ...
## $ XCOORD : int 398477 397983 407549 397228 395484 406540 401258 394269 398475 396935 ...
## $ YCOORD : int 136329 136520 136017 136665 137977 135604 136697 137616 136194 139583 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 13 13 13 13 13 13 13 13 13 13 ...
19389/4 # Every Friday - 4847.25 tickets or 4847
## [1] 4847.25
# Saturday
saturday <- tickets[tickets$DAY_OF_WEEK == "SATURDAY",]
str(saturday) # 10824 tickets on Saturdays. We had four Saturdays.
## 'data.frame': 10824 obs. of 19 variables:
## $ ï..X : num -77 -77 -77.1 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 25019284 25019285 25019286 25019287 25019288 25019289 25019290 25019291 25019292 25019293 ...
## $ ROWID_ : int 384925 384926 384927 384928 384929 384930 384931 384932 384933 384934 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 11 11 11 11 11 11 11 11 11 11 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 2326 2326 210 2146 2227 959 2257 1442 1738 2153 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 43 76 10 89 76 33 47 33 2 9 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 120 39 115 60 39 80 54 80 126 11 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 7535 7535 9380 7555 2294 2490 6106 6126 22235 3212 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 8 8 11 27 61 13 27 11 11 11 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 12 12 12 12 7 12 12 1 12 7 ...
## $ ADDRESS_ID : int 813671 813671 814430 801649 811771 802517 800055 235103 238639 802660 ...
## $ STREETSEGID : int 11766 11766 12435 1553 10098 2222 49 8283 11459 2039 ...
## $ XCOORD : int 397564 397564 395165 402466 400791 397171 398190 395967 398224 397330 ...
## $ YCOORD : int 138854 138854 140996 139177 136928 137408 138552 138503 137394 138713 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 14 14 14 14 14 14 14 14 14 14 ...
10824/4 # Every Saturday - 2706 tickets.
## [1] 2706
# Sunday
sunday <- tickets[tickets$DAY_OF_WEEK == "SUNDAY",]
str(sunday) # 5010 tickets on Sundays. We had five Sundays
## 'data.frame': 5010 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 39 38.9 ...
## $ OBJECTID : int 25338205 25338206 25338207 25338208 25338209 25338210 25338211 25338212 25338213 25338214 ...
## $ ROWID_ : int 703425 703426 703427 703428 703429 703430 703431 703432 703433 703434 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 12 12 12 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1749 1152 1647 736 2347 132 814 214 407 2348 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 76 47 36 47 77 89 76 126 47 54 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 39 54 102 54 40 60 39 144 54 1 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 18998 18343 19012 3994 4401 2338 13838 8343 6336 16466 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 11 11 27 11 35 61 27 27 11 27 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 12 12 12 1 43 7 12 12 12 12 ...
## $ ADDRESS_ID : int 800849 812047 808100 805481 806635 807343 803381 286782 226361 807978 ...
## $ STREETSEGID : int 632 10341 6941 4702 5694 6297 2869 12850 2917 6834 ...
## $ XCOORD : int 397963 396895 397936 397231 401338 397609 405614 402624 396102 398117 ...
## $ YCOORD : int 136596 137737 134627 144037 135773 137186 137893 139155 139354 139950 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 15 15 15 15 15 15 15 15 15 15 ...
5010/5 # Every Sunday - 1002 tickets.
## [1] 1002
# Visualize data by the day of the week (average)
# Create a dataframe
# Create a dataframe
tickets.frame <- data.frame(
day = factor(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"),
levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
ticket = c(5638, 5978, 6587, 6010, 4847, 2706, 1002)
)
# ggplot graph
tickets.by.day <- ggplot(data=tickets.frame, aes(x=day, y=ticket, fill=day)) + geom_bar(stat="identity") +
labs(title = "Average number of tickets in March 2015 \nper the day of the week",
subtitle = "Source: opendata.dc.gov") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Day of Week",
y = "Average number of tickets",
fill = "Day of Week") +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
theme(axis.line.x = element_line(size = .5, colour = "black"),
axis.title = element_text(size = 14),
legend.position = "right",
legend.direction = "vertical",
legend.box = "vertical",
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family = "OfficinaSanITC-Book"))
tickets.by.day
# We can see that on weekends the amount of tickets is significantly lower that during
# weekdays.
# We have five weeks - Week 10, Week 11, Week 12, Week 13, Week 14.
# Let's compare
week10 <- tickets[tickets$WEEK_OF_YEAR == "10",]
str(week10) # 21847 tickets (7 days)
## 'data.frame': 21847 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77.1 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 28804668 28804669 28804670 28804671 28804672 28804673 28804674 28804675 28804676 28804677 ...
## $ ROWID_ : int 4176967 4176968 4176969 4176970 4176971 4176972 4176973 4176974 4176975 4176976 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 10 10 10 10 10 10 10 10 10 10 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1149 907 1131 1018 750 2000 2144 1834 923 1652 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 32 33 93 32 33 9 33 1 72 76 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 106 80 17 106 80 11 80 78 64 39 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 8216 8216 1647 8750 16985 15601 1810 5834 17192 10980 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 27 27 27 27 27 61 61 11 11 27 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 55 55 43 55 55 7 54 10 12 54 ...
## $ ADDRESS_ID : int 809051 809051 810334 803114 802853 810070 807502 801673 813267 806368 ...
## $ STREETSEGID : int 7749 7749 8853 2540 2340 8615 6436 1388 11407 5473 ...
## $ XCOORD : int 395743 395743 397700 395459 395953 398316 397101 396387 399214 394331 ...
## $ YCOORD : int 136465 136465 136476 137490 136790 137532 137254 138430 136893 137994 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 3 3 3 3 3 3 3 3 3 3 ...
21847/7 # 3121 ticket per day on average during week 10
## [1] 3121
week11 <- tickets[tickets$WEEK_OF_YEAR == "11",]
str(week11) # 34384 tickets (7 days)
## 'data.frame': 34384 obs. of 19 variables:
## $ ï..X : num -77 -77 -77.1 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 25019284 25019285 25019286 25019287 25019288 25019289 25019290 25019291 25019292 25019293 ...
## $ ROWID_ : int 384925 384926 384927 384928 384929 384930 384931 384932 384933 384934 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 11 11 11 11 11 11 11 11 11 11 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 2326 2326 210 2146 2227 959 2257 1442 1738 2153 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 43 76 10 89 76 33 47 33 2 9 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 120 39 115 60 39 80 54 80 126 11 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 7535 7535 9380 7555 2294 2490 6106 6126 22235 3212 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 8 8 11 27 61 13 27 11 11 11 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 12 12 12 12 7 12 12 1 12 7 ...
## $ ADDRESS_ID : int 813671 813671 814430 801649 811771 802517 800055 235103 238639 802660 ...
## $ STREETSEGID : int 11766 11766 12435 1553 10098 2222 49 8283 11459 2039 ...
## $ XCOORD : int 397564 397564 395165 402466 400791 397171 398190 395967 398224 397330 ...
## $ YCOORD : int 138854 138854 140996 139177 136928 137408 138552 138503 137394 138713 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 14 14 14 14 14 14 14 14 14 14 ...
38384/7 # 5483 (approx.) tickets per day on average during week 11
## [1] 5483.429
week12 <- tickets[tickets$WEEK_OF_YEAR == "12",]
str(week12) # 36384 tickets (7 days)
## 'data.frame': 36384 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 24858535 24859070 24859071 24859072 24859073 24859074 24859075 24859076 24859077 24859078 ...
## $ ROWID_ : int 221785 221904 221905 221906 221907 221908 221909 221910 221911 221912 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 12 12 12 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 841 1917 1917 1000 857 1410 2044 2210 1115 752 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 2 9 76 78 72 78 94 78 76 119 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 126 11 39 57 64 57 23 57 39 107 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 1411 5873 5873 17138 16459 17556 12230 2921 8465 7538 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 27 11 11 61 27 11 61 11 27 61 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 49 12 12 54 49 12 12 12 54 12 ...
## $ ADDRESS_ID : int 243076 812954 812954 803748 807858 800400 814423 811171 814267 804964 ...
## $ STREETSEGID : int 3036 11134 11134 3354 6734 335 12430 9570 12297 4282 ...
## $ XCOORD : int 396796 398598 398598 398003 398186 398684 397560 396372 393318 397431 ...
## $ YCOORD : int 137342 138408 138408 134827 136885 136994 136190 137667 139120 138855 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 19 19 19 19 19 19 19 19 19 19 ...
36384/7 # 5198 (approx.) tickets per day on average during week 12
## [1] 5197.714
week13 <- tickets[tickets$WEEK_OF_YEAR == "13",]
str(week13) # 38222 tickets (7 days)
## 'data.frame': 38222 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 28832176 28832177 28832178 28832179 28832180 28832181 28832182 28832183 28832184 28832185 ...
## $ ROWID_ : int 4190462 4190463 4190464 4190465 4190466 4190467 4190468 4190469 4190470 4190471 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 13 13 13 13 13 13 13 13 13 13 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1640 1143 1542 1728 729 1626 1245 1025 957 1338 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 3 17 17 27 78 94 119 27 78 94 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 125 112 112 137 57 23 107 137 57 23 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 5183 5486 15402 20404 3041 11129 6759 7441 5841 1090 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 61 11 27 11 44 11 27 27 61 27 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 12 7 12 12 12 12 12 12 54 12 ...
## $ ADDRESS_ID : int 807791 803980 802169 218152 802422 808160 812099 240946 801861 810315 ...
## $ STREETSEGID : int 6680 3382 1814 10578 2139 6992 10385 7498 1717 8835 ...
## $ XCOORD : int 396918 396697 399696 396680 397509 395016 398817 396067 396108 397836 ...
## $ YCOORD : int 138306 137710 136542 140267 140199 140659 136101 137485 138388 136718 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 23 23 23 23 23 23 23 23 23 23 ...
38222/7 # 5460 (approx.) tickets per day on average during week 13
## [1] 5460.286
week14 <- tickets[tickets$WEEK_OF_YEAR == "14",]
str(week14) # 12856 tickets (3 days)
## 'data.frame': 12856 obs. of 19 variables:
## $ ï..X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 24838706 24838707 24838708 24904306 24904307 24904308 24904309 24904310 24904311 24904312 ...
## $ ROWID_ : int 200051 200052 200053 267588 267589 267590 267591 267592 267593 267594 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 14 14 14 14 14 14 14 14 14 14 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1238 1738 2052 1625 2048 1015 1151 1031 2148 2027 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 3 3 94 82 47 78 33 78 94 3 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 125 125 23 84 54 57 80 57 23 125 ...
## $ LOCATION : Factor w/ 22268 levels "#11 6TH ST SE",..: 8955 10612 16615 2264 3744 2473 2231 3579 2997 8195 ...
## $ RP_PLATE_STATE : Factor w/ 68 levels "","AB","AK","AL",..: 11 11 61 49 11 11 61 11 11 11 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 7 12 47 12 12 12 49 12 12 12 ...
## $ ADDRESS_ID : int 804577 802342 807504 812923 803667 802368 803862 242636 800027 813104 ...
## $ STREETSEGID : int 3948 2626 6437 11108 3285 1897 3162 2986 26 11268 ...
## $ XCOORD : int 400277 400656 398186 397609 397318 397747 397495 395939 400930 395863 ...
## $ YCOORD : int 132405 140124 136476 136611 139544 141436 135237 137878 136928 139091 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 31 31 31 31 31 31 31 31 31 31 ...
12856/3 # 4285 (approx.) tickets per day on average dueing week 14
## [1] 4285.333
# Create a dataframe
tickets.frame1 <- data.frame(
week = factor(c("Week 10", "Week 11", "Week 12", "Week 13", "Week 14"),
levels=c("Week 10", "Week 11", "Week 12", "Week 13", "Week 14")),
ticket1 = c(3121, 5483, 5198, 5460, 4285)
)
# ggplot graph
tickets.by.week <- ggplot(data=tickets.frame1, aes(x=week, y=ticket1, fill=week)) + geom_bar(stat="identity") +
labs(title = "Average number of tickets \nin March 2015 per week",
subtitle = "Source: opendata.dc.gov") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Week",
y = "Average number of tickets",
fill = "Week") +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
theme(axis.line.x = element_line(size = .5, colour = "black"),
axis.title = element_text(size = 14),
legend.position = "right",
legend.direction = "vertical",
legend.box = "vertical",
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family = "OfficinaSanITC-Book"))
tickets.by.week
# We can see that during Week 10 the average number of tickets was the lowest.
ticket.date <- tickets$TICKET_ISSUE_DATE
# Our date format - 2015-03-17T00:00:00.000Z
# Source: https://stackoverflow.com/questions/40938733/converting-character-to-timestamp-in-dataframe
ticket.date1 <- strptime(ticket.date,'%Y-%m-%dT%H:%M:%OSZ') # we transform data
ticket.day <- format(as.POSIXct(ticket.date1, format = '%Y-%m-%dT%H:%M:%OSZ'), "%d")
# ticket.day demonstrates only the day, when the ticket was issued in March 2015
# Create a histogram for March 2015
ticket.hist <- ggplot(data = tickets, aes(x = ticket.day, fill = ticket.day)) + geom_histogram(stat="count") +
labs(title = "Number of tickets per day in March 2015",
subtitle = "Source: opendata.dc.gov") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Day of Month",
y = "Number of tickets") +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
theme(axis.line.x = element_line(size = .5, colour = "black"),
axis.title = element_text(size = 14),
legend.position = "right",
legend.direction = "vertical",
legend.box = "vertical",
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family = "OfficinaSanITC-Book")) +
theme(legend.position="none")
ticket.hist
# We can see that the number of tickets drops dramatically
# every Saturday and especially Sunday.
# Now we need to analyze bike rides in March 2015
bikes <- read.csv("Bikes March 2015.csv", header = TRUE)
str(bikes) # 193107 obs
## 'data.frame': 193107 obs. of 7 variables:
## $ Total.duration..ms.: int 436022 750081 1306219 935705 140159 1132544 464562 187135 239613 216353 ...
## $ Start.date : Factor w/ 31896 levels "3/1/2015 0:00",..: 1 1 1 2 2 3 4 5 6 6 ...
## $ Start.station : Factor w/ 343 levels "10th & E St NW",..: 280 275 242 121 182 2 343 217 181 181 ...
## $ End.date : Factor w/ 31918 levels "3/1/2015 0:05",..: 2 6 10 8 1 10 5 3 4 4 ...
## $ End.station : Factor w/ 346 levels "10th & E St NW",..: 54 66 238 145 104 172 244 302 108 108 ...
## $ Bike.number : Factor w/ 3116 levels "W00005","W00006",..: 462 68 1100 2735 1707 275 234 322 739 1218 ...
## $ Subscription.Type : Factor w/ 2 levels "Casual","Registered": 2 2 2 2 2 2 2 2 2 2 ...
summary(bikes)
## Total.duration..ms. Start.date
## Min. : 749 3/16/2015 18:13: 37
## 1st Qu.: 351485 3/26/2015 17:26: 37
## Median : 597323 3/9/2015 17:50 : 36
## Mean : 932866 3/12/2015 17:55: 34
## 3rd Qu.: 1010658 3/16/2015 18:23: 34
## Max. :328173346 3/12/2015 17:34: 33
## (Other) :192896
## Start.station End.date
## Columbus Circle / Union Station : 4665 3/16/2015 17:36: 40
## Lincoln Memorial : 3576 3/30/2015 17:19: 37
## Massachusetts Ave & Dupont Circle NW: 3266 3/26/2015 17:59: 36
## Jefferson Dr & 14th St SW : 3053 3/12/2015 17:35: 35
## 15th & P St NW : 2714 3/24/2015 17:43: 35
## New Hampshire Ave & T St NW : 2483 3/26/2015 18:40: 35
## (Other) :173350 (Other) :192889
## End.station Bike.number
## Columbus Circle / Union Station : 4753 W01190 : 156
## Massachusetts Ave & Dupont Circle NW: 3921 W01226 : 154
## Lincoln Memorial : 3513 W20226 : 154
## Jefferson Dr & 14th St SW : 3225 W21588 : 151
## 15th & P St NW : 3030 W21128 : 148
## 14th & V St NW : 2753 W21581 : 148
## (Other) :171912 (Other):192196
## Subscription.Type
## Casual : 33691
## Registered:159416
##
##
##
##
##
colnames(bikes)
## [1] "Total.duration..ms." "Start.date" "Start.station"
## [4] "End.date" "End.station" "Bike.number"
## [7] "Subscription.Type"
# [1] "Total.duration..ms." "Start.date" "Start.station" "End.date"
# [5] "End.station" "Bike.number" "Subscription.Type"
# Attempt to create a histogram of bikes per day
# Date
ride.day <- bikes$Start.date
# 3/2/2015 7:56
ride.day1 <- format(as.POSIXct(ride.day, format = "%m/%d/%Y %H:%M"), "%d")
ride.day1
# Histogramm
ride.hist <- ggplot(data = bikes, aes(x = ride.day1, fill = ride.day1)) + geom_histogram(stat="count") +
labs(title = "Number of rides per day in March 2015",
subtitle = "Source: capitalbikeshare.com") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Day of Month",
y = "Number of rides") +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
theme(axis.line.x = element_line(size = .5, colour = "black"),
axis.title = element_text(size = 14),
legend.position = "right",
legend.direction = "vertical",
legend.box = "vertical",
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family = "OfficinaSanITC-Book")) +
theme(legend.position="none")
ride.hist
# We can see a certain pattern between the number of tickets on weekends and
# the increasing number of bike rides on weekend. However, it is not super clear.
# Ok, let's try to understand time of tickets a time of bike rides
# Parking tickets
# Preliminary I cleaned data in Excel, deleting 81 rows with 0 or blank cells for issue time.
tickets1 <- read.csv("Parking_Violations_in_March_2015-1.csv", header = TRUE)
str(tickets1) # 139916 obs
## 'data.frame': 139916 obs. of 19 variables:
## $ X : num -77 -77 -77 -77 -77 ...
## $ Y : num 38.9 38.9 38.9 38.9 38.9 ...
## $ OBJECTID : int 24838706 24838707 24838708 24858535 24859070 24859071 24859072 24859073 24859074 24859075 ...
## $ ROWID_ : int 200051 200052 200053 221785 221904 221905 221906 221907 221908 221909 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 6 6 5 5 5 5 5 5 5 ...
## $ HOLIDAY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WEEK_OF_YEAR : int 14 14 14 12 12 12 12 12 12 12 ...
## $ MONTH_OF_YEAR : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ISSUE_TIME : int 1238 1738 2052 841 1917 1917 1000 857 1410 2044 ...
## $ VIOLATION_CODE : Factor w/ 146 levels "P001","P002",..: 3 3 94 2 9 76 78 72 78 94 ...
## $ VIOLATION_DESCRIPTION: Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 125 125 23 126 11 39 57 64 57 23 ...
## $ LOCATION : Factor w/ 21478 levels "#11 6TH ST SE",..: 8626 10203 15911 1386 5657 5657 16410 15763 16814 11735 ...
## $ RP_PLATE_STATE : Factor w/ 67 levels "","AB","AK","AL",..: 11 11 61 27 11 11 61 27 11 61 ...
## $ BODY_STYLE : Factor w/ 58 levels "","15","16","2",..: 7 12 47 49 12 12 54 49 12 12 ...
## $ ADDRESS_ID : int 804577 802342 807504 243076 812954 812954 803748 807858 800400 814423 ...
## $ STREETSEGID : int 3948 2626 6437 3036 11134 11134 3354 6734 335 12430 ...
## $ XCOORD : int 400277 400656 398186 396796 398598 398598 398003 398186 398684 397560 ...
## $ YCOORD : int 132405 140124 136476 137342 138408 138408 134827 136885 136994 136190 ...
## $ TICKET_ISSUE_DATE : Factor w/ 31 levels "2015-03-01T00:00:00.000Z",..: 31 31 31 19 19 19 19 19 19 19 ...
# So I deleted in Excel 3777 rows with missing or messy data about tickets issue time.
# Then I deleted two last character in variable. It shows me only the hour of issued ticket.
# Even if the ticket was issued at 3:58 PM, it will be only 15, not 16 hours.
ticket.time <- tickets1$ISSUE_TIME
ticket.time.cleaned <- gsub('.{2}$', '', ticket.time)
str(ticket.time.cleaned)
## chr [1:139916] "12" "17" "20" "8" "19" "19" "10" "8" "14" "20" "22" ...
# How to reorder the values or hours?
tickets2 <- within(tickets1,
ticket.time.cleaned <- factor(ticket.time.cleaned,
levels=names(sort(table(ticket.time.cleaned),
decreasing=TRUE))))
# Histogram of time
ticket.time.hist <- ggplot(data = tickets2, aes(x = ticket.time.cleaned,
fill = ticket.time.cleaned)) +
geom_histogram(stat="count") +
labs(title = "Number of tickets per hour in March 2015",
subtitle = "Source: opendata.dc.gov") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Hour of day",
y = "Number of tickets") +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
theme(axis.line.x = element_line(size = .5, colour = "black"),
axis.title = element_text(size = 14),
legend.position = "right",
legend.direction = "vertical",
legend.box = "vertical",
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family = "OfficinaSanITC-Book")) +
theme(legend.position="none")
ticket.time.hist
# So, here we can see which hours have the highest number of tickets
# Now let's compare with rides by hour
start.hour <- format(as.POSIXct(bikes$Start.date, format = "%m/%d/%Y %H:%M"), "%H")
class(start.hour)
as.numeric(start.hour)
# Histogram
bike.hour.hist <- ggplot(data = bikes, aes(x = start.hour, fill = start.hour)) +
geom_histogram(stat="count") +
labs(title = "Number of rides per hour in March 2015",
subtitle = "Source: capitalbikeshare.com") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Hour",
y = "Number of rides") +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
theme(axis.line.x = element_line(size = .5, colour = "black"),
axis.title = element_text(size = 14),
legend.position = "right",
legend.direction = "vertical",
legend.box = "vertical",
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
text = element_text(family = "OfficinaSanITC-Book"),
plot.title = element_text(family = "OfficinaSanITC-Book")) +
theme(legend.position="none")
bike.hour.hist
# Which violations do we have?
violation <- tickets$VIOLATION_DESCRIPTION
str(violation)
## Factor w/ 144 levels "","ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1",..: 125 125 23 126 11 39 57 64 57 23 ...
summary(violation)
## PARK AT EXPIRED METER
## 20280
## RESIDENTIAL PERMIT PKING BEYOND LIMIT W/O PERMIT
## 12330
## FAIL TO DISPLAY A MULTISPACE METER RECEIPT
## 10513
## NO PARKING STREET CLEANING
## 9543
## DISOBEYING OFFICIAL SIGN
## 9072
## NO STOPPING OR STANDING IN PM RUSH HOUR ZONE
## 6426
## NO PARKING ANYTIME
## 5435
## PARK OVERTIME AT A METER
## 4988
## FAILURE TO DISPLAY CURRENT TAGS
## 4951
## UNAUTHORIZED VEHICLE IN LOADING ZONE
## 4906
## FAILURE TO SECURE DC TAGS
## 4299
##
## 4281
## NO STANDING ANYTIME
## 3577
## PARKING ON SNOW EMERGENCY ROUTE WHEN PROHIBITED
## 3116
## PARK OVERTIME IN TIMED ZONE
## 2514
## NO STOPPING OR STANDING IN AM RUSH HOUR ZONE
## 2320
## EXPIRATION TIME ON METER RECEIPT LAPSED
## 2306
## PARKED WITHIN 25 FEET OF A STOP SIGN
## 2297
## PARK IN A BUS ZONE
## 2215
## FAIL TO REPORT FOR INSPECTION.
## 2208
## NO PARKING SPECIFIC HOURS
## 2165
## STAND OR PARK IN ALLEY
## 1992
## RELOCATE TOW FEE
## 1890
## NO FRONT TAGS.
## 1815
## PARK ABREAST OF ANOTHER VEHICLE
## 1685
## PARK WITHIN 5 FEET OF DRIVEWAY OR ALLEY
## 1431
## PARK IN A DESIGNATED ENTRANCE
## 1285
## EMERGENCY NO PARKING
## 1208
## PARK WITHIN 10 FEET OF A FIRE HYDRANT
## 1178
## VEHICLE ON PRIVATE/PUBLIC PROPERTY WITHOUT CONSENT
## 1164
## NO STOPPING ANYTIME
## 1029
## FAIL TO DISPLAY CURRENT INSPECTION STICKER
## 916
## NOT PARKED IN A METER SPACE
## 644
## OBSTRUCTING CROSSWALK
## 519
## PARK IN OFFICIAL PARKING PERMIT ONLY SPACE
## 447
## PARK IN RESERVED SPACE FOR MOTORCYCLES ONLY
## 444
## FAIL TO PROPERLY AFFIX VALIDATION STICKERS
## 421
## STOPPING, STANDING OR PARKING VEHICLE IN BIKE LANE
## 365
## NO STOPPING OR STANDING ANYTIME
## 360
## PARK LESS THAN 40 FEET FROM AN INTERSECTION
## 358
## PARK LESS THAN 25 FEET FROM AN INTERSECTION
## 351
## PARKED ON DESIGNATED VENDOR AREA
## 334
## PARK IN SCHOOL ZONE
## 288
## EXPIRED HANDICAP METER
## 264
## PARK ON PUBLIC SPACE
## 247
## EXPIRED INSPECTION REJECTION STICKER
## 240
## INDIVID W/ DISABILITIES ONLY UNAUTH. USE RSRVD SP
## 238
## PARK WITH LEFT WHEEL TO THE CURB
## 206
## NO REAR TAGS.
## 157
## RESERVED RESIDENTIAL SPACE PERSONS W/ DISABILTIES
## 141
## FAIL TO DISPLAY METER RECEIPT IN A LOADING ZONE
## 140
## GOVERNMENT VEHICLE PARKING ONLY
## 137
## PARKED IN DRIVEWAY OR ALLEY TO OBSTRUCT SIDEWALK
## 134
## PARK ON SIDEWALK OR AREA RESERVED FOR PEDESTRIANS
## 122
## PARK IN RESERVED EMBASSY SPACE
## 119
## NO PARKING 7 AM TO 6:30 PM
## 118
## IMPROPER DISPLAY OF TAGS
## 116
## PARK IN A FIRE LANE
## 111
## OVERSIZED COMM VEHICLE PARKED AT A METER
## 98
## PARK ON TAXICAB STAND
## 98
## STREETCAR-PARK, STOP STAND VEH GUIDEWAY/PLATFORM
## 95
## VALET PARKING/STAGING ZONE
## 81
## PARK MORE THAN 12 INCHES FROM CURB
## 80
## PARK IN MEDIAN, ISLAND, OR SAFETY ZONE
## 79
## NON-TAXI ON HACK STAND
## 54
## OBSTRUCTING ENTRANCE OF PKG GARAGE, DOOR OR GATE
## 49
## MOTOR RUNNING UNATTENDED
## 44
## PARK WITHIN 20 FEET OF A BUS STOP OR ZONE
## 41
## FAIL TO PARK PARALLEL
## 37
## ABANDON VEHICLE ON PUBLIC/PRIVATE PROPERTY VIOL 1
## 32
## PARK IN RESERVED SPACE
## 26
## PARK ON MEDIAN, ISLAND, OR SAFETY ZONE
## 26
## PARK VEHICLE IN FRONT OF A BARRICADE
## 26
## DANGEROUS VEHICLE ON PRIV/PUBLIC PROP VIOL 1
## 22
## PARK ON OR UNDER AN ELEVATED STRUCTURE
## 21
## OBSTRUCTED TAGS.
## 20
## PARK ON PRIVATE OR PUBLIC PROPERTY
## 20
## PARK BETWEEN ISLAND OR SAFETY ZONE
## 19
## NO STANDING COMMERCIAL VEHICLE
## 17
## VEHICLE > 22 PUBLIC ST FRT DWELLING,SCHOOL, HOSP
## 17
## COMMERCIAL MOTOR VEHICLE - NOT OUT OF SERVICE
## 16
## PARK WITHIN 20 FEET OF A FIREHOUSE ENTRANCE
## 16
## FAIL TO PROPERLY DISPLAY MULTISPACE METER RECEIPT
## 15
## OBSTRUCT AN INTERSECTION
## 15
## PARK TO REDUCE ROADWAY TO LESS THAN 10 FEET
## 15
## PARK AT CURB IN LOADING ZONE
## 14
## PARKED IN A RESERVED CAR SHARING SPACE
## 14
## OPP OR BESIDE CONTRUCTION SITE OBSTRUCTING TRAFFIC
## 13
## COMMERCIAL MOTOR VEHICLE - OUT OF SERVICE
## 11
## FAIL TO DISPLAY CURRENT DC VEHICLE TAGS
## 11
## TAXICAB LOITERING.
## 11
## VEHICLE HAUL TRASH, DEBRIS OR JUNK
## 11
## PARK IN AN UNDESIGNATED SPACE
## 10
## COMMERCIAL PMT PARK DECAL, FAIL DISPLAY PROPERLY
## 9
## EXCEED GROSS WEIGHT INDICATED ON IRP REGISTRATION
## 9
## EXPIRED MULTISPACE METER RECEIPT IN LOADING ZONE
## 8
## PARK IN BUS STOP OR ZONE DURING RUSH HOUR
## 8
## PARKED LESS THAN 3 FEET FROM ANOTHER VEHICLE
## 8
## PERMIT UNLICENSED HACKER
## 8
## (Other)
## 113
head(summary(violation), 10) # we can see our top 10 violations
## PARK AT EXPIRED METER
## 20280
## RESIDENTIAL PERMIT PKING BEYOND LIMIT W/O PERMIT
## 12330
## FAIL TO DISPLAY A MULTISPACE METER RECEIPT
## 10513
## NO PARKING STREET CLEANING
## 9543
## DISOBEYING OFFICIAL SIGN
## 9072
## NO STOPPING OR STANDING IN PM RUSH HOUR ZONE
## 6426
## NO PARKING ANYTIME
## 5435
## PARK OVERTIME AT A METER
## 4988
## FAILURE TO DISPLAY CURRENT TAGS
## 4951
## UNAUTHORIZED VEHICLE IN LOADING ZONE
## 4906
# PARK AT EXPIRED METER 20280
# RESIDENTIAL PERMIT PKING BEYOND LIMIT W/O PERMIT 12330
# FAIL TO DISPLAY A MULTISPACE METER RECEIPT 10513
# NO PARKING STREET CLEANING 9543
# DISOBEYING OFFICIAL SIGN 9072
# NO STOPPING OR STANDING IN PM RUSH HOUR ZONE 6426
# NO PARKING ANYTIME 5435
# PARK OVERTIME AT A METER 4988
# FAILURE TO DISPLAY CURRENT TAGS 4951
# UNAUTHORIZED VEHICLE IN LOADING ZONE 4906
DCmap <- qmap("Washington, DC", zoom = 13, extent = "device",
color = "bw", legend = "topleft")
DCmap + geom_point(aes(x = tickets$ï..X, y = tickets$Y,
colour = violation), data = tickets) +
theme(legend.position="none") # this removes legend
# Removed 14820 rows containing missing values (geom_point)
# This map is overloaded.
# Question 8. Visualize all DC bikeshare stations.
bike.stations <- read.csv("Capital_Bike_Share_Locations.csv", header = TRUE)
summary(bike.stations)
## ï..OBJECTID ID ADDRESS
## Min. :159554260 Min. : 1.0 10th & E St NW : 1
## 1st Qu.:159555658 1st Qu.:124.5 10th & Florida Ave NW: 1
## Median :159555775 Median :245.0 10th & G St NW : 1
## Mean :159555522 Mean :244.4 10th & K St NW : 1
## 3rd Qu.:159555892 3rd Qu.:368.5 10th & Monroe St NE : 1
## Max. :159556010 Max. :488.0 10th & U St NW : 1
## (Other) :465
## TERMINAL_NUMBER LATITUDE LONGITUDE INSTALLED LOCKED
## Min. :31000 Min. :38.80 Min. :-77.37 YES:471 NO:471
## 1st Qu.:31118 1st Qu.:38.88 1st Qu.:-77.09
## Median :31309 Median :38.90 Median :-77.05
## Mean :31481 Mean :38.91 Mean :-77.06
## 3rd Qu.:31856 3rd Qu.:38.93 3rd Qu.:-77.02
## Max. :32225 Max. :39.13 Max. :-76.91
##
## INSTALL_DATE REMOVAL_DATE TEMPORARY_INSTALL NUMBER_OF_BIKES
## Mode:logical Mode:logical NO:471 Min. : 0.000
## NA's:471 NA's:471 1st Qu.: 3.000
## Median : 5.000
## Mean : 5.998
## 3rd Qu.: 8.000
## Max. :24.000
##
## NUMBER_OF_EMPTY_DOCKS X Y SE_ANNO_CAD_DATA
## Min. : 0.00 Min. :368067 Min. :125535 Mode:logical
## 1st Qu.: 6.00 1st Qu.:392513 1st Qu.:135129 NA's:471
## Median :10.00 Median :395959 Median :136995
## Mean :10.55 Mean :394576 Mean :138556
## 3rd Qu.:14.00 3rd Qu.:398077 3rd Qu.:140534
## Max. :41.00 Max. :407499 Max. :162007
##
str(bike.stations) # 471 obs. of 16 variables
## 'data.frame': 471 obs. of 16 variables:
## $ ï..OBJECTID : int 159554260 159554261 159554262 159554263 159554264 159554265 159554266 159554267 159554268 159554269 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ADDRESS : Factor w/ 471 levels "10th & E St NW",..: 229 66 88 42 171 363 402 223 16 109 ...
## $ TERMINAL_NUMBER : int 31000 31001 31002 31003 31004 31005 31006 31007 31008 31009 ...
## $ LATITUDE : num 38.9 38.9 38.9 38.9 38.9 ...
## $ LONGITUDE : num -77.1 -77.1 -77 -77 -77.1 ...
## $ INSTALLED : Factor w/ 1 level "YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ LOCKED : Factor w/ 1 level "NO": 1 1 1 1 1 1 1 1 1 1 ...
## $ INSTALL_DATE : logi NA NA NA NA NA NA ...
## $ REMOVAL_DATE : logi NA NA NA NA NA NA ...
## $ TEMPORARY_INSTALL : Factor w/ 1 level "NO": 1 1 1 1 1 1 1 1 1 1 ...
## $ NUMBER_OF_BIKES : int 5 3 8 6 8 5 6 16 5 9 ...
## $ NUMBER_OF_EMPTY_DOCKS: int 8 8 8 4 3 13 8 2 5 9 ...
## $ X : num 395380 395372 395727 395696 394837 ...
## $ Y : num 132352 132161 132069 132485 132230 ...
## $ SE_ANNO_CAD_DATA : logi NA NA NA NA NA NA ...
colnames(bike.stations)
## [1] "ï..OBJECTID" "ID"
## [3] "ADDRESS" "TERMINAL_NUMBER"
## [5] "LATITUDE" "LONGITUDE"
## [7] "INSTALLED" "LOCKED"
## [9] "INSTALL_DATE" "REMOVAL_DATE"
## [11] "TEMPORARY_INSTALL" "NUMBER_OF_BIKES"
## [13] "NUMBER_OF_EMPTY_DOCKS" "X"
## [15] "Y" "SE_ANNO_CAD_DATA"
DCmap.bike.stations <- qmap("Washington, DC", zoom = 13, extent = "device",
color = "bw", legend = "topleft")
DCmap + geom_point(aes(x = bike.stations$LONGITUDE, y = bike.stations$LATITUDE,
colour = bike.stations$NUMBER_OF_BIKES, size = bike.stations$NUMBER_OF_BIKES),
data = bike.stations) +
theme(legend.position="none")
# Removed 226 rows containing missing values (geom_point).
# So, we can see locations of all stations.
# Size and color show us which stations have more bikes.
# Attempt to visualize in Leaflet
m1 <- leaflet(bike.stations) %>% addTiles("http://korona.geog.uni-heidelberg.de/tiles/roadsg/x={x}&y={y}&z={z}",
attribution='Map tiles by <a href="http://stamen.com">Stamen Design</a>, <a href="http://creativecommons.org/licenses/by/3.0">CC BY 3.0</a> — Map data © <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>')
m1 %>% setView(-77.0369, 38.9072, zoom = 13)
m1 %>% addCircles(~bike.stations$LONGITUDE,
~bike.stations$LATITUDE,
popup=bike.stations,
weight = 3,
radius=bike.stations$NUMBER_OF_BIKES,
stroke = TRUE, fillOpacity = 0.8)