Short summary of the research problem and its importance, what you do and what you find
A person reading your abstract should get a good sense of what problem you addressed and how you addressed it without having to look at the rest of the paper
Bikeshare, Weather, Cycling
What is the general area? What is the exact problem you are addressing?
Why is it important? (why should I be interested as a reader?)
What are the objectives of the research? What are your hypotheses?
How is the paper structured?
(I think the ones near the bottom of the list may be most promising…)
Define data collection method
Accurate representation of the sample population and coverage issue from the target population
Description of Data
A complete description of the desired output
Data Analysis
Describe the instrumentation
Describe the analysis plan
Describe the scope and limitations of the methodology
The data-set is currently composed of XXXXX records and VVV variables.
Comment on missing values
Comment on cleanup
in order to obtain the maximum information possible, we had to discard the use of many variables and put our focus into the following variables:
var1
var2
var3
var4
From the above, we decided to …
– Should state whether their hypotheses were verified or proven untrue or, if no hypotheses were given, whether their research questions were answered. The authors should also comment on their results in light of previous studies and explain what differences (if any) exist between their findings and those reported by others and attempt to provide an explanation for the discrepancies.
Should illustrate the important features of the methods and results.
Should allow the reader to understand the figure or graph without having to refer back to the text of the manuscript.
Common mistakes made by inexperienced authors are failing to include figures that best depict their findings, writing unclear figure legends, and making poor use of arrows.
Should summarize the data, make the data more easily understandable, and point out important comparisons.
Description of the data in the text, if possible, is preferable to the use of a space-consuming table.
Recap briefly what you do in the paper
Evaluate the effectiveness of your research and provide recommendations (if applicable)
Make sure that all of the questions raised in the introduction and the literature review have been addressed
Compare the final results against the original aims and objectives
Identify any shortcomings and future research
knitr::opts_chunk$set(echo = TRUE, fig.pos = 'h')
#mydir = "C:/Users/Michael/Dropbox/priv/CUNY/MSDS/201909-Fall/DATA621_Nasrin/20201214_FinalProject/"
mydir = "./"
setwd(mydir)
knitr::opts_knit$set(root.dir = mydir)
options(digits=7,scipen=999,width=120)
datadir = paste0(mydir,"data/")
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
# Weather data is obtained from the NCDC (National Climatic Data Center) via https://www.ncdc.noaa.gov/cdo-web/
# click on search tool https://www.ncdc.noaa.gov/cdo-web/search
# select "daily summaries"
# select Search for Stations
# Enter Search Term "USW00094728" for Central Park Station:
# https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail
# "add to cart"
weatherfilenames=list.files(path="./",pattern = '.csv$', full.names = T) # ending with .csv ; not .zip
weatherfilenames
## [1] ".//NYC_Weather_Data_2013-2019.csv"
weatherfile <- "NYC_Weather_Data_2013-2019.csv"
## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
STATION = col_character(),
NAME = col_character(),
LATITUDE = col_double(),
LONGITUDE = col_double(),
ELEVATION = col_double(),
DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F"), # readr::parse_datetime() : "%F" = "%Y-%m-%d"
AWND = col_double(), # Average Daily Wind Speed
AWND_ATTRIBUTES = col_character(),
PGTM = col_double(), # Peak Wind-Gust Time
PGTM_ATTRIBUTES = col_character(),
PRCP = col_double(), # Amount of Precipitation
PRCP_ATTRIBUTES = col_character(),
SNOW = col_double(), # Amount of Snowfall
SNOW_ATTRIBUTES = col_character(),
SNWD = col_double(), # Depth of snow on the ground
SNWD_ATTRIBUTES = col_character(),
TAVG = col_double(), # Average Temperature (not populated)
TAVG_ATTRIBUTES = col_character(),
TMAX = col_double(), # Maximum temperature for the day
TMAX_ATTRIBUTES = col_character(),
TMIN = col_double(), # Minimum temperature for the day
TMIN_ATTRIBUTES = col_character(),
TSUN = col_double(), # Daily Total Sunshine (not populated)
TSUN_ATTRIBUTES = col_character(),
WDF2 = col_double(), # Direction of fastest 2-minute wind
WDF2_ATTRIBUTES = col_character(),
WDF5 = col_double(), # Direction of fastest 5-second wind
WDF5_ATTRIBUTES = col_character(),
WSF2 = col_double(), # Fastest 2-minute wind speed
WSF2_ATTRIBUTES = col_character(),
WSF5 = col_double(), # fastest 5-second wind speed
WSF5_ATTRIBUTES = col_character(),
WT01 = col_double(), # Fog
WT01_ATTRIBUTES = col_character(),
WT02 = col_double(), # Heavy Fog
WT02_ATTRIBUTES = col_character(),
WT03 = col_double(), # Thunder
WT03_ATTRIBUTES = col_character(),
WT04 = col_double(), # Sleet
WT04_ATTRIBUTES = col_character(),
WT06 = col_double(), # Glaze
WT06_ATTRIBUTES = col_character(),
WT08 = col_double(), # Smoke or haze
WT08_ATTRIBUTES = col_character(),
WT13 = col_double(), # Mist
WT13_ATTRIBUTES = col_character(),
WT14 = col_double(), # Drizzle
WT14_ATTRIBUTES = col_character(),
WT16 = col_double(), # Rain
WT16_ATTRIBUTES = col_character(),
WT18 = col_double(), # Snow
WT18_ATTRIBUTES = col_character(),
WT19 = col_double(), # Unknown source of precipitation
WT19_ATTRIBUTES = col_character(),
WT22 = col_double(), # Ice fog
WT22_ATTRIBUTES = col_character()
)
# load all the daily weather data
weather <- read_csv(weatherfile,col_types = weatherspec)
# extract just 2019
weather2019 <- weather[(weather$DATE>="2019-01-01" & weather$DATE<="2019-12-31"),]
# extract just one month
weather201906 <- weather[(weather$DATE>="2019-06-01" & weather$DATE<="2019-06-30"),]
library(readr)
read_CB_data_file = function(f){
Startloadtime = Sys.time()
print(paste("reading data file ", f, " at ", Startloadtime))
datafile = read_csv(f,skip = 1,
col_names=c("trip_duration", # in seconds
"s_time", # start date/time
"e_time", # end date/time
"s_station_id", # station ID for beginning of trip
"s_station_name",
"s_lat", # start station latitude
"s_long", # start station longitude
"e_station_id", # station ID for end of trip
"e_station_name",
"e_lat", # latitude
"e_long", # longitude
"bike_id", # every bike has a 5-digit ID number
"user_type", # Annual Subscriber or Daily Customer
"birth_year", # Can infer age from this
"gender") # 1=Male,2=Female,0=unknown
,col_types = "dTTffddffddifif" # d=decimal; T=datetime; f=factor; i=integer
)
Endloadtime = Sys.time()
print(paste("done with data file ", f, " at ", Endloadtime))
Totalloadtime = round(Endloadtime - Startloadtime, 2)
print(paste("Totaltime = ", Totalloadtime))
print("------------------------------------------------------------")
return(datafile)
}
filenames=list.files("/home/ann/projects/nyc-citibike-data/data", pattern = "data.csv$") # ending with .csv ; not .zip
length(filenames)
## [1] 75
## [,1]
## [1,] "2013-07 - Citi Bike trip data.csv"
## [2,] "2013-08 - Citi Bike trip data.csv"
## [3,] "2013-09 - Citi Bike trip data.csv"
## [4,] "2013-10 - Citi Bike trip data.csv"
## [5,] "2013-11 - Citi Bike trip data.csv"
## [6,] "2013-12 - Citi Bike trip data.csv"
## [7,] "2014-01 - Citi Bike trip data.csv"
## [8,] "2014-02 - Citi Bike trip data.csv"
## [9,] "2014-03 - Citi Bike trip data.csv"
## [10,] "2014-04 - Citi Bike trip data.csv"
## [11,] "2014-05 - Citi Bike trip data.csv"
## [12,] "2014-06 - Citi Bike trip data.csv"
## [13,] "2014-07 - Citi Bike trip data.csv"
## [14,] "2014-08 - Citi Bike trip data.csv"
## [15,] "201409-citibike-tripdata.csv"
## [16,] "201410-citibike-tripdata.csv"
## [17,] "201411-citibike-tripdata.csv"
## [18,] "201412-citibike-tripdata.csv"
## [19,] "201501-citibike-tripdata.csv"
## [20,] "201502-citibike-tripdata.csv"
## [21,] "201503-citibike-tripdata.csv"
## [22,] "201504-citibike-tripdata.csv"
## [23,] "201505-citibike-tripdata.csv"
## [24,] "201506-citibike-tripdata.csv"
## [25,] "201507-citibike-tripdata.csv"
## [26,] "201508-citibike-tripdata.csv"
## [27,] "201509-citibike-tripdata.csv"
## [28,] "201510-citibike-tripdata.csv"
## [29,] "201511-citibike-tripdata.csv"
## [30,] "201512-citibike-tripdata.csv"
## [31,] "201601-citibike-tripdata.csv"
## [32,] "201602-citibike-tripdata.csv"
## [33,] "201603-citibike-tripdata.csv"
## [34,] "201604-citibike-tripdata.csv"
## [35,] "201605-citibike-tripdata.csv"
## [36,] "201606-citibike-tripdata.csv"
## [37,] "201607-citibike-tripdata.csv"
## [38,] "201608-citibike-tripdata.csv"
## [39,] "201609-citibike-tripdata.csv"
## [40,] "201610-citibike-tripdata.csv"
## [41,] "201611-citibike-tripdata.csv"
## [42,] "201612-citibike-tripdata.csv"
## [43,] "201701-citibike-tripdata.csv"
## [44,] "201702-citibike-tripdata.csv"
## [45,] "201703-citibike-tripdata.csv"
## [46,] "201704-citibike-tripdata.csv"
## [47,] "201705-citibike-tripdata.csv"
## [48,] "201706-citibike-tripdata.csv"
## [49,] "201707-citibike-tripdata.csv"
## [50,] "201708-citibike-tripdata.csv"
## [51,] "201709-citibike-tripdata.csv"
## [52,] "201710-citibike-tripdata.csv"
## [53,] "201711-citibike-tripdata.csv"
## [54,] "201712-citibike-tripdata.csv"
## [55,] "201801-citibike-tripdata.csv"
## [56,] "201802-citibike-tripdata.csv"
## [57,] "201803-citibike-tripdata.csv"
## [58,] "201804-citibike-tripdata.csv"
## [59,] "201805-citibike-tripdata.csv"
## [60,] "201806-citibike-tripdata.csv"
## [61,] "201807-citibike-tripdata.csv"
## [62,] "201808-citibike-tripdata.csv"
## [63,] "201809-citibike-tripdata.csv"
## [64,] "201810-citibike-tripdata.csv"
## [65,] "201811-citibike-tripdata.csv"
## [66,] "201812-citibike-tripdata.csv"
## [67,] "201901-citibike-tripdata.csv"
## [68,] "201902-citibike-tripdata.csv"
## [69,] "201903-citibike-tripdata.csv"
## [70,] "201904-citibike-tripdata.csv"
## [71,] "201905-citibike-tripdata.csv"
## [72,] "201906-citibike-tripdata.csv"
## [73,] "201907-citibike-tripdata.csv"
## [74,] "201908-citibike-tripdata.csv"
## [75,] "201909-citibike-tripdata.csv"
#### Load all the data files, notting how much time it takes to load them
Starttime = Sys.time()
print(paste("Start time: ", Starttime))
## [1] "Start time: 2019-12-22 20:27:49"
### loads up the data for all files -- problem is too much data for my computer to handle if all are loaded
#print("About to load multiple datafiles:")
#print(filenames)
#CB <- do.call(rbind,lapply(filenames,read_CB_data_file))
### Problem: loading up multiple files is too much for my computer to handle !!!!
### load up just a single month of data ("filename")
#### Look at just June 2019
filename = filenames[72]
print(paste("Loading data for ", filename))
## [1] "Loading data for 201906-citibike-tripdata.csv"
## [1] "reading data file data/201906-citibike-tripdata.csv at 2019-12-22 20:27:49"
## [1] "done with data file data/201906-citibike-tripdata.csv at 2019-12-22 20:28:03"
## [1] "Totaltime = 14.58"
## [1] "------------------------------------------------------------"
## [1] "End time: 2019-12-22 20:28:03"
Totaltime = round(Endtime - Starttime,2)
print(paste("Totaltime for loading above file(s) = ", Totaltime))
## [1] "Totaltime for loading above file(s) = 14.6"
glimpse
the dataset## Observations: 2,125,370
## Variables: 15
## $ trip_duration <dbl> 330, 830, 380, 1155, 1055, 128, 315, 471, 1554, 392, 588, 1906, 2754, 2037, 454, 642, 1371, 13…
## $ s_time <dttm> 2019-06-01 00:00:01, 2019-06-01 00:00:04, 2019-06-01 00:00:06, 2019-06-01 00:00:06, 2019-06-0…
## $ e_time <dttm> 2019-06-01 00:05:31, 2019-06-01 00:13:55, 2019-06-01 00:06:26, 2019-06-01 00:19:22, 2019-06-0…
## $ s_station_id <fct> 3602, 3054, 229, 3771, 441, 3236, 3129, 3467, 379, 467, 2000, 532, 3016, 3285, 3125, 3429, 223…
## $ s_station_name <fct> 31 Ave & 34 St, Greene Ave & Throop Ave, Great Jones St, McKibbin St & Bogart St, E 52 St & 2 …
## $ s_lat <dbl> 40.76315, 40.68949, 40.72743, 40.70624, 40.75601, 40.75898, 40.75110, 40.72495, 40.74916, 40.6…
## $ s_long <dbl> -73.92083, -73.94206, -73.99379, -73.93387, -73.96742, -73.99380, -73.94074, -74.00166, -73.99…
## $ e_station_id <fct> 3570, 3781, 326, 3016, 3159, 495, 3560, 401, 2006, 3368, 366, 3712, 3764, 540, 3221, 157, 450,…
## $ e_station_name <fct> 35 Ave & 37 St, Greene Av & Myrtle Av, E 11 St & 1 Ave, Kent Ave & N 7 St, W 67 St & Broadway,…
## $ e_lat <dbl> 40.75573, 40.69857, 40.72954, 40.72037, 40.77493, 40.76270, 40.75488, 40.72020, 40.76591, 40.6…
## $ e_long <dbl> -73.92366, -73.91888, -73.98427, -73.96165, -73.98267, -73.99301, -73.93433, -73.98998, -73.97…
## $ bike_id <int> 20348, 34007, 20587, 33762, 31290, 25137, 25648, 26972, 32969, 33539, 19770, 16708, 25711, 270…
## $ user_type <fct> Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber, Subscriber…
## $ birth_year <int> 1992, 1987, 1990, 1987, 1973, 1989, 1995, 1990, 1970, 1980, 1974, 1993, 1993, 1965, 1989, 1969…
## $ gender <fct> 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0…
str
- structure of the dataset## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2125370 obs. of 15 variables:
## $ trip_duration : num 330 830 380 1155 1055 ...
## $ s_time : POSIXct, format: "2019-06-01 00:00:01" "2019-06-01 00:00:04" "2019-06-01 00:00:06" "2019-06-01 00:00:06" ...
## $ e_time : POSIXct, format: "2019-06-01 00:05:31" "2019-06-01 00:13:55" "2019-06-01 00:06:26" "2019-06-01 00:19:22" ...
## $ s_station_id : Factor w/ 793 levels "3602","3054",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ s_station_name: Factor w/ 793 levels "31 Ave & 34 St",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ s_lat : num 40.8 40.7 40.7 40.7 40.8 ...
## $ s_long : num -73.9 -73.9 -74 -73.9 -74 ...
## $ e_station_id : Factor w/ 806 levels "3570","3781",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ e_station_name: Factor w/ 806 levels "35 Ave & 37 St",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ e_lat : num 40.8 40.7 40.7 40.7 40.8 ...
## $ e_long : num -73.9 -73.9 -74 -74 -74 ...
## $ bike_id : int 20348 34007 20587 33762 31290 25137 25648 26972 32969 33539 ...
## $ user_type : Factor w/ 2 levels "Subscriber","Customer": 1 1 1 1 1 1 1 1 2 1 ...
## $ birth_year : int 1992 1987 1990 1987 1973 1989 1995 1990 1970 1980 ...
## $ gender : Factor w/ 3 levels "1","2","0": 1 2 2 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. trip_duration = col_double(),
## .. s_time = col_datetime(format = ""),
## .. e_time = col_datetime(format = ""),
## .. s_station_id = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. s_station_name = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. s_lat = col_double(),
## .. s_long = col_double(),
## .. e_station_id = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. e_station_name = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. e_lat = col_double(),
## .. e_long = col_double(),
## .. bike_id = col_integer(),
## .. user_type = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. birth_year = col_integer(),
## .. gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
## .. )
## trip_duration s_time e_time s_station_id
## Min. : 61 Min. :2019-06-01 00:00:01 Min. :2019-06-01 00:02:16 519 : 14778
## 1st Qu.: 395 1st Qu.:2019-06-07 18:46:00 1st Qu.:2019-06-07 19:02:42 514 : 13512
## Median : 683 Median :2019-06-15 14:17:56 Median :2019-06-15 14:40:21 426 : 13410
## Mean : 1109 Mean :2019-06-15 21:45:53 Mean :2019-06-15 22:04:23 402 : 12191
## 3rd Qu.: 1199 3rd Qu.:2019-06-23 18:48:22 3rd Qu.:2019-06-23 19:09:09 3255 : 11973
## Max. :3379585 Max. :2019-06-30 23:59:54 Max. :2019-07-15 03:53:26 499 : 11680
## (Other):2047826
## s_station_name s_lat s_long e_station_id e_station_name
## Pershing Square North: 14778 Min. :40.66 Min. :-74.03 519 : 14696 Pershing Square North: 14696
## 12 Ave & W 40 St : 13512 1st Qu.:40.72 1st Qu.:-74.00 426 : 14163 West St & Chambers St: 14163
## West St & Chambers St: 13410 Median :40.74 Median :-73.99 514 : 13614 12 Ave & W 40 St : 13614
## Broadway & E 22 St : 12191 Mean :40.74 Mean :-73.98 402 : 12639 Broadway & E 22 St : 12639
## 8 Ave & W 31 St : 11973 3rd Qu.:40.76 3rd Qu.:-73.97 3255 : 12055 8 Ave & W 31 St : 12055
## Broadway & W 60 St : 11680 Max. :40.81 Max. :-73.91 499 : 11310 Broadway & W 60 St : 11310
## (Other) :2047826 (Other):2046893 (Other) :2046893
## e_lat e_long bike_id user_type birth_year gender
## Min. :40.66 Min. :-74.07 Min. :14529 Subscriber:1752526 Min. :1885 1:1399108
## 1st Qu.:40.72 1st Qu.:-74.00 1st Qu.:20961 Customer : 372844 1st Qu.:1969 2: 527577
## Median :40.74 Median :-73.99 Median :28996 Median :1983 0: 198685
## Mean :40.74 Mean :-73.98 Mean :27614 Mean :1980
## 3rd Qu.:40.76 3rd Qu.:-73.97 3rd Qu.:33016 3rd Qu.:1990
## Max. :40.81 Max. :-73.91 Max. :39934 Max. :2003
##
## [1] "Number of columns with missing values = 0"
## [1] "Names of columns with missing values = "
In this data (just a single month), there are no missing items.
The trip_duration is specified in seconds, but there seem to be some outliers which may be incorrect, as the value for Max is quite high: 3379585 seconds, or 39.1155671 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged.
It may be easier to think of trip duration in other units (i.e., minutes, hours, or days) rather than in seconds, so lets create such variables. Also, let’s confirm that the value shown (in seconds) is consistent with the difference between the start time and the end time:
#express trip duration in seconds, minutes, hours, days
CB$trip_duration_s = as.numeric(CB$e_time - CB$s_time,"secs")
CB$trip_duration_m = as.numeric(CB$e_time - CB$s_time,"mins")
CB$trip_duration_h = as.numeric(CB$e_time - CB$s_time,"hours")
CB$trip_duration_d = as.numeric(CB$e_time - CB$s_time,"days")
summary(CB$trip_duration_h)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0169 0.1098 0.1900 0.3083 0.3333 938.7739
## [1] 2024
Let’s assume that nobody would rent a bicycle for more than a specified timelimit, and drop the records which exceed this:
## [1] "Initial number of trips: 2125370"
# choose only trips that were at most 3 hrs, as longer trips may reflect an error
# remove long trips from the data set -- something may be wrong (e.g., the system failed to properly record the return of a bike)
longtripthreshold_s = 60 * 60 *3 # 10800 seconds = 180 minutes = 2 hours
longtripthreshold_m = longtripthreshold_s / 60
longtripthreshold_h = longtripthreshold_m / 60
long_trips <- CB %>% filter(trip_duration_s > longtripthreshold_s)
num_long_trips_removed = dim(long_trips)[1]
pct_long_trips_removed = round(100*num_long_trips_removed / total_rows, 3)
CB <- CB %>% filter(trip_duration <= longtripthreshold_s)
reduced_rows = dim(CB)[1]
print(paste0("Removed ", num_long_trips_removed, " trips (", pct_long_trips_removed, "%) of longer than ", longtripthreshold_h, " hours."))
## [1] "Removed 4003 trips (0.188%) of longer than 3 hours."
## [1] "Remaining number of trips: 2121369"
par(mfrow=c(1,2))
hist(CB$trip_duration_m, col="lightgreen", xlab="Trip duration, in minutes")
hist(log(CB$trip_duration_m), col="lightgreen", xlab="log(Trip duration, in minutes)")
The birth year for some users is as old as 1885, which is not possible.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1885 1969 1983 1980 1990 2003
# Deduce age from trip date and birth year
#library(lubridate) #loaded above
CB$age <- year(CB$s_time) - CB$birth_year
par(mfrow=c(1,2))
hist(CB$age, col="lightblue", xlab="User Age, inferred from birth year")
hist(log(CB$age), col="lightblue", xlab="log(User Age, inferred from birth year)")
# choose only trips where the user was born after a certain year, as older users may reflect an error
age_threshhold = 90
aged_trips <- CB %>% filter(age > age_threshhold)
num_aged_trips_removed = dim(aged_trips)[1]
pct_aged_trips_removed = round(100*num_aged_trips_removed / total_rows, 3)
CB <- CB %>% filter(age <= age_threshhold)
reduced_rows = dim(CB)[1]
print(paste0("Removed ", num_aged_trips_removed, " trips (", pct_aged_trips_removed, "%) of users older than ", age_threshhold, " years."))
## [1] "Removed 1052 trips (0.049%) of users older than 90 years."
## [1] "Remaining number of trips: 2120317"
par(mfrow=c(1,2))
hist(CB$age, col="lightgreen", xlab="User Age, inferred from birth year")
hist(log(CB$age), col="lightgreen", xlab="log(User Age, inferred from birth year)")
This is straight-line distance – it doesn’t incorporate an actual route.
There are services (e.g., from Google) which can compute and measure a recommended bicycle route between points, but use of such services requires a subscription and incurs a cost.
# Compute the distance between start and end stations
s_lat_long <- CB %>% select(c(s_lat,s_long)) %>% as.matrix
e_lat_long <- CB %>% select(c(e_lat,e_long)) %>% as.matrix
#library(sp) # loaded above
CB$distance_km <- spDists(s_lat_long, e_lat_long, longlat=T, diagonal = TRUE)
# There is a time-based usage fee for rides longer than an initial period.
# For user_type=Subscriber, the fee is $2.50 per 15 minutes following an initial free 45 minutes.
# For user_type=Customer, the fee is $4.00 per 15 minutes following an initial free 30 minutes.
CB$trip_fee[CB$user_type=="Subscriber"] <- 2.50 * (ceiling(CB$trip_duration_m[CB$user_type=="Subscriber"] / 15)-3) # first 45 minutes are free
## Warning: Unknown or uninitialised column: 'trip_fee'.
# extract selected fields
CBlite <- select(CB, c(trip_duration, trip_fee, distance_km,
s_station_id, s_lat, s_long,
e_station_id, e_lat, e_long,
user_type, gender, age))
#make numeric variables
CBlite$user_type <- as.integer(CBlite$user_type)
CBlite$gender <- as.integer(CBlite$gender)
# function to revert factor back to its numeric levels
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
CBlite$s_station_id <- as.numeric.factor(CBlite$s_station_id)
CBlite$e_station_id <- as.numeric.factor(CBlite$e_station_id)
#library(Hmisc) #loaded above
res2<-rcorr(as.matrix(CBlite))
respearson=rcorr(as.matrix(CBlite),type = "pearson")
resspearman=rcorr(as.matrix(CBlite),type = "spearman")
res3 <- cor(as.matrix(CBlite))