I will be taking you through the different steps one by one.
The data set that was given for this task was from
I took the December 2013 data as per the suggestions for this exercise.
Let us read the data in a data frame
setwd("D:/Vibs/Work/Tech-Prep/Profile/Elula")
#clear the environment of all existing variables
rm(list=ls())
# setting the time zone UTC
Sys.setenv(TZ='UTC')
trip_fare <- read.csv("D:/Vibs/Work/Tech-Prep/Profile/Elula/trip_fare/trip_fare_12.csv", header=TRUE)
Let us now analyze the payment type from the other data set (trip_fare)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(plyr)
## Warning: package 'plyr' was built under R version 3.4.4
summary(trip_fare$payment_type)
## CRD CSH DIS NOC UNK
## 7444798 6426107 10100 35184 54929
g15 <- ggplot(trip_fare,aes(trip_fare$payment_type)) + geom_bar() +
xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution") + coord_flip()
g15
There are 5 types of payments of which CRD is the most popular followed by CSH & DIS is the least popular.
We can further analyze distribution of payment type by grouping the total_amount into bins or by the tip_amount.
summary(trip_fare$total_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.50 8.00 11.30 15.13 17.00 650.00
trip_fare <- cbind(trip_fare, total_amount_category = 'NA')
b <- c(2.5, 8, 11, 17, 650)
names <- c("Low", "Medium", "High","Very High")
trip_fare$total_amount_category <- cut(trip_fare$total_amount, breaks = b, labels = names)
trip_fare$total_amount_category <- as.factor(trip_fare$total_amount_category)
table(trip_fare$total_amount_category)
##
## Low Medium High Very High
## 3695808 3187403 3675488 3412399
g16 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=total_amount_category)) + geom_bar(position ="stack") + xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution over total amount category") + coord_flip()
g16
Inference: The CRD & CSG payment types are almost equally distributed among all total amount categories.
summary(trip_fare$tip_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.414 2.000 200.000
trip_fare <- cbind(trip_fare, tip_amount_category = 'NA')
b <- c(-Inf, 0, 1, 2, 200)
names <- c("Zero", "One", "Two","High")
trip_fare$tip_amount_category <- cut(trip_fare$tip_amount, breaks = b, labels = names)
trip_fare$tip_amount_category <- as.factor(trip_fare$tip_amount_category)
table(trip_fare$tip_amount_category)
##
## Zero One Two High
## 6720363 1343343 2675992 3231420
g17 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=tip_amount_category)) + geom_bar(position ="stack") + xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution over tip amount category") + coord_flip()
g17
Inference: If the payment is by cash, then the tip amount is zero. High tip is given in case of CRD payment as well as tip of 1$ & 2$ is popular amoung CRD payments.
g19 <- ggplot(trip_fare,aes(trip_fare$payment_type, fill=vendor_id)) + geom_bar(position ="stack") + xlab("payment type") + ylab("Frequency") + ggtitle("Payment type Distribution over vendor") + coord_flip()
g19
Inference: CSH & CRD modes are almost equally divided between CMT & VMS vendors. However, UNK payment type is only done for VTS & NOC and DIS payment type is only done for CMT.
Let us now understand how the fare is distributed.
summary(trip_fare$fare_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.50 6.50 9.50 12.63 14.50 500.00
trip_fare <- cbind(trip_fare, fare_amount_category = 'NA')
b <- c(2, 6.5, 9.5, 14.5, 500)
names <- c("Low", "Medium", "High","very High")
trip_fare$fare_amount_category <- cut(trip_fare$fare_amount, breaks = b, labels = names)
trip_fare$fare_amount_category <- as.factor(trip_fare$fare_amount_category)
table(trip_fare$fare_amount_category)
##
## Low Medium High very High
## 3748826 3539946 3293220 3389126
g18 <- ggplot(trip_fare,aes(trip_fare$fare_amount_category, fill=payment_type)) + geom_bar(position ="stack") + xlab("Fare Amount") + ylab("Frequency") + ggtitle("Fare Amount Distribution over payment type") + coord_flip()
g18
Inference: Whatever is the Fare Amount, it is mostly paid by CRD or CSH.
g20 <- ggplot(trip_fare,aes(trip_fare$fare_amount_category, fill=vendor_id)) + geom_bar(position ="stack") + xlab("fare amount") + ylab("Frequency") + ggtitle("Fare Amount Distribution over vendor") + coord_flip()
g20
Inference: Fare amounts are almost equally distributed between CMT & VTS Vendors.
Let us now see the distribution of fare amount over pickup time.
trip_fare <- cbind(trip_fare, pickUpTime = 'NA')
trip_fare$pickUpTime <- as.character(trip_fare$pickUpTime)
trip_fare$pickUpTime <- strftime(trip_fare$pickup_datetime, format="%H")
trip_fare$pickUpTime <- as.factor(trip_fare$pickUpTime)
g21 <- ggplot(trip_fare,aes(trip_fare$pickUpTime, fill=fare_amount_category)) + geom_bar(position ="stack") + xlab("Pick Up Time") + ylab("Frequency") + ggtitle("Fare Amount Distribution over pick up time") + coord_flip()
g21
Inference: Very High fare amount starts from 12 noon to 12 midnight.
summary(trip_fare$tip_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.414 2.000 200.000
g24 <- ggplot(trip_fare,aes(trip_fare$pickUpTime, fill=tip_amount_category)) + geom_bar(position ="stack") + xlab("Pick Up Time") + ylab("Frequency") + ggtitle("Tip Amount Distribution over pick up time") + coord_flip()
g24
Inference: Highest Tip amount is between 6 pm to midnight.
In repect of time, here I will pause on the exploratory analysis and start with the Open Questions section.