3.1 Parameter Visualization

We start by visualizing crucial parameters to determine their
importance in our data.
We first take a look at the number of trips segregated by day of the
week, coming to the conclusion that most trips occurred on Sunday,
Monday, Wednesday, and Thursday.

It is observe from the above chart VeriFone has a bigger share of
approximately 75% in the yellow taxi rides.

The graph above indicates that the number of journeys increases as
the number of passengers decreases. The right-skewed graph corroborates
this observation.

We had expected that evenings would have the most travels, but our
data revealed that afternoons were the busiest in terms of number of
trips, followed by mornings.
3.2 Relationship Exploration

Observations - We initially skimmed correlation
coefficients for our continuous variables, such as travel distance and
trip time, to see if they were connected to tip amount. As seen in the
above cor plot, the results were 0.65 and 0.33, which suggest a moderate
association. However, since correlation does not imply causation,
statistical tests must be performed on these variables to establish
their relationship.
3.3 Location Analysis
Location Distribution
Pick up Location
|
Drop off Location
|
No. of Trips
|
Avg. Fare
|
Queens
|
Manhattan
|
59411
|
36.9
|
Manhattan
|
Queens
|
33523
|
33.8
|
Manhattan
|
Brooklyn
|
6681
|
25.9
|
Manhattan
|
EWR
|
3587
|
66.1
|
Queens
|
Bronx
|
1466
|
39.6
|
Manhattan
|
Manhattan
|
1328
|
30.5
|
Manhattan
|
Bronx
|
1094
|
31.7
|
Queens
|
Queens
|
744
|
46.3
|
Brooklyn
|
Manhattan
|
232
|
23.6
|
Manhattan
|
Staten Island
|
61
|
47.2
|
Observations - At first glance, Queens has the most
pickups, followed by Manhattan and Brooklyn in second and third,
respectively. Similarly, Manhattan, Queens, and Brooklyn make up the top
three drop-off locations. Further investigation showed that the highest
number of trips was between Queens and Manhattan, followed by Manhattan
to Queens and Manhattan to Brooklyn. These results are credible when
considering yellow cabs (the focus of this analysis), as they primarily
serve the above regions, in contrast to green cabs, which serve areas
where yellow cabs do not operate.
We can see the highest avg fare price, Staten to Island EWR, which
is $102; however, we do not have sufficient data for these locations;
hence we consider only the top 10 source and destination boroughs in
terms of number of trips. A trip from Manhattan to EWR costs around $66
on average and $46 for travelling within Queens.
This graph illustrates that Queens has the most tipping passengers,
followed by Manhattan.Similarly, Manhattan has higher tipping passengers
than others in Drop-Off Location
## Df Sum Sq Mean Sq F value Pr(>F)
## PULocation 5 7691 1538 132 <2e-16 ***
## Residuals 108326 1257827 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## DOLocation 5 174314 34863 3461 <2e-16 ***
## Residuals 108326 1091205 10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observations - The p-value for both variables is
0.2*10^−15 or 0.0000000000000002, which are infinitesimal compared to
the significance level of 0.05, thus rejecting the null hypothesis that
the means of the two entities are the same, making them statically
different.
3.4 Trip Duration Impact on Tips

Observations - The data is approximately normally
distributed with slight skewness on the right side. This is to say that
majority of out trips are 20-40 mins in length.

Observations - A t-test between travel time and
customer tip percentage reveals the p-value of the relationship between
the variables, which is 0.2*10-15; consequently, since the value is much
lower than the significance level of 0.05, we can state that Yellow taxi
passengers tip differently depending on the length of the trip and
successfully reject the null hypothesis that the means of the two
variables are equal.
3.5 Trip Length and Tips
We are attempting to determine if those doing shorter journeys are
more likely to leave larger gratuities or those taking longer travels
are more giving.

To study these two groups separately, we divide the data for trip
distance into two categories: short and long trips. When we plot the
journey distance against the number of tips paid, we notice that
passengers tips higher number of times on shorter rides than on longer
ones.

Observations - A Simple two way test for pvalues
which is found to be way less than the significant level 0.05. We can
reject the null hypothesis, Z-test cannot be used because we don’t know
population’s mean & std dev.
Declaring hypothesis
Null Hypothesis: Ho Tip amount is same for both short and long
distance passenger(s)
Alternate Hypothesis: Ha Tip amount is NOT same for both short and
long distance passenger(s)
3.6 Importance of passenger count and vendor

A Anova test between tip percentage and passenger count shows a
significant relationship between the number of passengers and the amount
of tips because the p-value is 0.00006, which is less than the
significant value(0.005). Hence, we can reject the null
hypothesis(H0).

Observation- Finally, we explore the relationship
between the vendor and our response variable, tip. Unsurprisingly a
two-way T-test between the aforementioned variables reveals that there
is no significance with a p-value of 0.264.
Comments : At a first glance, there are total 67604356 observation across 3558124 and 19 variables in which 7 are categorical and 12 are numerical variables. The data was procured from the NYC Open Source GIS website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.