This dataset is about taxi rides in cdmx; it has different type of taxi services. The data excludes rides with more than 6 hours last . It contains data from Jun 2016 to Jul 2017. The data is available in this link. I’m not responsible for data content.
First thing to do is open the content of the data and visualize, manipulate and see the data. The str funcion gives the possibilty to see the data type for each variable. The results are 12 variables.
## 'data.frame': 12694 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ vendor_id : chr "México DF Taxi de Sitio" "México DF Taxi Libre" "México DF Taxi Libre" "México DF Taxi Libre" ...
## $ pickup_datetime : chr "16/09/16 7:14" "18/09/16 6:16" "18/09/16 10:11" "18/09/16 10:23" ...
## $ dropoff_datetime : chr "18/09/16 4:41" "18/09/16 10:11" "18/09/16 10:23" "18/09/16 10:30" ...
## $ pickup_longitude : num -99.1 -99.3 -99.3 -99.3 -99.3 ...
## $ pickup_latitude : num 19.4 19.3 19.3 19.3 19.3 ...
## $ dropoff_longitude : num -99.2 -99.3 -99.3 -99.3 -99.3 ...
## $ dropoff_latitude : num 19.4 19.3 19.3 19.3 19.3 ...
## $ store_and_fwd_flag: chr "N" "N" "N" "N" ...
## $ trip_duration : int 120449 14110 681 436 442 100 345 544 1226 1723 ...
## $ dist_meters : int 12373 1700 2848 1409 1567 797 676 3771 5662 14349 ...
## $ wait_sec : num 242 461 129 106 85 19 169 37 572 459 ...
As you can see the data available allows to analyze data with maps, histograms, barplots, etc. Ok, the data is right and can help to do the task, but let’s see the total taxi rides overtime.
The waterfall graph shows the total rides per month and at the end is the total sum. The best month is Jul 2017 with 1,760 rides, the largest contribution in the dataset.
The next thing is to see the behavior of the taxi rides. To achieve the mission, a heat map is a quite good alternative to see and play with the data. Please, feel free to play.
The data shows how the taxi rides are over the city. The information is nice to see in a spacial way the data; there’s not a visual pattern.
Now, let’s see a bubble graph with the total rides and the average minutes in each one by the type of service.
At the first look, taxi libre has more rides, but less average minutes, instead, ratio taxi has almost 2,000 rides and the minutes average are upon 40. Also, uber service is not demanded in this dataset.
By the bubble graph, it’s easy to determinated that taxi libre is the service with more operations, but, less minutes: maybe the service is for short distances.
The graph above shows a matrix heat map with de months in ‘y’ and week days in ‘x’. In this way, the color intensity is based in the quantity of taxi rides. At July 2017, the are more taxi rides than other days.
The boxplot graph is a representation of the distribution of a variable. In this case the graph shows the distribution of each month and the jitter is each instance.
The graph says that there are some months was August 2016 and February 2017 with more minutes on average per day, even, there are some outliers over 100 minutes.
Finally, the boxplot graph is a way to see the variable distribution. In this case, the minutes average per day shows how everyday the average minutes are under 30 minutes. In other words, taxi rides do not exceed 30 minutes and the 75% of the rides are over 40 minutes.
The last thing to see is the interactive map that shows the connections of each ride. There’s an special item to segment the data with groups of time, 1 hour, 2, hours, 3 hours and more than 3 hours.
The data is nice, but not balanced
There are more rides in 2017
The heat map shows the dimension of the data and some points with more rides than others
There are more rides from taxi libre than other services
The taxi services seem to show an special behavior as short or long distances
Days with more rides are from tuesday to fryday
The average time per day doesn’t show months with an special behavior
The 50% of the taxi rides are under 30 minutes
The map shows how the distance in not a determinant of the total time. For example, rides with more than 180 minutes have different distances. The last sentence represents the complexity of a city as cdmx.