Intro

Lately, I have noticed a skyrocketing number of mobility services around my neighborhood ranging from sharing bikes, electronic scooters, etc. My dad once told me that he used to walk 1-2 hours just to go to school, but these days I can just take an electronic scooter to light-rail station, ride a light-rail to the airport, fly to Chicago, then take Uber to get to my friend’s house. How easy is that?

The entire way we travel from point A to point B has been changing dramatically, and Uber might have been a starter in the mobility industry. Every time I use Uber, I’ve wondered what kind of data they might be collecting from my ride and had these following questions:

  • Can they infer that I am a student or a full-time worker based on my profile and my route on specific time?
  • Which cluster of people would I fall into?
  • Would they know that I am doing an internship in Chicago just for this summer and will head back home afterwards? If so, what kind promotion can they provide?
  • How do they optimize their pricing model when there’s surging demands during rain or snow storms?

Coming up with possible answers and what kind of tools I can use to answer these back-to-back questions, I arrive at my destination in no time.

These curiosity provoking questions were enough to trigger me to look for related-data, and I stumbled upon a dataset someone named “Stan Tyan” shared on Kaggle UBER RIDE HISTORY. It was a dataset that Stan collected from his 678 Uber and Gett(Uber-like service in Russia) rides for 2 years by syncing his mobile app account to Google Spreadsheet and automatically uploading each ride every time he arrives at the destination.

Stan is Russian, and I was able to reach him via his blog. I asked if it’s okay to use the dataset to do some EDA and publish it on my blog, and he was very kind to give me permission. Below is a snapshot of the email I received from Stan:

Side Notes:

Unlike Uber in the US where anyone with one’s own car and a driver’s license, only taxi drivers are allowed to drive via Uber in Russia. In many countries, Uber has gained some hatred from people, especially taxi drivers, for taking away rides from domestic taxi drivers and even ripping away commission for the service. In Korea, the taxi driver’s license is an asset that drivers can sell for over $50000, so the Korean government banned Uber service because it would allow anyone to drive via Uber and make money.

========================================================================================================================

Data

========================================================================================================================

  • Data looks as shown above. The name of the variables are self-explanatory, so I would not go into details.

  • Note that the data has location info (longitude, latitude) and weather info.

========================================================================================================================

EDA - Basic

## [1] "Number of Days for Data Collection:  1082"
## [1] "Data Colelction Date:  2015-05-11 2018-04-27"
## [1] "Average # of rides per day:  0.626617375231054"

Travel Time & Waiting Time

========================================================================================================================

  • trip_min (Travel Time in munites) on average is 21 minutes, wait_min on average is 9 minutes.

  • There were 2 rides with wait_min (Waiting Time in munites) greater than 100 minutes. My initial thought was it’s related to weather conditions, but let’s take a look at it.

========================================================================================================================

Rides greater than 100 mins of Travel Time

========================================================================================================================

  • It did rain on those 2 rides with wait_min > 100 minutes (Check precipitation column)

  • As for trip_start_time variable, it’s the time Stan requested a ride rather than the time the ride began.

  • Therefore, we get an equation trip_end_time = trip_start_time + trip_time + wait_time

========================================================================================================================

Rides with Travel Time > 50 minutes

========================================================================================================================

  • There was a rain for the ride with trip_min > 50 mins and shortest distance.

  • We will need to look at how other variables change accordingly with precipitation.

========================================================================================================================

Impact of Precipitation on Rides

========================================================================================================================

  • There seems to be a slight difference among precipitation groups with average speed changing in descending order of none > rain > snow.

  • However, the difference does not seem too different, so I won’t carry out an ANOVA test.

========================================================================================================================

Seasonality

========================================================================================================================

  • Of course it’s Russia! Stan is a human after all and tends to take more rides during winter.

========================================================================================================================

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

========================================================================================================================

  • This is a trend we see based on hour of the day. It reaches a peak at 10am. It’s probably when Stan goes to work or school. We will verify my inference later of this analysis.

========================================================================================================================

Ride Fee

========================================================================================================================

  • There are many outliers, so it’s very skewed to the right.

  • Two summits capture my eyes. Let’s dig a little more on the fee.

========================================================================================================================

Ride Fee by Type of Rides

========================================================================================================================

  • Click on the legend to filter by type

  • Since uberBLACK is a premium servce, its distance to price ratio is higher than uberX.

  • Business by Gett seems to have a higher pricing than uberBLACK.

  • Distance to price ratio is in the order following order Business > uberBLACK > uberX = EconomyFix

  • uberBLACK has high variance.

  • uberX has relatively equal variance.

Let’s carry out an One-way ANOVA test to see if there’s statistically significant differece in price by type of rides.

========================================================================================================================

ONE-WAY ANOVA on Average Ride Fee by Different Types of Ride

========================================================================================================================

  • One-way Anova model should follow the assumptions: Independence, Normality, Equal variance, Randomness

  • Normal Q-Q plot (top right): it violates the normality becuase the points in the bottom and top don’t lie on a straight line.

  • Therefore, we can’t reject the null hypothesis that “all type of rides have the same price.”

========================================================================================================================

========================================================================================================================

  • Since our previous model did not follow the assumption, we are going to take a log of price_usd to make the data follow the normality assumption.

  • Independece: Since each ride does not affect the price of one another, it follows the independence assumption.

  • Normal Q-Q plot (top right): Points lie on the straight line; therefore, it follows the normality assumption.

  • Residual vs. Fitted plot (top left): The variance of y does not differ by the change of x; therefore, it follow the equal variance assumption.

  • Since the data now satisfies all the assumptions, we will move forward with the model.

========================================================================================================================

##                       Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(trip_type)   7  40.47   5.781   18.55 <2e-16 ***
## Residuals            670 208.79   0.312                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

========================================================================================================================

  • We got F-test = 18.55, p-value < 2 * 10^(-16) which is less than our predetermined significance of 0.05.

  • Therefore, we can confidently reject the null hypothesis and say that there is at least one type of ride whose price is statistically different from other types of ride.

  • However, One-way ANOVA does not tell between which groups the difference exists. Let’s carry out a post hoc test to check where the difference exists.

========================================================================================================================

## 
##   Posthoc multiple comparisons of means : Bonferroni 
##     95% family-wise confidence level
## 
## $`as.factor(trip_type)`
##                              diff     lwr.ci      upr.ci    pval    
## Comfort-Business      -1.05984597 -2.2270423  0.10735033 0.12702    
## EconomyFix-Business   -1.04755493 -2.1196932  0.02458334 0.06352 .  
## uberBEAUTY-Business   -1.40880756 -3.4304509  0.61283575 0.81736    
## uberBLACK-Business    -0.06047580 -1.1259754  1.00502378 1.00000    
## uberELKA-Business      0.52819280 -1.4934505  2.54983611 1.00000    
## uberSELECT-Business   -0.32800189 -1.9262513  1.27024748 1.00000    
## uberX-Business        -1.17979153 -2.1930917 -0.16649136 0.00787 ** 
## EconomyFix-Comfort     0.01229105 -0.6720384  0.69662054 1.00000    
## uberBEAUTY-Comfort    -0.34896159 -2.1944610  1.49653782 1.00000    
## uberBLACK-Comfort      0.99937018  0.3254891  1.67325128 0.00011 ***
## uberELKA-Comfort       1.58803877 -0.2574606  3.43353818 0.19977    
## uberSELECT-Comfort     0.73184409 -0.6368149  2.10050307 1.00000    
## uberX-Comfort         -0.11994556 -0.7078262  0.46793507 1.00000    
## uberBEAUTY-EconomyFix -0.36125263 -2.1481497  1.42564448 1.00000    
## uberBLACK-EconomyFix   0.98707913  0.4959081  1.47825016 1.5e-08 ***
## uberELKA-EconomyFix    1.57574773 -0.2111494  3.36264484 0.16342    
## uberSELECT-EconomyFix  0.71955304 -0.5689968  2.00810287 1.00000    
## uberX-EconomyFix      -0.13223661 -0.4965673  0.23209412 1.00000    
## uberBLACK-uberBEAUTY   1.34833176 -0.4345900  3.13125357 0.50348    
## uberELKA-uberBEAUTY    1.93700036 -0.5389969  4.41299763 0.40314    
## uberSELECT-uberBEAUTY  1.08080567 -1.0634709  3.22508221 1.00000    
## uberX-uberBEAUTY       0.22901603 -1.5232106  1.98124263 1.00000    
## uberELKA-uberBLACK     0.58866860 -1.1942532  2.37159041 1.00000    
## uberSELECT-uberBLACK  -0.26752609 -1.5505575  1.01550530 1.00000    
## uberX-uberBLACK       -1.11931573 -1.4636205 -0.77501098 < 2e-16 ***
## uberSELECT-uberELKA   -0.85619469 -3.0004712  1.28808185 1.00000    
## uberX-uberELKA        -1.70798433 -3.4602109  0.04424227 0.06506 .  
## uberX-uberSELECT      -0.85178965 -2.0918128  0.38823352 0.88385    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

========================================================================================================================

  • There are several post hoc test such as Fisher’s LSD, Tukey’s LSD, Bonferroni correction for F-test. I wanted to be conservative with my analysis, so I chose the Bonferroni test.

  • (uberX~uberBLACK), (uberBLACK~EconomyFix), (uberBLACK~Comfort), (uberX-Business) have p-values less than our significance level of 0.05 which means these are the pairs of groups where the statistically significant difference in price exists.

========================================================================================================================

Driver’s Speed

driver_name_en max_speed
Abuzar 76.77778
Vadim 71.67097
Valeriy 71.18961
Sergey 67.48052
Vyacheslav 66.71688
Boburzhon 64.21571
Evgeniy 62.52245
Vasiliy 61.83317
Sagynbek 60.57624
Viktor 60.52662

========================================================================================================================

  • The table on the bottom is Mr. Abuzar’s trip from the airport, and he was asssigned to Stan’s ride only once.

  • max_speed is calculated by taking ‘distance / trip duration’ so it can be thought of as the average of speed of the entire ride experience. This also means that the driver might have sped up and exceeded way over his max_speed at times.

  • My guess based on the data above: Stan has taken the same route several times, but he was assigned to Mr. Abuzar only once. There’s a possibility that Stan had given a bad review on the driver and was no longer assigned to Mr. Abuzar. Or it could be the other way where Mr. Abuzar gave a bad review on Stan. Unfortuanly, we don’t have data on reviews to verify my guess.

  • However, we can’t conclude that Mr. Abuzar went over speed limit and drove aggressively given the the ride was at midnight nearby the airport.

  • Let’s now compare the speed of other drivers who have taken the same route on similar weather and time.

========================================================================================================================

========================================================================================================================

  • There are 3 drivers who had the same departure and destination and drove at different speed on average.

  • If other conditions that are not available in our dataset (traffic, condition of car, etc) are similar, Mr. Abuzar seems to drove relatively faster than others

  • However, considering that there’s no review information, a very limited sample size, and a different matching of drivers for each ride, we can’t guarantee for sure.

  • Now that I take a closer look at the dataset, there is an error. On 11/25/2017, Mr. Abuzar and Mr.Leonid drove at the same time around midnight! There must have been an error in data collection.

========================================================================================================================

Correlation

========================================================================================================================

  • In the correlation plot, I set the significance level at 0.10 and correlations with p-value greater than 0.10 will be considered not stistically significant.

  • It’s noteworthy that there is a weak positive correlation among temperature, feels_like, wait_time, trip_time, price/fee, distance, and uberBLACK.

  • price_usd & distance_kms: Weak positive correlation

  • Premium service like uberBLACK & price_usd: Weak positive correlation

========================================================================================================================

EDA - GPS

Number of Ride by Departure Address

trip_start_address pickup_cnt
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 173
Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027 40
Sofyi Kovalevskoy ul., 14к6?, Sankt-Peterburg, Russia, 195256 23
Pulkovo Airport (LED), Unnamed Road, Sankt-Peterburg, Russia, 196210 20
Sofyi Kovalevskoy ulitsa, 14 ко?п?? 6?, Sankt-Peterburg, Russia, 195256 16
Irinovskiy Prospekt, 32 Sankt-Peterburg 195030 10

Number of Ride by Destination Address

trip_end_address dropoff_cnt
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 183
Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027 59
Sofyi Kovalevskoy ul., 14к6?, Sankt-Peterburg, Russia, 195256 29
Pulkovo Airport (LED), Unnamed Road, Sankt-Peterburg, Russia, 196210 28
Sofyi Kovalevskoy ulitsa, 14 ко?п?? 6?, Sankt-Peterburg, Russia, 195256 15
Kirishskaya ul., 11, Sankt-Peterburg, Russia, 195299 13

========================================================================================================================

  • Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014. I am guessing this is the home address as it appears more frequently in both departure and destination.

  • Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027. I am guessing this is a company or school because it ranked 2nd in terms of frequency.

  • Let’s now look at the trips between departure and destination.

========================================================================================================================

Frequency of Trips between Grouped Departure and Destination & Average Departure Time

trip_start_address trip_end_address pickup_cnt avg_time
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027 32 10:37:34
Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 28 17:27:04
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Pulkovo Airport (LED), Unnamed Road, Sankt-Peterburg, Russia, 196210 16 15:54:19
Pulkovo Airport (LED), Unnamed Road, Sankt-Peterburg, Russia, 196210 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 11 11:55:05
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Kirishskaya ul., 11, Sankt-Peterburg, Russia, 195299 7 10:15:17
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Yakornaya Ulitsa, 5?, Sankt-Peterburg, Russia, 195027 7 12:08:00
Sofyi Kovalevskoy ul., 14к6?, Sankt-Peterburg, Russia, 195256 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 6 15:01:30
ul. Kollontay, 1, Sankt-Peterburg, Russia, 193230 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 6 14:59:30
Magnitogorskaya ul., 11, Sankt-Peterburg, Russia, 195027 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 5 14:27:00
Ulitsa Dzhona Rida, 2, Sankt-Peterburg, Russia, 193318 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 5 14:37:36

========================================================================================================================

  • The first two rows show that the most frequent trips was between what I guessed as home and company/school.

  • The first two rows show that Stan leaves home for work around 10:37am and heads back home around 17:27pm.

  • The 3rd row entry shows that Stan goes to the airport around 16:00pm from his home. I thought he would be a consultant at first, but I don’t think consultants departs from home to the airport around 16:00pm. Maybe he is a free lancer?

========================================================================================================================

Frequency of Trips between Grouped Departure and Destination & Average Arrival Time

trip_start_address trip_end_address dropoff_cnt avg_time
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027 32 11:05:30
Sverdlovskaya naberezhnaya, 44?/4?, Sankt-Peterburg, Russia, 195027 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 28 17:50:43
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Pulkovo Airport (LED), Unnamed Road, Sankt-Peterburg, Russia, 196210 16 16:38:04
Pulkovo Airport (LED), Unnamed Road, Sankt-Peterburg, Russia, 196210 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 11 12:39:38
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Kirishskaya ul., 11, Sankt-Peterburg, Russia, 195299 7 10:47:00
Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 Yakornaya Ulitsa, 5?, Sankt-Peterburg, Russia, 195027 7 12:31:09
Sofyi Kovalevskoy ul., 14к6?, Sankt-Peterburg, Russia, 195256 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 6 15:32:00
ul. Kollontay, 1, Sankt-Peterburg, Russia, 193230 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 6 15:29:50
Magnitogorskaya ul., 11, Sankt-Peterburg, Russia, 195027 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 5 14:51:36
Ulitsa Dzhona Rida, 2, Sankt-Peterburg, Russia, 193318 Paradnaya Ulitsa, 3, Sankt-Peterburg, Russia, 191014 5 15:06:12

========================================================================================================================

  • This table shows the time Stan arrives at the destination.

  • Top 1 shows that he arrives at work around 11:05am.

  • I guess his company does not require him to be at work 11 o’clock sharp.

========================================================================================================================

Interactive Map: Frequency of Rides based on Departure

========================================================================================================================

  • Click or zoom in for detailed information

  • If you keep zooming into the point where it has the highest number, you will find that the neighborhood of his house. This sounds creepy

========================================================================================================================

Interactive Map: Trip Path (Line from Departure to Destination)

========================================================================================================================

  • Click or zoom in for detailed information

  • This map draws a line between departure and destination.

  • You can see that there is a cluster on the bottom right corner and there’s no line between the clusters. This implies that he did not take uber to get to to the other city. My guess is that Stan got a ride from his friend or flew to the other city.

========================================================================================================================

In retrospect…

All in all, we have explored Stan’s Uber history, analyzed drivers’ speed, and guesstimated some information about him, such as his job and . While I was doing the EDA, I was a little worried that I might be digging a little too deep but thanks to Stan’s generosity, I was able to explore and scratched the itchy spot. The analysis might have been more interesting with review information.

Besides ride hailing service like Uber, I am curious about what kind of data ride sharing companies like Zipcar, who allows short-term rent to customers, are accumulating and the fancy pricing models they are building. I don’t think companies would share those crucial data, so I might as well look for different data sets and integrate them to come up with a more intriguing analysis!

If you are curious about Stan’s website please click here :)