Introduction

Uber is a car-for-hire service that relies on smartphone technology to dispatch drivers and manage fees and has become the most recognized alternative to traditional taxi cabs.
It is growing exponentially and according to uber, it’s available in 65 countries worldwide with over 75 million riders.
It is dynamically changing the traditional way people commute.
According to RideGuru , the average trip distance for an uber ride is considered to be about 6 miles.
This presentation is about finding whether “the sample of Uber drives taken, between January to December 2016, (n= 1154) coming from a population whose length of trips(in miles) are on average(\(\mu = 6\))?”
In other words, the aim of this investigation is to find out , how unusual is the sample mean assuming \(\mu = 6\) ?

Problem Statement

As the car-for-hire service provider Uber is dominating the taxi and rideshare market, an attempt is made to find, through the hypothesis testing on the sample, whether the sample mean of length of the trip is usual or not.
In other words, Is there an evidence that the population mean for length of the trip is equal to 6 miles?
For this analysis, one-sample t-test with two tailed hypothesis is used to test whether the result of the sample is statistically significant or not.
The significance level, denoted by \(\alpha\), is set at \(\alpha = 0.05\) to judge the “unusualness”.
The data is also preprocessed before the hypothesis testing to remove the inconsistencies in data and transformed as well to remove the skewness of the original dataset.
Two assumptions were taken for this test: (a) Unknown Population standard deviation (b) Normality Assumption
Hypothesis testing will be performed using three methods:(a) t-critical value method (b) p- value approach (c) confidence interval approach.

Data

The data is the Uber Drives dataset containing 7 variables & 1155 observations and sourced from open data source kaggle. More information about this dataset can be found here: https://www.kaggle.com/zusmani/uberdrives
The variables consist of:

START_DATE: start date of the travel

END_DATE: end date of the travel

CATEGORY: Category of travel(like business, personal etc)

START: Starting point of the travel

STOP: End point of travel

MILES: Miles travelled

PURPOSE: Purpose of travel (like meeting, errands, meal etc)
The data is preprocessed before the analysis and the last row containing the total number of miles is removed in order to make it tidy and appropriate for analysis.
As the data is highly right skewed, it is transformed using log function to reduce the skewness.

Descriptive Statistics and Visualisation

The average number of miles travelled according to the sample is 10.57 with a standard deviation of 21.579
The maximum distance travelled is 310.30 while the minimum distance travelled is .50

summary(uber$MILES., na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    2.90    6.00   10.57   10.40  310.30

sd(uber$MILES.)

## [1] 21.57911

Descriptive Statistics and Visualisation Contd:

The box plot visualises a number of outliers in the sample.

boxplot(uber$MILES.)

Descriptive Statistics and Visualisation Contd:

The Z score method can also be used to confirm the number of outliers.From the summary() output, we can see that the minimum z score is -0.47 and the maximum is 13.88.

zscores_miles <- uber$MILES. %>%  scores(type = "z")
zscores_miles %>% summary()

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.466509 -0.355290 -0.211632  0.000000 -0.007732 13.889971

The total number of outliers as per Z-score is pretty small, 22 out of 1155 observation which is just 1.9%. The outliers are a pattern in the data and also due to the uncertainty of the reasons for having them, they are not removed.

length (which( abs(zscores_miles) >3 ))

## [1] 22

Descriptive Statistics and Visualisation Contd: (preprocessing of data)

The data can be plotted in a histogram to check the distribution of the data.

hist(uber$MILES., col = "yellow")

Descriptive Statistics and Visualisation Contd: (transformation of data)

The distribution is heavily right-skewed hence in order to reduce the skewness the data is transformed using log function.The hypothesis testing is performed on this transformed data saved as a variable named “MILES_transformed”.

uber$MILES_transformed <- log(uber$MILES.)


par(mfrow=c(2,2)) 
hist(uber$MILES., col = "yellow", main = "Histogram for Uber Miles", xlab = "Uber Miles Before")
hist(uber$MILES_transformed, col = "yellow", main = "Histogram for Transformed Uber Miles", xlab = "Uber Miles After Transformation")

Hypothesis Testing

Hypothesis for two-tailed one sample t-test:

\(H_0: \mu = 1.791759\) (log 6)

\(H_A: \mu \ne \ 1.791759\)(log 6)

log(6)

## [1] 1.791759

Assumption:

One sample t-test assumes that:

Population standard deviation is unknown which is true in this case.
The data is normally distributed.

The normality assumption of the hypothesis test could have been avoided due to the large sample size (i.e n = 1155) .This is because, according to the Central Limit Theorem, When the sample size is large (typically defined as n>30) the sampling distribution of the mean is approximately normal regardless of the variable’s underlying population distribution. However, in order to perform the effective hypothesis test, the data transformation (log functions) is applied to bring out the desirable result.

Hypothesis Testing Contd:

Significance level (\(\alpha\)):

\(\alpha = 0.05\)

We define “unusual” as there being a less than 5% chance for a result to occur, or a result even more extreme, assuming is true. We will call this 5% the significance (\(\alpha\) ) level of the test.

Decision Rules:

Reject \(H_0\) :

if t- critical value is beyond -1.96 to +1.96

if p-value < 0.05(\(\alpha\) significance level)

if CI of the mean of miles does not capture \(H_0: \mu = 6\)

otherwise, fail to reject \(H_0\)

Hypthesis Testing Cont.

In order to perform t-critical value method for hypothesis test, we first need to find the t-critical value(t*). It is the point at which the sample result has a less than \(\alpha\) ( \(\alpha = 0.05\) in this case) probability of occurring. This means the t critical value sits 1.96 Standard Error above and below the population mean of 6(log value = 1.791759) as this is a two- tailed test.

qt(p = 0.05/2, df = 1155-1, lower.tail = TRUE)

## [1] -1.962022

The One sample t-test is very simple to perform in R, it gives us the result for all the three methods (namely - t-critical value, p-value method, and confidence interval method) which lets us decide whether the we Reject or Fail to Reject the Null on the basis of the decision rules mentioned before.

t.test(uber$MILES_transformed, mu = log(6), alternative="two.sided")

## 
##  One Sample t-test
## 
## data:  uber$MILES_transformed
## t = -1.374, df = 1154, p-value = 0.1697
## alternative hypothesis: true mean is not equal to 1.791759
## 95 percent confidence interval:
##  1.69446 1.80891
## sample estimates:
## mean of x 
##  1.751685

Hypthesis Testing Cont.

According to the one sample t-test:

t-statistic: -1.374

P-value: 0.1697(almost 0.17)

Confidence interval: lower bound: 1.69; upper bound: 1.81

Discussion

The t-statistic(-1.37) is not extreme than t-critical value(i.e -1.96), we Fail to Reject the Null.
The p-value(0.17) is more than the significance level(\(\alpha\)), \(\alpha = 0.05\), therefore, we Fail to Reject the Null as per p value as well.
The Lower and Upper bounds of the confidence intervals(i.e 1.69 and 1.81 respectively) also capture the Null Hypothesised mean (\(H_0\) = 1.791759 (log 6 value)), hence, on the basis of confidence interval as well we Fail to Reject the Null.
All the three methods always give the same result. The t-critical value method only tells whether the result is statistically significant or not but doesn’t tell the probability or the usualness of getting the sample mean when the population mean is true(6 in our case).
p-value gives the probability of getting the sample mean when the population mean is true. Hence, P- value and confidence interval are preferred over t-critical value method but the result from all three will point in the same direction.

Discussion (contd):

Strengths:

Decent size dataset having 1155 observations.
Dataset taken for this hypothesis test is collected in the most random fashion possible, thereby reducing the chances of bias in sampling. The data is :

Collected from three geographically and demographically different countries namely USA, Pakistan and Srilanka.
Taken over a period of one year from January to December 2016 that takes in to account any temporary fluctuation due to external factors like weather, vacations, festivals etc.
Random people were picked up from random locations,taking uber for various purposes.

Limitations:

Certain limitations are acknowledged for this investigation:
Firstly, considering the 5 billion customer base as stated by uber’s official website, the dataset taken accounts for a very small proportion.
Secondly, the data taken into consideration is created by an individual source(driver) and no certain credit can be given to its reliability.
Thirdly, the outliers in the data could not be removed despite being in small numbers due to the uncertainty of the reasons and were found to be the part of the pattern.

Discussion (contd):

Direction for future investigations:

As the sample size increases the chances of sampling error reduces hence bigger the sample the better it is, considering uber’s humongous customer base.
Data from multiple sources is preferred as it enhances the certainty of reliability of data in comparison to one arbitrary source.
Test finding: A two-tailed, one-sample t-test was used to determine if the mean Uber Drive(length of trip in miles) were significantly different from the previously assumed Uber drives(length of trip) population mean of 6(miles), transformed value = 1.791759(i.e log 6). The 0.05 level of significance was used. The sample’s mean Mile was M = 10.57, SD = 21.58(miles). The results of the one-sample t-test found the mean Miles(length of trip) to be NOT statistically significantly higher than the population Miles, t(1155) = -1.37, p >0.05, 95% CI [1.69, 1.81].
Final Conclusion:

Through the hypothesis testing on the sample, the sample mean of length of the trip, 10.57(miles) (transformed mean : 1.751685) is not found to be unusual and comes from the population whose length of trips(in miles) are on average(\(\mu = 6\)).

References

“Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017”, viewed October 18, 2018 https://www.kaggle.com/zusmani/uberdrives/home
“Ippei, RIDEGURU, April 10, 2018” viewed October 2018 https://ride.guru/lounge/p/what-is-the-average-trip-distance-for-an-uber-or-lyft-ride
" UBER“, viewed October 18, 2018 https://www.uber.com/en-AU/newsroom/company-info/

EXPLORING UBER TRIPS

How far does Uber take you?

Introduction

Problem Statement

Data

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation Contd:

Descriptive Statistics and Visualisation Contd:

Descriptive Statistics and Visualisation Contd: (preprocessing of data)

Descriptive Statistics and Visualisation Contd: (transformation of data)

Hypothesis Testing

Hypothesis Testing Contd:

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Discussion

Discussion (contd):

Discussion (contd):

References