Neeraj Sehrawat, StudentId- s3711712
Ria Talwar, StudentId- s3729618
Radhika Santosh Zawar, StudentId- s3734939
Last updated: 28 October, 2018
Uber is a car-for-hire service that relies on smartphone technology to dispatch drivers and manage fees and has become the most recognized alternative to traditional taxi cabs.
It is growing exponentially and according to uber, it’s available in 65 countries worldwide with over 75 million riders.
It is dynamically changing the traditional way people commute.
According to RideGuru , the average trip distance for an uber ride is considered to be about 6 miles.
This presentation is about finding whether “the sample of Uber drives taken, between January to December 2016, (n= 1154) coming from a population whose length of trips(in miles) are on average(\(\mu = 6\))?”
In other words, the aim of this investigation is to find out , how unusual is the sample mean assuming \(\mu = 6\) ?
As the car-for-hire service provider Uber is dominating the taxi and rideshare market, an attempt is made to find, through the hypothesis testing on the sample, whether the sample mean of length of the trip is usual or not.
In other words, Is there an evidence that the population mean for length of the trip is equal to 6 miles?
For this analysis, one-sample t-test with two tailed hypothesis is used to test whether the result of the sample is statistically significant or not.
The significance level, denoted by \(\alpha\), is set at \(\alpha = 0.05\) to judge the “unusualness”.
The data is also preprocessed before the hypothesis testing to remove the inconsistencies in data and transformed as well to remove the skewness of the original dataset.
Two assumptions were taken for this test: (a) Unknown Population standard deviation (b) Normality Assumption
Hypothesis testing will be performed using three methods:(a) t-critical value method (b) p- value approach (c) confidence interval approach.
The data is the Uber Drives dataset containing 7 variables & 1155 observations and sourced from open data source kaggle. More information about this dataset can be found here: https://www.kaggle.com/zusmani/uberdrives
The variables consist of:
START_DATE: start date of the travel
END_DATE: end date of the travel
CATEGORY: Category of travel(like business, personal etc)
START: Starting point of the travel
STOP: End point of travel
MILES: Miles travelled
PURPOSE: Purpose of travel (like meeting, errands, meal etc)
The data is preprocessed before the analysis and the last row containing the total number of miles is removed in order to make it tidy and appropriate for analysis.
As the data is highly right skewed, it is transformed using log function to reduce the skewness.
summary(uber$MILES., na.rm = TRUE)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 2.90 6.00 10.57 10.40 310.30
sd(uber$MILES.)## [1] 21.57911
boxplot(uber$MILES.)zscores_miles <- uber$MILES. %>% scores(type = "z")
zscores_miles %>% summary()## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.466509 -0.355290 -0.211632 0.000000 -0.007732 13.889971
The total number of outliers as per Z-score is pretty small, 22 out of 1155 observation which is just 1.9%. The outliers are a pattern in the data and also due to the uncertainty of the reasons for having them, they are not removed.
length (which( abs(zscores_miles) >3 ))## [1] 22
hist(uber$MILES., col = "yellow")uber$MILES_transformed <- log(uber$MILES.)
par(mfrow=c(2,2))
hist(uber$MILES., col = "yellow", main = "Histogram for Uber Miles", xlab = "Uber Miles Before")
hist(uber$MILES_transformed, col = "yellow", main = "Histogram for Transformed Uber Miles", xlab = "Uber Miles After Transformation")\(H_0: \mu = 1.791759\) (log 6)
\(H_A: \mu \ne \ 1.791759\)(log 6)
log(6)## [1] 1.791759
One sample t-test assumes that:
Population standard deviation is unknown which is true in this case.
The data is normally distributed.
\(\alpha = 0.05\)
We define “unusual” as there being a less than 5% chance for a result to occur, or a result even more extreme, assuming is true. We will call this 5% the significance (\(\alpha\) ) level of the test.
Reject \(H_0\) :
if t- critical value is beyond -1.96 to +1.96
if p-value < 0.05(\(\alpha\) significance level)
if CI of the mean of miles does not capture \(H_0: \mu = 6\)
otherwise, fail to reject \(H_0\)
qt(p = 0.05/2, df = 1155-1, lower.tail = TRUE)## [1] -1.962022
The One sample t-test is very simple to perform in R, it gives us the result for all the three methods (namely - t-critical value, p-value method, and confidence interval method) which lets us decide whether the we Reject or Fail to Reject the Null on the basis of the decision rules mentioned before.
t.test(uber$MILES_transformed, mu = log(6), alternative="two.sided")##
## One Sample t-test
##
## data: uber$MILES_transformed
## t = -1.374, df = 1154, p-value = 0.1697
## alternative hypothesis: true mean is not equal to 1.791759
## 95 percent confidence interval:
## 1.69446 1.80891
## sample estimates:
## mean of x
## 1.751685
t-statistic: -1.374
P-value: 0.1697(almost 0.17)
Confidence interval: lower bound: 1.69; upper bound: 1.81
The t-statistic(-1.37) is not extreme than t-critical value(i.e -1.96), we Fail to Reject the Null.
The p-value(0.17) is more than the significance level(\(\alpha\)), \(\alpha = 0.05\), therefore, we Fail to Reject the Null as per p value as well.
The Lower and Upper bounds of the confidence intervals(i.e 1.69 and 1.81 respectively) also capture the Null Hypothesised mean (\(H_0\) = 1.791759 (log 6 value)), hence, on the basis of confidence interval as well we Fail to Reject the Null.
All the three methods always give the same result. The t-critical value method only tells whether the result is statistically significant or not but doesn’t tell the probability or the usualness of getting the sample mean when the population mean is true(6 in our case).
p-value gives the probability of getting the sample mean when the population mean is true. Hence, P- value and confidence interval are preferred over t-critical value method but the result from all three will point in the same direction.
Strengths:
Decent size dataset having 1155 observations.
Dataset taken for this hypothesis test is collected in the most random fashion possible, thereby reducing the chances of bias in sampling. The data is :
Collected from three geographically and demographically different countries namely USA, Pakistan and Srilanka.
Taken over a period of one year from January to December 2016 that takes in to account any temporary fluctuation due to external factors like weather, vacations, festivals etc.
Random people were picked up from random locations,taking uber for various purposes.
Limitations:
Certain limitations are acknowledged for this investigation:
Firstly, considering the 5 billion customer base as stated by uber’s official website, the dataset taken accounts for a very small proportion.
Secondly, the data taken into consideration is created by an individual source(driver) and no certain credit can be given to its reliability.
Thirdly, the outliers in the data could not be removed despite being in small numbers due to the uncertainty of the reasons and were found to be the part of the pattern.
Direction for future investigations:
As the sample size increases the chances of sampling error reduces hence bigger the sample the better it is, considering uber’s humongous customer base.
Data from multiple sources is preferred as it enhances the certainty of reliability of data in comparison to one arbitrary source.
Test finding: A two-tailed, one-sample t-test was used to determine if the mean Uber Drive(length of trip in miles) were significantly different from the previously assumed Uber drives(length of trip) population mean of 6(miles), transformed value = 1.791759(i.e log 6). The 0.05 level of significance was used. The sample’s mean Mile was M = 10.57, SD = 21.58(miles). The results of the one-sample t-test found the mean Miles(length of trip) to be NOT statistically significantly higher than the population Miles, t(1155) = -1.37, p >0.05, 95% CI [1.69, 1.81].
Final Conclusion:
Through the hypothesis testing on the sample, the sample mean of length of the trip, 10.57(miles) (transformed mean : 1.751685) is not found to be unusual and comes from the population whose length of trips(in miles) are on average(\(\mu = 6\)).
“Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017”, viewed October 18, 2018 https://www.kaggle.com/zusmani/uberdrives/home
“Ippei, RIDEGURU, April 10, 2018” viewed October 2018 https://ride.guru/lounge/p/what-is-the-average-trip-distance-for-an-uber-or-lyft-ride
" UBER“, viewed October 18, 2018 https://www.uber.com/en-AU/newsroom/company-info/