Ankit Munot , S3764950
2nd June 2019
Uber is a car rental service that depends on mobile technology to +get the cars hired and is leading cab service over the world.
It is expanding widely over the globe and according to its data , it is availabe as far as in 65 countries.
Uber is changing the way and comfort of travel by providing you the option to select the cars and its type.
This presentation is about finding whether the distance of trip are on average of 6 miles from a population count of 1155 in year 2016.
As retal car service provider Uber is increasing its hold in the market, we made an alaysis to find whether the length of trip was usual or more than expected.
For the above given analysis , one-sample two tailed hypothesis was used to check the results of sample
The data was pre-processed before hypothesis testing to remove the inconsistance data
Two assumptions were taken for the test:a)Unknown population SD b) Normality assumptions.
The Dataset contains 7 variables and 1155 observations. This open data was collected from kaggle https://www.kaggle.com/zusmani/uberdrives
The attributes were:
Start_date: Starting date of travel
End_date: End date of travel
Category: categories of travelling (for eg:business)
Start: Starting point of journey
Stop: Last stop of journey
Miles: Total miles travelled
Purpose: Purpose of travelling(for eg:dinner)
The last row of data was eliminated containing the total no. of miles travelled so as to get correct results for analysis
From the below summary , the average no.of miles travelled is 10.57 with minimum distance of 5.0 and maximum distance of 310.3 miles.
summary(uber$MILES.,na.rm=TRUE)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 2.90 6.00 10.57 10.40 310.30
sd(uber$MILES.)## [1] 21.57911
The boxplot visualises the outliers if any present in the sample
boxplot(uber$MILES.,xlab = "Miles",col="orange",
border="brown")The data is plotted in histogram to check distribution of data
hist(uber$MILES.,col="Green",main = "Histogram for Miles")The Distribution is right-skewed and in order to reduce the skewness the data was transformed using log function. Then further the hypothesis testing was performed on transformed data.
uber$MILES_transformed <- log(uber$MILES.)
par(mfrow=c(2,2))
hist(uber$MILES., col = "Green", main = "Histogram for Miles", xlab = " Miles Before")
hist(uber$MILES_transformed, col = "Green", main = "Histogram for Transformed Miles", xlab = "Miles After Transformation")Hypothesis for a two -tailed one sample t-test
\[H_0: \mu = 1.791(log 6) \] \[H_A: \mu \ne 1.791(log 6) \]
log(6)## [1] 1.791759
Assumptions:
Standard Deviation is unknown which is true in the above case.
Data is normally distributed.
The normailty assumptions could be avoided due to large sample size. this is due to central limit theorm , which states that when a sample size is large (>30) the sampling distribution mean is normal regardless of variable’s underlying population distribution.
Significance level \[ \alpha = 0.05 \]
We will call this 5% the significance level of the test
Decision Rules:
Reject H0 :
If t- critical value is beyond -1.96 to +1.96
If p-value < 0.05( significance level)
If CI of the mean of miles does not capture \[H_0:\mu= 6 \]
Otherwise, fail to reject \[H_0\]
qt(p = 0.05/2, df = 1155-1, lower.tail = TRUE)## [1] -1.962022
The One sample t-test is very simple to perform in R, it gives us the result for all the three methods namely t-critical value p-value method confidence interval method which lets us decide to Reject or Fail to Reject the Null on the basis of the decision rules mentioned above.
t.test(uber$MILES_transformed, mu = log(6), alternative="two.sided")##
## One Sample t-test
##
## data: uber$MILES_transformed
## t = -1.374, df = 1154, p-value = 0.1697
## alternative hypothesis: true mean is not equal to 1.791759
## 95 percent confidence interval:
## 1.69446 1.80891
## sample estimates:
## mean of x
## 1.751685
The t-statistic(-1.37) is not extreme than t-critical value(i.e -1.96), we Fail to Reject the Null.
The p-value(0.17) is more than the significance level (0.05), therefore, we Fail to Reject the Null
The Lower and Upper bounds of the confidence intervals(i.e 1.69 and 1.81 respectively) also capture the Null Hypothesised mean (H0 = 1.791759 (log 6 value)), Therefor, on the basis of ci we Fail to Reject the Null agian.
The t-critical value tells whether the result is statistically significant or not but fails to give the probability of getting the sample mean when the population mean is true
P-value gives the probability of getting the sample mean when the population mean is true. Hence, P- value and confidence interval are preferred over t-critical value method
Few limitations were found for this investigation:
Firstly, considering the billions of customers base as stated by uber’s website, the dataset takes a very small account.
Then, the data taken into consideration is created by an individual source and no particular credits can be given to its reliability.
Next, the outliers in the data could not be removed as they are small in numbers due to the uncertainty of the reasons.
Final Conclusion:
Through the hypothesis testing on the sample, the sample mean of distance of the trip, 10.57(miles) (transformed mean : 1.751685) is not found to be unusual and comes from the population whose length of trips(in miles) are on average(6).
My Uber Drives Complete Details of My Uber Drives in 2016 https://www.kaggle.com/zusmani/uberdrives
R-Bootcamp courses
R-Course module 6