Introduction

Uber is a car rental service that depends on mobile technology to +get the cars hired and is leading cab service over the world.
It is expanding widely over the globe and according to its data , it is availabe as far as in 65 countries.
Uber is changing the way and comfort of travel by providing you the option to select the cars and its type.
This presentation is about finding whether the distance of trip are on average of 6 miles from a population count of 1155 in year 2016.

Problem Statement

As retal car service provider Uber is increasing its hold in the market, we made an alaysis to find whether the length of trip was usual or more than expected.
For the above given analysis , one-sample two tailed hypothesis was used to check the results of sample
The data was pre-processed before hypothesis testing to remove the inconsistance data
Two assumptions were taken for the test:a)Unknown population SD b) Normality assumptions.

Data

The Dataset contains 7 variables and 1155 observations. This open data was collected from kaggle https://www.kaggle.com/zusmani/uberdrives

The attributes were:

Start_date: Starting date of travel

End_date: End date of travel

Category: categories of travelling (for eg:business)

Start: Starting point of journey

Stop: Last stop of journey

Miles: Total miles travelled

Purpose: Purpose of travelling(for eg:dinner)

Dscriptive Statistics and Visualisation

The last row of data was eliminated containing the total no. of miles travelled so as to get correct results for analysis

From the below summary , the average no.of miles travelled is 10.57 with minimum distance of 5.0 and maximum distance of 310.3 miles.

summary(uber$MILES.,na.rm=TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    2.90    6.00   10.57   10.40  310.30

sd(uber$MILES.)

## [1] 21.57911

Desriptive Statistics Cont.

The boxplot visualises the outliers if any present in the sample

boxplot(uber$MILES.,xlab = "Miles",col="orange",
border="brown")

Desriptive Statistics Cont.

The data is plotted in histogram to check distribution of data

hist(uber$MILES.,col="Green",main = "Histogram for Miles")

Desriptive Statistics Continued.

The Distribution is right-skewed and in order to reduce the skewness the data was transformed using log function. Then further the hypothesis testing was performed on transformed data.

uber$MILES_transformed <- log(uber$MILES.)

par(mfrow=c(2,2)) 

hist(uber$MILES., col = "Green", main = "Histogram for Miles", xlab = " Miles Before")

hist(uber$MILES_transformed, col = "Green", main = "Histogram for Transformed  Miles", xlab = "Miles After Transformation")

Hypothesis Testing

Hypothesis for a two -tailed one sample t-test

\[H_0: \mu = 1.791(log 6) \] \[H_A: \mu \ne 1.791(log 6) \]

log(6)

## [1] 1.791759

Assumptions:

Standard Deviation is unknown which is true in the above case.
Data is normally distributed.

The normailty assumptions could be avoided due to large sample size. this is due to central limit theorm , which states that when a sample size is large (>30) the sampling distribution mean is normal regardless of variable’s underlying population distribution.

Hypthesis Testing Cont.

Significance level \[ \alpha = 0.05 \]

We will call this 5% the significance level of the test

Decision Rules:

Reject H0 :
If t- critical value is beyond -1.96 to +1.96
If p-value < 0.05( significance level)
If CI of the mean of miles does not capture \[H_0:\mu= 6 \]
Otherwise, fail to reject \[H_0\]

Hypthesis Testing Cont.

To perform t-critical value method for hypothesis test, we need to find the t-critical value(t*).
It is the point at which the sample result has a less than 0.05 in this case, probability of occurring.
That denotes the t critical value sits 1.96 Standard Error above and below the population mean of 6(log value = 1.791759) as this is a two- tailed test.

qt(p = 0.05/2, df = 1155-1, lower.tail = TRUE)

## [1] -1.962022

The One sample t-test is very simple to perform in R, it gives us the result for all the three methods namely t-critical value p-value method confidence interval method which lets us decide to Reject or Fail to Reject the Null on the basis of the decision rules mentioned above.

t.test(uber$MILES_transformed, mu = log(6), alternative="two.sided")

## 
##  One Sample t-test
## 
## data:  uber$MILES_transformed
## t = -1.374, df = 1154, p-value = 0.1697
## alternative hypothesis: true mean is not equal to 1.791759
## 95 percent confidence interval:
##  1.69446 1.80891
## sample estimates:
## mean of x 
##  1.751685

Discussion

The t-statistic(-1.37) is not extreme than t-critical value(i.e -1.96), we Fail to Reject the Null.
The p-value(0.17) is more than the significance level (0.05), therefore, we Fail to Reject the Null
The Lower and Upper bounds of the confidence intervals(i.e 1.69 and 1.81 respectively) also capture the Null Hypothesised mean (H0 = 1.791759 (log 6 value)), Therefor, on the basis of ci we Fail to Reject the Null agian.
The t-critical value tells whether the result is statistically significant or not but fails to give the probability of getting the sample mean when the population mean is true
P-value gives the probability of getting the sample mean when the population mean is true. Hence, P- value and confidence interval are preferred over t-critical value method

Limitations:

Few limitations were found for this investigation:

Firstly, considering the billions of customers base as stated by uber’s website, the dataset takes a very small account.
Then, the data taken into consideration is created by an individual source and no particular credits can be given to its reliability.
Next, the outliers in the data could not be removed as they are small in numbers due to the uncertainty of the reasons.

Final Conclusion:

Through the hypothesis testing on the sample, the sample mean of distance of the trip, 10.57(miles) (transformed mean : 1.751685) is not found to be unusual and comes from the population whose length of trips(in miles) are on average(6).

Assignment-3 Uber Trips

Introduction

Problem Statement

Data

Dscriptive Statistics and Visualisation

Desriptive Statistics Cont.

Desriptive Statistics Cont.

Desriptive Statistics Continued.

Hypothesis Testing

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Discussion

Limitations:

References