Assignment-3 Uber Trips

Ankit Munot , S3764950

2nd June 2019

Introduction

Problem Statement

Data

The Dataset contains 7 variables and 1155 observations. This open data was collected from kaggle https://www.kaggle.com/zusmani/uberdrives

The attributes were:

Start_date: Starting date of travel

End_date: End date of travel

Category: categories of travelling (for eg:business)

Start: Starting point of journey

Stop: Last stop of journey

Miles: Total miles travelled

Purpose: Purpose of travelling(for eg:dinner)

Dscriptive Statistics and Visualisation

The last row of data was eliminated containing the total no. of miles travelled so as to get correct results for analysis

From the below summary , the average no.of miles travelled is 10.57 with minimum distance of 5.0 and maximum distance of 310.3 miles.

summary(uber$MILES.,na.rm=TRUE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    2.90    6.00   10.57   10.40  310.30
sd(uber$MILES.)
## [1] 21.57911

Desriptive Statistics Cont.

The boxplot visualises the outliers if any present in the sample

boxplot(uber$MILES.,xlab = "Miles",col="orange",
border="brown")

Desriptive Statistics Cont.

The data is plotted in histogram to check distribution of data

hist(uber$MILES.,col="Green",main = "Histogram for Miles")

Desriptive Statistics Continued.

The Distribution is right-skewed and in order to reduce the skewness the data was transformed using log function. Then further the hypothesis testing was performed on transformed data.

uber$MILES_transformed <- log(uber$MILES.)

par(mfrow=c(2,2)) 

hist(uber$MILES., col = "Green", main = "Histogram for Miles", xlab = " Miles Before")

hist(uber$MILES_transformed, col = "Green", main = "Histogram for Transformed  Miles", xlab = "Miles After Transformation")

Hypothesis Testing

Hypothesis for a two -tailed one sample t-test

\[H_0: \mu = 1.791(log 6) \] \[H_A: \mu \ne 1.791(log 6) \]

log(6)
## [1] 1.791759

Assumptions:

  1. Standard Deviation is unknown which is true in the above case.

  2. Data is normally distributed.

The normailty assumptions could be avoided due to large sample size. this is due to central limit theorm , which states that when a sample size is large (>30) the sampling distribution mean is normal regardless of variable’s underlying population distribution.

Hypthesis Testing Cont.

Significance level \[ \alpha = 0.05 \]

We will call this 5% the significance level of the test

Decision Rules:

Hypthesis Testing Cont.

qt(p = 0.05/2, df = 1155-1, lower.tail = TRUE)
## [1] -1.962022

The One sample t-test is very simple to perform in R, it gives us the result for all the three methods namely t-critical value p-value method confidence interval method which lets us decide to Reject or Fail to Reject the Null on the basis of the decision rules mentioned above.

t.test(uber$MILES_transformed, mu = log(6), alternative="two.sided")
## 
##  One Sample t-test
## 
## data:  uber$MILES_transformed
## t = -1.374, df = 1154, p-value = 0.1697
## alternative hypothesis: true mean is not equal to 1.791759
## 95 percent confidence interval:
##  1.69446 1.80891
## sample estimates:
## mean of x 
##  1.751685

Discussion

Limitations:

Few limitations were found for this investigation:

Final Conclusion:

Through the hypothesis testing on the sample, the sample mean of distance of the trip, 10.57(miles) (transformed mean : 1.751685) is not found to be unusual and comes from the population whose length of trips(in miles) are on average(6).

References