## Parsed with column specification:
## cols(
## Cab = col_character(),
## Trip_distance = col_double(),
## Fare_amount = col_double(),
## Tip_amount = col_double(),
## Payment_type = col_integer(),
## Trip_type = col_integer()
## )
Graph for Trip Distance
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Graph for Fare Amount
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the above two graphs for trip_distance and fare_amount variables, the two datasets are more right skewed; quite a few outliers.
Trip Distance 99% Confidence Interval:
# t.test(distance, conf.level = 0.99) can also work
n <- length(distance) # size of data
# Z-Score = (X-mean)/sd
disMean <- mean(distance)
disVariance <- var(distance)
dis_sd <- sqrt(disVariance)
dis_error <- qnorm(0.995)*dis_sd/sqrt(n) # 99% Confidence Interval
left_distance <- disMean-dis_error
right_distance <- disMean+dis_error
disInterval <- c(left_distance, right_distance)
intervalHead <- c("Lower.Limit", "Upper.Limit")
names(disInterval) <- intervalHead
print(disInterval)
## Lower.Limit Upper.Limit
## 3.015609 3.416238
Fare Amount 99% Confidence Interval:
fareMean <- mean(fareAmount)
fareVariance <- var(fareAmount)
fare_sd <- sqrt(fareVariance)
fare_error <- qnorm(0.995)*fare_sd/sqrt(n) # 99% Confidence Interval; size did not change
left_fare <- fareMean-fare_error
right_fare <- fareMean+fare_error
fareInterval <- c(left_fare, right_fare)
names(fareInterval) <- intervalHead
print(fareInterval)
## Lower.Limit Upper.Limit
## 12.21424 13.40468
# Z-Score = (X-mean)/sd
three <- (3-disMean)/dis_sd
thirteen <- (13-fareMean)/fare_sd
percentile <- (three+thirteen)/2
percent(pnorm(percentile))
## [1] "49%"
t.test(fareAmount, mu = 12)
##
## One Sample t-test
##
## data: fareAmount
## t = 3.503, df = 1542, p-value = 0.0004733
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
## 12.35620 13.26273
## sample estimates:
## mean of x
## 12.80946
After performing a 95% confidence interval t.test, the average fare came out to be $12.81 which is greater than the null hypothesis. Therefore, we can reject the null hypothesis.
The Pearson correlation can evaluate a linear relationship between 2 continous variables. So when a change in 1 variable is associated with a proportional change in the other variable, a relationship is linear. Compared to a Spearman correlation which evaluates a monotonic relationship between 2 continous or ordinal variables. So the 2 variables change together, but not necessarily at a constant rate.
Scatterplot Correlation between Trip Distance and Fare Amount
plot(distance, fareAmount, col = c("deepskyblue4"), pch = c(20), cex = 1, lty = "solid", lwd = 2)
cor.test(distance, fareAmount, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: distance and fareAmount
## t = 110.93, df = 1541, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9368827 0.9480153
## sample estimates:
## cor
## 0.9427109
cor.test(distance, fareAmount, method = "spearman")
## Warning in cor.test.default(distance, fareAmount, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: distance and fareAmount
## S = 21265000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9652692
The Pearson measure on this dataset is more suitable for this scenario because trip_distance and fare_amount are proportionate with each other. As the distance increases for a trip so does the fare amount. By also creating a scatter plot of the data, like the one above, you can see that the relationship is linear (Pearson), not monotonic (Spearman).