1. Load the Dataset (green_tripdata_2015-01.csv)

## Parsed with column specification:
## cols(
##   Cab = col_character(),
##   Trip_distance = col_double(),
##   Fare_amount = col_double(),
##   Tip_amount = col_double(),
##   Payment_type = col_integer(),
##   Trip_type = col_integer()
## )

2. Determine if the trip_distance and fare_amount variables are normal:

Graph for Trip Distance

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Graph for Fare Amount

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the above two graphs for trip_distance and fare_amount variables, the two datasets are more right skewed; quite a few outliers.

3. Develop 99% confidence intervals for the trip_distance and fare_amount, using z-scores:

Trip Distance 99% Confidence Interval:

# t.test(distance, conf.level = 0.99) can also work
n <- length(distance) # size of data
# Z-Score = (X-mean)/sd
disMean <- mean(distance)
disVariance <- var(distance)
dis_sd <- sqrt(disVariance)
dis_error <- qnorm(0.995)*dis_sd/sqrt(n) # 99% Confidence Interval
left_distance <- disMean-dis_error
right_distance <- disMean+dis_error
disInterval <- c(left_distance, right_distance)
intervalHead <- c("Lower.Limit", "Upper.Limit")
names(disInterval) <- intervalHead
print(disInterval)
## Lower.Limit Upper.Limit 
##    3.015609    3.416238

Fare Amount 99% Confidence Interval:

fareMean <- mean(fareAmount)
fareVariance <- var(fareAmount)
fare_sd <- sqrt(fareVariance)
fare_error <- qnorm(0.995)*fare_sd/sqrt(n) # 99% Confidence Interval; size did not change
left_fare <- fareMean-fare_error
right_fare <- fareMean+fare_error
fareInterval <- c(left_fare, right_fare)
names(fareInterval) <- intervalHead
print(fareInterval)
## Lower.Limit Upper.Limit 
##    12.21424    13.40468

4. Using z scores determine what percentile a 3 miles travel distance that costs $13 falls into?

# Z-Score = (X-mean)/sd
three <- (3-disMean)/dis_sd
thirteen <- (13-fareMean)/fare_sd
percentile <- (three+thirteen)/2
percent(pnorm(percentile))
## [1] "49%"

5. The null hypothesis for green cab fare amount is that the average fare is $12. From the 95% confidence interval, can you reject the null hypothesis?

t.test(fareAmount, mu = 12)
## 
##  One Sample t-test
## 
## data:  fareAmount
## t = 3.503, df = 1542, p-value = 0.0004733
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
##  12.35620 13.26273
## sample estimates:
## mean of x 
##  12.80946

After performing a 95% confidence interval t.test, the average fare came out to be $12.81 which is greater than the null hypothesis. Therefore, we can reject the null hypothesis.

6. Understand Pearson and Spearman Correlation Coefficients by reading “Pearson vs Spearman.pdf” document.

The Pearson correlation can evaluate a linear relationship between 2 continous variables. So when a change in 1 variable is associated with a proportional change in the other variable, a relationship is linear. Compared to a Spearman correlation which evaluates a monotonic relationship between 2 continous or ordinal variables. So the 2 variables change together, but not necessarily at a constant rate.

7. Apply Pearson and Spearman measure on this data set (“trip_distance” and “fare_amount”) . Which one of these two is suitable for this scenario and why?

Scatterplot Correlation between Trip Distance and Fare Amount

plot(distance, fareAmount, col = c("deepskyblue4"), pch = c(20), cex = 1, lty = "solid", lwd = 2)

cor.test(distance, fareAmount, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  distance and fareAmount
## t = 110.93, df = 1541, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9368827 0.9480153
## sample estimates:
##       cor 
## 0.9427109
cor.test(distance, fareAmount, method = "spearman")
## Warning in cor.test.default(distance, fareAmount, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  distance and fareAmount
## S = 21265000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9652692

The Pearson measure on this dataset is more suitable for this scenario because trip_distance and fare_amount are proportionate with each other. As the distance increases for a trip so does the fare amount. By also creating a scatter plot of the data, like the one above, you can see that the relationship is linear (Pearson), not monotonic (Spearman).

8. Are “trip_distance” and “fare_amount” strongly correlated (Explain)?

By performing Pearson’s product-moment correlation on “trip_distance” and “fare_amount”, we receive a sample estimate correlation coefficient (or “r”) of 0.9427109. The closer r is to +1 or -1, the more closely the two variables are related. Given, the r value is very close 1, in my opinion I would say “trip_distance” and “fare_amount” are strongly correlated.