A correlation is indicative of the relationship between two variables; that is, a correlation coefficient tells us the extent to which and how variables move in relation to each other. The strength of a correlation statistic is expressed in a value ranging somewhere in between -1 to 1. It can also be understood as the line of best fit in relation to the relation of two variables — tells us the strength of the relation between two variables.
A covariance tells us how two random variables vary in relation to each other. In other words, it tells us about the movement between two variables and how a change in variable “x” affects variable “y”. Covariance statistic can range from (- ∞∞, + ∞∞). It can be understood as the linear relationship between two variables — tells us the direction (positive or negative) of the relation between two variables.
I’d like to look at the taxi trips that occurred in NYC over the
years of 2018-2019. I selected 6 variables (see
interest_variables) in code block below. For the sake of my
computer’s memory storage, I will limit the number of observations to
100,000 observations from the original datasets.
df.2018 <- read.csv("/Users/jiwonban/ADEC7301/Week 7/2018_taxi_trips.csv", nrows=100000)
df.2019 <- read.csv("/Users/jiwonban/ADEC7301/Week 7/2019_taxi_trips.csv", nrows=100000)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
interest_variables <- c(ID = "VendorID",
Passenger_Count = "passenger_count",
Pick_Up_Location_ID = "PULocationID",
fare_amount = "fare_amount",
total_amount = "total_amount",
Trip_Distance = "trip_distance"
)
new_df.2018 <- df.2018[, interest_variables]
new_df.2019 <- df.2018[, interest_variables]
MERGED_TRIPS <- merge(
x = new_df.2018,
y = new_df.2019)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(MERGED_TRIPS,
type = "text",
title = "Summary")
##
## Summary
## =====================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------
## VendorID 387,182 2.000 0.000 2 2
## passenger_count 387,182 1.000 0.000 1 1
## PULocationID 387,182 90.345 63.749 1 265
## fare_amount 387,182 7.571 5.734 1.000 437.500
## total_amount 387,182 8.372 5.734 1.800 438.300
## trip_distance 387,182 1.422 1.635 0.000 98.270
## -----------------------------------------------------
summary(MERGED_TRIPS)
## VendorID passenger_count PULocationID fare_amount
## Min. :2 Min. :1 Min. : 1.00 Min. : 1.000
## 1st Qu.:2 1st Qu.:1 1st Qu.: 52.00 1st Qu.: 4.500
## Median :2 Median :1 Median : 74.00 Median : 6.000
## Mean :2 Mean :1 Mean : 90.34 Mean : 7.572
## 3rd Qu.:2 3rd Qu.:1 3rd Qu.: 83.00 3rd Qu.: 9.500
## Max. :2 Max. :1 Max. :265.00 Max. :437.500
## total_amount trip_distance
## Min. : 1.800 Min. : 0.000
## 1st Qu.: 5.300 1st Qu.: 0.550
## Median : 6.800 Median : 0.950
## Mean : 8.372 Mean : 1.422
## 3rd Qu.: 10.300 3rd Qu.: 1.770
## Max. :438.300 Max. :98.270
summary(new_df.2018)
## VendorID passenger_count PULocationID fare_amount total_amount
## Min. :2 Min. :1 Min. : 1.0 Min. : 1.00 Min. : 1.80
## 1st Qu.:2 1st Qu.:1 1st Qu.: 65.0 1st Qu.: 7.00 1st Qu.: 7.80
## Median :2 Median :1 Median : 75.0 Median : 10.50 Median : 11.30
## Mean :2 Mean :1 Mean :104.1 Mean : 12.39 Mean : 13.19
## 3rd Qu.:2 3rd Qu.:1 3rd Qu.:147.0 3rd Qu.: 15.50 3rd Qu.: 16.30
## Max. :2 Max. :1 Max. :265.0 Max. :437.50 Max. :438.30
## trip_distance
## Min. : 0.000
## 1st Qu.: 1.150
## Median : 2.105
## Mean : 2.731
## 3rd Qu.: 3.450
## Max. :98.270
summary(new_df.2019)
## VendorID passenger_count PULocationID fare_amount total_amount
## Min. :2 Min. :1 Min. : 1.0 Min. : 1.00 Min. : 1.80
## 1st Qu.:2 1st Qu.:1 1st Qu.: 65.0 1st Qu.: 7.00 1st Qu.: 7.80
## Median :2 Median :1 Median : 75.0 Median : 10.50 Median : 11.30
## Mean :2 Mean :1 Mean :104.1 Mean : 12.39 Mean : 13.19
## 3rd Qu.:2 3rd Qu.:1 3rd Qu.:147.0 3rd Qu.: 15.50 3rd Qu.: 16.30
## Max. :2 Max. :1 Max. :265.0 Max. :437.50 Max. :438.30
## trip_distance
## Min. : 0.000
## 1st Qu.: 1.150
## Median : 2.105
## Mean : 2.731
## 3rd Qu.: 3.450
## Max. :98.270
Run a correlation (measures strength of linear relationship) between the two variables, and run the covariance between the two variables. Interpret.
?cor
cor(MERGED_TRIPS$trip_distance, MERGED_TRIPS$total_amount, method = c("pearson"))
## [1] 0.9515365
?cov
cov(MERGED_TRIPS$trip_distance, MERGED_TRIPS$total_amount, method = c("pearson"))
## [1] 8.921415
Specifically, I was curious to explore the relation between the fare amount and the total distance traveled.
The Pearson correlation statistic between the amount of distance driven and the total amount of the fare was 0.9515. This value is very close to a correlation coefficient of 1, which indicates a quite strong relationship between the distance of a trip and the amount of the fare. This is expected, as base fares of a taxi trip is calculated based on the distance traveled.
The covariance of 8.92 tells us that there is a quite direct relationship between the two selected variables. The positive value tells us that these two variables move in a direct response to each other, in which an increase in one variable will cause the other to positively change.
plot(MERGED_TRIPS$trip_distance, MERGED_TRIPS$total_amount)
Interestingly, when you look at the plot, you see some outliers which may indicate either a recording error (e.g., $4.30 recorded as $430) or fares that related more to traffic conditions than just the distance alone!
lm(total_amount~trip_distance, data=MERGED_TRIPS)
##
## Call:
## lm(formula = total_amount ~ trip_distance, data = MERGED_TRIPS)
##
## Coefficients:
## (Intercept) trip_distance
## 3.626 3.337
cov(MERGED_TRIPS$trip_distance,MERGED_TRIPS$total_amount)/var(MERGED_TRIPS$trip_distance)
## [1] 3.337144
Yes, the slope coefficient of 3.337 is the same.