1.  Do a few Google searches and tell us what is correlation (5 lines max).

A correlation is indicative of the relationship between two variables; that is, a correlation coefficient tells us the extent to which and how variables move in relation to each other. The strength of a correlation statistic is expressed in a value ranging somewhere in between -1 to 1. It can also be understood as the line of best fit in relation to the relation of two variables — tells us the strength of the relation between two variables.

2.  Do a few Google searches and tell us what is covariance (5 lines max).

A covariance tells us how two random variables vary in relation to each other. In other words, it tells us about the movement between two variables and how a change in variable “x” affects variable “y”. Covariance statistic can range from (- ∞∞, + ∞∞). It can be understood as the linear relationship between two variables — tells us the direction (positive or negative) of the relation between two variables.

3. Try merging any dataset that interests you based on the data dictionary (pay attention to the unique keys), and create a meaningful dataset (that have some interesting y (outcome) and an interesting x (independent variable). 

I’d like to look at the taxi trips that occurred in NYC over the years of 2018-2019. I selected 6 variables (see interest_variables) in code block below. For the sake of my computer’s memory storage, I will limit the number of observations to 100,000 observations from the original datasets.

df.2018 <- read.csv("/Users/jiwonban/ADEC7301/Week 7/2018_taxi_trips.csv", nrows=100000)
df.2019 <- read.csv("/Users/jiwonban/ADEC7301/Week 7/2019_taxi_trips.csv", nrows=100000)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
interest_variables <- c(ID = "VendorID",
                        Passenger_Count = "passenger_count",
                        Pick_Up_Location_ID = "PULocationID", 
                        fare_amount = "fare_amount",
                        total_amount = "total_amount",
                        Trip_Distance = "trip_distance" 
                     )

new_df.2018 <- df.2018[, interest_variables]
new_df.2019 <- df.2018[, interest_variables]


MERGED_TRIPS <- merge(
  x = new_df.2018,
  y = new_df.2019)

4.  Create a summary statistics table of the merged dataset, and the two unmerged datasets (so that one can see the before/after datasets). This will give the reader some idea about the variables in your data, and their distribution.  

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(MERGED_TRIPS, 
          type = "text", 
          title = "Summary")
## 
## Summary
## =====================================================
## Statistic          N     Mean  St. Dev.  Min    Max  
## -----------------------------------------------------
## VendorID        387,182 2.000   0.000     2      2   
## passenger_count 387,182 1.000   0.000     1      1   
## PULocationID    387,182 90.345  63.749    1     265  
## fare_amount     387,182 7.571   5.734   1.000 437.500
## total_amount    387,182 8.372   5.734   1.800 438.300
## trip_distance   387,182 1.422   1.635   0.000 98.270 
## -----------------------------------------------------
summary(MERGED_TRIPS)
##     VendorID passenger_count  PULocationID     fare_amount     
##  Min.   :2   Min.   :1       Min.   :  1.00   Min.   :  1.000  
##  1st Qu.:2   1st Qu.:1       1st Qu.: 52.00   1st Qu.:  4.500  
##  Median :2   Median :1       Median : 74.00   Median :  6.000  
##  Mean   :2   Mean   :1       Mean   : 90.34   Mean   :  7.572  
##  3rd Qu.:2   3rd Qu.:1       3rd Qu.: 83.00   3rd Qu.:  9.500  
##  Max.   :2   Max.   :1       Max.   :265.00   Max.   :437.500  
##   total_amount     trip_distance   
##  Min.   :  1.800   Min.   : 0.000  
##  1st Qu.:  5.300   1st Qu.: 0.550  
##  Median :  6.800   Median : 0.950  
##  Mean   :  8.372   Mean   : 1.422  
##  3rd Qu.: 10.300   3rd Qu.: 1.770  
##  Max.   :438.300   Max.   :98.270
summary(new_df.2018)
##     VendorID passenger_count  PULocationID    fare_amount      total_amount   
##  Min.   :2   Min.   :1       Min.   :  1.0   Min.   :  1.00   Min.   :  1.80  
##  1st Qu.:2   1st Qu.:1       1st Qu.: 65.0   1st Qu.:  7.00   1st Qu.:  7.80  
##  Median :2   Median :1       Median : 75.0   Median : 10.50   Median : 11.30  
##  Mean   :2   Mean   :1       Mean   :104.1   Mean   : 12.39   Mean   : 13.19  
##  3rd Qu.:2   3rd Qu.:1       3rd Qu.:147.0   3rd Qu.: 15.50   3rd Qu.: 16.30  
##  Max.   :2   Max.   :1       Max.   :265.0   Max.   :437.50   Max.   :438.30  
##  trip_distance   
##  Min.   : 0.000  
##  1st Qu.: 1.150  
##  Median : 2.105  
##  Mean   : 2.731  
##  3rd Qu.: 3.450  
##  Max.   :98.270
summary(new_df.2019)
##     VendorID passenger_count  PULocationID    fare_amount      total_amount   
##  Min.   :2   Min.   :1       Min.   :  1.0   Min.   :  1.00   Min.   :  1.80  
##  1st Qu.:2   1st Qu.:1       1st Qu.: 65.0   1st Qu.:  7.00   1st Qu.:  7.80  
##  Median :2   Median :1       Median : 75.0   Median : 10.50   Median : 11.30  
##  Mean   :2   Mean   :1       Mean   :104.1   Mean   : 12.39   Mean   : 13.19  
##  3rd Qu.:2   3rd Qu.:1       3rd Qu.:147.0   3rd Qu.: 15.50   3rd Qu.: 16.30  
##  Max.   :2   Max.   :1       Max.   :265.0   Max.   :437.50   Max.   :438.30  
##  trip_distance   
##  Min.   : 0.000  
##  1st Qu.: 1.150  
##  Median : 2.105  
##  Mean   : 2.731  
##  3rd Qu.: 3.450  
##  Max.   :98.270

5.  Pick any two quantitative variables from the data set that interests you. 

 Run a correlation (measures strength of linear relationship) between the two variables, and run the covariance between the two variables. Interpret.

?cor
cor(MERGED_TRIPS$trip_distance, MERGED_TRIPS$total_amount, method = c("pearson"))
## [1] 0.9515365
?cov
cov(MERGED_TRIPS$trip_distance, MERGED_TRIPS$total_amount, method = c("pearson"))
## [1] 8.921415

Specifically, I was curious to explore the relation between the fare amount and the total distance traveled.

The Pearson correlation statistic between the amount of distance driven and the total amount of the fare was 0.9515. This value is very close to a correlation coefficient of 1, which indicates a quite strong relationship between the distance of a trip and the amount of the fare. This is expected, as base fares of a taxi trip is calculated based on the distance traveled.

The covariance of 8.92 tells us that there is a quite direct relationship between the two selected variables. The positive value tells us that these two variables move in a direct response to each other, in which an increase in one variable will cause the other to positively change.

plot(MERGED_TRIPS$trip_distance, MERGED_TRIPS$total_amount)

Interestingly, when you look at the plot, you see some outliers which may indicate either a recording error (e.g., $4.30 recorded as $430) or fares that related more to traffic conditions than just the distance alone!

In the next discussion, you will run a bivariate regression -  lm(y~x, data=df). Look at the slope coefficient.  Now, calculate cov(x,y)/var(x) . Is this the slope coefficient from your linear regression?  HINT - it should be ! 

lm(total_amount~trip_distance, data=MERGED_TRIPS)
## 
## Call:
## lm(formula = total_amount ~ trip_distance, data = MERGED_TRIPS)
## 
## Coefficients:
##   (Intercept)  trip_distance  
##         3.626          3.337
cov(MERGED_TRIPS$trip_distance,MERGED_TRIPS$total_amount)/var(MERGED_TRIPS$trip_distance)
## [1] 3.337144

Yes, the slope coefficient of 3.337 is the same.