Part 1

Correlation: Simply put, correlation is the relationship between two or more things. In statistics, it describes or is a measure of the association between two variables. It usually ranges on the scale from -1 (negative/inverse) to 1 (positive/direct) relationships. It is important to note that correlation can indicate a relationship, but correlation does not mean or imply causation.

Covariance: Covariance measures the extent two variables vary linearly. Covariance reveals of variables move in the same or opposite directions. It examines the co-variability of the variables around their respective means. The scale of covariance is negative infinity to positive infinity and can be tricky to interpret at times. (negative means variables move in opposite directions, positive means the same).

Correlation and covariance tell us similar things, except correlation can be compared across different data sets. (standardizes the covariance).

Part 2

#Import the DataSets
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
setwd("/Users/Ryan/OneDrive/Documents/Data Analysis- Sharma")
jordan_career <- read_csv("Discussions/Discussion12/jordan_career.csv")
## Rows: 1072 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): age, team, opp, result
## dbl  (20): game, mp, fg, fga, fgp, three, threeatt, threep, ft, fta, ftp, or...
## lgl   (1): plus_minus
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(jordan_career)


lebron_career <- read_csv("Discussions/Discussion12/lebron_career.csv")
## Rows: 1265 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): age, team, opp, result
## dbl  (21): game, mp, fg, fga, fgp, three, threeatt, threep, ft, fta, ftp, or...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(lebron_career)
#Create new columns of interest

#Total career game column

jordan_career <- jordan_career %>%
  mutate(Career_Game = row_number())

lebron_career <- lebron_career %>%
  mutate(Career_Game = row_number())
#Create new columns of interest

#(total rebounds, points, and assist for any given game. We will call this- 'total stats')

jordan_career <- jordan_career %>%
  mutate(Total_Stats = trb + ast + pts)

lebron_career <- lebron_career %>%
  mutate(Total_Stats = trb + ast + pts)
#Create Cleaner DataSets with the stats we need
clean_jordan <- select(jordan_career, Career_Game, Total_Stats)

clean_lebron <- select(lebron_career, Career_Game, Total_Stats)
#Merge the data: we want to merge career game by the total stats.

merged_data <- merge(clean_jordan, clean_lebron, by = "Career_Game", suffixes = c("Jordan", "Lebron"), all = TRUE)
#Eliminate the NA Values, we are only going to go up to the game number where Jordan retired.

merged_data <- slice(merged_data, 1:1072)

The data frame “merged_data” contains the Career Game # and the total stats (points, rebounds, and assists) each player recorded in their respective game. We will run a correlation to see if the players got better over time with experience (would show a general positive correlation between career game and total stats). We will also plot each player on the same graph for comparison.

Summary Stats.

stargazer(merged_data, type = 'text')
## 
## ==================================================
## Statistic           N    Mean   St. Dev. Min  Max 
## --------------------------------------------------
## Career_Game       1,072 536.500 309.604   1  1,072
## Total_StatsJordan 1,072 41.602   11.165   8   93  
## Total_StatsLebron 1,072 41.466   9.354   11   75  
## --------------------------------------------------

From summary stats we can see that Lebron averaged slightly higher total stats per game throughout his career (remember we omitted steals and blocks). Jordan has the higher max at 93 stats in a game, where Jordan sits at 75. Jordan also owns the lower min at 8 vs Lebron’s 11. Jordan had a higher standard deviation compared to Lebron, meaning that Lebron’s spread was slightly more predictable (slightly more consistent around his mean).

Run Correlation

#Calc correlation
correlation_lebron <- cor(merged_data$Career_Game, merged_data$Total_StatsLebron)

correlation_jordan <- cor(merged_data$Career_Game, merged_data$Total_StatsJordan)
  
print(correlation_lebron)
## [1] 0.06288728
print(correlation_jordan)
## [1] -0.3117435
#Calc Co-variance
cov_lebron <- cov(merged_data$Career_Game, merged_data$Total_StatsLebron)

print(cov_lebron)
## [1] 182.1289
cov_jordan <- cov(merged_data$Career_Game, merged_data$Total_StatsJordan)

print(cov_jordan)
## [1] -1077.649

The correlation shows a slightly positive relationship between career games and total stats with Lebron James, while showing a slightly more powerful correlation for Jordan in the negative direction. There is a slight positive relationship between total stats and games played for Lebron (very weak relationship that the more games, the more stats). Jordan had a weak negative relationship (the more games, the less stats).

The co-variances show the same trend. Lebron’s covariance is in the positive direction (same as correlation coefficient) and Jordan’s is negative (also same as correlation coefficient). It is hard to quantify co-variance (correlation coefficient is the standardized version). Since we are tracking the same variable, in the same league we can compare these two values with some accuracy. We can say that there is a stronger correlation with the Jordan data and the direction of the correlation for both the Lebron and Jordan data.

#Lebron in Red, Jordan in Blue.
plot(merged_data$Career_Game, merged_data$Total_StatsJordan, type = 'l', col = 'blue', xlab = "Game", ylab = "TotalStats")
lines(merged_data$Career_Game, merged_data$Total_StatsLebron, col = 'red')
legend("topright", legend = c("JordanStats", "LebronStats"), col = c('blue', 'red'))

It appears that the end of Jordan’s career he declined faster than Lebron which could account for the negative correlation. A better test may be to only use data up until their primes, and not account for decline if you are solely looking on experience and improvement.