# Clearing workspace  
rm(list = ls()) # Clear environment 
gc()          
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 522804 28.0    1163480 62.2   660491 35.3
## Vcells 949031  7.3    8388608 64.0  1769514 13.6
# Clear unused memory
cat("\f")       
# Clear the console 

1 Do a few Google searches and tell us what is correlation (5 lines max).

Correlation is defined as a statistical measure that expresses the extent to which two variables are linearly related. Correlation is measured by r which can take on any value between -1 and 1. Anything above 0 indicates a positive relationship while anything below 0 indicates a negative relationship. A positive relationship tells us that both variables tend to increase together and a negative relationship tells us that as one variable increases, the other decreases. The closer our R value is to either 1 and -1, the stronger the relationship is.

2 Do a few Google searches and tell us what is covariance (5 lines max).

Covariance is a measure of the relationship between two random variables and to the extent that they change together. Similar to correlation, covariance can have both a positive and negative relationship. If the covariance for two variables is positive, we will see both variables move in the same direction. If we have a negative covariance, as one variable is greater, it corresponds with the other variable being lesser. As example of this could be as two stocks tend to move up together, they have a positive covariance.

3 Try merging any dataset that interests you based on the data dictionary (pay attention to the unique keys), and create a meaningful dataset (that have some interesting y (outcome) and an interesting x (independent variable).

For this, I decided to look at Michael Jordan and Lebron James’ regular season statistics from their careers. These are two that have constantly been compared to one another as to who is the greatest basketball player of all time. While there are plenty of other factors like playoff statistics, championships, teammates, and so on, I thought it would be interesting to look at their data. There were a lot of variables in our original data set but for our summary statistics, I only wanted to look at a few. For this, I created a dataframes of Games Played, Win/Loss, Minutes Played, Field Goal Percentage, Assists, Steals, Blocks, Points, and plus minus.

# MJ Regular Season Data
MJData <- read.csv('./jordan_career.csv') # Downloading MJ Data
# Lebron Regular Season Data
LebronData <- read.csv('./lebron_career.csv') # Downloading Lebron Data
?merge
## starting httpd help server ... done
# Creating MJ Data Frame
selected_columns <- c(Games = "game", 
                      Win.Loss = "result", 
                      Minutes.Played = "mp", 
                      FG.Percentage = "fgp", 
                      Assist = "ast", 
                      Steals = "stl", 
                      Blocks = "blk", 
                      Points = "pts", 
                      Plus.Minus = "plus_minus")

MJData.df <- MJData[, selected_columns]
# Creating Lebron Data Frame/columns we are more interested in
selected_columns <- c(Games = "game", 
                      Win.Loss = "result", 
                      Minutes.Played = "mp", 
                      FG.Percentage = "fgp", 
                      Assists = "ast", 
                      Steals = "stl", 
                      Blocks = "blk", 
                      Points = "pts", 
                      Plus.Minus = "plus_minus")

LebronData.df <- LebronData[, selected_columns]
# Merging on Points
Merged.Data <- merge(x = MJData.df, 
           y = LebronData.df,
          by.x = c("pts"),
          by.y = c("game")
          )