1 Pick any two quantitative variables from a data set that interests you. You can continue to use the dataset from your last discussion, or pick up a new dataset.

I will continue on with my Michael Jordan and Lebron James datasets.

MJData <- read.csv('./jordan_career.csv') # Downloading MJ Data
# Creating MJ Data Frame
selected_columns <- c(Games = "game", 
                      Win.Loss = "result", 
                      Minutes.Played = "mp", 
                      FG.Percentage = "fgp", 
                      Assist = "ast", 
                      Steals = "stl", 
                      Blocks = "blk", 
                      Points = "pts", 
                      Plus.Minus = "plus_minus")

MJData.df <- MJData[, selected_columns]

A. Tell us what are the dependent and independent variable. Type put your estimating equation. I am expecting to see subscripts i on your y, x and error term professionally done. Make sure to describe these two variables.

In the last discussion, I looked at points vs field goal percentage for Lebron so for this one I will change it up. This discussion, I will look at Points vs Games Played for Michael Jordan so we can see the relationship of his scoring as the regular season progresses. This should tell us if he typically plays better in the beginning, middle, or end of an 82 game season. Our data set gives us each game number for every season.

head(MJData.df) #Heading
##   game  result mp   fgp ast stl blk pts plus_minus
## 1    1 W (+16) 40 0.313   7   2   4  16         NA
## 2    2  L (-2) 34 0.615   5   2   1  21         NA
## 3    3  W (+6) 34 0.542   5   6   2  37         NA
## 4    4  W (+5) 36 0.381   5   3   1  25         NA
## 5    5 L (-16) 33 0.467   5   1   1  17         NA
## 6    6  W (+4) 27 0.474   3   3   1  25         NA

I have included the heading of this large data set as an example so it is easy to see what I mean by each game number having a number. This continues to go on for all of his 1072 games played in his career.

Independent Variable: Our Independent variable here will be games played which I will put on the x-axis. This is our independent variable since the games will not change at all in an 82 game season per our dataset. The only time where each year may be different is when he might have missed a game due to injury or another reason. This should not affect our test too much since we are more interested in how he performs in the beginning, middle, and end of a long season. We can note that injuries or time off could have an effect though.

Dependent Variable: Our dependent variable is points scored, which we will put on our Y-Axis. This is often called our response or the outcome, and is the focus on our study. Since we want to know the points scored per game of an 82 game season, we are interested in the amount of points scored for different parts of the season.

# Scatterplot this data
plot(x = MJData.df$game,
     y = MJData.df$pts,
     xlab = "Games Played",
     ylab = "Points",
     main = "Games Played vs Points"
     )

B Estimate the linear regression in R using the lm() command.

?lm
## starting httpd help server ... done
MJ.LM <- lm(formula = MJData.df$pts~MJData.df$game)
summary(MJ.LM)
## 
## Call:
## lm(formula = MJData.df$pts ~ MJData.df$game)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.674  -6.878  -0.026   6.272  38.083 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.05546    0.57988  50.106   <2e-16 ***
## MJData.df$game  0.02698    0.01258   2.145   0.0322 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.736 on 1070 degrees of freedom
## Multiple R-squared:  0.00428,    Adjusted R-squared:  0.003349 
## F-statistic: 4.599 on 1 and 1070 DF,  p-value: 0.03221

C Interpret the slope and intercept parameters.

Slope: The slope gives us the expected increment in the dependent variable when the independent variable increases by one unit. An example from our data is how the amount of points Michael Jordan scored increases or decreases when we go from game 52 to 53 in a season and so on.

Intercept: The intercept represents the expected response when X=0. In our case, this would be what Michael Jordan’s points are when he has played 0 games, we know this would =0 in our case.

D Replicate the slope and intercept parameter using the covariance/variance formulas, like we did in class with Excel

Slope

# Covariance
cov(x = MJData.df$game, 
    y = MJData.df$pts)
## [1] 15.08709
# Variance
var(x = MJData.df$game)
## [1] 559.2103
# Beta 1
beta_1 <- cov(x = MJData.df$game, y = MJData.df$pts)/var(x = MJData.df$game)
beta_1
## [1] 0.02697928

Intercept

# Beta 0 
beta_0 <- mean(MJData.df$pts) - mean(MJData.df$game) * beta_1
beta_0
## [1] 29.05546