I will continue on with my Michael Jordan and Lebron James datasets.
MJData <- read.csv('./jordan_career.csv') # Downloading MJ Data
# Creating MJ Data Frame
selected_columns <- c(Games = "game",
Win.Loss = "result",
Minutes.Played = "mp",
FG.Percentage = "fgp",
Assist = "ast",
Steals = "stl",
Blocks = "blk",
Points = "pts",
Plus.Minus = "plus_minus")
MJData.df <- MJData[, selected_columns]
In the last discussion, I looked at points vs field goal percentage for Lebron so for this one I will change it up. This discussion, I will look at Points vs Games Played for Michael Jordan so we can see the relationship of his scoring as the regular season progresses. This should tell us if he typically plays better in the beginning, middle, or end of an 82 game season. Our data set gives us each game number for every season.
head(MJData.df) #Heading
## game result mp fgp ast stl blk pts plus_minus
## 1 1 W (+16) 40 0.313 7 2 4 16 NA
## 2 2 L (-2) 34 0.615 5 2 1 21 NA
## 3 3 W (+6) 34 0.542 5 6 2 37 NA
## 4 4 W (+5) 36 0.381 5 3 1 25 NA
## 5 5 L (-16) 33 0.467 5 1 1 17 NA
## 6 6 W (+4) 27 0.474 3 3 1 25 NA
I have included the heading of this large data set as an example so it is easy to see what I mean by each game number having a number. This continues to go on for all of his 1072 games played in his career.
Independent Variable: Our Independent variable here will be games played which I will put on the x-axis. This is our independent variable since the games will not change at all in an 82 game season per our dataset. The only time where each year may be different is when he might have missed a game due to injury or another reason. This should not affect our test too much since we are more interested in how he performs in the beginning, middle, and end of a long season. We can note that injuries or time off could have an effect though.
Dependent Variable: Our dependent variable is points scored, which we will put on our Y-Axis. This is often called our response or the outcome, and is the focus on our study. Since we want to know the points scored per game of an 82 game season, we are interested in the amount of points scored for different parts of the season.
# Scatterplot this data
plot(x = MJData.df$game,
y = MJData.df$pts,
xlab = "Games Played",
ylab = "Points",
main = "Games Played vs Points"
)
?lm
## starting httpd help server ... done
MJ.LM <- lm(formula = MJData.df$pts~MJData.df$game)
summary(MJ.LM)
##
## Call:
## lm(formula = MJData.df$pts ~ MJData.df$game)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.674 -6.878 -0.026 6.272 38.083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.05546 0.57988 50.106 <2e-16 ***
## MJData.df$game 0.02698 0.01258 2.145 0.0322 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.736 on 1070 degrees of freedom
## Multiple R-squared: 0.00428, Adjusted R-squared: 0.003349
## F-statistic: 4.599 on 1 and 1070 DF, p-value: 0.03221
Slope: The slope gives us the expected increment in the dependent variable when the independent variable increases by one unit. An example from our data is how the amount of points Michael Jordan scored increases or decreases when we go from game 52 to 53 in a season and so on.
Intercept: The intercept represents the expected response when X=0. In our case, this would be what Michael Jordan’s points are when he has played 0 games, we know this would =0 in our case.
Slope
# Covariance
cov(x = MJData.df$game,
y = MJData.df$pts)
## [1] 15.08709
# Variance
var(x = MJData.df$game)
## [1] 559.2103
# Beta 1
beta_1 <- cov(x = MJData.df$game, y = MJData.df$pts)/var(x = MJData.df$game)
beta_1
## [1] 0.02697928
Intercept
# Beta 0
beta_0 <- mean(MJData.df$pts) - mean(MJData.df$game) * beta_1
beta_0
## [1] 29.05546