library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/Lahmans Databse .csv")
lahman_data <- lahman_data|>
  filter(yearID!=2020)

Contents:

1. World Series Wins (DIVWins) - Binary Variable

2. Logistic Regression - DIVWins~Runs Per Game

Division Wins - Binary Variable

3. C.I of the Co-efficient

Data_set <- lahman_data |>
  mutate(RPG = round(R/G,2))|>
  select(DivWin,RPG,R)

Data_set <- Data_set |>
  mutate(DivWin_Binary = ifelse(DivWin =='N',0,1))

Interpretation: For the logistic regression analysis conducted herein, the Division Wins variable was designated as the response variable. In order to facilitate the analysis, the categorical values ‘N’, signifying that the team did not secure the division title, were transformed to a numerical representation of 0, while ‘Y’, indicative of a division win, were encoded as 1.

Logistic Regression (DIVWin_Binary ~ Runs Per Game)

In this logistic regression model, we aim to classify teams into those who have won division titles and those who have not, based on the metric of runs per game.

Data_set |>
  ggplot(aes(x=RPG,y=DivWin_Binary))+
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_point()+
  geom_smooth(method = "lm",se=FALSE)+
  theme_economist()
## `geom_smooth()` using formula = 'y ~ x'

Interpretation: The preceding visualization categorizes teams into two classes, denoted as 0 (representing ‘N’) and 1 (representing ‘Y’), based on their runs per game (RPG). Evidently, it emerges that RPG in isolation does not sufficiently serve as a robust predictor for classifying division wins. Nonetheless, despite this limitation, let us proceed with the model construction and ascertain its suitability in fitting a curve to the aforementioned dataset. Clearly, the regression line is not the good fit for the DIVWin_Binary ~ Runs Per Game. Lets model the logistic regression on the above dataset.

Logistic Regression.

Bin_Model <- glm(DivWin_Binary ~ RPG, data = Data_set,
             family = binomial(link = 'logit'))
Bin_Model
## 
## Call:  glm(formula = DivWin_Binary ~ RPG, family = binomial(link = "logit"), 
##     data = Data_set)
## 
## Coefficients:
## (Intercept)          RPG  
##      -9.812        1.797  
## 
## Degrees of Freedom: 629 Total (i.e. Null);  628 Residual
## Null Deviance:       630.5 
## Residual Deviance: 555.2     AIC: 559.2

Interpretation: For each unit increase in RPG, the log-odds of winning (DivWin_Binary) increase by approximately 1.855, when other variables are held constant.

To interpret this in terms of probability change, we can exponentiate the coefficient. The odds ratio associated with RPG is approximately exp(1.855) = 6.39. This implies that for every one-unit increase in RPG, the odds of winning the division title increase by approximately 6.39 times.

binary <- \(x) 1 / (1 + exp(-(-10.079 + 1.855 * x)))
Data_set |>
  ggplot(aes(x=RPG,y=DivWin_Binary))+
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_point()+
  geom_function(fun = binary, color = 'blue', linewidth = 1) +  
  theme_economist()

Interpretation: The logit function curve fits better than the curve showed in the previous graph.

Confidence Interval of the Coefficient:

tidy(Bin_Model, conf.int = TRUE) |>
  ggplot(mapping = aes(x = estimate, y = term,color=term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
  geom_errorbarh(mapping = aes(xmin = conf.low, 
                               xmax = conf.high, 
                               height = 0.5)) +
  labs(title = "Model Coefficient C.I.")+
  theme_economist()

Interpretation: The above graph shows us the confidence intervals of the intercept and RPG.

df <- tidy(Bin_Model, conf.int = TRUE)
RPG_Row <- df[df$term == "RPG", ]

scale <- qnorm(0.975) * RPG_Row$std.error

CI_Lower <- RPG_Row$estimate - scale
CI_Upper <- RPG_Row$estimate + scale

paste("The C.I of the RPG -", paste("(",round(CI_Lower,2),round(CI_Upper,2),")"))
## [1] "The C.I of the RPG - ( 1.36 2.24 )"

Interpretation: The confidence interval of a coefficient does not contain zero, it suggests that the coefficient is statistically significant at the chosen level of significance (typically 0.05).

The confidence interval of RPG is entirely above zero, it suggests that there is a statistically significant positive relationship between the RPG and the Division Wins. In other words, as the RPG increases, the odds of the event (in logistic regression) or the expected value of the Division Wins (in linear regression) also increases.