library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
setwd("C:\\Users\\srini\\OneDrive\\Desktop\\Regression Analysis\\Homework 2")
getwd()
## [1] "C:/Users/srini/OneDrive/Desktop/Regression Analysis/Homework 2"
baseball_data=read.csv("S25-HW2-baseball.csv",header=TRUE)
View(baseball_data)
plot(x = baseball_data$BA,
y = baseball_data$W,
xlab = "Batting Average(BA)",
ylab = "Total Wins in 2024",
main = "Wins VS Batting Average",
pch = 21,
bg = "white",
col = "black"
)
From the scatter plot there appears to be a small positive co-relation between Batting Average and the Total Wins for the MLB Teams, so performing a linear regression might offer greater insight.
lm_model=lm(W~BA,data=baseball_data)
summary(lm_model)
##
## Call:
## lm(formula = W ~ BA, data = baseball_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.498 -5.719 1.545 6.525 17.409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -88.73 41.00 -2.164 0.039160 *
## BA 697.85 168.46 4.143 0.000286 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.992 on 28 degrees of freedom
## Multiple R-squared: 0.38, Adjusted R-squared: 0.3578
## F-statistic: 17.16 on 1 and 28 DF, p-value: 0.0002864
From the above statistic following the equation Y = β₀ + β₁X + ε the equation is W = -88.73 + 697.85*BA
The intercept (β0) in a simple linear regression model represents the predicted value of the response variable (y) when the predictor variable (x) is equal to zero. In other words, it is the point where the least squares line crosses the y-axis.
The slope paramter, also known as coefficient of c, represents the change in the mean value of dependant variable y for one unit change in the independent variable(x).
In this particular example we have a slope parameter equal to 697.5,which essentially states that for a one unit increase in Batting Average a team can expect to have 697.5 more wins for their calendar year. However since Batting Average range between 0 and 1, once can interpret this as a 0.001 increase in Batting Average a team can expect 6.9 more wins.
In this example we have an negative intercept of -88.73,which essentially states that for a team having a Batting Average of 0, the team will win -88.73 games which is not a realistic situation. However in this case the lowest Batting Average is 0.221 which still results in a positive win number and eliminates the possibility of having a negative win number.
σ = 9.992 on 28 degrees of freedom
σ2 = (9.992)^2 = 99.8400
SSE = σ2 x Degree of Freedom = 99.8400 x 28 = 2795.52
ΔW = 697.85×ΔBA
For ΔW = 5
ΔBA = 5/697.85
ΔBA = 0.0072
For a increase in 5 Wins the Batting Average must increase by approximately 0.00716
plot(x = baseball_data$ERA,
y = baseball_data$W,
xlab = "Earned Run Average(ERA)",
ylab = "Total Wins in 2024",
main = "Wins VS Earned Run Average",
pch = 21,
bg = "white",
col = "black"
)
The trend suggests a negative correlation, meaning that as ERA increases, the number of wins tends to decrease. This implies that teams with lower ERA generally achieve more wins. This is in contrast with the relation of Wins with Batting Average which have a positive correlation as a greater Batting Average generally indicate greater number of Wins.
plot(x = baseball_data$ERA,
y = baseball_data$BA,
xlab = "Earned Run Average(ERA)",
ylab = "Batting Average(BA)",
main = "Batting Average VS Earned Run Average",
pch = 21,
bg = "white",
col = "black"
)
The scatter plot of Batting Average (BA) vs. Earned Run Average (ERA) does not show a strong or clear correlation between the two variables. The data points appear somewhat scattered without a distinct trend, suggesting that a team’s batting average is not directly influenced by its ERA. While there may be some minor patterns, this plot alone does not provide strong insights into their relationship.
Part 2 is continued below