library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
## The following object is masked from 'package:stats':
##
## power.t.test
Data_set <- "/Users/ba/Documents/IUPUI/Masters/First Sem/Statistics/Dataset/PitchingPost.csv"
Pitching_Data <- read.csv(Data_set)
Regression_Data <-
Pitching_Data |>
filter(is.finite(ERA),
is.finite(BAOpp))
Regression_Data |>
filter(teamID == "BOS" | teamID == "CLE" | teamID == "NYA") |>
group_by(teamID) |>
ggplot(aes(x=teamID,y=ER,fill=teamID,color=teamID))+
geom_boxplot()+
theme_economist()
ANOVA_test <- aov(ER~teamID,data=Regression_Data)
summary(ANOVA_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## teamID 31 163 5.251 1.342 0.0982 .
## Residuals 3670 14365 3.914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The p-value is 0.0982, which is greater than the commonly used significance level of 0.05. Therefore, we fail to reject the null hypothesis at the 0.05 significance level. These asterisks indicate the level of significance of the p-value. The more asterisks, the lower the p-value and the greater the significance. In this case, there are no asterisks, indicating that the result is not statistically significant at the conventional levels (0.05, 0.01, 0.001).
Explanation on Continuous and Categorical Variables:
‘ER’ is inherently a quantitative measure representing the number of runs that a pitcher allows that are not the result of errors or passed balls. As a continuous variable, ‘ER’ provides a direct metric of a pitcher’s performance that can be influenced by various factors including team strategy, opposing team strength, and in-game circumstances. Analyzing ‘ER’ continuously allows us to utilize regression techniques to estimate and understand variations in pitcher performance across different teams.
The objective was to quantify the impact of playing for different teams on the earned runs a pitcher is likely to allow. This could uncover insights such as whether certain teams are associated with better or worse pitching performances due to factors like defensive support, ballpark factors, or team management strategies.
‘TeamID’ classifies the data into groups based on the team for which each pitcher played. This variable is categorical because each team is distinct and represents a unique grouping within the dataset. By using ‘TeamID’ as a categorical variable, we can perform ANOVA to compare the mean earned runs across different teams, providing insights into team-level differences and effects.
The use of ‘TeamID’ as an explanatory variable in ANOVA allows us to assess whether there are statistically significant differences in pitching outcomes (i.e., earned runs allowed) among teams. This can help identify if certain teams consistently foster better or poorer pitching performances, which could be useful for team management and strategic planning.
model <- lm(W~ER, data=Regression_Data)
model
##
## Call:
## lm(formula = W ~ ER, data = Regression_Data)
##
## Coefficients:
## (Intercept) ER
## 0.208688 0.005726
Interpretation:
Intercept (0.208688): This is the estimated value of W when the explanatory variable ER is zero. In other words, when ER is zero, the estimated value of W is approximately 0.208688.
ER (0.005726): This coefficient represents the change in the response variable W for a one-unit increase in the explanatory variable ER, holding all other variables constant. So, for each unit increase in ER, the estimated value of W increases by approximately 0.005726.
In simpler terms, the intercept represents the baseline value of W, and the coefficient for ER represents how much W is expected to change for each unit increase in ER, assuming all other factors remain constant.