data <- read.csv(file = "BaseballTimes.csv", header = TRUE, sep = ",")
names(data)[1] "Game" "League" "Runs" "Margin" "Pitchers"
[6] "Attendance" "Time"
#dim(data)
#head(data, n = 10)#Fetch data
data <- read.csv(file = "BaseballTimes.csv", header = TRUE, sep = ",")
names(data)[1] "Game" "League" "Runs" "Margin" "Pitchers"
[6] "Attendance" "Time"
#dim(data)
#head(data, n = 10)#Correlation
cor(data$Time,data$Attendance, use = "complete.obs")[1] 0.2571925
cor(data$Time,data$Pitchers, use = "complete.obs") #[1] 0.8943082
cor(data$Time,data$Margin, use = "complete.obs")[1] -0.07135831
cor(data$Time,data$Runs, use = "complete.obs")[1] 0.6813144
The best predictor for time is Pitchers since it has greates correlation value
##Fit the model
model <- lm(Time ~ Pitchers, data = data)
summary(model)
Call:
lm(formula = Time ~ Pitchers, data = data)
Residuals:
Min 1Q Median 3Q Max
-37.945 -8.445 -3.104 9.751 50.794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.843 13.387 7.085 8.24e-06 ***
Pitchers 10.710 1.486 7.206 6.88e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.46 on 13 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7844
F-statistic: 51.93 on 1 and 13 DF, p-value: 6.884e-06
the above is the regression equation based on the predictor that is most correlated with time (Pitchers). interpretation of the slope coefficient: for each increase in pitchers, time will increase by 10.7
To asses whether the predictor is truly correlated with the game time we performed a t-test of time on pitchers. Finding that the t value is 7.206 for pitchers, and p value being 6.88 * 10^-6, suggest that there is indeed strong correlation
# Diagnostic plots: Residuals vs. Fitted: Check for linearity and constant variance.
plot(model, which = 1)The residuals show mostly random scatter around zero, suggesting linearity is reasonable. There is slight curvature and increased spread at higher fitted values, likely due to a few outliers, but no severe violations are observed.
# Diagnostic plots: Normal Q-Q Plot- Check for normality of residuals.
plot(model, which = 2) The residuals are mostly on straight line on the normality test graph, suggesting mostly normal distribution. However there are two outliers that may raise concern regarding the normality assumption.
# Diagnostic plots: Scale-Location Plot- Check for constant variance.
plot(model, which = 3) The Scale Location plot shows a small increase in residual spread at higher fitted values, suggesting a small departure from constant variance,
plot(model, which = 4) the points 4, 13 & 15 have signicant impact on the regression coefficient