Project

Author

Mujahid

#Fetch data

data <- read.csv(file = "BaseballTimes.csv", header = TRUE, sep = ",")
names(data)
[1] "Game"       "League"     "Runs"       "Margin"     "Pitchers"  
[6] "Attendance" "Time"      
#dim(data)
#head(data, n = 10)

#Correlation

cor(data$Time,data$Attendance, use = "complete.obs")
[1] 0.2571925
cor(data$Time,data$Pitchers, use = "complete.obs") #
[1] 0.8943082
cor(data$Time,data$Margin, use = "complete.obs")
[1] -0.07135831
cor(data$Time,data$Runs, use = "complete.obs")
[1] 0.6813144

The best predictor for time is Pitchers since it has greates correlation value

##Fit the model

model <- lm(Time ~ Pitchers, data = data)
summary(model)

Call:
lm(formula = Time ~ Pitchers, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.945  -8.445  -3.104   9.751  50.794 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   94.843     13.387   7.085 8.24e-06 ***
Pitchers      10.710      1.486   7.206 6.88e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.46 on 13 degrees of freedom
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7844 
F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

the above is the regression equation based on the predictor that is most correlated with time (Pitchers). interpretation of the slope coefficient: for each increase in pitchers, time will increase by 10.7

To asses whether the predictor is truly correlated with the game time we performed a t-test of time on pitchers. Finding that the t value is 7.206 for pitchers, and p value being 6.88 * 10^-6, suggest that there is indeed strong correlation

# Diagnostic plots: Residuals vs. Fitted: Check for linearity and constant variance.
plot(model, which = 1)

The residuals show mostly random scatter around zero, suggesting linearity is reasonable. There is slight curvature and increased spread at higher fitted values, likely due to a few outliers, but no severe violations are observed.

# Diagnostic plots: Normal Q-Q Plot- Check for normality of residuals.
plot(model, which = 2)  

The residuals are mostly on straight line on the normality test graph, suggesting mostly normal distribution. However there are two outliers that may raise concern regarding the normality assumption.

# Diagnostic plots: Scale-Location Plot- Check for constant variance.
plot(model, which = 3)  

The Scale Location plot shows a small increase in residual spread at higher fitted values, suggesting a small departure from constant variance,

plot(model, which = 4)  

the points 4, 13 & 15 have signicant impact on the regression coefficient