# Read in data
#The firstbasestats.csv file contains all the first base players’ stats.
#We can read this file in R using the read.csv() function.
firstbase = read.csv("firstbasestats.csv")
#Declared the variable firstbase
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbase)
'data.frame': 23 obs. of 15 variables:
$ Player : chr "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
$ Pos : chr "1B" "1B" "1B" "1B" ...
$ Team : chr "LAD" "CHW" "TEX" "STL" ...
$ GP : int 159 157 157 151 160 140 160 145 146 143 ...
$ AB : int 612 601 593 561 638 551 583 555 545 519 ...
$ H : int 199 183 179 178 175 152 141 139 132 124 ...
$ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
$ HR : int 21 15 27 35 32 20 36 22 8 18 ...
$ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
$ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
$ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
$ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
$ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
$ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
$ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
summary(firstbase)
Player Pos Team GP AB H X2B HR RBI
Length:23 Length:23 Length:23 Min. : 5.0 Min. : 14.0 Min. : 3.0 Min. : 1.00 Min. : 0.00 Min. : 1.00
Class :character Class :character Class :character 1st Qu.:105.5 1st Qu.:309.0 1st Qu.: 74.5 1st Qu.:13.50 1st Qu.: 8.00 1st Qu.: 27.00
Mode :character Mode :character Mode :character Median :131.0 Median :465.0 Median :115.0 Median :23.00 Median :18.00 Median : 63.00
Mean :120.2 Mean :426.9 Mean :110.0 Mean :22.39 Mean :17.09 Mean : 59.43
3rd Qu.:152.0 3rd Qu.:558.0 3rd Qu.:146.5 3rd Qu.:28.00 3rd Qu.:24.50 3rd Qu.: 84.50
Max. :160.0 Max. :638.0 Max. :199.0 Max. :47.00 Max. :36.00 Max. :115.00
AVG OBP SLG OPS WAR Payroll.Salary2023
Min. :0.2020 Min. :0.2140 Min. :0.2860 Min. :0.5000 Min. :-1.470 Min. : 720000
1st Qu.:0.2180 1st Qu.:0.3030 1st Qu.:0.3505 1st Qu.:0.6445 1st Qu.: 0.190 1st Qu.: 739200
Median :0.2420 Median :0.3210 Median :0.4230 Median :0.7290 Median : 1.310 Median : 4050000
Mean :0.2499 Mean :0.3242 Mean :0.4106 Mean :0.7346 Mean : 1.788 Mean : 6972743
3rd Qu.:0.2750 3rd Qu.:0.3395 3rd Qu.:0.4690 3rd Qu.:0.8175 3rd Qu.: 3.140 3rd Qu.: 8150000
Max. :0.3250 Max. :0.4070 Max. :0.5780 Max. :0.9810 Max. : 7.860 Max. :27000000
#We used the function summary() to generate a summary of each variable (column) in firstbase. The output includes:
#For numeric variables (int or num):
#Minimum value (Min): The smallest value in the variable.
#1st Quartile (1st Qu): The value below which 25% of the data falls (Q1 or lower quartile).
#Median (Median or 50th percentile): The middle value in the sorted variable (Q2).
#Mean (Mean): The average value of the variable.
#3rd Quartile (3rd Qu): The value below which 75% of the data falls (Q3 or upper quartile).
#Maximum value (Max): The largest value in the variable.
#For categorical or character variables (chr):
#Counts of each unique value.
#Example
#Length:23: Indicates there are 23 observations (rows) in firstbase.
#Class :character, Mode :character: Indicates Player is a character variable (names of players).
# Linear Regression (one variable)
#lm() function is used to fit linear models, and the summary() function provides a detailed summary of the fitted model.
#lm(Payroll.Salary2023 ~ RBI, data = firstbase): This line of code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable) and RBI is the independent variable (predictor variable). data = firstbase specifies that R should use the data stored in the data frame firstbase.
#model1 <- lm(...): Assigns the linear regression model object to the variable model1.
#summary(model1): The summary() function then provides a detailed summary of the fitted linear regression model model1. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)
Call:
lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-10250331 -5220790 -843455 2386848 13654950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2363744 2866320 -0.825 0.41883
RBI 157088 42465 3.699 0.00133 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6516000 on 21 degrees of freedom
Multiple R-squared: 0.3945, Adjusted R-squared: 0.3657
F-statistic: 13.68 on 1 and 21 DF, p-value: 0.001331
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.
#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.
#Model Fit:
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6516000.
#Multiple R-squared: This is the coefficient of determination, which measures the proportion of variance in the dependent variable (Payroll.Salary2023) that is explained by #the independent variable (RBI). In this case, Multiple R-squared is 0.3945, indicating that 39.45% of the variance in Payroll.Salary2023 is explained by RBI.
#Adjusted R-squared: This is the R-squared adjusted for the number of predictors in the model. It penalizes for adding predictors that do not improve the model's fit. In this case, Adjusted R-squared is 0.3657, which is slightly lower than Multiple R-squared.
#F-statistic: This statistic tests the overall significance of the model. It compares the fit of the intercept-only model (null model) with the fit of the current model. In this case, the F-statistic is 13.68 with 1 and 21 degrees of freedom, and the associated p-value (0.001331) indicates that the model as a whole is statistically significant.
# Sum of Squared Errors
#This expression retrieves the residuals from the linear regression model model1.
model1$residuals
1 2 3 4 5 6 7 8 9 10 11 12 13
13654950.2 10082148.6 -5524939.3 10298631.2 1626214.0 -6731642.8 -5902522.2 -10250330.7 -4711916.8 -532796.1 -6667082.5 -6696203.1 7582148.6
14 15 16 17 18 19 20 21 22 23
-4916640.9 -1898125.3 -336532.3 -995042.5 -1311618.3 -843454.5 8050721.3 1250336.9 1847040.4 2926656.0
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model1$residuals^2)
SSE #Printing the values of SSE. In this case 8.914926e+14
[1] 8.914926e+14
# Linear Regression (two variables)
#fit a multiple linear regression model. In this case model2, which includes more than one predictor variable (AVG and RBI)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and AVG and RBI are the independent variables (predictor variables) taken from the firstbase dataset.
#model2 <- lm(...): Assigns the linear regression model object to the variable model2.
#summary(model2): The summary() function then provides a detailed summary of the fitted linear regression model model2. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
summary(model2)
Call:
lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9097952 -4621582 -33233 3016541 10260245
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18083756 9479037 -1.908 0.0709 .
AVG 74374031 42934155 1.732 0.0986 .
RBI 108850 49212 2.212 0.0388 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6226000 on 20 degrees of freedom
Multiple R-squared: 0.4735, Adjusted R-squared: 0.4209
F-statistic: 8.994 on 2 and 20 DF, p-value: 0.001636
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.
#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6226000.
#Statistical Significance:
#RBI is statistically significant (p-value = 0.0388).
#AVG shows marginal significance (p-value = 0.0986)
#Model Fit.
#Multiple R-squared: Approximately 47.35% of the variance in Payroll.Salary2023 is explained by AVG and RBI.
# Sum of Squared Errors
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model2$residuals^2)
SSE #Printing the values of SSE. In this case 7.751841e+14
[1] 7.751841e+14
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model3, which includes more than one predictor variable (HR,RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and HR,RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.
#model3 <- lm(...): Assigns the linear regression model object to the variable model3.
#summary(model3): The summary() function then provides a detailed summary of the fitted linear regression model model3. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)
Call:
lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS,
data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9611440 -3338119 64016 4472451 9490309
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31107859 11738494 -2.650 0.0168 *
HR -341069 552069 -0.618 0.5449
RBI 115786 113932 1.016 0.3237
AVG -63824769 104544645 -0.611 0.5496
OBP 27054948 131210166 0.206 0.8391
OPS 60181012 95415131 0.631 0.5366
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6023000 on 17 degrees of freedom
Multiple R-squared: 0.5811, Adjusted R-squared: 0.4579
F-statistic: 4.717 on 5 and 17 DF, p-value: 0.006951
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.
#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6023000.
#Model Fit.
#Multiple R-squared: Approximately 58.11% of the variance in Payroll.Salary2023 is explained by HR,RBI,AVG,OPB and OPS.
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
[1] 6.167793e+14
# Remove HR
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model3, which includes more than one predictor variable (RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.
#model4 <- lm(...): Assigns the linear regression model object to the variable model4.
#summary(model3): The summary() function then provides a detailed summary of the fitted linear regression model model2. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data=firstbase)
summary(model4)
Call:
lm(formula = Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9399551 -3573842 98921 3979339 9263512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29466887 11235931 -2.623 0.0173 *
RBI 71495 87015 0.822 0.4220
AVG -11035457 59192453 -0.186 0.8542
OBP 86360720 87899074 0.982 0.3389
OPS 9464546 47788458 0.198 0.8452
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5919000 on 18 degrees of freedom
Multiple R-squared: 0.5717, Adjusted R-squared: 0.4765
F-statistic: 6.007 on 4 and 18 DF, p-value: 0.00298
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.
#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5919000.
#Model Fit.
#Multiple R-squared: Approximately 57.17% of the variance in Payroll.Salary2023 is explained by RBI,AVG,OPB and OPS.
firstbase<-firstbase[,-(1:3)]
# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
[1] 0.6281239
cor(firstbase$AVG, firstbase$OBP)
[1] 0.8028894
cor(firstbase)
GP AB H X2B HR RBI AVG OBP SLG OPS WAR Payroll.Salary2023
GP 1.0000000 0.9779421 0.9056508 0.8446267 0.7432552 0.8813917 0.4430808 0.4841583 0.6875270 0.6504483 0.5645243 0.4614889
AB 0.9779421 1.0000000 0.9516701 0.8924632 0.7721339 0.9125839 0.5126292 0.5026125 0.7471949 0.6980141 0.6211558 0.5018820
H 0.9056508 0.9516701 1.0000000 0.9308318 0.7155225 0.9068893 0.7393167 0.6560021 0.8211406 0.8069779 0.7688712 0.6249911
X2B 0.8446267 0.8924632 0.9308318 1.0000000 0.5889699 0.8485911 0.6613085 0.5466537 0.7211259 0.6966830 0.6757470 0.6450730
HR 0.7432552 0.7721339 0.7155225 0.5889699 1.0000000 0.8929048 0.3444242 0.4603408 0.8681501 0.7638721 0.6897677 0.5317619
RBI 0.8813917 0.9125839 0.9068893 0.8485911 0.8929048 1.0000000 0.5658479 0.5704463 0.8824090 0.8156612 0.7885666 0.6281239
AVG 0.4430808 0.5126292 0.7393167 0.6613085 0.3444242 0.5658479 1.0000000 0.8028894 0.7254274 0.7989005 0.7855945 0.5871543
OBP 0.4841583 0.5026125 0.6560021 0.5466537 0.4603408 0.5704463 0.8028894 1.0000000 0.7617499 0.8987390 0.7766375 0.7025979
SLG 0.6875270 0.7471949 0.8211406 0.7211259 0.8681501 0.8824090 0.7254274 0.7617499 1.0000000 0.9686752 0.8611140 0.6974086
OPS 0.6504483 0.6980141 0.8069779 0.6966830 0.7638721 0.8156612 0.7989005 0.8987390 0.9686752 1.0000000 0.8799893 0.7394981
WAR 0.5645243 0.6211558 0.7688712 0.6757470 0.6897677 0.7885666 0.7855945 0.7766375 0.8611140 0.8799893 1.0000000 0.8086359
Payroll.Salary2023 0.4614889 0.5018820 0.6249911 0.6450730 0.5317619 0.6281239 0.5871543 0.7025979 0.6974086 0.7394981 0.8086359 1.0000000
#In this code we use the function cor() to compute the correlation matrix for numeric columns of the entire dataset, in this case firstbase.
#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP+OPS, data=firstbase)
summary(model5)
Call:
lm(formula = Payroll.Salary2023 ~ RBI + OBP + OPS, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9465449 -3411234 259746 4102864 8876798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29737007 10855411 -2.739 0.013 *
RBI 72393 84646 0.855 0.403
OBP 82751360 83534224 0.991 0.334
OPS 7598051 45525575 0.167 0.869
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5767000 on 19 degrees of freedom
Multiple R-squared: 0.5709, Adjusted R-squared: 0.5031
F-statistic: 8.426 on 3 and 19 DF, p-value: 0.000913
model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
summary(model6)
Call:
lm(formula = Payroll.Salary2023 ~ RBI + OBP, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9045497 -3487008 139497 4084739 9190185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28984802 9632560 -3.009 0.00693 **
RBI 84278 44634 1.888 0.07360 .
OBP 95468873 33385182 2.860 0.00969 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5625000 on 20 degrees of freedom
Multiple R-squared: 0.5703, Adjusted R-squared: 0.5273
F-statistic: 13.27 on 2 and 20 DF, p-value: 0.0002149
# Read in test set
#The firstbasestats_test.csv file contains a test sample of only 2 of the first base players’ stats.
#We can read this file in R using the read.csv() function.
firstbaseTest = read.csv("firstbasestats_test.csv")
#Declared the variable firstbaseTest
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbaseTest)
'data.frame': 2 obs. of 15 variables:
$ Player : chr "Matt Olson" "Josh Bell"
$ Pos : chr "1B" "1B"
$ Team : chr "ATL" "SD"
$ GP : int 162 156
$ AB : int 616 552
$ H : int 148 147
$ X2B : int 44 29
$ HR : int 34 17
$ RBI : int 103 71
$ AVG : num 0.24 0.266
$ OBP : num 0.325 0.362
$ SLG : num 0.477 0.422
$ OPS : num 0.802 0.784
$ WAR : num 3.29 3.5
$ Payroll.Salary2023: num 21000000 16500000
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
# Make test set predictions
predictTest = predict(model6, newdata=firstbaseTest)
predictTest
1 2
10723186 11558647
# Compute R-squared
SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
[1] 0.5477734
---
title: "Activity_6"
output: html_notebook
---


```{r}
# Read in data
#The firstbasestats.csv file contains all the first base players’ stats. 
#We can read this file in R using the read.csv() function.

firstbase = read.csv("firstbasestats.csv")
#Declared the variable firstbase 
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbase)
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
```
```{r}
summary(firstbase)
#We used the function summary() to generate a summary of each variable (column) in firstbase. The output includes:

#For numeric variables (int or num):

#Minimum value (Min): The smallest value in the variable.
#1st Quartile (1st Qu): The value below which 25% of the data falls (Q1 or lower quartile).
#Median (Median or 50th percentile): The middle value in the sorted variable (Q2).
#Mean (Mean): The average value of the variable.
#3rd Quartile (3rd Qu): The value below which 75% of the data falls (Q3 or upper quartile).
#Maximum value (Max): The largest value in the variable.

#For categorical or character variables (chr):

#Counts of each unique value.

#Example 
#Length:23: Indicates there are 23 observations (rows) in firstbase.
#Class :character, Mode :character: Indicates Player is a character variable (names of players).

```
```{r}
# Linear Regression (one variable)
#lm() function is used to fit linear models, and the summary() function provides a detailed summary of the fitted model. 

#lm(Payroll.Salary2023 ~ RBI, data = firstbase): This line of code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable) and RBI is the independent variable (predictor variable). data = firstbase specifies that R should use the data stored in the data frame firstbase.

#model1 <- lm(...): Assigns the linear regression model object to the variable model1.

#summary(model1): The summary() function then provides a detailed summary of the fitted linear regression model model1. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Model Fit:
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6516000.
#Multiple R-squared: This is the coefficient of determination, which measures the proportion of variance in the dependent variable (Payroll.Salary2023) that is explained by #the independent variable (RBI). In this case, Multiple R-squared is 0.3945, indicating that 39.45% of the variance in Payroll.Salary2023 is explained by RBI.
#Adjusted R-squared: This is the R-squared adjusted for the number of predictors in the model. It penalizes for adding predictors that do not improve the model's fit. In this case, Adjusted R-squared is 0.3657, which is slightly lower than Multiple R-squared.
#F-statistic: This statistic tests the overall significance of the model. It compares the fit of the intercept-only model (null model) with the fit of the current model. In this case, the F-statistic is 13.68 with 1 and 21 degrees of freedom, and the associated p-value (0.001331) indicates that the model as a whole is statistically significant.

```
```{r}
# Sum of Squared Errors
#This expression retrieves the residuals from the linear regression model model1.
model1$residuals
```

```{r}
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model1$residuals^2)
SSE #Printing the values of SSE. In this case 8.914926e+14
```
```{r}
# Linear Regression (two variables)
#fit a multiple linear regression model. In this case model2, which includes more than one predictor variable (AVG and RBI)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and AVG and RBI are the independent variables (predictor variables) taken from the firstbase dataset.

#model2 <- lm(...): Assigns the linear regression model object to the variable model2.

#summary(model2): The summary() function then provides a detailed summary of the fitted linear regression model model2. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

summary(model2)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6226000.
#Statistical Significance:
#RBI is statistically significant (p-value = 0.0388).
#AVG shows marginal significance (p-value = 0.0986)

#Model Fit.
#Multiple R-squared: Approximately 47.35% of the variance in Payroll.Salary2023 is explained by AVG and RBI.

```
```{r}
# Sum of Squared Errors
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model2$residuals^2)
SSE #Printing the values of SSE. In this case 7.751841e+14
```
```{r}
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model3, which includes more than one predictor variable (HR,RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and HR,RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model3 <- lm(...): Assigns the linear regression model object to the variable model3.

#summary(model3): The summary() function then provides a detailed summary of the fitted linear regression model model3. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6023000.

#Model Fit.
#Multiple R-squared: Approximately 58.11% of the variance in Payroll.Salary2023 is explained by HR,RBI,AVG,OPB and OPS.

```
```{r}
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE #Printing the values of SSE. In this case 6.167793e+14
```
```{r}
# Remove HR
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model4, which includes more than one predictor variable (RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model4 <- lm(...): Assigns the linear regression model object to the variable model4.

#summary(model4): The summary() function then provides a detailed summary of the fitted linear regression model model4. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data=firstbase)
summary(model4)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5919000.

#Model Fit.
#Multiple R-squared: Approximately 57.17% of the variance in Payroll.Salary2023 is explained by RBI,AVG,OPB and OPS.
```
```{r}
firstbase<-firstbase[,-(1:3)]
#This code removes the first three columns from the firstbase dataset. 
```
```{r}
# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
#The above code uses the cor() function to find the correlation between two variables (RBI and Payroll.Salary2023). In this case the answer is 0.6281239

#The correlation coefficient indicates a strong positive linear relationship between the two variables,

#The positive value (0.6281239) suggests a moderate positive linear relationship between RBI (Runs Batted In) and Payroll.Salary2023. As RBI increases, Payroll.Salary2023 tends to increase as well, though not necessarily in a perfectly linear fashion.
```
```{r}
cor(firstbase$AVG, firstbase$OBP)

#The above code uses the cor() function to find the correlation between two numeric vectors (AVG and OBP). In this case the answer is 0.8028894

#The correlation coefficient indicates a strong positive linear relationship between AVG and OBP.

#The positive value (0.8028894) suggests that as AVG (Batting Average) increases, OBP (On-Base Percentage) tends to increase as well, suggesting that players with higher batting averages also tend to have higher on-base percentages.The correlation coefficient (0.8028894) suggests a robust positive relationship between AVG and OBP.
```
```{r}
cor(firstbase)
#In this code we use the function cor() to compute the correlation matrix for numeric columns of the entire dataset, in this case firstbase.
```
```{r}
#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP + OPS, data=firstbase)
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model5, which includes more than one predictor variable (RBI,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model5 <- lm(...): Assigns the linear regression model object to the variable model5.

#summary(model5): The summary() function then provides a detailed summary of the fitted linear regression model model5. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
summary(model5)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5767000.

#Model Fit.
#Multiple R-squared: Approximately 57.09% of the variance in Payroll.Salary2023 is explained by RBI,OPB and OPS.

```
```{r}
model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model6, which includes more than one predictor variable (RBI, and OPB)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model6 <- lm(...): Assigns the linear regression model object to the variable model6.

#summary(model6): The summary() function then provides a detailed summary of the fitted linear regression model model6. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
summary(model6)
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5625000.

#Model Fit.
#Multiple R-squared: Approximately 57.09% of the variance in Payroll.Salary2023 is explained by RBI, and OPB.

```
```{r}
# Read in test set
#The firstbasestats_test.csv file contains a test sample of only 2 of the first base players’ stats. 
#We can read this file in R using the read.csv() function.
firstbaseTest = read.csv("firstbasestats_test.csv")
#Declared the variable firstbaseTest 
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbaseTest)
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
```
```{r}
# Make test set predictions
#In this code we use the predict() function to generate predictions using model6 on new data (firstbaseTest). 
predictTest = predict(model6, newdata=firstbaseTest)
predictTest
#Finally use the print() function to display the results.
```
```{r}
# Compute R-squared
SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
#In this above code we use SSE and SST functions
#SSE: Sum of Squared Errors, which measures the variation in the actual values that is not explained by the model.

#predictTest: This is the vector of predicted values obtained from your model (model6) applied to firstbaseTest.
#firstbaseTest$Payroll.Salary2023: These are the actual values of the response variable (Payroll.Salary2023) in your test dataset.
#sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2): Calculates the sum of squared differences between the actual values (firstbaseTest$Payroll.Salary2023) and the predicted values (predictTest).

#SST: Total Sum of Squares, which measures the total variation in the actual values.

#mean(firstbase$Payroll.Salary2023): This computes the mean of the response variable (Payroll.Salary2023) from your entire dataset (firstbase).
#sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2): Calculates the sum of squared differences between each actual value (firstbaseTest$Payroll.Salary2023) and the mean of all actual values (mean(firstbase$Payroll.Salary2023)).

#In this case the result of 0.5477734 means that approximately 54.77% of the variability in Payroll.Salary2023 can be explained by the independent variables (predictors) included in your model (model6).

```






