# Read in data
#The firstbasestats.csv file contains all the first base players’ stats. 
#We can read this file in R using the read.csv() function.

firstbase = read.csv("firstbasestats.csv")
#Declared the variable firstbase 
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbase)
'data.frame':   23 obs. of  15 variables:
 $ Player            : chr  "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
 $ Pos               : chr  "1B" "1B" "1B" "1B" ...
 $ Team              : chr  "LAD" "CHW" "TEX" "STL" ...
 $ GP                : int  159 157 157 151 160 140 160 145 146 143 ...
 $ AB                : int  612 601 593 561 638 551 583 555 545 519 ...
 $ H                 : int  199 183 179 178 175 152 141 139 132 124 ...
 $ X2B               : int  47 40 26 41 35 27 25 28 40 23 ...
 $ HR                : int  21 15 27 35 32 20 36 22 8 18 ...
 $ RBI               : int  100 75 76 115 97 84 94 85 53 63 ...
 $ AVG               : num  0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
 $ OBP               : num  0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
 $ SLG               : num  0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
 $ OPS               : num  0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
 $ WAR               : num  5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
 $ Payroll.Salary2023: num  27000000 19500000 4050000 26000000 14500000 ...
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
summary(firstbase)
    Player              Pos                Team                 GP              AB              H              X2B              HR             RBI        
 Length:23          Length:23          Length:23          Min.   :  5.0   Min.   : 14.0   Min.   :  3.0   Min.   : 1.00   Min.   : 0.00   Min.   :  1.00  
 Class :character   Class :character   Class :character   1st Qu.:105.5   1st Qu.:309.0   1st Qu.: 74.5   1st Qu.:13.50   1st Qu.: 8.00   1st Qu.: 27.00  
 Mode  :character   Mode  :character   Mode  :character   Median :131.0   Median :465.0   Median :115.0   Median :23.00   Median :18.00   Median : 63.00  
                                                          Mean   :120.2   Mean   :426.9   Mean   :110.0   Mean   :22.39   Mean   :17.09   Mean   : 59.43  
                                                          3rd Qu.:152.0   3rd Qu.:558.0   3rd Qu.:146.5   3rd Qu.:28.00   3rd Qu.:24.50   3rd Qu.: 84.50  
                                                          Max.   :160.0   Max.   :638.0   Max.   :199.0   Max.   :47.00   Max.   :36.00   Max.   :115.00  
      AVG              OBP              SLG              OPS              WAR         Payroll.Salary2023
 Min.   :0.2020   Min.   :0.2140   Min.   :0.2860   Min.   :0.5000   Min.   :-1.470   Min.   :  720000  
 1st Qu.:0.2180   1st Qu.:0.3030   1st Qu.:0.3505   1st Qu.:0.6445   1st Qu.: 0.190   1st Qu.:  739200  
 Median :0.2420   Median :0.3210   Median :0.4230   Median :0.7290   Median : 1.310   Median : 4050000  
 Mean   :0.2499   Mean   :0.3242   Mean   :0.4106   Mean   :0.7346   Mean   : 1.788   Mean   : 6972743  
 3rd Qu.:0.2750   3rd Qu.:0.3395   3rd Qu.:0.4690   3rd Qu.:0.8175   3rd Qu.: 3.140   3rd Qu.: 8150000  
 Max.   :0.3250   Max.   :0.4070   Max.   :0.5780   Max.   :0.9810   Max.   : 7.860   Max.   :27000000  
#We used the function summary() to generate a summary of each variable (column) in firstbase. The output includes:

#For numeric variables (int or num):

#Minimum value (Min): The smallest value in the variable.
#1st Quartile (1st Qu): The value below which 25% of the data falls (Q1 or lower quartile).
#Median (Median or 50th percentile): The middle value in the sorted variable (Q2).
#Mean (Mean): The average value of the variable.
#3rd Quartile (3rd Qu): The value below which 75% of the data falls (Q3 or upper quartile).
#Maximum value (Max): The largest value in the variable.

#For categorical or character variables (chr):

#Counts of each unique value.

#Example 
#Length:23: Indicates there are 23 observations (rows) in firstbase.
#Class :character, Mode :character: Indicates Player is a character variable (names of players).
# Linear Regression (one variable)
#lm() function is used to fit linear models, and the summary() function provides a detailed summary of the fitted model. 

#lm(Payroll.Salary2023 ~ RBI, data = firstbase): This line of code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable) and RBI is the independent variable (predictor variable). data = firstbase specifies that R should use the data stored in the data frame firstbase.

#model1 <- lm(...): Assigns the linear regression model object to the variable model1.

#summary(model1): The summary() function then provides a detailed summary of the fitted linear regression model model1. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)

Call:
lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)

Residuals:
      Min        1Q    Median        3Q       Max 
-10250331  -5220790   -843455   2386848  13654950 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -2363744    2866320  -0.825  0.41883   
RBI           157088      42465   3.699  0.00133 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6516000 on 21 degrees of freedom
Multiple R-squared:  0.3945,    Adjusted R-squared:  0.3657 
F-statistic: 13.68 on 1 and 21 DF,  p-value: 0.001331
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Model Fit:
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6516000.
#Multiple R-squared: This is the coefficient of determination, which measures the proportion of variance in the dependent variable (Payroll.Salary2023) that is explained by #the independent variable (RBI). In this case, Multiple R-squared is 0.3945, indicating that 39.45% of the variance in Payroll.Salary2023 is explained by RBI.
#Adjusted R-squared: This is the R-squared adjusted for the number of predictors in the model. It penalizes for adding predictors that do not improve the model's fit. In this case, Adjusted R-squared is 0.3657, which is slightly lower than Multiple R-squared.
#F-statistic: This statistic tests the overall significance of the model. It compares the fit of the intercept-only model (null model) with the fit of the current model. In this case, the F-statistic is 13.68 with 1 and 21 degrees of freedom, and the associated p-value (0.001331) indicates that the model as a whole is statistically significant.
# Sum of Squared Errors
#This expression retrieves the residuals from the linear regression model model1.
model1$residuals
          1           2           3           4           5           6           7           8           9          10          11          12          13 
 13654950.2  10082148.6  -5524939.3  10298631.2   1626214.0  -6731642.8  -5902522.2 -10250330.7  -4711916.8   -532796.1  -6667082.5  -6696203.1   7582148.6 
         14          15          16          17          18          19          20          21          22          23 
 -4916640.9  -1898125.3   -336532.3   -995042.5  -1311618.3   -843454.5   8050721.3   1250336.9   1847040.4   2926656.0 
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model1$residuals^2)
SSE #Printing the values of SSE. In this case 8.914926e+14
[1] 8.914926e+14
# Linear Regression (two variables)
#fit a multiple linear regression model. In this case model2, which includes more than one predictor variable (AVG and RBI)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and AVG and RBI are the independent variables (predictor variables) taken from the firstbase dataset.

#model2 <- lm(...): Assigns the linear regression model object to the variable model2.

#summary(model2): The summary() function then provides a detailed summary of the fitted linear regression model model2. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

summary(model2)

Call:
lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)

Residuals:
     Min       1Q   Median       3Q      Max 
-9097952 -4621582   -33233  3016541 10260245 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -18083756    9479037  -1.908   0.0709 .
AVG          74374031   42934155   1.732   0.0986 .
RBI            108850      49212   2.212   0.0388 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6226000 on 20 degrees of freedom
Multiple R-squared:  0.4735,    Adjusted R-squared:  0.4209 
F-statistic: 8.994 on 2 and 20 DF,  p-value: 0.001636
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6226000.
#Statistical Significance:
#RBI is statistically significant (p-value = 0.0388).
#AVG shows marginal significance (p-value = 0.0986)

#Model Fit.
#Multiple R-squared: Approximately 47.35% of the variance in Payroll.Salary2023 is explained by AVG and RBI.
# Sum of Squared Errors
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model2$residuals^2)
SSE #Printing the values of SSE. In this case 7.751841e+14
[1] 7.751841e+14
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model3, which includes more than one predictor variable (HR,RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and HR,RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model3 <- lm(...): Assigns the linear regression model object to the variable model3.

#summary(model3): The summary() function then provides a detailed summary of the fitted linear regression model model3. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)

Call:
lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS, 
    data = firstbase)

Residuals:
     Min       1Q   Median       3Q      Max 
-9611440 -3338119    64016  4472451  9490309 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -31107859   11738494  -2.650   0.0168 *
HR            -341069     552069  -0.618   0.5449  
RBI            115786     113932   1.016   0.3237  
AVG         -63824769  104544645  -0.611   0.5496  
OBP          27054948  131210166   0.206   0.8391  
OPS          60181012   95415131   0.631   0.5366  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6023000 on 17 degrees of freedom
Multiple R-squared:  0.5811,    Adjusted R-squared:  0.4579 
F-statistic: 4.717 on 5 and 17 DF,  p-value: 0.006951
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6023000.

#Model Fit.
#Multiple R-squared: Approximately 58.11% of the variance in Payroll.Salary2023 is explained by HR,RBI,AVG,OPB and OPS.
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
[1] 6.167793e+14
# Remove HR
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model3, which includes more than one predictor variable (RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model4 <- lm(...): Assigns the linear regression model object to the variable model4.

#summary(model3): The summary() function then provides a detailed summary of the fitted linear regression model model2. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data=firstbase)
summary(model4)

Call:
lm(formula = Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data = firstbase)

Residuals:
     Min       1Q   Median       3Q      Max 
-9399551 -3573842    98921  3979339  9263512 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -29466887   11235931  -2.623   0.0173 *
RBI             71495      87015   0.822   0.4220  
AVG         -11035457   59192453  -0.186   0.8542  
OBP          86360720   87899074   0.982   0.3389  
OPS           9464546   47788458   0.198   0.8452  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5919000 on 18 degrees of freedom
Multiple R-squared:  0.5717,    Adjusted R-squared:  0.4765 
F-statistic: 6.007 on 4 and 18 DF,  p-value: 0.00298
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5919000.

#Model Fit.
#Multiple R-squared: Approximately 57.17% of the variance in Payroll.Salary2023 is explained by RBI,AVG,OPB and OPS.
firstbase<-firstbase[,-(1:3)]
# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
[1] 0.6281239
cor(firstbase$AVG, firstbase$OBP)
[1] 0.8028894
cor(firstbase)
                          GP        AB         H       X2B        HR       RBI       AVG       OBP       SLG       OPS       WAR Payroll.Salary2023
GP                 1.0000000 0.9779421 0.9056508 0.8446267 0.7432552 0.8813917 0.4430808 0.4841583 0.6875270 0.6504483 0.5645243          0.4614889
AB                 0.9779421 1.0000000 0.9516701 0.8924632 0.7721339 0.9125839 0.5126292 0.5026125 0.7471949 0.6980141 0.6211558          0.5018820
H                  0.9056508 0.9516701 1.0000000 0.9308318 0.7155225 0.9068893 0.7393167 0.6560021 0.8211406 0.8069779 0.7688712          0.6249911
X2B                0.8446267 0.8924632 0.9308318 1.0000000 0.5889699 0.8485911 0.6613085 0.5466537 0.7211259 0.6966830 0.6757470          0.6450730
HR                 0.7432552 0.7721339 0.7155225 0.5889699 1.0000000 0.8929048 0.3444242 0.4603408 0.8681501 0.7638721 0.6897677          0.5317619
RBI                0.8813917 0.9125839 0.9068893 0.8485911 0.8929048 1.0000000 0.5658479 0.5704463 0.8824090 0.8156612 0.7885666          0.6281239
AVG                0.4430808 0.5126292 0.7393167 0.6613085 0.3444242 0.5658479 1.0000000 0.8028894 0.7254274 0.7989005 0.7855945          0.5871543
OBP                0.4841583 0.5026125 0.6560021 0.5466537 0.4603408 0.5704463 0.8028894 1.0000000 0.7617499 0.8987390 0.7766375          0.7025979
SLG                0.6875270 0.7471949 0.8211406 0.7211259 0.8681501 0.8824090 0.7254274 0.7617499 1.0000000 0.9686752 0.8611140          0.6974086
OPS                0.6504483 0.6980141 0.8069779 0.6966830 0.7638721 0.8156612 0.7989005 0.8987390 0.9686752 1.0000000 0.8799893          0.7394981
WAR                0.5645243 0.6211558 0.7688712 0.6757470 0.6897677 0.7885666 0.7855945 0.7766375 0.8611140 0.8799893 1.0000000          0.8086359
Payroll.Salary2023 0.4614889 0.5018820 0.6249911 0.6450730 0.5317619 0.6281239 0.5871543 0.7025979 0.6974086 0.7394981 0.8086359          1.0000000
#In this code we use the function cor() to compute the correlation matrix for numeric columns of the entire dataset, in this case firstbase.
#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP+OPS, data=firstbase)
summary(model5)

Call:
lm(formula = Payroll.Salary2023 ~ RBI + OBP + OPS, data = firstbase)

Residuals:
     Min       1Q   Median       3Q      Max 
-9465449 -3411234   259746  4102864  8876798 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -29737007   10855411  -2.739    0.013 *
RBI             72393      84646   0.855    0.403  
OBP          82751360   83534224   0.991    0.334  
OPS           7598051   45525575   0.167    0.869  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5767000 on 19 degrees of freedom
Multiple R-squared:  0.5709,    Adjusted R-squared:  0.5031 
F-statistic: 8.426 on 3 and 19 DF,  p-value: 0.000913
model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
summary(model6)

Call:
lm(formula = Payroll.Salary2023 ~ RBI + OBP, data = firstbase)

Residuals:
     Min       1Q   Median       3Q      Max 
-9045497 -3487008   139497  4084739  9190185 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -28984802    9632560  -3.009  0.00693 **
RBI             84278      44634   1.888  0.07360 . 
OBP          95468873   33385182   2.860  0.00969 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5625000 on 20 degrees of freedom
Multiple R-squared:  0.5703,    Adjusted R-squared:  0.5273 
F-statistic: 13.27 on 2 and 20 DF,  p-value: 0.0002149
# Read in test set
#The firstbasestats_test.csv file contains a test sample of only 2 of the first base players’ stats. 
#We can read this file in R using the read.csv() function.
firstbaseTest = read.csv("firstbasestats_test.csv")
#Declared the variable firstbaseTest 
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbaseTest)
'data.frame':   2 obs. of  15 variables:
 $ Player            : chr  "Matt Olson" "Josh Bell"
 $ Pos               : chr  "1B" "1B"
 $ Team              : chr  "ATL" "SD"
 $ GP                : int  162 156
 $ AB                : int  616 552
 $ H                 : int  148 147
 $ X2B               : int  44 29
 $ HR                : int  34 17
 $ RBI               : int  103 71
 $ AVG               : num  0.24 0.266
 $ OBP               : num  0.325 0.362
 $ SLG               : num  0.477 0.422
 $ OPS               : num  0.802 0.784
 $ WAR               : num  3.29 3.5
 $ Payroll.Salary2023: num  21000000 16500000
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
# Make test set predictions
predictTest = predict(model6, newdata=firstbaseTest)
predictTest
       1        2 
10723186 11558647 
# Compute R-squared
SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
[1] 0.5477734
---
title: "Activity_6"
output: html_notebook
---


```{r}
# Read in data
#The firstbasestats.csv file contains all the first base players’ stats. 
#We can read this file in R using the read.csv() function.

firstbase = read.csv("firstbasestats.csv")
#Declared the variable firstbase 
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbase)
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
```
```{r}
summary(firstbase)
#We used the function summary() to generate a summary of each variable (column) in firstbase. The output includes:

#For numeric variables (int or num):

#Minimum value (Min): The smallest value in the variable.
#1st Quartile (1st Qu): The value below which 25% of the data falls (Q1 or lower quartile).
#Median (Median or 50th percentile): The middle value in the sorted variable (Q2).
#Mean (Mean): The average value of the variable.
#3rd Quartile (3rd Qu): The value below which 75% of the data falls (Q3 or upper quartile).
#Maximum value (Max): The largest value in the variable.

#For categorical or character variables (chr):

#Counts of each unique value.

#Example 
#Length:23: Indicates there are 23 observations (rows) in firstbase.
#Class :character, Mode :character: Indicates Player is a character variable (names of players).

```
```{r}
# Linear Regression (one variable)
#lm() function is used to fit linear models, and the summary() function provides a detailed summary of the fitted model. 

#lm(Payroll.Salary2023 ~ RBI, data = firstbase): This line of code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable) and RBI is the independent variable (predictor variable). data = firstbase specifies that R should use the data stored in the data frame firstbase.

#model1 <- lm(...): Assigns the linear regression model object to the variable model1.

#summary(model1): The summary() function then provides a detailed summary of the fitted linear regression model model1. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Model Fit:
#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6516000.
#Multiple R-squared: This is the coefficient of determination, which measures the proportion of variance in the dependent variable (Payroll.Salary2023) that is explained by #the independent variable (RBI). In this case, Multiple R-squared is 0.3945, indicating that 39.45% of the variance in Payroll.Salary2023 is explained by RBI.
#Adjusted R-squared: This is the R-squared adjusted for the number of predictors in the model. It penalizes for adding predictors that do not improve the model's fit. In this case, Adjusted R-squared is 0.3657, which is slightly lower than Multiple R-squared.
#F-statistic: This statistic tests the overall significance of the model. It compares the fit of the intercept-only model (null model) with the fit of the current model. In this case, the F-statistic is 13.68 with 1 and 21 degrees of freedom, and the associated p-value (0.001331) indicates that the model as a whole is statistically significant.

```
```{r}
# Sum of Squared Errors
#This expression retrieves the residuals from the linear regression model model1.
model1$residuals
```

```{r}
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model1$residuals^2)
SSE #Printing the values of SSE. In this case 8.914926e+14
```
```{r}
# Linear Regression (two variables)
#fit a multiple linear regression model. In this case model2, which includes more than one predictor variable (AVG and RBI)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and AVG and RBI are the independent variables (predictor variables) taken from the firstbase dataset.

#model2 <- lm(...): Assigns the linear regression model object to the variable model2.

#summary(model2): The summary() function then provides a detailed summary of the fitted linear regression model model2. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

summary(model2)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6226000.
#Statistical Significance:
#RBI is statistically significant (p-value = 0.0388).
#AVG shows marginal significance (p-value = 0.0986)

#Model Fit.
#Multiple R-squared: Approximately 47.35% of the variance in Payroll.Salary2023 is explained by AVG and RBI.

```
```{r}
# Sum of Squared Errors
#SSE represents the sum of the squared differences between the observed values of the dependent variable
#model1$residuals^2: This calculates the squared value of each residual in model1.
#sum(model1$residuals^2): This computes the sum of all squared residuals, resulting in the SSE
SSE = sum(model2$residuals^2)
SSE #Printing the values of SSE. In this case 7.751841e+14
```
```{r}
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model3, which includes more than one predictor variable (HR,RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and HR,RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model3 <- lm(...): Assigns the linear regression model object to the variable model3.

#summary(model3): The summary() function then provides a detailed summary of the fitted linear regression model model3. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.

model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 6023000.

#Model Fit.
#Multiple R-squared: Approximately 58.11% of the variance in Payroll.Salary2023 is explained by HR,RBI,AVG,OPB and OPS.

```
```{r}
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE #Printing the values of SSE. In this case 6.167793e+14
```
```{r}
# Remove HR
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model4, which includes more than one predictor variable (RBI,AVG,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,AVG,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model4 <- lm(...): Assigns the linear regression model object to the variable model4.

#summary(model4): The summary() function then provides a detailed summary of the fitted linear regression model model4. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data=firstbase)
summary(model4)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5919000.

#Model Fit.
#Multiple R-squared: Approximately 57.17% of the variance in Payroll.Salary2023 is explained by RBI,AVG,OPB and OPS.
```
```{r}
firstbase<-firstbase[,-(1:3)]
#This code removes the first three columns from the firstbase dataset. 
```
```{r}
# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
#The above code uses the cor() function to find the correlation between two variables (RBI and Payroll.Salary2023). In this case the answer is 0.6281239

#The correlation coefficient indicates a strong positive linear relationship between the two variables,

#The positive value (0.6281239) suggests a moderate positive linear relationship between RBI (Runs Batted In) and Payroll.Salary2023. As RBI increases, Payroll.Salary2023 tends to increase as well, though not necessarily in a perfectly linear fashion.
```
```{r}
cor(firstbase$AVG, firstbase$OBP)

#The above code uses the cor() function to find the correlation between two numeric vectors (AVG and OBP). In this case the answer is 0.8028894

#The correlation coefficient indicates a strong positive linear relationship between AVG and OBP.

#The positive value (0.8028894) suggests that as AVG (Batting Average) increases, OBP (On-Base Percentage) tends to increase as well, suggesting that players with higher batting averages also tend to have higher on-base percentages.The correlation coefficient (0.8028894) suggests a robust positive relationship between AVG and OBP.
```
```{r}
cor(firstbase)
#In this code we use the function cor() to compute the correlation matrix for numeric columns of the entire dataset, in this case firstbase.
```
```{r}
#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP + OPS, data=firstbase)
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model5, which includes more than one predictor variable (RBI,OPB and OPS)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model5 <- lm(...): Assigns the linear regression model object to the variable model5.

#summary(model5): The summary() function then provides a detailed summary of the fitted linear regression model model5. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
summary(model5)

#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5767000.

#Model Fit.
#Multiple R-squared: Approximately 57.09% of the variance in Payroll.Salary2023 is explained by RBI,OPB and OPS.

```
```{r}
model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
# Linear Regression (all variables)
#fit a multiple linear regression model. In this case model6, which includes more than one predictor variable (RBI, and OPB)
#This code fits a linear regression model (lm) where Payroll.Salary2023 is the dependent variable (response variable), and RBI,OPB and OPS are the independent variables (predictor variables) taken from the firstbase dataset.

#model6 <- lm(...): Assigns the linear regression model object to the variable model6.

#summary(model6): The summary() function then provides a detailed summary of the fitted linear regression model model6. This summary includes information such as coefficients, standard errors, t-statistics, p-values, and goodness-of-fit measures like R-squared.
summary(model6)
#Residuals:
#Residuals: These are the differences between the observed values of the dependent variable (Payroll.Salary2023) and the values predicted by the model. The summary shows descriptive statistics of the residuals:
#Min, 1Q, Median, 3Q, Max: These values represent the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals, respectively. They provide insight into the distribution and range of errors in the model predictions.

#Std. Error: These are the standard errors associated with the coefficients. They measure the variability of the coefficient estimate.
#t value: This is the t-statistic value, which measures the significance of each coefficient estimate. It's calculated as Estimate / Std. Error.
#Pr(>|t|): This is the p-value associated with each coefficient. It indicates the probability of observing the data if the null hypothesis (that the coefficient is zero) were true.

#Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it's approximately 5625000.

#Model Fit.
#Multiple R-squared: Approximately 57.09% of the variance in Payroll.Salary2023 is explained by RBI, and OPB.

```
```{r}
# Read in test set
#The firstbasestats_test.csv file contains a test sample of only 2 of the first base players’ stats. 
#We can read this file in R using the read.csv() function.
firstbaseTest = read.csv("firstbasestats_test.csv")
#Declared the variable firstbaseTest 
#Use the function read.csv() to read the file "firstbasestats.csv" and assign the contens to the variable firstbase
str(firstbaseTest)
#Use the str() funtion to provides a concise summary of the structure of firstbase, including the number of observations (rows) and variables (columns), along with the data type and first few values of each variable.
```
```{r}
# Make test set predictions
#In this code we use the predict() function to generate predictions using model6 on new data (firstbaseTest). 
predictTest = predict(model6, newdata=firstbaseTest)
predictTest
#Finally use the print() function to display the results.
```
```{r}
# Compute R-squared
SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
#In this above code we use SSE and SST functions
#SSE: Sum of Squared Errors, which measures the variation in the actual values that is not explained by the model.

#predictTest: This is the vector of predicted values obtained from your model (model6) applied to firstbaseTest.
#firstbaseTest$Payroll.Salary2023: These are the actual values of the response variable (Payroll.Salary2023) in your test dataset.
#sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2): Calculates the sum of squared differences between the actual values (firstbaseTest$Payroll.Salary2023) and the predicted values (predictTest).

#SST: Total Sum of Squares, which measures the total variation in the actual values.

#mean(firstbase$Payroll.Salary2023): This computes the mean of the response variable (Payroll.Salary2023) from your entire dataset (firstbase).
#sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2): Calculates the sum of squared differences between each actual value (firstbaseTest$Payroll.Salary2023) and the mean of all actual values (mean(firstbase$Payroll.Salary2023)).

#In this case the result of 0.5477734 means that approximately 54.77% of the variability in Payroll.Salary2023 can be explained by the independent variables (predictors) included in your model (model6).

```






