setwd("/Users/kayhanbabakan/OneDrive/MIT/Analytics Edge/HW7")
rm(list=ls())
Hitters <- read.csv("Hitters.csv")
set.seed(15071)
train.obs <- sort(sample(seq_len(nrow(Hitters)), 0.7*nrow(Hitters))) 
train <- Hitters[train.obs,2:21]
test <- Hitters[-train.obs,2:21]

Initial Predictors
Explore the dataset by plotting the player salaries as a function of a few predictors. Include 2–3 visualizations. Characterize and discuss a few patterns that you observe.

AtBat=ggplot(Hitters,aes(AtBat,Salary))+
  geom_point()+
  geom_smooth()

Hits=ggplot(Hitters,aes(CHits,Salary))+
  geom_point()+
  geom_smooth()

HmRun=ggplot(Hitters,aes(CHmRun,Salary))+
  geom_point()+
  geom_smooth()

ggarrange(AtBat,Hits,HmRun,ncol=1,nrow=3)

The visualization I chose to incoporate show some of they key performance measures the league uses to justify salaray. As ATBat CHits, and Homeruns generally increases the total salary. It is good to note here that there are a few outliers in the data.

Correlation Matrix
Report the correlation matrix between the numerical predictors (i.e., all predictors except Name, League, Division and NewLeague). You can restrict the dataset to the numerical predictors with Hitters[,2:18]. What do you observe? Does this make sense, in view of the problem?

corplot=ggcorr(select(Hitters,-c("Name","League","Division","NewLeague")),size=2)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
corplot

As expected there is a strong correltion amongst baseball stats and metrics relative to the players salary. The better the performance the stronger the salary! Additionaly there is alot of multicolinearity amongst the predictors. i.e the number of times at bat has a strong correlation with the number of times you are able to make a hit, a home run or even a run.

Linear Regression Fit a linear regression model with all the predictors using the training set, and make predictions on the test set. Report the in-sample and out-of-sample R2.Comment briefly on the sign and significance of the variables and the R2 values. Does this make sense, in view of your earlier observations?

lm = lm(Salary~.,train)
predict_train=predict(lm,train)
predict_test=predict(lm,test)
SSETrain= sum((predict_train - train$Salary)^2)
SSTTrain = sum((mean(train$Salary)-train$Salary)^2)
SSETest= sum((predict_test - test$Salary)^2)
SSTTest = sum((mean(train$Salary)-test$Salary)^2)
R2.LM = 1-(SSETrain/SSTTrain)
OSR2.LM <- 1-(SSETest/SSTTest)
summary(lm)

Call:
lm(formula = Salary ~ ., data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-742.5 -176.4  -24.5  139.2 1838.7 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 246.38953  108.15508   2.278 0.024011 *  
AtBat        -3.18561    0.80557  -3.954 0.000114 ***
Hits         10.29267    2.97333   3.462 0.000685 ***
HmRun        -6.69027    7.47490  -0.895 0.372082    
Runs         -1.55120    3.65385  -0.425 0.671729    
RBI           1.27969    3.12671   0.409 0.682869    
Walks         7.67079    2.18198   3.516 0.000568 ***
Years       -22.90596   16.02952  -1.429 0.154910    
CAtBat        0.03047    0.16788   0.181 0.856207    
CHits         0.13791    0.83710   0.165 0.869350    
CHmRun        2.18851    2.12401   1.030 0.304355    
CRuns         0.47897    0.93183   0.514 0.607939    
CRBI          0.16728    0.86209   0.194 0.846387    
CWalks       -0.89886    0.40397  -2.225 0.027440 *  
PutOuts       0.25947    0.09390   2.763 0.006376 ** 
Assists       0.38031    0.28782   1.321 0.188236    
Errors        0.49402    5.88410   0.084 0.933193    
LeagueN      33.68519   96.03250   0.351 0.726212    
DivisionW   -84.15153   50.39587  -1.670 0.096865 .  
NewLeagueN   -6.42531   96.03884  -0.067 0.946740    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 319.9 on 164 degrees of freedom
Multiple R-squared:  0.5603,    Adjusted R-squared:  0.5093 
F-statistic:    11 on 19 and 164 DF,  p-value: < 2.2e-16
LMMod=data.frame("Model"="LM","R2"=R2.LM,"OSR2"=OSR2.LM)
LMMod

The significant independent variables are Walks, Cwalks, and PutOuts. Hits and AtBats. The signifcant varialbes dont always make sense as your hits increase your salary increases however as at bats increase you get less. This is due to the significant amount of multi-colinearity throughout the data. The R2 and OSR2 are .56 and .40, osr2 should be able to obtain better results by reducing the noise in the model.

Lasso/Ridge
Train ridge regression and LASSO models with 10-fold cross-validation to select the appropriate value of the shrinkage parameter λ, using the Mean Squared Error as the performance metric (this is the default option in the cv.glmnet() function). Plot the cross-validated Mean Squared Error as a function of λ. Report the value of λ that minimizes the Mean Squared Error for each method. [

x_salary.train=model.matrix(Salary~.-1,train) 
y_salary.train<-train[,c("Salary")] #set Y for glmnet fitting
x_salary.test=model.matrix(Salary~.-1,test) 
y_salary.test=test[,c("Salary")]


all.lambdas <- c(exp(seq(15, -10, -.1)))
set.seed(15071)
cv.ridge = cv.glmnet(x_salary.train,y_salary.train,alpha=0,lambda=all.lambdas,nfolds=10)
set.seed(15071)
cv.lasso = cv.glmnet(x_salary.train,y_salary.train,alpha=1,lambda=all.lambdas,nfolds=10)

bestlambdaridge = cv.ridge$lambda.min
bestlambdalasso = cv.lasso$lambda.min

plot(cv.ridge)

plot(cv.lasso)


rbind(data.frame("Model"="Ridge","Min RMSE Lambda"=bestlambdaridge),
data.frame("Model"="Lasso","Min RMSE Lambda"=bestlambdalasso))

#Retrain using best Lambda With the selected values of λ, re-train your ridge regression and LASSO models on the full training set. Report each model’s coefficients and comment on the effects of ridge regression vs. LASSO. Use each model to make predictions on the test set. Report the values of the in-sample R2 and the out-of-sample R2. Comment on your results.

set.seed(15071)
ridge.final <- glmnet(x_salary.train,y_salary.train,alpha=0,lambda=bestlambdaridge )
lasso.final <- glmnet(x_salary.train,y_salary.train,alpha=1,lambda=bestlambdalasso)

pred.train.lasso <- predict(lasso.final,x_salary.train)
pred.test.lasso <- predict(lasso.final,x_salary.test)
pred.train.ridge <- predict(ridge.final,x_salary.train)
pred.test.ridge <- predict(ridge.final,x_salary.test)

SSETrainLasso= sum((pred.train.lasso - train$Salary)^2)
SSETestLasso= sum((pred.test.lasso - test$Salary)^2)
SSETrainRidge= sum((pred.train.ridge - train$Salary)^2)
SSETestRidge= sum((pred.test.ridge - test$Salary)^2)

R2.Lasso = 1-(SSETrainLasso/SSTTrain)
OSR2.lasso <- 1-(SSETestLasso/SSTTest)
R2.Ridge = 1-(SSETrainRidge/SSTTrain)
OSR2.Ridge <- 1-(SSETestRidge/SSTTest)

LassoMod=data.frame("Model"="Lasso","R2"=R2.Lasso,"OSR2"=OSR2.lasso)
RidgeMod=data.frame("Model"="Ridge","R2"=R2.Ridge,"OSR2"=OSR2.Ridge)
rbind(LMMod,LassoMod,RidgeMod)

The main differnece between ridge and lasso is lassos ability to drive the coefficients to 0 where ridge brings them to a very small number. Of the models, Lasso performs optimally providing the best OSR2 and only slightly worse R2. LM includes all variables with alot of noise which allows it to fit the training data better than the other methods but poorly on the test data.

---
title: "R Notebook"
output: html_notebook
---
```{r include=FALSE}
library("ggplot2")
library("gplots")
library("glmnet")
library("MASS")
library("tidyverse")
library("dplyr")
library("reshape")
library("ggpubr")
library("ggplot2")
library("glmnet")
library("reshape2")
library("heatmaply")
library("dummies")
library("dplyr")
library("tidyr")
library("caTools")
library("caret")
library("ROCR")
library("ggpubr")
library("glmnetUtils")
library("GGally")
library("glmnet")
library("dplyr")
library("ggplot2")
library("tidyr")
library("lars")
library("leaps")
library("gbm")
library("rpart")
library("corrplot")
library("Metrics")
library("rpart.plot")
library("randomForest")
```

```{r}
setwd("/Users/kayhanbabakan/OneDrive/MIT/Analytics Edge/HW7")
rm(list=ls())
Hitters <- read.csv("Hitters.csv")
set.seed(15071)
train.obs <- sort(sample(seq_len(nrow(Hitters)), 0.7*nrow(Hitters))) 
train <- Hitters[train.obs,2:21]
test <- Hitters[-train.obs,2:21]
```
<b>Initial Predictors</b></br>
Explore the dataset by plotting the player salaries as a function of a few predictors. Include 2–3
visualizations. Characterize and discuss a few patterns that you observe. 
```{r message=FALSE}
AtBat=ggplot(Hitters,aes(AtBat,Salary))+
  geom_point()+
  geom_smooth()

Hits=ggplot(Hitters,aes(CHits,Salary))+
  geom_point()+
  geom_smooth()

HmRun=ggplot(Hitters,aes(CHmRun,Salary))+
  geom_point()+
  geom_smooth()

ggarrange(AtBat,Hits,HmRun,ncol=1,nrow=3)
```
> The visualization I chose to incoporate show some of they key performance measures the league uses to justify salaray. As ATBat CHits, and Homeruns generally increases the total salary. It is good to note here that there are a few outliers in the data.

<b>Correlation Matrix</b></br>
Report the correlation matrix between the numerical predictors (i.e., all predictors except Name,
League, Division and NewLeague). You can restrict the dataset to the numerical predictors with
Hitters[,2:18]. What do you observe? Does this make sense, in view of the problem? 

```{r echo=TRUE}
corplot=ggcorr(select(Hitters,-c("Name","League","Division","NewLeague")),size=2)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
corplot

```
>As expected there is a strong correltion amongst baseball stats and metrics relative to the players salary. The better the performance the stronger the salary! Additionaly there is alot of multicolinearity amongst the predictors. i.e the number of times at bat has a strong correlation with the number of times you are able to make a hit, a home run or even a run.

<b>Linear Regression</b>
Fit a linear regression model with all the predictors using the training set, and make predictions on the test set. Report the in-sample and out-of-sample R2.Comment briefly on the sign and significance of the variables and the R2 values. Does this make sense, in view of your earlier observations?
```{r echo=TRUE}
lm = lm(Salary~.,train)
predict_train=predict(lm,train)
predict_test=predict(lm,test)
SSETrain= sum((predict_train - train$Salary)^2)
SSTTrain = sum((mean(train$Salary)-train$Salary)^2)
SSETest= sum((predict_test - test$Salary)^2)
SSTTest = sum((mean(train$Salary)-test$Salary)^2)
R2.LM = 1-(SSETrain/SSTTrain)
OSR2.LM <- 1-(SSETest/SSTTest)
summary(lm)
LMMod=data.frame("Model"="LM","R2"=R2.LM,"OSR2"=OSR2.LM)
LMMod
```
>The significant independent variables are Walks, Cwalks, and PutOuts. Hits and AtBats. The signifcant varialbes dont always make sense as your hits increase your salary increases however as at bats increase you get less. This is due to the significant amount of multi-colinearity throughout the data. The R2 and OSR2 are .56 and .40, osr2 should be able to obtain better results by reducing the noise in the model.

<b>Lasso/Ridge</b></br>
Train ridge regression and LASSO models with 10-fold cross-validation to select the appropriate value of the shrinkage parameter λ, using the Mean Squared Error as the performance metric (this is the default option in the cv.glmnet() function). Plot the cross-validated Mean Squared Error as a function of λ. Report the value of λ that minimizes the Mean Squared Error for each method. [
```{r echo=TRUE}
x_salary.train=model.matrix(Salary~.-1,train) 
y_salary.train<-train[,c("Salary")] #set Y for glmnet fitting
x_salary.test=model.matrix(Salary~.-1,test) 
y_salary.test=test[,c("Salary")]


all.lambdas <- c(exp(seq(15, -10, -.1)))
set.seed(15071)
cv.ridge = cv.glmnet(x_salary.train,y_salary.train,alpha=0,lambda=all.lambdas,nfolds=10)
set.seed(15071)
cv.lasso = cv.glmnet(x_salary.train,y_salary.train,alpha=1,lambda=all.lambdas,nfolds=10)

bestlambdaridge = cv.ridge$lambda.min
bestlambdalasso = cv.lasso$lambda.min

plot(cv.ridge)
plot(cv.lasso)

rbind(data.frame("Model"="Ridge","Min RMSE Lambda"=bestlambdaridge),
data.frame("Model"="Lasso","Min RMSE Lambda"=bestlambdalasso))
```
#Retrain using best Lambda
With the selected values of λ, re-train your ridge regression and LASSO models on the full training set.
Report each model’s coefficients and comment on the effects of ridge regression vs. LASSO. Use each
model to make predictions on the test set. Report the values of the in-sample R2 and the out-of-sample
R2. Comment on your results.
```{r echo=TRUE}
set.seed(15071)
ridge.final <- glmnet(x_salary.train,y_salary.train,alpha=0,lambda=bestlambdaridge )
lasso.final <- glmnet(x_salary.train,y_salary.train,alpha=1,lambda=bestlambdalasso)

pred.train.lasso <- predict(lasso.final,x_salary.train)
pred.test.lasso <- predict(lasso.final,x_salary.test)
pred.train.ridge <- predict(ridge.final,x_salary.train)
pred.test.ridge <- predict(ridge.final,x_salary.test)

SSETrainLasso= sum((pred.train.lasso - train$Salary)^2)
SSETestLasso= sum((pred.test.lasso - test$Salary)^2)
SSETrainRidge= sum((pred.train.ridge - train$Salary)^2)
SSETestRidge= sum((pred.test.ridge - test$Salary)^2)

R2.Lasso = 1-(SSETrainLasso/SSTTrain)
OSR2.lasso <- 1-(SSETestLasso/SSTTest)
R2.Ridge = 1-(SSETrainRidge/SSTTrain)
OSR2.Ridge <- 1-(SSETestRidge/SSTTest)

LassoMod=data.frame("Model"="Lasso","R2"=R2.Lasso,"OSR2"=OSR2.lasso)
RidgeMod=data.frame("Model"="Ridge","R2"=R2.Ridge,"OSR2"=OSR2.Ridge)
rbind(LMMod,LassoMod,RidgeMod)
```
> The main differnece between ridge and lasso is lassos ability to drive the coefficients to 0 where ridge brings them to a very small number. Of the models, Lasso performs optimally providing the best OSR2 and only slightly worse R2. LM includes all variables with alot of noise which allows it to fit the training data better than the other methods but poorly on the test data.


