1.1. Comparing model fit

1a. Load the Credit data (from the ISLR package)Also load the tidyverse, ggplot2, and dplyr packages, you will need them for arranging the data and plotting

library(ISLR)
library(tidyverse)
library(ggplot2)
library(dplyr)
data("Credit")



1b. Set seed to 123 (set.seed(123)), do a 75-25 train-test split on the variable Rating, generate training and testing data

library(lattice)
library(caret)
set.seed(123)

train_index <- createDataPartition(Credit$Rating, p = .75, list = FALSE)
training <- Credit[train_index,]
testing <- Credit[-train_index,]



1b. Fit a linear regression model up to 5-degree polynomials on the training data by regressing Rating on Income, use the appropriate function f(x)

rating_poly5 = lm(Rating ~ poly(Income, 5), data = training)
summary(rating_poly5)
## 
## Call:
## lm(formula = Rating ~ poly(Income, 5), data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -193.914  -79.108    5.176   75.956  175.680 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       355.734      5.393  65.962   <2e-16 ***
## poly(Income, 5)1 2255.251     93.565  24.103   <2e-16 ***
## poly(Income, 5)2   51.768     93.565   0.553   0.5805    
## poly(Income, 5)3  -12.479     93.565  -0.133   0.8940    
## poly(Income, 5)4  161.241     93.565   1.723   0.0859 .  
## poly(Income, 5)5  -82.348     93.565  -0.880   0.3795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.57 on 295 degrees of freedom
## Multiple R-squared:  0.6648, Adjusted R-squared:  0.6591 
## F-statistic:   117 on 5 and 295 DF,  p-value: < 2.2e-16



1b. Bonus question: in Q1b, if you will be using the poly(, ) function, you can use the default raw setting (which is FALSE, and you should use this option here) or set raw = TRUE, what’re their pros and cons in terms of adjudicating i) the number of optimal knots and ii) the effects of additional polynomial terms of the original variable x?

Setting raw = False, each degree of x will be orthogonal. It makes additional polynomial term easier to see if it improves the model fitting. Also, orthogonalized polynomial makes x more possible to be statistically significant. Setting raw = True makes variables be highly correlated so that it’s hard to tell whether additional term improves regression model. Yet, the p-value of higher degree x will increase largely in the sense that it can sometimes distinguish at which level the x stop being significant.

On the other hand, if original x is interested, raw = True can better interpret original x. However, it’s still not recommended to adopt raw = True which makes polynomial model highly correlated. When using polynomial model, mostly we’d like to see whether higher degree improves fitting. And, raw = False capture effects of polynomial terms.

1c. Fit a linear regression model on the training data by regressing Rating on the log of Income, use the appropriate function f(x)

rating_log <- lm(Rating ~ log(Income), data = training)
summary(rating_log)
## 
## Call:
## lm(formula = Rating ~ log(Income), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -208.17  -82.03   -2.29   86.01  351.96 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -248.76      32.31   -7.70 2.01e-13 ***
## log(Income)   168.74       8.85   19.07  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.8 on 299 degrees of freedom
## Multiple R-squared:  0.5487, Adjusted R-squared:  0.5472 
## F-statistic: 363.5 on 1 and 299 DF,  p-value: < 2.2e-16



1d. Use the results from the two models to predict on the testing data, append the RMSE and R^2 of both models to a data.frame() object and print them out. Explain which model you would prefer based on these two metrics

poly_pred <- predict(rating_poly5, testing)
log_pred <- predict(rating_log, testing)

data.frame(type = c('Poly','Log'),
  RMSE = c(RMSE(poly_pred, testing$Rating), RMSE(log_pred, testing$Rating)),
  R2 = c(R2(poly_pred, testing$Rating), R2(log_pred, testing$Rating))
)
##   type      RMSE        R2
## 1 Poly  98.67925 0.4924893
## 2  Log 104.40012 0.4404227

RMSE is rooted mean square error which is standard deviation of predicted value from actual value. If RMSE is smaller, the deviation of predicted value from actual value is closer which simply means the fitting model is better.

R^2 is (1 - SSR/TSS) where SSR is squared sum of residuals and TSS is total sum of squares. In order to find a good prediction model explaining variation a lot, little SSR is good criterion which also brings a larger R^2.

Apparently, polynomial model is better than log model. Compared to log model, RMSE is smaller while R^2 larger which means the model explains more and less residual.

1e. Plot the results of Q1b and Q1c using the ggplot() function

ggplot(training, aes(Income, Rating) ) +
  geom_point(alpha = 0.2) +
  labs(x = "Income", y = "Rating", title = "Rating as a polynomial function of Income") +
  stat_smooth(method = lm, formula = y ~ poly(x, 5), col = "red")


ggplot(training, aes(Income, Rating) ) +
  geom_point(alpha = 0.2) +
  labs(x = "Income", y = "Rating", title = "Rating as a log function of Income") +
  stat_smooth(method = lm, formula = y ~ log(x, 5), col = "red")


Based on the graph above, it seems pretty clear that polynomial model fits data closely especially as income being larger than 100 thousand. I guess the reason could be polynomial function is more flexible for adjusting when value getting large than log model be.

2. Predicting purchasing behaviors with Support Vector Machine (SVM)

2a. Load the Social_Network_Ads data from dropbox using the following code: read.csv(“https://www.dropbox.com/s/11rmse4ay9uu8vu/Social_Network_Ads.csv?dl=1”)

library(reshape2)
library(skimr)
library(psych)
library(e1071)
library(data.table)
library(Matrix)
library(keras)

Social_Network_Ads <- read.csv("https://www.dropbox.com/s/11rmse4ay9uu8vu/Social_Network_Ads.csv?dl=1")



2b. Subset Age, EstimatedSalary, and Purchased from the data and convert them into a new dataframe object (name it with any name you like)

df <- data.frame( Age = Social_Network_Ads$Age, Est.Salary = Social_Network_Ads$EstimatedSalary, Purchase = Social_Network_Ads$Purchased)

I named subset dataframe “df”.

2c. Convert Purchased to factor variable (i.e., a variable of factor class)

class(df$Purchase)
## [1] "integer"
df$Purchase <- as.factor(df$Purchase)
class(df$Purchase)
## [1] "factor"



2d. Set seed to 123 (set.seed(123)), do a 75-25 train-test split on the variable Purchased, generate training and testing data

library(lattice)
library(caret)
set.seed(123)

svm_train_index <- createDataPartition(df$Purchase, p = .75, list = FALSE)
svm_training <- df[svm_train_index,]
svm_testing <- df[-svm_train_index,]



2e. Fitting a SVM on the training data using Purchased as the outcome variable, Age and EstimatedSalary as explanatory variables. Use the svm() function (from the e1071 package), set type to “C-classification” and kernel to “linear”. You DO NOT need to tweak the gamma argument unless you want to create curvature on the SHP.

svm.fit <- svm(Purchase~.,
               type = "C-classification",
               kernel = "linear",
                 data = svm_training)



2f. Use the result from Q2e to predict on the testing data

predict_test_svm <- predict(svm.fit, svm_testing)



2g. Generate confusion matrix and show prediction accuracy

confmat_test_svm <- table(Predicted = predict_test_svm, 
               Actual = svm_testing$Purchase)
confmat_test_svm
##          Actual
## Predicted  0  1
##         0 57  9
##         1  7 26
(confmat_test_svm[1, 1] + confmat_test_svm[2, 2]) / sum(confmat_test_svm) * 100
## [1] 83.83838

Accurate prediction are (predicted, actual) = (0, 0) and (1, 1).
Prediction accuracy is the number of (0, 0) plus the number of (1, 1) and divided by the number of all prediction. The result is 83.83838.

2h. Visualize the result of Q2e with plot() function

colors <- c("bisque", "snow2")
plot(svm.fit, svm_training, col = colors)



3. Predicting MLB ballers’ salary with LASSO regression

3a. Load the ISLR, glmnet, and plotmo packages, require the Hitters data (from the ISLR package)

library(ISLR)
library(glmnet)
library(plotmo)
data("Hitters")



3b. Remove missing values from the Hitters data, generate the variable Salary as y, use the rest of the variables as x and generate them as a matrix. Fit a LASSO model on the Hitters data using the glmnet() function

dim(Hitters)
## [1] 322  20
Hitters_noNA <- Hitters[complete.cases(Hitters), ]
dim(Hitters_noNA)
## [1] 263  20

After removing NA, the dataframe only contains 263 rows. That is, I deleted 59 rows where NA exist.

X = model.matrix(Salary ~ .-1, data = Hitters_noNA)
Y = Hitters_noNA$Salary

Hitter_LASSO <- glmnet(X, Y, alpha = 1)

Let alpha be 1 which means adopting LASSO L1 regression.

3c.

(i) Plot the result of Q3b using plot_glmnet() on the top 5 variables. Describe the sign and magnitude of the effects of these 5 variables at optimized lambda value

par(mfrow=c(1,2))

plot_glmnet(Hitter_LASSO, xvar = "lambda", 
            label=5,   
             xlab = expression(paste("log(", lambda, ")")),
             ylab = expression(beta))

plot_glmnet(Hitter_LASSO, label = 5, xlab = expression(paste("log(", lambda, ")")), ylab = expression(beta))


As plot shows, Hits, Walks, NewLeagN, LeagueA, and DivisionW are top 5 variables. I understand that hits and walks must be associated with hitters’ salary. I check the definition of each variable from rdrr.io(https://rdrr.io/cran/ISLR/man/Hitters.html) found out that

“NewLeague”: A factor with levels A and N indicating player’s league at the beginning of 1987.

“League”: A factor with levels A and N indicating player’s league at the end of 1986

“Division”: A factor with levels E and W indicating player’s division at the end of 1986

Yet, I do not see any importance for grading a player’s performance on league or division. I mean there are too many confounds in terms of player’s league and division (some division are competitive, while some division just sucks). I think it should have more detailed data to adjust personal performance nowadays.
As for the result of LASSO regression, is it because these there variables are binary variable?

coef(Hitter_LASSO, s = cv.glmnet(X, Y)$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 144.37970458
## AtBat         .         
## Hits          1.36380384
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.49731098
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.15275165
## CRBI          0.32833941
## CWalks        .         
## LeagueA       .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.06625755
## Assists       .         
## Errors        .         
## NewLeagueN    .

I set the lambda value using cross validation glmnet() with lambda.1se rule which presents lambda value that minimizes the loss function.

Now, it shows five variables which are Hits, Walks, C(career)Runs, C(career)RBI, and PutOuts instead of league and division makes more sense.
In terms of the effect,
Hits: increase one hit in 1986 season, the predicted salary increases by 1.36380384 unit.
Walks: increase one walk in 1986 season, the predicted salary increases by 1.49731098 unit.
CRuns: increase one runs in entire career, the predicted salary increases by 0.15275165 unit.
CRBI: increase one RBI in entire career, the predicted salary increases by 0.32833941 unit.
PutOuts: increase one putout in 1986 season, the predicted salary increases by 0.06625755 unit.

(ii) Display and describe the sign and magnitude of the effects of the 5 variables at log(lambda) = 0

coef(Hitter_LASSO, s = exp(0))
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  1.993305e+02
## AtBat       -1.924762e+00
## Hits         6.771856e+00
## HmRun        1.251049e+00
## Runs        -9.833022e-01
## RBI          .           
## Walks        5.602395e+00
## Years       -7.706032e+00
## CAtBat      -6.515206e-02
## CHits        .           
## CHmRun       2.296343e-01
## CRuns        1.118645e+00
## CRBI         5.689114e-01
## CWalks      -7.263206e-01
## LeagueA     -4.653101e+01
## LeagueN      6.335197e-12
## DivisionW   -1.167282e+02
## PutOuts      2.818971e-01
## Assists      2.879515e-01
## Errors      -2.855204e+00
## NewLeagueN  -1.015361e+01

I set the lambda value, exp(0), because exp(0) = 1 and log(1) = 0.
In terms of effect of top 5 variables in the plot,
Hits: increase one hit in 1986 season, the predicted salary increases by 6.771856e+00 unit.
Walks: increase one walk in 1986 season, the predicted salary increases by 5.602395e+00 unit.
LeagueA: being in leagueA in 1986 season, the predicted salary decreases by 4.653101e+01 unit.
NewLeagueN: being in leagueN in 1987 season, the predicted salary drops by 1.015361e+01 unit.
DivisionW: being in divisionW in 1986 season, the predicted salary drops by 1.167282e+02 unit.