library(ISLR)
library(tidyverse)
library(ggplot2)
library(dplyr)
data("Credit")
library(lattice)
library(caret)
set.seed(123)
train_index <- createDataPartition(Credit$Rating, p = .75, list = FALSE)
training <- Credit[train_index,]
testing <- Credit[-train_index,]
rating_poly5 = lm(Rating ~ poly(Income, 5), data = training)
summary(rating_poly5)
##
## Call:
## lm(formula = Rating ~ poly(Income, 5), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.914 -79.108 5.176 75.956 175.680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 355.734 5.393 65.962 <2e-16 ***
## poly(Income, 5)1 2255.251 93.565 24.103 <2e-16 ***
## poly(Income, 5)2 51.768 93.565 0.553 0.5805
## poly(Income, 5)3 -12.479 93.565 -0.133 0.8940
## poly(Income, 5)4 161.241 93.565 1.723 0.0859 .
## poly(Income, 5)5 -82.348 93.565 -0.880 0.3795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.57 on 295 degrees of freedom
## Multiple R-squared: 0.6648, Adjusted R-squared: 0.6591
## F-statistic: 117 on 5 and 295 DF, p-value: < 2.2e-16
Setting raw = False, each degree of x will be orthogonal. It makes
additional polynomial term easier to see if it improves the model
fitting. Also, orthogonalized polynomial makes x more possible to be
statistically significant. Setting raw = True makes variables be highly
correlated so that it’s hard to tell whether additional term improves
regression model. Yet, the p-value of higher degree x will increase
largely in the sense that it can sometimes distinguish at which level
the x stop being significant.
On the other hand, if original x is interested, raw = True can better
interpret original x. However, it’s still not recommended to adopt raw =
True which makes polynomial model highly correlated. When using
polynomial model, mostly we’d like to see whether higher degree improves
fitting. And, raw = False capture effects of polynomial terms.
rating_log <- lm(Rating ~ log(Income), data = training)
summary(rating_log)
##
## Call:
## lm(formula = Rating ~ log(Income), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -208.17 -82.03 -2.29 86.01 351.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -248.76 32.31 -7.70 2.01e-13 ***
## log(Income) 168.74 8.85 19.07 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.8 on 299 degrees of freedom
## Multiple R-squared: 0.5487, Adjusted R-squared: 0.5472
## F-statistic: 363.5 on 1 and 299 DF, p-value: < 2.2e-16
poly_pred <- predict(rating_poly5, testing)
log_pred <- predict(rating_log, testing)
data.frame(type = c('Poly','Log'),
RMSE = c(RMSE(poly_pred, testing$Rating), RMSE(log_pred, testing$Rating)),
R2 = c(R2(poly_pred, testing$Rating), R2(log_pred, testing$Rating))
)
## type RMSE R2
## 1 Poly 98.67925 0.4924893
## 2 Log 104.40012 0.4404227
RMSE is rooted mean square error which is standard deviation of
predicted value from actual value. If RMSE is smaller, the deviation of
predicted value from actual value is closer which simply means the
fitting model is better.
R^2 is (1 - SSR/TSS) where SSR is
squared sum of residuals and TSS is total sum of squares. In order to
find a good prediction model explaining variation a lot, little SSR is
good criterion which also brings a larger R^2.
Apparently,
polynomial model is better than log model. Compared to log model, RMSE
is smaller while R^2 larger which means the model explains more and less
residual.
ggplot(training, aes(Income, Rating) ) +
geom_point(alpha = 0.2) +
labs(x = "Income", y = "Rating", title = "Rating as a polynomial function of Income") +
stat_smooth(method = lm, formula = y ~ poly(x, 5), col = "red")
ggplot(training, aes(Income, Rating) ) +
geom_point(alpha = 0.2) +
labs(x = "Income", y = "Rating", title = "Rating as a log function of Income") +
stat_smooth(method = lm, formula = y ~ log(x, 5), col = "red")
Based on the graph above, it seems pretty clear that polynomial model
fits data closely especially as income being larger than 100 thousand. I
guess the reason could be polynomial function is more flexible for
adjusting when value getting large than log model be.
df <- data.frame( Age = Social_Network_Ads$Age, Est.Salary = Social_Network_Ads$EstimatedSalary, Purchase = Social_Network_Ads$Purchased)
I named subset dataframe “df”.
class(df$Purchase)
## [1] "integer"
df$Purchase <- as.factor(df$Purchase)
class(df$Purchase)
## [1] "factor"
library(lattice)
library(caret)
set.seed(123)
svm_train_index <- createDataPartition(df$Purchase, p = .75, list = FALSE)
svm_training <- df[svm_train_index,]
svm_testing <- df[-svm_train_index,]
svm.fit <- svm(Purchase~.,
type = "C-classification",
kernel = "linear",
data = svm_training)
predict_test_svm <- predict(svm.fit, svm_testing)
confmat_test_svm <- table(Predicted = predict_test_svm,
Actual = svm_testing$Purchase)
confmat_test_svm
## Actual
## Predicted 0 1
## 0 57 9
## 1 7 26
(confmat_test_svm[1, 1] + confmat_test_svm[2, 2]) / sum(confmat_test_svm) * 100
## [1] 83.83838
Accurate prediction are (predicted, actual) = (0, 0) and (1, 1).
Prediction accuracy is the number of (0, 0) plus the number of (1, 1)
and divided by the number of all prediction. The result is 83.83838.
colors <- c("bisque", "snow2")
plot(svm.fit, svm_training, col = colors)
library(ISLR)
library(glmnet)
library(plotmo)
data("Hitters")
dim(Hitters)
## [1] 322 20
Hitters_noNA <- Hitters[complete.cases(Hitters), ]
dim(Hitters_noNA)
## [1] 263 20
After removing NA, the dataframe only contains 263 rows. That is, I
deleted 59 rows where NA exist.
X = model.matrix(Salary ~ .-1, data = Hitters_noNA)
Y = Hitters_noNA$Salary
Hitter_LASSO <- glmnet(X, Y, alpha = 1)
Let alpha be 1 which means adopting LASSO L1 regression.
par(mfrow=c(1,2))
plot_glmnet(Hitter_LASSO, xvar = "lambda",
label=5,
xlab = expression(paste("log(", lambda, ")")),
ylab = expression(beta))
plot_glmnet(Hitter_LASSO, label = 5, xlab = expression(paste("log(", lambda, ")")), ylab = expression(beta))
As plot shows, Hits, Walks, NewLeagN, LeagueA, and DivisionW are
top 5 variables. I understand that hits and walks must be associated
with hitters’ salary. I check the definition of each variable from
rdrr.io(https://rdrr.io/cran/ISLR/man/Hitters.html) found out
that
“NewLeague”: A factor with levels A and N indicating player’s league
at the beginning of 1987.
“League”: A factor with levels A and N indicating player’s league at
the end of 1986
“Division”: A factor with levels E and W indicating player’s division
at the end of 1986
Yet, I do not see any importance for grading a player’s performance
on league or division. I mean there are too many confounds in terms of
player’s league and division (some division are competitive, while some
division just sucks). I think it should have more detailed data to
adjust personal performance nowadays.
As for the result of LASSO
regression, is it because these there variables are binary variable?
coef(Hitter_LASSO, s = cv.glmnet(X, Y)$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 144.37970458
## AtBat .
## Hits 1.36380384
## HmRun .
## Runs .
## RBI .
## Walks 1.49731098
## Years .
## CAtBat .
## CHits .
## CHmRun .
## CRuns 0.15275165
## CRBI 0.32833941
## CWalks .
## LeagueA .
## LeagueN .
## DivisionW .
## PutOuts 0.06625755
## Assists .
## Errors .
## NewLeagueN .
I set the lambda value using cross validation glmnet() with
lambda.1se rule which presents lambda value that minimizes the loss
function.
Now, it shows five variables which are Hits, Walks, C(career)Runs,
C(career)RBI, and PutOuts instead of league and division makes more
sense.
In terms of the effect,
Hits: increase one hit in 1986
season, the predicted salary increases by 1.36380384 unit.
Walks:
increase one walk in 1986 season, the predicted salary increases by
1.49731098 unit.
CRuns: increase one runs in entire career, the
predicted salary increases by 0.15275165 unit.
CRBI: increase one
RBI in entire career, the predicted salary increases by 0.32833941 unit.
PutOuts: increase one putout in 1986 season, the predicted salary
increases by 0.06625755 unit.
coef(Hitter_LASSO, s = exp(0))
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.993305e+02
## AtBat -1.924762e+00
## Hits 6.771856e+00
## HmRun 1.251049e+00
## Runs -9.833022e-01
## RBI .
## Walks 5.602395e+00
## Years -7.706032e+00
## CAtBat -6.515206e-02
## CHits .
## CHmRun 2.296343e-01
## CRuns 1.118645e+00
## CRBI 5.689114e-01
## CWalks -7.263206e-01
## LeagueA -4.653101e+01
## LeagueN 6.335197e-12
## DivisionW -1.167282e+02
## PutOuts 2.818971e-01
## Assists 2.879515e-01
## Errors -2.855204e+00
## NewLeagueN -1.015361e+01
I set the lambda value, exp(0), because exp(0) = 1 and log(1) = 0.
In terms of effect of top 5 variables in the plot,
Hits:
increase one hit in 1986 season, the predicted salary increases by
6.771856e+00 unit.
Walks: increase one walk in 1986 season, the
predicted salary increases by 5.602395e+00 unit.
LeagueA: being in
leagueA in 1986 season, the predicted salary decreases by 4.653101e+01
unit.
NewLeagueN: being in leagueN in 1987 season, the predicted
salary drops by 1.015361e+01 unit.
DivisionW: being in divisionW in
1986 season, the predicted salary drops by 1.167282e+02 unit.