train <- read.csv("C:/Users/dorgo/Desktop/kaggle/data/train.csv")
test <- read.csv("C:/Users/dorgo/Desktop/kaggle/data/test.csv")

Workflow

In this session I’m going to walk through some steps of the workflow process of prediction problems. Generally, I am going to focus on some of the preliminaries, the prediction process itself, and some post-prediction analysis and to introduce some nice R packages which are very helpful on those tasks.

The workflow composes a few stages:

  1. Data exploration: There are some packages helping with this task. I am going to use a nice one (and quite new): the DataExplorer package.

  2. Model choosing & predictions: In this stage I will examine a few models. In order to choose the best model we have to test many, and calibrate the parameters. In pursuit of that goal I will use one useful package named caret. This package is a very popular package for ML implementions in r. I used this package because I wanted to use one package for many models rather than using model-oriented package for each model family. (You may also use h2o, MASS, rpart etc.)

  3. Explain: ML Model considered by many as “Black Boxes”. So I found the best model in RMSE terms, but now what? Do I understand any of that? Is rm variable important or is it lstat the most important one? In order to anwer those questions we have to dive into the world of agnostic (i.e model indifferent) interpretation of those balck boxes. I will use the DALEX package which can be used for explaination of various ML packages. You can also look at the iml and breakDown packages.

Exploration

Data description

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following features:

crim - per capita crime rate by town.

zn - proportion of residential land zoned for lots over 25,000 sq.ft.

indus - the proportion of non-retail business acres per town.

chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox - nitrogen oxides concentration (parts per 10 million).

rm - average number of rooms per dwelling.

age - proportion of owner-occupied units built prior to 1940.

dis - weighted mean of distances to five Boston employment centres.

rad - index of accessibility to radial highways.

tax - full-value property-tax rate per $10,000.

ptratio - pupil-teacher ratio by town.

black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat - lower status of the population (percent).

medv - median value of owner-occupied homes in $1000s.

Data basic features

We now going to implement some general query of basic features of our data. In order to do so I will use the DataExplorer package which is very nice! Before you contunue reading I vow you to take some dataset and run the following command:
DataExplorer::create_report(your_data)

The create_report command creates an automatic report contains various features of your data. I will now use its functions, so we would see them separately. Not every graph would be relevant to our case, but I want to show their functionality.

First, we want to know the dimensions of our data, its memory usage, missing data and so on. So we will use the following functions:

introduce(train)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1  333      15                0                 15                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0           333               4995        38400
plot_intro(train)

plot_missing(train)

It is also possible that we would like to know the structure of our data & variables:

plot_bar(train, with = "medv")

plot_histogram(train)

plot_density(train)

plot_boxplot(train, by = "medv")

plot_scatterplot(split_columns(train)$continuous, by = "medv")

It is also possible to get some statistical features of our data, such as QQplots, correlation map, and principal components:

plot_qq(train, by = "medv")

plot_correlation(train)

plot_prcomp(train, maxcat = 5L)

In order to check whether we should treat some columns we will check the class for each column:

sapply(train, class)
##        ID      crim        zn     indus      chas       nox        rm 
## "integer" "numeric" "numeric" "numeric" "integer" "numeric" "numeric" 
##       age       dis       rad       tax   ptratio     black     lstat 
## "numeric" "numeric" "integer" "integer" "numeric" "numeric" "numeric" 
##      medv 
## "numeric"

As we can see our data is already clean, and we can start seeking for the best model.

Model

Building the models

After the exploration of the data we want to choose the model that gives the best predictions for the boston dataset. We will try to predict the relevent values using several models that we studied this semester. We will tune the model by picking the best parameter values such as tree depth, \(\lambda\) and so on. We will then move on to evaluate their performance.
In order to keep the workflow clean we will use for estimation the caret package only.

Next, we will implement the usual steps for model-fitting: Splitting, training etc. We will not go into details right now. You can follow the code if you’d like to.

Nontheless we will explain the main structure of caret; After you got your test & train data you should choose a Cross-Validation method using the trainControl function. There are several methods available such as: repeatedcv, oob (Out-of-Box), boot and so on. (check ?trainControl for further information). We chose (for unknown reason) repeatedcv with 10-folds and 10 repeats. Next you should choose models you want to try. You can find the list of models supported by `caret in here. We tried some linear models such as OLS (Dahhh), pcr, stepwise (backward & forward); and also KNN, Ridge, Lasso (we actually tried elastic net with optimal \(\alpha\)); and some kind of trees such as random forest, boosting & bagging. We don’t have much to show you in this section so just read the code…

Boston <- read.csv("C:/Users/dorgo/Desktop/kaggle/data/train.csv")
set.seed(998)
inTraining <- createDataPartition(Boston$medv, p = .75, list = FALSE)
training <- Boston[ inTraining,]
testing  <- Boston[-inTraining,]
cutoff <- 35

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10)
formula <- medv ~ . - ID
set.seed(825)

gbmFit1 <- train(formula, data = training, 
                 method = "gbm", 
                 trControl = fitControl,
                 verbose = FALSE)


bagging <- train(formula, data = training, 
                 method = "treebag", 
                 trControl = fitControl,
                 verbose = FALSE)

random_forest <- train(formula, data = training, 
                 method = "rf", 
                 trControl = fitControl,
                 verbose = FALSE)


principal_component <- train(formula, data = training, 
                       method = "pcr", 
                       trControl = fitControl,
                       verbose = FALSE)

stepwise_backward <- train(formula, data = training, 
                             method = "leapBackward", 
                             trControl = fitControl,
                             verbose = FALSE)

stepwise_forward <- train(formula, data = training, 
                           method = "leapForward", 
                           trControl = fitControl,
                           verbose = FALSE)

lm <- train(formula, data = training, 
                  method = "lm", 
                  trControl = fitControl)

knn <- train(formula, data = training, 
            method = "knn", 
            trControl = fitControl)



fitControl <- trainControl(method = 'cv', number=6,
                           summaryFunction=defaultSummary)
set.seed(123)

glmnet_object <- train(formula, data=training, method = 'glmnet',
                      trControl=fitControl,
                      metric='RMSE')

Tuning parameters

So we have our models. Nice. But how should we know what the best parameters are? Maybe with the right parameters we will get the best results from different model than the one we considered to be the best one?
Luckily, caret is doing some evaluations behind the scenes of what the best parameters are. The parameters we have to tune differ from one model to another; We have the \(\alpha\) for the Ridge & Lasso, we have \(\lambda\) for regularization, tree depth for boosting, number of predictors for random forest and bagging, \(k\) for KNN and so on. Moreover, we can use different objective functions - RMSE or R-Squared and so on. So the art of choosing parameters is not easy. We will just use the parameters found by caret.

Next, you can see some caret graphs in which you can see the performance (both in RMSE and R-squared terms) of different parameters in each model.

p3 <- ggplot(glmnet_object) +geom_point(show.legend = FALSE)+ggtitle("glmnet")+ theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p4 <- ggplot(gbmFit1)+ geom_point(show.legend = FALSE)+ggtitle("GBM")+ theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p5 <- ggplot(random_forest)+ geom_point(show.legend = FALSE)+ggtitle("RF")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p6 <- ggplot(stepwise_backward)+ geom_point(show.legend = FALSE)+ggtitle("BACK")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p7 <- ggplot(stepwise_forward)+ geom_point(show.legend = FALSE)+ggtitle("FOR")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p8 <- ggplot(knn)+ geom_point(show.legend = FALSE)+ggtitle("KNN")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
gridExtra::grid.arrange(p3,p4, nrow = 2)

gridExtra::grid.arrange(p5,p6,p7,p8, ncol = 2)

p9 <- ggplot(glmnet_object, metric = "Rsquared")+ geom_point(show.legend = FALSE)+ggtitle("glmnet")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p10 <- ggplot(gbmFit1, metric = "Rsquared")+ geom_point(show.legend = FALSE)+ggtitle("GBM")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p11 <- ggplot(random_forest, metric = "Rsquared")+ geom_point(show.legend = FALSE)+ggtitle("RF")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p12 <- ggplot(stepwise_backward, metric = "Rsquared")+ geom_point(show.legend = FALSE)+ggtitle("BACK")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
p13 <- ggplot(stepwise_forward, metric = "Rsquared")+ geom_point(show.legend = FALSE)+ggtitle("FOR")+ theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
gridExtra::grid.arrange(p9,p10, nrow = 2)

gridExtra::grid.arrange(p11,p12,p13, ncol = 2)

Note that you can adapt the parameters yourself in caret (look in here for further information).

Model performance

So we have our models, and hopefully we have the best parameters for each model. So how can we know which model is the best one? We can look at the R-squared of each model. Here the DALEX package comes in handy; We will create objects called “Explainers”. We will go back to those explainers later, but for now all we should know is that we can use the model_performance function and get the following graph:

gbm_explainer <- explain(gbmFit1, label="gbm", 
                         data = training,
                         y = training$medv)

bagging_explainer <- explain(bagging, label="treebag", 
                             data = training,
                             y = training$medv)

random_forest_explainer <- explain(random_forest, label="rf", 
                                   data = training,
                                   y = training$medv)

glm_explainer <- explain(glmnet_object, label="glmnet", 
                         data = training,
                         y = training$medv)

pcr_explainer <- explain(principal_component, label="pcr", 
                         data = training,
                         y = training$medv)

backward_explainer <- explain(stepwise_backward,
                              label="leapBackward", 
                         data = training,
                         y = training$medv)

forward_explainer <- explain(stepwise_forward,
                              label="leapForward", 
                              data = training,
                              y = training$medv)

lm_explainer <- explain(lm, label="lm", 
                             data = training,
                             y = training$medv)

knn_explainer <- explain(knn, label="knn", 
                        data = training,
                        y = training$medv)
mp_gbm <- model_performance(gbm_explainer)
mp_glm <- model_performance(glm_explainer)
mp_bag <- model_performance(bagging_explainer)
mp_rf <- model_performance(random_forest_explainer)
mp_pc <- model_performance(pcr_explainer)
mp_back <- model_performance(backward_explainer)
mp_for <- model_performance(forward_explainer)
mp_lm <- model_performance(lm_explainer)
mp_knn <- model_performance(knn_explainer)


plot(mp_gbm,mp_bag,mp_rf,mp_glm,mp_pc,mp_back,mp_for,mp_lm,mp_knn)

We can aso get the same notion in boxplot, all you need to do is specifying geom = "boxplot" in the generic plot function.

plot(mp_gbm,mp_bag,mp_rf,mp_glm,mp_pc,mp_back,mp_for,mp_lm,mp_knn,
     geom = "boxplot")

So let us go deeper into these 2 graphs. The first graph shows us for each model its performance. So how should you read the graph?
the y-axis is percentage of obserations, the x-axis is residual. We can see that \(100\)% of the observations has more (or equal to) \(0\) residuals. For the random forest model we see that only about 10% of the observations have more than ~\(2.5\) residuals. Or, if you prefer the short version:
>“The closer the line to y-axis - the better the model”

Interesting thing to notice (in general, not in this case) is that if two lines intersect - each predict better in different areas of the data. From this graph only you can’t know where each model perform better.

It works the same way for the “boxplot” graph:
>The closer the black vertical line to the y-axis the better the model is.

The red dot is the RMSE. If one model’s box overlaps with other box and passes it - than the second model is better predictor on some part of the data.

In our case we can see that the tree models are performing better than the linear models, and that the random forest gives the best predictions.

We also want to check if the models perform the same in every price level, so we will predict on the testing data (the data that we created from splitting the data, not the test data we got from “Kaggle”). Now that we have our predictions from each model we would like to plot them against the actual medv of the testing data.

gbm_pred <- predict(gbmFit1, newdata = testing)
rf_pred <- predict(random_forest, newdata = testing)
bagging_pred <- predict(bagging, newdata = testing)
glm_pred <- predict(glmnet_object, newdata = testing)
pc_pred <- predict(principal_component, newdata = testing)
back_pred <- predict(stepwise_backward, newdata = testing)
for_pred <- predict(stepwise_forward, newdata = testing)
lm_pred <- predict(lm, newdata = testing)
knn_pred <- predict(knn, newdata = testing)
plotting <- function(model, name){
  ggplot(data.frame(real_price = testing$medv, predicted_price = model),
         aes(x = real_price, y = predicted_price, color = real_price)) +           geom_point(alpha = 0.8,show.legend = FALSE)+ geom_abline() +              ggtitle(name)+
         theme(axis.title.x=element_blank(),axis.text.x=element_blank(),
         axis.ticks.x=element_blank(),axis.title.y=element_blank(),
         axis.text.y=element_blank(),axis.ticks.y=element_blank())
}

a <- plotting(gbm_pred, "Gbm")
b <- plotting(rf_pred, "Random Forest")
c <- plotting(bagging_pred, "Bagging")
d <- plotting(pc_pred, "Principal Component")
e <- plotting(back_pred, "Backward")
f <- plotting(for_pred," Forward")
g <- plotting(lm_pred, "Linear Model")
h <- plotting(knn_pred, "KNN")

grid.arrange(a,b,c,d, ncol=2)

grid.arrange(e,f,g,h, ncol=2)

We can see that consistently, every model underestimate the prices of expansive houses. We want to check what is the deviation of the predictions from the actual prices:

a <- rf_pred %>% as.data.frame() %>% rownames_to_column()
a$rowname <- a$rowname %>% as.numeric()
b <- a  %>% filter(.>cutoff)
b$real <- Boston[b$rowname,][["medv"]]
bias <- mean(b$.-b$real)
rm(a)
rm(b)

We get deviance of -2.7824878 Actually that quite a lot given our RMSEs. So, we would like to give a correction but we will address this issue later.

Explanations

We have explored the data, built models and tuned them and now we can easily predict on the test. But before that we want to introduce the concept of explanations for “Black Box” models.
ML models considered to be “Black-Boxes” because even though we know how the models were built we don’t know what happens “inside” them. This is particulary important if you want to make decisions based on these models (and also for issues of “racist algorithms”). Say you want to improve your wine, and you have a model which predicts wine quality. What should you with this info? (Yes, I know that model doesn’t check for causation. Get off my back.) In what parameter you should invest first? And how does each level of alcohol influences the quality? In order to get answer to these questions and more one should use the concept of “Explanations”; Some neat r packages for this task are iml breakDown, lime & DALEX. (You can see some nice examples in here, here for image recognition, and here for text classification). We will use some of the functions of DALEX. (extensive description can be found here).

We already saw some of DALEX function in action in the R-Squared graphs. Now we will see two more: the variable_importance and the variable_response.
The variable_importance is a concept we already know from principal component. Now we will see it once again in different setting. The function has its way of calculating the importance of each variables; We will not go into it. Anyway you can see the importance of each variable in the following graph:

importance_gbm <- variable_importance(gbm_explainer, 
                    loss_function = loss_root_mean_square)

importance_glm <- variable_importance(glm_explainer, 
                    loss_function = loss_root_mean_square)

importance_bag <- variable_importance(bagging_explainer, 
                    loss_function = loss_root_mean_square)

importance_rf <- variable_importance(random_forest_explainer, 
                    loss_function = loss_root_mean_square)

importance_pcr <- variable_importance(pcr_explainer, 
                    loss_function = loss_root_mean_square)

importance_back <- variable_importance(backward_explainer, 
                      loss_function = loss_root_mean_square)

importance_for <- variable_importance(forward_explainer, 
                      loss_function = loss_root_mean_square)

importance_lm <- variable_importance(lm_explainer, 
                      loss_function = loss_root_mean_square)

a <- plot(importance_bag,importance_gbm,importance_glm,
          importance_pcr)
b <- plot(importance_rf,importance_back,importance_for,
          importance_lm)
plot_grid(a,b)

As we can see from the graphs - the lstat and the rm variables are the most important ones. So it is only natural to try to understand how those variables influence the medv. For that goal we will use the variable_response function. We can see the “behavior” patterns of those variables for each level of price:

back_stat <- variable_response(backward_explainer,
                               variable = "lstat",
                               type = "pdp")
for_stat <- variable_response(forward_explainer,
                               variable = "lstat",
                               type = "pdp")
rf_stat <- variable_response(random_forest_explainer,
                              variable = "lstat",
                              type = "pdp")
bag_stat <- variable_response(bagging_explainer,
                             variable = "lstat",
                             type = "pdp")
gbm_stat <- variable_response(gbm_explainer,
                              variable = "lstat",
                              type = "pdp")
glm_stat <- variable_response(glm_explainer,
                              variable = "lstat",
                              type = "pdp")
lm_stat <- variable_response(lm_explainer,
                              variable = "lstat",
                              type = "pdp")
pcr_stat <- variable_response(pcr_explainer,
                             variable = "lstat",
                             type = "pdp")


trees_response_rm <- plot(rf_stat,bag_stat,gbm_stat)
linear_response_rm <- plot(back_stat,for_stat,glm_stat,
                        lm_stat,pcr_stat)
plot_grid(trees_response_rm,linear_response_rm)

back_stat <- variable_response(backward_explainer,
                               variable = "rm",
                               type = "pdp")
for_stat <- variable_response(forward_explainer,
                               variable = "rm",
                               type = "pdp")
rf_stat <- variable_response(random_forest_explainer,
                              variable = "rm",
                              type = "pdp")
bag_stat <- variable_response(bagging_explainer,
                             variable = "rm",
                             type = "pdp")
gbm_stat <- variable_response(gbm_explainer,
                              variable = "rm",
                              type = "pdp")
glm_stat <- variable_response(glm_explainer,
                              variable = "rm",
                              type = "pdp")
lm_stat <- variable_response(lm_explainer,
                              variable = "rm",
                              type = "pdp")
pcr_stat <- variable_response(pcr_explainer,
                             variable = "rm",
                             type = "pdp")


trees_response_rm <- plot(rf_stat,bag_stat,gbm_stat)
linear_response_rm <- plot(back_stat,for_stat,glm_stat,
                        lm_stat,pcr_stat)
plot_grid(trees_response_rm,linear_response_rm)

We can see that the “tree models” catch some form that the linear models can’t.

Final prediction

All we have left to do is predict with our best model on the test dataset. We used our best model - random forest.

rf_pred_test <- predict(random_forest, newdata = test)
logi <- c(rf_pred_test>cutoff)
bias_v <- logi*bias
rf_pred_test <- rf_pred_test-bias_v
rf_outcome <- cbind(ID = test$ID, medv = rf_pred_test)
rownames(rf_outcome) <- rf_outcome[,"ID"]
write.csv(rf_outcome,"rf_outcome.csv")

And that’s it for now! We hope you had a nice time reading it.