Assignment 3 for Data science practices

0.1 Introduction
0.2 Background
0.3 Version control
0.4 Code style
0.5 API
0.6 Modules and Packages
0.7 Documentation
0.8 Test
0.9 Containerisation
0.10 Conclusion
0.11 References
0.12 Appendixes

0.1 Introduction

Reproducibility is ability to replicate findings of others using the same tools and data. Or it simply is to take someone’s code, understand, reuse and reproduce the same result or modify it for one’s own purpose. For example, one can fork an open source repository and run the same code using the same data and reproduce the same result. So reproducibility centered around “replicability”, “repeatability”.

Reproducibility in data science is important because:

Data science projects should not be one-off analysis projects because it is uncommon that business may require you to reproduce your project with different parameters. Then if your code was not developed to easily reuse or adapt to a new business requirements, you could end up starting a new project from scratch which could produce unexpected costs for business.
You may not work alone in your projects but in a team, which could be small, medium or large team. Reproducibility means that other members can easily understand your code, reuse and change when it needs to be fixed for errors or changes. So understanding and implementing the well documented production ready code are always highly recommended in most scenarios.
Reproducibility could reduce trouble and build trust. Good documentation and quality code allows the knowledge sharing or reusing part of your projects. Imagine one day, you need to fix errors or make changes for your old projects, you could spend tons of hours just to understand what you had done earlier, that is a nightmare.

This report provides few best practices for delivering data science projects in terms of reproducibility, which could be easily found from software engineering domain. We will take our previous project as an example to explain each of these concepts. That was a group assignment, which we together sought to answer the research question of whether a sample of a number key economic and financial variables, such as those used by Legacy Pricing Capital Management (LPCM), can sufficiently predict the movement of the ASX 200 for the following month.

0.2 Background

(Medium, 2019)

According to Medium (2019), reproducibility standard level could range up to five levels, from “not reproducible” to code & documentation and “gold standard”. Data and metadata is a must for any publications of data projects because there is no way to reproduce the results without these. It is widely known that code and documentation is sufficient to deliver data projects publications. However, to reach high standard level, the easy installation and runtime environment reproduction are two important components which should be shipped together as a full batch.

After receiving publications, the first step for any reproducer is to install projects, which could be at least 50% of publications failed to be installed (Medium, 2019). There could be so many reasons such as complicated or no longer existing dependencies, poorly written readme files, or even broken code. So installation procedure which keeps track of the different dependency structures and installs everything is important to smooth this first major step.

The goal is to achieve the identical results to the published paper with the same input data is not ready. Publishers should to provide either copy their software environment and offer it as a virtual machine or as a container. It should include identical version of the dependencies because different versions could result in software behave differently or change the results of the analysis.

0.3 Version control

Version control is to keep track of changes to a file or set of files over time so you can recall specific versions later.

GitHub and GitLab provide hostings for software development control using Git. It provides a remote backup of codebase, track changes in code and collaborate effectively as a team. So at any point of time, you can reproduce the other member’s code, or go back in time to obtain the exact scripts that were used to produce a given result.
Below is best practices git workflow you should follow to achieve an optimal reproducibility:

(Leyden, 2019)

The general idea is to deliver your code to the appropriate branches for an effective release management. In this model, two key principles to keep in mind are keeping master branch as “as-yet-unreleased” mainline development and any changes or updates or new features should be put on separate branches and only merge into master when code is well documented and ready for production.

Here we explain the process in short:

First, developers make a new branch for each feature they are working on, that is called a feature branch.
After they produce the ready-for-production code, they request a merge into master. Other developers can review and collaborate together to ensure the code quality before getting it merge into master.
Up to release, master branch is delivered to release branch, which is to get master branch always ready for developing new features while fixing bugs on release branch at the same time. The release branch is supposed to live shortly until it ends up at a stable release, which will go into production. Remember, we should frequently merge the release branch into master to keep the master branch always “as-yet-unreleased” mainline development.
Our project assignment was not ready for production yet so we have not applied minor releases and support branches so you can read more to understand the reference link.

To apply that process into our assignment, here is what we should do:

Feature branches could be our features such as import & munging & wrangling OECDLI dataset, import & munging & wrangling gold_price_london, or merge all data, exploration data and so on.
Merging feature branch to master branch should be made after peer review for code quality.
At each stage of assigment submit, we bring master branch to release branch and follow the above process.

0.4 Code style

Code style or coding conventions is a set of rules or guidelines to write code in a format that can be easily read by other developers and helpful to avoid creating mistakes. There could be several code styles per programming language so we as a team should agree on coding standards. Without consistent style of code between team members, code will become very ugly, hard to read or even confused.

Few popular coding standards:

PEP 8 - Style Guide for Python Code https://www.python.org/dev/peps/pep-0008/
R - The tidyverse style guide https://style.tidyverse.org/

Here are a few major things from style guide:

Code layout is very important, including indentation, tab / spaces, maximum line length and so on.
Comments are critical for other to understand the meaning of code, specifically on the complex algorithm. But also make comment follow the style
Naming conventions including variables names, function names, method names and so on.

Below are some tips which could help developer’s life easier: - If you use IDE as PyCharm or Visual Studio Code, you could add a few add-on or extension which highlight syntactical and stylistic problems in your code such as Linting Python. - There could be several tools to help you out on formatting code seamlessly and transparently such as https://github.com/psf/black, https://prettier.io/

0.5 API

API is a set of defined methods for communication among various components. Specifically, RESTful API is popular among data scientists, that refers to using Hypertext Transfer Protocol (HTTP) to send and retrieve information among the services or systems.

The idea of API is to decompose a large system into distributed and modular units, which could be deployed separately and connect together to form a whole application. So API could be seen as an abstraction layer delivering a simple interfaces to client while hiding all kinds of complexity behind the scenes. The core principle of API design is to keep these components simple, reusable, deliverable.

To apply API design principles into our project, we could design a couple of APIs for our assignment. Since we are using a couple of algorithms in our project, we will build those as apis such as logistic regression model, random forest model, time series model. Once we build this, members in team can call it anywhere to use it in their report. Or later they can upgrade it to multiple versions but it will not corrupt their old reports.

** API Predict by logistic regression - One set of variable

- Input: A set of independent variables
- Outcome: Yes / No ( Yes: Predict of increase of ASX, No: Predict of decrease of ASX)

** API Predict by logistic regression - Array of sets

- Input: An array of sets of independent variables
- Outcome: Array of (Yes / No) ( Yes: Predict of increase of ASX, No: Predict of decrease of ASX)

0.6 Modules and Packages

A Python module allows us to logically organize code by splitting it into several files for easier maintenance. Simply, a module is a file consisting of code, in which we can define functions, classes and variables. Suppose you have a large number of modules, package is a way to organize module by hierarchical structuring of the module namespace using dot notation.

You may confuse whether API or module is better fit for your project, that is also an eternal debate between using services oriented approach or modular architectures. In general they are the same in terms of reproducibility approach, which is splitting a large applications into a number of smaller units which are independent, simple, testable, deliverable. We highly recommend using modular approach for small and medium applications and reserve service oriented for large and super large system due to its complexity, cost and consumed resources.

In our project, modular approach can be applied by breaking few repeated tasks into function, which can be later reused. We have separated our code into multiple functions, which you can see more at Appendixes part. Those functions could be reused in our next project.

- Function to run all machine learning model
- Function to run cross validation
- Function to call Random forest model

0.7 Documentation

“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.“
- Daniele Procida

Documenting is highly recommended if you want to save your time to revisit your project at a later date, or make it easier for others to reuse your code or contribute to public. Documenting help your audiences, specifically your users and your developers, understand and use it properly. It is frustrating feeling that slow down or deter you from working efficiently because of poor documentation. As the purpose of making code readable, documentation should be made short, concise and straightforward.

There are several documentation could apply to our project:

Project Documentation:

A README file at the root directory should give general information. README could also include sections:
INSTALL guideline if project requires complicated steps to setup
TODO or the development plan
CHANGELOG, a short overview of changes with the latest version
A LICENSE file if you want to make it available to the public.

Comments:

Should be kept brief and focused (View four essential rules https://blog.codinghorror.com/when-good-comments-go-bad/)


# Preparing data for train
combi_ml <- combizs[, -1]
t <- createDataPartition(combi_ml$direction, p = 0.8, list = FALSE)
combi_train <- combi_ml[t, ]
combi_test <- combi_ml[-t, ]


# Logistic regression with train set
glm1 <- glm(direction ~ ., family = binomial(logit), data = combi_train)
summary(glm1)
plot(glm1)

# predict with train set
probability <- predict(glm1, newdata = combi_test, type = "response")
plot(probability)

# setting a threshold value of 0.5 for positive...
# you may want to see if there are better settings that you
# could use here (Hint: do a search for "ROC curve")
prediction <- ifelse(probability > 0.5, 1, 0)

# building a contingency table of the counts at each combination of factor levels
confusion <- table(combi_test$direction, prediction)
confusion

Docstrings

A docstring is a string literal specified in the source code, normally placed in front of module, function, class, or method definition to document a specific segment of code.

#' Run random forest
#' 
#' @description This function execute random forest model
#' together.  
#'
#' By using the description tag you'll notice that I
#' can have multiple paragraphs in the description section
#' 
#' @param combizs dataframe
#' @usage run_random_forest(combizs)
#' @return a model including train and control
#' 
#' @examples
#' run_random_forest(combizs)

0.8 Test

As mentioned earlier, a key principle of reproducibility is quality of deliverability of individual components and there is no better way to guarantee its quality without appropriate testing. There could be a broad topic about tests but here we focus on the unit test and code coverage. Unit test is to validate that each unit of the software performs as designed while code coverage to measure the proportion of your code covered by your unit tests. Having a sufficient of tests could increase confidence in changing and maintenance code, faster development.

In our project, we use “testthat” package to test few our defined machine leanring model functions.

Install the packages

# Install the released version from CRAN
install.packages("testthat")

Define test functions


# 
run_all_algorithm <- function(combizs) {
    ...
}

test_that('Check run all agorithm return list', {
  
    expect_type(results, "data.frame")
})

Run the test:


test_file('model_test.R')

# if success, result is [1] "list"
# else it is Error: Test failed: 'Check run all agorithm return list' * NULL has type `NULL`, not `list`.

Result is :

0.9 Containerisation

Container is a standardized unit of software that packages up code and all its dependencies into a single object, which can consistently run in any environment. It could be from developer laptop to testing environment or from staging to production. Conceptually, containers are similar to virtual machines in terms of resource isolation and allocation benefits. However virtual machines take a full isolation approach, which includes abstraction of physical hardware, operating system and process. Meanwhile containers leverage the host operating system to run each container as an isolated processes in user space so they certainly take less up space, use less RAM, and deliver optimal performance.

Docker container image packages all code, runtime, system tools, system libraries and settings into a lightweight, standalone, executable object which become containers at runtime. As a package, containers isolate itself from the host environment so it always run the same regardless of the infrastructure. In this regard, docker image could become a standard package so it could be portable, shared and reused.

Docker architecture

(Docker, 2019)

Docker uses client-server architecture, in which client talks to daemon using REST-API, over UNIX sockets or a network interface. Docker client provides an interface as command line or an application, which allows sers to interact with Docker daemon. It actually converts the command lines into Docker API and send to Docker daemon so one client can communicate with more than one daemon.

Docker registry is a place to store docker images, where users can push image to store on the registry, pull it back to use, and manage images. Docker hub owns the world’s largest library and community for container images. But users can host their private registry on their own desktop or private server.

Few familiar terms when working with docker: - Docker image: a read-only template with instructions for creating a Docker container - Docker container: a runnable instance of an image

How containers can be used to create and share data science environments

It is uncommon for data scientist to reproduce analysis or share the work and get the other team to repeat the result or reuse it in different projects. Container well-fit into this type of mission with its enabling scientist to build, test and ship containerised analysis from a local to other environment.

As a publisher, to ship data science projects, there could be few options for data scientists:

First could be building and shipping our docker file on GitHub so the audience can fork and use it.
Second, we can build docker image and publish it on Docker Hub, so audience can pull it to reuse it.

Within a team, docker file is shared on GitHub, which every members can download and run docker instance. Working in large and complex projects which require few systems to run on at the same time, Docker is excellent candidate since it is designed for multiple containers to utilise the host machine to run together seamlessly. It could one container to run database, one run API, one run machine learning model and so on. It is just one or two commands line to get the container up and running.

How we design containers for our project

In our project, we build docker file and push it on GitHub so every member can pull onto their laptop and get the container running.

Our docker file: - The image is built on top of rocker/r-rmd since we need rmarkdown to generate html for our report - When running the docker, we run docker command to bind our source code on the host machine with the container. So when the report is generated at container, we can also retrieve it at the host machine.

FROM rocker/r-rmd:latest

# Make folder to host the source code
RUN mkdir /analysis

# Install the packages
RUN R -e 'install.packages(c("Amelia", \
                              "mlbench", \
                              "psych", \
                              "corrplot", \
                              "lattice", \
                              "munsell", \
                              "ggplot2", \
                              "caret", \
                              "randomForest", \
                              "e1071", \
                              "plyr"))'
                              
# Run the rmarkdown to generate the report
CMD ["R", "-e", "rmarkdown::render('/analysis/notebook/eda/vinc_eda.Rmd', output_file='/analysis/notebook/eda/vinc_eda.html')"]

# Following commands are to build and run docker
# docker build -t dsp-3-rmd .
# docker run -v ~/vincent-dsp-3-stats:/analysis  dsp-3-rmd
# Remember to replace ~ with your current directory

0.10 Conclusion

Reproducibility centered around replicability or repeatability, that is the ability of another person to produce the same results using the same tools and the same data. In the field of data science, it is important and beneficial because it reduce the time for us to revisit our old projects, work together with our team members or publish reports. As it is widely known that code and documentation is sufficient to publish a reproducible data publications but according to Medium (2019), the gold standard of data science publications should be a combination of publications, data, code, documentation, easy installation and runtime environment reproduction.

There could be several tools and practices to help achieve gold standard level for data projects publications. In this regard, this paper provides a basic understanding of reproducibility and introduces a few best practices and tools specifically for publication of data science projects. And we also take our project as a sample, which could explain the concepts easier and provide you a hands-on skill.

So far, we provide you version control, code style, API, modules and packages, documentation, test, containerisation. Among these, we could recommend to use using containers to deliver your data publications. It fit in data science projects in terms of easy installation and runtime environment reproduction. Each container means to be a standardized unit of software that packages up code and all its dependencies into a single object, which can consistently run in any environment.

0.11 References

Docker. (2019). What is a Container? | Docker. [online] Available at: https://docs.docker.com/engine/docker-overview/ [Accessed 5 Nov. 2019].

Docker. (2019). What is a Container? | Docker. [online] Available at: https://www.docker.com/resources/what-container [Accessed 5 Nov. 2019].

Leyden, T. (2019). A successful git branching model with enterprise support - Seven Story Rabbit Hole. [online] Tleyden.github.io. Available at: http://tleyden.github.io/blog/2014/04/09/a-successful-git-branching-model-with-enterprise-support/ [Accessed 6 Nov. 2019].

Medium. (2019). Data Science’s Reproducibility Crisis. [online] Available at: https://towardsdatascience.com/data-sciences-reproducibility-crisis-b87792d88513 [Accessed 5 Nov. 2019].

Medium. (2019). Scientific Data Analysis Pipelines and Reproducibility. [online] Available at: https://towardsdatascience.com/scientific-data-analysis-pipelines-and-reproducibility-75ff9df5b4c5 [Accessed 6 Nov. 2019].

0.12 Appendixes

Below are source code for our previous project, which has been re-written to produce reproducibility, including:

One rmarkdown file
One modelling file, which is provides functions for rmakrdown to use
One test file

knitr::opts_chunk$set(echo = TRUE)

for (package in c("Amelia",
                  "mlbench",
                  "psych",
                  "corrplot",
                  "lattice",
                  "munsell",
                  "ggplot2",
                  "caret",
                  "randomForest",
                  "e1071",
                  "plyr")) {
  if (!package %in% installed.packages()) {
    install.packages(package)
  }
  if (!package %in% .packages()) {
    library(package, character.only = TRUE)
  }
}

Preparation


# include defined modelling functions
source("../../src/modeling/modeling.R")

library(tidyverse)

combi <- read_csv("../../data/processed/final_combi.csv")
combizs <- read_csv("../../data/processed/final_combi_zs.csv")

# Format date for combi
combi$date <- as.Date(combi$Date, "%Y-%m-%d")
glimpse(combi)

# Add date to combi zs
date_seq <- seq(as.Date("2005-01-01"),
               by = "1 month",
               length.out = nrow(combizs))
combizs_d <- combizs
combizs_d$date <- date_seq

After mungling and merging data, we have 15 variables and 175 observations. It is the 175 months ranging from 2005 to 2019.

Check missing data

Missing data could give big impact on modeling. We recheck again to see if there is missing data. We are using missing plot, on which the variables are on the yaxis and instances are on the xaxis. Below graph shows that there is no red color so there is no missing data in this dataset.

library(Amelia)
missmap(combi, col=c("red", "lightgreen"), legend=FALSE)

Remove missing values

combi <- combi %>% drop_na()
glimpse(combi)

After removing of missing values, one line of data is cut off. Now we have 15 variables with 174 observations.

Feature engineering data

As the goal is to predict the up/down of asx variable, we add one column called ‘direction’ with value 1 for up and 0 for down. In this report, we will use direction as a response vairable, as that shows whether the asx went up or down since the previous day.


# Add 1 row at top 
asxt <- rbind(combi[1,], combi)
asxt <- asxt[-nrow(asxt),]

combi$direction = combi$asx - asxt$asx
for (i in seq_along(combi$direction)) {
    t = combi$direction[[i]]
    if (t > 0) {
        combi$direction[[i]] = 1
    } else {
        combi$direction[[i]] = 0
    }
}

# Switch to factor direction for combi
combi$direction = as.factor(combi$direction)

# Switch to factor direction for combi zs
colnames(combizs)[colnames(combizs) == "binary_asx"] <- "direction"
combizs_nofactor  = combizs
combizs$direction = as.factor(combizs$direction)


head(combi %>% select(asx, direction))

Visualise data

Starting with histogram could provide us the indication of distribution of some variables

library(psych)
combi %>%
  keep(is.numeric) %>% 
    multi.hist()

Dividend, direction, unemployment are quite skew on one side while asx, abs_imports, unemployment are quite normal distribution. The rest of other variables are about bi-modal distribution. To withdraw a little understanding from this plot, the unemployment and abs_imports are quite similar distribution as asx. Additionally, the up direction is quite bigger than down direction.

par(mfrow = c(1, 8))
for (i in 1:8) {
  boxplot(combi[, i], main = names(combi)[i])
}

par(mfrow = c(1, 8))
for (i in 9:15) {
  boxplot(combi[, i], main = names(combi)[i])
}

Distribution can also be seen by box plot. By this plot, oecd_li, pe_ratio and divident are having few outliers while rba_cash_ra and iron are quite distributed.

What should we deal with outliers here???? I am stil thinking ……..

Now, we plot the correlation between each pair of numeric variables to give an idea of which variables change together.

library(corrplot)
correlations <- cor(combi[, 1:15])
corrplot(correlations, method = "circle")

The correlation matrix is used where blue represents positive correlation and red negative and larger dot the larger correlation. In this report, we focus on the correlation between asx and other variables. On this plot, asx seems to have postive correlation with djia, pe_ratio, oecd_li, abs_imports and abs_exports as well as negative correlation with dividend, yearly_inflation, iron, exchange_rate.

Given those high correlation variables, we put those into scatter plots matrix with direction up/down as an indicator, blue for up and red for down.

group <- NA
group[combi$direction == 1] <- 1
group[combi$direction == 0] <- 2
pairs(combi %>% select("asx", 
                       "djia", 
                       "pe_ratio",
                       "oecd_li",
                       "abs_imports", 
                       "abs_exports", 
                       "direction"), 
      col = c("blue", "red")[group])

The djia, pe_ratio could give some signals of positive correlation but not too strong here.

Now we take the further step to view the distribution of each variable broken down into direction. Below plot presents to us the distribution of up and down are quite similar in all variables. There are slight differences of up and down at iron, oil, quarterly_inflation and abs_exports but it is still hard to predict something here. In general, it is hard to predict up and down if using only one or two variables.

library(caret)
x <- combi[, 1:8]
y <- combi[, 17]
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
featurePlot(x = combi[, 1:15], 
            y = combi$direction, 
            plot = "density", 
            scales = list(x = list(relation = "free"), 
                          y = list(relation = "free")), 
            auto.key = list(columns = 3))

Next, we could explore trend of all variables with asx to observe the trend differences. We do log scale these variables to be able to place those on same plot without losing each individual trend.


combi_t <- combi[, c(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16)] %>%
  gather(key = "indicator", value = "value", -Date)

ggplot(combi_t, aes(x = Date, y = value)) + 
  geom_line(aes(color = indicator), size = 1) + 
  theme_minimal() +
  scale_y_log10()

At this angle, even we log scale these variables but it is still hard to discover trend between multiple variables. However it seems that there is no variable having same trend as asx.

Compare different machine learning algorithm


if (exists("run_all_algorithm", mode = "function")) {
  source("../../src/modeling/modeling.R")
}

results <- run_all_algorithm(combizs)
summary(results)

dotplot(results)

Run kfold logistic regression

Running logistic with random partition

library(randomForest)
library(caret)
library(e1071)

# Preparing data for train
combi_ml <- combizs[, -1]
t <- createDataPartition(combi_ml$direction, p = 0.8, list = FALSE)
combi_train <- combi_ml[t, ]
combi_test <- combi_ml[-t, ]


# Logistic regression with train set
glm1 <- glm(direction ~ ., family = binomial(logit), data = combi_train)
summary(glm1)
plot(glm1)

# predict with train set
probability <- predict(glm1, newdata = combi_test, type = "response")
plot(probability)

# setting a threshold value of 0.5 for positive...
# you may want to see if there are better settings that you
# could use here (Hint: do a search for "ROC curve")
prediction <- ifelse(probability > 0.5, 1, 0)

# building a contingency table of the counts at each combination of factor levels
confusion <- table(combi_test$direction, prediction)
confusion

Running default k-fold


library(lattice)
library(munsell)
library(ggplot2)
library(caret)


# combi_ml <- combi[,c(-16, -1)]
combi_ml <- combizs[, -1]

# Run algorithms using 10-fold cross validation
# control <- trainControl(method="cv", number=10)
control <- trainControl(
  method = "cv",
  number = 10,
  search = "grid"
)
metric <- "Accuracy"

# Train the model
set.seed(100)
fit.glm <- train(direction ~ ., data = combi_ml, method = "glm", metric = metric, trControl = control)

# Summarize the results
print(fit.glm)

Running k-fold manual


# Load data
combi_ml <- combizs[, -1]

# include function
if (exists("cross_validation", mode = "function")) {
  source("../../src/modeling/modeling.R")
}

# Cross validation (customized)
res_cv <- cross_validation(combi_ml)
acc <- res_cv$acc
fpr <- res_cv$fpr
fnr <- res_cv$fnr

Accuracy of models

mean(acc)

Histogram / Boxplot of accuracy


# Histogram of accuracy
hist(acc,
  xlab = "Accuracy",
  ylab = "Freq",
  col = "cyan",
  border = "blue",
  density = 30
)

# Boxplot of accuracy
boxplot(acc,
  col = "cyan",
  border = "blue",
  horizontal = T,
  xlab = "Accuracy",
  main = "Accuracy CV"
)

Mean of False postive rate & False negative rate


# Confusion matrix and plots of fpr and fnr
mean(fpr)
mean(fnr)

Histogram of False postive rate & False negative rate & Miss class

hist(fpr,
  xlab = "% of fnr",
  ylab = "Freq",
  main = "FPR",
  col = "cyan",
  border = "blue",
  density = 30
)

hist(fnr,
  xlab = "% of fnr",
  ylab = "Freq",
  main = "FNR",
  col = "cyan",
  border = "blue",
  density = 30
)

hist(fnr,
  xlab = "% of misclassfication",
  ylab = "Freq",
  main = "Miss class",
  col = "cyan",
  border = "blue",
  density = 30
)

Run random forests

1. Use default setting when train model

Build the model with the default values:


# include function
if (exists("run_random_forest", mode = "function")) {
  source("../../src/modeling/modeling.R")
}

# Cross validation (customized)
res <- run_random_forest(combizs)
rf_default <- res$rf_default
tr_control <- res$tr_control


# Print the results
plot(rf_default)

The algorithm tested three different values of mtry: 2, 8, 14. The optimum value is 14 ith accuracy 0.64. Next step is to find a better mtry.

Step 2 Find better mtry

Test the model with values of mtry from 1 to 14.

set.seed(100)
tuneGrid <- expand.grid(.mtry = c(1:14))
rf_mtry <- train(direction ~ .,
  data = combi_train,
  method = "rf",
  metric = "Accuracy",
  tuneGrid = tuneGrid,
  trControl = tr_control,
  importance = TRUE,
  nodesize = 14,
  ntree = 300
)
plot(rf_mtry)

best_mtry <- rf_mtry$bestTune$mtry
max(rf_mtry$results$Accuracy)

We find out that the optimum value of mtry is 10 with accuracy 0.69. We would go for 10.

Step 3. Search the best maxnodes

store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5:20)) {
  set.seed(100)
  rf_maxnode <- train(direction ~ .,
    data = combi_train,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = tr_control,
    importance = TRUE,
    nodesize = 14,
    maxnodes = maxnodes,
    ntree = 300
  )
  current_iteration <- toString(maxnodes)
  store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)

Running maxnode from 5 to 30, the optimum value is 13.

Step 4. Search the best ntrees

store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
  set.seed(100)
  rf_maxtrees <- train(direction ~ .,
    data = combi_train,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = tr_control,
    importance = TRUE,
    nodesize = 14,
    maxnodes = 13,
    ntree = ntree
  )
  key <- toString(ntree)
  store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)

The best value of ntree is 300. So finally, we got: - mtry = 10 - maxnodes = 13 - ntree = 300

Predict model

Now we evaluate model:

fit_rf <- train(direction ~ .,
  combi_train,
  method = "rf",
  metric = "Accuracy",
  tuneGrid = tuneGrid,
  trControl = tr_control,
  importance = TRUE,
  nodesize = 14,
  ntree = 300,
  maxnodes = 13
)

prediction <- predict(fit_rf, combi_test)
confusionMatrix(prediction, combi_test$direction)

We got the accuracy of 0.7222 percent, which is higher than the default value.

Visualise the result


varimp_mars <- varImp(fit_rf)
plot(varimp_mars, main = "Variable Importance with MARS")

##—————————————————————##

Test file

library("testthat")

source("src/modeling/modeling.R")
source("test/prepare_data.R")

combizs <- prepare_data()

test_that("Check run all agorithm return list", {
  expect_type(run_all_algorithm(run_all_algorithm(combizs)), "list")
})