Info about the activity

Objectives

By the end of this activity, you should learn how to

  1. Apply the pipe operator to increase readability of your code

  2. Use tidyverse system of libraries for simple data manipulation

  3. Produce scatterplots in R

  4. Fit a decision tree and plot it

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with your peers and with the instructor.

Data

We will work with a dataset of credict card frauds.

Source: https://www.kaggle.com/mlg-ulb/creditcardfraud

This is a collection of credict card transactions in Europe. Original features have been transformed to ensure anonimity.

library(tidyverse)  # data transformation and plotting
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)  # framework for training ML models
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart)  # decision tree fitting
library(rpart.plot)  # decision tree plotting

C <- read_csv("creditcard.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.
head(C)

Pipe operator

An English text is read from left to write but functions in mathematics are applied from right to left, i.e., \[ \ln(\cos(\exp\sqrt{7})) \] is calculated by starting with number \(7\), then applying the square root function, then the exponential function etc. In R (as in most programming languages), the order is also from right to left:

log(cos(exp(sqrt(7))))
## [1] -3.143688

It is, however, often more convenient to apply functions from left to write. This is done with the pipe operator %>% from the libraries tidyverse and magrittr (the former is a big collection of tools, the latter contains just the pipe operator):

7 %>% sqrt %>% exp %>% cos %>% log
## [1] -3.143688

If a function used in such a pipe chain needs several arguments, then whatever comes from the left of the pipe chain is inserted as the first argument. For example,

max(exp(-3), 2)
## [1] 2

is the same as

-3 %>% exp %>% max(2)
## [1] 2

Pipe operator for data transformation

Let’s say we want to apply the following transformations to our dataset C:

  1. Keep variables from V10 to Class and remove the rest of the variables

  2. Remove the variable V20

  3. Change values of Class from 0 and 1 to no and yes

  4. Extract observations with Amount > 100

  5. Introduce the new variable VAR whose value is V10+V11-V12

  6. Sort the resulting dataset by Amount

This is doable in base R (and in Python / Pandas) but requires a lot of effort. With the pipe operator and data transformation functions select, filter, mutate, and arrange (these functions are in the library dplyr that is a part of tidyverse), this sequence of transformations is much easier to implement:

A <- C %>%
  select(V10:Class) %>%
  select(-V20) %>%
  mutate(Class = ifelse(Class == 1, "yes", "no")) %>%
  filter(Amount > 100) %>%
  mutate(VAR = V10+V11-V12) %>%
  arrange(Amount)

head(A)

If you are interested in a Python analogue of this chain of transformations, here are some resources:

Question

Write a pipe chain that transforms our dataset as follows:

  1. Remove the time stamp

  2. Change values of Class from 0 and 1 to no and yes

  3. Convert Class from character to factor

  4. Sorts all records by Amount in the decreasing order

Save the result to the variable A again - later we will construct predictive models for A

A <- C %>%
  select(-Time) %>%
  mutate(Class = ifelse(Class == 1, "yes", "no")) %>%
  mutate(Class = as.factor(Class)) %>%
  arrange(-Amount)

head(A)

Scatterplots

Plots in the tidyverse ecosystems are done with the library ggplot2 (also available in Python). It is better and easier to work with than base R. It is probably not so good as plotly, but plotly only produces interactive plots, which is usually an advantage, but is a problem if you just need to write a usual non-interactive report.

Here is how we do a scatterplot:

ggplot(data = A, aes(x = V1, y = V2)) + 
  geom_point()

And here is how we can colour it according to Class and add transparency so that it is easier to see what is going on:

ggplot(data = A, aes(x = V1, y = V2, color = Class)) + 
  geom_point(alpha = 0.2)

Question

The variable ‘Class’ contains values ‘yes’ for fraudulent transactions and ‘no’ for genuine transactions. Find percentages of fraudulent and genuine transactions respectively. Is the dataset well-balanced?

Answer

You probably are familiar with the function table from base R:

table(A$Class) / nrow(A)
## 
##          no         yes 
## 0.998272514 0.001727486

There is also a way to do this with tidyverse functions (we will later cover these in more depth):

A %>%
  group_by(Class) %>%
  summarise(N = n()) %>%
  ungroup %>%
  mutate(Frequency = N / sum(N))
## `summarise()` ungrouping output (override with `.groups` argument)

Training and test datasets

Normally when a lot of data are available, one designates a small portion to the test set. However, here since data are highly imbalanced, we will split them into equally large training and test sets.

set.seed(128)
ind_train <- sample(1:nrow(A), round(nrow(A)/2))
train_data <- A %>% slice(ind_train)
test_data <- A %>% slice(-ind_train)
cat("Dimensions of training data are", dim(train_data), "\n")
## Dimensions of training data are 142404 30
cat("Dimensions of test data are", dim(test_data), "\n")
## Dimensions of test data are 142403 30

Modelling

Training the model

Below we fit a decision tree into the training data and plot it

mod_tree <- rpart(Class ~ . , data = train_data)
# saveRDS(mod_tree, "tree_rf.rds")
rpart.plot(mod_tree)

If we want to regularize it, we need to choose the value of the regularization parameter \(\alpha\) that minimizes the cross-validation error. Below is the plot of cross-validation error vs \(\alpha\):

plotcp(mod_tree)

The optimal value is \(0.019.\) Here is the tree pruned with \(\alpha = 0.019\):

mod_tree %>%
  prune(cp = 0.019) %>%
  rpart.plot()

Performance on the training set

Predictions of a classification model are usually probability vectors:

mod_tree %>%
  predict(test_data) %>%
  head
##          no          yes
## 1 0.1403509 0.8596491228
## 2 0.9997184 0.0002815969
## 3 0.9997184 0.0002815969
## 4 0.9997184 0.0002815969
## 5 0.9997184 0.0002815969
## 6 0.9997184 0.0002815969

However, in order to find accuracy of our model, we need predicted classes. The default threshold for differentiating the two classes is \(0.5\) (in some cases, it can and should be changed - we will look at such cases later in the course)

mod_tree %>%
  predict(test_data, type = "class") %>%
  head
##   1   2   3   4   5   6 
## yes  no  no  no  no  no 
## Levels: no yes

Here is the confusion matrix of our model:

mod_tree %>%
  predict(test_data, type = "class") %>%
  confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  142129     59
##        yes     29    186
##                                           
##                Accuracy : 0.9994          
##                  95% CI : (0.9992, 0.9995)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8084          
##                                           
##  Mcnemar's Test P-Value : 0.001992        
##                                           
##             Sensitivity : 0.9998          
##             Specificity : 0.7592          
##          Pos Pred Value : 0.9996          
##          Neg Pred Value : 0.8651          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9981          
##    Detection Prevalence : 0.9985          
##       Balanced Accuracy : 0.8795          
##                                           
##        'Positive' Class : no              
## 

Note that accuracy is very high, but that’s not surprising. In fact, even if we just predict all transactions to be genuine, accuracy will be very high.

Question

Find the confusion matrix on of the pruned tree model. Is the pruned tree more or less accurate than the full tree?

mod_tree %>%
  prune(cp = 0.019) %>%
  predict(test_data, type = "class") %>%
  confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  142132     70
##        yes     26    175
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9992, 0.9995)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7844          
##                                           
##  Mcnemar's Test P-Value : 1.14e-05        
##                                           
##             Sensitivity : 0.9998          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.9995          
##          Neg Pred Value : 0.8706          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9981          
##    Detection Prevalence : 0.9986          
##       Balanced Accuracy : 0.8571          
##                                           
##        'Positive' Class : no              
## 

Apparently, regularization didn’t help.