By the end of this activity, you should learn how to
Apply the pipe operator to increase readability of your code
Use tidyverse
system of libraries for simple data manipulation
Produce scatterplots in R
Fit a decision tree and plot it
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with your peers and with the instructor.
We will work with a dataset of credict card frauds.
Source: https://www.kaggle.com/mlg-ulb/creditcardfraud
This is a collection of credict card transactions in Europe. Original features have been transformed to ensure anonimity.
library(tidyverse) # data transformation and plotting
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret) # framework for training ML models
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(rpart) # decision tree fitting
library(rpart.plot) # decision tree plotting
C <- read_csv("creditcard.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double()
## )
## i Use `spec()` for the full column specifications.
head(C)
An English text is read from left to write but functions in mathematics are applied from right to left, i.e., \[ \ln(\cos(\exp\sqrt{7})) \] is calculated by starting with number \(7\), then applying the square root function, then the exponential function etc. In R (as in most programming languages), the order is also from right to left:
log(cos(exp(sqrt(7))))
## [1] -3.143688
It is, however, often more convenient to apply functions from left to write. This is done with the pipe operator %>%
from the libraries tidyverse
and magrittr
(the former is a big collection of tools, the latter contains just the pipe operator):
7 %>% sqrt %>% exp %>% cos %>% log
## [1] -3.143688
If a function used in such a pipe chain needs several arguments, then whatever comes from the left of the pipe chain is inserted as the first argument. For example,
max(exp(-3), 2)
## [1] 2
is the same as
-3 %>% exp %>% max(2)
## [1] 2
Let’s say we want to apply the following transformations to our dataset C
:
Keep variables from V10
to Class
and remove the rest of the variables
Remove the variable V20
Change values of Class
from 0
and 1
to no
and yes
Extract observations with Amount > 100
Introduce the new variable VAR
whose value is V10+V11-V12
Sort the resulting dataset by Amount
This is doable in base R (and in Python / Pandas) but requires a lot of effort. With the pipe operator and data transformation functions select
, filter
, mutate
, and arrange
(these functions are in the library dplyr
that is a part of tidyverse
), this sequence of transformations is much easier to implement:
A <- C %>%
select(V10:Class) %>%
select(-V20) %>%
mutate(Class = ifelse(Class == 1, "yes", "no")) %>%
filter(Amount > 100) %>%
mutate(VAR = V10+V11-V12) %>%
arrange(Amount)
head(A)
If you are interested in a Python analogue of this chain of transformations, here are some resources:
https://towardsdatascience.com/dplyr-style-data-manipulation-with-pipes-in-python-380dcb137000
https://stmorse.github.io/journal/tidyverse-style-pandas.html
Write a pipe chain that transforms our dataset as follows:
Remove the time stamp
Change values of Class
from 0
and 1
to no
and yes
Convert Class
from character to factor
Sorts all records by Amount
in the decreasing order
Save the result to the variable A
again - later we will construct predictive models for A
A <- C %>%
select(-Time) %>%
mutate(Class = ifelse(Class == 1, "yes", "no")) %>%
mutate(Class = as.factor(Class)) %>%
arrange(-Amount)
head(A)
Plots in the tidyverse
ecosystems are done with the library ggplot2
(also available in Python). It is better and easier to work with than base R. It is probably not so good as plotly
, but plotly
only produces interactive plots, which is usually an advantage, but is a problem if you just need to write a usual non-interactive report.
Here is how we do a scatterplot:
ggplot(data = A, aes(x = V1, y = V2)) +
geom_point()
And here is how we can colour it according to
Class
and add transparency so that it is easier to see what is going on:
ggplot(data = A, aes(x = V1, y = V2, color = Class)) +
geom_point(alpha = 0.2)
The variable ‘Class’ contains values ‘yes’ for fraudulent transactions and ‘no’ for genuine transactions. Find percentages of fraudulent and genuine transactions respectively. Is the dataset well-balanced?
Answer
You probably are familiar with the function table
from base R:
table(A$Class) / nrow(A)
##
## no yes
## 0.998272514 0.001727486
There is also a way to do this with tidyverse
functions (we will later cover these in more depth):
A %>%
group_by(Class) %>%
summarise(N = n()) %>%
ungroup %>%
mutate(Frequency = N / sum(N))
## `summarise()` ungrouping output (override with `.groups` argument)
Normally when a lot of data are available, one designates a small portion to the test set. However, here since data are highly imbalanced, we will split them into equally large training and test sets.
set.seed(128)
ind_train <- sample(1:nrow(A), round(nrow(A)/2))
train_data <- A %>% slice(ind_train)
test_data <- A %>% slice(-ind_train)
cat("Dimensions of training data are", dim(train_data), "\n")
## Dimensions of training data are 142404 30
cat("Dimensions of test data are", dim(test_data), "\n")
## Dimensions of test data are 142403 30
Below we fit a decision tree into the training data and plot it
mod_tree <- rpart(Class ~ . , data = train_data)
# saveRDS(mod_tree, "tree_rf.rds")
rpart.plot(mod_tree)
If we want to regularize it, we need to choose the value of the regularization parameter \(\alpha\) that minimizes the cross-validation error. Below is the plot of cross-validation error vs \(\alpha\):
plotcp(mod_tree)
The optimal value is \(0.019.\) Here is the tree pruned with \(\alpha = 0.019\):
mod_tree %>%
prune(cp = 0.019) %>%
rpart.plot()
Predictions of a classification model are usually probability vectors:
mod_tree %>%
predict(test_data) %>%
head
## no yes
## 1 0.1403509 0.8596491228
## 2 0.9997184 0.0002815969
## 3 0.9997184 0.0002815969
## 4 0.9997184 0.0002815969
## 5 0.9997184 0.0002815969
## 6 0.9997184 0.0002815969
However, in order to find accuracy of our model, we need predicted classes. The default threshold for differentiating the two classes is \(0.5\) (in some cases, it can and should be changed - we will look at such cases later in the course)
mod_tree %>%
predict(test_data, type = "class") %>%
head
## 1 2 3 4 5 6
## yes no no no no no
## Levels: no yes
Here is the confusion matrix of our model:
mod_tree %>%
predict(test_data, type = "class") %>%
confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 142129 59
## yes 29 186
##
## Accuracy : 0.9994
## 95% CI : (0.9992, 0.9995)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8084
##
## Mcnemar's Test P-Value : 0.001992
##
## Sensitivity : 0.9998
## Specificity : 0.7592
## Pos Pred Value : 0.9996
## Neg Pred Value : 0.8651
## Prevalence : 0.9983
## Detection Rate : 0.9981
## Detection Prevalence : 0.9985
## Balanced Accuracy : 0.8795
##
## 'Positive' Class : no
##
Note that accuracy is very high, but that’s not surprising. In fact, even if we just predict all transactions to be genuine, accuracy will be very high.
Find the confusion matrix on of the pruned tree model. Is the pruned tree more or less accurate than the full tree?
mod_tree %>%
prune(cp = 0.019) %>%
predict(test_data, type = "class") %>%
confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 142132 70
## yes 26 175
##
## Accuracy : 0.9993
## 95% CI : (0.9992, 0.9995)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7844
##
## Mcnemar's Test P-Value : 1.14e-05
##
## Sensitivity : 0.9998
## Specificity : 0.7143
## Pos Pred Value : 0.9995
## Neg Pred Value : 0.8706
## Prevalence : 0.9983
## Detection Rate : 0.9981
## Detection Prevalence : 0.9986
## Balanced Accuracy : 0.8571
##
## 'Positive' Class : no
##
Apparently, regularization didn’t help.