Lecture 3: Data Modeling

The goal of a model is to provide a simple, low-dimensional, interpretable summary of a dataset. Models are a really useful way to help you peel back layers of structure as you are exploring your dataset. Every statistical model can be “divided” in two parts: 1. a family of models that express a prece, but generic, pattern that you want to capture (i.e., the pattern can be a straight line or a quadratic curve); 2. a fitted model, that can be found by selecting the family of models that is the closest to your data.

It is important to understand that a fitted model is just the closest model from a family of models. This implies that you have the “best” model according to some criteria and based on a set of assumptions. This does not imply that your model is a good model or that your model is “true”. George Box, a famous british statistician, once said one of the most quoted statistical quotes:“all models are wrong, but some are useful”. It is worth reading the fuller context of the quote as it is quite illustrative of the philosophy behind any statistical model: “Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an”ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules. For such a model there is no need to ask the question “Is the model true?” If “truth” is to be the “whole truth” the answer must be “No.” The only question of interest is “Is the model illuminating and use- ful?”

This does not mean that all the models are wrong and, we should just go for the least wrong model. This quote should be interpeted as a call for careful laying down the assumptions on which the quality of the model is built on. As Berkeley statisticain Mark Van Der Laan stated in a recent article “The statistical formulation and theory should define the algorithm” source.

In this lecture we will go see how to perform in R two types of models: 1. linear regression models; 2. non-linear predictive models (decision trees, random forests).

As you have probably seen the former in your econometric classes, I will skip the mathematical details, and focus on the the function used to build these models and on the intepretation of their outputs.

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.1
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(modelr)

library(readxl)
data <- read_excel("G:\\Il mio Drive\\Econometrics Lab\\Data\\Compustat Data.xlsx")
data <- data[, !names(data) %in% c("Interest Expense - Total (Financial Services)", "Net Interest Income", "Nonperforming Assets - Total")]
data_clean <- na.omit(data)

x <- data_clean$`Assets - Total`[which(data_clean$`Assets - Total`< quantile(data_clean$`Assets - Total`, 0.95))]
y <- data_clean$`Sales/Turnover (Net)`[which(data_clean$`Assets - Total`< quantile(data_clean$`Assets - Total`, 0.95))]

reg_data <- as.data.frame(cbind(x, y))

ggplot(reg_data, aes(x, y)) +
  geom_point()

You can see a quite clear pattern in the data. Let’s now use a model to capture the pattern and make it more explicit.

Let’s first generate a set of random model an let’s overlay them on the data.

models <- tibble(
  beta1 = runif(length(x), 0, 200),
  beta2 = runif(length(x), -4, 4)
)

ggplot(reg_data, aes(x, y)) +
  geom_abline(
    aes(intercept = beta1,
        slope = beta2),
    data = models, alpha = 1/15
  ) + 
  geom_point()

model1 <- function(beta, data){
  beta[1] + data$x * beta[2]
}

fitted.values <- model1(c(50, 1.5), reg_data)

head(fitted.values)

## [1] 360.4925 415.7135 409.4930  73.7855  84.3965  51.2975

Let’s now get the residuals of our model.

measure_distance <- function(mod, data) {
 diff <- data$y - model1(mod, data)
 sqrt(mean(diff ^ 2))
}
measure_distance(c(50, 1.5), sim1)

## [1] 42.828

We can use “purrr” to compute the distance for all the models defined previously. We will need a helper function because our distance expectes the model as a numeric vector of length 2.

reg_data_dist <- function(beta1, beta2) {
 measure_distance(c(beta1, beta2), reg_data)
}
models <- models %>%
 mutate(dist = purrr::map2_dbl(beta1, beta2, reg_data_dist))

models

## # A tibble: 5,453 x 3
##    beta1  beta2  dist
##    <dbl>  <dbl> <dbl>
##  1 178.   0.255  288.
##  2  72.9 -0.263  378.
##  3  79.3 -0.909  546.
##  4  95.0 -2.29   942.
##  5  13.9  0.528  243.
##  6 161.  -1.39   655.
##  7 200.   3.68  1032.
##  8 121.  -2.02   852.
##  9  85.9  1.90   453.
## 10  11.6  2.66   624.
## # ... with 5,443 more rows

We can now overlay the best 10 models on the data.

ggplot(reg_data, aes(x, y)) +
 geom_point(size = 2, color = "grey30") +
 geom_abline(
 aes(intercept = beta1, slope = beta2, color = -dist),
 data = filter(models, rank(dist) <= 10)
 )

We can also think about these models as observations, and visualize them with a scatterplot of beta1 versus beta2, again colored by -dist. We can no longer directly see how the model compares to the data, but we can see many models at once. Again, I’ve highlighted the 10 best models, this time by drawing red circles underneath them:

ggplot(models, aes(beta1, beta2)) +
 geom_point(
 data = filter(models, rank(dist) <= 10),
 size = 4, color = "red"
 ) +
 geom_point(aes(colour = -dist))

Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points (this is called a grid search). I picked the parameters of the grid roughly by looking at where the best models were in the preceding plot:

grid <- expand.grid(
 beta1 = seq(0, 200, length = 50),
 beta2 = seq(-4, 4, length = 50)
 ) %>%
 mutate(dist = purrr::map2_dbl(beta1, beta2, reg_data_dist))
grid %>%
 ggplot(aes(beta1, beta2)) +
 geom_point(
 data = filter(grid, rank(dist) <= 10),
 size = 4, colour = "red"
 ) +
 geom_point(aes(color = -dist))

When you overlay the best 10 models back on the original data, they all look pretty good:

ggplot(reg_data, aes(x, y)) +
 geom_point(size = 2, color = "grey30") +
 geom_abline(
 aes(intercept = beta1, slope = beta2, color = -dist),
 data = filter(grid, rank(dist) <= 10)
 )

You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there’s a better way to tackle that problem: a numerical minimization tool called Newton-Raphson search. The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can’t go any lower. In R, we can do that with optim():

best <- optim(c(0, 0), measure_distance, data = reg_data)
best$par

## [1] -3.1063985  0.8127066

ggplot(reg_data, aes(x, y)) +
 geom_point(size = 2, color = "grey30") +
 geom_abline(intercept = best$par[1], slope = best$par[2])

Don’t worry too much about the details of how optim() works. It’s the intuition that’s important here. If you have a function that defines the distance between a model and a dataset, and an algorithm that can minimize that distance by modifying the parameters of the model, you can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for. There’s one more approach that we can use for this model, because it is a special case of a broader family: linear models. A linear model has the general form \(y = a_1 + a_2 \cdot x_1 + a_3 \cdot x_2 + ... + a_n \cdot x_{(n - 1)}\). So this simple model is equivalent to a general linear model where n is 2 and \(x_1\) is \(x\). R has a tool specifically designed for fitting linear models called lm(). lm() has a special way to specify the model family: formulas. Formulas look like \(y ~ x\), which lm() will translate to a function like \(y = a_1 + a_2 * x\). We can fit the model and look at the output:

model_1 <- lm(y ~ x, data = reg_data)
summary(model_1)

## 
## Call:
## lm(formula = y ~ x, data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1221.1   -27.4     0.9     9.9  6578.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.06484    3.62636  -0.845    0.398    
## x            0.81267    0.01172  69.333   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 229.5 on 5451 degrees of freedom
## Multiple R-squared:  0.4686, Adjusted R-squared:  0.4685 
## F-statistic:  4807 on 1 and 5451 DF,  p-value: < 2.2e-16

Now let’s add an additional variable in the linear regression to compare the two different models.

z <- data_clean$Employees[which(data_clean$`Assets - Total`< quantile(data_clean$`Assets - Total`, 0.95))]
reg_data <- cbind(reg_data, z)

model_2 <- lm(y ~ x + z, data = reg_data)
summary(model_2)

## 
## Call:
## lm(formula = y ~ x + z, data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2675.1   -22.5     2.5    11.0  6700.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.31960    3.38664  -1.275    0.202    
## x            0.69931    0.01166  59.999   <2e-16 ***
## z           20.62225    0.72862  28.303   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214.3 on 5450 degrees of freedom
## Multiple R-squared:  0.5367, Adjusted R-squared:  0.5365 
## F-statistic:  3157 on 2 and 5450 DF,  p-value: < 2.2e-16

In R, you can either write down all the variables that you want to use as regressors in your model or you can just use \(y \sim \: .\).

model_3 <- lm(y ~ ., data = reg_data)
summary(model_3)

## 
## Call:
## lm(formula = y ~ ., data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2675.1   -22.5     2.5    11.0  6700.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.31960    3.38664  -1.275    0.202    
## x            0.69931    0.01166  59.999   <2e-16 ***
## z           20.62225    0.72862  28.303   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214.3 on 5450 degrees of freedom
## Multiple R-squared:  0.5367, Adjusted R-squared:  0.5365 
## F-statistic:  3157 on 2 and 5450 DF,  p-value: < 2.2e-16

A very easy way to compare two different linear regressions is through the likelihood ratio test. In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

lrtest(model_1, model_2)

## Likelihood ratio test
## 
## Model 1: y ~ x
## Model 2: y ~ x + z
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   3 -37377                         
## 2   4 -37004  1 747.81  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(p < 0.001\) indicates that the model with all predictors fits significantly better than the model with only one predictor. Another “goodness-of-fit” measure that can be used is the \(R^2\): \[\begin{equation} R^2= 1 - \frac{ESS}{TSS}. \end{equation}\]

summary(model_1)$r.squared

## [1] 0.4686107

summary(model_2)$r.squared

## [1] 0.5367084

We can also get the fitted values of the model for any \(x\) and \(z\) by running the following chunck of code.

coeffs = coefficients(model_2)
assets = 159 
employees = 2
y <- coeffs[1] +coeffs[2]*assets +coeffs[3]*employees
y

## (Intercept) 
##    148.1153

Or, equivalently:

newdata <- data.frame(x = 159, z = 2)
predict(model_2, newdata)

##        1 
## 148.1153

predict(model_2, newdata, interval="confidence")

##        fit      lwr      upr
## 1 148.1153 142.2269 154.0036

Once we fitted our favourite model, we can check the residuals from the model: \(e_i = y_i - \hat{f}(x_i)\).

model.res = resid(model_2)
plot(reg_data$y, model.res, ylab="Residuals", xlab="Sales", main="Residuals v. Sales") 
abline(0, 0)

Moreover, we can standardize the residuals and plot them against normalized scores for the outcome variable.

model_2.stdres = rstandard(model_2)
qqnorm(model_2.stdres ,  ylab="Standardized Residuals",  xlab="Normal Scores",  main="Standardized Residuals v. Sales") 
qqline(model_2.stdres)

In R, you can introduce an interaction between the regressors by using \(*\). Always remember to include also the single regressors in the formula.

model_int<-lm(y ~ x  + z + x*z, data = reg_data)
summary(model_int)

## 
## Call:
## lm(formula = y ~ x + z + x * z, data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1568.2   -25.4    -9.4     6.9  6775.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.308146   3.274783   2.842  0.00449 ** 
## x           0.605046   0.011784  51.343  < 2e-16 ***
## z           2.426270   1.033420   2.348  0.01892 *  
## x:z         0.034463   0.001451  23.754  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204 on 5449 degrees of freedom
## Multiple R-squared:  0.5802, Adjusted R-squared:  0.5799 
## F-statistic:  2510 on 3 and 5449 DF,  p-value: < 2.2e-16

You can’t directly introduce a quadratic term in the regression formula. Hence, you need to create an additional variable with the square term and then you can include it in the regression.

x2 <- x^2
model_squared<-lm(y ~ x  +  x2 + z, data = reg_data)
summary(model_squared)

## 
## Call:
## lm(formula = y ~ x + x2 + z, data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2676.7   -22.3    -1.9     8.9  6669.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.055e+00  3.866e+00   0.273  0.78489    
## x           6.189e-01  3.031e-02  20.418  < 2e-16 ***
## x2          7.753e-05  2.696e-05   2.875  0.00405 ** 
## z           2.066e+01  7.283e-01  28.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214.1 on 5449 degrees of freedom
## Multiple R-squared:  0.5374, Adjusted R-squared:  0.5372 
## F-statistic:  2110 on 3 and 5449 DF,  p-value: < 2.2e-16

Non-linear predictive model

In some scenarios we may be interested in building a statistical model to predict a certain outcome. In this case, for instance, we may want to use different models to predict the location of a firm. The Compustat data entail US and Canadian enterprises. In the next chunks of code I will build three different models (logistic regression, CART and Random Forest) to predict the location of the firm.

Before running the analyses, I restict the set of predictors to the following variables and I create a dummy variable that assumes value 1 if the firm is located in the US and 0 if it is located in Canada.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

myvariables <- c("ISO Currency Code",
                 "Assets - Total", "Average Short-Term Borrowings",
                 "Current Assets - Total", "Long-Term Debt Due in One Year",
                 "Debt in Current Liabilities - Total", "Employees",
                 "Earnings Before Interest and Taxes", "Liabilities - Total",
                 "Net Income (Loss)", "In Process R&D Expense",
                 "GIC Sectors", "Standard Industry Classification Code")
data_prediction <- data_clean[myvariables]
data_prediction$iso_code <- ifelse(data_prediction$`ISO Currency Code`=="USD", 1, 0)
data_prediction <- data_prediction[, !names(data_prediction) %in% c("ISO Currency Code")]

In order to check how good are the three models, I randomly split the data into two disjoint sets: a training set that I will use to build the model and a test set that I will use to validate the quality of the model’s prediction.

set.seed(123)
index <- sample(seq_len(nrow(data_prediction)),
                    size = nrow(data_prediction)*0.5) 

train <- data_prediction[index,]
test <- data_prediction[-index,]

Logistic Regression

The first model that I run is a logistic regression with the inclusion of all the covariates.

logit<-glm(iso_code ~ ., data= train, family=binomial(link='logit'))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logit)

## 
## Call:
## glm(formula = iso_code ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1015   0.1754   0.3027   0.4930   1.4832  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                             -8.663e-01  1.508e-01  -5.743
## `Assets - Total`                        -4.599e-04  3.459e-04  -1.330
## `Average Short-Term Borrowings`          1.684e-02  9.810e-03   1.716
## `Current Assets - Total`                 2.555e-03  7.378e-04   3.463
## `Long-Term Debt Due in One Year`         6.835e-03  6.994e-03   0.977
## `Debt in Current Liabilities - Total`   -1.073e-02  6.779e-03  -1.582
## Employees                                4.996e-02  3.530e-02   1.415
## `Earnings Before Interest and Taxes`    -2.414e-03  2.280e-03  -1.059
## `Liabilities - Total`                    8.082e-04  6.295e-04   1.284
## `Net Income (Loss)`                      4.237e-04  2.352e-03   0.180
## `In Process R&D Expense`                -1.391e+02  5.892e+03  -0.024
## `GIC Sectors`                            4.632e-02  6.250e-03   7.411
## `Standard Industry Classification Code`  3.329e-04  3.724e-05   8.938
##                                         Pr(>|z|)    
## (Intercept)                             9.29e-09 ***
## `Assets - Total`                        0.183633    
## `Average Short-Term Borrowings`         0.086105 .  
## `Current Assets - Total`                0.000535 ***
## `Long-Term Debt Due in One Year`        0.328398    
## `Debt in Current Liabilities - Total`   0.113545    
## Employees                               0.156980    
## `Earnings Before Interest and Taxes`    0.289682    
## `Liabilities - Total`                   0.199163    
## `Net Income (Loss)`                     0.857055    
## `In Process R&D Expense`                0.981162    
## `GIC Sectors`                           1.26e-13 ***
## `Standard Industry Classification Code`  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2140.2  on 2869  degrees of freedom
## Residual deviance: 1669.4  on 2857  degrees of freedom
## AIC: 1695.4
## 
## Number of Fisher Scoring iterations: 25

To get the accuracy of the model I first get the predicted probabilities, then impute the values for the outcome variable and then check the root-mean-squared-error (RMSE), the mean-absolute error (MAE), the \(R^2\) and the confusion matrix.

#Accurancy from cv
fitted.results.logit <- predict(logit, newdata = test, type='response')
fitted.logit <- ifelse(fitted.results.logit >= 0.5, 1, 0)

# RMSE
caret::postResample(fitted.logit, test$iso_code)

##       RMSE   Rsquared        MAE 
## 0.36476662 0.02285701 0.13305468

#For good predictive model the chi and RMSE values should be low 
confusionMatrix(data = as.factor(fitted.logit),
                reference = as.factor(test$iso_code))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   16    6
##          1  376 2473
##                                          
##                Accuracy : 0.8669         
##                  95% CI : (0.854, 0.8792)
##     No Information Rate : 0.8635         
##     P-Value [Acc > NIR] : 0.3045         
##                                          
##                   Kappa : 0.0637         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.040816       
##             Specificity : 0.997580       
##          Pos Pred Value : 0.727273       
##          Neg Pred Value : 0.868024       
##              Prevalence : 0.136538       
##          Detection Rate : 0.005573       
##    Detection Prevalence : 0.007663       
##       Balanced Accuracy : 0.519198       
##                                          
##        'Positive' Class : 0              
##

Classification and Regression Tree

The second model that I run is a classification and regression tree from the “rpart” package in R.

library(rpart)
rpart <- rpart(iso_code ~ ., data=train, method="class") 

printcp(rpart) # display the results

## 
## Classification tree:
## rpart(formula = iso_code ~ ., data = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] Assets - Total                       
## [2] Current Assets - Total               
## [3] Debt in Current Liabilities - Total  
## [4] Earnings Before Interest and Taxes   
## [5] GIC Sectors                          
## [6] Liabilities - Total                  
## [7] Standard Industry Classification Code
## 
## Root node error: 353/2870 = 0.123
## 
## n= 2870 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.158640      0   1.00000 1.00000 0.049844
## 2 0.133144      1   0.84136 0.94618 0.048667
## 3 0.014164      2   0.70822 0.72805 0.043333
## 4 0.010387      6   0.65156 0.72805 0.043333
## 5 0.010000     10   0.60907 0.74788 0.043860

plotcp(rpart) # visualize cross-validation results

summary(rpart) # detailed summary of splits

## Call:
## rpart(formula = iso_code ~ ., data = train, method = "class")
##   n= 2870 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.15864023      0 1.0000000 1.0000000 0.04984405
## 2 0.13314448      1 0.8413598 0.9461756 0.04866673
## 3 0.01416431      2 0.7082153 0.7280453 0.04333319
## 4 0.01038716      6 0.6515581 0.7280453 0.04333319
## 5 0.01000000     10 0.6090652 0.7478754 0.04386049
## 
## Variable importance
## Standard Industry Classification Code 
##                                    40 
##                           GIC Sectors 
##                                    23 
##                   Liabilities - Total 
##                                    10 
##                        Assets - Total 
##                                     9 
##                Current Assets - Total 
##                                     6 
##    Earnings Before Interest and Taxes 
##                                     4 
##                             Employees 
##                                     3 
##                     Net Income (Loss) 
##                                     3 
##   Debt in Current Liabilities - Total 
##                                     2 
##         Average Short-Term Borrowings 
##                                     1 
##        Long-Term Debt Due in One Year 
##                                     1 
## 
## Node number 1: 2870 observations,    complexity param=0.1586402
##   predicted class=1  expected loss=0.1229965  P(node) =1
##     class counts:   353  2517
##    probabilities: 0.123 0.877 
##   left son=2 (338 obs) right son=3 (2532 obs)
##   Primary splits:
##       Standard Industry Classification Code < 1661.5   to the left,  improve=162.026300, (0 missing)
##       GIC Sectors                           < 17.5     to the left,  improve=122.071700, (0 missing)
##       Employees                             < 0.1175   to the left,  improve= 21.343270, (0 missing)
##       Current Assets - Total                < 108.6385 to the left,  improve= 15.732660, (0 missing)
##       Assets - Total                        < 2.631    to the right, improve=  6.835526, (0 missing)
##   Surrogate splits:
##       GIC Sectors < 17.5     to the left,  agree=0.945, adj=0.536, (0 split)
## 
## Node number 2: 338 observations,    complexity param=0.1331445
##   predicted class=0  expected loss=0.4171598  P(node) =0.11777
##     class counts:   197   141
##    probabilities: 0.583 0.417 
##   left son=4 (245 obs) right son=5 (93 obs)
##   Primary splits:
##       Assets - Total                        < 10.73    to the right, improve=28.88831, (0 missing)
##       Current Assets - Total                < 1.106    to the right, improve=26.66774, (0 missing)
##       Liabilities - Total                   < 0.1225   to the right, improve=24.78560, (0 missing)
##       Standard Industry Classification Code < 1067     to the left,  improve=12.88777, (0 missing)
##       GIC Sectors                           < 12.5     to the right, improve=12.67121, (0 missing)
##   Surrogate splits:
##       Liabilities - Total    < 0.261    to the right, agree=0.908, adj=0.667, (0 split)
##       Current Assets - Total < 1.705    to the right, agree=0.902, adj=0.645, (0 split)
##       Employees              < 0.0035   to the right, agree=0.825, adj=0.366, (0 split)
## 
## Node number 3: 2532 observations
##   predicted class=1  expected loss=0.06161137  P(node) =0.88223
##     class counts:   156  2376
##    probabilities: 0.062 0.938 
## 
## Node number 4: 245 observations,    complexity param=0.01416431
##   predicted class=0  expected loss=0.2897959  P(node) =0.08536585
##     class counts:   174    71
##    probabilities: 0.710 0.290 
##   left son=8 (125 obs) right son=9 (120 obs)
##   Primary splits:
##       GIC Sectors                           < 12.5     to the right, improve=10.849650, (0 missing)
##       Standard Industry Classification Code < 1065     to the left,  improve= 9.095067, (0 missing)
##       Liabilities - Total                   < 1.897    to the left,  improve= 7.060408, (0 missing)
##       Current Assets - Total                < 15.342   to the left,  improve= 5.298863, (0 missing)
##       Net Income (Loss)                     < 0.032    to the left,  improve= 2.838368, (0 missing)
##   Surrogate splits:
##       Standard Industry Classification Code < 1065     to the left,  agree=0.935, adj=0.867, (0 split)
##       Liabilities - Total                   < 2.5085   to the left,  agree=0.731, adj=0.450, (0 split)
##       Earnings Before Interest and Taxes    < -0.0795  to the left,  agree=0.710, adj=0.408, (0 split)
##       Net Income (Loss)                     < -0.6345  to the left,  agree=0.682, adj=0.350, (0 split)
##       Assets - Total                        < 45.735   to the left,  agree=0.633, adj=0.250, (0 split)
## 
## Node number 5: 93 observations,    complexity param=0.01038716
##   predicted class=1  expected loss=0.2473118  P(node) =0.03240418
##     class counts:    23    70
##    probabilities: 0.247 0.753 
##   left son=10 (49 obs) right son=11 (44 obs)
##   Primary splits:
##       Standard Industry Classification Code < 1155     to the left,  improve=6.805474, (0 missing)
##       Liabilities - Total                   < 0.0605   to the right, improve=5.417307, (0 missing)
##       GIC Sectors                           < 12.5     to the right, improve=4.870546, (0 missing)
##       Assets - Total                        < 1.907    to the right, improve=4.105832, (0 missing)
##       Current Assets - Total                < 1.106    to the right, improve=3.529394, (0 missing)
##   Surrogate splits:
##       GIC Sectors                        < 12.5     to the right, agree=0.935, adj=0.864, (0 split)
##       Earnings Before Interest and Taxes < -0.487   to the left,  agree=0.742, adj=0.455, (0 split)
##       Current Assets - Total             < 0.355    to the right, agree=0.656, adj=0.273, (0 split)
##       Employees                          < 0.0055   to the right, agree=0.645, adj=0.250, (0 split)
##       Liabilities - Total                < 0.041    to the right, agree=0.645, adj=0.250, (0 split)
## 
## Node number 8: 125 observations
##   predicted class=0  expected loss=0.144  P(node) =0.04355401
##     class counts:   107    18
##    probabilities: 0.856 0.144 
## 
## Node number 9: 120 observations,    complexity param=0.01416431
##   predicted class=0  expected loss=0.4416667  P(node) =0.04181185
##     class counts:    67    53
##    probabilities: 0.558 0.442 
##   left son=18 (19 obs) right son=19 (101 obs)
##   Primary splits:
##       Liabilities - Total                 < 196.3925 to the right, improve=6.833151, (0 missing)
##       Debt in Current Liabilities - Total < 0.023    to the right, improve=6.628602, (0 missing)
##       Average Short-Term Borrowings       < 2.6985   to the right, improve=4.256061, (0 missing)
##       Assets - Total                      < 487.2625 to the right, improve=4.079792, (0 missing)
##       Net Income (Loss)                   < 43.2665  to the right, improve=3.879236, (0 missing)
##   Surrogate splits:
##       Assets - Total                      < 415.227  to the right, agree=0.950, adj=0.684, (0 split)
##       Net Income (Loss)                   < 43.2665  to the right, agree=0.933, adj=0.579, (0 split)
##       Earnings Before Interest and Taxes  < 55.055   to the right, agree=0.925, adj=0.526, (0 split)
##       Average Short-Term Borrowings       < 60.251   to the right, agree=0.867, adj=0.158, (0 split)
##       Debt in Current Liabilities - Total < 71.65    to the right, agree=0.867, adj=0.158, (0 split)
## 
## Node number 10: 49 observations,    complexity param=0.01038716
##   predicted class=1  expected loss=0.4285714  P(node) =0.01707317
##     class counts:    21    28
##    probabilities: 0.429 0.571 
##   left son=20 (37 obs) right son=21 (12 obs)
##   Primary splits:
##       Liabilities - Total                < 0.0605   to the right, improve=5.837838, (0 missing)
##       Assets - Total                     < 1.907    to the right, improve=5.121212, (0 missing)
##       Current Assets - Total             < 1.633    to the right, improve=4.015686, (0 missing)
##       Employees                          < 5e-04    to the left,  improve=3.000000, (0 missing)
##       Earnings Before Interest and Taxes < -0.799   to the right, improve=1.618729, (0 missing)
##   Surrogate splits:
##       Earnings Before Interest and Taxes < -0.0225  to the left,  agree=0.796, adj=0.167, (0 split)
##       Current Assets - Total             < 0.0015   to the right, agree=0.776, adj=0.083, (0 split)
## 
## Node number 11: 44 observations
##   predicted class=1  expected loss=0.04545455  P(node) =0.01533101
##     class counts:     2    42
##    probabilities: 0.045 0.955 
## 
## Node number 18: 19 observations
##   predicted class=0  expected loss=0.05263158  P(node) =0.006620209
##     class counts:    18     1
##    probabilities: 0.947 0.053 
## 
## Node number 19: 101 observations,    complexity param=0.01416431
##   predicted class=1  expected loss=0.4851485  P(node) =0.03519164
##     class counts:    49    52
##    probabilities: 0.485 0.515 
##   left son=38 (16 obs) right son=39 (85 obs)
##   Primary splits:
##       Debt in Current Liabilities - Total < 0.023    to the right, improve=5.778975, (0 missing)
##       Current Assets - Total              < 19.392   to the left,  improve=4.172308, (0 missing)
##       Average Short-Term Borrowings       < 2.6985   to the right, improve=3.987360, (0 missing)
##       Net Income (Loss)                   < 23.475   to the left,  improve=2.271607, (0 missing)
##       Long-Term Debt Due in One Year      < 0.023    to the right, improve=2.081585, (0 missing)
##   Surrogate splits:
##       Average Short-Term Borrowings         < 2.6985   to the right, agree=0.911, adj=0.438, (0 split)
##       Long-Term Debt Due in One Year        < 0.023    to the right, agree=0.911, adj=0.438, (0 split)
##       Standard Industry Classification Code < 1346     to the right, agree=0.871, adj=0.188, (0 split)
##       Current Assets - Total                < 3.1195   to the left,  agree=0.851, adj=0.063, (0 split)
## 
## Node number 20: 37 observations,    complexity param=0.01038716
##   predicted class=0  expected loss=0.4324324  P(node) =0.01289199
##     class counts:    21    16
##    probabilities: 0.568 0.432 
##   left son=40 (15 obs) right son=41 (22 obs)
##   Primary splits:
##       Earnings Before Interest and Taxes < -0.8915  to the right, improve=4.513677, (0 missing)
##       Net Income (Loss)                  < -0.9225  to the right, improve=3.382400, (0 missing)
##       Current Assets - Total             < 1.0465   to the right, improve=2.444515, (0 missing)
##       Employees                          < 0.0015   to the left,  improve=1.966358, (0 missing)
##       Assets - Total                     < 1.907    to the right, improve=1.416708, (0 missing)
##   Surrogate splits:
##       Net Income (Loss)      < -0.9225  to the right, agree=0.973, adj=0.933, (0 split)
##       Employees              < 0.0015   to the left,  agree=0.784, adj=0.467, (0 split)
##       Liabilities - Total    < 0.1325   to the left,  agree=0.730, adj=0.333, (0 split)
##       Assets - Total         < 0.092    to the left,  agree=0.703, adj=0.267, (0 split)
##       Current Assets - Total < 0.035    to the left,  agree=0.703, adj=0.267, (0 split)
## 
## Node number 21: 12 observations
##   predicted class=1  expected loss=0  P(node) =0.004181185
##     class counts:     0    12
##    probabilities: 0.000 1.000 
## 
## Node number 38: 16 observations
##   predicted class=0  expected loss=0.125  P(node) =0.005574913
##     class counts:    14     2
##    probabilities: 0.875 0.125 
## 
## Node number 39: 85 observations,    complexity param=0.01416431
##   predicted class=1  expected loss=0.4117647  P(node) =0.02961672
##     class counts:    35    50
##    probabilities: 0.412 0.588 
##   left son=78 (9 obs) right son=79 (76 obs)
##   Primary splits:
##       Liabilities - Total                < 1.8985   to the left,  improve=2.696938, (0 missing)
##       Current Assets - Total             < 17.346   to the left,  improve=1.928219, (0 missing)
##       Net Income (Loss)                  < 23.475   to the left,  improve=1.336176, (0 missing)
##       Earnings Before Interest and Taxes < -12.381  to the right, improve=1.307255, (0 missing)
##       Assets - Total                     < 44.226   to the left,  improve=1.061030, (0 missing)
##   Surrogate splits:
##       Standard Industry Classification Code < 1155     to the left,  agree=0.906, adj=0.111, (0 split)
## 
## Node number 40: 15 observations
##   predicted class=0  expected loss=0.1333333  P(node) =0.005226481
##     class counts:    13     2
##    probabilities: 0.867 0.133 
## 
## Node number 41: 22 observations,    complexity param=0.01038716
##   predicted class=1  expected loss=0.3636364  P(node) =0.007665505
##     class counts:     8    14
##    probabilities: 0.364 0.636 
##   left son=82 (8 obs) right son=83 (14 obs)
##   Primary splits:
##       Current Assets - Total                < 2.785    to the right, improve=3.7532470, (0 missing)
##       Assets - Total                        < 1.992    to the right, improve=1.4318180, (0 missing)
##       Earnings Before Interest and Taxes    < -3.2205  to the left,  improve=0.6818182, (0 missing)
##       Standard Industry Classification Code < 1020     to the left,  improve=0.6818182, (0 missing)
##       Liabilities - Total                   < 0.1925   to the right, improve=0.6091686, (0 missing)
##   Surrogate splits:
##       Assets - Total                        < 2.839    to the right, agree=0.818, adj=0.500, (0 split)
##       Liabilities - Total                   < 0.4985   to the right, agree=0.773, adj=0.375, (0 split)
##       Earnings Before Interest and Taxes    < -3.2205  to the left,  agree=0.727, adj=0.250, (0 split)
##       Standard Industry Classification Code < 1020     to the left,  agree=0.727, adj=0.250, (0 split)
##       Employees                             < 0.0065   to the right, agree=0.682, adj=0.125, (0 split)
## 
## Node number 78: 9 observations
##   predicted class=0  expected loss=0.2222222  P(node) =0.003135889
##     class counts:     7     2
##    probabilities: 0.778 0.222 
## 
## Node number 79: 76 observations
##   predicted class=1  expected loss=0.3684211  P(node) =0.02648084
##     class counts:    28    48
##    probabilities: 0.368 0.632 
## 
## Node number 82: 8 observations
##   predicted class=0  expected loss=0.25  P(node) =0.002787456
##     class counts:     6     2
##    probabilities: 0.750 0.250 
## 
## Node number 83: 14 observations
##   predicted class=1  expected loss=0.1428571  P(node) =0.004878049
##     class counts:     2    12
##    probabilities: 0.143 0.857

You can depict the classification tree by using the “plot()” function.

#Plot tree 
rpart.plot::rpart.plot(rpart)

#Accurancy from cv
fitted.results.rpart <- predict(rpart, newdata=test,type='prob')
fitted.rpart <- ifelse(fitted.results.rpart[,2] >= 0.5, 1, 0)

# RMSE
caret::postResample(fitted.rpart, test$iso_code)

##      RMSE  Rsquared       MAE 
## 0.3183684 0.2371798 0.1013584

#For good predictive model the chi and RMSE values should be low 
confusionMatrix(data = as.factor(fitted.rpart),
                reference = as.factor(test$iso_code))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  144   43
##          1  248 2436
##                                          
##                Accuracy : 0.8986         
##                  95% CI : (0.887, 0.9094)
##     No Information Rate : 0.8635         
##     P-Value [Acc > NIR] : 6.802e-09      
##                                          
##                   Kappa : 0.4488         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.36735        
##             Specificity : 0.98265        
##          Pos Pred Value : 0.77005        
##          Neg Pred Value : 0.90760        
##              Prevalence : 0.13654        
##          Detection Rate : 0.05016        
##    Detection Prevalence : 0.06513        
##       Balanced Accuracy : 0.67500        
##                                          
##        'Positive' Class : 0              
##

Random Forest

The last model that I build is a random forest from the “randomForest” package.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

colnames(train) <- c("assets", "short_term_borrow",
                     "current_assets", "debt",
                     "debt_liabilities", "employees", 
                     "EBIT", "liabilities",
                     "net_income", "r_d",
                     "gic", "SICC", "iso_code")
colnames(test) <- c("assets", "short_term_borrow",
                     "current_assets", "debt",
                     "debt_liabilities", "employees", 
                     "EBIT", "liabilities",
                     "net_income", "r_d",
                     "gic", "SICC", "iso_code")
set.seed(133234)
rf <- randomForest(iso_code ~ ., data=train, importance=TRUE, ntree=200)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

predict.rf <- predict(rf, test)
fitted.rf <- as.numeric(predict.rf > 0.5)
print(rf)

## 
## Call:
##  randomForest(formula = iso_code ~ ., data = train, importance = TRUE,      ntree = 200) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.05384666
##                     % Var explained: 50.08

plot(rf)

varImpPlot(rf)

#Fitted results Random Forest
caret::postResample(fitted.rf, test$iso_code)

##       RMSE   Rsquared        MAE 
## 0.29032945 0.35613049 0.08429119

#For good predictive model the chi and RMSE values should be low 
confusionMatrix(data = as.factor(fitted.rf),
                reference = as.factor(test$iso_code))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  200   50
##          1  192 2429
##                                           
##                Accuracy : 0.9157          
##                  95% CI : (0.9049, 0.9256)
##     No Information Rate : 0.8635          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5782          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.51020         
##             Specificity : 0.97983         
##          Pos Pred Value : 0.80000         
##          Neg Pred Value : 0.92675         
##              Prevalence : 0.13654         
##          Detection Rate : 0.06966         
##    Detection Prevalence : 0.08708         
##       Balanced Accuracy : 0.74502         
##                                           
##        'Positive' Class : 0               
##

Data Modeling

Falco

24/10/2019

Lecture 3: Data Modeling

Non-linear predictive model

Logistic Regression

Classification and Regression Tree

Random Forest