The goal of a model is to provide a simple, low-dimensional, interpretable summary of a dataset. Models are a really useful way to help you peel back layers of structure as you are exploring your dataset. Every statistical model can be “divided” in two parts: 1. a family of models that express a prece, but generic, pattern that you want to capture (i.e., the pattern can be a straight line or a quadratic curve); 2. a fitted model, that can be found by selecting the family of models that is the closest to your data.
It is important to understand that a fitted model is just the closest model from a family of models. This implies that you have the “best” model according to some criteria and based on a set of assumptions. This does not imply that your model is a good model or that your model is “true”. George Box, a famous british statistician, once said one of the most quoted statistical quotes:“all models are wrong, but some are useful”. It is worth reading the fuller context of the quote as it is quite illustrative of the philosophy behind any statistical model: “Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an”ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules. For such a model there is no need to ask the question “Is the model true?” If “truth” is to be the “whole truth” the answer must be “No.” The only question of interest is “Is the model illuminating and use- ful?”
This does not mean that all the models are wrong and, we should just go for the least wrong model. This quote should be interpeted as a call for careful laying down the assumptions on which the quality of the model is built on. As Berkeley statisticain Mark Van Der Laan stated in a recent article “The statistical formulation and theory should define the algorithm” source.
In this lecture we will go see how to perform in R two types of models: 1. linear regression models; 2. non-linear predictive models (decision trees, random forests).
As you have probably seen the former in your econometric classes, I will skip the mathematical details, and focus on the the function used to build these models and on the intepretation of their outputs.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.2.0 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(modelr)
library(readxl)
data <- read_excel("G:\\Il mio Drive\\Econometrics Lab\\Data\\Compustat Data.xlsx")
data <- data[, !names(data) %in% c("Interest Expense - Total (Financial Services)", "Net Interest Income", "Nonperforming Assets - Total")]
data_clean <- na.omit(data)
x <- data_clean$`Assets - Total`[which(data_clean$`Assets - Total`< quantile(data_clean$`Assets - Total`, 0.95))]
y <- data_clean$`Sales/Turnover (Net)`[which(data_clean$`Assets - Total`< quantile(data_clean$`Assets - Total`, 0.95))]
reg_data <- as.data.frame(cbind(x, y))
ggplot(reg_data, aes(x, y)) +
geom_point()
You can see a quite clear pattern in the data. Let’s now use a model to capture the pattern and make it more explicit.
Let’s first generate a set of random model an let’s overlay them on the data.
models <- tibble(
beta1 = runif(length(x), 0, 200),
beta2 = runif(length(x), -4, 4)
)
ggplot(reg_data, aes(x, y)) +
geom_abline(
aes(intercept = beta1,
slope = beta2),
data = models, alpha = 1/15
) +
geom_point()
model1 <- function(beta, data){
beta[1] + data$x * beta[2]
}
fitted.values <- model1(c(50, 1.5), reg_data)
head(fitted.values)
## [1] 360.4925 415.7135 409.4930 73.7855 84.3965 51.2975
Let’s now get the residuals of our model.
measure_distance <- function(mod, data) {
diff <- data$y - model1(mod, data)
sqrt(mean(diff ^ 2))
}
measure_distance(c(50, 1.5), sim1)
## [1] 42.828
We can use “purrr” to compute the distance for all the models defined previously. We will need a helper function because our distance expectes the model as a numeric vector of length 2.
reg_data_dist <- function(beta1, beta2) {
measure_distance(c(beta1, beta2), reg_data)
}
models <- models %>%
mutate(dist = purrr::map2_dbl(beta1, beta2, reg_data_dist))
models
## # A tibble: 5,453 x 3
## beta1 beta2 dist
## <dbl> <dbl> <dbl>
## 1 178. 0.255 288.
## 2 72.9 -0.263 378.
## 3 79.3 -0.909 546.
## 4 95.0 -2.29 942.
## 5 13.9 0.528 243.
## 6 161. -1.39 655.
## 7 200. 3.68 1032.
## 8 121. -2.02 852.
## 9 85.9 1.90 453.
## 10 11.6 2.66 624.
## # ... with 5,443 more rows
We can now overlay the best 10 models on the data.
ggplot(reg_data, aes(x, y)) +
geom_point(size = 2, color = "grey30") +
geom_abline(
aes(intercept = beta1, slope = beta2, color = -dist),
data = filter(models, rank(dist) <= 10)
)
We can also think about these models as observations, and visualize them with a scatterplot of beta1 versus beta2, again colored by -dist. We can no longer directly see how the model compares to the data, but we can see many models at once. Again, I’ve highlighted the 10 best models, this time by drawing red circles underneath them:
ggplot(models, aes(beta1, beta2)) +
geom_point(
data = filter(models, rank(dist) <= 10),
size = 4, color = "red"
) +
geom_point(aes(colour = -dist))
Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points (this is called a grid search). I picked the parameters of the grid roughly by looking at where the best models were in the preceding plot:
grid <- expand.grid(
beta1 = seq(0, 200, length = 50),
beta2 = seq(-4, 4, length = 50)
) %>%
mutate(dist = purrr::map2_dbl(beta1, beta2, reg_data_dist))
grid %>%
ggplot(aes(beta1, beta2)) +
geom_point(
data = filter(grid, rank(dist) <= 10),
size = 4, colour = "red"
) +
geom_point(aes(color = -dist))
When you overlay the best 10 models back on the original data, they all look pretty good:
ggplot(reg_data, aes(x, y)) +
geom_point(size = 2, color = "grey30") +
geom_abline(
aes(intercept = beta1, slope = beta2, color = -dist),
data = filter(grid, rank(dist) <= 10)
)
You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there’s a better way to tackle that problem: a numerical minimization tool called Newton-Raphson search. The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can’t go any lower. In R, we can do that with optim():
best <- optim(c(0, 0), measure_distance, data = reg_data)
best$par
## [1] -3.1063985 0.8127066
ggplot(reg_data, aes(x, y)) +
geom_point(size = 2, color = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2])
Don’t worry too much about the details of how optim() works. It’s the intuition that’s important here. If you have a function that defines the distance between a model and a dataset, and an algorithm that can minimize that distance by modifying the parameters of the model, you can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for. There’s one more approach that we can use for this model, because it is a special case of a broader family: linear models. A linear model has the general form \(y = a_1 + a_2 \cdot x_1 + a_3 \cdot x_2 + ... + a_n \cdot x_{(n - 1)}\). So this simple model is equivalent to a general linear model where n is 2 and \(x_1\) is \(x\). R has a tool specifically designed for fitting linear models called lm(). lm() has a special way to specify the model family: formulas. Formulas look like \(y ~ x\), which lm() will translate to a function like \(y = a_1 + a_2 * x\). We can fit the model and look at the output:
model_1 <- lm(y ~ x, data = reg_data)
summary(model_1)
##
## Call:
## lm(formula = y ~ x, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1221.1 -27.4 0.9 9.9 6578.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.06484 3.62636 -0.845 0.398
## x 0.81267 0.01172 69.333 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229.5 on 5451 degrees of freedom
## Multiple R-squared: 0.4686, Adjusted R-squared: 0.4685
## F-statistic: 4807 on 1 and 5451 DF, p-value: < 2.2e-16
Now let’s add an additional variable in the linear regression to compare the two different models.
z <- data_clean$Employees[which(data_clean$`Assets - Total`< quantile(data_clean$`Assets - Total`, 0.95))]
reg_data <- cbind(reg_data, z)
model_2 <- lm(y ~ x + z, data = reg_data)
summary(model_2)
##
## Call:
## lm(formula = y ~ x + z, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2675.1 -22.5 2.5 11.0 6700.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.31960 3.38664 -1.275 0.202
## x 0.69931 0.01166 59.999 <2e-16 ***
## z 20.62225 0.72862 28.303 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214.3 on 5450 degrees of freedom
## Multiple R-squared: 0.5367, Adjusted R-squared: 0.5365
## F-statistic: 3157 on 2 and 5450 DF, p-value: < 2.2e-16
In R, you can either write down all the variables that you want to use as regressors in your model or you can just use \(y \sim \: .\).
model_3 <- lm(y ~ ., data = reg_data)
summary(model_3)
##
## Call:
## lm(formula = y ~ ., data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2675.1 -22.5 2.5 11.0 6700.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.31960 3.38664 -1.275 0.202
## x 0.69931 0.01166 59.999 <2e-16 ***
## z 20.62225 0.72862 28.303 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214.3 on 5450 degrees of freedom
## Multiple R-squared: 0.5367, Adjusted R-squared: 0.5365
## F-statistic: 3157 on 2 and 5450 DF, p-value: < 2.2e-16
A very easy way to compare two different linear regressions is through the likelihood ratio test. In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
lrtest(model_1, model_2)
## Likelihood ratio test
##
## Model 1: y ~ x
## Model 2: y ~ x + z
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -37377
## 2 4 -37004 1 747.81 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(p < 0.001\) indicates that the model with all predictors fits significantly better than the model with only one predictor. Another “goodness-of-fit” measure that can be used is the \(R^2\): \[\begin{equation} R^2= 1 - \frac{ESS}{TSS}. \end{equation}\]
summary(model_1)$r.squared
## [1] 0.4686107
summary(model_2)$r.squared
## [1] 0.5367084
We can also get the fitted values of the model for any \(x\) and \(z\) by running the following chunck of code.
coeffs = coefficients(model_2)
assets = 159
employees = 2
y <- coeffs[1] +coeffs[2]*assets +coeffs[3]*employees
y
## (Intercept)
## 148.1153
Or, equivalently:
newdata <- data.frame(x = 159, z = 2)
predict(model_2, newdata)
## 1
## 148.1153
predict(model_2, newdata, interval="confidence")
## fit lwr upr
## 1 148.1153 142.2269 154.0036
Once we fitted our favourite model, we can check the residuals from the model: \(e_i = y_i - \hat{f}(x_i)\).
model.res = resid(model_2)
plot(reg_data$y, model.res, ylab="Residuals", xlab="Sales", main="Residuals v. Sales")
abline(0, 0)
Moreover, we can standardize the residuals and plot them against normalized scores for the outcome variable.
model_2.stdres = rstandard(model_2)
qqnorm(model_2.stdres , ylab="Standardized Residuals", xlab="Normal Scores", main="Standardized Residuals v. Sales")
qqline(model_2.stdres)
In R, you can introduce an interaction between the regressors by using \(*\). Always remember to include also the single regressors in the formula.
model_int<-lm(y ~ x + z + x*z, data = reg_data)
summary(model_int)
##
## Call:
## lm(formula = y ~ x + z + x * z, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1568.2 -25.4 -9.4 6.9 6775.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.308146 3.274783 2.842 0.00449 **
## x 0.605046 0.011784 51.343 < 2e-16 ***
## z 2.426270 1.033420 2.348 0.01892 *
## x:z 0.034463 0.001451 23.754 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204 on 5449 degrees of freedom
## Multiple R-squared: 0.5802, Adjusted R-squared: 0.5799
## F-statistic: 2510 on 3 and 5449 DF, p-value: < 2.2e-16
You can’t directly introduce a quadratic term in the regression formula. Hence, you need to create an additional variable with the square term and then you can include it in the regression.
x2 <- x^2
model_squared<-lm(y ~ x + x2 + z, data = reg_data)
summary(model_squared)
##
## Call:
## lm(formula = y ~ x + x2 + z, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2676.7 -22.3 -1.9 8.9 6669.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.055e+00 3.866e+00 0.273 0.78489
## x 6.189e-01 3.031e-02 20.418 < 2e-16 ***
## x2 7.753e-05 2.696e-05 2.875 0.00405 **
## z 2.066e+01 7.283e-01 28.370 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214.1 on 5449 degrees of freedom
## Multiple R-squared: 0.5374, Adjusted R-squared: 0.5372
## F-statistic: 2110 on 3 and 5449 DF, p-value: < 2.2e-16
In some scenarios we may be interested in building a statistical model to predict a certain outcome. In this case, for instance, we may want to use different models to predict the location of a firm. The Compustat data entail US and Canadian enterprises. In the next chunks of code I will build three different models (logistic regression, CART and Random Forest) to predict the location of the firm.
Before running the analyses, I restict the set of predictors to the following variables and I create a dummy variable that assumes value 1 if the firm is located in the US and 0 if it is located in Canada.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
myvariables <- c("ISO Currency Code",
"Assets - Total", "Average Short-Term Borrowings",
"Current Assets - Total", "Long-Term Debt Due in One Year",
"Debt in Current Liabilities - Total", "Employees",
"Earnings Before Interest and Taxes", "Liabilities - Total",
"Net Income (Loss)", "In Process R&D Expense",
"GIC Sectors", "Standard Industry Classification Code")
data_prediction <- data_clean[myvariables]
data_prediction$iso_code <- ifelse(data_prediction$`ISO Currency Code`=="USD", 1, 0)
data_prediction <- data_prediction[, !names(data_prediction) %in% c("ISO Currency Code")]
In order to check how good are the three models, I randomly split the data into two disjoint sets: a training set that I will use to build the model and a test set that I will use to validate the quality of the model’s prediction.
set.seed(123)
index <- sample(seq_len(nrow(data_prediction)),
size = nrow(data_prediction)*0.5)
train <- data_prediction[index,]
test <- data_prediction[-index,]
The first model that I run is a logistic regression with the inclusion of all the covariates.
logit<-glm(iso_code ~ ., data= train, family=binomial(link='logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit)
##
## Call:
## glm(formula = iso_code ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1015 0.1754 0.3027 0.4930 1.4832
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -8.663e-01 1.508e-01 -5.743
## `Assets - Total` -4.599e-04 3.459e-04 -1.330
## `Average Short-Term Borrowings` 1.684e-02 9.810e-03 1.716
## `Current Assets - Total` 2.555e-03 7.378e-04 3.463
## `Long-Term Debt Due in One Year` 6.835e-03 6.994e-03 0.977
## `Debt in Current Liabilities - Total` -1.073e-02 6.779e-03 -1.582
## Employees 4.996e-02 3.530e-02 1.415
## `Earnings Before Interest and Taxes` -2.414e-03 2.280e-03 -1.059
## `Liabilities - Total` 8.082e-04 6.295e-04 1.284
## `Net Income (Loss)` 4.237e-04 2.352e-03 0.180
## `In Process R&D Expense` -1.391e+02 5.892e+03 -0.024
## `GIC Sectors` 4.632e-02 6.250e-03 7.411
## `Standard Industry Classification Code` 3.329e-04 3.724e-05 8.938
## Pr(>|z|)
## (Intercept) 9.29e-09 ***
## `Assets - Total` 0.183633
## `Average Short-Term Borrowings` 0.086105 .
## `Current Assets - Total` 0.000535 ***
## `Long-Term Debt Due in One Year` 0.328398
## `Debt in Current Liabilities - Total` 0.113545
## Employees 0.156980
## `Earnings Before Interest and Taxes` 0.289682
## `Liabilities - Total` 0.199163
## `Net Income (Loss)` 0.857055
## `In Process R&D Expense` 0.981162
## `GIC Sectors` 1.26e-13 ***
## `Standard Industry Classification Code` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2140.2 on 2869 degrees of freedom
## Residual deviance: 1669.4 on 2857 degrees of freedom
## AIC: 1695.4
##
## Number of Fisher Scoring iterations: 25
To get the accuracy of the model I first get the predicted probabilities, then impute the values for the outcome variable and then check the root-mean-squared-error (RMSE), the mean-absolute error (MAE), the \(R^2\) and the confusion matrix.
#Accurancy from cv
fitted.results.logit <- predict(logit, newdata = test, type='response')
fitted.logit <- ifelse(fitted.results.logit >= 0.5, 1, 0)
# RMSE
caret::postResample(fitted.logit, test$iso_code)
## RMSE Rsquared MAE
## 0.36476662 0.02285701 0.13305468
#For good predictive model the chi and RMSE values should be low
confusionMatrix(data = as.factor(fitted.logit),
reference = as.factor(test$iso_code))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 6
## 1 376 2473
##
## Accuracy : 0.8669
## 95% CI : (0.854, 0.8792)
## No Information Rate : 0.8635
## P-Value [Acc > NIR] : 0.3045
##
## Kappa : 0.0637
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.040816
## Specificity : 0.997580
## Pos Pred Value : 0.727273
## Neg Pred Value : 0.868024
## Prevalence : 0.136538
## Detection Rate : 0.005573
## Detection Prevalence : 0.007663
## Balanced Accuracy : 0.519198
##
## 'Positive' Class : 0
##
The second model that I run is a classification and regression tree from the “rpart” package in R.
library(rpart)
rpart <- rpart(iso_code ~ ., data=train, method="class")
printcp(rpart) # display the results
##
## Classification tree:
## rpart(formula = iso_code ~ ., data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] Assets - Total
## [2] Current Assets - Total
## [3] Debt in Current Liabilities - Total
## [4] Earnings Before Interest and Taxes
## [5] GIC Sectors
## [6] Liabilities - Total
## [7] Standard Industry Classification Code
##
## Root node error: 353/2870 = 0.123
##
## n= 2870
##
## CP nsplit rel error xerror xstd
## 1 0.158640 0 1.00000 1.00000 0.049844
## 2 0.133144 1 0.84136 0.94618 0.048667
## 3 0.014164 2 0.70822 0.72805 0.043333
## 4 0.010387 6 0.65156 0.72805 0.043333
## 5 0.010000 10 0.60907 0.74788 0.043860
plotcp(rpart) # visualize cross-validation results
summary(rpart) # detailed summary of splits
## Call:
## rpart(formula = iso_code ~ ., data = train, method = "class")
## n= 2870
##
## CP nsplit rel error xerror xstd
## 1 0.15864023 0 1.0000000 1.0000000 0.04984405
## 2 0.13314448 1 0.8413598 0.9461756 0.04866673
## 3 0.01416431 2 0.7082153 0.7280453 0.04333319
## 4 0.01038716 6 0.6515581 0.7280453 0.04333319
## 5 0.01000000 10 0.6090652 0.7478754 0.04386049
##
## Variable importance
## Standard Industry Classification Code
## 40
## GIC Sectors
## 23
## Liabilities - Total
## 10
## Assets - Total
## 9
## Current Assets - Total
## 6
## Earnings Before Interest and Taxes
## 4
## Employees
## 3
## Net Income (Loss)
## 3
## Debt in Current Liabilities - Total
## 2
## Average Short-Term Borrowings
## 1
## Long-Term Debt Due in One Year
## 1
##
## Node number 1: 2870 observations, complexity param=0.1586402
## predicted class=1 expected loss=0.1229965 P(node) =1
## class counts: 353 2517
## probabilities: 0.123 0.877
## left son=2 (338 obs) right son=3 (2532 obs)
## Primary splits:
## Standard Industry Classification Code < 1661.5 to the left, improve=162.026300, (0 missing)
## GIC Sectors < 17.5 to the left, improve=122.071700, (0 missing)
## Employees < 0.1175 to the left, improve= 21.343270, (0 missing)
## Current Assets - Total < 108.6385 to the left, improve= 15.732660, (0 missing)
## Assets - Total < 2.631 to the right, improve= 6.835526, (0 missing)
## Surrogate splits:
## GIC Sectors < 17.5 to the left, agree=0.945, adj=0.536, (0 split)
##
## Node number 2: 338 observations, complexity param=0.1331445
## predicted class=0 expected loss=0.4171598 P(node) =0.11777
## class counts: 197 141
## probabilities: 0.583 0.417
## left son=4 (245 obs) right son=5 (93 obs)
## Primary splits:
## Assets - Total < 10.73 to the right, improve=28.88831, (0 missing)
## Current Assets - Total < 1.106 to the right, improve=26.66774, (0 missing)
## Liabilities - Total < 0.1225 to the right, improve=24.78560, (0 missing)
## Standard Industry Classification Code < 1067 to the left, improve=12.88777, (0 missing)
## GIC Sectors < 12.5 to the right, improve=12.67121, (0 missing)
## Surrogate splits:
## Liabilities - Total < 0.261 to the right, agree=0.908, adj=0.667, (0 split)
## Current Assets - Total < 1.705 to the right, agree=0.902, adj=0.645, (0 split)
## Employees < 0.0035 to the right, agree=0.825, adj=0.366, (0 split)
##
## Node number 3: 2532 observations
## predicted class=1 expected loss=0.06161137 P(node) =0.88223
## class counts: 156 2376
## probabilities: 0.062 0.938
##
## Node number 4: 245 observations, complexity param=0.01416431
## predicted class=0 expected loss=0.2897959 P(node) =0.08536585
## class counts: 174 71
## probabilities: 0.710 0.290
## left son=8 (125 obs) right son=9 (120 obs)
## Primary splits:
## GIC Sectors < 12.5 to the right, improve=10.849650, (0 missing)
## Standard Industry Classification Code < 1065 to the left, improve= 9.095067, (0 missing)
## Liabilities - Total < 1.897 to the left, improve= 7.060408, (0 missing)
## Current Assets - Total < 15.342 to the left, improve= 5.298863, (0 missing)
## Net Income (Loss) < 0.032 to the left, improve= 2.838368, (0 missing)
## Surrogate splits:
## Standard Industry Classification Code < 1065 to the left, agree=0.935, adj=0.867, (0 split)
## Liabilities - Total < 2.5085 to the left, agree=0.731, adj=0.450, (0 split)
## Earnings Before Interest and Taxes < -0.0795 to the left, agree=0.710, adj=0.408, (0 split)
## Net Income (Loss) < -0.6345 to the left, agree=0.682, adj=0.350, (0 split)
## Assets - Total < 45.735 to the left, agree=0.633, adj=0.250, (0 split)
##
## Node number 5: 93 observations, complexity param=0.01038716
## predicted class=1 expected loss=0.2473118 P(node) =0.03240418
## class counts: 23 70
## probabilities: 0.247 0.753
## left son=10 (49 obs) right son=11 (44 obs)
## Primary splits:
## Standard Industry Classification Code < 1155 to the left, improve=6.805474, (0 missing)
## Liabilities - Total < 0.0605 to the right, improve=5.417307, (0 missing)
## GIC Sectors < 12.5 to the right, improve=4.870546, (0 missing)
## Assets - Total < 1.907 to the right, improve=4.105832, (0 missing)
## Current Assets - Total < 1.106 to the right, improve=3.529394, (0 missing)
## Surrogate splits:
## GIC Sectors < 12.5 to the right, agree=0.935, adj=0.864, (0 split)
## Earnings Before Interest and Taxes < -0.487 to the left, agree=0.742, adj=0.455, (0 split)
## Current Assets - Total < 0.355 to the right, agree=0.656, adj=0.273, (0 split)
## Employees < 0.0055 to the right, agree=0.645, adj=0.250, (0 split)
## Liabilities - Total < 0.041 to the right, agree=0.645, adj=0.250, (0 split)
##
## Node number 8: 125 observations
## predicted class=0 expected loss=0.144 P(node) =0.04355401
## class counts: 107 18
## probabilities: 0.856 0.144
##
## Node number 9: 120 observations, complexity param=0.01416431
## predicted class=0 expected loss=0.4416667 P(node) =0.04181185
## class counts: 67 53
## probabilities: 0.558 0.442
## left son=18 (19 obs) right son=19 (101 obs)
## Primary splits:
## Liabilities - Total < 196.3925 to the right, improve=6.833151, (0 missing)
## Debt in Current Liabilities - Total < 0.023 to the right, improve=6.628602, (0 missing)
## Average Short-Term Borrowings < 2.6985 to the right, improve=4.256061, (0 missing)
## Assets - Total < 487.2625 to the right, improve=4.079792, (0 missing)
## Net Income (Loss) < 43.2665 to the right, improve=3.879236, (0 missing)
## Surrogate splits:
## Assets - Total < 415.227 to the right, agree=0.950, adj=0.684, (0 split)
## Net Income (Loss) < 43.2665 to the right, agree=0.933, adj=0.579, (0 split)
## Earnings Before Interest and Taxes < 55.055 to the right, agree=0.925, adj=0.526, (0 split)
## Average Short-Term Borrowings < 60.251 to the right, agree=0.867, adj=0.158, (0 split)
## Debt in Current Liabilities - Total < 71.65 to the right, agree=0.867, adj=0.158, (0 split)
##
## Node number 10: 49 observations, complexity param=0.01038716
## predicted class=1 expected loss=0.4285714 P(node) =0.01707317
## class counts: 21 28
## probabilities: 0.429 0.571
## left son=20 (37 obs) right son=21 (12 obs)
## Primary splits:
## Liabilities - Total < 0.0605 to the right, improve=5.837838, (0 missing)
## Assets - Total < 1.907 to the right, improve=5.121212, (0 missing)
## Current Assets - Total < 1.633 to the right, improve=4.015686, (0 missing)
## Employees < 5e-04 to the left, improve=3.000000, (0 missing)
## Earnings Before Interest and Taxes < -0.799 to the right, improve=1.618729, (0 missing)
## Surrogate splits:
## Earnings Before Interest and Taxes < -0.0225 to the left, agree=0.796, adj=0.167, (0 split)
## Current Assets - Total < 0.0015 to the right, agree=0.776, adj=0.083, (0 split)
##
## Node number 11: 44 observations
## predicted class=1 expected loss=0.04545455 P(node) =0.01533101
## class counts: 2 42
## probabilities: 0.045 0.955
##
## Node number 18: 19 observations
## predicted class=0 expected loss=0.05263158 P(node) =0.006620209
## class counts: 18 1
## probabilities: 0.947 0.053
##
## Node number 19: 101 observations, complexity param=0.01416431
## predicted class=1 expected loss=0.4851485 P(node) =0.03519164
## class counts: 49 52
## probabilities: 0.485 0.515
## left son=38 (16 obs) right son=39 (85 obs)
## Primary splits:
## Debt in Current Liabilities - Total < 0.023 to the right, improve=5.778975, (0 missing)
## Current Assets - Total < 19.392 to the left, improve=4.172308, (0 missing)
## Average Short-Term Borrowings < 2.6985 to the right, improve=3.987360, (0 missing)
## Net Income (Loss) < 23.475 to the left, improve=2.271607, (0 missing)
## Long-Term Debt Due in One Year < 0.023 to the right, improve=2.081585, (0 missing)
## Surrogate splits:
## Average Short-Term Borrowings < 2.6985 to the right, agree=0.911, adj=0.438, (0 split)
## Long-Term Debt Due in One Year < 0.023 to the right, agree=0.911, adj=0.438, (0 split)
## Standard Industry Classification Code < 1346 to the right, agree=0.871, adj=0.188, (0 split)
## Current Assets - Total < 3.1195 to the left, agree=0.851, adj=0.063, (0 split)
##
## Node number 20: 37 observations, complexity param=0.01038716
## predicted class=0 expected loss=0.4324324 P(node) =0.01289199
## class counts: 21 16
## probabilities: 0.568 0.432
## left son=40 (15 obs) right son=41 (22 obs)
## Primary splits:
## Earnings Before Interest and Taxes < -0.8915 to the right, improve=4.513677, (0 missing)
## Net Income (Loss) < -0.9225 to the right, improve=3.382400, (0 missing)
## Current Assets - Total < 1.0465 to the right, improve=2.444515, (0 missing)
## Employees < 0.0015 to the left, improve=1.966358, (0 missing)
## Assets - Total < 1.907 to the right, improve=1.416708, (0 missing)
## Surrogate splits:
## Net Income (Loss) < -0.9225 to the right, agree=0.973, adj=0.933, (0 split)
## Employees < 0.0015 to the left, agree=0.784, adj=0.467, (0 split)
## Liabilities - Total < 0.1325 to the left, agree=0.730, adj=0.333, (0 split)
## Assets - Total < 0.092 to the left, agree=0.703, adj=0.267, (0 split)
## Current Assets - Total < 0.035 to the left, agree=0.703, adj=0.267, (0 split)
##
## Node number 21: 12 observations
## predicted class=1 expected loss=0 P(node) =0.004181185
## class counts: 0 12
## probabilities: 0.000 1.000
##
## Node number 38: 16 observations
## predicted class=0 expected loss=0.125 P(node) =0.005574913
## class counts: 14 2
## probabilities: 0.875 0.125
##
## Node number 39: 85 observations, complexity param=0.01416431
## predicted class=1 expected loss=0.4117647 P(node) =0.02961672
## class counts: 35 50
## probabilities: 0.412 0.588
## left son=78 (9 obs) right son=79 (76 obs)
## Primary splits:
## Liabilities - Total < 1.8985 to the left, improve=2.696938, (0 missing)
## Current Assets - Total < 17.346 to the left, improve=1.928219, (0 missing)
## Net Income (Loss) < 23.475 to the left, improve=1.336176, (0 missing)
## Earnings Before Interest and Taxes < -12.381 to the right, improve=1.307255, (0 missing)
## Assets - Total < 44.226 to the left, improve=1.061030, (0 missing)
## Surrogate splits:
## Standard Industry Classification Code < 1155 to the left, agree=0.906, adj=0.111, (0 split)
##
## Node number 40: 15 observations
## predicted class=0 expected loss=0.1333333 P(node) =0.005226481
## class counts: 13 2
## probabilities: 0.867 0.133
##
## Node number 41: 22 observations, complexity param=0.01038716
## predicted class=1 expected loss=0.3636364 P(node) =0.007665505
## class counts: 8 14
## probabilities: 0.364 0.636
## left son=82 (8 obs) right son=83 (14 obs)
## Primary splits:
## Current Assets - Total < 2.785 to the right, improve=3.7532470, (0 missing)
## Assets - Total < 1.992 to the right, improve=1.4318180, (0 missing)
## Earnings Before Interest and Taxes < -3.2205 to the left, improve=0.6818182, (0 missing)
## Standard Industry Classification Code < 1020 to the left, improve=0.6818182, (0 missing)
## Liabilities - Total < 0.1925 to the right, improve=0.6091686, (0 missing)
## Surrogate splits:
## Assets - Total < 2.839 to the right, agree=0.818, adj=0.500, (0 split)
## Liabilities - Total < 0.4985 to the right, agree=0.773, adj=0.375, (0 split)
## Earnings Before Interest and Taxes < -3.2205 to the left, agree=0.727, adj=0.250, (0 split)
## Standard Industry Classification Code < 1020 to the left, agree=0.727, adj=0.250, (0 split)
## Employees < 0.0065 to the right, agree=0.682, adj=0.125, (0 split)
##
## Node number 78: 9 observations
## predicted class=0 expected loss=0.2222222 P(node) =0.003135889
## class counts: 7 2
## probabilities: 0.778 0.222
##
## Node number 79: 76 observations
## predicted class=1 expected loss=0.3684211 P(node) =0.02648084
## class counts: 28 48
## probabilities: 0.368 0.632
##
## Node number 82: 8 observations
## predicted class=0 expected loss=0.25 P(node) =0.002787456
## class counts: 6 2
## probabilities: 0.750 0.250
##
## Node number 83: 14 observations
## predicted class=1 expected loss=0.1428571 P(node) =0.004878049
## class counts: 2 12
## probabilities: 0.143 0.857
You can depict the classification tree by using the “plot()” function.
#Plot tree
rpart.plot::rpart.plot(rpart)
#Accurancy from cv
fitted.results.rpart <- predict(rpart, newdata=test,type='prob')
fitted.rpart <- ifelse(fitted.results.rpart[,2] >= 0.5, 1, 0)
# RMSE
caret::postResample(fitted.rpart, test$iso_code)
## RMSE Rsquared MAE
## 0.3183684 0.2371798 0.1013584
#For good predictive model the chi and RMSE values should be low
confusionMatrix(data = as.factor(fitted.rpart),
reference = as.factor(test$iso_code))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 144 43
## 1 248 2436
##
## Accuracy : 0.8986
## 95% CI : (0.887, 0.9094)
## No Information Rate : 0.8635
## P-Value [Acc > NIR] : 6.802e-09
##
## Kappa : 0.4488
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.36735
## Specificity : 0.98265
## Pos Pred Value : 0.77005
## Neg Pred Value : 0.90760
## Prevalence : 0.13654
## Detection Rate : 0.05016
## Detection Prevalence : 0.06513
## Balanced Accuracy : 0.67500
##
## 'Positive' Class : 0
##
The last model that I build is a random forest from the “randomForest” package.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
colnames(train) <- c("assets", "short_term_borrow",
"current_assets", "debt",
"debt_liabilities", "employees",
"EBIT", "liabilities",
"net_income", "r_d",
"gic", "SICC", "iso_code")
colnames(test) <- c("assets", "short_term_borrow",
"current_assets", "debt",
"debt_liabilities", "employees",
"EBIT", "liabilities",
"net_income", "r_d",
"gic", "SICC", "iso_code")
set.seed(133234)
rf <- randomForest(iso_code ~ ., data=train, importance=TRUE, ntree=200)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
predict.rf <- predict(rf, test)
fitted.rf <- as.numeric(predict.rf > 0.5)
print(rf)
##
## Call:
## randomForest(formula = iso_code ~ ., data = train, importance = TRUE, ntree = 200)
## Type of random forest: regression
## Number of trees: 200
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 0.05384666
## % Var explained: 50.08
plot(rf)
varImpPlot(rf)
#Fitted results Random Forest
caret::postResample(fitted.rf, test$iso_code)
## RMSE Rsquared MAE
## 0.29032945 0.35613049 0.08429119
#For good predictive model the chi and RMSE values should be low
confusionMatrix(data = as.factor(fitted.rf),
reference = as.factor(test$iso_code))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 200 50
## 1 192 2429
##
## Accuracy : 0.9157
## 95% CI : (0.9049, 0.9256)
## No Information Rate : 0.8635
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5782
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.51020
## Specificity : 0.97983
## Pos Pred Value : 0.80000
## Neg Pred Value : 0.92675
## Prevalence : 0.13654
## Detection Rate : 0.06966
## Detection Prevalence : 0.08708
## Balanced Accuracy : 0.74502
##
## 'Positive' Class : 0
##