In the following study:

“Òscar Jordà, Moritz Schularick, and Alan M. Taylor. 2017. “Macrofinancial History and the New Business Cycle Facts.” in NBER Macroeconomics Annual 2016, volume 31, edited by Martin Eichenbaum and Jonathan A. Parker. Chicago: University of Chicago Press.”

the researchers aim to provide convenient no-cost open access under a license to the most extensive long-run macro-financial dataset to date. The database covers 17 advanced economies since 1870 on an annual basis. It comprises 25 real and nominal variables such as bank credit to the non-financial private sector, mortgage lending and long-term house prices. The database captures the near-universe of advanced-country macroeconomic and asset price dynamics, covering on average over 90 percent of advanced-economy output and over 50 percent of world output.

In this project we will use machine learning algorithm to predict the economic cycle, natural fluctuation of the economy between periods of expansion (growth) and contraction (recession).

The data for this project can be downloaded from this original source: http://www.macrohistory.net/data/.

Getting and Cleaning the Data

Factors such as gross domestic product (GDP), interest rates, levels of employment and consumer spending can help to determine the current stage of the economic cycle.

An economic growth rate is a measure of economic growth from one period to another in percentage terms. The economic growth rate provides insight into the general direction and magnitude of growth for the overall economy.

In the first part of project, we plan to predict the rising or falling of two consecutive annual growth rates. If an economy experiences two consecutive years with falling growth rates, it can be said that the associated economy is falling into a recession.

set.seed(1)
library(dplyr)
library(ggplot2)
library(gdata)
setwd("/Users/junwen/Documents/Project_Incubator/Data")
df <- read.xls ("JSTdatasetR1.xlsx", sheet = 2, header = TRUE)
df <- subset(df, iso == "USA")
df$gdp_change <- c(NA, diff(log(df$gdp)))
df$stock_change <- c(NA, diff(log(df$stocks)))
df$import_change <- c(NA, diff(log(df$imports)))
df$export_change <- c(NA, diff(log(df$exports)))
df$revenue_change <- c(NA, diff(log(df$revenue)))
df$expenditure_change <- c(NA, diff(log(df$expenditure)))
df$gdp_rate_change <- c(diff(df$gdp_change), NA)
rownames(df) <- 1:nrow(df)
df <- df[-c(2:4)]

Because there is a large amount of missing values, we will firstly fill the missing values by the median of columns.

library(caret)
f = function(x){
   if (sum(is.na(x)) >= length(x)*0.6) {
     x <- 0
   }
   x[is.na(x)] = median(x, na.rm = TRUE) 
   x
}
df <- data.frame(apply(df, 2, f))
dim(df)
## [1] 144  33
df <- df[!(apply(df == 0, 2, all))]
nzv_cols <- nearZeroVar(df)
if(length(nzv_cols) > 0) df <- df[, -nzv_cols]
dim(df)
## [1] 144  31

Exploratory analysis and statistical inference

One of interesting question is how stock market change would impact the economic growth rate. We want to compare the mean of growth rate change between two groups, one with positive stock market change and the other with negative change.

boxplot(gdp_rate_change ~ as.factor(stock_change > 0), df, 
        xlab= "Stock market growing", ylab = "GDP growth rate change")

We can not use the paired t interval as the groups are independent. Now let’s do a t interval comparing groups. We’ll show the two intervals, one assuming that the variances are equal and one assuming otherwise.

t.test(gdp_rate_change ~ as.factor(stock_change > 0), paired = FALSE, var.equal = FALSE, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  gdp_rate_change by as.factor(stock_change > 0)
## t = -2.2731, df = 78.367, p-value = 0.02576
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.061919537 -0.004101324
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##         -0.02121195          0.01179848

The confidence interval of t-test does not contain zero. Based on this result we can assume that the average economic growth rate increases with the stock market growth, and therefore the null hypothesis can be rejected.

Similarly, we can investigate whether the revenue growth would impact positively on the economic growth rate change.

boxplot(gdp_rate_change ~ as.factor(revenue_change > 0), df, 
        xlab= "Revenue growing", ylab = "GDP growth rate change")

t.test(gdp_rate_change ~ as.factor(revenue_change > 0), paired = FALSE, var.equal = FALSE, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  gdp_rate_change by as.factor(revenue_change > 0)
## t = 2.4178, df = 56.057, p-value = 0.01889
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.006725553 0.071708456
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          0.02838758         -0.01082942

The confidence interval of t-test does not contain zero. Based on this result we can assume that the average economic growth rate increases with the revenue decay, and therefore the null hypothesis can be rejected.

boxplot(gdp_rate_change ~ as.factor(expenditure_change > 0), df, 
        xlab= "Expenditure growing", ylab = "GDP growth rate change")

t.test(gdp_rate_change ~ as.factor(expenditure_change > 0), paired = FALSE, var.equal = FALSE, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  gdp_rate_change by as.factor(expenditure_change > 0)
## t = 0.49874, df = 55.555, p-value = 0.6199
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02520152  0.04190598
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##         0.006310694        -0.002041537

However, since the confidence interval of the test contains zero we can say that expenditure growth seems to have no impact on economic growth rate change based on this test.

Using machine learning to evaluate the data

inTraining <- createDataPartition(y=df$gdp, p=0.75, list=FALSE)
myTraining <- df[inTraining, ]
myTesting <- df[-inTraining, ]

myTraining$gdp_rate_change <- factor(as.numeric(myTraining$gdp_rate_change > 0))
myTesting$gdp_rate_change <- factor(as.numeric(myTesting$gdp_rate_change > 0))

Random forests

A random forest is an ensemble learning approach to supervised learning. The algorithm for a random forest involves sampling cases and variables to create a large number of decision trees. Each case is classified by each decision tree. The most common classification for that case is then used as the outcome. Random forests are grown using the randomForest() function in the randomForest package.

library(randomForest)
fit.forest <- randomForest(gdp_rate_change ~., data = myTraining, 
                           na.action = na.roughfix, importance = TRUE)
fit.forest
## 
## Call:
##  randomForest(formula = gdp_rate_change ~ ., data = myTraining,      importance = TRUE, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 43.52%
## Confusion matrix:
##    0  1 class.error
## 0 29 24   0.4528302
## 1 23 32   0.4181818
forest.pred <- predict(fit.forest, myTesting)
forest.perf <- table(myTesting$gdp_rate_change, forest.pred, 
                  dnn= c("Actual", "Predicted"))
confusionMatrix(forest.perf)
## Confusion Matrix and Statistics
## 
##       Predicted
## Actual  0  1
##      0 15  6
##      1  4 11
##                                          
##                Accuracy : 0.7222         
##                  95% CI : (0.5481, 0.858)
##     No Information Rate : 0.5278         
##     P-Value [Acc > NIR] : 0.01383        
##                                          
##                   Kappa : 0.4393         
##  Mcnemar's Test P-Value : 0.75183        
##                                          
##             Sensitivity : 0.7895         
##             Specificity : 0.6471         
##          Pos Pred Value : 0.7143         
##          Neg Pred Value : 0.7333         
##              Prevalence : 0.5278         
##          Detection Rate : 0.4167         
##    Detection Prevalence : 0.5833         
##       Balanced Accuracy : 0.7183         
##                                          
##        'Positive' Class : 0              
## 

Boosted tree model and parameter tuning

The accuracy of a predictive model can be improved using boosting algorithms like gradient boosting. The first step is tuning the model. Currently, k-fold cross-validation, leave-one-out cross-validation and bootstrap re-sampling methods can be used by train.

library(gbm)
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10)
fit.gbm <- train(gdp_rate_change ~., data = myTraining, 
                 method = "gbm", 
                 trControl = fitControl, 
                 verbose = FALSE)
fit.gbm
## Stochastic Gradient Boosting 
## 
## 108 samples
##  30 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 98, 98, 97, 96, 97, 97, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.6390909  0.2787831
##   1                  100      0.6356818  0.2727525
##   1                  150      0.6362424  0.2739408
##   2                   50      0.6367727  0.2734621
##   2                  100      0.6325606  0.2654764
##   2                  150      0.6376061  0.2757573
##   3                   50      0.6131061  0.2277796
##   3                  100      0.6235758  0.2483904
##   3                  150      0.6140000  0.2283201
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 1, shrinkage = 0.1 and n.minobsinnode = 10.
gbm.pred <- predict(fit.gbm, myTesting)
gbm.perf <- table(myTesting$gdp_rate_change, gbm.pred, 
                  dnn = c("Actual", "Predicted"))
confusionMatrix(gbm.perf)
## Confusion Matrix and Statistics
## 
##       Predicted
## Actual  0  1
##      0 14  7
##      1  3 12
##                                          
##                Accuracy : 0.7222         
##                  95% CI : (0.5481, 0.858)
##     No Information Rate : 0.5278         
##     P-Value [Acc > NIR] : 0.01383        
##                                          
##                   Kappa : 0.4495         
##  Mcnemar's Test P-Value : 0.34278        
##                                          
##             Sensitivity : 0.8235         
##             Specificity : 0.6316         
##          Pos Pred Value : 0.6667         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.4722         
##          Detection Rate : 0.3889         
##    Detection Prevalence : 0.5833         
##       Balanced Accuracy : 0.7276         
##                                          
##        'Positive' Class : 0              
## 

Further Questions

1.Economic growth can be spurred by a variety of factors or occurrences. Most commonly, increases in aggregate demand encourage a corresponding increase in overall output that brings in a new source of income. Technological advancements and new product developments can exert positive influences on economic growth. Increases in demand, or availability, in foreign markets that result in higher exports can also have positive influences. I will look specific variables pertain to the question to improve the prediction.

2.An economic cycle, also referred to as the business cycle, has four stages: expansion, peak, contraction and trough. I plan to use unsupervised learning algorithms to classify each period of the cycle.