After the 2008 financial crisis, the US government formed the CFPB to protect consumers and ensure that financial companies play by the same consumer rules and compete fairly. This blog post uses the Consumer Complaints data available to the public in the CFPB database to demo statistical or quantitaive analysis and perform a forecast. The following is the workflow to achieving our intended goal;
library(lubridate)
library(dplyr)
library(tseries)
library(forecast)
Almost everyone who demonstrates time-series analyses uses data sets such as AirPassengers, lynx & so on from base R. That’s OK except real-life data don’t come already cleaned up and converted to either ts or mts data types so the demos were very limited in that regard. Personally, I would have liked for someone to use data that was not preprocessed and transformed to a time series. Hence the motivation of using data that has every aspect of what you are likely to see in your everyday analytical job role. Additionally, this post will adhere to the core steps in performing a forecasting project. The assumptions made here are that you are already familiar with basics such as what a time series is, basic data manipulation in R and you know or are familiar with the R language and RStudio.
str(consumers)
## 'data.frame': 632080 obs. of 18 variables:
## $ Date.received : chr "7/29/2013" "7/29/2013" "7/29/2013" "7/29/2013" ...
## $ Product : chr "Consumer Loan" "Bank account or service" "Bank account or service" "Bank account or service" ...
## $ Sub.product : chr "Vehicle loan" "Checking account" "Checking account" "Checking account" ...
## $ Issue : chr "Managing the loan or lease" "Using a debit or ATM card" "Account opening, closing, or management" "Deposits and withdrawals" ...
## $ Sub.issue : chr "" "" "" "" ...
## $ Consumer.complaint.narrative: chr "" "" "" "" ...
## $ Company.public.response : chr "" "" "" "" ...
## $ Company : chr "Wells Fargo & Company" "Wells Fargo & Company" "Santander Bank US" "Wells Fargo & Company" ...
## $ State : chr "VA" "CA" "NY" "GA" ...
## $ ZIP.code : chr "24540" "95992" "10065" "30084" ...
## $ Tags : chr "" "Older American" "" "" ...
## $ Consumer.consent.provided. : chr "N/A" "N/A" "N/A" "N/A" ...
## $ Submitted.via : chr "Phone" "Web" "Fax" "Web" ...
## $ Date.sent.to.company : chr "7/30/2013" "7/31/2013" "7/31/2013" "7/30/2013" ...
## $ Company.response.to.consumer: chr "Closed with explanation" "Closed with explanation" "Closed" "Closed with explanation" ...
## $ Timely.response : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Consumer.disputed : chr "No" "No" "No" "No" ...
## $ Complaint.ID : int 468882 468889 468879 468949 475823 468981 467801 475728 469026 469035 ...
The data pulled captures the consumer complaints between 2011 and 2016 for various financial products such as student loans, mortgages etc., however for this post we will single out a product of interest that we will use for analysis. Stress not however, the process demonstrated is reproducible for any other product of your choosing. To start with, let’s narrow our data to only 2 variables;
consumers <- consumers[c('Date.received', 'Product')]
From the str() function output, we saw that Date.recieved was a chr (character) instead of date so we need transform this variable to its appropriate type as follows;
dates <- "Date.received"
consumers[, dates] = lapply(dates, function(x) as.Date(consumers[,x],'%m/%d/%Y'))
sapply(consumers, class)
## Date.received Product
## "Date" "character"
Next, lets find out the product with the most consumer complaints and narrow our focus on it for this demo.
table(consumers$Product)
##
## Bank account or service Consumer Loan Credit card
## 71133 24659 74545
## Credit reporting Debt collection Money transfers
## 110414 116473 4427
## Mortgage Other financial service Payday loan
## 202337 711 4479
## Prepaid card Student loan Virtual currency
## 2953 19937 12
As seen most consumers complained regarding mortgages so we will subset our data to focus on this product.
consumers <- subset(consumers, Product == 'Mortgage')
We can further preprocess the timestamp in order to efficiently work with it. We will use the lubridate package for this. We probably would like to perform time series on yearly data (not a hard rule)so we will extract Year, Month & Day and then later take monthly sums of the complaints across all the years using dplyr.
consumers$year <- lubridate::year(consumers$Date.received)
consumers$month <- lubridate::month(consumers$Date.received)
consumers$day <- lubridate::day(consumers$Date.received)
consumers$Date.received <- NULL
Then perform the latter part using dplyr as follows;
monthly <- consumers <- consumers %>% group_by(year, month)
per_month <- monthly %>% dplyr::summarize(num_complaint = n())
tail(per_month, n=1)
## Source: local data frame [1 x 3]
## Groups: year [1]
##
## year month num_complaint
## (dbl) (dbl) (int)
## 1 2016 9 351
head(per_month,n=1)
## Source: local data frame [1 x 3]
## Groups: year [1]
##
## year month num_complaint
## (dbl) (dbl) (int)
## 1 2011 12 1280
per_month$Date <- paste(per_month$year, per_month$month,sep = "-")
per_month <- per_month[c("Date", "num_complaint")]
Finally, let’s transform the data to a timeseries so we can begin the time series analysis. We will slice the data such that it starts from January 2012 and ends in December 2015 and then perform some exploratory analysis.
cc <- ts((per_month$num_complaint),start = c(2012,1), end = c(2015, 12), frequency = 12)
start(cc)
## [1] 2012 1
end(cc)
## [1] 2015 12
class(cc)
## [1] "ts"
plot(cc, ylab = '', main = 'Consumer complaints 2012-2015', col = 'blue', bty = 'l')
2012 starts off at a very low dip that steadily spikes until mid-year then shows a series of ups and downs, this pattern is repeated through out the time span. The data seems to have a weak seasonality & without a clear trend. Wondering what might have caused the sharp spike in early 2013 its clear that most consumers filed more complaints compared to the other years. This data does not seem to have autocovariance but we will run an addional test to verify our intuition. Boxplots show a clear increase of complaints rising as months progress to mid-year then begins to fall towards the end of the year. Disassembling the data shows seasonality a bit more clear but still no clear trend. Finally the seasonal plot shows the seasonal patterns a bit more clearly and makes it easy to spot that 2012 began with a very low minimum and early 2013 had a sharp spike. A quick google also confirms this rise but does not provide specific reasons why most consumers complained about Mortgages that year than the rest of the years within this time span. Perhaps some sentiment analysis may assist in answering that question.
boxplot(cc ~ cycle(cc))
plot(stl(cc, s.window = 'periodic', t.window = 15))
seasonplot(cc, year.labels = T, year.labels.left = T, col = 1:4, labelgap = 0.4,
main = 'Comparing Seasons' )
One of the common requirements in many time series techniques is that data must be stationary, meaning that the mean, variance and autocorrelation do not change over time. If this requirement is not met, particulary for ARIMA models, then a technique known as differencing is applied to the data to make it stationary. Fortunately, the forecast package has auto.arima() model which automates the process of differncing the data. However, it still remains good practice to test data for stationarity. There are quiet a few number of methods to test for stationarity but for our purpose we will use the Augmented Dickey Fuller test of stationarity for more info please follow this link.
adf.test(cc)
##
## Augmented Dickey-Fuller Test
##
## data: cc
## Dickey-Fuller = -3.438, Lag order = 3, p-value = 0.06139
## alternative hypothesis: stationary
The test outputs a p-value greater than 0.05 therefore the data are not stationary. We will use the earlier referenced to ARIMA model that automates the process of differencing and ARMA parameter selection for us. Next we check for autocorrelation.
acf(cc)
pacf(cc)
The analysis above show a lag correlation of 2 for the ACF test and a lag correlation of 1 for the PACF. These tests tell you the correlation between points seperated by various time lags for more information please follow this link. So if we had to manually assess which ARIMA models to use, a good point to start would be with p = 2 & q = 1 values or p = 1 and q = 1.
Whether fitting Machine Learning models or Time Series models for forecasting, the recommended approach is to fit a wide range of models say 10-15 and work your way until you can single out a model that’s best performing based on some specific diagnostic parameters such as error rate or accuracy. For this analysis we use the Forecast package which is pretty efficient in model fitting. I was inspired by some of industry’s best and I would recommend that if you’re newby to quantitative analysis try to model your work after this demo. My code can can copied and modified to suit your need.
test_cc <- window(cc, start = c(2015, 1))
train_cc<- window(cc, end = c(2014, 12))
#Next fit models
models <- list(
mod_arima = auto.arima(train_cc, ic = 'aicc', stepwise = F),
mod_exp = ets(train_cc, ic = 'aicc', restrict = F),
mod_neural= nnetar(train_cc, p = 12, size = 25),
mod_tbats = tbats(train_cc, ic = 'aicc', seasonal.periods = 12),
mod_bats = bats(train_cc, ic = 'aicc', seasonal.periods = 12),
mod_stl = stlm(train_cc, s.window = 12, ic = 'aicc', robust = T, method = 'ets'),
mod_sts = StructTS(train_cc)
)
#Forecasts
forecasts <- lapply(models, forecast, 12)
forecasts$naive <- naive(train_cc)
par(mfrow = c(4, 2))
par(mar = c(1,1,1,1))
for (f in forecasts) {
plot(f)
lines(test_cc, col = 'red')
}
You can see that ARIMA models did not perform well on this data. Structural models seem to have done a better job at fitting the data. A naive model which forecasts a flat line is also included in this analysis. Lets evaluate model performance on the test set to see whether or not Structural models would be a good candidate to perform forecast with.
acc <- lapply(forecasts, function(f){
accuracy(f, test_cc)[2,,drop = F]
})
acc <- Reduce(rbind, acc)
row.names(acc) <- names(forecasts)
acc <- acc[order(acc[,'MASE']),]
round(acc, 2)
## ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
## mod_sts 70.26 333.71 287.18 1.49 8.38 0.31 0.14 1.04
## mod_stl -240.84 493.74 391.48 -7.51 11.52 0.42 0.66 1.90
## mod_bats -210.25 498.97 413.10 -6.46 12.37 0.45 0.31 1.86
## mod_tbats -337.59 573.43 427.69 -10.29 12.78 0.46 0.49 2.23
## mod_exp 384.02 531.29 441.39 9.89 11.79 0.48 0.57 1.75
## naive 577.30 691.53 577.30 15.25 15.25 0.62 0.63 2.43
## mod_arima -516.56 930.54 675.43 -16.03 20.36 0.73 0.49 3.66
## mod_neural -559.86 801.28 711.81 -14.82 19.85 0.77 0.46 2.56
The Structural model seem to have done better than other models on this data based on almost all diagnostic metrics. Techniques such as cross validation would be another route to take in order to provide lift to model performance. We will explore that technique in the next blog post so make sure to check back!
If you found this post helpful, please provide feedback in the comment section below.
Thanks!