Introduction & CRISP-DM Overview

Describe the data set, what you plan to do in each phase of CRISP-DM, and the models you expect to build, evaluate, and apply.

Business Understanding

Background

I’ve been working as a freelance web & graphic designer, event coordinator, yoga teacher & studio manager, and administrative consultant for the past few years. I was marginally profitable in this sole proprietorship in 2016, necessitating a more extensive amount of time and attention to detail spent in filing taxes.  To decrease the workload involved in filing taxes in future years I’ve decided to use my financial exploration for tax reporting purposes as the dataset for this project and to build classification models capable of classifying deductibles with a high degree of accuracy such that it requires less manual labor on my part.

In the CRISP-DM overview below I outline the primary outcomes I hope to achieve with the project.

Business Objectives

My objectives with this expense data exploration are to:

  • Find a means of making the tax filing process easier by leveraging machine learning techniques learned in DA 5030. ML Models for Deduction Classification
  • Create projections for future years based on data from previous years See Budget Projections
  • Discern differences in expenses with the relocation for school See Cost of Living comparison
  • Find clusters of expense records that can serve as indicators where improvements can be made to budgeting/spending. See Expense Data Cluster Analysis
  • (If time allows) to develop a tool for better selecting well-performing stock options to bolster the self-managed stock portfolio that is currently paying for this course of education.

Situation Assessment

I currently track all expense data manually in a smartphone application called Expense Manager. Each transaction has the following fields. Each is denoted as being manually entered or automatically entered below:

  1. Date & Time (Automatic)
  2. Amount (Manual)
  3. Category (Manual)
  4. Subcategory (Manual - autopopulates based on previous Category combination)
  5. Payment Method (Manual - also autopopulates based on prediction from Category/Subcategory combination)
  6. Description (Manual)
  7. Payee/Payer (Manual - autopopulates)
  8. Account (Manual)
  9. Tag (Manual - autopopulates. I will use this for tagging items as deductible in the future)

I have also imported transaction data from checking, savings and Paypal accounts for some of the tax information and visualiations in the Data Visualization section.

Business Success Criteria

The success criteria are:

  • To develop an ensemble machine learning model capable of accurately identifing deductibles easily to expediate the tax filing process in future years.

    Classification techniques used for determining deductions are:

    • Radial SVM Package: kernlab, caret method: svmRadial
    • Naive Bayes Package: KlaR, caret method: nb
    • Boosted Logistic Regression Package: caTools, caret method: LogitBoost
    • Adaptive Boosted Decision Trees Package: fastAdaboost, caret method: adaboost
    • C5.0 Decision Tree Package: C5.0, caret method: C5.0
    • caretStack stacked general linear model ensemble of the svmRadial, LogitBoost, adaboost, and C5.0 models.
  • To use cluster analysis to identify pockets of spending in order to understand “networks” of transactions.

    Two datasets were used: one that included the following features:Category, Subcategory and Payee/Payer Cluster techniques used are:

    • Date
    • Amount
    • Category
    • Subcategory
    • Payee/Payer

    and one that used only Date & Amount. Cluster techniques used are:

    • Gaussian mixed models (fits data to gaussian curves to identify clusters)
    • Cluster medoids (begins with a datapoint (medoid) as the cluster center)

    With distance measures

    • Euclidean
    • Mahalanobis

    Clustering criteria from the ClusterCrit package were used to assess the best clusters via a voting algorithm.

  • (If time allows) to prototype a tool for making wise stock-selection and buy/sell decisions

Development of Project Plan

Trello is used to track and plan the project. GitKracken is used to track document changes. In the document, timestamped comments are used to formulate plans for coding each section and to track changes.

Data Understanding

Considering data requirements

The raw data was exported from the Expense Manager app as a csv file. Various measures were taken to wrangle the data into formats conducive for the various analyses. These are documented inline in the body of the document. The transactions were manually coded as deductibles or not in google sheets for ease. The deductibles were originally coded as T/F, became “1”,“2” after conversion to a factor, and then recoded to “yes”, “no” for readability in model results output and confusion matrices.

Initial data collection, exploration, and quality assessment.

I have received a 1099-Misc for all major contract clients to verify income. I have downloaded all transaction history from checking/credit accounts, and have exported all manually recorded transactions from the app that I use to track expenses. The data that I generated is relatively high quality with few missing values. Because of my familiarity with the data, and with the ability to cross-reference transactions with the transaction data from checking and credit card accounts, I was easily able to fill-in values where absent and correct mislabellings where discovered. Data that has been generated for 2018 thus far provides an out-of-time test set to use for evaluating the performance of the models.

Factor levels in some of the features are relabelled for use in specific models where special characters caused errors. These instances are documented inline. Conversion of factor levels to numeric, and scaling was necessary for the clustering algorithms.

Selection of required data

The transaction histories from the bank accounts does not allow for the creation of additional categories for labelling specific transaction types and the descriptions of the transactions themselves are cryptic and often difficult to interpret. Thus, I will be using only the manually tracked expenses because the data has more legible features with respect to labelling each transaction.

Data acquisition

Described above

Data integration and formatting

Described above

Data cleaning

Described above

Data transformation and enrichment

Derived attributes such as categorical sums were necessary for much of the manual calculation of deductible expense categories for the tax filing process See Self-Employment Income.

Modeling

Selection of appropriate modeling technique

I initially selected Naive Bayes, Radial SVM, and LogitBoost (See Exploratory 1st Run for details). LogitBoost had the best performance followed by the Radial SVM and finally Naive Bayes. To replace Naive Bayes, C5.0 was added to the ensemble. C5.0 performed exceptionally well for reasons discussed in-depth in the Add C5.0 section. A summary of the explanation is that the Category and Subcategory factors are the primary means of identifying the deductibles, thus the bias decision trees give to factors with many classes actually works to it’s advantage in labelling the deductibles.
In light of the performance of C5.0 another boosted decision tree, adaboost, was added to the final mix of models used for the ensemble.

Splitting of the dataset into training and testing subsets for evaluation purposes

The caret package createDataPartition function works especially well at partitioning data based on a percentage split according to factor levels in the response variable for the purpose of training and verifying models. An out-of-time test set with transaction data from 2018 was used for predictions and final evaluation from which herein are derived. The construction of the training set was multiphasic; including transformation in R, porting to Google sheets for cleaning and deductible labeling, and then loading back into R for further cleaning and matching of levels with the test set. This process can be found in part in the Data Loading and Cleaning section. The construction of the test set can be found here.

Development and examination of alternative modeling algorithms and parameter settings

I used caret train on all initial models to find tuning parameters for each model type. I then used a manual iterative approach with the various naive bayes kernels to attempt to fine tune the model. The cosine kernel performed better, but was significantly lower than the Radial SVM, thus Naive Bayes was dropped from future models. C5.0 and Adaboost for the final ensemble as discussed above.

Fine tuning of the model settings according to an initial assessment of the model’s performance

A custom function called tuneGrids was developed to hone tuning grids with refined tuning measures from each train result for repeated iterations of model building with train. The performance of the first run can be found here. Models were made individually with their respective packages using best tuning parameters from caret and additional parameters not available in the caret implementation. See Manual Models: Refinements for code and performance measures in the manual refinement process.

Model evaluation

Evaluation of the model in the context of the business success criteria

The C5.0 algorithm outperformed even the ensemble models built on a generalized linear models of class probabilities. The ensemble model may be more effective for a dataset which has a high degree of variability in the factors that are used, as C5.0 may overfit such a data set. However, with this particular classification task, C5.0 was the perfect fit. See the performance for all models included in the first run. The performance of the set of models included in the ensemble, and the ensemble using the 1st resample probabilities and full resample probabilities can be found in Ensemble Model Performance

Model approval

The C5.0 decision tree will be used in 2018 deductible classification due to it’s rapid training and 99% AUC.

Deployment

Create a report of findings

Here it is!

Planning and development of the deployment procedure

The work-up used herein will be sufficient for loading transaction data from 2018 and classifying deductibles with far less effort than was necessary in 2017. I would be amenable to developing a Shiny deployment of the C5.0 algorithm for classifying deductibles if there was sufficient interest from the developers of the Expense Manager app. To be useful for other individuals using the app, they would need to develop a system for classifying their deductibles consistently such that C5.0 will can effectively classify their data.

Load Data

Fn for converting date to decimal

This function reduces the date to a decimal representing the part of the year (ie June 31st would be about .5). It subtracts the year part of the date such that the decimal date between the training set for 2017 and the out-of-time test set for 2018 match.

Add Shared Expenses to Expense Manager Data

The following section is used for manual calculation of deductible categories for tax filing purposes. Inline documentation of Finance for Freelancers continues in ML Models for Deduction Classification. All chunks in the following section are set to eval=F and some lines are not echoed for privacy.

Taxes

Capital Gains Liquidation for Tuition Payments

For form 1040

Tax Exemption from certain types of Federal Obligations for the state of MA (Valueline.MA) - TIR 89.8

Ordinary Dividends

Long Term Gains and Losses

Capital Loss Carryover Worksheet

1040SD - cont

Qualified Dividends and Capital Gain Tax Worksheet

# ----------------------- Sat Apr 14 23:14:03 2018
# ------------------------# Qualified Dividends and Capital
# Gain Tax Worksheet—Line 28
CGTWs <- c("1. Enter the amount from Form 1040, line 43. However, if you are filing Form 2555 or 2555-EZ (relating to foreign earned income), enter the amount from line 3 of the Foreign Earned Income Tax Worksheet . . . . . . . . . . . . . . . . . . . . . 1.", 
    "2. Enter the amount from Form 1040, line 9b* . . . . . . . 2.", 
    "3. Are you filing Schedule D?* Yes. Enter the smaller of line 15 or 16 of Schedule D. If either line 15 or 16 is blank or a loss, enter -0-. 3. No. Enter the amount from Form 1040 line 13.", 
    "4. Add lines 2 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.", 
    "5. If filing Form 4952 (used to figure investment interest expense deduction), enter any amount from line 4g of that form. Otherwise, enter -0- . . . . . . . . . . 5.", 
    "6. Subtract line 5 from line 4. If zero or less, enter -0- . . . . . . . . . . . . . . . . . . . . . . 6.", 
    "7. Subtract line 6 from line 1. If zero or less, enter -0- . . . . . . . . . . . . . . . . . . . . . . 7.", 
    "8. Enter: $37,950 if single or married filing separately $75,900 if married filing jointly or qualifying widow(er) $50,800 if head of household. . . . . . . . . . . . . . 8.", 
    "9. Enter the smaller of line 1 or line 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.", 
    "10. Enter the smaller of line 7 or line 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10.", 
    "11. Subtract line 10 from line 9. This amount is taxed at 0% . . . . . . . . . . . . . . . . . .11.", 
    "12. Enter the smaller of line 1 or line 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.", 
    "13. Enter the amount from line 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13.", 
    "14. Subtract line 13 from line 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14.", 
    "15. Enter: $418,400 if single $235,350 if married filing separately $470,700 if married filing jointly or qualifying widow(er) $444,550 if head of household. . . . . . . . . . . . . . 15.", 
    "16. Enter the smaller of line 1 or line 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.", 
    "17. Add lines 7 and 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17.", 
    "18. Subtract line 17 from line 16. If zero or less, enter -0- . . . . . . . . . . . . . . . . . . . .18.", 
    "19. Enter the smaller of line 14 or line 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19.", 
    "20. Multiply line 19 by 15% (0.15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.", 
    "21. Add lines 11 and 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.", 
    "22. Subtract line 21 from line 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22.", 
    "23. Multiply line 22 by 20% (0.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.", 
    "24. Figure the tax on the amount on line 7. If the amount on line 7 is less than $100,000, use the Tax", 
    "Table to figure the tax. If the amount on line 7 is $100,000 or more, use the Tax Computation Worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.", 
    "25. Add lines 20, 23, and 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.", 
    "26. Figure the tax on the amount on line 1. If the amount on line 1 is less than $100,000, use the Tax Table to figure the tax. If the amount on line 1 is $100,000 or more, use the Tax Computation Worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.", 
    "27. Tax on all taxable income. Enter the smaller of line 25 or 26. Also include this amount on Form 1040, line 44. If you are filing Form 2555 or 2555-EZ, don’t enter this amount on Form 1040 line 44. Instead, enter it on line 4 of the Foreign Earned Income Tax Worksheet . . . . . . . . . . . . . . 27. *If you are filing Form 2555 or 2555-EZ, see the footnote in the Foreign Earned Income Tax Worksheet before completing")
CGTWs %<>% gsub("\\.*\\s+[1-9.]+$", "", .) %>% gsub("(?:[.]\\s[.])|\\s{2,}", 
    "", .) %>% gsub("\\s{2,}", "", .)

Add to Gsheets

Expense Deductions

For form 1040SC & 1040SE

Data Loading and Cleaning

Combine and Replace NA

##           Date         Amount       Category    Subcategory Payment.Method 
##          FALSE          FALSE          FALSE          FALSE          FALSE 
##    Payee.Payer        Account            Tag     Deductible 
##          FALSE          FALSE           TRUE          FALSE
##           Date         Amount       Category    Subcategory Payment.Method 
##          FALSE          FALSE          FALSE          FALSE          FALSE 
##    Payee.Payer        Account            Tag     Deductible 
##          FALSE          FALSE          FALSE          FALSE

Itemized Business Deductions

Car and Truck

Deductible Expenses - Car and Truck

Supplies

Deductible Expenses - Supplies

Registration Fees

Deductible Expenses - Registration Fees

ML Models for Deduction Classification

This section includes the initial loading a cleaning of data from the data understanding, data prep, and modeling phases prior to the project proposal. Many of the chunks from this preliminary testing are here to show the discovery process, but have been set to ## NOT RUN due to the process having been quite error prone. Interwoven are refinements made upon revision of these phases used to generate new models with improved code. Inline documentation will appear throughout.

Load and Transform data

Load data

Loading Labelled Data from GS

Clean OOT Test data

##             X1           Date         Amount       Category    Subcategory 
##      "integer"         "Date"      "numeric"    "character"    "character" 
## Payment.Method    Payee.Payer        Account            Tag     Deductible 
##    "character"    "character"    "character"    "character"      "logical"
##           Date         Amount       Category    Subcategory Payment.Method 
##          FALSE          FALSE          FALSE          FALSE          FALSE 
##    Payee.Payer        Account            Tag     Deductible 
##          FALSE          FALSE          FALSE          FALSE

matchLevels

Fn for Matching Levels

Apply the function to the training and test sets

## NULL
## [1] TRUE

Modeling

Exploratory 1st Run

This chunk contains the original training phase that was run prior to the project proposal. Many errors were encountered due to incongruent levels in test & training sets and due to providing the ‘finalModel’ to predict functions instead of the train object. I managed to get through to creating the first three models to ensure the project was feasible via a fairly haphazard method, as can be seen in the code below. The matchlevels function above was done upon revising this section in preparation for using caretEnsemble to create ensemble models, and solves some of the issues in this chunk more elegantly. The models were saved because of the long training time, and the accuracy of each is reported herein.

Train Models on 2017 Coded Data

# NOT RUN - from first run prior to project proposal
library(caret)

em2017.train <- createDataPartition(em17.tr$Deductible, times = 2, 
    p = 0.9, list = T)
em2017.train <- trainControl(method = "repeatedcv", index = em2017.train, 
    number = 2, repeats = 1, search = "grid", verboseIter = T, 
    allowParallel = F)

# kernlab::sigest(Deductible ~ .,data=em17.tr) em.2017.mod <-
# train(Deductible ~ . , data=em17.tr, method='svmRadial',
# na.action='na.pass',
# metric='Accuracy',tuneLength=10,trCtrl=em2017.train)
# ----------------------- Sat Mar 31 20:12:08 2018
# ------------------------# Throwing error
# `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more
# levels

# em.lvls <-
# purrr::map2(.x=sapply(em17.tr[,sapply(em17.tr,is.factor)],levels),.y=sapply(em.test,levels),
# .f=function(.x,.y){ c(.x,.y) %>% unique })
# em.lvls[['Date']] <- NULL em.lvls[['Amount']] <- NULL
# em.fac <- names(em.lvls) for(v in seq_along(em.fac)){
# em17.tr[[em.fac[v]]] %<>% as.factor em.test[[em.fac[v]]]
# %<>% as.factor levels(em17.tr[[em.fac[v]]]) <-
# em.lvls[[em.fac[v]]] levels(em.test[[em.fac[v]]]) <-
# em.lvls[[em.fac[v]]] }
#-------- Does not properly merge factors ------#
# ----------------------- Sat Mar 31 21:51:34 2018
# ------------------------# Merge test and train data and
# recompute factors for consistency in factor levels


# ----------------------- Sat Mar 31 20:51:18 2018
# ------------------------# LEt's see if this resolved it
em.2017.mod <- train(Deductible ~ ., data = em17.tr, method = "svmRadial", 
    na.action = "na.pass", metric = "Accuracy", tuneLength = 10, 
    trCtrl = em2017.train)
# Works ----------------------- Sat Mar 31 20:56:12 2018
# ------------------------# Let's try some additional methods

library(doParallel)
# make a cluster with all possible threads (not cores)
cl <- makeCluster(detectCores() - 1)
# register the number of parallel workers (here all CPUs)
registerDoParallel(cl)
getDoParWorkers()
# insert parallel calculations here

em.2017.svmRad <- train(Deductible ~ ., data = em17.tr, method = "svmRadial", 
    na.action = "na.pass", metric = "Accuracy", tuneLength = 10, 
    trCtrl = em2017.train)
# ----------------------- Sat Mar 31 20:53:10 2018
# ------------------------# A boosted logistic regression
em.2017.LB <- train(Deductible ~ ., data = em17.tr, method = "LogitBoost", 
    na.action = "na.pass", metric = "Accuracy", tuneLength = 10, 
    trCtrl = em2017.train)
# ----------------------- Sat Mar 31 20:53:55 2018
# ------------------------# And Naive Bayes


em.2017.nB <- caret::train(Deductible ~ ., data = em17.tr, method = "nb", 
    na.action = "na.pass", metric = "Accuracy", tuneGrid = expand.grid(fL = c(0, 
        1), usekernel = T, adjust = c(0.1, 0.5, 1)), trCtrl = em2017.train)
# stop the cluster and remove Rscript.exe childs (WIN)
stopCluster(cl)
registerDoSEQ()
em.2017.mods <- list(SVM = em.2017.svmRad, LB = em.2017.LB, NB = em.2017.nB)
save(em.2017.mods, file = "em2017modes.RData")

Eval Models

## [1] "svmRadial: 0.8904"
## [1] "LogitBoost: 0.9382"
## [1] "nb: 0.7528"
## [[1]]

## 
## [[2]]

## 
## [[3]]

It appears that the best model is the boosted logistic regression model. This makes sense given that the classification involves a binary classification, namely whether an entry is deductible or not. The boosting algorithm additionally adds predictive power, making it a bit more accurate than the radial SVM.

Predictions

To account for what may be errors/incompatibilities between the caret final models and the predict functions, we will manually construct the models based on the tuning parameters provided by the caret training.

Manual Models: Refinements

ksvm, manual

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  64   6
##        yes  3  20
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8242, 0.9548)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.00001483      
##                                           
##                   Kappa : 0.751           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.9552          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.9143          
##          Neg Pred Value : 0.8696          
##              Prevalence : 0.7204          
##          Detection Rate : 0.6882          
##    Detection Prevalence : 0.7527          
##       Balanced Accuracy : 0.8622          
##                                           
##        'Positive' Class : no              
## 

An ~88% accuracy was achieved with the test data. A majority of incorrectly labelled deductions were accounted for an addition of a single recurring payment associated with a Category/Subcategory combination that was not associated with a deduction in the test data. As the dataset grows over the years, and enough categories are created such that they can remains static, the accuracy of the algorithm will likely improve.

logitboost, manual

naive bayes,manual

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  63   5
##        yes  4  21
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8242, 0.9548)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.00001483      
##                                           
##                   Kappa : 0.7569          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9403          
##             Specificity : 0.8077          
##          Pos Pred Value : 0.9265          
##          Neg Pred Value : 0.8400          
##              Prevalence : 0.7204          
##          Detection Rate : 0.6774          
##    Detection Prevalence : 0.7312          
##       Balanced Accuracy : 0.8740          
##                                           
##        'Positive' Class : no              
## 

It appears that the Naive Bayes model is not so well-suited for predicting the deductibles at least with this set of tuning parameters. Perhaps this can improve with the Laplace correction?

Naive Bayes tuning

nb,laplace tuning

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  48   9
##        yes 19  17
##                                          
##                Accuracy : 0.6989         
##                  95% CI : (0.595, 0.7897)
##     No Information Rate : 0.7204         
##     P-Value [Acc > NIR] : 0.72196        
##                                          
##                   Kappa : 0.3313         
##  Mcnemar's Test P-Value : 0.08897        
##                                          
##             Sensitivity : 0.7164         
##             Specificity : 0.6538         
##          Pos Pred Value : 0.8421         
##          Neg Pred Value : 0.4722         
##              Prevalence : 0.7204         
##          Detection Rate : 0.5161         
##    Detection Prevalence : 0.6129         
##       Balanced Accuracy : 0.6851         
##                                          
##        'Positive' Class : no             
## 

LaPlace correction does not improve the model. Does another kernel?

nb,kernel tuning

## $gaussian
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  48   9
##        yes 19  17
##                                          
##                Accuracy : 0.6989         
##                  95% CI : (0.595, 0.7897)
##     No Information Rate : 0.7204         
##     P-Value [Acc > NIR] : 0.72196        
##                                          
##                   Kappa : 0.3313         
##  Mcnemar's Test P-Value : 0.08897        
##                                          
##             Sensitivity : 0.7164         
##             Specificity : 0.6538         
##          Pos Pred Value : 0.8421         
##          Neg Pred Value : 0.4722         
##              Prevalence : 0.7204         
##          Detection Rate : 0.5161         
##    Detection Prevalence : 0.6129         
##       Balanced Accuracy : 0.6851         
##                                          
##        'Positive' Class : no             
##                                          
## 
## $epanechnikov
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  43   7
##        yes 24  19
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.5613, 0.7611)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.896531        
##                                           
##                   Kappa : 0.3105          
##  Mcnemar's Test P-Value : 0.004057        
##                                           
##             Sensitivity : 0.6418          
##             Specificity : 0.7308          
##          Pos Pred Value : 0.8600          
##          Neg Pred Value : 0.4419          
##              Prevalence : 0.7204          
##          Detection Rate : 0.4624          
##    Detection Prevalence : 0.5376          
##       Balanced Accuracy : 0.6863          
##                                           
##        'Positive' Class : no              
##                                           
## 
## $rectangular
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  47   9
##        yes 20  17
##                                           
##                Accuracy : 0.6882          
##                  95% CI : (0.5837, 0.7802)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.79233         
##                                           
##                   Kappa : 0.3146          
##  Mcnemar's Test P-Value : 0.06332         
##                                           
##             Sensitivity : 0.7015          
##             Specificity : 0.6538          
##          Pos Pred Value : 0.8393          
##          Neg Pred Value : 0.4595          
##              Prevalence : 0.7204          
##          Detection Rate : 0.5054          
##    Detection Prevalence : 0.6022          
##       Balanced Accuracy : 0.6777          
##                                           
##        'Positive' Class : no              
##                                           
## 
## $triangular
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  42   7
##        yes 25  19
##                                           
##                Accuracy : 0.6559          
##                  95% CI : (0.5502, 0.7514)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.931075        
##                                           
##                   Kappa : 0.2951          
##  Mcnemar's Test P-Value : 0.002654        
##                                           
##             Sensitivity : 0.6269          
##             Specificity : 0.7308          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.4318          
##              Prevalence : 0.7204          
##          Detection Rate : 0.4516          
##    Detection Prevalence : 0.5269          
##       Balanced Accuracy : 0.6788          
##                                           
##        'Positive' Class : no              
##                                           
## 
## $biweight
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  45   8
##        yes 22  18
##                                           
##                Accuracy : 0.6774          
##                  95% CI : (0.5725, 0.7707)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.85059         
##                                           
##                   Kappa : 0.3125          
##  Mcnemar's Test P-Value : 0.01762         
##                                           
##             Sensitivity : 0.6716          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.8491          
##          Neg Pred Value : 0.4500          
##              Prevalence : 0.7204          
##          Detection Rate : 0.4839          
##    Detection Prevalence : 0.5699          
##       Balanced Accuracy : 0.6820          
##                                           
##        'Positive' Class : no              
##                                           
## 
## $cosine
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  48   7
##        yes 19  19
##                                           
##                Accuracy : 0.7204          
##                  95% CI : (0.6178, 0.8086)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.55260         
##                                           
##                   Kappa : 0.3919          
##  Mcnemar's Test P-Value : 0.03098         
##                                           
##             Sensitivity : 0.7164          
##             Specificity : 0.7308          
##          Pos Pred Value : 0.8727          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.7204          
##          Detection Rate : 0.5161          
##    Detection Prevalence : 0.5914          
##       Balanced Accuracy : 0.7236          
##                                           
##        'Positive' Class : no              
##                                           
## 
## $optcosine
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  42   7
##        yes 25  19
##                                           
##                Accuracy : 0.6559          
##                  95% CI : (0.5502, 0.7514)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.931075        
##                                           
##                   Kappa : 0.2951          
##  Mcnemar's Test P-Value : 0.002654        
##                                           
##             Sensitivity : 0.6269          
##             Specificity : 0.7308          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.4318          
##              Prevalence : 0.7204          
##          Detection Rate : 0.4516          
##    Detection Prevalence : 0.5269          
##       Balanced Accuracy : 0.6788          
##                                           
##        'Positive' Class : no              
## 

Changing the kernel for the density calculation of numeric values does not improve the accuracy enough to warrant inclusion for further testing. The poor performance of Naive Bayes could be due to the data having only two numeric variables.
All predicted deductibles will need to be manually checked for accuracy. It may be useful for this process to have the rules from years previous to reference. A decision tree can provide these rules to reference. We will try a decision tree algorithm to replace Naive Bayes in the chunks to come.

Add C50

Decision Tree

It appears the C5.0 algorithm performs exceptionally well with an accuracy of ~98%. The well-categorized financial data makes the decision tree particularly well-suited to determine if a transaction is deductible. Given the performance, I’m fairly certain that this will be the algorithm of choice for determining deductions in future years. I think 98% performance of the decision tree on the test set might be in part due to the presence of many recurring monthly subscription payments scheduled for the coming year providing a number of easy-to-fit observations in the data because they are identically labeled to the previous year. In addition to eliminating these, I will run the test with additional data generated over the course of working on this project plus all data used in the previous test (All transactions YTD 2018), which will likely provide a more realistic measure of accuracy.
An explanation for why the decision tree might have performed well on this particular dataset is it’s bias towards factors with numerous levels, which with this dataset and it’s numerous categories and subcategories, lends an advantage. The Category & Subcategory labels with which I label each of the transactions are fairly consistent, thus giving the decision tree advantage through it’s bias towards these two factors with numerous levels.

In the next section, and out-of-time test set with new data for 2018 will be loaded and filtered to only include transactions year-to-date (eliminating scheduled recurring transactions.

tuneGrids v1

In the model creation phase we used caret to create and tune models, and then using parameters provided by the caret tuning, manually created models of the same type. These models will need to be regenerated with the new datasets that have matching levels. We can use the tuning parameters from the previous iterations to refine the tuning parameters in this round of model generation.

Model Parameters

load("em2017modes.RData")
load("em.dt.RData")
names(em.2017.mods) <- c("svmRadial", "LogitBoost", "nb")
em.2017.mods$C5.0 <- em.dt

best.Tunes <- lapply(em.2017.mods, FUN = function(x) {
    purrr::pluck(x, list("bestTune"))
})
# ----------------------- Thu Apr 12 17:41:23 2018
# ------------------------# Based on previous tuning
# parameters, we will try ranges of values near to the best
# tunes to further refine the results.
tuneGrids <- lapply(seq_along(best.Tunes), tunes = best.Tunes, 
    mods = em.2017.mods, function(tune, tunes, mods) {
        modelnm <- names(tunes)[tune]
        varnms <- names(tunes[[tune]])
        rowmatch <- sapply(varnms, function(v) {
            which(mods[[modelnm]][["results"]][, v] %in% tunes[[modelnm]][, 
                v])
        }) %>% Reduce(intersect, .)  # Match the best tune to the row number
        out <- sapply(varnms, modelnm = modelnm, rowmatch = rowmatch, 
            function(v, modelnm, rowmatch) {
                if (mods[[modelnm]][["results"]][, v] %>% is.numeric) {
                  range(mods[[modelnm]][["results"]][c((rowmatch - 
                    1):(rowmatch + 1)), v]) %>% summary %>% c(mods[[modelnm]][["results"]][rowmatch, 
                    v]) %>% unique
                } else {
                  mods[[modelnm]][["results"]][, v] %>% unique
                }
            })
        if (is.list(out)) {
            out <- expand.grid(out)
        } else {
            out <- as.data.frame(out)
        }
        # Create new tuning grids based on the best tune parameters,
        # and parameter values from iterations before and after
        return(out)
    })
mods <- vector("list", length(tuneGrids))
names(mods) <- names(tuneGrids) <- sapply(em.2017.mods, purrr::pluck, 
    "method") %>% subset(., subset = !sapply(., is.null))

Test on 2018 Out-of-Time test set

Regenerate Models

Modified AUC

Now to compare the performance across the various models

All model performance on test set

## $svmRadial
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.151          0.13
## True.Neg         0.043          0.68
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos          0.54          0.46
## True.Neg          0.06          0.94
## 
## Accuracy =  0.83  Sensitivity =  0.54   Specificity =  0.94 
## with Area Under the Curve =  0.88
## d.prime =  1.65  Criterion =  1.56  Beta =  0.22
## Observed Phi correlation =  0.54 
## Inferred latent (tetrachoric) correlation  =  0.8
## $LogitBoost
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.221         0.047
## True.Neg         0.035         0.698
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos         0.826          0.17
## True.Neg         0.048          0.95
## 
## Accuracy =  0.92  Sensitivity =  0.83   Specificity =  0.95 
## with Area Under the Curve =  0.97
## d.prime =  2.61  Criterion =  1.67  Beta =  0.32
## Observed Phi correlation =  0.79 
## Inferred latent (tetrachoric) correlation  =  0.95
## $C5.0
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.258         0.022
## True.Neg         0.032         0.688
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos         0.923         0.077
## True.Neg         0.045         0.955
## 
## Accuracy =  0.95  Sensitivity =  0.92   Specificity =  0.96 
## with Area Under the Curve =  0.99
## d.prime =  3.12  Criterion =  1.7  Beta =  0.38
## Observed Phi correlation =  0.87 
## Inferred latent (tetrachoric) correlation  =  0.98

C5.0 is again the top performing algorithm for predicting deductibles by measures of Accuracy, AUC and in the visible class separation in the Decision Theory graph.

tuneGrids v2

Refine tuneGrids

best.Tunes <- lapply(mods, FUN = function(x) {
    purrr::pluck(x, list("bestTune"))
})
# ----------------------- Wed Apr 18 15:10:39 2018
# ------------------------# Refine tuneGrids function to use
# the range of those trials that have significance at the .05
# alpha levels as being the best performers

tuneGrids <- lapply(seq_along(best.Tunes), tunes = best.Tunes, 
    mods = mods, function(tune, tunes, mods) {
        modelnm <- names(tunes)[tune]
        varnms <- names(tunes[[tune]])
        rowmatch <- sapply(varnms, function(v) {
            which(mods[[modelnm]][["results"]][, v] %in% tunes[[modelnm]][, 
                v])
        }) %>% Reduce(intersect, .)  # Match the best tune to the row number
        perf.measures <- list(Accuracy = mods[[modelnm]][["results"]][["Accuracy"]] %>% 
            .[order(., decreasing = T)], Kappa = mods[[modelnm]][["results"]][["Kappa"]] %>% 
            .[order(., decreasing = T)])
        
        top.ind <- purrr::map2(.x = perf.measures, .y = names(perf.measures), 
            function(.x, .y) {
                which(mods[[modelnm]][["results"]][[.y]] %in% 
                  .x[.x > qnorm(0.95, mean(.x), sd(.x))])
            }) %>% unlist %>% unique  # Calculate best preforming based on kappa
        if (top.ind %>% length > 0) {
            btunes <- mods[[modelnm]][["results"]][c(top.ind, 
                rowmatch), ]
        } else {
            btunes <- c((rowmatch - 1):(rowmatch + 1))
        }
        out <- sapply(varnms, modelnm = modelnm, rowmatch = rowmatch, 
            function(v, modelnm, rowmatch) {
                if (mods[[modelnm]][["results"]][, v] %>% is.numeric) {
                  range(mods[[modelnm]][["results"]][btunes, 
                    v]) %>% summary %>% c(mods[[modelnm]][["results"]][rowmatch, 
                    v]) %>% unique
                } else {
                  mods[[modelnm]][["results"]][, v] %>% unique
                }
            })
        if (is.list(out)) {
            out <- expand.grid(out)
        } else {
            out <- as.data.frame(out)
        }
        # Create new tuning grids based on the best tune parameters,
        # and parameter values from iterations before and after
        return(out)
    })

Ensemble

caretEnsemble

## NOT RUN - BUG: Error: cannot compute class probabilities
## for regressionError in evalSummaryFunction(y, wts =
## weights, ctrl = trControl, lev = classLevels, : ##
## train()'s use of ROC codes requires class probabilities.
## See the classProbs option of trainControl()
## ----------------------- Thu Apr 12 23:14:30 2018
## ------------------------# Predictions from caret train
## apparently will not save even when savePredictions=T. We
## will have to try the caretEnsemble full workup from the
## vignette tomorrow.  ----------------------- Fri Apr 13
## 07:34:18 2018 ------------------------# Since we've seen
## that naive bayes doesn't perform to well with this data, we
## will replace it with the adaboost algorithm, a boosted
## decision tree leaner ----------------------- Wed Apr 18
## 18:35:23 2018 ------------------------# FIX: formula is
## declared via the variable form - not formula



set.seed(1)
req.packages <- c("doParallel", "kernlab", "caTools", "C50", 
    "parallel", "iterators", "MASS", "foreach", "caret", "tidyverse", 
    "dplyr", "htmltools", "magrittr", "repmis")
for (q in seq_along(req.packages)) {
    suppressPackageStartupMessages(library(req.packages[q], character.only = T))
}
# repmis::source_data('https://github.com/yogat3ch/da5030/blob/master/matchedlevels.RData?raw=true')
data.train <- caret::createDataPartition(matchedlevels$olddata[["Deductible"]], 
    times = 2, p = 0.85)
data.train <- caret::trainControl(method = "repeatedcv", index = data.train, 
    number = 10, repeats = 1, search = "grid", allowParallel = T, 
    summaryFunction = caret::twoClassSummary, classProbs = T, 
    savePredictions = "final", returnResamp = "final")
form <- as.formula(paste0("Deductible", " ~ ."))
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
getDoParWorkers()
model.list <- caretEnsemble::caretList(form = form, data = matchedlevels$olddata, 
    trControl = data.train, metric = "ROC", methodList = c("svmRadial", 
        "LogitBoost", "adaboost", "C5.0"), tuneList = list(svmRadial = caretEnsemble::caretModelSpec(method = "svmRadial", 
        tuneGrid = tuneGrids$svmRadial), LogitBoost = caretEnsemble::caretModelSpec(method = "LogitBoost", 
        tuneGrid = tuneGrids$LogitBoost), adaboost = caretEnsemble::caretModelSpec(method = "adaboost", 
        tuneLength = 10), C5.0 = caretEnsemble::caretModelSpec(method = "C5.0", 
        tuneGrid = tuneGrids$C5.0)))
stopCluster(cl)
registerDoSEQ()
# ----------------------- Tue Apr 17 14:10:03 2018
# ------------------------# Models need to be filtered to
# only return one resample, saved back to the caretList and
# then used with caretEnsemble
modellist.cM <- list(svmRadial = confusionMatrix(predict(model.list$svmRadial, 
    newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible, 
    positive = "yes"), LogitBoost = confusionMatrix(predict(model.list$LogitBoost, 
    newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible, 
    positive = "yes"), adaboost = confusionMatrix(predict(model.list$adaboost, 
    newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible, 
    positive = "yes"), C5.0 = confusionMatrix(predict(model.list$C5.0, 
    newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible, 
    positive = "yes"))
save(modellist.cM, file = "modellist.cM.RData")
save(model.list, file = "caretEnsemble_model.list.RData")

Run the ensemble

Confusion Matrix of small stacked ensemble model

Does it work with the original model list?

caretStack with original model list

caretStack Ensemble confusion Matrix

It does appear to work.

Ensemble model performance

Performance of all Models

## $svmRadial
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  66  15
##        yes  1  11
##                                           
##                Accuracy : 0.828           
##                  95% CI : (0.7357, 0.8983)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.011366        
##                                           
##                   Kappa : 0.4887          
##  Mcnemar's Test P-Value : 0.001154        
##                                           
##             Sensitivity : 0.4231          
##             Specificity : 0.9851          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.8148          
##              Prevalence : 0.2796          
##          Detection Rate : 0.1183          
##    Detection Prevalence : 0.1290          
##       Balanced Accuracy : 0.7041          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## $LogitBoost
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  64   4
##        yes  3  22
##                                          
##                Accuracy : 0.9247         
##                  95% CI : (0.851, 0.9692)
##     No Information Rate : 0.7204         
##     P-Value [Acc > NIR] : 0.0000008947   
##                                          
##                   Kappa : 0.8109         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8462         
##             Specificity : 0.9552         
##          Pos Pred Value : 0.8800         
##          Neg Pred Value : 0.9412         
##              Prevalence : 0.2796         
##          Detection Rate : 0.2366         
##    Detection Prevalence : 0.2688         
##       Balanced Accuracy : 0.9007         
##                                          
##        'Positive' Class : yes            
##                                          
## 
## $adaboost
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  43   4
##        yes 24  22
##                                          
##                Accuracy : 0.6989         
##                  95% CI : (0.595, 0.7897)
##     No Information Rate : 0.7204         
##     P-Value [Acc > NIR] : 0.7219591      
##                                          
##                   Kappa : 0.395          
##  Mcnemar's Test P-Value : 0.0003298      
##                                          
##             Sensitivity : 0.8462         
##             Specificity : 0.6418         
##          Pos Pred Value : 0.4783         
##          Neg Pred Value : 0.9149         
##              Prevalence : 0.2796         
##          Detection Rate : 0.2366         
##    Detection Prevalence : 0.4946         
##       Balanced Accuracy : 0.7440         
##                                          
##        'Positive' Class : yes            
##                                          
## 
## $C5.0
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  64   2
##        yes  3  24
##                                          
##                Accuracy : 0.9462         
##                  95% CI : (0.879, 0.9823)
##     No Information Rate : 0.7204         
##     P-Value [Acc > NIR] : 0.00000003033  
##                                          
##                   Kappa : 0.8681         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9231         
##             Specificity : 0.9552         
##          Pos Pred Value : 0.8889         
##          Neg Pred Value : 0.9697         
##              Prevalence : 0.2796         
##          Detection Rate : 0.2581         
##    Detection Prevalence : 0.2903         
##       Balanced Accuracy : 0.9392         
##                                          
##        'Positive' Class : yes            
##                                          
## 
## $StackedSmall
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  65   8
##        yes  2  18
##                                           
##                Accuracy : 0.8925          
##                  95% CI : (0.8111, 0.9472)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.00005051      
##                                           
##                   Kappa : 0.7128          
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##             Sensitivity : 0.6923          
##             Specificity : 0.9701          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.8904          
##              Prevalence : 0.2796          
##          Detection Rate : 0.1935          
##    Detection Prevalence : 0.2151          
##       Balanced Accuracy : 0.8312          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## $StackedLarge
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  58   8
##        yes  9  18
##                                           
##                Accuracy : 0.8172          
##                  95% CI : (0.7235, 0.8898)
##     No Information Rate : 0.7204          
##     P-Value [Acc > NIR] : 0.02146         
##                                           
##                   Kappa : 0.5515          
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.6923          
##             Specificity : 0.8657          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.8788          
##              Prevalence : 0.2796          
##          Detection Rate : 0.1935          
##    Detection Prevalence : 0.2903          
##       Balanced Accuracy : 0.7790          
##                                           
##        'Positive' Class : yes             
## 

ROC and AUC for all models

## $svmRadial
## $svmRadial$Name
## [1] "svmRadial"
## 
## $svmRadial$AUC
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.118          0.16
## True.Neg         0.011          0.71
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos         0.423          0.58
## True.Neg         0.015          0.99
## 
## Accuracy =  0.83  Sensitivity =  0.42   Specificity =  0.99 
## with Area Under the Curve =  0.92
## d.prime =  1.98  Criterion =  2.17  Beta =  0.17
## Observed Phi correlation =  0.55 
## Inferred latent (tetrachoric) correlation  =  0.87
## 
## $LogitBoost
## $LogitBoost$Name
## [1] "LogitBoost"
## 
## $LogitBoost$AUC
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.237         0.043
## True.Neg         0.032         0.688
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos         0.846          0.15
## True.Neg         0.045          0.96
## 
## Accuracy =  0.92  Sensitivity =  0.85   Specificity =  0.96 
## with Area Under the Curve =  0.97
## d.prime =  2.72  Criterion =  1.7  Beta =  0.34
## Observed Phi correlation =  0.81 
## Inferred latent (tetrachoric) correlation  =  0.96
## 
## $adaboost
## $adaboost$Name
## [1] "adaboost"
## 
## $adaboost$AUC
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos          0.24         0.043
## True.Neg          0.26         0.462
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos          0.85          0.15
## True.Neg          0.36          0.64
## 
## Accuracy =  0.7  Sensitivity =  0.85   Specificity =  0.64 
## with Area Under the Curve =  0.84
## d.prime =  1.38  Criterion =  0.36  Beta =  0.51
## Observed Phi correlation =  0.44 
## Inferred latent (tetrachoric) correlation  =  0.69
## 
## $C5.0
## $C5.0$Name
## [1] "C5.0"
## 
## $C5.0$AUC
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.258         0.022
## True.Neg         0.032         0.688
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos         0.923         0.077
## True.Neg         0.045         0.955
## 
## Accuracy =  0.95  Sensitivity =  0.92   Specificity =  0.96 
## with Area Under the Curve =  0.99
## d.prime =  3.12  Criterion =  1.7  Beta =  0.38
## Observed Phi correlation =  0.87 
## Inferred latent (tetrachoric) correlation  =  0.98
## 
## $StackedSmall
## $StackedSmall$Name
## [1] "StackedSmall"
## 
## $StackedSmall$AUC
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.194         0.086
## True.Neg         0.022         0.699
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos          0.69          0.31
## True.Neg          0.03          0.97
## 
## Accuracy =  0.89  Sensitivity =  0.69   Specificity =  0.97 
## with Area Under the Curve =  0.95
## d.prime =  2.39  Criterion =  1.88  Beta =  0.28
## Observed Phi correlation =  0.72 
## Inferred latent (tetrachoric) correlation  =  0.93
## 
## $StackedLarge
## $StackedLarge$Name
## [1] "StackedLarge"
## 
## $StackedLarge$AUC
## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##          Predicted.Pos Predicted.Neg
## True.Pos         0.194         0.086
## True.Neg         0.097         0.624
## 
## Conditional probabilities of 
##          Predicted.Pos Predicted.Neg
## True.Pos          0.69          0.31
## True.Neg          0.13          0.87
## 
## Accuracy =  0.82  Sensitivity =  0.69   Specificity =  0.87 
## with Area Under the Curve =  0.87
## d.prime =  1.61  Criterion =  1.11  Beta =  0.31
## Observed Phi correlation =  0.55 
## Inferred latent (tetrachoric) correlation  =  0.78

From the comparison between small and large ensemble models it can be noted that including the 2nd resampling predictions in the caretList created more variance in the stacked glm probabilities and thus reduced the Accuracy, AUC and class separation. Overall, C5.0 performs exceptionally well, with far less computational resources to train (it trains in a matter of seconds running in paralle) and memory (the ensemble Models are near to 300MB each where the C5.0 object is 1.6mb). For the machine learning task of classifying deductibles from well-labelled transaction data, C5.0 is clearly the optimal tool though it was a valuable learning experience to build the ensemble models.

detach packages

Expense Data Cluster Analysis

Preparing data for Cluster Analysis

## $Date
## NULL
## 
## $Amount
## NULL
## 
## $Category
## NULL
## 
## $Subcategory
## NULL
## 
## $Payment.Method
## NULL

Clustering with Gaussian Mixed Models

The ClusterR::GMM fits gaussian curves to the data to form clusters. Two datasets are tested for clustering outcomes:

  • A full data set with category and subcategory labels
  • Since expenses tend to vary by time and amount primarily, a reduced dataset with just date and amount is also used.
Further explanation as to the rationale for including Mahalanobis and Euclidean distances can be found in the paragraph preceding the following chunk.

Cluster Analysis with Gaussian Mixed Models

## [1] 4639.9215 2193.3093 1838.5082 1056.4875  876.3358
## attr(,"class")
## [1] "Gaussian Mixture Models"
em.cl.trScaled <- ClusterR::center_scale(em.cl.tr)  # A vectorize scaling function for dataframes.
emCluster <- ClusterR::GMM(em.cl.trScaled, gaussian_comps = 5, 
    dist_mode = "eucl_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$ClusterGMM.euc <- ClusterR::predict_GMM(em.cl.trScaled, 
    emCluster$centroids, emCluster$covariance_matrices, emCluster$weights)$cluster_labels
emCluster <- ClusterR::GMM(em.cl.trScaled, gaussian_comps = 5, 
    dist_mode = "maha_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$ClusterGMM.mah <- ClusterR::predict_GMM(em.cl.trScaled, 
    emCluster$centroids, emCluster$covariance_matrices, emCluster$weights)$cluster_labels


# ----------------------- Sun Apr 08 20:01:08 2018
# ------------------------# Reduced set of cluster data with
# just date & amount
em.cl.trReduced <- sapply(em17.tr[, c(1:2)] %>% filter(Amount < 
    0), as.numeric) %>% as.data.frame %>% ClusterR::center_scale()
emReducedCluster <- ClusterR::GMM(em.cl.trReduced, gaussian_comps = 5, 
    dist_mode = "eucl_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$Red.ClusterGMM.euc <- ClusterR::predict_GMM(em.cl.trReduced, 
    emReducedCluster$centroids, emReducedCluster$covariance_matrices, 
    emReducedCluster$weights)$cluster_labels
emReducedCluster <- ClusterR::GMM(em.cl.trReduced, gaussian_comps = 5, 
    dist_mode = "maha_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$Red.ClusterGMM.mah <- ClusterR::predict_GMM(em.cl.trReduced, 
    emReducedCluster$centroids, emReducedCluster$covariance_matrices, 
    emReducedCluster$weights)$cluster_labels

Clustering with Cluster Medoids

Cluster Medoids is a clustering algorithm that uses data points as centers. It is more robust to noise and outliers as it minimizes the sum of pairwise dissimilarities rather then the sum of squared Euclidean distances. I’m going to try Mahalanobis distance to account for the dataset being transactional across time whereby a cluster of txns may have an ellipsoidal shape when mapped to time. Furthermore, an outlier on the amount metric will likely still be included in a cluster because the axes from which distances are used are eigenvectors in the covariance matrix where extreme values are more likely to be included in a cluster rather than forming a cluster of their own.

The results of both clustering techniques will be compared using the ClusterCrit package.

kmedoids clustering

# ----------------------- Sun Apr 08 20:31:22 2018
# ------------------------#
# ClusterR::Optimal_Clusters_Medoids(em.cl.trScaled,max_clusters
# = 12, distance_metric = 'euclidean', criterion =
# 'dissimilarity') The plot shows an elbow at 5 clusters,
# with decreases in dissimilarity of .005 and .006
# thereafter. 5 clusters will be used for comparison with the
# GMM method. Euclidean distance metric will also be
# considered for comparison with the GMM method.
em.mah.Medoids <- ClusterR::Cluster_Medoids(em.cl.trScaled, clusters = 5, 
    distance_metric = "mahalanobis")
em.euc.Medoids <- ClusterR::Cluster_Medoids(em.cl.trScaled, clusters = 5, 
    distance_metric = "euclidean")
em.cl.tr$ClusterM.mah <- ClusterR::predict_Medoids(data = em.cl.trScaled, 
    MEDOIDS = em.mah.Medoids$medoids, distance_metric = "mahalanobis")$clusters
em.cl.tr$ClusterM.euc <- ClusterR::predict_Medoids(data = em.cl.trScaled, 
    MEDOIDS = em.euc.Medoids$medoids, distance_metric = "euclidean")$clusters
# Reduced
em.mah.Medoids <- ClusterR::Cluster_Medoids(em.cl.trReduced, 
    clusters = 5, distance_metric = "mahalanobis")
em.euc.Medoids <- ClusterR::Cluster_Medoids(em.cl.trReduced, 
    clusters = 5, distance_metric = "euclidean")
em.cl.tr$Red.ClusterM.mah <- ClusterR::predict_Medoids(data = em.cl.trReduced, 
    MEDOIDS = em.mah.Medoids$medoids, distance_metric = "mahalanobis")$clusters
em.cl.tr$Red.ClusterM.euc <- ClusterR::predict_Medoids(data = em.cl.trReduced, 
    MEDOIDS = em.euc.Medoids$medoids, distance_metric = "euclidean")$clusters

Cluster Evaluation with clusterCrit

ClusterCrits provides various clustering evaluation algorithms, most of which are largely based on measures of variance minimization. There are numerous criteria that the package provides and I would like to learn more about each one as time allows. For now, I’ve used all the algorithms paired with a voting system such that the best cluster is selected based on the number of votes from the total criteria it receives.

Cluster Evaluation

## .
##       ClusterM.mah Red.ClusterGMM.euc   Red.ClusterM.euc 
##                  3                  3                 14 
##   Red.ClusterM.mah 
##                 20
## .
##   Red.ClusterM.mah   Red.ClusterM.euc       ClusterM.mah 
##                 20                 14                  3 
## Red.ClusterGMM.euc 
##                  3

It appears that the best performing clustering technique was the medoid clustering, and it performed best on the reduced models regardless of distance measure. This clustering method uses a data point as a cluster center rather than a randomly assigned starting point and moves the clusters to the center of variance with each iteration using pairwise dissimilarity. On the reduced dataset that includes only dates and amounts, this method hopefully pinpointed clusters of expenses and will allow trends to be observed in spending across time.

Bind best performing clusters to original data

Data Visualization

Final draft of the visualization

Cluster Analysis

The cluster analysis can be found in the section above entitled Expense Data Cluster Analysis.
Documentation for the code contained in each chunk can be found preceding and sometimes following the chunks. Comments that elaborate specific code segments can be found inline.

Data Cleaning and Transformation

Much of the data transformation took place as required by a particular analytical task. The data transformation can be found in part in the Data Loading and Cleaning section. Further cleaning of the data can be found in Load and Transform Data. Additional data cleaning was performed as needed by the specific visualization type and can be found documented accompanying the code for the visualization.


Narrative

As per suggestions during the evaluation period, the order of the visualizations in the narrative was rearranged to included the Timeseries in the first position. Additionally, a comparison of the change in Median Income to the actual change in expenses between Asheville and Boston was positioned to complement the information already present in the title area.
The timeseries visualization now includes custom selected shapes that allow for easier differentiation of transactions in adjacent clusters in time. The legend key shapes are also larger to draw direct parallel with the network graph of the clusters below the timeseries. The cluster nodes in the network graph have been manually modified to reflect the corresponding shape used in the timeseries. Links connecting the ‘2017’ node in center to the cluster nodes have been given a heavier weight such that the location of the cluster node ID can be easily discerned and each cluster can be visually tracked outwards from there. The color legend for categories is present between the network graph and the timeseries to allow for ease of connecting colors with categories as the two visualizations are cross referenced. Furthermore, explanations that are positioned near each cluster are now in black for legibility, with a small cluster shape signifier as reference point to connect the text with the appropriate cluster. I feel these modifications make the connectivity between the timeseries and the network graph more apparent.

The cost of living visualization is straightforward and includes notable insights highlighted with text on the graph. As per suggestion of a peer, the legend has been simplified to simply fill-only and fill-stroked bars to signify location, instead of repetitive text along the x-axis. Additionally, ‘+’ signs have been added to all positive values for contrast with negative values.

The visualization of market signals backtesting on google class a & b stocks for 2017 uses a color coded buy & sell in the description that corresponds with vertical lines indicating the corresponding points in time where the action takes place. Thus the success/failure of the signal can be seen in the number and momentum of upward/downward triangle points denoting the Parabolic SAR indicator that lie between each blue buy signal and each red sell signal. The algorithm definitely needs further work because it actually performs slightly worse than buy & hold.

Choice of Visualization

It was my intention with this project to take a network perspective with regard to expense transactions to better understand patterns of spending. I felt that a cluster analysis combined with the network graph would be a way of gaining this perspective into the data. Once the network graph came into form, it became apparent that rich insights about the networked nature of transactions could be made between a typical timeseries graph (as can be seen on the first draft) and a timeseries graph imbued with visual continuity to the network graph. The result appears on the final draft. The use of the stacked bar graph included in the first draft acted as both a summary of the expenses in each category (split by location) which was intended to magnify the massive difference in total expense between Asheville and Boston. The inclusion of educational expense (an outlier which occurred upon moving to Boston) emphasized this difference to a great magnitude. However, the stacked bar graph did not make for a fluid comparison of various categories of expense between the cities. It was later decided that the most relevant information to communicate to an audience unconcerned with my personal expenses, was the amount of change in cost of living expenses they might expect to incur on making a move from a smaller town (like that of Asheville) to a metropolitan area (like that of Boston). The facet graph with accompanying percentage change by category more clearly shows these direct comparisons of changes in expense. In addition, educational expense was averaged out over the course of all months in the year as it’s typically thought about in terms of years even though it might be paid in a 1-2 transactions. Even with this generous distribution, it’s clear to see that the cost/month is still greater than the largest cost in living expenses: rent for a 2 br apartment (that houses two people as indicated by the split on the respective bar).
The final visualzation is a timeseries, as is typical when visualizing financial data. Space was limited for this visualization and thus it did not print very well, but the pdf maintains sufficient granularity to see the rise/fall of the parabolic SAR between buy & sell lines such that the performance of the indicator algorithm can be tracked across the timeseries.


Consistency of Layout and Visual Language

It was my intention to maintain a continuity of color mapped across all visualizations that included the original transaction data set such that the specific category of expense could be tracked across all three visualizations providing insight into it’s frequency (on the timeseries) the types of transactions that typically occur around it in time or are similar in amount (the network graph), and how it changed with the move from Asheville to Boston (cost of living comparison). With regard to tracking the clusters in the visualization, the addition of shape-cluster association continuity across point representations in the timeseries to the cluster nodes in the network graph to the symbol reference points for the respective cluster descriptions provides a more coherent narrative of each cluster should the viewer choose to inquire in this way. Additionally, a sizeable effort went into maintaining subcategory cluster colors in the network graph by custom coding a function that varied the luminance of the primary category color to generate colors for the subcategories. In the portfolio management timeseries, a symbolic link is added in the description of the Parabolic SAR indicator to it’s corresponding representation in the graph to complement the existing color link between the description of the buy and sell signals to their representative vertical lines in the visualization. The legend has been repositioned to the top of the graph to provide explanatory detail about the SAR symbols as the gazes moves from description to visualization.


Iterative Improvements and Peer Feedback

The project has changed shaped dramatically over the time in development. The only similarity remaining from the first few drafts is concept. Peer and professor reception were taken into account when the first two drafts were presented, and it was apparent that some ingenuity would be necessary to make this dataset engaging as a visualization since admittedly - tax data doesn’t necessarily make for brilliant visualizations. As per suggestion during the first evaluation, a long (vertical/digital) format leporello format was adopted for it’s familiarity in the realm of infographics and for the amount of space it provides. The timeseries, as per advice of the professor, was log transformed. Though in the first attempt I did not understand why the transformation and subsequent graphing failed. The issue was traced to NaNs appearing in the dataset and making it such that only a few points showed up in the graph. Upon approaching the task again later it was realized that negative values become NaN when log transformed. Thus the sign of each value was extracted prior to log transforming, the absolute value was log transformed, and the sign was reapplied to arrive at the dataset that appears in the final timeseries.
Peer Feedback
After posting the 2nd-to-final-draft on NuVustudio fairly late in the process it was a pleasant surprised to see signficant and well-thought out peer reviews appear in the comment thread on short notice. Each of the three commenters feedback was taken into account and incorporated into the final result. These were:

  • Moving the Network graph to the top as it appears to be the center of attention (though this was later revised as per advice of the professors in the evaluation session)
  • Better consistency of text sizes and fonts (this was implemented manually whereby all text was changed to a single font and only 4 font sizes were used or primary titles, subtitles, axis text & document descriptions and finally axis tick text.
  • Adding + signs to positive values of change in Cost of living (to better contrast with negative)
  • Removal of the non-log transformed timeseries as it didn’t provide any additional relevant information
  • Removal of the grid artifact made by using the shape tool then object>split to grid to provide 1/3rds gridlines in Illustrator
  • Clarification of the description for cost of living comparisons to indicate that the data was derived from my personal expense data and not an online source
  • Removal of the repetitive Asheville/Boston on the x-axis of the cost of living comparison, and the suggestion of instead using a legend in conjunction with a visual modification of each bar.
Suggestions (and realizations) from the poster evaluation sessions were implemented in the final version:
  • The readdition of the shape legend for the timeseries was added.
  • The size of the shapes in the legend were increased for legibility.
  • The network graph was modified such that cluster nodes maintained these shapes and therefore stand out to the viewer as being linked to the clusters in the timeseries.
  • Shape reference points were added to the descriptions of each cluster, though the descriptions were not removed as they do provide some insight into the possible reasons for the formation of the cluster (though the time allotted for the evaluation made it such that this couldnt be delved into and therefore might have been deemed unnecessary as a result).
  • There was a suggestion of using cluster hulls to confine each of the clusters on the network graph. However, in noting that each node consists of a center coordinate only, and the shape itself is not defined by coordinates, it would make the addition of a cluster hull only possible through a manual operation in Illustator. In considering how it might appear; if a border line was used it might add busy-ness to what is already a fairly busy visualization. If an underlying colored area was used, it would not look busy, but color is not mapped to the cluster (it is mapped to categories) and therefore could create miscorrelations. Thus it was decided that the shapes from the timeseries would be used for cluster nodes and that the links from the center outward to each of the nodes would be strengthened in weight such that the cluster node could be easily found by moving out from center, thereby allowing the viewer to more clearly see the links between edges in that cluster and also the whitespace that separates it from adjacent clusters.
  • The addition of a comparison between average change across all categories of cost of living and the change in (census reported) median income between the two cities.
  • The legend for the portfolio management visual was moved to the top and the shapes were added to the description.

Helpful feedback was essential in the evolution of the project from it’s first form to the final draft. The peer-review and studio style of this class was especially helpful in improving our skill through trial and error. I would like to see more classes at Northeastern adopt a greater degree of collaborative and constructive exchange between class members as this class exemplified.

Clusters as Networks on Txns

The chunk below uses the cluster labels that received the most votes in the analysis to create a long format edge data frame.

Txn Link DF

best.clustermethod <- names(emClusters.best[order(emClusters.best,decreasing = T)][1]) %>% as.name # convert the column name of the cluster labels to a name for use in future dplyr function calls
em2017.net <- rbind(
matrix(data=c(
rep("2017",5),
1:5,
em2017.clusters %>% group_by(!!best.clustermethod) %>% summarize(amount=sum(Amount)) %>% .[["amount"]] %>% as.character), # bind year to cluster & amount
ncol=3,dimnames = list(row=NULL,col=c("from","to","amount"))),
em2017.clusters %>% group_by(!!best.clustermethod,Category) %>% summarise(amount=sum(Amount)) %>% rowwise %>%  mutate(to=paste(!!best.clustermethod,Category,sep="_")) %>% select(!!best.clustermethod,to,amount) %>% as.matrix, # cluster to category & amount
em2017.clusters %>% group_by(!!best.clustermethod,Category,Subcategory) %>% summarise(amount=sum(Amount)) %>% rowwise %>% mutate(from=paste(!!best.clustermethod,Category,sep="_")) %>%  mutate(to=paste(!!best.clustermethod,Category,Subcategory,sep="_")) %>% select(from,to,amount) %>% as.matrix # cat to subcat & amt 
) # Create an edge data frame from the results of the clustering 
em2017.net %<>% as.data.frame()
em2017.net$amount %<>% as.character %>% as.numeric
# Remove periods from names bc they interfere with javascript variable names
em2017.net$to %<>% gsub("\\&"," ",.) %>% as.factor
em2017.net[,sapply(em2017.net,is.factor)] %<>% lapply(snakecase::to_parsed_case)
em2017.net$amount <- em2017.net$amount %>% abs %>% scale %>% percent_rank(.)*30
em2017.net$to[str_detect(em2017.net$to,"office\\/")] %<>% gsub("\\(\\w+\\/\\w+\\_\\w+\\)","",.) 
save(em2017.net,file="em2017.net.RData")

Google sheets was used in initial runs to manipulate the data into the format necessary for the edge data frame. In later drafts the transformations were revised to be entirely reproducible in R (shown above).

Manually construct links in Googlesheets

Create custom colors for subcategories

A custom method for generating subcategory colors is in the chunk below. The function outline is as follows:

  1. Use the purrr::map2 function to increment along the names of the Categories and the hexidecimal values themselves. The edge data frame em2017.net is also provided.
  2. Use a method from the scales package to convert the hexidecimal values to HCL colors.
  3. Use the colorspace package to create a sequence of colors that vary in luminance from the starting luminance up to a luminance value of 80.
  4. Output a nested list with category names as the first level of sublists, and subcategories within each list with their respective hexademical color values.
  5. Plot each spread of colors for verification.
I have no idea why the animation of the color plots comes out with a pink tint.

Configure Colors for Categories

load(file = "em2017.net.RData")
CatCols <- c(Automobile = "#F3756D", Travel = "#EF66A3", Business = "#E28925", 
    Education = "#C39B2D", Entertainment = "#99AA3A", Food = "#4EB548", 
    Health_Care = "#2DB456", Household = "#18BCC3", Insurance = "#19B6E9", 
    Other = "#469BD4", Personal = "#9488C0", Tax = "#D56CAB", 
    Software = "#B77AB4", Home_Office = "#23B991")
# Utilities=colorspace::mixcolor(alpha=.5,colorspace::hex2RGB(CatCols['Personal']),colorspace::hex2RGB(CatCols['Travel']))
# %>% colorspace::hex() %>% as.character Utilities was not
# included in the original graph, so it is coerced by
# blending the colors of the categories that are
# alphabetically before and after it such that it blends
# nicely when these colors are used to modify the vertbar
# graph
SubCols <- purrr::map2(.x = names(CatCols), .y = CatCols, em.links = em2017.net, 
    .f = function(.x, .y, em.links) {
        nm <- .x
        
        rgb <- t(grDevices::col2rgb(.y))/255
        coords <- grDevices::convertColor(rgb, "sRGB", "Luv")
        h <- atan2(coords[, "v"], coords[, "u"]) * 180/pi
        chroma <- sqrt(coords[, "u"]^2 + coords[, "v"]^2)
        l <- coords[, "L"]
        cat(nm)
        len <- em.links %>% filter(grepl(nm, from)) %>% .[["to"]] %>% 
            gsub("\\d\\_", "", .) %>% unique %>% length
        subcols <- colorspace::sequential_hcl(len, h = h, c. = chroma, 
            l = c(l, 80), power = 1, gamma = NULL, fixup = TRUE, 
            alpha = 1)
        # rsubcols <- sapply(subcols,grDevices::col2rgb) out <-
        # vector() nc <- ifelse(ncol(rsubcols)>0, ncol(rsubcols),
        # length(rsubcols)) for (i in 1:nc) { jsrgb <-
        # paste0(rsubcols[,i],collapse=',') out[i] <-
        # paste0('rgb(',jsrgb,')') %>% print }
        out <- subcols
        plot(x = 1:length(out), y = 1:length(out), pch = 19, 
            cex = 5, col = subcols, main = nm)
        names(out) <- em.links %>% filter(grepl(nm, from)) %>% 
            .[["to"]] %>% gsub("\\d\\_", "", .) %>% unique
        return(out)
    })
## Automobile
## Travel
## Business
## Education
## Entertainment
## Food
## Health_Care
## Household
## Insurance
## Other
## Personal
## Tax
## Software
## Home_Office

Timeseries

Cleaning data in preparation for the Timeseries. This was actually used in the original timeseries on the first draft. Later iterations use data from expense manager that is used in the prior sections of the full analysis.

Timeseries Tx Data prep

## [1] "Date"     "Desc"     "Category" "Amt"
## [1] "Date"     "Desc"     "Category" "Amt"
##        Date        Desc    Category         Amt        Acct         G.L 
##      "Date" "character"    "factor"   "numeric"    "factor" "character"

A visualization of the timeseries. Most transactions fall within a small range of amounts, while some outliers exist, thus a trade-off between a scale where outliers are visible and all other points are overplotted, or some granularity in points with outliers cut off. Ultimately, a log transformation was used in the following chunk to better visualize the data.

Vis Timeseries 1st Draft

limits <- c(-1800, 1600)
bstn.exp <- Tx %>% filter(Date > lubridate::mdy("08-03-2017") & 
    Amt < 0 & Amt > -1000)
bstn.lm <- lm(Amt ~ Date, data = bstn.exp)
ts.colors <- RColorBrewer::brewer.pal(n = 11, name = "RdYlGn")
names(ts.colors)[c(1, 10, 11)] <- c("Expense", "Support", "Income")


(Timeseries <- ggplot(data = Tx, mapping = aes(x = Date, y = Amt)) + 
    geom_smooth(method = "lm", formula = Amt ~ Date, data = bstn.exp, 
        colour = ts.colors[11]) + geom_line(mapping = aes(color = G.L), 
    alpha = 1) + geom_vline(xintercept = lubridate::mdy("08-03-2017"), 
    color = ts.colors[2]) + geom_text(aes(x = lubridate::mdy("08-03-2017"), 
    label = "Relocation", y = 1000), colour = ts.colors[2], angle = 90, 
    vjust = 1.2, size = 4) + geom_vline(xintercept = lubridate::ymd("2017-09-15"), 
    color = ts.colors[2]) + geom_text(aes(x = lubridate::ymd("2017-09-15"), 
    label = "Tuition", y = 1000), colour = ts.colors[2], angle = 90, 
    vjust = 1.2, size = 4) + geom_point(data = Tx %>% filter(Category == 
    "Investment Income"), color = ts.colors[4], size = 1, shape = 23, 
    fill = ts.colors[11]) + scale_y_continuous(name = "Amount", 
    breaks = seq(limits[1], limits[2], by = 200), minor_breaks = seq(limits[1], 
        limits[2], by = 100), limits = limits) + scale_x_date(date_breaks = "1 month", 
    date_minor_breaks = "1 week", date_labels = "%m") + scale_color_manual(values = ts.colors[c(1, 
    10, 11)], name = "Gain/Loss") + scale_alpha(range = c(1, 
    1)) + labs(title = "2017 Income & Expenses", subtitle = "Investment savings as points on Expenses", 
    caption = "", x = "Month Number", y = "Amount ($)") + theme(plot.title = element_text(hjust = 0.5), 
    plot.subtitle = element_text(hjust = 0.5)) + theme_minimal() + 
    theme(legend.position = "bottom", legend.margin = margin(rep(0, 
        4), "pt"), plot.margin = margin(rep(0, 4), "pt")))

The data is transformed using a log transformation whereby the sign of the amount value is reattributed to the log values after log transformation (as log transformation does not work on negative values). Duplicates and NAs were introduced in the data transformation and had to be removed accordingly.

Prepare Timeseries Data

em2017[rownames(em2017) %in% c(428:431), ]$Amount <- em2017[rownames(em2017) %in% 
    c(428:431), ]$Amount * -1  # Fix Electric to be an expense
emTS <- em2017 %>% left_join(em2017.clusters, by = c(Amount = "Amount", 
    Category = "Category", Subcategory = "Subcategory")) %>% 
    rename(Date = Date.x, Payment.Method = Payment.Method.y, 
        Cluster = !(!best.clustermethod)) %>% select(one_of(names(.)[c(1:5, 
    7, 8, 13)])) %>% mutate_at(.vars = "Cluster", .funs = function(x) {
    sapply(x, FUN = function(x) {
        if (is.na(x)) {
            x <- 6
        } else {
            x
        }
    })
}) %>% mutate_at(.vars = "Cluster", .funs = as.factor) %>% unique
# Join the best cluster to the original data (because date
# was lost in the cluster analysis) - Rename the columns to
# intuitive names & select only those that are necessary for
# the timeseries. Mutate the cluster column to make a new
# Miscellaneous cluster 6 for all txns that were not included
# in the cluster analysis, filter the result for unique
# values.
emTS[duplicated(emTS[, c("Date", "Amount", "Category", "Subcategory")]), 
    ] %>% nrow
# Above is the # of txns that were duplicated in the process
emTS <- emTS[not(duplicated(emTS[, c("Date", "Amount", "Category", 
    "Subcategory")])), ]  # Filter them
logT <- function(num) {
    # Log transform that works for negative values
    log(abs(num)) * sign(num)
    # if(sign(num)==-1){ lnum <- log(abs(num))*-1 }else{lnum <-
    # log(num)} return(lnum)
}
emTS %<>% mutate(LogT = logT(Amount))  # Log transform the data to account for outliers using the function above that reapplies the sign to the amount after the log transform.
emTS$LogT %>% sign %>% table
emTS$Amount %>% sign %>% table
save(emTS, file = "emTS.RData")

The second run of the time series visualization that was included in the 2nd to final draft but removed as per peer feedback as it didn’t include any additional information that could not already be derived from other visualizations already present.

Vis Timeseries

The log transformed time series is visualized in the chunk below. Many aspects of the chart were set manually.

Vis Timeseries Log Transformed

Cost of Living

Cost of Living Comparison

Granular detail as to the differences in cost of living between Asheville and Boston is computed in the table below.

Cost of Living Comparison

##           Date         Amount       Category    Subcategory Payment.Method 
##          FALSE          FALSE          FALSE          FALSE          FALSE 
##    Payee.Payer        Account            Tag     Deductible 
##          FALSE          FALSE          FALSE          FALSE
##           Date         Amount       Category    Subcategory Payment.Method 
##         "Date"      "numeric"       "factor"       "factor"       "factor" 
##    Payee.Payer        Account            Tag     Deductible 
##       "factor"       "factor"       "factor"       "factor"
## $AVL
## $AVL$Data
## # A tibble: 12 x 2
##    Category      MonAvg
##    <fct>          <dbl>
##  1 Automobile      87.4
##  2 Business        37.5
##  3 Education      172. 
##  4 Entertainment   16.6
##  5 Food           282. 
##  6 Health Care    113. 
##  7 Home Office     40.9
##  8 Household      651. 
##  9 Other           33.8
## 10 Personal        69.3
## 11 Software        22  
## 12 Tax            104  
## 
## 
## $BSN
## $BSN$Data
## # A tibble: 12 x 2
##    Category      MonAvg
##    <fct>          <dbl>
##  1 Automobile      50.2
##  2 Business       173. 
##  3 Education     6908. 
##  4 Entertainment   22.2
##  5 Food           498. 
##  6 Health Care    141  
##  7 Household     1632. 
##  8 Insurance      120. 
##  9 Other          270  
## 10 Personal        66.1
## 11 Travel          57.2
## 12 Utilities       63.4

As per the interest of Christian, the average difference in cost of living and median income.

## [1] 90.72727
## [1] 0.7083059

Original stacked bar chart comparison

The original stacked bar graph from the 1st draft remains in the chunk below.

Stacked Bar of Primary Expense Categories

Final draft facet wrap comparison

This is the facet wrap of the expense categories used in the final draft. It was chosen because the direct comparison of categories was found to be more effective at visually communicating the sizeable differences in cost of living between Boston and Asheville. The plot builds in previous efforts and took quite a bit of fine tuning - thus the commented out portions of the ggplot call.

Facet Wrap of Primary Expense Categories

#em2017.vertbar <- em2017.avl %>% filter(Amount < 0) %>% mutate(Mon=lubridate::month(Date)) %>% group_by(Mon,Category,Loc) %>% summarise(MonSum=abs(sum(Amount))) %>% ungroup %>% group_by(Loc,Category) %>% summarise(MonAvg=mean(MonSum))
em2017.vertbar <- col.df %>% filter(!Category %in% (c("Education","Software","Tax","Utilities","Travel"))) %>% mutate_at(.vars="Category",.funs=snakecase::to_parsed_case)
em2017.vertbar <- em2017.vertbar[order(em2017.vertbar$MonAvg.bsn, decreasing = F),] %>% subset(subset=!apply(.[,2:3],1,FUN=function(x)all(is.na(x))))
#em2017.vertbar %<>% rbind(list("Education",0,0,0)) %>% cbind(Education=c(rep(NA,(nrow(.)-1)),1913.33)) %>% mutate_at(.vars="Category",.funs=as.factor) 


em2017.sumexp <- sum(colSums(em2017.vertbar[,c("MonAvg.bsn","MonAvg.avl")]))
`+.uneval` <- function(a,b) {
    `class<-`(modifyList(a,b), "uneval")
} # A function necessary to use aes_string and aes in the same geom call
(Vertbar <- ggplot(data=em2017.vertbar[,c(1:4)], mapping=aes(fill=Category))+
geom_bar(stat="identity",
position="stack",aes_string(x=shQuote("Boston"), y="MonAvg.bsn"))+
    geom_bar(stat="identity",
position="stack",aes_string(x=shQuote("Asheville"), y="MonAvg.avl"))+
#     geom_bar(stat="identity",
# position="stack",aes_string(x=shQuote("Education"), y="Education"))+
#geom_text(size = 3, position = position_stack(vjust = 0.5),mapping = aes(label=paste(round(MonAvg)," | ",round(MonAvg/em2017.sumexp*100),"%",sep="")), hjust=.6)+
#   geom_label(position=position_stack(vjust = 1),hjust=0.5,
# mapping=aes(label=paste0(Category,"\n",round(MonAvg.bsn)," | ",round(MonAvg.bsn/sum(MonAvg.bsn)*100),"%"),lineheight = .7)+aes_string(x=shQuote("Boston"),y="MonAvg.bsn"),
# size=3,
# label.padding = unit(0, "lines"),
# label.size= unit(0, "lines"),
# label.r=unit(0, "lines"))+
#     geom_label(position=position_stack(vjust = 1),hjust=0.5,
# mapping=aes(label=paste0(Category,"\n",round(MonAvg.avl)," | ",round(MonAvg.avl/sum(MonAvg.avl)*100),"%"),lineheight = .7)+aes_string(x=shQuote("Asheville"),y="MonAvg.avl"),
# size=3,
# label.padding = unit(0, "lines"),
# label.size= unit(0, "lines"),
# label.r=unit(0, "lines"))+
  facet_wrap(~Category, ncol=5,scales="fixed")+
    geom_label(aes_string(x=shQuote("Asheville"))+aes(label=paste0(PctDiff,"%"),y=1000),size=7)+
scale_y_continuous(minor_breaks = seq(1900,2000,by=10),expand=c(0.009,0))+
scale_color_manual(values=CatCols)+
labs(title = "Cost of Living\nAsheville & Boston",
subtitle = "",
caption = "",
x = "Location",y = "Monthly Average ($)") +
theme(plot.title = element_text(hjust = .5),plot.subtitle = element_text(hjust = .5),legend.position = "none"))

As is evident from the chart above, the cost of education is astronomical comparatively to the cost of living in either location. Also notable is the significant differences in cost of living between Asheville (A) and Boston (B).

This was an idea about a visualization to include on the final draft. However, it was omitted because the comparison of cost of living at the categorical level provides more relevant comparisons for a viewer unfamiliar with the data and uninterested in the details of my expenses.

Appendix

Budget Projections

This section of the project has been abandoned because of the relocation that took place this year. The relocation, and consequent turbulence in expenses for this year make it impossible to make accurate projections about 2018 where a move will not be taking place. A model can indeed be built, but it will not have much predictive validity for future years.

Projections

Minimal Repro

A minimal reproducible example created for posting and issue on the caretEnsemble github. This also encountered an error, likely due to the formula variable being declared with formula instead of form.

Minimal Reproducible Example

# Minimal reproducible example made for reporting errors in
# the above code to zachmayer/caretEnsemble github
bc <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data")
nms <- readLines("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names")[110:126] %>% 
    str_match("^\\\t?[a-z1-2]\\)\\s?(\\w+\\s?\\w+)") %>% na.omit %>% 
    .[, 2]
names(bc) <- c(nms[1:2], nms[3:12] %>% paste0(".mean"), nms[3:12] %>% 
    paste0(".se"), nms[3:12] %>% paste0(".wst")) %>% gsub("\\s", 
    "\\.", .)
rownames(bc) <- bc[, 1]
bc <- bc[, -1]
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
getDoParWorkers()
data.train <- caret::createMultiFolds(bc$Diagnosis, times = 2)
data.train <- caret::trainControl(method = "repeatedcv", index = data.train, 
    number = 10, repeats = 1, search = "grid", allowParallel = T, 
    classProbs = T, savePredictions = "all", returnResamp = "all", 
    summaryFunction = caret::twoClassSummary)
f <- as.formula(paste0("Diagnosis", " ~ ."))
mod.list <- caretEnsemble::caretList(formula = f, data = bc, 
    trControl = data.train, methodList = c("svmRadial", "LogitBoost", 
        "adaboost", "C5.0"), metric = "ROC", tuneList = list(svmRadial = caretEnsemble::caretModelSpec(method = "svmRadial", 
        tuneGrid = tuneGrids$svmRadial), LogitBoost = caretEnsemble::caretModelSpec(method = "LogitBoost", 
        tuneGrid = tuneGrids$LogitBoost), adaboost = caretEnsemble::caretModelSpec(method = "adaboost", 
        tuneLength = 10), C5.0 = caretEnsemble::caretModelSpec(method = "C5.0", 
        tuneGrid = tuneGrids$C5.0)))
stopCluster(cl)
registerDoSEQ()

networkD3 JS node formatting

I was not able toget hte custom node coloring via groups to work for networkD3, and thus abandoned this route of creating the network graph. Per advice of Dr. Offenhuber, I used ggnet which employs the familiar grammar of graphics syntax to plot the clustered transaction network graph. I beleive that the reason this did not work with networkD3 is that the group feature has function beyond aesthetic properties and actually also determines connections between nodes, thus the labels used for the groups as shown below, are not represented in the edge df and thus the graph was blank.

Format names for networkD3

# ----------------------- Mon Apr 16 15:23:11 2018 ------------------------#
# https://community.rstudio.com/t/networkd3-customize-node-colors-based-on-the-existing-nodes/5594
# Fix the names to upper camel case will require revising em.links

paste0('"',names(property),'",',collapse="") %>% cat #for pasting into JS call
paste0('"',property,'",',collapse="") %>% cat
colJS <- networkD3::JS('d3.scaleOrdinal().domain(["2017",1","2","3","4","5","Automobile","Travel","Business","Education","Entertainment","Food","Health_Care","Household","Insurance","Other","Personal","Tax","Software","Home_Office","Utilities","Automobile_Fuel","Automobile_Parking","Automobile_Registration_tax","Automobile_Maintenance","Travel_Parking","Travel_Train","Travel_Car_Rental","Business_Application","Business_Home_Office","Business_Utilities","Business_Supplies","Business_Office_Expenses","Business_Wordpress_Plugins","Business_Insurance","Education_Tuition","Education_Article","Education_Books","Education_Supplies","Education_Transcripts","Education_Gre_Study","Entertainment_Podcast","Entertainment_Article","Entertainment_Concert","Entertainment_Movies","Entertainment_Books","Food_Bulk","Food_Groceries","Food_Restaurant","Food_Snack","Food_Tea","Food_Raw_Juice","Health_Care_Massage_Therapy","Health_Care_Health_Insurance","Health_Care_Medical","Health_Care_Meditation","Health_Care_Prescription","Health_Care_Genomics","Health_Care_Other","Health_Care_Yoga","Household_Appliance","Household_Consumables","Household_Home_Maintenance","Household_Miscellaneous_Household_Items","Household_Rent","Household_Household_Tools","Household_Other","Household_Gardening","Insurance_Auto","Insurance_Renter","Other_Cash_Back","Other_Other","Other_Charitable_Contributions","Personal_Clothing","Personal_Gift","Personal_Haircut","Personal_Personal_Care","Personal_Other","Personal_Donation","Tax_Other","Software_Productivity","Home_Office_Electronics","Home_Office_Computer","Home_Office_Office_Supply","Utilities_Gas"]).range(["white","red","green","blue","yellow","purple","#F3756D","#EF66A3","#E28925","#C39B2D","#99AA3A","#4EB548","#2DB456","#18BCC3","#19B6E9","#469BD4","#9488C0","#D56CAB","#B77AB4","#23B991","#C277B2","rgb(144,15,0)","rgb(192,69,59)","rgb(241,115,107)","rgb(255,161,153)","rgb(152,0,84)","rgb(219,81,144)","rgb(255,153,211)","rgb(126,49,0)","rgb(149,69,0)","rgb(173,89,0)","rgb(197,110,0)","rgb(221,132,26)","rgb(246,154,67)","rgb(255,177,97)","rgb(100,66,0)","rgb(126,89,0)","rgb(152,114,0)","rgb(179,139,0)","rgb(207,166,63)","rgb(235,193,99)","rgb(63,78,0)","rgb(92,108,0)","rgb(123,140,0)","rgb(156,173,63)","rgb(190,207,106)","rgb(0,89,0)","rgb(0,113,0)","rgb(2,139,0)","rgb(57,165,52)","rgb(90,192,86)","rgb(121,220,117)","rgb(0,91,0)","rgb(0,108,0)","rgb(0,126,0)","rgb(0,144,42)","rgb(0,163,67)","rgb(47,182,89)","rgb(76,202,110)","rgb(101,222,131)","rgb(0,90,96)","rgb(0,106,113)","rgb(0,123,130)","rgb(0,141,148)","rgb(0,159,166)","rgb(0,178,185)","rgb(46,197,204)","rgb(81,217,224)","rgb(0,87,138)","rgb(86,212,255)","rgb(0,80,138)","rgb(39,139,196)","rgb(133,206,255)","rgb(76,61,121)","rgb(99,86,143)","rgb(123,111,167)","rgb(149,137,193)","rgb(175,164,220)","rgb(203,191,247)","rgb(138,0,99)","rgb(113,41,110)","rgb(0,90,50)","rgb(0,152,113)","rgb(91,220,180)","rgb(121,33,106)"])')