I’ve been working as a freelance web & graphic designer, event coordinator, yoga teacher & studio manager, and administrative consultant for the past few years. I was marginally profitable in this sole proprietorship in 2016, necessitating a more extensive amount of time and attention to detail spent in filing taxes. To decrease the workload involved in filing taxes in future years I’ve decided to use my financial exploration for tax reporting purposes as the dataset for this project and to build classification models capable of classifying deductibles with a high degree of accuracy such that it requires less manual labor on my part.
In the CRISP-DM overview below I outline the primary outcomes I hope to achieve with the project.
My objectives with this expense data exploration are to:
I currently track all expense data manually in a smartphone application called Expense Manager. Each transaction has the following fields. Each is denoted as being manually entered or automatically entered below:
I have also imported transaction data from checking, savings and Paypal accounts for some of the tax information and visualiations in the Data Visualization section.
The success criteria are:
Classification techniques used for determining deductions are:
Two datasets were used: one that included the following features:Category, Subcategory and Payee/Payer Cluster techniques used are:
and one that used only Date & Amount. Cluster techniques used are:
With distance measures
Clustering criteria from the ClusterCrit package were used to assess the best clusters via a voting algorithm.
Trello is used to track and plan the project. GitKracken is used to track document changes. In the document, timestamped comments are used to formulate plans for coding each section and to track changes.
The raw data was exported from the Expense Manager app as a csv file. Various measures were taken to wrangle the data into formats conducive for the various analyses. These are documented inline in the body of the document. The transactions were manually coded as deductibles or not in google sheets for ease. The deductibles were originally coded as T/F, became “1”,“2” after conversion to a factor, and then recoded to “yes”, “no” for readability in model results output and confusion matrices.
I have received a 1099-Misc for all major contract clients to verify income. I have downloaded all transaction history from checking/credit accounts, and have exported all manually recorded transactions from the app that I use to track expenses. The data that I generated is relatively high quality with few missing values. Because of my familiarity with the data, and with the ability to cross-reference transactions with the transaction data from checking and credit card accounts, I was easily able to fill-in values where absent and correct mislabellings where discovered. Data that has been generated for 2018 thus far provides an out-of-time test set to use for evaluating the performance of the models.
Factor levels in some of the features are relabelled for use in specific models where special characters caused errors. These instances are documented inline. Conversion of factor levels to numeric, and scaling was necessary for the clustering algorithms.
The transaction histories from the bank accounts does not allow for the creation of additional categories for labelling specific transaction types and the descriptions of the transactions themselves are cryptic and often difficult to interpret. Thus, I will be using only the manually tracked expenses because the data has more legible features with respect to labelling each transaction.
Data acquisitionDescribed above
Data integration and formattingDescribed above
Data cleaningDescribed above
Data transformation and enrichmentDerived attributes such as categorical sums were necessary for much of the manual calculation of deductible expense categories for the tax filing process See Self-Employment Income.
I initially selected Naive Bayes, Radial SVM, and LogitBoost (See Exploratory 1st Run for details). LogitBoost had the best performance followed by the Radial SVM and finally Naive Bayes. To replace Naive Bayes, C5.0 was added to the ensemble. C5.0 performed exceptionally well for reasons discussed in-depth in the Add C5.0 section. A summary of the explanation is that the Category and Subcategory factors are the primary means of identifying the deductibles, thus the bias decision trees give to factors with many classes actually works to it’s advantage in labelling the deductibles.
In light of the performance of C5.0 another boosted decision tree, adaboost, was added to the final mix of models used for the ensemble.
The caret package createDataPartition function works especially well at partitioning data based on a percentage split according to factor levels in the response variable for the purpose of training and verifying models. An out-of-time test set with transaction data from 2018 was used for predictions and final evaluation from which herein are derived. The construction of the training set was multiphasic; including transformation in R, porting to Google sheets for cleaning and deductible labeling, and then loading back into R for further cleaning and matching of levels with the test set. This process can be found in part in the Data Loading and Cleaning section. The construction of the test set can be found here.
I used caret train on all initial models to find tuning parameters for each model type. I then used a manual iterative approach with the various naive bayes kernels to attempt to fine tune the model. The cosine kernel performed better, but was significantly lower than the Radial SVM, thus Naive Bayes was dropped from future models. C5.0 and Adaboost for the final ensemble as discussed above.
A custom function called tuneGrids was developed to hone tuning grids with refined tuning measures from each train result for repeated iterations of model building with train. The performance of the first run can be found here. Models were made individually with their respective packages using best tuning parameters from caret and additional parameters not available in the caret implementation. See Manual Models: Refinements for code and performance measures in the manual refinement process.
The C5.0 algorithm outperformed even the ensemble models built on a generalized linear models of class probabilities. The ensemble model may be more effective for a dataset which has a high degree of variability in the factors that are used, as C5.0 may overfit such a data set. However, with this particular classification task, C5.0 was the perfect fit. See the performance for all models included in the first run. The performance of the set of models included in the ensemble, and the ensemble using the 1st resample probabilities and full resample probabilities can be found in Ensemble Model Performance
The C5.0 decision tree will be used in 2018 deductible classification due to it’s rapid training and 99% AUC.
Here it is!
The work-up used herein will be sufficient for loading transaction data from 2018 and classifying deductibles with far less effort than was necessary in 2017. I would be amenable to developing a Shiny deployment of the C5.0 algorithm for classifying deductibles if there was sufficient interest from the developers of the Expense Manager app. To be useful for other individuals using the app, they would need to develop a system for classifying their deductibles consistently such that C5.0 will can effectively classify their data.
Dates are converted, and the first column of rownumers is removed. Data is subsetted to include only data from 2017. Amount is converted to numeric values.
# NOT RUN deprecated
em <- readr::read_csv("C:\\Users\\Stephen\\Documents\\Finance\\Taxes\\2017\\2018-03-31.expensemanager.csv")
em$Date %<>% lubridate::ymd() # Convert data to Date
em <- em[-1, ] # Remove duplicate titles
em2017 <- em[em$Date > lubridate::ymd("2017-01-01") & em$Date <
lubridate::ymd("2018-01-01"), ] # Extract only data for 2017
em$Amount %<>% as.numeric() # Convert amounts to numeric
This function reduces the date to a decimal representing the part of the year (ie June 31st would be about .5). It subtracts the year part of the date such that the decimal date between the training set for 2017 and the out-of-time test set for 2018 match.
Create test and train data
Load Txn Data
Transaction data directly from bank records of transactions are read into a variable. Columns with no data are removed and more intuitive names are added. Strange characters are cleaned and columns are classed appropriately.
# ----------------------- Fri Mar 23 18:42:21 2018
# ------------------------# USAA
usaa <- readr::read_csv("C:\\Users\\Stephen\\Documents\\Finance\\Taxes\\2017\\USAA\\USAAChecking2017.csv",
col_names = F) # Read Transaction data from Checking acct
usaa <- usaa[, -c(1, 2, 4)] # Remove useless columns
names(usaa) <- c("Date", "Desc", "Category", "Amt") #Rename columns to intuitive names
usaa$Amt %<>% gsub("\\-\\-", "", .) # Remove the double negative signs for positive values
usaa$Amt %<>% as.numeric() # Convert to numeric
usaa.visa <- readr::read_csv("C:\\Users\\Stephen\\Documents\\Finance\\Taxes\\2017\\USAA\\USAAVisa.csv",
col_names = F) # Read transaction data from Visa
usaa.visa <- usaa.visa[, -c(1, 2, 4)] # Remove Useless columns
names(usaa.visa) <- c("Date", "Desc", "Category", "Amt") # Add intuitive names
usaa.visa$Amt %<>% gsub("\\-\\-", "", .) # Change double neg to pos
usaa.visa$Amt %<>% as.numeric() # Convert to numeric
# visa.deductible <- edit(usaa.visa %>% group_by(Desc) %>%
# summarise(TotalAmt=sum(Amt)))
# ----------------------- Fri Mar 23 21:40:36 2018
# ------------------------# Deductible Expenses Reference
# from Intuit
htm <- xml2::read_html("https://quickbooks.intuit.com/r/professional/complete-list-of-self-employed-expenses-and-tax-deductions/")
Deductibles <- htm %>% rvest::html_node(xpath = "//*[@id='main']/div/div[3]/div/div[1]/div[1]/div/div/div/div/div/div/table[1]") %>%
rvest::html_table(header = T)
Deductibles %<>% rbind(htm %>% rvest::html_node(xpath = "//*[@id='main']/div/div[3]/div/div[1]/div[1]/div/div/div/div/div/div/table[2]") %>%
rvest::html_table(header = T))
Deductibles$Amount <- rep(NA, nrow(Deductibles))
# ----------------------- Fri Mar 30 21:12:36 2018
# ------------------------# Import shared expenses from
# Google Sheets googlesheets::gs_auth() lsbalance <-
# googlesheets::gs_url('https://docs.google.com/spreadsheets/d/1e_iiwJ6HEXXEbpjEXj1FKohs6tWsXw2r9nurrZSxZmQ/edit#gid=504516022')
# shared.expenses <-
# googlesheets::gs_read(lsbalance,ws=2,range='A1:E192')
Add Shared Expenses to Expense Manager Data
shared.expenses$Date %<>% lubridate::ymd()
names(shared.expenses)[4:5]
shared.expenses[4:5] %<>% lapply(function(x) gsub("\\$", "",
x)) %>% lapply(as.numeric)
shared.expenses <- edit(shared.expenses)
shared.expenses$Category[is.na(shared.expenses$Category)] <- "Food"
shared.expenses$Subcategory[is.na(shared.expenses$Subcategory)] <- "Groceries"
shared.expenses$`Payment Method` <- "Lia"
names(shared.expenses) <- c("Date", "Description", "Qty", "Cost",
"Amount", "Category", "Subcategory", "Payment Method")
shared.expenses %>% write.csv("shared.expenses")
# em2017 <-
# plyr::rbind.fill(em2017,shared.expenses[,c('Date','Description','Amount','Category','Subcategory','Payment
# Method')]) em2017$Date %<>% lubridate::ymd() # Convert data
# to Date
The following section is used for manual calculation of deductible categories for tax filing purposes. Inline documentation of Finance for Freelancers continues in ML Models for Deduction Classification. All chunks in the following section are set to eval=F and some lines are not echoed for privacy.
For form 1040
Tax Exemption from certain types of Federal Obligations for the state of MA (Valueline.MA) - TIR 89.8
Ordinary Dividends
Long Term Gains and Losses
(lapply(CGains, FUN = `[`, "Amount") %>% unlist) - (lapply(CGains,
FUN = `[`, "CB") %>% unlist)
Cgain17 <- lapply(CGains, FUN = `[`, "Amount") %>% unlist %>%
sum - lapply(CGains, FUN = `[`, "CB") %>% unlist %>% sum
Capital Loss Carryover Worksheet
1040SD - cont
Qualified Dividends and Capital Gain Tax Worksheet
# ----------------------- Sat Apr 14 23:14:03 2018
# ------------------------# Qualified Dividends and Capital
# Gain Tax Worksheet—Line 28
CGTWs <- c("1. Enter the amount from Form 1040, line 43. However, if you are filing Form 2555 or 2555-EZ (relating to foreign earned income), enter the amount from line 3 of the Foreign Earned Income Tax Worksheet . . . . . . . . . . . . . . . . . . . . . 1.",
"2. Enter the amount from Form 1040, line 9b* . . . . . . . 2.",
"3. Are you filing Schedule D?* Yes. Enter the smaller of line 15 or 16 of Schedule D. If either line 15 or 16 is blank or a loss, enter -0-. 3. No. Enter the amount from Form 1040 line 13.",
"4. Add lines 2 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.",
"5. If filing Form 4952 (used to figure investment interest expense deduction), enter any amount from line 4g of that form. Otherwise, enter -0- . . . . . . . . . . 5.",
"6. Subtract line 5 from line 4. If zero or less, enter -0- . . . . . . . . . . . . . . . . . . . . . . 6.",
"7. Subtract line 6 from line 1. If zero or less, enter -0- . . . . . . . . . . . . . . . . . . . . . . 7.",
"8. Enter: $37,950 if single or married filing separately $75,900 if married filing jointly or qualifying widow(er) $50,800 if head of household. . . . . . . . . . . . . . 8.",
"9. Enter the smaller of line 1 or line 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.",
"10. Enter the smaller of line 7 or line 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10.",
"11. Subtract line 10 from line 9. This amount is taxed at 0% . . . . . . . . . . . . . . . . . .11.",
"12. Enter the smaller of line 1 or line 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.",
"13. Enter the amount from line 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13.",
"14. Subtract line 13 from line 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14.",
"15. Enter: $418,400 if single $235,350 if married filing separately $470,700 if married filing jointly or qualifying widow(er) $444,550 if head of household. . . . . . . . . . . . . . 15.",
"16. Enter the smaller of line 1 or line 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.",
"17. Add lines 7 and 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17.",
"18. Subtract line 17 from line 16. If zero or less, enter -0- . . . . . . . . . . . . . . . . . . . .18.",
"19. Enter the smaller of line 14 or line 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19.",
"20. Multiply line 19 by 15% (0.15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.",
"21. Add lines 11 and 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.",
"22. Subtract line 21 from line 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22.",
"23. Multiply line 22 by 20% (0.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.",
"24. Figure the tax on the amount on line 7. If the amount on line 7 is less than $100,000, use the Tax",
"Table to figure the tax. If the amount on line 7 is $100,000 or more, use the Tax Computation Worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.",
"25. Add lines 20, 23, and 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.",
"26. Figure the tax on the amount on line 1. If the amount on line 1 is less than $100,000, use the Tax Table to figure the tax. If the amount on line 1 is $100,000 or more, use the Tax Computation Worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.",
"27. Tax on all taxable income. Enter the smaller of line 25 or 26. Also include this amount on Form 1040, line 44. If you are filing Form 2555 or 2555-EZ, don’t enter this amount on Form 1040 line 44. Instead, enter it on line 4 of the Foreign Earned Income Tax Worksheet . . . . . . . . . . . . . . 27. *If you are filing Form 2555 or 2555-EZ, see the footnote in the Foreign Earned Income Tax Worksheet before completing")
CGTWs %<>% gsub("\\.*\\s+[1-9.]+$", "", .) %>% gsub("(?:[.]\\s[.])|\\s{2,}",
"", .) %>% gsub("\\s{2,}", "", .)
Add to Gsheets
Reload EM Data with handpicked Deductions
googlesheets::gs_auth()
em2017 <- googlesheets::gs_read(emgs, ws = 1, range = "A1:J500")
em2017 %>% write.csv("em2017.csv")
Load Coded Data
em2017 <- read.csv("em2017.csv")
rownames(em2017) <- em2017[, 1]
em2017 <- em2017[, -1]
em2017$Deductible %<>% as.character %>% forcats::fct_recode(no = "FALSE",
yes = "TRUE")
em2017$Date %<>% lubridate::mdy()
em17.tr <- em2017 %>% filter(Category != "Income") %>% dplyr::select(-Description)
em17.tr %>% sapply(class)
## Date Amount Category Subcategory Payment.Method
## "Date" "numeric" "factor" "factor" "factor"
## Payee.Payer Account Tag Deductible
## "factor" "factor" "factor" "factor"
Filter Income
inc2017 <- em2017 %>% filter(Category == "Income") %>% group_by(Payee.Payer,
Account) %>% summarise(Total = sum(Amount))
Self-Employment Income
For form 1040SC & 1040SE
Combine and Replace NA
## Date Amount Category Subcategory Payment.Method
## FALSE FALSE FALSE FALSE FALSE
## Payee.Payer Account Tag Deductible
## FALSE FALSE TRUE FALSE
em17.tr[, emna] %<>% lapply(renamena)
em17.tr[, sapply(em17.tr, is.character)] %<>% lapply(as.factor)
em17.tr %>% lapply(function(x) any(is.na(x))) %>% unlist
## Date Amount Category Subcategory Payment.Method
## FALSE FALSE FALSE FALSE FALSE
## Payee.Payer Account Tag Deductible
## FALSE FALSE FALSE FALSE
Form 8917 Educational Deductions vs 8863 Education Tax Credits
8863 Instructions
Deductible Expenses - Education
em17.ded <- em17.tr %>% filter(Deductible == "yes") %>% group_by(Category,
Subcategory) %>% summarize(Total = sum(Amount))
# ----------------------- Fri Apr 13 14:45:28 2018
# ------------------------# Education
em17.ded[em17.ded$Subcategory %in% "Tuition", ]$Total <- (-22960 +
(em17.tr %>% filter(Payee.Payer == "AB Tech" & Category ==
"Education") %>% .$Amount)) # 1098-T Northeastern + AB Tech pre-requisites
em17.ded %>% filter(Category == "Education") %>% summarize(Sum = sum(Total)) # Total Education expenses for FY2017 for Form 8917
# https://www.eitc.irs.gov/other-refundable-credits-toolkit/compare-education-credits-and-tuition-and-fees-deduction/compare
Deductible Expenses - Car and Truck
Deductible Expenses - Contract Labor
Deductible Expenses - Business Insurance
Deductible Expenses - Office Expense
em17.tr <- em17.tr[-which(rownames(em17.tr) %in% rownames(em17.tr[c(str_detect(em17.tr$Payee.Payer,
"Newegg") & str_detect(em17.tr$Account, "Personal")), ])),
] # Remove newegg entry that got cancelled
em17.tr %>% filter(Category == "Business" & !Subcategory %in%
c("Utilities", "Insurance") & Account != "Heartwood" & Deductible ==
"yes") %>% group_by(Subcategory) %>% summarize(Total = sum(Amount))
# bus.exp <- Deductibles[str_detect(Deductibles$Type,'Office
# Expense'),]$Amount <- bus.exp[bus.exp$Subcategory %in%
# c('Application','Home Office','Office Expenses','Shipping &
# Handling','Supplies','Wordpress Plugins'),] %>% .$Total %>%
# sum
Deductible Expenses - Other Expenses
Deductibles[str_detect(Deductibles$Type, "Other"), ]$Amount <- em17.tr %>%
filter(Payee.Payer == "Plus for Trello") %>% .$Amount # Should have been in Office expense, but rather than recalculate, it's included here.
Percent of Residence used for Home Office for following Calculations
Deductible Expenses - Home Office Maintenance
Deductible Expenses - Utilities
Deductible Expenses - Supplies
Deductible Expenses - Home Office
DeductibleExpenses <- vector()
names(Rent) <- month.abb
HO.de <- pHO * (sum(Rent) + em17.tr %>% filter(Subcategory ==
"Renter") %>% .$Amount)
em17.tr %>% filter(Subcategory == "Renter") %>% .$Amount # Home Office Insurance
net <- sum(Income[1:3] %>% unlist) + (Deductibles$Amount %>%
sum) + HO.de # Line 31
Deductible Expenses - Charitable Giving
em %>% filter(Tag %in% c("Non-Profit 501(c)3", "Light a Path") |
Subcategory == "Charitable Contributions") %>% filter(Date >
lubridate::mdy("01/01/2017") & Date < lubridate::mdy("01/01/2018")) %>%
group_by(Payee.Payer) %>% summarise(Total = sum(Amount))
em %>% filter(Tag %in% c("Non-Profit 501(c)3", "Light a Path") |
Subcategory == "Charitable Contributions") %>% filter(Date >
lubridate::mdy("01/01/2016") & Date < lubridate::mdy("01/01/2017")) %>%
group_by(Payee.Payer) %>% summarise(Total = sum(Amount))
Deductible Expenses - Educational Supplies
Deductible Expenses - Registration Fees
Deductible Expenses - Totals
For form 1040SE
1040SE
net * 0.9235 * 0.153 #line 5
SE.de <- net * 0.9235 * 0.153 * 0.5 #line 6 deductible for 50% of SE Tax
1040
# Form 8917 Line 4 (Sum of 1040 Line 23-33)
HI.de <- c(NEBCBS = 2159, AVLIFM = em17.tr %>% filter(Subcategory ==
"Health Insurance") %>% summarise(Total = sum(Amount)) %>%
.$Total %>% abs) #Health insurance deductible for NE
SE.de + sum(HI.de) # F8917 Line 4
Income$`1040Total` - (SE.de + sum(HI.de)) # 8917 Line 5
(SE.de + sum(HI.de) + 4000) # 1040 Line 36
Income$`1040Total` - (SE.de + sum(HI.de) + 4000) # 1040 Line 41
Income$`1040Total` - (SE.de + sum(HI.de) + 4000) - 6350 # - STandard deduction 1040 Line 41
This section includes the initial loading a cleaning of data from the data understanding, data prep, and modeling phases prior to the project proposal. Many of the chunks from this preliminary testing are here to show the discovery process, but have been set to ## NOT RUN due to the process having been quite error prone. Interwoven are refinements made upon revision of these phases used to generate new models with improved code. Inline documentation will appear throughout.
Load data
## NOT RUN - used for manual cleaning prior to labelling
em2018 <- read.csv("~/Northeastern/Git/da5030/Project/em2018.csv",
sep = ",", skip = 1)
em2018$Date %<>% lubridate::ymd()
em2018 %<>% filter(Date < lubridate::now())
em18.te <- em2018[, names(em17.tr)[names(em17.tr) %in% names(em2018)]]
em18.te$Deductible <- rep(F, nrow(em18.te))
Loading Labelled Data from GS
## NOT RUN ----------------------- Thu Apr 12 14:57:12 2018
## ------------------------# Upload to Google sheets to
## manually label the deductibles.
library(googlesheets)
gs_ws_new(emgs, ws_title = "2018")
gs_ws_ls(emgs)
# Refresh the object
emgs <- gs_key(emgs[["sheet_key"]])
gs_edit_cells(emgs, ws = "2018", input = em18.te)
em18.te <- googlesheets::gs_read(emgs, ws = "2018")
detach("package:googlesheets", unload = T)
em18.te %>% sapply(class)
em18.te %>% write.csv(file = "em18.te.csv")
Clean OOT Test data
## X1 Date Amount Category Subcategory
## "integer" "Date" "numeric" "character" "character"
## Payment.Method Payee.Payer Account Tag Deductible
## "character" "character" "character" "character" "logical"
em18.te <- em18.te[, -1]
em18.te$Date %<>% sapply(dd)
em18.te[, sapply(em18.te, is.character)] %<>% lapply(as.factor)
em18.te %<>% as.data.frame
em18.te$Deductible %<>% as.character %>% forcats::fct_recode(no = "FALSE",
yes = "TRUE")
em18.te %>% lapply(function(x) {
any(is.na(x))
}) %>% unlist
## Date Amount Category Subcategory Payment.Method
## FALSE FALSE FALSE FALSE FALSE
## Payee.Payer Account Tag Deductible
## FALSE FALSE FALSE FALSE
# ----------------------- Thu Apr 12 16:03:04 2018
# ------------------------# Factors needs to be releveled to
# exclude new factor levels.
Fn for Matching Levels
matchLevels <- function(newdata, olddata, verbose = F) {
mL <- function(l, nd, od, verbose = verbose) {
nm <- names(newdata)[l]
if (verbose == T) {
print(nm)
}
if (is.factor(nd[[nm]])) {
out <- factor(nd[[nm]], levels = levels(as.factor(c(levels(od[[nm]]),
levels(nd[[nm]])))))
} else {
out <- nd[[nm]]
}
return(out)
}
newdata <- lapply(seq_along(newdata), nd = newdata, od = olddata,
verbose = verbose, FUN = mL) %>% as.data.frame()
if (verbose == T) {
names(newdata) %>% print
}
names(newdata) <- names(olddata)
olddata <- lapply(seq_along(olddata), nd = olddata, od = newdata,
verbose = verbose, FUN = mL) %>% as.data.frame()
if (verbose == T) {
olddata %>% names %>% print
identical(olddata %>% names %>% length, newdata %>% names %>%
length) %>% print
}
names(olddata) <- names(newdata)
return(list(newdata = newdata, olddata = olddata))
}
Apply the function to the training and test sets
if (lubridate::is.Date(em17.tr$Date)) em17.tr$Date %<>% sapply(dd) # Convert date to decimal on train
matchedlevels <- matchLevels(newdata = em18.te, olddata = em17.tr)
lapply(matchedlevels, findna) %>% unlist
## NULL
# Looks like it worked without making an NA values. Let's see
# if the levels match
matchedlevels.levels <- lapply(matchedlevels, function(l) {
sapply(l, function(x) {
if (is.factor(x)) {
levels(x)
}
}, simplify = T)
})
identical(matchedlevels.levels[[1]], matchedlevels.levels[[2]])
## [1] TRUE
This chunk contains the original training phase that was run prior to the project proposal. Many errors were encountered due to incongruent levels in test & training sets and due to providing the ‘finalModel’ to predict functions instead of the train object. I managed to get through to creating the first three models to ensure the project was feasible via a fairly haphazard method, as can be seen in the code below. The matchlevels function above was done upon revising this section in preparation for using caretEnsemble to create ensemble models, and solves some of the issues in this chunk more elegantly. The models were saved because of the long training time, and the accuracy of each is reported herein.
Train Models on 2017 Coded Data
# NOT RUN - from first run prior to project proposal
library(caret)
em2017.train <- createDataPartition(em17.tr$Deductible, times = 2,
p = 0.9, list = T)
em2017.train <- trainControl(method = "repeatedcv", index = em2017.train,
number = 2, repeats = 1, search = "grid", verboseIter = T,
allowParallel = F)
# kernlab::sigest(Deductible ~ .,data=em17.tr) em.2017.mod <-
# train(Deductible ~ . , data=em17.tr, method='svmRadial',
# na.action='na.pass',
# metric='Accuracy',tuneLength=10,trCtrl=em2017.train)
# ----------------------- Sat Mar 31 20:12:08 2018
# ------------------------# Throwing error
# `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more
# levels
# em.lvls <-
# purrr::map2(.x=sapply(em17.tr[,sapply(em17.tr,is.factor)],levels),.y=sapply(em.test,levels),
# .f=function(.x,.y){ c(.x,.y) %>% unique })
# em.lvls[['Date']] <- NULL em.lvls[['Amount']] <- NULL
# em.fac <- names(em.lvls) for(v in seq_along(em.fac)){
# em17.tr[[em.fac[v]]] %<>% as.factor em.test[[em.fac[v]]]
# %<>% as.factor levels(em17.tr[[em.fac[v]]]) <-
# em.lvls[[em.fac[v]]] levels(em.test[[em.fac[v]]]) <-
# em.lvls[[em.fac[v]]] }
#-------- Does not properly merge factors ------#
# ----------------------- Sat Mar 31 21:51:34 2018
# ------------------------# Merge test and train data and
# recompute factors for consistency in factor levels
# ----------------------- Sat Mar 31 20:51:18 2018
# ------------------------# LEt's see if this resolved it
em.2017.mod <- train(Deductible ~ ., data = em17.tr, method = "svmRadial",
na.action = "na.pass", metric = "Accuracy", tuneLength = 10,
trCtrl = em2017.train)
# Works ----------------------- Sat Mar 31 20:56:12 2018
# ------------------------# Let's try some additional methods
library(doParallel)
# make a cluster with all possible threads (not cores)
cl <- makeCluster(detectCores() - 1)
# register the number of parallel workers (here all CPUs)
registerDoParallel(cl)
getDoParWorkers()
# insert parallel calculations here
em.2017.svmRad <- train(Deductible ~ ., data = em17.tr, method = "svmRadial",
na.action = "na.pass", metric = "Accuracy", tuneLength = 10,
trCtrl = em2017.train)
# ----------------------- Sat Mar 31 20:53:10 2018
# ------------------------# A boosted logistic regression
em.2017.LB <- train(Deductible ~ ., data = em17.tr, method = "LogitBoost",
na.action = "na.pass", metric = "Accuracy", tuneLength = 10,
trCtrl = em2017.train)
# ----------------------- Sat Mar 31 20:53:55 2018
# ------------------------# And Naive Bayes
em.2017.nB <- caret::train(Deductible ~ ., data = em17.tr, method = "nb",
na.action = "na.pass", metric = "Accuracy", tuneGrid = expand.grid(fL = c(0,
1), usekernel = T, adjust = c(0.1, 0.5, 1)), trCtrl = em2017.train)
# stop the cluster and remove Rscript.exe childs (WIN)
stopCluster(cl)
registerDoSEQ()
em.2017.mods <- list(SVM = em.2017.svmRad, LB = em.2017.LB, NB = em.2017.nB)
save(em.2017.mods, file = "em2017modes.RData")
Eval Models
load(file = "em2017modes.RData")
names(em.2017.mods) <- c("svmRadial", "LogitBoost", "nb")
lapply(names(em.2017.mods), mods = em.2017.mods, FUN = function(x,
mods) {
purrr::pluck(mods[[x]], list("results", "Accuracy")) %>%
max %>% round(., 4) %>% paste0(x, ": ", .) %>% print()
plot(mods[[x]])
})
## [1] "svmRadial: 0.8904"
## [1] "LogitBoost: 0.9382"
## [1] "nb: 0.7528"
## [[1]]
##
## [[2]]
##
## [[3]]
It appears that the best model is the boosted logistic regression model. This makes sense given that the classification involves a binary classification, namely whether an entry is deductible or not. The boosting algorithm additionally adds predictive power, making it a bit more accurate than the radial SVM.
Predictions
## NOT RUN ----------------------- Sun Apr 01 08:14:15 2018
## ------------------------# BUG: Unfortunately, it appears
## that the predict functions all encounter errors, even when
## the training data itself is used. -----------------------
## Thu Apr 12 15:19:57 2018 ------------------------# FIX:
## This has since been resolved as the main model is meant to
## be used - not the final model. See section entitled 'Test
## on 2018 Out-of-Time test set''
# ----------------------- Sat Mar 31 16:10:01 2018
# ------------------------# Test on all data
em.pred <- predict(em.2017.svmRad, newdata = em17.tr[, names(em.test) !=
"Deductible"], type = "response")
# Error: 'Error in .local(object', '...) : test vector does
# not match model !' TODO(Troubleeshoot 2018-03-31 2300)
identical(sapply(em17.tr, levels), sapply(em.test, levels))
# ----------------------- Sun Apr 01 06:36:05 2018
# ------------------------# The same error when the training
# data itself is used. There must be a problem in the code.
# It appears that there is no remedy for this issue. Skipping
# the ksvm model
em.pred <- predict(em.2017.nB, newdata = em17.tr[, names(em.test) !=
"Deductible"])
# ----------------------- Sun Apr 01 06:47:57 2018
# ------------------------# Also doesnt work... trying
# LogitBoost
library(caTools)
em.pred <- predict(em.2017.LB, xtest = model.frame(Deductible ~
Date + Amount + Category + Subcategory + Payment.Method +
Payee.Payer + Account + Tag, data = em17.tr))
To account for what may be errors/incompatibilities between the caret final models and the predict functions, we will manually construct the models based on the tuning parameters provided by the caret training.
ksvm, manual
best.Tunes <- lapply(em.2017.mods, FUN = function(x) {
purrr::pluck(x, list("bestTune"))
})
em.svm <- kernlab::ksvm(Deductible ~ ., data = em17.tr, kpar = list(sigma = best.Tunes[[1]]$sigma),
C = best.Tunes[[1]]$C, kernel = "rbfdot")
library(kernlab)
# This appears to work, to determine the accuracy, I'll have
# to manually check the data. em.test <- edit(em.test)
# #Manually coding deductible categories
caret::confusionMatrix(predict(em.svm, newdata = em18.te), em18.te$Deductible)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 64 6
## yes 3 20
##
## Accuracy : 0.9032
## 95% CI : (0.8242, 0.9548)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.00001483
##
## Kappa : 0.751
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.9552
## Specificity : 0.7692
## Pos Pred Value : 0.9143
## Neg Pred Value : 0.8696
## Prevalence : 0.7204
## Detection Rate : 0.6882
## Detection Prevalence : 0.7527
## Balanced Accuracy : 0.8622
##
## 'Positive' Class : no
##
An ~88% accuracy was achieved with the test data. A majority of incorrectly labelled deductions were accounted for an addition of a single recurring payment associated with a Category/Subcategory combination that was not associated with a deduction in the test data. As the dataset grows over the years, and enough categories are created such that they can remains static, the accuracy of the algorithm will likely improve.
logitboost, manual
library(caTools)
em.LB <- caTools::LogitBoost(xlearn = em17.tr[, names(em17.tr) !=
"Deductible"], ylearn = em17.tr$Deductible, nIter = best.Tunes[[2]]$nIter)
# BUG: Error in if (MinLS > vLS1) { : argument is of length
# zero
detach("package:caTools")
naive bayes,manual
library(klaR)
em.nB <- klaR::NaiveBayes(Deductible ~ ., data = em17.tr, usekernel = T,
fL = best.Tunes[[3]]$fL, kernel = "gaussian", adjust = best.Tunes[[3]]$adjust)
caret::confusionMatrix(predict(em.nB, newdata = em18.te, type = "response")$class,
em18.te$Deductible)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 63 5
## yes 4 21
##
## Accuracy : 0.9032
## 95% CI : (0.8242, 0.9548)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.00001483
##
## Kappa : 0.7569
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9403
## Specificity : 0.8077
## Pos Pred Value : 0.9265
## Neg Pred Value : 0.8400
## Prevalence : 0.7204
## Detection Rate : 0.6774
## Detection Prevalence : 0.7312
## Balanced Accuracy : 0.8740
##
## 'Positive' Class : no
##
It appears that the Naive Bayes model is not so well-suited for predicting the deductibles at least with this set of tuning parameters. Perhaps this can improve with the Laplace correction?
nb,laplace tuning
em.nB <- klaR::NaiveBayes(Deductible ~ ., data = em17.tr, usekernel = T,
fL = 1, kernel = "gaussian", adjust = 0.01)
caret::confusionMatrix(predict(em.nB, newdata = em18.te, type = "response")$class,
em18.te$Deductible)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 48 9
## yes 19 17
##
## Accuracy : 0.6989
## 95% CI : (0.595, 0.7897)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.72196
##
## Kappa : 0.3313
## Mcnemar's Test P-Value : 0.08897
##
## Sensitivity : 0.7164
## Specificity : 0.6538
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.4722
## Prevalence : 0.7204
## Detection Rate : 0.5161
## Detection Prevalence : 0.6129
## Balanced Accuracy : 0.6851
##
## 'Positive' Class : no
##
LaPlace correction does not improve the model. Does another kernel?
nb,kernel tuning
kerns <- c("gaussian", "epanechnikov", "rectangular", "triangular",
"biweight", "cosine", "optcosine")
em.nB <- lapply(kerns, function(k) {
em.nB <- klaR::NaiveBayes(Deductible ~ ., data = em17.tr,
usekernel = T, fL = 1, kernel = k, adjust = 0.01)
})
nB.cM <- lapply(em.nB, function(.x) {
caret::confusionMatrix(predict(.x, newdata = em18.te, type = "response")$class,
em18.te$Deductible)
})
names(nB.cM) <- kerns
print(nB.cM)
## $gaussian
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 48 9
## yes 19 17
##
## Accuracy : 0.6989
## 95% CI : (0.595, 0.7897)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.72196
##
## Kappa : 0.3313
## Mcnemar's Test P-Value : 0.08897
##
## Sensitivity : 0.7164
## Specificity : 0.6538
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.4722
## Prevalence : 0.7204
## Detection Rate : 0.5161
## Detection Prevalence : 0.6129
## Balanced Accuracy : 0.6851
##
## 'Positive' Class : no
##
##
## $epanechnikov
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 43 7
## yes 24 19
##
## Accuracy : 0.6667
## 95% CI : (0.5613, 0.7611)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.896531
##
## Kappa : 0.3105
## Mcnemar's Test P-Value : 0.004057
##
## Sensitivity : 0.6418
## Specificity : 0.7308
## Pos Pred Value : 0.8600
## Neg Pred Value : 0.4419
## Prevalence : 0.7204
## Detection Rate : 0.4624
## Detection Prevalence : 0.5376
## Balanced Accuracy : 0.6863
##
## 'Positive' Class : no
##
##
## $rectangular
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 47 9
## yes 20 17
##
## Accuracy : 0.6882
## 95% CI : (0.5837, 0.7802)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.79233
##
## Kappa : 0.3146
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.7015
## Specificity : 0.6538
## Pos Pred Value : 0.8393
## Neg Pred Value : 0.4595
## Prevalence : 0.7204
## Detection Rate : 0.5054
## Detection Prevalence : 0.6022
## Balanced Accuracy : 0.6777
##
## 'Positive' Class : no
##
##
## $triangular
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 42 7
## yes 25 19
##
## Accuracy : 0.6559
## 95% CI : (0.5502, 0.7514)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.931075
##
## Kappa : 0.2951
## Mcnemar's Test P-Value : 0.002654
##
## Sensitivity : 0.6269
## Specificity : 0.7308
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.4318
## Prevalence : 0.7204
## Detection Rate : 0.4516
## Detection Prevalence : 0.5269
## Balanced Accuracy : 0.6788
##
## 'Positive' Class : no
##
##
## $biweight
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 45 8
## yes 22 18
##
## Accuracy : 0.6774
## 95% CI : (0.5725, 0.7707)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.85059
##
## Kappa : 0.3125
## Mcnemar's Test P-Value : 0.01762
##
## Sensitivity : 0.6716
## Specificity : 0.6923
## Pos Pred Value : 0.8491
## Neg Pred Value : 0.4500
## Prevalence : 0.7204
## Detection Rate : 0.4839
## Detection Prevalence : 0.5699
## Balanced Accuracy : 0.6820
##
## 'Positive' Class : no
##
##
## $cosine
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 48 7
## yes 19 19
##
## Accuracy : 0.7204
## 95% CI : (0.6178, 0.8086)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.55260
##
## Kappa : 0.3919
## Mcnemar's Test P-Value : 0.03098
##
## Sensitivity : 0.7164
## Specificity : 0.7308
## Pos Pred Value : 0.8727
## Neg Pred Value : 0.5000
## Prevalence : 0.7204
## Detection Rate : 0.5161
## Detection Prevalence : 0.5914
## Balanced Accuracy : 0.7236
##
## 'Positive' Class : no
##
##
## $optcosine
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 42 7
## yes 25 19
##
## Accuracy : 0.6559
## 95% CI : (0.5502, 0.7514)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.931075
##
## Kappa : 0.2951
## Mcnemar's Test P-Value : 0.002654
##
## Sensitivity : 0.6269
## Specificity : 0.7308
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.4318
## Prevalence : 0.7204
## Detection Rate : 0.4516
## Detection Prevalence : 0.5269
## Balanced Accuracy : 0.6788
##
## 'Positive' Class : no
##
Changing the kernel for the density calculation of numeric values does not improve the accuracy enough to warrant inclusion for further testing. The poor performance of Naive Bayes could be due to the data having only two numeric variables.
All predicted deductibles will need to be manually checked for accuracy. It may be useful for this process to have the rules from years previous to reference. A decision tree can provide these rules to reference. We will try a decision tree algorithm to replace Naive Bayes in the chunks to come.
Decision Tree
library(doParallel)
# make a cluster with all possible threads (not cores)
cl <- makeCluster(detectCores() - 1)
# register the number of parallel workers (here all CPUs)
registerDoParallel(cl)
getDoParWorkers()
em.dt <- caret::train(Deductible ~ ., data = em17.tr, method = "C5.0",
na.action = "na.pass", metric = "Accuracy", tuneLength = 10,
trCtrl = em2017.train)
stopCluster(cl)
registerDoSEQ()
detach("package:doParallel")
library(C50)
caret::confusionMatrix(predict(em.dt, newdata = em18.te, type = "raw"),
em18.te$Deductible)
save(em.dt, file = "em.dt.RData")
It appears the C5.0 algorithm performs exceptionally well with an accuracy of ~98%. The well-categorized financial data makes the decision tree particularly well-suited to determine if a transaction is deductible. Given the performance, I’m fairly certain that this will be the algorithm of choice for determining deductions in future years. I think 98% performance of the decision tree on the test set might be in part due to the presence of many recurring monthly subscription payments scheduled for the coming year providing a number of easy-to-fit observations in the data because they are identically labeled to the previous year. In addition to eliminating these, I will run the test with additional data generated over the course of working on this project plus all data used in the previous test (All transactions YTD 2018), which will likely provide a more realistic measure of accuracy.
An explanation for why the decision tree might have performed well on this particular dataset is it’s bias towards factors with numerous levels, which with this dataset and it’s numerous categories and subcategories, lends an advantage. The Category & Subcategory labels with which I label each of the transactions are fairly consistent, thus giving the decision tree advantage through it’s bias towards these two factors with numerous levels.
In the next section, and out-of-time test set with new data for 2018 will be loaded and filtered to only include transactions year-to-date (eliminating scheduled recurring transactions.
In the model creation phase we used caret to create and tune models, and then using parameters provided by the caret tuning, manually created models of the same type. These models will need to be regenerated with the new datasets that have matching levels. We can use the tuning parameters from the previous iterations to refine the tuning parameters in this round of model generation.
Model Parameters
load("em2017modes.RData")
load("em.dt.RData")
names(em.2017.mods) <- c("svmRadial", "LogitBoost", "nb")
em.2017.mods$C5.0 <- em.dt
best.Tunes <- lapply(em.2017.mods, FUN = function(x) {
purrr::pluck(x, list("bestTune"))
})
# ----------------------- Thu Apr 12 17:41:23 2018
# ------------------------# Based on previous tuning
# parameters, we will try ranges of values near to the best
# tunes to further refine the results.
tuneGrids <- lapply(seq_along(best.Tunes), tunes = best.Tunes,
mods = em.2017.mods, function(tune, tunes, mods) {
modelnm <- names(tunes)[tune]
varnms <- names(tunes[[tune]])
rowmatch <- sapply(varnms, function(v) {
which(mods[[modelnm]][["results"]][, v] %in% tunes[[modelnm]][,
v])
}) %>% Reduce(intersect, .) # Match the best tune to the row number
out <- sapply(varnms, modelnm = modelnm, rowmatch = rowmatch,
function(v, modelnm, rowmatch) {
if (mods[[modelnm]][["results"]][, v] %>% is.numeric) {
range(mods[[modelnm]][["results"]][c((rowmatch -
1):(rowmatch + 1)), v]) %>% summary %>% c(mods[[modelnm]][["results"]][rowmatch,
v]) %>% unique
} else {
mods[[modelnm]][["results"]][, v] %>% unique
}
})
if (is.list(out)) {
out <- expand.grid(out)
} else {
out <- as.data.frame(out)
}
# Create new tuning grids based on the best tune parameters,
# and parameter values from iterations before and after
return(out)
})
mods <- vector("list", length(tuneGrids))
names(mods) <- names(tuneGrids) <- sapply(em.2017.mods, purrr::pluck,
"method") %>% subset(., subset = !sapply(., is.null))
Regenerate Models
load("matchedlevels.RData")
req.packages <- c("doParallel", "klaR", "kernlab", "caTools",
"C50", "parallel", "iterators", "MASS", "foreach", "caret",
"fastAdaboost")
for (q in seq_along(req.packages)) {
suppressPackageStartupMessages(library(req.packages[q], character.only = T))
}
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
getDoParWorkers()
nms <- names(mods)
mods <- lapply(seq_along(mods), data = matchedlevels$olddata,
mods = mods, rv = "Deductible", tG = tuneGrids, function(i,
data, mods, rv, tG) {
method <- names(mods)[i]
data.train <- createDataPartition(data[[rv]], times = 2,
p = 0.8, list = T)
data.train <- trainControl(method = "repeatedcv", index = data.train,
number = 5, repeats = 1, search = "grid", verboseIter = T,
allowParallel = T, classProbs = T, savePredictions = T)
form <- as.formula(paste0(rv, " ~ ."))
mod <- train(form, data = data, method = method, na.action = "na.pass",
metric = "Accuracy", trCtrl = data.train, tuneGrid = tG[[i]])
return(mod)
})
names(mods) <- nms
stopCluster(cl)
registerDoSEQ()
save(mods, file = "mods.RData")
Modified AUC
Now to compare the performance across the various models
All model performance on test set
# ----------------------- Thu Apr 12 15:28:06 2018
# ------------------------# Explore the results on the test
# sets ----------------------- Thu Apr 12 20:20:59 2018
# ------------------------# Model performances
load("mods.Rdata")
req.packages <- c("doParallel", "klaR", "kernlab", "caTools",
"C50", "parallel", "iterators", "MASS", "foreach", "caret")
for (q in seq_along(req.packages)) {
suppressPackageStartupMessages(library(req.packages[q], character.only = T))
}
mods.cM <- lapply(mods, newdata = matchedlevels$newdata, function(l,
newdata) {
em.pred <- predict(l, newdata = newdata)
caret::confusionMatrix(em.pred, newdata$Deductible, positive = "yes")
})
# When attempting to generate the AUC graphs, the naive bayes
# prediction had no positive predictions causing a divide by
# 0 error, thus it is removed.
purrr::map2(.x = mods.cM[-3], .y = names(mods.cM[-3]), function(.x,
.y) {
.x %<>% lapply(rev)
AUC(t = .x$table, plot = "b", labels = .y)
})
## $svmRadial
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.151 0.13
## True.Neg 0.043 0.68
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.54 0.46
## True.Neg 0.06 0.94
##
## Accuracy = 0.83 Sensitivity = 0.54 Specificity = 0.94
## with Area Under the Curve = 0.88
## d.prime = 1.65 Criterion = 1.56 Beta = 0.22
## Observed Phi correlation = 0.54
## Inferred latent (tetrachoric) correlation = 0.8
## $LogitBoost
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.221 0.047
## True.Neg 0.035 0.698
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.826 0.17
## True.Neg 0.048 0.95
##
## Accuracy = 0.92 Sensitivity = 0.83 Specificity = 0.95
## with Area Under the Curve = 0.97
## d.prime = 2.61 Criterion = 1.67 Beta = 0.32
## Observed Phi correlation = 0.79
## Inferred latent (tetrachoric) correlation = 0.95
## $C5.0
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.258 0.022
## True.Neg 0.032 0.688
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.923 0.077
## True.Neg 0.045 0.955
##
## Accuracy = 0.95 Sensitivity = 0.92 Specificity = 0.96
## with Area Under the Curve = 0.99
## d.prime = 3.12 Criterion = 1.7 Beta = 0.38
## Observed Phi correlation = 0.87
## Inferred latent (tetrachoric) correlation = 0.98
C5.0 is again the top performing algorithm for predicting deductibles by measures of Accuracy, AUC and in the visible class separation in the Decision Theory graph.
Refine tuneGrids
best.Tunes <- lapply(mods, FUN = function(x) {
purrr::pluck(x, list("bestTune"))
})
# ----------------------- Wed Apr 18 15:10:39 2018
# ------------------------# Refine tuneGrids function to use
# the range of those trials that have significance at the .05
# alpha levels as being the best performers
tuneGrids <- lapply(seq_along(best.Tunes), tunes = best.Tunes,
mods = mods, function(tune, tunes, mods) {
modelnm <- names(tunes)[tune]
varnms <- names(tunes[[tune]])
rowmatch <- sapply(varnms, function(v) {
which(mods[[modelnm]][["results"]][, v] %in% tunes[[modelnm]][,
v])
}) %>% Reduce(intersect, .) # Match the best tune to the row number
perf.measures <- list(Accuracy = mods[[modelnm]][["results"]][["Accuracy"]] %>%
.[order(., decreasing = T)], Kappa = mods[[modelnm]][["results"]][["Kappa"]] %>%
.[order(., decreasing = T)])
top.ind <- purrr::map2(.x = perf.measures, .y = names(perf.measures),
function(.x, .y) {
which(mods[[modelnm]][["results"]][[.y]] %in%
.x[.x > qnorm(0.95, mean(.x), sd(.x))])
}) %>% unlist %>% unique # Calculate best preforming based on kappa
if (top.ind %>% length > 0) {
btunes <- mods[[modelnm]][["results"]][c(top.ind,
rowmatch), ]
} else {
btunes <- c((rowmatch - 1):(rowmatch + 1))
}
out <- sapply(varnms, modelnm = modelnm, rowmatch = rowmatch,
function(v, modelnm, rowmatch) {
if (mods[[modelnm]][["results"]][, v] %>% is.numeric) {
range(mods[[modelnm]][["results"]][btunes,
v]) %>% summary %>% c(mods[[modelnm]][["results"]][rowmatch,
v]) %>% unique
} else {
mods[[modelnm]][["results"]][, v] %>% unique
}
})
if (is.list(out)) {
out <- expand.grid(out)
} else {
out <- as.data.frame(out)
}
# Create new tuning grids based on the best tune parameters,
# and parameter values from iterations before and after
return(out)
})
caretEnsemble
## NOT RUN - BUG: Error: cannot compute class probabilities
## for regressionError in evalSummaryFunction(y, wts =
## weights, ctrl = trControl, lev = classLevels, : ##
## train()'s use of ROC codes requires class probabilities.
## See the classProbs option of trainControl()
## ----------------------- Thu Apr 12 23:14:30 2018
## ------------------------# Predictions from caret train
## apparently will not save even when savePredictions=T. We
## will have to try the caretEnsemble full workup from the
## vignette tomorrow. ----------------------- Fri Apr 13
## 07:34:18 2018 ------------------------# Since we've seen
## that naive bayes doesn't perform to well with this data, we
## will replace it with the adaboost algorithm, a boosted
## decision tree leaner ----------------------- Wed Apr 18
## 18:35:23 2018 ------------------------# FIX: formula is
## declared via the variable form - not formula
set.seed(1)
req.packages <- c("doParallel", "kernlab", "caTools", "C50",
"parallel", "iterators", "MASS", "foreach", "caret", "tidyverse",
"dplyr", "htmltools", "magrittr", "repmis")
for (q in seq_along(req.packages)) {
suppressPackageStartupMessages(library(req.packages[q], character.only = T))
}
# repmis::source_data('https://github.com/yogat3ch/da5030/blob/master/matchedlevels.RData?raw=true')
data.train <- caret::createDataPartition(matchedlevels$olddata[["Deductible"]],
times = 2, p = 0.85)
data.train <- caret::trainControl(method = "repeatedcv", index = data.train,
number = 10, repeats = 1, search = "grid", allowParallel = T,
summaryFunction = caret::twoClassSummary, classProbs = T,
savePredictions = "final", returnResamp = "final")
form <- as.formula(paste0("Deductible", " ~ ."))
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
getDoParWorkers()
model.list <- caretEnsemble::caretList(form = form, data = matchedlevels$olddata,
trControl = data.train, metric = "ROC", methodList = c("svmRadial",
"LogitBoost", "adaboost", "C5.0"), tuneList = list(svmRadial = caretEnsemble::caretModelSpec(method = "svmRadial",
tuneGrid = tuneGrids$svmRadial), LogitBoost = caretEnsemble::caretModelSpec(method = "LogitBoost",
tuneGrid = tuneGrids$LogitBoost), adaboost = caretEnsemble::caretModelSpec(method = "adaboost",
tuneLength = 10), C5.0 = caretEnsemble::caretModelSpec(method = "C5.0",
tuneGrid = tuneGrids$C5.0)))
stopCluster(cl)
registerDoSEQ()
# ----------------------- Tue Apr 17 14:10:03 2018
# ------------------------# Models need to be filtered to
# only return one resample, saved back to the caretList and
# then used with caretEnsemble
modellist.cM <- list(svmRadial = confusionMatrix(predict(model.list$svmRadial,
newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible,
positive = "yes"), LogitBoost = confusionMatrix(predict(model.list$LogitBoost,
newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible,
positive = "yes"), adaboost = confusionMatrix(predict(model.list$adaboost,
newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible,
positive = "yes"), C5.0 = confusionMatrix(predict(model.list$C5.0,
newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible,
positive = "yes"))
save(modellist.cM, file = "modellist.cM.RData")
save(model.list, file = "caretEnsemble_model.list.RData")
Run the ensemble
load("caretEnsemble_model.list.RData")
# ----------------------- Wed Apr 18 13:50:04 2018
# ------------------------# Resampling filtration
lapply(model.list, function(x) purrr::pluck(x, "pred") %>% nrow)
lapply(model.list, function(x) purrr::pluck(x, "pred", "Resample") %>%
unique)
# Perhaps these might need to be subset into just one
# resample?
Preds <- purrr::map2(.x = model.list, .y = names(model.list),
function(.x, .y) {
samples <- purrr::pluck(.x, "pred", "Resample") %>% unique
sample <- purrr::pluck(.x, "pred") %>% filter(Resample ==
samples[1])
list(pred = sample)
})
for (i in seq_along(model.list)) {
model.list[[i]][["pred"]] <- Preds[[i]][["pred"]]
}
mods.Ens <- caretEnsemble::caretStack(model.list, method = "glm")
# it works!
save(mods.Ens, file = "mods.enssm.RData")
Confusion Matrix of small stacked ensemble model
load(file = "mods.enssm.RData")
(modellist.cM$StackedSmall <- confusionMatrix(predict(mods.Ens,
newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible,
positive = "yes"))
# The stacked model does not appear as accurate as the
# decision tree models (though might be more useful if we
# would like to be careful about overfitting)
save(modellist.cM, file = "modellist.cM.RData")
rm(mods.Ens)
Does it work with the original model list?
caretStack with original model list
load("caretEnsemble_model.list.Rdata")
mods.Ens <- caretEnsemble::caretStack(model.list, method = "glm")
save(mods.Ens, file = "mods.enslg.RData")
lapply(model.list, )
confusionMatrix(predict(model.list$svmRadial, newdata = matchedlevels$newdata),
matchedlevels$newdata$Deductible, positive = "yes")
caretStack Ensemble confusion Matrix
load(file = "mods.enslg.RData")
(modellist.cM$StackedLarge <- caret::confusionMatrix(predict(mods.Ens,
newdata = matchedlevels$newdata), matchedlevels$newdata$Deductible,
positive = "yes")) %>% .$table %>% rm(mods.Ens)
It does appear to work.
Performance of all Models
## $svmRadial
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 66 15
## yes 1 11
##
## Accuracy : 0.828
## 95% CI : (0.7357, 0.8983)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.011366
##
## Kappa : 0.4887
## Mcnemar's Test P-Value : 0.001154
##
## Sensitivity : 0.4231
## Specificity : 0.9851
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.8148
## Prevalence : 0.2796
## Detection Rate : 0.1183
## Detection Prevalence : 0.1290
## Balanced Accuracy : 0.7041
##
## 'Positive' Class : yes
##
##
## $LogitBoost
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 64 4
## yes 3 22
##
## Accuracy : 0.9247
## 95% CI : (0.851, 0.9692)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.0000008947
##
## Kappa : 0.8109
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8462
## Specificity : 0.9552
## Pos Pred Value : 0.8800
## Neg Pred Value : 0.9412
## Prevalence : 0.2796
## Detection Rate : 0.2366
## Detection Prevalence : 0.2688
## Balanced Accuracy : 0.9007
##
## 'Positive' Class : yes
##
##
## $adaboost
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 43 4
## yes 24 22
##
## Accuracy : 0.6989
## 95% CI : (0.595, 0.7897)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.7219591
##
## Kappa : 0.395
## Mcnemar's Test P-Value : 0.0003298
##
## Sensitivity : 0.8462
## Specificity : 0.6418
## Pos Pred Value : 0.4783
## Neg Pred Value : 0.9149
## Prevalence : 0.2796
## Detection Rate : 0.2366
## Detection Prevalence : 0.4946
## Balanced Accuracy : 0.7440
##
## 'Positive' Class : yes
##
##
## $C5.0
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 64 2
## yes 3 24
##
## Accuracy : 0.9462
## 95% CI : (0.879, 0.9823)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.00000003033
##
## Kappa : 0.8681
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9231
## Specificity : 0.9552
## Pos Pred Value : 0.8889
## Neg Pred Value : 0.9697
## Prevalence : 0.2796
## Detection Rate : 0.2581
## Detection Prevalence : 0.2903
## Balanced Accuracy : 0.9392
##
## 'Positive' Class : yes
##
##
## $StackedSmall
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 65 8
## yes 2 18
##
## Accuracy : 0.8925
## 95% CI : (0.8111, 0.9472)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.00005051
##
## Kappa : 0.7128
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.6923
## Specificity : 0.9701
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.8904
## Prevalence : 0.2796
## Detection Rate : 0.1935
## Detection Prevalence : 0.2151
## Balanced Accuracy : 0.8312
##
## 'Positive' Class : yes
##
##
## $StackedLarge
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 58 8
## yes 9 18
##
## Accuracy : 0.8172
## 95% CI : (0.7235, 0.8898)
## No Information Rate : 0.7204
## P-Value [Acc > NIR] : 0.02146
##
## Kappa : 0.5515
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.6923
## Specificity : 0.8657
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.8788
## Prevalence : 0.2796
## Detection Rate : 0.1935
## Detection Prevalence : 0.2903
## Balanced Accuracy : 0.7790
##
## 'Positive' Class : yes
##
ROC and AUC for all models
# library(psych) assignInNamespace('AUC',AUC,ns='psych')
(model.perfs <- purrr::map2(.x = modellist.cM, .y = names(modellist.cM),
function(.x, .y) {
.x %<>% lapply(rev)
list(Name = .y, AUC = AUC(t = .x$table, plot = "b", labels = .y))
}))
## $svmRadial
## $svmRadial$Name
## [1] "svmRadial"
##
## $svmRadial$AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.118 0.16
## True.Neg 0.011 0.71
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.423 0.58
## True.Neg 0.015 0.99
##
## Accuracy = 0.83 Sensitivity = 0.42 Specificity = 0.99
## with Area Under the Curve = 0.92
## d.prime = 1.98 Criterion = 2.17 Beta = 0.17
## Observed Phi correlation = 0.55
## Inferred latent (tetrachoric) correlation = 0.87
##
## $LogitBoost
## $LogitBoost$Name
## [1] "LogitBoost"
##
## $LogitBoost$AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.237 0.043
## True.Neg 0.032 0.688
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.846 0.15
## True.Neg 0.045 0.96
##
## Accuracy = 0.92 Sensitivity = 0.85 Specificity = 0.96
## with Area Under the Curve = 0.97
## d.prime = 2.72 Criterion = 1.7 Beta = 0.34
## Observed Phi correlation = 0.81
## Inferred latent (tetrachoric) correlation = 0.96
##
## $adaboost
## $adaboost$Name
## [1] "adaboost"
##
## $adaboost$AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.24 0.043
## True.Neg 0.26 0.462
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.85 0.15
## True.Neg 0.36 0.64
##
## Accuracy = 0.7 Sensitivity = 0.85 Specificity = 0.64
## with Area Under the Curve = 0.84
## d.prime = 1.38 Criterion = 0.36 Beta = 0.51
## Observed Phi correlation = 0.44
## Inferred latent (tetrachoric) correlation = 0.69
##
## $C5.0
## $C5.0$Name
## [1] "C5.0"
##
## $C5.0$AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.258 0.022
## True.Neg 0.032 0.688
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.923 0.077
## True.Neg 0.045 0.955
##
## Accuracy = 0.95 Sensitivity = 0.92 Specificity = 0.96
## with Area Under the Curve = 0.99
## d.prime = 3.12 Criterion = 1.7 Beta = 0.38
## Observed Phi correlation = 0.87
## Inferred latent (tetrachoric) correlation = 0.98
##
## $StackedSmall
## $StackedSmall$Name
## [1] "StackedSmall"
##
## $StackedSmall$AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.194 0.086
## True.Neg 0.022 0.699
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.69 0.31
## True.Neg 0.03 0.97
##
## Accuracy = 0.89 Sensitivity = 0.69 Specificity = 0.97
## with Area Under the Curve = 0.95
## d.prime = 2.39 Criterion = 1.88 Beta = 0.28
## Observed Phi correlation = 0.72
## Inferred latent (tetrachoric) correlation = 0.93
##
## $StackedLarge
## $StackedLarge$Name
## [1] "StackedLarge"
##
## $StackedLarge$AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## Predicted.Pos Predicted.Neg
## True.Pos 0.194 0.086
## True.Neg 0.097 0.624
##
## Conditional probabilities of
## Predicted.Pos Predicted.Neg
## True.Pos 0.69 0.31
## True.Neg 0.13 0.87
##
## Accuracy = 0.82 Sensitivity = 0.69 Specificity = 0.87
## with Area Under the Curve = 0.87
## d.prime = 1.61 Criterion = 1.11 Beta = 0.31
## Observed Phi correlation = 0.55
## Inferred latent (tetrachoric) correlation = 0.78
From the comparison between small and large ensemble models it can be noted that including the 2nd resampling predictions in the caretList created more variance in the stacked glm probabilities and thus reduced the Accuracy, AUC and class separation. Overall, C5.0 performs exceptionally well, with far less computational resources to train (it trains in a matter of seconds running in paralle) and memory (the ensemble Models are near to 300MB each where the C5.0 object is 1.6mb). For the machine learning task of classifying deductibles from well-labelled transaction data, C5.0 is clearly the optimal tool though it was a valuable learning experience to build the ensemble models.
detach packages
Preparing data for Cluster Analysis
library(ClusterR)
library(clusterCrit)
em.cl.tr <- em17.tr[, c(1:5)] %>% filter(Amount < 0)
if (em.cl.tr$Date %>% lubridate::is.Date()) {
em.cl.tr$Date %<>% lubridate::decimal_date()
}
em.cl.tr[, sapply(em.cl.tr, is.factor)] %<>% lapply(as.numeric)
em.cl.tr %>% findna()
## $Date
## NULL
##
## $Amount
## NULL
##
## $Category
## NULL
##
## $Subcategory
## NULL
##
## $Payment.Method
## NULL
em.cl.tr[is.na(em.cl.tr$Subcategory), "Subcategory"] <- "None"
em.cl.tr$Subcategory %<>% as.factor %>% as.numeric
em.cl.tr[, 2:5] %<>% lapply(scale)
The ClusterR::GMM fits gaussian curves to the data to form clusters. Two datasets are tested for clustering outcomes:
Cluster Analysis with Gaussian Mixed Models
ClusterR::Optimal_Clusters_GMM(em.cl.tr, max_clusters = 5, criterion = "AIC",
dist_mode = "eucl_dist") # Function finds the optimal number of clusters based on a given max, criteria and distance measure
## [1] 4639.9215 2193.3093 1838.5082 1056.4875 876.3358
## attr(,"class")
## [1] "Gaussian Mixture Models"
em.cl.trScaled <- ClusterR::center_scale(em.cl.tr) # A vectorize scaling function for dataframes.
emCluster <- ClusterR::GMM(em.cl.trScaled, gaussian_comps = 5,
dist_mode = "eucl_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$ClusterGMM.euc <- ClusterR::predict_GMM(em.cl.trScaled,
emCluster$centroids, emCluster$covariance_matrices, emCluster$weights)$cluster_labels
emCluster <- ClusterR::GMM(em.cl.trScaled, gaussian_comps = 5,
dist_mode = "maha_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$ClusterGMM.mah <- ClusterR::predict_GMM(em.cl.trScaled,
emCluster$centroids, emCluster$covariance_matrices, emCluster$weights)$cluster_labels
# ----------------------- Sun Apr 08 20:01:08 2018
# ------------------------# Reduced set of cluster data with
# just date & amount
em.cl.trReduced <- sapply(em17.tr[, c(1:2)] %>% filter(Amount <
0), as.numeric) %>% as.data.frame %>% ClusterR::center_scale()
emReducedCluster <- ClusterR::GMM(em.cl.trReduced, gaussian_comps = 5,
dist_mode = "eucl_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$Red.ClusterGMM.euc <- ClusterR::predict_GMM(em.cl.trReduced,
emReducedCluster$centroids, emReducedCluster$covariance_matrices,
emReducedCluster$weights)$cluster_labels
emReducedCluster <- ClusterR::GMM(em.cl.trReduced, gaussian_comps = 5,
dist_mode = "maha_dist", km_iter = 12, em_iter = 10, seed = 1)
em.cl.tr$Red.ClusterGMM.mah <- ClusterR::predict_GMM(em.cl.trReduced,
emReducedCluster$centroids, emReducedCluster$covariance_matrices,
emReducedCluster$weights)$cluster_labels
Cluster Medoids is a clustering algorithm that uses data points as centers. It is more robust to noise and outliers as it minimizes the sum of pairwise dissimilarities rather then the sum of squared Euclidean distances. I’m going to try Mahalanobis distance to account for the dataset being transactional across time whereby a cluster of txns may have an ellipsoidal shape when mapped to time. Furthermore, an outlier on the amount metric will likely still be included in a cluster because the axes from which distances are used are eigenvectors in the covariance matrix where extreme values are more likely to be included in a cluster rather than forming a cluster of their own.
The results of both clustering techniques will be compared using the ClusterCrit package.
kmedoids clustering
# ----------------------- Sun Apr 08 20:31:22 2018
# ------------------------#
# ClusterR::Optimal_Clusters_Medoids(em.cl.trScaled,max_clusters
# = 12, distance_metric = 'euclidean', criterion =
# 'dissimilarity') The plot shows an elbow at 5 clusters,
# with decreases in dissimilarity of .005 and .006
# thereafter. 5 clusters will be used for comparison with the
# GMM method. Euclidean distance metric will also be
# considered for comparison with the GMM method.
em.mah.Medoids <- ClusterR::Cluster_Medoids(em.cl.trScaled, clusters = 5,
distance_metric = "mahalanobis")
em.euc.Medoids <- ClusterR::Cluster_Medoids(em.cl.trScaled, clusters = 5,
distance_metric = "euclidean")
em.cl.tr$ClusterM.mah <- ClusterR::predict_Medoids(data = em.cl.trScaled,
MEDOIDS = em.mah.Medoids$medoids, distance_metric = "mahalanobis")$clusters
em.cl.tr$ClusterM.euc <- ClusterR::predict_Medoids(data = em.cl.trScaled,
MEDOIDS = em.euc.Medoids$medoids, distance_metric = "euclidean")$clusters
# Reduced
em.mah.Medoids <- ClusterR::Cluster_Medoids(em.cl.trReduced,
clusters = 5, distance_metric = "mahalanobis")
em.euc.Medoids <- ClusterR::Cluster_Medoids(em.cl.trReduced,
clusters = 5, distance_metric = "euclidean")
em.cl.tr$Red.ClusterM.mah <- ClusterR::predict_Medoids(data = em.cl.trReduced,
MEDOIDS = em.mah.Medoids$medoids, distance_metric = "mahalanobis")$clusters
em.cl.tr$Red.ClusterM.euc <- ClusterR::predict_Medoids(data = em.cl.trReduced,
MEDOIDS = em.euc.Medoids$medoids, distance_metric = "euclidean")$clusters
ClusterCrits provides various clustering evaluation algorithms, most of which are largely based on measures of variance minimization. There are numerous criteria that the package provides and I would like to learn more about each one as time allows. For now, I’ve used all the algorithms paired with a voting system such that the best cluster is selected based on the number of votes from the total criteria it receives.
Cluster Evaluation
emCluster.crits <- purrr::map2(.x = names(em.cl.tr[, 6:13]),
.y = em.cl.tr[, 6:13], sca = em.cl.trScaled, red = em.cl.trReduced,
.f = function(.x, .y, sca, red) {
if (str_detect(.x, "Red")) {
crit <- clusterCrit::intCriteria(em.cl.trReduced,
as.integer(.y), "all")
} else {
crit <- clusterCrit::intCriteria(em.cl.trScaled,
as.integer(.y), "all")
}
return(crit)
}) # Apply all of the criteria and return the criteria outputs
names(emCluster.crits) <- names(em.cl.tr[, 6:13])
(emClusters.best <- sapply(clusterCrit::getCriteriaNames(isInternal = T),
cc = emCluster.crits, FUN = function(crit, cc) {
crit.vals <- lapply(cc, `[[`, tolower(crit)) %>% unlist
bestcrit.index <- clusterCrit::bestCriterion(crit.vals,
crit = crit)
crit.vote <- c(names(cc)[bestcrit.index])
names(crit.vote) <- crit
return(crit.vote)
}) %>% table) # Use best criterion score to vote for a cluster (1 cluster vote per criterion).
## .
## ClusterM.mah Red.ClusterGMM.euc Red.ClusterM.euc
## 3 3 14
## Red.ClusterM.mah
## 20
## .
## Red.ClusterM.mah Red.ClusterM.euc ClusterM.mah
## 20 14 3
## Red.ClusterGMM.euc
## 3
It appears that the best performing clustering technique was the medoid clustering, and it performed best on the reduced models regardless of distance measure. This clustering method uses a data point as a cluster center rather than a randomly assigned starting point and moves the clusters to the center of variance with each iteration using pairwise dissimilarity. On the reduced dataset that includes only dates and amounts, this method hopefully pinpointed clusters of expenses and will allow trends to be observed in spending across time.
Bind best performing clusters to original data
The cluster analysis can be found in the section above entitled Expense Data Cluster Analysis.
Documentation for the code contained in each chunk can be found preceding and sometimes following the chunks. Comments that elaborate specific code segments can be found inline.
Much of the data transformation took place as required by a particular analytical task. The data transformation can be found in part in the Data Loading and Cleaning section. Further cleaning of the data can be found in Load and Transform Data. Additional data cleaning was performed as needed by the specific visualization type and can be found documented accompanying the code for the visualization.
As per suggestions during the evaluation period, the order of the visualizations in the narrative was rearranged to included the Timeseries in the first position. Additionally, a comparison of the change in Median Income to the actual change in expenses between Asheville and Boston was positioned to complement the information already present in the title area.
The timeseries visualization now includes custom selected shapes that allow for easier differentiation of transactions in adjacent clusters in time. The legend key shapes are also larger to draw direct parallel with the network graph of the clusters below the timeseries. The cluster nodes in the network graph have been manually modified to reflect the corresponding shape used in the timeseries. Links connecting the ‘2017’ node in center to the cluster nodes have been given a heavier weight such that the location of the cluster node ID can be easily discerned and each cluster can be visually tracked outwards from there. The color legend for categories is present between the network graph and the timeseries to allow for ease of connecting colors with categories as the two visualizations are cross referenced. Furthermore, explanations that are positioned near each cluster are now in black for legibility, with a small cluster shape signifier as reference point to connect the text with the appropriate cluster. I feel these modifications make the connectivity between the timeseries and the network graph more apparent.
The cost of living visualization is straightforward and includes notable insights highlighted with text on the graph. As per suggestion of a peer, the legend has been simplified to simply fill-only and fill-stroked bars to signify location, instead of repetitive text along the x-axis. Additionally, ‘+’ signs have been added to all positive values for contrast with negative values.
The visualization of market signals backtesting on google class a & b stocks for 2017 uses a color coded buy & sell in the description that corresponds with vertical lines indicating the corresponding points in time where the action takes place. Thus the success/failure of the signal can be seen in the number and momentum of upward/downward triangle points denoting the Parabolic SAR indicator that lie between each blue buy signal and each red sell signal. The algorithm definitely needs further work because it actually performs slightly worse than buy & hold.
It was my intention with this project to take a network perspective with regard to expense transactions to better understand patterns of spending. I felt that a cluster analysis combined with the network graph would be a way of gaining this perspective into the data. Once the network graph came into form, it became apparent that rich insights about the networked nature of transactions could be made between a typical timeseries graph (as can be seen on the first draft) and a timeseries graph imbued with visual continuity to the network graph. The result appears on the final draft. The use of the stacked bar graph included in the first draft acted as both a summary of the expenses in each category (split by location) which was intended to magnify the massive difference in total expense between Asheville and Boston. The inclusion of educational expense (an outlier which occurred upon moving to Boston) emphasized this difference to a great magnitude. However, the stacked bar graph did not make for a fluid comparison of various categories of expense between the cities. It was later decided that the most relevant information to communicate to an audience unconcerned with my personal expenses, was the amount of change in cost of living expenses they might expect to incur on making a move from a smaller town (like that of Asheville) to a metropolitan area (like that of Boston). The facet graph with accompanying percentage change by category more clearly shows these direct comparisons of changes in expense. In addition, educational expense was averaged out over the course of all months in the year as it’s typically thought about in terms of years even though it might be paid in a 1-2 transactions. Even with this generous distribution, it’s clear to see that the cost/month is still greater than the largest cost in living expenses: rent for a 2 br apartment (that houses two people as indicated by the split on the respective bar).
The final visualzation is a timeseries, as is typical when visualizing financial data. Space was limited for this visualization and thus it did not print very well, but the pdf maintains sufficient granularity to see the rise/fall of the parabolic SAR between buy & sell lines such that the performance of the indicator algorithm can be tracked across the timeseries.
It was my intention to maintain a continuity of color mapped across all visualizations that included the original transaction data set such that the specific category of expense could be tracked across all three visualizations providing insight into it’s frequency (on the timeseries) the types of transactions that typically occur around it in time or are similar in amount (the network graph), and how it changed with the move from Asheville to Boston (cost of living comparison). With regard to tracking the clusters in the visualization, the addition of shape-cluster association continuity across point representations in the timeseries to the cluster nodes in the network graph to the symbol reference points for the respective cluster descriptions provides a more coherent narrative of each cluster should the viewer choose to inquire in this way. Additionally, a sizeable effort went into maintaining subcategory cluster colors in the network graph by custom coding a function that varied the luminance of the primary category color to generate colors for the subcategories. In the portfolio management timeseries, a symbolic link is added in the description of the Parabolic SAR indicator to it’s corresponding representation in the graph to complement the existing color link between the description of the buy and sell signals to their representative vertical lines in the visualization. The legend has been repositioned to the top of the graph to provide explanatory detail about the SAR symbols as the gazes moves from description to visualization.
The project has changed shaped dramatically over the time in development. The only similarity remaining from the first few drafts is concept. Peer and professor reception were taken into account when the first two drafts were presented, and it was apparent that some ingenuity would be necessary to make this dataset engaging as a visualization since admittedly - tax data doesn’t necessarily make for brilliant visualizations. As per suggestion during the first evaluation, a long (vertical/digital) format leporello format was adopted for it’s familiarity in the realm of infographics and for the amount of space it provides. The timeseries, as per advice of the professor, was log transformed. Though in the first attempt I did not understand why the transformation and subsequent graphing failed. The issue was traced to NaNs appearing in the dataset and making it such that only a few points showed up in the graph. Upon approaching the task again later it was realized that negative values become NaN when log transformed. Thus the sign of each value was extracted prior to log transforming, the absolute value was log transformed, and the sign was reapplied to arrive at the dataset that appears in the final timeseries.
Peer Feedback
After posting the 2nd-to-final-draft on NuVustudio fairly late in the process it was a pleasant surprised to see signficant and well-thought out peer reviews appear in the comment thread on short notice. Each of the three commenters feedback was taken into account and incorporated into the final result. These were:
The chunk below uses the cluster labels that received the most votes in the analysis to create a long format edge data frame.
Txn Link DF
best.clustermethod <- names(emClusters.best[order(emClusters.best,decreasing = T)][1]) %>% as.name # convert the column name of the cluster labels to a name for use in future dplyr function calls
em2017.net <- rbind(
matrix(data=c(
rep("2017",5),
1:5,
em2017.clusters %>% group_by(!!best.clustermethod) %>% summarize(amount=sum(Amount)) %>% .[["amount"]] %>% as.character), # bind year to cluster & amount
ncol=3,dimnames = list(row=NULL,col=c("from","to","amount"))),
em2017.clusters %>% group_by(!!best.clustermethod,Category) %>% summarise(amount=sum(Amount)) %>% rowwise %>% mutate(to=paste(!!best.clustermethod,Category,sep="_")) %>% select(!!best.clustermethod,to,amount) %>% as.matrix, # cluster to category & amount
em2017.clusters %>% group_by(!!best.clustermethod,Category,Subcategory) %>% summarise(amount=sum(Amount)) %>% rowwise %>% mutate(from=paste(!!best.clustermethod,Category,sep="_")) %>% mutate(to=paste(!!best.clustermethod,Category,Subcategory,sep="_")) %>% select(from,to,amount) %>% as.matrix # cat to subcat & amt
) # Create an edge data frame from the results of the clustering
em2017.net %<>% as.data.frame()
em2017.net$amount %<>% as.character %>% as.numeric
# Remove periods from names bc they interfere with javascript variable names
em2017.net$to %<>% gsub("\\&"," ",.) %>% as.factor
em2017.net[,sapply(em2017.net,is.factor)] %<>% lapply(snakecase::to_parsed_case)
em2017.net$amount <- em2017.net$amount %>% abs %>% scale %>% percent_rank(.)*30
em2017.net$to[str_detect(em2017.net$to,"office\\/")] %<>% gsub("\\(\\w+\\/\\w+\\_\\w+\\)","",.)
save(em2017.net,file="em2017.net.RData")
Google sheets was used in initial runs to manipulate the data into the format necessary for the edge data frame. In later drafts the transformations were revised to be entirely reproducible in R (shown above).
Manually construct links in Googlesheets
## NOT RUN ----------------------- Mon Apr 09 10:52:10 2018
## ------------------------# Move to Google Sheets to
## construct link df
library(googlesheets)
gs_auth()
emgs <- gs_url("https://docs.google.com/spreadsheets/d/1rec4HN6L6oGMAHnmw5p-HYj44IgfsULQYGwNStp6xuY/edit#gid=0")
gs_ws_new(emgs, ws_title = "Links")
emgs <- gs_url("https://docs.google.com/spreadsheets/d/1rec4HN6L6oGMAHnmw5p-HYj44IgfsULQYGwNStp6xuY/edit#gid=0")
gs_edit_cells(emgs, ws = "Links", input = em2017.net)
gs_edit_cells(emgs, ws = "Links", anchor = "A56", input = em2017.clusters %>%
group_by(Red.ClusterM.euc, Category, Subcategory) %>% summarise(amount = sum(Amount)))
em.links <- gs_read(emgs, ws = "Links", range = "A1:C180")
save(em.links, file = "em.links.RData")
detach("package:googlesheets")
A custom method for generating subcategory colors is in the chunk below. The function outline is as follows:
Configure Colors for Categories
load(file = "em2017.net.RData")
CatCols <- c(Automobile = "#F3756D", Travel = "#EF66A3", Business = "#E28925",
Education = "#C39B2D", Entertainment = "#99AA3A", Food = "#4EB548",
Health_Care = "#2DB456", Household = "#18BCC3", Insurance = "#19B6E9",
Other = "#469BD4", Personal = "#9488C0", Tax = "#D56CAB",
Software = "#B77AB4", Home_Office = "#23B991")
# Utilities=colorspace::mixcolor(alpha=.5,colorspace::hex2RGB(CatCols['Personal']),colorspace::hex2RGB(CatCols['Travel']))
# %>% colorspace::hex() %>% as.character Utilities was not
# included in the original graph, so it is coerced by
# blending the colors of the categories that are
# alphabetically before and after it such that it blends
# nicely when these colors are used to modify the vertbar
# graph
SubCols <- purrr::map2(.x = names(CatCols), .y = CatCols, em.links = em2017.net,
.f = function(.x, .y, em.links) {
nm <- .x
rgb <- t(grDevices::col2rgb(.y))/255
coords <- grDevices::convertColor(rgb, "sRGB", "Luv")
h <- atan2(coords[, "v"], coords[, "u"]) * 180/pi
chroma <- sqrt(coords[, "u"]^2 + coords[, "v"]^2)
l <- coords[, "L"]
cat(nm)
len <- em.links %>% filter(grepl(nm, from)) %>% .[["to"]] %>%
gsub("\\d\\_", "", .) %>% unique %>% length
subcols <- colorspace::sequential_hcl(len, h = h, c. = chroma,
l = c(l, 80), power = 1, gamma = NULL, fixup = TRUE,
alpha = 1)
# rsubcols <- sapply(subcols,grDevices::col2rgb) out <-
# vector() nc <- ifelse(ncol(rsubcols)>0, ncol(rsubcols),
# length(rsubcols)) for (i in 1:nc) { jsrgb <-
# paste0(rsubcols[,i],collapse=',') out[i] <-
# paste0('rgb(',jsrgb,')') %>% print }
out <- subcols
plot(x = 1:length(out), y = 1:length(out), pch = 19,
cex = 5, col = subcols, main = nm)
names(out) <- em.links %>% filter(grepl(nm, from)) %>%
.[["to"]] %>% gsub("\\d\\_", "", .) %>% unique
return(out)
})
## Automobile
## Travel
## Business
## Education
## Entertainment
## Food
## Health_Care
## Household
## Insurance
## Other
## Personal
## Tax
## Software
## Home_Office
# subcols <- sapply(seq(1,100,length.out =
# len),function(x)scales::col2hcl(cc,h=h,c=chroma,l=x)) %>%
# as.vector() tcol <- colorspace::sequential_hcl(3, c. =
# t(col2rgb('#4A6FE3')), l = c(80, 0), power = c(1/5, 1),
# gamma = NULL, fixup = TRUE, alpha = 1)
# plot(x=1:length(tcol), y=1:length(tcol), pch=19, cex=5,
# col=tcol)
plot(x = 1:15, y = 1:15, pch = 19, cex = 5, col = colorspace::sequential_hcl(3,
c. = t(col2rgb("#4A6FE3")), l = c(40, 100), power = c(1/5,
1), gamma = NULL, fixup = TRUE, alpha = 1)) # Generates colors with deviation in luminosity
A node df with color properties of each of the values in the edge df is constructed from the output of the function above.
Create Nodes DF
em.nodes <- data.frame(id = c("2017", em2017.net[, 2, drop = T]))
property <- c(`2017` = "white", `1` = "red", `2` = "green", `3` = "blue",
`4` = "yellow", `5` = "purple", CatCols, SubCols %>% unlist)
# findprop <- function(x,property){
# property[grep(as.character(x),names(property),fixed =
# T)[1]]} em.nodes$prop <- sapply(em.nodes$id %>%
# as.character %>%
# gsub('\\d\\_','',.),findprop,property=property) # Add
# Color
em.nodes %<>% rowwise %>% mutate(group = names(property)[grep(id %>%
as.character %>% gsub("\\d\\_", "", .), names(property),
fixed = T)[1]])
em.nodes$group[1:3] <- c("2017", "1", "2") # Add the group such that colors can be accessed in the networkD3 call
em.nodes$Radius <- c(10, em2017.net$amount) %>% scales::rescale(c(1,
9)) # 10 for the 2017 node
# View(em.nodes)
Thanks to sounds advice (much appreciated!), ggnet is used to plot the edge and node dataframes successfully.
Create ggnet graph
library(ggnet)
library(network)
# em2017.net <- em2017.net[-1,]
em2017.net <- rbind(c(from = 1, to = 2017, amount = 10), em2017.net)
em2017.net %<>% inner_join(em.nodes[, 1:2], by = c(to = "id"))
em2017.net$amount %<>% scales::rescale(c(1, 15)) # rescale amounts such that they map nicely to sizes in the network graph
nms <- names(CatCols)
label.trim <- function(col1, nms) {
labels <- gsub(paste0(regex(nms), collapse = "|"), "", col1) %>%
gsub("^_", "", .) %>% gsub("\\_+", " ", .) %>% gsub("^\\s",
"", .) %>% substr(1, 10)
labels[nchar(labels) < 1] <- col1[which(nchar(labels) < 1)] %>%
substr(1, 10)
return(labels)
} # Trim the labels down to 10 characters each to avoid too much text on the network graph
em2017.net %<>% mutate(Label = label.trim(group, nms))
net <- network(em2017.net[, c(1, 2)], matrix.type = "edgelist",
vertex.attr = list(group = em2017.net[order(em2017.net$to),
]$group, radius = em2017.net[order(em2017.net$to), ]$amount,
label = em2017.net[order(em2017.net$to), ]$Label))
ggnet2(net, alpha = 0.75, size = "radius", max_size = 15, color = "group",
palette = property, edge.alpha = 0.5, label = "label", label.size = 3,
layout.par = list(cell.pointpointrad = 2)) + theme(legend.position = "none")
Cleaning data in preparation for the Timeseries. This was actually used in the original timeseries on the first draft. Later iterations use data from expense manager that is used in the prior sections of the full analysis.
Timeseries Tx Data prep
## [1] "Date" "Desc" "Category" "Amt"
## [1] "Date" "Desc" "Category" "Amt"
Tx <- rbind(cbind(usaa, Acct = rep("C", nrow(usaa))), cbind(usaa.visa,
Acct = rep("V", nrow(usaa.visa)))) # Add a factor to label the source of each txn, V for visa, C for checking.
Tx$Date %<>% lubridate::mdy()
Tx$Category %<>% as.factor()
Tx$Amt %<>% as.numeric
tx.fac <- function(x) {
sapply(x, function(x) {
if (x == 1000 | x == 1500) {
x <- "Support"
return(x)
} else if (x < 0) {
x <- "Expense"
return(x)
} else if (x > 0) {
x <- "Income"
return(x)
}
})
}
Tx <- Tx %>% dplyr::mutate(G.L = tx.fac(Amt)) # Add a factor to label whether the txn is a gain or a loss, or support income G.L
Tx <- Tx %>% filter(Category != "Credit Card Payments") # Filter out all values that deduct from checking account to pay visa balance as they are already accounted for in visa data
sapply(Tx, class)
## Date Desc Category Amt Acct G.L
## "Date" "character" "factor" "numeric" "factor" "character"
# ----------------------- Sun Apr 01 17:16:42 2018
# ------------------------# Attribute appropriate classes to
# features
Tx$G.L %<>% as.factor
# ----------------------- Fri Apr 06 15:29:51 2018
# ------------------------# Edit 1: Use log transformation to
# plot extreme values All log transformations introduce NaN
# Stickng with outlier removal method
A visualization of the timeseries. Most transactions fall within a small range of amounts, while some outliers exist, thus a trade-off between a scale where outliers are visible and all other points are overplotted, or some granularity in points with outliers cut off. Ultimately, a log transformation was used in the following chunk to better visualize the data.
Vis Timeseries 1st Draft
limits <- c(-1800, 1600)
bstn.exp <- Tx %>% filter(Date > lubridate::mdy("08-03-2017") &
Amt < 0 & Amt > -1000)
bstn.lm <- lm(Amt ~ Date, data = bstn.exp)
ts.colors <- RColorBrewer::brewer.pal(n = 11, name = "RdYlGn")
names(ts.colors)[c(1, 10, 11)] <- c("Expense", "Support", "Income")
(Timeseries <- ggplot(data = Tx, mapping = aes(x = Date, y = Amt)) +
geom_smooth(method = "lm", formula = Amt ~ Date, data = bstn.exp,
colour = ts.colors[11]) + geom_line(mapping = aes(color = G.L),
alpha = 1) + geom_vline(xintercept = lubridate::mdy("08-03-2017"),
color = ts.colors[2]) + geom_text(aes(x = lubridate::mdy("08-03-2017"),
label = "Relocation", y = 1000), colour = ts.colors[2], angle = 90,
vjust = 1.2, size = 4) + geom_vline(xintercept = lubridate::ymd("2017-09-15"),
color = ts.colors[2]) + geom_text(aes(x = lubridate::ymd("2017-09-15"),
label = "Tuition", y = 1000), colour = ts.colors[2], angle = 90,
vjust = 1.2, size = 4) + geom_point(data = Tx %>% filter(Category ==
"Investment Income"), color = ts.colors[4], size = 1, shape = 23,
fill = ts.colors[11]) + scale_y_continuous(name = "Amount",
breaks = seq(limits[1], limits[2], by = 200), minor_breaks = seq(limits[1],
limits[2], by = 100), limits = limits) + scale_x_date(date_breaks = "1 month",
date_minor_breaks = "1 week", date_labels = "%m") + scale_color_manual(values = ts.colors[c(1,
10, 11)], name = "Gain/Loss") + scale_alpha(range = c(1,
1)) + labs(title = "2017 Income & Expenses", subtitle = "Investment savings as points on Expenses",
caption = "", x = "Month Number", y = "Amount ($)") + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) + theme_minimal() +
theme(legend.position = "bottom", legend.margin = margin(rep(0,
4), "pt"), plot.margin = margin(rep(0, 4), "pt")))
# ggplot2::ggsave(filename=paste('Timeseries','.pdf',sep=''),plot=last_plot(),
# path='C:\\Users\\Stephen\\Documents\\Northeastern\\Git\\ppua5302\\Project
# 3\\Plots',device = 'pdf',width = 9, height = 1/3*17,
# units = 'in')
The data is transformed using a log transformation whereby the sign of the amount value is reattributed to the log values after log transformation (as log transformation does not work on negative values). Duplicates and NAs were introduced in the data transformation and had to be removed accordingly.
Prepare Timeseries Data
em2017[rownames(em2017) %in% c(428:431), ]$Amount <- em2017[rownames(em2017) %in%
c(428:431), ]$Amount * -1 # Fix Electric to be an expense
emTS <- em2017 %>% left_join(em2017.clusters, by = c(Amount = "Amount",
Category = "Category", Subcategory = "Subcategory")) %>%
rename(Date = Date.x, Payment.Method = Payment.Method.y,
Cluster = !(!best.clustermethod)) %>% select(one_of(names(.)[c(1:5,
7, 8, 13)])) %>% mutate_at(.vars = "Cluster", .funs = function(x) {
sapply(x, FUN = function(x) {
if (is.na(x)) {
x <- 6
} else {
x
}
})
}) %>% mutate_at(.vars = "Cluster", .funs = as.factor) %>% unique
# Join the best cluster to the original data (because date
# was lost in the cluster analysis) - Rename the columns to
# intuitive names & select only those that are necessary for
# the timeseries. Mutate the cluster column to make a new
# Miscellaneous cluster 6 for all txns that were not included
# in the cluster analysis, filter the result for unique
# values.
emTS[duplicated(emTS[, c("Date", "Amount", "Category", "Subcategory")]),
] %>% nrow
# Above is the # of txns that were duplicated in the process
emTS <- emTS[not(duplicated(emTS[, c("Date", "Amount", "Category",
"Subcategory")])), ] # Filter them
logT <- function(num) {
# Log transform that works for negative values
log(abs(num)) * sign(num)
# if(sign(num)==-1){ lnum <- log(abs(num))*-1 }else{lnum <-
# log(num)} return(lnum)
}
emTS %<>% mutate(LogT = logT(Amount)) # Log transform the data to account for outliers using the function above that reapplies the sign to the amount after the log transform.
emTS$LogT %>% sign %>% table
emTS$Amount %>% sign %>% table
save(emTS, file = "emTS.RData")
The second run of the time series visualization that was included in the 2nd to final draft but removed as per peer feedback as it didn’t include any additional information that could not already be derived from other visualizations already present.
Vis Timeseries
limits <- c(-1700, 1200)
load(file = "emTS.Rdata")
(Timeseries <- ggplot(data = emTS, mapping = aes(x = Date, y = Amount)) +
geom_point(mapping = aes(color = Category, shape = Cluster),
alpha = 1) + scale_y_continuous(name = "Amount", breaks = seq(limits[1],
limits[2], by = 200), minor_breaks = seq(limits[1], limits[2],
by = 100), limits = limits) + scale_x_date(date_breaks = "1 month",
date_minor_breaks = "1 week", date_labels = "%m") + geom_vline(xintercept = lubridate::mdy("08-03-2017"),
color = ts.colors[2]) + geom_text(aes(x = lubridate::mdy("08-03-2017"),
label = "Relocation", y = -6), colour = ts.colors[2], angle = 90,
vjust = 1.2, size = 4) + labs(title = "2017 Income & Expenses",
subtitle = "Category by Color, Clusters by Shape", caption = "",
x = "Month Number", y = "Amount($)") + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) + theme_minimal() +
theme(legend.position = "bottom", legend.margin = margin(rep(0,
4), "pt"), plot.margin = margin(rep(2, 4), "pt"), legend.key.size = unit(28,
"pt"), legend.text = element_text(size = 14), legend.title = element_text(size = 14),
plot.title = element_text(size = unit(18, "pt")), plot.subtitle = element_text(size = unit(14,
"pt"))) + guides(col = F, shape = guide_legend(nrow = 1)))
# ----------------------- Mon Apr 23 14:05:57 2018
# ------------------------# for creating legend
# guide_legend(override.aes = list(size = 10)
# ggplot2::ggsave(filename=paste('Timeseries','.pdf',sep=''),plot=last_plot(),
# path='C:\\Users\\Stephen\\Documents\\Northeastern\\Git\\ppua5302\\Project
# 3\\Plots',device = 'pdf',width = 10.5, height = 4, units
# = 'in')
The log transformed time series is visualized in the chunk below. Many aspects of the chart were set manually.
Vis Timeseries Log Transformed
load(file="emTS.Rdata")
(TimeseriesLog <- ggplot(data = emTS,mapping=aes(x=Date,y=LogT))+
geom_point(mapping = aes(color=Category,shape=Cluster),alpha=1,size=2)+
#scale_y_continuous(name="LogT",breaks=seq(limits[1], limits[2], by = 200),minor_breaks = seq(limits[1], limits[2], by = 100), limits=limits)+
scale_x_date(date_breaks = "1 month",date_minor_breaks = "1 week", date_labels = "%m")+
geom_vline(xintercept = lubridate::mdy("08-03-2017"), color=ts.colors[2])+
geom_text(aes(x=lubridate::mdy("08-03-2017"), label="Relocation", y=-6), colour=ts.colors[2], angle=90, vjust = 1.2, size=4)+
scale_shape_manual(values=c('1' = 21, '2'= 17, '3'= 19,'4'= 23,'5'=11, '6' = 8))+
labs(title = "2017 Income & Expenses Log Transformed",
caption = "",
x = "Month Number",y = "Log(Amount ($))") +
theme(plot.title = element_text(hjust = .5),plot.subtitle = element_text(hjust = .5))+
theme_minimal()+
theme(legend.position = "bottom",legend.margin = margin(rep(0,4),"pt"),plot.margin = margin(rep(2,4),"pt"),legend.key.size = unit(22,"pt"), legend.text = element_text(size=14),legend.title = element_text(size=14), plot.title = element_text(size=unit(18,"pt")), plot.subtitle = element_text(size=unit(14,"pt")), )+
guides(col = F,shape = guide_legend(nrow=1,override.aes = list(size=3))))
# ----------------------- Mon Apr 23 14:05:57 2018 ------------------------#
# for creating legend
#guide_legend(override.aes = list(size = 10)
ggplot2::ggsave(filename=paste("TimeseriesLog",".pdf",sep=""),plot=last_plot(), path="C:\\Users\\Stephen\\Documents\\Northeastern\\Git\\ppua5302\\Project 3\\Plots",device = cairo_pdf,width = 10.5, height = 4, units = "in")
Granular detail as to the differences in cost of living between Asheville and Boston is computed in the table below.
Cost of Living Comparison
em2017.avl <- em17.tr %>% filter(Amount < 0) # Filter out Income
em2017.avl %>% sapply(function(x) {
any(is.na(x))
}) # Check for NA
## Date Amount Category Subcategory Payment.Method
## FALSE FALSE FALSE FALSE FALSE
## Payee.Payer Account Tag Deductible
## FALSE FALSE FALSE FALSE
em2017.avl$Date <- em2017 %>% filter(Amount < 0) %>% .$Date #Txfrm date to date
sapply(em2017.avl, class) #check classes
## Date Amount Category Subcategory Payment.Method
## "Date" "numeric" "factor" "factor" "factor"
## Payee.Payer Account Tag Deductible
## "factor" "factor" "factor" "factor"
em2017.avl %<>% mutate(Loc = ifelse(Date > lubridate::mdy("08-04-2017"),
"BSN", "AVL")) #Add a factor for location
em2017.avl[em2017.avl$Amount == -2550, ]$Loc <- "BSN"
em2017.avl$Loc %<>% factor() # Make it a factor
(em2017.col <- list(AVL = list(Data = em2017.avl %>% filter(Loc ==
"AVL") %>% mutate(Mon = lubridate::month(Date)) %>% group_by(Mon,
Category) %>% summarise(MonSum = abs(sum(Amount))) %>% ungroup %>%
group_by(Category) %>% summarise(MonAvg = mean(MonSum))),
BSN = list(Data = em2017.avl %>% filter(Loc == "BSN") %>%
mutate(Mon = lubridate::month(Date)) %>% group_by(Mon,
Category) %>% summarise(MonSum = abs(sum(Amount))) %>%
ungroup %>% group_by(Category) %>% summarise(MonAvg = mean(MonSum))))) # Create a list of the monthly averages by category per location
## $AVL
## $AVL$Data
## # A tibble: 12 x 2
## Category MonAvg
## <fct> <dbl>
## 1 Automobile 87.4
## 2 Business 37.5
## 3 Education 172.
## 4 Entertainment 16.6
## 5 Food 282.
## 6 Health Care 113.
## 7 Home Office 40.9
## 8 Household 651.
## 9 Other 33.8
## 10 Personal 69.3
## 11 Software 22
## 12 Tax 104
##
##
## $BSN
## $BSN$Data
## # A tibble: 12 x 2
## Category MonAvg
## <fct> <dbl>
## 1 Automobile 50.2
## 2 Business 173.
## 3 Education 6908.
## 4 Entertainment 22.2
## 5 Food 498.
## 6 Health Care 141
## 7 Household 1632.
## 8 Insurance 120.
## 9 Other 270
## 10 Personal 66.1
## 11 Travel 57.2
## 12 Utilities 63.4
## LM Not necessary, monthly averages can be compared directly
## for (i in seq_along(em2017.col)) { # Add Linear Models to
## the lists per location em2017.col[[i]]$LM <- lm(MonAvg ~ 0
## + Category,data=em2017.col[[i]]$Data) } This isn't the best
## way to establish the diference in cost of living em2017.col
## <- em2017.avl %>% group_by(Loc, Category) %>%
## summarize(Total=abs(sum(Amount))) %>% spread(key =
## Loc,value = Total) Or perhaps it is since the # of months
## differ.
col.df <- data.frame(Category = c(em2017.col[["AVL"]][["Data"]][["Category"]] %>%
levels, em2017.col[["BSN"]][["Data"]][["Category"]] %>% levels) %>%
unique) %>% left_join(em2017.col[["BSN"]][["Data"]]) %>%
left_join(em2017.col[["AVL"]][["Data"]], by = "Category",
suffix = c(".bsn", ".avl")) %>% subset(subset = !apply(.[,
2:3], 1, FUN = function(x) all(is.na(x)))) %>% mutate_all(.funs = function(co) {
sapply(co, function(x) {
if (is.na(x)) {
x <- 0
} else {
x <- x
}
})
}) %>% mutate(PctDiff = round(subtract(MonAvg.bsn, MonAvg.avl)/MonAvg.avl *
100)) # Combine the averages with a column for each place, filter where NA in both columns. Change NA to 0. Create a column with the percent diff.
# apply(col.df[,2:3],1,FUN=function(x)all(is.na(x))) #
# Function to find the rows where all values are NA, used in
# subset above
As per the interest of Christian, the average difference in cost of living and median income.
col.df %>% filter(Category != "Education") %>% .$PctDiff %>%
.[is.finite(.)] %>% mean # Average cost of Living difference
## [1] 90.72727
## [1] 0.7083059
The original stacked bar graph from the 1st draft remains in the chunk below.
Stacked Bar of Primary Expense Categories
em2017.vertbar <- em2017.avl %>% filter(Amount < 0) %>% mutate(Mon=lubridate::month(Date)) %>% group_by(Mon,Category,Loc) %>% summarise(MonSum=abs(sum(Amount))) %>% ungroup %>% group_by(Loc,Category) %>% summarise(MonAvg=mean(MonSum))
em2017.sumexp <- sum(em2017.vertbar$MonAvg)
(Vertbar <- ggplot(data=em2017.vertbar[order(em2017.vertbar$MonAvg, decreasing = F),], mapping=aes(x=Loc,y=MonAvg,fill=Category))+
geom_bar(stat="identity",
position="stack")+
#geom_text(size = 3, position = position_stack(vjust = 0.5),mapping = aes(label=paste(round(MonAvg)," | ",round(MonAvg/em2017.sumexp*100),"%",sep="")), hjust=.6)+
geom_label(position=position_stack(vjust = 1),hjust=0.5,
mapping=aes(label=paste0(Category,"\n",round(MonAvg)," | ",round(MonAvg/em2017.sumexp*100),"%"),lineheight = .7),
size=3,
label.padding = unit(0, "lines"),
label.size= unit(0, "lines"),
label.r=unit(0, "lines"))+
scale_y_continuous(expand=c(0.009,0))+
labs(title = "Cost of Living\nAsheville & Boston",
subtitle = "",
caption = "",
x = "Location",y = "Monthly Average ($)") +
theme(plot.title = element_text(hjust = .5),plot.subtitle = element_text(hjust = .5),legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank()))
This is the facet wrap of the expense categories used in the final draft. It was chosen because the direct comparison of categories was found to be more effective at visually communicating the sizeable differences in cost of living between Boston and Asheville. The plot builds in previous efforts and took quite a bit of fine tuning - thus the commented out portions of the ggplot call.
Facet Wrap of Primary Expense Categories
#em2017.vertbar <- em2017.avl %>% filter(Amount < 0) %>% mutate(Mon=lubridate::month(Date)) %>% group_by(Mon,Category,Loc) %>% summarise(MonSum=abs(sum(Amount))) %>% ungroup %>% group_by(Loc,Category) %>% summarise(MonAvg=mean(MonSum))
em2017.vertbar <- col.df %>% filter(!Category %in% (c("Education","Software","Tax","Utilities","Travel"))) %>% mutate_at(.vars="Category",.funs=snakecase::to_parsed_case)
em2017.vertbar <- em2017.vertbar[order(em2017.vertbar$MonAvg.bsn, decreasing = F),] %>% subset(subset=!apply(.[,2:3],1,FUN=function(x)all(is.na(x))))
#em2017.vertbar %<>% rbind(list("Education",0,0,0)) %>% cbind(Education=c(rep(NA,(nrow(.)-1)),1913.33)) %>% mutate_at(.vars="Category",.funs=as.factor)
em2017.sumexp <- sum(colSums(em2017.vertbar[,c("MonAvg.bsn","MonAvg.avl")]))
`+.uneval` <- function(a,b) {
`class<-`(modifyList(a,b), "uneval")
} # A function necessary to use aes_string and aes in the same geom call
(Vertbar <- ggplot(data=em2017.vertbar[,c(1:4)], mapping=aes(fill=Category))+
geom_bar(stat="identity",
position="stack",aes_string(x=shQuote("Boston"), y="MonAvg.bsn"))+
geom_bar(stat="identity",
position="stack",aes_string(x=shQuote("Asheville"), y="MonAvg.avl"))+
# geom_bar(stat="identity",
# position="stack",aes_string(x=shQuote("Education"), y="Education"))+
#geom_text(size = 3, position = position_stack(vjust = 0.5),mapping = aes(label=paste(round(MonAvg)," | ",round(MonAvg/em2017.sumexp*100),"%",sep="")), hjust=.6)+
# geom_label(position=position_stack(vjust = 1),hjust=0.5,
# mapping=aes(label=paste0(Category,"\n",round(MonAvg.bsn)," | ",round(MonAvg.bsn/sum(MonAvg.bsn)*100),"%"),lineheight = .7)+aes_string(x=shQuote("Boston"),y="MonAvg.bsn"),
# size=3,
# label.padding = unit(0, "lines"),
# label.size= unit(0, "lines"),
# label.r=unit(0, "lines"))+
# geom_label(position=position_stack(vjust = 1),hjust=0.5,
# mapping=aes(label=paste0(Category,"\n",round(MonAvg.avl)," | ",round(MonAvg.avl/sum(MonAvg.avl)*100),"%"),lineheight = .7)+aes_string(x=shQuote("Asheville"),y="MonAvg.avl"),
# size=3,
# label.padding = unit(0, "lines"),
# label.size= unit(0, "lines"),
# label.r=unit(0, "lines"))+
facet_wrap(~Category, ncol=5,scales="fixed")+
geom_label(aes_string(x=shQuote("Asheville"))+aes(label=paste0(PctDiff,"%"),y=1000),size=7)+
scale_y_continuous(minor_breaks = seq(1900,2000,by=10),expand=c(0.009,0))+
scale_color_manual(values=CatCols)+
labs(title = "Cost of Living\nAsheville & Boston",
subtitle = "",
caption = "",
x = "Location",y = "Monthly Average ($)") +
theme(plot.title = element_text(hjust = .5),plot.subtitle = element_text(hjust = .5),legend.position = "none"))
As is evident from the chart above, the cost of education is astronomical comparatively to the cost of living in either location. Also notable is the significant differences in cost of living between Asheville (A) and Boston (B).
This was an idea about a visualization to include on the final draft. However, it was omitted because the comparison of cost of living at the categorical level provides more relevant comparisons for a viewer unfamiliar with the data and uninterested in the details of my expenses.
em.Payee <- emTS %>% group_by(Payee.Payer) %>% summarize(Total = sum(Amount),
Type = sign(Total), n = n()) %>% arrange(Total) %>% inner_join(emTS[,
c("Payee.Payer", "Category")], by = "Payee.Payer") %>% mutate_at(.vars = c("Category",
"Type"), .funs = as.factor) %>% mutate_at(.vars = "Total",
.funs = abs) %>% unique %>% subset(subset = not(duplicated(.[,
c("Payee.Payer", "Total", "n"), ]))) %>% mutate_at(.vars = "Payee.Payer",
.funs = function(x) {
gsub("(\\w+)\\s(\\w+).*", "\\1 \\2", x)
})
em.Payee$Type %<>% forcats::fct_recode(Expense = "-1", Income = "1") # %>% mutate_at(.vars='Total',.funs=scales::rescale,to=c(0,4))
em.Payee %>% View
ggplot(data = head(em.Payee[order(em.Payee$Total, decreasing = T),
], 15), aes(color = Type)) + geom_bar(stat = "identity",
aes(y = Total, x = reorder(Payee.Payer, Total), fill = Category,
alpha = Type)) + scale_color_manual(values = CatCols) +
scale_alpha_discrete(range = c(0.4, 1)) + coord_flip() +
theme(legend.position = "right") + guides(fill = F, alpha = guide_legend(Title = "Alpha")) +
labs(title = "Top Payees/Payers", subtitle = "", caption = "",
x = "Payee/Payer", y = "Amount") + theme(plot.title = element_text(hjust = 0.5,
size = unit(18, "pt")), plot.subtitle = element_text(hjust = 0.5),
axis.title = element_text(size = unit(14, "pt")))
This section of the project has been abandoned because of the relocation that took place this year. The relocation, and consequent turbulence in expenses for this year make it impossible to make accurate projections about 2018 where a move will not be taking place. A model can indeed be built, but it will not have much predictive validity for future years.
Projections
library(mgcv)
em2017.avl.gam <- mgcv::gam(Amount ~ Loc + Date + Category, data = em2017.avl,
method = "P-ML", knots = 2)
summary(em2017.avl.gam)
# ----------------------- Fri Apr 06 15:29:25 2018
# ------------------------# TODO(Add geom_line with fitted
# values to timeseries graph)
ggplot(data = em2017.avl, mapping = aes(x = Date, y = Amount)) +
geom_line(data = data.frame(Date = em2017.avl$Date, Amount = em2017.avl.gam$fitted.values),
color = "red") + geom_line()
A minimal reproducible example created for posting and issue on the caretEnsemble github. This also encountered an error, likely due to the formula variable being declared with formula instead of form.
Minimal Reproducible Example
# Minimal reproducible example made for reporting errors in
# the above code to zachmayer/caretEnsemble github
bc <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data")
nms <- readLines("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names")[110:126] %>%
str_match("^\\\t?[a-z1-2]\\)\\s?(\\w+\\s?\\w+)") %>% na.omit %>%
.[, 2]
names(bc) <- c(nms[1:2], nms[3:12] %>% paste0(".mean"), nms[3:12] %>%
paste0(".se"), nms[3:12] %>% paste0(".wst")) %>% gsub("\\s",
"\\.", .)
rownames(bc) <- bc[, 1]
bc <- bc[, -1]
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
getDoParWorkers()
data.train <- caret::createMultiFolds(bc$Diagnosis, times = 2)
data.train <- caret::trainControl(method = "repeatedcv", index = data.train,
number = 10, repeats = 1, search = "grid", allowParallel = T,
classProbs = T, savePredictions = "all", returnResamp = "all",
summaryFunction = caret::twoClassSummary)
f <- as.formula(paste0("Diagnosis", " ~ ."))
mod.list <- caretEnsemble::caretList(formula = f, data = bc,
trControl = data.train, methodList = c("svmRadial", "LogitBoost",
"adaboost", "C5.0"), metric = "ROC", tuneList = list(svmRadial = caretEnsemble::caretModelSpec(method = "svmRadial",
tuneGrid = tuneGrids$svmRadial), LogitBoost = caretEnsemble::caretModelSpec(method = "LogitBoost",
tuneGrid = tuneGrids$LogitBoost), adaboost = caretEnsemble::caretModelSpec(method = "adaboost",
tuneLength = 10), C5.0 = caretEnsemble::caretModelSpec(method = "C5.0",
tuneGrid = tuneGrids$C5.0)))
stopCluster(cl)
registerDoSEQ()
This was an attempt to manually create an ensemble model to be used for predictions because of the errors encountered in the first few attempts with caretEnsemble.
Manual Ensemble Model
tuneGrids$adaboost <- data.frame(nIter = seq(10, 100, 10))
set.seed(1)
req.packages <- c("doParallel", "kernlab", "caTools", "C50",
"parallel", "iterators", "MASS", "foreach", "caret", "tidyverse",
"dplyr", "htmltools", "magrittr", "repmis")
for (q in seq_along(req.packages)) {
suppressPackageStartupMessages(library(req.packages[q], character.only = T))
}
data.train <- caret::createDataPartition(matchedlevels$olddata[["Deductible"]],
times = 2, p = 0.85)
data.train <- caret::trainControl(method = "repeatedcv", index = data.train,
number = 10, repeats = 1, search = "grid", allowParallel = T,
classProbs = T, savePredictions = "all", summaryFunction = caret::twoClassSummary,
returnResamp = "all", verboseIter = T)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
getDoParWorkers()
ensmod.nms <- c("svmRadial", "LogitBoost", "adaboost", "C5.0")
ensmods <- vector("list", length(ensmod.nms))
for (i in seq_along(ensmod.nms)) {
ensmods[[endsmod.nms[i]]] <- caret::train(Deductible ~ .,
method = ensmod.nms[i], metric = "Accuracy", trControl = data.train,
tuneGrid = tuneGrids[[ensmod.nms[i]]], data = matchedlevels$olddata)
}
stopCluster(cl)
registerDoSEQ()
A demo of the igraph package to create edge and node dfs used to try to generate the networkD3 graph.
cluster visualization example
# http://kateto.net/network-visualization
Ex.1Nodes <- read.csv("~/Northeastern/Git/ppua5302/Project 3/Dataset1-Media-Example-NODES.csv")
Ex.1Links <- read.csv("~/Northeastern/Git/ppua5302/Project 3/Dataset1-Media-Example-EDGES.csv")
nrow(Ex.1Nodes)
length(unique(Ex.1Nodes$id))
nrow(Ex.1Links)
nrow(unique(Ex.1Links[, c("from", "to")]))
Ex.1Links <- aggregate(Ex.1Links[, 3], Ex.1Links[, -3], sum)
colnames(Ex.1Links)[4] <- "weight"
rownames(Ex.1Links) <- NULL
A demo used to orient to the syntax of ggnet and to the data structures that generate the network graphs.
ggnet example
r = "https://raw.githubusercontent.com/briatte/ggnet/master/"
# read nodes
v = read.csv(paste0(r, "inst/extdata/nodes.tsv"), sep = "\t")
e = read.csv(paste0(r, "inst/extdata/network.tsv"), sep = "\t")
names(e)
x = data.frame(Twitter = network.vertex.names(net))
x = merge(x, v, by = "Twitter", sort = FALSE)$Group
I was not able toget hte custom node coloring via groups to work for networkD3, and thus abandoned this route of creating the network graph. Per advice of Dr. Offenhuber, I used ggnet which employs the familiar grammar of graphics syntax to plot the clustered transaction network graph. I beleive that the reason this did not work with networkD3 is that the group feature has function beyond aesthetic properties and actually also determines connections between nodes, thus the labels used for the groups as shown below, are not represented in the edge df and thus the graph was blank.
Format names for networkD3
# ----------------------- Mon Apr 16 15:23:11 2018 ------------------------#
# https://community.rstudio.com/t/networkd3-customize-node-colors-based-on-the-existing-nodes/5594
# Fix the names to upper camel case will require revising em.links
paste0('"',names(property),'",',collapse="") %>% cat #for pasting into JS call
paste0('"',property,'",',collapse="") %>% cat
colJS <- networkD3::JS('d3.scaleOrdinal().domain(["2017",1","2","3","4","5","Automobile","Travel","Business","Education","Entertainment","Food","Health_Care","Household","Insurance","Other","Personal","Tax","Software","Home_Office","Utilities","Automobile_Fuel","Automobile_Parking","Automobile_Registration_tax","Automobile_Maintenance","Travel_Parking","Travel_Train","Travel_Car_Rental","Business_Application","Business_Home_Office","Business_Utilities","Business_Supplies","Business_Office_Expenses","Business_Wordpress_Plugins","Business_Insurance","Education_Tuition","Education_Article","Education_Books","Education_Supplies","Education_Transcripts","Education_Gre_Study","Entertainment_Podcast","Entertainment_Article","Entertainment_Concert","Entertainment_Movies","Entertainment_Books","Food_Bulk","Food_Groceries","Food_Restaurant","Food_Snack","Food_Tea","Food_Raw_Juice","Health_Care_Massage_Therapy","Health_Care_Health_Insurance","Health_Care_Medical","Health_Care_Meditation","Health_Care_Prescription","Health_Care_Genomics","Health_Care_Other","Health_Care_Yoga","Household_Appliance","Household_Consumables","Household_Home_Maintenance","Household_Miscellaneous_Household_Items","Household_Rent","Household_Household_Tools","Household_Other","Household_Gardening","Insurance_Auto","Insurance_Renter","Other_Cash_Back","Other_Other","Other_Charitable_Contributions","Personal_Clothing","Personal_Gift","Personal_Haircut","Personal_Personal_Care","Personal_Other","Personal_Donation","Tax_Other","Software_Productivity","Home_Office_Electronics","Home_Office_Computer","Home_Office_Office_Supply","Utilities_Gas"]).range(["white","red","green","blue","yellow","purple","#F3756D","#EF66A3","#E28925","#C39B2D","#99AA3A","#4EB548","#2DB456","#18BCC3","#19B6E9","#469BD4","#9488C0","#D56CAB","#B77AB4","#23B991","#C277B2","rgb(144,15,0)","rgb(192,69,59)","rgb(241,115,107)","rgb(255,161,153)","rgb(152,0,84)","rgb(219,81,144)","rgb(255,153,211)","rgb(126,49,0)","rgb(149,69,0)","rgb(173,89,0)","rgb(197,110,0)","rgb(221,132,26)","rgb(246,154,67)","rgb(255,177,97)","rgb(100,66,0)","rgb(126,89,0)","rgb(152,114,0)","rgb(179,139,0)","rgb(207,166,63)","rgb(235,193,99)","rgb(63,78,0)","rgb(92,108,0)","rgb(123,140,0)","rgb(156,173,63)","rgb(190,207,106)","rgb(0,89,0)","rgb(0,113,0)","rgb(2,139,0)","rgb(57,165,52)","rgb(90,192,86)","rgb(121,220,117)","rgb(0,91,0)","rgb(0,108,0)","rgb(0,126,0)","rgb(0,144,42)","rgb(0,163,67)","rgb(47,182,89)","rgb(76,202,110)","rgb(101,222,131)","rgb(0,90,96)","rgb(0,106,113)","rgb(0,123,130)","rgb(0,141,148)","rgb(0,159,166)","rgb(0,178,185)","rgb(46,197,204)","rgb(81,217,224)","rgb(0,87,138)","rgb(86,212,255)","rgb(0,80,138)","rgb(39,139,196)","rgb(133,206,255)","rgb(76,61,121)","rgb(99,86,143)","rgb(123,111,167)","rgb(149,137,193)","rgb(175,164,220)","rgb(203,191,247)","rgb(138,0,99)","rgb(113,41,110)","rgb(0,90,50)","rgb(0,152,113)","rgb(91,220,180)","rgb(121,33,106)"])')
A failed attempt to plot the network graph with igraph and networkD3.
Network Graph of Clustered Txn Data
library(networkD3)
net <- graph_from_data_frame(d = em2017.net, vertices = em.nodes,
directed = T)
wc <- cluster_walktrap(net)
members <- membership(wc)
net_d3 <- igraph_to_networkD3(net, group = em.nodes$group)
# Create force directed network plot
forceNetwork(Links = net_d3$links, Nodes = net_d3$nodes, Source = "source",
Target = "target", NodeID = "name", Group = "group", colourScale = colJS,
opacity = 0.5)
networkD3::forceNetwork(Links = em2017.net, Nodes = em.nodes,
Source = "from", Target = "to", NodeID = "id", Nodesize = "Radius",
Group = "group", colourScale = colJS, opacity = 1)
URL <- paste0("https://cdn.rawgit.com/christophergandrud/networkD3/",
"master/JSONdata//flare.json")
## Convert to list format ----------------------- Mon Apr 16
## 21:59:32 2018 ------------------------# not working, could
## be the radius size ----------------------- Mon Apr 16
## 22:01:36 2018 ------------------------#
library(igraph)
# Use igraph to make the graph and find membership
karate <- make_graph("Zachary")
wc <- cluster_walktrap(karate)
members <- membership(wc)
# Convert to object suitable for networkD3
karate_d3 <- igraph_to_networkD3(karate, group = members)
# Create force directed network plot
forceNetwork(Links = karate_d3$links, Nodes = karate_d3$nodes,
Source = "source", Target = "target", NodeID = "name", Group = "group")
A more efficient method is used in above.
Deprecated method of finding cost of living
(ColTable <- data.frame(PctDiff = (abs(em2017.col[["BSN"]][["Data"]][["Category"]][em2017.commonCats]) -
abs(em2017.col[["AVL"]][["Data"]][["Category"]][em2017.commonCats]))/abs(em2017.col[["BSN"]][["Data"]][["Category"]][em2017.commonCats]) *
100, AmtDiff = (abs(em2017.col[["BSN"]][["Data"]][["Category"]][em2017.commonCats]) -
abs(em2017.col[["AVL"]][["Data"]][["Category"]][em2017.commonCats])))) # Compute the percent difference, and actual differences
round(ColTable$PctDiff)