Import Data

To begin, let’s start by importing the data and taking a quick look at what we’re working with.

file_path <- 'C:/Downloads/pave.dev tha/data-analyst/data.parquet'

df <- read_parquet(file_path)

glimpse(df)
## Rows: 4,139
## Columns: 8
## $ transaction_id          <chr> "tx_793434", "tx_2323260", "tx_792384", "tx_79~
## $ user_id                 <chr> "user_294", "user_723", "user_294", "user_294"~
## $ account_id              <chr> "account_0664", "account_7540", "account_0664"~
## $ amount                  <dbl> 2.77, 1.83, -200.00, 11.53, 36.09, 133.00, 5.4~
## $ date                    <date> 2017-07-25, 2018-12-07, 2018-06-06, 2017-06-1~
## $ transaction_description <chr> "DUNKIN #345683 Q35 SPRINGFIELD MO 07/24", "7-~
## $ merchant_name           <chr> "Dunkin' Donuts", "7-Eleven", NA, "Family Expr~
## $ merchant_category       <chr> "Chain Restaurant", "Chain Restaurant", NA, "C~
head(df)
## # A tibble: 6 x 8
##   transaction_id user_id  account_id    amount date       transaction_descripti~
##   <chr>          <chr>    <chr>          <dbl> <date>     <chr>                 
## 1 tx_793434      user_294 account_0664    2.77 2017-07-25 DUNKIN #345683 Q35 SP~
## 2 tx_2323260     user_723 account_7540    1.83 2018-12-07 7-Eleven              
## 3 tx_792384      user_294 account_0664 -200    2018-06-06 MO CHILD SUPPORT MO18~
## 4 tx_793593      user_294 account_0664   11.5  2017-06-12 FAMILY EXPRESS #63 SP~
## 5 tx_792481      user_294 account_0664   36.1  2018-05-10 SPEEDWAY 08333 750 IN~
## 6 tx_794343      user_294 account_0664  133    2016-11-21 CHECK # 1611          
## # ... with 2 more variables: merchant_name <chr>, merchant_category <chr>

Missing Data

Looks like there are some null values for the merchant info, let’s check for other nulls.

VIM::aggr(df, 
          col=c('green','red'), 
          numbers = T, 
          sortVars = T,
          cex.axis = .5,
          ylab=c("Proportion of Data", "Combinations and Percentiles")
          ) 

## 
##  Variables sorted by number of missings: 
##                 Variable     Count
##            merchant_name 0.4189418
##        merchant_category 0.4189418
##           transaction_id 0.0000000
##                  user_id 0.0000000
##               account_id 0.0000000
##                   amount 0.0000000
##                     date 0.0000000
##  transaction_description 0.0000000

The merchant fields are the only ones with null values, with 42% of transactions in this dataset missing merchant info.

Understanding Missing Data

In the first few rows, it looks like child support and checks were missing these fields, which makes sense as all transactions aren’t always related to purchases. Let’s take a look at other transactions without merchants.

# transactions without merchant data
df %>%
  filter(is.na(merchant_name)) %>%
  select(transaction_description, amount)
## # A tibble: 1,734 x 2
##    transaction_description                                 amount
##    <chr>                                                    <dbl>
##  1 MO CHILD SUPPORT MO18000 XXXXXXXXXX CCD ID: XXXXXXXXXX -200   
##  2 CHECK # 1611                                            133   
##  3 SMARTSTYLE 17909 SPRINGFIELD MO 11/08                    40.1 
##  4 ZOOM TAN                                                 20.0 
##  5 TRANSFER FROM CHK XXXXXXXXX                            -250   
##  6 Aegon Online Pmt Organa,Leia -                            9.95
##  7 ATM MAC 839 BEDFORD                                     102.  
##  8 POS DEBIT TAQUERIA AND RESTAURAN                         31.6 
##  9 INTERNET TRF TO CCA XXXXXXXXXXXXXXXX 2819                30   
## 10 Antone's                                                  7.35
## # ... with 1,724 more rows

So it looks like the merchant fields aren’t completely reliable. There’s clearly transactions related to purchases that don’t have the merchant info provided in the table.

Also, it looks like there’s transactions types that might be interesting to tag later on:

  • child support payments
  • paper checks
  • atm tx
  • overdraft fees
  • quickpay w/ zelle

Missing Merchants by Account

Just curious if the level of missing merchants is an average across the board or associated with a specific account.

# missing merchant user/account summary
df %>%
  group_by(user_id, account_id) %>%
  summarize(missing_merchant_pct = sum(is.na(merchant_name)) / n(),
            tx_count = n())
## # A tibble: 4 x 4
## # Groups:   user_id [3]
##   user_id  account_id   missing_merchant_pct tx_count
##   <chr>    <chr>                       <dbl>    <int>
## 1 user_294 account_0664                0.419     2501
## 2 user_294 account_6137                1          111
## 3 user_723 account_7540                0.343     1309
## 4 user_815 account_7221                0.578      218

It looks like we have 3 users total. One user has two accounts and the other users have one account. It looks like acct 6137 doesn’t have any merchant info at all, while the other accounts all have a considerable chunk of merchants null.

Let’s take a peek at that account with zero merchant data.

# check account with no merchant info at all
df %>%
  filter(account_id == 'account_6137') %>%
  select(date, transaction_description, amount) %>%
  arrange(date, amount)
## # A tibble: 111 x 3
##    date       transaction_description                                     amount
##    <date>     <chr>                                                        <dbl>
##  1 2016-11-14 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX 1~     50
##  2 2016-11-16 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX 1~    190
##  3 2016-11-18 TRANSFER FROM CHK XXXXXXXXX                                   -250
##  4 2016-11-28 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX 1~    260
##  5 2016-12-02 TRANSFER FROM CHK XXXXXXXXX                                   -250
##  6 2016-12-08 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX 1~    250
##  7 2016-12-16 TRANSFER FROM CHK XXXXXXXXX                                   -250
##  8 2016-12-16 Online Transfer to CHK ....9732 transaction#: XXXXXXXXXX 1~    250
##  9 2016-12-30 TRANSFER FROM CHK XXXXXXXXX                                   -250
## 10 2017-01-06 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX 0~    250
## # ... with 101 more rows

Okay, so it looks like a bank account. Deposits coming in and then the same amount being sent out within the next few days.

The same amount gets sent out on the same day quite often. It’s possible that there’s a direct deposit directly into savings, and this person is having trouble keeping it there.

User 294

Let’s check out this user’s activity across both accounts.

# income only
df %>%
  filter(user_id == 'user_294' & amount < 0) %>%
  select(date, transaction_description, amount, merchant_name, merchant_category) %>%
  arrange(date, amount)
## # A tibble: 281 x 5
##    date       transaction_description      amount merchant_name merchant_catego~
##    <date>     <chr>                         <dbl> <chr>         <chr>           
##  1 2016-11-14 Online Transfer from SAV ..~   -50  <NA>          <NA>            
##  2 2016-11-16 Online Transfer from SAV ..~  -190  <NA>          <NA>            
##  3 2016-11-18 THE METHODIST HO PR PAYMENT~ -2475. <NA>          <NA>            
##  4 2016-11-18 TRANSFER FROM CHK XXXXXXXXX   -250  <NA>          <NA>            
##  5 2016-11-23 MO CHILD SUPPORT MO18000 XX~  -200  <NA>          <NA>            
##  6 2016-11-28 Online Transfer from SAV ..~  -260  <NA>          <NA>            
##  7 2016-12-02 THE METHODIST HO PR PAYMENT~ -2015. <NA>          <NA>            
##  8 2016-12-02 TRANSFER FROM CHK XXXXXXXXX   -250  <NA>          <NA>            
##  9 2016-12-07 MO CHILD SUPPORT MO18000 XX~  -200  <NA>          <NA>            
## 10 2016-12-08 Online Transfer from SAV ..~  -250  <NA>          <NA>            
## # ... with 271 more rows

I went through a few pages and it looks like this person has 3 sources of regular income:

  • ~2k biweekly, the methodist hospital – job/work
  • 200 biweekly, child support
  • movement between account 6137, which can now be determined as a savings account
# payments only
df %>%
  filter(user_id == 'user_294' & amount > 0) %>%
  select(date, transaction_description, amount, merchant_name, merchant_category) %>%
  arrange(date, amount)
## # A tibble: 2,331 x 5
##    date       transaction_description     amount merchant_name merchant_category
##    <date>     <chr>                        <dbl> <chr>         <chr>            
##  1 2016-11-14 iTunes                        1.06 iTunes        ""               
##  2 2016-11-14 DUNKIN #308715 Q35 SPRINGF~   2.66 Dunkin' Donu~ "Chain Restauran~
##  3 2016-11-14 DUNKIN #352549 Q COLUMBIA ~   2.77 Dunkin' Donu~ "Chain Restauran~
##  4 2016-11-14 DUNKIN #336784 Q35 JOPLIN ~   2.77 Dunkin' Donu~ "Chain Restauran~
##  5 2016-11-14 DUNKIN #352549 Q COLUMBIA ~   2.77 Dunkin' Donu~ "Chain Restauran~
##  6 2016-11-14 DUNKIN #345683 Q35 SPRINGF~   3.09 Dunkin' Donu~ "Chain Restauran~
##  7 2016-11-14 Wendy's                       8.34 Wendy's       "Chain Restauran~
##  8 2016-11-14 FAMILY EXPRESS #57 SEDALIA~  15.0  Family Expre~ "Convenience Sto~
##  9 2016-11-14 FAMILY EXPRESS #57 SEDALIA~  15.1  Family Expre~ "Convenience Sto~
## 10 2016-11-14 FAMILY EXPRESS #63 SPRINGF~  17.2  Family Expre~ "Convenience Sto~
## # ... with 2,321 more rows

Observations:

  • This person spends money frequently. The first day of data, there’s 5 transactions at dunkin donuts.
  • There’s payments to the banks at rounded numbers, which means they’re probably carrying a balance and paying near minimum.
  • First money source mortgage payments seen 3 months in. In the following weeks, there’s some overdraft and insufficient funds fees.
  • acceptancenow.com – looks like this person is financing furniture.
  • there are transactions back and forth between the two accounts – checking and savings

I think we can assume this person doesn’t have a lot saved if they’re getting hit with bank fees. It looks like they have a healthy income, but their spending is a bit out of control.

Setup

The right set of visualizations depends on the audience. Visualizations need to be relevant to the viewer and easy to digest.

Since the data is limited, let’s build off of user 294.

We’re going to flag transfers to/from savings at this point for later use.

df2 <- df %>%
  filter(user_id == 'user_294') %>%
  mutate(income = ifelse(amount < 0, abs(amount), 0),
         payments = ifelse(amount > 0, amount, 0),
         month = floor_date(date, 'month'),
         savings_tx = ifelse(str_detect(transaction_description, 'Online Transfer'), 1, 0)
         )

df$savings_tx <- ifelse(str_detect(df$transaction_description, 'Online Transfer'), 1, 0)

# check flag for transfers between accounts
df2 %>% filter(savings_tx == 1) %>%
  arrange(date, amount) %>%
  select(account_id, transaction_description, amount)
## # A tibble: 116 x 3
##    account_id   transaction_description                                   amount
##    <chr>        <chr>                                                      <dbl>
##  1 account_0664 Online Transfer from SAV ...XXXX transaction#: XXXXXXXXXX    -50
##  2 account_6137 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX~     50
##  3 account_0664 Online Transfer from SAV ...XXXX transaction#: XXXXXXXXXX   -190
##  4 account_6137 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX~    190
##  5 account_0664 Online Transfer from SAV ...XXXX transaction#: XXXXXXXXXX   -260
##  6 account_6137 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX~    260
##  7 account_0664 Online Transfer from SAV ...XXXX transaction#: XXXXXXXXXX   -250
##  8 account_6137 Online Transfer to CHK ....XXXX transaction#: XXXXXXXXXX~    250
##  9 account_0664 Online Transfer from SAV ...XXXX transaction#: XXXXXXXXXX   -250
## 10 account_6137 Online Transfer to CHK ....9732 transaction#: XXXXXXXXXX~    250
## # ... with 106 more rows

By Month

I think a logical place to start when analyzing finances is looking at what’s happening at a monthly level. This is really just because housing and other costs occur on a monthly basis (mortgage, rent, utilities, phone bill, etc.) and I’m really just curious if this person is spending more than they make.

# aggregate to monthly level, exclude movement between two accounts
mo <- df2 %>%
  filter(savings_tx == 0) %>%
  group_by(month) %>%
  summarize(income = sum(income),
            payments = sum(payments),
            net = income - payments
            )

mo
## # A tibble: 25 x 4
##    month      income payments     net
##    <date>      <dbl>    <dbl>   <dbl>
##  1 2016-11-01  2925.    3400.  -475. 
##  2 2016-12-01  7762.    5788.  1974. 
##  3 2017-01-01  4811.    6629. -1818. 
##  4 2017-02-01  4837.    5215.  -379. 
##  5 2017-03-01  5522.    4671.   851. 
##  6 2017-04-01  8879.    8816.    62.8
##  7 2017-05-01  8506.    8848.  -342. 
##  8 2017-06-01  7956.    6903.  1054. 
##  9 2017-07-01  8043.    9036.  -993. 
## 10 2017-08-01  4949.    5271.  -322. 
## # ... with 15 more rows

For this monthly rollup, let’s exclude the first and last month since they’re not full months. Also, let’s add in a running total for the net value at the end of each month to see net over time.

min_mo <- min(mo$month)
max_mo <- max(mo$month)

mo2 <- mo %>%
  filter(month > min_mo & month < max_mo) %>%
  arrange(month) %>%
  mutate(running_net = cumsum(net))

cat(paste0('new measurement period: ', min(mo2$month), ' - ', max(mo2$month)))
## new measurement period: 2016-12-01 - 2018-10-01
mo2
## # A tibble: 23 x 5
##    month      income payments       net running_net
##    <date>      <dbl>    <dbl>     <dbl>       <dbl>
##  1 2016-12-01  7762.    5788.  1974.         1974. 
##  2 2017-01-01  4811.    6629. -1818.          156. 
##  3 2017-02-01  4837.    5215.  -379.         -222. 
##  4 2017-03-01  5522.    4671.   851.          629. 
##  5 2017-04-01  8879.    8816.    62.8         691. 
##  6 2017-05-01  8506.    8848.  -342.          349. 
##  7 2017-06-01  7956.    6903.  1054.         1403. 
##  8 2017-07-01  8043.    9036.  -993.          409. 
##  9 2017-08-01  4949.    5271.  -322.           87.0
## 10 2017-09-01  6170.    6169.     0.840        87.8
## # ... with 13 more rows

Change in Value Over Time

Looking at a time series of the cumulative net balance will show us the change in the overall value of her bank account over time.

ggplot(mo2, aes(x = month, y = running_net)) +
  geom_line() +
  geom_smooth(se = F)

It looks like the beginning of 2017 wasn’t a great time for this person, as we can see that their account dipped below zero. Earlier, I noticed the fees around that time, and I’m glad to see that it wasn’t a common event in the future. It also looks like the account value spikes at the end of the year and dips back after the next year turns – possibly a holiday bonus with serious spending for the holidays?

In any case, this person seems to be saving more money recently. Keep in mind that we can’t be sure that all accounts are included here and this might not be an accurate view of their total wealth.

Spend or Save

Continuing on measuring this person’s ability to manage their monthly finances, let’s take a look at the distribution of where they balance at the end of the month.

ggplot(mo2, aes(x = net)) +
  geom_boxplot()

cat('mean:', mean(mo2$net), '\nmedian:', median(mo2$net))
## mean: 166.0557 
## median: 22.99

There’s a very small chance that this person is going to have a net positive balance any given month. Almost half of the months in the measurement period ended with this person spending more than they make.

The average of the net value would tell us how much the account value would change by month over this measurement period.

So on average and based on these two accounts only, this person is saving about $166 per month. If this person is paying interest on credit cards, it’s likely that the accrued interest on the balance would offset these savings and show an overall net negative balance.

Income

Now that we understand this person’s finances from a high level based on these two accounts, let’s take a deeper dive into where this money is coming from and where it’s being spent.

Let’s start with calculating an average annual salary:

# Monthly Average Income, Annualized
sum(mo2$income) / nrow(mo2) * 12
## [1] 86502.27

This person’s average annual after tax income is about 86.5k. Let’s take a quick look at if this has been consistent every month.

ggplot(mo2, aes(x = month, y = income)) +
  geom_col()

ggplot(mo2, aes(x = income)) +
  geom_boxplot()

cat('average monthly income:', mean(mo2$income), '\nrange of monthly income:', max(mo2$income) - min(mo2$income))
## average monthly income: 7208.522 
## range of monthly income: 5903.21

So this person takes home about 7,200 per month, but the amount can range by almost 6k. Sometimes they’re making under 5k, and on the best month they pull in almost 11k.

Let’s see if we can see where the money’s coming from each month.

df3 <- df2 %>%
  filter(amount < 0) %>%
  filter(month > min_mo & month < max_mo) %>%
  filter(savings_tx == 0) %>%
  mutate(income_source = ifelse(str_detect(transaction_description,'THE METHODIST HO'), 'Job', 
                                ifelse(str_detect(transaction_description, 'CHILD SUPPORT'), 'Child Support'
                                ,'Other'
                                ))
         ) %>%
  arrange(date)
  
ggplot(df3, aes(x = month, y = income, fill = reorder(income_source, income))) +
  geom_col() +
  labs(fill = '') +
  theme(legend.position = 'bottom') +
  scale_fill_manual(values = c('yellowgreen', 'chartreuse4', 'darkslategrey'))

Not really sure why income from the hospital varies so much, it’s possible it’s related to variables like overtime, commission, and bonuses rather than a set salary. Some months are seeing a large portion of income coming from the ‘other’ category, and this can come from unexpected income like tax returns and tuition returns.

Spending

When thinking about personal finance, we have two ways to improve our position– make more or spend less.

For this section we’ll leverage the merchant name and category rather than spend too much time cleaning up the data.

Let’s start by comparing the frequency of transactions – as an average number of spending transactions per day.

# exclude savings account transfers
df4 <- df2 %>%
  filter(savings_tx == 0) %>%
  mutate(tx_type = ifelse(amount > 0, 'spend', 'income' )) %>%
  arrange(date)

days_in_period <- difftime(max(df4$date), min(df4$date), units = 'days') %>% 
  as.double()

df4 %>%
  summarize(daily_spend_rate = sum(amount > 0) / days_in_period)
## # A tibble: 1 x 1
##   daily_spend_rate
##              <dbl>
## 1             3.11
df4 %>%
  group_by(date) %>%
  summarize(purchases = sum(amount > 0)) %>%
  ggplot(aes(x = purchases)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(0,28,2))

Looks like this person is seeing about 4 expenses per day. Half the time, there’s between 2-6 expenses and at the extreme end there was 26 transactions in one day.

Compared To Other Users

Let’s see how this ratio stacks up against the other two users.

user_date_ranges <- df %>%
  group_by(user_id) %>%
  summarize(min_date = min(date),
            max_date = max(date),
            days_in_period = as.double(difftime(max_date, min_date, 'day'))
  )

df %>%
  filter(user_id != 'user_294') %>%
  left_join(user_date_ranges, (by = c('user_id' = 'user_id'))) %>%
  group_by(user_id) %>%
  summarize(daily_spend_rate = sum(amount > 0) / max(days_in_period))
## # A tibble: 2 x 2
##   user_id  daily_spend_rate
##   <chr>               <dbl>
## 1 user_723             2.26
## 2 user_815             3.30

The other two users are spending about the same, with user 723 spending a bit less often.

# create a dataframe with all dates listed
for (i in 1:nrow(user_date_ranges)) {
   x <- data.frame(user_id = user_date_ranges[i,]$user_id, 
                   date = seq(user_date_ranges[i,]$min_date, user_date_ranges[i,]$max_date, '1 day'))

  if (i == 1) {
    all_dates <- x
  } else {
    all_dates <- bind_rows(all_dates, x)
  }
  
}

# total spend by user by day
day_sum <- df %>%
  filter(savings_tx == 0) %>%
  group_by(date, user_id) %>%
  summarize(purchases = sum(amount > 0))
  
# total spend by day, including days with no spend
all_dates %>%
  left_join(day_sum, by = c('date' = 'date')) %>%
  mutate(purchases = coalesce(purchases, 0)) %>%
  select(user_id = user_id.x, date, purchases) %>%
  ggplot(aes(x = purchases, y = user_id)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(0,28,2))

This is the distribution of daily expense transactions, including days without transactions. It looks like all 3 users share a median of 2 expense tx per day, with user 723 being the least likely to go a day without buying or paying for something.

Lets take a look at how much they each spent as well.

df %>%
  filter(savings_tx == 0) %>%
  mutate(month = floor_date(date, 'month')) %>%
  group_by(user_id, month) %>%
  summarize(expenses = sum(ifelse(amount > 0, amount, 0))) %>%
  ggplot(aes(x = month, y = expenses, color = user_id)) +
  geom_line()

Looking at this, not sure these users are comparable. User 723 has a wide range of spend every month and user 815 only has a handful of months of data. Depending on the audience, it might be interesting to see a current month summary with a benchmark against the previous month, as well as the same month last year (year over year).

By Merchant

Let’s get back to looking at spend by merchant and category for user 294.

library(plotly)

ggplotly(
  df4 %>%
    group_by(month, merchant_name) %>%
    summarize(expenses = sum(payments)) %>%
    ggplot(aes(x = month, y = expenses, fill = merchant_name)) +
    geom_col() +
    theme(legend.position = 'none') +
    labs(title = 'monthly spend by merchant name')
)
ggplotly(
  df4 %>%
    group_by(month, merchant_category) %>%
    summarize(expenses = sum(payments)) %>%
    ggplot(aes(x = month, y = expenses, fill = merchant_category)) +
    geom_col() +
    theme(legend.position = 'none') +
    labs(title = 'monthly spend by merchant category')
)

The charts above have too much going on – let’s try to look at overall spend and see if there’s anything interesting.

purchase_denom <- sum(df4$payments)

df4 %>%
  filter(!is.na(merchant_name)) %>%
  group_by(merchant_name) %>%
  summarize(spend = sum(payments),
            pct_of_spend = spend / purchase_denom) %>%
  arrange(desc(spend)) %>%
  top_n(30) %>%
  ggplot(aes(y = reorder(merchant_name, pct_of_spend), x = pct_of_spend)) +
  geom_col() +
  scale_x_continuous(labels = percent_format(1), breaks = seq(0,0.12,0.02)) +
  labs(title = 'Spend by Top 30 Merchants',
       y = element_blank(),
       x = '% of Total Spend')

  • It looks like the top 3 merchants are related to bills (home, car, cable).
  • check into cash is a payday loan service
df4 %>%
  mutate(merchant_category = ifelse(merchant_category == '' | is.na(merchant_category), 'N/A', merchant_category)) %>%
  group_by(merchant_category) %>%
  summarize(spend = sum(payments),
            pct_of_spend = spend / purchase_denom) %>%
  ggplot(aes(y = reorder(merchant_category, pct_of_spend), x = pct_of_spend)) +
  geom_col() +
  scale_x_continuous(labels = percent_format(1)) +
  labs(title = 'Spend by Category',
       y = element_blank(),
       x = '% of Total Spend')

  • Almost half of all spend is missing merchant info
  • We’re seeing about a third of spend going towards bills + utilities, which isn’t that easy to cut down.

Let’s see if we can get some more detail on the chart above by including merchant names.

spend_by_cat <- df4 %>%
  group_by(merchant_category) %>%
  summarize(spend_by_cat = sum(payments))

ggplotly(
df4 %>%
  mutate(merchant_category = ifelse(merchant_category == '' | is.na(merchant_category), 'N/A', merchant_category)) %>%
  group_by(merchant_category, merchant_name) %>%
  summarize(spend = sum(payments),
            pct_of_spend = spend / purchase_denom) %>%
  left_join(spend_by_cat, by = c('merchant_category' = 'merchant_category')) %>%
  ggplot(aes(y = reorder(merchant_category, spend_by_cat), x = pct_of_spend, fill = merchant_name)) +
  geom_col() +
  scale_x_continuous(labels = percent_format(1)) +
  labs(title = 'Spend by Category',
       y = element_blank(),
       x = '% of Total Spend') +
  theme(legend.position = 'none')
)

Since bills are most likely to be paid first, let’s exclude bills and missing info to see where the daily expenses are happening in order to target opportunities for savings.

ggplotly(
df4 %>%
  mutate(merchant_category = ifelse(merchant_category == '' | is.na(merchant_category), 'N/A', merchant_category)) %>%
  filter(merchant_category != 'N/A' & merchant_category != 'Bill' & merchant_category != 'Utility') %>%
  group_by(merchant_category, merchant_name) %>%
  summarize(spend = sum(payments),
            pct_of_spend = spend / purchase_denom) %>%
  left_join(spend_by_cat, by = c('merchant_category' = 'merchant_category')) %>%
  ggplot(aes(y = reorder(merchant_category, spend_by_cat), x = pct_of_spend, fill = merchant_name)) +
  geom_col() +
  scale_x_continuous(labels = percent_format(1)) +
  labs(title = 'Spend by Category',
       y = element_blank(),
       x = '% of Total Spend') +
  theme(legend.position = 'none')
)

```