Data Acquisition

Project Overview

This Dashboard present an analysis about my credit card expense data. The credit card related used attributes have been analyzed from different perspectives and several interactive visuals of varied kind have been created to help understand few important aspects. Further some interesting usage patterns have been identified which did not seem very obvious before analyzing the visuals.

Data Acquisition:

The following sequential steps were followed to prepare the data:

1. Raw Data Acquisition: The raw data was procured from my American Express credit card account by fetching the historical transaction data from the time the card was issues (September, 2017).

2. Data Cleaning & Data Enhancement: The raw data was not in a ready to sue format. It wither did not have data in the format suitable for analyzing. Also, a couple of new data fields were needed for the conceived analysis. Hence a few key enhancements were performed on the following fields to make them suitable for uploading in R.

Date field: The date were not exactly in the date format acceptable by R. Dates were like 01/01/2018 Mon. SO first I Split this into two columns and cleaned it to finally create two columns like : 01-01-2018 (suitable data format for R) & Mon (Day of the Week column).
Expense_Type data field: This specific field was not present in the raw transaction data. There was a column which had transaction detail like name of the shop, portal where the transaction happened and some more cryptic details in the txt format. So, I manually added a column expense type and based on the details present classified each of the 5000+ transaction into a suitable category like Grocery, Restaurant etc.
Day_of_Week: As mentioned in last para, Day_of _week column was not directly present but was obtained by splitting the original column into date and Day_of_Week
year: This filed was created using r commands from the Date field.
month: This field represents month of the year with 12 different factors. This filed was created using r commands from the Date field.
day: This filed was created using r commands from the Date field.
month_name: This was created using r commands (mapvalue function) to give more meaningful name to month which was a numeric field

Dataset (after enhancements) Cleaned and enhanced data contains: 572 rows 12 columns Date range: From September 13, 2017 to Feb 06, 2019. Hence, we have data for all 12 months of the year only for year 2018. Therefore, some of the bars corresponding to months are empty for year 2017 &2019.

The column names are as follows:

Date
Day_of_Week
Card_User
Amount
Source_Raw
Expense_Type
Transaction_ID
Revised_Transaction_ID
year
month
day
month_name

The next sections have been arranged as follows:

Interactive Visualization Charts

Key Observations/Questions Addressed: This section will start by analyzing each of the visuals presented in previous section and will answer a few important questions. Some of these questions can be answered by a quick observation of visualizations, however some of the questions can only be answered by a careful and comparative analysis of the data visualizations presented.

Spending by Categories

Row

Spending by Categories

Transactions Volume

Row

Transactions Volume by Expense Categories

Spending: Day of Week v/s Month

Row

Spending pattern by Day of the week

Row

Spend by Month of the Year

Spending Dygraph

Row

Spending over time by date range

Spending: Primary v/s sencondary user

Row

Spending pattern of primary user v/s sencondary user

Observations/Questions

A few important Questions can be answered using the visuals presented in the previous section

Question 1: Which expense category has highest share in overall expense and how much percentage is there share?

Grocery, 25.4% share by dollar amount

Question 2. Which expense category has lowest share in overall expense and dhow much percentage is their share?

General Utilities, 0.42% share by dollar amount

Question 3: Which expense category has highest count of transactions for the total time span and how much is the count of total number of transactions?

Grocery, 153 number of transactions

Question 4: Over the whole time span which day has the highest amount of transaction and how much is the value of transaction?

Feb 02, 2018. $802.15

Question5: For year 2018, which month has highest overall spending and which month has lowest overall spending?

Month with the highest spending: February Month with lowest Spending: January

Some Indirect & More Complex Questions

To answer these questions require a deeper analysis of data. Also specific visuals presented in this dashboard helps under some complex phenomenonsin a simple way.

Question6: Which days of the week usually see highest expenditure and which days of the week see lowest expenditure?

Fridays are the days of highest expense on average, which is followed by Thursdays. Also, Sundays seems days of lowest expenditure on average usually.

Question7: Whether the expense pattern for primary and secondary are similar? Are they similar in low spending range? Are they similar in High spending range?

Very similar in low spending range as the density graphs for both users almost overlap each other in low dollar value range. Very different in mid spending and high spending range. Here primary user has higher spending instances in comparison to lower expense region. Overall the spending pattern are very similar as the density curve for the two users overlap in most of the spending region.

---
title: "Data Visualization - Final Project"
author: "Anil Jhanwar "
output:
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    source_code: embed  
---

```{r setup, include=FALSE}


knitr::opts_chunk$set(echo = FALSE)

library(plotly)
library(flexdashboard)
library(ggplot2)
library(lubridate)
library(plyr)
library(colorRamps)
library(RColorBrewer)
library(dygraphs)
library(xts)
library(data.table)

```


```{r,echo = FALSE, message = FALSE}

credit_card<-read.csv("final_cleaned_data.csv")

#add some more columns which will be useful for creating meaningful visualization
credit_card$year<-as.factor(year(credit_card$Date))
credit_card$month<-as.factor(month(credit_card$Date))
credit_card$day<-day(credit_card$Date)
credit_card$month_name<-mapvalues(credit_card$month, c(1:12), c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "July", "Aug", "Sep", "Oct", "Nov", "Dec"))

credit_card$Card_User<-mapvalues(credit_card$Card_User, c("Anil Jhanwar", "Kanika Jhanwar"), c("Primary User", "Addon User"))

credit_card_sub<-credit_card[ credit_card$Expense_Type != "Credit Card Payment" & credit_card$Expense_Type !="Cash Reward", ]


```


Data Acquisition
===================================== 


**  **
**  **
**Project Overview**

This Dashboard present an analysis about my credit card expense data. The credit card related used attributes have been analyzed from different perspectives and several interactive visuals of varied kind have been created to help understand few important aspects. Further some interesting usage patterns have been identified which did not seem very obvious before analyzing the visuals.

**Data Acquisition:**

The following sequential steps were followed to prepare the data:

**1. Raw Data Acquisition:** 
The raw data was procured from my American Express credit card account by fetching the historical transaction data from the time the card was issues (September, 2017).



**2. Data Cleaning & Data Enhancement:**
The raw data was not in a ready to sue format. It wither did not have data in the format suitable for analyzing. Also, a couple of new data fields were needed for the conceived analysis. Hence a few key enhancements were performed on the following fields to make them suitable for uploading in R.

i. Date field: The date were not exactly in the date format acceptable by R. Dates were like  01/01/2018 Mon. SO first I Split this into two columns and cleaned it to finally create two columns like : 01-01-2018 (suitable data format for R) & Mon (Day of the Week column).

ii. Expense_Type data field: This specific field was not present in the raw transaction data. There was a column which had transaction detail like name of the shop, portal where the transaction happened and some more cryptic details in the txt format. So, I manually added a column expense type and based on the details present classified each of the 5000+ transaction into a suitable category like Grocery, Restaurant etc.

iii. Day_of_Week: As mentioned in last para, Day_of _week column was not directly present but was obtained by splitting the original column into date and Day_of_Week

iv. year: This filed was created using r commands from the Date field.

v. month: This field represents month of the year with 12 different factors. This filed was created using r commands from the Date field.

vi. day: This filed was created using r commands from the Date field.

vii. month_name: This was created using r commands (mapvalue function) to give more meaningful name to month which was a numeric field


**Dataset (after enhancements)**
 Cleaned and enhanced data contains:
 572 rows 
 12 columns
 Date range: From September 13, 2017 to Feb 06, 2019. Hence, we have data for all 12 months of the year only for year 2018. Therefore, some of the bars corresponding to months are **empty** for year 2017 &2019.
 
 The column names are as follows:
 
 - Date     
 - Day_of_Week           
 - Card_User             
 - Amount               
 - Source_Raw     
 - Expense_Type          
 - Transaction_ID        
 - Revised_Transaction_ID
 - year                  
 - month                 
 - day                 
 - month_name 


The next sections have been arranged as follows:

**Interactive Visualization Charts**

**Key Observations/Questions Addressed:**
This section will start by analyzing each of the visuals presented in previous section and will answer a few important questions. Some of these questions can be answered by a quick observation of visualizations, however some of the questions can only be answered by a careful and comparative analysis of the data visualizations presented.


Spending by Categories
=======================================================================
Row {.tabset data-width=600}
----------------------------------------------------------------------
### Spending by Categories

```{r,echo = FALSE, message = FALSE}

#Pie chart

#data preperation for pie chart

spend_proportion <- ddply(credit_card_sub, ~ Expense_Type, summarise, proportion = round(100*sum(Amount)/sum(credit_card_sub$Amount),2))

colourCount = length(unique(credit_card_sub$Expense_Type))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))


p1 <- plot_ly(spend_proportion, labels = ~Expense_Type, values = ~proportion, type = 'pie',textposition = 'outside',textinfo = 'label+percent') %>%
  layout(title = 'Distribution of Credit Card Spending by Categories',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p1

```




Transactions Volume
====================================
Row {.tabset data-width=600}
----------------------------------------------------------------------
### Transactions Volume by Expense Categories


```{r, echo = FALSE, message = FALSE}

count_transaction <- ddply(credit_card_sub, ~ Expense_Type, summarise, count_by_category=length(Revised_Transaction_ID), Percentage_share = round(100*length(Revised_Transaction_ID)/length(credit_card_sub$Revised_Transaction_ID),2))


p6<-ggplot(count_transaction, aes(x = reorder(Expense_Type,count_by_category), y = count_by_category, fill=Expense_Type)) + geom_bar(aes(),stat="identity",width = 0.75, position = position_dodge()) + xlab("Spending Category") + ylab("Amount ($)") +
theme_gray()+
theme(plot.background = element_rect(fill = "grey80")
      ) +  ggtitle("Number of Transactions by Expense Categories")
ggplotly(p6)

```



Spending: Day of Week v/s Month
====================================
Row {.tabset data-width=500}
----------------------------------------------------------------------
### Spending pattern by Day of the week


```{r, echo = FALSE, message = FALSE}


#Spending by day of the week

#we need to order the dataframe by the day of the week
credit_card_sub$Day_of_Week <- factor(credit_card_sub$Day_of_Week, levels= c("Sun", "Mon", 
    "Tue", "Wed", "Thu", "Fri", "Sat"))
credit_card_sub<-credit_card_sub[order(credit_card_sub$Day_of_Week), ]


p3<-ggplot(credit_card_sub, aes(x = Day_of_Week, y = Amount, fill= Day_of_Week)) + geom_bar(aes(),stat="identity",width = 0.75, position = position_dodge()) + xlab("Month") + ylab("Amount ($)") +
theme_gray()+
scale_fill_brewer(palette ="PuBu")+
theme(plot.background = element_rect(fill = "grey80")
      ) +  facet_wrap("year")+ ggtitle("Spending by Day of the week")

ggplotly(p3)


```


Row {.tabset data-width=600}
----------------------------------------------------------------------
### Spend by Month of the Year 


```{r, echo = FALSE, message = FALSE}



#Spending By Month

p2<- ggplot(credit_card_sub, aes(x = month_name, y = Amount, fill=month_name)) + geom_bar(aes(),stat="identity",width = 0.75, position = position_dodge()) + xlab("Month") + ylab("Amount ($)") +
theme_gray()+
theme(plot.background = element_rect(fill = "grey80")
      ) +  ggtitle("Spending by Month")+facet_wrap("year")
ggplotly(p2)


```



Spending Dygraph
====================================
Row {.tabset data-width=600}
----------------------------------------------------------------------
### Spending over time by date range


```{r, echo = FALSE, message = FALSE}


#Spending over time by date range


credit_card_sub_2 <- credit_card_sub[,c(1,4)]


dygraph_data <- xts(credit_card_sub_2[,2],order.by=as.Date(credit_card_sub_2$Date))

dygraph(dygraph_data) %>% dyRangeSelector(dateWindow = c("2017-09-13", "2019-02-06"))%>%
  dyAxis("y",label="Amount Spent in (USD)")

```



Spending: Primary v/s sencondary user
====================================
Row {.tabset data-width=600}
----------------------------------------------------------------------
### Spending pattern of primary user v/s sencondary user


```{r, echo = FALSE, message = FALSE}


####density chart for comarision of spending pattern of primary user v/s sencondary user


p5<-ggplot(credit_card_sub, aes(x = Amount, fill =Card_User )) + geom_density()+ggtitle("Density Plot: spending pattern of primary user v/s sencondary user")

ggplotly(p5)

```




Observations/Questions
====================================

**  **
**  **


A few important Questions can be answered using the visuals presented in the previous section

Question 1: Which expense category has highest share in overall expense and how much percentage is there share?

**Grocery, 25.4% share by dollar amount**

Question 2. Which expense category has lowest share  in overall expense and dhow much percentage is their share?

**General Utilities, 0.42% share by dollar amount**

Question 3: Which expense category has highest count of transactions for the total time span and how much is the count of total number of transactions?

**Grocery, 153 number of transactions**

Question 4: Over the whole time span which day has the highest amount of transaction and how much is the value of transaction?

**Feb 02, 2018. $802.15**

Question5:
For year 2018, which month has highest overall spending and which month has lowest overall spending?

**Month with the highest spending: February**
**Month with lowest Spending: January**


----------------------------------



**Some Indirect & More Complex Questions** 

To answer these questions require a deeper analysis of data. Also specific visuals presented in this dashboard helps under some complex phenomenonsin a simple way.


Question6: Which days of the week usually see highest expenditure  and which days of the week see lowest expenditure?

**Fridays are the days of highest expense on average, which is followed by Thursdays. Also, Sundays seems days of lowest expenditure on average usually. **

Question7: Whether the expense pattern for primary and secondary are similar? Are they similar in low spending range? Are they similar in High spending range?

**Very similar in low spending range as the density graphs for both users almost overlap each other in low dollar value range.**
**Very different in mid spending and high spending range. Here primary user has higher spending instances in comparison to lower expense region.**
**Overall the spending pattern are very similar as the density curve for the two users overlap in most of the spending region.**