A project for ANLY512: Data Visualization
The Quantified Self movement grew from the popularity and growth of the internet of things, the mass collection of personal information, and mobile technologies (primarily wearable computing). This final class project uses a collection of 1 years of data on spending and payments captured by discover credit card.
The goal of the project is to collect, analyze and visualize the data using the tools and methods covered in class. Additionally, using the data-driven approach, I will create a summary which answers the following questions based on the data collected 1) What is the total spending by month in 2018? 2) Which category costed the most in 2018? 3) For Merchandise, what is the spending by month in 2018? 4) In march and November, what is the spending in each category? 5) What is the total spending per transaction type?
data <- read.csv("C:/Users/Mingmei Yang/Documents/HU/Final-Project-Excel.csv")
There are 7 variables we are interested in:
Year: 2018
###Q1: What is the total spending by month in 2018?
# plot
library(ggplot2)
fill <- "gold1"
line <- "goldenrod2"
agg1 <- aggregate(data$Amount, by=list(Month=data$Month),FUN=sum)
ggplot(agg1, aes(x = Month, y = x)) +
geom_bar(stat = "identity", position="identity", fill = "Blue")+
scale_x_discrete(limits=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) +
labs(title = "Spending per Month", x = "Month", y = "Amount") +
geom_text(aes(label=x), position=position_dodge(width=0.5), vjust=-0.25, size=3) +
theme_minimal()
Since Month is categorical and Amount is continous, and I was aimed to compare the total mounth in each month, bar chart was used to summarize the data. I summarized the total amount spent in each month and usd bar chart display the numbers. From the plot we could find that, average spending over month is around 800 dollars. Total spending in March and November are the most. Month August has the most spend in 2018, followed by Month April and Month November. Month June has the least spead in 2018, followed by Month July and Month September. The reasonality effect is very strong in the spending trend, however, it is not consistent across all months.
###Q2: Which category costed the most in 2018?
library(ggplot2)
agg2 <- aggregate(data$Amount, by=list(Category = data$Category), FUN=sum)
ggplot(agg2, aes(x = Category, y= x, fill= Category)) +
geom_bar(stat = "identity", fill = "Blue")+
labs(title = "Spending per Category", x = "Category", y = "Amount") +
geom_text(aes(label= x), position=position_dodge(width=0.5), vjust=-0.25, size=3) +
coord_flip()
*** Due to the same reason, bar chart was used here. The name of each category is long and can not be displayed at the bottom, so a rotated bar chart was created. Here We see the category I spend the most money is Merchandise, arount $5600, which is much higher than the other categories. The second category is Travel/Entertainemnt, around $1300.
###Q3: For Dining, what is the spending by month in 2018?
data1<- subset(data, Category=='Dining')
agg3 <- aggregate(data1$Amount, by=list(Month = data1$Month), FUN=sum)
ggplot(agg3, aes(x = Month, y= x)) +
geom_bar(stat = "identity", fill = "Blue")+
scale_x_discrete(limits=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) +
labs(title = "Dining Spending per Month", x = "Month", y = "Amount") +
geom_text(aes(label= x), position=position_dodge(width=0.5), vjust=-0.25, size=3) +
theme_minimal()
*** We see that for most months, the spending on Merchandise was less than 500 dollars. However, in March and November, it is 3 times higher than the other months.
###Q4: In march and November, what is the spending in each category?
data2<- subset(data, Month=='3')
ggplot(data2, aes(x = Category, y=Amount, fill=Category)) +
geom_bar(stat = "identity", fill = "Blue")+
labs(title = "March Spending per Category", x = "Category", y = "Amount") +
coord_flip()
data3<- subset(data, Month=='11' )
ggplot(data3, aes(x = Category, y=Amount, fill=Category)) +
geom_bar(stat = "identity", fill = "DarkGreen")+
labs(title = "November Spending per Category", x = "Category", y = "Amount") +
coord_flip()
*** From the plot we can see that, still, the most spending is Merchandise. Besides, in November, the second spending is Travel/Entertainment. I checked detailed information. The auto insurance was renewed in March and Thanksgiving holiday was in November, which caused the majority of the spending.
###Q5: Show a pie chart of transactions by category in 2018
slices = table(data$Category)
pie(slices,
main="Pie Chart of Transcations by Category")