This article aims to analyse and provide insights from the monthly transaction data set in order to better understand the customer transaction patterns. The article also offers a study on linear regression model, an important concept in the field of machine learning, and discusses how this model can assist in the decision-making process of identifying trends in bank transactions within the years of 2013 - 2016.
To well capture this information, the CRISP-DM management model is adopted to provide a structured planning approach to a data mining project with 6 high-level phases. In particular, these phases assist companies in comprehending the data mining process and serve as a road map for planning and executing a data mining project (Medeiros, 2021). This study explores each of the six phases, and the tasks associated with each in the following orders:
Cross-Industry Standard Process for Data Mining (CRISP-DM project, 2000)
Business Understanding is the first taken step in the CRISP-DM methodology. In this stage, the main task is to understand the purpose of the analysis and to provide a clear and crisp definition of the problem in respect of understanding the Business objectives and Data mining objectives.
In our case study, the posed question related Business object paraphrased from the sale manager’s request is: “what is driving the trends and increase total monthly revenue?”. On the other hand, we wish to achieve the data mining object by applying data visualization tools to identify any underlying patterns from the dataset.
Following that, the Data Understanding phase is where we focus on understanding the data collected to support the Business Understanding and resolve the business challenge (Wijaya, 2021). Data preprocessing and data visualization techniques play an essential role in this. Thus, I’m going to divide the section into 2 main components:
The data was imported into the software package R to construct visualizations represented the findings found during the analysis.
First, I will run the libraries which will be necessary for reading & manipulating our data and then conducting the graphs.
##----------------------------------------------------------------
## Load the Libraries --
##----------------------------------------------------------------
library(here) # assess the file path
library(DataExplorer) # EDA visualizations
library(tidyverse) # data wrangling
library(kableExtra) # write table
library(bannerCommenter) # create comment banner
library(ggplot2) # data visualization
library(forecast) # times-series forecasting
library(ggradar) # plot seasonal trend
library(sqldf) # using SQL
library(dplyr) # data processing
library(ggpubr) # combine plots into single page
theme_set(theme_pubr())
library(reshape2) # transpose table
library(fmsb) # create radar chart
library(modelr) # computing regression model performance metrics
library(caret) # streamline the model training process
library(xts) # convert df to ts object
Once libraries are loaded, we explore the data with the goal of understanding its dimensions, data types, and distribution of values. In this assignment, a time series data set of financial transactions was used as the major source of data. The attributes information is specifically presented in Appendix 1.
##----------------------------------------------------------------
## Dataset Overview --
##----------------------------------------------------------------
# Load Data File
df <- read_csv(here("dataset/transactions.csv"))
# Quick overview of the data set then transpose for better table display.
# The information including:
# * index and column data types
# * non-null values
# * memory usage
df_overview <- introduce(df) %>% t()
# turn the table into the Markdown format
df_overview %>% kbl() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| rows | 94248 |
| columns | 5 |
| discrete_columns | 2 |
| continuous_columns | 3 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 94248 |
| total_observations | 471240 |
| memory_usage | 4203448 |
# rows 94247
# columns 5
# discrete_columns 4
# continuous_columns 1
# all_missing_columns 0
# total_missing_values 0
# complete_rows 94247
# total_observations 471235
# memory_usage 4201744
As apparent from the table, the data records 470,000+ observations
across 5 columns, which are equivalent to 94,000+ bank transactions. The
5 features contained in this data set including date,
customer_id, industry, location,
monthly_amount, clearly indicates the total transactions
amounts for customers each month spanning a 3-year period over a range
of industries and locations. Therefore, no further justification needs
to be made on column names.
# inspect columns data type
sapply(df, class)
## date customer_id industry location monthly_amount
## "character" "character" "numeric" "numeric" "numeric"
# date customer_id industry location monthly_amount
# "Date" "character" "character" "character" "numeric"
It is also worthwhile to note that features are made up in multiple
formats that include both numerical and time-series data. However, the
output shows that the date column has the wrong data type
which will need to be converted to date format later.
Additionally, I investigate further by looking at the response field.
Recall from the business question, we would expect to use the
monthly_amount column as the target field since our goal is
to get the predicted value of the monthly transaction value next month.
Since the observation in this column are continuous, thus, I can
conclude that our problem is defined as the supervised regression
problem. Having known this information is extremely essential to select
the right Machine Learning model in the later stage of this report.
# Check for missing value in each column by plotting
plot_missing(df)
Missing values plot
From the plot, it shows that there are no missing values on any
fields of data. Nevertheless, some data sets define missing observations
in categorical/character columns as a new category such as
"NA", "NULL", etc. so there are chances that
we possibly miss these observations, which can lay a tremendous negative
impact on the real data distribution. Consequently, a further address on
the missing values of our categorical columns need to be made in order
to confirm this observation.
# convert character values of columns to upper case for better missing value inspection
missing_df <- data.frame(
lapply(df, function(v) {
if (is.character(v)) return(toupper(v))
else return(v)
}))
# check if there is there is missing values assigned under new category
##::::::::::::::::::
## 1. date column
##::::::::::::::::::
sprintf(paste0("Is there any missing value observation categories in date column (T/F)?: ",
missing_df[1] %in% c("NA","N/A","NULL","")))
## [1] "Is there any missing value observation categories in date column (T/F)?: FALSE"
# "FALSE"
##::::::::::::::::::::::::::
## 2. customer_id column
##::::::::::::::::::::::::::
sprintf(paste0("Is there any missing value observation categories in customer_id column (T/F)?: ",
missing_df[1] %in% c("NA","N/A","NULL","")))
## [1] "Is there any missing value observation categories in customer_id column (T/F)?: FALSE"
# "FALSE"
# 3. Check for any transaction with zero values
sprintf(paste0("How many rows contained 0 value in monthly transaction amount?: ",
sum(df$monthly_amount==0)))
## [1] "How many rows contained 0 value in monthly transaction amount?: 1"
# "FALSE"
The code output below interprets that there is no new missing value
category exists in categorical columns. Thus, we can confirm our
hypothesis that there is no missing values from both numerical and
categorical columns in this data set. Furthermore, it also indicates
that there are 1 row that contain odd value in
monthly_amount column that will need to be resolved.
To evaluate the impact of each feature in the phenomenon, a univariate, bivariate, and multivariate analysis is performed with all features.
The univariate analysis is the study of the data distribution. In research from Sharma (2020), the distributions of the independent variable and the target variable are assumed to be a crucial component in building linear models. Therefore, understanding the skewness of data helps us in creating better models.
Firstly, I will plot a histogram to check which group of industry and location statistically contribute the most to the significant difference.
##---------------------------------------------------------------
## Data Distribution --
##---------------------------------------------------------------
##:::::::::::::::::::::::::::::::
## 1. Data Skewness/Imbalance
##:::::::::::::::::::::::::::::::
# combine 2 plots into 1 plot
par(mfrow=c(1,2))
# plot data distribution
# 1. MONTHLY_AMOUNT group by INDUSTRY
hist(df$industry,
main = "Trans by Industry",
xlab="Industry",
xlim = c(0,10),
ylim=c(0,50000),
las=0)
## 2. MONTHLY_AMOUNT group by LOCATION
hist(df$location,
main = "Trans by Location",
xlab="Location",
xlim = c(0,10),
ylim=c(0,50000),
las=0)
Data distribution
As can be seen from the plot, the location 1 and 2 made the top
contributions for the industry column while the industry 2
and industry 1 occupied for the highest frequency distribution for the
location. These results imply that the model can perform
better at predicting the total transaction amount for next month with
location 1, 2 and/or industry 1, 2.
##---------------------------------------------------------------
## Data Distribution --
##---------------------------------------------------------------
##:::::::::::::::::::::::::::::::
## 2. Data Outliers
##:::::::::::::::::::::::::::::::
# combine 2 plots into 1 plot
par(mfrow=c(1,2))
# plot boxplot to check for outliers
# 1. industry
boxplot(monthly_amount~industry,
data=df,
main='Transaction vs. Industry',
xlab='Industry',
ylab='Transaction Amount',
horizontal=TRUE) +
scale_fill_grey() +
theme_classic()
## NULL
# 2. location
boxplot(monthly_amount~location,
data=df,
main='Transaction vs. Location',
xlab='Location',
ylab='Transaction Amount',
horizontal=TRUE) +
scale_fill_grey() +
theme_classic()
Boxplot to check for outliers when plotting Monthly Amount against Location & Industry
## NULL
Next, boxplot of sale transactions by the industry and location present their high variance with considerable amount of outliers. The median amount of spending per customer for industry 6 and 9 are highest, over 500,000 while the lowest ones belong to industry 1 and 10, less than 200,000. In terms of locations, most of locations had its median amount of spending less than 500,000.
After having known the distribution of our transaction
dataset, it is essential to also check for correlation and collinearity
assumptions between fields in the Bivariate Analysis. Some basic
transformations regarding data types are performed beforehand for the
sake of plotting visualizations.
##---------------------------------------------------------------
## Data Transformation --
##---------------------------------------------------------------
# convert date column into the date format
df$date <- as.Date(df$date,"%d/%m/%Y")
# convert customer_id column into character format
df$customer_id = as.character(df$customer_id, format = "")
# convert location column into character format
df$location <- as.character(df$location)
# convert industry column into character format
df$industry <- as.character(df$industry)
# filter out value with 0 transaction amount
df<-filter(df, monthly_amount!=0)
########################################################
# DATA VISUAIZATION: CORRELATION PLOT
########################################################
plot_correlation(df)
Correlation Plot
Having known this information is essentially important to gain better understandings about the transaction data set and provide great insights for transforming data in the later stage.
##----------------------------------------------------------------
## Data Visualization --
##----------------------------------------------------------------
##:::::::::::::::::::::::::::::::::::::::::
## 1. Transaction Amount & # Transaction
##:::::::::::::::::::::::::::::::::::::::::
# create new df contain total transaction amount
transaction_amount <- sqldf(
"SELECT
date,
'Transaction Amount' as type, -- specify value type
SUM(monthly_amount) AS value -- sum total transaction amount
FROM df
GROUP BY date -- filter by date
ORDER BY date
"
)
# create new df contain number of transaction
transaction_count <- sqldf(
"SELECT
date,
'Transaction Count' as type, -- specify value type
COUNT(*) as value -- count total number of transactions
FROM df
GROUP BY date -- filter by date
ORDER BY date
"
)
# merge 2 df into 1 new TRANSACTION df vertically
transaction_df <- rbind(transaction_amount,
transaction_count)
# plot transaction amount over time
monthly_amount_plot <- transaction_df %>%
# filter by transaction amount only
filter(type=="Transaction Amount") %>%
# assign x and y-axis from the dataset
ggplot(aes(x = date, y = value/1e6)) +
# add the line graph, color, and the size
geom_line(color = "indianred", size=1.6) +
# the relationship graph between x and y
geom_smooth(formula = y~x,
method = 'loess') +
labs(x = "Year",
y = "Total transaction amount (in millions)",
title = "Monthly Transaction Amount",
subtitle = "2013 to 2016") +
theme_minimal()
# plot total transaction number over time
monthly_count_plot <- transaction_df %>%
# filter by total transaction number count only
filter(type=="Transaction Count") %>%
# assign x and y-axis from the dataset
ggplot(aes(x = date, y = value)) +
# add the line graph, color, and the size
geom_line(color = "indianred", size=1.6) +
# the relationship graph between x and y
geom_smooth(formula = y~x,
method = 'loess') +
labs(x = "Year",
y = "Total transaction number",
title = "Total Transaction Number",
subtitle = "2013 to 2016") +
theme_minimal()
## combine individual plots into a single page
ggarrange(monthly_amount_plot,
monthly_count_plot,
ncol = 2, nrow = 1)
Transaction amount vs. transaction number trend over time
Number of transactions and total amount of sales rose sharply throughout the years, from 2013 to 2017. The seasonal trend can be found on total amount of sales while the up trend for number of transactions is quite smooth.
We could see that there are seasonal pattern, although the trend is not clear yet. To investigate more, we would make yearly polar plot:
##----------------------------------------------------------------
## Data Visualization --
##----------------------------------------------------------------
##::::::::::::::::::::::::::::::::::::::
## 2. Time-series and Seasonal Trend
##::::::::::::::::::::::::::::::::::::::
## Create new MONTHLY_TREND df to plot seasonal transaction trend
new_ts_df <- sqldf(
"SELECT
strftime('%m', date) as month, --extract month from date column
strftime('%Y', --extract year from date column
date * 3600 * 24,
'unixepoch') as year,
SUM(monthly_amount) AS transaction_amount
FROM df
GROUP BY
month,
year
ORDER BY
month,
year
"
)
## transpose the dataset to prepare for the data visualization
monthly_trend <- recast(new_ts_df,
year + variable ~ month,
id.var = c("month", "year"))
monthly_trend <- data.frame(monthly_trend[,-1], # use the first column as the data index
row.names = monthly_trend[,1]) # use the first row as the header
monthly_trend <- subset(monthly_trend, select = -variable) # remove the unecessary column
## create new vector specify month column names
colnames(monthly_trend) <- c('January', 'February', 'March', 'April',
'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December')
## To use the fmsb package, I have to add 2 lines to the dataframe:
## the max and min of each variable to show on the plot!
data <- rbind(rep(1400e6,12) , rep(0,12) , monthly_trend)
## create color vector
colr_1 <- rgb(0.2,0.5,0.5,0.9)
colr_2 <- rgb(0.8,0.2,0.5,0.9)
colr_3 <- rgb(0.7,0.5,0.1,0.9)
colr_4 <- "#FC4E07"
## set color theme for radar border
colors_border=c(colr_1,
colr_2,
colr_3,
colr_4)
## plot with default options:
seasonal_mul_plot <- radarchart(data, axistype=0,
#custom polygon
pcol=colors_border, plwd=2 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", cglwd=0.8,
#custom labels
vlcex=0.8, title=paste("Transaction Seasonal Trend"), cex.main = 1
)
## Add a legend
legend(seasonal_mul_plot,
x=1.5, y=0.5,
legend =
rownames(data[-c(1,2),]),
bty = "n", pch=15 ,
col=colors_border,
text.col = "black",
cex=1.2,
pt.cex=1,
title = "Year")
Seasonal Trend Over the Years
# Individual Years Bar Charts
ggplot(new_ts_df) +
aes(x = month, fill = year, weight = transaction_amount/1e6) +
geom_bar() +
scale_fill_brewer(palette = "Dark2",
direction = -1) +
labs(y = "Total transaction amount (in millions)",
title = "Monthly Seasonal Trend",
subtitle = "Individual Year (from 2013 to 2016)") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size = 15L)) +
facet_wrap(vars(year), scales = "free", nrow = 1L)
Seasonal Trend Over the Years
A closer examination reveals that the total volume of transaction amount increases significantly from January to October and subsequently decreases from November towards the end of the year. This pattern can be ascribed to the fact that people trade less during the holidays, particularly during the month surrounding big holidays like Christmas and New Year.
This, however, might be based on a variety of different factors rather than on individual conclusions about each region or industry. As a result, additional information is required to substantiate these hypotheses.
##::::::::::::::::::::::::::::::::::::::::::::::::
## 3. Transaction Amount by Location & Industry
##::::::::::::::::::::::::::::::::::::::::::::::::
# create new INDUSTRY df contain total transaction by industry over time
industry <- sqldf(
"SELECT
date,
industry,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY
date, -- filter by date
industry -- filter by industry
ORDER BY date
"
)
# create new LOCATION df contain total transaction by location over time
location <- sqldf(
"SELECT
date,
location,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY
date, -- filter by date
location -- filter by location
ORDER BY date
"
)
# plot transaction info by industry
industry_amount_plot <- ggplot(industry) +
aes(x = date, y = transaction_amount, colour = industry) +
geom_line(size = 1) +
scale_color_hue(direction = 1) +
labs(x = "Year",
y = "Total transaction amount (in millions)",
title = "Transaction Amount",
subtitle = "By industry (from 2013 to 2016)") +
theme_minimal()
# plot transaction info by location
location_amount_plot <- ggplot(location) +
aes(x = date, y = transaction_amount, colour = location) +
geom_line(size = 1) +
scale_color_hue(direction = 1) +
labs(x = "Year",
y = "Total transaction amount (in millions)",
title = "Transaction Amount",
subtitle = "By location (from 2013 to 2016)") +
theme_minimal()
# combine plots into a single page
ggarrange(industry_amount_plot, location_amount_plot,
ncol = 2, nrow = 1)
Transaction amount by Location vs. Industry
When looking at the monthly amount by location and
industry, it is not surprising that total sales of location
1 and 2 increased significantly compared to other locations. Meanwhile,
in terms of industry, industry 2, 3 and 1 shows a rapid growing over the
years while others’ progress are quite slow.
When the data has been fully understood, data scientists generally need to go back to data collection and data cleaning phases of the data science pipeline so as to transform the data set as per the expected business outcomes.To expand the information that is already at hand and better represent the information we have, the best practice is to perform Data Preparation or Feature Engineering, meaning the creation of new features from the ones already existing.
In this case study, the data will need to be modified as we will be applying a linear regression model later on.
# write a reusable function
aggregate_transactions <- function(df) {
# aggregate the data, grouping by date, industry and location,
# and calculating the mean monthly_amount
output = df %>%
group_by(date, industry, location) %>%
summarize(monthly_amount = mean(monthly_amount, na.rm = TRUE))
# create a column for the month number and another one for year number
output = output %>%
# create new column for month number
mutate(month_number = format(as.Date(date), "%m")) %>%
# create new column for month number
mutate(year_number = format(as.Date(date), "%Y"))
# Make sure the new columns are of the correct type
output$month_number = as.character(output$month_number)
output$year_number = as.character(output$year_number)
transform(output, month_number = as.integer(month_number), year_number = as.integer(year_number))
return(output)
}
# create a new variable that store new df with transformed features
aggregated_transactions <- aggregate_transactions(df)
# A tibble: 3,886 x 6
# Groups: date, industry [470]
# date industry location monthly_amount month_number year_number
# <date> <chr> <chr> <dbl> <chr> <chr>
# 1 2013-01-01 1 1 136081. 01 2013
# 2 2013-01-01 1 10 188735. 01 2013
# 3 2013-01-01 1 2 177840. 01 2013
# 4 2013-01-01 1 3 141632. 01 2013
# 5 2013-01-01 1 4 221058. 01 2013
# 6 2013-01-01 1 5 178138. 01 2013
# 7 2013-01-01 1 6 133400. 01 2013
# 8 2013-01-01 1 7 231599. 01 2013
# 9 2013-01-01 1 8 143778. 01 2013
# 10 2013-01-01 1 9 157416. 01 2013
# ... with 3,876 more rows
# turn the df into a Markdown table format
rmarkdown::paged_table(aggregated_transactions)
An aggregated data set using the fields date,
industry and location, with a mean of monthly
amount is created. There are total 3,886 rows with each row presents a
mean of monthly amount ranging from 2013 to 2016.
Now that we have a new adjusted data set, I’m going to split the data into train and the test set for the aim of building a prediction model. The train set includes three years of data from 2013 to 2016 while test set includes one last year of data, 2016.
#################################################################
## Task 2A.iii - Modelling ##
#################################################################
##---------------------------------------------------------------
## 1. Train-Test Split --
##---------------------------------------------------------------
# create helper function with the creation of train-test split
create_train_test <- function(df){
# Test set
# Use data for the year 2016 to test accuracy
test_df <- df[df$date >= "2016-01-01",]
# Training set
# Use data for the year 2016 for forecasting
train_df <- df[(df$date < "2016-01-01"),]
return (list(train_df, test_df))
}
# Create train and test set
# create new variable and apply train-test split function
split_list <- create_train_test(aggregated_transactions)
# 1. train set
train_set <- split_list[[1]]
# 2. test set
test_set <- split_list[[2]]
Additionally, we have 2 requirements in this assignment, which are:
industry = 1
and location = 1.I will generate an additional data set that filtered only Industry 1 and Location 1 records. The train and split test for the Advanced Model Fitting section can be kept the same as there are no further adjustments needed.
# create new df filter by only Location 1 & Industry 1
agg_1_1 <- aggregated_transactions %>%
filter(industry == 1, location == 1)
# # Create train and test
# use the adjusted data set with new features
split_list_1_1 <- create_train_test(agg_1_1)
# 1. train set: filter by only Location 1 & Industry 1
train_set_1_1 <- split_list_1_1[[1]]
# 2. test set: filter by only Location 1 & Industry 1
test_set_1_1 <- split_list_1_1[[2]]
# create new df to plot our test-train set
# add new column to specify the value type
train_set_1_1$type <- "Train"
test_set_1_1$type <- "Test"
# create data frame to display year range of test and train data
train_test <- sqldf(
"SELECT * FROM train_set_1_1
UNION ALL
SELECT * FROM test_set_1_1
ORDER BY type;
"
)
# plot year range of train set & test set (L1 & I1)
ggplot(train_test) +
aes(x = date, y = monthly_amount, colour = type) +
geom_line(size = 0.5) +
scale_color_hue(direction = 1) +
labs(x = "Year", y = "Total monthly amount",
title = "Train-Test Split",
subtitle = "Location 1 & Industry 1") +
theme_classic() +
theme(legend.position = "bottom", plot.title = element_text(size = 15L))
As new dataset is created, I will also use it to create a line plot
of the variable monthly_amount for
industry = 1 and location = 1 with the purpose
of gaining more insights from targeted areas.
#############################################################################
## Task 2A.ii - Visualization: Mean Monthly Transaction Amount (L1 & I1) ##
#############################################################################
agg_1_1 %>%
ggplot() +
aes(x = date, y = monthly_amount, colour = year_number) +
geom_line(size = 0.7) +
geom_point(size=1) +
scale_color_brewer(palette = "Dark2", direction = 1) +
labs(
x = "year",
y = "Transaction amount",
title = "Mean Monthly Transaction Amount",
subtitle = "Industry 1 & Location 1",
color = "Year"
) +
ggthemes::theme_par() +
theme(
plot.title = element_text(size = 12L),
plot.subtitle = element_text(hjust = 0.5)
)
It is clear from the graph that there is the seasonality trend observed from the mean transaction amount of Industry 1 & Location 1. More specifically, a down trend at the end of year followed by the up trend at the beginning of year is presented with the months of December and January are low months for this industry and location, and the sales more bounce back in March to June. This pattern of fluctuation is repeated during the year and in the time span of 3 years from 2013 to 2017. In average, the monthly mean amount of sales is increasing slowly over time.
However, it is worth mentioning that the year-end trend in 2016 was upward, which was the inverse of previous years. As a result, we will need to take a closer look at this occurrence by examining the amount of money moved by month for each year using the graphic below.
# Plot Mean Transaction Amount Bar chart
agg_1_1 %>%
ggplot() +
aes(x = month_number, fill = year_number, weight = monthly_amount) +
geom_bar() +
scale_fill_manual(values = c(`2013` = "#EBEBAE", `2014` = "#BBE395", `2015` = "#379E54", `2016` = "#004529"
)) +
labs(x = "Month", y = "Total transaction amount", title = "Transaction Amount by Month", subtitle = "Location 1 & Industry 1",
fill = "year") +
theme_classic() +
theme(legend.position = "bottom",
plot.title = element_text(size = 15L),
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
facet_wrap(vars(year_number), nrow = 1L) +
coord_flip()
As can be seen, the anomalous increase towards the end of 2016 was previously noticed as a result of a lack of transaction data in December 2016. As a result, we discovered another insight based on fact observing from the trend chart above.
In this section, I will focus on developing a MLR filtered by Location 1 & Industry 1.
In this section, numerous Multiple Linear Regression (MLR) models will be developed and assessed with various combinations of predictor variables. As for the approach, I will adopt stepwise model selection method as backwards elimination. Here, I start with a full model that is a model with all possible co-variants or predictors included, and then I will drop variables one at a time until a parsimonious model is reached.
Noted that even though we start the model with all variables, I will
exclude the location, industry and
year as we only filter by Location 1, Industry 1 and the
year of 2013-2015, which can overfit our MLR model.
##::::::::::::::::::::::::::::
## M1: date + month_number
##::::::::::::::::::::::::::::
M1_1 <- lm(monthly_amount~date+month_number, data =train_set_1_1)
summary(M1_1)
##
## Call:
## lm(formula = monthly_amount ~ date + month_number, data = train_set_1_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12943.4 -1553.1 494.6 2858.6 10733.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.758e+05 5.029e+04 -3.495 0.001954 **
## date 1.981e+01 3.123e+00 6.342 1.80e-06 ***
## month_number02 1.828e+04 4.561e+03 4.008 0.000551 ***
## month_number03 2.508e+04 4.563e+03 5.496 1.38e-05 ***
## month_number04 9.805e+03 4.568e+03 2.146 0.042623 *
## month_number05 2.188e+04 4.575e+03 4.783 8.00e-05 ***
## month_number06 1.448e+04 4.584e+03 3.159 0.004389 **
## month_number07 1.903e+04 4.595e+03 4.142 0.000395 ***
## month_number08 2.004e+04 4.608e+03 4.349 0.000236 ***
## month_number09 1.453e+04 4.622e+03 3.144 0.004546 **
## month_number10 2.475e+04 4.639e+03 5.335 2.04e-05 ***
## month_number11 1.983e+04 4.658e+03 4.258 0.000296 ***
## month_number12 4.360e+03 4.678e+03 0.932 0.360955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5584 on 23 degrees of freedom
## Multiple R-squared: 0.8329, Adjusted R-squared: 0.7457
## F-statistic: 9.555 on 12 and 23 DF, p-value: 2.627e-06
# Call:
# lm(formula = monthly_amount ~ date + month_number, data = train_set_1_1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -12943.4 -1553.1 494.6 2858.6 10733.4
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.758e+05 5.029e+04 -3.495 0.001954 **
# date 1.981e+01 3.123e+00 6.342 1.80e-06 ***
# month_number02 1.828e+04 4.561e+03 4.008 0.000551 ***
# month_number03 2.508e+04 4.563e+03 5.496 1.38e-05 ***
# month_number04 9.805e+03 4.568e+03 2.146 0.042623 *
# month_number05 2.188e+04 4.575e+03 4.783 8.00e-05 ***
# month_number06 1.448e+04 4.584e+03 3.159 0.004389 **
# month_number07 1.903e+04 4.595e+03 4.142 0.000395 ***
# month_number08 2.004e+04 4.608e+03 4.349 0.000236 ***
# month_number09 1.453e+04 4.622e+03 3.144 0.004546 **
# month_number10 2.475e+04 4.639e+03 5.335 2.04e-05 ***
# month_number11 1.983e+04 4.658e+03 4.258 0.000296 ***
# month_number12 4.360e+03 4.678e+03 0.932 0.360955
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 5584 on 23 degrees of freedom
# Multiple R-squared: 0.8329, Adjusted R-squared: 0.7457
# F-statistic: 9.555 on 12 and 23 DF, p-value: 2.627e-06
The month number variable is introduced to accommodate for the
seasonality of the sales amount. As summarized in the linear model with
the formula formula = monthly_amount ~ date + month_number,
this model performs quite impressively with the Adjusted R-Square
equivalent to 0.7457. In other words, this indicates that approximately
74,57% observations in the training set are explained by the model.
month_number variable##::::::::::::::::::::::::::::
## M2: month_number
##::::::::::::::::::::::::::::
M1_2 <-lm(monthly_amount~month_number, data =train_set_1_1)
summary(M1_2)
##
## Call:
## lm(formula = monthly_amount ~ month_number, data = train_set_1_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14715.9 -6444.4 625.6 6639.3 12664.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142568 5233 27.244 < 2e-16 ***
## month_number02 18895 7401 2.553 0.017455 *
## month_number03 26248 7401 3.547 0.001641 **
## month_number04 11588 7401 1.566 0.130484
## month_number05 24258 7401 3.278 0.003179 **
## month_number06 17471 7401 2.361 0.026703 *
## month_number07 22616 7401 3.056 0.005431 **
## month_number08 24239 7401 3.275 0.003200 **
## month_number09 19347 7401 2.614 0.015209 *
## month_number10 30155 7401 4.075 0.000436 ***
## month_number11 25854 7401 3.493 0.001873 **
## month_number12 10976 7401 1.483 0.151073
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9064 on 24 degrees of freedom
## Multiple R-squared: 0.5407, Adjusted R-squared: 0.3302
## F-statistic: 2.569 on 11 and 24 DF, p-value: 0.02583
# Call:
# lm(formula = monthly_amount ~ month_number, data = train_set_1_1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -14715.9 -6444.4 625.6 6639.3 12664.7
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 142568 5233 27.244 < 2e-16 ***
# month_number02 18895 7401 2.553 0.017455 *
# month_number03 26248 7401 3.547 0.001641 **
# month_number04 11588 7401 1.566 0.130484
# month_number05 24258 7401 3.278 0.003179 **
# month_number06 17471 7401 2.361 0.026703 *
# month_number07 22616 7401 3.056 0.005431 **
# month_number08 24239 7401 3.275 0.003200 **
# month_number09 19347 7401 2.614 0.015209 *
# month_number10 30155 7401 4.075 0.000436 ***
# month_number11 25854 7401 3.493 0.001873 **
# month_number12 10976 7401 1.483 0.151073
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 9064 on 24 degrees of freedom
# Multiple R-squared: 0.5407, Adjusted R-squared: 0.3302
# F-statistic: 2.569 on 11 and 24 DF, p-value: 0.02583
Based on the Multiple R-squared value, our Model 2 can only account
for approximately 54% of the variance. This indicates that fitting the
month_number alone provide a moderate predictor of
monthly_amount which specifically perform worse than the
first model. We can also get confirmation by looking at the p-value of
0.02583 which tells us that the month predictors unlikely to be a good
fit to the data.
date variable##::::::::::::::::::::::::::::
## M3: 1 VARIABLE - date
##::::::::::::::::::::::::::::
M1_3 <-lm(monthly_amount~date, data =train_set_1_1)
summary(M1_3)
##
## Call:
## lm(formula = monthly_amount ~ date, data = train_set_1_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17585 -7236 1484 6569 17030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.764e+05 7.681e+04 -2.297 0.0279 *
## date 2.083e+01 4.729e+00 4.405 0.0001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8966 on 34 degrees of freedom
## Multiple R-squared: 0.3633, Adjusted R-squared: 0.3446
## F-statistic: 19.4 on 1 and 34 DF, p-value: 1e-04
# Call:
# lm(formula = monthly_amount ~ date, data = train_set_1_1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -17585 -7236 1484 6569 17030
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.764e+05 7.681e+04 -2.297 0.0279 *
# date 2.083e+01 4.729e+00 4.405 0.0001 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 8966 on 34 degrees of freedom
# Multiple R-squared: 0.3633, Adjusted R-squared: 0.3446
# F-statistic: 19.4 on 1 and 34 DF, p-value: 1e-04
With the third one where we fit only the date variable
to the model, it even get a worse performance with only 36% of the
variability in the average monthly sales amount is explained by it,
leaving a whopping of unexplained 64% variance.
In conclusion, Model 1 provide the best fit so far comparing to the
other 2 combinations. Thus, I will use this model for making a
prediction for monthly_amount in December 2016.
After having chosen the Model 1 as the final model, my next step is
to create a new data frame specifying only 2016 records. I then making
the prediction for the transaction amount in December, 2016 with the
function predict() as follows.
##################################################################
## Task 2A.iv - Making the Prediction on Dec 2016 ##
##################################################################
##----------------------------------------------------------------
## Dec 2016 Prediction --
##----------------------------------------------------------------
# create new vector that specify column names
x <- c("date", "industry", "location", "monthly_amount",
"month_number","year_number","type")
# create new df for fitted df
fit_df <- data.frame(train_set_1_1[-7]) %>%
# assign value type for the the dataset
mutate(type="Fitted")
# Rename the fitted dataset column names
colnames(fit_df) <- x
# create new df for prediction df
testset_all <- data.frame(test_set_1_1[-7]) %>%
# assign value type for the the dataset
mutate(type="Predicted")
# Rename the predicted dataset column names
colnames(testset_all) <- x
# create new df for prediction row
dec2016<-data.frame(date = "2016/12/01",
industry="1",
location="1",
monthly_amount = 0,
month_number="12",
year_number="2016",
type="Predicted"
)
# convert the date column to the right date format to merge with predicted df later
dec2016$date <- as.Date(dec2016$date,format = "%Y/%m/%d")
## Use predict function + model we just built and apply to dec_2016 data frame
fit_df$monthly_amount <- predict(M1_1, train_set_1_1)
testset_all$monthly_amount <- predict(M1_1, test_set_1_1)
pred <- predict(M1_1, dec2016)
dec2016$monthly_amount <- pred
# combine the predicted df with the new predicted row
testset_all <- sqldf(
"SELECT * FROM testset_all
UNION ALL
SELECT * FROM dec2016"
)
Next, I will examine whether our December 2016 forecast reasonable by plotting a line plot with the predicted data.
# create new column to specify value type for the dataset
agg_1_1$type <- "Real"
# make sure our variables have the right format for plotting
fit_df <- as.data.frame(fit_df)
testset_all <- as.data.frame(testset_all)
# merge all necessary variables into 1 single df variable with rbind
newData <- rbind(fit_df, testset_all,agg_1_1)
# Plot goodness of the fir from our MLR
newData %>%
filter(!(year_number %in% "2016")) %>%
filter(!(type %in% "Predicted")) %>%
ggplot() +
aes(x = date, y = monthly_amount, colour = type) +
geom_line(size = 0.5) +
scale_color_hue(direction = 1) +
labs(
x = "Year",
y = "Mean Transaction Amount",
title = "Fit from MLR"
) +
theme_light()
# Plot the comparison between our MLR prediction on test set with the actual values
ggplot(newData) +
aes(x = date, y = monthly_amount, colour = type) +
geom_line(size = 0.7) +
scale_color_hue(direction = 1) +
labs(x = "Year", y = "Mean transaction amount",
title = "December 2016 Prediction") +
theme_light() +
theme(plot.title = element_text(face = "bold"))
We can quantify the residuals by calculating a number of commonly used evaluation metrics. We’ll focus on the following three:
Mean Square Error (MSE): The mean of the squared differences between predicted and actual values. This yields a relative metric in which the smaller the value, the better the fit of the model.
Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in the same unit as the label. The smaller the value, the better the model.
Coefficient of Determination (usually known as R-squared or R2): A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.
#################################################################
## Task 2B - Describe the model ##
#################################################################
##----------------------------------------------------------------
## Evaluation Metrics --
##----------------------------------------------------------------
# Predictions vs Test data
evaluation_metrics_test_1_1 <- data.frame(
X = "Predictions vs Test data",
R2 = summary(M1_1)$r.squared,
RMSE = modelr::rmse(M1_1, data = test_set_1_1),
MAE = modelr::mae(M1_1, data = test_set_1_1)
)
# table formatting
rmarkdown::paged_table(evaluation_metrics_test_1_1)
library(modelr)
# Fitted value vs Train data
evaluation_metrics_train_1_1 <- data.frame(
X = "Fitted value vs Train data",
R2 = rsquare(M1_1, train_set_1_1),
RMSE = modelr::rmse(M1_1, train_set_1_1),
MAE = modelr::mae(M1_1, train_set_1_1)
)
# table formatting
rmarkdown::paged_table(evaluation_metrics_train_1_1)
# X R2 RMSE MAE
# 1 Fitted value vs Train data 0.8329209 4463.717 3172.824
# Model test error
test_error <- (evaluation_metrics_test_1_1$RMSE)/mean(test_set_1_1$monthly_amount)
# view the result
test_error # [1] 0.06164264
## [1] 0.06164264
The performance of prediction is significantly lower than the model’s performance on train data. The R2 is 0.55 lower than 0.83, which presents its low fit to the actual data. The prediction error RMSE is 11293, representing an error rate of ~ 6%, which is still good.
We want to apply our model (mean amount + date + month number) across
all industries and geographical locations. To do this, I will construct
a loop function as calculate_predictions to run everything
through.
To be more specific, the loop function will do the following tasks:
We were running all locations and industries through the below model:
mean_monthly ~ time_numberIn that, time_number represent the date order.
# format the aggregate data columns to make sure they have right formats when performing modelling
aggregated_transactions$industry <- as.numeric(aggregated_transactions$industry)
aggregated_transactions$location <- as.numeric(aggregated_transactions$location)
aggregated_transactions$month_number <- as.numeric(aggregated_transactions$month_number)
aggregated_transactions$year_number <- as.numeric(aggregated_transactions$year_number)
# Using the aggregated dataset from part 2 (aggregated_transactions)
df <- aggregated_transactions
calculate_predictions <- function(df, industries, locations) {
output = data.frame()
# Want to make sure we capture a couple of months from the preceeding year to capture the seasonal data
testingNum = 9
for (ind in industries) {
for (loc in locations) {
# Create a subset of the data
temp = df[df$industry == ind & df$location == loc, ]
# Check to make sure you have at least 'trainingMultiplier' times the number of training rows than testing rows
if (length(unique(temp$date)) >= testingNum) {
# Arrange dataset by date
arrange(temp, date)
# Add a number to represent date order
temp$time_number = c(1:nrow(temp))
# Training number is the number of rows minus the testingNum
trainingNum = nrow(temp) - testingNum
# Training set is all rows from the start minus the number in the test set. We'll arrange again, just in case.
trainingSet = head(arrange(temp, time_number), trainingNum)
# Testing set is the last 'testingNum' rows. We'll arrange again, just in case.
testingSet = tail(arrange(temp, time_number), testingNum)
# Run the model
training.model = lm(monthly_amount~time_number, data=trainingSet)
# testing.model = lm(monthly_amount~time_number, data=testingSet)
# Calculate the mean standard error
training.mse <- mean(residuals(training.model)^2)
#testing.mse <- mean(residuals(testing.model)^2)
# Calculate root mean squared error
training.rmse <- sqrt(training.mse)
# testing.rmse <- sqrt(testing.mse)
### Now, add an extra row into temp for the December 2016 prediction, giving December 2016 a monthly_amount of 0
# Create a dataframe containing just the December 2016 data
december_2016 = data.frame(date = "2016-12-01",
industry=ind,
location=loc,
monthly_amount=0,
month_number=12,
year_number=2016,
time_number=(nrow(temp)+1))
# Ensure temp is of type data frame
temp = as.data.frame(temp)
#testingSet = as.data.frame(testingSet)
# Add the December 2016 row
temp = rbind(temp, december_2016)
# testingSet = rbind(testingSet, december_2016)
# Output a prediction based on all rows and add it to the temp data frame
temp$prediction = predict(training.model, temp)
testingSet$prediction = predict(training.model, testingSet)
# Get the last prediction value (which is the Dec 2016 value).
train_dec_2016_prediction = tail(temp$prediction, 1)
# test_dec_2016_prediction = tail(testingSet$prediction, 1)
# Create row to add to the output data frame, including industry and location variables
dataRow = c(ind,loc,training.rmse,train_dec_2016_prediction)
} else {
# Append entry to output data frame when not enough data to compute
dataRow = c(ind,loc,NA,NA)
}
# Add the row to the output dataframe
output = rbind(output, dataRow)
}
}
#Add column names to the output data frame
colnames(output) <- c("Industry","Location", "RMSE", "Dec 2016 Prediction")
#Return the output
return(output)
}
# Get the list of unique industries, sorted in numerical order
industries <- sort(unique(df$industry))
# Get the list of unique locations, sorted in numerical order
locations <- sort(unique(df$location))
# Calculate the predictions for all industry and location pairs
all_predictions <- calculate_predictions(df, industries, locations)
# Order by RMSE
rmarkdown::paged_table(arrange(all_predictions, RMSE))
Among data sets, we picked two worst performed industries and locations by its highest RMSE:
Industry 6 & Location 1
Industry 10 & Location 8
In order to find out potential reasons that lead to poor performance of these locations, I will retrain the model based on these 2 industries and locations and then plot the model in order to see how they are performing.
Let’s take a look at the diagnosis plots for these 2 models:
# the diagnostic plot
################################
# Industry 6 & Location 1
################################
par(mfrow=c(2,2))
plot(i6_l1.lm)
################################
# Industry 10 & Location 8
################################
par(mfrow=c(2,2))
plot(i10_l8.lm)
We can see that, both models have outliers records. For industry 6 and location 1, there are 1 and 2 points which are far, above and below, from the model, respectively. For industry 10 and location 8, there 3 outliers exist above the plotted line.
# plot the model
i6_l1_model <- ggplot(data = trainingSet, mapping = aes(x = time_number, y =
monthly_amount)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title="Model 2, Using time_number as only predictor",
subtitle = "industry 6, location 1",
x="month number",
y="Monthly Amount")
i10_l8_model <- ggplot(data = trainingSet_2, mapping = aes(x = time_number, y =
monthly_amount)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title="Model 2, Using time_number as only predictor",
subtitle = "industry 10, location 8",
x="month number",
y="Monthly Amount")
## combine individual plots into a single page
ggarrange(i6_l1_model,
i10_l8_model,
ncol = 2, nrow = 1)
To confirm our theory, we plot another linear model plot for industry 10 & location 8 and industry 6 & location 1. In this plot, we can have a clearer view of outstanding outliers contain in both models. Additionally, it is also observed from the plot that the fitted line cannot catch the constant up and down trend of the monthly mean due to seasonality, that could be a reason for a poor performance. By developing more advanced models that could account for those fluctuations and removing these outliers can likely lead to a more accurate and power model.
Medeiros, L. (2021, December 19). The CRISP-DM methodology - Lucas Medeiros. Medium. https://medium.com/@lucas.medeiross/the-crisp-dm-methodology-d1b1fc2dc653
Sharma, A. (2020, December 23). What is Skewness in Statistics? | Statistics for Data Science. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/Wijaya, C. Y. (2021, December 19).
CRISP-DM Methodology For Your First Data Science Project. Medium. https://towardsdatascience.com/crisp-dm-methodology-for-your-first-data-science-project-769f35e0346c
## load data description csv file
dd <- read_csv(here("dataset/data_description.csv"))
# display the information under table format
dd %>%
kbl() %>%
kable_styling(full_width = F)
| Field | Data Type | Description |
|---|---|---|
| date | Date | Date of the first day of each month |
| customer_id | String | Unique customer identifier |
| industry | Integer | Code for 10 industries, ranging from 1 to 10 |
| location | Integer | Code for 10 locations, ranging from 1 to 10 |
| monthly_amount | Numeric | Total transaction amount for customer in given month |