This report aims to analyse and provide insight from the provided monthly financial transactions data file. The monthly transactions are over the period starting Jan - 2013 and end at Nov - 2016. Initial EDA will be undertaken to identify any interesting trends and finally machine learning algorithms will be developed and applied to the data to predict future monthly earnings
This report has been prepared using the CRISP-DM format and is broken into the following sections
Modelling
Evaluation and conclusion
Business objectives Financial transactions describes the total transactions amounts for customers each month over a range of industries and locations. The monthly transations can vary greatly depending on both the industry and the location. We wish to gain insight into what is driving the trends and increase total monthly revenue
Data mining objectives Firstly, analysis of the data will be performed followed secondly by applying data visualization tools to identify any underlying patterns.
The data set provided contains records from 10 different industries as well as 10 different locations and also provides the month and month amount from the period from Jan-2013 until Nov-2016. The customers are also included however they have been anonymised to ensure their privacy. There are approximately 94,250 separate data points as well as no missing values. The data structure is relatively as it has 5 columns (date, anonymised customer id, industry, location and monthly amount)
The data was imported into the software package R. No missing values were found. The dates were in the format as characters and were converted to date objects. This would make analysis on it easier in the modelling process. The remaining variables were untouched as they were in their correct data type.
try(library(tidyverse), silent=TRUE)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
try(library(dplyr), silent=TRUE)
try(library(ggplot2), silent=TRUE)
try(library(caret), silent=TRUE)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
try(library(Rcpp), silent=TRUE)
# Set up working directory and import the dataset (transactions.csv)
working_dir = setwd('C:/Users/Admin/Desktop/Masters of Data Science and Innovation/Autumn 2021/36106 - Machine learning algorithms and applications/Assignment 1 MLAA AUT 2021/')
dataset = read.csv("transactions.csv")
# convert data types to correct types as set out in data dictionary
dataset$date = as.Date(dataset$date, format = "%d/%m/%Y")
dataset$customer_id = as.character(dataset$customer_id, format = "")
# view the data and check the data types
summary(dataset)
## date customer_id industry location
## Min. :2013-01-01 Length:94248 Min. : 1.000 Min. : 1.000
## 1st Qu.:2014-05-01 Class :character 1st Qu.: 1.000 1st Qu.: 2.000
## Median :2015-06-01 Mode :character Median : 2.000 Median : 4.000
## Mean :2015-03-26 Mean : 2.455 Mean : 4.214
## 3rd Qu.:2016-03-01 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :2016-11-01 Max. :10.000 Max. :10.000
## monthly_amount
## Min. : 0
## 1st Qu.: 95323
## Median : 179399
## Mean : 395397
## 3rd Qu.: 375439
## Max. :100000000
str(dataset)
## 'data.frame': 94248 obs. of 5 variables:
## $ date : Date, format: "2013-01-01" "2013-02-01" ...
## $ customer_id : chr "70efdf2ec9b086079795c442636b55fb" "70efdf2ec9b086079795c442636b55fb" "70efdf2ec9b086079795c442636b55fb" "70efdf2ec9b086079795c442636b55fb" ...
## $ industry : int 8 8 8 8 8 8 8 8 8 8 ...
## $ location : int 9 9 9 9 9 9 9 9 9 9 ...
## $ monthly_amount: num 753851 651548 1138769 659739 770675 ...
Firstly we want to gain some insight from visualising the data by plotting a range of variables and see if any trends are apparent
Scatter plot of total monthly transations per industry
plot_date_vs_month = ggplot(data = dataset, aes(x = date, y = monthly_amount/1e6)) +
geom_point(aes(colour = industry)) +
scale_y_continuous(labels = scales::comma) +
labs (title = "Total transactions by month per industry",
x = "Date",
y = "Monthly transation amount in millions")
plot_date_vs_month
Trying to visualise a second scatter plot of total transactions by month per industry to seperate the industries by colour
plot_date_vs_month_2 = ggplot(data = dataset, aes(x = date, y = monthly_amount/1e6)) +
geom_point(aes(colour = cut(industry, c(0,1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11)))) +
scale_color_manual(name = "Industry",
values = c("(0,1]" = "black",
"(1,2]" = "yellow",
"(2,3]" = "red",
"(3,4]" = "blue",
"(4,5]" = "pink",
"(5,6]" = "green",
"(6,7]" = "purple",
"(7,8]" = "orange",
"(8,9]" = "grey",
"(9,10]" = "brown"),
labels = c("Ind 1", "Ind 2", "Ind 3", "Ind 4", "Ind 5", "Ind 6", "Ind 7", "Ind 8", "Ind 9", "Ind 10"))+
scale_y_continuous(labels = scales::comma) +
labs (title = "Total transactions by month per industry",
x = "Date",
y = "Monthly transation amount in millions")
plot_date_vs_month_2
Now we will look at the relationship between total transactions per month and location.
plot_date_vs_month_location = ggplot(data = dataset, aes(x = date, y = monthly_amount/1e6)) +
geom_point(aes(colour = location)) +
scale_y_continuous(labels = scales::comma) +
labs (title = "Total transactions by month per location",
x = "Date",
y = "Monthly transation amount in millions")
plot_date_vs_month_location
Instead of looking at total monthly transactions by dollar amount ,we’ll now look at total number of transations both by indutry and by location
par(mfrow=c(1,2))
hist(dataset$industry, main = "Trans by Industry", xlab="Industry", ylab="Number of transactions", xlim = c(0,10), ylim=c(0,50000), las=0)
hist(dataset$location, main = "Trans by Location", xlab="Location", ylab="Number of transactions", xlim = c(0,10), ylim=c(0,50000), las=0)
Now lets consider the monthly transaction and compare it to the average to see if there is any insight The process will involve the creation of a new mean variable which each monthly average will be compared to as either True -> higher or false -> below
# we first need to create a boolean operator that says if our value is above or below our mean
transactions_with_mean_boolean <- dataset %>%
mutate(monthly_amount_above_mean = ifelse(monthly_amount > mean(monthly_amount), TRUE, FALSE))
# plotting the data of true false monthyl average compared to the mean
qplot(x = date, fill = monthly_amount_above_mean, data = transactions_with_mean_boolean, geom = "bar",
main = "Number of transactions per month compared to average monthly transaction")+
scale_fill_discrete(name = "Above or below mean")+
ylab("Number of transations")+
xlab("Date")
From the above it is clear that most monthly transactions are below the mean monthly transaction value. It is also clear the the number of monthly transactions are growing steadily without any significant rises or declines The only decline albeit minimal appears to happen between December and January which can be explained by the seasonality of the business and the holiday period
Lets now analyse how each location performed in relation to the average monthly transaction amount.
qplot(x = location, fill = monthly_amount_above_mean, data = transactions_with_mean_boolean, geom = "bar",
main = "Number of transactions by locatio, compared to average monthly transaction")+ #, scale_x = breaks = c(1:10))
scale_fill_discrete(name = "Above or below mean")+
scale_x_continuous(breaks=1:10) +
ylab("Number of transations")+
xlab("Location")
lets now look at the montly average compared by industry
qplot(x = industry, fill = monthly_amount_above_mean, data = transactions_with_mean_boolean, geom = "bar",
main = "Number of transactions by industry, compared to average monthly transaction")+
scale_fill_discrete(name = "Above or below mean")+
scale_x_continuous(breaks=1:10) +
ylab("Number of transations")+
xlab("Industry")
From the previous two plots, we have gained some insight on how the location and the industries perform compared to the average monthly transaction amount. We also can view each industry and location’s number of transations over the entire date period. It is clear that industry 1 and 2 are the clearly generate the most number of transactions, however the majority of the time, they come in below the average monthly transaction amount. Whereas industry 9 and 3 while having a lower number of transactions, are regularly above the average
Similar insights can be gained from looking at the location data where location 1 and 2 dominate the number of sales. However the proportion of above to below monthly transactions appears to be similar across all locations so there is no clear locations that performs best based on that metric
The aim will be to build a model that can predict monthly_amounts for the future.
NUmerous models will be developed and assessed wth various combinations of predictor variables
The data will need to be modified as we will be applying a linear regression model. This requires us to change each month into an integer which we will be doing sequentially e.g. Jan 2013 = 1, Feb 2013 = 2 etc. The data will be split roughly 70:30 test:train. This is to ensure that the clear seasonality data is captured in our train set and will be used to predict data for December 2016
This section will also include references to the assignment tasks
Basic Model Fitting + a) Aggregate the data, grouping by date, industry and location, and calculating the mean monthly_amount
# generatinga function that will be used repeatedly
monthly_avg_function = function(df) {
# groub by data , location and industry
output = df %>%
group_by(date, industry, location) %>%
summarize(monthly_amount = mean(monthly_amount, na.rm = TRUE))
# make sure date is in proper format before interget coversion
output = output %>%
mutate(month = format(as.Date(date), "%m")) %>%
mutate(year = format(as.Date(date), "%Y"))
# convert date to an integer
output$month = as.integer(output$month)
output$year = as.integer(output$year)
transform(output, month = as.integer(month),
year = as.integer(year))
return(output)
}
# have a look at the dataset to make sure it has worked properly
aggregated_dataset = monthly_avg_function(dataset)
## `summarise()` has grouped output by 'date', 'industry'. You can override using the `.groups` argument.
aggregated_dataset
## # A tibble: 3,886 x 6
## # Groups: date, industry [470]
## date industry location monthly_amount month year
## <date> <int> <int> <dbl> <int> <int>
## 1 2013-01-01 1 1 136081. 1 2013
## 2 2013-01-01 1 2 177840. 1 2013
## 3 2013-01-01 1 3 141632. 1 2013
## 4 2013-01-01 1 4 221058. 1 2013
## 5 2013-01-01 1 5 178138. 1 2013
## 6 2013-01-01 1 6 133400. 1 2013
## 7 2013-01-01 1 7 231599. 1 2013
## 8 2013-01-01 1 8 143778. 1 2013
## 9 2013-01-01 1 9 157416. 1 2013
## 10 2013-01-01 1 10 188735. 1 2013
## # ... with 3,876 more rows
Create a line plot of the variable monthly_amount for industry = 1 and location = 1
ggplot(data=filter(aggregated_dataset, industry == 1 & location == 1), aes(x=date, y=monthly_amount, group=1)) +
geom_line(color="red") + geom_point() +
labs(title="Line plot of mean monthly amount for Industry 1 and Location 1", x="Date", y="Mean Monthly Amount")
Industries and location are all the same in the aggregated data set and so will not be used for model prediction
Hinted that we will need to capture more than 12 months due to seasonality, this variables will be called month_number
# set up a temporary data set of aggregated values
temporary <- aggregated_dataset[aggregated_dataset$industry == 1 & aggregated_dataset$location == 1, ]
temporary$time_number = c(1:nrow(arrange(temporary, date)))
# arrange the data by time number ot make sure it's worked properly
arrange(temporary, time_number)
## # A tibble: 47 x 7
## # Groups: date, industry [47]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <int> <int> <int>
## 1 2013-01-01 1 1 136081. 1 2013 1
## 2 2013-02-01 1 1 152964. 2 2013 2
## 3 2013-03-01 1 1 158481. 3 2013 3
## 4 2013-04-01 1 1 152341. 4 2013 4
## 5 2013-05-01 1 1 170330. 5 2013 5
## 6 2013-06-01 1 1 152575. 6 2013 6
## 7 2013-07-01 1 1 158755. 7 2013 7
## 8 2013-08-01 1 1 165313. 8 2013 8
## 9 2013-09-01 1 1 147199. 9 2013 9
## 10 2013-10-01 1 1 163409. 10 2013 10
## # ... with 37 more rows
# number of months used to capture seasonality, may need to test out other values. Initially set to 14
month_number = 14
#Training number is the number of rows minus the month_number
trainingNum = nrow(arrange(temporary, date, time_number)) - month_number
#Training set is all rows from the start minus the number in the test set
trainingSet = head(arrange(temporary, date, time_number), trainingNum)
#Testing set is the last 'month_number' rows
testingSet = tail(arrange(temporary, date, time_number), month_number)
trainingSet
## # A tibble: 33 x 7
## # Groups: date, industry [33]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <int> <int> <int>
## 1 2013-01-01 1 1 136081. 1 2013 1
## 2 2013-02-01 1 1 152964. 2 2013 2
## 3 2013-03-01 1 1 158481. 3 2013 3
## 4 2013-04-01 1 1 152341. 4 2013 4
## 5 2013-05-01 1 1 170330. 5 2013 5
## 6 2013-06-01 1 1 152575. 6 2013 6
## 7 2013-07-01 1 1 158755. 7 2013 7
## 8 2013-08-01 1 1 165313. 8 2013 8
## 9 2013-09-01 1 1 147199. 9 2013 9
## 10 2013-10-01 1 1 163409. 10 2013 10
## # ... with 23 more rows
month_number.lm <- lm(monthly_amount~month, data=trainingSet)
summary(month_number.lm)
##
## Call:
## lm(formula = monthly_amount ~ month, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21904.9 -7126.0 162.6 8778.7 17485.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156947.7 3836.2 40.913 <2e-16 ***
## month 629.8 553.5 1.138 0.264
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10510 on 31 degrees of freedom
## Multiple R-squared: 0.04009, Adjusted R-squared: 0.009127
## F-statistic: 1.295 on 1 and 31 DF, p-value: 0.2639
Based on the Multiple R-squared value our model can only account for approximately ~4% of the variance. This is very low and the model is a poor predictor of monthly_amount. We can also get confirmation by looking at the p-value of 0.2639 which tells us that the month predictor is unlikely to be a good fit to the data/.
Onto the residuals we can see that we have a minimum of ~ -21900 and a maximum of ~ 17490. We will need to improve our model
Lets have a look at the plot of the model vs the data to assess how our model performed.
# plot of residuals of model loc = 1, ind = 1
ggplot(data = trainingSet, mapping = aes(x = month, y = monthly_amount)) +
geom_point() +
scale_x_continuous(breaks=1:12)+
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title="Model 1. Using month as only predictor", x="Month Number", y="Monthly Amount")
## `geom_smooth()` using formula 'y ~ x'
As expected from the statistics earlier it is clear that our model is a poor fit. It does however capture a trend especially between the months of 2 and 11. As mentioned ealier the monthly transcations and therefore amount would decrease over December and January which is reflected above
Lets now look at diagnostic plots
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(month_number.lm)
#### Model 2. Linear model to predict monthly_amount using time_number as the sole predictor. Time number is the month in sequential order over the entire data set ### Task b i, ii e.g. Jan 2013 = 1, Jan 2014 = 13 etc
time_number.lm <- lm(monthly_amount~time_number, data=trainingSet)
summary(time_number.lm)
##
## Call:
## lm(formula = monthly_amount ~ time_number, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17279 -6992 1486 6303 16923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150333.8 3159.5 47.581 < 2e-16 ***
## time_number 614.7 162.2 3.791 0.000651 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8870 on 31 degrees of freedom
## Multiple R-squared: 0.3168, Adjusted R-squared: 0.2947
## F-statistic: 14.37 on 1 and 31 DF, p-value: 0.0006507
By looking at the output of the summary, we can see that the R squared value has improved significantly improving to 0.3168. This model can account for ~32% of the variation in the data. Looking at the p value, the predictor time_number has a much stronger relationship with the monthly_amount variable.
Lets plot our new model against the data to gain some insight
# plot of time_number model vs the data
ggplot(data = trainingSet, mapping = aes(x = time_number, y = monthly_amount)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title="Model 2, Using time_number as only predictor",
x="month number",
y="Monthly Amount")
## `geom_smooth()` using formula 'y ~ x'
The above plot appears to show that using time_number as the predictor captures moe of the data and shows a stronger trend.
Again lets look at the diagnostic plots
par(mfrow=c(2,2))
plot(time_number.lm)
We will now build a model using both months and time_number to see if there is any additiotnal improvement
# build model using month and time_number
month_number_and_time_number.lm <- lm(monthly_amount~month+ time_number, data=trainingSet)
summary(month_number_and_time_number.lm)
##
## Call:
## lm(formula = monthly_amount ~ month + time_number, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17139 -6451 1404 6149 17038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148686.3 3998.7 37.183 <2e-16 ***
## month 326.5 478.6 0.682 0.5003
## time_number 594.6 166.2 3.578 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8947 on 30 degrees of freedom
## Multiple R-squared: 0.3272, Adjusted R-squared: 0.2823
## F-statistic: 7.295 on 2 and 30 DF, p-value: 0.00262
Since we have used two predictors, we need to judge the model based on the adjusted R-squared value which is 0.2823. This is lower compared to our model 2 when time_number was the sole predictor. This p values of the predictors also show that the month variable was statistically insiginificant and should likely be disgarded.
Lets look at a 4th model using time_number and date as a predictor
month_number_and_time_number.lm <- lm(monthly_amount~date+ time_number, data=trainingSet)
summary(month_number_and_time_number.lm)
##
## Call:
## lm(formula = monthly_amount ~ date + time_number, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14204 -7043 1639 7116 13018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67605444 32044036 2.110 0.0433 *
## date -4303 2044 -2.105 0.0438 *
## time_number 131489 62171 2.115 0.0429 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8416 on 30 degrees of freedom
## Multiple R-squared: 0.4047, Adjusted R-squared: 0.365
## F-statistic: 10.2 on 2 and 30 DF, p-value: 0.000418
Again lets look at the adjusted R-squared which gives us a value of 0.365. This is an improvement on our previous models. However the p scores are lower for our predictors suggesting that they aren’t very strong indicators of model performance As such we will proceed using Model 2 as the relationship is much stronger.
Summary Model 2 (time_number) has performed the highest based on R-squared value and p scores of predictor(s) and will be used going forwards
Add a column for the month of December in the year 2016 and predict a value using model 2
december2016<-data.frame(date = "01/12/16",
industry=1,
location=1,
monthly_amount = 0,
month=12,
year=2016,
time_number=48)
december2016$date <- as.Date(december2016$date,
format = "%d/%m/%y")
december2016$industry <- as.integer(december2016$industry)
december2016$location <- as.integer(december2016$location)
december2016$time_number <- as.integer(december2016$time_number)
december2016$monthly_amount <- predict(time_number.lm,december2016)
predicted_outcome <- rbind(temporary, december2016)
predicted_outcome
## # A tibble: 48 x 7
## # Groups: date, industry [48]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <dbl> <dbl> <int>
## 1 2013-01-01 1 1 136081. 1 2013 1
## 2 2013-02-01 1 1 152964. 2 2013 2
## 3 2013-03-01 1 1 158481. 3 2013 3
## 4 2013-04-01 1 1 152341. 4 2013 4
## 5 2013-05-01 1 1 170330. 5 2013 5
## 6 2013-06-01 1 1 152575. 6 2013 6
## 7 2013-07-01 1 1 158755. 7 2013 7
## 8 2013-08-01 1 1 165313. 8 2013 8
## 9 2013-09-01 1 1 147199. 9 2013 9
## 10 2013-10-01 1 1 163409. 10 2013 10
## # ... with 38 more rows
Lets now plot our predicted data and see if our prediction for December 2016 makes sense.
plot_predict_ind1_december2016 <- ggplot(data=predicted_outcome, aes(x=date, y=monthly_amount)) +
geom_smooth(stat="identity", method = "lm") +
geom_point(colour = "red") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(date_minor_breaks = "1 month") +
labs (title = "Model 2. Industry 1 & Location 1",
subtitle = "December 2016 Prediction",
x = "Date", y = "Average Amount")
plot_predict_ind1_december2016
From the plot we can clearly see that the prediction for December 2016 has decreased compared to the November value which was supplied. This agrees with our previous assertions about the seasonality of the data and that is decreases in the December and January months.
Advanced Fitting model
Lets create a new data set called p3 We’ll need to sort the locations and indusries
# creating aggregate data set and arranging lcoations and industries
p3_data_set = monthly_avg_function(aggregated_dataset)
## `summarise()` has grouped output by 'date', 'industry'. You can override using the `.groups` argument.
p3_data_set
## # A tibble: 3,886 x 6
## # Groups: date, industry [470]
## date industry location monthly_amount month year
## <date> <int> <int> <dbl> <int> <int>
## 1 2013-01-01 1 1 136081. 1 2013
## 2 2013-01-01 1 2 177840. 1 2013
## 3 2013-01-01 1 3 141632. 1 2013
## 4 2013-01-01 1 4 221058. 1 2013
## 5 2013-01-01 1 5 178138. 1 2013
## 6 2013-01-01 1 6 133400. 1 2013
## 7 2013-01-01 1 7 231599. 1 2013
## 8 2013-01-01 1 8 143778. 1 2013
## 9 2013-01-01 1 9 157416. 1 2013
## 10 2013-01-01 1 10 188735. 1 2013
## # ... with 3,876 more rows
industries <- sort(unique(p3_data_set$industry))
locations <- sort(unique(p3_data_set$location))
output = data.frame()
month_number = 11
# lets have a look at the data
p3_data_set
## # A tibble: 3,886 x 6
## # Groups: date, industry [470]
## date industry location monthly_amount month year
## <date> <int> <int> <dbl> <int> <int>
## 1 2013-01-01 1 1 136081. 1 2013
## 2 2013-01-01 1 2 177840. 1 2013
## 3 2013-01-01 1 3 141632. 1 2013
## 4 2013-01-01 1 4 221058. 1 2013
## 5 2013-01-01 1 5 178138. 1 2013
## 6 2013-01-01 1 6 133400. 1 2013
## 7 2013-01-01 1 7 231599. 1 2013
## 8 2013-01-01 1 8 143778. 1 2013
## 9 2013-01-01 1 9 157416. 1 2013
## 10 2013-01-01 1 10 188735. 1 2013
## # ... with 3,876 more rows
We want to apply our model (monthly_amount ~ time_number) across all industries and locations We will put everything through a loop
The loop will do the following 1. Train the model for each industry and location as mentioned above 2. Append an empty column for December 2016 3. Calculate the MSE and RMSE 4. Make a prediction for December 2016 5. Bind everything into a dataframe
for (ind in industries) {
for (loc in locations) {
temporary = p3_data_set[p3_data_set$industry == ind & p3_data_set$location == loc, ]
if (length(unique(temporary$date)) >= month_number) {
arrange(temporary, date)
temporary$time_number = c(1:nrow(temporary))
# Training number is the number of rows minus the month_number
trainingNum = nrow(temporary) - month_number
# Training set is all rows from the start minus the number in the test set. We'll arrange again, just in case.
trainingSet = head(arrange(temporary, time_number), trainingNum)
# Testing set is the last 'month_number' rows. We'll arrange again, just in case.
testingSet = tail(arrange(temporary, time_number), month_number)
# Run the model
training.model = lm(monthly_amount~time_number, data=trainingSet)
# testing.model = lm(monthly_amount~time_number, data=testingSet)
# Calculate the mean standard error
training.mse <- mean(residuals(training.model)^2)
#testing.mse <- mean(residuals(testing.model)^2)
# Calculate root mean squared error
training.rmse <- sqrt(training.mse)
# testing.rmse <- sqrt(testing.mse)
### Now, add an extra row into temp for the December 2016 prediction, giving December 2016 a monthly_amount of 0
# Create a dataframe containing just the December 2016 data
december_2016 = data.frame(date = "2016-12-01",
industry=ind,
location=loc,
monthly_amount=0,
month=12,
year=2016,
time_number=(nrow(temporary)+1))
# Make sure temporary has data frame type
temporary = as.data.frame(temporary)
#testingSet = as.data.frame(testingSet)
# Add the December 2016 row
temporary = rbind(temporary, december_2016)
# testingSet = rbind(testingSet, december_2016)
# Output a prediction based on all rows and add it to the temp data frame
temporary$prediction = predict(training.model, temporary)
testingSet$prediction = predict(training.model, testingSet)
# Get the last prediction value (which is the Dec 2016 value).
train_dec_2016_prediction = tail(temporary$prediction, 1)
# Create row to add to the output data frame, including industry and location variables
dataRow = c(ind,loc,training.rmse,train_dec_2016_prediction) #
} else {
dataRow = c(ind,loc,NA,NA)
}
output = rbind(output, dataRow) # change to temporary
}
}
## Warning in predict.lm(training.model, temporary): prediction from a rank-
## deficient fit may be misleading
## Warning in predict.lm(training.model, testingSet): prediction from a rank-
## deficient fit may be misleading
output
## X1 X1.1 X8708.71861352349 X180582.52684704
## 1 1 1 8708.719 180582.53
## 2 1 2 8338.057 213241.27
## 3 1 3 8661.918 209985.53
## 4 1 4 15348.917 212896.23
## 5 1 5 9642.893 176660.84
## 6 1 6 7458.941 118608.92
## 7 1 7 18096.106 160786.91
## 8 1 8 5141.575 126819.12
## 9 1 9 10385.604 203987.20
## 10 1 10 14882.014 244951.94
## 11 2 1 28775.232 442855.05
## 12 2 2 24948.666 489477.47
## 13 2 3 58636.625 599354.87
## 14 2 4 49884.885 259835.34
## 15 2 5 54709.346 240381.99
## 16 2 6 29615.736 396781.68
## 17 2 7 26428.476 270319.06
## 18 2 8 48701.291 514809.95
## 19 2 9 71165.747 448156.64
## 20 2 10 26813.509 276164.23
## 21 3 1 96066.470 576221.29
## 22 3 2 61577.633 694542.07
## 23 3 3 108525.427 497038.50
## 24 3 4 112224.980 1038817.10
## 25 3 5 145509.181 282798.15
## 26 3 6 105817.990 85692.28
## 27 3 7 65670.448 676483.37
## 28 3 8 38483.766 175587.61
## 29 3 9 177989.626 310535.16
## 30 3 10 199913.555 -328462.07
## 31 4 1 26899.381 506804.79
## 32 4 2 16126.258 229728.26
## 33 4 3 38707.015 323510.53
## 34 4 4 44150.073 424171.07
## 35 4 5 42349.315 452839.66
## 36 4 6 153722.021 1225786.96
## 37 4 7 27318.186 546953.04
## 38 4 8 20896.022 234778.81
## 39 4 9 21650.678 227705.05
## 40 4 10 28435.318 321276.40
## 41 5 1 64433.278 603845.37
## 42 5 2 35046.534 435939.22
## 43 5 3 59049.222 308443.96
## 44 5 4 47842.417 240670.00
## 45 5 5 17675.791 246935.90
## 46 5 6 82370.650 745553.53
## 47 5 7 38740.777 340446.60
## 48 5 8 49761.420 485398.40
## 49 5 9 49744.111 428682.22
## 50 5 10 91451.847 431699.86
## 51 6 1 4353935.576 34422974.97
## 52 6 2 NA NA
## 53 6 3 NA NA
## 54 6 4 NA NA
## 55 6 5 NA NA
## 56 6 6 NA NA
## 57 6 7 NA NA
## 58 6 8 NA NA
## 59 6 9 NA NA
## 60 6 10 NA NA
## 61 7 1 24479.867 244206.31
## 62 7 2 33828.547 296698.43
## 63 7 3 89834.438 315015.71
## 64 7 4 55119.851 424611.83
## 65 7 5 45271.123 313633.17
## 66 7 6 36249.150 228236.58
## 67 7 7 17164.975 221666.84
## 68 7 8 36528.455 312872.89
## 69 7 9 92875.342 806894.01
## 70 7 10 54741.414 510940.74
## 71 8 1 283037.392 1536448.27
## 72 8 2 96333.939 789059.33
## 73 8 3 171915.695 699734.95
## 74 8 4 45769.216 333562.97
## 75 8 5 210197.449 107575.00
## 76 8 6 138846.516 1051881.52
## 77 8 7 183196.776 234663.58
## 78 8 8 183280.814 418496.05
## 79 8 9 71346.848 560103.12
## 80 8 10 116958.758 646710.25
## 81 9 1 97861.243 592133.92
## 82 9 2 136493.074 1345442.89
## 83 9 3 177917.008 1064834.04
## 84 9 4 550472.431 -338379.59
## 85 9 5 0.000 132464.38
## 86 9 6 270549.474 2184139.77
## 87 9 7 149838.574 1593969.07
## 88 9 8 NA NA
## 89 9 9 198407.002 1215139.24
## 90 9 10 118761.793 5254169.65
## 91 10 1 1054.842 61938.37
## 92 10 2 60958.468 247645.15
## 93 10 3 0.000 94960.92
## 94 10 4 4375.048 93558.22
## 95 10 5 11232.251 147650.11
## 96 10 6 NA NA
## 97 10 7 115263.316 265167.95
## 98 10 8 2589092.890 26384897.61
## 99 10 9 10160.973 100733.81
## 100 10 10 NA NA
Need to change the columns names to there correct name
colnames(output) <- c("Industry","Location", "RMSE", "Dec 2016 Prediction")
industries <- sort(unique(temporary$industry))
locations <- sort(unique(temporary$location))
arrange(output, RMSE)
## Industry Location RMSE Dec 2016 Prediction
## 1 9 5 0.000 132464.38
## 2 10 3 0.000 94960.92
## 3 10 1 1054.842 61938.37
## 4 10 4 4375.048 93558.22
## 5 1 8 5141.575 126819.12
## 6 1 6 7458.941 118608.92
## 7 1 2 8338.057 213241.27
## 8 1 3 8661.918 209985.53
## 9 1 1 8708.719 180582.53
## 10 1 5 9642.893 176660.84
## 11 10 9 10160.973 100733.81
## 12 1 9 10385.604 203987.20
## 13 10 5 11232.251 147650.11
## 14 1 10 14882.014 244951.94
## 15 1 4 15348.917 212896.23
## 16 4 2 16126.258 229728.26
## 17 7 7 17164.975 221666.84
## 18 5 5 17675.791 246935.90
## 19 1 7 18096.106 160786.91
## 20 4 8 20896.022 234778.81
## 21 4 9 21650.678 227705.05
## 22 7 1 24479.867 244206.31
## 23 2 2 24948.666 489477.47
## 24 2 7 26428.476 270319.06
## 25 2 10 26813.509 276164.23
## 26 4 1 26899.381 506804.79
## 27 4 7 27318.186 546953.04
## 28 4 10 28435.318 321276.40
## 29 2 1 28775.232 442855.05
## 30 2 6 29615.736 396781.68
## 31 7 2 33828.547 296698.43
## 32 5 2 35046.534 435939.22
## 33 7 6 36249.150 228236.58
## 34 7 8 36528.455 312872.89
## 35 3 8 38483.766 175587.61
## 36 4 3 38707.015 323510.53
## 37 5 7 38740.777 340446.60
## 38 4 5 42349.315 452839.66
## 39 4 4 44150.073 424171.07
## 40 7 5 45271.123 313633.17
## 41 8 4 45769.216 333562.97
## 42 5 4 47842.417 240670.00
## 43 2 8 48701.291 514809.95
## 44 5 9 49744.111 428682.22
## 45 5 8 49761.420 485398.40
## 46 2 4 49884.885 259835.34
## 47 2 5 54709.346 240381.99
## 48 7 10 54741.414 510940.74
## 49 7 4 55119.851 424611.83
## 50 2 3 58636.625 599354.87
## 51 5 3 59049.222 308443.96
## 52 10 2 60958.468 247645.15
## 53 3 2 61577.633 694542.07
## 54 5 1 64433.278 603845.37
## 55 3 7 65670.448 676483.37
## 56 2 9 71165.747 448156.64
## 57 8 9 71346.848 560103.12
## 58 5 6 82370.650 745553.53
## 59 7 3 89834.438 315015.71
## 60 5 10 91451.847 431699.86
## 61 7 9 92875.342 806894.01
## 62 3 1 96066.470 576221.29
## 63 8 2 96333.939 789059.33
## 64 9 1 97861.243 592133.92
## 65 3 6 105817.990 85692.28
## 66 3 3 108525.427 497038.50
## 67 3 4 112224.980 1038817.10
## 68 10 7 115263.316 265167.95
## 69 8 10 116958.758 646710.25
## 70 9 10 118761.793 5254169.65
## 71 9 2 136493.074 1345442.89
## 72 8 6 138846.516 1051881.52
## 73 3 5 145509.181 282798.15
## 74 9 7 149838.574 1593969.07
## 75 4 6 153722.021 1225786.96
## 76 8 3 171915.695 699734.95
## 77 9 3 177917.008 1064834.04
## 78 3 9 177989.626 310535.16
## 79 8 7 183196.776 234663.58
## 80 8 8 183280.814 418496.05
## 81 9 9 198407.002 1215139.24
## 82 3 10 199913.555 -328462.07
## 83 8 5 210197.449 107575.00
## 84 9 6 270549.474 2184139.77
## 85 8 1 283037.392 1536448.27
## 86 9 4 550472.431 -338379.59
## 87 10 8 2589092.890 26384897.61
## 88 6 1 4353935.576 34422974.97
## 89 6 2 NA NA
## 90 6 3 NA NA
## 91 6 4 NA NA
## 92 6 5 NA NA
## 93 6 6 NA NA
## 94 6 7 NA NA
## 95 6 8 NA NA
## 96 6 9 NA NA
## 97 6 10 NA NA
## 98 9 8 NA NA
## 99 10 6 NA NA
## 100 10 10 NA NA
arrange
## function (.data, ..., .by_group = FALSE)
## {
## UseMethod("arrange")
## }
## <bytecode: 0x000000000a2c79e8>
## <environment: namespace:dplyr>
Looking at worst performers
Industry 6 location 1 and Industry 10 location 8 were the worst performers
Lets have a look at how our model performed on those two data sets
# start by looking at the data
ggplot(data=filter(aggregated_dataset, industry == 6 & location == 1), aes(x=date, y=monthly_amount, group=1)) +
geom_line(color="red") + geom_point() +
labs(title="Line plot of mean monthly amount for Industry 6 and Location 1", x="Date", y="Mean Monthly Amount")
There is a clear upward trend towards the more recent data points. The model is unlikely able to catch the strength of the trend which is why is has performed so poorly
# set up a temporary data set of aggregated values
industry6_location_1 <- aggregated_dataset[aggregated_dataset$industry == 6 & aggregated_dataset$location == 1, ]
industry6_location_1$time_number = c(1:nrow(arrange(industry6_location_1, date)))
# arrange the data by time number ot make sure it's worked properly
arrange(industry6_location_1, time_number)
## # A tibble: 47 x 7
## # Groups: date, industry [47]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <int> <int> <int>
## 1 2013-01-01 6 1 22275239. 1 2013 1
## 2 2013-02-01 6 1 18357462. 2 2013 2
## 3 2013-03-01 6 1 21865386. 3 2013 3
## 4 2013-04-01 6 1 21565534. 4 2013 4
## 5 2013-05-01 6 1 22374034. 5 2013 5
## 6 2013-06-01 6 1 24337420. 6 2013 6
## 7 2013-07-01 6 1 26165409. 7 2013 7
## 8 2013-08-01 6 1 21955779. 8 2013 8
## 9 2013-09-01 6 1 23139375. 9 2013 9
## 10 2013-10-01 6 1 22440661. 10 2013 10
## # ... with 37 more rows
Retraining the model based on the original data split and month_number of 14
# number of months used to capture seasonality, may need to test out other values. Initially set to 14
month_number = 14
#Training number is the number of rows minus the month_number
trainingNum = nrow(arrange(industry6_location_1, date, time_number)) - month_number
#Training set is all rows from the start minus the number in the test set
trainingSet = head(arrange(industry6_location_1, date, time_number), trainingNum)
#Testing set is the last 'month_number' rows
testingSet = tail(arrange(industry6_location_1, date, time_number), month_number)
trainingSet
## # A tibble: 33 x 7
## # Groups: date, industry [33]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <int> <int> <int>
## 1 2013-01-01 6 1 22275239. 1 2013 1
## 2 2013-02-01 6 1 18357462. 2 2013 2
## 3 2013-03-01 6 1 21865386. 3 2013 3
## 4 2013-04-01 6 1 21565534. 4 2013 4
## 5 2013-05-01 6 1 22374034. 5 2013 5
## 6 2013-06-01 6 1 24337420. 6 2013 6
## 7 2013-07-01 6 1 26165409. 7 2013 7
## 8 2013-08-01 6 1 21955779. 8 2013 8
## 9 2013-09-01 6 1 23139375. 9 2013 9
## 10 2013-10-01 6 1 22440661. 10 2013 10
## # ... with 23 more rows
time_number.lm <- lm(monthly_amount~time_number, data=trainingSet)
summary(time_number.lm)
##
## Call:
## lm(formula = monthly_amount ~ time_number, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11364450 -1966108 291910 1983046 6947682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20085909 1377758 14.579 2.02e-15 ***
## time_number 206284 70709 2.917 0.00651 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3868000 on 31 degrees of freedom
## Multiple R-squared: 0.2154, Adjusted R-squared: 0.1901
## F-statistic: 8.511 on 1 and 31 DF, p-value: 0.006513
The original model had a R-squared value of 0.3168 whereas for this data set it’s only 0.2154. That is an additional ~10% unexplained variation in the data. Also looking at the residuals, there is a substainal increase in the range between the minimum (-11364450) and the maximum ( 6947682)compared to our initial training set of min (-17279) and max(16923)
Lets also plot it
ggplot(data = trainingSet, mapping = aes(x = time_number, y = monthly_amount)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title="Model 2, Using time_number as only predictor",
subtitle = "industry 6, location 1",
x="month number",
y="Monthly Amount")
## `geom_smooth()` using formula 'y ~ x'
The data appears to follow a sinusiodal line which is not captured by our model. A new model would have to be developed to better capture this data but it may only work for this particular data set
Going to do the exact same proce
# set up a temporary data set of aggregated values
industry10_location_8 <- aggregated_dataset[aggregated_dataset$industry == 10 & aggregated_dataset$location == 8, ]
industry10_location_8$time_number = c(1:nrow(arrange(industry6_location_1, date)))
# arrange the data by time number ot make sure it's worked properly
arrange(industry10_location_8, time_number)
## # A tibble: 47 x 7
## # Groups: date, industry [47]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <int> <int> <int>
## 1 2013-01-01 10 8 16065862. 1 2013 1
## 2 2013-02-01 10 8 15893236. 2 2013 2
## 3 2013-03-01 10 8 16822430. 3 2013 3
## 4 2013-04-01 10 8 16308113. 4 2013 4
## 5 2013-05-01 10 8 18623262. 5 2013 5
## 6 2013-06-01 10 8 17892471. 6 2013 6
## 7 2013-07-01 10 8 18356617. 7 2013 7
## 8 2013-08-01 10 8 19987638. 8 2013 8
## 9 2013-09-01 10 8 17688355. 9 2013 9
## 10 2013-10-01 10 8 19358444. 10 2013 10
## # ... with 37 more rows
# number of months used to capture seasonality, may need to test out other values. Initially set to 14
month_number = 14
#Training number is the number of rows minus the month_number
trainingNum = nrow(arrange(industry10_location_8, date, time_number)) - month_number
#Training set is all rows from the start minus the number in the test set
trainingSet = head(arrange(industry10_location_8, date, time_number), trainingNum)
#Testing set is the last 'month_number' rows
testingSet = tail(arrange(industry10_location_8, date, time_number), month_number)
trainingSet
## # A tibble: 33 x 7
## # Groups: date, industry [33]
## date industry location monthly_amount month year time_number
## <date> <int> <int> <dbl> <int> <int> <int>
## 1 2013-01-01 10 8 16065862. 1 2013 1
## 2 2013-02-01 10 8 15893236. 2 2013 2
## 3 2013-03-01 10 8 16822430. 3 2013 3
## 4 2013-04-01 10 8 16308113. 4 2013 4
## 5 2013-05-01 10 8 18623262. 5 2013 5
## 6 2013-06-01 10 8 17892471. 6 2013 6
## 7 2013-07-01 10 8 18356617. 7 2013 7
## 8 2013-08-01 10 8 19987638. 8 2013 8
## 9 2013-09-01 10 8 17688355. 9 2013 9
## 10 2013-10-01 10 8 19358444. 10 2013 10
## # ... with 23 more rows
time_number.lm <- lm(monthly_amount~time_number, data=trainingSet)
summary(time_number.lm)
##
## Call:
## lm(formula = monthly_amount ~ time_number, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2522424 -954886 -307057 254751 8481001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17115140 805939 21.236 < 2e-16 ***
## time_number 165305 41362 3.997 0.000369 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2262000 on 31 degrees of freedom
## Multiple R-squared: 0.34, Adjusted R-squared: 0.3187
## F-statistic: 15.97 on 1 and 31 DF, p-value: 0.0003688
ggplot(data = trainingSet, mapping = aes(x = time_number, y = monthly_amount)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title="Model 2, Using time_number as only predictor",
subtitle = "industry 10, location 8",
x="month number",
y="Monthly Amount")
## `geom_smooth()` using formula 'y ~ x'
Looking at the plot and the statistical summary for industry 10, location 8 we can see a clear sinusoidal trend in the data. There are two obvious point which are far removed from our model, accounting for these two point would likely lead to improvements in the models predictive power
Based solely on the R-squared value, our model has performed well and is able to predict %34 of the variation. I suspect removing the outliers would significantly improve the R-squared value.
As we have only assessed the performance on the RMSE, any outliers which appear above substantially affect our RMSE value.
Data analysis and CRISP-DM methodology confirmed the relationship between monthy amount and the number of the month when considered as a sequential series. The relationship has been modeled throughout numerous locations and industries and has been found to explain approximately ~30% of the variation on average across the entire data set. It is clear that there is an upwards trend year on year in addition to variations due to seasonality. By developing more advanced models that could account for those fluctuations, a more accurate and power model could be developed. However that model may not be suitable across every industry and range.
Considering the insights of this report, it is clear that the lower performing locations and industries should be assessed as the difference between them and the higher permofers is signifant. In addition, efforts should be made in preperation for the December - January period where transactions decrease. Finally the model developed using the available data may offer little benefit as it doesn’t account for significant variation and would have low predictive performance. More data or more complex analysis would be necessary to improve the model