MLAA - Autumn

Introduction

This report aims to analyse and provide insight from the provided monthly financial transactions data file. The monthly transactions are over the period starting Jan - 2013 and end at Nov - 2016. Initial EDA will be undertaken to identify any interesting trends and finally machine learning algorithms will be developed and applied to the data to predict future monthly earnings

This report has been prepared using the CRISP-DM format and is broken into the following sections

Business understanding

1. Business objectives
1. Data mining objectives

Data preparation

1. Dataset
1. Data cleaning and pre-processing

Data understanding

1. Exploratory analysis
1. Data visualisation

Modelling
Evaluation and conclusion

1. Business understanding

Business objectives Financial transactions describes the total transactions amounts for customers each month over a range of industries and locations. The monthly transations can vary greatly depending on both the industry and the location. We wish to gain insight into what is driving the trends and increase total monthly revenue
Data mining objectives Firstly, analysis of the data will be performed followed secondly by applying data visualization tools to identify any underlying patterns.

2. Data preparation - Task 1 a)

The data set provided contains records from 10 different industries as well as 10 different locations and also provides the month and month amount from the period from Jan-2013 until Nov-2016. The customers are also included however they have been anonymised to ensure their privacy. There are approximately 94,250 separate data points as well as no missing values. The data structure is relatively as it has 5 columns (date, anonymised customer id, industry, location and monthly amount)
The data was imported into the software package R. No missing values were found. The dates were in the format as characters and were converted to date objects. This would make analysis on it easier in the modelling process. The remaining variables were untouched as they were in their correct data type.

3. Data understanding

1. Exploratory analysis and
1. Data visualisation
1. Task 1 b)

try(library(tidyverse), silent=TRUE)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

try(library(dplyr), silent=TRUE)
try(library(ggplot2), silent=TRUE)
try(library(caret), silent=TRUE)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

try(library(Rcpp), silent=TRUE)

# Set up working directory and import the dataset (transactions.csv)
working_dir = setwd('C:/Users/Admin/Desktop/Masters of Data Science and Innovation/Autumn 2021/36106 - Machine learning algorithms and applications/Assignment 1 MLAA AUT 2021/')
dataset = read.csv("transactions.csv")

# convert data types to correct types as set out in data dictionary
dataset$date = as.Date(dataset$date, format = "%d/%m/%Y")
dataset$customer_id = as.character(dataset$customer_id, format = "")

# view the data and check the data types 
summary(dataset)

##       date            customer_id           industry         location     
##  Min.   :2013-01-01   Length:94248       Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:2014-05-01   Class :character   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median :2015-06-01   Mode  :character   Median : 2.000   Median : 4.000  
##  Mean   :2015-03-26                      Mean   : 2.455   Mean   : 4.214  
##  3rd Qu.:2016-03-01                      3rd Qu.: 3.000   3rd Qu.: 6.000  
##  Max.   :2016-11-01                      Max.   :10.000   Max.   :10.000  
##  monthly_amount     
##  Min.   :        0  
##  1st Qu.:    95323  
##  Median :   179399  
##  Mean   :   395397  
##  3rd Qu.:   375439  
##  Max.   :100000000

str(dataset)

## 'data.frame':    94248 obs. of  5 variables:
##  $ date          : Date, format: "2013-01-01" "2013-02-01" ...
##  $ customer_id   : chr  "70efdf2ec9b086079795c442636b55fb" "70efdf2ec9b086079795c442636b55fb" "70efdf2ec9b086079795c442636b55fb" "70efdf2ec9b086079795c442636b55fb" ...
##  $ industry      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ location      : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ monthly_amount: num  753851 651548 1138769 659739 770675 ...

Firstly we want to gain some insight from visualising the data by plotting a range of variables and see if any trends are apparent

Scatter plot of total monthly transations per industry

plot_date_vs_month = ggplot(data = dataset, aes(x = date, y = monthly_amount/1e6)) +
  geom_point(aes(colour = industry)) +
  scale_y_continuous(labels = scales::comma) +
  labs (title = "Total transactions by month per industry",
        x = "Date",
        y = "Monthly transation amount in millions")
        
plot_date_vs_month

Trying to visualise a second scatter plot of total transactions by month per industry to seperate the industries by colour

plot_date_vs_month_2 = ggplot(data = dataset, aes(x = date, y = monthly_amount/1e6)) +
  geom_point(aes(colour = cut(industry, c(0,1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11)))) +
  scale_color_manual(name = "Industry",
                     values = c("(0,1]" = "black",
                                  "(1,2]" = "yellow",
                                  "(2,3]" = "red",
                                  "(3,4]" = "blue",
                                  "(4,5]" = "pink",
                                  "(5,6]" = "green",
                                  "(6,7]" = "purple",
                                  "(7,8]" = "orange",
                                  "(8,9]" = "grey",
                                  "(9,10]" = "brown"),                              

                     labels = c("Ind 1", "Ind 2", "Ind 3", "Ind 4", "Ind 5", "Ind 6", "Ind 7", "Ind 8", "Ind 9", "Ind 10"))+
  scale_y_continuous(labels = scales::comma) +
  labs (title = "Total transactions by month per industry",
        x = "Date",
        y = "Monthly transation amount in millions")
        
plot_date_vs_month_2

Now we will look at the relationship between total transactions per month and location.

plot_date_vs_month_location = ggplot(data = dataset, aes(x = date, y = monthly_amount/1e6)) +
  geom_point(aes(colour = location)) +
  scale_y_continuous(labels = scales::comma) +
  labs (title = "Total transactions by month per location",
        x = "Date",
        y = "Monthly transation amount in millions")

plot_date_vs_month_location

Instead of looking at total monthly transactions by dollar amount ,we’ll now look at total number of transations both by indutry and by location

par(mfrow=c(1,2))
hist(dataset$industry, main = "Trans by Industry", xlab="Industry", ylab="Number of transactions", xlim = c(0,10), ylim=c(0,50000), las=0)
hist(dataset$location, main = "Trans by Location", xlab="Location", ylab="Number of transactions", xlim = c(0,10), ylim=c(0,50000), las=0)

Now lets consider the monthly transaction and compare it to the average to see if there is any insight The process will involve the creation of a new mean variable which each monthly average will be compared to as either True -> higher or false -> below

# we first need to create a boolean operator that says if our value is above or below our mean
transactions_with_mean_boolean <- dataset %>%
  mutate(monthly_amount_above_mean = ifelse(monthly_amount > mean(monthly_amount), TRUE, FALSE))

# plotting the data of true false monthyl average compared to the mean
qplot(x = date, fill = monthly_amount_above_mean, data = transactions_with_mean_boolean, geom = "bar",
      main = "Number of transactions per month compared to average monthly transaction")+
     scale_fill_discrete(name = "Above or below mean")+
    ylab("Number of transations")+
    xlab("Date")

Task 1b)

From the above it is clear that most monthly transactions are below the mean monthly transaction value. It is also clear the the number of monthly transactions are growing steadily without any significant rises or declines The only decline albeit minimal appears to happen between December and January which can be explained by the seasonality of the business and the holiday period

Lets now analyse how each location performed in relation to the average monthly transaction amount.

qplot(x = location, fill = monthly_amount_above_mean, data = transactions_with_mean_boolean, geom = "bar",
      main = "Number of transactions by locatio, compared to average monthly transaction")+ #, scale_x = breaks = c(1:10))
      scale_fill_discrete(name = "Above or below mean")+
      scale_x_continuous(breaks=1:10) +
     ylab("Number of transations")+
     xlab("Location")

lets now look at the montly average compared by industry

qplot(x = industry, fill = monthly_amount_above_mean, data = transactions_with_mean_boolean, geom = "bar",
      main = "Number of transactions by industry, compared to average monthly transaction")+
      scale_fill_discrete(name = "Above or below mean")+
      scale_x_continuous(breaks=1:10) +
      ylab("Number of transations")+
     xlab("Industry")

Task 1b) continued

From the previous two plots, we have gained some insight on how the location and the industries perform compared to the average monthly transaction amount. We also can view each industry and location’s number of transations over the entire date period. It is clear that industry 1 and 2 are the clearly generate the most number of transactions, however the majority of the time, they come in below the average monthly transaction amount. Whereas industry 9 and 3 while having a lower number of transactions, are regularly above the average

Similar insights can be gained from looking at the location data where location 1 and 2 dominate the number of sales. However the proportion of above to below monthly transactions appears to be similar across all locations so there is no clear locations that performs best based on that metric

4.Modeling

The aim will be to build a model that can predict monthly_amounts for the future.

NUmerous models will be developed and assessed wth various combinations of predictor variables

The data will need to be modified as we will be applying a linear regression model. This requires us to change each month into an integer which we will be doing sequentially e.g. Jan 2013 = 1, Feb 2013 = 2 etc. The data will be split roughly 70:30 test:train. This is to ensure that the clear seasonality data is captured in our train set and will be used to predict data for December 2016

This section will also include references to the assignment tasks

Task 2 a i

Basic Model Fitting + a) Aggregate the data, grouping by date, industry and location, and calculating the mean monthly_amount

# generatinga function that will be used repeatedly
monthly_avg_function = function(df) {
  # groub by data , location and industry 
  output = df %>% 
    group_by(date, industry, location) %>% 
    summarize(monthly_amount = mean(monthly_amount, na.rm = TRUE))
    # make sure date is in proper format before interget coversion
  output = output %>% 
    mutate(month = format(as.Date(date), "%m")) %>% 
    mutate(year = format(as.Date(date), "%Y"))
    # convert date to an integer
  output$month = as.integer(output$month)
  output$year = as.integer(output$year)
  
  transform(output, month = as.integer(month),
            year = as.integer(year))
  
  return(output) 
  
}  

# have a look at the dataset to make sure it has worked properly
aggregated_dataset = monthly_avg_function(dataset)

## `summarise()` has grouped output by 'date', 'industry'. You can override using the `.groups` argument.

aggregated_dataset

## # A tibble: 3,886 x 6
## # Groups:   date, industry [470]
##    date       industry location monthly_amount month  year
##    <date>        <int>    <int>          <dbl> <int> <int>
##  1 2013-01-01        1        1        136081.     1  2013
##  2 2013-01-01        1        2        177840.     1  2013
##  3 2013-01-01        1        3        141632.     1  2013
##  4 2013-01-01        1        4        221058.     1  2013
##  5 2013-01-01        1        5        178138.     1  2013
##  6 2013-01-01        1        6        133400.     1  2013
##  7 2013-01-01        1        7        231599.     1  2013
##  8 2013-01-01        1        8        143778.     1  2013
##  9 2013-01-01        1        9        157416.     1  2013
## 10 2013-01-01        1       10        188735.     1  2013
## # ... with 3,876 more rows

Task 2 a ii

Create a line plot of the variable monthly_amount for industry = 1 and location = 1

ggplot(data=filter(aggregated_dataset, industry == 1 & location == 1), aes(x=date, y=monthly_amount, group=1)) + 
  geom_line(color="red") + geom_point() + 
  labs(title="Line plot of mean monthly amount for Industry 1 and Location 1", x="Date", y="Mean Monthly Amount")

Industries and location are all the same in the aggregated data set and so will not be used for model prediction

Hinted that we will need to capture more than 12 months due to seasonality, this variables will be called month_number

# set up a temporary data set of aggregated values
temporary <- aggregated_dataset[aggregated_dataset$industry == 1 & aggregated_dataset$location == 1, ]
temporary$time_number = c(1:nrow(arrange(temporary, date)))
# arrange the data by time number ot make sure it's worked properly
arrange(temporary, time_number)

## # A tibble: 47 x 7
## # Groups:   date, industry [47]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <int> <int>       <int>
##  1 2013-01-01        1        1        136081.     1  2013           1
##  2 2013-02-01        1        1        152964.     2  2013           2
##  3 2013-03-01        1        1        158481.     3  2013           3
##  4 2013-04-01        1        1        152341.     4  2013           4
##  5 2013-05-01        1        1        170330.     5  2013           5
##  6 2013-06-01        1        1        152575.     6  2013           6
##  7 2013-07-01        1        1        158755.     7  2013           7
##  8 2013-08-01        1        1        165313.     8  2013           8
##  9 2013-09-01        1        1        147199.     9  2013           9
## 10 2013-10-01        1        1        163409.    10  2013          10
## # ... with 37 more rows

# number of months used to capture seasonality, may need to test out other values. Initially set to 14
month_number = 14
#Training number is the number of rows minus the month_number
trainingNum = nrow(arrange(temporary, date, time_number)) - month_number

#Training set is all rows from the start minus the number in the test set
trainingSet = head(arrange(temporary, date, time_number), trainingNum)

#Testing set is the last 'month_number' rows
testingSet = tail(arrange(temporary, date, time_number), month_number)

trainingSet

## # A tibble: 33 x 7
## # Groups:   date, industry [33]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <int> <int>       <int>
##  1 2013-01-01        1        1        136081.     1  2013           1
##  2 2013-02-01        1        1        152964.     2  2013           2
##  3 2013-03-01        1        1        158481.     3  2013           3
##  4 2013-04-01        1        1        152341.     4  2013           4
##  5 2013-05-01        1        1        170330.     5  2013           5
##  6 2013-06-01        1        1        152575.     6  2013           6
##  7 2013-07-01        1        1        158755.     7  2013           7
##  8 2013-08-01        1        1        165313.     8  2013           8
##  9 2013-09-01        1        1        147199.     9  2013           9
## 10 2013-10-01        1        1        163409.    10  2013          10
## # ... with 23 more rows

Task 2 a iii

Model 1. Linear model to predict monthly_amount using the month variable as the sole predictor

month_number.lm <- lm(monthly_amount~month, data=trainingSet)
summary(month_number.lm)

## 
## Call:
## lm(formula = monthly_amount ~ month, data = trainingSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21904.9  -7126.0    162.6   8778.7  17485.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156947.7     3836.2  40.913   <2e-16 ***
## month          629.8      553.5   1.138    0.264    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10510 on 31 degrees of freedom
## Multiple R-squared:  0.04009,    Adjusted R-squared:  0.009127 
## F-statistic: 1.295 on 1 and 31 DF,  p-value: 0.2639

Based on the Multiple R-squared value our model can only account for approximately ~4% of the variance. This is very low and the model is a poor predictor of monthly_amount. We can also get confirmation by looking at the p-value of 0.2639 which tells us that the month predictor is unlikely to be a good fit to the data/.

Onto the residuals we can see that we have a minimum of ~ -21900 and a maximum of ~ 17490. We will need to improve our model

Lets have a look at the plot of the model vs the data to assess how our model performed.

# plot of residuals of model loc = 1, ind = 1
ggplot(data = trainingSet, mapping = aes(x = month, y = monthly_amount)) + 
  geom_point() +
  scale_x_continuous(breaks=1:12)+
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  labs(title="Model 1. Using month as only predictor", x="Month Number", y="Monthly Amount")

## `geom_smooth()` using formula 'y ~ x'

As expected from the statistics earlier it is clear that our model is a poor fit. It does however capture a trend especially between the months of 2 and 11. As mentioned ealier the monthly transcations and therefore amount would decrease over December and January which is reflected above

Lets now look at diagnostic plots

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(month_number.lm)

#### Model 2. Linear model to predict monthly_amount using time_number as the sole predictor. Time number is the month in sequential order over the entire data set ### Task b i, ii e.g. Jan 2013 = 1, Jan 2014 = 13 etc

time_number.lm <- lm(monthly_amount~time_number, data=trainingSet)
summary(time_number.lm)

## 
## Call:
## lm(formula = monthly_amount ~ time_number, data = trainingSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17279  -6992   1486   6303  16923 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150333.8     3159.5  47.581  < 2e-16 ***
## time_number    614.7      162.2   3.791 0.000651 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8870 on 31 degrees of freedom
## Multiple R-squared:  0.3168, Adjusted R-squared:  0.2947 
## F-statistic: 14.37 on 1 and 31 DF,  p-value: 0.0006507

By looking at the output of the summary, we can see that the R squared value has improved significantly improving to 0.3168. This model can account for ~32% of the variation in the data. Looking at the p value, the predictor time_number has a much stronger relationship with the monthly_amount variable.

Lets plot our new model against the data to gain some insight

# plot of time_number model vs the data

ggplot(data = trainingSet, mapping = aes(x = time_number, y = monthly_amount)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title="Model 2, Using time_number as only predictor",
       x="month number",
       y="Monthly Amount")

## `geom_smooth()` using formula 'y ~ x'

The above plot appears to show that using time_number as the predictor captures moe of the data and shows a stronger trend.

Again lets look at the diagnostic plots

par(mfrow=c(2,2))
plot(time_number.lm)

Model 3. Using month and time_number as predictors

We will now build a model using both months and time_number to see if there is any additiotnal improvement

# build model using month and time_number
month_number_and_time_number.lm <- lm(monthly_amount~month+ time_number, data=trainingSet)
summary(month_number_and_time_number.lm)

## 
## Call:
## lm(formula = monthly_amount ~ month + time_number, data = trainingSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17139  -6451   1404   6149  17038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 148686.3     3998.7  37.183   <2e-16 ***
## month          326.5      478.6   0.682   0.5003    
## time_number    594.6      166.2   3.578   0.0012 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8947 on 30 degrees of freedom
## Multiple R-squared:  0.3272, Adjusted R-squared:  0.2823 
## F-statistic: 7.295 on 2 and 30 DF,  p-value: 0.00262

Since we have used two predictors, we need to judge the model based on the adjusted R-squared value which is 0.2823. This is lower compared to our model 2 when time_number was the sole predictor. This p values of the predictors also show that the month variable was statistically insiginificant and should likely be disgarded.

Model 4.

Lets look at a 4th model using time_number and date as a predictor

month_number_and_time_number.lm <- lm(monthly_amount~date+ time_number, data=trainingSet)
summary(month_number_and_time_number.lm)

## 
## Call:
## lm(formula = monthly_amount ~ date + time_number, data = trainingSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14204  -7043   1639   7116  13018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 67605444   32044036   2.110   0.0433 *
## date           -4303       2044  -2.105   0.0438 *
## time_number   131489      62171   2.115   0.0429 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8416 on 30 degrees of freedom
## Multiple R-squared:  0.4047, Adjusted R-squared:  0.365 
## F-statistic:  10.2 on 2 and 30 DF,  p-value: 0.000418

Again lets look at the adjusted R-squared which gives us a value of 0.365. This is an improvement on our previous models. However the p scores are lower for our predictors suggesting that they aren’t very strong indicators of model performance As such we will proceed using Model 2 as the relationship is much stronger.

Summary Model 2 (time_number) has performed the highest based on R-squared value and p scores of predictor(s) and will be used going forwards

Task 2 a iv

Add a column for the month of December in the year 2016 and predict a value using model 2

december2016<-data.frame(date = "01/12/16",
                         industry=1,
                         location=1,
                         monthly_amount = 0,
                         month=12,
                         year=2016,
                         time_number=48)

december2016$date <- as.Date(december2016$date,
                             format = "%d/%m/%y")

december2016$industry <- as.integer(december2016$industry)
december2016$location <- as.integer(december2016$location)
december2016$time_number <- as.integer(december2016$time_number)

december2016$monthly_amount <- predict(time_number.lm,december2016)

predicted_outcome <- rbind(temporary, december2016)
predicted_outcome

## # A tibble: 48 x 7
## # Groups:   date, industry [48]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <dbl> <dbl>       <int>
##  1 2013-01-01        1        1        136081.     1  2013           1
##  2 2013-02-01        1        1        152964.     2  2013           2
##  3 2013-03-01        1        1        158481.     3  2013           3
##  4 2013-04-01        1        1        152341.     4  2013           4
##  5 2013-05-01        1        1        170330.     5  2013           5
##  6 2013-06-01        1        1        152575.     6  2013           6
##  7 2013-07-01        1        1        158755.     7  2013           7
##  8 2013-08-01        1        1        165313.     8  2013           8
##  9 2013-09-01        1        1        147199.     9  2013           9
## 10 2013-10-01        1        1        163409.    10  2013          10
## # ... with 38 more rows

Lets now plot our predicted data and see if our prediction for December 2016 makes sense.

plot_predict_ind1_december2016 <- ggplot(data=predicted_outcome, aes(x=date, y=monthly_amount)) +
  geom_smooth(stat="identity", method = "lm") +
  geom_point(colour = "red") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_minor_breaks = "1 month") +
  labs (title = "Model 2. Industry 1 & Location 1",
        subtitle = "December 2016 Prediction",
        x = "Date", y = "Average Amount")


plot_predict_ind1_december2016

From the plot we can clearly see that the prediction for December 2016 has decreased compared to the November value which was supplied. This agrees with our previous assertions about the seasonality of the data and that is decreases in the December and January months.

Task 3 a

Advanced Fitting model

Lets create a new data set called p3 We’ll need to sort the locations and indusries

# creating aggregate data set and arranging  lcoations and industries
p3_data_set = monthly_avg_function(aggregated_dataset)

## `summarise()` has grouped output by 'date', 'industry'. You can override using the `.groups` argument.

p3_data_set

## # A tibble: 3,886 x 6
## # Groups:   date, industry [470]
##    date       industry location monthly_amount month  year
##    <date>        <int>    <int>          <dbl> <int> <int>
##  1 2013-01-01        1        1        136081.     1  2013
##  2 2013-01-01        1        2        177840.     1  2013
##  3 2013-01-01        1        3        141632.     1  2013
##  4 2013-01-01        1        4        221058.     1  2013
##  5 2013-01-01        1        5        178138.     1  2013
##  6 2013-01-01        1        6        133400.     1  2013
##  7 2013-01-01        1        7        231599.     1  2013
##  8 2013-01-01        1        8        143778.     1  2013
##  9 2013-01-01        1        9        157416.     1  2013
## 10 2013-01-01        1       10        188735.     1  2013
## # ... with 3,876 more rows

industries <- sort(unique(p3_data_set$industry))
locations <- sort(unique(p3_data_set$location))

output = data.frame()
month_number = 11

# lets have a look at the data
p3_data_set

## # A tibble: 3,886 x 6
## # Groups:   date, industry [470]
##    date       industry location monthly_amount month  year
##    <date>        <int>    <int>          <dbl> <int> <int>
##  1 2013-01-01        1        1        136081.     1  2013
##  2 2013-01-01        1        2        177840.     1  2013
##  3 2013-01-01        1        3        141632.     1  2013
##  4 2013-01-01        1        4        221058.     1  2013
##  5 2013-01-01        1        5        178138.     1  2013
##  6 2013-01-01        1        6        133400.     1  2013
##  7 2013-01-01        1        7        231599.     1  2013
##  8 2013-01-01        1        8        143778.     1  2013
##  9 2013-01-01        1        9        157416.     1  2013
## 10 2013-01-01        1       10        188735.     1  2013
## # ... with 3,876 more rows

We want to apply our model (monthly_amount ~ time_number) across all industries and locations We will put everything through a loop

The loop will do the following 1. Train the model for each industry and location as mentioned above 2. Append an empty column for December 2016 3. Calculate the MSE and RMSE 4. Make a prediction for December 2016 5. Bind everything into a dataframe

for (ind in industries) {
  for (loc in locations) {
    temporary = p3_data_set[p3_data_set$industry == ind & p3_data_set$location == loc, ]
    
   if (length(unique(temporary$date)) >= month_number) {
     arrange(temporary, date)
     temporary$time_number = c(1:nrow(temporary))
     
     # Training number is the number of rows minus the month_number
     trainingNum = nrow(temporary) - month_number
     # Training set is all rows from the start minus the number in the test set. We'll arrange again, just in case.
     trainingSet = head(arrange(temporary, time_number), trainingNum)
     
     # Testing set is the last 'month_number' rows. We'll arrange again, just in case.
     testingSet = tail(arrange(temporary, time_number), month_number)
     
     # Run the model
     training.model = lm(monthly_amount~time_number, data=trainingSet)
     # testing.model = lm(monthly_amount~time_number, data=testingSet)
     
     # Calculate the mean standard error
     training.mse <- mean(residuals(training.model)^2)
     #testing.mse <- mean(residuals(testing.model)^2)
     
     # Calculate root mean squared error
     training.rmse <- sqrt(training.mse)
     # testing.rmse <- sqrt(testing.mse)
     
     ### Now, add an extra row into temp for the December 2016 prediction, giving December 2016 a monthly_amount of 0
     
     # Create a dataframe containing just the December 2016 data
     december_2016 = data.frame(date = "2016-12-01",
                                industry=ind,
                                location=loc,
                                monthly_amount=0,
                                month=12,
                                year=2016,
                                time_number=(nrow(temporary)+1))
     
     # Make sure temporary has data frame type
     temporary = as.data.frame(temporary)
     #testingSet = as.data.frame(testingSet)
     
     # Add the December 2016 row
     temporary = rbind(temporary, december_2016)
     # testingSet = rbind(testingSet, december_2016)
     
     # Output a prediction based on all rows and add it to the temp data frame
     temporary$prediction = predict(training.model, temporary)
     testingSet$prediction = predict(training.model, testingSet)
     
     # Get the last prediction value (which is the Dec 2016 value).
     train_dec_2016_prediction = tail(temporary$prediction, 1)

     # Create row to add to the output data frame, including industry and location variables
     dataRow = c(ind,loc,training.rmse,train_dec_2016_prediction) #      
   } else {
     dataRow = c(ind,loc,NA,NA)
   }
    output = rbind(output, dataRow) # change to temporary
  }
    
}

## Warning in predict.lm(training.model, temporary): prediction from a rank-
## deficient fit may be misleading

## Warning in predict.lm(training.model, testingSet): prediction from a rank-
## deficient fit may be misleading

output

##     X1 X1.1 X8708.71861352349 X180582.52684704
## 1    1    1          8708.719        180582.53
## 2    1    2          8338.057        213241.27
## 3    1    3          8661.918        209985.53
## 4    1    4         15348.917        212896.23
## 5    1    5          9642.893        176660.84
## 6    1    6          7458.941        118608.92
## 7    1    7         18096.106        160786.91
## 8    1    8          5141.575        126819.12
## 9    1    9         10385.604        203987.20
## 10   1   10         14882.014        244951.94
## 11   2    1         28775.232        442855.05
## 12   2    2         24948.666        489477.47
## 13   2    3         58636.625        599354.87
## 14   2    4         49884.885        259835.34
## 15   2    5         54709.346        240381.99
## 16   2    6         29615.736        396781.68
## 17   2    7         26428.476        270319.06
## 18   2    8         48701.291        514809.95
## 19   2    9         71165.747        448156.64
## 20   2   10         26813.509        276164.23
## 21   3    1         96066.470        576221.29
## 22   3    2         61577.633        694542.07
## 23   3    3        108525.427        497038.50
## 24   3    4        112224.980       1038817.10
## 25   3    5        145509.181        282798.15
## 26   3    6        105817.990         85692.28
## 27   3    7         65670.448        676483.37
## 28   3    8         38483.766        175587.61
## 29   3    9        177989.626        310535.16
## 30   3   10        199913.555       -328462.07
## 31   4    1         26899.381        506804.79
## 32   4    2         16126.258        229728.26
## 33   4    3         38707.015        323510.53
## 34   4    4         44150.073        424171.07
## 35   4    5         42349.315        452839.66
## 36   4    6        153722.021       1225786.96
## 37   4    7         27318.186        546953.04
## 38   4    8         20896.022        234778.81
## 39   4    9         21650.678        227705.05
## 40   4   10         28435.318        321276.40
## 41   5    1         64433.278        603845.37
## 42   5    2         35046.534        435939.22
## 43   5    3         59049.222        308443.96
## 44   5    4         47842.417        240670.00
## 45   5    5         17675.791        246935.90
## 46   5    6         82370.650        745553.53
## 47   5    7         38740.777        340446.60
## 48   5    8         49761.420        485398.40
## 49   5    9         49744.111        428682.22
## 50   5   10         91451.847        431699.86
## 51   6    1       4353935.576      34422974.97
## 52   6    2                NA               NA
## 53   6    3                NA               NA
## 54   6    4                NA               NA
## 55   6    5                NA               NA
## 56   6    6                NA               NA
## 57   6    7                NA               NA
## 58   6    8                NA               NA
## 59   6    9                NA               NA
## 60   6   10                NA               NA
## 61   7    1         24479.867        244206.31
## 62   7    2         33828.547        296698.43
## 63   7    3         89834.438        315015.71
## 64   7    4         55119.851        424611.83
## 65   7    5         45271.123        313633.17
## 66   7    6         36249.150        228236.58
## 67   7    7         17164.975        221666.84
## 68   7    8         36528.455        312872.89
## 69   7    9         92875.342        806894.01
## 70   7   10         54741.414        510940.74
## 71   8    1        283037.392       1536448.27
## 72   8    2         96333.939        789059.33
## 73   8    3        171915.695        699734.95
## 74   8    4         45769.216        333562.97
## 75   8    5        210197.449        107575.00
## 76   8    6        138846.516       1051881.52
## 77   8    7        183196.776        234663.58
## 78   8    8        183280.814        418496.05
## 79   8    9         71346.848        560103.12
## 80   8   10        116958.758        646710.25
## 81   9    1         97861.243        592133.92
## 82   9    2        136493.074       1345442.89
## 83   9    3        177917.008       1064834.04
## 84   9    4        550472.431       -338379.59
## 85   9    5             0.000        132464.38
## 86   9    6        270549.474       2184139.77
## 87   9    7        149838.574       1593969.07
## 88   9    8                NA               NA
## 89   9    9        198407.002       1215139.24
## 90   9   10        118761.793       5254169.65
## 91  10    1          1054.842         61938.37
## 92  10    2         60958.468        247645.15
## 93  10    3             0.000         94960.92
## 94  10    4          4375.048         93558.22
## 95  10    5         11232.251        147650.11
## 96  10    6                NA               NA
## 97  10    7        115263.316        265167.95
## 98  10    8       2589092.890      26384897.61
## 99  10    9         10160.973        100733.81
## 100 10   10                NA               NA

Need to change the columns names to there correct name

colnames(output) <- c("Industry","Location", "RMSE", "Dec 2016 Prediction")

industries <- sort(unique(temporary$industry))
locations <- sort(unique(temporary$location))
arrange(output, RMSE)

##     Industry Location        RMSE Dec 2016 Prediction
## 1          9        5       0.000           132464.38
## 2         10        3       0.000            94960.92
## 3         10        1    1054.842            61938.37
## 4         10        4    4375.048            93558.22
## 5          1        8    5141.575           126819.12
## 6          1        6    7458.941           118608.92
## 7          1        2    8338.057           213241.27
## 8          1        3    8661.918           209985.53
## 9          1        1    8708.719           180582.53
## 10         1        5    9642.893           176660.84
## 11        10        9   10160.973           100733.81
## 12         1        9   10385.604           203987.20
## 13        10        5   11232.251           147650.11
## 14         1       10   14882.014           244951.94
## 15         1        4   15348.917           212896.23
## 16         4        2   16126.258           229728.26
## 17         7        7   17164.975           221666.84
## 18         5        5   17675.791           246935.90
## 19         1        7   18096.106           160786.91
## 20         4        8   20896.022           234778.81
## 21         4        9   21650.678           227705.05
## 22         7        1   24479.867           244206.31
## 23         2        2   24948.666           489477.47
## 24         2        7   26428.476           270319.06
## 25         2       10   26813.509           276164.23
## 26         4        1   26899.381           506804.79
## 27         4        7   27318.186           546953.04
## 28         4       10   28435.318           321276.40
## 29         2        1   28775.232           442855.05
## 30         2        6   29615.736           396781.68
## 31         7        2   33828.547           296698.43
## 32         5        2   35046.534           435939.22
## 33         7        6   36249.150           228236.58
## 34         7        8   36528.455           312872.89
## 35         3        8   38483.766           175587.61
## 36         4        3   38707.015           323510.53
## 37         5        7   38740.777           340446.60
## 38         4        5   42349.315           452839.66
## 39         4        4   44150.073           424171.07
## 40         7        5   45271.123           313633.17
## 41         8        4   45769.216           333562.97
## 42         5        4   47842.417           240670.00
## 43         2        8   48701.291           514809.95
## 44         5        9   49744.111           428682.22
## 45         5        8   49761.420           485398.40
## 46         2        4   49884.885           259835.34
## 47         2        5   54709.346           240381.99
## 48         7       10   54741.414           510940.74
## 49         7        4   55119.851           424611.83
## 50         2        3   58636.625           599354.87
## 51         5        3   59049.222           308443.96
## 52        10        2   60958.468           247645.15
## 53         3        2   61577.633           694542.07
## 54         5        1   64433.278           603845.37
## 55         3        7   65670.448           676483.37
## 56         2        9   71165.747           448156.64
## 57         8        9   71346.848           560103.12
## 58         5        6   82370.650           745553.53
## 59         7        3   89834.438           315015.71
## 60         5       10   91451.847           431699.86
## 61         7        9   92875.342           806894.01
## 62         3        1   96066.470           576221.29
## 63         8        2   96333.939           789059.33
## 64         9        1   97861.243           592133.92
## 65         3        6  105817.990            85692.28
## 66         3        3  108525.427           497038.50
## 67         3        4  112224.980          1038817.10
## 68        10        7  115263.316           265167.95
## 69         8       10  116958.758           646710.25
## 70         9       10  118761.793          5254169.65
## 71         9        2  136493.074          1345442.89
## 72         8        6  138846.516          1051881.52
## 73         3        5  145509.181           282798.15
## 74         9        7  149838.574          1593969.07
## 75         4        6  153722.021          1225786.96
## 76         8        3  171915.695           699734.95
## 77         9        3  177917.008          1064834.04
## 78         3        9  177989.626           310535.16
## 79         8        7  183196.776           234663.58
## 80         8        8  183280.814           418496.05
## 81         9        9  198407.002          1215139.24
## 82         3       10  199913.555          -328462.07
## 83         8        5  210197.449           107575.00
## 84         9        6  270549.474          2184139.77
## 85         8        1  283037.392          1536448.27
## 86         9        4  550472.431          -338379.59
## 87        10        8 2589092.890         26384897.61
## 88         6        1 4353935.576         34422974.97
## 89         6        2          NA                  NA
## 90         6        3          NA                  NA
## 91         6        4          NA                  NA
## 92         6        5          NA                  NA
## 93         6        6          NA                  NA
## 94         6        7          NA                  NA
## 95         6        8          NA                  NA
## 96         6        9          NA                  NA
## 97         6       10          NA                  NA
## 98         9        8          NA                  NA
## 99        10        6          NA                  NA
## 100       10       10          NA                  NA

arrange

## function (.data, ..., .by_group = FALSE) 
## {
##     UseMethod("arrange")
## }
## <bytecode: 0x000000000a2c79e8>
## <environment: namespace:dplyr>

Task 3 b i

Looking at worst performers

Industry 6 location 1 and Industry 10 location 8 were the worst performers

Lets have a look at how our model performed on those two data sets

Starting with industry 6, location 1

# start by looking at the data
ggplot(data=filter(aggregated_dataset, industry == 6 & location == 1), aes(x=date, y=monthly_amount, group=1)) + 
  geom_line(color="red") + geom_point() + 
  labs(title="Line plot of mean monthly amount for Industry 6 and Location 1", x="Date", y="Mean Monthly Amount")

There is a clear upward trend towards the more recent data points. The model is unlikely able to catch the strength of the trend which is why is has performed so poorly

# set up a temporary data set of aggregated values
industry6_location_1 <- aggregated_dataset[aggregated_dataset$industry == 6 & aggregated_dataset$location == 1, ]
industry6_location_1$time_number = c(1:nrow(arrange(industry6_location_1, date)))
# arrange the data by time number ot make sure it's worked properly
arrange(industry6_location_1, time_number)

## # A tibble: 47 x 7
## # Groups:   date, industry [47]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <int> <int>       <int>
##  1 2013-01-01        6        1      22275239.     1  2013           1
##  2 2013-02-01        6        1      18357462.     2  2013           2
##  3 2013-03-01        6        1      21865386.     3  2013           3
##  4 2013-04-01        6        1      21565534.     4  2013           4
##  5 2013-05-01        6        1      22374034.     5  2013           5
##  6 2013-06-01        6        1      24337420.     6  2013           6
##  7 2013-07-01        6        1      26165409.     7  2013           7
##  8 2013-08-01        6        1      21955779.     8  2013           8
##  9 2013-09-01        6        1      23139375.     9  2013           9
## 10 2013-10-01        6        1      22440661.    10  2013          10
## # ... with 37 more rows

Retraining the model based on the original data split and month_number of 14

# number of months used to capture seasonality, may need to test out other values. Initially set to 14
month_number = 14
#Training number is the number of rows minus the month_number
trainingNum = nrow(arrange(industry6_location_1, date, time_number)) - month_number

#Training set is all rows from the start minus the number in the test set
trainingSet = head(arrange(industry6_location_1, date, time_number), trainingNum)

#Testing set is the last 'month_number' rows
testingSet = tail(arrange(industry6_location_1, date, time_number), month_number)

trainingSet

## # A tibble: 33 x 7
## # Groups:   date, industry [33]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <int> <int>       <int>
##  1 2013-01-01        6        1      22275239.     1  2013           1
##  2 2013-02-01        6        1      18357462.     2  2013           2
##  3 2013-03-01        6        1      21865386.     3  2013           3
##  4 2013-04-01        6        1      21565534.     4  2013           4
##  5 2013-05-01        6        1      22374034.     5  2013           5
##  6 2013-06-01        6        1      24337420.     6  2013           6
##  7 2013-07-01        6        1      26165409.     7  2013           7
##  8 2013-08-01        6        1      21955779.     8  2013           8
##  9 2013-09-01        6        1      23139375.     9  2013           9
## 10 2013-10-01        6        1      22440661.    10  2013          10
## # ... with 23 more rows

time_number.lm <- lm(monthly_amount~time_number, data=trainingSet)
summary(time_number.lm)

## 
## Call:
## lm(formula = monthly_amount ~ time_number, data = trainingSet)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11364450  -1966108    291910   1983046   6947682 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20085909    1377758  14.579 2.02e-15 ***
## time_number   206284      70709   2.917  0.00651 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3868000 on 31 degrees of freedom
## Multiple R-squared:  0.2154, Adjusted R-squared:  0.1901 
## F-statistic: 8.511 on 1 and 31 DF,  p-value: 0.006513

The original model had a R-squared value of 0.3168 whereas for this data set it’s only 0.2154. That is an additional ~10% unexplained variation in the data. Also looking at the residuals, there is a substainal increase in the range between the minimum (-11364450) and the maximum ( 6947682)compared to our initial training set of min (-17279) and max(16923)

Lets also plot it

ggplot(data = trainingSet, mapping = aes(x = time_number, y = monthly_amount)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title="Model 2, Using time_number as only predictor",
       subtitle = "industry 6, location 1",
       x="month number",
       y="Monthly Amount")

## `geom_smooth()` using formula 'y ~ x'

The data appears to follow a sinusiodal line which is not captured by our model. A new model would have to be developed to better capture this data but it may only work for this particular data set

Looking at industry = 10, location = 8

Going to do the exact same proce

# set up a temporary data set of aggregated values
industry10_location_8 <- aggregated_dataset[aggregated_dataset$industry == 10 & aggregated_dataset$location == 8, ]
industry10_location_8$time_number = c(1:nrow(arrange(industry6_location_1, date)))
# arrange the data by time number ot make sure it's worked properly
arrange(industry10_location_8, time_number)

## # A tibble: 47 x 7
## # Groups:   date, industry [47]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <int> <int>       <int>
##  1 2013-01-01       10        8      16065862.     1  2013           1
##  2 2013-02-01       10        8      15893236.     2  2013           2
##  3 2013-03-01       10        8      16822430.     3  2013           3
##  4 2013-04-01       10        8      16308113.     4  2013           4
##  5 2013-05-01       10        8      18623262.     5  2013           5
##  6 2013-06-01       10        8      17892471.     6  2013           6
##  7 2013-07-01       10        8      18356617.     7  2013           7
##  8 2013-08-01       10        8      19987638.     8  2013           8
##  9 2013-09-01       10        8      17688355.     9  2013           9
## 10 2013-10-01       10        8      19358444.    10  2013          10
## # ... with 37 more rows

# number of months used to capture seasonality, may need to test out other values. Initially set to 14
month_number = 14
#Training number is the number of rows minus the month_number
trainingNum = nrow(arrange(industry10_location_8, date, time_number)) - month_number

#Training set is all rows from the start minus the number in the test set
trainingSet = head(arrange(industry10_location_8, date, time_number), trainingNum)

#Testing set is the last 'month_number' rows
testingSet = tail(arrange(industry10_location_8, date, time_number), month_number)

trainingSet

## # A tibble: 33 x 7
## # Groups:   date, industry [33]
##    date       industry location monthly_amount month  year time_number
##    <date>        <int>    <int>          <dbl> <int> <int>       <int>
##  1 2013-01-01       10        8      16065862.     1  2013           1
##  2 2013-02-01       10        8      15893236.     2  2013           2
##  3 2013-03-01       10        8      16822430.     3  2013           3
##  4 2013-04-01       10        8      16308113.     4  2013           4
##  5 2013-05-01       10        8      18623262.     5  2013           5
##  6 2013-06-01       10        8      17892471.     6  2013           6
##  7 2013-07-01       10        8      18356617.     7  2013           7
##  8 2013-08-01       10        8      19987638.     8  2013           8
##  9 2013-09-01       10        8      17688355.     9  2013           9
## 10 2013-10-01       10        8      19358444.    10  2013          10
## # ... with 23 more rows

time_number.lm <- lm(monthly_amount~time_number, data=trainingSet)
summary(time_number.lm)

## 
## Call:
## lm(formula = monthly_amount ~ time_number, data = trainingSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2522424  -954886  -307057   254751  8481001 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17115140     805939  21.236  < 2e-16 ***
## time_number   165305      41362   3.997 0.000369 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2262000 on 31 degrees of freedom
## Multiple R-squared:   0.34,  Adjusted R-squared:  0.3187 
## F-statistic: 15.97 on 1 and 31 DF,  p-value: 0.0003688

ggplot(data = trainingSet, mapping = aes(x = time_number, y = monthly_amount)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title="Model 2, Using time_number as only predictor",
       subtitle = "industry 10, location 8",
       x="month number",
       y="Monthly Amount")

## `geom_smooth()` using formula 'y ~ x'

Looking at the plot and the statistical summary for industry 10, location 8 we can see a clear sinusoidal trend in the data. There are two obvious point which are far removed from our model, accounting for these two point would likely lead to improvements in the models predictive power

Based solely on the R-squared value, our model has performed well and is able to predict %34 of the variation. I suspect removing the outliers would significantly improve the R-squared value.

As we have only assessed the performance on the RMSE, any outliers which appear above substantially affect our RMSE value.

Evaluation and conclusion

Data analysis and CRISP-DM methodology confirmed the relationship between monthy amount and the number of the month when considered as a sequential series. The relationship has been modeled throughout numerous locations and industries and has been found to explain approximately ~30% of the variation on average across the entire data set. It is clear that there is an upwards trend year on year in addition to variations due to seasonality. By developing more advanced models that could account for those fluctuations, a more accurate and power model could be developed. However that model may not be suitable across every industry and range.

Considering the insights of this report, it is clear that the lower performing locations and industries should be assessed as the difference between them and the higher permofers is signifant. In addition, efforts should be made in preperation for the December - January period where transactions decrease. Finally the model developed using the available data may offer little benefit as it doesn’t account for significant variation and would have low predictive performance. More data or more complex analysis would be necessary to improve the model