Linear Regression Model on a time series data set of financial Transactions.

Sales manager wants to have an accurate prediction for monthly_amount next month.

Import the data & loading the neccesary libraries

Use the link above to download the database to a local folder. Make sure that the file is extracted and is in a native .csv format. Make sure the working directory is set to where the file is extracted.

1 - Undertake EDA (Exploratory Data Analysis)

1a. Data Cleansing

library(tidyverse)
library(dplyr)
library(ggplot2)
library(caret)
library(Rcpp)

setwd("~/Documents/36106/Assignment_1/36106AT1")

trans_month <- read_csv("transactions.csv")

lets have a look at this dataset

trans_month

## # A tibble: 94,248 x 5
##    date    customer_id                    industry location monthly_amount
##    <chr>   <chr>                             <int>    <int>          <dbl>
##  1 1/1/13  70efdf2ec9b086079795c442636b5…        8        9        753851.
##  2 1/2/13  70efdf2ec9b086079795c442636b5…        8        9        651548.
##  3 1/3/13  70efdf2ec9b086079795c442636b5…        8        9       1138769.
##  4 1/4/13  70efdf2ec9b086079795c442636b5…        8        9        659739.
##  5 1/5/13  70efdf2ec9b086079795c442636b5…        8        9        770675.
##  6 1/6/13  70efdf2ec9b086079795c442636b5…        8        9        592246.
##  7 1/7/13  70efdf2ec9b086079795c442636b5…        8        9        820428.
##  8 1/8/13  70efdf2ec9b086079795c442636b5…        8        9        759836.
##  9 1/9/13  70efdf2ec9b086079795c442636b5…        8        9        892087.
## 10 1/10/13 70efdf2ec9b086079795c442636b5…        8        9        931722.
## # ... with 94,238 more rows

Looking at the dataset and how the read_csv imported the csv, we can understand that the date is imported as a string rather than .

trans_month$date <- as.Date(trans_month$date,
                            format = "%d/%m/%y")

trans_month

## # A tibble: 94,248 x 5
##    date       customer_id                 industry location monthly_amount
##    <date>     <chr>                          <int>    <int>          <dbl>
##  1 2013-01-01 70efdf2ec9b086079795c44263…        8        9        753851.
##  2 2013-02-01 70efdf2ec9b086079795c44263…        8        9        651548.
##  3 2013-03-01 70efdf2ec9b086079795c44263…        8        9       1138769.
##  4 2013-04-01 70efdf2ec9b086079795c44263…        8        9        659739.
##  5 2013-05-01 70efdf2ec9b086079795c44263…        8        9        770675.
##  6 2013-06-01 70efdf2ec9b086079795c44263…        8        9        592246.
##  7 2013-07-01 70efdf2ec9b086079795c44263…        8        9        820428.
##  8 2013-08-01 70efdf2ec9b086079795c44263…        8        9        759836.
##  9 2013-09-01 70efdf2ec9b086079795c44263…        8        9        892087.
## 10 2013-10-01 70efdf2ec9b086079795c44263…        8        9        931722.
## # ... with 94,238 more rows

Are there any missing values in the training data? A nice way to quickly find that out:

library(Amelia)

missmap(trans_month, main = "Missing values vs observed")

As per the above exercises, it is crucial to change the date format from “character” to “Date” so R can classify this column correctly. This will help us able to data munge properly.
We can see that there arn’t any missing values which is great!

1b. Insights from EDA

lets try and plot this with date vs monthly_amount to see if theres any insights.

p1 <- ggplot(data = trans_month, aes(x=date, y=monthly_amount)) +
  geom_point(aes(color=industry)) +
  scale_y_continuous(labels = scales::comma) +
  labs (title = "Total Transactions per Month",
        subtitle = "Per Industry",
        x = "Date", y = "Monthly Transaction Amount") 
  

p1

Lets check to see if theres any distinct differences changing the variable investigated to location

p2 <- ggplot(data = trans_month, aes(x=date, y=monthly_amount)) +
  geom_point(aes(color=location)) +
  scale_y_continuous(labels = scales::comma) +
  labs (title = "Total Transactions per Month",
        subtitle = "Per Location",
        x = "Date", y = "Monthly Transaction Amount")

p2

It’s hard to grasp how the all this information is in a scatter plot as there is a dense amount of points <12,500,000 range. Lets try other plot methods to see if it can explain this data set better. It is however clear that the highest monthly transaction amounts are dominated in the locations closer to 1.

bar charts Grouped by industry

Ask someone whether we can put the percentage into the bar itself.

p4 <- ggplot(data=trans_month, aes(x=industry, y=monthly_amount)) +
  geom_bar(stat="identity") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks=1:10) +
  labs (title = "Total Transactions",
        subtitle = "Per Industry",
        x = "Industry", y = "Amount")


p4

grouped by location

p5 <- ggplot(data=trans_month, aes(x=location, y=monthly_amount)) +
  geom_bar(stat="identity") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks=1:10)  +
  labs (title = "Total Transactions",
        subtitle = "Per Location",
        x = "Location", y = "Amount")

p5

at the moment i haven’t removed any of the outliers yet as it’s still EDA. Lets try and see how the company/business is performing against their mean.

trans_above_mean <- trans_month %>%
  mutate(above_mean = ifelse(monthly_amount > mean(monthly_amount), TRUE, FALSE))

ggplot(data=trans_above_mean, aes(x=date, y=monthly_amount, fill = above_mean)) +
  geom_bar(stat="identity") +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_discrete(name = "Above Mean?") +
  labs (title = "Total Transactions",
        subtitle = "Per Month",
        x = "Month", y = "Amount")

What are some insights?

It is clear that every year, the sales peak in December and drops dramatically in January the following year.
Location 1 is the strongest performer in this data set with location 2 coming in second.*
Industries 1 & 2 are the strongest performers with 3 & 6 coming in second.*

*Note, this is based off the sum of all transactions over the 4 years.

2 - Basic Model fitting

2a. i. Aggregated data set using the fields date, industry and location, with a mean of monthly_amount

a1 <- trans_month %>%
  group_by(date, industry, location) %>%
  summarize(avg = mean(monthly_amount))

a1

## # A tibble: 3,886 x 4
## # Groups:   date, industry [?]
##    date       industry location     avg
##    <date>        <int>    <int>   <dbl>
##  1 2013-01-01        1        1 136081.
##  2 2013-01-01        1        2 177840.
##  3 2013-01-01        1        3 141632.
##  4 2013-01-01        1        4 221058.
##  5 2013-01-01        1        5 178138.
##  6 2013-01-01        1        6 133400.
##  7 2013-01-01        1        7 231599.
##  8 2013-01-01        1        8 143778.
##  9 2013-01-01        1        9 157416.
## 10 2013-01-01        1       10 188735.
## # ... with 3,876 more rows

2a. ii. Create a line plot of the variable monthly_amount for industry = 1 and location = 1.

Note the seasonality by month in this time series.

#filtered for industry 1 and location 1
il1 <- 
  a1 %>%
  filter(location == 1, industry == 1)
  
p7 <- ggplot(data=il1, aes(x=date, y=avg)) +
  geom_smooth(stat="identity") +
  geom_point(colour = "red", alpha = "0.5") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_minor_breaks = "1 month") +
  labs (title = "Industry 1 & Location 1",
        subtitle = "Monthly Transaction Amount",
        x = "Date", y = "Average Amount")

p7

2a. iii. For industry = 1 and location = 1, train a linear regression model with monthly_amount as the target.

Note 1. Remember that time is very important in this model, so be sure to include a variable for the time sequence (HINT: this can simply be a 1 for the first month, 2 for the second month, etc) Note 2. Carefully think about how you split your test and train sets. (Hint: Random is not appropriate!)

With time series forecasting, one-step forecasts may not be as relevant as multi-step forecasts. In this case, the cross-validation procedure based on a rolling forecasting origin can be modified to allow multi-step errors to be used. Suppose that we are interested in models that produce good 4-step-ahead forecasts. Then the corresponding diagram is shown below.

multistep times series forecasting: Recursive forecasting and direct forecasting:

In recursive forecasting (also called iterated forecasting) you train your model for one step ahead forecasts only. After the training is done you apply your final model recursively to forecast 1 step ahead, 2 steps ahead, etc…until you reach the desired n steps forecast horizon. To do this, you feed the forecast from each successive step back into the model to generate the next step. This approach is used by traditional forecasting algorithms like ARIMA and Exponential Smoothing algorithms, and can be also used for Machine Learning based forecasting (see this post for an example, and this post for some discussion).

Direct forecasting is when you train a separate model for each step (so you trying to “directly” forecast the nth step ahead instead of reaching n steps recursively. See Ben Taied et al. for a discussion of direct forecasting and more complex combined approaches. Note that Hyndman’s blog post on cross validation for time series covers both one step ahead and direct forecasting.

We’ve only got 47 observations and will require to create a Train/Test. k-fold is too naive to deal with the autocorrelation.

With 47 observations for this specific function, it’d be best to capture each year plus a few months to allow for seasonality. Lets split this into 60/20/20 Train/Test/Validate. This gives us 29 observations in the train set, 9 observations in the test set and 9 in the Validate set.

#function to create the average monthly amount per transaction. Note that we have not included the ID number as of yet.

mean.fun <- function(df) {
    
  output = df %>%
    group_by(date, industry, location) %>%
    summarize(avg = mean(avg))
  
  output = output %>%
    mutate(month = format(as.Date(date), "%m")) %>%
    mutate(year = format(as.Date(date), "%Y"))
  
  output$month = as.integer(output$month)
  output$year = as.integer(output$year)
  
  transform(output, month = as.integer(month), 
                    year = as.integer(year))
  
  return(output)

}

il1 <- mean.fun(il1)

il1$id <- 1:nrow(il1)
il1 <- as.data.frame(il1)

# lets create a train/test model for this exercise. 

il1.train <- slice(il1, 1:38)
il1.test <- slice(il1, 39:47)


#lets create a linear model to see what we can get with this train set.

il1.lm = lm(formula = avg ~ date + id + month, 
            data = il1.train)

summary(il1.lm)

## 
## Call:
## lm(formula = avg ~ date + id + month, data = il1.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19393  -8260   2334   6935  17362 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56745633.6 34054908.8   1.666    0.105
## date           -3610.7     2172.6  -1.662    0.106
## id            110414.2    66105.2   1.670    0.104
## month            378.9      445.9   0.850    0.401
## 
## Residual standard error: 9533 on 34 degrees of freedom
## Multiple R-squared:  0.3773, Adjusted R-squared:  0.3223 
## F-statistic: 6.867 on 3 and 34 DF,  p-value: 0.0009706

#lets plot it.

p8 <- ggplot(data=il1.train, aes(x=date, y=avg)) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue") +
  geom_point(colour = "red", alpha = "0.5") +
  scale_x_date(date_minor_breaks = "1 month")

p8

plot(il1.lm)

prediction <- predict(il1.lm, il1.test)
print(((prediction - il1.test$avg)/il1.test$avg)*100)

##          1          2          3          4          5          6 
## -10.127527  -5.500026  -8.258356  -8.739239   4.526101  -9.847730 
##          7          8          9 
##  -5.230346  -4.504547 -13.920244

We can see that the result above varies -14% to +5% of the actual figures in our test set. Note, that when i didn’t include the factors month, the results decreased by 1% accuracy.

2a. iv. Create a prediction for monthly_amount in December 2016

#method 1, create new data frame with the value id = 48 for the prediction model to predict the outcome at this sequence. 

december2016<-data.frame(date = "01/12/16",
                                   industry=1,
                                   location=1,
                                   avg=0,
                                  month=12,
                                  year=2016,
                                   id=48)

december2016$date <- as.Date(december2016$date,
                            format = "%d/%m/%y")
december2016$industry <- as.integer(december2016$industry)
december2016$location <- as.integer(december2016$location)
december2016$id <- as.integer(december2016$id)

#apply the linear model and predict december 2016

december2016$avg <- predict(il1.lm,december2016)



december2016

##         date industry location      avg month year id
## 1 2016-12-01        1        1 176893.6    12 2016 48

predicted_outcome <- rbind(il1, december2016)

p9 <- ggplot(data=predicted_outcome, aes(x=date, y=avg)) +
  geom_smooth(stat="identity", method = "lm") +
  geom_point(colour = "red", alpha = "0.5") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_minor_breaks = "1 month") +
  labs (title = "Industry 1 & Location 1",
        subtitle = "December 2016 Prediction",
        x = "Date", y = "Average Amount")


p9

As per the plot above, we can see that the prediction has catered for seasonality and have followed the trend for the last quarter.

2b. i. How well does your model fit the data it is trained on in a statistical sense? Define & Describe an appropriate quantitative measure. Justify your choice of measure.

Often the validation of a model seems to consist of nothing more than quoting the R2 statistic from the fit (which measures the fraction of the total variability in the response that is accounted for by the model). Unfortunately, a high R2 value does not guarantee that the model fits the data well. Use of a model that does not fit the data well cannot provide good answers to the underlying engineering or scientific questions under investigation.

Numerical methods for model validation, such as the R2 statistic, are also useful, but usually to a lesser degree than graphical methods. Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data. Numerical methods for model validation tend to be narrowly focused on a particular aspect of the relationship between the model and the data and often try to compress that information into a single descriptive number or test result.

Numerical methods do play an important role as confirmatory methods for graphical techniques, however. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. There are also a few modeling situations in which graphical methods cannot easily be used. In these cases, numerical methods provide a fallback position for model validation. One common situation when numerical validation methods take precedence over graphical methods is when the number of parameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area in which graphical residual analysis can be difficult.

Make sure the assumptions are satisfactorily met

Use scatterplot or component plus residual plot to examine the linear relationship between the independent predictor(s) and the dependent variable.

Compose a plot with standardized residual versus predicted value and ensure there isn’t extreme point with very high residual, and the spread of the residual is largely similar along the predicted value, as well as spreading largely equally above and below the mean of residual, zero.

You can also change the y-axis to residual2. This plot helps identifying unequal variance.

Re-examine the study design to ensure the assumption of independence is reasonable.

Retrieve the variance inflation factor (VIF) or tolerance statistics to examine possible collinearity.

Examine potential influential point(s)

Check statistics such as Cook’s D, DFits, or DF Beta to find out if a certain data point is drastically changing your regression results. You can find more here. Examine the change in R2 and Adjusted R2 statistics

Being the ratio of regression sum of squares to total sum of squares, R2 can tell you how many % of variability in your dependent variable are explained by the model. Adjusted R2 can be used to check if the extra sum of squares brought about my the additional predictor(s) is really worth the degrees of freedom they’ll take. Check necessary interaction

If there is a main independent predictor, before you make any interpretation of its independent effect, check if it is interacting with other independent variables. Interaction, if left unadjusted, can bias your estimate. Apply your model to another data set and check its performance

You can also apply the regression formula to other separate data and see how well it predicts. Graph like scatter plot and statistics like % difference from the observed value can serve as a good start.

2b. ii. How well does your model predicitng out-of-sample? Define & describe an appropriate quantitative measure. Justify your choice of measure.

2b. iii. Which features did you end up using in your final model? Justify your choice of features (this feature selection method).

3 - Advanced Model Fitting

3a. Apply the modelling process you built for industry 1 and location 1 to all industries and locations programmatically.

#create a loop parameter
#lets assume its for 2016 December

#apply to whole of dataset and add additional columns for year and month

a1 <- mean.fun(a1)

Applying the linear model built from Location 1 and Industry 1 and applying the the rest of the industries and locations.

now we have additional columns for the month number and year.

calculate_predictions <- function(df, industries, locations) {
  
  output = data.frame()
  
  # Want to make sure we capture a couple of months from the preceeding year to capture the seasonal data
  testingNum = 9
  
  # The test/train split should ensure training data has 'trainingMultiplier' times the number of entries of testing entries
  
  # * 1.6 was the minimum size based on trials that produced the same RMSE compared to higher sampling.
  # * A value of 4 or above returned no results due to insufficient rows. 1.5 or below returned warnings about rank-deficient fits
  # * Between 1.6 and 3 provided the same RMSE error values
  # * However, when trainingMultiplier was >=1.6 or <=2, results included one extra location/industry combination.
  # * The lowest number was chosen that did not change RMSE values, to maximise chance of calculation on smaller sample sizes without compromising predictions.
  
  
  
  for (ind in industries) {
    for (loc in locations) {
      
      # Create a subset of the data
      temp = df[df$industry == ind & df$location == loc, ]
      
      # Check to make sure you have at least 'trainingMultiplier' times the number of training rows than testing rows
      if (length(unique(temp$date)) >= testingNum) {
        
        # Arrange dataset by date
        arrange(temp, date)
        
        # Add a number to represent date order
        temp$time_number = c(1:nrow(temp))
        
        # Training number is the number of rows minus the testingNum
        trainingNum = nrow(temp) - testingNum
        
        # Training set is all rows from the start minus the number in the test set. We'll arrange again, just in case.
        trainingSet = head(arrange(temp, time_number), trainingNum)
        
        # Testing set is the last 'testingNum' rows. We'll arrange again, just in case.
        testingSet = tail(arrange(temp, time_number), testingNum)
        
        # Run the model
        training.model = lm(avg~time_number, data=trainingSet)
        # testing.model = lm(monthly_amount~time_number, data=testingSet)
        
        # Calculate the mean standard error
        training.mse <- mean(residuals(training.model)^2)
        #testing.mse <- mean(residuals(testing.model)^2)
        
        # Calculate root mean squared error
        training.rmse <- sqrt(training.mse)
        # testing.rmse <- sqrt(testing.mse)
        
        ### Now, add an extra row into temp for the December 2016 prediction, giving December 2016 a monthly_amount of 0
        
        # Create a dataframe containing just the December 2016 data
        december_2016 = data.frame(date = "2016-12-01",
                                   industry=ind,
                                   location=loc,
                                   avg=0,
                                   month=12,
                                   year=2016,
                                   time_number=(nrow(temp)+1))
        
        # Ensure temp is of type data frame
        temp = as.data.frame(temp)
        #testingSet = as.data.frame(testingSet)
        
        # Add the December 2016 row
        temp = rbind(temp, december_2016)
        # testingSet = rbind(testingSet, december_2016)
        
        # Output a prediction based on all rows and add it to the temp data frame
        temp$prediction = predict(training.model, temp)
        testingSet$prediction = predict(training.model, testingSet)
        
        # Get the last prediction value (which is the Dec 2016 value).
        train_dec_2016_prediction = tail(temp$prediction, 1)
        # test_dec_2016_prediction = tail(testingSet$prediction, 1)
        
        # Create row to add to the output data frame, including industry and location variables
        dataRow = c(ind,loc,training.rmse,train_dec_2016_prediction)
        
      } else {
        # Append entry to output data frame when not enough data to compute
        dataRow = c(ind,loc,NA,NA)
      }
      
      # Add the row to the output dataframe
      output = rbind(output, dataRow)
    }
  }
  
  #Add column names to the output data frame
  colnames(output) <- c("Industry","Location", "RMSE", "Dec 2016 Prediction")
  
  #Return the output
  return(output)
}


# Get the list of unique industries, sorted in numerical order 
industries <- sort(unique(a1$industry))
# Get the list of unique locations, sorted in numerical order
locations <- sort(unique(a1$location))

# Calculate the predictions for all industry and location pairs
all_predictions <- calculate_predictions(a1, industries, locations)

# Order by RMSE
arrange(all_predictions, RMSE)

##     Industry Location         RMSE Dec 2016 Prediction
## 1         10        3     603.6323            79534.52
## 2          9        5    2012.4789            66600.21
## 3         10        1    4446.7008            49845.73
## 4          1        8    5538.6968           126570.66
## 5         10        4    7106.1359           106503.13
## 6          1        6    7387.6404           119469.94
## 7          1        2    8548.6526           210623.53
## 8          1        1    9515.8878           178543.72
## 9          1        3    9603.3477           205911.95
## 10        10        9    9608.8493            99452.73
## 11         1        5    9863.2697           173610.63
## 12        10        5   11517.5418           143048.51
## 13         1        9   11805.5751           198723.95
## 14         1       10   15489.9964           242128.47
## 15         1        4   15988.3260           208957.94
## 16         4        2   16095.2431           225127.74
## 17         7        7   17066.0262           226031.89
## 18         1        7   17740.0193           161432.82
## 19         5        5   17751.3792           241300.70
## 20         4        8   21034.0076           237613.89
## 21         4        9   21262.1287           225816.21
## 22         7        1   24685.4813           252164.27
## 23         2        2   24841.8438           482435.01
## 24         2        7   25802.9804           273000.87
## 25         2       10   26797.2850           284117.69
## 26         4        7   27077.5698           540174.60
## 27         4        1   28557.1759           491501.14
## 28         4       10   28749.5933           311294.54
## 29         2        1   29719.7977           429667.31
## 30         7        2   33023.1388           293742.06
## 31         5        2   34858.0487           429031.34
## 32         7        6   35929.7034           219105.71
## 33         7        8   36228.8703           303746.65
## 34         2        6   36700.1712           366288.29
## 35         4        3   37680.6488           322742.01
## 36         3        8   37833.7998           178052.35
## 37         5        7   39769.1310           329872.86
## 38         4        5   42716.2924           445966.52
## 39         4        4   43166.4768           419228.93
## 40         7        5   44955.1432           325303.54
## 41         8        4   47181.7160           354409.03
## 42         5        4   47942.3169           254968.32
## 43         2        4   48671.9104           255885.48
## 44         5        9   49200.0968           440365.08
## 45         2        8   49243.8068           500516.34
## 46         5        8   50651.0459           465697.13
## 47         7        4   53833.1084           428257.03
## 48         7       10   54213.4159           497701.79
## 49         5        3   57612.6727           299553.86
## 50         2        3   57701.5777           607024.75
## 51        10        2   59745.2694           238273.67
## 52         2        5   60035.4295           274250.46
## 53         3        2   60244.4948           688737.10
## 54         5        1   63162.4640           610234.60
## 55         3        7   67254.4225           648664.85
## 56         8        9   70051.8602           553241.26
## 57         2        9   70325.1443           460783.34
## 58         5        6   87166.2626           699754.56
## 59         7        3   87524.9314           313296.41
## 60         5       10   91846.4981           462057.03
## 61         7        9   92696.3108           779314.88
## 62         3        1   95393.2153           601300.03
## 63         8        2   96793.6409           760218.25
## 64         9        1   99543.2810           631064.75
## 65         3        6  103163.3940           125025.75
## 66         3        4  109780.5334          1027582.90
## 67         3        3  114278.7645           555142.52
## 68        10        7  115089.4627           299069.69
## 69         8       10  115965.6449           656832.62
## 70         8        6  139653.7829          1010321.22
## 71         3        5  146819.0369           332931.68
## 72         9        2  148446.3713          1256372.52
## 73         9        7  155797.6856          1521353.94
## 74         4        6  162835.4399          1139849.79
## 75         8        3  169513.3461           663966.23
## 76         9        3  179000.5605          1004181.55
## 77         8        8  179116.9772           437249.51
## 78         8        7  179337.4219           257994.28
## 79         3        9  181730.2677           375564.46
## 80         9        9  196809.2936          1137264.71
## 81         3       10  198078.4934          -199643.72
## 82         8        5  204982.4257           123212.52
## 83         8        1  284506.8298          1460039.43
## 84         9        6  297460.1852          1763920.96
## 85         9       10  322657.7149          1308952.43
## 86         9        4  539522.8534          -253357.89
## 87        10        8 2574934.2447         25676380.07
## 88         6        1 4372304.0117         35859656.20
## 89         6        2           NA                  NA
## 90         6        3           NA                  NA
## 91         6        4           NA                  NA
## 92         6        5           NA                  NA
## 93         6        6           NA                  NA
## 94         6        7           NA                  NA
## 95         6        8           NA                  NA
## 96         6        9           NA                  NA
## 97         6       10           NA                  NA
## 98         9        8           NA                  NA
## 99        10        6           NA                  NA
## 100       10       10           NA                  NA

3b. calculate your evaluation measure for the training data and your testing data for all models. Identify the two industries and two locations for which your method performs worst.

Using industry 1 and location 1 as the linear model and applying predictions to the rest of the locations and industry’s yielded poor results. To increase the the performance of the model, i expanded the

RMSD is sensitive to outliers RMSD is proportional to the size of the squared error RMSD is the square root of the average of squared errors

3c. What might be causing the models on these two industries and lcoations to be performing poorly? How might you fix this in the future?

#lets see if this model improves understanding all the variables available?


a1id <- a1
a1id$id <- 1:nrow(a1)
test.lm <- lm(avg ~ industry + location + id + month + year, data = a1id)

summary(test.lm)

## 
## Call:
## lm(formula = avg ~ industry + location + id + month + year, data = a1id)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2979831 -1014044  -514055   -38004 47457473 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.716e+10  1.970e+10   3.409 0.000659 ***
## industry    -7.178e+04  8.484e+04  -0.846 0.397525    
## location    -1.289e+05  2.370e+04  -5.441 5.63e-08 ***
## id           3.374e+04  9.858e+03   3.422 0.000627 ***
## month       -2.759e+06  8.140e+05  -3.389 0.000707 ***
## year        -3.336e+07  9.787e+06  -3.409 0.000659 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3938000 on 3880 degrees of freedom
## Multiple R-squared:  0.03266,    Adjusted R-squared:  0.03142 
## F-statistic:  26.2 on 5 and 3880 DF,  p-value: < 2.2e-16

We can see that industry has a very low p value and association with the rest of the variables. With all the factors, we understand that the Mult R2 is 0.03266 whilst the adjust r-squared is 0.03142.

#lets see if this model improves by removing industry?

test.lm <- lm(avg ~ location + id + month + year, data = a1id)

summary(test.lm)

## 
## Call:
## lm(formula = avg ~ location + id + month + year, data = a1id)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2867050 -1016602  -507317   -37700 47595431 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.106e+10  5.117e+09   9.979  < 2e-16 ***
## location    -1.215e+05  2.202e+04  -5.518 3.64e-08 ***
## id           2.568e+04  2.560e+03  10.031  < 2e-16 ***
## month       -2.094e+06  2.121e+05  -9.871  < 2e-16 ***
## year        -2.536e+07  2.542e+06  -9.979  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3937000 on 3881 degrees of freedom
## Multiple R-squared:  0.03248,    Adjusted R-squared:  0.03149 
## F-statistic: 32.58 on 4 and 3881 DF,  p-value: < 2.2e-16

This actually decreased our R2 values (although marginal)!. Building an overall model to predict all 10 industries and 10 locations is not feasible as it can only explain approximately 3 percent of the dataset.

#lets try and seperate it by location variable


a1loc1 <- a1

a1loc1 <- 
  a1loc1 %>%
  filter(location == 1)

a1loc1 <- mean.fun(a1loc1)

a1loc1$id <- 1:nrow(a1loc1)
test1.lm <- lm(avg ~ industry + id + month + year, data = a1loc1)

summary(test1.lm)

## 
## Call:
## lm(formula = avg ~ industry + id + month + year, data = a1loc1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6670942 -3753849 -2705337 -1655322 44441942 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.239e+10  5.866e+10   0.211    0.833
## industry     3.096e+05  3.185e+05   0.972    0.332
## id           6.220e+04  2.660e+05   0.234    0.815
## month       -4.585e+05  2.472e+06  -0.185    0.853
## year        -6.153e+06  2.914e+07  -0.211    0.833
## 
## Residual standard error: 9327000 on 432 degrees of freedom
## Multiple R-squared:  0.01992,    Adjusted R-squared:  0.01084 
## F-statistic: 2.195 on 4 and 432 DF,  p-value: 0.06875

Even worse! 0.01992, adjusted r2 0.01084

4 - Reporting

4a. Using all the notes and answers you have above, wrap up all your work into a report for the sales manager that follows the CRISP-DM methodology. Whilst the sales manager is not a DS, they are intelligent and have some experience in data analytics. Therefore it will be an important task to ensure you include enough technical details to withstand QA and technical scrutiny whilst positioning for a business audience.

4b. Ensure you include your predictions for Decemeber 2016 as an appendix. (Your prdictions, the actual, the difference) which can be referenced in your report.

5 - Submission

5a. Submit your report and professionally commented R-code.

i6 <- 
  a1 %>%
  filter(industry == 6)

p8 <- ggplot(data=i6, aes(x=date, y=avg)) +
  geom_smooth(stat="identity") +
  geom_point(colour = "red", alpha = "0.5") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_minor_breaks = "1 month")

p8

i1 <- 
  a1 %>%
  filter(industry == 1)

p9 <- ggplot(data=i1, aes(x=date, y=avg)) +
  geom_smooth(stat="identity") +
  geom_point(colour = "red", alpha = "0.5") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_minor_breaks = "1 month")

p9

36106AT1 Part A - Linear Regression

Edward Truong

03/09/2018