Linear Regression Model on a time series data set of financial Transactions.
Sales manager wants to have an accurate prediction for monthly_amount next month.
Use the link above to download the database to a local folder. Make sure that the file is extracted and is in a native .csv format. Make sure the working directory is set to where the file is extracted.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(caret)
library(Rcpp)
setwd("~/Documents/36106/Assignment_1/36106AT1")
trans_month <- read_csv("transactions.csv")
lets have a look at this dataset
trans_month
## # A tibble: 94,248 x 5
## date customer_id industry location monthly_amount
## <chr> <chr> <int> <int> <dbl>
## 1 1/1/13 70efdf2ec9b086079795c442636b5… 8 9 753851.
## 2 1/2/13 70efdf2ec9b086079795c442636b5… 8 9 651548.
## 3 1/3/13 70efdf2ec9b086079795c442636b5… 8 9 1138769.
## 4 1/4/13 70efdf2ec9b086079795c442636b5… 8 9 659739.
## 5 1/5/13 70efdf2ec9b086079795c442636b5… 8 9 770675.
## 6 1/6/13 70efdf2ec9b086079795c442636b5… 8 9 592246.
## 7 1/7/13 70efdf2ec9b086079795c442636b5… 8 9 820428.
## 8 1/8/13 70efdf2ec9b086079795c442636b5… 8 9 759836.
## 9 1/9/13 70efdf2ec9b086079795c442636b5… 8 9 892087.
## 10 1/10/13 70efdf2ec9b086079795c442636b5… 8 9 931722.
## # ... with 94,238 more rows
Looking at the dataset and how the read_csv
imported the csv, we can understand that the date is imported as a
trans_month$date <- as.Date(trans_month$date,
format = "%d/%m/%y")
trans_month
## # A tibble: 94,248 x 5
## date customer_id industry location monthly_amount
## <date> <chr> <int> <int> <dbl>
## 1 2013-01-01 70efdf2ec9b086079795c44263… 8 9 753851.
## 2 2013-02-01 70efdf2ec9b086079795c44263… 8 9 651548.
## 3 2013-03-01 70efdf2ec9b086079795c44263… 8 9 1138769.
## 4 2013-04-01 70efdf2ec9b086079795c44263… 8 9 659739.
## 5 2013-05-01 70efdf2ec9b086079795c44263… 8 9 770675.
## 6 2013-06-01 70efdf2ec9b086079795c44263… 8 9 592246.
## 7 2013-07-01 70efdf2ec9b086079795c44263… 8 9 820428.
## 8 2013-08-01 70efdf2ec9b086079795c44263… 8 9 759836.
## 9 2013-09-01 70efdf2ec9b086079795c44263… 8 9 892087.
## 10 2013-10-01 70efdf2ec9b086079795c44263… 8 9 931722.
## # ... with 94,238 more rows
Are there any missing values in the training data? A nice way to quickly find that out:
library(Amelia)
missmap(trans_month, main = "Missing values vs observed")
lets try and plot this with date vs monthly_amount to see if theres any insights.
p1 <- ggplot(data = trans_month, aes(x=date, y=monthly_amount)) +
geom_point(aes(color=industry)) +
scale_y_continuous(labels = scales::comma) +
labs (title = "Total Transactions per Month",
subtitle = "Per Industry",
x = "Date", y = "Monthly Transaction Amount")
p1
Lets check to see if theres any distinct differences changing the variable investigated to location
p2 <- ggplot(data = trans_month, aes(x=date, y=monthly_amount)) +
geom_point(aes(color=location)) +
scale_y_continuous(labels = scales::comma) +
labs (title = "Total Transactions per Month",
subtitle = "Per Location",
x = "Date", y = "Monthly Transaction Amount")
p2
It’s hard to grasp how the all this information is in a scatter plot as there is a dense amount of points <12,500,000 range. Lets try other plot methods to see if it can explain this data set better. It is however clear that the highest monthly transaction amounts are dominated in the locations closer to 1.
bar charts Grouped by industry
p4 <- ggplot(data=trans_month, aes(x=industry, y=monthly_amount)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks=1:10) +
labs (title = "Total Transactions",
subtitle = "Per Industry",
x = "Industry", y = "Amount")
p4
grouped by location
p5 <- ggplot(data=trans_month, aes(x=location, y=monthly_amount)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks=1:10) +
labs (title = "Total Transactions",
subtitle = "Per Location",
x = "Location", y = "Amount")
p5
at the moment i haven’t removed any of the outliers yet as it’s still EDA. Lets try and see how the company/business is performing against their mean.
trans_above_mean <- trans_month %>%
mutate(above_mean = ifelse(monthly_amount > mean(monthly_amount), TRUE, FALSE))
ggplot(data=trans_above_mean, aes(x=date, y=monthly_amount, fill = above_mean)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = scales::comma) +
scale_fill_discrete(name = "Above Mean?") +
labs (title = "Total Transactions",
subtitle = "Per Month",
x = "Month", y = "Amount")
What are some insights?
*Note, this is based off the sum of all transactions over the 4 years.
a1 <- trans_month %>%
group_by(date, industry, location) %>%
summarize(avg = mean(monthly_amount))
a1
## # A tibble: 3,886 x 4
## # Groups: date, industry [?]
## date industry location avg
## <date> <int> <int> <dbl>
## 1 2013-01-01 1 1 136081.
## 2 2013-01-01 1 2 177840.
## 3 2013-01-01 1 3 141632.
## 4 2013-01-01 1 4 221058.
## 5 2013-01-01 1 5 178138.
## 6 2013-01-01 1 6 133400.
## 7 2013-01-01 1 7 231599.
## 8 2013-01-01 1 8 143778.
## 9 2013-01-01 1 9 157416.
## 10 2013-01-01 1 10 188735.
## # ... with 3,876 more rows
Note the seasonality by month in this time series.
#filtered for industry 1 and location 1
il1 <-
a1 %>%
filter(location == 1, industry == 1)
p7 <- ggplot(data=il1, aes(x=date, y=avg)) +
geom_smooth(stat="identity") +
geom_point(colour = "red", alpha = "0.5") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(date_minor_breaks = "1 month") +
labs (title = "Industry 1 & Location 1",
subtitle = "Monthly Transaction Amount",
x = "Date", y = "Average Amount")
p7
Note 1. Remember that time is very important in this model, so be sure to include a variable for the time sequence (HINT: this can simply be a 1 for the first month, 2 for the second month, etc) Note 2. Carefully think about how you split your test and train sets. (Hint: Random is not appropriate!)
With time series forecasting, one-step forecasts may not be as relevant as multi-step forecasts. In this case, the cross-validation procedure based on a rolling forecasting origin can be modified to allow multi-step errors to be used. Suppose that we are interested in models that produce good 4-step-ahead forecasts. Then the corresponding diagram is shown below.
multistep times series forecasting: Recursive forecasting and direct forecasting:
In recursive forecasting (also called iterated forecasting) you train your model for one step ahead forecasts only. After the training is done you apply your final model recursively to forecast 1 step ahead, 2 steps ahead, etc…until you reach the desired n steps forecast horizon. To do this, you feed the forecast from each successive step back into the model to generate the next step. This approach is used by traditional forecasting algorithms like ARIMA and Exponential Smoothing algorithms, and can be also used for Machine Learning based forecasting (see this post for an example, and this post for some discussion).
Direct forecasting is when you train a separate model for each step (so you trying to “directly” forecast the nth step ahead instead of reaching n steps recursively. See Ben Taied et al. for a discussion of direct forecasting and more complex combined approaches. Note that Hyndman’s blog post on cross validation for time series covers both one step ahead and direct forecasting.
We’ve only got 47 observations and will require to create a Train/Test. k-fold is too naive to deal with the autocorrelation.
With 47 observations for this specific function, it’d be best to capture each year plus a few months to allow for seasonality. Lets split this into 60/20/20 Train/Test/Validate. This gives us 29 observations in the train set, 9 observations in the test set and 9 in the Validate set.
#function to create the average monthly amount per transaction. Note that we have not included the ID number as of yet.
mean.fun <- function(df) {
output = df %>%
group_by(date, industry, location) %>%
summarize(avg = mean(avg))
output = output %>%
mutate(month = format(as.Date(date), "%m")) %>%
mutate(year = format(as.Date(date), "%Y"))
output$month = as.integer(output$month)
output$year = as.integer(output$year)
transform(output, month = as.integer(month),
year = as.integer(year))
return(output)
}
il1 <- mean.fun(il1)
il1$id <- 1:nrow(il1)
il1 <- as.data.frame(il1)
# lets create a train/test model for this exercise.
il1.train <- slice(il1, 1:38)
il1.test <- slice(il1, 39:47)
#lets create a linear model to see what we can get with this train set.
il1.lm = lm(formula = avg ~ date + id + month,
data = il1.train)
summary(il1.lm)
##
## Call:
## lm(formula = avg ~ date + id + month, data = il1.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19393 -8260 2334 6935 17362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56745633.6 34054908.8 1.666 0.105
## date -3610.7 2172.6 -1.662 0.106
## id 110414.2 66105.2 1.670 0.104
## month 378.9 445.9 0.850 0.401
##
## Residual standard error: 9533 on 34 degrees of freedom
## Multiple R-squared: 0.3773, Adjusted R-squared: 0.3223
## F-statistic: 6.867 on 3 and 34 DF, p-value: 0.0009706
#lets plot it.
p8 <- ggplot(data=il1.train, aes(x=date, y=avg)) +
geom_smooth(method = "lm", se = FALSE, colour = "blue") +
geom_point(colour = "red", alpha = "0.5") +
scale_x_date(date_minor_breaks = "1 month")
p8
plot(il1.lm)
prediction <- predict(il1.lm, il1.test)
print(((prediction - il1.test$avg)/il1.test$avg)*100)
## 1 2 3 4 5 6
## -10.127527 -5.500026 -8.258356 -8.739239 4.526101 -9.847730
## 7 8 9
## -5.230346 -4.504547 -13.920244
We can see that the result above varies -14% to +5% of the actual figures in our test set. Note, that when i didn’t include the factors month, the results decreased by 1% accuracy.
#method 1, create new data frame with the value id = 48 for the prediction model to predict the outcome at this sequence.
december2016<-data.frame(date = "01/12/16",
industry=1,
location=1,
avg=0,
month=12,
year=2016,
id=48)
december2016$date <- as.Date(december2016$date,
format = "%d/%m/%y")
december2016$industry <- as.integer(december2016$industry)
december2016$location <- as.integer(december2016$location)
december2016$id <- as.integer(december2016$id)
#apply the linear model and predict december 2016
december2016$avg <- predict(il1.lm,december2016)
december2016
## date industry location avg month year id
## 1 2016-12-01 1 1 176893.6 12 2016 48
predicted_outcome <- rbind(il1, december2016)
p9 <- ggplot(data=predicted_outcome, aes(x=date, y=avg)) +
geom_smooth(stat="identity", method = "lm") +
geom_point(colour = "red", alpha = "0.5") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(date_minor_breaks = "1 month") +
labs (title = "Industry 1 & Location 1",
subtitle = "December 2016 Prediction",
x = "Date", y = "Average Amount")
p9
As per the plot above, we can see that the prediction has catered for seasonality and have followed the trend for the last quarter.
Often the validation of a model seems to consist of nothing more than quoting the R2 statistic from the fit (which measures the fraction of the total variability in the response that is accounted for by the model). Unfortunately, a high R2 value does not guarantee that the model fits the data well. Use of a model that does not fit the data well cannot provide good answers to the underlying engineering or scientific questions under investigation.
Numerical methods for model validation, such as the R2 statistic, are also useful, but usually to a lesser degree than graphical methods. Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data. Numerical methods for model validation tend to be narrowly focused on a particular aspect of the relationship between the model and the data and often try to compress that information into a single descriptive number or test result.
Numerical methods do play an important role as confirmatory methods for graphical techniques, however. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. There are also a few modeling situations in which graphical methods cannot easily be used. In these cases, numerical methods provide a fallback position for model validation. One common situation when numerical validation methods take precedence over graphical methods is when the number of parameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area in which graphical residual analysis can be difficult.
Make sure the assumptions are satisfactorily met
Use scatterplot or component plus residual plot to examine the linear relationship between the independent predictor(s) and the dependent variable.
Compose a plot with standardized residual versus predicted value and ensure there isn’t extreme point with very high residual, and the spread of the residual is largely similar along the predicted value, as well as spreading largely equally above and below the mean of residual, zero.
You can also change the y-axis to residual2. This plot helps identifying unequal variance.
Re-examine the study design to ensure the assumption of independence is reasonable.
Retrieve the variance inflation factor (VIF) or tolerance statistics to examine possible collinearity.
Examine potential influential point(s)
Check statistics such as Cook’s D, DFits, or DF Beta to find out if a certain data point is drastically changing your regression results. You can find more here. Examine the change in R2 and Adjusted R2 statistics
Being the ratio of regression sum of squares to total sum of squares, R2 can tell you how many % of variability in your dependent variable are explained by the model. Adjusted R2 can be used to check if the extra sum of squares brought about my the additional predictor(s) is really worth the degrees of freedom they’ll take. Check necessary interaction
If there is a main independent predictor, before you make any interpretation of its independent effect, check if it is interacting with other independent variables. Interaction, if left unadjusted, can bias your estimate. Apply your model to another data set and check its performance
You can also apply the regression formula to other separate data and see how well it predicts. Graph like scatter plot and statistics like % difference from the observed value can serve as a good start.
#create a loop parameter
#lets assume its for 2016 December
#apply to whole of dataset and add additional columns for year and month
a1 <- mean.fun(a1)
Applying the linear model built from Location 1 and Industry 1 and applying the the rest of the industries and locations.
calculate_predictions <- function(df, industries, locations) {
output = data.frame()
# Want to make sure we capture a couple of months from the preceeding year to capture the seasonal data
testingNum = 9
# The test/train split should ensure training data has 'trainingMultiplier' times the number of entries of testing entries
# * 1.6 was the minimum size based on trials that produced the same RMSE compared to higher sampling.
# * A value of 4 or above returned no results due to insufficient rows. 1.5 or below returned warnings about rank-deficient fits
# * Between 1.6 and 3 provided the same RMSE error values
# * However, when trainingMultiplier was >=1.6 or <=2, results included one extra location/industry combination.
# * The lowest number was chosen that did not change RMSE values, to maximise chance of calculation on smaller sample sizes without compromising predictions.
for (ind in industries) {
for (loc in locations) {
# Create a subset of the data
temp = df[df$industry == ind & df$location == loc, ]
# Check to make sure you have at least 'trainingMultiplier' times the number of training rows than testing rows
if (length(unique(temp$date)) >= testingNum) {
# Arrange dataset by date
arrange(temp, date)
# Add a number to represent date order
temp$time_number = c(1:nrow(temp))
# Training number is the number of rows minus the testingNum
trainingNum = nrow(temp) - testingNum
# Training set is all rows from the start minus the number in the test set. We'll arrange again, just in case.
trainingSet = head(arrange(temp, time_number), trainingNum)
# Testing set is the last 'testingNum' rows. We'll arrange again, just in case.
testingSet = tail(arrange(temp, time_number), testingNum)
# Run the model
training.model = lm(avg~time_number, data=trainingSet)
# testing.model = lm(monthly_amount~time_number, data=testingSet)
# Calculate the mean standard error
training.mse <- mean(residuals(training.model)^2)
#testing.mse <- mean(residuals(testing.model)^2)
# Calculate root mean squared error
training.rmse <- sqrt(training.mse)
# testing.rmse <- sqrt(testing.mse)
### Now, add an extra row into temp for the December 2016 prediction, giving December 2016 a monthly_amount of 0
# Create a dataframe containing just the December 2016 data
december_2016 = data.frame(date = "2016-12-01",
industry=ind,
location=loc,
avg=0,
month=12,
year=2016,
time_number=(nrow(temp)+1))
# Ensure temp is of type data frame
temp = as.data.frame(temp)
#testingSet = as.data.frame(testingSet)
# Add the December 2016 row
temp = rbind(temp, december_2016)
# testingSet = rbind(testingSet, december_2016)
# Output a prediction based on all rows and add it to the temp data frame
temp$prediction = predict(training.model, temp)
testingSet$prediction = predict(training.model, testingSet)
# Get the last prediction value (which is the Dec 2016 value).
train_dec_2016_prediction = tail(temp$prediction, 1)
# test_dec_2016_prediction = tail(testingSet$prediction, 1)
# Create row to add to the output data frame, including industry and location variables
dataRow = c(ind,loc,training.rmse,train_dec_2016_prediction)
} else {
# Append entry to output data frame when not enough data to compute
dataRow = c(ind,loc,NA,NA)
}
# Add the row to the output dataframe
output = rbind(output, dataRow)
}
}
#Add column names to the output data frame
colnames(output) <- c("Industry","Location", "RMSE", "Dec 2016 Prediction")
#Return the output
return(output)
}
# Get the list of unique industries, sorted in numerical order
industries <- sort(unique(a1$industry))
# Get the list of unique locations, sorted in numerical order
locations <- sort(unique(a1$location))
# Calculate the predictions for all industry and location pairs
all_predictions <- calculate_predictions(a1, industries, locations)
# Order by RMSE
arrange(all_predictions, RMSE)
## Industry Location RMSE Dec 2016 Prediction
## 1 10 3 603.6323 79534.52
## 2 9 5 2012.4789 66600.21
## 3 10 1 4446.7008 49845.73
## 4 1 8 5538.6968 126570.66
## 5 10 4 7106.1359 106503.13
## 6 1 6 7387.6404 119469.94
## 7 1 2 8548.6526 210623.53
## 8 1 1 9515.8878 178543.72
## 9 1 3 9603.3477 205911.95
## 10 10 9 9608.8493 99452.73
## 11 1 5 9863.2697 173610.63
## 12 10 5 11517.5418 143048.51
## 13 1 9 11805.5751 198723.95
## 14 1 10 15489.9964 242128.47
## 15 1 4 15988.3260 208957.94
## 16 4 2 16095.2431 225127.74
## 17 7 7 17066.0262 226031.89
## 18 1 7 17740.0193 161432.82
## 19 5 5 17751.3792 241300.70
## 20 4 8 21034.0076 237613.89
## 21 4 9 21262.1287 225816.21
## 22 7 1 24685.4813 252164.27
## 23 2 2 24841.8438 482435.01
## 24 2 7 25802.9804 273000.87
## 25 2 10 26797.2850 284117.69
## 26 4 7 27077.5698 540174.60
## 27 4 1 28557.1759 491501.14
## 28 4 10 28749.5933 311294.54
## 29 2 1 29719.7977 429667.31
## 30 7 2 33023.1388 293742.06
## 31 5 2 34858.0487 429031.34
## 32 7 6 35929.7034 219105.71
## 33 7 8 36228.8703 303746.65
## 34 2 6 36700.1712 366288.29
## 35 4 3 37680.6488 322742.01
## 36 3 8 37833.7998 178052.35
## 37 5 7 39769.1310 329872.86
## 38 4 5 42716.2924 445966.52
## 39 4 4 43166.4768 419228.93
## 40 7 5 44955.1432 325303.54
## 41 8 4 47181.7160 354409.03
## 42 5 4 47942.3169 254968.32
## 43 2 4 48671.9104 255885.48
## 44 5 9 49200.0968 440365.08
## 45 2 8 49243.8068 500516.34
## 46 5 8 50651.0459 465697.13
## 47 7 4 53833.1084 428257.03
## 48 7 10 54213.4159 497701.79
## 49 5 3 57612.6727 299553.86
## 50 2 3 57701.5777 607024.75
## 51 10 2 59745.2694 238273.67
## 52 2 5 60035.4295 274250.46
## 53 3 2 60244.4948 688737.10
## 54 5 1 63162.4640 610234.60
## 55 3 7 67254.4225 648664.85
## 56 8 9 70051.8602 553241.26
## 57 2 9 70325.1443 460783.34
## 58 5 6 87166.2626 699754.56
## 59 7 3 87524.9314 313296.41
## 60 5 10 91846.4981 462057.03
## 61 7 9 92696.3108 779314.88
## 62 3 1 95393.2153 601300.03
## 63 8 2 96793.6409 760218.25
## 64 9 1 99543.2810 631064.75
## 65 3 6 103163.3940 125025.75
## 66 3 4 109780.5334 1027582.90
## 67 3 3 114278.7645 555142.52
## 68 10 7 115089.4627 299069.69
## 69 8 10 115965.6449 656832.62
## 70 8 6 139653.7829 1010321.22
## 71 3 5 146819.0369 332931.68
## 72 9 2 148446.3713 1256372.52
## 73 9 7 155797.6856 1521353.94
## 74 4 6 162835.4399 1139849.79
## 75 8 3 169513.3461 663966.23
## 76 9 3 179000.5605 1004181.55
## 77 8 8 179116.9772 437249.51
## 78 8 7 179337.4219 257994.28
## 79 3 9 181730.2677 375564.46
## 80 9 9 196809.2936 1137264.71
## 81 3 10 198078.4934 -199643.72
## 82 8 5 204982.4257 123212.52
## 83 8 1 284506.8298 1460039.43
## 84 9 6 297460.1852 1763920.96
## 85 9 10 322657.7149 1308952.43
## 86 9 4 539522.8534 -253357.89
## 87 10 8 2574934.2447 25676380.07
## 88 6 1 4372304.0117 35859656.20
## 89 6 2 NA NA
## 90 6 3 NA NA
## 91 6 4 NA NA
## 92 6 5 NA NA
## 93 6 6 NA NA
## 94 6 7 NA NA
## 95 6 8 NA NA
## 96 6 9 NA NA
## 97 6 10 NA NA
## 98 9 8 NA NA
## 99 10 6 NA NA
## 100 10 10 NA NA
Using industry 1 and location 1 as the linear model and applying predictions to the rest of the locations and industry’s yielded poor results. To increase the the performance of the model, i expanded the
RMSD is sensitive to outliers RMSD is proportional to the size of the squared error RMSD is the square root of the average of squared errors
#lets see if this model improves understanding all the variables available?
a1id <- a1
a1id$id <- 1:nrow(a1)
test.lm <- lm(avg ~ industry + location + id + month + year, data = a1id)
summary(test.lm)
##
## Call:
## lm(formula = avg ~ industry + location + id + month + year, data = a1id)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2979831 -1014044 -514055 -38004 47457473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.716e+10 1.970e+10 3.409 0.000659 ***
## industry -7.178e+04 8.484e+04 -0.846 0.397525
## location -1.289e+05 2.370e+04 -5.441 5.63e-08 ***
## id 3.374e+04 9.858e+03 3.422 0.000627 ***
## month -2.759e+06 8.140e+05 -3.389 0.000707 ***
## year -3.336e+07 9.787e+06 -3.409 0.000659 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3938000 on 3880 degrees of freedom
## Multiple R-squared: 0.03266, Adjusted R-squared: 0.03142
## F-statistic: 26.2 on 5 and 3880 DF, p-value: < 2.2e-16
We can see that industry has a very low p value and association with the rest of the variables. With all the factors, we understand that the Mult R2 is 0.03266 whilst the adjust r-squared is 0.03142.
#lets see if this model improves by removing industry?
test.lm <- lm(avg ~ location + id + month + year, data = a1id)
summary(test.lm)
##
## Call:
## lm(formula = avg ~ location + id + month + year, data = a1id)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2867050 -1016602 -507317 -37700 47595431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.106e+10 5.117e+09 9.979 < 2e-16 ***
## location -1.215e+05 2.202e+04 -5.518 3.64e-08 ***
## id 2.568e+04 2.560e+03 10.031 < 2e-16 ***
## month -2.094e+06 2.121e+05 -9.871 < 2e-16 ***
## year -2.536e+07 2.542e+06 -9.979 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3937000 on 3881 degrees of freedom
## Multiple R-squared: 0.03248, Adjusted R-squared: 0.03149
## F-statistic: 32.58 on 4 and 3881 DF, p-value: < 2.2e-16
This actually decreased our R2 values (although marginal)!. Building an overall model to predict all 10 industries and 10 locations is not feasible as it can only explain approximately 3 percent of the dataset.
#lets try and seperate it by location variable
a1loc1 <- a1
a1loc1 <-
a1loc1 %>%
filter(location == 1)
a1loc1 <- mean.fun(a1loc1)
a1loc1$id <- 1:nrow(a1loc1)
test1.lm <- lm(avg ~ industry + id + month + year, data = a1loc1)
summary(test1.lm)
##
## Call:
## lm(formula = avg ~ industry + id + month + year, data = a1loc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6670942 -3753849 -2705337 -1655322 44441942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.239e+10 5.866e+10 0.211 0.833
## industry 3.096e+05 3.185e+05 0.972 0.332
## id 6.220e+04 2.660e+05 0.234 0.815
## month -4.585e+05 2.472e+06 -0.185 0.853
## year -6.153e+06 2.914e+07 -0.211 0.833
##
## Residual standard error: 9327000 on 432 degrees of freedom
## Multiple R-squared: 0.01992, Adjusted R-squared: 0.01084
## F-statistic: 2.195 on 4 and 432 DF, p-value: 0.06875
Even worse! 0.01992, adjusted r2 0.01084
i6 <-
a1 %>%
filter(industry == 6)
p8 <- ggplot(data=i6, aes(x=date, y=avg)) +
geom_smooth(stat="identity") +
geom_point(colour = "red", alpha = "0.5") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(date_minor_breaks = "1 month")
p8
i1 <-
a1 %>%
filter(industry == 1)
p9 <- ggplot(data=i1, aes(x=date, y=avg)) +
geom_smooth(stat="identity") +
geom_point(colour = "red", alpha = "0.5") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(date_minor_breaks = "1 month")
p9