You are given a set of survey data which captures spend amount among other data points. However, some of the spend amount (totshopping.rep) is missing. Build a model that will help predict the amount spent by a visitor based on the set of data provided.
# Loading of Libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
# Loading the Data
train_in <- read.csv('./spendata.csv', header=T)
# Cleaning the Data
train_in <- train_in[, colSums(is.na(train_in)) == 0]
train_in <- train_in[, -c(1)]
dim(train_in)
## [1] 18379 228
We split the data (spendata.csv) into 50% for training and 50% for testing our Multiple Linear Regression Model.
# Splitting the Data
set.seed(1234)
inTrain <- createDataPartition(train_in$totshopping.rep, p = 0.5, list = FALSE)
trainData <- train_in[inTrain, ]
testData <- train_in[-inTrain, ]
# Removing Predictors with Near Zero Variance
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
testData <- testData[, -NZV]
dim(trainData)
## [1] 9191 109
dim(testData)
## [1] 9188 109
trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))
# Plotting a Correlation Plot for Training Data
cor_mat <- cor(trainData)
corrplot(cor_mat, order = "FPC", method = "color",
type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a threshold value of 0.95 to determine these highly correlated variables.
We used the Highly Correlated Variables derived as Predictors for our Multiple Linear Regression Model.
# Building our Multiple Linear Regression Model
modFit <- train(totshopping.rep ~ f.188 + b.15 + b.16 + b.17 + c.58 + c.159 + c.161 +
c.164 + c.165 + c.166 + var2 + c.32, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression
##
## 9191 samples
## 12 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 9191, 9191, 9191, 9191, 9191, 9191, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 59.20283 0.9768327 43.17505
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)
The Regression Diagnostic Plots show residuals in 4 different ways:
The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.
# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)
From our Residuals vs Fitted Plot shown above, it suggests that our data has good linearity. Characteristics to support this include:
The normality assumptions can be examined by inspecting the Normal Q-Q Plot.
# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)
From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:
However, there are some outliers in the Theoretical Quantiles range of 3 to 4. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.
Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.
# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)
From our Scale-Location Plot shown above, it suggests that our data has good homoscedasticity. Characteristics to support this include:
Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.
Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.
# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)
From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:
However, there are some outliers as shown in the plots above. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.
# Prediction with our Multiple Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(totshopping.rep, Prediction, colour = month, data = testData)
# Summary of our Prediction
summary(Prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 113.5 224.0 326.1 384.6 5188.4