Overview

You are given a set of survey data which captures spend amount among other data points. However, some of the spend amount (totshopping.rep) is missing. Build a model that will help predict the amount spent by a visitor based on the set of data provided.

Loading of Libraries

# Loading of Libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded

Loading and Cleaning of Data

# Loading the Data
train_in <- read.csv('./spendata.csv', header=T)

# Cleaning the Data
train_in <- train_in[, colSums(is.na(train_in)) == 0]
train_in <- train_in[, -c(1)]
dim(train_in)
## [1] 18379   228

Preparing Datasets for Prediction

We split the data (spendata.csv) into 50% for training and 50% for testing our Multiple Linear Regression Model.

# Splitting the Data
set.seed(1234)
inTrain <- createDataPartition(train_in$totshopping.rep, p = 0.5, list = FALSE)
trainData <- train_in[inTrain, ]
testData <- train_in[-inTrain, ]

# Removing Predictors with Near Zero Variance
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
testData <- testData[, -NZV]
dim(trainData)
## [1] 9191  109
dim(testData)
## [1] 9188  109
trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))

Plotting a Correlation Plot for Training Data

# Plotting a Correlation Plot for Training Data
cor_mat <- cor(trainData)
corrplot(cor_mat, order = "FPC", method = "color",
         type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))

In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a threshold value of 0.95 to determine these highly correlated variables.

Finding Highly Correlated Variables in Training Data

# Finding Highly Correlated Variables in Training Data
highlyCorrelated = findCorrelation(cor_mat, cutoff=0.95)
names(trainData)[highlyCorrelated]
##  [1] "f.188" "b.15"  "b.16"  "b.17"  "c.58"  "c.159" "c.161" "c.164" "c.165"
## [10] "c.166" "var2"  "c.32"

Building our Multiple Linear Regression Model

We used the Highly Correlated Variables derived as Predictors for our Multiple Linear Regression Model.

# Building our Multiple Linear Regression Model
modFit <- train(totshopping.rep ~ f.188 + b.15 + b.16 + b.17 + c.58 + c.159 + c.161 +
                  c.164 + c.165 + c.166 + var2 + c.32, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression 
## 
## 9191 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 9191, 9191, 9191, 9191, 9191, 9191, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   59.20283  0.9768327  43.17505
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Plotting Regression Diagnostic Plots

# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)

The Regression Diagnostic Plots show residuals in 4 different ways:

  1. Residuals vs Fitted - Used to examine the linearity assumptions. A horizontal line, without distinct patterns, is a good indication of linearity.
  2. Normal Q-Q - Used to examine whether the residuals are normally distributed. The data is good if residual points follow the straight dashed line.
  3. Scale-Location / Spread-Location - Used to examine the homogeneity in variances of residuals (Homoscedasticity). A horizontal line with equally spread points is a good indication of Homoscedasticity.
  4. Residuals vs Leverage - Used to identify extreme values that may influence the analysis.

Linearity of the Data

The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.

# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)

From our Residuals vs Fitted Plot shown above, it suggests that our data has good linearity. Characteristics to support this include:

  1. The residuals are well distributed around the 0 straight dashed line, suggesting good linearity.
  2. The residuals form a horizontal band around the 0 straight dashed line, suggesting similar variances.
  3. Few residuals stand out from the distribution pattern, suggesting few outliers.

Normality of Residuals

The normality assumptions can be examined by inspecting the Normal Q-Q Plot.

# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)

From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:

  1. The residual points fall approximately along the straight dashed line.

However, there are some outliers in the Theoretical Quantiles range of 3 to 4. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.

Homoscedasticity

Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.

# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)

From our Scale-Location Plot shown above, it suggests that our data has good homoscedasticity. Characteristics to support this include:

  1. The residuals are equally spread along the range of predictors.
  2. The residuals are equally spread around the horizontal red line.

Outliers, High Leverage Points and Influential Values

Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.

Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.

# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)

From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:

  1. All data points are well inside the Cook’s Distance Lines, which are represented by the red dashed line in our Residuals vs Leverage Plot.

However, there are some outliers as shown in the plots above. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.

Prediction with our Multiple Linear Regression Model

# Prediction with our Multiple Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(totshopping.rep, Prediction, colour = month, data = testData)

Summary of our Prediction

# Summary of our Prediction
summary(Prediction)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0   113.5   224.0   326.1   384.6  5188.4