The objective of this case study is to understand what is driving the total spend of credit card (Primary Card + Secondary Card).

The dataset has been provided by Analytixlabs and is for Academic purpose only.

Life Cycle of the Project:

Importing the required libraries

dplyr, magrittr, readxl, ggplot2, Hmisc, MASS

Reading the data

data = read_xlsx("D:\\Business Analytics\\R\\CLASS 13 - 14\\Linear Regression Case\\Linear Regression Case Study.xlsx")

Pre - Modelling (Exploratory Data Analysis)

Number of rows and columns

dim(data)
## [1] 5000  132

In Data Analysis we will analyse the following:

  • Missing Values
  • All the numerical and categorical variable
  • Cardinality of categorical variable
  • Outliers
  • Relationship between independent and dependent feature

Features with Missing Values

## townsize - 6 , lncreddebt - 4942 , lnothdebt - 4973 , commutetime - 42 , lnlongten - 4438 , cardten - 698 ,

As we can observe that there are lot of missing values in lncreddebt and lnothdebt, so we have to impute them with something meaningful.

Number of Categorical Feature

## [1] 92

Number of Continuous Features

## [1] 24

Exploring the Continuous Features

#--Applying function
options(scipen = 999)
test = t(apply(data[cont_num_var],2,my_csf))

These are some of the important information about the continuous features

##            nmiss    n outlier_flag1.99% outlier_flag2       mean        std
## age            0 5000                 0             0 47.0256000 17.7703377
## employ         0 5000                 1             1  9.7304000  9.6909287
## lninc          0 5000                 1             1  3.6999094  0.7470719
## debtinc        0 5000                 1             1  9.9541600  6.3997833
## lncreddebt     1 4999                 1             1 -0.1304535  1.2730584
## lnothdebt      1 4999                 1             1  0.6969153  1.1285782
##                    cv       min     p1.1%     p5.5%      q1.25%      q2.50%
## age         0.3778865 18.000000 18.000000 20.000000 31.00000000 47.00000000
## employ      0.9959435  0.000000  0.000000  0.000000  2.00000000  7.00000000
## lninc       0.2019163  2.197225  2.197225  2.564949  3.17805383  3.63758616
## debtinc     0.6429255  0.000000  0.700000  1.900000  5.10000000  8.80000000
## lncreddebt -9.7587123 -6.597334 -3.401690 -2.291604 -0.95268511 -0.07610597
## lnothdebt   1.6193909 -4.092107 -2.168241 -1.243483 -0.01898651  0.74153726
##                q3.75%   p95.95%   p99.99%       max   uc1.99%    lc1.1%
## age        62.0000000 76.000000 79.000000 79.000000 79.000000 18.000000
## employ     15.0000000 31.000000 39.000000 52.000000 39.000000  0.000000
## lninc       4.2046926  4.990433  5.605839  6.978214  5.605839  2.197225
## debtinc    13.6000000 22.200000 29.200000 43.100000 29.200000  0.700000
## lncreddebt  0.7246652  1.852297  2.658910  4.692014  2.658910 -3.401690
## lnothdebt   1.4620533  2.469586  3.180802  4.952011  3.180802 -2.168241
##                   uc2        lc2
## age        100.336613  -6.285413
## employ      38.803186 -19.342386
## lninc        5.941125   1.458694
## debtinc     29.153510  -9.245190
## lncreddebt   3.688722  -3.949629
## lnothdebt    4.082650  -2.688819

Distribution of Spend (Dependent)

ggplot(data = data,aes(total_spend))+geom_histogram(fill = "lightblue",col="black",bins = 25)+ ylab("Count")+xlab("Spend")+ggtitle("Histogram of Spend")

We can observe from the above histogram that the distribution of Spend(Target) is slightly skwed, so we will transform to make it normal.

Distribution of the Continuous Features

par(mfrow=c(6,4))
col_names = colnames(data[cont_num_var])
for(feature in cont_num_var){
  hist(data[[feature]], main = paste("Histogram of",feature),xlab = feature, ylab = "count", col = "lightblue",border = "black") 
}

From the above histograms we can conclude that majority of the features are skwed, so we have to transform to make it normaly distributed.

Outliers in Spend (Dependent)

ggplot(data = data,aes(total_spend))+geom_boxplot(fill = "lightblue",col="black")+coord_flip()+ ylab("Count")+xlab("Spend")+ggtitle("Boxplot of Spend")

From the above boxplot we can conclude that there are outliers in Spend.

Outliers in Continuous Features

par(mfrow=c(6,4))
col_names = colnames(data[cont_num_var])
for(feature in cont_num_var){
par(bg="white")
boxplot(data[feature],main=paste("Box Plot of",feature),ylab=feature, col = "lightblue")
}

From the above boxplots we can conclude that majority of the continuous features have outliers.

Scatter plot

par(mfrow=c(6,4))
col_names = colnames(data[cont_num_var])
for(feature in cont_num_var){
par(bg="white")
plot(data[[feature]],data[["total_spend"]],main=paste("Scatter Plot of",feature),ylab="total spend", xlab=feature, col = "darkblue") 
}

From the above scatter plots we can conclude that there is some kind of relationship between continuous features and spend

Feature Engineering

Transformation of Target Variable

# Y should be normal
data$total_spend = log(data$total_spend)

Missing Values

Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data

#--Treating Missing Values
data[feature_with_na] = data.frame(apply(data[feature_with_na],2, function(x) impute(x,median)))

Outliers

An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error. An outlier can cause serious problems in statistical analysis.

# Treating Outliers
otl_udf = function(x) {
  
  quantiles = quantile(x, c(0.05,0.95),na.rm = T)
  
  x[x < quantiles[1]] = quantiles[1]
  x[x > quantiles[2]] = quantiles[2]
  x
}
data[cont_num_var] = apply(data[cont_num_var],2,otl_udf)

We have capped the outliers at 5 and 95 percentile.

Correlation Heatmap

Correlation is used to test relationships between continuous variables. In other words, it’s a measure of how things are related. The study of how variables are correlated is called correlation analysis.

cor_mat=round(cor(data[cont_num_var]),2)
col <- colorRampPalette(c("darkred","white","darkgreen"))
corrplot(cor_mat,method = "color",col = col(200),tl.cex = 1,tl.col="black", addCoef.col = "black", number.cex = 0.6)

From the above Correlation Heatmap we can observe that there are few features which have almost no correlation with total spend. So we will remove those features. And we will remove those feature which have high multicollinearity.

We will remove those features which are having no correlation with target.

data = subset(data ,select = -c(age,debtinc,hourstv,cardmon,cardten,commutetime,card2tenure,tollmon,lnlongten,lnlongten,tenure))

Feature Reduction (Factor Analysis)

Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score. Factor analysis is part of general linear model (GLM) and this method also assumes several assumptions: - There is linear relationship - There is no multicollinearity - It includes relevant variables into analysis - There is true correlation between variables and factors.

In Factor Analysis we will group variables into different factors and then we will select few variables from each factors in ratio of variables in each factors.

Here we have calculated the correlation matrix and then find the eigen values for each features. These are some of the eigen values.

cormat = cor(data)

# Deciding number of factor using scree plot
head(eigen(cormat)$values,12)
##  [1] 9.824568 9.239085 4.903071 4.320629 3.612203 3.460933 2.037795 2.004863
##  [9] 1.875219 1.705524 1.527542 1.487476

Factor is a group of variables which are related to each other. In order to know the number of factors to create, we have a concept of Eigen Values.

Eigen values is also called characteristic roots. Eigen values shows variance explained by that particular factor out of the total variance. From the commonality column, we can know how much variance is explained by the first factor out of the total variance. For example, if our first factor explains 68% variance out of the total, this means that 32% variance will be explained by the other factor.

eigen_values = mutate(data.frame(eigen(cormat)$values),
                cum_sum_eigen = cumsum(eigen.cormat..values),
                pct_var = eigen.cormat..values/sum(eigen.cormat..values),
                cum_pct_var = cum_sum_eigen/sum(eigen.cormat..values))

Now we have calculated eigen values, cumulative sum of eigen values, percent variation and cumulative percent variation.The purpose of calculating these is to understand how many factors we should consider.

head(eigen_values)
##   eigen.cormat..values cum_sum_eigen    pct_var cum_pct_var
## 1             9.824568      9.824568 0.09268460   0.0926846
## 2             9.239085     19.063653 0.08716118   0.1798458
## 3             4.903071     23.966724 0.04625539   0.2261012
## 4             4.320629     28.287353 0.04076065   0.2668618
## 5             3.612203     31.899556 0.03407739   0.3009392
## 6             3.460933     35.360490 0.03265031   0.3335895
plot(eigen_values$pct_var,type="b")

Here we have ploted the eigen values to have alook at the variation at each number of factors.

factor_analysis = fa(r=cormat,32,rotate = "varimax" ,fm = "pca")
## factor method not specified correctly, minimum residual (unweighted least squares  used
## In smc, smcs < 0 were set to .0
## In smc, smcs < 0 were set to .0
## In smc, smcs < 0 were set to .0
## In factor.stats, the correlation matrix is singular, an approximation is used
## In factor.scores, the correlation matrix is singular, an approximation is used
## I was unable to calculate the factor score weights, factor loadings used instead
fa_sort = fa.sort(factor_analysis)

loadings = data.frame(fa_sort$loadings[1:ncol(data),])

We have sorted the loadings at each factor level

#write.csv(loadings,"loadings.csv",row.names = T)

Numerical and Categorical Data

# seperating categorical and numerical data
numeric_data = data[cont_num_var]
categorical_data = data[categorical_var]

These vars are the features selected from different factors.

vars = c('cardfee','gender','commutecarpool','polparty','homeown','carbuy','pets_dogs','region','commutepublic','active','tollten','reason','pets_cats','jobcat','card','carditems','ed','wiremon','pets_freshfish','ownvcr','owncd','cars','carcatvalue','reside','spoused','lnothdebt','carvalue','lninc','callid','confer','tollfree','response_01','ownfax','pager','voice','equipten','ownpc','equipmon','internet','retire','callcard','multline','news','employ','address','lnlongmon','cardtenure','total_spend')

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

total_spend = data$total_spend
data = cbind(scale(data.frame(data[,1:ncol(data)-1]),center = TRUE,scale = TRUE),data.frame(total_spend))

Spliting data into train and test

set.seed(129)

train_ind = sample(1:nrow(data),size = floor(0.70*nrow(data)))

train = data[train_ind,]
test = data[-train_ind,]

Creating model with all features

fit = lm(total_spend~., data = train)
summary(fit)

Using StepAic for selecting significant features

step = stepAIC(fit,direction = "both")

We have used stepaic to select the significant variables.

fit2 =lm(total_spend ~ gender + commutecarpool + region + jobcat + card + 
    carditems + carcatvalue + carvalue + lninc + equipten + retire, 
    data = train)

Using VIF to Checking for multicollinearity

A variance inflation factor detects multicollinearity in regression analysis. Multicollinearity is when there’s correlation between predictors in a model, it’s presence can adversely affect the regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

vif(fit2)
##         gender commutecarpool         region         jobcat           card 
##       1.004413       1.004240       1.002355       1.057805       1.004606 
##      carditems    carcatvalue       carvalue          lninc       equipten 
##       1.008366       5.055332       9.575863       3.855157       1.032694 
##         retire 
##       1.292207

Here we have used vif to check whether there is any multicollinearity amoung the selected variables.

Model Building

fit_f =lm(total_spend ~ gender + commutecarpool + region + jobcat + card + 
    carditems  + carvalue + lninc + equipten + retire, 
    data = train)
summary(fit_f)
## 
## Call:
## lm(formula = total_spend ~ gender + commutecarpool + region + 
##     jobcat + card + carditems + carvalue + lninc + equipten + 
##     retire, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39853 -0.30238  0.00112  0.30678  1.45220 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     6.007986   0.007356 816.740 < 0.0000000000000002 ***
## gender         -0.035129   0.007369  -4.767           0.00000195 ***
## commutecarpool  0.009443   0.007391   1.278               0.2015    
## region          0.006624   0.007334   0.903               0.3664    
## jobcat         -0.011115   0.007531  -1.476               0.1401    
## card           -0.128621   0.007372 -17.447 < 0.0000000000000002 ***
## carditems       0.255320   0.007301  34.969 < 0.0000000000000002 ***
## carvalue        0.007269   0.012456   0.584               0.5595    
## lninc           0.196420   0.013184  14.898 < 0.0000000000000002 ***
## equipten        0.002557   0.007575   0.338               0.7357    
## retire         -0.020844   0.008288  -2.515               0.0119 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.435 on 3489 degrees of freedom
## Multiple R-squared:  0.4231, Adjusted R-squared:  0.4214 
## F-statistic: 255.9 on 10 and 3489 DF,  p-value: < 0.00000000000000022

Now we have build the final model, so we will diagnos this model by creating diagnostic plots.

layout(matrix(c(1,2,3,4),2,2))
plot(fit_f,col = "darkblue")

Diagnostic Plots

  • The Residuals are not following any pattern
  • The Q-Q Plot is linear between Theoretical Normal Distribution and Residual Normal Distribution
  • There are few Influential Observations
hist(fit_f$residuals,col="lightblue", main = paste("Histogram of Residuals"),xlab = "Residuals",border = "black")

The Residuals should be Normally Distributed

mean(fit_f$residuals) 
## [1] 0.00000000000000001632087

The Mean of Residuals should be close to Zero

Scoring using predict function

# Training Data
t1 = cbind(train, pred_spent = exp(predict(fit_f,train)))

t1$total_spend = exp(t1$total_spend)

t1 = transform(t1,APE = abs(pred_spent - total_spend)/total_spend ) 
# Test Data
test$total_spend = exp(test$total_spend)

t2 = cbind(test, pred_spent = exp(predict(fit_f,test)))

t2 = transform(t2,APE = abs(pred_spent - total_spend)/total_spend)

Creating Decile Analysis (Training)

## # A tibble: 10 x 4
##    Decile Count Avg_pred_spent Avg_actual_spent
##    <fct>  <int>          <dbl>            <dbl>
##  1 1        350           213.             226.
##  2 2        350           279.             300.
##  3 3        350           319.             352.
##  4 4        350           356.             384.
##  5 5        350           392.             452.
##  6 6        350           427.             505.
##  7 7        350           469.             535.
##  8 8        350           521.             574.
##  9 9        350           598.             632.
## 10 10       350           797.             814.

Creating Decile Analysis (Testing)

## # A tibble: 10 x 4
##    Decile Count Avg_pred_spent Avg_actual_spent
##    <fct>  <int>          <dbl>            <dbl>
##  1 1        150           216.             228.
##  2 2        150           274.             300.
##  3 3        150           314.             351.
##  4 4        150           351.             400.
##  5 5        150           384.             424.
##  6 6        150           423.             511.
##  7 7        150           465.             512.
##  8 8        150           517.             596.
##  9 9        150           599.             698.
## 10 10       150           779.             816.

Metrics

Training Data

## [1] "Mean Absolute Percentage Error - 0.373"
## [1] "Mean Square Error - 46959.225"
## [1] "Root Mean Square Error - 216.701"
## [1] "Mean Absolute Error - 160.077"

Test Data

## [1] "Mean Absolute Percentage Error - 0.378"
## [1] "Mean Square Error - 50289.914"
## [1] "Root Mean Square Error - 224.254"
## [1] "Mean Absolute Error - 164.481"