The objective of this case study is to understand what is driving the total spend of credit card (Primary Card + Secondary Card).
The dataset has been provided by Analytixlabs and is for Academic purpose only.
dplyr, magrittr, readxl, ggplot2, Hmisc, MASS
data = read_xlsx("D:\\Business Analytics\\R\\CLASS 13 - 14\\Linear Regression Case\\Linear Regression Case Study.xlsx")
dim(data)
## [1] 5000 132
## townsize - 6 , lncreddebt - 4942 , lnothdebt - 4973 , commutetime - 42 , lnlongten - 4438 , cardten - 698 ,
As we can observe that there are lot of missing values in lncreddebt and lnothdebt, so we have to impute them with something meaningful.
## [1] 92
## [1] 24
#--Applying function
options(scipen = 999)
test = t(apply(data[cont_num_var],2,my_csf))
These are some of the important information about the continuous features
## nmiss n outlier_flag1.99% outlier_flag2 mean std
## age 0 5000 0 0 47.0256000 17.7703377
## employ 0 5000 1 1 9.7304000 9.6909287
## lninc 0 5000 1 1 3.6999094 0.7470719
## debtinc 0 5000 1 1 9.9541600 6.3997833
## lncreddebt 1 4999 1 1 -0.1304535 1.2730584
## lnothdebt 1 4999 1 1 0.6969153 1.1285782
## cv min p1.1% p5.5% q1.25% q2.50%
## age 0.3778865 18.000000 18.000000 20.000000 31.00000000 47.00000000
## employ 0.9959435 0.000000 0.000000 0.000000 2.00000000 7.00000000
## lninc 0.2019163 2.197225 2.197225 2.564949 3.17805383 3.63758616
## debtinc 0.6429255 0.000000 0.700000 1.900000 5.10000000 8.80000000
## lncreddebt -9.7587123 -6.597334 -3.401690 -2.291604 -0.95268511 -0.07610597
## lnothdebt 1.6193909 -4.092107 -2.168241 -1.243483 -0.01898651 0.74153726
## q3.75% p95.95% p99.99% max uc1.99% lc1.1%
## age 62.0000000 76.000000 79.000000 79.000000 79.000000 18.000000
## employ 15.0000000 31.000000 39.000000 52.000000 39.000000 0.000000
## lninc 4.2046926 4.990433 5.605839 6.978214 5.605839 2.197225
## debtinc 13.6000000 22.200000 29.200000 43.100000 29.200000 0.700000
## lncreddebt 0.7246652 1.852297 2.658910 4.692014 2.658910 -3.401690
## lnothdebt 1.4620533 2.469586 3.180802 4.952011 3.180802 -2.168241
## uc2 lc2
## age 100.336613 -6.285413
## employ 38.803186 -19.342386
## lninc 5.941125 1.458694
## debtinc 29.153510 -9.245190
## lncreddebt 3.688722 -3.949629
## lnothdebt 4.082650 -2.688819
ggplot(data = data,aes(total_spend))+geom_histogram(fill = "lightblue",col="black",bins = 25)+ ylab("Count")+xlab("Spend")+ggtitle("Histogram of Spend")
We can observe from the above histogram that the distribution of Spend(Target) is slightly skwed, so we will transform to make it normal.
par(mfrow=c(6,4))
col_names = colnames(data[cont_num_var])
for(feature in cont_num_var){
hist(data[[feature]], main = paste("Histogram of",feature),xlab = feature, ylab = "count", col = "lightblue",border = "black")
}
From the above histograms we can conclude that majority of the features are skwed, so we have to transform to make it normaly distributed.
ggplot(data = data,aes(total_spend))+geom_boxplot(fill = "lightblue",col="black")+coord_flip()+ ylab("Count")+xlab("Spend")+ggtitle("Boxplot of Spend")
From the above boxplot we can conclude that there are outliers in Spend.
par(mfrow=c(6,4))
col_names = colnames(data[cont_num_var])
for(feature in cont_num_var){
par(bg="white")
boxplot(data[feature],main=paste("Box Plot of",feature),ylab=feature, col = "lightblue")
}
From the above boxplots we can conclude that majority of the continuous features have outliers.
par(mfrow=c(6,4))
col_names = colnames(data[cont_num_var])
for(feature in cont_num_var){
par(bg="white")
plot(data[[feature]],data[["total_spend"]],main=paste("Scatter Plot of",feature),ylab="total spend", xlab=feature, col = "darkblue")
}
From the above scatter plots we can conclude that there is some kind of relationship between continuous features and spend
# Y should be normal
data$total_spend = log(data$total_spend)
Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data
#--Treating Missing Values
data[feature_with_na] = data.frame(apply(data[feature_with_na],2, function(x) impute(x,median)))
An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error. An outlier can cause serious problems in statistical analysis.
# Treating Outliers
otl_udf = function(x) {
quantiles = quantile(x, c(0.05,0.95),na.rm = T)
x[x < quantiles[1]] = quantiles[1]
x[x > quantiles[2]] = quantiles[2]
x
}
data[cont_num_var] = apply(data[cont_num_var],2,otl_udf)
We have capped the outliers at 5 and 95 percentile.
Correlation is used to test relationships between continuous variables. In other words, it’s a measure of how things are related. The study of how variables are correlated is called correlation analysis.
cor_mat=round(cor(data[cont_num_var]),2)
col <- colorRampPalette(c("darkred","white","darkgreen"))
corrplot(cor_mat,method = "color",col = col(200),tl.cex = 1,tl.col="black", addCoef.col = "black", number.cex = 0.6)
From the above Correlation Heatmap we can observe that there are few features which have almost no correlation with total spend. So we will remove those features. And we will remove those feature which have high multicollinearity.
We will remove those features which are having no correlation with target.
data = subset(data ,select = -c(age,debtinc,hourstv,cardmon,cardten,commutetime,card2tenure,tollmon,lnlongten,lnlongten,tenure))
Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score. Factor analysis is part of general linear model (GLM) and this method also assumes several assumptions: - There is linear relationship - There is no multicollinearity - It includes relevant variables into analysis - There is true correlation between variables and factors.
In Factor Analysis we will group variables into different factors and then we will select few variables from each factors in ratio of variables in each factors.
Here we have calculated the correlation matrix and then find the eigen values for each features. These are some of the eigen values.
cormat = cor(data)
# Deciding number of factor using scree plot
head(eigen(cormat)$values,12)
## [1] 9.824568 9.239085 4.903071 4.320629 3.612203 3.460933 2.037795 2.004863
## [9] 1.875219 1.705524 1.527542 1.487476
Factor is a group of variables which are related to each other. In order to know the number of factors to create, we have a concept of Eigen Values.
Eigen values is also called characteristic roots. Eigen values shows variance explained by that particular factor out of the total variance. From the commonality column, we can know how much variance is explained by the first factor out of the total variance. For example, if our first factor explains 68% variance out of the total, this means that 32% variance will be explained by the other factor.
eigen_values = mutate(data.frame(eigen(cormat)$values),
cum_sum_eigen = cumsum(eigen.cormat..values),
pct_var = eigen.cormat..values/sum(eigen.cormat..values),
cum_pct_var = cum_sum_eigen/sum(eigen.cormat..values))
Now we have calculated eigen values, cumulative sum of eigen values, percent variation and cumulative percent variation.The purpose of calculating these is to understand how many factors we should consider.
head(eigen_values)
## eigen.cormat..values cum_sum_eigen pct_var cum_pct_var
## 1 9.824568 9.824568 0.09268460 0.0926846
## 2 9.239085 19.063653 0.08716118 0.1798458
## 3 4.903071 23.966724 0.04625539 0.2261012
## 4 4.320629 28.287353 0.04076065 0.2668618
## 5 3.612203 31.899556 0.03407739 0.3009392
## 6 3.460933 35.360490 0.03265031 0.3335895
plot(eigen_values$pct_var,type="b")
Here we have ploted the eigen values to have alook at the variation at each number of factors.
factor_analysis = fa(r=cormat,32,rotate = "varimax" ,fm = "pca")
## factor method not specified correctly, minimum residual (unweighted least squares used
## In smc, smcs < 0 were set to .0
## In smc, smcs < 0 were set to .0
## In smc, smcs < 0 were set to .0
## In factor.stats, the correlation matrix is singular, an approximation is used
## In factor.scores, the correlation matrix is singular, an approximation is used
## I was unable to calculate the factor score weights, factor loadings used instead
fa_sort = fa.sort(factor_analysis)
loadings = data.frame(fa_sort$loadings[1:ncol(data),])
We have sorted the loadings at each factor level
#write.csv(loadings,"loadings.csv",row.names = T)
# seperating categorical and numerical data
numeric_data = data[cont_num_var]
categorical_data = data[categorical_var]
These vars are the features selected from different factors.
vars = c('cardfee','gender','commutecarpool','polparty','homeown','carbuy','pets_dogs','region','commutepublic','active','tollten','reason','pets_cats','jobcat','card','carditems','ed','wiremon','pets_freshfish','ownvcr','owncd','cars','carcatvalue','reside','spoused','lnothdebt','carvalue','lninc','callid','confer','tollfree','response_01','ownfax','pager','voice','equipten','ownpc','equipmon','internet','retire','callcard','multline','news','employ','address','lnlongmon','cardtenure','total_spend')
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
total_spend = data$total_spend
data = cbind(scale(data.frame(data[,1:ncol(data)-1]),center = TRUE,scale = TRUE),data.frame(total_spend))
set.seed(129)
train_ind = sample(1:nrow(data),size = floor(0.70*nrow(data)))
train = data[train_ind,]
test = data[-train_ind,]
fit = lm(total_spend~., data = train)
summary(fit)
step = stepAIC(fit,direction = "both")
We have used stepaic to select the significant variables.
fit2 =lm(total_spend ~ gender + commutecarpool + region + jobcat + card +
carditems + carcatvalue + carvalue + lninc + equipten + retire,
data = train)
A variance inflation factor detects multicollinearity in regression analysis. Multicollinearity is when there’s correlation between predictors in a model, it’s presence can adversely affect the regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
vif(fit2)
## gender commutecarpool region jobcat card
## 1.004413 1.004240 1.002355 1.057805 1.004606
## carditems carcatvalue carvalue lninc equipten
## 1.008366 5.055332 9.575863 3.855157 1.032694
## retire
## 1.292207
Here we have used vif to check whether there is any multicollinearity amoung the selected variables.
fit_f =lm(total_spend ~ gender + commutecarpool + region + jobcat + card +
carditems + carvalue + lninc + equipten + retire,
data = train)
summary(fit_f)
##
## Call:
## lm(formula = total_spend ~ gender + commutecarpool + region +
## jobcat + card + carditems + carvalue + lninc + equipten +
## retire, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39853 -0.30238 0.00112 0.30678 1.45220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.007986 0.007356 816.740 < 0.0000000000000002 ***
## gender -0.035129 0.007369 -4.767 0.00000195 ***
## commutecarpool 0.009443 0.007391 1.278 0.2015
## region 0.006624 0.007334 0.903 0.3664
## jobcat -0.011115 0.007531 -1.476 0.1401
## card -0.128621 0.007372 -17.447 < 0.0000000000000002 ***
## carditems 0.255320 0.007301 34.969 < 0.0000000000000002 ***
## carvalue 0.007269 0.012456 0.584 0.5595
## lninc 0.196420 0.013184 14.898 < 0.0000000000000002 ***
## equipten 0.002557 0.007575 0.338 0.7357
## retire -0.020844 0.008288 -2.515 0.0119 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.435 on 3489 degrees of freedom
## Multiple R-squared: 0.4231, Adjusted R-squared: 0.4214
## F-statistic: 255.9 on 10 and 3489 DF, p-value: < 0.00000000000000022
Now we have build the final model, so we will diagnos this model by creating diagnostic plots.
layout(matrix(c(1,2,3,4),2,2))
plot(fit_f,col = "darkblue")
hist(fit_f$residuals,col="lightblue", main = paste("Histogram of Residuals"),xlab = "Residuals",border = "black")
The Residuals should be Normally Distributed
mean(fit_f$residuals)
## [1] 0.00000000000000001632087
The Mean of Residuals should be close to Zero
# Training Data
t1 = cbind(train, pred_spent = exp(predict(fit_f,train)))
t1$total_spend = exp(t1$total_spend)
t1 = transform(t1,APE = abs(pred_spent - total_spend)/total_spend )
# Test Data
test$total_spend = exp(test$total_spend)
t2 = cbind(test, pred_spent = exp(predict(fit_f,test)))
t2 = transform(t2,APE = abs(pred_spent - total_spend)/total_spend)
## # A tibble: 10 x 4
## Decile Count Avg_pred_spent Avg_actual_spent
## <fct> <int> <dbl> <dbl>
## 1 1 350 213. 226.
## 2 2 350 279. 300.
## 3 3 350 319. 352.
## 4 4 350 356. 384.
## 5 5 350 392. 452.
## 6 6 350 427. 505.
## 7 7 350 469. 535.
## 8 8 350 521. 574.
## 9 9 350 598. 632.
## 10 10 350 797. 814.
## # A tibble: 10 x 4
## Decile Count Avg_pred_spent Avg_actual_spent
## <fct> <int> <dbl> <dbl>
## 1 1 150 216. 228.
## 2 2 150 274. 300.
## 3 3 150 314. 351.
## 4 4 150 351. 400.
## 5 5 150 384. 424.
## 6 6 150 423. 511.
## 7 7 150 465. 512.
## 8 8 150 517. 596.
## 9 9 150 599. 698.
## 10 10 150 779. 816.
## [1] "Mean Absolute Percentage Error - 0.373"
## [1] "Mean Square Error - 46959.225"
## [1] "Root Mean Square Error - 216.701"
## [1] "Mean Absolute Error - 160.077"
## [1] "Mean Absolute Percentage Error - 0.378"
## [1] "Mean Square Error - 50289.914"
## [1] "Root Mean Square Error - 224.254"
## [1] "Mean Absolute Error - 164.481"