Data Filtering

salary<-transactions%>%
  filter(txn_description=="PAY/SALARY")%>%
  mutate(annual_salary=12*amount)

Explore correlations between annual salary and various customer attributes

Create Dummy Variables for Non-Numerical

library(fastDummies)
DummyData <- dummy_cols(salary, 
                        select_columns =c("gender"))

Correlation Matrix

corr <- round(cor(DummyData[,c(11,14,24:26)]), 1)
ggcorrplot(corr,hc.order = TRUE, type = "lower",lab = TRUE)

Therefore, the annual salary has a weak correlation with the gender, showing that a male gender will be likely to be paid more than a female.

Predictive models

Train data and Test Data

library(caret)
set.seed(123)
ind_test <- createDataPartition(salary$annual_salary,times=1, p=0.2, list=FALSE)
train_data <- salary[-ind_test,]
test_set <- salary[c(ind_test),]

Linear regression

fit_lm <- lm(annual_salary~ balance+gender+age, 
                data= train_data)
summary(fit_lm)
## 
## Call:
## lm(formula = annual_salary ~ balance + gender + age, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30231  -8515  -2899   6023  76244 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24478.7457  1494.6526  16.378  < 2e-16 ***
## balance         0.1357     0.0164   8.271 6.69e-16 ***
## genderM      4414.5820   986.1029   4.477 8.84e-06 ***
## age          -190.1483    41.2059  -4.615 4.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12910 on 700 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.135 
## F-statistic: 37.56 on 3 and 700 DF,  p-value: < 2.2e-16

The regression is significant and all the variables taken into account are significant for calculating the Annual Salary of a Customer, however, the R-squared value is low and shows that the predictive power of the model is low. It shows that it only explains 14% of the variability of the data. With this information its possible to obtain the Root Mean Squared Error (RSME).

predict_lm <- predict(fit_lm,newdata=test_set)
sqrt(mean((predict_lm-test_set$annual_salary)^2))
## [1] 12437.05

It is confirmed now that it’s is not a good estimate and more information is needed to explain this model.

Challenge: Build a decision-tree

library(rpart)
set.seed(123)
grid<- expand.grid(cp=seq(0,0.05,0.002))
train_rpart <-train(annual_salary~balance+gender+age, data=train_data,
                    method= "rpart",
                    tuneGrid = grid,
                    control= rpart.control(minsplit = 20))

ggplot(train_rpart,highlight = TRUE)

The optimum Complexity Parameter is

train_rpart$bestTune
##      cp
## 9 0.016
plot(train_rpart$finalModel, uniform=TRUE, branch=0.6, margin = 0.1)
text(train_rpart$finalModel, use.n=TRUE, cex= 0.7)
title("Training Set's Classification Tree")

library(rpart.plot)
rpart.plot(train_rpart$finalModel)

With this best tuning parameter, it is going to be possible to predict and check the RMSE of the model

predic_rpart <- predict(train_rpart, newdata = test_set)
sqrt(mean((predic_rpart-test_set$annual_salary)^2))
## [1] 11539.15

It is possible to see that the error is significantly big, therefore, it is not a good model for prediction. It’s necessary to get more information about the user to have a bettter prediction of their salary. The current information is not enough to give a good estimate. However, the estimate results slightly better than the linear regression.

2D vizualization of the classification

plotmo(train_rpart, ngrid2 = 20,
       pt.col=  ifelse(salary$gender == "F", "red", "lightblue"),
       type2="contour",
       degree1 = FALSE)

3D Vizialization of the classification

plotmo(train_rpart,
       degree1= FALSE, type2= "persp")