salary<-transactions%>%
filter(txn_description=="PAY/SALARY")%>%
mutate(annual_salary=12*amount)
library(fastDummies)
DummyData <- dummy_cols(salary,
select_columns =c("gender"))
corr <- round(cor(DummyData[,c(11,14,24:26)]), 1)
ggcorrplot(corr,hc.order = TRUE, type = "lower",lab = TRUE)
Therefore, the annual salary has a weak correlation with the gender, showing that a male gender will be likely to be paid more than a female.
library(caret)
set.seed(123)
ind_test <- createDataPartition(salary$annual_salary,times=1, p=0.2, list=FALSE)
train_data <- salary[-ind_test,]
test_set <- salary[c(ind_test),]
fit_lm <- lm(annual_salary~ balance+gender+age,
data= train_data)
summary(fit_lm)
##
## Call:
## lm(formula = annual_salary ~ balance + gender + age, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30231 -8515 -2899 6023 76244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24478.7457 1494.6526 16.378 < 2e-16 ***
## balance 0.1357 0.0164 8.271 6.69e-16 ***
## genderM 4414.5820 986.1029 4.477 8.84e-06 ***
## age -190.1483 41.2059 -4.615 4.68e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12910 on 700 degrees of freedom
## Multiple R-squared: 0.1387, Adjusted R-squared: 0.135
## F-statistic: 37.56 on 3 and 700 DF, p-value: < 2.2e-16
The regression is significant and all the variables taken into account are significant for calculating the Annual Salary of a Customer, however, the R-squared value is low and shows that the predictive power of the model is low. It shows that it only explains 14% of the variability of the data. With this information its possible to obtain the Root Mean Squared Error (RSME).
predict_lm <- predict(fit_lm,newdata=test_set)
sqrt(mean((predict_lm-test_set$annual_salary)^2))
## [1] 12437.05
It is confirmed now that it’s is not a good estimate and more information is needed to explain this model.
library(rpart)
set.seed(123)
grid<- expand.grid(cp=seq(0,0.05,0.002))
train_rpart <-train(annual_salary~balance+gender+age, data=train_data,
method= "rpart",
tuneGrid = grid,
control= rpart.control(minsplit = 20))
ggplot(train_rpart,highlight = TRUE)
The optimum Complexity Parameter is
train_rpart$bestTune
## cp
## 9 0.016
plot(train_rpart$finalModel, uniform=TRUE, branch=0.6, margin = 0.1)
text(train_rpart$finalModel, use.n=TRUE, cex= 0.7)
title("Training Set's Classification Tree")
library(rpart.plot)
rpart.plot(train_rpart$finalModel)
With this best tuning parameter, it is going to be possible to predict and check the RMSE of the model
predic_rpart <- predict(train_rpart, newdata = test_set)
sqrt(mean((predic_rpart-test_set$annual_salary)^2))
## [1] 11539.15
It is possible to see that the error is significantly big, therefore, it is not a good model for prediction. It’s necessary to get more information about the user to have a bettter prediction of their salary. The current information is not enough to give a good estimate. However, the estimate results slightly better than the linear regression.
2D vizualization of the classification
plotmo(train_rpart, ngrid2 = 20,
pt.col= ifelse(salary$gender == "F", "red", "lightblue"),
type2="contour",
degree1 = FALSE)
3D Vizialization of the classification
plotmo(train_rpart,
degree1= FALSE, type2= "persp")