library(readxl)
Bank_Salaries <- read_excel("G:/My Drive/Data Analysis/Week 5/Bank Salaries.xlsx")
We want to estimate probability of someone being female based on salary and age
First, let’s create a dummy variable for gender, let’s use female = 1
Bank_Salaries$Female<-ifelse(Bank_Salaries$Gender=="Female", 1, 0)
library (psych)
## Warning: package 'psych' was built under R version 4.2.2
describe (Bank_Salaries$Female)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 208 0.67 0.47 1 0.71 0 0 1 1 -0.73 -1.47 0.03
Now let’s split the data into training and test sets. Set seed to ensure reproducibility
set.seed(123)
TrainIndex<-sample(1:nrow(Bank_Salaries), 150)
trainBank<-Bank_Salaries[TrainIndex,]
testBank<-Bank_Salaries[-TrainIndex,]
str(trainBank)
## tibble [150 × 10] (S3: tbl_df/tbl/data.frame)
## $ Employee : num [1:150] 159 207 179 14 195 170 50 118 43 205 ...
## $ Education: num [1:150] 5 5 5 2 5 3 2 4 5 5 ...
## $ Grade : num [1:150] 4 6 5 1 6 4 1 3 1 6 ...
## $ Years1 : num [1:150] 4 35 8 10 14 16 2 6 7 36 ...
## $ Years2 : num [1:150] 0 0 2 6 0 0 0 0 0 0 ...
## $ Age : num [1:150] 27 59 37 58 49 49 53 38 35 61 ...
## $ Salary : num [1:150] 41800 94000 49000 34700 60000 ...
## $ Gender : chr [1:150] "Male" "Male" "Male" "Female" ...
## $ PC Job : chr [1:150] "No" "No" "No" "No" ...
## $ Female : num [1:150] 0 0 0 1 0 1 1 1 1 0 ...
str(testBank)
## tibble [58 × 10] (S3: tbl_df/tbl/data.frame)
## $ Employee : num [1:58] 2 3 10 12 15 18 19 28 29 31 ...
## $ Education: num [1:58] 1 1 3 2 3 2 3 1 1 1 ...
## $ Grade : num [1:58] 1 1 1 1 1 1 1 1 1 1 ...
## $ Years1 : num [1:58] 14 12 9 8 4 8 5 13 12 7 ...
## $ Years2 : num [1:58] 1 0 0 8 0 9 6 0 6 4 ...
## $ Age : num [1:58] 38 35 31 37 33 37 44 48 40 35 ...
## $ Salary : num [1:58] 39100 33200 29500 31300 30000 ...
## $ Gender : chr [1:58] "Female" "Female" "Female" "Female" ...
## $ PC Job : chr [1:58] "No" "No" "No" "No" ...
## $ Female : num [1:58] 1 1 1 1 1 1 1 1 1 1 ...
Now, on to the estimation
logit1<-glm(Female~Age+Salary, data=trainBank, family="binomial")
summary(logit1)
##
## Call:
## glm(formula = Female ~ Age + Salary, family = "binomial", data = trainBank)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1399 -1.0502 0.6053 0.8932 1.8344
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.116e+00 8.736e-01 2.422 0.0154 *
## Age 6.132e-02 2.179e-02 2.814 0.0049 **
## Salary -9.831e-05 2.252e-05 -4.366 1.27e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 198.21 on 149 degrees of freedom
## Residual deviance: 169.76 on 147 degrees of freedom
## AIC: 175.76
##
## Number of Fisher Scoring iterations: 4
Predict based on test set
testBank$Predicted<-predict(logit1, testBank, type="response")
testBank$Predicted<-ifelse(testBank$Predicted>.5, "Female", "Male")
# one way to build confusion matrix
Let’s build a confusion matrix
table(testBank$Predicted, testBank$Female)
##
## 0 1
## Female 7 42
## Male 5 4
NOTE:
confusionMatrix command I attempted is a part of caret package and library. I am still working through all the dependencies to successfully install it and if I succeed, I will share detailed notes of the process to illustrate troubleshooting process that is not result of user error (such as using trainBank$variable to build training model)