library(readxl)
Bank_Salaries <- read_excel("G:/My Drive/Data Analysis/Week 5/Bank Salaries.xlsx")

We want to estimate probability of someone being female based on salary and age

First, let’s create a dummy variable for gender, let’s use female = 1

Bank_Salaries$Female<-ifelse(Bank_Salaries$Gender=="Female", 1, 0)
library (psych)
## Warning: package 'psych' was built under R version 4.2.2
describe (Bank_Salaries$Female)
##    vars   n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 208 0.67 0.47      1    0.71   0   0   1     1 -0.73    -1.47 0.03

Now let’s split the data into training and test sets. Set seed to ensure reproducibility

set.seed(123)

TrainIndex<-sample(1:nrow(Bank_Salaries), 150)
trainBank<-Bank_Salaries[TrainIndex,]
testBank<-Bank_Salaries[-TrainIndex,]

str(trainBank)
## tibble [150 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Employee : num [1:150] 159 207 179 14 195 170 50 118 43 205 ...
##  $ Education: num [1:150] 5 5 5 2 5 3 2 4 5 5 ...
##  $ Grade    : num [1:150] 4 6 5 1 6 4 1 3 1 6 ...
##  $ Years1   : num [1:150] 4 35 8 10 14 16 2 6 7 36 ...
##  $ Years2   : num [1:150] 0 0 2 6 0 0 0 0 0 0 ...
##  $ Age      : num [1:150] 27 59 37 58 49 49 53 38 35 61 ...
##  $ Salary   : num [1:150] 41800 94000 49000 34700 60000 ...
##  $ Gender   : chr [1:150] "Male" "Male" "Male" "Female" ...
##  $ PC Job   : chr [1:150] "No" "No" "No" "No" ...
##  $ Female   : num [1:150] 0 0 0 1 0 1 1 1 1 0 ...
str(testBank)
## tibble [58 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Employee : num [1:58] 2 3 10 12 15 18 19 28 29 31 ...
##  $ Education: num [1:58] 1 1 3 2 3 2 3 1 1 1 ...
##  $ Grade    : num [1:58] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Years1   : num [1:58] 14 12 9 8 4 8 5 13 12 7 ...
##  $ Years2   : num [1:58] 1 0 0 8 0 9 6 0 6 4 ...
##  $ Age      : num [1:58] 38 35 31 37 33 37 44 48 40 35 ...
##  $ Salary   : num [1:58] 39100 33200 29500 31300 30000 ...
##  $ Gender   : chr [1:58] "Female" "Female" "Female" "Female" ...
##  $ PC Job   : chr [1:58] "No" "No" "No" "No" ...
##  $ Female   : num [1:58] 1 1 1 1 1 1 1 1 1 1 ...

Now, on to the estimation

logit1<-glm(Female~Age+Salary, data=trainBank, family="binomial")
summary(logit1)
## 
## Call:
## glm(formula = Female ~ Age + Salary, family = "binomial", data = trainBank)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1399  -1.0502   0.6053   0.8932   1.8344  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.116e+00  8.736e-01   2.422   0.0154 *  
## Age          6.132e-02  2.179e-02   2.814   0.0049 ** 
## Salary      -9.831e-05  2.252e-05  -4.366 1.27e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198.21  on 149  degrees of freedom
## Residual deviance: 169.76  on 147  degrees of freedom
## AIC: 175.76
## 
## Number of Fisher Scoring iterations: 4

Predict based on test set

testBank$Predicted<-predict(logit1, testBank, type="response")
testBank$Predicted<-ifelse(testBank$Predicted>.5, "Female", "Male")
  # one way to build confusion matrix

Let’s build a confusion matrix

table(testBank$Predicted, testBank$Female)
##         
##           0  1
##   Female  7 42
##   Male    5  4

NOTE:

confusionMatrix command I attempted is a part of caret package and library. I am still working through all the dependencies to successfully install it and if I succeed, I will share detailed notes of the process to illustrate troubleshooting process that is not result of user error (such as using trainBank$variable to build training model)