Introduction

In this assignment, I will be utilizing Zelig 5 syntax to interpret the probability of IBM employee attrition by simulation. The data set, which was retrieved from Kaggle, contains 1,470 observations and 35 variables. The binary outcome variable is attrition (“Yes”/“No”) and there are 33 possible predictor variables. This analysis will examine the relationship between attrition and the following predictors: age, number of companies worked, and overtime. The goal is to be able to answer the following questions: What effect does age have on the probability of attrition? How does the number of companies an employee has worked for affect their likelihood of leaving IBM? Finally, what impact does working overtime have?


Data and Variables

The variables used in this analysis are as follows:

head(AttritionDataset3)

Descriptive Analysis

Age

The majority of employees in this data set appear to be between the ages 30 and 40.

library(ggplot2)
ggplot (AttritionDataset3, aes (x = Age)) + geom_histogram()


Number of Companies Worked

Most employees have worked for one other company prior to IBM.

ggplot (AttritionDataset3, aes (x = NumCompaniesWorked)) + geom_bar()


Overtime

The vast majority of employees at IBM do not regularly work overtime.

ggplot (AttritionDataset3, aes (x = OverTime)) + geom_bar()


Estimating the Model

This regression model estimates the effect of age, the number of companies worked, and whether or not an employee works overtime on the probability of attrition. A square term of the variable, age, was included to account for non-linearity. Further analysis will be done after graphing the simulation. For the rest of the predictor variables, working at more companies prior to IBM increases the probability of attrition by 0.14, and the relationship is statistically significant. The model estimation also shows working overtime to increase the probability of attrition by 1.45; the results are statistically significant.

z5attrition <- zlogit$new()
z5attrition$zelig(Attrition ~ Age + I(Age^2) + NumCompaniesWorked + OverTime, data = AttritionDataset3)
summary(z5attrition)
Model: 

Call:
z5attrition$zelig(formula = Attrition ~ Age + I(Age^2) + NumCompaniesWorked + 
    OverTime, data = AttritionDataset3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6300  -0.5961  -0.4157  -0.3143   2.5680  

Coefficients:
                     Estimate Std. Error z value             Pr(>|z|)
(Intercept)         4.4242991  0.9875380   4.480        0.00000745975
Age                -0.3286633  0.0546210  -6.017        0.00000000178
I(Age^2)            0.0034658  0.0007076   4.898        0.00000096929
NumCompaniesWorked  0.1387759  0.0305315   4.545        0.00000548488
OverTimeYes         1.4511580  0.1549134   9.368 < 0.0000000000000002

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1298.6  on 1469  degrees of freedom
Residual deviance: 1132.5  on 1465  degrees of freedom
AIC: 1142.5

Number of Fisher Scoring iterations: 5

Next step: Use 'setx' method

Setting Parameters, Generating Simulations, and Plotting/Summarizing

Effect of Age on Attrition

The simulation results indicate that as employees get older, the probability of leaving IBM decreases. However, this is only the case until around 50 years old, at which point the probability of attrition begins to increase again as depicted in the figure below.

z5attrition$setrange(Age = min(AttritionDataset3$Age):max (AttritionDataset3$Age))
z5attrition$sim()
z5attrition$graph()
Age chosen as the x-axis variable. Use the var argument to specify a different variable.


Effect of Number of Companies Worked on Attrition

Employees who have worked in more companies have a higher probability of attrition. In other words, those who have switched more companies are more likely to leave IBM as well.

z5attrition$setrange(NumCompaniesWorked = min(AttritionDataset3$NumCompaniesWorked):max (AttritionDataset3$NumCompaniesWorked))
z5attrition$sim()
z5attrition$graph()
NumCompaniesWorked chosen as the x-axis variable. Use the var argument to specify a different variable.


Effect of Working Overtime on Attrition

For the purpose of simplicity, I created a separate model object to examine the effect of working overtime on attrition. The results from the simulation indicate that the probability of leaving IBM is higher among those who work overtime, with a first difference of -0.17.

z5OverTime <- zlogit$new()
z5OverTime$zelig(Attrition ~ Age + I(Age^2) + NumCompaniesWorked + OverTime, data = AttritionDataset3)
z5OverTime$setx(OverTime = "Yes")
z5OverTime$setx1(OverTime = "No")
z5OverTime$sim()
summary(z5OverTime)

 sim x :
 -----
ev
          mean         sd       50%      2.5%     97.5%
[1,] 0.2382665 0.02392028 0.2375257 0.1941806 0.2886036
pv
         0     1
[1,] 0.777 0.223

 sim x1 :
 -----
ev
           mean          sd        50%       2.5%      97.5%
[1,] 0.06870589 0.008498964 0.06826441 0.05401017 0.08703405
pv
         0     1
[1,] 0.941 0.059
fd
           mean         sd        50%       2.5%     97.5%
[1,] -0.1695606 0.02268267 -0.1689708 -0.2187875 -0.126963
plot(z5OverTime)


Analysis of Results

This logistic regression model examined the effect of age, the number of companies worked, and whether or not an employee works overtime on the probability of attrition. Unlike the model in the previous assignment, I included a square term of the variable, age, in order to account for non-linearity. The simulation results indicate that as employees get older, they become less likely to leave IBM. However, the effect starts to reverse around age 50, at which point the probability of attrition begins to increase again. This increase is to be expected as employees get closer to retirement age. In addition, the results suggest that employees who have worked in more companies have a higher probability of attrition. In other words, those who have switched more companies are more likely to leave IBM as well. More work needs to be done by IBM in figuring out how to best retain these particular employees. Lastly, in terms of the effect of working overtime on attrition, the results indicate that the probability of attrition for employees who work overtime is higher than that for those who do not, with a first difference of -0.17.

---
title: "Soc 712 Homework #6b - Analyzing IBM Employee Attrition using Zelig 5 syntax"
output: html_notebook
---

---

## Introduction 

In this assignment, I will be utilizing Zelig 5 syntax to interpret the probability of IBM employee attrition by simulation. The data set, which was retrieved from Kaggle, contains 1,470 observations and 35 variables. The binary outcome variable is attrition ("Yes"/"No") and there are 33 possible predictor variables. This analysis will examine the relationship between attrition and the following predictors: *age*, *number of companies worked*, and *overtime*. The goal is to be able to answer the following questions: What effect does age have on the probability of attrition? How does the number of companies an employee has worked for affect their likelihood of leaving IBM? Finally, what impact does working overtime have? 

---

## Data and Variables

The variables used in this analysis are as follows:

* **Attrition** - Attrition will serve as the outcome variable;the responses are either "Yes" or "No".
* **Age** - This variable measures the employee's age.
* **NumCompaniesWorked** - This variable records the number of companies an employee has worked for prior to IBM. 
* **OverTime** - This variable describes whether or not the employee regularly works overtime; the responses are either "Yes" or "No". 

```{r, echo=FALSE}
library(readr)
AttritionDataset <- read_csv("C:/Users/Raven/Desktop/AttritionDataset.xls")
```

```{r, echo=FALSE}
library(dplyr)
AttritionDataset3 <- select(AttritionDataset, 
               Attrition, Age, NumCompaniesWorked, OverTime)
```

```{r, echo=FALSE}
AttritionDataset3$Attrition <- as.integer(as.factor(AttritionDataset3$Attrition))
AttritionDataset3$OverTime <- as.factor(AttritionDataset3$OverTime)
```

```{r, echo=FALSE}
AttritionDataset3 <- mutate(AttritionDataset3,
                            Attrition = sjmisc::rec(Attrition, rec = "2=1; 1=0"))
```


```{r}
head(AttritionDataset3)
```

---

##  Descriptive Analysis

**Age**

The majority of employees in this data set appear to be between the ages 30 and 40.
```{r}
library(ggplot2)
ggplot (AttritionDataset3, aes (x = Age)) + geom_histogram()
```

---

**Number of Companies Worked**

Most employees have worked for one other company prior to IBM.  
```{r}
ggplot (AttritionDataset3, aes (x = NumCompaniesWorked)) + geom_bar()
```

---

**Overtime**

The vast majority of employees at IBM do *not* regularly work overtime.
```{r}
ggplot (AttritionDataset3, aes (x = OverTime)) + geom_bar()
```

---

## Estimating the Model 

This regression model estimates the effect of age, the number of companies worked, and whether or not an employee works overtime on the probability of attrition. A square term of the variable, age, was included to account for non-linearity. Further analysis will be done after graphing the simulation. For the rest of the predictor variables, working at more companies prior to IBM increases the probability of attrition by 0.14, and the relationship is statistically significant. The model estimation also shows working overtime to increase the probability of attrition by 1.45; the results are statistically significant. 

```{r}
z5attrition <- zlogit$new()
z5attrition$zelig(Attrition ~ Age + I(Age^2) + NumCompaniesWorked + OverTime, data = AttritionDataset3)
summary(z5attrition)
```


---

##Setting Parameters, Generating Simulations, and Plotting/Summarizing

###Effect of Age on Attrition 
  
The simulation results indicate that as employees get older, the probability of leaving IBM decreases. However, this is only the case until around 50 years old, at which point the probability of attrition begins to increase again as depicted in the figure below.

```{r}
z5attrition$setrange(Age = min(AttritionDataset3$Age):max (AttritionDataset3$Age))
z5attrition$sim()
z5attrition$graph()
```

---

###Effect of Number of Companies Worked on Attrition 

Employees who have worked in more companies have a higher probability of attrition. In other words, those who have switched more companies are more likely to leave IBM as well.   

```{r}
z5attrition$setrange(NumCompaniesWorked = min(AttritionDataset3$NumCompaniesWorked):max (AttritionDataset3$NumCompaniesWorked))
z5attrition$sim()
z5attrition$graph()
```

---

###Effect of Working Overtime on Attrition 

For the purpose of simplicity, I created a separate model object to examine the effect of working overtime on attrition. The results from the simulation indicate that the probability of leaving IBM is higher among those who work overtime, with a first difference of **-0.17**.

```{r}
z5OverTime <- zlogit$new()
z5OverTime$zelig(Attrition ~ Age + I(Age^2) + NumCompaniesWorked + OverTime, data = AttritionDataset3)
z5OverTime$setx(OverTime = "Yes")
z5OverTime$setx1(OverTime = "No")
z5OverTime$sim()
summary(z5OverTime)
```

```{r fig.height=8}
plot(z5OverTime)
```

---

##Analysis of Results 

This logistic regression model examined the effect of age, the number of companies worked, and whether or not an employee works overtime on the probability of attrition. Unlike the model in the previous assignment, I included a square term of the variable, age, in order to account for non-linearity. The simulation results indicate that as employees get older, they become less likely to leave IBM. However, the effect starts to reverse around age 50, at which point the probability of attrition begins to increase again. This increase is to be expected as employees get closer to retirement age. In addition, the results suggest that employees who have worked in more companies have a higher probability of attrition. In other words, those who have switched more companies are more likely to leave IBM as well. More work needs to be done by IBM in figuring out how to best retain these particular employees. Lastly, in terms of the effect of working overtime on attrition, the results indicate that the probability of attrition for employees who work overtime is higher than that for those who do not, with a first difference of **-0.17**.












