interdiction :

I have prepared data using excel before I got it into R so there is many steps I did in Excel that I will explain here.

first i build the client dissatisfaction indicator, it is the Cancelled rate (cancelled times / Years) + replaced rate (replaced times / Years) + Complaint rate (Complanit times / Years), that to use it in first question.
second difference between joining date and first time to get hired date
Number of Failed interview it means Number Of Interviews done - Number of successful interviews (if exists)

get data into R we have to careful from this status maids as below :

REJECTED
VISA_UNSUCCESSFUL
PASSED_EXIT
UNREACHABLE_AFTER_EXIT
IN_EXIT
AVAILABLE
LANDED_IN_DUBAI
TRACKED
PENDING_FOR_DISCIPLINE

that because they don’t has any values and it will be distorts data

Get data into R Code

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts -------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readxl)
data<- read_excel("D:/Files/miad.cc/obada aladib test file 2020-04-17-17-35.xlsx", 
    sheet = "Table2", col_types = c("numeric", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "date", "numeric", "date", "text", 
        "text", "numeric", "numeric", "numeric", 
        "text", "numeric", "numeric", "date"))

The data frame which we have now, we can explore it from below table

DT::datatable(data)

Methodology

Before I beginning, I want to note here about really important note, I know there is one way to discover correlation between two variables; one of them is numeric and other is nominal called point-biserial correaltion I will apply it here with other method called t-test to dicover differance between maens for two groups

First Question

EDA (Exploratory Data Analysis)

As we said data should be without some of maids whom has specific status and also without the maids whom have times hired by a client equal zero,

Prepar Data

data1 <- filter (data , data$`Times Hired by a client`!=0 , data$`date left`!=0 )
data1$type<- if_else(data1$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)

We start from Histogram of client dissatisfaction indicator to see the distribution of data

hist(data1$`client dissatisfaction indicator`, xlab = "client dissatisfaction indicator", main = "Hist of client dissatisfaction indicator",prob=TRUE)
lines(density(data1$`client dissatisfaction indicator`),col="blue")

as we see the Hist dose not have normal distribution,so we go to next step

do the hist for two type of group (FREEDOM_OPERATOR and WALK-INs )

As we see the destruction of data is

library(ggplot2)
ggplot(data1, aes(data1$`client dissatisfaction indicator`, fill = data1$`Maid Type (wlak-in / operator)`)) + 
   geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity',bins = 30) +
  geom_density(alpha = 0.2)+xlab(label = "client dissatisfaction indicator")

## Warning: Use of `data1$`client dissatisfaction indicator`` is discouraged. Use
## `client dissatisfaction indicator` instead.

## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data1$`client dissatisfaction indicator`` is discouraged. Use
## `client dissatisfaction indicator` instead.

## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

as we see here; the two groups doesn’t not have normal distribution , the distribution of WALK-INs Type approaching to the natural and refer to the dissatisfaction of client is lager than OPRATION type this is the first observation but we still not accurate from it, so we will go to next step.

please note that, I using density histogram because the number of maid for each type is not equal.

now we can see Box-plot to understand this situation

ggplot(data1, aes(data1$`Maid Type (wlak-in / operator)`, data1$`client dissatisfaction indicator`)) + geom_boxplot(  aes(fill = factor(data1$`Maid Type (wlak-in / operator)`))) +
  labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "client dissatisfaction indicator")

## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data1$`client dissatisfaction indicator`` is discouraged. Use
## `client dissatisfaction indicator` instead.

here the chart refers to important thing the data has outlines and the client Dissatisfaction for WALK-INs tyep is higher then OPRator, it means the data want using inference

Inference

As we saw, the data does not have normal distribution and there are many outliers values so I refer to two approaches about testing difference first one about testing difference using parameter way and other using non parameter way

the parameter method needs some condition to apply it which is not included in this data as normal distribution .. but some other opinions says if the data is big the mean more then 1000 it is possible using two ways so I will using T test to see differences between mean of two group, and we will applay point-biserial correaltion

library(polycor)
#preapre data to be ready to test 
OPERATOR <- data1%>% filter (data1$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN <- data1%>% filter (data1$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR$`client dissatisfaction indicator`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5391  2.6469  4.5666  5.9530  7.5536 59.1892

summary(WALKIN$`client dissatisfaction indicator`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.755   9.181  15.279  19.557  22.702  62.989

#Doing t--test
t.test(OPERATOR$`client dissatisfaction indicator`,WALKIN$`client dissatisfaction indicator`,var.equal = F,alternative = "great")

## 
##  Welch Two Sample t-test
## 
## data:  OPERATOR$`client dissatisfaction indicator` and WALKIN$`client dissatisfaction indicator`
## t = -5.6168, df = 35.51, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -17.6948      Inf
## sample estimates:
## mean of x mean of y 
##  5.953016 19.557189

cor.test(data1$`client dissatisfaction indicator`,data1$type)

## 
##  Pearson's product-moment correlation
## 
## data:  data1$`client dissatisfaction indicator` and data1$type
## t = -12.862, df = 738, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4850380 -0.3671793
## sample estimates:
##        cor 
## -0.4279261

as we see in the results the mean of client dissatisfaction indicator for WALK-INs group is greater than OPERATOR

so the result goes us to refer OPERATOR , according to dissatisfaction of client.

==========================================================================

Second Question

EDA (Exploratory Data Analysis)

Here the question about (whom type is faster to get hired by a client) so I think we can measure this using two variable first one is difference between join data and first time to get hired date and the Nbr. Of days unemployed (available)

We will apply same previous analysis so lets get starte.. Go

Prepar Data

I will get all data for maid who is joined to company using next code

data2<- data %>% filter(data$`date of joining the company`!=0)
data2$type<- if_else(data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)

summary(data2$`difference between joining date and first time to get hired date`)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -8.3697   0.0000   0.0000   3.6117   0.7728 372.0827

We start from Histogram of join data and first time to get hired date

hist(data2$`difference between joining date and first time to get hired date`, xlab = "difference between joining date and first time to get hired date", main = "Hist of difference between joining date and first time to get hired date",prob=TRUE)
lines(density(data2$`client dissatisfaction indicator`),col="blue")

Opss .. here we see some negative values! .. after I checked it there is some wrong data in some maids have ID ( 10082, 11110, 11318, 11320, 11378, 12542, 12558, 13069, 13095, 14517, 19841) this some of many you can check it so I will repreapre data without negative values

data2<- data %>% filter(data$`date of joining the company`!=0, data$`difference between joining date and first time to get hired date`>=0)
data2$type<- if_else(data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)

apply Histogram again

hist(data2$`difference between joining date and first time to get hired date`, xlab = "difference between joining date and first time to get hired date", main = "Hist of difference between joining date and first time to get hired date",prob=TRUE)
lines(density(data2$`client dissatisfaction indicator`),col="blue")

and histogram for the Nbr. Of days unemployed (available) to see the distribution of data

hist(data2$`Nbr. Of days unemployed (available)`, xlab = "Nbr. Of days unemployed", main = "Hist of Nbr. Of days unemployed",prob=TRUE)
lines(density(data2$`client dissatisfaction indicator`),col="blue")

we see the Hist dose not have normal distribution,so we go to next step

library(ggplot2)
ggplot(data2, aes(data2$`difference between joining date and first time to get hired date`, fill = data2$`Maid Type (wlak-in / operator)`)) +
     geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') + xlab(label = "difference between joining date and first time to get hired date")

## Warning: Use of `data2$`difference between joining date and first time to get
## hired date`` is discouraged. Use `difference between joining date and first time
## to get hired date` instead.

## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggplot2)
ggplot(data2, aes(data2$`Nbr. Of days unemployed (available)`, fill = data2$`Maid Type (wlak-in / operator)`)) +
     geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') + xlab(label = "Nbr. Of days unemployed (available)")

## Warning: Use of `data2$`Nbr. Of days unemployed (available)`` is discouraged.
## Use `Nbr. Of days unemployed (available)` instead.

## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

as you see the each distribution is close to other one so maybe we can see no difference between two group but as you know we have to do many test

the next one is Box-Plot

ggplot(data2, aes(data2$`Maid Type (wlak-in / operator)`, data2$`difference between joining date and first time to get hired date`)) + geom_boxplot(  aes(fill = factor(data2$`Maid Type (wlak-in / operator)`))) +
  labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "difference between joining date and first time to get hired date")

## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data2$`difference between joining date and first time to get
## hired date`` is discouraged. Use `difference between joining date and first time
## to get hired date` instead.

confusing chart because the outlier values so we will go other variable Nbr. Of days unemployed (available)

ggplot(data2, aes(data2$`Maid Type (wlak-in / operator)`, data2$`Nbr. Of days unemployed (available)`)) + geom_boxplot(  aes(fill = factor(data2$`Maid Type (wlak-in / operator)`))) +
  labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "Nbr. Of days unemployed (available)")

## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data2$`Nbr. Of days unemployed (available)`` is discouraged.
## Use `Nbr. Of days unemployed (available)` instead.

less counfussing in this chart but also there are many outliers, any way we can see the two box same the maybe no differance between two group

Inference

we will apply same previous steps using T-test lets start on difference between joining date and first time to get hired date

#preapre data to be ready to test 
OPERATOR2 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN2 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR2$`difference between joining date and first time to get hired date`)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.5479   4.4698   0.8130 372.0827

summary(WALKIN2$`difference between joining date and first time to get hired date`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.5990  3.0773  0.8814 51.4559

#Doing t--test
t.test(OPERATOR2$`difference between joining date and first time to get hired date`,WALKIN2$`difference between joining date and first time to get hired date`)

## 
##  Welch Two Sample t-test
## 
## data:  OPERATOR2$`difference between joining date and first time to get hired date` and WALKIN2$`difference between joining date and first time to get hired date`
## t = 2.3218, df = 899.38, p-value = 0.02047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2154216 2.5697157
## sample estimates:
## mean of x mean of y 
##  4.469850  3.077281

cor.test(data2$`difference between joining date and first time to get hired date`,data2$type)

## 
##  Pearson's product-moment correlation
## 
## data:  data2$`difference between joining date and first time to get hired date` and data2$type
## t = 1.4722, df = 1714, p-value = 0.1411
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01180195  0.08271883
## sample estimates:
##        cor 
## 0.03553792

here we can see there is actual differance between mean two group if our test on 95% but also the difference is small and p value = 0.02 that means if my test on 0.01 significance I will say no difference so lets go to test Nbr. Of days unemployed (available) maybe gose us to accurate decision

#preapre data to be ready to test 
OPERATOR22 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN22 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR22$`Nbr. Of days unemployed (available)`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00   14.00   24.34   32.00  750.00

summary(WALKIN22$`Nbr. Of days unemployed (available)`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     3.5    10.0    16.7    24.0   102.0

#Doing t--test
t.test(OPERATOR22$`Nbr. Of days unemployed (available)`,WALKIN22$`Nbr. Of days unemployed (available)`,var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  OPERATOR22$`Nbr. Of days unemployed (available)` and WALKIN22$`Nbr. Of days unemployed (available)`
## t = 5.0206, df = 1039.8, p-value = 6.058e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.648731 10.614003
## sample estimates:
## mean of x mean of y 
##  24.33520  16.70383

cor.test(data2$`Nbr. Of days unemployed (available)`,data2$type)

## 
##  Pearson's product-moment correlation
## 
## data:  data2$`Nbr. Of days unemployed (available)` and data2$type
## t = 3.0285, df = 1714, p-value = 0.002494
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02572492 0.11986236
## sample estimates:
##        cor 
## 0.07295614

here we have clear vision the decision is WALK-INs faster to get hired by a client and the average days every type of maid stays with us before getting hired by a client is

OPERATOR=24.34 days
ALK-INs=16.70383 days

Third Question

EDA (Exploratory Data Analysis)

We will apply same previous analysis so lets get started

Here the question about (correlation between the type of maids, and the number of failed interviews?) so I think we can measure this using Number of Failed interview which is equal to Number Of Interviews done - Number of successful interviews (if exists)

We will apply same previous analysis so lets get starte.. Go

Prepar Data

after I explorer data I find some unlogic things as we have maid has sccussed interviwe but dose not has date of joining the company! so I will exclude the maids don’t have " date of joining the company" and at least maid has one interviews

data3<- data %>% filter(data$`date of joining the company`!=0,data$`Number Of Interviews done`)
data3$type<- if_else(data3$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)

summary(data3$`Number of Failed interview`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   6.000   8.531  14.000  58.000

We start from Histogram of join data and first time to get hired date

hist(data3$`Number of Failed interview`, xlab = "Number of Failed interview", main = "Hist for Number of Failed interview",prob=TRUE)
lines(density(data3$`client dissatisfaction indicator`),col="blue")

as we see the Hist dose not has normal distribution,so we go to next step

do the hist for two type of group (FREEDOM_OPERATOR and WALK-INs )

As we see the destruction of data is

library(ggplot2)
ggplot(data3, aes(data3$`Number of Failed interview`, fill = data3$`Maid Type (wlak-in / operator)`)) + 
   geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity',bins = 30) +
  geom_density(alpha = 0.2)+xlab(label = "Number of Failed interview")

## Warning: Use of `data3$`Number of Failed interview`` is discouraged. Use `Number
## of Failed interview` instead.

## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data3$`Number of Failed interview`` is discouraged. Use `Number
## of Failed interview` instead.

## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

as you see the each distribution are close to other one so maybe we can see no difference between two group but as you know we have to do many test

the next one is Box-Plot

ggplot(data3, aes(data3$`Maid Type (wlak-in / operator)`, data3$`Number of Failed interview`)) + geom_boxplot(  aes(fill = factor(data3$`Maid Type (wlak-in / operator)`))) +
  labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "Number of Failed interview")

## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.

## Warning: Use of `data3$`Number of Failed interview`` is discouraged. Use `Number
## of Failed interview` instead.

here the chart refers to important thing the data has outlines and the client Dissatisfaction for WALK-INs type is higher then Operator, it means the data has using inference

Inference

we will apply same previous steps using T-test and point-biserial correaltion lets start with Number of Failed interview

#preapre data to be ready to test 
OPERATOR3 <- data3%>% filter (data3$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN3 <- data3%>% filter (data3$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR3$`Number of Failed interview`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   6.000   8.942  15.000  58.000

summary(WALKIN3$`Number of Failed interview`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   4.000   6.467  10.000  53.000

#Doing t--test
t.test(OPERATOR3$`Number of Failed interview`,WALKIN3$`Number of Failed interview`,alternative = "great")

## 
##  Welch Two Sample t-test
## 
## data:  OPERATOR3$`Number of Failed interview` and WALKIN3$`Number of Failed interview`
## t = 5.232, df = 508.64, p-value = 1.228e-07
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  1.695983      Inf
## sample estimates:
## mean of x mean of y 
##  8.942405  6.466667

cor.test(data3$`Number of Failed interview`,data3$type)

## 
##  Pearson's product-moment correlation
## 
## data:  data3$`Number of Failed interview` and data3$type
## t = 4.6584, df = 1893, p-value = 3.408e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06172711 0.15076675
## sample estimates:
##       cor 
## 0.1064604

Fourth Question

Here the question about (correlation between the type of maid and their termination rate.) so I think we can measure this relation using Chi-seqar which told us about correlation between two numinal variables

and here we have two type of termination rate 1 - for maids 2 - for contract first I will prepare data for first type then go to other type

Prepare Data

I will exclude the maids whom don’t have " date of joining the company" and at least maids has one interviews and preapre data to be ready to apply Chi Sequr test to testing corrlation

data4<- data %>% filter(data$`date of joining the company`!=0)
data4$maids_Termination_rate<-if_else(is.na(data4$`date left`), "Exist", "NotExist")
table(data4$`Maid Type (wlak-in / operator)`,data4$maids_Termination_rate)

##                   
##                    Exist NotExist
##   FREEDOM_OPERATOR   901      777
##   WALKIN             280       42

Inference

Chi -secuare correlation test

test<-chisq.test(x = data4$`Maid Type (wlak-in / operator)`,data4$maids_Termination_rate)
test$expected

##                   data4$maids_Termination_rate
##                      Exist NotExist
##   FREEDOM_OPERATOR 990.859  687.141
##   WALKIN           190.141  131.859

test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data4$`Maid Type (wlak-in / operator)` and data4$maids_Termination_rate
## X-squared = 122.23, df = 1, p-value < 2.2e-16

from the test we see p-values = 0.00000 that means we reject the none hypothesis and there is correlaction between two type; walk-ins and the operator.from prievious table we note the Termination rate for Walk-ins types is less than OPERATOR

Fifth Question

proceed with any type that depend on the strategy of company, if the company focus on client satisfaction here we refer OPERATOR to processing with any Considerations, but from my side I think the WALK-INs type because it is better and has less number faild interview and faster in hired.

End

Thank You for Your reading this file , and I really interesting with this Quiz

With best regards

Analysis difference between walk-ins and operator maids using R

Obada Aladib

4/17/2020

interdiction :

Methodology

First Question

EDA (Exploratory Data Analysis)

Prepar Data

Inference

Second Question

EDA (Exploratory Data Analysis)

Prepar Data

Inference

Third Question

EDA (Exploratory Data Analysis)

Prepar Data

Inference

Fourth Question

Prepare Data

Inference

Fifth Question

End