I have prepared data using excel before I got it into R so there is many steps I did in Excel that I will explain here.
first i build the client dissatisfaction indicator, it is the Cancelled rate (cancelled times / Years) + replaced rate (replaced times / Years) + Complaint rate (Complanit times / Years), that to use it in first question.
second difference between joining date and first time to get hired date
Number of Failed interview it means Number Of Interviews done - Number of successful interviews (if exists)
get data into R we have to careful from this status maids as below :
that because they don’t has any values and it will be distorts data
Get data into R Code
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
data<- read_excel("D:/Files/miad.cc/obada aladib test file 2020-04-17-17-35.xlsx",
sheet = "Table2", col_types = c("numeric",
"text", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"date", "numeric", "date", "text",
"text", "numeric", "numeric", "numeric",
"text", "numeric", "numeric", "date"))
The data frame which we have now, we can explore it from below table
DT::datatable(data)
Before I beginning, I want to note here about really important note, I know there is one way to discover correlation between two variables; one of them is numeric and other is nominal called point-biserial correaltion I will apply it here with other method called t-test to dicover differance between maens for two groups
As we said data should be without some of maids whom has specific status and also without the maids whom have times hired by a client equal zero,
data1 <- filter (data , data$`Times Hired by a client`!=0 , data$`date left`!=0 )
data1$type<- if_else(data1$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)
We start from Histogram of client dissatisfaction indicator to see the distribution of data
hist(data1$`client dissatisfaction indicator`, xlab = "client dissatisfaction indicator", main = "Hist of client dissatisfaction indicator",prob=TRUE)
lines(density(data1$`client dissatisfaction indicator`),col="blue")
as we see the Hist dose not have normal distribution,so we go to next step
do the hist for two type of group (FREEDOM_OPERATOR and WALK-INs )
As we see the destruction of data is
library(ggplot2)
ggplot(data1, aes(data1$`client dissatisfaction indicator`, fill = data1$`Maid Type (wlak-in / operator)`)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity',bins = 30) +
geom_density(alpha = 0.2)+xlab(label = "client dissatisfaction indicator")
## Warning: Use of `data1$`client dissatisfaction indicator`` is discouraged. Use
## `client dissatisfaction indicator` instead.
## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data1$`client dissatisfaction indicator`` is discouraged. Use
## `client dissatisfaction indicator` instead.
## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
as we see here; the two groups doesn’t not have normal distribution , the distribution of WALK-INs Type approaching to the natural and refer to the dissatisfaction of client is lager than OPRATION type this is the first observation but we still not accurate from it, so we will go to next step.
please note that, I using density histogram because the number of maid for each type is not equal.
now we can see Box-plot to understand this situation
ggplot(data1, aes(data1$`Maid Type (wlak-in / operator)`, data1$`client dissatisfaction indicator`)) + geom_boxplot( aes(fill = factor(data1$`Maid Type (wlak-in / operator)`))) +
labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "client dissatisfaction indicator")
## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data1$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data1$`client dissatisfaction indicator`` is discouraged. Use
## `client dissatisfaction indicator` instead.
here the chart refers to important thing the data has outlines and the client Dissatisfaction for WALK-INs tyep is higher then OPRator, it means the data want using inference
As we saw, the data does not have normal distribution and there are many outliers values so I refer to two approaches about testing difference first one about testing difference using parameter way and other using non parameter way
the parameter method needs some condition to apply it which is not included in this data as normal distribution .. but some other opinions says if the data is big the mean more then 1000 it is possible using two ways so I will using T test to see differences between mean of two group, and we will applay point-biserial correaltion
library(polycor)
#preapre data to be ready to test
OPERATOR <- data1%>% filter (data1$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN <- data1%>% filter (data1$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR$`client dissatisfaction indicator`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5391 2.6469 4.5666 5.9530 7.5536 59.1892
summary(WALKIN$`client dissatisfaction indicator`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.755 9.181 15.279 19.557 22.702 62.989
#Doing t--test
t.test(OPERATOR$`client dissatisfaction indicator`,WALKIN$`client dissatisfaction indicator`,var.equal = F,alternative = "great")
##
## Welch Two Sample t-test
##
## data: OPERATOR$`client dissatisfaction indicator` and WALKIN$`client dissatisfaction indicator`
## t = -5.6168, df = 35.51, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -17.6948 Inf
## sample estimates:
## mean of x mean of y
## 5.953016 19.557189
cor.test(data1$`client dissatisfaction indicator`,data1$type)
##
## Pearson's product-moment correlation
##
## data: data1$`client dissatisfaction indicator` and data1$type
## t = -12.862, df = 738, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4850380 -0.3671793
## sample estimates:
## cor
## -0.4279261
as we see in the results the mean of client dissatisfaction indicator for WALK-INs group is greater than OPERATOR
so the result goes us to refer OPERATOR , according to dissatisfaction of client.
==========================================================================
Here the question about (whom type is faster to get hired by a client) so I think we can measure this using two variable first one is difference between join data and first time to get hired date and the Nbr. Of days unemployed (available)
We will apply same previous analysis so lets get starte.. Go
I will get all data for maid who is joined to company using next code
data2<- data %>% filter(data$`date of joining the company`!=0)
data2$type<- if_else(data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)
summary(data2$`difference between joining date and first time to get hired date`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.3697 0.0000 0.0000 3.6117 0.7728 372.0827
We start from Histogram of join data and first time to get hired date
hist(data2$`difference between joining date and first time to get hired date`, xlab = "difference between joining date and first time to get hired date", main = "Hist of difference between joining date and first time to get hired date",prob=TRUE)
lines(density(data2$`client dissatisfaction indicator`),col="blue")
Opss .. here we see some negative values! .. after I checked it there is some wrong data in some maids have ID ( 10082, 11110, 11318, 11320, 11378, 12542, 12558, 13069, 13095, 14517, 19841) this some of many you can check it so I will repreapre data without negative values
data2<- data %>% filter(data$`date of joining the company`!=0, data$`difference between joining date and first time to get hired date`>=0)
data2$type<- if_else(data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)
apply Histogram again
hist(data2$`difference between joining date and first time to get hired date`, xlab = "difference between joining date and first time to get hired date", main = "Hist of difference between joining date and first time to get hired date",prob=TRUE)
lines(density(data2$`client dissatisfaction indicator`),col="blue")
and histogram for the Nbr. Of days unemployed (available) to see the distribution of data
hist(data2$`Nbr. Of days unemployed (available)`, xlab = "Nbr. Of days unemployed", main = "Hist of Nbr. Of days unemployed",prob=TRUE)
lines(density(data2$`client dissatisfaction indicator`),col="blue")
we see the Hist dose not have normal distribution,so we go to next step
library(ggplot2)
ggplot(data2, aes(data2$`difference between joining date and first time to get hired date`, fill = data2$`Maid Type (wlak-in / operator)`)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') + xlab(label = "difference between joining date and first time to get hired date")
## Warning: Use of `data2$`difference between joining date and first time to get
## hired date`` is discouraged. Use `difference between joining date and first time
## to get hired date` instead.
## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggplot2)
ggplot(data2, aes(data2$`Nbr. Of days unemployed (available)`, fill = data2$`Maid Type (wlak-in / operator)`)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') + xlab(label = "Nbr. Of days unemployed (available)")
## Warning: Use of `data2$`Nbr. Of days unemployed (available)`` is discouraged.
## Use `Nbr. Of days unemployed (available)` instead.
## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
as you see the each distribution is close to other one so maybe we can see no difference between two group but as you know we have to do many test
the next one is Box-Plot
ggplot(data2, aes(data2$`Maid Type (wlak-in / operator)`, data2$`difference between joining date and first time to get hired date`)) + geom_boxplot( aes(fill = factor(data2$`Maid Type (wlak-in / operator)`))) +
labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "difference between joining date and first time to get hired date")
## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data2$`difference between joining date and first time to get
## hired date`` is discouraged. Use `difference between joining date and first time
## to get hired date` instead.
confusing chart because the outlier values so we will go other variable Nbr. Of days unemployed (available)
ggplot(data2, aes(data2$`Maid Type (wlak-in / operator)`, data2$`Nbr. Of days unemployed (available)`)) + geom_boxplot( aes(fill = factor(data2$`Maid Type (wlak-in / operator)`))) +
labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "Nbr. Of days unemployed (available)")
## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data2$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data2$`Nbr. Of days unemployed (available)`` is discouraged.
## Use `Nbr. Of days unemployed (available)` instead.
less counfussing in this chart but also there are many outliers, any way we can see the two box same the maybe no differance between two group
we will apply same previous steps using T-test lets start on difference between joining date and first time to get hired date
#preapre data to be ready to test
OPERATOR2 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN2 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR2$`difference between joining date and first time to get hired date`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.5479 4.4698 0.8130 372.0827
summary(WALKIN2$`difference between joining date and first time to get hired date`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.5990 3.0773 0.8814 51.4559
#Doing t--test
t.test(OPERATOR2$`difference between joining date and first time to get hired date`,WALKIN2$`difference between joining date and first time to get hired date`)
##
## Welch Two Sample t-test
##
## data: OPERATOR2$`difference between joining date and first time to get hired date` and WALKIN2$`difference between joining date and first time to get hired date`
## t = 2.3218, df = 899.38, p-value = 0.02047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2154216 2.5697157
## sample estimates:
## mean of x mean of y
## 4.469850 3.077281
cor.test(data2$`difference between joining date and first time to get hired date`,data2$type)
##
## Pearson's product-moment correlation
##
## data: data2$`difference between joining date and first time to get hired date` and data2$type
## t = 1.4722, df = 1714, p-value = 0.1411
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01180195 0.08271883
## sample estimates:
## cor
## 0.03553792
here we can see there is actual differance between mean two group if our test on 95% but also the difference is small and p value = 0.02 that means if my test on 0.01 significance I will say no difference so lets go to test Nbr. Of days unemployed (available) maybe gose us to accurate decision
#preapre data to be ready to test
OPERATOR22 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN22 <- data2%>% filter (data2$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR22$`Nbr. Of days unemployed (available)`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 14.00 24.34 32.00 750.00
summary(WALKIN22$`Nbr. Of days unemployed (available)`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 3.5 10.0 16.7 24.0 102.0
#Doing t--test
t.test(OPERATOR22$`Nbr. Of days unemployed (available)`,WALKIN22$`Nbr. Of days unemployed (available)`,var.equal = F)
##
## Welch Two Sample t-test
##
## data: OPERATOR22$`Nbr. Of days unemployed (available)` and WALKIN22$`Nbr. Of days unemployed (available)`
## t = 5.0206, df = 1039.8, p-value = 6.058e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.648731 10.614003
## sample estimates:
## mean of x mean of y
## 24.33520 16.70383
cor.test(data2$`Nbr. Of days unemployed (available)`,data2$type)
##
## Pearson's product-moment correlation
##
## data: data2$`Nbr. Of days unemployed (available)` and data2$type
## t = 3.0285, df = 1714, p-value = 0.002494
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02572492 0.11986236
## sample estimates:
## cor
## 0.07295614
here we have clear vision the decision is WALK-INs faster to get hired by a client and the average days every type of maid stays with us before getting hired by a client is
We will apply same previous analysis so lets get started
Here the question about (correlation between the type of maids, and the number of failed interviews?) so I think we can measure this using Number of Failed interview which is equal to Number Of Interviews done - Number of successful interviews (if exists)
We will apply same previous analysis so lets get starte.. Go
after I explorer data I find some unlogic things as we have maid has sccussed interviwe but dose not has date of joining the company! so I will exclude the maids don’t have " date of joining the company" and at least maid has one interviews
data3<- data %>% filter(data$`date of joining the company`!=0,data$`Number Of Interviews done`)
data3$type<- if_else(data3$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR",1,0)
summary(data3$`Number of Failed interview`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 6.000 8.531 14.000 58.000
We start from Histogram of join data and first time to get hired date
hist(data3$`Number of Failed interview`, xlab = "Number of Failed interview", main = "Hist for Number of Failed interview",prob=TRUE)
lines(density(data3$`client dissatisfaction indicator`),col="blue")
as we see the Hist dose not has normal distribution,so we go to next step
do the hist for two type of group (FREEDOM_OPERATOR and WALK-INs )
As we see the destruction of data is
library(ggplot2)
ggplot(data3, aes(data3$`Number of Failed interview`, fill = data3$`Maid Type (wlak-in / operator)`)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity',bins = 30) +
geom_density(alpha = 0.2)+xlab(label = "Number of Failed interview")
## Warning: Use of `data3$`Number of Failed interview`` is discouraged. Use `Number
## of Failed interview` instead.
## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data3$`Number of Failed interview`` is discouraged. Use `Number
## of Failed interview` instead.
## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
as you see the each distribution are close to other one so maybe we can see no difference between two group but as you know we have to do many test
the next one is Box-Plot
ggplot(data3, aes(data3$`Maid Type (wlak-in / operator)`, data3$`Number of Failed interview`)) + geom_boxplot( aes(fill = factor(data3$`Maid Type (wlak-in / operator)`))) +
labs(title = "BoxPlot", x = "Maid Type (wlak-in / operator)", y = "Number of Failed interview")
## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data3$`Maid Type (wlak-in / operator)`` is discouraged. Use
## `Maid Type (wlak-in / operator)` instead.
## Warning: Use of `data3$`Number of Failed interview`` is discouraged. Use `Number
## of Failed interview` instead.
here the chart refers to important thing the data has outlines and the client Dissatisfaction for WALK-INs type is higher then Operator, it means the data has using inference
we will apply same previous steps using T-test and point-biserial correaltion lets start with Number of Failed interview
#preapre data to be ready to test
OPERATOR3 <- data3%>% filter (data3$`Maid Type (wlak-in / operator)`=="FREEDOM_OPERATOR")
WALKIN3 <- data3%>% filter (data3$`Maid Type (wlak-in / operator)`=="WALKIN")
summary(OPERATOR3$`Number of Failed interview`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 6.000 8.942 15.000 58.000
summary(WALKIN3$`Number of Failed interview`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 4.000 6.467 10.000 53.000
#Doing t--test
t.test(OPERATOR3$`Number of Failed interview`,WALKIN3$`Number of Failed interview`,alternative = "great")
##
## Welch Two Sample t-test
##
## data: OPERATOR3$`Number of Failed interview` and WALKIN3$`Number of Failed interview`
## t = 5.232, df = 508.64, p-value = 1.228e-07
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.695983 Inf
## sample estimates:
## mean of x mean of y
## 8.942405 6.466667
cor.test(data3$`Number of Failed interview`,data3$type)
##
## Pearson's product-moment correlation
##
## data: data3$`Number of Failed interview` and data3$type
## t = 4.6584, df = 1893, p-value = 3.408e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06172711 0.15076675
## sample estimates:
## cor
## 0.1064604
Here the question about (correlation between the type of maid and their termination rate.) so I think we can measure this relation using Chi-seqar which told us about correlation between two numinal variables
and here we have two type of termination rate 1 - for maids 2 - for contract first I will prepare data for first type then go to other type
I will exclude the maids whom don’t have " date of joining the company" and at least maids has one interviews and preapre data to be ready to apply Chi Sequr test to testing corrlation
data4<- data %>% filter(data$`date of joining the company`!=0)
data4$maids_Termination_rate<-if_else(is.na(data4$`date left`), "Exist", "NotExist")
table(data4$`Maid Type (wlak-in / operator)`,data4$maids_Termination_rate)
##
## Exist NotExist
## FREEDOM_OPERATOR 901 777
## WALKIN 280 42
Chi -secuare correlation test
test<-chisq.test(x = data4$`Maid Type (wlak-in / operator)`,data4$maids_Termination_rate)
test$expected
## data4$maids_Termination_rate
## Exist NotExist
## FREEDOM_OPERATOR 990.859 687.141
## WALKIN 190.141 131.859
test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data4$`Maid Type (wlak-in / operator)` and data4$maids_Termination_rate
## X-squared = 122.23, df = 1, p-value < 2.2e-16
from the test we see p-values = 0.00000 that means we reject the none hypothesis and there is correlaction between two type; walk-ins and the operator.from prievious table we note the Termination rate for Walk-ins types is less than OPERATOR
proceed with any type that depend on the strategy of company, if the company focus on client satisfaction here we refer OPERATOR to processing with any Considerations, but from my side I think the WALK-INs type because it is better and has less number faild interview and faster in hired.
Thank You for Your reading this file , and I really interesting with this Quiz
With best regards