Bivariate Analysis
# city & target
dat0 <- data.frame(table(data$city,data$target))
names(dat0) <- c("City","target","Count")
dat0<-spread(dat0,target,Count)
dat0['prob_of_stay']<-round(dat0['0']/(dat0['0']+dat0['1']),4)
dat0<-dat0[order(dat0[,"prob_of_stay"],decreasing=TRUE),]
dat0
## City 0 1 prob_of_stay
## 13 111 3 0 1.0000
## 26 129 3 0 1.0000
## 35 140 1 0 1.0000
## 63 2 7 0 1.0000
## 77 39 11 0 1.0000
## 93 62 5 0 1.0000
## 109 8 4 0 1.0000
## 112 82 4 0 1.0000
## 106 77 31 1 0.9688
## 121 97 96 8 0.9231
## 71 28 177 15 0.9219
## 75 36 147 13 0.9188
## 32 138 110 10 0.9167
## 66 23 166 16 0.9121
## 7 104 273 28 0.9070
## 92 61 178 19 0.9036
## 14 114 1203 133 0.9004
## 101 72 18 2 0.9000
## 122 98 71 8 0.8987
## 104 75 274 31 0.8984
## 64 20 26 3 0.8966
## 31 136 525 61 0.8959
## 2 10 77 9 0.8953
## 86 50 125 15 0.8929
## 8 105 70 9 0.8861
## 1 1 23 3 0.8846
## 49 16 1354 179 0.8832
## 97 69 15 2 0.8824
## 27 13 42 6 0.8750
## 56 173 132 19 0.8742
## 90 57 90 13 0.8738
## 96 67 374 57 0.8677
## 110 80 13 2 0.8667
## 46 157 19 3 0.8636
## 95 65 151 24 0.8629
## 94 64 98 16 0.8596
## 83 45 97 16 0.8584
## 72 30 18 3 0.8571
## 76 37 12 2 0.8571
## 111 81 6 1 0.8571
## 100 71 227 39 0.8534
## 78 40 58 10 0.8529
## 36 141 23 4 0.8519
## 48 159 80 14 0.8511
## 19 12 11 2 0.8462
## 123 99 79 15 0.8404
## 113 83 120 23 0.8392
## 20 120 5 1 0.8333
## 5 102 252 52 0.8289
## 16 116 106 22 0.8281
## 98 7 22 5 0.8148
## 70 27 38 9 0.8085
## 119 93 21 5 0.8077
## 24 127 8 2 0.8000
## 29 133 8 2 0.8000
## 79 41 71 18 0.7978
## 69 26 19 5 0.7917
## 6 103 3427 928 0.7869
## 57 175 11 3 0.7857
## 88 54 11 3 0.7857
## 44 152 40 11 0.7843
## 52 165 64 18 0.7805
## 67 24 48 14 0.7742
## 42 149 78 24 0.7647
## 50 160 646 199 0.7645
## 3 100 210 65 0.7636
## 115 89 51 16 0.7612
## 43 150 49 16 0.7538
## 34 14 21 7 0.7500
## 53 166 3 1 0.7500
## 60 18 3 1 0.7500
## 73 31 3 1 0.7500
## 102 73 206 74 0.7357
## 22 123 58 21 0.7342
## 84 46 93 35 0.7266
## 39 144 21 8 0.7241
## 116 9 13 5 0.7222
## 30 134 31 12 0.7209
## 105 76 36 14 0.7200
## 61 180 5 2 0.7143
## 89 55 10 4 0.7143
## 51 162 91 37 0.7109
## 58 176 17 7 0.7083
## 114 84 17 7 0.7083
## 54 167 7 3 0.7000
## 91 59 7 3 0.7000
## 37 142 37 16 0.6981
## 47 158 34 15 0.6939
## 17 117 9 4 0.6923
## 87 53 18 8 0.6923
## 117 90 135 62 0.6853
## 9 106 6 3 0.6667
## 11 109 6 3 0.6667
## 18 118 18 9 0.6667
## 21 121 2 1 0.6667
## 28 131 6 3 0.6667
## 82 44 12 6 0.6667
## 118 91 29 16 0.6444
## 41 146 5 3 0.6250
## 108 79 5 3 0.6250
## 62 19 74 45 0.6218
## 120 94 16 10 0.6154
## 15 115 33 21 0.6111
## 38 143 25 16 0.6098
## 99 70 25 19 0.5682
## 68 25 2 2 0.5000
## 107 78 15 16 0.4839
## 103 74 50 54 0.4808
## 23 126 13 15 0.4643
## 85 48 6 7 0.4615
## 25 128 40 52 0.4348
## 4 101 32 43 0.4267
## 81 43 5 7 0.4167
## 40 145 26 37 0.4127
## 65 21 1105 1597 0.4090
## 12 11 100 147 0.4049
## 59 179 2 3 0.4000
## 74 33 6 11 0.3529
## 10 107 2 4 0.3333
## 80 42 4 9 0.3077
## 45 155 3 11 0.2143
## 33 139 1 4 0.2000
## 55 171 0 1 0.0000
According to the analysis, those data scientist who based in city
111, 129, 140, 2, 39, 62, 8 and 82 will choose to stay in their company
and don’t feel like changing a job.
filter(dat0,prob_of_stay<0.5)
## City 0 1 prob_of_stay
## 1 78 15 16 0.4839
## 2 74 50 54 0.4808
## 3 126 13 15 0.4643
## 4 48 6 7 0.4615
## 5 128 40 52 0.4348
## 6 101 32 43 0.4267
## 7 43 5 7 0.4167
## 8 145 26 37 0.4127
## 9 21 1105 1597 0.4090
## 10 11 100 147 0.4049
## 11 179 2 3 0.4000
## 12 33 6 11 0.3529
## 13 107 2 4 0.3333
## 14 42 4 9 0.3077
## 15 155 3 11 0.2143
## 16 139 1 4 0.2000
## 17 171 0 1 0.0000
Out of 123 cities where the data scientist live, data scientist from
only 17 cities has a higher probability to change their job. The highest
number of the data scientist who choose to leave is from city 21 (1105)
followed by city 11 (100) and city 74 (50).
dat0['total']<-dat0['0']+dat0['1']
dat0<-dat0[order(dat0[,"total"],decreasing=TRUE),]
dat0
## City 0 1 prob_of_stay total
## 6 103 3427 928 0.7869 4355
## 65 21 1105 1597 0.4090 2702
## 49 16 1354 179 0.8832 1533
## 14 114 1203 133 0.9004 1336
## 50 160 646 199 0.7645 845
## 31 136 525 61 0.8959 586
## 96 67 374 57 0.8677 431
## 104 75 274 31 0.8984 305
## 5 102 252 52 0.8289 304
## 7 104 273 28 0.9070 301
## 102 73 206 74 0.7357 280
## 3 100 210 65 0.7636 275
## 100 71 227 39 0.8534 266
## 12 11 100 147 0.4049 247
## 92 61 178 19 0.9036 197
## 117 90 135 62 0.6853 197
## 71 28 177 15 0.9219 192
## 66 23 166 16 0.9121 182
## 95 65 151 24 0.8629 175
## 75 36 147 13 0.9188 160
## 56 173 132 19 0.8742 151
## 113 83 120 23 0.8392 143
## 86 50 125 15 0.8929 140
## 16 116 106 22 0.8281 128
## 84 46 93 35 0.7266 128
## 51 162 91 37 0.7109 128
## 32 138 110 10 0.9167 120
## 62 19 74 45 0.6218 119
## 94 64 98 16 0.8596 114
## 83 45 97 16 0.8584 113
## 121 97 96 8 0.9231 104
## 103 74 50 54 0.4808 104
## 90 57 90 13 0.8738 103
## 42 149 78 24 0.7647 102
## 48 159 80 14 0.8511 94
## 123 99 79 15 0.8404 94
## 25 128 40 52 0.4348 92
## 79 41 71 18 0.7978 89
## 2 10 77 9 0.8953 86
## 52 165 64 18 0.7805 82
## 122 98 71 8 0.8987 79
## 8 105 70 9 0.8861 79
## 22 123 58 21 0.7342 79
## 4 101 32 43 0.4267 75
## 78 40 58 10 0.8529 68
## 115 89 51 16 0.7612 67
## 43 150 49 16 0.7538 65
## 40 145 26 37 0.4127 63
## 67 24 48 14 0.7742 62
## 15 115 33 21 0.6111 54
## 37 142 37 16 0.6981 53
## 44 152 40 11 0.7843 51
## 105 76 36 14 0.7200 50
## 47 158 34 15 0.6939 49
## 27 13 42 6 0.8750 48
## 70 27 38 9 0.8085 47
## 118 91 29 16 0.6444 45
## 99 70 25 19 0.5682 44
## 30 134 31 12 0.7209 43
## 38 143 25 16 0.6098 41
## 106 77 31 1 0.9688 32
## 107 78 15 16 0.4839 31
## 64 20 26 3 0.8966 29
## 39 144 21 8 0.7241 29
## 34 14 21 7 0.7500 28
## 23 126 13 15 0.4643 28
## 36 141 23 4 0.8519 27
## 98 7 22 5 0.8148 27
## 18 118 18 9 0.6667 27
## 1 1 23 3 0.8846 26
## 119 93 21 5 0.8077 26
## 87 53 18 8 0.6923 26
## 120 94 16 10 0.6154 26
## 69 26 19 5 0.7917 24
## 58 176 17 7 0.7083 24
## 114 84 17 7 0.7083 24
## 46 157 19 3 0.8636 22
## 72 30 18 3 0.8571 21
## 101 72 18 2 0.9000 20
## 116 9 13 5 0.7222 18
## 82 44 12 6 0.6667 18
## 97 69 15 2 0.8824 17
## 74 33 6 11 0.3529 17
## 110 80 13 2 0.8667 15
## 76 37 12 2 0.8571 14
## 57 175 11 3 0.7857 14
## 88 54 11 3 0.7857 14
## 89 55 10 4 0.7143 14
## 45 155 3 11 0.2143 14
## 19 12 11 2 0.8462 13
## 17 117 9 4 0.6923 13
## 85 48 6 7 0.4615 13
## 80 42 4 9 0.3077 13
## 81 43 5 7 0.4167 12
## 77 39 11 0 1.0000 11
## 24 127 8 2 0.8000 10
## 29 133 8 2 0.8000 10
## 54 167 7 3 0.7000 10
## 91 59 7 3 0.7000 10
## 9 106 6 3 0.6667 9
## 11 109 6 3 0.6667 9
## 28 131 6 3 0.6667 9
## 41 146 5 3 0.6250 8
## 108 79 5 3 0.6250 8
## 63 2 7 0 1.0000 7
## 111 81 6 1 0.8571 7
## 61 180 5 2 0.7143 7
## 20 120 5 1 0.8333 6
## 10 107 2 4 0.3333 6
## 93 62 5 0 1.0000 5
## 59 179 2 3 0.4000 5
## 33 139 1 4 0.2000 5
## 109 8 4 0 1.0000 4
## 112 82 4 0 1.0000 4
## 53 166 3 1 0.7500 4
## 60 18 3 1 0.7500 4
## 73 31 3 1 0.7500 4
## 68 25 2 2 0.5000 4
## 13 111 3 0 1.0000 3
## 26 129 3 0 1.0000 3
## 21 121 2 1 0.6667 3
## 35 140 1 0 1.0000 1
## 55 171 0 1 0.0000 1
While if we sort the table, most of the data scientist are live in
city 103 followed by city 65 and city 49.
# city_development_index & target
dat1 <- data.frame(table(data$city_development_index,data$target))
names(dat1) <- c("city_development_index","target","Count")
dat1<-spread(dat1,target,Count)
dat1['total']<-dat1['0']+dat1['1']
dat1<-dat1[order(dat1[,"total"],decreasing=TRUE),]
dat1
## city_development_index 0 1 total
## 86 0.92 4073 1127 5200
## 15 0.624 1105 1597 2702
## 83 0.91 1354 179 1533
## 91 0.926 1203 133 1336
## 28 0.698 489 194 683
## 79 0.897 525 61 586
## 92 0.939 451 46 497
## 68 0.855 374 57 431
## 58 0.804 252 52 304
## 89 0.924 273 28 301
## 41 0.754 206 74 280
## 74 0.887 210 65 275
## 73 0.884 227 39 266
## 9 0.55 100 147 247
## 84 0.913 178 19 197
## 81 0.899 166 16 182
## 57 0.802 151 24 175
## 90 0.925 147 24 171
## 76 0.893 147 13 160
## 72 0.878 132 19 151
## 39 0.743 119 27 146
## 88 0.923 120 23 143
## 78 0.896 125 15 140
## 61 0.827 113 24 137
## 14 0.579 65 70 135
## 42 0.762 93 35 128
## 46 0.767 91 37 128
## 63 0.836 110 10 120
## 24 0.682 74 45 119
## 22 0.666 98 16 114
## 75 0.89 97 16 113
## 71 0.866 90 13 103
## 25 0.689 78 24 102
## 65 0.843 80 14 94
## 85 0.915 79 15 94
## 54 0.794 82 11 93
## 8 0.527 40 52 92
## 77 0.895 77 9 86
## 49 0.776 69 13 82
## 82 0.903 64 18 82
## 35 0.738 58 21 79
## 93 0.949 71 8 79
## 12 0.558 32 43 75
## 37 0.74 43 24 67
## 10 0.555 26 37 63
## 53 0.789 33 21 54
## 32 0.727 37 16 53
## 45 0.766 34 15 49
## 67 0.848 38 9 47
## 26 0.691 29 16 45
## 66 0.847 36 5 41
## 62 0.83 31 1 32
## 69 0.856 27 5 32
## 56 0.796 26 3 29
## 64 0.84 21 8 29
## 2 0.479 13 15 28
## 19 0.647 22 5 27
## 30 0.722 18 9 27
## 43 0.763 23 4 27
## 70 0.865 21 5 26
## 44 0.764 17 7 24
## 47 0.769 19 3 22
## 55 0.795 18 2 20
## 31 0.725 12 6 18
## 1 0.448 6 11 17
## 11 0.556 3 11 14
## 36 0.739 10 4 14
## 4 0.493 6 7 13
## 13 0.563 4 9 13
## 17 0.64 11 2 13
## 6 0.516 5 7 12
## 80 0.898 11 0 11
## 38 0.742 8 2 10
## 40 0.745 8 2 10
## 48 0.775 7 3 10
## 87 0.921 7 3 10
## 23 0.68 6 3 9
## 29 0.701 6 3 9
## 34 0.735 5 3 8
## 33 0.73 6 1 7
## 52 0.788 7 0 7
## 7 0.518 2 4 6
## 50 0.78 5 1 6
## 3 0.487 1 4 5
## 5 0.512 2 3 5
## 18 0.645 5 0 5
## 20 0.649 3 1 4
## 27 0.693 4 0 4
## 59 0.807 3 1 4
## 60 0.824 3 1 4
## 16 0.625 3 0 3
## 51 0.781 2 1 3
## 21 0.664 0 1 1
plot(data$city_development_index,data$target)

From the scatter plot, it can be seen that most of the entry of this
dataset is from city development index 0.8 to 0.93 except for city with
city development index of 0.624, 0.754 and 0.55.
# Gender & target
dat2 <- data.frame(table(data$gender,data$target))
names(dat2) <- c("Gender","target","Count")
dat2
## Gender target Count
## 1 Female 0 912
## 2 Male 0 10209
## 3 Other 0 141
## 4 unknown 0 3119
## 5 Female 1 326
## 6 Male 1 3012
## 7 Other 1 50
## 8 unknown 1 1389
ggplot(data = dat2,aes(y = Gender,
x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of target by gender', x= 'Number of target', y='Gender')

The bar chart illustrates that for both men and women, it is
obviously that most of the data is collected from make. Overall, we can
see that employees choose to stay more than leave. While look in more
details, the probability of male who choose to stay in their company is
higher compared to female.
# relevent_experience & target
dat3 <- data.frame(table(data$relevant_experience,data$target))
names(dat3) <- c("Relevant_experience","target","Count")
dat3
## Relevant_experience target Count
## 1 no 0 3550
## 2 yes 0 10831
## 3 no 1 1816
## 4 yes 1 2961
ggplot(data = dat3,aes(y = Relevant_experience,
x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by relevant experience', x= 'Number of target', y='Relevant experience')

It can be seen from the bar chart that, regardless of whether they
have relevant work experience, the proportion of employees who choose to
stay is greater than that of those who choose to leave. But for overall
mobility, most of the data scientist in this dataset has relevant
experience and the proportion of those who has relevant experience and
choose to stay in their company is much higher than those who has no
relevant experience.
# enrolled_university & target
dat4 <- data.frame(table(data$enrolled_university,data$target))
names(dat4) <- c("Enrolled_university","target","Count")
dat4
## Enrolled_university target Count
## 1 Full time course 0 2326
## 2 no_enrollment 0 10896
## 3 Part time course 0 896
## 4 unknown 0 263
## 5 Full time course 1 1431
## 6 no_enrollment 1 2921
## 7 Part time course 1 302
## 8 unknown 1 123
ggplot(data = dat4,aes(y = Enrolled_university,
x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by enrolled university', x= 'Number of target', y='Enrolled university')

The bar chart shows that employees choose to stay more than leave,
regardless of whether they have enrolled in university or not. But for
overall mobility ,most of the data is collected from those who do not
enroll to any full time or part time univeristy couse, and they are more
prone to stay the company.
# education_level & target
dat5 <- data.frame(table(data$education_level,data$target))
names(dat5) <- c("Education_level","target","Count")
dat5
## Education_level target Count
## 1 Graduate 0 8353
## 2 High School 0 1623
## 3 Masters 0 3426
## 4 Phd 0 356
## 5 Primary School 0 267
## 6 unknown 0 356
## 7 Graduate 1 3245
## 8 High School 1 394
## 9 Masters 1 935
## 10 Phd 1 58
## 11 Primary School 1 41
## 12 unknown 1 104
ggplot(data = dat5,aes(y = Education_level,
x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by education level', x= 'Number of target', y='Education level')

The bar chart shows that, regardless of education level, employees
choose to stay more than they leave. For overall mobility, graduate
students account for the largest proportion. We also can see from the
graph that, it is very rare for a primary school education level to be
in a data scientist. For high school education level, they are more
likely to stay in the company while those graduate education level, they
are more likely to went for a job change.
# major_discipline & target
dat6 <- data.frame(table(data$major_discipline,data$target))
names(dat6) <- c("Major_discipline","target","Count")
dat6
## Major_discipline target Count
## 1 Arts 0 200
## 2 Business Degree 0 241
## 3 Humanities 0 528
## 4 No Major 0 168
## 5 Other 0 279
## 6 STEM 0 10701
## 7 unknown 0 2264
## 8 Arts 1 53
## 9 Business Degree 1 86
## 10 Humanities 1 141
## 11 No Major 1 55
## 12 Other 1 102
## 13 STEM 1 3791
## 14 unknown 1 549
ggplot(data = dat6,aes(y = Major_discipline,
x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by major discipline', x= 'Number of target', y='Major discipline')

From the bar chart, we can observe that most of the employees worked
as data scientists are majored in STEM.
# experience & target
dat7 <- data.frame(table(data$experience,data$target))
names(dat7) <- c("Experience","target","Count")
dat7<-spread(dat7,target,Count)
dat7['prob_of_stay']<-round(dat7['0']/(dat7['0']+dat7['1']),4)
dat7<-dat7[order(dat7[,"prob_of_stay"],decreasing=TRUE),]
dat7
## Experience 0 1 prob_of_stay
## 10 16 436 72 0.8583
## 12 18 237 43 0.8464
## 2 >20 2825 526 0.8430
## 9 15 572 114 0.8338
## 11 17 285 57 0.8333
## 13 19 251 53 0.8257
## 8 14 479 107 0.8174
## 6 12 402 92 0.8138
## 7 13 322 77 0.8070
## 4 10 778 207 0.7898
## 22 9 767 213 0.7827
## 15 20 115 33 0.7770
## 5 11 513 151 0.7726
## 21 8 607 195 0.7569
## 19 6 873 343 0.7179
## 18 5 1018 412 0.7119
## 20 7 725 303 0.7053
## 17 4 946 457 0.6743
## 14 2 753 374 0.6681
## 16 3 876 478 0.6470
## 3 1 316 233 0.5756
## 1 <1 285 237 0.5460
We can observe that the the data scientist with a lower experience
are more likely to went for a job change. While the highest probability
of those will stay in the company are those has more than 10 years
experience.
# company_size & target
dat8 <- data.frame(table(data$company_size,data$target))
names(dat8) <- c("Company_size","target","Count")
dat8
## Company_size target Count
## 1 <10 0 1084
## 2 >9999 0 1634
## 3 10-49 0 1127
## 4 100-499 0 2156
## 5 1000-4999 0 1128
## 6 50-99 0 2538
## 7 500-999 0 725
## 8 5000-9999 0 461
## 9 unknown 0 3528
## 10 <10 1 224
## 11 >9999 1 385
## 12 10-49 1 344
## 13 100-499 1 415
## 14 1000-4999 1 200
## 15 50-99 1 545
## 16 500-999 1 152
## 17 5000-9999 1 102
## 18 unknown 1 2410
ggplot(data = dat8,aes(y = Company_size, x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by company size', x= 'Number of target', y='Company size')

Since the ‘unknown’ category represents missing value, so this
category will not be considered. It can be seen from the bar chart that
most of the data scientist are working in the company with size of 50-99
employees.
# company_type & target
dat9 <- data.frame(table(data$company_type,data$target))
names(dat9) <- c("Company_type","target","Count")
dat9
## Company_type target Count
## 1 Early Stage Startup 0 461
## 2 Funded Startup 0 861
## 3 NGO 0 424
## 4 Other 0 92
## 5 Public Sector 0 745
## 6 Pvt Ltd 0 8042
## 7 unknown 0 3756
## 8 Early Stage Startup 1 142
## 9 Funded Startup 1 140
## 10 NGO 1 97
## 11 Other 1 29
## 12 Public Sector 1 210
## 13 Pvt Ltd 1 1775
## 14 unknown 1 2384
ggplot(data = dat9,aes(y =Company_type,
x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by company type', x= 'Number of target', y='Company type')

We can observe that most of the data scientists are working in the
company type of Pvt Ltd. The funded startup companies are more able to
retain their data scientist workpower.
# last_new_job & target
dat10 <- data.frame(table(data$last_new_job,data$target))
names(dat10) <- c("Last_new_job","target","Count")
dat10
## Last_new_job target Count
## 1 >4 0 2690
## 2 0 0 1713
## 3 1 0 5915
## 4 2 0 2200
## 5 3 0 793
## 6 4 0 801
## 7 unknown 0 269
## 8 >4 1 600
## 9 0 1 739
## 10 1 1 2125
## 11 2 1 700
## 12 3 1 231
## 13 4 1 228
## 14 unknown 1 154
ggplot(data = dat10,aes(y = Last_new_job, x = Count, fill=target))+
geom_bar(stat="identity",position="dodge")+
labs(title = 'Number of targets by last_new_job', x= 'Number of target', y='last_new_job')

We can observe that the proportion of employees who have left their
last job for about a year is the largest, followed by employees who have
been more than four years and two years. While if we analyse the data by
probability, the data scientist who did not change their job for more
than past 4 years are more likely to continue work for their company.
However, for those data scientists who has just left their previous job
for less than 1 year are more likely to leave the company they are
working now.
# training_hours & target
dat11 <- data.frame(table(data$training_hours,data$target))
names(dat11) <- c("training_hours","target","Count")
dat11<-spread(dat11,target,Count)
dat11['total']<-dat11['0']+dat11['1']
dat11['prob_of_stay']<-dat11['0']/dat11['total']
dat11<-dat11[order(dat11[,"total"],decreasing=TRUE),]
dat11
## training_hours 0 1 total prob_of_stay
## 28 28 250 79 329 0.7598784
## 12 12 229 63 292 0.7842466
## 18 18 215 76 291 0.7388316
## 22 22 220 62 282 0.7801418
## 50 50 195 84 279 0.6989247
## 20 20 210 68 278 0.7553957
## 17 17 202 71 273 0.7399267
## 24 24 200 73 273 0.7326007
## 6 6 193 68 261 0.7394636
## 34 34 194 67 261 0.7432950
## 23 23 192 66 258 0.7441860
## 21 21 186 70 256 0.7265625
## 26 26 182 72 254 0.7165354
## 56 56 175 75 250 0.7000000
## 42 42 186 56 242 0.7685950
## 10 10 175 66 241 0.7261411
## 11 11 187 50 237 0.7890295
## 48 48 174 63 237 0.7341772
## 9 9 162 72 234 0.6923077
## 14 14 177 54 231 0.7662338
## 15 15 174 56 230 0.7565217
## 8 8 176 51 227 0.7753304
## 4 4 186 38 224 0.8303571
## 46 46 163 60 223 0.7309417
## 13 13 148 65 213 0.6948357
## 36 36 156 55 211 0.7393365
## 7 7 148 61 209 0.7081340
## 32 32 144 63 207 0.6956522
## 44 44 151 54 205 0.7365854
## 25 25 143 56 199 0.7185930
## 43 43 141 58 199 0.7085427
## 52 52 155 41 196 0.7908163
## 16 16 145 47 192 0.7552083
## 40 40 137 55 192 0.7135417
## 30 30 139 48 187 0.7433155
## 31 31 144 40 184 0.7826087
## 29 29 132 47 179 0.7374302
## 39 39 131 47 178 0.7359551
## 51 51 122 54 176 0.6931818
## 45 45 129 46 175 0.7371429
## 55 55 131 40 171 0.7660819
## 78 78 107 58 165 0.6484848
## 19 19 119 44 163 0.7300613
## 37 37 125 38 163 0.7668712
## 35 35 124 38 162 0.7654321
## 54 54 127 34 161 0.7888199
## 47 47 117 40 157 0.7452229
## 72 72 120 33 153 0.7843137
## 33 33 114 36 150 0.7600000
## 41 41 117 28 145 0.8068966
## 80 80 110 34 144 0.7638889
## 57 57 102 40 142 0.7183099
## 101 102 95 42 137 0.6934307
## 53 53 99 37 136 0.7279412
## 64 64 102 30 132 0.7727273
## 70 70 108 24 132 0.8181818
## 58 58 98 33 131 0.7480916
## 74 74 99 32 131 0.7557252
## 62 62 91 37 128 0.7109375
## 93 94 97 27 124 0.7822581
## 95 96 92 31 123 0.7479675
## 3 3 87 32 119 0.7310924
## 90 90 96 22 118 0.8135593
## 27 27 87 29 116 0.7500000
## 38 38 88 27 115 0.7652174
## 68 68 77 36 113 0.6814159
## 99 100 87 26 113 0.7699115
## 84 84 89 22 111 0.8018018
## 5 5 81 26 107 0.7570093
## 66 66 85 22 107 0.7943925
## 61 61 73 25 98 0.7448980
## 82 82 78 20 98 0.7959184
## 60 60 75 22 97 0.7731959
## 92 92 76 21 97 0.7835052
## 2 2 71 25 96 0.7395833
## 111 112 66 30 96 0.6875000
## 107 108 76 19 95 0.8000000
## 86 86 76 18 94 0.8085106
## 105 106 76 18 94 0.8085106
## 88 88 66 26 92 0.7173913
## 67 67 70 20 90 0.7777778
## 77 77 65 23 88 0.7386364
## 83 83 68 18 86 0.7906977
## 76 76 59 24 83 0.7108434
## 65 65 62 17 79 0.7848101
## 97 98 63 16 79 0.7974684
## 69 69 61 17 78 0.7820513
## 109 110 57 20 77 0.7402597
## 63 63 54 21 75 0.7200000
## 162 166 56 15 71 0.7887324
## 59 59 57 12 69 0.8260870
## 91 91 49 19 68 0.7205882
## 113 114 48 19 67 0.7164179
## 103 104 48 17 65 0.7384615
## 104 105 56 9 65 0.8615385
## 89 89 47 17 64 0.7343750
## 114 116 41 23 64 0.6406250
## 106 107 43 20 63 0.6825397
## 73 73 48 14 62 0.7741935
## 81 81 46 16 62 0.7419355
## 79 79 48 13 61 0.7868852
## 85 85 52 9 61 0.8524590
## 110 111 39 21 60 0.6500000
## 108 109 42 17 59 0.7118644
## 153 156 38 20 58 0.6551724
## 75 75 40 17 57 0.7017544
## 87 87 35 20 55 0.6363636
## 132 134 39 15 54 0.7222222
## 156 160 44 10 54 0.8148148
## 128 130 42 9 51 0.8235294
## 143 146 38 12 50 0.7600000
## 49 49 36 12 48 0.7500000
## 120 122 37 11 48 0.7708333
## 137 140 34 14 48 0.7083333
## 98 99 37 10 47 0.7872340
## 149 152 35 11 46 0.7608696
## 112 113 36 9 45 0.8000000
## 141 144 32 12 44 0.7272727
## 96 97 33 10 43 0.7674419
## 122 124 32 11 43 0.7441860
## 147 150 31 12 43 0.7209302
## 155 158 36 6 42 0.8571429
## 151 154 35 6 41 0.8536585
## 158 162 32 9 41 0.7804878
## 134 136 28 12 40 0.7000000
## 135 138 30 10 40 0.7500000
## 171 182 31 9 40 0.7750000
## 164 168 26 13 39 0.6666667
## 175 192 26 13 39 0.6666667
## 100 101 28 10 38 0.7368421
## 116 118 26 12 38 0.6842105
## 94 95 25 12 37 0.6756757
## 126 128 29 8 37 0.7837838
## 102 103 26 9 35 0.7428571
## 124 126 29 5 34 0.8529412
## 145 148 26 8 34 0.7647059
## 189 222 27 6 33 0.8181818
## 169 178 25 7 32 0.7812500
## 181 204 27 5 32 0.8437500
## 142 145 23 8 31 0.7419355
## 185 214 24 7 31 0.7741935
## 130 132 21 9 30 0.7000000
## 159 163 24 5 29 0.8275862
## 170 180 23 6 29 0.7931034
## 177 196 21 8 29 0.7241379
## 183 210 24 5 29 0.8275862
## 167 174 19 9 28 0.6785714
## 182 206 24 4 28 0.8571429
## 129 131 23 4 27 0.8518519
## 154 157 20 7 27 0.7407407
## 173 188 19 8 27 0.7037037
## 178 198 18 9 27 0.6666667
## 133 135 21 5 26 0.8076923
## 138 141 19 7 26 0.7307692
## 146 149 23 3 26 0.8846154
## 152 155 19 6 25 0.7600000
## 172 184 18 7 25 0.7200000
## 123 125 18 6 24 0.7500000
## 165 170 21 3 24 0.8750000
## 187 218 20 4 24 0.8333333
## 131 133 19 4 23 0.8260870
## 136 139 16 7 23 0.6956522
## 179 200 18 5 23 0.7826087
## 140 143 20 2 22 0.9090909
## 115 117 14 7 21 0.6666667
## 125 127 17 4 21 0.8095238
## 148 151 13 8 21 0.6190476
## 180 202 15 6 21 0.7142857
## 71 71 14 6 20 0.7000000
## 121 123 18 2 20 0.9000000
## 127 129 15 5 20 0.7500000
## 139 142 14 6 20 0.7000000
## 190 224 17 3 20 0.8500000
## 191 226 16 4 20 0.8000000
## 160 164 15 4 19 0.7894737
## 166 172 14 5 19 0.7368421
## 168 176 16 3 19 0.8421053
## 176 194 16 3 19 0.8421053
## 117 119 13 5 18 0.7222222
## 157 161 15 3 18 0.8333333
## 174 190 15 3 18 0.8333333
## 118 120 14 3 17 0.8235294
## 163 167 13 4 17 0.7647059
## 186 216 12 5 17 0.7058824
## 188 220 12 5 17 0.7058824
## 119 121 13 3 16 0.8125000
## 161 165 12 4 16 0.7500000
## 184 212 12 4 16 0.7500000
## 150 153 10 5 15 0.6666667
## 204 256 14 1 15 0.9333333
## 208 264 13 2 15 0.8666667
## 231 314 12 3 15 0.8000000
## 236 326 13 2 15 0.8666667
## 144 147 11 3 14 0.7857143
## 193 232 12 2 14 0.8571429
## 200 246 12 2 14 0.8571429
## 214 278 14 0 14 1.0000000
## 228 308 10 4 14 0.7142857
## 202 250 11 2 13 0.8461538
## 205 258 7 6 13 0.5384615
## 223 298 7 6 13 0.5384615
## 226 304 10 3 13 0.7692308
## 234 322 11 2 13 0.8461538
## 198 242 12 0 12 1.0000000
## 207 262 11 1 12 0.9166667
## 219 288 10 2 12 0.8333333
## 224 300 11 1 12 0.9166667
## 227 306 10 2 12 0.8333333
## 230 312 11 1 12 0.9166667
## 232 316 9 3 12 0.7500000
## 239 332 8 4 12 0.6666667
## 201 248 8 3 11 0.7272727
## 210 268 6 5 11 0.5454545
## 221 292 8 3 11 0.7272727
## 237 328 9 2 11 0.8181818
## 238 330 10 1 11 0.9090909
## 240 334 9 2 11 0.8181818
## 241 336 8 3 11 0.7272727
## 233 320 9 1 10 0.9000000
## 1 1 7 2 9 0.7777778
## 203 254 8 1 9 0.8888889
## 206 260 8 1 9 0.8888889
## 217 284 6 3 9 0.6666667
## 220 290 5 4 9 0.5555556
## 225 302 6 3 9 0.6666667
## 235 324 7 2 9 0.7777778
## 199 244 6 2 8 0.7500000
## 216 282 6 2 8 0.7500000
## 229 310 8 0 8 1.0000000
## 192 228 3 4 7 0.4285714
## 195 236 7 0 7 1.0000000
## 197 240 6 1 7 0.8571429
## 211 270 4 3 7 0.5714286
## 215 280 4 3 7 0.5714286
## 209 266 5 1 6 0.8333333
## 213 276 6 0 6 1.0000000
## 222 294 6 0 6 1.0000000
## 194 234 5 0 5 1.0000000
## 212 272 4 1 5 0.8000000
## 218 286 2 3 5 0.4000000
## 196 238 4 0 4 1.0000000
plot(data$training_hours,data$target)

From the scatter plot, it can be seen that most of the data scientist
have went for about 150 hours or below of training.
dat11<-dat11[order(dat11[,"prob_of_stay"],decreasing=TRUE),]
dat11
## training_hours 0 1 total prob_of_stay
## 214 278 14 0 14 1.0000000
## 198 242 12 0 12 1.0000000
## 229 310 8 0 8 1.0000000
## 195 236 7 0 7 1.0000000
## 213 276 6 0 6 1.0000000
## 222 294 6 0 6 1.0000000
## 194 234 5 0 5 1.0000000
## 196 238 4 0 4 1.0000000
## 204 256 14 1 15 0.9333333
## 207 262 11 1 12 0.9166667
## 224 300 11 1 12 0.9166667
## 230 312 11 1 12 0.9166667
## 140 143 20 2 22 0.9090909
## 238 330 10 1 11 0.9090909
## 121 123 18 2 20 0.9000000
## 233 320 9 1 10 0.9000000
## 203 254 8 1 9 0.8888889
## 206 260 8 1 9 0.8888889
## 146 149 23 3 26 0.8846154
## 165 170 21 3 24 0.8750000
## 208 264 13 2 15 0.8666667
## 236 326 13 2 15 0.8666667
## 104 105 56 9 65 0.8615385
## 155 158 36 6 42 0.8571429
## 182 206 24 4 28 0.8571429
## 193 232 12 2 14 0.8571429
## 200 246 12 2 14 0.8571429
## 197 240 6 1 7 0.8571429
## 151 154 35 6 41 0.8536585
## 124 126 29 5 34 0.8529412
## 85 85 52 9 61 0.8524590
## 129 131 23 4 27 0.8518519
## 190 224 17 3 20 0.8500000
## 202 250 11 2 13 0.8461538
## 234 322 11 2 13 0.8461538
## 181 204 27 5 32 0.8437500
## 168 176 16 3 19 0.8421053
## 176 194 16 3 19 0.8421053
## 187 218 20 4 24 0.8333333
## 157 161 15 3 18 0.8333333
## 174 190 15 3 18 0.8333333
## 219 288 10 2 12 0.8333333
## 227 306 10 2 12 0.8333333
## 209 266 5 1 6 0.8333333
## 4 4 186 38 224 0.8303571
## 159 163 24 5 29 0.8275862
## 183 210 24 5 29 0.8275862
## 59 59 57 12 69 0.8260870
## 131 133 19 4 23 0.8260870
## 128 130 42 9 51 0.8235294
## 118 120 14 3 17 0.8235294
## 70 70 108 24 132 0.8181818
## 189 222 27 6 33 0.8181818
## 237 328 9 2 11 0.8181818
## 240 334 9 2 11 0.8181818
## 156 160 44 10 54 0.8148148
## 90 90 96 22 118 0.8135593
## 119 121 13 3 16 0.8125000
## 125 127 17 4 21 0.8095238
## 86 86 76 18 94 0.8085106
## 105 106 76 18 94 0.8085106
## 133 135 21 5 26 0.8076923
## 41 41 117 28 145 0.8068966
## 84 84 89 22 111 0.8018018
## 107 108 76 19 95 0.8000000
## 112 113 36 9 45 0.8000000
## 191 226 16 4 20 0.8000000
## 231 314 12 3 15 0.8000000
## 212 272 4 1 5 0.8000000
## 97 98 63 16 79 0.7974684
## 82 82 78 20 98 0.7959184
## 66 66 85 22 107 0.7943925
## 170 180 23 6 29 0.7931034
## 52 52 155 41 196 0.7908163
## 83 83 68 18 86 0.7906977
## 160 164 15 4 19 0.7894737
## 11 11 187 50 237 0.7890295
## 54 54 127 34 161 0.7888199
## 162 166 56 15 71 0.7887324
## 98 99 37 10 47 0.7872340
## 79 79 48 13 61 0.7868852
## 144 147 11 3 14 0.7857143
## 65 65 62 17 79 0.7848101
## 72 72 120 33 153 0.7843137
## 12 12 229 63 292 0.7842466
## 126 128 29 8 37 0.7837838
## 92 92 76 21 97 0.7835052
## 31 31 144 40 184 0.7826087
## 179 200 18 5 23 0.7826087
## 93 94 97 27 124 0.7822581
## 69 69 61 17 78 0.7820513
## 169 178 25 7 32 0.7812500
## 158 162 32 9 41 0.7804878
## 22 22 220 62 282 0.7801418
## 67 67 70 20 90 0.7777778
## 1 1 7 2 9 0.7777778
## 235 324 7 2 9 0.7777778
## 8 8 176 51 227 0.7753304
## 171 182 31 9 40 0.7750000
## 73 73 48 14 62 0.7741935
## 185 214 24 7 31 0.7741935
## 60 60 75 22 97 0.7731959
## 64 64 102 30 132 0.7727273
## 120 122 37 11 48 0.7708333
## 99 100 87 26 113 0.7699115
## 226 304 10 3 13 0.7692308
## 42 42 186 56 242 0.7685950
## 96 97 33 10 43 0.7674419
## 37 37 125 38 163 0.7668712
## 14 14 177 54 231 0.7662338
## 55 55 131 40 171 0.7660819
## 35 35 124 38 162 0.7654321
## 38 38 88 27 115 0.7652174
## 145 148 26 8 34 0.7647059
## 163 167 13 4 17 0.7647059
## 80 80 110 34 144 0.7638889
## 149 152 35 11 46 0.7608696
## 33 33 114 36 150 0.7600000
## 143 146 38 12 50 0.7600000
## 152 155 19 6 25 0.7600000
## 28 28 250 79 329 0.7598784
## 5 5 81 26 107 0.7570093
## 15 15 174 56 230 0.7565217
## 74 74 99 32 131 0.7557252
## 20 20 210 68 278 0.7553957
## 16 16 145 47 192 0.7552083
## 27 27 87 29 116 0.7500000
## 49 49 36 12 48 0.7500000
## 135 138 30 10 40 0.7500000
## 123 125 18 6 24 0.7500000
## 127 129 15 5 20 0.7500000
## 161 165 12 4 16 0.7500000
## 184 212 12 4 16 0.7500000
## 232 316 9 3 12 0.7500000
## 199 244 6 2 8 0.7500000
## 216 282 6 2 8 0.7500000
## 58 58 98 33 131 0.7480916
## 95 96 92 31 123 0.7479675
## 47 47 117 40 157 0.7452229
## 61 61 73 25 98 0.7448980
## 23 23 192 66 258 0.7441860
## 122 124 32 11 43 0.7441860
## 30 30 139 48 187 0.7433155
## 34 34 194 67 261 0.7432950
## 102 103 26 9 35 0.7428571
## 81 81 46 16 62 0.7419355
## 142 145 23 8 31 0.7419355
## 154 157 20 7 27 0.7407407
## 109 110 57 20 77 0.7402597
## 17 17 202 71 273 0.7399267
## 2 2 71 25 96 0.7395833
## 6 6 193 68 261 0.7394636
## 36 36 156 55 211 0.7393365
## 18 18 215 76 291 0.7388316
## 77 77 65 23 88 0.7386364
## 103 104 48 17 65 0.7384615
## 29 29 132 47 179 0.7374302
## 45 45 129 46 175 0.7371429
## 100 101 28 10 38 0.7368421
## 166 172 14 5 19 0.7368421
## 44 44 151 54 205 0.7365854
## 39 39 131 47 178 0.7359551
## 89 89 47 17 64 0.7343750
## 48 48 174 63 237 0.7341772
## 24 24 200 73 273 0.7326007
## 3 3 87 32 119 0.7310924
## 46 46 163 60 223 0.7309417
## 138 141 19 7 26 0.7307692
## 19 19 119 44 163 0.7300613
## 53 53 99 37 136 0.7279412
## 141 144 32 12 44 0.7272727
## 201 248 8 3 11 0.7272727
## 221 292 8 3 11 0.7272727
## 241 336 8 3 11 0.7272727
## 21 21 186 70 256 0.7265625
## 10 10 175 66 241 0.7261411
## 177 196 21 8 29 0.7241379
## 132 134 39 15 54 0.7222222
## 117 119 13 5 18 0.7222222
## 147 150 31 12 43 0.7209302
## 91 91 49 19 68 0.7205882
## 63 63 54 21 75 0.7200000
## 172 184 18 7 25 0.7200000
## 25 25 143 56 199 0.7185930
## 57 57 102 40 142 0.7183099
## 88 88 66 26 92 0.7173913
## 26 26 182 72 254 0.7165354
## 113 114 48 19 67 0.7164179
## 180 202 15 6 21 0.7142857
## 228 308 10 4 14 0.7142857
## 40 40 137 55 192 0.7135417
## 108 109 42 17 59 0.7118644
## 62 62 91 37 128 0.7109375
## 76 76 59 24 83 0.7108434
## 43 43 141 58 199 0.7085427
## 137 140 34 14 48 0.7083333
## 7 7 148 61 209 0.7081340
## 186 216 12 5 17 0.7058824
## 188 220 12 5 17 0.7058824
## 173 188 19 8 27 0.7037037
## 75 75 40 17 57 0.7017544
## 56 56 175 75 250 0.7000000
## 134 136 28 12 40 0.7000000
## 130 132 21 9 30 0.7000000
## 71 71 14 6 20 0.7000000
## 139 142 14 6 20 0.7000000
## 50 50 195 84 279 0.6989247
## 32 32 144 63 207 0.6956522
## 136 139 16 7 23 0.6956522
## 13 13 148 65 213 0.6948357
## 101 102 95 42 137 0.6934307
## 51 51 122 54 176 0.6931818
## 9 9 162 72 234 0.6923077
## 111 112 66 30 96 0.6875000
## 116 118 26 12 38 0.6842105
## 106 107 43 20 63 0.6825397
## 68 68 77 36 113 0.6814159
## 167 174 19 9 28 0.6785714
## 94 95 25 12 37 0.6756757
## 164 168 26 13 39 0.6666667
## 175 192 26 13 39 0.6666667
## 178 198 18 9 27 0.6666667
## 115 117 14 7 21 0.6666667
## 150 153 10 5 15 0.6666667
## 239 332 8 4 12 0.6666667
## 217 284 6 3 9 0.6666667
## 225 302 6 3 9 0.6666667
## 153 156 38 20 58 0.6551724
## 110 111 39 21 60 0.6500000
## 78 78 107 58 165 0.6484848
## 114 116 41 23 64 0.6406250
## 87 87 35 20 55 0.6363636
## 148 151 13 8 21 0.6190476
## 211 270 4 3 7 0.5714286
## 215 280 4 3 7 0.5714286
## 220 290 5 4 9 0.5555556
## 210 268 6 5 11 0.5454545
## 205 258 7 6 13 0.5384615
## 223 298 7 6 13 0.5384615
## 192 228 3 4 7 0.4285714
## 218 286 2 3 5 0.4000000
However, data scientists who took the training more than 150 hours
are more likely to choose to stay in the company.