simulation & linear regression for employee datatest

Author

kirit ved

Published

November 19, 2023

Lecture-5

author’s details

author’s image

author’s website

https://kiritved.com

setting R environment

rm(list=ls(all.names = T))
set.seed(1234)
setwd("d:/met/met_lect6")
if(! require("pacman")) install.packages("pacman")
Loading required package: pacman
library("pacman")
p_load(tidyverse,janitor)
p_loaded()
 [1] "janitor"   "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "pacman"   
myage=function(dt){
  t=Sys.Date()-dt
  t=as.numeric(t)
  t=t/365.25
  t=ceiling(t)+1
  return(t)
}

simulation

two machine simulation

two m/cs run in series in the assembly line to get the final product

However time taken by m/c 1 is normally distributed with mean 80 & sd of 15 mins, while that of the second m/c is 60 & 10. The manager wants to know :

What are the chances that the product will be available in

  1. less than 125 mins

  2. above 150 mins.

  3. between 120 to 140 mins.

```{r}
n=100000
t1=rnorm(n,80,15);t1=round(t1,2)
t2=rnorm(n,60,10);t2=round(t2,2)
ttlt=t1+t2
fl1=fl2=fl3=c()
for(i in 1:n){
if(ttlt[i]<=125){
  fl1[i]=1
} else{
  fl1[i]=0
}
if(ttlt[i]>=150){
  fl2[i]=1
} else{
  fl2[i]=0
}
 if(ttlt[i]<=140 & ttlt[i]>120){
  fl3[i]=1
} else{
  fl3[i]=0
} 
}
s1=sum(fl1)
s2=sum(fl2)
s3=sum(fl3)
p1=s1/n
p2=s2/n
p3=s3/n
s1;s2;s3;p1;p2;p3
d=data.frame(t1,t2,ttlt,fl1,fl2,fl3)
head(d)
```
[1] 20199
[1] 29065
[1] 36394
[1] 0.20199
[1] 0.29065
[1] 0.36394
t1 t2 ttlt fl1 fl2 fl3
61.89 45.84 107.73 1 0 0
84.16 63.17 147.33 0 0 0
96.27 67.28 163.55 0 1 0
44.81 41.00 85.81 1 0 0
86.44 48.55 134.99 0 0 1
87.59 65.77 153.36 0 1 0

linear regression with employee dataset

load employee dataset & correct date format

d=read.csv("d:/met/met_lect6/emp.csv") |>
  select(-c(6),c(6)) 
d$bdate=mdy(d$bdate)
Warning: 1 failed to parse.
d$bdate=d$bdate+years(30)
head(d)
id gender bdate educ jobcat salbegin jobtime prevexp minority salary
1 m 1982-02-03 15 3 27000 98 144 0 57000
2 m 1988-05-23 16 1 18750 98 36 0 40200
3 f 1959-07-26 12 1 12000 98 381 0 21450
4 f 1977-04-15 8 1 13200 98 190 0 21900
5 m 1985-02-09 15 1 21000 98 138 0 45000
6 m 1988-08-22 15 1 13500 98 67 0 32100
tail(d)
id gender bdate educ jobcat salbegin jobtime prevexp minority salary
469 469 f 1994-06-01 15 1 13950 64 57 0 25200
470 470 m 1994-01-22 12 1 15750 64 69 1 26250
471 471 m 1996-08-03 15 1 15750 64 32 1 26400
472 472 m 1996-02-21 15 1 15750 63 46 0 39150
473 473 f 1967-11-25 12 1 12750 63 139 0 21450
474 474 f 1998-11-05 12 1 14250 63 9 0 29400
summary(d)
       id           gender              bdate                 educ      
 Min.   :  1.0   Length:474         Min.   :1959-02-10   Min.   : 8.00  
 1st Qu.:119.2   Class :character   1st Qu.:1978-01-03   1st Qu.:12.00  
 Median :237.5   Mode  :character   Median :1992-01-23   Median :12.00  
 Mean   :237.5                      Mean   :1986-10-08   Mean   :13.49  
 3rd Qu.:355.8                      3rd Qu.:1995-07-06   3rd Qu.:15.00  
 Max.   :474.0                      Max.   :2001-02-10   Max.   :21.00  
                                    NA's   :1                           
     jobcat         salbegin        jobtime         prevexp      
 Min.   :1.000   Min.   : 9000   Min.   :63.00   Min.   :  0.00  
 1st Qu.:1.000   1st Qu.:12488   1st Qu.:72.00   1st Qu.: 19.25  
 Median :1.000   Median :15000   Median :81.00   Median : 55.00  
 Mean   :1.411   Mean   :17016   Mean   :81.11   Mean   : 95.86  
 3rd Qu.:1.000   3rd Qu.:17490   3rd Qu.:90.00   3rd Qu.:138.75  
 Max.   :3.000   Max.   :79980   Max.   :98.00   Max.   :476.00  
                                                                 
    minority          salary      
 Min.   :0.0000   Min.   : 15750  
 1st Qu.:0.0000   1st Qu.: 24000  
 Median :0.0000   Median : 28875  
 Mean   :0.2194   Mean   : 34420  
 3rd Qu.:0.0000   3rd Qu.: 36938  
 Max.   :1.0000   Max.   :135000  
                                  
str(d)
'data.frame':   474 obs. of  10 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender  : chr  "m" "m" "f" "f" ...
 $ bdate   : Date, format: "1982-02-03" "1988-05-23" ...
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...

remove rows with na value

tmp=!complete.cases(d)
sum(tmp)
[1] 1
d=d[tmp==F,]
head(d)
id gender bdate educ jobcat salbegin jobtime prevexp minority salary
1 m 1982-02-03 15 3 27000 98 144 0 57000
2 m 1988-05-23 16 1 18750 98 36 0 40200
3 f 1959-07-26 12 1 12000 98 381 0 21450
4 f 1977-04-15 8 1 13200 98 190 0 21900
5 m 1985-02-09 15 1 21000 98 138 0 45000
6 m 1988-08-22 15 1 13500 98 67 0 32100
tail(d)
id gender bdate educ jobcat salbegin jobtime prevexp minority salary
469 469 f 1994-06-01 15 1 13950 64 57 0 25200
470 470 m 1994-01-22 12 1 15750 64 69 1 26250
471 471 m 1996-08-03 15 1 15750 64 32 1 26400
472 472 m 1996-02-21 15 1 15750 63 46 0 39150
473 473 f 1967-11-25 12 1 12750 63 139 0 21450
474 474 f 1998-11-05 12 1 14250 63 9 0 29400
str(d)
'data.frame':   473 obs. of  10 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender  : chr  "m" "m" "f" "f" ...
 $ bdate   : Date, format: "1982-02-03" "1988-05-23" ...
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...
summary(d)
       id           gender              bdate                 educ      
 Min.   :  1.0   Length:473         Min.   :1959-02-10   Min.   : 8.00  
 1st Qu.:119.0   Class :character   1st Qu.:1978-01-03   1st Qu.:12.00  
 Median :237.0   Mode  :character   Median :1992-01-23   Median :12.00  
 Mean   :237.1                      Mean   :1986-10-08   Mean   :13.49  
 3rd Qu.:355.0                      3rd Qu.:1995-07-06   3rd Qu.:15.00  
 Max.   :474.0                      Max.   :2001-02-10   Max.   :21.00  
     jobcat         salbegin        jobtime         prevexp      
 Min.   :1.000   Min.   : 9000   Min.   :63.00   Min.   :  0.00  
 1st Qu.:1.000   1st Qu.:12450   1st Qu.:72.00   1st Qu.: 19.00  
 Median :1.000   Median :15000   Median :81.00   Median : 55.00  
 Mean   :1.412   Mean   :17009   Mean   :81.14   Mean   : 95.95  
 3rd Qu.:1.000   3rd Qu.:17490   3rd Qu.:90.00   3rd Qu.:139.00  
 Max.   :3.000   Max.   :79980   Max.   :98.00   Max.   :476.00  
    minority          salary      
 Min.   :0.0000   Min.   : 15750  
 1st Qu.:0.0000   1st Qu.: 24000  
 Median :0.0000   Median : 28800  
 Mean   :0.2199   Mean   : 34418  
 3rd Qu.:0.0000   3rd Qu.: 37050  
 Max.   :1.0000   Max.   :135000  
do=d

data visualization

mybarplot=function(df,cn,ttl="bar plot",xlbl,ylbl){
  colnm=colnames(df)[cn]
  v=df|>select(c(cn))
  t=table(v)
  ttl=paste(ttl,"for",colnm,sep=" ")
  b=barplot(t,col="lightgreen",main=ttl,xlab=colnm)
  text(b,min(t)*1.1,labels=as.character(t),col="red")
  #return (T)
}
mypie=function(df,cn,ttl="pie chart"){
  colnm=colnames(df)[cn]
  v=df|>select(c(cn))
  
  t=table(v)

  revnm=paste(names(t),t,sep="--")
  ttl=paste(ttl,"for",colnm,sep=" ")
    pie(t,labels=revnm,main=ttl)
}
dtmp=d|>group_by(gender)|>summarise(n=n())|>mutate(nper=100*n/sum(n))
dtmp
gender n nper
f 216 45.66596
m 257 54.33404
mybarplot(d,2)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(cn)

  # Now:
  data %>% select(all_of(cn))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

mybarplot(d,4)

mybarplot(d,5)

mybarplot(d,9)

mypie(d,2)

mypie(d,4)

mypie(d,5)

mypie(d,9)

plot(d$salbegin,d$salary,col="lightgreen")

boxplot(salary~gender,d,col="lightgreen",main="boxplot of salary vs gender")

boxplot(salary~jobcat,d,col="lightgreen",main="boxplot of salary vs jobcategory")

boxplot(salary~educ,d,col="lightgreen",main="boxplot of salary vs educ")

boxplot(salary~minority,d,col="lightgreen",main="boxplot of salary vs minority")

hist(d$salary,
     main="histogram of salary",
     xlab="salary",
     ylab="frequency",col="lightgreen")

hist(d$salbegin,
     main="histogram of begining salary",
     xlab="begining salary",
     ylab="frequency",col="lightgreen")

hist(d$educ,
     main="histogram of education",
     xlab="education",
     ylab="frequency",col="lightgreen")

structure of data needed for employee database

str(do)
'data.frame':   473 obs. of  10 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender  : chr  "m" "m" "f" "f" ...
 $ bdate   : Date, format: "1982-02-03" "1988-05-23" ...
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...

calculate age of employee & replacing gender with numerical value using pipe operator

do1=d|>mutate(age=myage(bdate),
gn=ifelse(gender=="f",0,1))|> select(-c(2,3)) |>select(-c(8),c(8))
head(do1)
id educ jobcat salbegin jobtime prevexp minority age gn salary
1 15 3 27000 98 144 0 43 1 57000
2 16 1 18750 98 36 0 37 1 40200
3 12 1 12000 98 381 0 66 0 21450
4 8 1 13200 98 190 0 48 0 21900
5 15 1 21000 98 138 0 40 1 45000
6 15 1 13500 98 67 0 37 1 32100
str(do1)
'data.frame':   473 obs. of  10 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ age     : num  43 37 66 48 40 37 39 29 49 49 ...
 $ gn      : num  1 1 0 0 1 1 1 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...

performing linear regression

split the data

split_ratio=2/3
tmp=sample(1:nrow(do1),nrow(do1)*split_ratio)
 # tmp
  do1train=do1[tmp,]
  do1test=do1[-tmp,]
  nrow(do1train);nrow(do1test)
[1] 315
[1] 158
  mylm=lm(salary~.,do1train)
  mylm

Call:
lm(formula = salary ~ ., data = do1train)

Coefficients:
(Intercept)           id         educ       jobcat     salbegin      jobtime  
  66276.700      -56.075      396.790     6105.053        1.315     -623.172  
    prevexp     minority          age           gn  
    -13.188     -583.189      -87.839     1002.326  
  summary(mylm)

Call:
lm(formula = salary ~ ., data = do1train)

Residuals:
   Min     1Q Median     3Q    Max 
-19990  -3248   -754   2298  46447 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 66276.700  69502.101   0.954   0.3410    
id            -56.075     51.581  -1.087   0.2778    
educ          396.789    199.836   1.986   0.0480 *  
jobcat       6105.053    824.215   7.407 1.27e-12 ***
salbegin        1.315      0.109  12.062  < 2e-16 ***
jobtime      -623.172    702.139  -0.888   0.3755    
prevexp       -13.188      7.046  -1.872   0.0622 .  
minority     -583.189   1036.303  -0.563   0.5740    
age           -87.839     60.670  -1.448   0.1487    
gn           1002.326   1008.018   0.994   0.3208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7011 on 305 degrees of freedom
Multiple R-squared:  0.8285,    Adjusted R-squared:  0.8234 
F-statistic: 163.7 on 9 and 305 DF,  p-value: < 2.2e-16
  pred=predict(mylm,do1test)
  pred=round(pred,0)
  #paste(pred,do1test$salary,sep=" - ")
  plot(do1test$salary,pred)

  tmpcor=cor(pred,do1test$salary)
  tmpcor
[1] 0.9325219
  RMSE = sqrt(mean((pred - do1test$salary)^2))
  RMSE
[1] 6458.141

create dummy data of employees for prediction

str(do1)
'data.frame':   473 obs. of  10 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ age     : num  43 37 66 48 40 37 39 29 49 49 ...
 $ gn      : num  1 1 0 0 1 1 1 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...
n=20
id=round(runif(n,474,550))
gender=(sample(c("m","f"),n,replace=T))
bdate=runif(n,min(do$bdate),max(d$bdate))
bdate=as.Date(bdate)
educ=round(runif(n,min(do1$educ),max(do1$educ)))
jobcat=round(runif(n,min(do1$jobcat),max(do1$jobcat)))
salbegin=round(runif(n,min(do1$salbegin),max(do1$salbegin)))
jobtime=round(runif(n,min(do1$jobtime),max(do1$jobtime)))
prevexp=round(runif(n,min(do1$prevexp),max(do1$prevexp)))
minority=round(runif(n,min(do1$educ),max(do1$educ)))
age=round(runif(n,min(do1$age),max(do1$age)))
gendern=(sample(c("m","f"),n,replace=T))
dpred=data.frame(id,gender,bdate,educ,jobcat,salbegin,jobtime,prevexp,minority)
dpred1=dpred|>mutate(age=myage(bdate),
gn=ifelse(gender=="f",0,1))|> select(-c(2,3)) |>select(-c(8),c(8))
dpred1
id educ jobcat salbegin jobtime prevexp minority gn age
509 13 1 63807 90 351 13 1 51
510 12 1 54070 88 419 19 0 26
510 12 3 29066 70 264 19 1 46
526 21 2 30578 75 448 19 0 58
499 18 2 52008 73 254 9 0 30
491 17 1 14759 97 429 8 0 45
476 19 2 24860 96 109 11 1 55
549 18 3 47387 85 236 18 1 40
537 18 2 65835 95 241 15 1 32
486 16 1 34519 70 49 20 0 60
517 19 2 12079 93 468 10 0 58
515 8 3 20486 97 250 9 1 26
494 11 1 27079 76 381 9 0 42
546 14 2 17823 75 472 9 1 63
487 18 2 10180 86 283 19 1 51
535 10 2 15513 91 61 8 1 64
534 16 3 71359 78 55 11 1 45
501 17 2 33728 75 232 19 0 61
506 8 2 74491 67 202 10 0 65
517 12 2 75295 83 216 20 1 55

predicting salary for new employee

finpred=predict(mylm,dpred1)
finpred
         1          2          3          4          5          6          7 
61137.9836 45923.9954 37757.8505 28716.3883 69318.9276 -3719.1655 20522.0709 
         8          9         10         11         12         13         14 
54177.6800 69161.2660 35672.0545 -2132.8699 15554.3237 23333.9666 14121.6831 
        15         16         17         18         19         20 
 -174.4141  6058.1332 96143.3023 35258.6887 95293.3527 83214.4725 
dpred1$salary=round(finpred)
dpred1
id educ jobcat salbegin jobtime prevexp minority gn age salary
509 13 1 63807 90 351 13 1 51 61138
510 12 1 54070 88 419 19 0 26 45924
510 12 3 29066 70 264 19 1 46 37758
526 21 2 30578 75 448 19 0 58 28716
499 18 2 52008 73 254 9 0 30 69319
491 17 1 14759 97 429 8 0 45 -3719
476 19 2 24860 96 109 11 1 55 20522
549 18 3 47387 85 236 18 1 40 54178
537 18 2 65835 95 241 15 1 32 69161
486 16 1 34519 70 49 20 0 60 35672
517 19 2 12079 93 468 10 0 58 -2133
515 8 3 20486 97 250 9 1 26 15554
494 11 1 27079 76 381 9 0 42 23334
546 14 2 17823 75 472 9 1 63 14122
487 18 2 10180 86 283 19 1 51 -174
535 10 2 15513 91 61 8 1 64 6058
534 16 3 71359 78 55 11 1 45 96143
501 17 2 33728 75 232 19 0 61 35259
506 8 2 74491 67 202 10 0 65 95293
517 12 2 75295 83 216 20 1 55 83214