lect5

linear regression

Author

kirit ved

Published

November 12, 2023

author’s details

author’s image

author’s website

https://kiritved.com

setting R environment

Loading required package: pacman
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "pacman"   

multivariate linear regression analysis

generating dummy data

x1 x2 x3 y
-1.4189153 55.31654 24.88214 4798.574
2.0742629 40.85885 19.47918 3762.265
-0.2076010 61.95378 20.18491 4704.574
3.2907369 46.19061 19.97652 4012.559
3.8724917 60.96752 19.51187 4740.914
-4.8036025 51.37820 15.49174 3637.568
1.0078027 64.38995 14.97110 4381.711
-1.6938027 54.58989 18.76212 4176.199
0.2391636 63.29888 11.90487 3992.911
-0.5811786 50.28625 23.13990 4466.557

performing regression analysis


Call:
lm(formula = y ~ ., data = dd)

Coefficients:
(Intercept)           x1           x2           x3  
     -80.64        29.91        46.52        94.93  

Call:
lm(formula = y ~ ., data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.469 -12.044  -2.961  16.735  30.869 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -80.639    108.157  -0.746 0.484100    
x1            29.906      3.886   7.695 0.000252 ***
x2            46.521      1.375  33.833 4.44e-08 ***
x3            94.932      2.812  33.760 4.50e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.81 on 6 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9948 
F-statistic: 579.5 on 3 and 6 DF,  p-value: 8.888e-08

generating dummy data for prediction

x1 x2 x3
-2.8041404 49.53004 16.28137
3.3852353 64.19679 17.93383
-0.2897971 55.71272 24.16503

predicting output using regression model

x1 x2 x3 y
-2.8041404 49.53004 16.28137 3685.325
3.3852353 64.19679 17.93383 4709.609
-0.2897971 55.71272 24.16503 4796.554

load employee data

Warning: 1 failed to parse.
id gender bdate educ jobcat salbegin jobtime prevexp minority salary
1 m 1952-02-03 15 3 27000 98 144 0 57000
2 m 1958-05-23 16 1 18750 98 36 0 40200
3 f 1929-07-26 12 1 12000 98 381 0 21450
4 f 1947-04-15 8 1 13200 98 190 0 21900
5 m 1955-02-09 15 1 21000 98 138 0 45000
6 m 1958-08-22 15 1 13500 98 67 0 32100
id gender bdate educ jobcat salbegin jobtime prevexp minority salary
469 469 f 1964-06-01 15 1 13950 64 57 0 25200
470 470 m 1964-01-22 12 1 15750 64 69 1 26250
471 471 m 1966-08-03 15 1 15750 64 32 1 26400
472 472 m 1966-02-21 15 1 15750 63 46 0 39150
473 473 f 1937-11-25 12 1 12750 63 139 0 21450
474 474 f 1968-11-05 12 1 14250 63 9 0 29400
'data.frame':   474 obs. of  10 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender  : chr  "m" "m" "f" "f" ...
 $ bdate   : Date, format: "1952-02-03" "1958-05-23" ...
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...
       id           gender              bdate                 educ      
 Min.   :  1.0   Length:474         Min.   :1929-02-10   Min.   : 8.00  
 1st Qu.:119.2   Class :character   1st Qu.:1948-01-03   1st Qu.:12.00  
 Median :237.5   Mode  :character   Median :1962-01-23   Median :12.00  
 Mean   :237.5                      Mean   :1956-10-08   Mean   :13.49  
 3rd Qu.:355.8                      3rd Qu.:1965-07-06   3rd Qu.:15.00  
 Max.   :474.0                      Max.   :1971-02-10   Max.   :21.00  
                                    NA's   :1                           
     jobcat         salbegin        jobtime         prevexp      
 Min.   :1.000   Min.   : 9000   Min.   :63.00   Min.   :  0.00  
 1st Qu.:1.000   1st Qu.:12488   1st Qu.:72.00   1st Qu.: 19.25  
 Median :1.000   Median :15000   Median :81.00   Median : 55.00  
 Mean   :1.411   Mean   :17016   Mean   :81.11   Mean   : 95.86  
 3rd Qu.:1.000   3rd Qu.:17490   3rd Qu.:90.00   3rd Qu.:138.75  
 Max.   :3.000   Max.   :79980   Max.   :98.00   Max.   :476.00  
                                                                 
    minority          salary      
 Min.   :0.0000   Min.   : 15750  
 1st Qu.:0.0000   1st Qu.: 24000  
 Median :0.0000   Median : 28875  
 Mean   :0.2194   Mean   : 34420  
 3rd Qu.:0.0000   3rd Qu.: 36938  
 Max.   :1.0000   Max.   :135000  
                                  

removing data with NA

calculating age & recoding gender

educ jobcat salbegin jobtime prevexp minority age gn salary
15 3 27000 98 144 0 42 1 57000
16 1 18750 98 36 0 36 1 40200
12 1 12000 98 381 0 65 0 21450
8 1 13200 98 190 0 47 0 21900
15 1 21000 98 138 0 39 1 45000
15 1 13500 98 67 0 36 1 32100
'data.frame':   473 obs. of  9 variables:
 $ educ    : int  15 16 12 8 15 15 15 12 15 12 ...
 $ jobcat  : int  3 1 1 1 1 1 1 1 1 1 ...
 $ salbegin: int  27000 18750 12000 13200 21000 13500 18750 9750 12750 13500 ...
 $ jobtime : int  98 98 98 98 98 98 98 98 98 98 ...
 $ prevexp : int  144 36 381 190 138 67 114 0 115 244 ...
 $ minority: int  0 0 0 0 0 0 0 0 0 0 ...
 $ age     : num  42 36 65 47 39 36 38 28 48 48 ...
 $ gn      : num  1 1 0 0 1 1 1 0 0 0 ...
 $ salary  : int  57000 40200 21450 21900 45000 32100 36000 21900 27900 24000 ...

linear regression for employee’s data


Call:
lm(formula = salary ~ ., data = d1)

Coefficients:
(Intercept)         educ       jobcat     salbegin      jobtime      prevexp  
 -11771.821      455.136     5753.379        1.329      154.787      -14.582  
   minority          age           gn  
   -968.946      -69.214     1803.320  

Call:
lm(formula = salary ~ ., data = d1)

Residuals:
   Min     1Q Median     3Q    Max 
-23217  -3013   -738   2655  46337 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.177e+04  3.252e+03  -3.619 0.000328 ***
educ         4.551e+02  1.541e+02   2.954 0.003293 ** 
jobcat       5.753e+03  6.226e+02   9.242  < 2e-16 ***
salbegin     1.329e+00  7.037e-02  18.884  < 2e-16 ***
jobtime      1.548e+02  3.164e+01   4.893 1.38e-06 ***
prevexp     -1.458e+01  5.506e+00  -2.649 0.008359 ** 
minority    -9.689e+02  7.845e+02  -1.235 0.217386    
age         -6.921e+01  4.770e+01  -1.451 0.147414    
gn           1.803e+03  7.744e+02   2.329 0.020311 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6803 on 464 degrees of freedom
Multiple R-squared:  0.8443,    Adjusted R-squared:  0.8416 
F-statistic: 314.5 on 8 and 464 DF,  p-value: < 2.2e-16

data generation for prediction

id gender bdate educ jobcat salbegin jobtime prevexp minority
1 f 1964-07-01 21 2 17047 91 90 1
2 f 1962-10-16 10 2 57257 72 79 1
3 m 1968-05-14 12 2 24527 89 64 1
4 f 1938-12-27 18 2 11581 84 65 0
5 m 1935-03-04 21 2 12535 89 83 0
6 f 1946-01-29 8 1 18641 76 71 1
7 f 1968-04-07 8 3 71653 91 96 1
8 m 1963-08-08 15 1 51759 88 91 0
9 m 1968-03-27 11 2 9244 94 87 0
10 f 1955-08-07 10 1 45229 85 96 0
educ jobcat salbegin jobtime prevexp minority age gn
21 2 17047 91 90 1 60 0
10 2 57257 72 79 1 62 0
12 2 24527 89 64 1 56 1
18 2 11581 84 65 0 85 0
21 2 12535 89 83 0 89 1
8 1 18641 76 71 1 78 0
'data.frame':   10 obs. of  8 variables:
 $ educ    : int  21 10 12 18 21 8 8 15 11 10
 $ jobcat  : int  2 2 2 2 2 1 3 1 2 1
 $ salbegin: num  17047 57257 24527 11581 12535 ...
 $ jobtime : num  91 72 89 84 89 76 91 88 94 85
 $ prevexp : num  90 79 64 65 83 71 96 91 87 96
 $ minority: num  1 1 1 0 0 1 1 0 0 0
 $ age     : num  60 62 56 85 89 78 56 61 56 69
 $ gn      : num  0 0 1 0 1 0 0 1 1 0

predicting data using regression model

educ jobcat salbegin jobtime prevexp minority age gn salarypred
21 2 17047 91 90 1 60 0 39598.35
10 2 57257 72 79 1 62 0 85108.82
12 2 24527 89 64 1 56 1 47592.21
18 2 11581 84 65 0 85 0 29488.70
21 2 12535 89 83 0 89 1 34159.81
8 1 18641 76 71 1 78 0 26755.91
8 3 71653 91 96 1 56 0 112191.44
15 1 51759 88 91 0 61 1 79467.80
11 2 9244 94 87 0 56 1 28234.65
10 1 45229 85 96 0 69 0 65619.96