MULTIPLE REGRESSION

NOTE: since I have installed packages to check code, I put # in front of all those commands. You should remove it and run the installation. Also, follow instructions in Logit Rmd on the bottom to update R. Also, this Rmd file only looks at linear regression, I already posted Rmd for logistic regression.

Background work

First let’s import the data (Rmarkdown seems to refuse to knit if you don’t do so)

library(readxl)
Bank_Salaries <- read_excel("C:/Users/atomi/Downloads/Bank Salaries.xlsx")
View(Bank_Salaries)

Let’s start with straight-forward OLS estimation. First, let’s make sure all the data is in the proper form

library(psych)
str(Bank_Salaries)
## tibble [208 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Employee : num [1:208] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Education: num [1:208] 3 1 1 2 3 3 3 3 1 3 ...
##  $ Grade    : num [1:208] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Years1   : num [1:208] 3 14 12 8 3 3 4 8 4 9 ...
##  $ Years2   : num [1:208] 1 1 0 7 0 0 0 2 0 0 ...
##  $ Age      : num [1:208] 26 38 35 40 28 24 27 33 62 31 ...
##  $ Salary   : num [1:208] 32000 39100 33200 30600 29000 30500 30000 27000 34000 29500 ...
##  $ Gender   : chr [1:208] "Male" "Female" "Female" "Female" ...
##  $ PC Job   : chr [1:208] "No" "No" "No" "No" ...
describe(Bank_Salaries)
##           vars   n     mean       sd  median  trimmed     mad   min   max range
## Employee     1 208   104.50    60.19   104.5   104.50   77.10     1   208   207
## Education    2 208     3.16     1.47     3.0     3.20    1.48     1     5     4
## Grade        3 208     2.76     1.57     3.0     2.62    1.48     1     6     5
## Years1       4 208     9.67     6.99     8.0     8.58    5.93     2    39    37
## Years2       5 208     2.38     3.14     1.0     1.80    1.48     0    18    18
## Age          6 208    40.39    10.32    38.5    39.78   11.12    22    65    43
## Salary       7 208 39921.92 11256.15 37000.0 38099.70 8154.30 26700 97000 70300
## Gender*      8 208     1.33     0.47     1.0     1.29    0.00     1     2     1
## PC Job*      9 208     1.09     0.29     1.0     1.00    0.00     1     2     1
##           skew kurtosis     se
## Employee  0.00    -1.22   4.17
## Education 0.00    -1.33   0.10
## Grade     0.53    -0.82   0.11
## Years1    1.65     3.15   0.48
## Years2    1.59     2.80   0.22
## Age       0.46    -0.73   0.72
## Salary    2.43     8.34 780.47
## Gender*   0.73    -1.47   0.03
## PC Job*   2.82     5.96   0.02

Modelling

Let’s run a salary model.

Let’s do Age only

regAge<-lm(Bank_Salaries$Salary~Bank_Salaries$Age)
summary (regAge)
## 
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20155  -6652  -1251   5144  48451 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       23010.70    2925.50   7.866 2.04e-13 ***
## Bank_Salaries$Age   418.65      70.18   5.965 1.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10420 on 206 degrees of freedom
## Multiple R-squared:  0.1473, Adjusted R-squared:  0.1432 
## F-statistic: 35.59 on 1 and 206 DF,  p-value: 1.05e-08
plot(regAge)

Let’s add gender

regAgeGender<-lm(Bank_Salaries$Salary~Bank_Salaries$Age+Bank_Salaries$Gender)
summary(regAgeGender)
## 
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age + Bank_Salaries$Gender)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23474  -5766   -968   4702  41839 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              18948.46    2757.29   6.872 7.45e-11 ***
## Bank_Salaries$Age          446.65      64.48   6.926 5.46e-11 ***
## Bank_Salaries$GenderMale  8966.98    1415.11   6.337 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9551 on 205 degrees of freedom
## Multiple R-squared:  0.287,  Adjusted R-squared:   0.28 
## F-statistic: 41.25 on 2 and 205 DF,  p-value: 8.792e-16
plot(regAgeGender)

Let’s interact age and gender

regAgeGenderInt<-lm(Bank_Salaries$Salary~Bank_Salaries$Age+Bank_Salaries$Gender+Bank_Salaries$Age*Bank_Salaries$Gender)

summary(regAgeGenderInt)
## 
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age + Bank_Salaries$Gender + 
##     Bank_Salaries$Age * Bank_Salaries$Gender)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30598  -4921  -1097   5004  33098 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 33130.87    3374.81   9.817
## Bank_Salaries$Age                              99.77      80.54   1.239
## Bank_Salaries$GenderMale                   -21139.38    4923.58  -4.293
## Bank_Salaries$Age:Bank_Salaries$GenderMale    751.22     118.52   6.338
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## Bank_Salaries$Age                             0.217    
## Bank_Salaries$GenderMale                   2.72e-05 ***
## Bank_Salaries$Age:Bank_Salaries$GenderMale 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8751 on 204 degrees of freedom
## Multiple R-squared:  0.4043, Adjusted R-squared:  0.3955 
## F-statistic: 46.15 on 3 and 204 DF,  p-value: < 2.2e-16
plot(regAgeGenderInt)

#install.packages("car")
#install.packages("carData")

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
vif(regAgeGender)
##    Bank_Salaries$Age Bank_Salaries$Gender 
##             1.004715             1.004715

VIF command checks for variance inflation factor. Value of more than 10 is an indicator of multicolinearity.

regAgeGenderYears1Grade<-lm(Bank_Salaries$Salary~Bank_Salaries$Age+Bank_Salaries$Gender+Bank_Salaries$Age*Bank_Salaries$Gender+Bank_Salaries$Years1+Bank_Salaries$Grade)

summary(regAgeGenderYears1Grade)
## 
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age + Bank_Salaries$Gender + 
##     Bank_Salaries$Age * Bank_Salaries$Gender + Bank_Salaries$Years1 + 
##     Bank_Salaries$Grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30090.0  -2726.6   -541.7   2574.6  23149.4 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 28818.43    2343.08  12.299
## Bank_Salaries$Age                            -139.09      60.22  -2.310
## Bank_Salaries$GenderMale                   -12590.38    3260.58  -3.861
## Bank_Salaries$Years1                          495.87      85.29   5.814
## Bank_Salaries$Grade                          3921.90     309.11  12.688
## Bank_Salaries$Age:Bank_Salaries$GenderMale    405.32      79.86   5.075
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## Bank_Salaries$Age                          0.021916 *  
## Bank_Salaries$GenderMale                   0.000152 ***
## Bank_Salaries$Years1                       2.35e-08 ***
## Bank_Salaries$Grade                         < 2e-16 ***
## Bank_Salaries$Age:Bank_Salaries$GenderMale 8.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5684 on 202 degrees of freedom
## Multiple R-squared:  0.7512, Adjusted R-squared:  0.745 
## F-statistic:   122 on 5 and 202 DF,  p-value: < 2.2e-16

Train / Test split and training ->testing approach

set.seed(123)

TrainIndex<-sample(1:nrow(Bank_Salaries), 150)

trainBank<-Bank_Salaries[TrainIndex,]
testBank<-Bank_Salaries[-TrainIndex,]

Let’s run the last, big, model from above on training set and then test it on the test set. IMPORTANT, you must use just variable name and data=testBank option. If you use testBank$Variable option, prediction will not work.

reg1train<-lm(Salary~Age+Gender+Age*Gender+Years1+Grade, data=trainBank)

Now let’s predict using test set

predictions<-predict(reg1train, testBank)
summary(predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28550   33274   37231   39551   44969   66225

For diagnosis, in this case, we use measures of fit such as Root Mean Square Error (RMSE), Mean Percentage Error (MAPE) and Mean Absolute Error (MAE) (just to name a few)

#install.packages("Metrics")
library(Metrics)
RMSE<-rmse(testBank$Salary, predictions)
MSE<-mse(testBank$Salary, predictions)
MAE<-mae(testBank$Salary, predictions)

Now, let’s create a table of these measures

diagnostics_table<-data.frame(Metric=c("RMSE", "MSE", "MAE"), value=c(RMSE, MSE, MAE))
print(diagnostics_table)
##   Metric        value
## 1   RMSE     6869.554
## 2    MSE 47190775.897
## 3    MAE     4382.831

Heteroskedasticity robust standard errors

Most common issue when estimating linear models is heteroskedasticity. It does not bias the coefficients, but it does bias standard errors of coefficients which, then, yields inappropriate p-values. Good practice is to always use the robust errors so that you have more accurate picture of which coefficients are actually significant. If heteroskedasticity is not present, then there will be no differnce between the two approaches.

So, let’s Look at robust standard errors and see if there are any changes in t-values for the coefficients in the big model we just ran on train data (this can be done with all other regression objects above)

#install.packages("sandwich")
#install.packages("lmtest")
#install.packages("zoo")
library(sandwich)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
robust_se<-coeftest(reg1train, vcov = vcovHC(reg1train, type="HC3"))
print(robust_se)
## 
## t test of coefficients:
## 
##                 Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    26300.210   2231.650 11.7851 < 2.2e-16 ***
## Age             -120.560     69.647 -1.7310  0.085592 .  
## GenderMale     -9477.294   3197.930 -2.9636  0.003559 ** 
## Years1           690.376    166.530  4.1457  5.76e-05 ***
## Grade           4102.805    351.935 11.6578 < 2.2e-16 ***
## Age:GenderMale   310.375     96.637  3.2118  0.001628 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(reg1train)
## 
## Call:
## lm(formula = Salary ~ Age + Gender + Age * Gender + Years1 + 
##     Grade, data = trainBank)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13876.9  -2884.2   -390.3   2396.9  19128.0 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    26300.21    2571.05  10.229  < 2e-16 ***
## Age             -120.56      64.37  -1.873  0.06311 .  
## GenderMale     -9477.29    3565.82  -2.658  0.00875 ** 
## Years1           690.38      97.69   7.067 6.29e-11 ***
## Grade           4102.81     345.84  11.863  < 2e-16 ***
## Age:GenderMale   310.38      88.92   3.491  0.00064 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5398 on 144 degrees of freedom
## Multiple R-squared:  0.8234, Adjusted R-squared:  0.8173 
## F-statistic: 134.3 on 5 and 144 DF,  p-value: < 2.2e-16

Notice that Age is actually insignificant when we use robust standard errors.

THAT’S ALL FOLKS :). For logistic regression see the other Rmd File