NOTE: since I have installed packages to check code, I put # in front of all those commands. You should remove it and run the installation. Also, follow instructions in Logit Rmd on the bottom to update R. Also, this Rmd file only looks at linear regression, I already posted Rmd for logistic regression.
First let’s import the data (Rmarkdown seems to refuse to knit if you don’t do so)
library(readxl)
Bank_Salaries <- read_excel("C:/Users/atomi/Downloads/Bank Salaries.xlsx")
View(Bank_Salaries)
Let’s start with straight-forward OLS estimation. First, let’s make sure all the data is in the proper form
library(psych)
str(Bank_Salaries)
## tibble [208 × 9] (S3: tbl_df/tbl/data.frame)
## $ Employee : num [1:208] 1 2 3 4 5 6 7 8 9 10 ...
## $ Education: num [1:208] 3 1 1 2 3 3 3 3 1 3 ...
## $ Grade : num [1:208] 1 1 1 1 1 1 1 1 1 1 ...
## $ Years1 : num [1:208] 3 14 12 8 3 3 4 8 4 9 ...
## $ Years2 : num [1:208] 1 1 0 7 0 0 0 2 0 0 ...
## $ Age : num [1:208] 26 38 35 40 28 24 27 33 62 31 ...
## $ Salary : num [1:208] 32000 39100 33200 30600 29000 30500 30000 27000 34000 29500 ...
## $ Gender : chr [1:208] "Male" "Female" "Female" "Female" ...
## $ PC Job : chr [1:208] "No" "No" "No" "No" ...
describe(Bank_Salaries)
## vars n mean sd median trimmed mad min max range
## Employee 1 208 104.50 60.19 104.5 104.50 77.10 1 208 207
## Education 2 208 3.16 1.47 3.0 3.20 1.48 1 5 4
## Grade 3 208 2.76 1.57 3.0 2.62 1.48 1 6 5
## Years1 4 208 9.67 6.99 8.0 8.58 5.93 2 39 37
## Years2 5 208 2.38 3.14 1.0 1.80 1.48 0 18 18
## Age 6 208 40.39 10.32 38.5 39.78 11.12 22 65 43
## Salary 7 208 39921.92 11256.15 37000.0 38099.70 8154.30 26700 97000 70300
## Gender* 8 208 1.33 0.47 1.0 1.29 0.00 1 2 1
## PC Job* 9 208 1.09 0.29 1.0 1.00 0.00 1 2 1
## skew kurtosis se
## Employee 0.00 -1.22 4.17
## Education 0.00 -1.33 0.10
## Grade 0.53 -0.82 0.11
## Years1 1.65 3.15 0.48
## Years2 1.59 2.80 0.22
## Age 0.46 -0.73 0.72
## Salary 2.43 8.34 780.47
## Gender* 0.73 -1.47 0.03
## PC Job* 2.82 5.96 0.02
Let’s run a salary model.
Let’s do Age only
regAge<-lm(Bank_Salaries$Salary~Bank_Salaries$Age)
summary (regAge)
##
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20155 -6652 -1251 5144 48451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23010.70 2925.50 7.866 2.04e-13 ***
## Bank_Salaries$Age 418.65 70.18 5.965 1.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10420 on 206 degrees of freedom
## Multiple R-squared: 0.1473, Adjusted R-squared: 0.1432
## F-statistic: 35.59 on 1 and 206 DF, p-value: 1.05e-08
plot(regAge)
Let’s add gender
regAgeGender<-lm(Bank_Salaries$Salary~Bank_Salaries$Age+Bank_Salaries$Gender)
summary(regAgeGender)
##
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age + Bank_Salaries$Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23474 -5766 -968 4702 41839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18948.46 2757.29 6.872 7.45e-11 ***
## Bank_Salaries$Age 446.65 64.48 6.926 5.46e-11 ***
## Bank_Salaries$GenderMale 8966.98 1415.11 6.337 1.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9551 on 205 degrees of freedom
## Multiple R-squared: 0.287, Adjusted R-squared: 0.28
## F-statistic: 41.25 on 2 and 205 DF, p-value: 8.792e-16
plot(regAgeGender)
Let’s interact age and gender
regAgeGenderInt<-lm(Bank_Salaries$Salary~Bank_Salaries$Age+Bank_Salaries$Gender+Bank_Salaries$Age*Bank_Salaries$Gender)
summary(regAgeGenderInt)
##
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age + Bank_Salaries$Gender +
## Bank_Salaries$Age * Bank_Salaries$Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30598 -4921 -1097 5004 33098
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 33130.87 3374.81 9.817
## Bank_Salaries$Age 99.77 80.54 1.239
## Bank_Salaries$GenderMale -21139.38 4923.58 -4.293
## Bank_Salaries$Age:Bank_Salaries$GenderMale 751.22 118.52 6.338
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Bank_Salaries$Age 0.217
## Bank_Salaries$GenderMale 2.72e-05 ***
## Bank_Salaries$Age:Bank_Salaries$GenderMale 1.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8751 on 204 degrees of freedom
## Multiple R-squared: 0.4043, Adjusted R-squared: 0.3955
## F-statistic: 46.15 on 3 and 204 DF, p-value: < 2.2e-16
plot(regAgeGenderInt)
#install.packages("car")
#install.packages("carData")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
vif(regAgeGender)
## Bank_Salaries$Age Bank_Salaries$Gender
## 1.004715 1.004715
VIF command checks for variance inflation factor. Value of more than 10 is an indicator of multicolinearity.
regAgeGenderYears1Grade<-lm(Bank_Salaries$Salary~Bank_Salaries$Age+Bank_Salaries$Gender+Bank_Salaries$Age*Bank_Salaries$Gender+Bank_Salaries$Years1+Bank_Salaries$Grade)
summary(regAgeGenderYears1Grade)
##
## Call:
## lm(formula = Bank_Salaries$Salary ~ Bank_Salaries$Age + Bank_Salaries$Gender +
## Bank_Salaries$Age * Bank_Salaries$Gender + Bank_Salaries$Years1 +
## Bank_Salaries$Grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30090.0 -2726.6 -541.7 2574.6 23149.4
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 28818.43 2343.08 12.299
## Bank_Salaries$Age -139.09 60.22 -2.310
## Bank_Salaries$GenderMale -12590.38 3260.58 -3.861
## Bank_Salaries$Years1 495.87 85.29 5.814
## Bank_Salaries$Grade 3921.90 309.11 12.688
## Bank_Salaries$Age:Bank_Salaries$GenderMale 405.32 79.86 5.075
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Bank_Salaries$Age 0.021916 *
## Bank_Salaries$GenderMale 0.000152 ***
## Bank_Salaries$Years1 2.35e-08 ***
## Bank_Salaries$Grade < 2e-16 ***
## Bank_Salaries$Age:Bank_Salaries$GenderMale 8.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5684 on 202 degrees of freedom
## Multiple R-squared: 0.7512, Adjusted R-squared: 0.745
## F-statistic: 122 on 5 and 202 DF, p-value: < 2.2e-16
set.seed(123)
TrainIndex<-sample(1:nrow(Bank_Salaries), 150)
trainBank<-Bank_Salaries[TrainIndex,]
testBank<-Bank_Salaries[-TrainIndex,]
Let’s run the last, big, model from above on training set and then test it on the test set. IMPORTANT, you must use just variable name and data=testBank option. If you use testBank$Variable option, prediction will not work.
reg1train<-lm(Salary~Age+Gender+Age*Gender+Years1+Grade, data=trainBank)
Now let’s predict using test set
predictions<-predict(reg1train, testBank)
summary(predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28550 33274 37231 39551 44969 66225
For diagnosis, in this case, we use measures of fit such as Root Mean Square Error (RMSE), Mean Percentage Error (MAPE) and Mean Absolute Error (MAE) (just to name a few)
#install.packages("Metrics")
library(Metrics)
RMSE<-rmse(testBank$Salary, predictions)
MSE<-mse(testBank$Salary, predictions)
MAE<-mae(testBank$Salary, predictions)
Now, let’s create a table of these measures
diagnostics_table<-data.frame(Metric=c("RMSE", "MSE", "MAE"), value=c(RMSE, MSE, MAE))
print(diagnostics_table)
## Metric value
## 1 RMSE 6869.554
## 2 MSE 47190775.897
## 3 MAE 4382.831
Most common issue when estimating linear models is heteroskedasticity. It does not bias the coefficients, but it does bias standard errors of coefficients which, then, yields inappropriate p-values. Good practice is to always use the robust errors so that you have more accurate picture of which coefficients are actually significant. If heteroskedasticity is not present, then there will be no differnce between the two approaches.
So, let’s Look at robust standard errors and see if there are any changes in t-values for the coefficients in the big model we just ran on train data (this can be done with all other regression objects above)
#install.packages("sandwich")
#install.packages("lmtest")
#install.packages("zoo")
library(sandwich)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
robust_se<-coeftest(reg1train, vcov = vcovHC(reg1train, type="HC3"))
print(robust_se)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26300.210 2231.650 11.7851 < 2.2e-16 ***
## Age -120.560 69.647 -1.7310 0.085592 .
## GenderMale -9477.294 3197.930 -2.9636 0.003559 **
## Years1 690.376 166.530 4.1457 5.76e-05 ***
## Grade 4102.805 351.935 11.6578 < 2.2e-16 ***
## Age:GenderMale 310.375 96.637 3.2118 0.001628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(reg1train)
##
## Call:
## lm(formula = Salary ~ Age + Gender + Age * Gender + Years1 +
## Grade, data = trainBank)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13876.9 -2884.2 -390.3 2396.9 19128.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26300.21 2571.05 10.229 < 2e-16 ***
## Age -120.56 64.37 -1.873 0.06311 .
## GenderMale -9477.29 3565.82 -2.658 0.00875 **
## Years1 690.38 97.69 7.067 6.29e-11 ***
## Grade 4102.81 345.84 11.863 < 2e-16 ***
## Age:GenderMale 310.38 88.92 3.491 0.00064 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5398 on 144 degrees of freedom
## Multiple R-squared: 0.8234, Adjusted R-squared: 0.8173
## F-statistic: 134.3 on 5 and 144 DF, p-value: < 2.2e-16
Notice that Age is actually insignificant when we use robust standard errors.
THAT’S ALL FOLKS :). For logistic regression see the other Rmd File