When one of the predictor variables is qualitative in nature and is a non-metric variable then we use dummy variable. When the qualitative variable having limited categorical values has a good correlation with the dependent or response variable, then we cannot avoid this qualitative variable in the model. But since the nature is qualitative, we cannot calculate mean, SD or variance statistic for comparisons. Hence we use dummy variables.
Dummy Variables are -
* Qualitative variables taken from independent variable categorical values.
* The number of dummy variables is always N-1, where N is the number of possible categorical values for the qualitative independent variable.
* Dummy variables can take only 2 numeric values 0 or 1
We’ll do a practical regression problem for dummy variable.
We load the data first. The data is salaries of 46 people. The predictor variables are no. of years(Exp), their education(Edu), management variable(Mngt) to indicate if they hold a manegerial position or not. The Edu variable has 3 levels - High School(1), Bachelor Degree(2), Advance Degree(3). The management variable has 2 levels - 1 and 0, 1 is Yes(hold management position) and 0(don’t hold management position).
The Mngt variable already has 1 and 0, hence we don’t need any dummy variable as it already follows N-1 rule. But for Edu variable, there are 3 possible values and we need N-1 = 2 dummy variables. The 2 dummy variables for Edu values are D1 and D2 with Advance Degree as reference value.
head(sal,10)
## Row.Nos Salary Exp Edu Mngt D1 D2
## 1 1 13876 1 1 1 1 0
## 2 2 11608 1 3 0 0 0
## 3 3 18701 1 3 1 0 0
## 4 4 11283 1 2 0 0 1
## 5 5 11767 1 3 0 0 0
## 6 6 20872 2 2 1 0 1
## 7 7 11772 2 2 0 0 1
## 8 8 10535 2 1 0 1 0
## 9 9 12195 2 3 0 0 0
## 10 10 12313 3 2 0 0 1
str(sal)
## 'data.frame': 46 obs. of 7 variables:
## $ Row.Nos: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Salary : int 13876 11608 18701 11283 11767 20872 11772 10535 12195 12313 ...
## $ Exp : int 1 1 1 1 1 2 2 2 2 3 ...
## $ Edu : int 1 3 3 2 3 2 2 1 3 2 ...
## $ Mngt : int 1 0 1 0 0 1 0 0 0 0 ...
## $ D1 : int 1 0 0 0 0 0 0 1 0 0 ...
## $ D2 : int 0 0 0 1 0 1 1 0 0 1 ...
We build the regression model
sal_reg <- lm(Salary ~ . -Row.Nos -Edu, data = sal)
summary(sal_reg)
##
## Call:
## lm(formula = Salary ~ . - Row.Nos - Edu, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1884.6 -653.6 22.2 844.9 1716.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11031.8 383.2 28.79 < 2e-16 ***
## Exp 546.2 30.5 17.90 < 2e-16 ***
## Mngt 6883.5 313.9 21.93 < 2e-16 ***
## D1 -2996.2 411.8 -7.28 6.7e-09 ***
## D2 147.8 387.7 0.38 0.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1030 on 41 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.953
## F-statistic: 227 on 4 and 41 DF, p-value: <2e-16
# Residual plots
par(mfrow = c(2,2))
plot(sal_reg)
We check D-W test value and VIF values to see if the model satisfy regression assumptions.
require(lmtest)
require(car)
dwtest(sal_reg)
##
## Durbin-Watson test
##
## data: sal_reg
## DW = 2.237, p-value = 0.7639
## alternative hypothesis: true autocorrelation is greater than 0
vif(sal_reg)
## Exp Mngt D1 D2
## 1.062 1.055 1.564 1.588
Assumptions
1. Error should follow normal distribution - Normal Q-Q plot looks linear. So this assumption is satisfied.
2. No heteroscedasticity - Residual plot shows 6 different patterns instead of scattered plot. Hence, this assumption is not met.
3. No multicollinearity - The VIF values of all predictor variables are less than 5 so this assumption satisfied.
4. No autocorrelation - D-W test value is 2.237 which is close to 2. Hence, this assumption is also satisfied.
Interpretation of the above regression model
1. R-squared is 0.9568. The proportion of salary variations accounted for is quite high.
2. The regression coefficient Exp is 546.18. It means each year of experience is estimated to give a raise of 546.18 in salary per annum.
3. The regression coefficient Mngt is 6883.53, which means this amount is the average incremental value in annual salary of a person per year, associated with him being in the manegerial position. This also means that a person in manegerial position get 6883.53 more than a person who is not in manegerial position.
4. For education variable
* D1 measures the salary difference for the HS category in relation to Advance degree which is the reference category here.
* D2 measures the salary difference for the Bachelor degree category in relation to Advance degree which is the reference category here.
* D1 - D2 measure the difference in salary between Bachelor degree and HS levels of education.(-2296.21 - 147.82 = 3194)
* A person with HS education gets an annual salary of 2996.21 lesser than a person with Advance degree. * A person with Bachelor degree gets an annual salary of 147.82 more than a person with Advance degree even though the variable is insignificant in this model.
Now, we know our residual plot has 6 trends, lets address that.
The heteroscedasticity shows 6 trends. It can be seen from the graph that the residuals cluster by size according to their education-management category. Here we can say that they 2 categorical variables (edu with 3 labels) and Mngt (with 2 labels - 1 and 0) are having some interaction with each other, producing a Cartesian effect i.e., 3X2 possible values = 6 trends.
This behavior implies that the model does not adequately explain the relationship between salary and experience, education, and management variables. The graph points to some hidden structure in the data that has not been explored.
We will use Interaction between independent variables.
Why Interaction? - Interaction is a situation where one variable will act differently over a given range of values for the second variable, than it does over another range of values for the second variable. In regression, this interaction is being examined as a separate variable. An interaction predictor variable is designed by multiplying the data values of one variable by the values of another, thereby creating a new variable.
We will build our second model with interaction between Mgnt and D1, D2
sal_reg2 <- lm(Salary ~ Exp + Mngt + D1 + D2 + Mngt*D1 + Mngt*D2, data = sal)
summary(sal_reg2)
##
## Call:
## lm(formula = Salary ~ Exp + Mngt + D1 + D2 + Mngt * D1 + Mngt *
## D2, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -928.1 -46.2 24.3 65.9 204.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11203.43 79.07 141.70 < 2e-16 ***
## Exp 496.99 5.57 89.28 < 2e-16 ***
## Mngt 7047.41 102.59 68.70 < 2e-16 ***
## D1 -1730.75 105.33 -16.43 < 2e-16 ***
## D2 -349.08 97.57 -3.58 0.00095 ***
## Mngt:D1 -3066.04 149.33 -20.53 < 2e-16 ***
## Mngt:D2 1836.49 131.17 14.00 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 174 on 39 degrees of freedom
## Multiple R-squared: 0.999, Adjusted R-squared: 0.999
## F-statistic: 5.52e+03 on 6 and 39 DF, p-value: <2e-16
# Residual plots
par(mfrow = c(2,2))
plot(sal_reg2)
#D-W test
dwtest(sal_reg2)
##
## Durbin-Watson test
##
## data: sal_reg2
## DW = 2.244, p-value = 0.7823
## alternative hypothesis: true autocorrelation is greater than 0
vif(sal_reg2)
## Exp Mngt D1 D2 Mngt:D1 Mngt:D2
## 1.234 3.938 3.577 3.514 3.290 3.380
Final Interpretation
1. The assumptions are fine and satisfied now as the residual plot shows a scatter.
2. R2=1 which shows it is a perfect model.
3. Adj R2=1 which shows all our dummy variables & interaction variables including the predictor variables are all significant to our model.
4. RMSE is 173.8 which is much lesser than the previous model.
5. ANOVA p-value is 2.2e-16 which is much lesser than 0.05 which shows model fitness is fine.
6. Regression coefficients change with interaction and dummy variables now, and look fine.