Zahid Asghar
11/25/2020
Qualitative variables are nominal scale variables which have no particular numerical values.
We can “quantify” them by creating the so-called dummy variables, which take values of 0 and 1
For example, a variable denoting gender can be quantified as female = 1 and male = 0 or vice versa.
Dummy variables are also called indicator variables, categorical variables, and qualitative variables. Examples: gender, race, color, religion, nationality, geographical region, party affiliation, and political upheavals
If an intercept is included in the model and if a qualitative variable has m categories, then introduce only (m – 1) dummy variables.
If we consider self-reported health as a choice among excellent, good, and poor, we can have at most two dummy variables to represent the three categories.
If we do not follow this rule, we will fall into what is called the dummy variable trap, the situation of perfect collinearity.
The category that gets the value of 0 is called the reference, benchmark, or comparison category.
All comparisons are made in relation to the reference category.
If there are several dummy variables, you must keep track of the reference category; otherwise, it will be difficult to interpret the results.
If there is an intercept in the regression model, the number of dummy variables must be one less than the number of classifications of each qualitative variable.
If you drop the (common) intercept from the model, you can have as many dummy variables as the number of categories of the dummy variable.
The coefficient of a dummy variable must always be interpreted in relation to the reference category.
Dummy variables can interact with quantitative regressors as well as with qualitative regressors. If a model has several qualitative variables with several categories, introduction of dummies for all the combinations can consume a large number of degrees of freedom.
Dummy coefficients are often called differential intercept dummies, for they show the differences in the intercept values of the category that gets the value of 1 as compared to the reference category.
The common intercept value refers to all those categories that take a value of 0.
If we have: Yi = B1 + B2 Fi where Y = wage and F = female dummy variable
Then, on average, females earn a wage of (B1 + B2) and males earn a wage of B1. (Note that B2 can be negative.)
Thus females earn a wage that is B2 higher than males.
Since wages tend to be skewed to the right, we might instead model the wage function as: lnYi = B1 + B2 Fi
In this case, females earn exp(B2 – 1)*100% more than males on average.
On average, male wages are equal to exp(B1), and female wages are equal to exp(B1+B2).
TeachingRatings <- read_excel("C:/Users/hp/Dropbox/Applied Econometrics SBP/Stock and Watson Data sets/TeachingRatings.xls")
sample_n(TeachingRatings, size=5)
minority | age | female | onecredit | beauty | course_eval | intro | nnenglish |
---|---|---|---|---|---|---|---|
0 | 49 | 0 | 0 | 1.05 | 4 | 0 | 0 |
0 | 42 | 0 | 0 | 0.217 | 3.8 | 0 | 0 |
0 | 38 | 1 | 0 | -1.02 | 3.5 | 1 | 0 |
0 | 35 | 0 | 0 | 0.275 | 4.2 | 0 | 0 |
0 | 39 | 0 | 0 | 0.577 | 4.2 | 0 | 0 |
TeachingRatings1<-TeachingRatings %>% mutate(male=(1-female), advanced=(1-intro))
set.seed(12345)
sample_data<-sample_n(TeachingRatings1, size=20)
sample_data
minority | age | female | onecredit | beauty | course_eval | intro | nnenglish | male | advanced |
---|---|---|---|---|---|---|---|---|---|
0 | 47 | 0 | 0 | 0.541 | 4.7 | 0 | 0 | 1 | 1 |
0 | 33 | 1 | 0 | 0.724 | 4.4 | 0 | 0 | 0 | 1 |
0 | 64 | 0 | 0 | -0.111 | 4.4 | 0 | 0 | 1 | 1 |
1 | 52 | 0 | 0 | 0.212 | 3.5 | 0 | 1 | 1 | 1 |
1 | 52 | 0 | 0 | 0.212 | 3.9 | 0 | 1 | 1 | 1 |
0 | 42 | 0 | 0 | 0.217 | 3.7 | 0 | 0 | 1 | 1 |
0 | 57 | 0 | 0 | 0.632 | 4.2 | 0 | 0 | 1 | 1 |
0 | 32 | 0 | 0 | 1.23 | 4.3 | 1 | 0 | 1 | 0 |
0 | 47 | 1 | 0 | 0.339 | 3.8 | 1 | 0 | 0 | 0 |
0 | 52 | 1 | 0 | -1.09 | 4.4 | 0 | 0 | 0 | 1 |
1 | 52 | 0 | 0 | 0.212 | 4.6 | 0 | 1 | 1 | 1 |
1 | 47 | 0 | 0 | -1.05 | 3.4 | 0 | 0 | 1 | 1 |
0 | 42 | 0 | 0 | 1.77 | 4.9 | 1 | 0 | 1 | 0 |
0 | 60 | 1 | 0 | -0.0567 | 4 | 0 | 1 | 0 | 1 |
0 | 62 | 0 | 0 | -0.728 | 4 | 0 | 0 | 1 | 1 |
0 | 40 | 1 | 0 | -0.678 | 4.6 | 0 | 0 | 0 | 1 |
0 | 52 | 1 | 0 | -1.09 | 3.7 | 0 | 0 | 0 | 1 |
0 | 60 | 0 | 0 | -0.395 | 4.5 | 0 | 0 | 1 | 1 |
0 | 57 | 0 | 0 | -0.767 | 4.7 | 1 | 0 | 1 | 0 |
0 | 37 | 0 | 0 | 0.933 | 3.5 | 0 | 0 | 1 | 1 |
lm_dummy<-lm(course_eval~beauty+female,data = sample_data)
lm_saturated<-lm(course_eval~beauty+female+male,data = sample_data)
lm_nointercept<-lm(course_eval~beauty+female+male+0,data = sample_data)
huxreg("One Less Dummy"=lm_dummy,"Constant+Dummies"=lm_saturated, "full dummies without constant"=lm_nointercept) %>% set_caption("Teaching Evaluation as function of Beauty")
One Less Dummy | Constant+Dummies | full dummies without constant | |
---|---|---|---|
(Intercept) | 4.140 *** | 4.140 *** | |
(0.130) | (0.130) | ||
beauty | 0.117 | 0.117 | 0.117 |
(0.142) | (0.142) | (0.142) | |
female | 0.046 | 0.046 | 4.186 *** |
(0.242) | (0.242) | (0.198) | |
male | 4.140 *** | ||
(0.130) | |||
N | 20 | 20 | 20 |
R2 | 0.039 | 0.039 | 0.989 |
logLik | -11.760 | -11.760 | -11.760 |
AIC | 31.520 | 31.520 | 31.520 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |