In this guide you will:
In the regression analysis, a prerequisite is that all the variables are in the interval scale, i.e., the distance between each step on the scale are equal. It is, however, not always the case that variables of interest can be converted into a measurable scale. The different types of classes and properties, may not have an inherent rank order. For example, if you are interested in the effect of a variety of different areas of specialisation in political attitudes, you cannot assume that natural science education is twice as much as an education in the social sciences, or a librarian eduacation is of 0.5 of a course of training in the biomedical field. The various programs are simply different (though some aspects of them, may be compared, for example, duration).
But how are you going to do if you believe that education is a key strength that you want to take that into account in your analysis? This is when dummy coding comes to play. Dummy variables is, quite simply, categorical variables. You make a variable for each of the properties you are interested in. This variable, in turn, has only two possible values: 0 and 1. The variable has a value of 1 when the analysis is holding the property you are interested in, and 0 otherwise. This means that there is only one stage of the variable, which means that it can be considered as interval-scaled – all the steps are the same length! It is therefore a good one to use in the regression analysis.
In the case of education, it is possible to imagine that you have a dummy coded variable of education in the social sciences. These are the are the people who have a 1 in the dummy variable “social sciences”. The ones that have a 0 have studies another subject. You can also use a dummy variable of the gender. You then choose a gender, for example, woman, and create a dummy variable which measures the property of a woman. All of the ones that have a 1 for the variable is women, and the ones that have a 0 in the dummy variable have another property (probably a man).
You simply make a dummy variable for each of the properties you are interested in!
dataFrame <- read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")
We can create dummy variables by using the as.numeric() function, for example.
dataFrame$Dum1 <- as.numeric(dataFrame$race== 2)
dataFrame$Dum2 <- as.numeric(dataFrame$race== 3)
dataFrame$Dum3 <- as.numeric(dataFrame$race== 4)
It is worth pointing out, here, that we can create dummy variables in many other ways in R. For instance, we can use other packages and, maybe, the ifelse() function. You can see some different methods about how to make dummy variables in R here: https://www.marsja.se/create-dummy-variables-in-r/.
Now that we have created the dummy variables we can go ahead and carry out the regression analysis in R.
fit <- lm(write ~ Dum1 + Dum2 + Dum3, data=dataFrame)
summary(fit)
##
## Call:
## lm(formula = write ~ Dum1 + Dum2 + Dum3, data = dataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0552 -5.4583 0.9724 7.0000 18.8000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.458 1.842 25.218 < 2e-16 ***
## Dum1 11.542 3.286 3.512 0.000552 ***
## Dum2 1.742 2.732 0.637 0.524613
## Dum3 7.597 1.989 3.820 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.025 on 196 degrees of freedom
## Multiple R-squared: 0.1071, Adjusted R-squared: 0.0934
## F-statistic: 7.833 on 3 and 196 DF, p-value: 5.785e-05
From this result, we can infer that the effect of the first and second dummy variables was statistically significant. This means that Group 2 had greater outcome values than Group 1, and Group 4 also had greater outcome values than Group 1. However, Group 2 did not. Note, however, we can obtain the same results by changing the variable to a factor variable:
fit1 <- lm(write ~ factor(race), data=dataFrame)
summary(fit1)
##
## Call:
## lm(formula = write ~ factor(race), data = dataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0552 -5.4583 0.9724 7.0000 18.8000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.458 1.842 25.218 < 2e-16 ***
## factor(race)2 11.542 3.286 3.512 0.000552 ***
## factor(race)3 1.742 2.732 0.637 0.524613
## factor(race)4 7.597 1.989 3.820 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.025 on 196 degrees of freedom
## Multiple R-squared: 0.1071, Adjusted R-squared: 0.0934
## F-statistic: 7.833 on 3 and 196 DF, p-value: 5.785e-05