This handout focuses on a basic but important task in R: How do you create a dummy variable in R from existing data sources? This is a subset of a deeper concept in R: how to make new variables.
Let’s motivate this task with a straightforward problem. Imagine you were interested in understanding why some regions have different outcomes on different social indicators. For instance, let’s say you were interested in … why different French-speaking provinces in Switzerland in 1888 had different fertility rates.
Using the swiss dataset pre-loaded in R, we can, fortunately, test some hypotheses:
out.lm1 <- lm(Fertility ~ Agriculture + Examination + Catholic + Infant.Mortality,data=swiss)
summary(out.lm1)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9194 -3.5530 -0.6489 6.5956 14.1767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.60267 13.04246 4.570 4.25e-05 ***
## Agriculture -0.04759 0.08032 -0.593 0.556688
## Examination -0.96805 0.25284 -3.829 0.000423 ***
## Catholic 0.02611 0.03843 0.679 0.500551
## Infant.Mortality 1.39597 0.46259 3.018 0.004315 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.82 on 42 degrees of freedom
## Multiple R-squared: 0.5448, Adjusted R-squared: 0.5014
## F-statistic: 12.57 on 4 and 42 DF, p-value: 8.272e-07
You might look at this and therefore conclude that provinces with higher-quality education and lower infant mortality tend to have lower birth rates, while the share of men working in Agriculture or the percent of the population who attends Mass doesn’t matter.
But what if you think that there’s a difference between a really Catholic region – one with, say, fifty percent or more Catholics – and a marginally Catholic region? You might have a theory about why this would be so (given the state of 19th century social science, it’s probably a pretty bigoted one, but whatever). So you want to test your theory by creating a new variable, Majority.Catholic, from the existing data.
To do this, we’re going to have to first create a new column in the matrix:
swiss[,7] <- NA ## creates a column full of missing variables
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality V7
## Courtelary 22.2 NA
## Delemont 22.2 NA
## Franches-Mnt 20.2 NA
## Moutier 20.3 NA
## Neuveville 20.6 NA
## Porrentruy 26.6 NA
We start by creating a column full of missing variables because we don’t want to inadvertently create a variable later on that we don’t actually have information on.
Now, we recode the variable we have created based on other columns’ information:
swiss[swiss[,5]>=50,7] <-1
swiss[swiss[,5]<50,7] <- 0
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality V7
## Courtelary 22.2 0
## Delemont 22.2 1
## Franches-Mnt 20.2 1
## Moutier 20.3 0
## Neuveville 20.6 0
## Porrentruy 26.6 1
That worked!
Why?
Go back and read through the code. Read from the inside-out: We are taking all of the rows of swiss in which column 5 is greater than or equal to 50 (swiss[,5]>=50). This corresponds to the logical condition we’re interested in: creating a variable that takes a 1 if the percentage of Catholics in a province is greater than half and 0 otherwise. So that part is all about identifying those records.
We then specify that we’re going to be doing something to column 7 when we find those (swiss[swiss[,5]>=50,7]). That’s because, in this example, the column we created is the 7th one in the matrix. (If this were a dataframe, btw, we would have named this from the beginning – matrices have names, but this matrix is being addressed by its column number.) That thing that we’re doing to that column, based on the logical condition we identified, is turning it from NA to a 1.
The second line of code is the exact opposite: we’re taking any record in which the values in column 5 are less than 50 and turning it into a 0 instead of an NA.
We then run head(swiss) as a quick validation of our code – did it work?
The first time I wrote this code, I specified swiss[,5]>=.5,7 – why did I do that, and what was the result?
A nice thing to do is to name the columns once you’ve created them. (This is more important for data frames, as we’ll see later on.)
colnames(swiss)
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
## [7] "V7"
colnames(swiss) <- c(colnames(swiss)[1:6],"Majority.Catholic")
names(swiss)
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
## [7] "Majority.Catholic"
head(swiss[,c(5,7)])
## Catholic Majority.Catholic
## Courtelary 9.96 0
## Delemont 84.84 1
## Franches-Mnt 93.40 1
## Moutier 33.77 0
## Neuveville 5.16 0
## Porrentruy 90.57 1
colnames() is a funny command – it means both “tell me the names of this object’s columns” and also “change the names of this object’s columns” (see ?colnames() for more). In this case, I’m using it both ways (which is simultaneously showy and sloppy – you’d want to be careful with this in the wild).
I’m using the colnames on the right hand side of the second line of code to retrieve the names of the swiss columns – but only the first six elements. Why? Because as the first line of code suggests, the column names are 7 elements long – and the final element is V7, which is the very thing we want to avoid! In place of that, I’m putting in the new, desired, element: “Majority.Catholic”, using a period to avoid R’s mistrust of empty spaces (like this " “).
On the left hand side of the equation, I then use colnames() to set the names of the columns to the character vector we defined on the right hand side. names(swiss) and head(swiss[,c(5,7)]) are the data integrity checks.
Great! We’ve created a dummy variable. Did it change anything? Re-run the regression from the motivating example but substitute in Majority.Catholic for Catholic. Does that change our results? How?
As a tireless p-hacker, you’re interested in knowing whether there’s any effect of living in a more industrialized province. Create a dummy variable De.Pastoralized that takes a value of 1 if the percent of men working in agriculture is less than 66 percent and 0 otherwise. Re-run the equation from the first example with this new variable in place of Agriculture and interpret the changes.