Recoding and Manipulating GSS Variables in R

Often we want to manipulate how the data is organized. We might do this for theoretical or simply exploratory reasons. As a researcher, it is your prerogative to create new variables based on given variables, so long as you can explain and defend what you're doing.

Let's load the data.

setwd("~/Dropbox/Data General/GSS")  #Set your working directory to whatever folder holds GSS.csv
options(scipen = 999)  #Turn off scientific notation
x <- read.csv("GSS.csv")

Create a dummy variable (only two levels) from a factor variable with multiple levels

summary(x$wrkstat)  #summarize the variable capturing multiple categories of work status

##    keeping house            other          retired           school 
##             9177             1078             7285             1681 
## temp not working unempl, laid off working fulltime working parttime 
##             1173             1769            27295             5616 
##             NA's 
##               13

library(memisc)

## Loading required package: lattice

## Loading required package: grid

## Loading required package: MASS

## Attaching package: 'memisc'

## The following object(s) are masked from 'package:stats':
## 
## contr.sum, contr.treatment, contrasts

## The following object(s) are masked from 'package:base':
## 
## as.array

x$unemployed <- x$wrkstat
x$unemployed <- recode(x$unemployed, "unemployed" <- "unempl, laid off", otherwise = "not unemployed")
summary(x$unemployed)

##     unemployed not unemployed 
##           1769          53318

Make a factor variable ordered or reorder it

A lot of the GSS variables are factors, and many of them have a natural ordering. In R, “ordered factors” are factors the levels of which can be ranked by relations of less than (<) or greater than (>). If the levels of a factor can be ranked, then it's useful to make it an ordered factor. R uses ordered factors intelligently so that their ordering can be put to use without extra work by you.

For instance, the variable recording respondents' political ideology is clearly an ordinal, categorical variable but R by default orders the categories (levels) in alphabetical order.

summary(x$polviews)  #Notice the levels are in alphabetical order

##         conservative    extremely liberal extrmly conservative 
##                 6800                 1249                 1438 
##              liberal             moderate slghtly conservative 
##                 5338                17781                 7423 
##     slightly liberal                 NA's 
##                 5973                 9085

is.ordered(x$polviews)

## [1] FALSE

We use the ordered() function make this variable ordered. We could simply type x$polviews<-ordered(x$polviews) but that would leave them in alphabetical order, which is useless to us. Rather, let's reorder the variables using the levels argument. We set “levels” equal to a list of the levels in the order we want them. The little “c” is just a little function that concatenates (binds into a vector) whatever is listed in the parentheses.

x$ideology <- ordered(x$polviews, levels = c("extremely liberal", "liberal", 
    "slightly liberal", "moderate", "slghtly conservative", "conservative", 
    "extrmly conservative"))

Use ordered factors as quantitative variables

Because ordered factors represent magnitudes of some sort, it is sometimes reasonable to use them as quantitative variables. In the example above, it is reasonable to argue that the variable we created, x$ideology is really a quantitative variable measuring conservatism. If we replaced all responses of “extremely liberal” with a 0, all responses of “extremely conservative” with a 7, and all the levels in between with corresponding numerical values increasing by one for each level–then we could arguably use this variable in analyses that call for a single numerical scale. Of course, this is questionable because the difference between “extremely liberal” and “liberal” might be less than the difference between “liberal” and “moderate”, in which case the creation of a numerical scale would not exactly represent reality. This sort of thing should be done sparingly and with caution, but as a researcher you are a fundamentally creative agent and it's your prerogative to model the world according to your interests.

We do this using the as.numeric() function. When we use this function on ordered factors, R knows to assign them correspondingly ordered numerical values. Let's make a new variable called x$ideology.numeric which is the numerical version of x$ideology

summary(x$ideology)

##    extremely liberal              liberal     slightly liberal 
##                 1249                 5338                 5973 
##             moderate slghtly conservative         conservative 
##                17781                 7423                 6800 
## extrmly conservative                 NA's 
##                 1438                 9085

x$ideology.numeric <- as.numeric(x$ideology)
summmary(x$ideology.numeric)

## Error: could not find function "summmary"