Often we want to manipulate how the data is organized. We might do this for theoretical or simply exploratory reasons. As a researcher, it is your prerogative to create new variables based on given variables, so long as you can explain and defend what you're doing.
Let's load the data.
setwd("~/Dropbox/Data General/GSS") #Set your working directory to whatever folder holds GSS.csv
options(scipen = 999) #Turn off scientific notation
x <- read.csv("GSS.csv")
summary(x$wrkstat) #summarize the variable capturing multiple categories of work status
## keeping house other retired school
## 9177 1078 7285 1681
## temp not working unempl, laid off working fulltime working parttime
## 1173 1769 27295 5616
## NA's
## 13
library(memisc)
## Loading required package: lattice
## Loading required package: grid
## Loading required package: MASS
## Attaching package: 'memisc'
## The following object(s) are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object(s) are masked from 'package:base':
##
## as.array
x$unemployed <- x$wrkstat
x$unemployed <- recode(x$unemployed, "unemployed" <- "unempl, laid off", otherwise = "not unemployed")
summary(x$unemployed)
## unemployed not unemployed
## 1769 53318
A lot of the GSS variables are factors, and many of them have a natural ordering. In R, “ordered factors” are factors the levels of which can be ranked by relations of less than (<) or greater than (>). If the levels of a factor can be ranked, then it's useful to make it an ordered factor. R uses ordered factors intelligently so that their ordering can be put to use without extra work by you.
For instance, the variable recording respondents' political ideology is clearly an ordinal, categorical variable but R by default orders the categories (levels) in alphabetical order.
summary(x$polviews) #Notice the levels are in alphabetical order
## conservative extremely liberal extrmly conservative
## 6800 1249 1438
## liberal moderate slghtly conservative
## 5338 17781 7423
## slightly liberal NA's
## 5973 9085
is.ordered(x$polviews)
## [1] FALSE
We use the ordered() function make this variable ordered. We could simply type x$polviews<-ordered(x$polviews) but that would leave them in alphabetical order, which is useless to us. Rather, let's reorder the variables using the levels argument. We set “levels” equal to a list of the levels in the order we want them. The little “c” is just a little function that concatenates (binds into a vector) whatever is listed in the parentheses.
x$ideology <- ordered(x$polviews, levels = c("extremely liberal", "liberal",
"slightly liberal", "moderate", "slghtly conservative", "conservative",
"extrmly conservative"))
Because ordered factors represent magnitudes of some sort, it is sometimes reasonable to use them as quantitative variables. In the example above, it is reasonable to argue that the variable we created, x$ideology is really a quantitative variable measuring conservatism. If we replaced all responses of “extremely liberal” with a 0, all responses of “extremely conservative” with a 7, and all the levels in between with corresponding numerical values increasing by one for each level–then we could arguably use this variable in analyses that call for a single numerical scale. Of course, this is questionable because the difference between “extremely liberal” and “liberal” might be less than the difference between “liberal” and “moderate”, in which case the creation of a numerical scale would not exactly represent reality. This sort of thing should be done sparingly and with caution, but as a researcher you are a fundamentally creative agent and it's your prerogative to model the world according to your interests.
We do this using the as.numeric() function. When we use this function on ordered factors, R knows to assign them correspondingly ordered numerical values. Let's make a new variable called x$ideology.numeric which is the numerical version of x$ideology
summary(x$ideology)
## extremely liberal liberal slightly liberal
## 1249 5338 5973
## moderate slghtly conservative conservative
## 17781 7423 6800
## extrmly conservative NA's
## 1438 9085
x$ideology.numeric <- as.numeric(x$ideology)
summmary(x$ideology.numeric)
## Error: could not find function "summmary"