The Need to Recode Data

Part of the data science process for refining data involves reorganizing the values of variables to reshape the data for visualization and analysis. Sometimes this process creates new variables. This is the process of recoding data. In this report, I will demonstrate how to recode data.

Demonstration of Recoding

The data

I have a dataset that contains 100 observations coded into 4 categories with measures two colors and two designs of neckties:

  • 1 = plain red tie

  • 2 = red tie with yellow stripes

  • 3 = plain blue tie

  • 4 = blue tie with yellow stripes

I created an RProject and entered a .csv file containing the data into the the RProject directory. All in one chunk, I load dplyr and, then, read these data into RProject from a file, dataOct-12-2015.csv, that contains one variable that I have named tietype:

suppressPackageStartupMessages(require(dplyr))
# "suppressPackageStartupMessages" reduces some of the verbosity after a package is loaded
#
neckties <- tbl_df(read.csv(file="dataOct-12-2015.csv", header=TRUE, sep=","))
# Notice that this one line combines "tbl_df" with "read.csv".
#
# Now, I print a few lines of "neckties," a data frame
# containing 100 lines with one column.
neckties
## # A tibble: 100 × 1
##    tietype
##      <int>
## 1        3
## 2        3
## 3        4
## 4        2
## 5        4
## 6        2
## 7        4
## 8        3
## 9        2
## 10       1
## # ... with 90 more rows

The use of ifelse to recode neckties data frame

Here is the frequency distribution of neckties$tietype:

# Make neckties$tietype into a table.
type <- table(neckties$tietype)
# Print the table
type
## 
##  1  2  3  4 
## 23 41 16 20

Next is a recode using the versatile ifelse function available in R.

How many ties are red?

To begin, I create a new variable, neckties$color from the variable neckties$tietype. neckties$color is coded “1” if tietype is red and “0” if it is not red.

# We start with `necktie$color` equal to "zero"0, and we change it to "1"
# only if `neckties$tietype` is red.
neckties$color <- 0
neckties$color <- ifelse(neckties$tietype == 1,1,neckties$color)
neckties$color <- ifelse(neckties$tietype == 2,1,neckties$color)

The frequency distribution of neckties$color:

tiecolor <- table(neckties$color)
tiecolor 
## 
##  0  1 
## 36 64
# 64 ties are red; 36 are not red.

Let’s break down the syntax of the ifelse command. We create a new variable, neckties$color, that is coded “1” if the tie is red and “0” if the tie is not red. The ifelse command has three parameters:

  • The first parameter states the condition: “is the tie red? (OR is neckties$tietype == 1).

  • The second parameter is the value of the new variable if statement is true that the tie is red and, therefore, the condition is true.

  • The third parameter is the value of the new variable if the statement is false that the tie is red and, therefore, the condition is false.

The logic of the ifelse statement that sorts red ties from ties of other color:

How many ties are striped?

neckties$stripe <- 0
neckties$stripe <- ifelse(neckties$tietype == 2,1,neckties$stripe)
neckties$stripe <- ifelse(neckties$tietype == 4,1,neckties$stripe)
tiestripe <- table(neckties$stripe)
tiestripe 
## 
##  0  1 
## 39 61
# 61 ties are striped; 39 ties are not striped.

The Structure of neckties After Recoding

We started with one variable, tietype and created two new variables, color and stripe, using the ifelse command to recode tietype.

# Print first couple of lines of final `neckties` data frame.
neckties
## # A tibble: 100 × 3
##    tietype color stripe
##      <int> <dbl>  <dbl>
## 1        3     0      0
## 2        3     0      0
## 3        4     0      1
## 4        2     1      1
## 5        4     0      1
## 6        2     1      1
## 7        4     0      1
## 8        3     0      0
## 9        2     1      1
## 10       1     1      0
## # ... with 90 more rows
# Print stats for variables in `neckties`.
summary(neckties)
##     tietype         color          stripe    
##  Min.   :1.00   Min.   :0.00   Min.   :0.00  
##  1st Qu.:2.00   1st Qu.:0.00   1st Qu.:0.00  
##  Median :2.00   Median :1.00   Median :1.00  
##  Mean   :2.33   Mean   :0.64   Mean   :0.61  
##  3rd Qu.:3.00   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :4.00   Max.   :1.00   Max.   :1.00