Some Best Pratices for Variable Transformation

Transformation: Safe

Starting Off

Please do the following:

Empty your Global Environment.
Restart R.
Attach the package ggplot2:
```
library(ggplot2)
```

Our Goal

Let’s say that we plan to violin plots of height, broken down by sex, using the data from the m111survey data frame in the tigerstats package. We want the height to measured in feet, though, instead of inches.

Getting to the Data

Since we need the m111survey data, let’s attach tigerstats, the package that contains it

library(tigerstats)

Transforming

We need to transform height from inches to feet. This seems easy enough:

First, make the heights into a new vector in your Global Environment:

tempHeight <- m111survey$height

Check your Global Environment: you should see that tempHeight is there, now:

ls()

## [1] "tempHeight"

Next, transform the heights:

heightInFeet <- tempHeight / 12

Do you see the new vector heightInFeet in your Global Environment? Check:

ls()

## [1] "heightInFeet" "tempHeight"

Yup, there it is, along with tempHeight.

Finally, create a new variable in m111survey, called heightInFeet:

m111survey$heightInFeet <- heightInFeet

Let’s check that heightInFeet really does exist as a new variable in m111survey:

str(m111survey)

## 'data.frame':    71 obs. of  13 variables:
##  $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
##  $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
##  $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
##  $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
##  $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
##  $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
##  $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
##  $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
##  $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
##  $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
##  $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...
##  $ heightInFeet   : num  6.33 6.17 5.33 5.17 6 ...

Yep, there it is!

Make the Graph

Now we can make our desired graph:

ggplot(m111survey, aes(x = sex, y = heightInFeet)) +
  geom_violin(fill = "burlywood")

Something Interesting

Check the Global Environment again:

ls()

## [1] "heightInFeet" "m111survey"   "tempHeight"

Hmm, m11survey` is there, now. But hey, isn’t it supposed to be in package tigerstats, not in your Global Environment?

Well, actually it is still very much in tigerstats. What you did when you added heightInFeet was to create a new data frame, also called m111survey, that resides in your Global Environment. That’s the object that R found when it ran the plotting command.

Inside tigerstats, m111survey is unchanged:

str(tigerstats::m111survey)

## 'data.frame':    71 obs. of  12 variables:
##  $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
##  $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
##  $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
##  $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
##  $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
##  $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
##  $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
##  $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
##  $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
##  $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
##  $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

See? No heightInFeet there! You can’t alter tigerstats::m111survey: R won’t let you. Instead it will make copies for you to use.

Transformation: Risky

Please empty your Global Environment. You still have ggplot2 and tigerstats loaded.

We’re going to repeat the transformation process, but this time without making intermediate variables. Here we go, all in one gulp:

m111survey$height <- m111survey$height / 12

Note that since we attempted to change m111survey, a copy of it pops up in our Global Environment:

ls()

## [1] "m111survey"

Now make the graph:

ggplot(m111survey, aes(x = sex, y = height)) +
  geom_violin(fill = "burlywood")

Great, that was quick. So why don’t we always work that way?

Transformation: Shooting Yourself in Your Foot

Here’s why we think it’s a good idea to create intermediate variables, and not change the values of existing variables in a data frame, if we can avoid doing so.

Suppose we try the quick way, but make a mistake in our code:

m111survey$height <- 12

We graph

ggplot(m111survey, aes(x = sex, y = height)) +
  geom_violin(fill = "burlywood")

Hmm, we think, something went wrong. We look back and realize our coding mistake: we made ALL of the heights equal to 12.

No problem, we think, we’ll just do it right:

m111survey$height <- m111survey$height / 12

We graph:

ggplot(m111survey, aes(x = sex, y = height)) +
  geom_violin(fill = "burlywood")

Aargh! Now we realize: height became all 12s, and now we have divided all of those 12s by 12, so now all of the heights are 1.

We’ve lost the original heights!

How can we get them back? We’ll have to clear the Global Environment, getting rid of our own copy of m111survey, and then start again with the correct code.

It can take a while to recognize an error like that one, so it’s best not to overwrite variables in a data frame unless you absolutely have to.

Other Advice

Proceed in very small steps. Check each step to see if it worked, before you move on to the next one.
Always know what’s in your Global Environment, and what packages you have loaded. Clear your Global Environment every time you finish solving a problem. Restart R frequently, and then load only the packages you need for your next little job.
When you are working on an R Markdown document, knit it up freqiently, at least after each problem you have solved. That way you can find and correct mistakes as soon as you make them.