Transformation: Safe
Starting Off
Please do the following:
- Empty your Global Environment.
- Restart R.
Attach the package ggplot2:
library(ggplot2)
Our Goal
Let’s say that we plan to violin plots of height, broken down by sex, using the data from the m111survey data frame in the tigerstats package. We want the height to measured in feet, though, instead of inches.
Getting to the Data
Since we need the m111survey data, let’s attach tigerstats, the package that contains it
library(tigerstats)Transforming
We need to transform height from inches to feet. This seems easy enough:
First, make the heights into a new vector in your Global Environment:
tempHeight <- m111survey$heightCheck your Global Environment: you should see that tempHeight is there, now:
ls()## [1] "tempHeight"
Next, transform the heights:
heightInFeet <- tempHeight / 12Do you see the new vector heightInFeet in your Global Environment? Check:
ls()## [1] "heightInFeet" "tempHeight"
Yup, there it is, along with tempHeight.
Finally, create a new variable in m111survey, called heightInFeet:
m111survey$heightInFeet <- heightInFeetLet’s check that heightInFeet really does exist as a new variable in m111survey:
str(m111survey)## 'data.frame': 71 obs. of 13 variables:
## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
## $ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
## $ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
## $ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
## $ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
## $ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
## $ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
## $ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
## $ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
## $ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
## $ heightInFeet : num 6.33 6.17 5.33 5.17 6 ...
Yep, there it is!
Make the Graph
Now we can make our desired graph:
ggplot(m111survey, aes(x = sex, y = heightInFeet)) +
geom_violin(fill = "burlywood")Something Interesting
Check the Global Environment again:
ls()## [1] "heightInFeet" "m111survey" "tempHeight"
Hmm, m11survey` is there, now. But hey, isn’t it supposed to be in package tigerstats, not in your Global Environment?
Well, actually it is still very much in tigerstats. What you did when you added heightInFeet was to create a new data frame, also called m111survey, that resides in your Global Environment. That’s the object that R found when it ran the plotting command.
Inside tigerstats, m111survey is unchanged:
str(tigerstats::m111survey)## 'data.frame': 71 obs. of 12 variables:
## $ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
## $ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
## $ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
## $ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
## $ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
## $ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
## $ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
## $ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
## $ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
## $ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
See? No heightInFeet there! You can’t alter tigerstats::m111survey: R won’t let you. Instead it will make copies for you to use.
Transformation: Risky
Please empty your Global Environment. You still have ggplot2 and tigerstats loaded.
We’re going to repeat the transformation process, but this time without making intermediate variables. Here we go, all in one gulp:
m111survey$height <- m111survey$height / 12Note that since we attempted to change m111survey, a copy of it pops up in our Global Environment:
ls()## [1] "m111survey"
Now make the graph:
ggplot(m111survey, aes(x = sex, y = height)) +
geom_violin(fill = "burlywood")Great, that was quick. So why don’t we always work that way?
Transformation: Shooting Yourself in Your Foot
Here’s why we think it’s a good idea to create intermediate variables, and not change the values of existing variables in a data frame, if we can avoid doing so.
Suppose we try the quick way, but make a mistake in our code:
m111survey$height <- 12We graph
ggplot(m111survey, aes(x = sex, y = height)) +
geom_violin(fill = "burlywood")Hmm, we think, something went wrong. We look back and realize our coding mistake: we made ALL of the heights equal to 12.
No problem, we think, we’ll just do it right:
m111survey$height <- m111survey$height / 12We graph:
ggplot(m111survey, aes(x = sex, y = height)) +
geom_violin(fill = "burlywood")Aargh! Now we realize: height became all 12s, and now we have divided all of those 12s by 12, so now all of the heights are 1.
We’ve lost the original heights!
How can we get them back? We’ll have to clear the Global Environment, getting rid of our own copy of m111survey, and then start again with the correct code.
It can take a while to recognize an error like that one, so it’s best not to overwrite variables in a data frame unless you absolutely have to.
Other Advice
- Proceed in very small steps. Check each step to see if it worked, before you move on to the next one.
- Always know what’s in your Global Environment, and what packages you have loaded. Clear your Global Environment every time you finish solving a problem. Restart R frequently, and then load only the packages you need for your next little job.
- When you are working on an R Markdown document, knit it up freqiently, at least after each problem you have solved. That way you can find and correct mistakes as soon as you make them.