1 Goal


The goal of this tutorial is to learn how to create new variables using existing ones to do so. This can be useful to create total results from partial variables, calculate percentages or calculating total benefits from different sources.


2 Create new variable


# We need to load the library dplyr and ggplot in this exercise
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# First of all we load the data
# For this tutorial we are going to use the iris plant dataset
data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# We can now create a new variable called petal area that is just the product of petal length and petal width

iris <- mutate(iris, Petal.Area = Petal.Width * Petal.Length)
str(iris)
## 'data.frame':    150 obs. of  6 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Petal.Area  : num  0.28 0.28 0.26 0.3 0.28 0.68 0.42 0.3 0.28 0.15 ...
# We can try now to see if there is any difference in area between species
ggplot() + geom_point(data = iris, aes(x = Species, y = Petal.Area))

# We can now try to divide the Petal Length by the Sepal Length to see if we observe any further separation

iris <- mutate(iris, Ratio = Petal.Length / Sepal.Length)
str(iris)
## 'data.frame':    150 obs. of  7 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Petal.Area  : num  0.28 0.28 0.26 0.3 0.28 0.68 0.42 0.3 0.28 0.15 ...
##  $ Ratio       : num  0.275 0.286 0.277 0.326 0.28 ...
# Plot the new variable
ggplot() + geom_point(data = iris, aes(x = Species, y = Ratio))

# We can add all the columns and see what we observe

iris <- mutate(iris, Total = Petal.Width + Petal.Length + Sepal.Width + Sepal.Length )

# Plot the new variable
ggplot() + geom_point(data = iris, aes(x = Species, y = Total))


3 Conclusion


In this tutorial we have learnt how to create new variables using information that already exists in our dataset.