Introduction

Usually, when we work in a project with data, we download and merge different datasets, with different types of variables.

In some cases, we have to look many times in the webpages or in the papers that we are reviewing for some project to remember the descriptions of the variables that we are using.

Then, having a record of the descriptions of some important variables can help us to work in a more efficient space. Also, these descriptions can help us to build a model, to understand the results, to make graphics with pre-existing labels, to communicate better the results, among others.

So, in this vignette, I will show how we can create labels to variables to help us to remember their long descriptions.

1. Hmisc package

The Hmisc package has the function “label” to labelled variables.

1.1 Variables created from us

For example, I create a dataframe “names” and its label “label1” for the description of the variables.

library(tidyverse)
library(Hmisc)
dataej1<- data.frame(Co2= floor(rnorm(6,25,8)), Envir=(c("1","0","1","1","0","0")), Year = (c(2000:2005))) 
label1 <- c(Co2="Co2 in millions of tonnes", Envir="Environment in which is collected the Co2, Urban is 0 and Rural is 1", Year= "year of collection of Co2 data")

and then I can match it through the names:

label(dataej1) = as.list(label1[match(names(dataej1),names(label1))])
label(dataej1)
##                                                                    Co2 
##                                            "Co2 in millions of tonnes" 
##                                                                  Envir 
## "Environment in which is collected the Co2, Urban is 0 and Rural is 1" 
##                                                                   Year 
##                                       "year of collection of Co2 data"

1.2 Downloaded Dataset

But, in many cases, we do not want to write all the labels (example above) or maybe we have the descriptions of the variables in the website that we downloaded the dataset.

For example, I downloaded a public database from data.gov about the emissions by sector from 1990 to 2015.

url<-"https://data.gov.au/dataset/b41195f8-1e86-45d3-a92f-b49f1fbb456a/resource/e1574848-0579-402c-8429-30b10d9dfd8d/download/2016-soe-atmosphere-australias-net-greenhouse-gas-emissions-by-sector-1990-and-2000-2015.csv"
emissions<-read.csv(url)
colnames(emissions) 
##  [1] "Year.to"                                           
##  [2] "Electricity..Mt.CO2.e."                            
##  [3] "Stationary.energy.excluding.electricity..Mt.CO2.e."
##  [4] "Transport..Mt.CO2.e."                              
##  [5] "Fugitive.emissions..Mt.CO2.e."                     
##  [6] "Industrialprocesses.and.product.use..Mt.CO2.e."    
##  [7] "Agriculture..Mt.CO2.e."                            
##  [8] "Waste..Mt.CO2.e."                                  
##  [9] "LULUCF..Mt.CO2.e."                                 
## [10] "Total"

We can see very long names, so I will take these names to have the description of the variables, and then, I will create new short names for the variables1.

emissions <- emissions[-c(1),] #I eliminated the first row because is not useful
var.labels = as.character(names(emissions)) #description of the variables
names(emissions) = c("year","electr","staewte","transp","fugem","indp&p","agr","wast", "lulucf","total")

Also, we can rename some variables of the dataset or we can aggregate some information to some label, for example:

var.labels[9] = "Land use, land-use change and forestry"
var.labels[1] = "Year"

But, unlike the example 1, in these examples, the names of the variables and the labels do not match, but they are in order, so we can use a loop to match them:

for(i in seq_along(emissions)){
  Hmisc::label(emissions[, i]) <- var.labels[i]
}
label(emissions)
##                                                 year 
##                                               "Year" 
##                                               electr 
##                             "Electricity..Mt.CO2.e." 
##                                              staewte 
## "Stationary.energy.excluding.electricity..Mt.CO2.e." 
##                                               transp 
##                               "Transport..Mt.CO2.e." 
##                                                fugem 
##                      "Fugitive.emissions..Mt.CO2.e." 
##                                               indp&p 
##     "Industrialprocesses.and.product.use..Mt.CO2.e." 
##                                                  agr 
##                             "Agriculture..Mt.CO2.e." 
##                                                 wast 
##                                   "Waste..Mt.CO2.e." 
##                                               lulucf 
##             "Land use, land-use change and forestry" 
##                                                total 
##                                              "Total"

Besides, the labels are useful for make graphics. We have to use “label” in “labs” in ggplot:

library(ggplot2)
ggplot(emissions, aes(year,lulucf)) +
geom_col() +
  labs(x=label(emissions$year), y='mt of Co2', title = "Emisions of Co2 Australia", subtitle = label(emissions$lulucf))  #I use label in the subtitle

2. Expss package

On the other hand, we can do the same, using the package “expss”.

“This package helps people to move data from ‘Excel’/‘SPSS’ to R (additionally, the package offers useful functions for data processing in marketing research / social surveys) and support to base R functions and some functions from other packages” (github).

Now, I will create variables and their labels with the apply_labels function:

detach("package:Hmisc", unload = TRUE)
library(expss)
dataej2<- data.frame(Co2= c(15,23,30,69,4,80), Envirn=as.character(c("1","0","1","1","0","0")), Year = (c(2000:2005)))
dataej2 = apply_labels(dataej2,
                      Co2 = "Co2 in millions of tonnes",
                      Envirn="Environment in which is collected the Co2",
                      Year= "year of the Co2 data")

Now, I will aggregate value labels:

val_lab(dataej2$Envirn) = num_lab("
             1 Urban
             0 Rural
             ")

And I will add more labels for values and variables:

add_val_lab(dataej2$Envirn) = num_lab("
                          -1 NAN
                          ")
dataej2$Precip = c(10,3,4,6,12,0)
dataej2=apply_labels(dataej2,Precip= "Precipitations in ml")

And also we can remove some values and variables labels:

# by value 
val_lab(dataej2$Envirn) = val_lab(dataej2$Envirn) %d% -1
# by name
val_lab(dataej2$Envirn) = val_lab(dataej2$Envirn) %n_d% "NaN"

In addition, we can use use_labels function for make graphics:

#graphic:
use_labels(dataej2, 
  ggplot(dataej2, aes(Year,Co2)) +
  geom_col()
  )

3. Haven package

If we have data format like SPS, SAS, and STATA, the package “haven” help us to recognize the labels of the variables under its names. For example:

detach("package:expss", unload = TRUE)
library(haven)
library(tidyverse)

3.1 Variables created from us

Haven package has the function “labelled” inside for creating labels variables and their values:

x1<-c(1,2,3)

x11<-labelled(x1, c( employment = 1, unemployment= 2, vacation = 3),label = "Employment situation") 
#The function is: labelled(data, labels values (if it has), label description)

3.2 Downloaded dataset

Also, this package is useful when we downloaded a dataset. Here I downloaded a dataset from “INE” (National Institute of Statistics of Chile) in .dta format:

INE<-read_dta("ene-2019-01.dta") 
str(INE[6:8]) #The labels are in Spanish but you can see that the variables have labels
## Classes 'tbl_df', 'tbl' and 'data.frame':    105264 obs. of  3 variables:
##  $ region   : 'haven_labelled' num  5 5 5 5 5 5 5 5 5 5 ...
##   ..- attr(*, "label")= chr "Región"
##   ..- attr(*, "format.stata")= chr "%16.0f"
##   ..- attr(*, "labels")= Named num  1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "names")= chr  "Región de Tarapacá" "Región de Antofagasta" "Región de Atacama" "Región de Coquimbo" ...
##  $ region_15: num  5 5 5 5 5 5 5 5 5 5 ...
##   ..- attr(*, "label")= chr "Región (división política administrativa a quince regiones)"
##   ..- attr(*, "format.stata")= chr "%16.0f"
##  $ estrato  : 'haven_labelled' num  5099 5099 5099 5099 5099 ...
##   ..- attr(*, "label")= chr "Estrato muestral"
##   ..- attr(*, "format.stata")= chr "%16.0f"
##   ..- attr(*, "labels")= Named num  1021 1032 1049 1051 2011 ...
##   .. ..- attr(*, "names")= chr  "Iquique CD" "I Región RAU" "I Región Rural" "Alto Hospicio CD" ...

4. Sjlabelled package

“This package covers reading and writing data between other statistical packages (like ‘SPSS’) and R, based on the haven and foreign packages. Hence, this package also includes functions to make working with labelled data easier.”(link in rdocumentation). In other words, we can use labels to visualize and understand the data better.

detach("package:haven", unload = TRUE)
#devtools::install_github("strengejacke/sjlabelled") for install it
library(sjlabelled)

I will use only a subset of data: edad (age), sexo (sex) y nivel (level of education) to show the labels.

INE2<- select(INE, edad, sexo, nivel)
get_label(INE2) #see variables labels
##                                  edad                                  sexo 
##                  "Edad de la persona"                                "Sexo" 
##                                 nivel 
## "Nivel educacional más alto aprobado"
get_labels(INE2) # see values labels 
## $edad
## NULL
## 
## $sexo
## [1] "Hombre" "Mujer" 
## 
## $nivel
##  [1] "Nunca Estudio"             "Sala Cuna"                
##  [3] "Kinder"                    "Basica o Primaria"        
##  [5] "Media Comun"               "Media tecnico profesional"
##  [7] "Humanidades"               "Centro Formacion Tecnica" 
##  [9] "Instituto Profesional"     "Universitario"            
## [11] "Post titulo"               "Magister"                 
## [13] "Doctorado"                 "Normalista"               
## [15] "Nivel Ignorado"            "Nivel Ignorado"

Also, when we subset data, we do not want to lose the label. So, we can use the original labels with copy_label function:

INE3 <- subset(INE2, subset = sexo == 1, select = c(1:2))
str(INE3) #we do not have edad(age) label
## Classes 'tbl_df', 'tbl' and 'data.frame':    50459 obs. of  2 variables:
##  $ edad: num  42 58 20 21 35 1 77 86 83 11 ...
##  $ sexo: 'haven_labelled' num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Sexo"
##   ..- attr(*, "labels")= Named num  1 2
##   .. ..- attr(*, "names")= chr  "Hombre" "Mujer"
INE3 <- copy_labels(INE3, INE2)
str(INE3)
## Classes 'tbl_df', 'tbl' and 'data.frame':    50459 obs. of  2 variables:
##  $ edad: num  42 58 20 21 35 1 77 86 83 11 ...
##   ..- attr(*, "label")= chr "Edad de la persona"
##  $ sexo: 'haven_labelled' num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Sexo"
##   ..- attr(*, "labels")= Named num  1 2
##   .. ..- attr(*, "names")= chr  "Hombre" "Mujer"

To remove label use: remove_labels()

To set labels use: set_label(data, label= “description”) and set_labels(data, labels=“description”)

And we can use the labels to make graphics:

attach(INE2)
ggplot(INE2, aes(x=as.factor(nivel), fill=as.factor(sexo))) +
geom_bar(stat = "count")+
theme(axis.text.x = element_text(angle = 90))+
scale_x_discrete(labels= get_labels(nivel))  + #change the labels of x axis
scale_fill_discrete(labels = get_labels(sexo))+ #change the labels of legend
labs(x = get_label(nivel),y=NULL, fill=get_label(sexo))

Summary

References


  1. This dataset is tidy, so I do not have to fix it, but with others, I recommend to tidy the dataset first