This post explains the basics of using the haven package in R

It is intented to be used as a reminder for myself when going back and forth SPSS and R

Hopefully it will help some other people too.
Feel free to comment and suggest improvements in the script

Let’s begin

Load the libraries

library(tidyverse)
## Warning: package 'tidyr' was built under R version 3.5.2
library(haven)

Import an spss file (.sav) to R

To follow along, you can download the wages.sav file from here
This dataset has both scale (numerical) and categorical (factor) variables
All categorical variables are coded as numeric with assigned value labels
For example, sex is a categorical variable coded as 0 for Male, and 1 for Female

wages <- read_sav("Wages.sav")

Let’s take a glimpse in what is imported

glimpse(wages)
## Observations: 400
## Variables: 9
## $ id    <dbl> 3, 4, 5, 12, 13, 14, 17, 20, 21, 23, 25, 28, 30, 31, 32,...
## $ educ  <dbl> 12, 13, 10, 9, 9, 12, 11, 12, 11, 6, 10, 8, 12, 12, 12, ...
## $ south <dbl+lbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0...
## $ sex   <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exper <dbl> 17, 9, 27, 30, 29, 37, 16, 9, 14, 45, 30, 19, 36, 20, 35...
## $ wage  <dbl> 7.50, 13.07, 4.45, 6.25, 19.98, 7.30, 3.65, 3.75, 4.50, ...
## $ occup <dbl+lbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ marr  <dbl+lbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1...
## $ ed    <dbl+lbl> 2, 3, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4...

Note the data type of sex (or any other categorical variable)

<dbl+lbl>

This is a labelled variable in the haven package

See the what comes along with the sex variable when we import the spss file

attributes(wages$sex)
## $format.spss
## [1] "F1.0"
## 
## $display_width
## [1] 3
## 
## $class
## [1] "haven_labelled"
## 
## $labels
##   Male Female 
##      0      1

We have the $class attribute which says it’s a haven_labelled [I think this is called an S3 class]
We also have the $labels
These are the Value Labels [0 Man, 1 Woman]
This is going to be useful (see in a moment)

Let’s see another one

attributes(wages$marr)
## $label
## [1] "Marital status"
## 
## $format.spss
## [1] "F1.0"
## 
## $display_width
## [1] 4
## 
## $class
## [1] "haven_labelled"
## 
## $labels
## Not married     Married 
##           0           1

Now we also have the $label attribute
This is of course the Variable Label
We didn’t have this for sex because there was no Variable Label for sex in the Wages.sav file

I think we can assign one if we want
We can use the attr function

attr(wages$sex, "label") <- "Gender"

# lets see
attributes(wages$sex)
## $format.spss
## [1] "F1.0"
## 
## $display_width
## [1] 3
## 
## $class
## [1] "haven_labelled"
## 
## $labels
##   Male Female 
##      0      1 
## 
## $label
## [1] "Gender"

something with the 's..
(I think I need to read more about the S3 classes in the Hands on programming with R)

Also have a look at this
(use the head function to avoid “overwhelming”" the console)

head(wages$marr)
## <Labelled double>: Marital status
## [1] 1 0 0 0 1 1
## 
## Labels:
##  value       label
##      0 Not married
##      1     Married

Convert to factors

We can convert these labeled variables into factors
This is important for modelling and/or plotting

see what happens if we don’t convert

wages %>% 
  ggplot(aes(sex,wage))+
  geom_boxplot()
## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

We can convert to factors with the as_factor function

wages %>% 
  mutate(sex = as_factor(sex))
## # A tibble: 400 x 9
##       id  educ south     sex   exper  wage occup     marr      ed       
##    <dbl> <dbl> <dbl+lbl> <fct> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
##  1     3    12 0         Male     17  7.5  6         1         2        
##  2     4    13 0         Male      9 13.1  6         0         3        
##  3     5    10 1         Male     27  4.45 6         0         1        
##  4    12     9 1         Male     30  6.25 6         0         1        
##  5    13     9 1         Male     29 20.0  6         1         1        
##  6    14    12 0         Male     37  7.3  6         1         2        
##  7    17    11 0         Male     16  3.65 6         0         1        
##  8    20    12 0         Male      9  3.75 6         0         2        
##  9    21    11 1         Male     14  4.5  6         1         1        
## 10    23     6 1         Male     45  5.75 6         1         1        
## # ... with 390 more rows

We can also use this little trick to turn into factors all the labelled variables

First get the variable names that are haven_labelled and save them into a vector

# use the all might `map` function in the purrr packages
variables_with_labels = map(wages, function(x) attr(x, "class") == "haven_labelled") %>% 
  unlist() %>% 
  names()

print(variables_with_labels)
## [1] "south" "sex"   "occup" "marr"  "ed"

Feed this vector into mutate_at

wages_factored = wages %>%
  mutate_at( vars(variables_with_labels), as_factor)

#let's have a look
wages_factored
## # A tibble: 400 x 9
##       id  educ south          sex   exper  wage occup marr     ed          
##    <dbl> <dbl> <fct>          <fct> <dbl> <dbl> <fct> <fct>    <fct>       
##  1     3    12 does not live~ Male     17  7.5  Other Married  High school~
##  2     4    13 does not live~ Male      9 13.1  Other Not mar~ Some college
##  3     5    10 lives in South Male     27  4.45 Other Not mar~ Less than h~
##  4    12     9 lives in South Male     30  6.25 Other Not mar~ Less than h~
##  5    13     9 lives in South Male     29 20.0  Other Married  Less than h~
##  6    14    12 does not live~ Male     37  7.3  Other Married  High school~
##  7    17    11 does not live~ Male     16  3.65 Other Not mar~ Less than h~
##  8    20    12 does not live~ Male      9  3.75 Other Not mar~ High school~
##  9    21    11 lives in South Male     14  4.5  Other Married  Less than h~
## 10    23     6 lives in South Male     45  5.75 Other Married  Less than h~
## # ... with 390 more rows

How do we go Back to SPSS?

Option 1

A toy data frame

test_1 = data.frame( 
  
  # factor with numeric levels and associated labels
  sex = factor( c(1,2,2,2,1), 
              levels  = c(1,2), 
              labels = c("Male", "Female")) ,
  # numeric factor
  wage = c(40,50,35,70,55)
  
  )

test_1
##      sex wage
## 1   Male   40
## 2 Female   50
## 3 Female   35
## 4 Female   70
## 5   Male   55

Now create an spss file

write_sav(test_1, "test_1.sav")
The spss variable view of test_1

The spss variable view of test_1

Option 2

We use the labelled function in haven

test_2 = data.frame( 
  
  # factor with numeric levels and associated labels
  sex = labelled(c(1,2,2,2,1), 
                 #the value labels
                 c(Male = 1, Female = 2), 
                 # we can also assign a Variable Label in SPSS style
                 label="Assigned sex at birth") ,
  # numeric factor
  wage = c(40,50,35,70,55)
  
  )

test_2
##   sex wage
## 1   1   40
## 2   2   50
## 3   2   35
## 4   2   70
## 5   1   55
write_sav(test_2, "test_2.sav")
The spss variable view of test_2

The spss variable view of test_2