This post explains the basics of using the haven package in R
It is intented to be used as a reminder for myself when going back and forth SPSS and R
Hopefully it will help some other people too.
Feel free to comment and suggest improvements in the script
Let’s begin
Load the libraries
library(tidyverse)
## Warning: package 'tidyr' was built under R version 3.5.2
library(haven)
To follow along, you can download the wages.sav file from here
This dataset has both scale (numerical) and categorical (factor) variables
All categorical variables are coded as numeric with assigned value labels
For example, sex
is a categorical variable coded as 0 for Male, and 1 for Female
wages <- read_sav("Wages.sav")
Let’s take a glimpse in what is imported
glimpse(wages)
## Observations: 400
## Variables: 9
## $ id <dbl> 3, 4, 5, 12, 13, 14, 17, 20, 21, 23, 25, 28, 30, 31, 32,...
## $ educ <dbl> 12, 13, 10, 9, 9, 12, 11, 12, 11, 6, 10, 8, 12, 12, 12, ...
## $ south <dbl+lbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0...
## $ sex <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exper <dbl> 17, 9, 27, 30, 29, 37, 16, 9, 14, 45, 30, 19, 36, 20, 35...
## $ wage <dbl> 7.50, 13.07, 4.45, 6.25, 19.98, 7.30, 3.65, 3.75, 4.50, ...
## $ occup <dbl+lbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ marr <dbl+lbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1...
## $ ed <dbl+lbl> 2, 3, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4...
Note the data type of sex
(or any other categorical variable)
<dbl+lbl>
This is a labelled
variable in the haven
package
See the what comes along with the sex
variable when we import the spss file
attributes(wages$sex)
## $format.spss
## [1] "F1.0"
##
## $display_width
## [1] 3
##
## $class
## [1] "haven_labelled"
##
## $labels
## Male Female
## 0 1
We have the $class
attribute which says it’s a haven_labelled
[I think this is called an S3 class]
We also have the $labels
These are the Value Labels [0 Man, 1 Woman]
This is going to be useful (see in a moment)
Let’s see another one
attributes(wages$marr)
## $label
## [1] "Marital status"
##
## $format.spss
## [1] "F1.0"
##
## $display_width
## [1] 4
##
## $class
## [1] "haven_labelled"
##
## $labels
## Not married Married
## 0 1
Now we also have the $label
attribute
This is of course the Variable Label
We didn’t have this for sex
because there was no Variable Label for sex
in the Wages.sav file
I think we can assign one if we want
We can use the attr
function
attr(wages$sex, "label") <- "Gender"
# lets see
attributes(wages$sex)
## $format.spss
## [1] "F1.0"
##
## $display_width
## [1] 3
##
## $class
## [1] "haven_labelled"
##
## $labels
## Male Female
## 0 1
##
## $label
## [1] "Gender"
something with the '
s..
(I think I need to read more about the S3 classes in the Hands on programming with R)
Also have a look at this
(use the head
function to avoid “overwhelming”" the console)
head(wages$marr)
## <Labelled double>: Marital status
## [1] 1 0 0 0 1 1
##
## Labels:
## value label
## 0 Not married
## 1 Married
We can convert these labeled variables into factors
This is important for modelling and/or plotting
see what happens if we don’t convert
wages %>%
ggplot(aes(sex,wage))+
geom_boxplot()
## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
We can convert to factors with the as_factor
function
wages %>%
mutate(sex = as_factor(sex))
## # A tibble: 400 x 9
## id educ south sex exper wage occup marr ed
## <dbl> <dbl> <dbl+lbl> <fct> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
## 1 3 12 0 Male 17 7.5 6 1 2
## 2 4 13 0 Male 9 13.1 6 0 3
## 3 5 10 1 Male 27 4.45 6 0 1
## 4 12 9 1 Male 30 6.25 6 0 1
## 5 13 9 1 Male 29 20.0 6 1 1
## 6 14 12 0 Male 37 7.3 6 1 2
## 7 17 11 0 Male 16 3.65 6 0 1
## 8 20 12 0 Male 9 3.75 6 0 2
## 9 21 11 1 Male 14 4.5 6 1 1
## 10 23 6 1 Male 45 5.75 6 1 1
## # ... with 390 more rows
We can also use this little trick to turn into factors all the labelled variables
First get the variable names that are haven_labelled
and save them into a vector
# use the all might `map` function in the purrr packages
variables_with_labels = map(wages, function(x) attr(x, "class") == "haven_labelled") %>%
unlist() %>%
names()
print(variables_with_labels)
## [1] "south" "sex" "occup" "marr" "ed"
Feed this vector into mutate_at
wages_factored = wages %>%
mutate_at( vars(variables_with_labels), as_factor)
#let's have a look
wages_factored
## # A tibble: 400 x 9
## id educ south sex exper wage occup marr ed
## <dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <fct>
## 1 3 12 does not live~ Male 17 7.5 Other Married High school~
## 2 4 13 does not live~ Male 9 13.1 Other Not mar~ Some college
## 3 5 10 lives in South Male 27 4.45 Other Not mar~ Less than h~
## 4 12 9 lives in South Male 30 6.25 Other Not mar~ Less than h~
## 5 13 9 lives in South Male 29 20.0 Other Married Less than h~
## 6 14 12 does not live~ Male 37 7.3 Other Married High school~
## 7 17 11 does not live~ Male 16 3.65 Other Not mar~ Less than h~
## 8 20 12 does not live~ Male 9 3.75 Other Not mar~ High school~
## 9 21 11 lives in South Male 14 4.5 Other Married Less than h~
## 10 23 6 lives in South Male 45 5.75 Other Married Less than h~
## # ... with 390 more rows
A toy data frame
test_1 = data.frame(
# factor with numeric levels and associated labels
sex = factor( c(1,2,2,2,1),
levels = c(1,2),
labels = c("Male", "Female")) ,
# numeric factor
wage = c(40,50,35,70,55)
)
test_1
## sex wage
## 1 Male 40
## 2 Female 50
## 3 Female 35
## 4 Female 70
## 5 Male 55
Now create an spss file
write_sav(test_1, "test_1.sav")
The spss variable view of test_1
We use the labelled
function in haven
test_2 = data.frame(
# factor with numeric levels and associated labels
sex = labelled(c(1,2,2,2,1),
#the value labels
c(Male = 1, Female = 2),
# we can also assign a Variable Label in SPSS style
label="Assigned sex at birth") ,
# numeric factor
wage = c(40,50,35,70,55)
)
test_2
## sex wage
## 1 1 40
## 2 2 50
## 3 2 35
## 4 2 70
## 5 1 55
write_sav(test_2, "test_2.sav")
The spss variable view of test_2