This document covers one way in which a SAS dataset can be converted directly to an R dataframe. I’ll include a reference to the raw source data and I’ll show the SAS code used to read the raw data and produce a SAS dataset. I’ll wrap up with the R library and code required to read the dataset.
The raw data for this example is drawn from a public source and is freely available for download. We’ll be looking at: Inpatient Prospective Payment System (IPPS) Provider Summary for the Top 100 Diagnosis-Related Groups (DRG) - FY2011. A link to this dataset is here and a detailed description of it can be found here.
This is a good size dataset with 12 columns and 163,065 rows. It was downloaded as a CSV file for purposes of this demonstration.
Note: SAS University Edition running SAS Studio 3.4 in an Oracle Virtual Box was used for this demo.
Here is the SAS code that will be used to read this file and construct a SAS dataset. I’ve taken a very explicit approach here instead of using some of the data import wizards packaged with various SAS interfaces. This code will skip the first row of data (the column header row) so I need to spell out the variable names we’ll be creating. Also, the code will only read the next 999 rows to keep processing time to a minimum.
filename dd1 "/folders/myshortcuts/Projects/SAS to R Demo/Inpatient_Prospective_Payment_System.csv";
libname demo "/folders/myshortcuts/Projects/SAS to R Demo/";
data demo.raw_sas;
length drg_def $ 50
provider_id $ 10
provider_name $ 50
provider_street $ 50
provider_city $ 50
provider_state $ 50
provider_zip $ 5
hosp_ref_region_desc $ 50
total_discharges 8
avg_covered_charges 8
average_total_payments 8
average_medicare_payments 8;
infile dd1 dlm = "," dsd firstobs=2 obs=1000 missover truncover;
input drg_def $
provider_id $
provider_name $
provider_street $
provider_city $
provider_state $
provider_zip $
hosp_ref_region_desc $
total_discharges 2.
avg_covered_charges dollar10.2
average_total_payments dollar10.2
average_medicare_payments dollar10.2;
run;
The SAS log reported the following results:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
55
56 filename dd1 "/folders/myshortcuts/Projects/SAS to R Demo/Inpatient_Prospective_Payment_System.csv";
57 libname demo "/folders/myshortcuts/Projects/SAS to R Demo/";
NOTE: Libref DEMO was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myshortcuts/Projects/SAS to R Demo
58
59 data demo.raw_sas;
60 length drg_def $ 50
61 provider_id $ 10
62 provider_name $ 50
63 provider_street $ 50
64 provider_city $ 50
65 provider_state $ 50
66 provider_zip $ 5
67 hosp_ref_region_desc $ 50
68 total_discharges 8
69 avg_covered_charges 8
70 average_total_payments 8
71 average_medicare_payments 8;
72
73 infile dd1 dlm = "," dsd firstobs=2 obs=1000 missover truncover;
WARNING: END-OF-RECORD option TRUNCOVER in effect for infile DD1.
74 input drg_def $
75 provider_id $
76 provider_name $
77 provider_street $
78 provider_city $
79 provider_state $
80 provider_zip $
81 hosp_ref_region_desc $
82 total_discharges 2.
83 avg_covered_charges dollar10.2
84 average_total_payments dollar10.2
85 average_medicare_payments dollar10.2;
86 run;
NOTE: The infile DD1 is:
Filename=/folders/myshortcuts/Projects/SAS to R Demo/Inpatient_Prospective_Payment_System.csv,
Owner Name=root,Group Name=vboxsf,
Access Permission=-rwxrwx---,
Last Modified=16Dec2015:10:18:18,
File Size (bytes)=27330796
NOTE: 999 records were read from the infile DD1.
The minimum record length was 130.
The maximum record length was 197.
NOTE: The data set DEMO.RAW_SAS has 999 observations and 12 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.04 seconds
87
88
89
90 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
102
The key takeaway here is that the resulting SAS dataset is called RAW_SAS
, and has 999 observations and 12 variables. We want to reproduce this in R.
The SAS dataset is sitting in the current working directory of this project. To read it and save it into an R dataframe we’ll use the sas7bdat
library. You can read more about this package by installing it and consulting its help files.
From the documentation it appears this package was last published in 2014 and may not be under active development. Also, the author notes that this package is experimental. Use at your own risk!
library(sas7bdat)
## Warning: package 'sas7bdat' was built under R version 3.2.3
raw_r <- read.sas7bdat("raw_sas.sas7bdat")
Now that we have built a data frame called raw_r
we can investigate the contents of it as follows:
str(raw_r)
## 'data.frame': 999 obs. of 12 variables:
## $ drg_def : Factor w/ 1 level "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC": 1 1 1 1 1 1 1 1 1 1 ...
## $ provider_id : Factor w/ 999 levels "100002","100006",..: 4 21 24 43 54 70 87 89 90 91 ...
## $ provider_name : Factor w/ 973 levels "ABBOTT NORTHWESTERN HOSPITAL",..: 760 440 212 844 741 54 202 911 346 273 ...
## $ provider_street : Factor w/ 998 levels "1 COOPER PLAZA",..: 82 420 332 671 29 342 316 785 51 46 ...
## $ provider_city : Factor w/ 682 levels "ABILENE","ABINGTON",..: 149 58 193 55 5 386 438 55 266 210 ...
## $ provider_state : Factor w/ 47 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ provider_zip : Factor w/ 956 levels "10003","10016",..: 320 317 313 310 304 318 326 309 314 316 ...
## $ hosp_ref_region_desc : Factor w/ 281 levels "AK - Anchorage",..: 3 2 2 2 2 6 2 2 4 2 ...
## $ total_discharges : num 91 14 24 25 18 67 51 32 13 34 ...
## $ avg_covered_charges : num 32963 15132 37560 13998 31633 ...
## $ average_total_payments : num 5777 5788 5435 5418 5658 ...
## $ average_medicare_payments: num 4764 4977 4454 4129 4851 ...
## - attr(*, "pkg.version")= chr "0.5"
## - attr(*, "column.info")=List of 12
## ..$ :List of 10
## .. ..$ name : chr "drg_def"
## .. ..$ offset: int 32
## .. ..$ length: int 50
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "provider_id"
## .. ..$ offset: int 82
## .. ..$ length: int 10
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "provider_name"
## .. ..$ offset: int 92
## .. ..$ length: int 50
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "provider_street"
## .. ..$ offset: int 142
## .. ..$ length: int 50
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "provider_city"
## .. ..$ offset: int 192
## .. ..$ length: int 50
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "provider_state"
## .. ..$ offset: int 242
## .. ..$ length: int 50
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "provider_zip"
## .. ..$ offset: int 292
## .. ..$ length: int 5
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "hosp_ref_region_desc"
## .. ..$ offset: int 297
## .. ..$ length: int 50
## .. ..$ type : chr "character"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "total_discharges"
## .. ..$ offset: int 0
## .. ..$ length: int 8
## .. ..$ type : chr "numeric"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "avg_covered_charges"
## .. ..$ offset: int 8
## .. ..$ length: int 8
## .. ..$ type : chr "numeric"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "average_total_payments"
## .. ..$ offset: int 16
## .. ..$ length: int 8
## .. ..$ type : chr "numeric"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## ..$ :List of 10
## .. ..$ name : chr "average_medicare_payments"
## .. ..$ offset: int 24
## .. ..$ length: int 8
## .. ..$ type : chr "numeric"
## .. ..$ fhdr : int 0
## .. ..$ foff : int 0
## .. ..$ flen : int 0
## .. ..$ lhdr : int 0
## .. ..$ loff : int 0
## .. ..$ llen : int 0
## - attr(*, "date.created")= POSIXct, format: "2015-12-16 17:58:11"
## - attr(*, "date.modified")= POSIXct, format: "2015-12-16 17:58:11"
## - attr(*, "SAS.release")= chr "9.0401M3"
## - attr(*, "SAS.host")= chr "Linux"
## - attr(*, "OS.version")= chr "2.6.32-504.23.4."
## - attr(*, "OS.maker")= chr ""
## - attr(*, "OS.name")= chr "x86_64"
## - attr(*, "endian")= chr "little"
## - attr(*, "winunix")= chr "unix"
The resulting R data frame contains more info about each of the fields captured by the read.sas7bdat
function than is typical of other common R “read” functions.
In particular the new data frame contains the additional attribute $column.info
which is itself a list giving more details about each of the fields that it processed.
To explore the final result we can display a few rows and columns of the data frame:
library(knitr)
kable(raw_r[1:5, 1:5])
drg_def | provider_id | provider_name | provider_street | provider_city |
---|---|---|---|---|
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN |
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ |
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE |
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT’S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM |
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER |
We can also plot some of the data:
library(ggplot2)
ggplot(raw_r) +
geom_histogram(aes(x = total_discharges), binwidth = 5)
In summary it appears this package can be used to read an existing SAS dataset without needing to invoke SAS directly. My demo here is certainly not exhaustive of all SAS objects one might want to read - your own experience may vary! As noted above, it does not appear that the package is being actively developed so use it at your own risk.
Other packages, such foreign
, require that the SAS dataset be converted first to “export” format before reading, or that SAS be invokable at the command line in order to convert the SAS dataset to an R-readable file on the fly.
Given these observations, a reliable alternative is to simply export the SAS dataset of interest to a common format such as CSV, XML, TXT, XLSX, or JSON (i.e. via PROC JSON
) and go from there. R has packages that can handle all of these formats.