RLadies Philly & DC Lightning Talk

Data Wrangling In R For SAS users

Introduction

Priyanka Gagneja

Senior R Developer at ProCogia

Say Hi:

Check Out:

Let’s Discuss

  • Who

  • Why

  • How

Analytics Pipeline

  • Importing data

  • Wrangling data

  • Plotting

  • Macros vs Functions

  • Modeling

Yay!!

SAS users start to feel at ease thinking about doing things with ease in R

Importing data

SAS

  • Operate in ‘Work’ library by default
  • Use libname to define file locations
libname library_name "file_location"; 

data data_in_use ;
set library_name.saved_data ; 
run;

proc import datafile = "my_file.csv" 
            out = my_data dbms = csv;
run;

R

  • Operate in a given ‘working directory’- use getwd() )
  • Use setwd() to switch to other locations
setwd("file_location")


save(data_in_use , file = "saved_data.rda")




write.csv()

Wrangling data

Working with columns - create, (de)select, rename variables

SAS

# assigning a new data frame - data step
data new_data; 
set old_data;
run;

# keeping & dropping a column
data new_data (keep=id);
set old_data (drop=job_title) ;
run;

# dropping multiple columns
data new_data (drop= temp: ); 
set old_data;
run;

# renaming a column   
data new_data;
set old_data;
rename old_name = new_name;
run;

R

# assigning a new data frame
new_data <- old_data



# selecting & deselecting a column
new_data <- old_data %>% 
  select(-job_title) %>% 
  select(id)

# selecting a bunch of columns to remove
new_data <- old_data %>% 
  select( -starts_with("temp"))

          
# renaming a column         


new_data <- old_data %>%
  rename(new_name = old_name)

Wrangling data.. contd

Working with rows - Sorting & Filtering

SAS

# Sorting data
proc sort data=old_data out=new_data; 
by id descending income ;
run;

# Remove duplicates while sorting
proc sort data=old_data nodup; 
by id job_type;
run;


# Filtering for rows satisfying certain criteria
data new_data;
set old_data; 
if year = 2020;
run;

data new_data;
set old_data;
by id;
counter + 1 ;
if first.id then counter = 1;
if counter <= 5;
run;

R

# Arranging data
new_data <- old_data %>% 
  arrange(id, desc(income))


# Remove duplicates while sorting
old_data <- old_data %>% 
  arrange(id, job_type) %>% 
  distinct()


# Filtering for rows satisfying certain criteria
new_data <- old_data %>%
  filter(year == 2020)





new_data <- old_data %>% 
  group_by(id) %>%
  mutate(counter = row_number()) %>% 
  filter(counter <=5)

Wrangling data.. contd

Working with aggregates - Summarising, Tabulating

SAS

# Counting across one/multiple columns
proc freq data = old_data ; 
table job_type ;
run;

proc freq data = old_data ;
table job_type*region ; 
run;

# Summarise data
proc summary data = old_data nway ; 
class job_type region ;
output out = new_data ;
run;

proc summary data = old_data nway ; 
class job_type region ;
var salary ;

R

# Counting across one/multiple columns
old_data %>%
  count(job_type) %>% 
  mutate(percent = n*100/sum(n))

old_data %>%
 count(job_type, region)


# Summarise data (equi to with nway option)
new_data <- old_data %>% 
  group_by(job_type, region) %>% 
  summarise(Count = n())

new_data <- old_data %>%
  group_by(job_type, region) %>% 
  summarise(total_salaries = sum(salary) ,
            Count          = n() )

Wrangling data.. contd

Combining various dataframes - Merging, Appending et al

SAS

# Append/Bind rows
data new_data ;
set data_1 data_2 ; 
run;

# Join df to add columns
data new_data ;
merge data_1 (in= in_1) data_2 ; 
by id ;
if in_1 ;
run;

R

# Append/Bind rows
new_data <- bind_rows(data_1, 
                      data_2)



# Join df to add columns
new_data <- left_join(data_1 , 
                      data_2 , 
                      by = "id"
                      )

Wrangling data.. contd

More to cover ?

  • working with dates

  • working with strings

  • Plotting data

Automating the tasks

SAS

%macro add_variable(dataset_name); 

data &dataset_name;
set &dataset_name;
new_variable = 1; 
run;

%mend;

%add_variable( my_data );

R

add_variable <- function(dataset_name){ 
  
  dataset_name <- dataset_name %>%
    mutate(new_variable = 1)
  
return( dataset_name ) 
}

my_data <- add_variable( my_data )

Automating the tasks

  • For simpler functions, helper functions like across() from tidyverse could also be used to perform an action across multiple columns without needing to use a loop.

  • For nested functions, you could use map (or ancillary) family of functions from {purrr} pkg to get a more efficient form of for/while loops.

Thank You !!