RLadies Philly & DC Lightning Talk

Data Wrangling In R For SAS users

Introduction

Priyanka Gagneja

Senior R Developer at ProCogia

Say Hi:

Check Out:

Github
Blog

Let’s Discuss

Analytics Pipeline

Importing data
Wrangling data
Plotting
Macros vs Functions
Modeling

Yay!!

SAS users start to feel at ease thinking about doing things with ease in R

Importing data

SAS

Operate in ‘Work’ library by default
Use libname to define file locations

libname library_name "file_location"; 

data data_in_use ;
set library_name.saved_data ; 
run;

proc import datafile = "my_file.csv" 
            out = my_data dbms = csv;
run;

Operate in a given ‘working directory’- use getwd() )
Use setwd() to switch to other locations

setwd("file_location")


save(data_in_use , file = "saved_data.rda")




write.csv()

Wrangling data

Working with columns - create, (de)select, rename variables

SAS

# assigning a new data frame - data step
data new_data; 
set old_data;
run;

# keeping & dropping a column
data new_data (keep=id);
set old_data (drop=job_title) ;
run;

# dropping multiple columns
data new_data (drop= temp: ); 
set old_data;
run;

# renaming a column   
data new_data;
set old_data;
rename old_name = new_name;
run;

# assigning a new data frame
new_data <- old_data



# selecting & deselecting a column
new_data <- old_data %>% 
  select(-job_title) %>% 
  select(id)

# selecting a bunch of columns to remove
new_data <- old_data %>% 
  select( -starts_with("temp"))

          
# renaming a column         


new_data <- old_data %>%
  rename(new_name = old_name)

Wrangling data.. contd

Working with rows - Sorting & Filtering

SAS

# Sorting data
proc sort data=old_data out=new_data; 
by id descending income ;
run;

# Remove duplicates while sorting
proc sort data=old_data nodup; 
by id job_type;
run;


# Filtering for rows satisfying certain criteria
data new_data;
set old_data; 
if year = 2020;
run;

data new_data;
set old_data;
by id;
counter + 1 ;
if first.id then counter = 1;
if counter <= 5;
run;

# Arranging data
new_data <- old_data %>% 
  arrange(id, desc(income))


# Remove duplicates while sorting
old_data <- old_data %>% 
  arrange(id, job_type) %>% 
  distinct()


# Filtering for rows satisfying certain criteria
new_data <- old_data %>%
  filter(year == 2020)





new_data <- old_data %>% 
  group_by(id) %>%
  mutate(counter = row_number()) %>% 
  filter(counter <=5)

Wrangling data.. contd

Working with aggregates - Summarising, Tabulating

SAS

# Counting across one/multiple columns
proc freq data = old_data ; 
table job_type ;
run;

proc freq data = old_data ;
table job_type*region ; 
run;

# Summarise data
proc summary data = old_data nway ; 
class job_type region ;
output out = new_data ;
run;

proc summary data = old_data nway ; 
class job_type region ;
var salary ;

# Counting across one/multiple columns
old_data %>%
  count(job_type) %>% 
  mutate(percent = n*100/sum(n))

old_data %>%
 count(job_type, region)


# Summarise data (equi to with nway option)
new_data <- old_data %>% 
  group_by(job_type, region) %>% 
  summarise(Count = n())

new_data <- old_data %>%
  group_by(job_type, region) %>% 
  summarise(total_salaries = sum(salary) ,
            Count          = n() )

Wrangling data.. contd

Combining various dataframes - Merging, Appending et al

SAS

# Append/Bind rows
data new_data ;
set data_1 data_2 ; 
run;

# Join df to add columns
data new_data ;
merge data_1 (in= in_1) data_2 ; 
by id ;
if in_1 ;
run;

# Append/Bind rows
new_data <- bind_rows(data_1, 
                      data_2)



# Join df to add columns
new_data <- left_join(data_1 , 
                      data_2 , 
                      by = "id"
                      )

Wrangling data.. contd

More to cover ?

working with dates
working with strings
Plotting data

Automating the tasks

SAS

%macro add_variable(dataset_name); 

data &dataset_name;
set &dataset_name;
new_variable = 1; 
run;

%mend;

%add_variable( my_data );

add_variable <- function(dataset_name){ 
  
  dataset_name <- dataset_name %>%
    mutate(new_variable = 1)
  
return( dataset_name ) 
}

my_data <- add_variable( my_data )

Automating the tasks

For simpler functions, helper functions like across() from tidyverse could also be used to perform an action across multiple columns without needing to use a loop.
For nested functions, you could use map (or ancillary) family of functions from {purrr} pkg to get a more efficient form of for/while loops.

RLadies Philly & DC Lightning Talk

Introduction

Let’s Discuss

Analytics Pipeline

Importing data

Wrangling data

Wrangling data.. contd

Wrangling data.. contd

Wrangling data.. contd

Wrangling data.. contd

Automating the tasks

Automating the tasks

Thank You !!