Data Wrangling In R For SAS users
Priyanka Gagneja
Senior R Developer at ProCogia
Say Hi:
Check Out:
Who
Why
How
Importing data
Wrangling data
Plotting
Macros vs Functions
Modeling
Yay!!
SAS users start to feel at ease thinking about doing things with ease in R
SAS
Working with columns - create, (de)select, rename variables
SAS
# assigning a new data frame - data step
data new_data;
set old_data;
run;
# keeping & dropping a column
data new_data (keep=id);
set old_data (drop=job_title) ;
run;
# dropping multiple columns
data new_data (drop= temp: );
set old_data;
run;
# renaming a column
data new_data;
set old_data;
rename old_name = new_name;
run;R
# assigning a new data frame
new_data <- old_data
# selecting & deselecting a column
new_data <- old_data %>%
select(-job_title) %>%
select(id)
# selecting a bunch of columns to remove
new_data <- old_data %>%
select( -starts_with("temp"))
# renaming a column
new_data <- old_data %>%
rename(new_name = old_name)Working with rows - Sorting & Filtering
SAS
# Sorting data
proc sort data=old_data out=new_data;
by id descending income ;
run;
# Remove duplicates while sorting
proc sort data=old_data nodup;
by id job_type;
run;
# Filtering for rows satisfying certain criteria
data new_data;
set old_data;
if year = 2020;
run;
data new_data;
set old_data;
by id;
counter + 1 ;
if first.id then counter = 1;
if counter <= 5;
run;R
# Arranging data
new_data <- old_data %>%
arrange(id, desc(income))
# Remove duplicates while sorting
old_data <- old_data %>%
arrange(id, job_type) %>%
distinct()
# Filtering for rows satisfying certain criteria
new_data <- old_data %>%
filter(year == 2020)
new_data <- old_data %>%
group_by(id) %>%
mutate(counter = row_number()) %>%
filter(counter <=5)Working with aggregates - Summarising, Tabulating
SAS
# Counting across one/multiple columns
proc freq data = old_data ;
table job_type ;
run;
proc freq data = old_data ;
table job_type*region ;
run;
# Summarise data
proc summary data = old_data nway ;
class job_type region ;
output out = new_data ;
run;
proc summary data = old_data nway ;
class job_type region ;
var salary ;R
# Counting across one/multiple columns
old_data %>%
count(job_type) %>%
mutate(percent = n*100/sum(n))
old_data %>%
count(job_type, region)
# Summarise data (equi to with nway option)
new_data <- old_data %>%
group_by(job_type, region) %>%
summarise(Count = n())
new_data <- old_data %>%
group_by(job_type, region) %>%
summarise(total_salaries = sum(salary) ,
Count = n() )Combining various dataframes - Merging, Appending et al
More to cover ?
working with dates
working with strings
Plotting data
SAS
R
For simpler functions, helper functions like across() from tidyverse could also be used to perform an action across multiple columns without needing to use a loop.
For nested functions, you could use map (or ancillary) family of functions from {purrr} pkg to get a more efficient form of for/while loops.
R-Ladies theme for Quarto Presentations. Code available on GitHub.