SAS stands for Statistical Analysis System. It is a comprehensive software suite for data management, advanced analytics, and statistical analysis. Here are some important highlights of the SAS programming language:
SAS Program Structure:
DATA Step Functionality:
PROC Step Functionality:
SAS Statement Conventions:
Code Formatting:
Comments:
Missing Values:
Dataset Limitations:
Ending Steps:
In summary,
Running SAS can be accomplished through various methods depending on your environment and requirements. Here’s a comprehensive overview of the different methods for executing SAS code:
With batch mode, you save your SAS program in a file. Then you submit the file for processing by SAS. You may start the execution at a time you specify.
Here are steps to Run a SAS Program in Batch Mode:
Command-Line Execution Syntax (depending upon the operating system):
sas -sysin /home/user/my_program.sas -log /home/user/my_program.log -print /home/user/my_program.lst
Explanation
-sysin: Specifies the path to the SAS program file you want to run. -log: Specifies the path for the log file where SAS will write information about the execution process. -print: Specifies the path for the output file where SAS will write the results of the execution.
In SAS, system options are parameters that control various aspects of the SAS environment and the behavior of SAS programs. These options can be set globally or locally to modify how SAS processes data, manages output, and handles different tasks.
System options can be set in different ways, including the use of the OPTIONS statement in your SAS code. The OPTIONS statement starts with the keyword OPTIONS and follows with a list of options and their values.
/* Set SAS system options */
options nodate nocenter linesize=80 pagesize=50 validvarname=v7;
Explanation
Sets system options to customize output formatting:
Other useful options include
Before performing an analysis in SAS, you need to access your data and convert it into a SAS dataset. This conversion can be achieved through various methods, including using a DATA step with the DATA statement, utilizing the IMPORT procedure with a database engine, or applying SQL procedures with SQL queries. We will begin by discussing the DATA step method.
SAS code in each data step looks like
DATA SAS_data_name; /* This statement tells SAS to create a SAS data table*/
code line2; /*Each line hereafter gives instructions in order*/
code line3;
.
.
.
RUN;
Each SAS data step
Two things to mention:
/*Create a SAS dataset*/
DATA raw_data;
INFILE datalines DSD DLM=',' FIRSTOBS=1;
INPUT Name $ Age Salary Date : MMDDYY10. Gender $; /*Data Types must be specified*/
FORMAT Salary DOLLAR10.2 Date MMDDYY10.; /*Output format*/
DATALINES;
John Doe, 25, 35000.50, 12/31/2023, M
Jane Smith, 32, 42000.75, 11/15/2023, F
Tom Brown, 45, 50000.00,, M
Alice Johnson, 29, 47000.90, 08/10/23, F
; /* Use semicolon to indicate the End of data */
RUN; /* End of data step */
/*Print the created data with the PRINT PROCedure. We introduce different procedure later.*/
PROC PRINT DATA=raw_data;
RUN;
I have used indentation in the above code. Indentation, though not necessary, makes the code clear.
Explanation of code
How does SAS work in the data step in general? Refer to the videos https://www.youtube.com/watch?v=gonmrtJopWo, https://www.youtube.com/watch?v=uPYA_7cc7UY. If you understand the PDV mechanism, you are already a SAS data step expert.
I strongly suggest you watch all of the videos from https://www.youtube.com/@SAS_ArthurLi.
Now, let’s look at how PDV works here.
In summary,
SAS comes with some built-in libraries such as
You can use a libname statement to create your own permanent libraries for storing SAS data sets.
/*Create a library with a libref called mylib in the directory C:\Users\username\Documents*/
libname mylib 'C:\Users\username\Documents\SASData';
/*Using the Permanent Library*/
data mylib.new_dataset;
input name $ age salary;
datalines;
John 30 50000
Jane 25 60000
Tom 40 55000
;
run;
Explanation
If you are working on a network drive, you can also create a permanent library by pointing to a network path.
libname netlib '\\NetworkDrive\Shared\SASData';
Note: You can use direct referencing to create permannant SAS data set. To do so, you specify a path to the file:
/*Using direct referencing to create a permannant SAS dataset*/
data "/home/u688923/my_shared_file_links/u688923/try";
x=5;
y=7;
run;
Explanation
We can list relevant information about a SAS dataset using the PROC CONTENTS procedure.
data employees;
input name $ age salary;
datalines;
John 30 50000
Jane 25 60000
Tom 40 55000
;
/*Print all relevant info about SAS dataset employees*/
PROC CONTENTS DATA=employees;
RUN;
The content stores information (called descriptor) about the SAS dataset created. The descriptor includes the SAS engine, the date on which the SAS data was created, the number of observations and variables, along with many other pieces of information.
If your data is stored in an external CSV file “raw_data” in a path, you would still use the INFILE statement. Here’s how you handle a comma-separated file:
DATA raw_data;
INFILE 'path/to/raw_data.csv' DSD DLM=',' FIRSTOBS=1;
INPUT Name $ FirstName $ LastName $ Age Salary Date : MMDDYY10. Gender $;
FORMAT Salary DOLLAR10.2 Date MMDDYY10.;
RUN;
Explanation
In this example, the data are read using list input (or delimiter input). In fact, it’s comma-delimited input here.
If each of the variable’s values is always found in the same place in the line of data, then you can use the column input (fixed width). The advantage of column input is that it can read character data longer than 8 characters and numeric data in scientific notation. Watch this video https://www.youtube.com/watch?v=Fos7iN0jUxg (start at 11’13”) to learn column input.
data example;
infile datalines;
input id $ 1-5 /* ID from columns 1 to 5 */
name $ 6-15 /* Name from columns 6 to 15 */
age
height;
datalines;
12345John 30 180
67890Alice 25 165
;
run;
Note: Fields that contain embedded delimiters (e.g., commas, tabs) within the data itself can cause problems with list input. This is because list input assumes that fields are separated by delimiters and does not handle embedded delimiters well. So, don’t use list input for address (containing commas) or date and time values (containing / or - or :).
DATA scientific_data;
INFILE datalines;
INPUT ID $ ScientificValue BEST32.; /*the informat BEST32. reads numeric values in various formats, including scientific notation*/
DATALINES;
001 1.23E+04
002 5.67E-03
003 9.01E+01
;
RUN;
More on informats later, but we list some commonly used informats:
Purpose: Reads numeric values with commas as thousands separators. Usage: COMMA10. (Handles numbers like 1,234,567)
Purpose: Reads numeric values with dollar signs and commas. Usage: DOLLAR8.2 (Reads values like $1,234.56).
Purpose: Reads numeric values formatted as percentages. Usage: PERCENT6.2 (Reads 50.00%)
Purpose: Reads dates in the MM/DD/YYYY format. Usage: MMDDYY10. (Reads 08/15/2024).
Purpose: Reads dates in a default date format. Usage: DATE9. (Reads 15AUG2024).
Purpose: Reads time in the HH:MM:SS format. Usage: TIME8. (Reads 12:34:56).
Purpose: Reads character data without any special formatting. Usage: $CHAR20. (Reads up to 20 characters).
If you want to create a SAS dataset in a DATA step without reading from raw data or modifying existing SAS data, you can use direct variable assignments to generate the dataset.
data circle_area;
r = 4; /* radius */
A = 3.1415926*r**2; /* Calculate area of circle*/
run;
Explanation
The SAS data set has only one observation and two variables (columns) r and A with values 4 and 50.26548.
data example1;
input name $ 1-10 @; /* Hold the line for further processing */
input age 12-13; /* Read age from columns 12-13 */
datalines;
John Doe 30
Alice Smith 25
;
run;
data example2;
input name $ age @@; /* Read multiple records from the same line */
datalines;
John 30 Alice 25 Bob 35
;
run;
Explanation:
Column pointer controls are used to specify exact columns for reading fixed-width data and manage how SAS skips or moves through columns.
+n (Column Pointer Control): Moves the input pointer n columns to the right from its current position. Useful for skipping columns or starting to read data from a specific position.
#n (Column Pointer Control in Informats): In some contexts, #n is used to specify column positions directly in the context of reading data. However, #n is not a standard SAS syntax for column pointer control. Instead, “+n” is used for specifying column positions.
data example;
input name $ 1-10 +2 age 12-13; /* Skips 2 columns after reading name */
datalines;
John Doe 30
Alice Smith 25
;
run;
Let’s look at an example.
Each year engineering students from around the USA and Canada build concrete canoes and hold regional and national competitions. Part of the competition involves racing the canoes. The following data contain the final results of a men’s sprint competition. The data lines start with the name of the canoe, followed by the school, and the time.
Bellatorum School CSULA Time 1:40.5
The Kraken School ASU Time 1:45.35
Black Widow School UoA Time 1:33.7
Koicrete School CSUF Time 1:40.25
Khaos School UNLV Time 2:03.45
Max School UCSD Time 1:26.47
Hakuna Matata School UCLA Time 1:20.64
Prospector School CPSLOlIPP Time 1:12.08
Andromeda School CPP Time 1:25.1
Kekoapohaku School UHM Time 1:24.49
You can see that because the canoe names are not always the same number of characters, the school names do not line up in the same column. Also, the time values are sometimes six characters and sometimes seven.
DATA canoeresults;
INPUT Name & $13. @'School' School $ @'Time' RaceTime :STIMER8.; /*The informat ":STIMER8." tells SAS how to read time*/
format RaceTime TIME11.2; /*Output time as hh:mm:ss.xx*/
datalines;
Bellatorum School CSULA Time 1:40.5
The Kraken School ASU Time 1:45.35
Black Widow School UoA Time 1:33.7
Koicrete School CSUF Time 1:40.25
Khaos School UNLV Time 2:03.45
Max School UCSD Time 1:26.47
Hakuna Matata School UCLA Time 1:20.64
Prospector School CPSLOIPP Time 1:12.08
Andromeda School CPP Time 1:25.1
Kekoapohaku School UHM Time 1:24.49
;
RUN;
proc print data=canoeresults noobs;
format RaceTime time11.2; /*Rather than in data step, you can also place the output format here*/
run;
Explanation of code
Refer to the shared files on SAS studio. To see those shared files, double click Files (Home). Files with “.sas” suffix are SAS files. Other files are data CSV, XLS, XLSX, or DAT files. Read the sas file to learn how to create SAS datasets.
We will introduce how to
Modifying data using a SAS DATA step is a common practice for data cleaning, transformation, and manipulation. Below, I’ll walk you through various types of data modifications you can perform using a DATA step in SAS.
Example 1. Modifying Existing Variables
data raw_data;
set raw_data;
/* Increasing salary by 10% */
Salary = Salary * 1.10;
/* Adjusting age if it's under 30 */
if Age < 30 then Age = Age + 1;
run;
Example 2. Adding New Variables
data raw_data;
set raw_data; /* Reads the existing dataset */
/* Adding a new variable for annual salary */
AnnualSalary = Salary * 12;
/* Adding a new variable for age in months */
AgeInMonths = Age * 12;
run;
Example 3. Filtering Data
data young_raw_data;
set raw_data;
/* Keep only raw_data who are under 30 */
if Age < 30;
run;
Example 4. Renaming Variables
data raw_data_renamed;
set raw_data;
/* Renaming variables */
rename Name = FullName
Date = HireDate;
run;
Example 5. Dropping and Keeping Variables
data raw_data_filtered;
set raw_data;
/* Drop unwanted variables */
drop AgeInMonths;
/* Keep only certain variables */
keep Name Salary;
run;
Example 6. Reordering Variables: You can change the order of variables in the dataset.
data raw_data_reordered;
set raw_data;
/* Reorder variables */
retain Name Salary Age HireDate Gender;
run;
Example 7. Handling Missing Values: You can address missing values by imputing or setting defaults.
data raw_data_cleaned;
set raw_data;
/* Replace missing Salary with the average salary */
if missing(Salary) then Salary = 40000; /* Example default value */
/* Replace missing Gender with 'Unknown' */
if missing(Gender) then Gender = 'Unknown';
run;
In the above code,
Example 8. Combining Modifications: Here is a more complex example that combines several modifications.
data raw_data_modified;
set raw_data;
/* Adding a new variable for performance bonus */
PerformanceBonus = Salary * 0.05;
/* Modify salary based on age */
if Age < 30 then Salary = Salary * 1.10;
else if Age >= 50 then Salary = Salary * 1.05;
/* Drop the AgeInMonths variable */
drop AgeInMonths;
/* Rename variables for clarity */
rename Name = FullName
Date = HireDate;
/* Keep only relevant variables */
keep FullName Salary PerformanceBonus HireDate Gender;
run;
Conditional functions in SAS can be used to apply logic to data transformation.
DATA conditional_transformed;
SET raw_data;
/* Classify people into age groups */
AgeGroup =
case
when Age < 30 then 'Young'
when Age < 50 then 'Middle-aged'
else 'Senior'
end;
/* Assign a performance rating based on Salary */
PerformanceRating =
if Salary < 40000 then 'Needs Improvement'
else if Salary < 60000 then 'Meets Expectations'
else 'Exceeds Expectations';
RUN;
Note:
In SAS, the DO statements within a DATA step are used to execute a block of code multiple times, either based on conditions or iteratively over a range of values. They provide a way to perform repetitive tasks and control flow within the data step. Here’s a guide to the different types of DO statements and how they can be used:
Example 1. Simple DO Loop: A simple DO loop iterates a block of code a specified number of times.
data loop_example;
do i = 1 to 5; /* Loop from 1 to 5 */
output; /* Output a row for each iteration */
end;
run;
Explanation
Example 2. DO Loop with BY Statement: You can use a DO loop with a BY statement to process data by groups.
data grouped_summary;
set employees;
by Gender;
if first.Gender then do;
TotalSalary = 0; /* Initialize a variable for each group */
Count = 0;
end;
TotalSalary + Salary; /* Accumulate salary */
Count + 1; /* Count the number of records in each group */
if last.Gender then do;
AvgSalary = TotalSalary / Count; /* Calculate average salary */
output; /* Output a summary for each group */
end;
run;
Explanation
Example 3. 3. Nested DO Loops: Nested DO loops can be used for more complex iterative processes.
data matrix_example;
do row = 1 to 3; /* Outer loop for rows */
do col = 1 to 3; /* Inner loop for columns */
Value = row * col; /* Compute a value */
output; /* Output the result */
end;
end;
run;
Explanation
Example 4. Conditional DO Loops: You can use DO loops with conditional logic to execute code based on specific conditions.
data conditional_loops;
set employees;
do while (Salary < 50000); /* Loop while Salary is less than 50000 */
Salary = Salary * 1.05; /* Increase salary by 5% */
output;
end;
do until (Salary >= 50000); /* Loop until Salary reaches 50000 */
Salary = Salary * 1.05; /* Increase salary by 5% */
output;
end;
run;
Explanation
Example 5. DO-UNTIL Loop: The DO-UNTIL loop continues until a specified condition becomes true.
data until_loop;
set employees;
Salary = Salary * 1.02; /* Start with a slight increase */
do until (Salary > 60000); /* Continue until Salary exceeds 60000 */
Salary = Salary * 1.05; /* Increase salary by 5% */
output; /* Output each iteration */
end;
run;
Explanation
Arrays in SAS are powerful tools for handling repetitive tasks, such as table lookups, rotating datasets, and performing iterations with indexed variables. We give a detailed guide on using arrays for these purposes with practical examples.
Example 1. (Table Look-ups Using Arrays) Arrays can be used to simplify table look-ups by creating a mapping between values and their corresponding outputs. This is particularly useful when you have a predefined set of values that need to be translated or matched.
Assume you have a dataset with employee IDs and you want to map these IDs to department names based on a lookup table.
/* Define the lookup table */
data dept_lookup;
input DeptID $ DeptName $;
datalines;
1 HR
2 Finance
3 IT
;
run;
/* Define the employee dataset */
data employees;
input EmpID $ DeptID $;
datalines;
101 1
102 2
103 3
104 1
;
run;
Let’s use arrays to perform the look-up.
data employees_with_dept;
/* Define arrays for the lookup */
array dept_ids[3] _temporary_ (1 2 3);
array dept_names[3] $ _temporary_ ('HR' 'Finance' 'IT');
set employees;
length DeptName $ 10;
/* Perform the look-up using arrays */
do i = 1 to 3;
if DeptID = dept_ids[i] then DeptName = dept_names[i];
end;
drop i;
run;
/* Print the resulting dataset */
proc print data=employees_with_dept;
run;
Explanation of code
Example 2. (Rotating Datasets Using Arrays) Arrays can be used to rotate data, which means transforming columns into rows (wide to long) or vice versa. This is useful for reshaping data to meet different analytical needs.
Given a dataset with monthly sales data for each quarter, you want to rotate it to have sales data by month.
/* Define the original dataset */
data sales_quarterly;
input ID $ Q1 Q2 Q3 Q4;
datalines;
1 100 150 200 250
2 110 160 210 260
;
run;
We now rotate the dataset using arrays.
data sales_monthly;
set sales_quarterly;
array quarters[4] Q1-Q4; /*Q1-Q4 is the shortcut for Q1 Q2 Q3 Q4*/
array months[4] $ _temporary_ ('Jan' 'Apr' 'Jul' 'Oct');
do i = 1 to 4;
Month = months[i];
Sales = quarters[i];
output;
end;
drop i Q1-Q4;
run;
/* Print the rotated dataset */
proc print data=sales_monthly;
run;
Example 3. (Performing Repeated Iterations Using Indexed Variables) Arrays in SAS are also useful for performing repeated iterations, such as applying the same computation across multiple variables.
Suppose you have a dataset with scores from five different tests and you want to compute the average score for each student.
/* Define the dataset */
data student_scores;
input StudentID $ Test1-Test5;
datalines;
001 85 90 78 92 88
002 80 82 85 87 90
;
run;
We now calculate the average score for each of the 5 tests using arrays.
data student_averages;
set student_scores;
array tests[5] Test1-Test5;
total_score = 0;
/* Sum up the test scores */
do i = 1 to 5;
total_score + tests[i];
end;
/* Calculate average score */
average_score = total_score / 5;
drop i total_score;
run;
/* Print the results */
proc print data=student_averages;
run;
Subsetting SAS data involves creating a new SAS dataset that contains only a subset of the observations or variables from an existing dataset. This can be done using various methods within the DATA step, PROC SQL, and even by using SAS procedures designed for data manipulation.
Example 1. Using IF statements
DATA high_salary_employees;
SET employees;
IF Salary > 50000; /* Filters records to include only those with Salary greater than 50000 */
RUN;
Example 2. Using WHERE statements to filter Rows Based on Conditions
DATA young_employees;
SET employees;
WHERE Age < 30; /* Select employees younger than 30 */
RUN;
Explanation
Example 3. Selecting Specific Variables
data selected_vars;
set employees(keep=Name Salary HireDate); /* Keep only specified variables */
run;
In SAS, handling dates involves understanding SAS date values, date formats, and various date functions. SAS stores dates as the number of days since January 1, 1960, which makes it different from other systems that may use different reference dates or formats. Here’s a comprehensive guide to working with dates in SAS:
Example 1. Example of Formatting and Informat
data formatted_dates;
input date_var mmddyy10.;
format date_var date9.;
datalines;
08/15/2024
01/01/2024
12/31/2023
;
run;
SAS provides various functions to manipulate and calculate dates.
Common Date Functions: - TODAY(): Returns the current date. - DATEPART(datetime): Extracts the date part from a datetime value. - DAY(date): Extracts the day of the month from a date value. - MONTH(date): Extracts the month from a date value. - YEAR(date): Extracts the year from a date value.
In SAS, RETURN and SUM statements are part of the SAS programming language used for specific tasks in a DATA step.
Example 1.
data example;
do i = 1 to 5;
if i = 3 then return; /* Exit the DATA step when i equals 3 */
output;
end;
run;
Explanation:
In this example, only the observations where i is 1 and 2 are included in the dataset because the RETURN statement exits the DATA step when i is 3.
Example 2. Using the “a + b” type of statements
data numbers;
input value;
datalines;
2
5
9
14
;
run;
/*Calculate the cumulative sum of values in the SAS dataset "numbers"*/
data cumulative_sum;
set numbers; /* Assuming a dataset 'numbers' with a variable 'value' */
cumulative_total + value; /* Accumulate sum of 'value' */
run;
/*Print Data*/
proc print data=cumulative_sum;
title 'Employee Details'; /*ancillary statement*/
footnote 'Thank somebody!' /*another ancillary statement*/
run;
Explanation:
Example 3. Conditional Sum: Finding the sum for each group
data conditional_sum;
set numbers;
by group; /* Assuming 'group' is a variable used for grouping */
if first.group then cumulative_total = 0; /* Reset sum at the beginning of each group */
cumulative_total + value; /* Accumulate sum */
run;
Explanation:
In SAS, the OUTPUT statement is used within a DATA step to control which observations are written to a dataset. It allows you to specify which observations to write and can also be used to output observations to different datasets based on conditions.
Example 1. (Only one SAS dataset)
data example;
input name $ age height;
if age > 30 then output; /* Only output observations where age > 30 */
datalines;
John 25 175
Alice 35 160
Bob 40 180
;
run;
Note: In the code, output is the same as output example.
Example 2. ( Multiple SAS datasets)
data dataset1 dataset2;
input name $ age height;
if age < 30 then output dataset1; /* Write observations where age < 30 to dataset1 */
else output dataset2; /* Write observations where age >= 30 to dataset2 */
datalines;
John 25 175
Alice 35 160
Bob 40 180
;
run;
Each SAS procedure step
Example 1a. The PRINT procedure
PROC PRINT data=raw_data;
TITLE 'Employee Data';
run;
Explanation
Example 1b. The PRINT procedure with WHERE Conditions
proc print data=raw_data;
where Age > 40; /* Print only those with Age greater than 40 */
run;
Example 2. The MEANS procedure: produce the summary statistics like mean, median, standard deviation, and range for numeric variables.
PROC MEANS data=raw_data;
VAR Age Salary;
TITLE "Summary Statistics for Age and Salary";
run;
Explanation:
Example 3. The FREQ procedure: produce the summary statistics
PROC freq data=raw_data;
TABLES Gender;
TITLE "Frequency of Gender";
run;
Explanation
Example 4a. The SORT procedure: For sorting data.
proc sort data=raw_data out=sorted_raw_data;
by Salary; /* Sort data by Salary */
run;
Example 4b. The SORT procedure: For sorting data with Conditions.
data high_salary_sorted_raw_data;
set sorted_raw_data;
where Salary > 50000; /* Include only those with Salary greater than 50000 */
run;
FILENAME REFFILE '/home/u688923/my_shared_file_links/u688923/FruitFlies.csv';
PROC IMPORT DATAFILE=REFFILE DBMS=CSV OUT=WORK.IMPORT;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT;
RUN;
Explaination
PROC SQL can also be used to subset data with SQL-like syntax.
proc sql;
create table young_females as
select Name, Age, Salary
from raw_data
where Age < 30 and Gender = 'F';
quit;
Explanation
Nested SQL statements, or subqueries, are SQL queries embedded within other SQL queries. They allow you to perform complex queries by using the results of one query as the input for another. In SAS, you can use nested SQL statements within PROC SQL to perform various types of data analysis and manipulation.
proc print data = sashelp.class;
run;
proc sql;
/* Find maximum height within each group and filter groups with overall maximum height */
select name, sex, max(height) as max_height
from sashelp.class
where height = (select max(height) from sashelp.class);
quit;
Note: The WHERE statement is Equivalent to using the HAVING statement HAVING height = max(height). This is because HAVING is applied at the group level while WHERE is at the observation level. You can’t do WHERE height = max(height). In the code where height = (select max(height) from sashelp.class);, the right side to the equality sign is already done.
What does the code do?
In SAS, Formats and Informats are used to control how data is read into and displayed from SAS datasets.
Determine how data values are displayed in SAS output. For example, you might use a format to display numbers with commas or dates in a specific format. Formats are used in the FORMAT statement in SAS. You apply formats either in a DATA step or when creating reports.
Example: Applying Formats in a DATA Step
DATA formatted_data;
SET raw_data;
FORMAT salary dollar10.2 date_of_birth DATE9.;
RUN;
Explanation
Example: Applying Formats in PROC PRINT
PROC PRINT DATA=formatted_data;
FORMAT salary dollar10.2 date_of_birth DATE9.;
RUN;
Determine how data values are read and interpreted from raw data. Informats are used in the INFORMAT statement to control how SAS reads raw data values.
Example: Applying Informats in a DATA Step
DATA raw_data;
INFILE datalines;
INPUT name $ age income : dollar8.2 date_of_birth : MMDDYY10. steps COMMA8.;
DATALINES;
John Doe 30 $1,234.56 08/15/1994 23,089
Jane Smith 25 $2,345.67 12/01/1998 4,621
;
RUN;
PROC PRINT DATA=raw_data;
RUN;
Explanation
You can use both formats and informats when processing and displaying data.
DATA formatted_data;
INFILE 'datafile.txt';
INPUT name $ age income : dollar8.2 date_of_birth : MMDDYY10.;
FORMAT income dollar10.2 date_of_birth DATE9.;
RUN;
Explanation
If you don’t want to read an observation satisfying certain conditions, use the single trailing at (@) symbol and the if-condition to skip the observation.
data filtered_data;
infile datalines;
input id $ name $ age @; /* Use @ to hold the line for subsequent INPUT statement */
if age < 26 then delete; /* Skip the observation if age is less than 26 */
input height; /*Continue to input*/
datalines;
001 John 25 70
002 Alice 35 65
003 Bob 22 68
004 Carol 28 72
;
run;
proc print data = filtered_data;
run;
In SAS, the MISSOVER option is used with the INFILE statement when the data on the current line is not sufficient (due to missing values) to fill all the variables in the INPUT statement. This option ensures that SAS does not skip lines when it encounters missing data but instead fills in the missing values with blanks or specified missing values.
data missoverexample;
infile datalines missover;
input id $ name $ age height;
datalines;
12345 John 25 177
67890 Alice 30 183
54321 Bob
89765 David 54
; /*record 3 has missing on age and height, record 4 has missing on height*/
run;
If a character value is longer than 8 in length, a LENGTH staement can be used to prevent truncation. The LENGTH statement must be placed before the relevant variable occurs.
DATA Length_example;
LENGTH Name $20 Address $30;
infile datalines delimiter=",";
input Name $ Age Address $;
Datalines;
David Moore, 80, 1234 6th Ave. S
Jessica Ottesen, 60, 12345 6th Ave. S
;
RUN;
proc print data = Length_example;
var Name Age Address; /* Specify the order of columns */
run;
Explanation:
Use the SET statement to read data from a SAS dataset, not from raw data. When filtering from a SAS dataset, use a WHERE statement.
data employees;
input Name $ Age Salary;
datalines;
John 28 50000
Jane 35 60000
Jim 45 55000
Jill 40 65000
;
run;
proc print data=employees;
run;
data high_salary;
set employees; /*The SET statement reads data from a SAS dataset*/
WHERE Salary > 60000;
run;
proc print data=high_salary;
run;
Note:
You can also use the IF keyword instead of WHERE. The difference is that “WHERE” filters data before it is read into the PDV, but “IF” filters data after it has been read into the PDV.
The WHERE statement can also work in a procedure such as PROC PRINT.
We will cover more PROC steps.
PROC FORMAT in SAS is a procedure used to define custom formats and informats for variables. Formats control how data is displayed, while informats control how data is read. This procedure is especially useful for formatting data in reports or converting codes into meaningful labels.
A video to watch: https://www.youtube.com/watch?v=IeUdnCsetqY (right click and open in new window)
You can create custom formats using PROC FORMAT. This allows you to define specific ways to display or categorize data.
Example. Define custom formats.
PROC FORMAT;
VALUE $genderfmt
'M' = 'Male'
'F' = 'Female';
VALUE agefmt
LOW-20 = 'Young'
21-40 = 'Adult'
41-HIGH = 'Senior';
RUN;
Explanation
Now, apply the defined custom formats.
DATA formatted_data;
SET raw_data;
FORMAT gender $genderfmt. age agefmt.;
RUN;
You can also use formats and informats directly in procedures such as PROC REPORT and PROC MEANS.
PROC REPORT DATA=formatted_data;
COLUMN name age income date_of_birth;
DEFINE age / FORMAT=agefmt.;
DEFINE income / FORMAT=dollar10.2;
DEFINE date_of_birth / FORMAT=DATE9.;
RUN;
More on the REPORT procedurelater.
PROC MEANS DATA=formatted_data;
VAR income;
FORMAT income dollar10.2;
RUN;
The TABULATE procedure in SAS is used to create detailed and flexible summary reports, such as frequency tables, cross-tabulations, and descriptive statistics.
Here is a video to watch: https://www.youtube.com/watch?v=K9V_tHrXgrA
Now, you can read my notes on PROC Tabulate.
/*Simple frequency table*/
proc tabulate data=sashelp.cars;
class type origin;
table type, origin;
run;
/*Summary Statistics*/
proc tabulate data=sashelp.cars;
class type origin;
table type, origin;
run;
/* Multi-Dimensional Table with Multiple Statistics*/
proc tabulate data=sashelp.cars;
class type origin;
var MSRP Horsepower;
table type*(origin),
(MSRP*mean MSRP*sum)
(Horsepower*mean Horsepower*sum);
run;
Explanation:
The REPORT procedure in SAS is used for creating customized reports that can present data in a tabular format. PROC REPORT provides advanced options for formatting, summarizing, and customizing the appearance of reports, making it a powerful tool for generating professional and user-friendly output.
/*Data*/
DATA nationalparks;
INPUT Park $ Type $ Region $ Museums Camping;
DATALINES;
Yosemite NP West 100 50
Yellowstone NP East 150 .
GrandCanyon NM . 200 70
Zion NP West 80 30
Glacier NP East 120 60
RockyMountain NP West 90 40
Olympic NM . 95 .
Shenandoah NM East 130 65
GreatSmoky NP West 100 55
Badlands NP East . 110
BryceCanyon NP West 110 75
Everglades NM West . 85
Denali NP East 140 85
MountRainier NP West 160 90
Sequoia NP . 105 45
CuyahogaValley NM West . 70
Haleakala NP . 80 60
;
RUN;
/* A simple report*/
PROC REPORT DATA=nationalparks nowd;
COLUMN Park Type Region Museums Camping;
RUN;
/*A custom report*/
PROC REPORT DATA=nationalparks MISSING;
COLUMN Park Type Region Museums Camping;
/* Custom column labels and settings */
DEFINE Region / ORDER; /* Sort by Region */
DEFINE Camping / ANALYSIS 'Campgrounds'; /* 'ANALYSIS' used for summary statistics*/
/* Adding a title to the report */
TITLE "National Parks Report";
RUN;
Explanation
We next create two summary reports:
/*summary report #1*/
PROC REPORT DATA=nationalparks;
COLUMN Region Type Museums Camping;
DEFINE Region / GROUP;
DEFINE Type / GROUP ;
/* Adding a title to the report */
TITLE "National Parks Summary Report with Two Group Variables";
RUN;
/*Summary report #2*/
PROC REPORT DATA=nationalparks;
COLUMN Region Type, (Museums Camping);
DEFINE Region / GROUP;
DEFINE Type / Across ;
/* Adding a title to the report */
TITLE "National Parks Summary Report with a Group and An Across Variable";
RUN;
We can also add summary breaks to the report:
PROC REPORT DATA=nationalparks MISSING;
COLUMN Region Type Museums Camping;
/* Custom column labels and settings */
DEFINE Region / ORDER; /* Sort by Region */
BREAK AFTER Region / SUMMARIZE;
BREAK AFTER / SUMMARIZE;
/* Adding a title to the report */
TITLE "National Parks Detailed Report with Summary Breaks";
RUN;
Next, we can add statistics to the report:
/*summary report #1*/
PROC REPORT DATA=nationalparks;
COLUMN Region Type N (Museums Camping), MEAN;
DEFINE Region / GROUP;
DEFINE Type / GROUP ;
TITLE "National Parks Statistics with Two Group Variables";
RUN;
/*Summary report #2*/
PROC REPORT DATA=nationalparks;
COLUMN Region N Type, (Museums Camping), MEAN;
DEFINE Region / GROUP;
DEFINE Type / Across ;
TITLE "National Parks Statistics with a Group and An Across Variable";
RUN;
Finally, we can add computed variables to the report:
/* Generate the report with computed variable */
PROC REPORT DATA=nationalparks MISSING;
COLUMN Park Region Museums Camping TotalFacilities Note;
/* Define column attributes */
DEFINE Museums / ANALYSIS SUM NOPRINT;
DEFINE Camping / ANALYSIS SUM NOPRINT;
DEFINE TotalFacilities / COMPUTED 'Total Facilities';
DEFINE Note / COMPUTED;
/* Compute TotalFacilities as the sum of Museums and Camping */
COMPUTE TotalFacilities / ;
TotalFacilities = Museums.SUM + Camping.SUM;
ENDCOMP;
COMPUTE Note / CHAR LENGTH = 10;
IF Museums.SUM = 0 THEN Note = 'No Museum';
IF Camping.SUM = 0 THEN Note = 'No Camping';
ENDCOMP;
/* Add a title to the report */
TITLE "National Parks Report with Computed Total Facilities";
RUN;
You can enhance your SAS output using ODS to create reports in various formats like HTML and PDF, and you can apply custom styles to make them visually appealing.
SAS output can be sent to the HTML or PDF destination.
Example 1: HTML output
/* Open the ODS HTML destination and specify the file path */
ODS HTML FILE='/home/u688923/my_shared_file_links/u688923/simple_output.html'
STYLE=htmlblue;
/*Remove procedure title*/
ODS noproctitle;
/* Create a dataset using datalines */
DATA squares;
INPUT Number Square;
DATALINES;
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100
;
RUN;
/* Print the dataset to display the table */
PROC PRINT DATA=squares NOOBS;
TITLE 'Square Table for Numbers 1 to 10';
VAR Number Square; /* Display 'Number' and 'Square' variables */
RUN;
/* Close the ODS HTML destination */
ODS HTML CLOSE;
Explanation
Example 2: HTML output
/* Open the ODS PDF destination and specify the file path */
ODS PDF FILE='/home/u688923/my_shared_file_links/u688923/simple_output.pdf'
STYLE=journal
STARTPAGE=NO;
/* Create a dataset using datalines */
DATA squares;
INPUT Number Square;
DATALINES;
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100
;
RUN;
/* Print the dataset to display the table */
PROC PRINT DATA=squares NOOBS;
TITLE 'Square Table for Numbers 1 to 10';
VAR Number Square; /* Display 'Number' and 'Square' variables */
RUN;
/* Close the ODS PDF destination */
ODS PDF CLOSE;
Explanation
Creating a styled report in SAS can enhance the presentation and readability of your output. You can achieve this using ODS (Output Delivery System) with custom styles, and by incorporating PROC REPORT or PROC PRINT. Below is an example that demonstrates how to create a styled report.
/* Create a simple dataset */
data employees;
input Name $ Age Salary;
datalines;
John 30 50000
Alice 28 55000
Bob 35 60000
;
run;
/* Generate a report with styled title and column */
PROC REPORT DATA=employees NOWD;
COLUMN Name Age Salary;
/* Define columns with optional styling */
DEFINE Salary / DISPLAY 'Annual Salary'
STYLE(COLUMN)=[BACKGROUND=lightgreen FONT_SIZE=12pt FONT_WEIGHT=bold COLOR=darkgreen];
RUN;
Explanation
In SAS, merging datasets involves combining observations from two or more datasets into a single dataset based on common variables. This operation is typically done using the MERGE statement in a DATA step. The merge process can be used to combine datasets in a variety of ways, including inner joins, left joins, right joins, and full joins.
SAS can perform different types of joins (inner, left, right, and full) using both the MERGE statement in the DATA step and PROC SQL. SAS can also perform inner join using the SET statement.
A video to watch: https://www.youtube.com/watch?v=fQNt7Pqb_0s
Example 1. INNER join: An inner join returns only the records with matching keys in both datasets.
/* Sample datasets */
data data1;
input ID Name $;
datalines;
1 John
2 Alice
3 Bob
;
run;
data data2;
input ID Age;
datalines;
1 30
2 25
4 40
;
run;
/* Inner join using DATA step */
data inner_join;
merge data1(in=a) data2(in=b);
by ID;
if a and b; /* Keep only records present in both datasets */
run;
Example 2. LEFT join: A left join returns all records from the left dataset and the matched records from the right dataset.
/* Left join using DATA step */
data left_join;
merge data1(in=a) data2(in=b);
by ID;
if a; /* Keep all records from data1 */
run;
Example 3. Right Join: A right join returns all records from the right dataset and the matched records from the left dataset.
/* Right join using DATA step */
data right_join;
merge data1(in=a) data2(in=b);
by ID;
if b; /* Keep all records from data2 */
run;
Example 4. Full Outer Join: A full outer join returns all records when there is a match in either dataset.
/* Full outer join using DATA step */
data full_outer_join;
merge data1(in=a) data2(in=b);
by ID;
/* Keep all records from both datasets */
run;
/*Inner join*/
proc sql;
create table inner_sql as
select a.ID, a.Name, b.Age
from data1 as a
inner join data2 as b
on a.ID = b.ID;
quit;
proc print data=inner_sql;
run;
/*Left join*/
proc sql;
create table left_sql as
select a.ID, a.Name, b.Age
from data1 as a
left join data2 as b
on a.ID = b.ID;
quit;
proc print data=left_sql;
run;
/*Right join*/
proc sql;
create table right_sql as
select a.ID, a.Name, b.Age
from data1 as a
right join data2 as b
on a.ID = b.ID;
quit;
proc print data=left_sql;
run;
/*Full join*/
proc sql;
create table full_sql as
select a.ID, a.Name, b.Age
from data1 as a
full join data2 as b
on a.ID = b.ID;
quit;
proc print data=full_sql;
run;
data sales;
input Product $ Quarter $ Sales;
datalines;
A Q1 100
A Q2 150
A Q3 200
B Q1 120
B Q2 180
B Q3 220
;
run;
proc transpose data=sales out=transposed_sales;
by Product; /*Each product will be an observation or row in the new SAS data*/
id Quarter; /* Different quarters become different variables or columns*/
var Sales; /* The sales become values under corresponding columns*/
run;
proc print data=transposed_sales;
run;
The BY statement may include a combination of a few variables.
SAS automatic variables are special variables that are created and maintained by SAS during data step processing. They provide useful information about the current state or environment of the data step. Here are the key automatic variables in SAS.
data example;
input x;
_n_ + 1; /* This adds 1 to _n_ on each iteration */
datalines;
10
20
30
;
run;
proc print data=example;
run;
data error_check;
input x;
if x < 0 then _ERROR_ = 1; /* Sets error flag if x is negative */
else _ERROR_ = 0;
datalines;
10
-5
20
;
run;
proc print data=error_check;
run;
data example;
set sashelp.class;
by sex;
if last.sex then last_observation = 1;
else last_observation = 0;
run;
proc print data=example;
run;
In SAS, macros are a powerful feature that allow you to automate repetitive tasks, create dynamic code, and improve the efficiency of your programming. Here’s an overview of the key concepts related to SAS macros:
/*Define the macro variable*/
%let var = age;
/*Usage: Use & to reference the macro variable. In the code below, &var will be replace by age everywhere*/
proc print data=example;
var &var;
run;
/*Create the macro program. The macro is called report taking data as the only parameter*/
%macro report(data=);
proc print data=&data;
run;
%mend report;
/*Execution: Call the macro by using its name and passing any required parameters.*/
%report(data=example);
Definition: Macro functions provide a way to perform operations on macro variables or to generate text strings dynamically.
/* The existing micro function "%sysfunc" allows you to call SAS functions within macro code.*/
/*Define a macro variable by calling the macro function "%sysfunc".*/
%let today = %sysfunc(today(), date9.);
Definition: Parameters are values passed to a macro when it is called. They allow you to create more flexible and reusable macros.
Definition in Macro: Define parameters within the macro definition.
%macro summary(data=, var=);
proc means data=&data;
var &var;
run;
%mend summary;
/*Passing Parameters: Provide parameter values when calling the macro.*/
%summary(data=example, var=age);
Definition: Macro statements are commands used to control macro processing. They include:
%LET for defining macro variables. %DO and %END for looping. %IF and %THEN for conditional processing.
We create a macro that prints a summary table for a dataset in SAS
%macro summary_table(dataset);
/* Create a dataset with variable names and types */
proc contents data=&dataset out=vars(keep=name type) noprint;
run;
/* Create a macro variable containing a list of variable names */
proc sql noprint;
select name into :varlist separated by ' '
from vars
where type = 1; /* Numeric variables only */
quit;
/* Generate a summary table for each variable */
proc means data=&dataset n mean std min max;
var &varlist;
run;
%mend;
/* Create an example dataset */
data example_data;
input id age height weight;
datalines;
1 25 175 70
2 30 180 80
3 22 160 60
4 28 170 75
;
run;
/* Use the macro to print a summary table */
%summary_table(example_data);
In SAS, the SELECT … INTO statement is used within PROC SQL to assign the result of a query to a macro variable. This feature is useful for extracting single values or aggregating results and using them later in your SAS program.
/*The following SQL code creates a macro variable "max_height"*/
proc sql;
select max(height)
into :max_height
from sashelp.class;
quit;
/* Using the macro variable */
%put Maximum height is &max_height; /*The result is in the LOG*/
Visualizing data in SAS can enhance your analysis by making patterns and insights easier to identify. Here’s a guide on various methods to visualize your data using SAS, focusing on PROC SGPLOT and PROC SGPANEL.
A video to watch: https://www.youtube.com/watch?v=EZAlVFQflTI
PROC SGPLOT is used for creating single-panel plots. Here’s a basic example that demonstrates how to create a scatter plot.
/* Generate a histogram of MPG */
proc sgplot data=sashelp.cars;
title "Histogram of Miles Per Gallon (MPG)"; /* Set the title of the histogram */
/* Create the histogram */
histogram MPG_city / binwidth=2 fillattrs=(color=lightblue) ;
/* Label for the x-axis */
xaxis label="Miles Per Gallon (MPG)";
/* Label for the y-axis */
yaxis label="Frequency";
run;
/* Create a dataset from the built-in SAS dataset 'cars' */
data work.cars;
set sashelp.cars; /* Load data from sashelp.cars into work.cars */
run;
/* Generate a scatter plot */
proc sgplot data=work.cars;
title "Scatter Plot of Horsepower vs. MPG"; /* Set the title of the plot */
/* Create a scatter plot with Horsepower on the x-axis and MPG on the y-axis */
scatter x=Horsepower y=MPG_city / markerattrs=(symbol=circlefilled color=blue);
/* Label for the x-axis */
xaxis label="Horsepower";
/* Label for the y-axis */
yaxis label="Miles Per Gallon (MPG)";
run;
The following example uses a dataset from the SASHELP library to demonstrate how to create a vertical bar plot showing the average MPG by car type.
/* Create a dataset from the built-in SAS dataset 'cars' */
data work.cars;
set sashelp.cars; /* Load data from sashelp.cars into work.cars */
run;
/* Calculate average MPG by Type */
proc sql;
create table work.avg_mpg as
select Type, mean(MPG_city) as Avg_MPG
from work.cars
group by Type;
quit;
/* Generate a vertical bar plot */
proc sgplot data=work.avg_mpg;
title "Average MPG by Car Type"; /* Set the title of the plot */
/* Create a vertical barplot. Use the hbar statement if horizontal barplot is wanted. */
vbar Type / response=Avg_MPG barwidth=0.5 fillattrs=(color=blue);
/* Label for the y-axis */
yaxis label="Average MPG";
/* Label for the x-axis */
xaxis label="Car Type";
run;
In the following example, we’ll create a box plot to visualize the distribution of MPG (miles per gallon) by car type.
/* Create a dataset from the built-in SAS dataset 'cars' */
data work.cars;
set sashelp.cars; /* Load data from sashelp.cars into work.cars */
run;
/* Generate a box plot */
proc sgplot data=work.cars;
title "Box Plot of MPG by Car Type"; /* Set the title of the plot */
/* Create a box plot for MPG by Type */
vbox MPG / category=Type fillattrs=(color=lightblue)
lineattrs=(color=blue);
/* Label for the y-axis */
yaxis label="Miles Per Gallon (MPG)";
/* Label for the x-axis */
xaxis label="Car Type";
run;
You can easily add a regression line to a scatter plot using PROC SGPLOT in SAS. Here’s how to do that with an example using the sashelp.cars dataset to visualize the relationship between Horsepower and MPG_City.
proc sgplot data=sashelp.cars;
/* Create a scatter plot */
scatter x=Horsepower y=MPG_City / markerattrs=(symbol=circlefilled color=blue);
/* Add a regression line */
reg x=Horsepower y=MPG_City / lineattrs=(color=red thickness=2);
/* Set titles and axis labels */
title "Scatter Plot of Horsepower vs. MPG City with Regression Line";
xaxis label="Horsepower";
yaxis label="Miles Per Gallon (City)";
run;
PROC SGPANEL is used for creating panel plots, allowing for multi-faceted visualizations based on one or more categorical variables. Here’s an example where we create a panel of scatter plots based on the type of car.
/* Generate a panel plot using the same 'cars' dataset */
proc sgpanel data=work.cars;
title "Panel of Horsepower vs. MPG by Type"; /* Set the title of the panel plot */
/* Create panels based on the Type variable, arranged in a row lattice layout */
panelby Type / layout=rowlattice onepanel;
/* Create a scatter plot for each panel */
scatter x=Horsepower y=MPG_city / markerattrs=(symbol=circlefilled);
/* Label for the y-axis of the panel */
rowaxis label="Miles Per Gallon (MPG)";
/* Label for the x-axis of the panel */
colaxis label="Horsepower";
run;
Another example:
proc sgpanel data=sashelp.class;
panelby sex; /* Create panels for each sex */
/* Create box plots for height by age */
vbox height / category=age; /* Use age as the categorical axis */
title "Box Plot of Height by Age and Sex";
rowaxis label="Height (inches)";
colaxis label="Age";
run;
A third example: Using two grouping variables (one on rows and one on columns)
proc sgpanel data=sashelp.cars;
/* Create panels by Type and Origin */
panelby Type Origin / layout=lattice
onepanel
sort=(ascformat descformat)
uniscale=column
novarname;
/* Create a scatter plot */
scatter x=Horsepower y=MPG_City;
/* Set titles and axis labels */
title "Scatter Plot of Horsepower vs. MPG City by Car Type and Origin";
rowaxis label="Miles Per Gallon (City)";
colaxis label="Horsepower";
run;
PROC UNIVARIATE provides detailed summary statistics and distribution analyses, including histograms and quantiles.
proc univariate data=sashelp.cars;
title "Univariate Analysis of MPG";
var MPG_city; /* Variable to analyze */
histogram MPG_city; /* Generate a histogram */
run;
PROC MEANS provides descriptive statistics such as mean, standard deviation, minimum, and maximum for numeric variables.
proc means data=sashelp.cars n mean std min max;
title "Descriptive Statistics for Cars Data";
var MPG_city Horsepower Weight; /* Variables to analyze */
run;
PROC FREQ is used to generate frequency tables for categorical variables.
proc freq data=sashelp.cars;
title "Frequency Distribution of Car Types";
tables Type; /* Categorical variable to analyze */
run;
PROC CORR computes the correlation coefficients between numeric variables.
proc corr data=sashelp.cars;
title "Correlation Matrix for Numeric Variables";
var MPG_city Horsepower Weight; /* Variables to analyze */
run;
PROC TTEST performs t-tests to compare means between two groups.
proc ttest data=sashelp.cars;
title "T-Test for MPG by Car Type";
class Type; /* Grouping variable */
var MPG_city; /* Variable to test */
run;
PROC ANOVA is used for analysis of variance to compare means across multiple groups.
proc anova data=sashelp.cars;
title "ANOVA for MPG by Car Type";
class Type; /* Grouping variable */
model MPG_city = Type; /* Dependent variable */
run;
In SAS, PROC REG is used for performing linear regression analysis. This procedure helps you understand the relationship between one dependent variable and one or more independent variables by estimating the coefficients of the regression equation.
Key Features of PROC REG:
Let’s create a simple linear regression model to predict MPG (miles per gallon) based on Horsepower and Weight from the sashelp.cars dataset.
/* Perform linear regression analysis */
proc reg data=sashelp.cars;
title "Linear Regression of MPG on Horsepower and Weight";
/* Specify the dependent and independent variables */
model MPG_city = Horsepower Weight;
/* Output diagnostics */
output out=reg_results p=predicted r=residual; /* Save predicted values and residuals */
run;
/* Print the results dataset */
proc print data=reg_results;
title "Predicted Values and Residuals";
run;