Chapter 1. Getting Started Using the SAS Programming Language

SAS stands for Statistical Analysis System. It is a comprehensive software suite for data management, advanced analytics, and statistical analysis. Here are some important highlights of the SAS programming language:

In summary,

Choosing a Method for Running SAS

Running SAS can be accomplished through various methods depending on your environment and requirements. Here’s a comprehensive overview of the different methods for executing SAS code:

  1. Interactive SAS Environment
  1. SAS Display Manager (SAS Windowing Environment)
  • Purpose: Provides a graphical user interface for running SAS code interactively.
  • It has
    • Program Editor: Write and edit SAS code.
    • Log Window: View execution logs, errors, warnings.
    • Output Window: Display results of procedures.
    • Explorer Window: Manage libraries and datasets.
    • Viewtable Window: View, edit, or create SAS datasets.
    • Results Window (Results Viewer): Show results from procedures.
  1. SAS Studio
  • Purpose: A web-based interface provided by SAS that allows you to write, run, and manage SAS code from a web browser.
  1. SAS Enterprise Guide
  • Purpose: A graphical user interface (GUI) provided by SAS for managing, analyzing, and reporting data. It is designed to simplify the process of working with SAS data and performing statistical analyses, especially for users who may not be familiar with SAS programming or prefer a more visual approach.
  1. Batch or Background Mode

With batch mode, you save your SAS program in a file. Then you submit the file for processing by SAS. You may start the execution at a time you specify.

  • Purpose: Allows for the execution of SAS code from the command line without requiring an interactive session.

Here are steps to Run a SAS Program in Batch Mode:

  • Write your SAS code in a text editor or within the SAS environment.
  • Save the SAS code in a file with a .sas extension (e.g., my_program.sas).
  • Use a command-line interface (CLI) to execute the SAS program.

Command-Line Execution Syntax (depending upon the operating system):

sas -sysin /home/user/my_program.sas -log /home/user/my_program.log -print /home/user/my_program.lst

Explanation

-sysin: Specifies the path to the SAS program file you want to run. -log: Specifies the path for the log file where SAS will write information about the execution process. -print: Specifies the path for the output file where SAS will write the results of the execution.

Using SAS System Options

In SAS, system options are parameters that control various aspects of the SAS environment and the behavior of SAS programs. These options can be set globally or locally to modify how SAS processes data, manages output, and handles different tasks.

System options can be set in different ways, including the use of the OPTIONS statement in your SAS code. The OPTIONS statement starts with the keyword OPTIONS and follows with a list of options and their values.

/* Set SAS system options */
options nodate nocenter linesize=80 pagesize=50 validvarname=v7;

Explanation

Sets system options to customize output formatting:

  • nodate: No date or time on output.
  • nocenter: Left-align output.
  • linesize=80: Limit line width to 80 characters.
  • pagesize=50: Set the number of lines per page to 50.
  • validvarname=v7: Variable names adhere to the rules of SAS Version 7. They must start with a letter or underscore, followed by letters, underscores, or digits. This is the most restrictive option, and special characters are not allowed. The option validvarname=v8: allows variable names to be up to 32 characters long and can include special characters (e.g., @, #, $). This option is more flexible than V7 but still retains some restrictions.

Other useful options include

  • “OBS=”: Limits the number of observations read from a dataset. Useful for debugging or when working with large datasets.
  • “FIRSTOBS=”: Specifies the first observation to read from a dataset. Useful for skipping initial rows in a large dataset.
  • “BUFNO=”: Sets the number of buffers to use when reading and writing datasets. More buffers can improve performance.

Chapter 2. Accessing Your Data

Before performing an analysis in SAS, you need to access your data and convert it into a SAS dataset. This conversion can be achieved through various methods, including using a DATA step with the DATA statement, utilizing the IMPORT procedure with a database engine, or applying SQL procedures with SQL queries. We will begin by discussing the DATA step method.

SAS code in each data step looks like

DATA SAS_data_name; /* This statement tells SAS to create a SAS data table*/
  code line2; /*Each line hereafter gives instructions in order*/
  code line3;
  .
  .
  .
RUN;

Each SAS data step

Two things to mention:

Read in-stream data lines

/*Create a SAS dataset*/
DATA raw_data;
    INFILE datalines DSD DLM=',' FIRSTOBS=1;
    INPUT Name $ Age Salary Date : MMDDYY10. Gender $;       /*Data Types must be specified*/
    FORMAT Salary DOLLAR10.2 Date MMDDYY10.;                 /*Output format*/
    DATALINES;
John Doe, 25, 35000.50, 12/31/2023, M
Jane Smith, 32, 42000.75, 11/15/2023, F
Tom Brown, 45, 50000.00,, M    
Alice Johnson, 29, 47000.90, 08/10/23, F
    ;       /* Use semicolon to indicate the End of data */
RUN;        /* End of data step */

/*Print the created data with the PRINT PROCedure. We introduce different procedure later.*/
PROC PRINT DATA=raw_data;
RUN;

I have used indentation in the above code. Indentation, though not necessary, makes the code clear.

Explanation of code

  • “infile datalines dsd dlm=‘,’”: dsd (short for delimiter-sensitive-data) ensures that consecutive delimiters are treated as missing values and properly handles quoted strings. dlm=‘,’ specifies that the delimiter is a comma. “datalines” indicates data are in-stream.
  • “input Name $ Age Salary Date : MMDDYY10. Gender $;”: Defines the variables in the dataset with their formats. Each $ sign indicates that the variable before it is a character variable. “: MMDDYY10.” specifies that the Date variable is in MM/DD/YYYY format. Because not every date in the data has the same length of 10, the modifier “:” is added to the informat “MMDDYY10.” In the INPUT statement in SAS, when no symbol (such as $, : or .) follows a variable name, the variable is numeric. When a colon (:) follows a variable name (e.g., Date : MMDDYY10.), it tells SAS to use an informat to read the data. Informats are used to specify how the data should be interpreted or formatted during the reading process.
  • “format Salary DOLLAR10.2 Date MMDDYY10.;”: Defines how numeric and date values should be displayed. “DOLLAR” specifies that the format is for displaying currency values, “10” is the total width of the formatted value, including the decimal point, commas, and currency symbol, and “.2” specifies the number of decimal places to display.
  • “DATALINES;”: Indicates the start of the inline data. You can also use CARDS. They are synonymous.

How does SAS work in the data step in general? Refer to the videos https://www.youtube.com/watch?v=gonmrtJopWo, https://www.youtube.com/watch?v=uPYA_7cc7UY. If you understand the PDV mechanism, you are already a SAS data step expert.

I strongly suggest you watch all of the videos from https://www.youtube.com/@SAS_ArthurLi.

Now, let’s look at how PDV works here.

  1. Initialization of the PDV:
  • Variable Initialization: Variables in the PDV are initialized to missing values (. for numeric and blank for character) before processing begins.
  1. Data Processing Loop:
  • Read First Observation:
    • Read John Doe 25 35000.50 12/31/2023 M
    • Assign values to Name, Age, Salary, Date, and Gender in the PDV.
  • Write Observation: After reading and assigning values, SAS writes this observation to the raw_data dataset.
  • Repeat for All Observations: SAS continues this process for Jane Smith, Tom Brown, and Alice Johnson.
  1. Completion:
  • Finalize Dataset: After processing all lines, SAS completes the dataset creation and applies the formats specified.

In summary,

  • Compilation Phase: SAS sets up the PDV and reads the metadata from INFILE, INPUT, and FORMAT statements.
  • Execution Phase: SAS reads the data lines, assigns values to variables in the PDV, applies formats, and writes observations to the dataset.
  • End of Step: The RUN; statement finalizes the data set.

SAS Data Libraries

SAS comes with some built-in libraries such as

  • WORK: the default library for temporary SAS data sets, if you don’t specify a libref in the SAS dataset you are creating. For example, the SAS data name “raw_data” is the same as “WORK.raw_data”. The latter is a two-level name and the first level (WORK) is called a libref, short for library reference. The WORK library is erased when exiting SAS.
  • SASHELP: a permanant library containing sample SAS data sets
  • SASUSER: a permanant library provided for you to save your SAS data sets.

You can use a libname statement to create your own permanent libraries for storing SAS data sets.

/*Create a library with a libref called mylib in the directory C:\Users\username\Documents*/
libname mylib 'C:\Users\username\Documents\SASData';

/*Using the Permanent Library*/
data mylib.new_dataset;
    input name $ age salary;
    datalines;
John 30 50000
Jane 25 60000
Tom 40 55000
;
run;

Explanation

  • “libname mylib ‘C:’;”: Assigns the libaray reference “mylib” to the directory - C:. You can now use mylib to refer to datasets in this directory. A libref must be 1 to 8 characters long and can include letters, numbers, and underscores, but must start with a letter or underscore.
  • “data mylib.new_dataset;”: Creates a new dataset named new_dataset in the mylib library.

If you are working on a network drive, you can also create a permanent library by pointing to a network path.

libname netlib '\\NetworkDrive\Shared\SASData';

Note: You can use direct referencing to create permannant SAS data set. To do so, you specify a path to the file:

/*Using direct referencing to create a permannant SAS dataset*/
data "/home/u688923/my_shared_file_links/u688923/try";
    x=5;
    y=7;
run;

Explanation

  • In the windows operating environment, you would use the path “88923_shared_file_links88923“.

Checking Information about SAS Data

We can list relevant information about a SAS dataset using the PROC CONTENTS procedure.

data employees;
    input name $ age salary;
    datalines;
John 30 50000
Jane 25 60000
Tom 40 55000
;

/*Print all relevant info about SAS dataset employees*/
PROC CONTENTS DATA=employees; 
RUN;

The content stores information (called descriptor) about the SAS dataset created. The descriptor includes the SAS engine, the date on which the SAS data was created, the number of observations and variables, along with many other pieces of information.

Importing Data

If your data is stored in an external CSV file “raw_data” in a path, you would still use the INFILE statement. Here’s how you handle a comma-separated file:

DATA raw_data;
    INFILE 'path/to/raw_data.csv' DSD DLM=',' FIRSTOBS=1; 
    INPUT Name $ FirstName $ LastName $ Age Salary Date : MMDDYY10. Gender $;
    FORMAT Salary DOLLAR10.2 Date MMDDYY10.;
RUN;

Explanation

  • INFILE: Specifies the external file to be read. In this case, ‘path/to/raw_data.csv’ is the path to the CSV file containing the data.
  • DSD: Stands for Delimiter-Sensitive Data. This option tells SAS to handle delimiters in a way that properly accounts for missing values and treats consecutive delimiters as missing values. It also allows for proper reading of values that might contain embedded delimiters.
  • DLM=‘,’: Specifies the delimiter used in the file. Here, the delimiter is a comma, which is typical for CSV files.
  • “FIRSTOBS=1”: Starts reading from the first line of the file (useful if your file has a header row). If your CSV file had a header spanning 3 rows,, you would skip the first 3 rows and FIRSTOBS would be set to 4.

In this example, the data are read using list input (or delimiter input). In fact, it’s comma-delimited input here.

If each of the variable’s values is always found in the same place in the line of data, then you can use the column input (fixed width). The advantage of column input is that it can read character data longer than 8 characters and numeric data in scientific notation. Watch this video https://www.youtube.com/watch?v=Fos7iN0jUxg (start at 11’13”) to learn column input.

data example;
  infile datalines;
  input id $ 1-5        /* ID from columns 1 to 5 */
        name $ 6-15      /* Name from columns 6 to 15 */
        age
        height;
  datalines;
12345John      30  180
67890Alice     25  165
;
run;

Note: Fields that contain embedded delimiters (e.g., commas, tabs) within the data itself can cause problems with list input. This is because list input assumes that fields are separated by delimiters and does not handle embedded delimiters well. So, don’t use list input for address (containing commas) or date and time values (containing / or - or :).

DATA scientific_data;
    INFILE datalines;
    INPUT ID $ ScientificValue BEST32.; /*the informat BEST32. reads numeric values in various formats, including scientific notation*/
    DATALINES;
001 1.23E+04
002 5.67E-03
003 9.01E+01
;
RUN;

More on informats later, but we list some commonly used informats:

  1. COMMAw.:

Purpose: Reads numeric values with commas as thousands separators. Usage: COMMA10. (Handles numbers like 1,234,567)

  1. DOLLARw.d:

Purpose: Reads numeric values with dollar signs and commas. Usage: DOLLAR8.2 (Reads values like $1,234.56).

  1. PERCENTw.d:

Purpose: Reads numeric values formatted as percentages. Usage: PERCENT6.2 (Reads 50.00%)

  1. MMDDYYw.:

Purpose: Reads dates in the MM/DD/YYYY format. Usage: MMDDYY10. (Reads 08/15/2024).

  1. DATEw.:

Purpose: Reads dates in a default date format. Usage: DATE9. (Reads 15AUG2024).

  1. TIMEw.d:

Purpose: Reads time in the HH:MM:SS format. Usage: TIME8. (Reads 12:34:56).

  1. $CHARw.:

Purpose: Reads character data without any special formatting. Usage: $CHAR20. (Reads up to 20 characters).

DATA Step That Does Not Read or Modify Existing Data

If you want to create a SAS dataset in a DATA step without reading from raw data or modifying existing SAS data, you can use direct variable assignments to generate the dataset.

data circle_area;
    r = 4;          /* radius */
    A = 3.1415926*r**2;  /* Calculate area of circle*/
run;

Explanation

  • a*b: the product of a and b
  • a+b: Sum of a and b
  • a-b: the difference between a and b
  • a/b: ratio of a to b

The SAS data set has only one observation and two variables (columns) r and A with values 4 and 50.26548.

Reading Data using Line Pointer or Column Pointer Controls

  • The #n character is used to advance to the first column of the nth record.
  • The forward slash (/) line pointer control can be used to advance to a new record before reading the next data value.
  • The **@** line pointer control is used at the end of an INPUT statement, it tells SAS to hold a record in the input buffer until a new line is encountered. (a few lines/records form one observation)
  • The @‘XYZ’ column pointer control can locate a series of characters ‘XYZ’ in the input record and move the pointer to the first column after the series.
  • The @n line pointer control moves the input pointer to column n of the current line, allowing you to read data starting from that column.
  • The +n line pointer control advances the input pointer by n columns from its current position.
  • **@@** (Double Trailing @): Holds the current line for multiple observations, allowing SAS to read multiple records from the same line until the end of the line is reached or a condition is met. (one line/record may have a few observations)
  • The ampersand (&) character is used in the INPUT statement to indicate that character values may include one or more embedded blanks. When this modifier is used, SAS will read the character value until it encounters two or more consecutive blanks, the variable’s defined length, or the end of the input line.
data example1;
    input name $ 1-10 @; /* Hold the line for further processing */
    input age 12-13; /* Read age from columns 12-13 */
    datalines;
John Doe     30
Alice Smith  25
;
run;

data example2;
    input name $ age @@; /* Read multiple records from the same line */
    datalines;
John 30 Alice 25 Bob 35
;
run;

Explanation:

  • The @ holds the current line for further reading. The second INPUT statement continues to read from the same line.
  • The @@ allows reading of multiple observations from the same line, so John 30, Alice 25, and Bob 35 are all read from a single line of data.

Column pointer controls are used to specify exact columns for reading fixed-width data and manage how SAS skips or moves through columns.

  • +n (Column Pointer Control): Moves the input pointer n columns to the right from its current position. Useful for skipping columns or starting to read data from a specific position.

  • #n (Column Pointer Control in Informats): In some contexts, #n is used to specify column positions directly in the context of reading data. However, #n is not a standard SAS syntax for column pointer control. Instead, “+n” is used for specifying column positions.

data example;
    input name $ 1-10 +2 age 12-13; /* Skips 2 columns after reading name */
    datalines;
John Doe     30
Alice Smith  25
;
run;

Reading Messy Raw Data

Let’s look at an example.

Each year engineering students from around the USA and Canada build concrete canoes and hold regional and national competitions. Part of the competition involves racing the canoes. The following data contain the final results of a men’s sprint competition. The data lines start with the name of the canoe, followed by the school, and the time.

Bellatorum  School  CSULA Time  1:40.5
The Kraken  School  ASU Time  1:45.35
Black Widow  School  UoA Time  1:33.7
Koicrete  School  CSUF Time  1:40.25
Khaos  School  UNLV Time  2:03.45
Max  School  UCSD Time  1:26.47
Hakuna Matata  School  UCLA Time  1:20.64
Prospector  School  CPSLOlIPP Time  1:12.08
Andromeda  School  CPP Time  1:25.1
Kekoapohaku  School  UHM Time  1:24.49

You can see that because the canoe names are not always the same number of characters, the school names do not line up in the same column. Also, the time values are sometimes six characters and sometimes seven.

DATA canoeresults;
 INPUT Name & $13. @'School' School $ @'Time' RaceTime :STIMER8.;  /*The informat ":STIMER8." tells SAS how to read time*/
 format RaceTime TIME11.2;    /*Output time as hh:mm:ss.xx*/
 datalines;
Bellatorum  School  CSULA Time  1:40.5
The Kraken  School  ASU Time  1:45.35
Black Widow  School  UoA Time  1:33.7
Koicrete  School  CSUF Time  1:40.25
Khaos  School  UNLV Time  2:03.45
Max  School  UCSD Time  1:26.47
Hakuna Matata  School  UCLA Time  1:20.64
Prospector  School  CPSLOIPP Time  1:12.08
Andromeda  School  CPP Time  1:25.1
Kekoapohaku  School  UHM Time  1:24.49
 ; 
RUN;

proc print data=canoeresults noobs;
   format RaceTime time11.2;    /*Rather than in data step, you can also place the output format here*/
run;

Explanation of code

  • Name & $13.: Reads Name as a character variable up to 13 characters. The & modifier tells SAS to read a name until two or more spaces are encountered even if the length of the canoe name is less than 13. This allows one space in a name, since some names have a space in between words.
  • @‘School’ School $: Position the line pointer at “School” to read the content that follows the word and assign the content to the School variable.
  • @‘Time’ RaceTime :STIMER8.: Line pointer will find “Time” and read RaceTime using the modified informat “:STIMER8.” Here, the modifier “:” allows time to have a length less than 8. In the SAS dataset, 1:40.5 will be output as 60 + 40.5 or 100.50. You can also use the more general informat :STIMER11.2, length being 11 with two decimal places.
  • “format RaceTime TIME8.;”: Applies the “TIME11.2” format to RaceTime for display. For example, 1:40.5 will be displayed as 1:41.

Other Ways of Reading Data

Refer to the shared files on SAS studio. To see those shared files, double click Files (Home). Files with “.sas” suffix are SAS files. Other files are data CSV, XLS, XLSX, or DAT files. Read the sas file to learn how to create SAS datasets.

Chapter 3. Working with SAS Data

We will introduce how to

Using Data Step to Modify SAS Data

Modifying data using a SAS DATA step is a common practice for data cleaning, transformation, and manipulation. Below, I’ll walk you through various types of data modifications you can perform using a DATA step in SAS.

Example 1. Modifying Existing Variables

data raw_data;
    set raw_data;

    /* Increasing salary by 10% */
    Salary = Salary * 1.10;

    /* Adjusting age if it's under 30 */
    if Age < 30 then Age = Age + 1;
run;

Example 2. Adding New Variables

data raw_data;
    set raw_data; /* Reads the existing dataset */

    /* Adding a new variable for annual salary */
    AnnualSalary = Salary * 12;

    /* Adding a new variable for age in months */
    AgeInMonths = Age * 12;
run;

Example 3. Filtering Data

data young_raw_data;
    set raw_data;
    /* Keep only raw_data who are under 30 */
    if Age < 30;
run;

Example 4. Renaming Variables

data raw_data_renamed;
    set raw_data;
    
    /* Renaming variables */
    rename Name = FullName 
           Date = HireDate;
run;

Example 5. Dropping and Keeping Variables

data raw_data_filtered;
    set raw_data;
    
    /* Drop unwanted variables */
    drop AgeInMonths;

    /* Keep only certain variables */
    keep Name Salary;
run;

Example 6. Reordering Variables: You can change the order of variables in the dataset.

data raw_data_reordered;
    set raw_data;
    /* Reorder variables */
    retain Name Salary Age HireDate Gender;
run;

Example 7. Handling Missing Values: You can address missing values by imputing or setting defaults.

data raw_data_cleaned;
    set raw_data;

    /* Replace missing Salary with the average salary */
    if missing(Salary) then Salary = 40000; /* Example default value */

    /* Replace missing Gender with 'Unknown' */
    if missing(Gender) then Gender = 'Unknown';
run;

In the above code,

  • “missing” is called a function in SAS. Functions in SAS can be used for a variety of purposes, including mathematical operations, character manipulations, and date handling. The function “missing” will check whether an observation has a missing value on the Salary variable: if yes, then the missing value is set to 40000. A similar explanation can be made for Gender.
  • A SAS function, when used, must be followed by a pair of parentheses with a list of parameters.
  • Examples of mathematical functions in SAS are log() for logarithmic transformation and sqrt() for square root.
  • Examples of character functions in SAS are upcase() for converting a string to uppercase and length() for the number of chars in a string.
  • Examples of date functions in SAS are year() for extracting the year from a date, month() for extracting the month, week() for getting the day of the week from a date (1=Sunday, 2=Monday, etc.), and today() for extracting today’s date value

Example 8. Combining Modifications: Here is a more complex example that combines several modifications.

data raw_data_modified;
    set raw_data;
    
    /* Adding a new variable for performance bonus */
    PerformanceBonus = Salary * 0.05;

    /* Modify salary based on age */
    if Age < 30 then Salary = Salary * 1.10;
    else if Age >= 50 then Salary = Salary * 1.05;

    /* Drop the AgeInMonths variable */
    drop AgeInMonths;

    /* Rename variables for clarity */
    rename Name = FullName
           Date = HireDate;
    
    /* Keep only relevant variables */
    keep FullName Salary PerformanceBonus HireDate Gender;
run;

SAS Conditional Functions

Conditional functions in SAS can be used to apply logic to data transformation.

DATA conditional_transformed;
    SET raw_data;

    /* Classify people into age groups */
    AgeGroup = 
        case 
            when Age < 30 then 'Young'
            when Age < 50 then 'Middle-aged'
            else 'Senior'
        end;

    /* Assign a performance rating based on Salary */
    PerformanceRating = 
        if Salary < 40000 then 'Needs Improvement'
        else if Salary < 60000 then 'Meets Expectations'
        else 'Exceeds Expectations';
RUN;

Note:

  • in the situation with only two cases, a simple if-then-else structure is enough; that is, there is no “else if” statement.
  • The case-end structure is equivalent to a “if-then-else if-then-else” structure.

SAS DO Statements

In SAS, the DO statements within a DATA step are used to execute a block of code multiple times, either based on conditions or iteratively over a range of values. They provide a way to perform repetitive tasks and control flow within the data step. Here’s a guide to the different types of DO statements and how they can be used:

Example 1. Simple DO Loop: A simple DO loop iterates a block of code a specified number of times.

data loop_example;
    do i = 1 to 5; /* Loop from 1 to 5 */
        output; /* Output a row for each iteration */
    end;
run;

Explanation

  • “do i = 1 to 5;”: Starts a loop that iterates from 1 through 5.
  • “output;”: Outputs a row to the dataset for each iteration.
  • “end;”: Ends the loop.

Example 2. DO Loop with BY Statement: You can use a DO loop with a BY statement to process data by groups.

data grouped_summary;
    set employees;
    by Gender;
    
    if first.Gender then do;
        TotalSalary = 0; /* Initialize a variable for each group */
        Count = 0;
    end;

    TotalSalary + Salary; /* Accumulate salary */
    Count + 1; /* Count the number of records in each group */
    
    if last.Gender then do;
        AvgSalary = TotalSalary / Count; /* Calculate average salary */
        output; /* Output a summary for each group */
    end;
run;

Explanation

  • “by Gender;”: Processes data by the Gender variable.
  • “if first.Gender then do;”: Initializes variables at the start of each group.
  • “TotalSalary + Salary;”: Sums salaries within each group.
  • “if last.Gender then do;”: Computes and outputs summary statistics at the end of each group.

Example 3. 3. Nested DO Loops: Nested DO loops can be used for more complex iterative processes.

data matrix_example;
    do row = 1 to 3; /* Outer loop for rows */
        do col = 1 to 3; /* Inner loop for columns */
            Value = row * col; /* Compute a value */
            output; /* Output the result */
        end;
    end;
run;

Explanation

  • “do row = 1 to 3;”: Outer loop for rows.
  • “do col = 1 to 3;”: Inner loop for columns.
  • Value = row * col;“: Computes a value based on the row and column indices.
  • “output;”: Outputs the result for each combination of row and column.

Example 4. Conditional DO Loops: You can use DO loops with conditional logic to execute code based on specific conditions.

data conditional_loops;
    set employees;
    
    do while (Salary < 50000); /* Loop while Salary is less than 50000 */
        Salary = Salary * 1.05; /* Increase salary by 5% */
        output;
    end;
    
    do until (Salary >= 50000); /* Loop until Salary reaches 50000 */
        Salary = Salary * 1.05; /* Increase salary by 5% */
        output;
    end;
run;

Explanation

  • “do while (Salary < 50000);”: Executes the loop while the condition is true.
  • “do until (Salary >= 50000);”: Executes the loop until the condition is true.
  • “Salary = Salary * 1.05;”: Increases salary by 5% in each iteration.
  • “output;”: Outputs the data after each modification.

Example 5. DO-UNTIL Loop: The DO-UNTIL loop continues until a specified condition becomes true.

data until_loop;
    set employees;
    Salary = Salary * 1.02; /* Start with a slight increase */
    do until (Salary > 60000); /* Continue until Salary exceeds 60000 */
        Salary = Salary * 1.05; /* Increase salary by 5% */
        output; /* Output each iteration */
    end;
run;

Explanation

  • “do until (Salary > 60000);”: Continues looping until Salary exceeds 60,000.
  • “Salary = Salary * 1.05;”: Applies a 5% increase to the salary.
  • “output;”: Outputs the updated record.

Simplifying SAS Programs with Array

Arrays in SAS are powerful tools for handling repetitive tasks, such as table lookups, rotating datasets, and performing iterations with indexed variables. We give a detailed guide on using arrays for these purposes with practical examples.

Example 1. (Table Look-ups Using Arrays) Arrays can be used to simplify table look-ups by creating a mapping between values and their corresponding outputs. This is particularly useful when you have a predefined set of values that need to be translated or matched.

Assume you have a dataset with employee IDs and you want to map these IDs to department names based on a lookup table.

/* Define the lookup table */
data dept_lookup;
  input DeptID $ DeptName $;
  datalines;
1 HR
2 Finance
3 IT
;
run;

/* Define the employee dataset */
data employees;
  input EmpID $ DeptID $;
  datalines;
101 1
102 2
103 3
104 1
;
run;

Let’s use arrays to perform the look-up.

data employees_with_dept;
  /* Define arrays for the lookup */
  array dept_ids[3] _temporary_ (1 2 3);
  array dept_names[3] $ _temporary_ ('HR' 'Finance' 'IT');
  
  set employees;
  length DeptName $ 10;

  /* Perform the look-up using arrays */
  do i = 1 to 3;
    if DeptID = dept_ids[i] then DeptName = dept_names[i];
  end;

  drop i;
run;

/* Print the resulting dataset */
proc print data=employees_with_dept;
run;

Explanation of code

  • array dept_ids[3] temporary (1 2 3);: Defines a temporary array dept_ids with three elements: 1, 2, and 3. These represent the department IDs from the lookup table.
  • array dept_names[3] $ temporary (‘HR’ ‘Finance’ ‘IT’);: Defines a temporary array dept_names with three elements: ‘HR’, ‘Finance’, and ‘IT’. These represent the department names corresponding to the department IDs.
  • length DeptName $ 10;: Specifies that the variable DeptName should have a length of 10 characters.
  • if DeptID = dept_ids[i] then DeptName = dept_names[i];: Checks if the DeptID from the employees dataset matches an element in the dept_ids array. If it does, assigns the corresponding department name from the dept_names array to DeptName.
  • drop i;: Removes the loop index variable i from the final dataset.

Example 2. (Rotating Datasets Using Arrays) Arrays can be used to rotate data, which means transforming columns into rows (wide to long) or vice versa. This is useful for reshaping data to meet different analytical needs.

Given a dataset with monthly sales data for each quarter, you want to rotate it to have sales data by month.

/* Define the original dataset */
data sales_quarterly;
  input ID $ Q1 Q2 Q3 Q4;
  datalines;
1 100 150 200 250
2 110 160 210 260
;
run;

We now rotate the dataset using arrays.

data sales_monthly;
  set sales_quarterly;
  array quarters[4] Q1-Q4;    /*Q1-Q4 is the shortcut for Q1 Q2 Q3 Q4*/ 
  array months[4] $ _temporary_ ('Jan' 'Apr' 'Jul' 'Oct');
  
  do i = 1 to 4;
    Month = months[i];
    Sales = quarters[i];
    output;
  end;

  drop i Q1-Q4;
run;

/* Print the rotated dataset */
proc print data=sales_monthly;
run;

Example 3. (Performing Repeated Iterations Using Indexed Variables) Arrays in SAS are also useful for performing repeated iterations, such as applying the same computation across multiple variables.

Suppose you have a dataset with scores from five different tests and you want to compute the average score for each student.

/* Define the dataset */
data student_scores;
  input StudentID $ Test1-Test5;
  datalines;
001 85 90 78 92 88
002 80 82 85 87 90
;
run;

We now calculate the average score for each of the 5 tests using arrays.

data student_averages;
  set student_scores;
  array tests[5] Test1-Test5;
  total_score = 0;
  
  /* Sum up the test scores */
  do i = 1 to 5;
    total_score + tests[i];
  end;

  /* Calculate average score */
  average_score = total_score / 5;
  
  drop i total_score;
run;

/* Print the results */
proc print data=student_averages;
run;

Subsetting Data in SAS Data Steps

Subsetting SAS data involves creating a new SAS dataset that contains only a subset of the observations or variables from an existing dataset. This can be done using various methods within the DATA step, PROC SQL, and even by using SAS procedures designed for data manipulation.

Example 1. Using IF statements

DATA high_salary_employees;
    SET employees;
    IF Salary > 50000; /* Filters records to include only those with Salary greater than 50000 */
RUN;

Example 2. Using WHERE statements to filter Rows Based on Conditions

DATA young_employees;
    SET employees;
    WHERE Age < 30; /* Select employees younger than 30 */
RUN;

Explanation

  • “set employees;”: Reads the existing dataset.
  • “where Age < 30;”: Includes only those records where Age is less than 30.

Example 3. Selecting Specific Variables

data selected_vars;
    set employees(keep=Name Salary HireDate); /* Keep only specified variables */
run;

Working with SAS Dates

In SAS, handling dates involves understanding SAS date values, date formats, and various date functions. SAS stores dates as the number of days since January 1, 1960, which makes it different from other systems that may use different reference dates or formats. Here’s a comprehensive guide to working with dates in SAS:

  1. Understanding SAS Date Values SAS dates are numeric values representing the number of days since January 1, 1960. For example:
  • January 1, 1960, is day 0.
  • January 2, 1960, is day 1.
  • December 31, 1960, is day 365.
  1. Date Formats and Informat
  • Date Format: Used to display SAS date values in a readable format. Examples: DATE9., MMDDYY10., DDMMYY10.
  • Date Informat: Used to read date values into SAS. Examples: DATE9., MMDDYY10., DDMMYY10.

Example 1. Example of Formatting and Informat

data formatted_dates;
    input date_var mmddyy10.;
    format date_var date9.;
    datalines;
08/15/2024
01/01/2024
12/31/2023
;
run;
  1. Date Functions

SAS provides various functions to manipulate and calculate dates.

Common Date Functions: - TODAY(): Returns the current date. - DATEPART(datetime): Extracts the date part from a datetime value. - DAY(date): Extracts the day of the month from a date value. - MONTH(date): Extracts the month from a date value. - YEAR(date): Extracts the year from a date value.

RETURN and SUM statements

In SAS, RETURN and SUM statements are part of the SAS programming language used for specific tasks in a DATA step.

  1. RETURN Statement
  • The RETURN statement is used within a DATA step to exit the current DATA step and return control to the SAS system.
  • It is commonly used in DATA step iterations when processing data with DO loops, especially in conjunction with the CALL RETURN routine to manage complex iterative data processing.

Example 1.

data example;
    do i = 1 to 5;
        if i = 3 then return; /* Exit the DATA step when i equals 3 */
        output;
    end;
run;

Explanation:

  • “do i = 1 to 5;”: Iterates from 1 to 5.
  • “if i = 3 then return;”: Exits the DATA step when i is 3.
  • “output;”: Writes the current observation to the dataset.

In this example, only the observations where i is 1 and 2 are included in the dataset because the RETURN statement exits the DATA step when i is 3.

  1. SUM Statement
  • The SUM statement in SAS is used to perform cumulative or running total calculations.
  • It is a special kind of assignment that automatically initializes the variable to zero the first time it is used.
  • This is particularly useful for accumulating values across rows in a DATA step.

Example 2. Using the “a + b” type of statements

data numbers;
    input value;
    datalines;
2
5
9
14
    ;
run;

/*Calculate the cumulative sum of values in the SAS dataset "numbers"*/
data cumulative_sum;
    set numbers; /* Assuming a dataset 'numbers' with a variable 'value' */

    cumulative_total + value; /* Accumulate sum of 'value' */
run;

/*Print Data*/
proc print data=cumulative_sum; 
  title 'Employee Details';  /*ancillary statement*/
  footnote 'Thank somebody!' /*another ancillary statement*/
run;

Explanation:

  • “set numbers;”: Reads data from an existing dataset named numbers.
  • “cumulative_total + value;”: Calculates the running total of the value variable and stores it in cumulative_total.
  • In SAS, ancillary statements are additional statements used within procedures or DATA steps to control various aspects of processing, output, and behavior.

Example 3. Conditional Sum: Finding the sum for each group

data conditional_sum;
    set numbers;
    by group; /* Assuming 'group' is a variable used for grouping */

    if first.group then cumulative_total = 0; /* Reset sum at the beginning of each group */
    cumulative_total + value; /* Accumulate sum */
run;

Explanation:

  • “by group;”: Processes data by group.
  • “if first.group then cumulative_total = 0;”: Resets the cumulative_total at the beginning of each group.
  • “cumulative_total + value;”: Accumulates the sum of value for each group.

Writing Multiple Datasets Using OUTPUT statements

In SAS, the OUTPUT statement is used within a DATA step to control which observations are written to a dataset. It allows you to specify which observations to write and can also be used to output observations to different datasets based on conditions.

Example 1. (Only one SAS dataset)

data example;
    input name $ age height;
    if age > 30 then output; /* Only output observations where age > 30 */
    datalines;
John 25 175
Alice 35 160
Bob 40 180
;
run;

Note: In the code, output is the same as output example.

Example 2. ( Multiple SAS datasets)

data dataset1 dataset2;
    input name $ age height;
    if age < 30 then output dataset1; /* Write observations where age < 30 to dataset1 */
    else output dataset2; /* Write observations where age >= 30 to dataset2 */
    datalines;
John 25 175
Alice 35 160
Bob 40 180
;
run;

Working with Data: SAS PROC Steps

Each SAS procedure step

  • Begin with a PROC statement
  • Perform specific analysis or task
  • Produce results or report

Example 1a. The PRINT procedure

PROC PRINT data=raw_data;
    TITLE 'Employee Data';
run;

Explanation

  • “PROC PRINT data=raw_data;”: This tells SAS to use the PRINT procedure on the dataset named raw_data.
  • “TITLE ‘Employee Data’;”: Adds a title to the output for better readability.
  • “run;”: Executes the procedure.

Example 1b. The PRINT procedure with WHERE Conditions

proc print data=raw_data;
    where Age > 40; /* Print only those with Age greater than 40 */
run;

Example 2. The MEANS procedure: produce the summary statistics like mean, median, standard deviation, and range for numeric variables.

PROC MEANS data=raw_data;
    VAR Age Salary;
    TITLE "Summary Statistics for Age and Salary";
run;

Explanation:

  • “PROC MEANS data=raw_data;”: Calls the MEANS procedure to calculate summary statistics for the dataset raw_data.
  • “VAR Age Salary;”: Specifies which variables you want statistics for. Here, it’s Age and Salary.

Example 3. The FREQ procedure: produce the summary statistics

PROC freq data=raw_data;
    TABLES Gender;
    TITLE "Frequency of Gender";
run;

Explanation

  • “TABLES Gender;”: Specifies which variable to create a frequency table for. Here, it’s Gender.

Example 4a. The SORT procedure: For sorting data.

proc sort data=raw_data out=sorted_raw_data;
    by Salary; /* Sort data by Salary */
run;

Example 4b. The SORT procedure: For sorting data with Conditions.

data high_salary_sorted_raw_data;
    set sorted_raw_data;
    where Salary > 50000; /* Include only those with Salary greater than 50000 */
run;

The PROC IMPORT

FILENAME REFFILE '/home/u688923/my_shared_file_links/u688923/FruitFlies.csv';

PROC IMPORT DATAFILE=REFFILE DBMS=CSV OUT=WORK.IMPORT; 
  GETNAMES=YES;
RUN;

PROC CONTENTS DATA=WORK.IMPORT; 
RUN;

Explaination

  • FILENAME REFFILE: The FILENAME statement in SAS is used to assign a fileref (file reference) to an external file, such as a data file or program file, allowing SAS to reference the file without needing to specify the full file path each time. This statement is essential for file handling operations, including reading from and writing to external files.
  • PROC IMPORT: This procedure reads data from an external file and creates a SAS dataset from it.
  • DATAFILE=REFFILE: Specifies the source of the data file using the fileref REFFILE, which was defined by the FILENAME statement.
  • DBMS=CSV: Indicates that the file format is CSV (Comma-Separated Values). For Excel file with suffix “.XLS”, use DBMS=XLS. For Excel file with suffix “.XLSX”, use DBMS=XLSX.
  • OUT=WORK.IMPORT: Specifies the name of the output SAS dataset to be created. WORK.IMPORT means the dataset IMPORT will be stored in the temporary WORK library.
  • GETNAMES=YES: This optional statement tells SAS to use the first row of the CSV file as variable names for the dataset. Other optional statements such as MIXED=YES (default) or MIXED=NO tells SAS to automatically detect and convert data types or not. If a column contains both numeric and character values, SAS might convert the entire column to a character variable, which could affect data analysis and processing. Setting MIXED=NO ensures that SAS does not automatically convert mixed data types in a column to character variables. The optional statement GUESSINGROWS=50: Directs SAS to scan the first 50 rows to determine the data types for each column, which can be useful if the default 20 rows are insufficient for accurate type inference.

The PROC SQL

PROC SQL can also be used to subset data with SQL-like syntax.

proc sql;
    create table young_females as
    select Name, Age, Salary
    from raw_data
    where Age < 30 and Gender = 'F';
quit;

Explanation

  • “create table young_females as”: Creates a new table with the filtered data.
  • “select Name, Age, Salary”: Specifies the columns to include in the new dataset.
  • “from employees”: Indicates that the data are from SAS data called “raw_data”.
  • “where Age < 30 and Gender = ‘F’;”: Filters rows based on the specified conditions.

Nested SQL Statements

Nested SQL statements, or subqueries, are SQL queries embedded within other SQL queries. They allow you to perform complex queries by using the results of one query as the input for another. In SAS, you can use nested SQL statements within PROC SQL to perform various types of data analysis and manipulation.

proc print data = sashelp.class; 
run;

proc sql;
   /* Find maximum height within each group and filter groups with overall maximum height */
   select name, sex, max(height) as max_height
   from sashelp.class
   where height = (select max(height) from sashelp.class); 
quit;

Note: The WHERE statement is Equivalent to using the HAVING statement HAVING height = max(height). This is because HAVING is applied at the group level while WHERE is at the observation level. You can’t do WHERE height = max(height). In the code where height = (select max(height) from sashelp.class);, the right side to the equality sign is already done.

What does the code do?

SAS Formats and Informats

In SAS, Formats and Informats are used to control how data is read into and displayed from SAS datasets.

Formats

Determine how data values are displayed in SAS output. For example, you might use a format to display numbers with commas or dates in a specific format. Formats are used in the FORMAT statement in SAS. You apply formats either in a DATA step or when creating reports.

Example: Applying Formats in a DATA Step

DATA formatted_data;
    SET raw_data;
    FORMAT salary dollar10.2 date_of_birth DATE9.;
RUN;

Explanation

  • “salary dollar10.2”: Formats “salary” as currency with a dollar sign, 10 total characters wide, including 2 decimal places.
  • “date_of_birth DATE9.”: Formats “date_of_birth” as a date in the format DDMMMYYYY (e.g., 01JAN2024).

Example: Applying Formats in PROC PRINT

PROC PRINT DATA=formatted_data;
    FORMAT salary dollar10.2 date_of_birth DATE9.;
RUN;

Informats

Determine how data values are read and interpreted from raw data. Informats are used in the INFORMAT statement to control how SAS reads raw data values.

Example: Applying Informats in a DATA Step

DATA raw_data;
    INFILE datalines;
    INPUT name $ age income : dollar8.2 date_of_birth : MMDDYY10. steps COMMA8.;
    DATALINES;
John Doe 30 $1,234.56 08/15/1994 23,089
Jane Smith 25 $2,345.67 12/01/1998 4,621
;
RUN;

PROC PRINT DATA=raw_data;
RUN;

Explanation

  • income : dollar8.2: Reads the “income” as a dollar amount, with up to 8 characters and 2 decimal places.
  • date_of_birth : MMDDYY10.: Reads “date_of_birth” in MMDDYYYY format (e.g., 12/31/2024).
  • steps COMMA8.: Reads steps as a numeric variable with commas as thousands separators.

Using Formats and Informats Together

You can use both formats and informats when processing and displaying data.

DATA formatted_data;
    INFILE 'datafile.txt';
    INPUT name $ age income : dollar8.2 date_of_birth : MMDDYY10.;
    FORMAT income dollar10.2 date_of_birth DATE9.;
RUN;

Explanation

  • “income : dollar8.2”: Reads “income” in a dollar format.
  • “FORMAT income dollar10.2”: Displays “income” with a dollar sign and two decimal places.

Summary

  • Formats control how data is displayed in SAS outputs.
  • Informats control how data is read and interpreted from raw data.
  • Use the FORMAT statement in a DATA step or procedures to apply formats.
  • Use the INFORMAT statement to specify informats when reading raw data.
  • By correctly applying formats and informats, you can ensure your data is both accurately read and clearly displayed in SAS.

Avoiding Reading Some Observations of Raw Data

If you don’t want to read an observation satisfying certain conditions, use the single trailing at (@) symbol and the if-condition to skip the observation.

data filtered_data;
  infile datalines;
  input id $ name $ age @; /* Use @ to hold the line for subsequent INPUT statement */
  if age < 26 then delete; /* Skip the observation if age is less than 26 */
  input height; /*Continue to input*/
  datalines;
001 John 25 70
002 Alice 35 65
003 Bob 22 68
004 Carol 28 72
;
run;

proc print data = filtered_data;
run;

Reading Observations with Insufficient Data

In SAS, the MISSOVER option is used with the INFILE statement when the data on the current line is not sufficient (due to missing values) to fill all the variables in the INPUT statement. This option ensures that SAS does not skip lines when it encounters missing data but instead fills in the missing values with blanks or specified missing values.

data missoverexample;
  infile datalines missover;
  input id $ name $ age height;
  datalines;
12345 John   25 177
67890 Alice  30 183
54321 Bob 
89765 David  54
; /*record 3 has missing on age and height, record 4 has missing on height*/

run;

Avoiding Truncation When Reading Data

If a character value is longer than 8 in length, a LENGTH staement can be used to prevent truncation. The LENGTH statement must be placed before the relevant variable occurs.

DATA Length_example;
  LENGTH Name $20 Address $30;
  infile datalines delimiter=",";
  input Name $ Age Address $;
  Datalines;
David Moore, 80, 1234 6th Ave. S
Jessica Ottesen, 60, 12345 6th Ave. S
  ;
RUN;

proc print data = Length_example;
  var Name Age Address; /* Specify the order of columns */
run;

Explanation:

  • “length Name $20 Address $30;”: Sets the length of Name to 20 characters and Address to 30 characters. This ensures that the full data for these fields will be read without truncation.
  • In SAS, when you use PROC PRINT to display a dataset, the order of columns might not always match the order specified in the DATA step. To ensure that the columns appear in the order you want when printing, you can use the VAR statement in PROC PRINT to explicitly specify the order of variables.

Using the WHERE statement to filter observations from Existing SAS Data

Use the SET statement to read data from a SAS dataset, not from raw data. When filtering from a SAS dataset, use a WHERE statement.

data employees;
  input Name $ Age Salary;
  datalines;
John 28 50000
Jane 35 60000
Jim 45 55000
Jill 40 65000
;
run;

proc print data=employees;
run;

data high_salary;
  set employees; /*The SET statement reads data from a SAS dataset*/
  WHERE Salary > 60000;
run;

proc print data=high_salary;
run;

Note:

You can also use the IF keyword instead of WHERE. The difference is that “WHERE” filters data before it is read into the PDV, but “IF” filters data after it has been read into the PDV.

The WHERE statement can also work in a procedure such as PROC PRINT.

Chapter 4. Sorting, Printing, and Summarizing Your Data

We will cover more PROC steps.

PROC FORMAT

PROC FORMAT in SAS is a procedure used to define custom formats and informats for variables. Formats control how data is displayed, while informats control how data is read. This procedure is especially useful for formatting data in reports or converting codes into meaningful labels.

A video to watch: https://www.youtube.com/watch?v=IeUdnCsetqY (right click and open in new window)

Custom Formats

You can create custom formats using PROC FORMAT. This allows you to define specific ways to display or categorize data.

Example. Define custom formats.

PROC FORMAT;
    VALUE $genderfmt
        'M' = 'Male'
        'F' = 'Female';
    VALUE agefmt
        LOW-20 = 'Young'
        21-40 = 'Adult'
        41-HIGH = 'Senior';
RUN;

Explanation

  • “$genderfmt”: Defines a format for the “gender” variable.
  • “agefmt”: Defines a format for the “age” variable.

Now, apply the defined custom formats.

DATA formatted_data;
    SET raw_data;
    FORMAT gender $genderfmt. age agefmt.;
RUN;

Using Formats and Informats in Procedures

You can also use formats and informats directly in procedures such as PROC REPORT and PROC MEANS.

PROC REPORT DATA=formatted_data;
    COLUMN name age income date_of_birth;
    DEFINE age / FORMAT=agefmt.;
    DEFINE income / FORMAT=dollar10.2;
    DEFINE date_of_birth / FORMAT=DATE9.;
RUN;

More on the REPORT procedurelater.

PROC MEANS DATA=formatted_data;
    VAR income;
    FORMAT income dollar10.2;
RUN;

PROC TABULATE

The TABULATE procedure in SAS is used to create detailed and flexible summary reports, such as frequency tables, cross-tabulations, and descriptive statistics.

Here is a video to watch: https://www.youtube.com/watch?v=K9V_tHrXgrA

Now, you can read my notes on PROC Tabulate.

/*Simple frequency table*/
proc tabulate data=sashelp.cars;
    class type origin;
    table type, origin;
run;

/*Summary Statistics*/
proc tabulate data=sashelp.cars;
    class type origin;
    table type, origin;
run;
  
/* Multi-Dimensional Table with Multiple Statistics*/
proc tabulate data=sashelp.cars;
    class type origin;
    var MSRP Horsepower;
    table type*(origin),
          (MSRP*mean MSRP*sum) 
          (Horsepower*mean Horsepower*sum);
run;

Explanation:

  • “class type origin;”: Specifies the classification variables type and origin.
  • “table type, origin;”: Creates a simple frequency table with type as rows and origin as columns.
  • “var MSRP;”: Specifies MSRP as the analysis variable.
  • “table type, origin(MSRPmean MSRP*sum);“: Creates a table with type as rows and origin as columns, showing the mean and sum of MSRP.
  • “table type*(origin),“: Creates a multi-dimensional table with type as rows and origin as columns.
  • “(MSRPmean MSRPsum)”: Provides the mean and sum of MSRP.
  • “(Horsepowermean Horsepowersum)”: Provides the mean and sum of Horsepower.

PROC REPORT

The REPORT procedure in SAS is used for creating customized reports that can present data in a tabular format. PROC REPORT provides advanced options for formatting, summarizing, and customizing the appearance of reports, making it a powerful tool for generating professional and user-friendly output.

/*Data*/
DATA nationalparks;
    INPUT Park $ Type $ Region $ Museums Camping;
    DATALINES;
Yosemite NP West 100 50
Yellowstone NP East 150 .
GrandCanyon NM . 200 70
Zion NP West  80 30
Glacier NP East 120 60
RockyMountain NP West  90 40
Olympic NM . 95 .
Shenandoah NM East 130 65
GreatSmoky NP West 100 55
Badlands NP East . 110 
BryceCanyon NP West 110 75
Everglades NM West  . 85
Denali NP East 140 85
MountRainier NP West 160 90
Sequoia NP . 105 45
CuyahogaValley NM West . 70 
Haleakala NP . 80 60
;
RUN;
  
  
/* A simple report*/
PROC REPORT DATA=nationalparks nowd;
    COLUMN Park Type Region Museums Camping;
RUN;
  
/*A custom report*/
PROC REPORT DATA=nationalparks MISSING;
    COLUMN Park Type Region Museums Camping;
    /* Custom column labels and settings */
    DEFINE Region / ORDER; /* Sort by Region */
    DEFINE Camping / ANALYSIS 'Campgrounds'; /* 'ANALYSIS' used for summary statistics*/
    
    /* Adding a title to the report */
    TITLE "National Parks Report";
RUN;

Explanation

  • MISSING: This option includes observations where any of the variables have missing values in the report. By default, PROC REPORT omits observations with missing values. This option ensures that even those observations with missing values are displayed.
  • COLUMN Park Type Region Museums Camping;: Specifies the columns to be included in the report: Park, Type, Region, Museums, and Camping.
  • DEFINE Region / ORDER;: ORDER indicates that the Region column should be used for sorting the report. The ORDER option ensures that the data is sorted based on the values in the Region column. The default sorting order is ascending.
  • DEFINE Camping / ANALYSIS ‘Campgrounds’;: The ANALYSIS option is generally used when you want to perform summary statistics (such as sums, means, or counts) on a numeric column. Since it’s specified here, it prepares the Camping column for such calculations. However, if no specific statistics are requested, the ANALYSIS option doesn’t perform any calculations by itself but is ready for such functionality if needed. Campgrounds provides a custom label for the Camping column, renaming it from Camping to ‘Campgrounds’ in the report output.
  • TITLE “National Parks Report”;: Adds a title to the report. This title will appear at the top of the generated report, making it clear that the report is about National Parks.

We next create two summary reports:

/*summary report #1*/
PROC REPORT DATA=nationalparks;
    COLUMN Region Type Museums Camping;
    DEFINE Region / GROUP; 
    DEFINE Type / GROUP ;
    
    /* Adding a title to the report */
    TITLE "National Parks Summary Report with Two Group Variables";
RUN;


/*Summary report #2*/
PROC REPORT DATA=nationalparks;
    COLUMN Region Type, (Museums Camping);
    DEFINE Region / GROUP; 
    DEFINE Type / Across ;
    
    /* Adding a title to the report */
    TITLE "National Parks Summary Report with a Group and An Across Variable";
RUN;

We can also add summary breaks to the report:

PROC REPORT DATA=nationalparks MISSING;
    COLUMN Region Type Museums Camping;
    /* Custom column labels and settings */
    DEFINE Region / ORDER; /* Sort by Region */
    BREAK AFTER Region / SUMMARIZE;
    BREAK AFTER  / SUMMARIZE;

    /* Adding a title to the report */
    TITLE "National Parks Detailed Report with Summary Breaks";
RUN;

Next, we can add statistics to the report:

/*summary report #1*/
PROC REPORT DATA=nationalparks;
    COLUMN Region Type N (Museums Camping), MEAN;
    DEFINE Region / GROUP; 
    DEFINE Type / GROUP ;
    
    TITLE "National Parks Statistics with Two Group Variables";
RUN;


/*Summary report #2*/
PROC REPORT DATA=nationalparks;
    COLUMN Region N Type, (Museums Camping), MEAN;
    DEFINE Region / GROUP; 
    DEFINE Type / Across ;
    
    TITLE "National Parks Statistics with a Group and An Across Variable";
RUN;

Finally, we can add computed variables to the report:

/* Generate the report with computed variable */
PROC REPORT DATA=nationalparks MISSING;
    COLUMN Park Region Museums Camping TotalFacilities Note;
    
    /* Define column attributes */
    DEFINE Museums / ANALYSIS SUM NOPRINT;
    DEFINE Camping / ANALYSIS SUM NOPRINT;
    DEFINE TotalFacilities / COMPUTED 'Total Facilities';
    DEFINE Note / COMPUTED;
    /* Compute TotalFacilities as the sum of Museums and Camping */
    COMPUTE TotalFacilities / ;
        TotalFacilities = Museums.SUM + Camping.SUM;
    ENDCOMP;
    COMPUTE Note / CHAR LENGTH = 10;
        IF Museums.SUM = 0 THEN Note = 'No Museum';
        IF Camping.SUM = 0 THEN Note = 'No Camping';
        
    ENDCOMP;

    /* Add a title to the report */
    TITLE "National Parks Report with Computed Total Facilities";
RUN;

Chapter 5. Enhancing Your Output with ODS

You can enhance your SAS output using ODS to create reports in various formats like HTML and PDF, and you can apply custom styles to make them visually appealing.

Creating HTML and PDF Output

SAS output can be sent to the HTML or PDF destination.

Example 1: HTML output

/* Open the ODS HTML destination and specify the file path */
ODS HTML FILE='/home/u688923/my_shared_file_links/u688923/simple_output.html'
          STYLE=htmlblue;
          
/*Remove procedure title*/
ODS noproctitle; 

/* Create a dataset using datalines */
DATA squares;
    INPUT Number Square;
    DATALINES;
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100
;
RUN;

/* Print the dataset to display the table */
PROC PRINT DATA=squares NOOBS;
    TITLE 'Square Table for Numbers 1 to 10';
    VAR Number Square; /* Display 'Number' and 'Square' variables */
RUN;

/* Close the ODS HTML destination */
ODS HTML CLOSE;

Explanation

  • “FILE=‘/home/u688923/my_shared_file_links/u688923/simple_output.html’”: Specifies the path and filename where the HTML output will be saved.
  • “STYLE=htmlblue”: Applies the “htmlblue” style template to format the HTML output. This style affects the visual appearance of the HTML report, including fonts, colors, and table formatting. You can use other styles such as html, minimal, or statdoc depending on your preference.
  • “ODS noproctitle;”: Suppresses the default procedure titles from appearing in the output.
  • “ODS HTML CLOSE;”: Closes the ODS HTML destination. This finalizes the HTML file and ensures that all output is written to the specified file.

Example 2: HTML output

/* Open the ODS PDF destination and specify the file path */
ODS PDF FILE='/home/u688923/my_shared_file_links/u688923/simple_output.pdf'
         STYLE=journal
         STARTPAGE=NO;

/* Create a dataset using datalines */
DATA squares;
    INPUT Number Square;
    DATALINES;
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100
;
RUN;

/* Print the dataset to display the table */
PROC PRINT DATA=squares NOOBS;
    TITLE 'Square Table for Numbers 1 to 10';
    VAR Number Square; /* Display 'Number' and 'Square' variables */
RUN;

/* Close the ODS PDF destination */
ODS PDF CLOSE;

Explanation

  • The ODS PDF statement opens a PDF file for output and specifies formatting options.
  • “FILE=‘/home/u688923/my_shared_file_links/u688923/simple_output.pdf’”: This line opens the ODS PDF destination and specifies the path and filename for the PDF output. The PDF file will be saved at the specified location.
  • “STYLE=journal”: Applies the “journal” style template to format the PDF output. This style affects the layout and appearance of the PDF. Other styles like default, minimal, or statdoc can be used based on your needs.
  • “STARTPAGE=NO”: Prevents the insertion of a new page at the start of each procedure or data step. This is useful for controlling the pagination of the output, especially if you want to keep related content on the same page.

Styled Report

Creating a styled report in SAS can enhance the presentation and readability of your output. You can achieve this using ODS (Output Delivery System) with custom styles, and by incorporating PROC REPORT or PROC PRINT. Below is an example that demonstrates how to create a styled report.

/* Create a simple dataset */
data employees;
    input Name $ Age Salary;
    datalines;
John 30 50000
Alice 28 55000
Bob 35 60000
;
run;

/* Generate a report with styled title and column */
PROC REPORT DATA=employees NOWD;
    COLUMN Name Age Salary;

    /* Define columns with optional styling */
    DEFINE Salary / DISPLAY 'Annual Salary'
                    STYLE(COLUMN)=[BACKGROUND=lightgreen FONT_SIZE=12pt FONT_WEIGHT=bold COLOR=darkgreen];
RUN;

Explanation

  • NOWD stands for “no windowing”, which means the output will be directed to the ODS destination (like HTML or PDF) rather than a separate interactive window.
  • The DEFINE statement customizes the presentation of the Salary column.
  • DISPLAY indicates that this column will show values as they are (without any additional summarization).
  • ‘Annual Salary’ provides a custom label for the column header, which will appear in the report instead of just “Salary”.
  • STYLE(COLUMN)=[BACKGROUND=lightgreen FONT_SIZE=12pt FONT_WEIGHT=bold COLOR=darkgreen];:
    • This part of the DEFINE statement applies specific styles to the Salary column.
    • BACKGROUND=lightgreen: Sets the background color of the column to light green.
    • FONT_SIZE=12pt: Sets the font size for the text in this column to 12 points.
    • FONT_WEIGHT=bold: Makes the font weight bold.
    • COLOR=darkgreen: Sets the text color to dark green.

Chapter 6. Modifyoing and Combining SAS Data

In SAS, merging datasets involves combining observations from two or more datasets into a single dataset based on common variables. This operation is typically done using the MERGE statement in a DATA step. The merge process can be used to combine datasets in a variety of ways, including inner joins, left joins, right joins, and full joins.

SAS can perform different types of joins (inner, left, right, and full) using both the MERGE statement in the DATA step and PROC SQL. SAS can also perform inner join using the SET statement.

A video to watch: https://www.youtube.com/watch?v=fQNt7Pqb_0s

Merging Data Using MERGE statements

Example 1. INNER join: An inner join returns only the records with matching keys in both datasets.

/* Sample datasets */
data data1;
    input ID Name $;
    datalines;
1 John
2 Alice
3 Bob
;
run;

data data2;
    input ID Age;
    datalines;
1 30
2 25
4 40
;
run;

/* Inner join using DATA step */
data inner_join;
    merge data1(in=a) data2(in=b);
    by ID;
    if a and b; /* Keep only records present in both datasets */
run;

Example 2. LEFT join: A left join returns all records from the left dataset and the matched records from the right dataset.

/* Left join using DATA step */
data left_join;
    merge data1(in=a) data2(in=b);
    by ID;
    if a; /* Keep all records from data1 */
run;

Example 3. Right Join: A right join returns all records from the right dataset and the matched records from the left dataset.

/* Right join using DATA step */
data right_join;
    merge data1(in=a) data2(in=b);
    by ID;
    if b; /* Keep all records from data2 */
run;

Example 4. Full Outer Join: A full outer join returns all records when there is a match in either dataset.

/* Full outer join using DATA step */
data full_outer_join;
    merge data1(in=a) data2(in=b);
    by ID;
    /* Keep all records from both datasets */
run;

Merging Data Using the PROC SQL Procedure

/*Inner join*/
proc sql;
    create table inner_sql as
    select a.ID, a.Name, b.Age
    from data1 as a
    inner join data2 as b
    on a.ID = b.ID;
quit;

proc print data=inner_sql;
run;

/*Left join*/
proc sql;
    create table left_sql as
    select a.ID, a.Name, b.Age
    from data1 as a
    left join data2 as b
    on a.ID = b.ID;
quit;

proc print data=left_sql;
run;    
    
/*Right join*/    
proc sql;
    create table right_sql as
    select a.ID, a.Name, b.Age
    from data1 as a
    right join data2 as b
    on a.ID = b.ID;
quit;    
    
proc print data=left_sql;
run;    
    
/*Full join*/    
proc sql;
    create table full_sql as
    select a.ID, a.Name, b.Age
    from data1 as a
    full join data2 as b
    on a.ID = b.ID;
quit;

proc print data=full_sql;
run;

PROC TRANSPOSE

data sales;
    input Product $ Quarter $ Sales;
    datalines;
A Q1 100
A Q2 150
A Q3 200
B Q1 120
B Q2 180
B Q3 220
;
run;

proc transpose data=sales out=transposed_sales;
    by Product;  /*Each product will be an observation or row in the new SAS data*/
    id Quarter;  /* Different quarters become different variables or columns*/
    var Sales;   /* The sales become values under corresponding columns*/
run;

proc print data=transposed_sales;
run;

The BY statement may include a combination of a few variables.

Using SAS Automatic Variables

SAS automatic variables are special variables that are created and maintained by SAS during data step processing. They provide useful information about the current state or environment of the data step. Here are the key automatic variables in SAS.

  • N: holds the number of iterations that have been executed in the data step. It starts at 1 and increments by 1 with each iteration of the data step.
  • ERROR: is set to 1 if an error occurs during the execution of the data step, otherwise, it is 0. It indicates whether an error was encountered for the current observation.
  • LAST: is created when using the BY statement in conjunction with FIRST. and LAST. variables. It helps identify the last observation within a BY group.
data example;
    input x;
    _n_ + 1;  /* This adds 1 to _n_ on each iteration */
    datalines;
10
20
30
;
run;

proc print data=example;
run;
data error_check;
    input x;
    if x < 0 then _ERROR_ = 1; /* Sets error flag if x is negative */
    else _ERROR_ = 0;
    datalines;
10
-5
20
;
run;

proc print data=error_check;
run;
data example;
    set sashelp.class;
    by sex;
    if last.sex then last_observation = 1;
    else last_observation = 0;
run;

proc print data=example;
run;

Chapter 7. Writing Flexible Code with the SAS Macro Facility

In SAS, macros are a powerful feature that allow you to automate repetitive tasks, create dynamic code, and improve the efficiency of your programming. Here’s an overview of the key concepts related to SAS macros:

Macro Variables

  • Definition: Macro variables are placeholders that store text strings or values. They can be used to make your code more flexible and reusable by substituting different values or texts.
  • Creation: Use %LET to define a macro variable.
/*Define the macro variable*/
%let var = age;

/*Usage: Use & to reference the macro variable. In the code below, &var will be replace by age everywhere*/
proc print data=example;
  var &var;
run;

Macro Programs

  • Definition: A macro program is a set of SAS statements that can be reused in different parts of your code or across different programs.
  • Creation: Use %macro to start the macro definition and %mend to end it.
/*Create the macro program. The macro is called report taking data as the only parameter*/
%macro report(data=);
  proc print data=&data;
  run;
%mend report;

/*Execution: Call the macro by using its name and passing any required parameters.*/
%report(data=example);

Macro Functions

Definition: Macro functions provide a way to perform operations on macro variables or to generate text strings dynamically.

/* The existing micro function "%sysfunc" allows you to call SAS functions within macro code.*/

/*Define a macro variable by calling the macro function "%sysfunc".*/
%let today = %sysfunc(today(), date9.);

Macro Parameters

Definition: Parameters are values passed to a macro when it is called. They allow you to create more flexible and reusable macros.

Definition in Macro: Define parameters within the macro definition.

%macro summary(data=, var=);
  proc means data=&data;
    var &var;
  run;
%mend summary;

/*Passing Parameters: Provide parameter values when calling the macro.*/

%summary(data=example, var=age);

Macro Statements

Definition: Macro statements are commands used to control macro processing. They include:

%LET for defining macro variables. %DO and %END for looping. %IF and %THEN for conditional processing.

Examples Using Macros

We create a macro that prints a summary table for a dataset in SAS

%macro summary_table(dataset);
    /* Create a dataset with variable names and types */
    proc contents data=&dataset out=vars(keep=name type) noprint;
    run;

    /* Create a macro variable containing a list of variable names */
    proc sql noprint;
        select name into :varlist separated by ' '
        from vars
        where type = 1;  /* Numeric variables only */
    quit;

    /* Generate a summary table for each variable */
    proc means data=&dataset n mean std min max;
        var &varlist;
    run;
%mend;

/* Create an example dataset */
data example_data;
    input id age height weight;
    datalines;
1 25 175 70
2 30 180 80
3 22 160 60
4 28 170 75
;
run;

/* Use the macro to print a summary table */
%summary_table(example_data);

The SELECT … INTO Structure

In SAS, the SELECT … INTO statement is used within PROC SQL to assign the result of a query to a macro variable. This feature is useful for extracting single values or aggregating results and using them later in your SAS program.

/*The following SQL code creates a macro variable "max_height"*/
proc sql;
   select max(height) 
   into :max_height
   from sashelp.class;
quit;

/* Using the macro variable */
%put Maximum height is &max_height;  /*The result is in the LOG*/

Chapter 8. Visualizing Your Data

Visualizing data in SAS can enhance your analysis by making patterns and insights easier to identify. Here’s a guide on various methods to visualize your data using SAS, focusing on PROC SGPLOT and PROC SGPANEL.

A video to watch: https://www.youtube.com/watch?v=EZAlVFQflTI

PROC SGPLOT

PROC SGPLOT is used for creating single-panel plots. Here’s a basic example that demonstrates how to create a scatter plot.

Histograms with PROC SGPLOT

/* Generate a histogram of MPG */
proc sgplot data=sashelp.cars;
    title "Histogram of Miles Per Gallon (MPG)";  /* Set the title of the histogram */
    
    /* Create the histogram */
    histogram MPG_city / binwidth=2 fillattrs=(color=lightblue) ;
    
    /* Label for the x-axis */
    xaxis label="Miles Per Gallon (MPG)";  
    /* Label for the y-axis */
    yaxis label="Frequency";  
run;

Scatterplots with PROC SGPLOT

/* Create a dataset from the built-in SAS dataset 'cars' */
data work.cars;
    set sashelp.cars;  /* Load data from sashelp.cars into work.cars */
run;

/* Generate a scatter plot */
proc sgplot data=work.cars;
    title "Scatter Plot of Horsepower vs. MPG";  /* Set the title of the plot */
    
    /* Create a scatter plot with Horsepower on the x-axis and MPG on the y-axis */
    scatter x=Horsepower y=MPG_city / markerattrs=(symbol=circlefilled color=blue); 
    
    /* Label for the x-axis */
    xaxis label="Horsepower";  
    /* Label for the y-axis */
    yaxis label="Miles Per Gallon (MPG)";  
run;

Barplots with PROC SGPLOT

The following example uses a dataset from the SASHELP library to demonstrate how to create a vertical bar plot showing the average MPG by car type.

/* Create a dataset from the built-in SAS dataset 'cars' */
data work.cars;
    set sashelp.cars;  /* Load data from sashelp.cars into work.cars */
run;

/* Calculate average MPG by Type */
proc sql;
    create table work.avg_mpg as
    select Type, mean(MPG_city) as Avg_MPG
    from work.cars
    group by Type;
quit;

/* Generate a vertical bar plot */
proc sgplot data=work.avg_mpg;
    title "Average MPG by Car Type";  /* Set the title of the plot */
    
    /* Create a vertical barplot. Use the hbar statement if horizontal barplot is wanted. */
    vbar Type / response=Avg_MPG barwidth=0.5 fillattrs=(color=blue);
    
    /* Label for the y-axis */
    yaxis label="Average MPG";  
    /* Label for the x-axis */
    xaxis label="Car Type";  
run;

Box Plot with PROC SGPLOT

In the following example, we’ll create a box plot to visualize the distribution of MPG (miles per gallon) by car type.

/* Create a dataset from the built-in SAS dataset 'cars' */
data work.cars;
    set sashelp.cars;  /* Load data from sashelp.cars into work.cars */
run;

/* Generate a box plot */
proc sgplot data=work.cars;
    title "Box Plot of MPG by Car Type";  /* Set the title of the plot */
    
    /* Create a box plot for MPG by Type */
    vbox MPG / category=Type fillattrs=(color=lightblue) 
               lineattrs=(color=blue);
    
    /* Label for the y-axis */
    yaxis label="Miles Per Gallon (MPG)";  
    /* Label for the x-axis */
    xaxis label="Car Type";  
run;

Regression Lines with PROC SGPLOT

You can easily add a regression line to a scatter plot using PROC SGPLOT in SAS. Here’s how to do that with an example using the sashelp.cars dataset to visualize the relationship between Horsepower and MPG_City.

proc sgplot data=sashelp.cars;
    /* Create a scatter plot */
    scatter x=Horsepower y=MPG_City / markerattrs=(symbol=circlefilled color=blue);
    
    /* Add a regression line */
    reg x=Horsepower y=MPG_City / lineattrs=(color=red thickness=2);

    /* Set titles and axis labels */
    title "Scatter Plot of Horsepower vs. MPG City with Regression Line";
    xaxis label="Horsepower";
    yaxis label="Miles Per Gallon (City)";
run;

PROC SGPANEL

PROC SGPANEL is used for creating panel plots, allowing for multi-faceted visualizations based on one or more categorical variables. Here’s an example where we create a panel of scatter plots based on the type of car.

/* Generate a panel plot using the same 'cars' dataset */
proc sgpanel data=work.cars;
    title "Panel of Horsepower vs. MPG by Type";  /* Set the title of the panel plot */
    
    /* Create panels based on the Type variable, arranged in a row lattice layout */
    panelby Type / layout=rowlattice onepanel;  
    
    /* Create a scatter plot for each panel */
    scatter x=Horsepower y=MPG_city / markerattrs=(symbol=circlefilled);  
    
    /* Label for the y-axis of the panel */
    rowaxis label="Miles Per Gallon (MPG)";  
    /* Label for the x-axis of the panel */
    colaxis label="Horsepower";  
run;

Another example:

proc sgpanel data=sashelp.class;
    panelby sex;  /* Create panels for each sex */
    
    /* Create box plots for height by age */
    vbox height / category=age;  /* Use age as the categorical axis */
    
    title "Box Plot of Height by Age and Sex";
    rowaxis label="Height (inches)";
    colaxis label="Age";
run;

A third example: Using two grouping variables (one on rows and one on columns)

proc sgpanel data=sashelp.cars;
    /* Create panels by Type and Origin */
    panelby Type Origin / layout=lattice 
                         onepanel 
                         sort=(ascformat descformat) 
                         uniscale=column 
                         novarname; 

    /* Create a scatter plot */
    scatter x=Horsepower y=MPG_City;  

    /* Set titles and axis labels */
    title "Scatter Plot of Horsepower vs. MPG City by Car Type and Origin";
    rowaxis label="Miles Per Gallon (City)";
    colaxis label="Horsepower";
run;

Chapter 9. Using Basic Statistical Procedures

PROC UNIVARIATE

PROC UNIVARIATE provides detailed summary statistics and distribution analyses, including histograms and quantiles.

proc univariate data=sashelp.cars;
    title "Univariate Analysis of MPG";
    var MPG_city;  /* Variable to analyze */
    histogram MPG_city;  /* Generate a histogram */
run;

PROC MEANS

PROC MEANS provides descriptive statistics such as mean, standard deviation, minimum, and maximum for numeric variables.

proc means data=sashelp.cars n mean std min max;
    title "Descriptive Statistics for Cars Data";
    var MPG_city Horsepower Weight;  /* Variables to analyze */
run;

PROC FREQ

PROC FREQ is used to generate frequency tables for categorical variables.

proc freq data=sashelp.cars;
    title "Frequency Distribution of Car Types";
    tables Type;  /* Categorical variable to analyze */
run;

PROC CORR

PROC CORR computes the correlation coefficients between numeric variables.

proc corr data=sashelp.cars;
    title "Correlation Matrix for Numeric Variables";
    var MPG_city Horsepower Weight;  /* Variables to analyze */
run;

PROC TTEST

PROC TTEST performs t-tests to compare means between two groups.

proc ttest data=sashelp.cars;
    title "T-Test for MPG by Car Type";
    class Type;  /* Grouping variable */
    var MPG_city;  /* Variable to test */
run;

PROC ANOVA

PROC ANOVA is used for analysis of variance to compare means across multiple groups.

proc anova data=sashelp.cars;
    title "ANOVA for MPG by Car Type";
    class Type;  /* Grouping variable */
    model MPG_city = Type;  /* Dependent variable */
run;

PROC REG

In SAS, PROC REG is used for performing linear regression analysis. This procedure helps you understand the relationship between one dependent variable and one or more independent variables by estimating the coefficients of the regression equation.

Key Features of PROC REG:

  • Multiple Linear Regression: You can include multiple independent variables.
  • Model Diagnostics: Provides options to assess the fit and assumptions of the regression model.
  • Output Options: Customizable output for coefficients, diagnostics, and plots.

Let’s create a simple linear regression model to predict MPG (miles per gallon) based on Horsepower and Weight from the sashelp.cars dataset.

/* Perform linear regression analysis */
proc reg data=sashelp.cars;
    title "Linear Regression of MPG on Horsepower and Weight";
    
    /* Specify the dependent and independent variables */
    model MPG_city = Horsepower Weight;  
    
    /* Output diagnostics */
    output out=reg_results p=predicted r=residual;  /* Save predicted values and residuals */
run;

/* Print the results dataset */
proc print data=reg_results;
    title "Predicted Values and Residuals";
run;