Overview

The PROC UNIVARIATE statement is required to invoke the UNIVARIATE procedure. If you do not specify any other statements, it produces a variety of statistics that summarize the data distribution of each analysis variable:

  • sample moments

  • basic measures of location and variability

  • confidence intervals for the mean, standard deviation, and variance

  • tests for location

  • tests for normality

  • trimmed and Winsorized means

  • robust estimates of scale

  • quantiles and related confidence intervals

  • extreme observations and extreme values

  • frequency counts for observations

  • missing values

proc univariate data=baseball;
run;

Moments, Basic Statistical Measures and Tests for Location

Quartiles and Extreme Observations

The UNIVARIATE procedure provides the following:

  • descriptive statistics based on moments (including skewness and kurtosis), quantiles or percentiles (such as the median), frequency tables, and extreme values

  • histograms that optionally can be fitted with probability density curves for various distributions and with kernel density estimates

  • cumulative distribution function plots (CDF plots). Optionally, these can be superimposed with probability distribution curves for various distributions.

  • quantile-quantile plots (Q-Q plots), probability plots, and probability-probability plots (P-P plots). These plots facilitate the comparison of a data distribution with various theoretical distributions.

  • goodness-of-fit tests for a variety of distributions including the normal

  • the ability to inset summary statistics on plots

  • the ability to analyze data sets with a frequency variable

  • the ability to create output data sets containing summary statistics, histogram intervals, and parameters of fitted curves

You can use the PROC UNIVARIATE statement, together with the VAR statement, to compute summary statistics. See the section Getting Started: UNIVARIATE Procedure for introductory examples. In addition, you can use the following statements to request plots:

  • the CDFPLOT statement for creating CDF plots

  • the HISTOGRAM statement for creating histograms

  • the PPPLOT statement for creating P-P plots

  • the PROBPLOT statement for creating probability plots

  • the QQPLOT statement for creating Q-Q plots

  • the CLASS statement together with any of these plot statements for creating comparative plots

  • the INSET statement with any of the plot statements for enhancing the plot with an inset table of summary statistics

Examples

Computing Descriptive Statistics for Multiple Variables

This example illustrates how to compute descriptive statistics for multiple variables:

title 'Results of Numbers at Bat';
ods select BasicMeasures Quantiles;
proc univariate data=baseball;
   var nAtBat;
run;
Results of Numbers at Bat

Results of Numbers at Bat

  • The ODS SELECT statement restricts the output to the “BasicMeasures” and “Quantiles” tables.

  • The PROC UNIVARIATE statement to requests univariate statistics for the variables listed in the VAR statement, which specifies the analysis variables and their order in the output.

Creating a Frequency Table

This example illustrates how to create a frequency table:

title 'Results of Numbers at Bat';
ods select Frequencies;
proc univariate data=baseball freq;
   var nAtBat;
run;
Results of Numbers at Bat

Results of Numbers at Bat

  • The ODS SELECT statement restricts the output to the “Frequencies” table.

  • The FREQ option in the PROC UNIVARIATE statement requests the table of frequencies shown in the output.

Creating Basic Summary Plots

This example illustrates how to create a basic summary plot:

ods graphics off;
ods select Plots SSPlots;
proc univariate data=baseball plot;
   /*by Team;*/
   var nAtBat;
run;

ods graphics on;
ods select Plots SSPlots;
proc univariate data=baseball plot;
   /*by Team;*/
   var nAtBat;
run;

Results of Numbers at Bat Results of Numbers at Bat

  • The ODS GRAPHICS OFF statement specified before the PROC statement disables ODS Graphics, which causes the PLOTS option to produce legacy line printer plots.

  • The PLOTS option produces a stem-and-leaf plot, a box plot, and a normal probability plot for the nAtBat variable. Because a BY statement is specified, a side-by-side box plot is also created to compare the teams.

  • The ODS SELECT statement restricts the output to the “Plots” and “SSPlots” tables.

Computing Confidence Limits for the Mean, Standard Deviation, and Variance

This example illustrates how to compute confidence limits for the mean, standard deviation, and variance:

title 'Analysis of Numbers at Bat';
ods select BasicIntervals;
proc univariate data=baseball cibasic;
   var nAtBat;
run;

title 'Analysis of Numbers at Bat';
ods select BasicIntervals;
proc univariate data=baseball cibasic(alpha=.1);
   var nAtBat;
run;
Analysis of Numbers at Bat

Analysis of Numbers at Bat

  • The CIBASIC option requests confidence limits for the mean, standard deviation, and variance.

  • The ODS SELECT statement restricts the output to the “BasicIntervals” table.

  • The confidence limits produced by the CIBASIC option produce 95% confidence intervals by default. You can request different level confidence limits by using the ALPHA= option in parentheses after the CIBASIC option, as shown above.

Creating a Histogram

This example illustrates how to create a histogram:

title 'Analysis of Numbers at Bat';
ods graphics on;
proc univariate data=baseball noprint;
   histogram nAtBat / odstitle = title;
run;
title 'Enhancing a Histogram';
proc univariate data=baseball noprint;
   histogram nAtBat / midpoints    = 100 to 700 by 50
                     rtinclude
                     outhistogram = OutMdpts
                     odstitle     = title;
run;

proc print data=baseball;
run;

Analysis of Numbers at Bat Analysis of Numbers at Bat

  • The NOPRINT option in the PROC UNIVARIATE statement suppresses tables of summary statistics for the variable nAtBat that would be displayed by default.

  • A histogram is created for each variable listed in the HISTOGRAM statement.

Creating a Two-Way Comparative Histogram

This example illustrates how to create a two-way comparative histogram:

title 'Results of Numbers at Bat';
ods graphics on;
proc univariate data=baseball;
   class League Division / keylevel = ('American' 'East');
   histogram nAtBat / vaxis      = 0 10 20 30
                     ncols      = 2
                     nrows      = 2
                     odstitle   = title;
run;

Results of Numbers at Bat Results of Numbers at Bat

  • The KEYLEVEL= option specifies the key cell as the cell for which League is equal to ‘American’ and Division is equal to ‘East.’ This cell determines the binning for the other cells, and the columns are arranged so that this cell is displayed in the upper left corner. Without the KEYLEVEL= option, the default key cell would be the cell for which League is equal to ‘American’ and Division is equal to ‘West’; the column labeled ‘West’ would be displayed to the left of the column labeled ‘East.’

  • The VAXIS= option specifies the tick mark labels for the vertical axis.

  • The NROWS=2 and NCOLS=2 options specify an arrangement for the tiles.

Adding a Normal Curve to a Histogram

The following statements fit a normal distribution to the thickness measurements in the Baseball data set and superimpose the fitted density curve on the histogram:

title 'Analysis of Number of Hits';
ods select Histogram ParameterEstimates GoodnessOfFit FitQuantiles Bins;
proc univariate data=baseball;
   histogram nHits / normal(percents=20 40 60 80 midpercents)
                     odstitle = title;
   inset n normal(ksdpval) / pos = ne format = 6.3;
run;
Analysis of Number of Hits

Analysis of Number of Hits

  • The ODS SELECT statement restricts the output to the “ParameterEstimates,” “GoodnessOfFit,” “FitQuantiles,” and “Bins” tables.

  • The NORMAL option specifies that the normal curve be displayed on the histogram. It also requests a summary of the fitted distribution, which is shown in the output.

  • The PERCENTS option specifies quartiles, which can be changed to MIDPERCENTS for a table that lists the midpoints, the observed percentage of observations, and the estimated percentage of the population in each interval (estimated from the fitted normal distribution).

Creating a Normal Quantile Plot

This example illustrates how to create a normal quantile plot. It is analyzing the distribution of runs batted in:

title 'Normal Quantile-Quantile Plot for Runs Batted In';
ods graphics on;
proc univariate data=baseball noprint;
   qqplot nRBI / odstitle = title;
run;
Normal QQ Plot for RBI

Normal QQ Plot for RBI

  • The plot compares the ordered values of Baseball with quantiles of the normal distribution. The linearity of the point pattern indicates that the measurements are normally distributed.

  • Note that a normal Q-Q plot is created by default.

Creating a P-P Plot

It is decided to check whether the runs batted in are normally distributed. The following statements create a P-P plot, which is based on the normal distribution with mean and standard deviation :

proc univariate data=baseball;
var nRBI;
run;

title 'Normal Probability-Probability Plot for Runs Batted In';
ods graphics on;
proc univariate data=baseball noprint;
   ppplot nRBI / normal(mu=49 sigma=25.5)
                     square
                     odstitle = title;
run;
Normal P-P Plot for RBI

Normal P-P Plot for RBI

  • The MEAN and STANDARD DEVIATION are computed in the PROC UNIVARIATE statement. These values are used in the PP Plot statement.

  • The NORMAL option in the PPPLOT statement requests a P-P plot based on the normal cumulative distribution function, and the MU= and SIGMA= normal-options specify \(\mu\) and \(\sigma\).

  • Note that a P-P plot is always based on a completely specified distribution—in other words, a distribution with specific parameters. In this example, if you did not specify the MU= and SIGMA= normal-options, the sample mean and sample standard deviation would be used for \(\mu\) and \(\sigma\).

Comparison of Normal P-P Plot and Normal Q-Q Plot

  • A P-P plot compares the empirical cumulative distribution function of a data set with a specified theoretical cumulative distribution function F(.).

  • A Q-Q plot compares the quantiles of a data distribution with the quantiles of a standardized theoretical distribution from a specified family of distributions.

There are three important differences in the way P-P plots and Q-Q plots are constructed and interpreted:

  • The construction of a Q-Q plot does not require that the location or scale parameters of F(·) be specified. The theoretical quantiles are computed from a standard distribution within the specified family. A linear point pattern indicates that the specified family reasonably describes the data distribution, and the location and scale parameters can be estimated visually as the intercept and slope of the linear pattern. In contrast, the construction of a P-P plot requires the location and scale parameters of F(·) to evaluate the cdf at the ordered data values.

  • The linearity of the point pattern on a Q-Q plot is unaffected by changes in location or scale. On a P-P plot, changes in location or scale do not necessarily preserve linearity.

  • On a Q-Q plot, the reference line representing a particular theoretical distribution depends on the location and scale parameters of that distribution, having intercept and slope equal to the location and scale parameters. On a P-P plot, the reference line for any distribution is always the diagonal line y=x.

Consequently, you should use a Q-Q plot if your objective is to compare the data distribution with a family of distributions that vary only in location and scale, particularly if you want to estimate the location and scale parameters from the plot.

An advantage of P-P plots is that they are discriminating in regions of high probability density, since in these regions the empirical and theoretical cumulative distributions change more rapidly than in regions of low probability density. For example, if you compare a data distribution with a particular normal distribution, differences in the middle of the two distributions are more apparent on a P-P plot than on a Q-Q plot.