Whether you need to analyze the results of clinical trials, predict credit card usage patterns, track endangered wildlife, model air pollution patterns or produce a sample of people to administer a health behavior survey, your challenges include the large amount of data you need to manage and finding the best methods for understanding that data. Statistics is the science of collecting, interpreting, and presenting data. It provides methods for analyzing and assessing the significance of data. Statistics enables the transformation of data into information that can then serve as the basis for decision-making.
In this lesson, you learn fundamental statistical concepts and techniques that you’ll use throughout the remainder of this course. First, you review some basic statistical concepts that help you better understand the nature of your data and which statistical methods you can apply to your data. Then you learn to produce descriptive statistics, including numeric summaries, histograms, normal probability plots and boxplots. You also learn some basic concepts of inferential statistics and how to calculate the standard error of the mean and confidence intervals for the mean. Finally, you learn the steps for conducting a hypothesis test in order to answer statistical questions about your data and draw conclusions about the population based on the sample data.
In this lesson, you learn to use descriptive statistics. In descriptive statistics, you examine the location, spread, and shape of the data’s distribution. This component of statistics is also referred to as exploratory data analysis, or EDA. For descriptive statistics, you learn to use PROC MEANS, PROC UNIVARIATE, and PROC SGPLOT. To calculate the standard error of the mean and confidence intervals for the mean, you learn to use PROC MEANS. To perform a one sample statistical hypothesis test, you learn to use PROC UNIVARIATE. To produce graphics, such as histograms, box plots, regression plots, and scatter plots, you learn to use PROC SGPLOT.
Let’s start by reviewing some statistical concepts that you need in order to perform basic statistical analyses. In this topic, you learn how to
distinguish between descriptive and inferential statistics
define populations and samples
distinguish between parameters and statistics
classify variables
explain other statistical concepts, including scale of measurement.
Descriptive statistics organize, describe, and summarize data using numbers and graphical techniques. This branch of statistics uses a set of standard measures such as percent, averages, and variability, as well as simple graphs, charts, and tables. Descriptive statistics help you to better understand your data by describing and summarizing its basic features. You learn how to generate and understand numerical summaries. These include frequency; measures of location, including minimum, maximum, percentiles, quartiles, and central tendency (mean, median, and mode); and measures of dispersion or variability, including range, interquartile range, variance, and standard deviation. The graphical summaries you learn include the histogram, normal probability plot, and box plot. The goals when you’re describing data are to:
screen for unusual data values,
inspect the spread and shape of your data,
characterize the central tendency, and
draw preliminary conclusions about your data.
Is the data as error free as possible? What unique features can you identify? Are there data values that cluster or show some unusual shape? Does your data include any possible outliers? When you have a basic understanding of your data, then you can use inferential statistics.
Inferential statistics is the branch of statistics concerned with drawing conclusions about a population from analysis of a random sample drawn from that population. It is also concerned with the precision and reliability of those inferences. Inferential statistics generalize from data you observe to the population that you have not observed. Descriptive statistics describe your sample data, but inferential statistics help you draw conclusions about the entire population of data. Descriptive statistics can also be referred to as exploratory data analysis, or EDA. Inferential statistics can also be called explanatory modeling.
Before beginning your analysis, you should use descriptive statistics to explore your data. After getting familiar with your data, you can use inferential statistics, or explanatory modeling, to describe your data. You can also use predictive modeling to make predictions about future observations. Letâs briefly compare explanatory and predictive modeling.
In explanatory modeling, the goal is to develop a model that answers the question, how is X related to Y? Sample sizes are typically small and include few variables. The focus is on the parameters of the model. To assess the model, you use p-values and confidence intervals.
The goal of predictive modeling is to answer the question, if you know X, can you predict Y? Sample sizes are typically quite large and include many predictor variables, also called input variables. The focus is on the predictions of observations, rather than the parameters of the model. To assess a predictive model, you validate predictions using holdout sample data.
A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. You gather a sample so that you don’t have to obtain data for the entire population. The sample should be representative of the population, meaning that the sample’s characteristics are similar to the population’s characteristics. One way to obtain a representative sample is to collect a simple random sample. With this sampling method, every possible sample of a given size in the population has an equal chance of being selected. Random sampling can help to ensure that the sample is representative of the population. You should avoid collecting your sample from a section of the population that is easily available to you. This is called convenience sampling, and it can lead to a biased sample that is not representative of the population from which it is drawn. A sample that’s not representative can cause you to draw incorrect conclusions. Let’s look at an example.
Suppose a university wants to estimate the percent of its freshmen who plan to return for their sophomore year. The population for this study is the entire set of 2,500 freshmen in attendance. Researchers gathered a representative sample of 100 freshmen by selecting 100 student ID numbers at random from the entire set of 2,500 freshmen. If the researchers had simply selected the first 100 freshmen who responded to an e-mail questionnaire, this would have resulted in a biased sample. This could lead to an incorrect estimate of the number who plan to return for their sophomore year. If you have a representative sample, you can make correct inferences to the entire population. In this course, we always assume that the sample is representative. Click the Information button for information on how to generate random samples.
PROC SURVEYSELECT DATA=name-of-SAS-data-set OUT=name-of-output-data-set METHOD=method-of-random-sampling (SRS, URS) SEED=seed-value (optional) SAMPSIZE=number of observations desired in sample;
/Program A/ proc surveyselect data=Statdata.cars /* sample from data table / seed=31475 / recommended that you use this option / method=srs / simple random sample / sampsize=120 / sample size / out=work.CarSample12; / sample stored in this data set */ run;
proc print data=work.CarSample12; run;
Parameters are numerical values that summarize characteristics of a population. For example, the mean, or average, of some measurement on every individual in a population is a parameter. Parameter values are typically unknown, because you can’t usually measure your entire population. You use Greek letters to represent population parameters, for example, μ, Ï Ï. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics, for example, xÌ, pÌ, r, s. You can measure characteristics of your sample and provide numerical values that summarize those characteristics. You use statistics to estimate parameters. Here is a table listing some common parameters and statistics.
PopulationParameters Sample Statistics
Mean μ x Ì Variance Ï2 s2 Standard Deviation Ï s
Let’s look at these statistics in more detail and see how you can use them to estimate parameters.xÌ, for example, is the sample mean or sample average. You can use xÌ to estimate the population mean or population average, μ. Suppose, for example, that you have a sample x1, x2, and so on, through xn from some population. You can calculate the mean for that sample using the formula shown here. x¯=âxi/n
The sample variance, s2, measures the variability of your sample around the mean. Sample variance gives you a specific measurement indicating how much your data values vary in comparison with the average value. You can calculate sample variance using the formula shown here.
s2=â(xiâx¯)2/n-1
After you calculate your sample variance, you can use this statistic to estimate the population variance, Ï2.
The sample standard deviation is another common measure of variability. It is simply the square root of the variance. You calculate the sample standard deviation using the formula shown here.
s=sqrt(â(xiâx¯)2)
Because it is the square root of the variance, the resulting measure of variability will be in the same units as the data and, therefore, the same units as the mean. You can use this statistic to estimate the standard deviation for the population. What do we mean when we say that the sample standard deviation is a measure of variability reported in the same units as the data? Let’s look at an example.
Suppose that you are interested in knowing the average dollar amount people spend in a particular store. The unit of measurement is dollars. The data you gather and the mean you calculate from the data will be in dollars. The sample variance will be a measure of the spread in your data in dollars squared. Because the standard deviation is the square root of the variance, it puts the measure of spread back on the original dollar scale.
Variables are characteristics or properties of data that can take on different values or amounts for different individuals in the population. They are data attributes such as ID number, age, gender, and body temperature. Variables can be classified according to their function or the way they’re used in a study. A variable can be independent or dependent. An independent variable can take different values. An independent variable affects or determines a dependent variable. A dependent variable can take different values in response to an independent variable. In some contexts, a researcher selects or controls the value of an independent variable, in order to determine its relationship to the dependent variable.
For example, a researcher investigating the effect of fertilizer on plant growth could change the amount of fertilizer â the independent variable â in order to observe the effect on the plants â the dependent variable. In other contexts, however, the independent variable’s values are simply taken as given. For example, suppose a researcher is trying to determine the effect of a variable such as incarceration rate on crime rate. The researcher can’t manipulate the variable incarceration rate. It is simply observed. In this case, you might hear this variable referred to as a predictor variable. You might also hear this type of variable referred to as an explanatory, control, or input variable. A dependent variable is also known as a response, outcome, or target variable.
Variables are also classified according to their characteristics. They can be quantitative or categorical. In order to plan a statistical analysis or interpret your results, you need to know which types of variables you have. Data that consists of counts or measurements is called quantitative data. You also hear this type of data referred to as numerical data. If you can perform arithmetic operations, like addition and subtraction, or take a sample average of your data, then you know that it is quantitative. Suppose you take a survey of the buying habits of families. An example of quantitative data in your survey is the age in years of the respondents. Age is a quantitative variable because it would make sense to compute the average age of individuals in a sample.
Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data consists of variables that can have only a countable number of values within a measurement range. That is, the values can be 0, 1, 2, 3, and so on. An example of discrete data is the number of children in a family. A family can have two or three children, but not 2.65 children. Continuous data consists of variables that are measured on a scale that has an infinite number of values and has no breaks or jumps. An example of a continuous variable is gas mileage. The gas mileage for a particular car might be 19 miles per gallon or 19.1 miles per gallon or 19.191034 miles per gallon, and so on. Remember that practical limitations can affect the precision of the measurement.
Categorical data consists of variables that denote groupings or labels. This type of data is also called attribute data. Categorical data can be distinguished from quantitative, because it does not make sense to perform arithmetic operations on categorical variables. For example, your survey includes a variable for the political party affiliation of survey respondents (Democrat, Republican, Independent, other). It doesn’t make sense to try to add or average the responses Republican and Democrat.
There are two main types of categorical variables: nominal and ordinal. A nominal categorical variable exhibits no ordering within its observed levels, groups, or categories. Gender is an example of a nominal variable. There is no ordering to the groups male and female. The type of beverage you can order from a menu, such as soda, coffee, or juice, has no logical ordering to it, so it is also a nominal variable. Nominal categorical variables can be coded to appear numeric, but their numbers are meaningless. For example, the variable Gender can be coded 1 for male and 2 for female. These numbers are not inherently meaningful: they could be reversed, or replaced, by any random set of numbers. A variable that lies on a nominal scale is sometimes called a qualitative or classification variable.
With ordinal categorical variables, the observed levels of the variable can be ordered in some meaningful way that implies that the differences between the groups or categories are due to magnitude. Disease condition divided into categories of low, moderate, or severe is an example of an ordinal variable. The size of beverage you can order from a menu being small, medium, or large does have a logical order to it, so it is also an ordinal variable.
As you’ve learned, variables are classified differently depending on the characteristics of that variable. We often refer to a variable’s classification as its scale of measurement. You need to know the scale of measurement for each variable in order to determine the statistical procedures appropriate for use with that variable. You already know two scales of measurement for categorical variables: nominal and ordinal. The nominal scale enables you to categorize or label variables such as gender or beverage type where there is no ordering to the levels of those variables. The ordinal scale indicates categories that can be ordered in a meaningful way, as in size of beverage or severity of disease.
There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered like ordinal data, but it also has a sensible spacing of observations such that differences between measurements are meaningful. For example, in measuring patient temperature, you can indicate specific differences in temperature, between the standard measurement of normal body temperature, 98.6 degrees F, and an observed body temperature of 98.2. Interval scales lack, however, the ability to calculate ratios between numbers on the scale. In the case of the Fahrenheit scale, for example, there is no true zero point. Zero does not imply the lack of temperature. Another example of an interval scale is pH value. Sea water, which has a pH of 8, is not twice as alkaline as tomato juice, which has a pH of 4.
Data on a ratio scale is not only rank-ordered with meaningful spacing, but it also includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale. For example, the Kelvin temperature scale has a true zero point. A temperature of 50 Kelvin is half as hot as 100 Kelvin. Another example of a ratio scale is money. If an individual has zero dollars, this does imply an absence of money. And one individual can have twice as much money as another.
The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Univariate analysis reveals patterns in the data, by looking at the range of values, measures of dispersion, the central tendency of the values, and frequency distribution. It also summarizes large amounts of data and organizes data into graphs and tables so that it is more easily understood.
Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. It includes techniques such as correlation analysis and chi-square tests of independence.
Multivariate or multivariable analysis examines two or more variables at the same time, in order to understand the relationships among them. Techniques such as multiple linear regression and n-way ANOVA are typically called multivariable analyses because there is only one response variable. Techniques such as factor analysis and clustering are typically called multivariate analysis because they consider more than one response variable. Multivariate linear regression and multivariate ANOVA (MANOVA) are extensions of these techniques when there is more than one response variable. You learn many of these statistical methods in this course.
Now that you understand some basic concepts about variables and descriptive statistics, let’s apply that knowledge by looking at an example.
As a project, students in Ms. Chao’s statistics course gathered data on the SAT scores of high school students who attend magnet high schools in Carver County. The purpose of the study is to determine if these students achieved an average combined SAT score of 1200 in math and reading, which is the goal set by the school board. The students in Ms. Chao’s class selected 80 students at random and recorded their test scores in a data file named TestScores.
TestScores Gender SATSCore IDNumber Male 1170 61469897 Female 1090 33081197 Male 1240 68137597 Female 1000 37070397 Male 1210 64608797 Female 970 60714297 Male 1020 16907997 Female 1490 9589297 Male 1200 93891897 Female 1260 85859397
The TestScores data includes three variables. The first variable is the gender of each student. The second variable is each student’s cumulative SAT score. The last variable is IDNumber. This is a unique identifier assigned to each student.
Now that you’ve examined the three variables in the TestScores data, you need to explore the particular values in this data. You should check for unusual data points or outliers. Are there any test scores that are unusually high or low? Are there any points that might be data entry errors? Having a unique ID number for each student is helpful if you notice any outliers. You can then look back at your data and try to understand why that point is unusual.
empphones First Gender EmpID Type Phone Togar M 121150 Home +61(2)5555-1793 Togar M 121150 Work +61(2)5555-1794 Kylie F 121151 Home +61(2)5555-1849 Birin M 121152 Work +61(2)5555-1850 Birin M 121152 Home +61(2)5555-1665 Birin M 121152 Cell +61(2)5555-1666
Before you move on to statistical analysis of your data, you must ensure that it is as error-free as possible. For example, you should identify unique aspects of the data such as data values that cluster or show some unusual shape. If there is a possible outlier in a variable, and it goes undetected, it could cause gross errors in your interpretation of the statistics. You can use descriptive statistics to look at the distribution of your data, checking the range, frequency, and shape of your data. Then you can draw some preliminary conclusions about your data.
Now you’re ready to learn how to use descriptive statistics to better understand your data. Here’s what you learn in this topic:
explain the basics of descriptive statistics (why, when, and how do I use them?)
describe what distributions can tell you about your data
use the MEANS procedure to produce descriptive statistics.
To summarize your data, you can construct a table of frequencies of the different values in the data. You refer to the frequency table as the distribution of your data. The distribution of your data tells you what values your data takes and how often it takes those values.
Chart: Vertical bar chart(histogram) with each bar representing the number of times that data value occurs in the data.
When you look at a data distribution, you can learn several things: You can look at the range of data values and demonstrate the data’s spread or dispersion. You can verify the number of times each value appears by looking at data frequency. You can establish the shape of your data by creating a histogram to determine whether the data is symmetric
Chart: Histogram with highest bars in center and equal number of bars on both sides gradually decreasing the same amount. A curved line is drawn that touches the top corner of each bar and the shape of the line looks like hill with symmetrical drops on each side that gradually even out until it is almost flat. This is called a normal distribution.
or skewed,
Chart: Histogram overlaid with a curved line above it that looks like a hill with the highest point skewed to the right so that right side of the line drops more quickly than the left and the left side of the line gradually decreasing.
meaning that the distribution has a longer tail either to the left or the right. You can check for outliers.
You can calculate descriptive statistics that measure locations in your data. Statistics that locate the center of the data are called measures of central tendency. These include mean, median, and mode. The mean is the average of all data values. The mean is obtained by adding all the values in the data and dividing by n, the number of values. The formula is shown here. x¯=1nâxi Let’s take a look at an example. Suppose you have the set of 16 data values shown in this example.
93 89 88 84 83 82 79 78 78 77 74 73 72 68 67 63
To calculate the mean, you add the 16 values together and divide by 16. A property of the sample mean is that the sum of the differences of each data value from the mean is always 0. This is represented by the formula shown here.
â(xiâx¯)=0
The mean is useful for measuring the center of your data when the data is balanced on both sides.
Chart: Histogram with highest bars in center and equal number of bars on both sides gradually decreasing the same amount. A curved line is drawn that touches the top corner of each bar and the shape of the line looks like hill with symmetrical drops on each side that gradually even out until it is almost flat. This is called a normal distribution.
However, the mean is highly influenced by outliers because those unusual values are included in its calculation. If the data in this example also included the value 35, for example, the mean would change to 75.47.
Chart: Histogram overlaid with a curved line above it that looks like a hill with the highest point skewed to the right so that right side of the line droppoing more quickly than the left and the left side of the line gradually decreasing.
The median is another measure of central location. The median is less sensitive to the presence of outliers. The median is the middle value in the data when the data is ordered. If you have an odd number of values in your data, the median is just the middle value. If you have an even number of values in your data, the median is the average of the two middle values. In this example, there are 17 observations, so the median is the middle value. Be aware that this is the most typical way to calculate the median, but not the only one.
The mode is the data value that occurs the most. The mode in this example is 78. Be aware that the mode is not as informative in small data files. Thus, most statisticians simply compare the mean and the median when exploring their data.
Percentiles are descriptive statistics that give us reference points in our data. A percentile is the value of a variable below which a certain percentage of observations fall. The most commonly reported percentiles are quartiles, which break the data up into quarters.
Chart: Histogram that is divided into 4 areas.
In our example, the 25th percentile is 72.5, which means that 25% of the data values fall at or below 72.5. You can also refer to the 25th percentile as the 1st quartile, Q1, or the lower quartile. The 50th percentile in this example is 78, which means that 50% of the data values fall at or below 78. You can also refer to the 50th percentile as the median, Q2, or the middle quartile. The 75th percentile is 83.5, which means 75% of the data values fall at or below 83.5. You can also refer to the 75th percentile as the 3rd quartile, Q3, or the upper quartile.
93 100th Percentile/Quartile 4 89 88 84 83.5 75th Percentile/Quartile 3 82 79 78 78 50th Percentile/Quartile 2 77 74 73 72.5 25th Percentile/Quartile 1 68 67 63
There are several descriptive statistics that measure the spread or dispersion of your data. Statisticians also refer to this characteristic as variability. The range of the data is a single value that measures the difference between the maximum and minimum values.In this example, the range is 30 â that is, 93 minus 63.
The interquartile range, also known by the abbreviation IQR, is the difference between the 25th and 75th percentiles. In this example, the interquartile range is 83.5 minus 72.5, which is 11.0. The interquartile range is a robust estimate of the variability because changes in the upper and lower 25% of the data do not affect it. If there are outliers in the data, then the IQR is a more reliable measure of the spread than the overall range.
Variance is a measure of variability of the data around the mean. It is defined as the average squared difference of the observations from the mean. The formula for sample variance is shown here.
s2=(â(xâx¯)2)/n-1
Click the information button on the course interface if you’d like to see how to calculate the variance for this example.
Standard deviation indicates how much variation there is from the mean, thereby measuring how spread out your data is. Standard deviation is the square root of the variance, as shown in this formula.
s=sqrt(â(xiâx¯)2)
So for this example, the standard deviation equals 8.36. Because standard deviation is the square root of the variance, it is expressed in the same units of measurement as your data and therefore the mean. A low standard deviation indicates that the data points tend to be very close to the mean. A high standard deviation indicates that the data is spread out over a larger range of values.
Another measure of dispersion in your data is the coefficient of variation, also referred to by the abbreviation C.V. The coefficient of variation is a measure of the standard deviation expressed as a percentage of the mean. It is useful when comparing data that has different units of measure â for example, pounds and height. It is a way of standardizing the units of measure for comparison. The formula is shown here.
(SD/Mean) x 100
As a project, students in Ms. Chao’s statistics course must assess whether or not magnet school students in their district have accomplished the goal set by the school board of an average combined SAT score of 1200 in math and reading. The SAT math and reading sections have a maximum combined score of 1600. The students in Ms. Chao’s statistics course selected 80 students at random from magnet school students in the district, recorded the test scores of those 80 students, and assigned each sample member an identification number. Click the information button to view the sample data.
Before the class can answer the question Ms. Chao has set for them, the class should explore the data by calculating descriptive statistics, including the sample mean and standard deviation.
Let’s look at how to use the MEANS procedure to generate the descriptive statistics needed for Ms. Chao’s class project. You can use the MEANS procedure in SAS to summarize and generate descriptive statistics for your data. PROC MEANS calculates a standard set of statistics, including the minimum, maximum, and mean data values, as well as standard deviation and n, which is the number of non missing values in the sample. You can also specify options that provide additional statistics. Here’s the syntax.
PROC MEANS DATA=SAS-data-set
In the PROC MEANS statement, you specify your input data set. Then you indicate any optional statistics. Other statistics that you can specify include the MEDIAN, MODE, variance (VAR), lower quartile (Q1), upper quartile (Q3), RANGE, and interquartile range (QRANGE). Note that if you do specify any statistics in the PROC MEANS statement, they will override the default. Thus you need to specify the entire list of statistics you want produced in the options area of the PROC MEANS statement, if you choose to list any. In the CLASS statement, you specify the variable that the procedure uses to group the data. In the VAR statement, you list the analysis variables for which SAS calculates descriptive statistics. If you list no VAR statement, SAS analyzes all numeric variables in the data set.
Here’s the code that you could use to generate the descriptive statistics needed for Ms. Chao’s class project.
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes; class Gender; var SATScore; title ‘Descriptive Statistics Using PROC MEANS’; run;
First, in the PROC MEANS statement, you specify the input data set, in this case, TestScores. You can also specify options that will control which statistics SAS generates and what the output will look like. MAXDEC=2 specifies a maximum of two decimal places for numeric values. FW=10 specifies that the field width for all columns is 10. The PRINTALLTYPES option displays statistics for all requested combinations of class variables â that is, for each level or occurrence of the variable and for all occurrences combined. In the CLASS statement, you specify Gender as the variable that PROC MEANS will use to group the data. This is the variable referred to in the PRINTALLTYPES option. Because the Gender variable has two possible values, Male and Female (in this case), the PRINTALLTYPES option will result in three sets of descriptive statistics: one set for Male, one set for Female and a third set for all values. In the VAR statement, you identify the analysis variable, in this case, SATScore. This code also includes an optional TITLE statement so that SAS identifies your output appropriately. You end the MEANS procedure with a RUN statement.
Before we run the MEANS procedure that we’ve just learned, let’s use PROC PRINT to take a look at our raw data. In the PROC PRINT statement, we specify the input data set, indicating with the OBS= option that we wish to view only 10 observations, or rows, of data. The TITLE statement specifies the output heading.
proc print data=statdata.testscores (obs=10); title ‘Listing of the SAT Data Set’; run; title;
Obs Gender SATScore IDNumber 1 Male 1170 61469897 2 Female 1090 33081197 3 Male 1240 68137597 4 Female 1000 37070397 5 Male 1210 64608797 6 Female 970 60714297 7 Male 1020 16907997 8 Female 1490 9589297 9 Male 1200 93891897 10 Female 1260 85859397
When we submit the program, we can see the first 10 observations in our data set. The first column lists the observation numbers. The next three columns contain three variables: the gender of the student, the student’s SAT score, and the unique ID number.
Now that we’ve confirmed the basic structure of our data, let’s use PROC MEANS to generate descriptive statistics for the variable SATScore.
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes; class Gender; var SATScore; title ‘Descriptive Statistics Using PROC MEANS’; run; title;
When we run the program, we can see how using the CLASS statement produced separate statistics for all 80 observations, then for just females and then just males.
Analysis Variable : SATScore N Obs N Mean Std Dev Minimum Maximum 80 80 1190.63 147.06 890.00 1600.00
Analysis Variable : SATScore Gender N Obs N Mean Std Dev Minimum Maximum Female 40 40 1221.00 157.40 910.00 1590.00 Male 40 40 1160.25 130.92 890.00 1600.00
It is interesting to see that the average mean SATScore for all 80 students is 1190.6. When we look at just females, the mean is 1221 and the mean for males is 1160. Because we didn’t specify which statistics we wanted, SAS generated the default statistics for PROC MEANS. The output provides the sample size (that is, the number of non-missing values), the mean, standard deviation, the minimum, and the maximum values of SATScore.
Now let’s modify our PROC MEANS statement and specify some different statistics. Remember that SAS will override the default statistics, so we must specify all statistics that we want included in the output. Let’s request the sample size (N), the mean and median values, standard deviation and variance, in addition to the lower quartile (Q1) and upper quartile (Q3). Let’s go ahead and submit our program.
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes n mean median std var q1 q3; class Gender; var SATScore; title ‘Selected Descriptive Statistics for SAT Scores’; run; title;
Analysis Variable : SATScore N Obs N Mean Median Std Dev Variance Lower Quartile Upper Quartile 80 80 1190.63 1170.00 147.06 21626.19 1085.00 1280.00
Analysis Variable : SATScore Gender N Obs N Mean Median Std Dev Variance Lower Quartile Upper Quartile Female 40 40 1221.00 1215.00 157.40 24773.33 1100.00 1315.00 Male 40 40 1160.25 1145.00 130.92 17140.96 1050.00 1240.00
Notice that the output is still broken out with all of the observations and then by gender because we specified Gender in the CLASS statement and PRINTALLTYPES in the PROC MEANS statement. Now we can see the values of the mean compared to the median. So, although the average score for all students was 1190.63, the median or middle value in the data was 1170. We have the values for variance as well as standard deviation. We’ll talk more about how these values relate to the data later. This output also includes the values for the 25th and 75th percentiles. These figures tell us that 25% of the data values for all students fell below 1085 and that the top 25% were above 1280.
Another way to understand your data is to display it using graphs and plots. In this topic you learn how to
look at the distribution of continuous variables
describe the normal distribution
use PROC UNIVARIATE to generate descriptive statistics, including histograms and normal probability plots.
There are several ways to visualize the distribution of your data. One of the most effective techniques is to plot a histogram.
Chart: Vertical bar chart(histogram) with each bar representing the percentage of times that data value occurs in the data.
A histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars. Each bar in the histogram represents a group of values, sometimes referred to as a bin. The heights of the bars represent the frequency or count of values in the bins. In this example, the height of the bar represents a percentage value.
The normal distribution is a common theoretical distribution in statistics. The normal distribution is shaped like a bell, with values concentrated near the mean. Underlying the normal distribution is a mathematical function called the probability density function. The height of the function at any point on the horizontal axis is the probability density at that point. The shape of a normal distribution depends on the value of two parameters, the mean (μ) and the standard deviation (Ï). The mean (μ) determines the center of the distribution. The normal distribution is symmetric around the mean. The standard deviation (Ï) determines how variable the distribution is. A larger standard deviation implies a wider normal distribution.
Chart: Vertical bar chart(histogram) with a curved line touching the top of each bar and the highest part of the curve in the middle with symetricaly lines curving slowly down on each side and tapering outward on each end close to the bottom of the histogram. There is labeling at the bottom of the histogram with the symbol for the mean located in the middle, the symbol for mean plus and minus 2 times the standard deviation is shown about midway midway between the mean and the right and left side of the mean, respectively, and the symbol for mean plus and minus 3 times the standard deviation is shown far to the right and left side of the mean, respectively.
You can calculate probabilities using the normal probability density function by calculating areas under the bell-shaped curve. Approximately 68% of the area under the bell curve falls within 1 standard deviation of the mean. This means that approximately 68% of the data values fall within this area. Approximately 95% of the area under the curve falls within 2 standard deviations of the mean. This means that approximately 95% of data values fall within this area. Finally, approximately 99% of the area under the curve falls within 3 standard deviations of the mean, meaning that approximately 99% of the data values fall within this area.
Statisticians often consider values that are more than 2 standard deviations from the mean as unusual. Now you can see why. Only about 5% of all values are that far away from the mean. Depending on the context, sometimes statisticians treat only values more than 3 standard deviations away from the mean as unusual. Because the normal distribution has many useful mathematical properties, statistical procedures for data based on a random sample often assume the normal distribution. So it’s important to know how to check this assumption for your data.
To check the assumption that your random sample has a normal distribution, it can be useful to plot a histogram. The sample of data in this histogram is from a normal distribution.
Vertical bar chart(histogram) with a curved line touching the top of each bar and the highest part of the curve in the middle with symetricaly lines curving slowly down on each side and tapering outward on each end close to the bottom of the histogram. There is labeling at the bottom of the histogram with the symbol for the mean located in the middle, the symbol for mean plus and minus 2 times the standard deviation is shown about midway midway between the mean and the right and left side of the mean, respectively, and the symbol for mean plus and minus 3 times the standard deviation is shown at the the right and left side of the mean, respectively. There is a legend on the table that shows the Skewness equal to -0.0073 and Kurtosis equal to -0.1700.
The grey curve in this diagram represents the normal distribution. You can see that the histogram follows the shape of the curve fairly closely. In addition to plotting a histogram, you can also look at statistical summaries of your data in order to check the assumption that your data is normally distributed. Two such summaries are skewness and kurtosis. Skewness and kurtosis measure certain aspects of the shape of a distribution. The closer these statistics are to 0 for your sample, the closer your data is shaped like the normal distribution.
Skewness measures the tendency of your data to be more spread out on one side of the mean than on the other. That is, it is a measure of the asymmetry of the distribution. Here are examples of two skewed distributions.
Chart: Vertical bar chart(histogram) with a curved line touching the top of each bar and the highest point on the curved line on the left graph is on the right side of the middle and the highest point on the curved line on the right graph is located to the left of the middle.
The graph on the left is an example of a left-skewed distribution. The graph on the right is an example of a right-skewed distribution. You can think of the direction of skewness as the direction the data is trailing off to. The closer the skewness statistic is to 0, the more normal or symmetric the data. When the statistic is negative, the data is left-skewed. When the statistic is positive, the data is right-skewed. A left-skewed distribution tells us that the mean is less than the median. A right-skewed distribution tells us that the mean is greater than the median.
Kurtosis measures the tendency of your data to be concentrated toward the center or toward the tails of the distribution. It is sometimes described as a measure of the peakedness of the data, reflecting the shape of the peak relative to the rest of the distribution. It is basically a measure of tail thickness. The closer the kurtosis statistic is to 0, the closer the tails of your data resemble the tail thickness of the normal distribution.
Be aware that normal distribution actually has a kurtosis value of positive 3. SAS has standardized it to zero, so that when you are assessing normality, the values against which to compare the sample skewness and kurtosis are the same.
A negative kurtosis statistic means that the data has lighter tails than in a normal distribution. Data is less heavily concentrated about the mean. You refer to this type of distribution as platykurtic.
Vertical bar chart(histogram) with shortest bars in the middle and the highest bars on both sides of the middle and little to no bars on the ends of the histogram. The curved line representing the distribution is highest in the middle but much flatter than a normal distribution and the line flattens and extends out along the bottom of the histogram on both tails. The value for kurtosis is -1.9289. In example above, the platykurtic distribution is symmetric. A symmetric platykurtic distribution is characterized by being flatter than the normal distribution, that is, less peaked, with heavier flanks and thinner tails. A distribution with two peaks is also referred to as a bimodal distribution. Rectangular, bimodal, and multimodal distributions tend to have low values of kurtosis.
A positive kurtosis statistic means that the data has heavier tails and is more concentrated about the mean than a normal distribution.
Vertical bar chart(histogram) with the highest bars just to the left of the middle. The bar heights drop quickly and are concentrated near the middle. The curved line representing the distribution is more flat than a normal distribution and the tails on both sides drop to just above the base of the historgram and extend much farther than a normal distribution. The value shown for the kurtosis is 6.5557. If you were to zoom in on the histogram shown above, you’d see that the data extends well beyond 2 standard deviations of the mean. You refer to this type of distribution as leptokurtic. A leptokurtic distribution is often referred to as heavy-tailed and might sometimes also be referred to as an outlier-prone distribution.
If the distribution is symmetric, a leptokurtic distribution tends to have a higher peak than the normal, with an excess of values near the mean and in the tails, but with thinner flanks. Distributions that are asymmetric also tend to have nonzero kurtosis. In these cases, understanding kurtosis is considerably more complex than in situations where the distribution is approximately symmetric. Kurtosis can be difficult to assess visually.
In addition to plotting your data using a histogram, there are other ways to visualize and assess the distribution of your data. You can use normal probability plots and box plots to compare your data with the normal distribution. A normal probability plot is a visual method for determining whether or not your data comes from a distribution that is approximately normal.
Chart: Scatter plot with a diagonal reference line starting in the lower left corner and extending to the upper right corner and the data values are clustered along this line.
The vertical axis represents the actual data values, and the horizontal axis displays the expected percentiles from a standard normal distribution. The normal reference line along the diagonal is where your data would fall if it were perfectly normal. The points hovering around the reference line are the actual data.
Let’s look at some sample plots.
There are 5 plots in this image. The first plot shows a data values clustered along the diagonal reference line which extends from the bottom left corner of the plot to the upper right corner of the plot. The second plot shows the data points in a long curved line below the diagonal reference line. The third plot shows the data point in a long curve above the diagonal reference line. The fourth plot shows the data points starting at the top of the reference line and curving above the reference line until at the middle of the reference line the data points drop below the reference line in a curve until it touches the reference line at the bottom. The fifth plot shows the data points starting at the top of the reference line and curving below the reference line and then at the middle of the reference line curving above the reference line until it touches the bottom of the reference line.
In the first plot, the distribution of data points follows the normal reference line quite closely. Thus, this data distribution looks fairly normal. In the second plot, the distribution of data is skewed to the right of the reference line. In the third plot, the distribution of data is skewed to the left of the reference line. The fourth plot shows a light-tailed distribution. Remember that the vertical axis is your ordered response values, so the mean of your data is a horizontal line running approximately through the middle of the plot. If you examine your data points paying particular attention to the tails, the data in the tails looks like it is being squashed from the top and the bottom, toward the mean. In other words, the data points are closer to the mean than they should be if they were normal, meaning that the tails of the distribution are light. The fifth plot shows a heavy-tailed distribution. If you examine your data points paying particular attention to the tails, the data in the tails looks like it is being stretched in the vertical direction, away from the mean. In other words, the data points are further from the mean than they should be if they were normal, meaning the tails of the distribution are heavy. The main point here is that the non-normal data distributions in the second through fifth examples are visually very obvious.
Another way to examine the distribution of your data is to create a box plot.
This type of graph makes it easy to see how spread out your data is and if there are any outliers. The box represents the middle 50% of your data, which is the interquartile range. The bottom of the box indicates the 25th percentile. The middle line represents the 50th percentile, or the median, and the top line indicates the 75th percentile. The diamond denotes the mean. Thus, you are able to visualize very easily how close the mean is to the median.
Horizontal and verticle axes with an elongated rectangular box in the middle. The box is about 4 times as long as it is wide. A straight line starts at the middle of the top of the box and extends to almost the top of the plot (like a whisker). Another straight line, which is not quite as long as the top line, extends from the middle bottom of the box and extends to almost the bottom of the plot (another whisker). A straight dotted line goes across the middle of the box. This line is labeled 50th percentile or median. Another dotted line goes across the plot at the top of the box. This is labeled the 75h percentile. A dotted line also goes across the plot at the bottom of the box and this is labeled the 25th percentile. There is a diamond shape in just a bit above the middle of the box and this is labeled the mean.
The whiskers extend from the box as far as the data extends, to a maximum length of 1.5 times the interquartile range (IQR) above the 75th percentile. This is referred to as 1.5 interquartile units. Any data points farther than 1.5 interquartile units away from the box are considered possible outliers and are represented in this plot as circles. You get a rough impression of the symmetry of your distribution by comparing the mean and median, as well as by assessing the symmetry of the box and whiskers around the median line. This box plot shows a data distribution that is approximately symmetric.
In assessing whether your data comes from a normal distribution, you should consider the following points. Compare the mean and the median. If they are nearly equal, that is an indicator of symmetry, which is one requirement for normality. Check if the skewness and kurtosis statistics are close to 0. If the mean and the median coincide and the skewness and kurtosis statistics are close to 0, it is likely that your data comes from a normal distribution. As a loose rule of thumb, if either the skewness or kurtosis statistic is greater than 1 or less than -1, many researchers conclude that the data is not normal. Other researchers conclude that the data is not normal if either statistic is greater than 2 or less than -2. It’s also useful to visually assess your data by creating histograms, box plots, and normal probability plots.
You know several ways to visualize the distribution of your data using histograms, normal probability plots, and box plots. Let’s apply these techniques to an example. You need to assess SAT score data for a sample of 80 students from magnet schools in Carver County. Before you can answer specific questions about this data, you need to ensure that you understand the distribution and other features of the SAT score data.
You can use descriptive statistics, including graphical techniques, to look at the distribution of your data, checking the range, frequency, and shape of your data. Does it exhibit a normal distribution? What are the measures of the mean and median, skewness and kurtosis, and what do those statistics tell us about the data? Also, you should check for unusual data points or outliers. Are there test scores that are unusually high or low, or are there any data points that might be data entry errors? After you complete this assessment, you’re ready to answer specific questions about this data by applying additional statistical techniques. Click the information button if you’d like to review the sample data.
You can use PROC UNIVARIATE to generate descriptive statistics, including skewness and kurtosis, quantiles or percentiles, frequency tables and extreme values. It also generates histograms and normal probability plots that assist you in assessing the distribution of your data.
Here’s the syntax. PROC UNIVARIATE DATA=SAS-data-set
In the PROC UNIVARIATE statement, you specify your SAS data set. In the VAR statement, you specify the analysis variables. If you do not include a VAR statement, SAS analyzes all numeric variables in the data set. Following the VAR statement you can provide an ID statement. Here you list the variable or variables that SAS should label in the table of extreme observations and identify as outliers in the graphs. By default, SAS displays the five lowest and five highest observations in the table of extreme observations. Then in the HISTOGRAM statement, you list the variable or variables you want histograms created for. You can add additional options to the HISTOGRAM statement. The NORMAL option creates a normal curve overlay to the histogram using the estimates of the population mean and standard deviation. In the PROBPLOT statement, you list the variables you want probability plots created for. You can add additional options to the PROBPLOT statement. The NORMAL option creates a diagonal reference line based on the estimates of the population mean and standard deviation. You use the INSET statement to create a box of summary statistics directly on the graphs. The INSET statement must follow the line of code that creates the plot you want augmented. Specify INSET, followed by keywords for the summary statistics you want included. You can add additional options, such as specifying the placement of the box in the graph window.
Here’s the code for using PROC UNIVARIATE to produce descriptive statistics, a histogram, and a normal probability plot for the TestScores data.
proc univariate data=statdata.testscores; var SATScore; id IDNumber; histogram SATScore / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; probplot SATScore / normal (mu=est sigma=est); inset skewness kurtosis; title ‘Descriptive Statistics Using PROC UNIVARIATE’; run;
In the PROC UNIVARIATE statement, you specify the TestScores data set. And in the VAR statement, identify SATScore as the analysis variable. The ID statement specifies that SAS will use the IDNumber variable as a label in the table of extreme observations and as an identifier for any extreme observations in the graphs. In the HISTOGRAM statement, you specify that you want SAS to create a histogram of the continuous variable SATScore with a normal curve superimposed using estimates of the population mean and the population standard deviation. Note that SAS determines the width and number of bins automatically. The INSET statement with the option POSITION=NE specifies that SAS prints a box with the skewness and kurtosis statistics in the northeast corner of the histogram. The PROBPLOT statement creates a normal probability plot of SATScore with the overlay of a normal reference line created using estimates of the population mean and the population standard deviation. A second INSET statement with no option added specifies a box with the skewness and kurtosis statistics in the default position of the plot, which is the northwest corner.
In addition to the statistical graphics available to you with PROC UNIVARIATE, you might want to use PROC SGSCATTER, PROC SGPLOT, PROC SGPANEL, and PROC SGRENDER to produce a wide variety of additional plot types.
You can use PROC SGSCATTER to produce several types of scatter plots. You can create a single-cell (or a simple Y by X) scatter plot, a multi-cell scatter plot with multiple independent scatter plots in a grid, and a scatter plot matrix, which produces a matrix of scatter plots comparing multiple variables.
Using PROC SGPLOT, you can produce a wide variety of plot types, including scatter plots, line graphs, histograms with overlaid distribution curves, regression lines with confidence and prediction bands, dot plots, box plots, bar charts, and so on. You can also overlay plots together to produce many different types of graphs. Plus, you can use statements and options to control graph appearance and add features such as legends and reference lines.
Using PROC SGPANEL, you can produce panels of plots for different levels of a factor or several different time periods, depending on the classification variable. For example, you can create several bar charts side by side, each representing a different quarter of the year. You can also produce side-by-side histograms, one representing males and the other representing females for whatever type of data you are looking at. This enables you to quickly provide a visual comparison for your data.
PROC SGRENDER enables you to create plots from graph templates you have modified or written yourself. For more information about each of these procedures, consult the SAS documentation.
Let’s look at the SGPLOT procedure in detail. PROC SGPLOT DATA=SAS-data-set
In the PROC SGPLOT statement, you first specify your input data set and any options. You can then use the DOT, HBAR, and VBAR statements to generate dot plots, horizontal bar charts, and vertical bar charts that summarize the values of a category variable. You can generate several other types of distribution plots using PROC SGPLOT: histograms, box plots, and density curves. You use the HBOX statement to create a horizontal box plot that shows the distribution of your data. The VBOX statement generates a vertical box plot. The HISTOGRAM statement creates a histogram that displays the frequency distribution of a continuous variable.
You can use PROC SGPLOT to generate other basic plots as well: scatter plots, series plots, band plots, needle plots, and vector plots. You can use the SCATTER statement to create a scatter plot of two continuous variables. The NEEDLE statement creates a plot with needles connecting each point to the baseline. The REG statement generates a fitted regression line or curve. You use a REFLINE statement to create a horizontal or vertical reference line on the plot.
REFLINE variable | value-1 <… value-n> ;
If you are creating a horizontally oriented plot, the reference line will be vertical. And if the plot is vertical in orientation, the reference line will be horizontal. The REFLINE statement can be placed anywhere in the procedure.
Let’s use PROG SGPLOT to create a vertical boxplot of the variable SATScore for the TestScores data.
proc sgplot data=statdata.testscores; refline 1200 / axis=y lineattrs=(color=blue); vbox SATScore / datalabel=IDNumber; format IDNumber 8.; title “Box Plots of SAT Scores”; run;
In the PROC SGPLOT statement, you specify the TestScores data set. The REFLINE statement with additional options creates a blue horizontal reference line (through the y axis) at the data value 1200. In the VBOX statement you specify the creation of a vertical box plot using the continuous variable SATScore. The DATALABEL= option specifies IDNumber as the label for any outliers. If there are no outliers, then this option has no effect. The FORMAT statement formats the variable IDNumber to 8 units.
ODS Graphics is an extension of the SAS Output Delivery System. With ODS Graphics, statistical procedures produce graphs as automatically as they produce tables, and graphs are integrated with tables in the ODS output. You can find a list of the graphs available for each SAS procedure in the SAS documentation.
You can use ODS statements to control your statistical graph output as easily as your tabular output. To specify options for graphs, you submit the ODS GRAPHICS statement. You can submit this statement with or without the keyword ON.
ODS GRAPHICS ON
For example, you can use the WIDTH= option to specify the width of the graph.
ods graphics on width=25; run:
ODS statistical graphics are ODS output objects, just like tables that are produced by SAS procedures. Procedures assign a name to each output object, which enables you to specify individual statistical graphs and tables in ODS statements.
To select or exclude specific test results, graphs, or tables from your output, you can use the ODS SELECT and ODS EXCLUDE statements.
Another way to control your output is to use PLOTS=, an option that is usually available in the procedure statement. This option enables you to specify which graphs SAS should create, either in addition to or instead of the default plots.
PROC UNIVARIATE DATA= SAS-data-set PLOTS=;
In addition, there are several ways you can control the layout of your graphics output.
You can use ODS templates to modify the layout and details of each graph.
You can use ODS styles to control the general appearance and consistency of your graphs and tables. By default, SAS displays your ODS Graphics output using the HTMLBLUE style.
In this course, you learn to control your statistical graphics in some of these ways. To learn more about ODS Graphics features that are not covered in this course, see the SAS/STAT documentation.
The program in the editor uses PROC UNIVARIATE to produce descriptive statistics, a histogram and a normal probability plot for the variable SATScore. Let’s specify the WIDTH= option so that our graphics are sized correctly.
Let’s submit the program.
ods graphics on/width=600; proc univariate data=statdata.testscores; var SATScore; id idnumber; histogram SATScore / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; probplot SATScore / normal(mu=est sigma=est); inset skewness kurtosis; title ‘Descriptive Statistics Using PROC UNIVARIATE’; run; title;
Descriptive Statistics Using PROC UNIVARIATE
The UNIVARIATE Procedure
Variable: SATScore Moments N 80 Sum Weights 80 Mean 1190.625 Sum Observations 95250 Std Deviation 147.058447 Variance 21626.1867 Skewness 0.64202018 Kurtosis 0.42409987 Uncorrected SS 115115500 Corrected SS 1708468.75 Coeff Variation 12.3513656 Std Error Mean 16.4416342 Basic Statistical Measures Location Variability Mean 1190.625 Std Deviation 147.05845 Median 1170.000 Variance 21626 Mode 1050.000 Range 710.00000 Interquartile Range 195.00000 Tests for Location: Mu0=0 Test Statistic p Value Student’s t t 72.41525 Pr > |t| <.0001 Sign M 40 Pr >= |M| <.0001 Signed Rank S 1620 Pr >= |S| <.0001 Quantiles (Definition 5) Level Quantile 100% Max 1600 99% 1600 95% 1505 90% 1375 75% Q3 1280 50% Median 1170 25% Q1 1085 10% 1020 5% 995 1% 890 0% Min 890 Extreme Observations Lowest Highest Value IDNumber Obs Value IDNumber Obs 890 76526697 69 1490 9589297 8 910 30834797 74 1520 73461797 42 970 60714297 6 1520 40177297 54 990 61728297 51 1590 23573597 70 1000 37070397 4 1600 39196697 25
Descriptive Statistics Using PROC UNIVARIATE
The UNIVARIATE Procedure Histogram for SATScore
Descriptive Statistics Using PROC UNIVARIATE
The UNIVARIATE Procedure
Fitted Normal Distribution for SATScore Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu 1190.625 Std Dev Sigma 147.0584 Goodness-of-Fit Tests for Normal Distribution Test Statistic p Value Kolmogorov-Smirnov D 0.08382224 Pr > D >0.150 Cramer-von Mises W-Sq 0.09964577 Pr > W-Sq 0.114 Anderson-Darling A-Sq 0.70124822 Pr > A-Sq 0.068 Quantiles for Normal Distribution Percent Quantile Observed Estimated 1.0 890.000 848.516 5.0 995.000 948.735 10.0 1020.000 1002.162 25.0 1085.000 1091.436 50.0 1170.000 1190.625 75.0 1280.000 1289.814 90.0 1375.000 1379.088 95.0 1505.000 1432.515 99.0 1600.000 1532.734
Descriptive Statistics Using PROC UNIVARIATE
The UNIVARIATE Procedure Probability Plot for SATScore The log shows that SAS processed the code without errors. When we look at our results, the tabular output tells us quite a lot about our data.
The sample size (N) is 80. The mean is 1190.6, which is roughly equivalent to the median of 1170 in the quantiles report.
The standard deviation is approximately 147, which is a measure of the average variability around the mean. The variance is the standard deviation squared.
The skewness statistic is 0.642. It is positive, but close to 0, so the distribution is slightly right-skewed. The kurtosis statistic is 0.424. It is positive, but also close to 0, so the distribution is slightly heavy-tailed.
The coefficient of variation is 12.35. This is the standard deviation expressed as a percentage of the mean. This is useful if you need to compare data with different units of measurements â for example, inches to centimeters. It’s a way of standardizing units of measurement.
The standard error of the mean of 16.4 measures the variability of the mean.
In the quantiles report, SAS provides various reference points in the data set. We can see that the minimum for SATScore is 890 and the maximum is 1600. We are also given other percentile values, such as the 25th, the median, and the 75th. In the extreme observations output, SAS specifies the five lowest values of SATScores and the five highest values. Remember that we specified IDNumber as an identifier in our code.
Next, let’s scroll down and review our graphs.
This histogram provides some additional information about the TestScores data. The bin identified with the midpoint of 1100 has approximately 33% of the values. The inset box displays the skewness and kurtosis values. The data looks approximately normal.
Let’s review the normal probability plot. The diagonal reference line here represents where the data values would fall if they came from a normal distribution. The circles represent the observed data values. Because the circles closely follow the diagonal reference line in the graph, you can conclude that there does not appear to be a significant departure from normality.
Now we use PROC SGPLOT to create a boxplot of the variable SATScore. Let’s specify a reference line at 1200. This is the SAT test score goal for magnet schools in the Carver County school district. We also specify the DATALABEL option in the VBOX statement and ask SAS to label any outliers with their IDNumber values. Let’s submit this code.
proc sgplot data=statdata.testscores; refline 1200 / axis=y lineattrs=(color=blue); vbox SATScore / datalabel=IDNumber; format IDNumber 8.; title “Box Plots of SAT Scores”; run; title;
The SGPlot Procedure
The log verifies that the code ran successfully. Now let’s look at our output. The top whisker represents the largest point up to 1.5 interquartile units from the box. The top line of the box represents the 75th percentile. The horizontal line inside the box represents the median or 50th percentile. The bottom line of the box represents the 25th percentile. The bottom whisker represents the smallest point up to 1.5 interquartile units from the box. The diamond represents the mean. The blue horizontal line is the reference line we added where SATScore is 1200. Note that there are two outliers, values beyond 1.5 interquartile units. SAS displays their IDNumber values as we had requested
You know how to assess normality and use descriptive statistics to better understand your data. Now you want to use inferential statistics to calculate the standard error of the mean and confidence intervals for the mean. In this topic, you learn how to
define the distribution of sample means and the central limit theorem
calculate and interpret the standard error of the mean and confidence intervals for the mean
use the MEAN procedure to generate the standard error of the mean and confidence intervals for the mean.
When you gather a sample of data, you calculate sample statistics to estimate parameters from the population. A point estimator is a sample statistic used to estimate a population parameter. You’re already familiar with several point estimators. For example, you know that you can use xÌ, the sample mean, to estimate μ, the population mean. Similarly, you can use s, the sample standard deviation, to estimate Ï, the population standard deviation. An estimator takes on different values from sample to sample, so it’s important to know the variance of an estimator.
Let’s look at an example. A research company is interested in whether drivers adhere to the posted speed limit of 45 mph on a particular street. They collect data for 25 drivers. Here’s a question. If you look at a sample from this data that has a mean of 45.56 mph, why are you not absolutely certain that this is the mean speed for the population?
The answer is because this sample mean is only an estimate of the population mean. If you collected another sample of drivers, you’d likely obtain a different estimate of the mean. Different samples yield different estimates of the mean for the same population. The variance of the sample mean refers to how much the value of the sample mean varies from sample to sample. Any sample statistic has some variability. The key is to understand and estimate this variability.
A statistic that measures the variability of your estimator is the standard error. Standard error differs from the sample standard deviation because sample standard deviation deals with the variability of your data whereas standard error deals with the variability of your sample statistic. An example is the standard error of the mean, which measures the variability of your sample mean. It’s an estimate of how much you can expect the sample mean to vary from sample to sample. You calculate the standard error of the mean using the formula shown here, where s is the sample standard deviation and n is the sample size.
SE(Mean) = s/sqrt(n)
The larger the sample size, the smaller the standard error of the mean will be. The smaller the standard error of the sample mean, the more precise the sample mean is as an estimator of the population mean.
What happens if you look at the distribution of all possible sample means from the population? This is called the distribution of sample means. Let’s look at the car example. Suppose you take 500 random samples, all with a sample size of 10, from an identified population size of 5,000 and calculate a mean for each sample. The histogram below shows the distribution of all 5000 observations.
Chart: Vertical bar chart(histogram) with each bar representing the number of times that data value occurs in the data. The distribution is normal, showing the highest bars in the middle and each side gradually and symetrically decreasing.
The histogram below represents the distribution of the 500 sample means.
Chart: Vertical bar chart(histogram) with the highest bars centered around the middle and the bar heights dropping quickly. The bars are clustered around the middle point on the graph.
The variability of the distribution of sample means is smaller than the variability of the distribution of the 5000 observations. That should make sense. With a hypothesized mean of 45, it seems relatively likely to find one driver with a speed of 62. But it’s not likely that the mean of a sample of 10 drivers would be 62. The distribution of the mean is always less variable than the data.
Because we know that point estimators vary from sample to sample, it would be nice to have an estimator of the mean that directly accounts for this natural variability.
Chart: Normal distribution curve with a straight line starting at the middle of the base of the graph and extending downward. There are three horizontal lines of different lengths located on the straight line. The differing lengths of the vertical lines represent the different sample ranges.
An interval estimator is another way to estimate the true population mean. The interval estimator gives us a range of values that is likely to contain the population mean. It incorporates the uncertainty that arises from random variability. The interval is centered at the sample mean. The margin of error is how far the interval extends on each side of the sample mean.
You calculate the interval from the standard error of the mean that we just talked about as well as a value that is determined by the degree of certainty we require.
Confidence intervals are a type of interval estimator used to estimate the population mean, while taking into account the variability of the sample statistic, in this case, the sample mean. A confidence interval for the mean gives a range of plausible values for the unknown population mean. It places an upper and lower bound around the sample mean. To construct a confidence interval for the mean, you must first select a significance level. Commonly, you use a 95% confidence interval to estimate the population mean. A 95% confidence interval indicates that you’re 95% confident that the true population mean lies between the two calculated values. In other words, if you were to sample repeatedly and calculate a confidence interval for each sample mean, approximately 95% of the confidence intervals you calculate will contain the true population mean.
To use the speed example, you interpret a 95% confidence interval to mean that you’re 95% confident that the interval contains the true population mean. In other words, if you select 100 different samples from the same population and calculate 100 intervals, approximately 95 of them will contain the true population mean. Here’s a question. Why wouldn’t you want to raise the confidence level and be more confident in your statistic? As you increase the confidence level, the width of the interval increases, potentially making it less able to provide useful information.
Here’s the formula for calculating the confidence interval for the mean.
x¯±tâ sx¯
X-bar is the sample mean. t is the t quantile value that is determined by the confidence level and the sample size. s sub x-bar is the standard error of the mean.
sx¯=s/sqrt(n)
Here’s a question. How can you make the confidence interval smaller? By increasing n, the sample size. You can also decrease the confidence level.
For the purposes of finding confidence limits for parameters (such as the mean), you might make assumptions about a theoretical population distribution. You might, for instance, assume normality of sample means. In fact, the confidence intervals in this course assume that the sample means are normally distributed. Remember that in a normal distribution, approximately 68% of the area under the bell curve falls within 1 standard deviation of the mean. Approximately 95% of the area under the curve falls within 2 standard deviations of the mean, and approximately 99% of the area under the curve falls within 3 standard deviations of the mean.
Chart: Normal distribution curve with the symbol for mean in the middle. A vertical line is shown on both sides of the mean about half way between the middle of the graph and the tails. These lines are labeled as the mean plus and minus the sample standard deviation and the area between these two lines is labeled 68%. Another vertical line is shown on both sides of the curved between the previous verticle lines and the tails of the curve. These lines are labeled as the mean plus and minus 2 times the standard deviation. The area under the curve between these lines is also labeled 95%. At the tails of each side of the curve is the label mean plus and minus 3 times the standard deviation. The area under the curve between these two lines is considered 99%.
If the distribution of sample means is normal, you can use the probabilities associated with the normal distribution when you construct a confidence interval. The probability corresponds to the confidence level. Therefore, if you construct a 95% confidence interval, you have a 95% probability of constructing a confidence interval that contains the population mean. Even if you faithfully calculate a 95% confidence interval and the other assumptions are met, still 5% of the confidence intervals will not include the true population mean. Unfortunately, there’s no way to know whether the one confidence interval you compute happens to be one of the 95% that contain the population mean or one of the 5% that don’t.
The graph shown here illustrates the distribution of sample means, with three different samples represented beneath the graph.
Normal distribution curve with the symbol for mean in the middle. A vertical line is shown on both sides of the mean about half way between the middle of the graph and the tails. These lines are labeled as the mean plus and minus the sample standard deviation and the area between these two lines is labeled 68%. Another vertical line is shown on both sides of the curved between the previous verticle lines and the tails of the curve. These lines are labeled as the mean plus and minus 2 times the standard deviation. The area under the curve between these lines is also labeled 95%. At the tails of each side of the curve is the label mean plus and minus 3 times the standard deviation. The area under the curve between these two lines is considered 99%. There are 3 horizontal lines below he graph. The first represents sample 1 and the line has a x-hat symbol in the middle with a left parenthesis on the line at about the mean of the graph with an arrow pointing to the left. The line also has a right parenthesis on the right side at about the mean plus 3 times the standard deviation on the graph. The second line is below the first line and this line is labeled Sample 3. This horizontal line extends from the point on the graph betwen the mean and the mean plus the standard deviation and extends past the right tail. There is a left parenthesis on this line at about the mean plus the standard deviation on the graph and the left parenthesis is located way past the right tail. The third horizontal line is labeled Sample 2. This line starts at the mean minus 3 times the standard deviation point and extends to the right and ends at the point between the mean and the mean plus the standard deviation on the graph. The parenthesis on this line is located right before the mean minus three times the standard deviation and extends to the point just before the mean plus the standard deviation. All of these lines have an arrows at both ends of the lines pointing outward from the sample mean.
Each sample has a different mean. The standard errors are all about the same, as well as about the same as the population standard error. The double-headed arrows around each of the means measure about 2 standard errors to each side of each sample mean (t is approximately 2). The means for samples 1 and 2 fall within 2 standard errors from the population mean, just by chance. In fact, you expect 95% of all sample means to fall within 2 standard errors of the population mean. Sample 3 was among the 5% whose mean, by chance alone, fell outside of 2 standard errors. Even though this sample was collected in the same manner as the others, because the sample mean was more than 2 standard errors from the population mean, the confidence interval did not extend far enough to include the true mean.
What if your data doesn’t look like it comes from a normal distribution? To satisfy the assumption of normality, you can either verify that the population distribution is approximately normal or apply the central limit theorem. The central limit theorem states that the distribution of sample means is approximately normal, regardless of the population distribution’s shape, if the sample size is large enough. “Large enough” is usually about 30 observations. It’s more if the data is heavily skewed, fewer if the data is symmetric. How does this work? Let’s see how a distribution of sample means approaches normality as the sample size increases.
Here’s a group of histograms. The first histogram shows data values drawn from an exponential distribution, which is a right-skewed population.
There are 4 histograms. The first shows a cluster of the highest bars on the left of the graph with a long tail of short bars on the right. The curved line shows a bell curve skewed to the rjight. The second graph shows a cluster of bars on the left with a slight decrease in height and number of bars on the left with a long tail on the right. The bell shaped curve has a longer tail on the right but not as drastic as the previous histogram. The third histogram shows a larger cluster of bars in the middle with decreasing height and number of bars on the left and a larger tail on the right but not as dramatic as the previous histogram. The fourth historgram shows the highest bars in the middle with symetrically decreasing height and number of vars on each side. This one shows a normal distribution curve.
The remaining charts display histograms of the sample means for samples of differing sizes drawn from the same exponential distribution. The second chart displays the distribution of sample means from 1000 samples, each of size 5, from the same right-skewed population. The third chart displays the distribution of sample means from 1000 samples, each of size 10, from the same right-skewed population. Notice that the sample is moving closer to normal. The fourth chart displays the distribution of sample means from 1000 samples, each of size 30, from the same right-skewed population. This distribution is approximately bell-shaped and symmetric. You can see that when the sample size is large enough, it doesn’t matter what your original population looked like. The distribution of sample means is close enough to normal.
You need to assess whether or not the average combined SAT score in math and reading of magnet high school students in Carver County is 1200, which is the goal set by the school board. The TestScores data includes information on 80 of these students selected at random. You already know that this sample data is approximately normal. You’ve calculated the sample mean and sample standard deviation. However, you don’t know the actual population mean and population standard deviation. You need to calculate the standard error of the mean for the TestScores data so that you can assess how precise your sample estimate is. You also need to calculate the confidence interval for the mean so that you can estimate the population mean while taking into account the variability of the sample statistic.
You can use the MEANS procedure to generate a 95% confidence interval for the mean. You’re already familiar with the syntax for PROC MEANS. You can use the CLM option in the PROC MEANS statement to calculate the confidence limits for the mean.
Here’s the code that you can use to calculate a 95% confidence interval for the mean of the variable SATScore in the TestScores data.
proc means data=statdata.testscores maxdec=4 n mean stderr clm alpha=.01; var SATScore; title ‘95% Confidence Interval for SAT’; run;
You want to display the number of observations, the mean of the data, the standard error of the mean, and the confidence interval for the mean. The CLM option calculates the confidence limits for the mean. SAS defaults to a 0.05 α, or 95% confidence level. If you want to construct confidence intervals with a different confidence level, you can add the ALPHA= option to the PROC MEANS statement. For example, here the ALPHA= option specifies a 99% confidence level.
proc means data=statdata.testscores maxdec=4 n mean stderr clm alpha=.01; var SATScore; title ‘95% Confidence Interval for SAT’; run;
Let’s use PROC MEANS to calculate a 95% confidence interval for the mean of the variable SATScore in the TestScores data set. The MAXDEC= option rounds the values in the table to four decimal places. We specify that SAS should display the number of observations, the mean, the standard error of the mean, and the confidence limits of the mean at the default 95% confidence level.
proc means data=statdata.testscores maxdec=4 n mean stderr clm; var SATScore; title ‘95% Confidence Interval for SAT’; run; title;
95% Confidence Interval for SAT
The MEANS Procedure Analysis Variable : SATScore N Mean Std Error Lower 95% Upper 95% CL for Mean CL for Mean 80 1190.6250 16.4416 1157.8987 1223.3513
Let’s look at the output. The sample size is 80. The mean of SATScore is 1190.625. The standard error of the mean is 16.4416. The standard error measures the variability of the sample mean or how much error we can expect if we use the sample mean to estimate the true population mean. The 95% confidence interval of the mean is 1157.9 to 1223.4. This indicates that you’re 95% confident that the true population mean is contained within this interval.
How does this relate to your original question? You want to know whether the average SAT score for the Carver County magnet high schools is different from the standard of 1200 set by the school board. Because 1200 is contained within the confidence interval of 1157.9 to 1223.4, you’re 95% confident that the true average SAT score is not significantly different from 1200.
Now you’re ready to learn a key technique of inferential statistics: hypothesis testing. A hypothesis test uses sample data to evaluate a question about a population. It provides us with a way to make inferences about a population based on sample data. We don’t usually know the true value of population parameters. We estimate these values based on statistics from a sample. In hypothesis testing, we reject or fail to reject a statistical hypothesis that is a statement about the value of a population parameter. In this topic, you learn how to
design and conduct a hypothesis test
use the p-value to determine statistical significance
use the UNIVARIATE procedure to perform a statistical hypothesis test
perform a one-sample, two-sided t-test to determine if the population mean is significantly different from a known value.
To help you understand how to perform a hypothesis test, it’s helpful to look at a decision-making process that’s similar. Statistical hypothesis testing is like a court trial. In a criminal court, you put defendants on trial because you suspect that they’re guilty of a crime. But how does the trial proceed? First you must determine your hypotheses, referred to in statistics as the null and alternative hypotheses. The null hypothesis is what you assume to be true when you start your analysis. So, in this case, the null hypothesis is that the defendant is not guilty. The alternative hypothesis is your initial research hypothesis, that is, your proposed explanation.
In this case, you propose that the defendant is guilty. Then you select the significance level. In this example, this is the amount of evidence needed to convict the defendant. In a criminal court of law, the evidence must prove guilt “beyond a reasonable doubt.” In a civil court, there must be a preponderance of evidence. Next, you gather your data. In this case, the police and investigators collect evidence. Lawyers present the evidence to the jury. Finally, the jury makes a judgment using a decision rule. If the evidence is sufficiently strong, the jury rejects the null hypothesis that the defendant is not guilty. If the evidence is not strong enough, the jury fails to reject the null hypothesis that the defendant is not guilty. Note that failing to prove guilt does not prove that the defendant is innocent. Statistical hypothesis testing follows this same basic path.
As in a court trial, there are four steps in conducting a hypothesis test.
Let’s look at an example. Suppose you want to know whether a coin is fair. You can’t flip it endlessly, so you decide to take a sample. You flip the coin five times and count the number of heads and tails. The null hypothesis, which statisticians designate H0, is what you assume to be true, unless proven otherwise. The null hypothesis is usually a hypothesis of equality. Your null hypothesis in this case is that the coin is fair. Suppose you suspect that the coin is not fair. This is the alternative hypothesis, which statisticians designate Ha or H1. The alternative hypothesis is typically what you suspect, or are attempting to demonstrate. The alternative hypothesis is usually a hypothesis of inequality.
This is the amount of evidence needed to reject the null hypothesis. A common significance level is 0.05 (1 chance in 20). This is what you’ll use in this course. If you require a stricter cutoff, you might consider lowering your significance level when planning your own analysis. For this example, if you observe five heads or five tails in a row (1 chance in 16), you conclude the coin is not fair and you reject the null hypothesis. If you get a mix of heads and tails, you decide there isn’t enough evidence to determine that the coin is not fair. To understand why the probability is 1 out of 16, select the information button on the course interface.
In this case, you flip the coin five times and record each outcome as a head or tail. You compute the sample statistic by counting the total number of heads.
You decide whether or not there is enough evidence to reject the null hypothesis. In this example, you conclude whether the coin is fair.
You perform a hypothesis test and make a decision. But is that decision correct? You start by assuming that the null hypothesis is true. In the coin example, you start by assuming that the coin is fair. However, there’s always a risk that you’re wrong. If you reject the null hypothesis when it’s actually true, you’ve made a Type I error. The probability of committing a Type I error is α. α is the significance level of a test. In the coin example, it’s the probability that you conclude that the coin is not fair when it is fair.
If you fail to reject the null hypothesis and it’s actually false, you’ve made a Type II error. The probability of committing a Type II error is β. In the coin example, it’s the probability that you fail to find that the coin is not fair when it is not fair. Type I and II errors are inversely related. As one type increases, the other decreases.
Power is the probability that you correctly reject the null hypothesis. It is the ability of the statistical test to detect a true difference, or the ability to successfully reject a false null hypothesis. The power of a statistical test is equal to (1 â β), where β is the Type II error rate.
If you flip a coin 100 times and count the number of heads, you do not doubt that the coin is fair if you observe exactly 50 heads. However, you might be somewhat skeptical that the coin is fair if you observe 40 or 60 heads. You’d be more skeptical that the coin is fair if you observe 37 or 63 heads and highly skeptical that the coin is fair if you observe 15 or 85 heads. In this situation, the greater the difference between the number of heads and tails, the more evidence you have that the coin is not fair. Statisticians refer to the difference between the observed statistic and the hypothesized value as the effect size.
A p-value measures the probability of observing a value as extreme as the one observed or more extreme, assuming that the null hypothesis is true. For example, if your null hypothesis is that the coin is fair and you observe 55 heads and 45 tails, the p-value is the probability of observing a difference in the number of heads and tails of 10 or more from a fair coin tossed 100 times. If the p-value is large, for example p-value = 0.3682, you would often see a difference this large in experiments with a fair coin. If the p-value is small, for example, p-value = <.0001, you would rarely see differences this large with a fair coin. In the latter situation, you have evidence that the coin is not fair. A large p-value indicates a high probability of observing your results, or more extreme results, given that the null hypothesis is true. So, it is reasonable to assume that the null hypothesis is true. A small p-value indicates a low probability of observing your results, or more extreme results, given that the null hypothesis is true. So, it is no longer reasonable to assume that the null hypothesis is true. So the p-value is used to determine statistical significance. It helps you assess whether you should reject the null hypothesis.
A p-value is not only affected by the effect size (in this example, the observed proportion of heads). It is also affected by the sample size (here, the number of coin flips). For a fair coin, you’d expect 50% of the flips to turn up heads. Let’s say that you flip the coin 10 times and observe 4 heads. In this example, the proportion of heads is 0.4. This value is different from the 0.5 you’d expect according to the null hypothesis.
The evidence becomes stronger as the number of trials on which the proportion is based becomes greater. As you saw in the section on confidence intervals, the variability around a mean estimate gets smaller as the sample size gets larger. For larger sample sizes, you can measure means more precisely. Therefore, 40% heads out of 400 flips makes you more sure that this was not just a chance difference from 50% than 40% out of 10 flips. The smaller p-values reflect this confidence. The p-value of <.0001 assesses how likely this difference is from 50%.
So let’s summarize what we know about statistical hypothesis testing. In statistics, the null hypothesis is denoted as H0. This is your initial assumption, usually one of equality or no relationship. The alternative hypothesis is denoted as Ha or H1. This is what you suspect or are trying to demonstrate, and it is typically an expression of an inequality or a relationship. You specify the significance level, which is also the Type I error rate, before collecting the data. It is a function of both your knowledge of the data and theoretical considerations. Statisticians refer to the significance level as α. You choose the level of α based on the cost of making a Type I error.
You measure the strength of the evidence by using a p-value. You calculate the p-value from the collected data. The decision rule is that you fail to reject the null hypothesis if the p-value is greater than or equal to α. You reject the null hypothesis if the p-value is less than α. You never conclude that two things are the same or have no relationship. You can only fail to show a difference or a relationship.
You know that you can compare α with the associated p-value in order to make a decision about the null hypothesis. Another way to measure statistical significance and test the null hypothesis is to compute a test statistic. Two common reference distributions for statistical hypothesis testing are the t distribution and the F distribution. A reference distribution enables you to quantify the probability of observing a particular outcome (the calculated sample statistic) or a more extreme outcome, if the null hypothesis is true. That probability is the p-value you’re already familiar with. The t distribution and F distribution are characterized by the degrees of freedom associated with your data. Depending on the type of analysis you’re performing, you’d use your data to calculate a sample t statistic or a sample F statistic and compare that statistic to the reference distribution. Values that fall in the tails of the distribution are values that are possible under the null hypothesis, but unlikely.
The t distribution arises when you’re making inferences about a population mean and (as in nearly all practical statistical work) the population standard deviation (and therefore, standard error) is unknown and has to be estimated from the data. It is approximately normal as the sample size grows larger. The t statistic measures how far X-bar, the sample mean, is from the hypothesized mean, μ0. It is the number of standard errors that the sample mean is from the hypothesized mean. The formula is shown here. The t statistic is positive when the sample mean is larger than the hypothesized mean, or negative when the sample mean is less than the hypothesized mean. If the t statistic is much higher or lower than 0 and has a small corresponding p-value, this indicates that the sample mean is quite different from the hypothesized mean, and you would reject the null hypothesis.
You use the t statistic when you don’t know the true population standard deviation, Ï, and you have only an estimated standard deviation, s. This graph shows the t critical region in relation to the t distribution. If the t statistic and corresponding p-value falls in the critical region (the shaded region on the graph), then you reject the null hypothesis. Otherwise, you fail to reject the null hypothesis. The area in each of the tails corresponds to α divided by 2 [α/2], or 2.5%. The sum of the areas under the tails is 5%, which is α. Because the rejection region for the t statistic is contained in both tails of the data distribution, statisticians refer to this type of hypothesis test as a two-sided t-test.
If the data comes from a normal population, then the t statistic calculated from the data follows a t distribution, and we can use the t distribution to compute p-values. The t distribution is a symmetric distribution like the normal distribution, except that the t distribution has thicker tails than a normal distribution. If the data does not come from a normal distribution, then the t statistic approximately follows a t distribution as long as the sample size is large, and we can still use the t distribution to calculate a p-value. This is another application of the central limit theorem.
You need to assess whether or not the average combined SAT score in math and reading of magnet high school students in Carver County is 1200, which is the goal set by the school board. The TestScores data includes information on 80 of these students selected at random. You already know that this sample data is approximately normal. You’ve calculated the sample mean and sample standard deviation. However, you don’t know the actual population mean and population standard deviation.
You can perform a statistical hypothesis test to see if the sample mean is different from the standard of 1200 set by the school board. 1200 is your hypothesized mean, μ0. You’ll use a significance level of 0.05. You can calculate the t statistic using the formula you learned earlier. t=(x¯âμ0)/sx¯
xÌ is the sample mean SAT score of students selected from the school district. sx¯
is the standard error of the mean that you calculated earlier in this lesson.
You can use either the UNIVARIATE procedure or the TTEST procedure to test the hypothesis that the mean of the SAT score is equal to 1200. In this course, you learn to do this using PROC UNIVARIATE.
You’re already familiar with the syntax for the UNIVARIATE procedure.
PROC UNIVARIATE DATA=SAS-data-set
By default, PROC UNIVARIATE sets the value of μ0 to zero. You can use the MU0= option in the PROC UNIVARIATE statement to specify the value of the hypothesized mean.
Here’s the code that you can use to test the hypothesis that the mean of the SAT score is equal to 1200.
ods select testsforlocation; proc univariate data=statdata.testscores mu0=1200; var SATScore; title ‘Testing Whether the Mean of SAT Scores=1200’; run;
You use the ODS SELECT statement to specify that SAS provide output only for the TestsForLocation table. This statement produces three tests for location, one of them being the t-test. All three tests produce a test statistic for the null hypothesis indicating that the mean or median is equal to a given value μ0 against the two-sided alternative that the mean or median is not equal to μ0. By the way, if you don’t know which table name to specify for the output you need, you can browse details for the UNIVARIATE procedure in the SAS documentation.
Then in the PROC UNIVARIATE statement, you specify the input data set, TestScores. You specify the MU0= option to be 1200. This program uses the default α of 0.05. You can change it in your own analysis if you want, by adding the ALPHA= option to the PROC UNIVARIATE statement. Then in the VAR statement, you identify SATScore as the analysis variable.
This program uses PROC UNIVARIATE to test the hypothesis that the mean of SATScore is equal to 1200. Your null hypothesis is that the population’s mean SAT score for the Carver County magnet high schools is 1200.
H0 μ= 1200
Your alternative hypothesis is that the population’s mean SAT score is not 1200.
Ha μ â 1200
Let’s submit this program and take a look at the output SAS produces.
ods select testsforlocation; proc univariate data=statdata.testscores mu0=1200; var SATScore; title ‘Testing Whether the Mean of SAT Scores = 1200’; run; title;
The Tests for Location table provides the t statistic, labeled Student’s t, and the corresponding p-value. The p-value is greater than the significance level, or α, of 0.05 that we had set. Note by the way that it is a coincidence that the t statistic and p-value have the same numeric value (although one is positive and the other negative). Because the p-value is greater than alpha, we fail to reject the null hypothesis. Therefore, we believe that there is no statistical difference between the sample mean of 1190 and the hypothesized mean of 1200.
Here’s another way to look at it. If the null hypothesis is true, how likely are we to see a t statistic with an absolute value of .5702 or greater? Well, about 57% of the time. This value confirms that we do not have enough evidence to reject the null hypothesis. To summarize: the original question was whether the mean SAT score for Carver County magnet high school students equals 1200. From this hypothesis test, we conclude that there is not enough evidence to say that the sample mean score of 1190.625 is statistically different from 1200.
Summary: Lesson 1: Introduction to Statistics This summary contains topic summaries, syntax, and sample programs. Topic Summaries To go to the movie where you learned a task or concept, select a link.
Basic Statistical Concepts
Descriptive statistics organizes, describes, and summarizes data using numbers and graphical techniques. Inferential statistics is concerned with drawing conclusions about a population from the analysis of a random sample drawn from that population. Inferential statistics is also concerned with the precision and reliability of those inferences.
A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. The sample should be representative of the population. You can obtain a representative sample by collecting a simple random sample.
Parameters are numerical values that summarize characteristics of a population. Parameter values are typically unknown and are represented by Greek letters. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics. You can measure characteristics of your sample and provide numerical values that summarize those characteristics. You use statistics to estimate parameters.
Variables are characteristics or properties of data that take on different values or amounts. A variable can be independent or dependent. In some contexts, you select the value of an independent variable in order to determine its relationship to the dependent variable. In other contexts, the independent variableâs values are simply taken as given.
Variables are also classified according to their characteristics. They can be quantitative or categorical. Data that consists of counts or measurements is quantitative. Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data takes on only a finite, or countable, number of values. Continuous data has an infinite number of values and no breaks or jumps.
Categorical or attribute data consists of variables that denote groupings or labels. There are two main types: nominal and ordinal. A nominal categorical variable exhibits no ordering within its groups or categories. With ordinal categorical variables, the observed levels of the variable can be ordered in a meaningful way that implies differences due to magnitude.
A variableâs classification is its scale of measurement. There are two scales of measurement for categorical variables: nominal and ordinal. There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered and has a sensible spacing of observations such that differences between measurements are meaningful. However, interval scales lack the ability to calculate ratios between numbers on the scale because there is no true zero point. Data on a ratio scale includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale.
The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. Multivariate analysis examines two or more variables at the same time, in order to understand the relationships among them.
Descriptive Statistics
The distribution of your data tells you what values your data takes and how often it takes those values.
You can calculate descriptive statistics that measure locations in your data. Statistics that locate the center of the data are measures of central tendency. These include mean, median, and mode.
Percentiles are descriptive statistics that give you reference points in your data. A percentile is the value of a variable below which a certain percentage of observations fall. The most commonly reported percentiles are quartiles, which break the data into quarters.
There are several descriptive statistics that measure the variability of your data: range, interquartile range (IQR), variance, standard deviation, and coefficient of variation (C.V.).
To summarize and generate descriptive statistics, you use the MEANS procedure. PROC MEANS calculates a standard set of statistics, including the minimum, maximum, and mean data values, as well as standard deviation and n. The PRINTALLTYPES option displays statistics for all requested combinations of class variables
Picturing Your Data
A histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars.
The normal distribution is a common theoretical distribution in statistics. It is bell-shaped, with values concentrated near the mean, and it is symmetric around the mean. The standard deviation (Ï) determines how variable the distribution is. Underlying the normal distribution is a mathematical function named the probability density function.
To check the assumption that your random sample has a normal distribution, you can plot a histogram. You can also look at statistical summaries of your data. The closer skewness and kurtosis are to 0, the closer your data is shaped like the normal distribution.
Skewness measures the tendency of your data to be more spread out on one side of the mean than on the other. It measures the asymmetry of the distribution. The direction of skewness is the direction to which the data is trailing off. The closer the skewness is to 0, the more normal or symmetric the data.
Kurtosis measures the tendency of data to be concentrated toward the center or toward the tails of the distribution. The closer kurtosis is to 0, the closer the tails of the data resemble the tail thickness of the normal distribution. Kurtosis can be difficult to assess visually.
A negative kurtosis statistic means that the data has lighter tails than in a normal distribution and is less heavily concentrated about the mean. This is a platykurtic distribution.
A positive kurtosis statistic means that the data has heavier tails and is more concentrated about the mean than a normal distribution. This is a leptokurtic distribution, which is often referred to as heavy-tailed and also as an outlier-prone distribution.
A normal probability plot is another way to visualize and assess the distribution of your data. The vertical axis represents the actual data values. The horizontal axis displays the expected percentiles from a standard normal distribution. The normal reference line along the diagonal indicates where the data would fall if it were perfectly normal.
A box plot makes it easy to see how spread out the data is and if there are any outliers.
You can use the UNIVARIATE procedure to generate descriptive statistics, histograms, and normal probability plots.
In the ID statement, you list the variable or variables that SAS should label in the table of extreme observations and identify as outliers in the graphs.
You can add more options to the HISTOGRAM and PROBPLOT statements. The NORMAL option uses estimates of the population mean and standard deviation to add a normal curve overlay to the histogram and a diagonal reference line to the normal probability plot.
You can use the INSET statement to create a box of summary statistics directly on the graphs.
In addition to the statistical graphics available to you with PROC UNIVARIATE, you might want to use the SGSCATTER, SGPLOT, SGPANEL, and SGRENDER procedures to produce a wide variety of additional plot types.
You can use PROC SGPLOT to generate dot plots, horizontal and vertical bar charts, histograms, box plots, density curves, scatter plots, series plots, band plots, needle plots, and vector plots. The REG statement generates a fitted regression line or curve. You use a REFLINE statement to create a horizontal or vertical reference line on the plot.
ODS Graphics is an extension of the SAS Output Delivery System. With ODS Graphics, statistical procedures produce graphs as automatically as they produce tables, and graphs are integrated with tables in the ODS output. You can find a list of the graphs available for each SAS procedure in the SAS documentation.
Confidence Intervals for the Mean
A point estimator is a sample statistic used to estimate a population parameter. A statistic that measures the variability of your estimator is the standard error.
The standard error of the mean measures the variability of your sample mean. Itâs an estimate of how much you can expect the sample mean to vary from sample to sample.
The distribution of sample means is the distribution of all possible sample means from the population. The distribution of the mean is always less variable than the data.
An interval estimator is another way to estimate a population parameter. It incorporates the uncertainty that arises from random variability.
Confidence intervals are a type of interval estimator used to estimate the population mean, while taking into account the variability of the sample mean.
The central limit theorem states that the distribution of sample means is approximately normal, regardless of the population distribution’s shape, if the sample size is large enough.
You can use the MEANS procedure to generate a 95% confidence interval for the mean.
You can use the CLM option in the PROC MEANS statement to calculate the confidence limits for the mean.
You can add the ALPHA= option to the PROC MEANS statement in order to construct confidence intervals with a different confidence level.
Hypothesis Testing A hypothesis test uses sample data to evaluate a question about a population. It provides a way to make inferences about a population, based on sample data.
There are four steps in conducting a hypothesis test. The first step is to identify the population of interest and determine the null and alternative hypotheses. The null hypothesis, H0, is what you assume to be true, unless proven otherwise. It is usually a hypothesis of equality. The alternative hypothesis, Ha or H1, is typically what you suspect, or are attempting to demonstrate. It is usually a hypothesis of inequality.
The second step in hypothesis testing is to select the significance level. This is the amount of evidence needed to reject the null hypothesis. A common significance level is 0.05 (1 chance in 20).
The third step is to collect the data. The fourth step is to use a decision rule to evaluate the data. You decide whether or not there is enough evidence to reject the null hypothesis.
If you reject the null hypothesis when it’s actually true, you’ve made a Type I error. The probability of committing a Type I error is α. α is the significance level of a test. If you fail to reject the null hypothesis and it’s actually false, you’ve made a Type II error. The probability of committing a Type II error is β. Type I and II errors are inversely related.The power of a statistical test is equal to 1 minus beta (1 â β).
The difference between the observed statistic and the hypothesized value is the effect size. A p-value measures the probability of observing a value as extreme as the one observed or more extreme. A p-value is affectednot only by the effect size, but also by the sample size.
The t statistic measures how far X-bar, the sample mean, is from the hypothesized mean, μ0. If the t statistic is much higher or lower than 0 and has a small corresponding p-value, this indicates that the sample mean is quite different from the hypothesized mean, and you would reject the null hypothesis.
You can use PROC UNIVARIATE to perform a statistical hypothesis test. You use the MU0= option to specify the value of the hypothesized mean, μ0. You can use the ALPHA= option to change the significance level.
Syntax
To go to the movie where you learned a statement or option, select a link.
PROC MEANS DATA=SAS-data-set
CLASS variables; VAR variables; RUN;
PROC UNIVARIATE DATA=SAS-data-set
VAR variables; ID variables; HISTOGRAM variables
PROC SGPLOT DATA=SAS-data-set
Sample Programs
Using PROC MEANS to Generate Descriptive Statistics
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes n mean median std var q1 q3; class Gender; var SATScore; title ‘Selected Descriptive Statistics for SAT Scores’; run; title;
Using SAS to Picture Your Data
proc univariate data=statdata.testscores; var SATScore; id idnumber; histogram SATScore / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; probplot SATScore / normal(mu=est sigma=est); inset skewness kurtosis; title ‘Descriptive Statistics Using PROC UNIVARIATE’; run; title;
proc sgplot data=statdata.testscores; refline 1200 / axis=y lineattrs=(color=blue); vbox SATScore / datalabel=IDNumber; format IDNumber 8.; title “Box Plots of SAT Scores”; run; title;
Calculating a 95% Confidence Interval
proc means data=statdata.testscores maxdec=4 n mean stderr clm; var SATScore; title ‘95% Confidence Interval for SAT’; run; title;
Using PROC UNIVARIATE to Perform a Hypothesis Test
ods select testsforlocation; proc univariate data=statdata.testscores mu0=1200; var SATScore; title ‘Testing Whether the Mean of SAT Scores = 1200’; run; title;
Copyright © 2017 SAS Institute Inc., Cary, NC, USA. All rights reserved.