Statistics is the science of collecting data, organizing and summarizing data and drawing conclusions from data.
There are two main branches of statistics: descriptive statistics and inferential statistics. Descriptive statistics focus on graphical and numerical procedures that are used to organize and summarize data. Inferential statistics focus on methods used to draw conclusions about a population from a sample.
A population is the complete set of all items that interest an investigator. The population size is denoted by N. A parameter is a numerical measure that describes a specific characteristic of the population. For example, if we were interested in average GPA of all college students then the population would be all college students and the parameter would be the average GPA.
A sample is an observed subset of a population. The sample size is denoted by n. A statistic is a numerical measure that describes a specific characteristic of the sample. For example, if we took a sample of 100 college students and computed their average GPA then the sample would be the 100 college students and the statistic would be the sample average GPA.
The population is usually quite large and it would take a lot of money and time to examine every item in the population. So we take a sample because it is smaller and easier to handle. If the sample is representative of the population then we can use the sample information to learn about the whole population.
There are many ways of choosing a sample from a population. The simplest method is called a simple random sample (SRS). Simple random sampling is a procedure used to select a sample of n objects from a population in such a way that each member of the population is chosen strictly by chance, the selection of one member does not influence the selection of any other member, each member of the population is equally likely to be chosen, and every possible sample of a given size, n, has the same chance of selection. This method is so common that the adjective simple is generally dropped, and the resulting sample is called a random sample.
Example: The management at a housing development is interested in knowing how satisfied the residents are with the amenities provided. A questionnaire is sent to a random sample of 35 current residents and 30 are returned. Identify the population, sample, parameter and statistic.
Click For AnswerAnswer: The population is all residents of the housing development. The parameter is the proportion of all residents who are satisfied. The sample is the 30 residents who completed the questionnaire. The statistic is the proportion of the 30 residents in the sample who are satisfied.
The information in the table below is part of a study of homes currently for sale in the Rochester area. Data was collected on zip code, age of the home (in years), the number of bedrooms, the distance (in miles) from the University of Rochester, and the type of community (Suburban, Urban or Rural):
| Zip Code | Age (Years) | Bedrooms | Distance (Miles) | Community Type |
|---|---|---|---|---|
| 14618 | 51 | 2 | 5 | S |
| 14623 | 24 | 4 | 10 | S |
| 14620 | 13 | 3 | 1 | U |
| 14214 | 75 | 2 | 17 | R |
Information such as that in the table is called data. We can obtain data about almost anything: hospitals, companies, stores, automobiles, schools, molecules, and so on, as well as people. The data in the table provide information on five characteristics of four homes. Each characteristic is called a variable because its values may vary from home to home. Each number or letter in the table is called an observation. The observational units are what you take measurements on. In these data, the observational units are the homes.
There are two types of variables: quantitative and categorical. A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense. The values of a quantitative variable are usually recorded in a unit of measurement such as seconds or kilograms. A categorical variable simply places an observational unit into one of several groups or categories.
In our data above, the following variables are quantitative: age and bedrooms and distance. Let’s look at these three variables more closely. In each case the values are whole numbers: for instance, 51 years, 2 bedrooms, or 5 miles. But the number of bedrooms can only be expressed as a whole number: 0, 1, 2, 3 bedrooms – no house could have 2.7 bedrooms. Variables such as this are referred to as discrete variables. On the other hand, the age of the home could be 50.5 or 50.48 years (or theoretically any number of decimal places). Similarly, distance of the home from the UR could be 5.2 or 4.8 miles (or theoretically any number of decimal places). Variables such as these are referred to as continuous variables.
In general, we may distinguish between discrete and continuous variables as follows: A discrete variable is one that has gaps between possible values (e.g., 2 and 3 bedrooms). It is usually a count. A continuous variable is one that has no gaps between possible values. It is usually a measurement. Thus it is always possible to find another amount between any two distances, e.g., 5.463 miles between 5.64 and 5.65 miles.
In our data above, the variables zip code and community type are categorical. Each home is classified into a community type category: Suburban, Urban, Rural. At first glance, the variable zip code may appear quantitative, however, numeric coding does not make the variable quantitative. It makes no sense to perform arithmetic operations on these numbers and these numbers have no units attached to them.
Example: Consider the students in STT 213 this semester as the observational units in a statistical study. For each of the following variables, indicate whether the variable is categorical or quantitative (discrete) or quantitative (continuous).
Answer:
Note, however, that average amount of sleep in the past 24 hours among students in our class would not be considered a variable for the above study on college students because it does not vary from college student to college student. However, if the observational units were college classes then it would be considered a variable.
The first step to knowing what kind of analysis to perform on a given dataset is to identify what the observational units and variables are. These questions are fundamental to knowing how to analyze the data: what kind of graph to produce, which statistic(s) to calculate, and what inference procedure to use. For this reason, we will emphasize the identification of observational units and types of variables throughout the course.
Example: For each of the studies described below, identify the observational units and variable(s). Also classify each variable as quantitative or categorical.
Answer: