How to distinguish between a population and a sample and between a parameter and a statistic
How to distinguish between descriptive statistics and inferential statistics
A Definition of Statistics
Data consist of information coming from observations, counts, measurements, or responses.
Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
Data Sets
There are two types of data sets you will use when studying statistics. These data sets are called populations and samples.
A population is the collection of all outcomes, responses, measurements, or counts that are of interest.
A sample is a subset, or part, of a population.
Example 1: Identifying Data Sets
In a survey, 834 employees in the United States were asked whether they thought their jobs were highly stressful. Of the 834 respondents, 517 said yes. Identify the population and the sample. Describe the sample data set.
\(\color{forestgreen}{\textbf{Solution 1 |}}\)
The population consists of the responses of all employees in the United States.
The sample consists of the responses of the 834 employees in the survey.
The sample data set consists of 517 people who said yes and 317 who said no.
Try It Yourself 1 |
In a survey of 1501 ninth to twelfth graders in the United States, 1215 said “leaders today are more concerned with their own agenda than with achieving the overall goals of the organization they serve.” Identify the population and the sample. Describe the sample data set.
Solution 1 | Try It Yourself
The population consists of the responses of all ninth to twelfth graders in the United States.
The sample consists of the responses of the 1501 ninth to twelfth graders in the survey.
The sample data set consists of 1215 ninth to twelfth graders who said leaders today are more concerned with their own agenda than with achieving the overall goals of the organization they serve and 286 ninth to twelfth graders who did not say that.
Data Sets | Parameter and Statistic
A parameter is a numerical description of a population characteristic.
A statistic is a numerical description of a sample characteristic.
Example 2 | Distinguishing Between a Parameter and a Statistic
Determine whether each number describes a population parameter or a sample statistic. Explain your reasoning.
A survey of several hundred collegiate student-athletes in the United States found that, during the season of their sport, the average time spent on athletics by student-athletes is 50 hours per week.
The freshman class at a university has an average WASSCE math score of 40%.
Example 2 | Solution
Because the average of 50 hours per week is based on a subset of the population, it is a sample statistic.
Because the average WASSCE math score of 40% is based on the entire freshman class, it is a population parameter.
Try It Yourself 2 |
Determine whether each number describes a population parameter or a sample statistic. Explain your reasoning.
In a random check of several hundred retail stores, the Food and Drug Administration found that 34% of the stores were not storing fish at the proper temperature.
Last year, a small company spent a total of $5,150,694 on employees’ salaries.
Solution 2 | Try It Yourself
Because 34% is based on a subset of the population, it is a sample statistic.
Population parameter, because the total spent on employees’ salaries, $5,150,694, is based on the entire company.
Branches of Statistics
The study of statistics has two major branches: descriptive statistics and inferential statistics.
Descriptive statistics is the branch of statistics that involves the organization, summarization, and display of data.
Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability. (You will learn more about probability in Chapter 3.)
Example 3 | Descriptive and Inferential Statistics
For each study, identify the population and the sample. Then determine which part of the study represents the descriptive branch of statistics. What conclusions might be drawn from the study using inferential statistics?
A study of 300 Wall Street analysts found that the percentage who incorrectly forecasted high-tech earnings in a recent year was 44%
Example 3 | Solution
The population consists of the high-tech earnings forecasts of all Wall Street analysts, and
the sample consists of the forecasts of the 300 Wall Street analysts in the study.
The part of this study that represents the descriptive branch of statistics involves the statement “the percentage [of Wall Street analysts] who incorrectly forecasted high-tech earnings in a recent year was 44%.”
A possible inference drawn from the study is that the stock market is difficult to forecast, even for professionals.
Try It Yourself 3 |
A study of 1000 U.S. adults found that when they have a question about their medication, three out of four adults will consult with their physician or pharmacist and only 8% visit a medication-specific website.
Identify the population and the sample.
Determine which part of the study represents the descriptive branch of statistics.
What conclusions might be drawn from the study using inferential statistics?
Solution 3 | Try It Yourself
The population consists of the responses of all U.S. adults, and the sample consists of the responses of the 1000 U.S. adults in the study.
The part of this study that represents the descriptive branch of statistics involves the statement “three out of four adults will consult with their physician or pharmacist and only 8% visit a medication-specific website [when they have a question about their medication].”
A possible inference drawn from the study is that most adults consult with their physician or pharmacist when they have a question about their medication.
1.2 Data Classification
What You Should Learn
How to distinguish between qualitative data and quantitative data
How to classify data with respect to the four levels of measurement: nominal, ordinal, interval, and ratio.
Types of Data
Data sets can consist of two types of data: qualitative data and quantitative data.
Qualitative data consist of attributes, labels, or nonnumerical entries.
Quantitative data consist of numbers that are measurements or counts.
Example 1 | Classifying Data by Type
The table shows sports-related head injuries treated in U.S. emergency rooms during a recent five-year span for several sports. Which data are qualitative data and which are quantitative data? Explain your reasoning.
The information shown in the table can be separated into two data sets. One data set contains the names of sports, and the other contains the numbers of head injuries treated. The names are nonnumerical entries, so these are qualitative data. The numbers of head injuries treated are numerical entries, so these are quantitative data.
Try It Yourself 1 |
The populations of several Liberia counties are shown in the table. Which data are qualitative data and which are quantitative data? Explain your reasoning. (Source: Liberia Lisgis)
The information shown in the table can be separated into two data sets. One data set contains the names of counties, and the other contains the numbers of people (populations). The names are nonnumerical entries, so these are qualitative data. The numbers of people are numerical entries, so these are quantitative data.
Levels of Measurement
Another characteristic of data is its level of measurement. The level of measurement determines which statistical calculations are meaningful. The four levels of measurement, in order from lowest to highest, are nominal, ordinal, interval, and ratio.
Nominal and Ordinal
Data at the nominal level of measurement are qualitative only. Data at this level are categorized using names, labels, or qualities. No mathematical computations can be made at this level.
Data at the ordinal level of measurement are qualitative or quantitative. Data at this level can be arranged in order, or ranked, but differences between data entries are not meaningful.
Example 2 | Classifying Data by Level
For each data set, determine whether the data are at the nominal level or at the ordinal level. Explain your reasoning
1.
code
library(flextable)library(tidyverse)library(formattable)library(gt)library(DT)library(reactablefmtr)head1 =c('1. Personal care aides', '2. Registered nurses', '3. Home health aides', '4. Combined food preparation and serving workers, including fast food', '5. Retail salespersons')dataH =data.frame(head1)dataH2 = dataH |>rename("Top five U.S. occupations with the most job growth (projected 2024)"= head1)knitr::kable(dataH2, format ="pandoc", caption ="")
Top five U.S. occupations with the most job growth (projected 2024)
1. Personal care aides
2. Registered nurses
3. Home health aides
4. Combined food preparation and serving workers, including fast food
1. This data set lists the ranks of the five fastest-growing occupations in the U.S. over the next few years. The data set consists of the ranks 1, 2, 3, 4, and 5. Because the ranks can be listed in order, these data are at the ordinal level. Note that the difference between a rank of 1 and 5 has no mathematical meaning.
2. This data set consists of the names of movie genres. No mathematical computations can be made with the names, and the names cannot be ranked, so these data are at the nominal level.
Interval and Ratio
The two highest levels of measurement consist of quantitative data only.
Data at the interval level of measurement can be ordered, and meaningful differences between data entries can be calculated. At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero.
Data at the ratio level of measurement are similar to data at the interval level, with the added property that a zero entry is an inherent zero. A ratio of two data entries can be formed so that one data entry can be meaningfully expressed as a multiple of another.
Example 3 | Classifying Data by Level
Two data sets are shown at the left. Which data set consists of data at the interval level? Which data set consists of data at the ratio level? Explain your reasoning.
Data set 1
Example 3 | Classifying Data by Level
Data set 2
Example 3 | Solution
Both of these data sets contain quantitative data. Consider the dates of the Yankees’ World Series victories. It makes sense to find differences between specific dates. For instance, the time between the Yankees’ first and last World Series victories is
But it does not make sense to say that one year is a multiple of another. So, these data are at the interval level. However, using the home run totals, you can find differences and write ratios. For instance, Boston hit 22 more home runs than Cleveland hit because 81 - 59 = 22 home runs. Also, Chicago hit about 1.25 times as many home runs as Baltimore hit because
For each data set, determine whether the data are at the nominal level or at the ordinal level. Explain your reasoning.
The final standings for the Pacific Division of the National Basketball Association
A collection of phone numbers
Try It Yourself 3 |
For each data set, determine whether the data are at the interval level or at the ratio level. Explain your reasoning.
The body temperatures (in degrees Fahrenheit) of an athlete during an exercise session
The heart rates (in beats per minute) of an athlete during an exercise session
Solution 2 | Try It Yourself
Ordinal, because the data can be put in order.
Nominal, because no mathematical computations can be made.
Solution 3 | Try It Yourself
Interval, because the data can be ordered and meaningful differences can be calculated, but it does not make sense to write a ratio using the temperatures.
Ratio, because the data can be ordered, meaningful differences can be calculated, the data can be written as a ratio, and the data set contains an inherent zero.
1.3 Data Collection and Experimental Design
What You Should Learn
How to design a statistical study and how to distinguish between an observational study and an experiment
How to collect data by using a survey or a simulation
How to design an experiment
How to create a sample using random sampling, simple random sampling, stratified sampling, cluster sampling, and systematic sampling and how to identify a biased sample
Design of a Statistical Study | GUIDELINES
Identify the variable(s) of interest (the focus) and the population of the study.
Develop a detailed plan for collecting data. If you use a sample, make sure the sample is representative of the population.
Collect the data.
Describe the data, using descriptive statistics techniques.
Interpret the data and make decisions about the population using inferential statistics.
Identify any possible errors.
Design of a Statistical Study
A statistical study can usually be categorized as an observational study or an experiment.
In an observational study, a researcher does not influence the responses.
In an experiment, a researcher deliberately applies a treatment before observing the responses. Here is a brief summary of these types of studies.
Obseervational Study
In an observational study, a researcher observes and measures characteristics of interest of part of a population but does not change existing conditions.
For instance, an observational study was conducted in which researchers measured the amount of time people spent doing various activities, such as volunteering, paid work, childcare, and socializing.
Experiment
In performing an experiment, a treatment is applied to part of a population, called a treatment group, and responses are observed.
Another part of the population may be used as a control group, in which no treatment is applied. (The subjects in both groups are called experimental units.)
In many cases, subjects in the control group are given a placebo, which is a harmless, fake treatment that is made to look like the real treatment.
The responses of both groups can then be compared and studied.
Example 1 a. | Distinguishing Between an Observational Study and an Experiment
Determine whether each study is an observational study or an experiment.
Researchers study the effect of vitamin D3 supplementation among patients who were newly diagnosed with a viral infection. To perform the study, researchers give 2700 U.S. adults either a daily vitamin D3 supplement or a placebo for four weeks.
Example 1 b. | Distinguishing Between an Observational Study and an Experiment
Determine whether each study is an observational study or an experiment.
Researchers conduct a study to determine how confident Americans are in the U.S. economy. To perform the study, researchers call 1019 U.S. adults and ask them to rate current U.S. economic conditions and whether the U.S. economy is getting better or worse.
Example 1 | Solution
1 a. Because the study applies a treatment (vitamin D3) to the subjects, the study is an experiment.
2 b. Because the study does not attempt to influence the responses of the subjects (there is no treatment), the study is an observational study.
Try It Yourself 1 |
The Pennsylvania Game Commission conducted a study to determine the percentage of the Pennsylvania elk population in each age and sex class. The commission captured and released elk during each year of the study and found an overall average of 16% branched bulls, 7% spike bulls, 56% adult cows, and 21% calves. Is this study an observational study or an experiment?
Solution 1 | Try It Yourself
This is an observational study.
Data Collection
There are several ways to collect data. Often, the focus of the study dictates the best way to collect data. Here is a brief summary of two methods of data collection.
Simulation
A simulation is the use of a mathematical or physical model to reproduce the conditions of a situation or process. Collecting data often involves the use of computers. Simulations allow you to study situations that are impractical or even dangerous to create in real life, and often they save time and money. For instance, automobile manufacturers use simulations with dummies to study the effects of crashes on humans. Throughout this course, you will have the opportunity to use applets that simulate statistical processes on a computer.
Survey
A survey is an investigation of one or more characteristics of a population. Most often, surveys are carried out on people by asking them questions. The most common types of surveys are done by interview, Internet, phone, or mail. In designing a survey, it is important to word the questions so that they do not lead to biased results, which are not representative of a population. For instance, a survey is conducted on a sample of physicians to determine whether the primary reason for their career choice is financial stability. In designing the survey, it would be acceptable to make a list of reasons and ask each individual in the sample to select their first choice.
Experimental Design
Three key elements of a well-designed experiment are control, randomization, and replication.
Confounding variable
Because experimental results can be ruined by a variety of factors, being able to control these influential factors is important. One such factor is a confounding variable.
A confounding variable occurs when an experimenter cannot tell the difference between the effects of different factors on the variable.
Placebo Effect
The placebo effect occurs when a subject reacts favorably to a placebo when in fact the subject has been given a fake treatment. To help control or minimize the placebo effect, a technique called blinding can be used.
Blinding is a technique in which the subjects do not know whether they are receiving a treatment or a placebo. In a double-blind experiment, neither the experimenter nor the subjects know whether the subjects are receiving a treatment or a placebo. The experimenter is informed after all the data have been collected. This type of experimental design is preferred by researchers.
Randomization
One challenge for experimenters is assigning subjects to groups so the groups have similar characteristics (such as age, height, weight, and so on). When treatment and control groups are similar, experimenters can conclude that any differences between groups are due to the treatment. To form groups with similar characteristics, experimenters use randomization.
Randomization is a process of randomly assigning subjects to different treatment groups.
Randomization
In a completely randomized design, subjects are assigned to different treatment groups through random selection. In some experiments, it may be necessary for the experimenter to use blocks, which are groups of subjects with similar characteristics. A commonly used experimental design is a randomized block design. To use a randomized block design, the experimenter divides the subjects with similar characteristics into blocks, and then, within each block, randomly assign subjects to treatment groups.
Another type of experimental design is a matched-pairs design, in which subjects are paired up according to a similarity.
Randomization
Another type of experimental design is a matched-pairs design, in which subjects are paired up according to a similarity.
Sample size, which is the number of subjects in a study, is another important part of experimental design. To improve the validity of experimental results, replication is required.
Replication is the repetition of an experiment under the same or similar conditions.
Example 2 a. | Analyzing an Experimental Design
A company wants to test the effectiveness of a new gum developed to help people quit smoking. Identify a potential problem with each experimental design and suggest a way to improve it.
The company identifies ten adults who are heavy smokers. Five of the subjects are given the new gum and the other five subjects are given a placebo. After two months, the subjects are evaluated and it is found that the five subjects using the new gum have quit smoking.
Example 2 b. | Analyzing an Experimental Design
A company wants to test the effectiveness of a new gum developed to help people quit smoking. Identify a potential problem with each experimental design and suggest a way to improve it.
The company identifies 1000 adults who are heavy smokers. The subjects are divided into blocks according to gender. Females are given the new gum and males are given the placebo. After two months, a significant number of the female subjects have quit smoking.
Example 2 | Solution
a. The sample size being used is not large enough to validate the results of the experiment. The experiment must be replicated to improve the validity.
b. The groups are not similar. The new gum may have a greater effect on women than on men, or vice versa. The subjects can be divided into blocks according to gender, but then, within each block, they should be randomly assigned to be in the treatment group or in the control group.
Sampling Techniques
A census is a count or measure of an entire population. Taking a census provides complete information, but it is often costly and difficult to perform.
A sampling is a count or measure of part of a population and is more commonly used in statistical studies.
To collect unbiased data, a researcher must ensure that the sample is representative of the population.
Sampling Techniques
Even with the best methods of sampling, a sampling error may occur.
A sampling error is the difference between the results of a sample and those of the population.
A random sample is one in which every member of the population has an equal chance of being selected.
A simple random sample is a sample in which every possible sample of the same size has the same chance of being selected.
Sampling Techniques
Stratified Sample When it is important for the sample to have members from each segment of the population, you should use a stratified sample.
Depending on the focus of the study, members of the population are divided into two or more subsets, called strata, that share a similar characteristic such as age, gender, ethnicity, or even political preference.
A sample is then randomly selected from each of the strata. Using a stratified sample ensures that each segment of the population is represented.
Sampling Techniques
Cluster Sample When the population falls into naturally occurring subgroups, each having similar characteristics, a cluster sample may be the most appropriate.
To select a cluster sample, divide the population into groups, called clusters, and select all of the members in one or more (but not all) of the clusters.
Examples of clusters could be different sections of the same course or different branches of a bank.
Sampling Techniques
Systematic Sample A systematic sample is a sample in which each member of the population is assigned a number.
The members of the population are ordered in some way, a starting number is randomly selected, and then sample members are selected at regular intervals from the starting number. (For instance, every 3rd, 5th, or 100th member is selected.)
Sampling Techniques
A type of sample that often leads to biased studies (so it is not recommended) is a convenience sample.
A convenience sample consists only of members of the population that are easy to access.
Example 3 | Identifying Sampling Techniques
You are doing a study to determine the opinions of students at your school regarding stem cell research. Identify the sampling technique you are using when you select the samples listed. Discuss potential sources of bias (if any).
You divide the student population with respect to majors and randomly select and question some students in each major.
You assign each student a number and generate random numbers. You then question each student whose number is randomly selected.
You select students who are in your biology class.
Example 3 | Solution
Because students are divided into strata (majors) and a sample is selected from each major, this is a stratified sample.
Each sample of the same size has an equal chance of being selected and each student has an equal chance of being selected, so this is a simple random sample.
Because the sample is taken from students who are readily available, this is a convenience sample. The sample may be biased because biology students may be more familiar with stem cell research than other students and may have stronger opinions.
Try It Yourself 2 |
You want to determine the opinions of students regarding stem cell research. Identify the sampling technique you are using when you select these samples.
You select a class at random and question each student in the class.
You assign each student a number and, after choosing a starting number, question every 25th student.
Solution 2 | Try It Yourself
The sample was selected by using the students in a randomly chosen class. This is cluster sampling.
The sample was selected by numbering each student in the school, randomly choosing a starting number, and selecting students at regular intervals from the starting number. This is systematic sampling.
Using Technology in Statistics |
With large data sets, you will find that calculators or computer software programs can help perform calculations and create graphics. We will perform these calculations using Rstudio and Python.
Example 1 | Generating a List of Random Numbers
A quality control department inspects a random sample of 15 of the 167 cars that are assembled at an auto plant. How should the cars be chosen?
Generating a List of Random Numbers | Solution Using Rstudio
code
library(tidyverse)library(tictoc)library(reticulate)#install_python()#install_miniconda()#conda_list()#conda_create(# envname = "reptilia",# packages = c("pandas", "pyarrow")#)use_miniconda("reptilia")# To generate integers WITHOUT replacement:sample(1:167, 15, replace=FALSE)
Generating a List of Random Numbers | Solution Using Python
code
import sysimport numpy as np#print(sys.version)#print(sys.executable)import pandas as pd# (A list of 15 random numbers between 1 and 167)import randomrandom_numbers = [random.randint(1, 167) for _ inrange(15)]print(random_numbers)
How to construct a frequency distribution, including limits, midpoints, relative frequencies, cumulative frequencies, and boundaries; and using technology (with R and Python)
How to construct frequency histograms, frequency polygons, relative frequency histograms, and ogives; and using technology (with R and Python)
Frequency Distributions
Important characteristics to look for when organizing and describing a data set are its center, its variability (or spread), and its shape.
In this section, you will learn how to organize data sets by grouping the data into intervals called classes and forming a frequency distribution.
Classes, Frequency
A frequency distribution is a table that shows classes or intervals of data entries with a count of the number of entries in each class.
The frequencyf of a class is the number of data entries in the class.
Limits, Width, Range
Lower class limit is the least number that can belong to the class.
Upper class limit is the greatest number that can belong to the class.
The class width is the distance between lower (or upper) limits of consecutive classes.
The difference between the maximum and minimum data entries is called the range.
Utility Function | R
summary() - main function for collecting simple descriptive statistics.
str() or glimpse() - provide a synthetic representation of information on a data frame, like its size, column names, types, and values of the first elements.
head() or tail() - allow visualizing the few topmost (head) or bottom most (tail) rows of a command output.
View() or view() - visualize a data frame.
unique() - It returns the list of unique values in a series. Particularly useful when applied to columns as unique(df$col_name).
Utility Functions | Python
.describe() - main function is to obtain descriptive statistics.
.info() provides particularly useful information like the size of the data frame, and for each column its name, the type, and the number of non-null values.
.head() or .tail() - allow visualizing the few topmost (head) or bottom most (tail) rows of a command output.
.unique() - returns a list of unique values in a series. Particularly useful when applied to columns as df['col_name'].unique()
Utility Function | R
names() - returns column names of a data frame with names(df) and variable names with lists.
class() - returns the data type of an R object, like numeric, character, logical, and data frame.
length() - returns the length of an R object.
nrow() or ncol() - return, respectively, the number of rows and columns in a data frame.
Utility Functions | Python
.columns and .index - return, respectively, the list of column names and the list of names of the row index.
.dtypes - returns the list of columns with the corresponding data type.
.size returns the length of a Python object.
.shape returns the number of rows and columns of a data frame.
Example 1 | Constructing a Frequency Distribution from a Data Set
The data set lists the cell phone screen times (in minutes) for 30 U.S. adults on a recent day. Construct a frequency distribution that has seven classes.
\(\color{black}{\textbf{Exercise 2.1}}\) Do the following using R:
Compute the sum 924 + 124 and assign the result to a variable named a.
Compute a * a
code
library(tidyverse)library(tictoc)library(reticulate)use_miniconda("reptilia")924+124-> aa
[1] 1048
code
a * a
[1] 1098304
Variables and functions | What’s in a name?
For the sake of readability, it is often preferable to give your variables more informative names.
There are a few different naming conventions that can be used to name your variables:
snake_case: where words are separated by an underscore (_). Example: house_hold_net_income.
camelCase or CamelCase: where each new word starts with a capital letter. Example: householdNetIncome or HouseholdNetIncome.
period.case: where each word is separated by a period (.).
Example 1 | What’s in a name?
This lovely little code snippet can be used to compute your net income.
code
library(tidyverse)library(tictoc)library(reticulate)use_miniconda("reptilia")# Set income and taxes:income <-100# replace 100 with your incometaxes <-20# replace 20 with how much taxes you pay# Compute your net incomenet_income <- income - taxes# Voilànet_income
[1] 80
Exercise 2.2 | | Do It Yourself
What happens if you use an invalid character in a variable name? Try e.g., the following:
\(\color{blue}{\text{net income <- income - taxes}}\)
\(\color{blue}{\text{net-income <- income - taxes}}\)
\(\color{blue}{\text{ca\$h <- income - taxes}}\)
What happens if you put R code as a comment? For example,
\(\color{blue}{\text{income <- 100}}\)
\(\color{blue}{\text{taxes <- 20}}\)
\(\color{blue}{\text{net_income <- income - taxes}}\)
In R, tables of vectors are called data frames. We can combine the two vectors into a data frame as follows:
code
library(tidyverse)library(tictoc)library(reticulate)use_miniconda("reptilia")# create a data frame called bookstore using age and purchase vectors.bookstore <-data.frame(age, purchase)bookstore