1 Chapter 1 Where Do Data Come from?

Data may be available at every moment and everywhere.
Entities such as schools, governments, commercial and industrial sectors all have data.
Researchers need to produce or collect new data in order to do research. In their final scientific report, they usually include a statistical analysis results.

Case Study: Apple Watch

Read this article: https://appleinsider.com/articles/22/12/27/apple-watch-sensor-has-racial-bias-claims-new-lawsuit

A video: https://www.youtube.com/watch?v=dzXK7kIwp-Q

Q1: What are data in the first place? The textbook does not give a definition.

Q2: Give a few examples of collecting data.

Q3: What are the ways of collecting data?

Q4: How can we store data on a computer?

A few concepts:

Work on textbook Problems 1.3-1.9 (page 15-16)

In summary, data can be produced through observational studies (such as surveys & census) and experimental studies. When storing data in a rectangular table, the rows are individuals (cases or observations) and columns are characteristics (or variables). The variable that is our target is called a response variable, and variables that have impacts on (or can explain the variation of) the target are called explanatory variables. When a study is a survey, the survey randomly selects a sample from a specific population.

An example of survey through a questionnaire: https://new.censusatschool.org.nz/wp-content/uploads/2021/04/CensusAtSchool-New-Zealand-2021.pdf

2 Chapter 2 Samples, Good and Bad

Case study 1: Financial aid program

The Dean of Students at XYZ University wants to determine how many undergraduates at XYZ are familiar with a new financial aid program offered by the university. There are 15,000 undergraduates at XYZ, so it is too expensive to conduct a census. Instead, the dean decides to conduct a survey using a sample of 150 undergraduates. The sample is obtained by visiting a chemistry class of 150 students.

What is a potential issue with this sampling method?
Case study 2: Amazon product review

Open the page: https://www.amazon.com/Statistics-Concepts-Controversies-David-Moore/product-reviews/1464192936/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

What is a potential issue with this sampling method? What is the population?
Case study 3: A Twitter poll

https://twitter.com/cspan/status/1016523004602474496

How can we get good ones?
Case study 4:

I want a good sample of 5 students from this class. How can I get such a sample?
Case study 5: A Gallup poll

See Example 5 on page 29 of textbook.

Read the scientific principles behind the poll: https://media.gallup.com/PDF/FAQ/HowArePolls.pdf

In summary, convenience or voluntary response sampling results in bad samples which causes bias in analysis. Good ones can be obtained using random sampling such as the simple random sampling which ensures that each sample of the same size is equally likely to be chosen. We can use software to select a simple random sample.

3 Chapter 3 What Do Samples Tell Us?

Class activity:

Step 1.

Let’s treat our class as a population. Number each student by 1, 2, 3, …. Remind them to write the their numbers down so that they won’t forget them.

Step 2.

Ask the class if they are taking more than 12 credits in this semester. Give 30 seconds for them to calculate their credits. Denote, by $x$, the number of students who are taking more than 12 credits currently.

$x=~~~~$

Let’s use $p$ to represent the proportion or percentage of all students who are taking more than 12 credits. We also call $p$ the population proportion.

$p=~~~~$

We will pretend that the value of $p$ is unknown to us, and we want to estimate it using a simple random sample of 5 students.

Step 3.

Take a simple random sample of 5 students using computer software. Count how many of them responded “Yes”. Divide this number by 5 to get the so-called sample proportion.

Repeat this 4 times.

The following is a computer simulation for the activity (with $N=25$ students and $x = 15$):

## The responses of the 5 selected students are: 
## 
##  N N Y Y Y 
## 
## The proportion is  0.6

## The responses of the 5 selected students are: 
## 
##  Y N N Y Y 
## 
## The proportion is  0.6

## The responses of the 5 selected students are: 
## 
##  Y N N N Y 
## 
## The proportion is  0.4

## The responses of the 5 selected students are: 
## 
##  Y Y N Y Y 
## 
## The proportion is  0.8

What do the results tell us?

Here is an app showing the sample to sample variation (especially paying attention to small samples and very large samples):

In the above activity, we introduced two important concepts: the population proportion and the sample proportion. While $p$ is used to denote the population proportion, $\hat{p}$ denotes the sample proportion.

In general, we call the population proportion a parameter and the sample proportion a statistic. A parameter is a quantity that describes the population and a statistic is a quantity that describes the sample. To memorize, use the pattern (p-p, s-s).

By definition, a statistic is a numeric quantity associated with the sample. That is, a statistic describes a sample. In contrast, a parameter talks about a population.

Here are some examples:

To estimate the proportion of students at SCSU who are smokers, a sample of 200 students is selected and among these students 18 are smokers. Here the relevant statistic is the proportion of students who are smokers among the 200 selected students. The value of this statistic equals 18/200 or 9%.
To estimate the average income of all Minnesota adults, 1000 adults are selected. Among these 1000 selected, the average income is $57,000. Here the average income of all Minnesota adults is the parameter and the average income of the 1000 selected adults is the statistic and the value of it is $57,000.

Do Exercises 3.3 & 3.4, 3.8-3.11.

Previously in Chapter 2, we have seen bad sampling methods that produce biased samples. So, when designing a sample survey, we should first aim for small bias or high accuracy by using good sampling methods. The above activity shows another concern when designing a survey, that is, variability. Using large sample whenever possible to reduce variability and thus increases the precision of your statistic!

When estimating the parameter of a population using the statistic, because of the existence of variability in the statistic, our statements about the parameter should take into account such variability. The more variability, the less precision of the statistic as an estimator of the parameter.

In practice, we make confidence statements such as “We are 95% confident that the true value of the parameter is 56% plus/minus 2%,” where 56% is the value of the statistic and 2% is called the margin of error ($ME$, which measures the precision) and 95% is called the confidence level. The margin of error depends on how confident we make such a statement and on the variability of the statistic (which is the sample proportion here). We will introduce a formula for $ME$ in Chapter 21 on page 496. Right now, we use a slightly less accurate version of that formula: $ME=\frac{1}{\sqrt{n}}$, which suggests that only the sample size, not the population size, matters. This is true as long as the population size is at least 20 times larger than the sample size. Now, do exercise 3.5 on page 50 and 3.6&3.7 on page 51.

A note: the margin of error does not mean we made a mistake, but it’s just due to sample-to-sample variability.

Example. 1

To estimate what percent of adults in Minnesota approve a new state-wide policy, 2500 adults are randomly selected. What is the margin of error for 95% confidence? Use the formula we just introduced.
To estimate what percent of school kids in a school district have read a popular book for kids, 64 kids are randomly selected. What is the margin of error for 95% confidence? Use the formula we just introduced.

Solution.

$ME=\frac{1}{\sqrt{n}}=\frac{1}{\sqrt{2500}}=\frac{1}{50}=2\%$
$ME=\frac{1}{\sqrt{n}}=\frac{1}{\sqrt{64}}=\frac{1}{8}=12.5\%$

Summary: population/sample, parameter/statistics, population proportion/sample proportion, Margin of error, confidence statements, bias/variability, accuracy/precision.

4 Chapter 4 Sample Surveys in the Real World

What are issues with each of the following cases?

Case Study #1:

A state representative wants to know how voters in his district feel about enacting a statewide smoking ban in all enclosed public places, including bars and restaurants. His staff mails a questionnaire to a simple random sample of 800 voters in his district. Of the 800 questionnaires mailed, 152 were returned.

Case Study #2:

In April 2018, a Gallup Poll asked two questions about the amount one pays in federal income taxes.

Question 1: Do you consider the amount of federal income tax you have to pay as too high, about right, or too low?

Question 2: Do you regard the income tax which you will have to pay this year as fair?

The first question had 48% respondents say “about right” while the second resulted in 61% of respondents saying that the taxes they paid were “fair.”

Sampling errors are errors caused by the act (only choose part of the population or use a bad sample method) of taking a sample. There are two sources of sampling errors:

Random sampling error, as quantified only by the margin of error, which can be controlled by choosing an appropriate sample size.
The use of bad sampling methods, such as voluntary response, which can be avoided.

There are also nonsampling errors that the margin of error does not cover:

Frame error. Sampling begins with a wrong sampling frame. A sampling frame is the actual list of individuals from which a sample is drawn. Ideally, the sampling frame should be the same as the population of interest. Because a list of the entire population is not possible, the sampling frame may not be a complete representation of the population, which leads to errors (bias). Undercoverage is a common frame error. Other frame errors arise from erroneous inclusions and multiple inclusions (having both cell and landline thus more likely to be included in the sample).
Processing error. These are mistakes such as entering wrong responses into a computer.
Response error. This occurs when a subject lies or does not understand the question being asked.
Nonresponse error. This occurs when failing to obtain data from an individual selected for a sample. It is the most serious kind of nonsampling error and there is no simple cure. The response rate is the percent of selected individuals who did respond, while the nonresponse rate is the percent of selected individuals who did not respond. The nonresponse rate for an industrial survey can be as high as 90%.
Error due to wording This occurs when a survey question leads subjects towards a particular answer choice.

How to handle the nonresponse problem?

This is a statistical issue. Usually a weighting method is used. For sample surveys that involve households, nonresponders can be substituted by other households in the same neighborhood. Nonresponse is higher in cities, so the survey gives more weights to those that did respond. If women are overrepresented in a sample, the survey will give more weight to the men.

In reality, sample surveys use more complex methods than simple random sampling. These methods generate the so-called probability samples, which include:

Stratified random sampling. This method first divides the sampling frame into distinct groups of individuals called strata (according to age, gender, or race, etc.). Then, take a sample from each stratum and combine all samples finally.
Systematic sampling. First number each individual say 1 to 1000. If I want 40 individuals, then randomly choose one from the first 25, and then choose every 25th.
Multi-stage sampling. Divide the sampling frame into clusters which are usually geographical areas such as states. Choose some clusters first. Within each clusters, there might be small clusters such as counties. Choose some small clusters. Use all individuals in the chosen clusters.

Examples.

A survey was conducted in a large city concerning the reading ability of students in grade 5. To conduct the survey, researchers obtained a sample of students and administered a standardized test of reading ability to the selected students. The researchers selected a random sample of four school divisions within the city; they then picked a random sample of three elementary schools within each of the selected school divisions. Since there were many grade 5 classes in each of these selected schools, a further sample of one class from each school was picked at random and all the students in that selected class were given the reading test.

This is Multi-stage sampling.

Mail carriers of a large city are divided into four groups according to sex (male and female) and whether they walk or ride on their routes. Then, 20 are randomly selected from each group and are interviewed to determine whether they have been bitten by a dog in the last year.

This is stratified sampling.

Suppose a researcher is testing a new vaccine on mice to see how well it prevents the flu virus. The lab has 1000 mice available for testing purposes, and each mouse has an identification number.

The researcher needs 30 mice to properly run her study of the new flu vaccine. She is considering five different sampling methods.

Which sampling method (SRS, convenience, cluster, stratified, or systematic) is used in each of the following situations?

The researcher uses a computer program to randomly choose 30 of the 1000 mice to use in her study.
The researcher chooses the first 30 mice she can reach from the door to the mouse room.
Each cage holds five mice. The researcher randomly chooses six cages to use in her study.
The mice are kept in cages on 10 different shelves in the mouse room. The researcher chooses three mice from each shelf.
The researcher generates a list of identification numbers that has been sorted from smallest to largest. She randomly picks one mouse from the list, and then chooses every fourth mouse until she has 30 .

A summary: population/sample, parameter/statistic, sampling errors/non-sampling errors, random sampling error (characterized by ME), various sampling methods

5 Chapter 5 Experiments, Good and Bad

In Observational studies, investigators simply observe and collect data and do analysis. A good observational study collect data through random sampling.

In Experimental studies or experiments, investigators create an experiment where they choose to change certain variable(s), called explanatory variables, and see how the result changes in response. A good experiment assigns (maybe voluntary) individuals (called subjects) randomly to at least two treatments that are being compared.

Case Study #1: Online vs. Traditional Courses

College and Universities offer many courses online. Are online courses better, or at least no worse, than, traditional in-class course?

Questions to explore:

If an observation study is used, how is it done? What would the collected data look like?
If an experimental study is used, how is it done? Draw a diagram showing the procedure. What would the collected data look like?
Which study design is better in the above situation? Why?

Name three important things (or more formally, principles) of an experiment.

The simplest such an experiment is called a randomized comparative experiment.

Textbook Example 3 on page 91, Example 4 on page 92 and Example 5 on page 94.

DIY: Do 5.1 on page 95.

Experiments can be done wrong!!

One-track experiments. This refer to experiments with a single treatment.
Experiments without using randomization to allocate subjects.

If the average response of the treatment group compared to that of the control group is so different that it would rarely occur by chance, then we say that the result is statistically significant. The difference between the two average responses is called the effect of the treatment.

The effect of a treatment can also be defined through proportions. For example, in a well designed experiment, 68% of covid-19 cases are cured with a new treatment while 46% are cured with an existing treatment. The effect of the treatment is $68\%-46\%$ or 22%.

Example 1.

Sixty eight high school students who have never taken the American College Test (ACT) are willing to participate in an experiment. Thirty four of them are randomly selected to join a 2-week training program while the remaining 34 prepare for the ACT on their own.

Is this a randomized comparative experiment?
If the average score of the students in the treatment group were 27 and the average score of the control group were 20, what is the effect of the treatment? Is the effect statistically significant?
If the average score of the students in the treatment group were 20.5 and the average score of the control group were 20, what is the effect of the treatment? Is the effect statistically significant?
Let’s assume that totally 6800 students participate in such a study and 3400 join the 2-week program. If the average score of the students in the treatment group were 20.5 and the average score of the control group were 20, what is the effect of the treatment? Is the effect statistically significant? Is this effect practically important?

Example 2.

Refer to textbook example 6 on page 98.

Association found in an observational study may not be causal. That is, association does not imply causation. This is because the effect of an explanatory variable (say $X$) on the response variable (say $Y$) may be confounded by the effect of another explanatory variable (say $Z$) or a lurking variable (say $L$) which is associated with both $X$ and $Y$.

Causation can only be established through well designed controlled experiments.

A summary:

There is a confounding issue in observation studies: The effect of an explanatory variable on a response variable is confounded by one or more other explanatory or lurking variables. Adjustment is needed if confounding is due to other explanatory variables.
There are three principles of any good experiment: Control, randomization, and replication.
Statistical significance is not the same as practical significance.
Experiments that study the effectiveness of medical treatments on actual patients are called clinical trials. Clinical trials are often double-blind (neither subjects nor investigators know which treatment was received) to remove the so-called placebo effect. A placebo is a dummy treatment with no active ingredients. Many patients respond favorably to any treatment, even a placebo.

6 Chapter 6 Experiments in the Real World

Case Study: Caffeine Dependence on page 111

Questions on page 125.

6.1 Equal Treatment for All

In a randomized comparative experiment, all the subjects are treated alike except for the treatments that the experiment is designed to compare. Any other unequal treatment can cause _____.

6.3 Refusal, Nonadherence, and Dropout

Just like sample surveys suffer from _______ due to failure to contact some people selected for the sample and the refusals of others to participate, experiments with human subjects suffer from problems such as refusals, nonadherence or noncompliance, and dropouts, all of which cause ______.

6.4 Can We Generalize?

The most common weakness in experiments is that we can’t generalize the conclusions widely. Some experiments use some special group such as college students, and all are performed at some specific place and time. We want to see similar experiments at other places and times confirm important findings. A remedy for this is to use a meta-analysis, a method used by statisticians who have sophisticated statistical ways of getting an overall conclusion by combining the results of several studies that are in different settings, with different designs, and of different quality.

6.5 Experimental Designs in the Real World

We have talked about experiments that divides the subjects at random into as many groups as there are treatments and then apply each treatment to one of the groups. Such a design is called the completely randomized design (CRD).

Example 8 on page 119:

How many explanatory variables are there?
How many combinations of levels are there?
How many treatments are there?
What is the response variable?

There are other designs:

Matched pairs design. This design compares just two treatments. Choose pairs of subjects that are as closely matched as possible. Assign one of the two treatments to each subject in a pair by tossing a coin. Sometimes the two subjects in each pair can be the same individual who gets both treatments together (for example, each on a different arm or leg) or one after the other with the order of the treatments being randomized for each subject by a coin toss. Example 9 on page 121.

Example:

Does Stay Bright nail polish remain on fingernails longer than Acme nail polish? You design an experiment to find out. A random sample of people was selected to wear both nail polishes. Which hand received which polish was determined randomly. After one week, the difference in the quality of the nail polish of each subject is evaluated. What type of experimental design is this?

Block design. In this design, the random assignment of subjects to treatments is carried out separately within each block. A block is a group of experimental subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. Blocks control the effects of some outside variables, and thus are another form of control.

Read textbook Example 10 on page 122. Example 11.

A matched pairs design is a special block design.

An advantage of the block design over the completely randomized design is that it allows you to control outside variables that might otherwise obscure the effect of your explanatory variable.

Self reading (page 124): Statistical Controversies: Is It or Isn’t It a Placebo?

7 Chapter 7 Data Ethics

Case Study: Marijuana and Driving Performance (https://jamanetwork.com/journals/jamapsychiatry/fullarticle/2788264)

The study was conducted by the University of California, San Diego between 1/2017 and 6/2019.

Questions of interest:

What is the relationship between the dose of marijuana and driving performance?
How long is driving impaired after using marijuana?
How can marijuana use be tested by law enforcement in the field?

To answer these questions, study participants were randomized into three groups receiving different doses: 0% THC (i.e., placebo), 5.9% THC, and 13.4% THC.

While it is important to explore these questions, is it ethical to have subjects consume Tetrahydrocannabinol (THC) so that their judgment is impaired?

7.1 First Principles of Data Ethics

Institutional Review Boards. This is a committee that reviews all planned studies to be carried out in an organization in order to protect the subjects from possible harm.
Informed Consent. All human subjects in a study must give their informed consent before data are collected. Before subjects giving their informed consent, subjects have a right to know how you plan to collect, store, and use their data. Tell subjects that the data collected will be stored in a secure database.
Confidentiality. All individual data must be kept confidential. Only statistical summaries for groups of subjects may be made public.

For situations where a formal ethics review committee does not exist, the principles of the Declaration of Helsinki should be followed. The Declaration of Helsinki is a set of ethical principles regarding human experimentation developed originally in 1964 for the medical community by the World Medical Association.

7.2 Anonymous or Confidential?

Anonymity means that subjects are not known at all to anyone involved in the study, while confidentiality means no individual data is known to the public, except the statistical summaries for groups of subjects. Read 7.3 on page 142 of textbook.

7.3 Who Owns Published Data?

The U.S. Supreme Court has ruled that “data” are facts and cannot be copyrighted. However, compilations of facts are generally copyrightable. Data from a table used to make a graphical presentation or data read from a graph can be used freely without permission.

8 Chapter 8 Measuring

In Chapter 2, we introduced sampling methods for estimating the parameters of a population.

In this chapter, we consider the problem of measuring a characteristic or property (tangible or intangible) of an individual.

Case Study #1: Measuring Weight

Case Study #2: Measuring Intelligence

What is a person’s intelligence? Google it.

Once intelligence is defined, then is it measurable just like we measure height or weight?

8.1 Measurement Basics

To measure something means to assign a number to some property or characteristic of an individual or a thing.

We often use an instrument to make a measurement. When measuring a particular property on different individuals, we use a variable whose values are recorded with a particular unit that we can choose.

8.2 Measurements, Valid and Invalid

A variable is a valid measure of a property if it is relevant or appropriate as a representation of that property.

Example.

Each of the following is a valid measurement of athletic ability:

Time (in seconds) to run a 100-meter dash.
Maximum weight (in pounds) a person can bench press.
Number of sit‑ups a person can do in one minute.

But, the number of times a person goes to the gym per week is not a valid measure of athletic ability.

Other Examples.

Using GPA to measure a student’s overall performance is valid, but it is not valid when using the total number of credits taken by the student.
The number of covid-19 cases in California is much more than that in Minnesota, so it is more likely to contract covid-19 in California than in Minnesota. Is this conclusion justified? Explain your reasoning.

The answer is no. California has a larger population than Minnesota. When the likelihood of contracting covid-19 is the same in the two states, the number of covid-19 cases is larger in California, so it is not a valid measure of the likelihood of contracting covid-19. A valid measure is the proportion of covid-19 cases in each state.

In general, a rate (a fraction, proportion, or percentage) at which something occurs is a more valid measure than a simple count of occurrences.

Example.

Between 1977 and 2017, 1465 convicted criminals were put to death in the United States. Here are data on the number of executions in several states during those years, as well as the estimated June 1, 2017, population of these states:

Find the rate of executions for each of the states just listed, in executions per million population, using the formula

\[\text{rate per million}=\frac{\text{executions}}{\text{population in thousands}}×1000\] Solution.

In two decimal places, the rate of executions, in executions per million population, for

Alabama is $\frac{61}{4875}\cdot 1000=12.51$,
Arkansas is $\frac{31}{3004}\cdot 1000=10.32$,

That is, the rate of executions per million population is 12.51 for Alabama and 10.32 for Arkansas.

Validity is simple for measurements of physical properties such as length, weight, and time. When measuring human personality and other vague properties, predictive validity is the most useful way to say whether our measures are valid. A measurement of a property has predictive validity if it can be used to predict success on tasks that are related to the property measured.

Which of the following measures has predictive validity when used to predict a student’s college success?

High school GPA.
Gender
Type of high school attended (private or public)
Political leanings

8.3 Bias and Variance

The systematic error that occurs every time we make a measurement is called bias. A biased measurement process (or instrument) has the tendency to systematically either overestimate or underestimate the true value. A measured value may consist of three parts:

\[\text{measured value = true value + bias + random error}\]

A measurement process has random error if repeated measurements on the same individual give different results. If the random error is small, we say the measurement is reliable. The average of several repeated measurements of the same individual is more reliable (less variable) than a single measurement.

Reliability and validity are different qualities. Bias and Lack of reliability are different kinds of error.

Example.

Professor Robinson has two teaching assistants who grade homework for a Statistics 101 course. He gives each TA the same student’s paper to grade and has each TA grade the paper according to the same rubric (a clear scoring guide).

Professor Robinson is doing this to try to guarantee the scores given by the TAs are ______.

not biased.
predictive.
reliable.
valid.

The correct one is (c). A measurement process has a random error if repeated measurements on the same individual give different results. If the random error is small, we say the measurement is reliable.

The paper of each student is checked by both assistants in order to reduce the likelihood of a random error. If the assistants give different scores, the paper can be double‑checked. Checking homework this way allows more accurate rating of assignments in accordance with the rubric. Thus, professor Holmes tries to guarantee the scores given by the TAs are reliable.

A quantity called variance can be used to determine if the random error is small. The variance of $n$ repeated measurements on the same individual is computed as follows:

Find the arithmetic average of these $n$ measurements.
Compute the difference between each observation and the arithmetic average and square each of these differences.
Divide the sum of all squared differences by $n-1$. The result is the variance.

When a measurement process has a small variance, it is said to be reliable.

Example.

The volume of a ball is measured five times. The measurements are 23, 22, 21, 22, and 22, respectively. What is the variance?

The same individual measures the volume next day using a different method and gets the these values: 22, 23, 21, 22, 21, and 23. What is the variance of this new method?

Which methods are more reliable?

Solution.

Method 1

Step 1: The average of the five values is $\frac{23+22+21+22+22}{5}=22$.

Step 2: The differences between the observed values and the average are 1, 0, $-1$, 0, and 0.

Step 3: Adding the squared differences gives $1^2+0^2+(-1)^2+0^2+0^2=2$. Dividing this result by $5-1$ or 4 gives 0.5. This is the variance of the measurements using the first method.

Method 2

Step 1: The average of the five values is $\frac{22+23+21+22+21+23}{6}=22$.

Step 2: The differences between the observed values and the average are $0$, 1, $-1$, $0$, $-1$, and 1.

Step 3: Adding the squared differences gives $0^2+1^2+(-1)^2+0^2+(-1)^2+1^2=4$. Dividing this result by $6-1$ or 5 gives 0.8. This is the variance of the measurements using the second method.

Since the variance of the first method is smaller, the first method is more reliable. That is, measurements using the first method has smaller random error.

A perfectly reliable measurement process has no variance.

8.4 Improving Reliability, Reducing Bias

Use average of repeated measurements to improve reliability of measurement.
Bias can be reduced by using a more accurate measure process or instrument, just like in sampling. Before talking about bias, validity should be assured.

9 Chapter 9 Do the Numbers Make Sense?

Case Study:

A basketball team took 50 free throws in a game last week and made 61% of them. Do the numbers make sense?

9.1 Percentage Change

If a quantity changes from $a$ to $b$, then the percentage change (increase or decrease) is

\[\text{percentage change} = \frac{b(\text{ending value}) - a(\text{starting value})}{a(\text{starting value})}\cdot 100\%\]

Example.

Unemployment rate climbing from 4% to 5% indicates an increase of one percentage point, but a 25% increase. Why?
Unemployment rate dropping from 5% to 4% indicates a decrease of one percentage point, but a 20% decrease. Why?

Solution.

$5\% - 4\% = 1\%$, $\text{percentage change} = \frac{\text{ending value} - \text{starting value}}{\text{starting value}}\cdot 100\%=\frac{5\%-4\%}{4\%}\cdot 100\%=\frac{1\%}{4\%}\cdot 100\%=\frac{1}{4}\cdot 100\%=25\%$, which means that the unemployment rate increases by 25%.
$4\% - 5\% = -1\%$, $\text{percentage change} = \frac{\text{ending value} - \text{starting value}}{\text{starting value}}\cdot 100\%=\frac{4\%-5\%}{5\%}\cdot 100\%=\frac{-1\%}{5\%}\cdot 100\%=\frac{-1}{5}\cdot 100\%=-20\%$, which means that the unemployment rate decreases by 20%.

Exercises.

Suppose Brian is in the market for a used textbook and the campus bookstore is having a sale. If the initial price of the used book is $87 and the discounted price is $58, what is the percentage change in the book price? Round your answer to two places after the decimal. What does the number mean?
My local Sam’s Club gas price is $3.50. The price was $2.35 at this time last year. What is the percentage change in gas Sam’s Club’s price? Round your answer to two places after the decimal. What does the number mean?
One day the number of new covid-19 cases was 250 in county A and 320 in county B. The number increased by 20% next day in county A but decreased by 20% in county B. What was the number of new covid-19 cases in each county next day?
At store A, the price of a jacket was reduced by 10% on July $1^{st}$ and then reduced by 20% next day. At store B, the price of the same jacket was reduced by 25% in those two days. If the original price of the jacket was $120 at the two stores, what was the price of the jacket at each store on July $2^{nd}$.

Solution.

Answers: -33.33%, -48.94%(decrease by 48.94%), 300&256, 86.40&90

9.2 The Difference between “as large as” and “larger than”

To find how many times $a$ is as large as $b$, we do $\frac{a}{b}$.
To find how many times $a$ is larger than $b$, we do $\frac{a-b}{b}$ or $\frac{a}{b}-1$.

For example,

8 is 4 times as large as 2, but is 3 times larger than 2.
24 is 3 times as large as 8, but is twice larger than 8.

9.3 The Difference between “Percentage Points Higher” and “Percent Higher”

8% is ___ percentage points higher than 5%.
8% is ___ percent higher than 5%.
15% is ___ percentage points higher than 10%.
15% is ___ percent higher than 10%.
30% is ___ percentage points higher than 25%.
30% is ___ percent higher than 25%.

Solution.

Answer: 3,60,5,50,5,20

9.4 More Exercises

12 is ___ times as large as 4.
12 is ___ times larger than 4.
The price of gold was $1060 per ounce on December 31 and has risen 12.25% since that time. What is the price per ounce now?
The net asset value of a mutual fund has decreased from $2,700 on December 31 to $1,330 now. The percent decrease in value is about ___% (to the nearest hundredth).
A student reports that, of a simple random sample of 20 college undergraduate students, x% were working at least two jobs. Here $x$ cannot be

Which of the following statements do you think could possibly be true?

A basketball team took 45 free throws in a game last week and made 74% of them.
Yesterday, it was 20 (Fahrenheit) in Chicago. Today, it warmed up to 40. This is a 50% increase in the temperature.
My weight decreased by 5% last year but then increased by 5% in the first 2 months of this year. Thus, my overall weight from the beginning of last year until now is unchanged.
The number of students enrolled in a statistics programs increased by 24% last year and ended with 60.
None of the above

In the last half of 2018, the price of Apple, Inc. (AAPL) common stock dropped from $124.20 per share to $101.40 per share. What percent decrease is this? Round your answer to the nearest hundredth in percentage.

Solution.

Answer: 3,2,1189.25,50.74,24,e,18.36

10 Chapter 10 Graphs, Good and Bad

Case Study 1: President Biden job approval

https://news.gallup.com/poll/395378/biden-job-approval-dips-new-low.aspx

10.1 Pie Charts and Bar Graphs

For a bar graph to be accurate, the bars should have equal width. There should have no unmarked gap in the quantitative axis; that is, there should have a origin in the quantitative axis.
Avoid fancy bar graphs, such as the following:

If you have categorical data, use https://goodcalculators.com/pie-chart-calculator/ to create pie charts.
Use data from: https://new.censusatschool.org.nz/explore/ to create bar graphs
What advantage does a bar graph have over a pie chart? It is easier to compare the heights of the bars on the bar graph than it is to compare the size of the angles on the pie chart.
Warning on using a pie chart: It is not correct to display data in a pie chart, if the percentages do not represent parts of a whole.

Example.

According to the National Household Survey on Drug Use and Health, when asked in 2016, 23.5% of those aged 18 to 25 years used cigarettes in the past month,5.2% used smokeless tobacco,23.2% used illicit drugs, and 38.4% engaged in binge alcohol drinking. Is it correct to display these data in a pie chart?
You have the average SAT score of entering freshman for five universities. The best graphical display for these data would be a

pie chart.
bar graph.
line graph.
side‑by‑side bar graph.

10.2 Beware the Pictogram

Pictograms are tempting, but not suggested.

10.3 Change over Time: Line Graphs

Observing the same quantity over time produces a time series. For example,

Monthly unemployment rate
Daily sales of a product

If you want to show how a quantity (e.g., price of cable television) has changed over time, use a line graph that connects data points by lines, with time on the horizontal axis and the measured quantity on the y-axis.

It is not acceptable for a line graph to have time intervals that are not equally spaced.

How can we examine a line graph?

Look for an overall pattern.
Look for striking deviations from the overall pattern.
Pay attention to a possible regular pattern that repeats each year (seasonal variation). Many series of regular measurements over time are seasonally adjusted by removing expected seasonal variation before the data are published.

Example 6 on page 222 of textbook

10.4 Watch Those Scales!

Scales can be manipulated to exaggerate a trend.

10.5 Making Good Graphs

Read the article here https://www.callingbullshit.org/tools/tools_misleading_axes.html to learn good and bad graphs.

Attitudes on same‑sex marriage have changed over time, but they also differ according to age. The figure below shows change in the attitudes on same‑sex marriage for four generational cohorts. The y‑axis in this figure shows the proportion in each generational cohort who favor same‑sex marriage.

We can gain some insights from this graph:

There is an increasing trend over time of people who support same‑sex marriage.
On average, a higher proportion of younger people support same‑sex marriage than older people.
There is no data displayed from 2002 for any cohort, and no data from 2001 for Millennials.

The following side‑by‑side bar graph shows educational attainment, by sex, for those aged 25 and older. These data were collected in the 2017 Current Population Survey.

Comparing educational attainment for males and females, we can say that

a larger percentage of males than females are high school graduates or less.
a larger percentage of females than males are more than high school graduates.

Do consumers prefer trucks, SUVs, and minivans to passenger cars? The data below give the sales and leases of new cars and trucks (in thousands of vehicles) in the United States from 1996 to 2010. (The definition of “truck” includes SUVs and minivans.)

year cars trucks

1996 10550 8130

1997 10510 8430

1998 10990 9080

1999 11410 11010

2000 11710 10990

2001 11060 10750

2002 10250 10498

2003 9860 10212

2004 10100 10194

2005 9942 10546

2006 10118 1060

2007 9943 1022

2008 8833 7482

2009 7193 5860

2010 7530 7020

Plot two line graphs on the same axes to compare the change in car and truck sales over time. We observe the following:

Sales and leases for both cars and trucks were at the highest in 1999 and 2000. Later years show an overall trend of gradually decreasing sales and leases until 2007 when the trend decreases sharply for both cars and trucks before beginning to rise again. Overall, trucks have much lower sales and leases than cars, except for the period 2002 to 2008 when truck sales and leases were only slightly higher than cars.

Women were allowed to enter the Boston Marathon in 1972. The time (in minutes, rounded to the nearest minute) for each winning woman from 1972 to 2018 appears in table.

Year Time

1972 190

1973 186

1974 167

1975 162

1976 167

1977 168

1978 165

1979 155

1980 154

1981 147

1982 150

1983 143

1984 149

1985 154

1986 145

1987 146

1988 145

1989 144

1990 145

1991 144

1992 144

1993 145

1994 142

1995 145

1996 147

1997 146

1998 143

1999 143

2000 146

2001 144

2002 141

2003 145

2004 144

2005 145

2006 144

2007 149

2008 145

2009 152

2010 146

2011 143

2012 152

2013 146

2014 139

2015 145

2016 149

2017 142

2018 160

Make a graph of the winning times. Give a brief description of the pattern of Boston Marathon winning times over these years. Have times stopped improving in recent years? Describe the line graph.

Here are some insights from the graph just created:

Average winning times have held fairly steady since the mid nineties, although there may be a slight increase in recent years.
There was a large overall decrease in winning times from 1972 until the early eighties. After the early eighties, the improvement in winning times has slowed considerably.
Winning times have generally improved slightly since the early eighties, but there has been more fluctuation recently.

10.6 More Exercises

A recent report on the religious affiliation of Hispanics says 55% are Catholic, 22% Protestant, 18% unaffiliated, and 4% other.

Explain why these number does not add up to 100%.
We can illustrate these numbers with a

line graph

bar graph

pie chart

Both (b) and (c)

None of these

In a college, 20% Physics students are female, 25% of Mathematics students are female, 60% of Statistics students are female, 50% of Biology students are female, 40% Chemistry students are female, and 80% of other students are female. We can illustrate these numbers with a

line graph

bar graph

pie chart

Both (b) and (c)

None of these

Using data on the average national cost of regular grade gasoline by month since 2006, to show clearly how the cost of gasoline has changed over time, the best choice of graph is a

line graph.

bar graph.

pie chart.

histogram.

scatterplot.

11 Chapter 11 Displaying Distributions with Graphs

Case Study: Apartment Rental Prices

The average rental prices for one-bedroom and two-bedroom apartments in the country’s 100 most populated cities are given below: (Source: https://www.rent.com/research/average-rent-price-report/)

5404 3849 3756 3715 3549 3124 3084 2962 2942 2883 2763 2750 2739 2660 2641 2623 2585 2451 2394 2362 2287 2212 2200 2186 2147 2058 2045 2044 2013 1945 1891 1882 1870 1867 1759 1756 1705 1701 1683 1681 1658 1658 1650 1645 1635 1630 1581 1570 1567 1546 1535 1526 1523 1518 1451 1451 1434 1432 1426 1422 1420 1400 1367 1342 1323 1306 1295 1283 1268 1264 1233 1226 1207 1203 1202 1199 1173 1164 1141 1133 1133 1113 1082 1079 1076 1070 1067 1067 1041 1029 1009 990 911 885 869 860 776 760 709 633

It is difficult to see patterns in the data. One picture is worth thousand words. How can we know the distribution of the rental prices?

11.1 Histograms

In Chapter 10, we introduced bar graphs and pie charts for categorical data. When dealing with quantitative data such as the average rental prices, we can group similar values together and report the number of values in each group, then we use the idea for creating a bar graph to make a graph. Such a graph is called a histogram.

A histogram consists of bars that correspond to intervals of the same width. Specifically, when constructing a histogram,

We divide the range of the data into intervals of equal width. We often use 5-25 intervals. The boundaries of each interval may be made nice (such as multiples of 5 or 10). Each interval is left closed and right open.
Then, we count the number of values (or calculate the percentage of values) in each interval. The numbers are called frequencies or relative frequencies.
Finally, we create a bar that has height equal to the corresponding count (or percentage).

Example 1.

The IQ’s of 50 adults are listed below:

112, 118, 75, 119, 98, 98, 103, 119, 74, 125, 108, 138, 108, 104, 84, 119, 112, 99, 88, 89, 97, 95, 84, 99, 116, 98, 83, 88, 110, 95, 80, 91, 98, 113, 98, 105, 62, 88, 104, 82, 107, 112, 98, 119, 86, 106, 106, 87, 80, 100

Choose appropriate intervals and make a histogram

using counts
using percentages

Solution.

The sorted IQ’s are:

62, 74, 75, 80, 80, 82, 83, 84, 84, 86, 87, 88, 88, 88, 89, 91, 95, 95, 97, 98, 98, 98, 98, 98, 98, 99, 99, 100, 103, 104, 104, 105, 106, 106, 107, 108, 108, 110, 112, 112, 112, 113, 116, 118, 119, 119, 119, 119, 125, 138

The range of the data is between 62 and 138.

We choose 8 intervals of equal width. Since $140 - 60 = 80$, each interval has length 10 and they are

$[60, 70)$, meaning “$\ge$ 60, but $<$ 70”, which contains 1 or 2% of the given IQ values;
$[70, 80)$, which contains ____________ of the given IQ values;
$[80, 90)$, which contains ____________ of the given IQ values;
$[90, 100)$, which contains ____________ of the given IQ values;
$[100, 110)$, which contains ____________ of the given IQ values;
$[110, 120)$, which contains ____________ of the given IQ values;
$[120, 130)$, which contains ____________ of the given IQ values;
$[130, 140)$, which contains ____________ of the given IQ values.

Now, we are ready to make the histogram.

11.2 Interpreting Histograms

Look for an overall pattern.
Describe the center (the value that separates bottom 50% from top 50%) and the variability (range).
Describe the shape (how many peaks? symmetric? right/left skewed?). A distribution is said to be skewed to the right (left) if it has a long tail on the right (left) side. Book example 3 on page 248.
Look for striking deviations from the pattern.
Pay attention to outliers (values that fall outside the overall pattern).

Example 2.

The scores of three classes taught by different instructors are given below:

Class 1: 59, 93, 78, 82, 89, 100, 82, 47, 81, 68, 79, 45, 94, 81, 83, 86, 83, 98, 81, 81, 83, 76, 82, 81, 87, 94, 80, 92, 73, 92, 87, 92, 81, 96,
89, 90, 89, 81, 90, 97

Class 2: 70, 49, 53, 64, 47, 56, 50, 48, 43, 48, 51, 78, 41, 49, 74, 42, 51, 49, 47, 59, 67, 52, 45, 62, 42, 42, 47, 48, 48, 42, 49, 50, 56, 64, 48, 51

Class 3: 88, 89, 97, 85, 85, 91, 90, 88, 84, 86, 92, 81, 78, 98, 79, 85, 86, 93, 86, 82, 89, 92, 89, 77, 91, 85, 87, 87, 85, 85, 89, 88, 89, 95, 88, 86, 76, 96

Make a histogram for each class using intervals of $[40, 50)$, $[50, 60)$, …
Interpret each histogram.

11.3 Stemplots

A stemplot (or stem-and-leaf plot) has stems (or classes) and leaves. The choice of stems and leaves depends.

Example 3.

The scores of three classes taught by different instructors are given below:

Class 1: 59, 93, 78, 82, 89, 100, 82, 47, 81, 68, 79, 45, 94, 81, 83, 86, 83, 98, 81, 81, 83, 76, 82, 81, 87, 94, 80, 92, 73, 92, 87, 92, 81, 96,
89, 90, 89, 81, 90, 97

Class 2: 70, 49, 53, 64, 47, 56, 50, 48, 43, 48, 51, 78, 41, 49, 74, 42, 51, 49, 47, 59, 67, 52, 45, 62, 42, 42, 47, 48, 48, 42, 49, 50, 56, 64, 48, 51

Class 3: 88, 89, 97, 85, 85, 91, 90, 88, 84, 86, 92, 81, 78, 98, 79, 85, 86, 93, 86, 82, 89, 92, 89, 77, 91, 85, 87, 87, 85, 85, 89, 88, 89, 95, 88, 86, 76, 96

Make a stemplot for each class using the tens digits as stems and ones digits as leaves.
Interpret each histogram.

Solution.

The stemplot may look like

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    4 | 57
##    5 | 9
##    6 | 8
##    7 | 3689
##    8 | 01111111222333677999
##    9 | 00222344678
##   10 | 0

The stemplot may look like

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   4 | 1222235777888889999
##   5 | 0011123669
##   6 | 2447
##   7 | 048

The stemplot may look like

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   7 | 6789
##   8 | 124555555666677888899999
##   9 | 0112235678

Example 4.

The IQ’s of 50 adults are listed below:

Make a stemplot for each class using the tens digits as stems and ones digits as leaves.
Interpret each histogram.

Solution.

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    6 | 2
##    7 | 45
##    8 | 002344678889
##    9 | 155788888899
##   10 | 0344566788
##   11 | 02223689999
##   12 | 5
##   13 | 8

Example 4.

Radon is a carcinogen—a naturally occurring radioactive gas whose decay products are also radioactive—known to cause lung cancer in high concentrations and estimated to cause several thousand lung cancer deaths per year in the United States.

The average radon levels (pCi/L) of 50 states are given below:

**Table for Kruskal-Wallis Test**
Alabama	10.7
Alaska	3.9
Arizona	1.9
Arkansas	2.5
California	2.3
Colorado	6.8
Connecticut	3.4
Delaware	2.4
Florida	1.8
Georgia	2.3
Idaho	7.3
Illinois	5.3
Indiana	4.7
Iowa	6.1
Hawaii	NA
Kansas	4.9
Kentucky	7.4
Louisiana	1.1
Maine	5.9
Maryland	5.4
Massachusetts	3.9
Michigan	3.5
Minnesota	4.6
Mississippi	1.2
Missouri	4.3
Montana	7.4
Nebraska	5.2
Nevada	3.4
New Hampshire	5.6
New Jersey	4.4
New Mexico	3.9
New York	4.2
North Carolina	4.0
North Dakota	6.0
Ohio	7.8
Oklahoma	2.5
Oregon	3.1
Pennsylvania	8.6
Rhode Island	4.3
South Carolina	2.4
South Dakota	9.6
Tennessee	4.8
Texas	2.1
Utah	4.4
Vermont	3.7
Virginia	3.6
Washington	7.5
West Virginia	6.1
Wisconsin	5.7
Wyoming	5.0

Make a stemplot and a histogram for the radon levels. The intervals for the histogram are $[1, 2), [2, 3), \cdots, [10, 11)$.

Solution.

## 
##   The decimal point is at the |
## 
##    1 | 1289
##    2 | 1334455
##    3 | 144567999
##    4 | 0233446789
##    5 | 0234679
##    6 | 0118
##    7 | 34458
##    8 | 6
##    9 | 6
##   10 | 7

Example 5.

The average rental prices for one-bedroom and two-bedroom apartments in the country’s 100 most populated cities are given below:

5404, 3849, 3756, 3715, 3549, 3124, 3084, 2962, 2942, 2883, 2763, 2750, 2739, 2660, 2641, 2623, 2585, 2451, 2394, 2362, 2287, 2212, 2200, 2186, 2147, 2058, 2045, 2044, 2013, 1945, 1891, 1882, 1870, 1867, 1759, 1756, 1705, 1701, 1683, 1681, 1658, 1658, 1650, 1645, 1635, 1630, 1581, 1570, 1567, 1546, 1535, 1526, 1523, 1518, 1451, 1451, 1434, 1432, 1426, 1422, 1420, 1400, 1367, 1342, 1323, 1306, 1295, 1283, 1268, 1264, 1233, 1226, 1207, 1203, 1202, 1199, 1173, 1164, 1141, 1133, 1133, 1113, 1082, 1079, 1076, 1070, 1067, 1067, 1041, 1029, 1009, 990, 911, 885, 869, 860, 776, 760, 709, 633

Rewrite the data in thousands and use rounding when possible. For example, 5404 is written as 5.404 thousand and rounded to 5.4 thousand. 3757 is written as 3.757 thousand and rounded to 3.8 thousand.
Make a stemplot for the updated data using the integer parts as stems and tenth digits as leaves.
Interpret each histogram.

Solution.

## 
##   The decimal point is 3 digit(s) to the right of the |
## 
##   0 | 67889999
##   1 | 000011111111112222222233333334444444555555566666677777778899999
##   2 | 000112223445666778899
##   3 | 0115788
##   4 | 
##   5 | 4

The symbol “5 | 4” represents 5400 and “3 | 0115788” represents 3000, 3100, 3100, 3500, 3700, 3800, and 3800.

11.4 A Class Activity

Step 1. Pass a piece of paper to collect the length of each student’s last name.

Step 2. Show the collected data to the class.

Step 3. Tell the class to make a histogram using intervals $[1.5, 2.5)$, $[2.5, 3.5)$, $[3.5, 4.5)$, $\cdots$. What is the advantage of choosing these intervals rather than other intervals such as $[1, 2)$, $[2, 3)$, $[3, 4)$, $\cdots$?

Step 4. Interpret the graph by telling

what the shape is,
where the center is located,
what is the variability,
how many peaks there are, and
whether there is any outlier.

Step 5. Make a stemplot for the collected data. To start, first add a decimal point to each value, followed by a 0. For example, after adding a decimal point, a value of 5 becomes 5.0 and 7 becomes 7.0. The stems for the stemplot are chosen to be the integer parts and leaves to be all 0’s. Add a key to your stemplot (an example on page 253 of textbook).

Step 6. Interpret the graph by telling something about shape, center, variability, peaks, and possible outliers.

12 Chapter 12 Describing Distributions with Numbers

Case Study: https://pubmed.ncbi.nlm.nih.gov/32048801/

12.1 Median and Quartiles

When we have a set of values, we can sort them from smallest to largest for ease of reading.

The center of the data is the very middle value or the average of the middle two values. Such a center splits all the data into lower and upper halves and is thus called the median of the data.
The median of the values in the lower half is called the first quartile (or lower quartile), denoted by $Q_1$.
The median of the values in the upper half is called the third quartile (or upper quartile), denoted by $Q_3$.

Example 1.

The number of study hours of 25 students are:

0 0 0 0 2 2 2 2 3 3 3 3 3 3 4 4 5 6 6 6 6 6 7 7 7

Find the median, first quartile and third quartile.

Solution.

Since the data are already in order, the median is the middle value (the 13th) or 3.

To find the quartiles, we split the data in half-

The lower half: 0 0 0 0 2 2 2 2 3 3 3 3 with median 2 (the average of the two middle values), which is the first quartile.
The upper half: 3 4 4 5 6 6 6 6 6 7 7 7 with median 6 (the average of the two middle values), which is the upper quartile.

Example 2.

The lifetime (in years) of a litter of 16 pythons are:

17 19 20 20 23 24 24 25 26 26 27 27 28 28 29 33

Find the median, first quartile and third quartile.

Solution.

Since the data are already in order, the median is the average of the two middle values 25 & 26, or 25.5.

To find the quartiles, we split the data in halves-

The lower half: 17 19 20 20 23 24 24 25 with median $(20+23)/2=21.5$, which is the first quartile.
The upper half: 26 26 27 27 28 28 29 33 with median $(27+28)/2=27.5$, which is the upper quartile.

12.2 The Five-Number Summary and Boxplots

For a set of values, we can find the minimum, maximum, median, and the first and third quartiles. These 5 numbers are called the five-number summary. These numbers can be displayed in a graph called the boxplot.

A boxplot looks like:

Example 3.

Create a boxplot if the 5-number summary is 2, 6, 13, 18, 20.

Example 4.

The following shows side-by-side boxplots.

A summary of the results:

Section 1 had the highest median score and section 3 had the lowest.
All 4 sections tend to have the same variability in scores.
One student in section 4 had an extreme high score relative to all other students.

12.3 Mean, Variance, and Standard Deviation

The average of a numeric sample is also called the mean of the sample. The mean of a sample is denoted by $\bar{x}$.

The variation of the sample is a way to measure the degree of consistency or spread or variation. To find the variance, we subtract the mean from each value in the sample, square the differences, add these squared differences, and divide the sum by $n-1$, where $n$ is the number of values or the sample size. The variance of a sample is denoted by $s^2$.

Since the variance has a measurement unit that is the square of the original measurement unit, another measure of consistency is defined, which is the square root of the variance. This new measure is called the standard deviation, which is an average deviation of the sample from the mean. The standard deviation of a sample is denoted by $s$.

Example 5.

The number of siblings of 5 students is given below:

\[2, 0, 3, 2, 3\] Find the mean, variance, and standard deviation.

Solution.

The mean is $\bar{x}=(2+0+3+2+3)/5=2$.

The variance is

\[s^2=\frac{(2-2)^2+(0-2)^2+(3-2)^2+(2-2)^2+(3-2)^2}{5-1}=\frac{0+4+1+0+1}{4}=1.5\]

The standard deviation is $s=\sqrt{1.5}=1.2247.$

Notes:

The variance and standard deviation can never be negative.
A small standard deviation tells us that the data are consistent and the deviation of the data around the mean is small.
When all values in the data are the same, the variance and standard deviation are both 0. On the other hand, when there is a lot of inconsistency within the data, we are going to get a large variance and standard deviation.
The standard deviation of a sample always has the same unit as that of the original values in the sample.
The mean, variance, and standard deviation all can be heavily affected by outliers. Removing outliers from a sample will decrease the variance and standard deviation.

12.4 Mean or Median?

Both mean and median describe the center of data. When there is any outlier in data, use median since the mean is sensitive to outliers while the median is not.

Example 6.

Find the mean and median of the following two sets of data.

20, 25, 30, 35, 40, 54
20, 25, 30, 35, 40, 540

Solution.

The mean is $\frac{20+25+30+35+40+50}{6}=34$. The median is 32.5, since there are two middle values (30 & 35) and we take average of them.
The mean is $\frac{20+25+30+35+40+540}{6}=115$. The median is still 32.5, since there are two middle values (30 & 35) and we take average of them.

This example shows that the mean can be heavily affected by an outlier.

13 Chapter 13 Normal Distributions

Case Study: https://pubmed.ncbi.nlm.nih.gov/4014058/

Go to or Click: https://ww2.amstat.org/censusatschool/ Click “Random Sampler” and get a sample of say 400 students from Minnesota.

13.1 Density Curves

Let’s continue the case study. If we are interested in knowing the distribution of a variable (say the number of text messages sent by a student), we can use data on that variable and make a histogram. The height of each bar equals the number of students falling in the corresponding interval. Now, each bar can be rescaled so that the total area of all bars equals 1. This is done by dividing the original height by the sample size and then by the common width of the intervals. Software can do this for you!

In addition, software can superimpose a curve on the rescaled histogram. This curve can capture the overall shape of the rescaled histogram. Such a curve is called a density curve. Check out this app: https://scsu.shinyapps.io/censusAtSchool

For a curve to be a density curve, it must be above the x-axis and the total area between the curve and the x-axis must be 1.

What is the use of the density curve? It not only shows the overall shape of the distribution of the data, but also shows the proportion of observations in any interval (not just those used for creating the histogram) by the area under the curve and above the interval.

13.2 The Center and Variability of a Density Curve

We can find, on the x-axis, a point through which a vertical line divides the area under the density curve in half. Such a point shows the center of a distribution and is called the median of the distribution.

We can find, on the x-axis, another point at which the curve would balance if the total area under the curve is made of uniformly solid material. Such a point is called the mean of the distribution for which the density curve describes. Refer to textbook page 298 for the locations of the mean and median of a distribution.

The variability of a density describes how wide a density curve is. The following are a few density curves:

The black density curve is the widest so it shows a distribution that has more variability than the distributions shown by the other two density curves. The blue density curve describes a distribution that has the smallest variability.

13.3 Normal Distributions

If the density curve looks like a bell-shaped, symmetric curve, the density is called a normal density. If a random variable can be modeled with a normal density, we say the random variable has a normal distribution. The center of the normal curve corresponds to the mean of the normal distribution. Another feature of the normal distribution is its standard deviation, which measures the spread of the distribution.

The following is a tool that allows us to calculate the area of a given region under a normal density curve.

13.4 The 68–95–99.7 Rule

If random variable $X$ has a normal distribution, then, no matter what the mean and standard deviation is,

with probability about 68%, the random variable falls within one standard deviation above or below the mean;
with probability about 95%, the random variable falls within two standard deviations above or below the mean;
with probability about 99.7%, the random variable falls within three standard deviation above or below the mean;

The rule is depicted in the following graph:

Based on the graph on this web page, we infer that

The probability that the random variable falls between the point that is two standard deviations below the mean and the point that is one standard deviation above the mean is approximately 0.8285 (13.59% + 34.13% + 34.13% = 82.85% or 0.8285).

There is an online normal-calculator that allows us to find probability when a population has a normal distribution. Here it is: https://onlinestatbook.com/2/calculators/normal_dist.html

Open the link and try to change the mean and standard deviation to see how the normal density curve changes. Now, fill in blanks:

When I only change the mean, the _____ (center or spread) of the curve remains the same but the _____ (center or spread) changes.
When I only change the standard deviation, the _____ (center or spread) of the curve remains the same but the _____ (center or spread) changes.
I really _____ (like or dislike) normal curves.

The answers are: spread, center, center, spread, answer varies

Example 1.

The IQ of human beings has a normal distribution with mean 100 and standard deviation 15.

What percent of people have IQ greater than 130?

To use the online calculator, first fill in the mean and standard deviation. Now, click “Above” and fill in 130. Finally, press “Calculate”. The answer should be 0.0228, which means 2.28% of people have IQ greater than 130.

What percent of people have IQ above 130? (This question is essentially the same as the previous one.)
What percent of people have IQ between 110 and 125? (Your answer should be 0.2047 or 20.47%)
What percent of people have IQ between 90 and 105?
What percent of people have IQ below 95?

The answers are: 2.28%, 2.28%, 20.47%, 37.81%, 36.94%,

Example 2.

The heights of adult men are normally distributed with mean 69.2 inches and 2.66 inches

What percent of people are taller than 72?
What percent of people are taller than 75 inches?
What percent of people are shorter than 65 inches?
What percent of people between 68 inches and 73 inches tall?
What is the 25th percentile?
What is the third quartile?
What is the 95th percentile?

The answers are: 14.63%, 1.46%, 5.72%, 59.75%, 67.41 cm, 70.99 cm, 73.58 cm

13.5 Standard Scores

The standard (or standardized) score or $z$-score for any observation is defined as

\[\text{standard score} = \frac{\text{observation}-\text{mean}}{\text{standard deviation}}\]

Such standardization removes the original unit; that is, standard scores are unitless. This makes comparison between values from different normal distributions easier.

Example 3.

Tom took both an SAT test and an ACT test, with 1270 in SAT and 26 in ACT. The distribution of SAT scores is normal with mean 1060 and standard deviation 217. The distribution of ACT scores is also normal with mean 18 and standard deviation 6.

Which score is relatively better?

Solution.

The z-score of Tom’s SAT is $z=\frac{1270-1060}{217}=0.9677$, which means that Tom’s SAT score is 0.9677 standard deviation higher than the mean SAT score.

The z-score of Tom’s ACT is $z=\frac{26-18}{6}=1.33$, which means that Tom’s ACT score is 1.33 standard deviations higher than the mean ACT score.

Since the $z$-score of ACT is higher, Tom did better in ACT.

When we know the mean and standard deviation of a variable, we can tell how far away any observation is from the mean in terms of standard deviations. A standard score is just a re-scaled distance between the original score and the mean, expressed in standard deviations.

13.6 Percentiles of Normal Distributions

The $c$-th percentile of a distribution is a value such that $c$ percent of the observations lie below it and the rest lie above it. For example, if your SAT score is at the 90th percentile, then you have scored higher than 90% of all SAT takers.

This inverse normal calculator: https://onlinestatbook.com/2/calculators/inverse_normal_dist.html allows you to find a cutoff when the area of a tail region (called a tail probability) under a normal distribution is given. That is, knowing the area of a left or right region under a normal density curve, you can find the corresponding cutoff on the number line.

Use the calculator https://onlinestatbook.com/2/calculators/inverse_normal_dist.html to do the following:

What is the 25th percentile (also called the first quartile and denoted by $Q_1$)? That is, 25% of people have IQ below what?

Setting “Area” to 0.25, choosing “Below”, and then clicking “Recalculate” gives 89.88, which is the cutoff on the number line. It means 25% of people have IQ below 89.88.

What is the 75th percentile (also called the first quartile and denoted by $Q_3$)? That is, 75% of people have IQ below what? or 25% of people have IQ above what? (keep in mind: the area under any normal curve always equals 1.)
To be among the top 1% smartest people, what IQ must you have?

Setting “Area” to 0.01, choosing “Above”, and then clicking “Recalculate” gives 134.8952, which is the cutoff on the number line. It means 1% of people have IQ above 134.8952.

What is the 95th percentile? That is, 95% of people have IQ below what? or 5% of people have IQ above what?

The answers are: 89.88, 110.12, 134.8952, 124.67

14 Chapter 14 Describing Relationships: Scatterplots and Correlation

Case Study: SAT Scores

Refer to https://worldpopulationreview.com/state-rankings/sat-scores-by-state

The graph is reproduced here: https://codap.concord.org/app/static/dg/en/cert/index.html#shared=https%3A%2F%2Fcfm-shared.concord.org%2FcqGyANlSJdegACkrRlVk%2Ffile.json

What does the graph suggest?
What can you say based on the plot?

14.1 Scatterplots

A scatterplot is a plot of points, each of which has an x coordinate and a y coordinate.

Let’s plot the SAT Math score versus the participation rate.

Which variable is the response variable?
Which variable is the explanatory variable?
Can you explain why Minnesota has the highest average SAT Math score?

To interpret a scatterplot, we look for

an overall pattern
- direction: positive or negative association
- form: linear (straight-line) or curved association
- strength: strong or weak association
striking deviations (outliers) from that pattern.

The Average SAT math score appears to be associated with participation rate in a linear manner.

14.2 Multiple Variables

We can consider the association between two quantitative variables within subgroups determined by a third variable that is categorical.

Each row in the following data is for an iris. An iris is a flowering plant genus of 310 accepted species with showy flowers.

**Iris Data**
Plant	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
11	5.4	3.7	1.5	0.2	setosa
12	4.8	3.4	1.6	0.2	setosa
13	4.8	3.0	1.4	0.1	setosa
14	4.3	3.0	1.1	0.1	setosa
15	5.8	4.0	1.2	0.2	setosa
16	5.7	4.4	1.5	0.4	setosa
17	5.4	3.9	1.3	0.4	setosa
18	5.1	3.5	1.4	0.3	setosa
19	5.7	3.8	1.7	0.3	setosa
20	5.1	3.8	1.5	0.3	setosa
21	5.4	3.4	1.7	0.2	setosa
22	5.1	3.7	1.5	0.4	setosa
23	4.6	3.6	1.0	0.2	setosa
24	5.1	3.3	1.7	0.5	setosa
25	4.8	3.4	1.9	0.2	setosa
26	5.0	3.0	1.6	0.2	setosa
27	5.0	3.4	1.6	0.4	setosa
28	5.2	3.5	1.5	0.2	setosa
29	5.2	3.4	1.4	0.2	setosa
30	4.7	3.2	1.6	0.2	setosa
31	4.8	3.1	1.6	0.2	setosa
32	5.4	3.4	1.5	0.4	setosa
33	5.2	4.1	1.5	0.1	setosa
34	5.5	4.2	1.4	0.2	setosa
35	4.9	3.1	1.5	0.2	setosa
36	5.0	3.2	1.2	0.2	setosa
37	5.5	3.5	1.3	0.2	setosa
38	4.9	3.6	1.4	0.1	setosa
39	4.4	3.0	1.3	0.2	setosa
40	5.1	3.4	1.5	0.2	setosa
41	5.0	3.5	1.3	0.3	setosa
42	4.5	2.3	1.3	0.3	setosa
43	4.4	3.2	1.3	0.2	setosa
44	5.0	3.5	1.6	0.6	setosa
45	5.1	3.8	1.9	0.4	setosa
46	4.8	3.0	1.4	0.3	setosa
47	5.1	3.8	1.6	0.2	setosa
48	4.6	3.2	1.4	0.2	setosa
49	5.3	3.7	1.5	0.2	setosa
50	5.0	3.3	1.4	0.2	setosa
51	7.0	3.2	4.7	1.4	versicolor
52	6.4	3.2	4.5	1.5	versicolor
53	6.9	3.1	4.9	1.5	versicolor
54	5.5	2.3	4.0	1.3	versicolor
55	6.5	2.8	4.6	1.5	versicolor
56	5.7	2.8	4.5	1.3	versicolor
57	6.3	3.3	4.7	1.6	versicolor
58	4.9	2.4	3.3	1.0	versicolor
59	6.6	2.9	4.6	1.3	versicolor
60	5.2	2.7	3.9	1.4	versicolor
61	5.0	2.0	3.5	1.0	versicolor
62	5.9	3.0	4.2	1.5	versicolor
63	6.0	2.2	4.0	1.0	versicolor
64	6.1	2.9	4.7	1.4	versicolor
65	5.6	2.9	3.6	1.3	versicolor
66	6.7	3.1	4.4	1.4	versicolor
67	5.6	3.0	4.5	1.5	versicolor
68	5.8	2.7	4.1	1.0	versicolor
69	6.2	2.2	4.5	1.5	versicolor
70	5.6	2.5	3.9	1.1	versicolor
71	5.9	3.2	4.8	1.8	versicolor
72	6.1	2.8	4.0	1.3	versicolor
73	6.3	2.5	4.9	1.5	versicolor
74	6.1	2.8	4.7	1.2	versicolor
75	6.4	2.9	4.3	1.3	versicolor
76	6.6	3.0	4.4	1.4	versicolor
77	6.8	2.8	4.8	1.4	versicolor
78	6.7	3.0	5.0	1.7	versicolor
79	6.0	2.9	4.5	1.5	versicolor
80	5.7	2.6	3.5	1.0	versicolor
81	5.5	2.4	3.8	1.1	versicolor
82	5.5	2.4	3.7	1.0	versicolor
83	5.8	2.7	3.9	1.2	versicolor
84	6.0	2.7	5.1	1.6	versicolor
85	5.4	3.0	4.5	1.5	versicolor
86	6.0	3.4	4.5	1.6	versicolor
87	6.7	3.1	4.7	1.5	versicolor
88	6.3	2.3	4.4	1.3	versicolor
89	5.6	3.0	4.1	1.3	versicolor
90	5.5	2.5	4.0	1.3	versicolor
91	5.5	2.6	4.4	1.2	versicolor
92	6.1	3.0	4.6	1.4	versicolor
93	5.8	2.6	4.0	1.2	versicolor
94	5.0	2.3	3.3	1.0	versicolor
95	5.6	2.7	4.2	1.3	versicolor
96	5.7	3.0	4.2	1.2	versicolor
97	5.7	2.9	4.2	1.3	versicolor
98	6.2	2.9	4.3	1.3	versicolor
99	5.1	2.5	3.0	1.1	versicolor
100	5.7	2.8	4.1	1.3	versicolor
101	6.3	3.3	6.0	2.5	virginica
102	5.8	2.7	5.1	1.9	virginica
103	7.1	3.0	5.9	2.1	virginica
104	6.3	2.9	5.6	1.8	virginica
105	6.5	3.0	5.8	2.2	virginica
106	7.6	3.0	6.6	2.1	virginica
107	4.9	2.5	4.5	1.7	virginica
108	7.3	2.9	6.3	1.8	virginica
109	6.7	2.5	5.8	1.8	virginica
110	7.2	3.6	6.1	2.5	virginica
111	6.5	3.2	5.1	2.0	virginica
112	6.4	2.7	5.3	1.9	virginica
113	6.8	3.0	5.5	2.1	virginica
114	5.7	2.5	5.0	2.0	virginica
115	5.8	2.8	5.1	2.4	virginica
116	6.4	3.2	5.3	2.3	virginica
117	6.5	3.0	5.5	1.8	virginica
118	7.7	3.8	6.7	2.2	virginica
119	7.7	2.6	6.9	2.3	virginica
120	6.0	2.2	5.0	1.5	virginica
121	6.9	3.2	5.7	2.3	virginica
122	5.6	2.8	4.9	2.0	virginica
123	7.7	2.8	6.7	2.0	virginica
124	6.3	2.7	4.9	1.8	virginica
125	6.7	3.3	5.7	2.1	virginica
126	7.2	3.2	6.0	1.8	virginica
127	6.2	2.8	4.8	1.8	virginica
128	6.1	3.0	4.9	1.8	virginica
129	6.4	2.8	5.6	2.1	virginica
130	7.2	3.0	5.8	1.6	virginica
131	7.4	2.8	6.1	1.9	virginica
132	7.9	3.8	6.4	2.0	virginica
133	6.4	2.8	5.6	2.2	virginica
134	6.3	2.8	5.1	1.5	virginica
135	6.1	2.6	5.6	1.4	virginica
136	7.7	3.0	6.1	2.3	virginica
137	6.3	3.4	5.6	2.4	virginica
138	6.4	3.1	5.5	1.8	virginica
139	6.0	3.0	4.8	1.8	virginica
140	6.9	3.1	5.4	2.1	virginica
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica
Note:
The data are freely available online.

Here is a picture of an iris.

The following is a scatterplot with different colors indicating different species.

Some observations based on the graph:

For setosas, the scatterplot of “Sepal width” versus “Sepal length” indicates a moderately strong linear (straight-line) relationship.
The relationship is not as strong for the other two species.

14.3 Correlation

The correlation coefficient (or just correlation) can be used to quantify how strong a linear relationship is. It also indicates the direction of the relationship. Textbook page 330 introduces the formula and gives an example. We will use software to calculate correlations.

Correlation is usually written as $r$. Its value is always between $-1$ and 1, inclusive, with 1 and $-1$ indicating a perfect straight-line relationship and 0 indicating no straight-line relationship. A positive $r$ indicates a positive linear relationship and negative $r$ indicates a negative linear relationship.

The CODAP online tool: https://codap.concord.org/

For centuries, people have associated intelligence with brain size. A recent study used magnetic resonance imaging to measure the brain size of several individuals. The IQ and brain size (in units of 10000 pixels) of six individuals are provided below.


Brain size	IQ
100	140
90	90
95	100
92	135
88	80
106	103

Make a scatterplot of these data using Excel or the CODAP software. What is the form, direction, and strength of the association? Is there any outlier? An outlier is an observation that is away from the overall pattern.

Choose the best answer from the following.

The scatterplot shows a weak, positive, straight‑line association with two possible outliers.
The scatterplot shows a weak, negative, straight‑line association with two possible outliers.
The scatterplot shows a strong, positive, straight‑line association with two possible outliers.
The scatterplot shows a strong, negative, straight‑line association with two possible outliers.

What would you estimate the correlation $r$ to be?

$r=0.38$
$r=−0.38$
$r=0.83$
$r=−0.83$

More about the correlation between two quantitative variables:

Like the mean, variance, and standard deviation, the correlation can be strongly affected by a few outliers in the data. Use CODAP to check this (Add an outlier and recalculate the correlation).
Correlation does not change when changing the unit of either variable. Use CODAP to check this (Make a data table with two variables. Add two more variables that are multiples of the first two variables).

Exercises.

Which of the following is NOT true of the correlation $r$?

It cannot be greater than 1 or less than $−1$.
It measures the strength of the straight‑line relationship between two quantitative variables.
A correlation of 1 or $−1$ can only happen if there is a perfect straight‑line relationship between two quantitative variables.
Correlation is 0 when there is no association between the variables.
Correlation changes when the explanatory and response variables are switched.

Which of these is NOT true of the correlation $r$ between the weight in pounds and gas mileage in miles-per-gallon for a sample of pickup trucks?

$r$ is measured in pounds.
$r$ must take a value between $–1$ and 1.
If heavier pickup trucks tend to also get lower gas mileage, then $r < 0$.
$r$ would not change if we measured these trucks in kilograms instead of pounds.
Both (b) and (d) are correct.

15 Chapter 15 Describing Relationships: Regression, Prediction, and Causation

Case Study: Housing Prices

What are the factors affecting the price of a house? Name a few.

15.1 Correlation and Regression

Correlation measures the direction and strength of a straight-line relationship between two quantitative variables. It does not require us to specify which variable is the response variable and which is the explanatory variable.

Regression draws a line to describe a straight-line relationship. It does require us to specify which variable is the response variable and which is the explanatory variable. The line, called the least-squares regression line, is the one that makes the sum of the squared vertical distances of the data points from the line as small as possible. See a graph here:

The equation of the regression line is called the regression equation, which can be used to make a prediction for the value of the response variable if we know the value of the explanatory variable. For example, if a regression equation is $y=25 - 3.2x$, then the $y$ value for $x = 5$ is predicted to be $25-3.2(5)=25-16=9$.

Extrapolation (i.e., prediction outside the range of the data) can be risky, because the pattern observed in the data may be different outside the range of the data.

“Regression analysis is the hydrogen bomb of the statistics arsenal.” ― Charles Wheelan, the author of the best selling Naked Statistics: Stripping the Dread from the Data.

Let’s look at at some houses posted for sale on https://www.zillow.com/ in St Cloud. We choose the first 10 houses. We will create a data table with https://codap.concord.org/. Click “Launch Cadap” and choose “Create New Document”. Click “Tables” at the upper-left corner and choose “New”. Give a meaningful name to the data table to be created. Click “AttributeName” to rename it as whatever meaningful. Click the table area and then click the “+” sign to add more columns if needed. Now, click the six dots to insert cases one by one.

Once the data table is finished, click the “Graph” icon and drag a variable from the data table to on put it on the y-axis and another variable placed on the x-axis. Now, you have a scatterplot. Click the ruler at the right margin of the graph and check the boxes “Least Squares Line” and “Plotted Equation”. The equation is called the regression equation. The $R^2$ or $r^2$ tells what percent of the total variation in the values of the response variable can be explained by the explanatory variable (or by the equation or by the model). If you feel this explanation is unclear, you can say what percent of the story between the two variables is told by the data. For example, an $R^2$ of 0.85 means “100% of the data only tell 85% of the story.”

Example 1.

The eye widths (in centimeters) and eyelash lengths (in centimeters) for a sample of 22 mammals are given below:

**The Eye Width and Eyelash Length of 22 Mammals**
Eye width	Eyelash length
0.57	0.13
0.39	0.23
0.74	0.23
0.58	0.41
1.08	0.43
0.72	0.24
0.7	0.36
1.52	0.32
0.92	0.61
1.75	0.42
1.89	0.6
1.98	0.38
1.98	0.47
2.02	0.95
2.2	0.76
2.41	0.89
2.54	0.86
2.6	1.21
2.64	0.76
2.92	0.8
3.09	0.67
4.08	1.66
Note:
From Statististics: Concepts & Controversies by David S. Moore and William I. Notz

Let’s look at the scatterplot:

The plot suggests a moderately strong, positive correlation between eyelash length and eye width.

Use the CODAP software to verify that the equation of the least-squares regression line fitted to the data is $(eyelash ~length) = 0.0530 + 0.3109\cdot (eye~ width)$. The slope 0.3109 indicates that as the eye width of a mammal increases by one, the average eyelash length increases by 0.3109 centimeter.
What is the fitted value of the eyelash length for the mammal with eye width of 1.89 centimeters? The fitted value of the eyelash length for this mammal is \[(eyelash ~length) = 0.0530 + 0.3109\cdot (1.89)=0.0530+0.5876=0.6406 ~\text{cm}\] The difference (observed minus fitted) between the observed value and the fitted value is called a residual. The residual for the mammal considered equals $0.6-0.6406=-0.0406$, meaning that the fitted value overfit the observed value by 0.0406.
What is the predicted value of the eyelash length for a (new) mammal with eye width of 4 centimeters? The predicted value of the eyelash length for this mammal is \[(eyelash ~length) = 0.0530 + 0.3109\cdot (1.89)=0.0530+0.5876=0.6406 ~\text{cm}\]

The $R^2$ value for a regression is always between 0 and 1, inclusive, while the correlation $r$ is always between $-1$ and 1, inclusive. If we know the value of the correlation, the $R^2$ is simply the square of it. $R^2$ does not tell us the direction of the straight-line relationship, but $r$ does.

More examples of interpreting the slope of a fitted regression line:

Suppose we have a linear regression model that relates the price of a used car to its age, and we find that the slope of the fitted regression line is -1,000. This means that for every year of age, the price of the car decreases by $1,000. So, if a car is 5 years older than another car, we would expect it to be $5,000 cheaper (assuming all other factors are equal).
Suppose we have a linear regression model that relates the amount of fertilizer used to the yield of a crop, and we find that the slope of the fitted regression line is 20. This means that for every additional unit of fertilizer used, the yield of the crop increases by 20 units. So, if one farmer uses 100 units of fertilizer and another farmer uses 120 units of fertilizer, we would expect the second farmer to have a yield that is 20 units higher (assuming all other factors are equal).
Suppose we have a linear regression model that relates the height of a basketball player to their points scored in a game, and we find that the slope of the fitted regression line is 3. This means that for every additional inch in height, the player scores 3 more points on average. So, if two players are the same age and have the same amount of experience, but one is 6 inches taller than the other, we would expect the taller player to score 18 more points on average (assuming all other factors are equal).

In general, the slope of a fitted regression line represents the change in the response variable (y) that is associated with a one-unit increase in the predictor variable (x). This interpretation can be applied to many different contexts, as long as we have a linear relationship between the variables and the assumptions of the regression model are met.

More examples of interpreting the $R^2$ value:

Suppose we have a linear regression model that relates the height and weight of a group of people, and we find that the R-squared value is 0.85. This means that 85% of the variation in weight can be explained by the variation in height. The remaining 15% of the variation may be due to other factors that are not included in the model, such as muscle mass, genetics, or lifestyle.
Suppose we have a linear regression model that relates the amount of time spent studying to a student’s exam score, and we find that the R-squared value is 0.50. This means that 50% of the variation in exam scores can be explained by the variation in study time. The remaining 50% of the variation may be due to other factors that are not included in the model, such as prior knowledge, test anxiety, or luck.
Suppose we have a linear regression model that relates the number of hours worked per week to a person’s income, and we find that the R-squared value is 0.70. This means that 70% of the variation in income can be explained by the variation in hours worked per week. The remaining 30% of the variation may be due to other factors that are not included in the model, such as education, experience, or industry.

In general, R-squared represents the proportion of variation in the response variable that is explained by the variation in the explanatory variable in the regression model. A high $R^2$ value (closer to 1) indicates that a large proportion of the variation is explained by the model, while a low $R^2$ value (closer to 0) indicates that the model is not a good fit for the data and other factors may be influencing the response variable. However, it’s important to note that $R^2$ alone does not necessarily indicate the strength of the relationship or the validity of the model. Other factors, such as the significance of the coefficients and the assumptions of the model, should also be considered when interpreting the results.

Examples of using a fitted regression equation for prediction of the response variable when the value of the explanatory variable is given:

Suppose we have a simple linear regression model that relates the number of years of experience to the salary of a software developer, and we find that the fitted regression equation is $Salary = 50,000 + 5,000Experience$. This means that for every additional year of experience, the salary is expected to increase by $\$5,000$. If a software developer has 7 years of experience, we can use the equation to predict their salary: $Salary = 50,000 + 5,0007 = \$85,000.$
Suppose we have a simple linear regression model that relates the size of a house (in square feet) to its selling price, and we find that the fitted regression equation is $Price = 100,000 + 200Size$. This means that for every additional square foot of living space, the selling price is expected to increase by $\$200$. If a house has 2,000 square feet of living space, we can use the equation to predict its selling price: $Price = 100,000 + 2002,000 = \$500,000.$
Suppose we have a simple linear regression model that relates the age of a car to its resale value, and we find that the fitted regression equation is $Value = 10,000 - 500Age$. This means that for every additional year of age, the resale value is expected to decrease by $\$500$. If a car is 5 years old, we can use the equation to predict its resale value: $Value = 10,000 - 5005 = \$7,500$.

In general, we can use a fitted simple regression equation to predict the response variable ($y$) for a given value of the predictor variable ($x$). We simply plug in the value of $x$ into the equation and solve for $y$. However, it’s important to note that the accuracy of the prediction depends on the validity of the regression model and the assumptions behind it. In addition, extrapolation (i.e., predictions for values of $x$ that are far outside the range of the observed data) may be less reliable.

Like the mean, variance, standard deviation, and the correlation coefficient, the $R^2$ and slope can be strongly affected by outliers. Play the app: <https://scsu.shinyapps.io/Regression/?

15.2 The Question of Causation (Cause and Effect)

If changes in one variable causes changes in another variable, the two variables are said to be causal or in a cause-and-effect tie.

There are three big facts about statistical evidence for cause and effect:

A strong relationship between two variables does not always indicate a cause-and-effect tie, no matter how strong.
The relationship between two variables is often influenced by other variables lurking in the background. Those variables are called confounding variables.
The best evidence for causation comes from randomized comparative experiments.

Refer to graphs on textbook page 355:

graph (a) shows an association that is explained by a direct cause-and-effect link,
graph (b) illustrates common response due to a lurking variable which creates an association,
graph (c) illustrates confounding (both the explanatory variable $x$ and the lurking variable $z$ have influence on $y$, and $x$ and $z$ are themselves associated, so we can not tell how strong the direct effect of $x$ on $y$).

In summary, the observed association between two variables may be due to

direct causation
common response
confounding
two or more of these

Example 2.

Which of the following relationship is causal? If not, what could be a confounding variable?

cigarette smoking and death rate from lung cancer
availability of guns in a nation and nation’s homicide rate from guns
Number of smart phones per person in a nation and nation’s life expectancy
Number of firefighters and loss due to fire

Solution.

Both variables in each pair are associated. Whether the two variables in (a) and (b) are causal are unclear and the issue might be more political than scientific. Although a cause and effect very likely exists in either case, neither tobacco companies nor weapon manufacturers would hope the corresponding association is causal. If the association in (a) or (b) is not causal, the type of gene and the standards for owning guns could be confounding variables for the two cases, respectively.

Neither association in (c) and (d) is causal. Wealth would be a confounding variable in (c) and size of fire would be a confounding variable in (d).

Even though association is not causal, it can be used for prediction as long as the patterns found in past data continue to hold true.

15.3 Evidence for Causation

Despite the difficulties, it is some times possible to build a strong case for causation in the absence of controlled experiments.

The criteria for establishing causation when we cannot do an experiment:

The association is strong (such as smoking and lung cancer).
The association is consistent (e.g., association between smoking and lung cancer was found by many studies of different kinds of people in many countries).
Higher doses are associated with strong responses (more cigarettes, more often to get lung cancer).
The alleged cause precedes the effect in time (lung cancer develops after years of smoking).
The alleged cause is plausible (experiments with animals show that tars from cigarette smoke do cause cancer).

16 Chapter 16 The Consumer Price Index and Government Statistics

Case Study: Who made more?

Michael Jordan’s 1997-1998 salary: $33,140,000

LeBron James’ 2018-2019 salary: $35,654,150

Some further information: During 1997-1998, the gas price was about $1.00. During 2018-2019, the price was about $2.25. That is, a dollar in 2018 did not have the same buying power as a dollar in 1997, so salaries in 1997 cannot be directly compared with salaries in 2018.

In the National Basketball Association (NBA), the mean salary rose from $2,160,000 in 1997 to $7,430,431 in 2018.

To be fair, an adjustment must be made whenever dollar values from different years are compared. Governments use the Consume Price Index (CPI) for this purpose.

Before introducing the CPI, we first introduce index numbers.

16.1 Index Numbers

The value of a particular variable relative to its value at a base period is measured by an index number. It’s calculated by the following formula:

\[\text{index number}=\frac{\text{value}}{\text{base value}}\cdot 100\]

For example, if the price of gas ($2.50) on Jan 1, 2022 is used as the base with the index 100, then the price of gas ($4.20) on Aug 20, 2022 has an index of 168, since

\[\frac{4.20}{2.50}\cdot 100=1.68\cdot 100=168\] The index number for a base period is always 100.

An index number that describes the total cost of a collection of goods and services (known as a market basket) is called a fixed market basket price index.

Refer to Example 2 on textbook page 373 and Exercises 16.14 & 16.15.

16.2 The Consumer Price Index (CPI)

Watch this video: https://www.youtube.com/watch?v=ReRYI86Ovms

The CPI

can be thought of as a fixed market basket price index for the collection of ALL the goods and services that consumers (who live or work in urban areas) buy. However, it is NOT a true fixed market basket price index because of adjustments for changing buying habits, new products, and improved quality,
can be used to change dollar amounts at one time into the amount at another time that has the same buying power, and
is determined using data from several large sample surveys (such as the Consumer Expenditure Survey of more than 30,000 households and the Telephone Point-of-Purchase Survey).

To convert dollars at time X to dollars at time Y, use the following formula:

\[\text{Dollars in year Y} = \frac{\text{CPI for year Y}}{\text{CPI for year X}}\cdot \text{Dollars in year X}\]

Example 1.

The part of the Consumer Price Index (CPI) that measures the cost of college tuition (1982‑84=100) was 748.4 in October 2018. The overall CPI was 252.9 that month.

Explain exactly what the index number 748.4 tells us about the rise in college tuition between the base period and October 2018.
Which one rose faster? College tuition or consumer prices in general? How can this be proven?

Solution.

A tuition cost of $100 in the base period will cost $748.4 in October 2018.
College tuition rose much faster than consumer prices in general, since college tuition increased by 648.4% and the overall cost increased 152.9%.

Example 2.

Yankee center fielder Joe DiMaggio was paid $32,000 in 1940 and $100,000 in 1950.

Using the table of the annual average Consumer Price Index (1982–84 = 100) available from https://media.saplinglearning.com/priv/he/stats/scc10e/tables/table16-1.pdf, express DiMaggio’s 1940 salary in 1950 dollars.
By what percentage did DiMaggio’s real income change in the decade?

Solution.

According to the table, the CPI was 14 in 1940 and 24.1 in 1950, so DiMaggio’s 1940 salary in 1950 is $\frac{24.1}{14}\cdot 32,000 = \$55,086$.
Since $\frac{100,000-55,086}{55,086}=0.8154$, DiMaggio’s real income increased 81.54% in the decade.

Example 3.

When Sonia started college in 2014, she set a goal of making $50,000 when she graduated. What must Sonia earn in 2018 in order to have the same buying power that $50,000 had in 2014? Use the table of the annual average Consumer Price Index (1982–84 = 100) available from https://media.saplinglearning.com/priv/he/stats/scc10e/tables/table16-1.pdf.

Solution.

According to the table, the CPI was 236.7 in 2014 and 251.1 in 2018, so Sonia must earn $\frac{251.1}{236.7}\cdot 50,000 = \$53,042$.

Example 4.

Tuition for Michigan residents at the University of Michigan has increased as follows:

**The Eye Width and Eyelash Length of 22 Mammals**
Year	Tuition ($)
1998	6098
2000	6513
2002	7411
2004	8202
2006	9723
2008	11037
2010	11837
2012	11994
2014	13486
2016	14402
2018	15262
Note:
From Statististics: Concepts & Controversies by David S. Moore and William I. Notz

Use annual average CPIs from https://media.saplinglearning.com/priv/he/stats/scc10e/tables/table16-1.pdf to restate each year’s tuition in constant 1998 dollars. Make two line graphs on the same axes, one showing actual dollar tuition for these years and the other showing constant dollar tuition. Then explain what your graphs show.

Solution.

More Examples.

In-state tuition and fees at Virginia Commonwealth University (VCU) were $9517 for the 2011–2012 academic year. The Consumer Price Index (CPI) for 2011 was 224.9, and the CPI for 2001 was 177.1. What is the 2011–2012 tuition in 2001 dollars? Solution: Let the number be $x$. Then, 177.1:224.9 = 9517:x. Solving this equation yields $x=7494$.
A gallon of milk cost $2.48 in 1995, and a gallon of milk cost $3.73 in 2016. Using 1995 as the base year, what is the gallon of milk index number for 2016? Solution: since 3.73/2.48 = 1.50, the gallon of milk index number for 2016 using 1995 as the base year is 150.
The breakfast fixed market basket consists of 1 gallon of milk, 1 pound of sliced bacon, and a dozen eggs. In 1995, the gallon of milk cost $2.48, the pound of sliced bacon cost $1.99, and the dozen eggs cost $0.93. In 2011, the gallon of milk cost $3.57, the pound of sliced bacon cost $4.63, and the dozen eggs cost $1.77. What is the breakfast fixed market basket price index in 2011 using 1995 as the base year? Solution: since (3.57 + 4.63 + 1.77)/(2.48 + 1.99 + 0.93) = 1.846, the breakfast fixed market basket price index in 2011 using 1995 as the base year is 184.6.

The inflation rate is the percentage change in the consumer price index (CPI) compared with the same month a year earlier. It can also be applied to the price index over a given period of time. For example, the CPI is 163.0 for 1998 and 251.1 for 2018. The percentage change in price index in the period 1998-2018 is $\frac{251.1-163.0}{163.0}=1.24=54\%$. That is, the inflation rate in the period 1998-2018 is 54%.

US inflation rates: https://www.usinflationcalculator.com/inflation/current-inflation-rates/

US inflation rates on different items: https://www.in2013dollars.com/inflation-cpi-categories

16.3 The Place of Government Statistics

Government statistical offices produce data (such as price indexes and unemployment rates) needed for government policy and decisions by businesses and individuals.

Some countries have a single statistical office, such as Statistics Canada, the Australian Bureau of Statistics, and Statistics Sweden. The U.S. has a decentralized statistical system. There is not one national office of statistics, but rather it has 13 designated statistical agencies that are embedded in departments. These include the Census Bureau and the Bureau of Economic Analysis in the Commerce Department, and the Bureau of Labor Statistics in the Labor Department. That means coordination.

Government statistical offices in the United States are reluctant to produce social statistics. Instead, the government funds some university to do their own sample surveys.

17 Chapter 17 Thinking about Chance

Case Study #1: Shared Birthdays

In a class with 30 students, some students may have a shared birthday. Guess the chance of this happening.

Case Study #2: Sex Sequence

Suppose that the chance of giving birth to a boy and the chance of giving birth to a girl are both 0.5. What would be the chance that in a family of four children, the sexes are GBGB? GGGG?

Case Study #3: Playing the US Powerball

What are the odds of winning the Powerball jackpot? https://lottosimulation.com/us/powerball

Video: https://www.cbsnews.com/news/powerball-jackpot-amount-1-5-billion-odds-of-winning/

17.1 The Idea of Probability

When flipping a fair coin, there are only two possible outcomes, heads or tails. The chance of getting each of the two outcomes is the same. That is, if you toss the coin many times, you expect to get each of the two outcomes about 50% of the time (Example 1 on textbook page 406). The number 50% or 0.5 is called the probability of each. Probability is a measure of chance. It is a number between 0 and 1 that describes the proportion of times the outcome would occur in a very long series of repetitions.

Check out the app #17: https://www.statcrunch.com/applets

When rolling a fair die, there are six possible outcomes, 1, 2, 3, 4, 5, and 6. The chance of getting each of the six outcomes is the same. That is, if you roll the die many times, you expect to get each of them about 1/6 of the time. The number 1/6 is the probability of getting each number.

Both situations we discussed imply an idea:

Individual outcomes are uncertain but there is a pattern (or regular distribution of outcomes) in the long run (or a large number of repetitions). Such a phenomenon is said to be random.

The pattern is called the law of large numbers. The law states that in a large number of “independent” repetitions of a random phenomenon (such as coin flipping), proportions (or averages) are likely to become more stable as the number of trials increases.

Example 1.

Toss a fair coin three times and record heads (H) and tails (T) on each toss.

What is the probability of getting the outcome sequence HHH?
What is the probability of getting the outcome sequence TTT?
What is the probability of getting the outcome sequence HTH?
What is the probability of getting exactly two heads?

Solution.

There are eight possible sequences of heads and tails and they are: HHH, HHT, HTH, THH, HTT, THT, TTH, and TTT.

1/8
1/8
1/8
There are three ways of equal chance to get two heads, so the probability of getting exactly two heads is 3/8 or 0.375.

Example 2.

What is the probability that a young man aged 20 to 24 will die next year?
What is the probability that a young woman aged 20 to 24 will die next year?
How would the result be used by an insurance company?

Solution.

Based on the 2016 National Center for Health Statistics report, the probability that a young man aged 20 to 24 will die next year is about 0.14%.
Based on the 2016 National Center for Health Statistics report, the probability that a young woman aged 20 to 24 will die next year is about 0.05%, about 1/3 of the chance for young men.
If an insurance company sells many policies to people aged 20-24, it knows that it will have to pay off next year on about 0.14% of the policies sold on young men’s lives and on about 0.05% of the policies sold on young women’s lives. It will charge more to insure a young man because the probability of having to pay is higher.

17.2 Personal Probability

What is the chance of raining tomorrow? There is no unique true answer. Different people will have different answers. Whether it will rain tomorrow is random and this random phenomenon cannot be repeated. Probability that is based on personal judgment but not on data about many repetitions is called personal probability.

A personal probability can be adjusted as more information is available. A branch of statistics called Bayes statistics makes use of personal probability for making better decisions.

17.3 What are the Odds?

Gamblers often express chance in terms of odds rather than probability. Odds of $A$ to $B$ in favor of an outcome means that the probability of that outcome is $\frac{A}{A+B}$. So “odds of 3 to 4 in favor of an outcome” is another way of saying "probability $\frac{3}{7}$. A probability is always between 0 and 1, but odds range from 0 to infinity.

17.4 Probability and Risk

Probability is a useful measure tool for describing the risk of a bad outcome (such as getting cancer from asbestos, dying in a motor vehicle crash, and getting struck by lightning).

Psychologists have shown that we generally overestimate very small risks and underestimate higher risks. We feel safer when a risk seems under our control than when we cannot control it.

18 Chapter 18 Probability Models

Case Study: Super Bowl Probabilities

18.1 Probability Models

A probability model describes a random phenomenon by telling what outcomes are possible and how to assign probabilities to them. Because probabilities in a probability model are assigned, there is a possibility of making mistakes. That is, models may not be always correct, some bad, some good, and some better. Statisticians always want better models.

When a population consists of individuals that can be categorized into different categories, the population is said to be discrete. When an individual is randomly selected from a discrete population, the individual can belong to any of these categories and thus this is a random phenomenon. A probability model used in this situation is usually expressed as a table with possible outcomes (i.e., categories) in one column and the corresponding probabilities in another column. If the probability model is correct, The probabilities in the model should equal the proportions of different categories.

Example 1.

A politician took a convenience sample from the population of all adults in Minnesota. His sample showed

22% were democrats
55% were republicans
23% were independents

He thus used this as a probability model for describing the whole population.

Another politician took a random sample from the population of all adults in Minnesota. Her sample showed

44% were democrats
40% were republicans
16% were independents

She thus used this as a probability model for describing the whole population.

By the Pew Research Center: https://www.pewresearch.org/religion/religious-landscape-study/state/minnesota/party-affiliation/, the numbers were

46% were democrats
39% were republicans
15% were independents

If we trust Pew, then the first probability model is very wrong, but the second model is very good.

Example 2.

Below are racial groups in the United States (2020 Census) including racial identification of Latinos:

White Americans (61.6%)

Black Americans (12.4%)

Two or more races (10.2%)

Some other race (8.4%)

Asian Americans (6.0%)

Native Americans (1.1%)

Pacific Islander Americans (0.2%)

Source: https://en.wikipedia.org/wiki/Demographics_of_the_United_States#Race

Suppose the numbers are the same for today’s USA. Randomly choose an individual. What probability model will you use here?

Solution.

We would use the following model:

A Probability Model for Racial Groups in USA
(including racial identification of Latinos)
Racial Group	Probability
White Americans	0.616
Black Americans	0.124
Asian Americans	0.06
Native Americans	0.011
Pacific Islander Americans	0.002
Two or more races	0.102
Some other race	0.084
Note:
Reference: https://en.wikipedia.org/wiki/Demographics_of_the_United_States#Race

18.2 Probability Rules

Rule 1. The probability of any event must be a number between 0 and 1.

Rule 2. The sum of probabilities of all possible outcomes must be 1.

Rule 3. The probability that an event does not occur is 1 minus the probability that the event does occur.

Rule 4. If two events have no outcomes in common, the probability that at least one of the two events occurs is the sum of their individual probabilities.

Let’s use the rules to do some problems.

Example 2.

Roll a standard 6-sided die.

What is the probability of rolling a 3?
What is the probability of rolling a 3 or 5?
What is the probability of rolling a number other than a 3?

Solution.

Each whole number between 1 and 6 is equally likely. Thus, by Rule 2, each number has probability of one $\frac{1}{6}$ to occur.
By Rule 4, the probability of rolling a 3 or 5 equals the sum of the probability of rolling each. Since each has probability of $\frac{1}{6}$ to occur, the desired probability is $\frac{2}{6}$, which reduces to $\frac{1}{3}$.
The probability of rolling a 3 is $\frac{1}{6}$. Thus, by Rule 3, the probability of rolling a number other than a 3 equals 1 minus $\frac{1}{6}$, or $\frac{5}{6}$.

18.3 Probability and Odds

Speaking of sports or casinos, we often use odds to measure chances of winning a game. Odds are related to probability, but the two are different.

If $p$ is the probability that an event occurs, then the odds of the event occurring are $\frac{p}{1-p}$.

Odds are usually written as $a:b$ or $a$ to $b$. Then, the probability equals $\frac{a}{a+b}$.

Example 3.

What are the odds of getting a 4 when rolling a standard 6-sided die?
If the odds of winning a lottery are 2 to 100, what is the probability of winning the lottery?

Solution.

Since the probability of getting a 4 is $\frac{1}{6}$, the odds of getting a 4 equal $\frac{\frac{1}{6}}{1-\frac{1}{6}}=\frac{\frac{1}{6}}{\frac{5}{6}}=\frac{1}{5}$, written as 1:5 or 1 to 5.
The probability of winning the lottery equals $\frac{2}{2+100}=\frac{2}{102}=\frac{1}{51}$.

18.4 Probability and Betting Odds

At the end of February 2019, one website gave the betting odds of 4 to 9 that Golden State Warriors would win the 2018-2019 NBA finals. This means that a bet of $9 will pay you $4 if the team wins and cost you the $9 you bet if the team loses. The total payout is $4+9=\$13$. Betting odds of 4 to 9 represent a personal probability of $\frac{9}{4+9}=\frac{9}{13}$ of winning; that is, if you bet 13 times, you believe that you win 9 of those bets.

Example.

Suppose in August 2018 you believed it fair that a bet of $2 should win $10 if the Philadelphia Eagles win Super Bowl 53.

Determine the betting odds and convert them to a probability. This would be your personal probability that the Philadelphia Eagles will win Super Bowl 53.

Solution.

Since you believed it fair that a bet of $2 should win $10 if the Philadelphia Eagles win Super Bowl 53, the betting odds are 10 to 2 and thus your personal probability that the Philadelphia Eagles will win Super Bowl 53 is $\frac{2}{2+10}=\frac{1}{6}$.

More about betting odds: https://sportsbook.fanduel.com/navigation/nfl?tab=super-bowl

Money line odds (aka “American” odds or “U.S.” odds) are popular in the United States. The odds for favorites are accompanied by a minus (-) sign and indicate the amount you need to stake to win $100. On the other hand, the odds for the underdogs are accompanied by a positive (+) sign and indicate the amount won for every $100 staked. (From: https://www.investopedia.com/articles/investing/042115/betting-basics-fractional-decimal-american-moneyline-odds.asp)

If, for example, a football game had a money line of Team A (+150) and Team B (-170), then the bettor immediately knows a couple of things:

Team A is expected to lose, and a bet on it will also pay out more, because it is not favored. Placing a bet of $100 on team A would win $150.
Team B is expected to win, and a bet on it will also pay out less, because it is favored. Placing a $170 bet on team B would gain $100 on payout.

Note that the role of $100 is different when the money line change from positive to negative.

18.5 Probability Models for Sampling

In Chapter 3, we introduced what a statistic is. A statistic is a numeric quantity associated with the sample. Examples of statistics are sample proportion and mean.

The sampling distribution of a statistic describes what values the statistic takes in repeated samples from the same population and how often it takes those values.

Example .

Randomly choose a sample of two numbers from the set containing 1, 2, 3, and 4. Let $X$ denote the mean of the two selected numbers. What is the probability model of $X$?

Solution.

The possible samples are: (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), and (3, 4), with sample means 1.5, 2, 2.5, 2.5, 3, and 3.5, respectively. The 6 samples are equally likely.

The probability model for the sample mean is

**Probability Model for Sampling**
x	p
1.5	1/6
2.0	1/6
2.5	2/6
3.0	1/6
3.5	1/6

18.6 Additional Examples

The Diversity index (or Gini index) of a population, denoted by DI, measures the probability that two people chosen at random will be from different racial and ethnic groups. It is calculated by subtracting the sum of squared proportions of all groups from 1.

The following is a probability model for Minnesota demographics:

Source: https://www.census.gov/library/visualizations/interactive/racial-and-ethnic-diversity-in-the-united-states-2010-and-2020-census.html

The DI for Minnesota can be calculated by

\[1-(76.3\%)^2-(6.9\%)^2-(1.0\%)^2-(5.2\%)^2-(0.0\%)^2-(0.4\%)^2-(4.1\%)^2-(6.1\%)^2=40.48\%.\]

19 Chapter 19 Simulation

Case Study: Luck or Cheating?

19.1 Where Do Probabilities Come From?

Some probabilities come from many repetitions.
Some probabilities (called personal probabilities) are based on personal judgment.

19.2 Simulation Basics

Using a table of random digits or from computer software to imitate chance behavior is called simulation.

Steps of simulation:

Step 1. Simulation starts with a probability model.
Step 2. Assign one or more digits to represent outcomes.
Step 3. Repeat the simulation many times (say 1000 times)

Example 1.

How can we simulate the flipping of a fair coin 10 times using the table of random digits shown above?

Solution.

Step 1: The probability model is $P(H)=0.5, P(T)=0.5$.
Step 2: We start anywhere in the table, say from the very beginning. We look at one digit. If the digit is between 0 and 4, inclusive, we write “H”; otherwise, we write “T”.
Step 3: Repeat the Step 2 ten times. We have the sequence HTTHTHTTTT.

Based on the simulation, the probability of heads is estimated to be 3 in 10, or 0.3.

Example 2.

How can we simulate the flipping of a unfair coin 10 times using the table of random digits shown above? Assume the probability of heads is 0.45.

Solution.

Step 1: The probability model is $P(H)=0.45, P(T)=0.55$.
Step 2: We start anywhere in the table, say from the very beginning. We look at two digits. If the two digits represent a number between 0 and 44, inclusive, we write “H”; otherwise, we write “T”.
Step 3: Repeat the Step 2 ten times. We have the sequence HTTTTTHTTH.

Based on the simulation, the probability of heads is estimated to be 3 in 10, or 0.3.

19.3 Thinking about Independence

Skip!!

19.4 More Elaborate Simulations

Read examples 4-5 in the textbook.

20 Chapter 20 The House Edge: Expected Values

Case Study: Better Betting?

There are a total of 38 pockets on the American roulette wheel, ranging from 0 to 36, plus the additional 00 number. ___ of these pockets are red, the other ___ are black while the two slots featuring 0 and 00 are green. The colors alternate completely on the wheel. If two numbers are consecutive, then they have different colors. The sum of all numbers is ____.

If you bet on any single number, your chance of winning is ____ and the odds of winning are ____.
If you bet on black, your chance of winning is ____ and the odds of winning are ____.

If a bet of $1 on red will win $2, how much would you expect to win each time? How much would you expect to win when you play roulette 1000 times?

20.1 Expected Values

The average of 10, 15, 24, and 31 is

\[\frac{10+ 15+ 24+ 31}{4}\] which can be written as

\[\frac{10+15+24+31}{4}=\] \[\frac{10}{4}+\frac{15}{4}+\frac{24}{4}+\frac{31}{4}=\] \[10\cdot\frac{1}{4}+15\cdot \frac{1}{4}+24\cdot \frac{1}{4}+31\cdot\frac{1}{4}\] which is called the weighted average of the numbers 10, 15, 24, and 31, and the numbers $\frac{1}{4}$, $\frac{1}{4}$,$\frac{1}{4}$, and $\frac{1}{4}$ are called the weights. Since the weights are the same, the four numbers are equally weighted.

In many situations, weighted averages may involve different weights. Here is an example: Tom gets 80% for homework assignments, 90% for projects, and 88% for the final exam. According the class policy, assignments account for 50% of the final grade, projects account for 30% of the final grade, and the final exam accounts for 20% of the final grade. Tom’s final score would be

\[80\%\cdot (0.5)+90\%\cdot (0.3)+88\%\cdot (0.2)\] which is the weighted average of the numbers 80%, 90%, and 88%, with weights 0.5, 0.3, and 0.2, respectively.

The weights in a weighted average must be non-negative and add up to 1. The weight of a number in a weighted average can be viewed as the contribution of that number to the weighted average.

The expected value of a random phenomenon that has numerical outcomes is found by multiplying each outcome by its probability and then adding all the products. That is, the expected value is the weighted average of all possible numerical outcomes with weights being the corresponding probabilities.

Example 1.

Flip a fair die once. What is the expected value of the number showing up?

Solution.

Since each of the six outcomes 1 through 6 has the same probability 1-in-6, the expected value is

\[1\cdot\frac{1}{6}+2\cdot \frac{1}{6}+3\cdot \frac{1}{6}+4\cdot\frac{1}{6}+5\cdot \frac{1}{6}+6\cdot\frac{1}{6}\] which equals 3.5.

Example 2.

In the following table, $X$ represents the number of tires with low pressure on a randomly chosen car.

What is the expected value of the number of tires with low pressure?

Solution.

The expected value is

\[0\cdot (0.17)+1\cdot (0.11)+2\cdot (0.37)+3\cdot (0.23)+4\cdot (0.12)=0+0.11+0.74+0.69+0.48=2.02.\]

Example 3.

A game involving a pair of dice pays you $5 with probability 15/36, costs you $1 with probability 10/36, and costs you $6 with probability 11/36.

What is your expected net result, in dollars, per play?

Solution.

Your expected net result is \[5\cdot \frac{15}{36}+(-1)\cdot \frac{10}{36}+(-6)\cdot \frac{11}{36}\]

\[=\frac{75}{36}+\frac{-10}{36}+\frac{-66}{36}\] \[=\frac{75-10-66}{36}\] \[ =\frac{-1}{36}\] \[ =-$\frac{1}{36}\] That is, your expected net result per play is to lose $\frac{1}{36}$ dollar.

Read examples 1 & 2 on pages 462-463 of the textbook.

21 Chapter 21 What Is a Confidence Interval?

Case Study #1: Social Media Use

What is the most popular social media platform?

In 2018, the Pew Research Center published a report on social media use. A total of 2002 survey respondents of 18 years of age and older were interviewed (1502 on a cell phone, and 500 on a landline). When asked which social media platform they used, 75% of the 2002 survey respondents said they used YouTube.

If we use the number 75% as an estimate of the proportion of people 18 years of age and older who used YouTube, how accurate is it?

This case study is related to something called statistical inference, which is a process of drawing conclusions about a population on the basis of a sample.

21.1 Estimating a Population Proportion with a Single number

The proportion of all adults 18 years of age and older who used YouTube in 2018 is called a population proportion (denoted by $p$) and is a parameter. We estimate a population proportion with a sample proportion (denoted by $\hat{p}$) called a statistic in general. A statistic is a number calculated from a sample.

Example 1.

A March 2018 Gallup survey asked a sample of 1041 adults if they wanted stricter laws covering the sale of firearms. A total of 697 of the survey respondents said “yes.” Although the samples in national polls are not simple random samples, they are similar enough that our method gives approximately correct confidence intervals.

What is the parameter $p$ here? Define it clearly.
What is the value of the statistic $\hat{p}$ here?

Solution.

The parameter $p$ is the proportion of all adults who wanted stricter laws covering the sale of firearms.
The statistic $\hat{p}=\frac{\text{count in the sample}}{\text{size of the sample}}=\frac{697}{1041}=0.6695$.

21.2 Estimating a Population Proportion with a Confidence Interval

Due to sample to sample variation, a statistic $\hat{p}$ may not have the the same value as the population proportion $p$.

To account for an error on the statistic, the result of estimate often is written as

\[\hat{p}\pm 2\cdot \sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n}}\]

called a 95% confidence interval, where $\sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n}}$ is called the standard error and is denoted by $se$.

A note: a more accurate confidence interval is to use 1.96 instead of 2, but we will use 2 for simplicity.

Example 2. (Example 1 cont’d)

Find a 95% confidence interval for the population proportion $p$.

Solution.

The standard error is $se=\sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n}}=\sqrt{\frac{0.6695\cdot (1-0.6695)}{1041}}=0.0146$, so the 95% confidence interval is:

\[0.6695\pm 2\cdot 0.0146=0.6695\pm 0.0292\] or from 0.6403 to 0.6987. The number “0.0292” is the margin of error.

An interpretation of the 95% confidence interval is:

We are 95% confident that between 64.03% and 69.87% of all adults wanted stricter laws covering the sale of firearms.

The number 95% is called the confidence level, which means that:

If all possible samples were selected, 95% of the confidence intervals constructed using the same procedure would capture the true value of the parameter.

If a larger confidence level is used for a confidence interval, the width of the corresponding confidence interval will be longer.

Here is a simulation for confidence intervals: https://www.statcrunch.com/applets/type3&ciprop. Change the sample size or confidence level and summarize your findings.

21.3 More Examples

A 95% confidence interval of (0.53, 0.55) for the proportion of registered voters who support a candidate can be interpreted as “We are 95% confident that between 53% and 55% of all registered voters support the candidate.” The confidence interval can also be written as $0.54 \pm 0.01$, with 1% being the margin of error.
A recent survey reported that 52% of 1021 American adults surveyed say the amount of federal income tax that they pay is too high. Find a 95% confidence interval for the proportion of all American adults who would say that the amount of federal income tax they pay is too high.

Solution.

The sample proportion is: $\hat{p}=0.52$
The margin of error: $2\cdot \sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n}}=2\cdot \sqrt{\frac{0.52\cdot (1-0.52)}{1021}}=0.0313$
The 95% confidence interval is: $0.52\pm 0.0313$ or from 0.4887 to 0.5513.
An interpretation: We are 95% confident that between 48.87% and 55.13% of American adults would say that the amount of federal income tax they pay is too high.

A student polls his school to see if students in the school district are for or against the new legislation regarding school uniforms. She surveys 600 students and finds that 480 are against the new legislation. Compute a 95% confidence interval for the true percent of students who are against the new legislation, and interpret the confidence interval.
In a sample of 300 students, 68% said they own an iPod and a smart phone. Compute a 95% confidence interval for the true percent of students who own an iPod and a smartphone.
A survey of 500 randomly selected adults found that 38% of them are in favor of a new tax law. Develop a 95% confidence interval for the proportion of all adults who are in favor of the new tax law.
A company produces light bulbs, and they claim that only 2% of their light bulbs are defective. A sample of 200 light bulbs is taken, and 6 of them are found to be defective. Find a 99% confidence interval for the true proportion of defective light bulbs produced by the company.
In a survey of 1000 registered voters, 520 said they would vote for Candidate A in the upcoming election. Find a 90% confidence interval for the proportion of all registered voters who would vote for Candidate A.
In a study of 300 college students, 150 of them said they had experienced anxiety in the past year. Develop a 95% confidence interval for the proportion of all college students who have experienced anxiety in the past year.
A manufacturer claims that 80% of their products are defect-free. A sample of 100 products is taken, and 73 of them are found to be defect-free. Find a 99% confidence interval for the true proportion of defect-free products produced by the manufacturer.
In a survey of 400 residents in a city, 160 of them said they would support a tax increase to fund a new park. Develop a 95% confidence interval for the proportion of all residents who would support the tax increase.
A medical researcher wants to estimate the proportion of patients who experience a certain side effect from a medication. In a sample of 50 patients, 10 of them experience the side effect. Develop a 90% confidence interval for the true proportion of patients who experience the side effect.
In a study of 200 households, 60 of them had at least one pet. Develop a 99% confidence interval for the proportion of all households that have at least one pet.
A school district claims that 85% of its students graduate on time. In a sample of 300 students, 255 of them graduate on time. Find a 95% confidence interval for the true proportion of students who graduate on time in the school district.
In a survey of 1500 adults, 390 of them said they have traveled abroad in the past year. Develop a 95% confidence interval for the proportion of all adults who have traveled abroad in the past year.

22 Chapter 22 What Is a Test of Significance?

We have learned how to use a sample to estimate the parameter of a population. For example, can estimate each of the following parameters using a sample:

The proportion of all Minnesota teenagers who have used TikTok.
The mean number of times a TikTok user opens TikTok per day.
The mean number of Minutes teenagers spend per day on TikTok
The difference between the proportions of female TikTok users and male TikTok users

Many times, rather than estimate a parameter, we may want to test hypotheses about it. Examples are

More than 55% of all Minnesota teenagers have used TikTok.
TikTok users, on average, open TikTok more than 10 times per day.
Teenagers spend more than 75 minutes per day on TikTok.
There is no difference between the proportions of female TikTok users and male TikTok users.
More women than men use ChatGPT.
Men earn more than women in California.

22.1 Hypotheses

A hypothesis is an assumption, claim, or conjecture about one or more populations. The process of testing a hypothesis is hypotheses testing, which is also a type of statistical inference.

The following are additional examples of hypotheses in different contexts:

Most of first-year college students are interested in a STEM (Science, technology, engineering, and mathematics) subject.
Lung cancer is associated with smoking.
Triglyceride is a risk factor of myocardial infarction.
The average age of kids who own their first smart phone is 13.
The level of the triglyceride is positively correlated to the average number of hours a person sits per day.

22.2 The Reasoning of Statistical Tests of Significance

How can a hypothesis be tested? Read on.

Adam attempted 30 free throws and made 20. Calculate the sample proportion $\hat{p}$. Is there evidence that Adam’s shooting percentage does not equal 60%?
Bob attempted 30 free throws and made 22. Calculate the sample proportion $\hat{p}$. Is there evidence that Bob’s shooting percentage does not equal 60%?
Cathy attempted 30 free throws and made 24. Calculate the sample proportion $\hat{p}$. Is there evidence that Cathy’s shooting percentage does not equal 60%?
Dina attempted 30 free throws and made 12. Calculate the sample proportion $\hat{p}$. Is there evidence that Dina’s shooting percentage does not equal 60%?

If we rank the three persons by the amount of evidence, we might have A < B < C = D. How can we quantify each piece of evidence?

The following is a reasoning of quantifying evidence for Cathy.

We first formulate two competing hypotheses:

\[H_0: p = 0.6 ~~~vs. ~~H_a: p \ne 0.6\] where $p$ represents Cathy’s true shooting percentage, $H_0$ is called the null hypothesis, and $H_1$ is called the alternative hypothesis.

For the sake of reasoning, we first assume that the null hypothesis is correct.

Under the null hypothesis, the distribution of $\hat{p}$ is approximately normal with

mean equal to the true shooting percentage (0.6), and
standard deviation equal to $\sqrt{\frac{p(1-p)}{n}}$ or 0.0894.

This distribution is called the sampling distribution of the statistic $\hat{p}$, which is plotted below:

Now, let’s add a vertical line which indicates where our sample proportion (24/30 = 0.8) is located on the number line:

If Cathy had made 12 free throws (equivalent to a sample proportion of 0.4), this would have provided as much evidence against the null hypothesis as 24 free throws, because the expected number of free throws made for 30 attempts is $30\cdot 0.6$ or 18, which is equally distant from 12 and 30.

If Cathy had made more than 24 free throws, this would have provided more evidence against the null hypothesis than just 24 free throws made. If Cathy had made less than 12 free throws, this would have provided more evidence against the null hypothesis than just 12 free throws made.

Now, let’s add another vertical line to the graph:

which indicates where the equivalent sample proportion (0.4) to our observed sample proportion (24/30 = 0.8) is located on the number line. By “equivalent”, I mean the two numbers provide equal evidence of rejecting the null hypothesis.

The total area of the two tail regions (under the standard normal curve, above the x-axis, and beyond the two red lines) gives the probability that $\hat{p}>0.8$ or $\hat{p}<0.4$. This probability is called the $p$-value. It indicates how significant the evidence (provided by the data) is against the null hypothesis.

The smaller the $p$-value is, the more evidence provided by the data to reject the null hypothesis. In practice, researchers choose the so-called significance level (called $\alpha$) as a bench mark. If the $p$-value is less than or equal to $\alpha$, reject the null hypothesis. The most often used $\alpha$ values are 0.05.

Example 1.

Let the null and alternative hypotheses be $H_0: p = 0.5 ~~~vs. ~~H_a: p < 0.5$.

For which of the following $p$-values should the null hypothesis be rejected at the significance level 0.05?

0.024
0.120
0.356
0.105

Example 2.

Let the null and alternative hypotheses be $H_0: p = 0.8 ~~~vs. ~~H_a: p \ne 0.8$.

For which of the following $p$-values should the null hypothesis be rejected at the significance level 0.01?

0.031
0.012
0.008
0.145

Example 3.

Let the null and alternative hypotheses be $H_0: p = 0.4 ~~~vs. ~~H_a: p < 0.4$.

If the $p$-value is 0.067, for which of the following $\alpha$-values should the null hypothesis be rejected?

0.01
0.05
0.10
0.001

Answers for the examples above: acc

The general guideline for setting up the null and alternative hypotheses:

Always use the “=” sign under the null hypothesis.
Always use one of the three symbols “<”, “>”, and “$\ne$” under the alternative hypothesis. Which symbol to use depends on the context. Refer to book examples 2 & 3.

22.3 The Procudre of Testing a Population Proportion ($p$)

Step 1. Write down the null hypothesis and alternative hypothesis about $p$.
Step 2. Calculate the value of the test statistic.

\[z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}\] where $p_0$ is the value under the null hypothesis, $n$ is the sample size and $\hat{p}=\frac{x}{n}$ with $x$ being the number of favorable events in the sample.

Step 3. Calculate the $p$-value with the standard normal distribution table in the Appendix. The $p$-value depends on the alternative hypothesis: if $p<p_0$, use the area of the shaded region as the $p$-value; if $p>p_0$, use the area of the shaded region subtracted from 1 as the $p$-value; if $p\ne p_0$, use twice the area of the shaded region or the twice the area of the other region as the $p$-value, whichever is smaller.
Step 4. Make a decision. Compare the $p$-value with a benchmark called the level of significance (denoted by $\alpha$). If the $p$-value is less than or equal to $\alpha$, reject the null hypothesis.
Step 5. Draw a conclusion in the context of the hypothesis testing problem.

Example 4. Computer Crime

Adults are spending more and more time on the internet, and the number experiencing computer-based or internet-based crime is rising. A 2018 Gallup poll of 1019 adults, aged 18 and older, found that 723 of those in the sample said that they worry about having their personal, credit card, or financial information stolen by computer hackers.

Do the data provide significant evidence that more than 65% of adults aged 18 and older worry about having their personal, credit card, or financial information stolen by computer hackers?

Solution.

We are given $n = 1019, x = 723$, so the sample proportion $\hat{p}=\frac{x}{n}=\frac{723}{1019}=0.7095$.

Step 1. Write down the null hypothesis and alternative hypothesis about $p$.

\[H_0: p=0.65 ~~ vs. ~~ H_a: p>0.65\]

Step 2. Calculate the value of the test statistic.

\[z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}=\frac{0.7095-0.65}{\sqrt{\frac{0.65 (1-0.65)}{1019}}}=3.98\]

Step 3. According to the standard normal table, the $p$-value is $1-0.99997=0.00003$, where 0.99997 is the area of the left region under the standard normal curve beyond the value of the test statistic.
Step 4. Since the $p$-value is less than the significance level 0.05, the null hypothesis is rejected.
Step 5. The data provide sufficient evidence that more than 65% of adults aged 18 and older worry about having their personal, credit card, or financial information stolen by computer hackers.

Example 5. ChatGPT

A random sample of 64 college students shows that 20 of them have used ChatGPT.

Do the data provide significant evidence that less than 40% of college students have used ChatGPT?

Solution.

We are given $n = 64, x = 20$, so the sample proportion $\hat{p}=\frac{x}{n}=\frac{20}{64}=0.3125$.

Step 1. Write down the null hypothesis and alternative hypothesis about $p$.

\[H_0: p=0.40 ~~ vs. ~~ H_a: p<0.40\]

Step 2. Calculate the value of the test statistic.

\[z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}=\frac{0.3125-0.40}{\sqrt{\frac{0.40 (1-0.40)}{64}}}=-1.43\]

Step 3. According to the standard normal table, the $p$-value is 0.07636.
Step 4. Since the $p$-value is greater than the significance level 0.05, the null hypothesis is NOT rejected.
Step 5. The data do not provide sufficient evidence that less than 40% of college students have used ChatGPT.

22.4 More Examples

To test each of the following claims, write down the null and alternative hypotheses.

More than 30% of the registered voters in Santa Clara County voted in the primary election.
The mean GPA of students in American colleges is different from 3.0 (out of 4.0).
College students take less than five years to graduate from college, on the average.
The proportion of adults who smoke in the United States is not equal to 20%.
The proportion of customers who prefer product A over product B is not equal to 50%.
The proportion of employees who are satisfied with their job is greater than 75%.
The proportion of customers who return items purchased online is not equal to 5%.
The proportion of people who prefer to shop online is not equal to 60%.
The proportion of people who have a gym membership is not equal to 40%.
The proportion of customers who are satisfied with the service at a restaurant is greater than 90%.
The proportion of people who believe in climate change is not equal to 75%.

Complete each of the following hypothesis testing problems regarding a population proportion.

A fast-food chain claims that the proportion of customers who order french fries is 60%. A random sample of 500 customers was selected, and 330 of them ordered french fries. Test whether the fast-food chain’s claim is true at a significance level of 0.05.
A shoe store claims that the proportion of customers who buy shoes during a sale is 30%. A random sample of 200 customers was selected, and 60 of them bought shoes during the sale. Test whether the shoe store’s claim is true at a significance level of 0.01.
A survey claims that the proportion of people who support a particular candidate for mayor is 55%. A random sample of 1000 voters was selected, and 600 of them indicated support for the candidate. Test whether the survey’s claim is true at a significance level of 0.05.
A clothing store claims that the proportion of customers who purchase an item of clothing online is 20%. A random sample of 400 customers was selected, and 90 of them purchased an item of clothing online. Test whether the clothing store’s claim is true at a significance level of 0.01.
A study claims that the proportion of people who prefer a particular brand of soda is 40%. A random sample of 300 people was selected, and 120 of them indicated a preference for the brand. Test whether the study’s claim is true at a significance level of 0.05.
A company claims that the proportion of employees who are satisfied with their job is 75%. A random sample of 1000 employees was selected, and 700 of them indicated satisfaction with their job. Test whether the company’s claim is true at a significance level of 0.01.
A manufacturer claims that the proportion of defective products in its production process is 5%. A random sample of 500 products was selected, and 20 of them were found to be defective. Test whether the manufacturer’s claim is true at a significance level of 0.05.
A school district claims that the proportion of students who pass a certain exam is 60%. A random sample of 400 students was selected, and 240 of them passed the exam. Test whether the school district’s claim is true at a significance level of 0.01.
A credit card company claims that the proportion of customers who carry a balance on their account is 40%. A random sample of 1000 customers was selected, and 380 of them carry a balance. Test whether the credit card company’s claim is true at a significance level of 0.05.
A charity claims that the proportion of donors who make a second donation is 30%. A random sample of 200 donors was selected, and 60 of them made a second donation. Test whether the charity’s claim is true at a significance level of 0.01.

22.5 Statistical significance or clinical significance?

Read: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8477766/#:~:text=Clinically%20significant%20findings%20are%20those,both%20subjective%20and%20objective%20terms.

23 Chapter 23 Use and Abuse of Statistical Inference

Skip!!!

Case Study: Boys and Breakfast Cereal Using Inference Wisely The Woes of Significance Tests The Advantages of Confidence Intervals Significance at the 5% Level Isn’t Magical Statistical Controversies: Should Hypothesis Tests Be Banned? Beware of Searching for Significance

24 Chapter 24 Two-Way Tables and the Chi-Square Test

Skip!!!

Case Study: Freedom of Speech and Political Beliefs Two-Way Tables Inference for a Two-Way Table The Chi-Square Test Using the Chi-Square Test Simpson’s Paradox

25 Final Exam Review Problems

Quizzes + worksheets

26 Reference

https://www.amstat.org/docs/default-source/amstat-documents/gaisecollege_full.pdf

27 Appendix

The standard normal table:

STAT 103 Lecture Notes

SZ

1/6/2022

1 Chapter 1 Where Do Data Come from?

2 Chapter 2 Samples, Good and Bad

3 Chapter 3 What Do Samples Tell Us?

4 Chapter 4 Sample Surveys in the Real World

5 Chapter 5 Experiments, Good and Bad

6 Chapter 6 Experiments in the Real World

6.1 Equal Treatment for All

6.2 Double-Blind Experiments

6.3 Refusal, Nonadherence, and Dropout

6.4 Can We Generalize?

6.5 Experimental Designs in the Real World

7 Chapter 7 Data Ethics

7.1 First Principles of Data Ethics

7.2 Anonymous or Confidential?

7.3 Who Owns Published Data?

7.4 Behavioral and Social Science Experiments

8 Chapter 8 Measuring

8.1 Measurement Basics

8.2 Measurements, Valid and Invalid

8.3 Bias and Variance

8.4 Improving Reliability, Reducing Bias

8.5 Measuring Psychological and Social Factors

9 Chapter 9 Do the Numbers Make Sense?

9.1 Percentage Change

9.2 The Difference between “as large as” and “larger than”

9.3 The Difference between “Percentage Points Higher” and “Percent Higher”

9.4 More Exercises

10 Chapter 10 Graphs, Good and Bad

10.1 Pie Charts and Bar Graphs

10.2 Beware the Pictogram

10.3 Change over Time: Line Graphs

10.4 Watch Those Scales!

10.5 Making Good Graphs

10.6 More Exercises

11 Chapter 11 Displaying Distributions with Graphs

11.1 Histograms

11.2 Interpreting Histograms

11.3 Stemplots

11.4 A Class Activity

12 Chapter 12 Describing Distributions with Numbers

12.1 Median and Quartiles

12.2 The Five-Number Summary and Boxplots

12.3 Mean, Variance, and Standard Deviation

12.4 Mean or Median?

13 Chapter 13 Normal Distributions

13.1 Density Curves

13.2 The Center and Variability of a Density Curve

13.3 Normal Distributions

13.4 The 68–95–99.7 Rule

13.5 Standard Scores

13.6 Percentiles of Normal Distributions

14 Chapter 14 Describing Relationships: Scatterplots and Correlation

14.1 Scatterplots

14.2 Multiple Variables

14.3 Correlation

15 Chapter 15 Describing Relationships: Regression, Prediction, and Causation

15.1 Correlation and Regression

15.2 The Question of Causation (Cause and Effect)

15.3 Evidence for Causation

16 Chapter 16 The Consumer Price Index and Government Statistics

16.1 Index Numbers

16.2 The Consumer Price Index (CPI)

16.3 The Place of Government Statistics

17 Chapter 17 Thinking about Chance

17.1 The Idea of Probability

17.2 Personal Probability

17.3 What are the Odds?

17.4 Probability and Risk

18 Chapter 18 Probability Models

18.1 Probability Models

18.2 Probability Rules

18.3 Probability and Odds

18.4 Probability and Betting Odds

18.5 Probability Models for Sampling

18.6 Additional Examples

19 Chapter 19 Simulation

19.1 Where Do Probabilities Come From?