What is Statistics?

  • The science of planning studies and experiments consisting of five steps:

Step 1: Raise a precise question

Step 2: Create a plan to answer the question

Step 3: Collect and organize the data

Step 4: Analyze the data

Step 5: Draw a conclusion about the question

  • Application of statistics is literally everywhere - Business, finance, engineering, health science, social science, environmental science, politics, education, and so on.

Learning Goals

  1. Designing Observational Studies and Experiments:
    • Populations; Samples; Sampling Techniques; Experiments; Observational Study
  2. Graphical and Tabular Displays of Data :
    • Frequency Tables, Bar Graphs, Pie Charts, Two-way Tables, Histograms
  3. Summarizing Data Numerically:
    • Characteristics of Data (center, variation, distribution, outliers); Boxplots; Measures of Center (mean, median, mode, midrange); Measures of Variation (range, variance, standard deviation); Measures of Relative Standing (z-score, five number summary)
  4. Computing Probability:
    • Addition Rules, Conditional Probability, Multiplication Rule, Normal Distribution
  5. Correlation/ Regression:
    • Scatterplots, Characteristics of Association, Linear Associations, Linear Models, Rate of Change, Slope, Linear Regression Model

Data

  • Data is a collection of facts, such as values or measurements or more formally a set of qualitative or quantitative variables.

  • Variables
    • A variable is a value or characteristics that can be different from individual to individual.
  • Example
    • Age
    • Sex
    • Race/ethnicity
    • Height
    • Weight
    • Education
    • Income
    • Marital Status
    • Number of children

Structure of a Data File

person_id age sex income education
1 25 M 35,000 AA
2 30 F 48,000 BA
3 22 F 25,000 AA
4 28 M 30,000 BA
5 35 M 48,000 MA
6 41 F 60,000 PhD
. . . . .
1000 20 F 20,000 GED


The dataset has 1,000 person records or cases. A case is also called a unit of observation or an observational unit.

Structure of a Data File…

------------------------------------------------------------------------------------
      name         state    pop2000   pop2010   fed_spend   poverty   homeownership 
---------------- --------- --------- --------- ----------- --------- ---------------
 Autauga County   Alabama    43671     54571      6.068      10.6         77.5      

 Baldwin County   Alabama   140415    182265      6.14       12.2         76.7      

 Barbour County   Alabama    29038     27457      8.752       25           68       

  Bibb County     Alabama    20826     22915      7.122      12.6         82.9      

 Blount County    Alabama    51024     57322      5.131      13.4          82       

 Bullock County   Alabama    11714     10914      9.973      25.3         76.9      

 Butler County    Alabama    21399     20947      9.312       25           69       

 Calhoun County   Alabama   112249    118572      15.44      19.5         70.7      
------------------------------------------------------------------------------------
[1] "COUNTY is the unit of observation"

Types of Variables

1. Numerical or Quantitative

  • consist of numbers representing measurements or counts.

- continuous: a subject or observation takes a value from an interval of real numbers, e.g., weight, height, age, etc.

- discrete: a subject or observation takes certain values from a finite set, e.g. population, traffic volume, etc.


2. Categorical (or Qualitative or Attribute)

  • consist of names or labels (not numbers that represent counts or measuments)

- nominal (unordered): the data fall into categories that have no particular order or ranking in relation to each other, e.g., color, gender, nationality, etc.

- ordinal (ordered): values have a natural order to ranking, e.g., temperature (low, medium, high), exam performance (A, B, C, D, F), satisfaction (high, neutral, low), etc.

Data Collection

The first step in conducting research is to identify topics or questions that are to be investigated. A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important. It is also important to consider how data are collected so that they are reliable and help achieve the research goals.

1. The Goal

Learn about the entire group of individuals

2. The Problem

It is usually impossible to collect data on the entire population

  • expensive
  • time consuming
  • impossible to find everyone
  • not everyone willing to participate
  • population changing constantly - births and deaths

3. The Compromise

Collect data on a smaller group of individuals selected from the population

4. The Challenge

Bias (double-counting and under-counting)

Sampling from a Population

Population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inference about.

Census is the collection of data from every member of the population.

Sample is a subcollection of members selected from a population.

Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population.

Example: Identify the Variable, Sample, and Population of a Study

In a poll of \(1000\) randomly selected American adults, \(48\%\) of respondents said that they strongly disapprove of the way Congress is doing its job. The study then made an inference about all American adults.

  1. Define the variable of the study.
  2. Identify the sample.
  3. Identify the population.

Descriptive Statistics and Inferential Statistics

There are two main areas of statistics: descriptive statistics and inferential statistics.

Descriptive statistics is the practice of using tables, graphs, and calculations about a sample to draw conclusions about only the sample.

Inferential Statistics is the practice of using information from a sample to draw conclusions about the entire population.

Statistical Inference

It is the process of making judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.
\(\require{AMScd}\) \[ \begin{CD} Sample @> {\text {statistical inference}} >> Population \end{CD} \]

\[\underbrace{\text {sample statistics}}_{\text{investigator knows}} = \underbrace{\text {population parameter}}_{\text{investigator wants to know}} + \underbrace{\text {bias}}_{\text{nonsampling error}} + \underbrace{\text {chance variation}}_{\text{random sampling error}} \]

Sampling Errors

  • Random sampling error - occurs when the sample has been selected with a random method, but there is a discrepency between a sample result and the true population result.

  • Nonsampling error - is the results of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances.
    • Selection/Sampling bias - occurs when the sample is selected in such a way that it systematically excludes or underepresented part of the population.
    • Nonresponse bias - occurs when responses are not obtained from all individuals selected for inclusion in a sample.
    • Measurement or response bias - occurs when the data are collected in such a way that it tends to result in observed values that are different from the actual value in some systematic way. Contributing factors: question wording and order, mode of survey, and influence of the interviewer, etc.
  • Nonrandom sampling error - is the results of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.

Definition - Bias

A sampling method that consistently underestimates or overestimates some characteristics of the population is said to be biased.

Sampling bias occurs if the sampling technique favors one group of individuals over another.

  • An online survey conducted to estimate the percentage of Americans who have a Facebook account.
  • The survey is biased because people who go online are favored.
  • People who never go online cannot participate in the poll.

Nonresponse bias happens if individuals refuse to be part of the study or if the research cannot track down individuals identified to be in the sample.

Response bias occurs if surveyed people’s answers do not match with what they really think. For example, people might exaggerate how much money they earn, or a researcher might record the information incorrectly.

Response bias can also result from the wording of questions. For example, compare the impact of the following two questions:

  1. Do you brag about your past successes with others?
  2. Do you inspire others by sharing your past successes?
  3. Do you share your past successes with others?

Bias and Precision

Bias

The average difference between the estimator and the true value.

Precision

The standard deviation of the estimator.

\[ \begin{aligned} \text{Mean Squared Error, MSE} &= precision^2 + bias^2 \\ \text{Root Mean Squared Error, RMSE} &= \sqrt{MSE} \end{aligned} \]

Parameter and Statistic

A parameter is a numerical measurement describing some characteristics of a population.

A statistic is a numerical measurement describing some characteristics of a sample.

Example:

There are \(17,246,372\) high school students in the U.S. In a study of \(8505\) U.S. high school students \(16\) years of age or older, \(44.5\%\) of them said that they texted while driving at least once during the previous \(30\) days.

  1. Parameter: What percent of the population texted while driving? (unknown)
  2. Statistic: \(44.5\%\)

Group Exercise

Open your book.

Homework 2.1

10

36

38

40

42

44

46

48

Sampling Methods

Simple Random Sampling (SRS)

A simple random sample of n subjects is selected in such a way that every possible sample of the sample size n has the same chance of being chosen.

Source: OpenIntro.Org

Stratified Sampling

  1. The population is divided into non-overlapping, homogeneous subgroups called strata .
  2. Then, SRS is employed to select a certain number or a certain proportion of the whole within each stratum.

Source: OpenIntro.Org

Example: Stratified Sampling

Design a sample to survey \(500\) students using startified sampling method.

\[ \text { Strata Sizes } \bbox[white,4px] { \color{black} { \begin{array}{c|c|c|c} \text{Gender} & \text{Undergraduate} & \text{Graduate} & \text{Total} \\ \hline \text{Female} & \text{3355} & \text{4693} & \text{8048} \\ \text{Male} & \text{3734} & \text{6687} & \text{10421} \\ \hline \text{Total} & \text{7089} & \text{11380} & \text{18469} \\ \end{array} } } \] \[ \text { Strata Proportions } \bbox[yellow,4px] { \color{black} { \begin{array}{c|c|c|c} \text{Gender} & \text{Undergraduate} & \text{Graduate} & \text{Total} \\ \hline \text{Female} & \text{3355/18469 = .182} & \text{.254} & \text{.436} \\ \text{Male} & \text{.202} & \text{.362} & \text{.564} \\ \hline \text{Total} & \text{.384} & \text{.616} & \text{1.000} \\ \end{array} } } \]

\[ \text { Sample sizes } \bbox[lightblue,4px] { \color{black} { \begin{array}{c|c|c|c} \text{Gender} & \text{Undergraduate} & \text{Graduate} & \text{Total} \\ \hline \text{Female} & \text{.182(500)=91} & \text{127} & \text{218} \\ \text{Male} & \text{101} & \text{181} & \text{282} \\ \hline \text{Total} & \text{192} & \text{308} & \text{500} \\ \end{array} } } \]

Cluster Sampling

  1. The population is often divided into non-overlapping mutually homogeneous yet internally heterogeneous subgroups called clusters. Cluster sampling is much like SRS, but instead of randomly selecting individuals, SRS is applied to select clusters.
    • In other words, unlike stratified sampling, cluster sampling is most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don’t look very different from one another. That is, we expect strata to be self-similar (homogeneous), while we expect clusters to be diverse (heterogeneous).
  2. The elements in each cluster are then sampled. If all elements in each sampled cluster are sampled, then this is referred to as a “one-stage” cluster sampling plan.
  3. Sometimes cluster sampling can be a more economical random sampling technique than the alternatives. For example, if neighborhoods represented clusters, this sampling method works best when each neighborhood is very diverse. Because each neighborhood itself encompasses diversity, a cluster sample can reduce the time and cost associated with data collection, because the interviewer would need only go to some of the neighborhoods rather than to all parts of a city, in order to collect a useful sample.

One-Stage Cluster Sampling

Source: OpenIntro.Org

Multistage Sampling

A “multistage” or “multistage cluster” sampling is an extention of cluster sampling and involves two (or more) steps.

  1. First step is to take a cluster sample.
  2. Then, instead of including all of the individuals in these clusters in the sample, a second sampling method, usually SRS, is employed within each of the selected clusters.

In the neighborhood example, we could first randomly select some number of neighborhoods and then take a SRS from just those selected neighborhoods. As seen in Figure, stratified sampling requires observations to be sampled from every stratum. Multistage sampling selects observations only from those clusters that were randomly selected in the first step.

It is also possible to have more than two steps in multistage sampling. Each cluster may be naturally divided into subclusters. For example, each neighborhood could be divided into streets. To take a three-stage sample, we could first select some number of clusters (neighborhoods), and then, within the selected clusters, select some number of subclusters (streets). Finally, we could select some number of individuals from each of the selected streets.

Multistage Sampling

Source: OpenIntro.Org

Nonrandom Sampling

Systematic Sampling

Select every \(k^{th}\) individual froma list of the population, where the position of the first person chosen is randomly selected from the \(k\) individuals. This will give a non-representative sample if there is a structure to the list.

Source: OpenIntro.Org

Solve:

The human resource department at a certain company wants to conduct a survey regarding worker benefits. The department has an alphabetical list of all \(5465\) employees at the company and wants to conduct a systematic sample of size \(60\).

  1. What is \(k\)?

  2. Determine the individuals who will be administered the survey. Randomly select a number from \(1\) to \(k\). Suppose that we randomly select \(10\). Starting with the first individual selected, the individuals in the survey will be

Nonrandom Sampling

Convenience or Volunteer Sampling

Use the first \(n\) individuals that are available or the individuals who volunteer to participate. This is almost sure to give a non-representative sample which cannot be generalized to the population.

Example: Non-Representative Sample

Survey

  • Surveyed 10 million people who were subcribers or had telephones.
  • 2.4 million people responded (i.e. 24% response rate)

Prediction
Landslide victory of Landon.

Election Result
Landslide victory of Roosevelt.

What did go wrong with the poll?

  • Sample was drawn from telephone directories, club membership, magazine subscibers, etc. who were upper middle class people, largely excluding poor unemployed people.
  • The sample suffered from both selection and nonresponse bias.

Exercise: Identifying Sampling Methods

  1. A researcher randomly selects 20 Taco Bell locations and surveys all the employees at those locations.

  2. A news station hosts a call-in survey about whether physician-assisted death should be legalized in all states.

  3. A researcher randomly selects an LED TV out of the first 200 LED TVs on an assembly line and also selects every 200th LED TV after that.

  4. In a study at a community college, 30 instructors are randomly selected from fulltime instructors and 50 instructors are selected from part-time instructors.

  5. The City Hall of Spring Hill, Kansas, creates a frame of its 5730 residents and randomly selects 60 residents.

Observational Studies

Observational Studies

Generally, data in observational studies are collected on specific characteristics only by passively monitoring study participants, but the observers don’t attempt to modify the individuals being studied. These studies are inexpensive and good for discovering relationships related to rare outcomes. They are generally only sufficient to show associations.


Types of observational studies:

1. Cross-sectional Studies

  • data are observed, measured, and collected at one point in time, not over a period of time.

2. Retrospective Studies

  • data are collected from a past time period by going back in time (through examinations of records, interviews, and so on).

3. Prospective (or longitudinal or cohort) Studies

  • data are collected in the future from groups that share common factors (such groups are called cohorts).

Observational Studies

Case: why did Fido and Fluffy die?

Early 2007, many dogs and cats died of kidney failure. Should you conduct a retrospective or prospective study to find out why?


Retrospective Studies could be appropriate since

  • The event happened in the past.
  • It may have been a rare event.
  • The retrospective study may provide clues on the cause.

Once possible causes are identified, try a Prospective Studies to verify the causes - if it doesn’t kill any pets.

Observational Studies

Case: drinking coffee and longivity

Coffee drinkers may live longer - nytimes.com, May 16, 2012.


Coffee may help you live longer, study suggests - thestar.com, May 17, 2012.


No, drinking coffee probably won’t make you live longer - washingtonpost.com, May 17, 2012.

Observational Study

Case: drinking coffee and longivity

Association of coffee drinking with total and cause-specific mortality,” New England Journal of Medicine, May 2012.


Sample Size

400,000


Age Range

50-71 years


Period

1995 - 2008


Death

52,000

How would you interpret the result?

Confounding Variable

A confounding variable is a variable that is associated with both the explanatory and response variables. Because of the confounding variable’s association with both variables, we do not know if the response is due to the explanatory variable or due to the confounding variable.


Sun exposure is a confounding factor because it is associated with both the use of sunscreen and the development of skin cancer. People who are out in the sun all day are more likely to use sunscreen, and people who are out in the sun all day are more likely to get skin cancer.

Lurking Variable

Lurking variables are variables that are not considered in the analysis, but may affect the nature of the relationship between the explanatory variable and the outcome.


20-year survival status of women by smoking status

\[ \begin{array}{c|lcr} & \text{Smoker} \\ & \text{Yes} & \text{No} \\ \hline \text {Dead} & 0.239 & 0.314 \\ \text {Alive} & 0.761 & 0.686 \end{array} \]

Are smokers less likely to die?

Lurking Variable

Experiments

Experiments

While observational studies are effective tools for answering certain research questions, experiments are essential to measure the effect of a treatment. In an experiment, we apply some treatment and then proceed to observe its effects on the individuals.

Subject

Entity who is participating in the study.

Treatment Group

The group of subjects that receives treatments.

Control Group

The group of subjects that receives no treatment.

Response Variable

The outcome of interest, measured on each subject.

Factor

The categorical variable that explains the outcome of the experiment. Each category is called level.

Experiments

Blinding

When researchers keep the subjects uninformed about their treatment, the study is said to be blind. Its purpose is to reduce the potential for both researchers’ and subjects’ emotional bias.

  • Subjects would not know which experimental group they are assigned to.
  • The researcher (i.e. the person who is measuring the outcome) would not know which treatment is assigned to which experimental unit.

Single-blind: only one type of blinding is applied.
Double-blind: both types of blinding are applied.

Placebo

A substance or treatment with no active ingredients. The control group receives the placebo treatment.

This phenomenon, in which the recipient perceives an improvement in condition due to personal expectations, rather than the treatment itself, is known as the placebo effect.

Principles of Experimental Design

Well-conducted experiments are built on three main principles.

Direct Control

Researchers assign treatments to cases, and they do their best to control any other differences in the groups. They want the groups to be as identical as possible except for the treatment, so that at the end of the experiment any difference in response between the groups can be attributed to the treatment and not to some other confounding or lurking variable. Direct control refers to variables that the researcher can control, or make the same.

Randomization

Researchers randomize patients into treatment groups to account for variables that cannot be controlled. Randomizing patients into the treatment or control group helps even out the effects of such differences, and it also prevents accidental bias from entering the study.

Replication

In a single study, replication is done by imposing the treatment on a sufficiently large number of subjects or experimental units. Scientists may also replicate the entire experiment on an entirely different population of experimental units to verify earlier findings.

Randomized and Blocked Design

A completely randomized experiment is one in which the subjects or experimental units are randomly assigned to each group in the experiment.

Researchers sometimes know or suspect that another variable, other than the treatment, influences the response. Under these circumstances, they may carry out a blocked experiment. In this design, they first group individuals into blocks based on the identified variable and then randomize subjects within each block to the treatment groups. This strategy is referred to as blocking.

Source: OpenIntro.Org

PATRICIA Study

PApilloma TRIal against Cancer In young Adults

The Lancet, Volume 374, Issue 9686, Pages 301 - 314, 25 July 2009

Efficacy of human papillomavirus (HPV) - 16/18 AS04-adjuvanted vaccine against cervical infection and precancer caused by oncogenic HPV types (PATRICIA); final analysis of a double-blind, randomised study in young women.

Paavonen, et. al.

\[ \begin{array}{c|c} {\text{Response Variable} \\ \text {(Acquired an infection)}} & {\text{Explanatory Variable} \\ \text{(Given the HPV vaccine)}} \\ \hline \text{Yes} & \text{Yes} \\ \text{No} & \text{No} \\ \end{array} \]

PATRICIA Study

\[ \bbox[yellow,5px] { \color{black} { \begin{array}{c} {\text{Factor 1} \\ \text{(2 Levels)}} \\ \hline \text{Drug A} \\ \text{Drug B} \end{array} } } \]

\[ \bbox[silver,5px] { \color{black} { \begin{array}{c} {\text{Factor 2} \\ \text{(2 Levels)}} \\ \hline \text{Dose A} \\ \text{Dose B} \end{array} } } \]




\[ \bbox[5px,border:2px solid red] { \begin{array}{c} \text{4 Treatments} \\ \hline \text{Drug A & Dose A}\\ \text{Drug A & Dose B}\\ \text{Drug B & Dose A}\\ \text{Drug B & Dose B} \end{array} } \]

Ischemic Preconditioning

Effect on Mascular Endurance

Can Ischemic Preconditioning improve athletic performance?


1. Experimental units:

40 male teenagers


2. Response Variable:

length of time a wall squat position can be held


3. Control Groups:

2 groups who received 0 lb pressure


4. Control of Extraneous Factors:

Age, sex, athletic ability


5. Randomization

Randomly assigned 10 experimental units to each of 4 treatment groups

Ischemic Preconditioning

Effect on Mascular Endurance

\[ \bbox[yellow,5px] { \color{black} { \begin{array}{c|c} \text{Factor1} & {\text{Amount of pressure} \\ \text{applied by the} \\ \text{bloodpressure cuff}} \\ \hline \text{Level 1} & \text{20 lb} \\ \text{Level 2} & \text{0 lb} \\ \end{array} } } \]

\[ \bbox[silver,5px] { \color{black} { \begin{array}{c|c} \text{Factor2} & {\text{Length of time pressure} \\ \text{was applied}} \\ \hline \text{Level 1} & \text{10 min} \\ \text{Level 2} & \text{20 min} \\ \end{array} } } \]



\[ \bbox[5px,border:2px solid red] { \begin{array}{c} \text{4 Treatments} \\ \hline \text{20 lb/10 min}\\ \text{20 lb/20 min}\\ \text{0 lb/ 10 min}\\ \text{0 lb/ 20 min} \end{array} } \]

Components of a Well-Designed Study

  • There should be a control group and at least one treatment group.
  • Individuals should be randomly assigned to the control and treatment group(s).
  • The sample size should be large enough.
  • A placebo should be used when appropriate.
  • The study should be double-blind when possible. If this is impossible, then the study should be single-blind if possible.

Example: Identifying an Experiment and an Observational Study

Identify whether the study is an experiment or an observational study. Discuss whether the components of a good study were used.

For five years, the author taught an innovative intermediate algebra course in which students learned by working in groups. Then the author compared the proportion of his successful intermediate algebra students who passed trigonometry with the proportion of other professors’ successful intermediate algebra students who passed trigonometry.

Example: Redesign an Observational Study into a Well-Designed Experiment

A researcher wants to determine whether taking vitamin C helps people avoid getting the flu and the common cold. She randomly selects 100 people and asks them whether they take vitamin C and how often they had the flu or a cold in the past year. The researcher analyzes the responses and concludes that vitamin C helps people avoid the flu and colds.

  1. Describe some problems with the observational study. Include in your description at least one possible lurking or confounding variable and identify which type it is.

  2. Redesign the study so that it is a well-designed experiment.

Next


Chapter 3: Graphical and Tabular Displays of Data