Introduction to Statistics

What is Statistics?

The science of planning studies and experiments; obtaining data; and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them.
Application of statistics is literally everywhere - Business, finance, engineering, health science, social science, environmental science, politics, education, and so on.

Learning Goals

Descriptive Statistics
- Describing and summarizing data
Relationship between variables
- Estimate and interpret regression model
Probability
- Understanding and quantifying randomness
Inference
- Making conclusions based on data from random samples

Data

Data is a collection of facts, such as values or measurements or more formally a set of qualitative or quantitative variables.
Variables
- A variable is a value or characteristics that can be different from individual to individual.
Example
- Age
- Sex
- Race/ethnicity
- Height
- Weight
- Education
- Income
- Marital Status
- Number of children

Structure of a Data File

person_id	age	sex	income	education
1	25	M	35,000	AA
2	30	F	48,000	BA
3	22	F	25,000	AA
4	28	M	30,000	BA
5	35	M	48,000	MA
6	41	F	60,000	PhD
.	.	.	.	.
1000	20	F	20,000	GED

The dataset has 1,000 person records or cases. A case is also called a unit of observation or an observational unit.

Structure of a Data File…

------------------------------------------------------------------------------------
      name         state    pop2000   pop2010   fed_spend   poverty   homeownership 
---------------- --------- --------- --------- ----------- --------- ---------------
 Autauga County   Alabama    43671     54571      6.068      10.6         77.5      

 Baldwin County   Alabama   140415    182265      6.14       12.2         76.7      

 Barbour County   Alabama    29038     27457      8.752       25           68       

  Bibb County     Alabama    20826     22915      7.122      12.6         82.9      

 Blount County    Alabama    51024     57322      5.131      13.4          82       

 Bullock County   Alabama    11714     10914      9.973      25.3         76.9      

 Butler County    Alabama    21399     20947      9.312       25           69       

 Calhoun County   Alabama   112249    118572      15.44      19.5         70.7      
------------------------------------------------------------------------------------

[1] "COUNTY is the unit of observation"

Types of Variables

1. Numerical or Quantitative

consist of numbers representing measurements or counts.

- continuous: a subject or observation takes a value from an interval of real numbers, e.g., weight, height, age, etc.

- discrete: a subject or observation takes certain values from a finite set, e.g. population, traffic volume, etc.

2. Categorical (or Qualitative or Attribute)

consist of names or labels (not numbers that represent counts or measuments)

- nominal (unordered): the data fall into categories that have no particular order or ranking in relation to each other, e.g., color, gender, nationality, etc.

- ordinal (ordered): values have a natural order to ranking, e.g., temperature (low, medium, high), exam performance (A, B, C, D, F), satisfaction (high, neutral, low), etc.

Data Collection

The first step in conducting research is to identify topics or questions that are to be investigated. A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important. It is also important to consider how data are collected so that they are reliable and help achieve the research goals.

1. The Goal

Learn about the entire group of individuals

2. The Problem

It is usually impossible to collect data on the entire population

expensive
time consuming
impossible to find everyone
not everyone willing to participate
population changing constantly - births and deaths

3. The Compromise

Collect data on a smaller group of individuals selected from the population

4. The Challenge

Bias (double-counting and under-counting)

Sampling from a Population

Population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inference about.

Census is the collection of data from every member of the population.

Sample is a subcollection of members selected from a population.

Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population.

Statistical Inference

It is the process of making judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling.
\(\require{AMScd}\) \[ \begin{CD} Sample @> {\text {statistical inference}} >> Population \end{CD} \]

\[\underbrace{\text {sample statistics}}_{\text{investigator knows}} = \underbrace{\text {population parameter}}_{\text{investigator wants to know}} + \underbrace{\text {bias}}_{\text{nonsampling error}} + \underbrace{\text {chance variation}}_{\text{random sampling error}} \]

Sampling Errors

Random sampling error - occurs when the sample has been selected with a random method, but there is a discrepency between a sample result and the true population result.
Nonsampling error - is the results of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances.
- Selection bias - occurs when the sample is selected in such a way that it systematically excludes or underepresented part of the population.
- Nonresponse bias - occurs when responses are not obtained from all individuals selected for inclusion in a sample.
- Measurement or response bias - occurs when the data are collected in such a way that it tends to result in observed values that are different from the actual value in some systematic way. Contributing factors: question wording and order, mode of survey, and influence of the interviewer, etc.
Nonrandom sampling error - is the results of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.

Bias and Precision

Bias

The average difference between the estimator and the true value.

Precision

The standard deviation of the estimator.

\[ \begin{aligned} \text{Mean Squared Error, MSE} &= precision^2 + bias^2 \\ \text{Root Mean Squared Error, RMSE} &= \sqrt{MSE} \end{aligned} \]

Parameter and Statistic

A parameter is a numerical measurement describing some characteristics of a population.

A statistic is a numerical measurement describing some characteristics of a sample.

Example:

There are \(17,246,372\) high school students in the U.S. In a study of \(8505\) U.S. high school students \(16\) years of age or older, \(44.5\%\) of them said that they texted while driving at least once during the previous \(30\) days.

Parameter: What percent of the population texted while driving? (unknown)
Statistic: \(44.5\%\)

Sampling Methods

Simple Random Sampling (SRS)

A simple random sample of n subjects is selected in such a way that every possible sample of the sample size n has the same chance of being chosen.

^{Source: OpenIntro.Org}

Stratified Sampling

The population is divided into non-overlapping, homogeneous subgroups called strata .
Then, SRS is employed to select a certain number or a certain proportion of the whole within each stratum.

^{Source: OpenIntro.Org}

Cluster Sampling

The population is often divided into non-overlapping mutually homogeneous yet internally heterogeneous subgroups called clusters. Cluster sampling is much like SRS, but instead of randomly selecting individuals, SRS is applied to select clusters.
- In other words, unlike stratified sampling, cluster sampling is most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another. That is, we expect strata to be self-similar (homogeneous), while we expect clusters to be diverse (heterogeneous).
The elements in each cluster are then sampled. If all elements in each sampled cluster are sampled, then this is referred to as a "one-stage" cluster sampling plan.
Sometimes cluster sampling can be a more economical random sampling technique than the alternatives. For example, if neighborhoods represented clusters, this sampling method works best when each neighborhood is very diverse. Because each neighborhood itself encompasses diversity, a cluster sample can reduce the time and cost associated with data collection, because the interviewer would need only go to some of the neighborhoods rather than to all parts of a city, in order to collect a useful sample.

One-Stage Cluster Sampling

^{Source: OpenIntro.Org}

Multistage Sampling

A "multistage" or "multistage cluster" sampling is an extention of cluster sampling and involves two (or more) steps.

First step is to take a cluster sample.
Then, instead of including all of the individuals in these clusters in the sample, a second sampling method, usually SRS, is employed within each of the selected clusters.

In the neighborhood example, we could first randomly select some number of neighborhoods and then take a SRS from just those selected neighborhoods. As seen in Figure, stratified sampling requires observations to be sampled from every stratum. Multistage sampling selects observations only from those clusters that were randomly selected in the first step.

It is also possible to have more than two steps in multistage sampling. Each cluster may be naturally divided into subclusters. For example, each neighborhood could be divided into streets. To take a three-stage sample, we could first select some number of clusters (neighborhoods), and then, within the selected clusters, select some number of subclusters (streets). Finally, we could select some number of individuals from each of the selected streets.

Multistage Sampling

^{Source: OpenIntro.Org}

Nonrandom Sampling

Systematic Sampling

Select every \(k^{th}\) individual froma list of the population, where the position of the first person chosen is randomly selected from the \(k\) individuals. This will give a non-representative sample if there is a structure to the list.

^{Source: OpenIntro.Org}

Nonrandom Sampling

Convenience or Volunteer Sampling

Use the first \(n\) individuals that are available or the individuals who volunteer to participate. This is almost sure to give a non-representative sample which cannot be generalized to the population.

Example: Non-Representative Sample

Survey

Surveyed 10 million people who were subcribers or had telephones.
2.4 million people responded (i.e. 24% response rate)

Prediction
Landslide victory of Landon.

Election Result
Landslide victory of Roosevelt.

What did go wrong with the poll?

Sample was drawn from telephone directories, club membership, magazine subscibers, etc. who were upper middle class people, largely excluding poor unemployed people.
The sample suffered from both selection and nonresponse bias.

Observational Studies

Generally, data in observational studies are collected on specific characteristics only by passively monitoring study participants, but the observers don't attempt to modify the individuals being studied. These studies are inexpensive and good for discovering relationships related to rare outcomes. They are generally only sufficient to show associations.

Types of observational studies:

1. Cross-sectional Studies

data are observed, measured, and collected at one point in time, not over a period of time.

2. Retrospective Studies

data are collected from a past time period by going back in time (through examinations of records, interviews, and so on).

3. Prospective (or longitudinal or cohort) Studies

data are collected in the future from groups that share common factors (such groups are called cohorts).

Observational Studies

Case: why did Fido and Fluffy die?

Early 2007, many dogs and cats died of kidney failure. Should you conduct a retrospective or prospective study to find out why?

Retrospective Studies could be appropriate since

The event happened in the past.
It may have been a rare event.
The retrospective study may provide clues on the cause.

Once possible causes are identified, try a Prospective Studies to verify the causes - if it doesn't kill any pets.

Observational Studies

Case: drinking coffee and longivity

Coffee drinkers may live longer - nytimes.com, May 16, 2012.

Coffee may help you live longer, study suggests - thestar.com, May 17, 2012.

No, drinking coffee probably won't make you live longer - washingtonpost.com, May 17, 2012.

Observational Study

Case: drinking coffee and longivity

"Association of coffee drinking with total and cause-specific mortality," New England Journal of Medicine, May 2012.

Sample Size

400,000

Age Range

50-71 years

Period

1995 - 2008

Death

52,000

How would you interpret the result?

Confounding Variable

A confounding variable is a variable that is associated with both the explanatory and response variables. Because of the confounding variable's association with both variables, we do not know if the response is due to the explanatory variable or due to the confounding variable.

Sun exposure is a confounding factor because it is associated with both the use of sunscreen and the development of skin cancer. People who are out in the sun all day are more likely to use sunscreen, and people who are out in the sun all day are more likely to get skin cancer.

Lurking Variable

Lurking variables are variables that are not considered in the analysis, but may affect the nature of the relationship between the explanatory variable and the outcome.

20-year survival status of women by smoking status

\[ \begin{array}{c|lcr} & \text{Smoker} \\ & \text{Yes} & \text{No} \\ \hline \text {Dead} & 0.239 & 0.314 \\ \text {Alive} & 0.761 & 0.686 \end{array} \]

Are smokers less likely to die?

Lurking Variable

Experiments

While observational studies are effective tools for answering certain research questions, experiments are essential to measure the effect of a treatment. In an experiment, we apply some treatment and then proceed to observe its effects on the individuals.

Subject

Entity who is participating in the study.

Treatment Group

The group of subjects that receives treatments.

Control Group

The group of subjects that receives no treatment.

Response Variable

The outcome of interest, measured on each subject.

Factor

The categorical variable that explains the outcome of the experiment. Each category is called level.

Experiments

Blinding

When researchers keep the subjects uninformed about their treatment, the study is said to be blind. Its purpose is to reduce the potential for both researchers' and subjects' emotional bias.

Subjects would not know which experimental group they are assigned to.
The researcher (i.e. the person who is measuring the outcome) would not know which treatment is assigned to which experimental unit.

Single-blind: only one type of blinding is applied.
Double-blind: both types of blinding are applied.

Placebo

A substance or treatment with no active ingredients. The control group receives the placebo treatment.

This phenomenon, in which the recipient perceives an improvement in condition due to personal expectations, rather than the treatment itself, is known as the placebo effect.

Principles of Experimental Design

Well-conducted experiments are built on three main principles.

Direct Control

Researchers assign treatments to cases, and they do their best to control any other differences in the groups. They want the groups to be as identical as possible except for the treatment, so that at the end of the experiment any difference in response between the groups can be attributed to the treatment and not to some other confounding or lurking variable. Direct control refers to variables that the researcher can control, or make the same.

Randomization

Researchers randomize patients into treatment groups to account for variables that cannot be controlled. Randomizing patients into the treatment or control group helps even out the effects of such differences, and it also prevents accidental bias from entering the study.

Replication

In a single study, replication is done by imposing the treatment on a sufficiently large number of subjects or experimental units. Scientists may also replicate the entire experiment on an entirely different population of experimental units to verify earlier findings.

Randomized and Blocked Design

A completely randomized experiment is one in which the subjects or experimental units are randomly assigned to each group in the experiment.

Researchers sometimes know or suspect that another variable, other than the treatment, influences the response. Under these circumstances, they may carry out a blocked experiment. In this design, they first group individuals into blocks based on the identified variable and then randomize subjects within each block to the treatment groups. This strategy is referred to as blocking.

^{Source: OpenIntro.Org}

PATRICIA Study

PApilloma TRIal against Cancer In young Adults

^{The Lancet, Volume 374, Issue 9686, Pages 301 - 314, 25 July 2009}

Efficacy of human papillomavirus (HPV) - 16/18 AS04-adjuvanted vaccine against cervical infection and precancer caused by oncogenic HPV types (PATRICIA); final analysis of a double-blind, randomised study in young women.

^{Paavonen, et. al.}

\[ \begin{array}{c|c} {\text{Response Variable} \\ \text {(Acquired an infection)}} & {\text{Explanatory Variable} \\ \text{(Given the HPV vaccine)}} \\ \hline \text{Yes} & \text{Yes} \\ \text{No} & \text{No} \\ \end{array} \]

PATRICIA Study

\[ \bbox[yellow,5px] { \color{black} { \begin{array}{c} {\text{Factor 1} \\ \text{(2 Levels)}} \\ \hline \text{Drug A} \\ \text{Drug B} \end{array} } } \]

\[ \bbox[silver,5px] { \color{black} { \begin{array}{c} {\text{Factor 2} \\ \text{(2 Levels)}} \\ \hline \text{Dose A} \\ \text{Dose B} \end{array} } } \]

\[ \bbox[5px,border:2px solid red] { \begin{array}{c} \text{4 Treatments} \\ \hline \text{Drug A & Dose A}\\ \text{Drug A & Dose B}\\ \text{Drug B & Dose A}\\ \text{Drug B & Dose B} \end{array} } \]

Ischemic Preconditioning

Effect on Mascular Endurance

Can Ischemic Preconditioning improve athletic performance?

1. Experimental units:

40 male teenagers

2. Response Variable:

length of time a wall squat position can be held

3. Control Groups:

2 groups who received 0 lb pressure

4. Control of Extraneous Factors:

Age, sex, athletic ability

5. Randomization

Randomly assigned 10 experimental units to each of 4 treatment groups

Ischemic Preconditioning

Effect on Mascular Endurance

\[ \bbox[yellow,5px] { \color{black} { \begin{array}{c|c} \text{Factor1} & {\text{Amount of pressure} \\ \text{applied by the} \\ \text{bloodpressure cuff}} \\ \hline \text{Level 1} & \text{20 lb} \\ \text{Level 2} & \text{0 lb} \\ \end{array} } } \]

\[ \bbox[silver,5px] { \color{black} { \begin{array}{c|c} \text{Factor2} & {\text{Length of time pressure} \\ \text{was applied}} \\ \hline \text{Level 1} & \text{10 min} \\ \text{Level 2} & \text{20 min} \\ \end{array} } } \]

\[ \bbox[5px,border:2px solid red] { \begin{array}{c} \text{4 Treatments} \\ \hline \text{20 lb/10 min}\\ \text{20 lb/20 min}\\ \text{0 lb/ 10 min}\\ \text{0 lb/ 20 min} \end{array} } \]