Textbook Notes

These are my notes so far on the textbook. Chapter 1 is very sketchy, chapter 2 somewhat less so. From chapter 3 on, I start including more detail, including many code examples in R. These are my notes; no guarantee is made as to the accuracy or completeness of the material.

Licensing

The OpenIntro textbook is released under a Creative Commons BY-SA 3.0 license. Because this document reproduces or paraphrases parts of that text, it also is released under the same license. The original document is avaiable at:

http://www.openintro.org

Very Brief Summaries

Some very brief selected summaries. Some how-to examples. These are here only for quick reference. Notes on each full chapter follow.

From Chapter 4

Confidence interval for numeric data (mean)

You need a point estimate (a sample mean), a sample size, a standard deviation (from the population, or else from the sample) and a confidence level. You also need to check the conditions before proceeding with the following.

# the givens:
point_estimate = 4.5 # The sample mean
n = 75 # sample size
sd = 0.225 # The standard deviation
conf_level = 0.95 # The confidence level

# First, calcuate Z_star value
alpha = 1 - conf_level
Z_star = qnorm(conf_level + (alpha/2))

# Next, the standard error
SE = sd / sqrt(n)

# Now, the confidence interval

conf_interval = c(point_estimate - Z_star * SE, point_estimate + Z_star * SE)
conf_interval
## [1] 4.449 4.551

Hypothesis test for numeric data (mean)

You need a point estimate (sample mean), a null value (the proposed alternate value), a standard deviation (from the population, or else from the sample), a confidence level, and a decision on whether the test is one- or two-sided. You also need to check the conditions before proceeding with the following.

point_estimate <- 4.5 # sample mean
n <- 75
sd <- 0.225 # the standard deviation
conf_level <- 0.95 # the confidence level
nullval <- 4.47 # the claimed or proposed value for population mean
two.sided <- F # a one-sided test

# First, define a multiplier for one- versus two-sided
multiplier <- if(two.sided) 2 else 1

# Next, the standard error
SE <- sd / sqrt(n)

# Now the Z value
Z <- abs(point_estimate - nullval) / SE

# Now the p-value
pvalue <- multiplier * (1 - pnorm(Z))

# Now the significance level
signif_level = 1 - conf_level

# Output
SE
## [1] 0.02598
Z
## [1] 1.155
pvalue
## [1] 0.1241
# Compare the two
pvalue < signif_level # if TRUE, reject the null hypothesis
## [1] FALSE

Sample size

For a given \(Z^*\) value, standard deviation, and margin of error, the sample size should be:

\[n \ge (\frac{Z^*\sigma}{ME})^2\]

Z_star = 1.96
sd = 1.5
me = 0.1
n = ceiling(((Z_star * sd)/me)^2)
n
## [1] 865

Or, specifying confidence level instead of \(Z^*\):

conf_level = 0.95
alpha = 1 - conf_level
Z_star = qnorm(conf_level + (alpha/2))
sd = 1.5
me = 0.1
n = ceiling(((Z_star * sd)/me)^2)
n
## [1] 865

From Chapter 5

Some notes on the inference function

Chapter 5 says nothing about the inference function, but it is in this unit that it is introduced.

To load the function:

load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata"))

Here are the parameters for the function.

  • y: response variable, can be numerical or categorical
  • x: explanatory variable, categorical (optional, defaults to NULL)
  • est: parameter to estimate: “mean”, “median”, or “proportion”
  • success: which level of the categorical variable to call “success”, i.e. do inference on
  • order: when x is given, order of levels of x in which to subtract parameters, defaults to NULL
  • method: CLT based (theoretical) or simulation based (randomization/bootstrapping), values can be “theoretical” or “simulation”
  • type: “ci” (confidence interval) or “ht” (hypothesis test)
  • alternative: direction of the alternative hypothesis, values are “left”, “right” and “twosided”
  • null: null value for a hypothesis test, defaults to NULL
  • boot_method: percentile (“perc”) or standard error (“se”)
  • conflevel: confidence level, value between 0 and 1, defaults to 0.95
  • siglevel: significance level, value between 0 and 1 (used only for ANOVA to determine if posttests are necessary), defaults to 0.05
  • nsim: number of simulations, defaults to 1000
  • simdist: TRUE/FALSE - return the simulation distribution, defaults to FALSE
  • seed: NULL - set.seed() defaults to NULL, specify an integer if you want the simulation to return identical results each time it runs
  • sum_stats: TRUE/FALSE - print summary stats
  • eda_plot: TRUE/FALSE - print EDA plot
  • inf_plot: TRUE/FALSE - print inference plot
  • inf_lines: TRUE/FALSE - print lines on the inference plot for ci bounds or p-value

Here is a very simple example of calling inference for a single mean (hence no x variable is specified). We run a hypothesis test on a very small data set. The confidence level is the default of 0.95.

data <- c(1,2,3,4,5,6)
nullval = 1.8
inference(data, est="mean", method="theoretical", type="ht", alternative="twosided", null=nullval)
## Single mean 
## Summary statistics:
## mean = 3.5 ;  sd = 1.8708 ;  n = 6 
## H0: mu = 1.8 
## HA: mu != 1.8 
## Standard error = 0.7638 
## Test statistic: T = 2.226 
## Degrees of freedom:  5 
## p-value =  0.0766

plot of chunk unnamed-chunk-6

Now a confidence interval on the same data. There is no inference plot this time, but also suppress the EDA plot.

data <- c(1,2,3,4,5,6)
nullval = 1.8
inference(data, est="mean", method="theoretical", type="ci", alternative="twosided", null=nullval, eda_plot=F)
## Single mean 
## Summary statistics: mean = 3.5 ;  sd = 1.8708 ;  n = 6 
## Standard error = 0.7638 
## 95 % Confidence interval = ( 1.5367 , 5.4633 )

From Chapter 7

Find the correlation of two variables.

x <- c(10,12,14,10,13,16,20)
y <- c(9,10,12,9,14,14,16)
cor(x, y)
## [1] 0.915

Find \(R^2\)

(cor(x,y))^2
## [1] 0.8372

Chapter 1: Introduction to Data

1.1 Case Study: Using Stents to Prevent Strokes

Stents have been demonstrated to help prevent death or additional heart attacks after an initial heart attack. A study was made to see whether they can prevent death or additional strokes after an initial stroke.

  • Treatment group: Patients in treatment group received a stent and medical management
  • Control group: Patients received medical management but not stent

The results:

0 - 30 days 0-365 days
stroke no event stroke no event
treatment 33 191 45 179
control 13 214 28 199
total 46 405 73 378

A summary statistic is a single number summarizing a large amount of data. For instance, the primary results of the study after one year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control group.

In the treatment group: \(45/224 = 20%\)
In the control group: \(28/227 = 12%\)

Some notes and precautions:

  • The result was surprising, because stents were expected to help. Statisticians should be careful to pay attention to the data and not their preconceived ideas.
  • It is possible (but not yet determined) that this surprising result may be due simply to random variability.
  • These results cannot be generalized to all patients and all stents.

1.2 Data Basics

email50 = read.csv("os2_data/openintroData/email50.txt",sep="\t")
head(email50[,c("spam","num_char","line_breaks","format","number")])
##   spam num_char line_breaks format number
## 1    0   21.705         551      1  small
## 2    0    7.011         183      1    big
## 3    1    0.631          28      0   none
## 4    0    2.454          61      0  small
## 5    0   41.623        1088      1  small
## 6    0    0.057           5      0  small

A 1 in the “spam” columns means the email is spam.

Each row in the table (only a small handful are shown) represents a single email or case. The columns (only a few of which are shown) represent characteristics, called variables, for each of the emails.

The data in the table represents a data matrix.

1.2.2 Types of Variables

  • numerical
  • continuous: a value like a person’s height
  • discrete: a value like a person’s year of birth
  • categorical: identify a case as one of a limited set of types or categories
  • ordinal: there is some natural ordering of the categories, e.g., customer rating from low to high
  • regular: there is no no natural ordering of the categories, e.g., “animal” “vegetable” or “mineral”

1.2.3 Relationships Between Variables

county = read.csv("os2_data/openintroData/county.txt", sep="\t")
colnames(county)
##  [1] "name"          "state"         "pop2000"       "pop2010"      
##  [5] "fed_spend"     "poverty"       "homeownership" "multiunit"    
##  [9] "income"        "med_income"
plot(county$poverty, county$fed_spend,pch=".",ylim=c(0,30),cex=2)

plot of chunk unnamed-chunk-11

The example above is based on data for all U.S. counties and shows the relationship between the poverty rate in a county and the federal spending in dollars per capita in that county. The plot indicates that there is a relationship between the two variables.

Here is another plot from the same data:

plot(county$multiunit, county$homeownership,pch=".",cex=2)

plot of chunk unnamed-chunk-12

The first plot showed a positive association, where an increase in the explanatory variable (on the x axis) tended to show an increase in the response variables (on the y axis). In this plot, we see a relationship between home ownership and the percent of all units that are multiunit. In this case, however, the relationship is negative. As the percent of multiunit homes goes up, the percent of home ownership goes down.

When two variables show some connection with each other, they are called associated or dependent variables. Note however that so far, we have not found any reason to suggest that one variable cause the other.

If two variables are not associated, they are said to be independent. A pair of variables are either related in some way (associated) or they are not (independent).

1.3 Overview of Data Collection Principles

Each research question refers to a target population. For example, all voters in the U.S., or all swordfish in the Atlantic. A sample represents a subset of the cases in the population, often a very small fraction of those cases.

1.3.2 Anecdotal Evidence

There is a big difference between a statistically-valid sample and an anecdote. My favorite anecdote is an acquaintance who knew someone who had a car accident and was thrown from the vehicle, which then burst into flames. The driver luckily escaped with some serious injuries but eventually recovered. This led the acquaintance to conclude that one should never wear seat belts “because he would have died if he’d been wearing one.” An anecdote is remembered because it is unusual; exactly because, in other words, it files in the face of most data. So relying on anecdotes is always dangerous.

Be careful of data collected in a haphazard fashion. Such evidence may be true and verifiable, but it may represent extraordinary cases.

1.3.3 Sampling from a Population

In general, we always seek to randomly select a sample from a population. The most basic random sample is called a simple random sample, which means selecting from the population on a purely random basis. Even this method, however, can result in a biased sample. For example, if many of the randomly-selected people in a study refuse to cooperate, there is a non-response bias that can skew results. Why do some people refuse, and other people cooperate?

Another common downfall is a convenience bias. A simple example of this is somebody asking all of their immediate acquaintances what they think about a new restaurant, or a political issue, and drawing broad conclusions based on these conveniently-selected cases.

1.3.4 Explanatory and Response Variables

Consider the plot on federal spending versus poverty. If we suspect poverty might affect spending in a county, then poverty is the explanatory variable and spending is the response variable. If there are many variables, it may be possible to consider a number of them as explanatory variables.

Caution: association does not imply causation. Just because one variable affects another does not mean that it causes it.

1.3.5 Introducing Observational Studies and Experiments

Researchers perform an observational study when they collect data in a way that does not directly interfere with how data arise. So for example, they conduct surveys or observe behavior, but they do not provide treatment or direct behavior.

When researches want to investigate the possibility of a causal connection, they conduct an experiment. Usually there will be both an explanatory and a response variable. The researches now interfere with the cases. For example, in a medical experiment, they will provide treatment, in order to see what effect the treatment has. In a sleep study, they might direct certain behaviors that subjects do or do not perform before sleep.

The cases in an experiment are assigned to groups, each of which receives a different treatment. When individuals are randomly assigned to groups, the experiment is called a randomized experiment.

1.4 Observational Studies and Sampling Strategies

Generally, data in observational studies are collected only by monitoring what occurs, while experiments require the primary explanatory variable in a study be assigned for each subject by the researchers. It is often possible to draw causal conclusions from experiments; it is usually ill-advised in observational studies.

Example: a study may show that, the more often someone uses sunblock, the more often they get skin cancer. But the real causes is probably that people who use sunblock more often also receive more exposure to the sun. So there is a confounding variables, sun exposure.

Observational studies come in two forms: prospective and retrospective studies. A prospective study identifies individuals and collects information as events unfold. Retrospective studies collect data after events have taken place.

1.4.2 Three Sampling Methods

Simple random sampling draws cases on a purely random basis from the population.

Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The strata are chosen so that similar cases are grouped together, then a second sampling method, usually simple random sampling, is employed within each stratum.

A cluster sample is much like a two-stage simple random sample. We break up the population into many groups, called clusters. Then we sample a fixed number of clusters and collect a simple random sample within each cluster.

1.5 Experiments

1.5.1 Principles of Experimental Design

  • Controlling: A control group receives no treatment, so that it can be compared to the treatment groups
  • Randomization: Subjects are randomly assigned to groups in order to minimize the effect of any variables among the subjects that are not accounted for; for example, some subjects may respond more strongly to a treatment than others for undetermined reasons
  • Replication: To ensure accuracy, the experiment is made large enough to be statistically valid. Another alternative is to repeat the entire experiment.
  • Blocking: If researches know or suspect that there can be some variables that affect results, they can assign subjects to blocks within treatment groups. An example might be smoker versus non-smoker.

1.5.2 Reducing Bias in Human Experiments

When human subjects in an experiment know that they are, or are not, receiving treatment, they may respond differently due to subtle and difficult-to-analyze psychological reasons. Therefore, they may be assigned to blind groups, which means that they do not know whether or not they receive treatment. Those in a control group receive placebo treatment, such as an inert pill instead of an active drug.

If the researchers know who is receiving treatment, their behavior may differ, and this again can introduce bias, so an experiment may be double blind, which means that neither the researchers who interact with patients, nor the patients themselves, know who receives treatment and who is in the control group.

It should be kept in mind that in some experiments, such as the stent example, a purely blind experiment is not feasible. To do so would mean performing risky surgery on patients in the control group just to convince them they had received a stent. One thing to realize early in this course is that there are rules and principles and best practices, and you can probably find at least one case where any one of them must be set aside for one reason or another.

1.6 Examining Numerical Data

1.6.1 Scatterplots for Paired Data

A scatterplot provides a case-by-case view of data for two numerical variables.

plot(email50$num_char, email50$line_breaks,col="blue")

plot of chunk unnamed-chunk-13

The relationship between two variables is not necessarily linear.

cars <- read.csv("os2_data/openintroData/cars.txt", sep="\t")
colnames(cars)
## [1] "type"       "price"      "mpgCity"    "driveTrain" "passengers"
## [6] "weight"
plot(cars$weight, cars$price, col="blue")

plot of chunk unnamed-chunk-14

1.6.2 Dot Plots and the Mean

Not sure how to show the mean in R; I don’t find dot plots of much use anyway.

stripchart(email50$num_char)

plot of chunk unnamed-chunk-15 The mean, sometimes called the average, is a common way to measure the center of a distribution of data.

\[\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\]

\(\bar{x}\) is a notation used to denote the mean of a set of values represented by the variable \(x\). So add all of the values in a data set \(x\), then divided by the number of values (\(n\)).

1.6.3 Histograms and Shape

hist(x=email50$num_char)

plot of chunk unnamed-chunk-16

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.

When data trail off to the right in this way and have a longer right tail, the shape is said to be right skewed. Data sets with the reverse characteristic–a long, thin tail to the left–are said to be left skewed. We also say that such a distribution has a long left tail. Data sets that show roughly equal trailing off in both directions are called symmetric.

Histogram can have more than one peak. A histogram with one prominent peak is unimodal, a histogram with two is bimodal, and a histogram with more than two is multimodal.

1.6.4 Variance and Standard Deviation

We call the distance of an observation from its mean its deviation. If we sum the squares of all these deviations, we have the variance.

\[\begin{align}\frac{\sum\limits_{i=1}^n (x_i - \bar{x})^2}{n-1}\end{align}\]

The square root of the variance is the standard deviation.

\[\sqrt{\frac{\sum\limits_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]

The reason we divided by \(n-1\) rather than \(n\) is outside the scope of this book.

1.6.5 Box Plots, Quartiles and the Median

A box plot summarizes a data set using five statistics while also plotting unusual observations.

boxplot(email50$num_char)

plot of chunk unnamed-chunk-17

  • The heavy black line is the median
  • The bottom boundary of the box is the first quartile
  • The top boundary of the box is the third quartile
  • The interquartile range (IQR) is the distance between the first and third quartiles
  • The whiskers extend to the bottom and top of the data sets, but no further than 1.5 IQR’s from the median.
  • Any points beyond the whiskers are labelled as outliers

An outlier is an observation that appears extreme relative to the rest of the data.

Why it is important to look for outliers

  1. They can identify strong skew in the distribution
  2. They can identify errors in data collection (if a list of heights shows someone as 27 feet tall, we can be sure there was an entry error)
  3. They can provide insight into interesting properties of the data

1.6.6 Robust Statistics

The median and IQR are called robust statistics because they are not strongly affected by outliers. The mean and standard deviation are not robust because outliers can strongly affect them.

1.6.7 Transforming Data

When data are very strongly skewed, we sometimes transform them so they are easier to model. An example is applying a logarithm to a set of data so that the resulting histogram does not have bars that are so extremely different from biggest to smallest.

Another example might be when you see a non-linear relationship between two variables (such as a parabola shape). A transformation of some kind might move the data to a linear relationship, which can make additional analysis possible.

This is an important concept in statistics and machine learning, but I don’t think it ever comes up again; it is outside the scope of this course.

1.6.8 Mapping Data

See the book. The topic is creating intensity maps that can show, for example, relative population, or poverty, etc., among geographic regions. This is another subject that won’t come up again after this unit.

1.7 Categorical Data

1.7.1 Contingency Tables and Bar Plots

(Sometimes you eat the bar, and sometimes the bar eats you.)

The following table summarizes two variables from the email50 set: spam (rows) and number (columns).

none small big Total
spam 149 168 50 367
not spam 400 2659 495 3554
Total 549 2827 545 3921

A table that summarizes data for two categorical variables in this way is called a contingency table.

A table for a single value is called a frequency table:

none small big Total
549 2827 545 3921

A bar plot is a common way to display a single categorical variable.

barplot(table(email50$number))

plot of chunk unnamed-chunk-18

1.7.2 Row and Column Proportions

The following shows the row proportions for the previous table on spam and number. The row proportions are computed as the counts divided by their row totals.

none small big Total
spam 149/367=0.406 168/367=0.458 50/367=0.156 1.000
not spam 400/3554=0.113 2659/3554=0.748 495/3554=0.139 1.00
Total 549/3921=0.145 2827/3921=0.721 545/3921=0.139 1.00

The following in contrast shows column proportions.

none small big Total
spam 149/459=0.271 168/2827=0.059 50/545=0.092 367/3921=0.094
not spam 400/459=0.729 2659/2827=0.941 495/545=0.908 3684/3921=0.906
Total 1.000 2827/3921=0.721 545/3921=0.139 1.000

1.7.3 Segmented Bar and Mosaic Plots

A segmented bar plot:

barplot(table(email50[,c("spam", "number")]))

plot of chunk unnamed-chunk-19

A mosaic plot for one variable:

mosaicplot(table(email50$number))

plot of chunk unnamed-chunk-20

A mosaic plot for two variables:

mosaicplot(table(email50[,c("number", "spam")]))

plot of chunk unnamed-chunk-21

1.7.4 The Only Pie Chart You Will See in This Book

The message: pie charts are misleading by nature and should be shunned. Call them mean names and tell them their mommy is calling for them, they’d better run home and see what she wants.

1.7.5 Comparing Numerical Data Across Groups

The side-by-side box plot is a traditional tool for comparing across groups.

countyall = read.csv("os2_data/openintroData/countyComplete.txt", sep="\t")
county2 = countyall[,c("pop2000","pop2010","median_household_income")]
county2 = county2[complete.cases(county2),]
x <- county2$pop2010-county2$pop2000
county2$pop_change = x
county2$pop_gain = vector(length=nrow(county2))
county2$pop_gain = sapply(county2$pop_gain, function(x){return("no")})
idx = county2$pop_change > 0
county2$pop_gain[idx] = "yes"

par(mfrow=c(1,2))
boxplot(county2[county2$pop_gain == "yes",]$median_household_income,
  county2[county2$pop_gain == "no",]$median_household_income)

library(openintro)
xlim=c(0,125000)
ylim=c(0,450)
histPlot(county2[county2$pop_gain == "yes",]$median_household_income,hollow=T,breaks=30,border="blue",xlim=xlim,ylim=ylim)
par(new=T)
histPlot(county2[county2$pop_gain == "no",]$median_household_income,hollow=T,breaks=30,border="red",xlim=xlim,ylim=ylim)

plot of chunk unnamed-chunk-22

1.8 Case Study: Gender Discrimination

Data from the real world is always variable. If you flip a coin ten times today, and ten times tomorrow, you might expect the number of heads to be about five both days, but if you flipped the coin ten times every day for a hundred days, and got exactly five heads each time, it would be very unexpected. Data can vary, in other words, and in fact it will vary. One of the key questions in statistics is to assess whether differences in data can be reasonably attributed to random variation.

1.8.1 Variability Within Data

We consider a study investigating gender discrimination in the 1970s, which is set in the context of personnel decisions within a bank. The research question we hope to answer is, “Are females unfairly discriminated against in promotion decisions made by male managers?”

promoted not promoted total
male 21 3 24
female 14 10 24
total 35 13 48

The study looks at 48 cases, 24 male and 24 female. Of the males, 21 were promoted and 3 were not, a proportion of 0.875. Of the female, 14 were promoted and 10 were not, a proportion of 0.583. The difference between these proportions is 0.875-0.583=0.292.

There is a considerable difference in the two proportions. We need models to assess the difference.

\(H_0\): The independence model states that gender and decision are independent. They have no relationship, and the observed difference between promotion of male and female managers was due to chance.

\(H_A\): The alternative model states that gender and decision are not independent. The difference in promotion rates is not due to chance.

We choose between these two competing claims by assessing if the data conflict so much with \(H_0\) that the independence model cannot be deemed reasonable. If this is the case, and the data support \(H_A\), then we will reject the notion of independence and conclude there was discrimination.

1.8.2 Simulating the Study

If the difference in the two proportions was due to random variation, then we should expect to see such variation again. So we can simulate the results to see what happens. To do this, we can make 48 cards, 24 for men and 24 for women. These cards are thoroughly shuffled, then dealt into two piles, one of which (promoted) has 35 cards and the other (not promoted) has 13. Then we count the number of cards in each pile that are men and the number that are women.

promoted not promoted total
male 18 6 24
female 17 7 24
total 35 13 48

The proportion of men promoted in the simulation is 18/24=0.75. The proportion of women promoted is 17/24=0.708. The difference between these two proportions is 0.75-0.708=0.042. This is in comparison to the difference of 0.292 seen in the actual study.

1.8.3 Checking for Independence

set.seed(8535)
sims = vector()
for(x in 0:100) {
  hiredidx <- sample(0:48, 24)
  menhired = sum(hiredidx<=24)
  womenhired =sum(hiredidx>24)
  menhired.prop = menhired/24
  womenhired.prop = womenhired/24
  sims[x] = menhired.prop - womenhired.prop
}
sims
##   [1]  0.00000  0.16667  0.16667 -0.08333  0.16667  0.16667  0.00000
##   [8] -0.25000  0.16667  0.16667  0.08333  0.00000 -0.25000 -0.08333
##  [15] -0.25000  0.08333 -0.08333  0.25000  0.25000  0.08333  0.16667
##  [22]  0.08333 -0.16667  0.16667  0.08333  0.16667  0.00000 -0.08333
##  [29]  0.25000 -0.08333  0.25000  0.00000  0.16667  0.00000  0.08333
##  [36]  0.00000  0.16667  0.08333 -0.33333  0.00000  0.00000 -0.08333
##  [43] -0.08333  0.33333  0.08333  0.25000  0.00000  0.08333  0.25000
##  [50] -0.08333 -0.16667 -0.16667  0.08333  0.00000  0.00000  0.08333
##  [57]  0.08333  0.16667 -0.08333  0.08333  0.25000  0.08333  0.08333
##  [64] -0.08333  0.08333  0.08333  0.25000  0.08333  0.16667 -0.08333
##  [71]  0.25000  0.33333 -0.25000  0.33333 -0.08333  0.08333  0.00000
##  [78] -0.16667 -0.08333 -0.08333  0.08333  0.08333 -0.08333  0.16667
##  [85]  0.16667  0.16667  0.25000 -0.08333  0.00000 -0.08333  0.00000
##  [92] -0.25000  0.16667  0.25000  0.25000  0.16667  0.08333  0.00000
##  [99]  0.08333 -0.16667
stripchart(sims,method="stack",pch=21,at=0,bg="lightblue",col="blue",offset=.5,cex=1,xlim=c(-0.7,0.7))

plot of chunk unnamed-chunk-23

The results are shown centered around zero. This means that, overall, we don’t expect to see any difference in hiring rates, if the numbers are truly random. But they are not all exactly zero. We do expect to see some random variation, just as we would not expect a thousand coin flips to be exactly 500 heads and 500 tails.

The results shown here are not identical to the book but they are roughly the same. They show differences in hiring rates for men versus women from the simulation. The difference in the actual study was 0.292. We can see that in only 3 of the 100 simulations did we see a difference that was 0.292 or higher, and in only one case did we see a difference that was -0.292 or lower (which would mean that the proportion of women hired was higher than men).

Chapter 2: Probability

2.1 Defining Probability

2.1.1 Probability

We often frame probability in terms of a random process giving rise to an outcome.

Roll a die: 1, 2, 3, 4, 5, or 6
Flip a coin: heads or tails

The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times.

The tendency of \(\hat{p}\) to stabilize around p is described by the Law of Large Numbers:

As more observations are collected, the proportion \(\hat{p}\) of occurrences with a particular outcome converges to the probability p of that outcome.

2.1.2 Disjoint or mutually exclusive outcomes

Two outcomes are called disjoint or mutually exclusive if they cannot both happen.

Addition Rule of disjoint outcomes

\[P(A_1 or A_2) = P(A_1) + P(A_2)\]

The addition rule applies both to disjoint outcomes and disjoint events. An example of an outcome might be rolling a 1 or 2 on a single die roll. An example of an event might be one occurrence of the die roll, when either a 1 or 2 is rolled, or it is not.

2.1.3 Probabilities when events are not disjoint

General Addtion Rule

If A and B are any two events, disjoint or not, then the probability that at least one of them will occur is: \[P(A \: or \: B) = P(A) + P(B) - P(A \: and \: B)\]

2.1.4: Probability distributions

A probability distribution is a table of all disjoint outcomes and their associated probabilities.

Dice sum 2 3 4 5 6 7 8 9 10 11 12
Probability 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Rules for probability distributions

  1. The outcomes listed must be disjoint
  2. Each probability must be between 0 and 1
  3. The probabilities must total 1

2.1.5 Complement of an event

Rolling a die produces a value in the set {1, 2, 3, 4, 5, 6}. This set of all possible outcomes is called the sample space (S) for rolling a die.

The complement of an event A is: \[P(A) + P(A^c)=1\]

# 1 in 6 chance of rolling a 1
# 5 in 6 chance of rolling a 2, 3, 4, 5, or 6
(1/6) + (5/6)
## [1] 1

2.1.6 Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other.

Multiplication Rule for independent processes

If A and B represent events from two different and independent processes, then the probability that both A and B occur can be calculated as the product of their separate probabilities:

\[P(A \: and \: B) = P(A) \times P(B)\]

# roll a die and flip a coin
# 1 in 6 chance of rolling a 1, 1 in 2 chance of getting H
(1/6) * (1/2)
## [1] 0.08333

2.2 Conditional Probability

2.2.1 Marginal and joint probabilities

A probability table:

parents used parents not Total
student: uses 0.28 0.21 0.49
student: not 0.19 0.32 0.51
Total 0.47 0.53 1

Marginal probabilities are those listed on bottom or far-right columns: totals for parents only, or students only (a single variable).

Joint probabilities are outcomes for two or more variables (i.e., everything else).

2.2.2 Defining conditional probability

A conditional probability is defined under a condition. For example, the probability that it will rain today, given that there is thunder in the distance. There are two parts to a conditional probability, the outcome of interest (whether it will rain) and the condition (thunder).

The conditional probability of the outcome of interest A given condition B is:

\[P(A|B) = \frac{P(A \: and \: B)}{P(B)}\]

2.2.3 Smallpox in Boston 1721

Extended example not covered here.

2.2.4 General Multiplication Rule

If A and B represent two outcomes or events, then \[P(A \: and \: B) = P(A|B) \times P(B)\]

Sum of Conditional Probabilities

Let \(A_1, A_2...A_k\) represent all the disjoint outcomes for a variable or process. Then if B is an event, possibly for another variable or process, we have: \[P(A_1|B) + \dots + P(A_k|B)=1\]

The rule for comelements also holds when an event and its complement are conditioned on the same information: \[P(A|B) = 1 - P(A^c | B)\]

2.2.5 Independence considerations in conditional probability

If two processes are independent, then knowing the outcome of one should provide no information about the other.

2.2.6 Tree diagrams

Intro to tree diagrams, see text for examples.

2.2.7 Baye’s Theorem

Consider the following conditional probability for variable 1 and variable 2 \[P(outcome \: A_1 \: of \: variable \: 1 | outcome \: B \: of \: variable \: 2)\]

Bayes’ theorem states that this conditional probability can be identified as the following fraction: \[\frac{P(B|A_1)P(A_1)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \dots + P(B|A_k)P(A_k)}\]

# Example of data passed in
# The first number for each element is probability of A-sub-i, second is probability of first option of B  
bayes_data <- list(Academic=list(p1=0.35, p2=0.25), Sporting=list(p1=0.2, p2=0.7), None=list(p1=0.45, p2=0.05))
# This function allows for 2 or more choices for a, but only 2 for b; would need refactoring to increase that
bayes <- function(data, a, b) {
  p2 = ifelse(b==1, data[[a]][[2]], 1-data[[a]][[2]])
  p1 = data[[a]][[1]]
  numerator = p1 * p2
  denominator = 0
  for(i in 1:length(data)) {
    p2 = ifelse(b==1, data[[i]][[2]], 1-data[[i]][[2]])
    
    denominator = denominator + data[[i]][[1]] * p2
  }
  return(numerator/denominator)
}

# probability that lot is full, given sporting event
bayes(bayes_data, 1, 1)
## [1] 0.35
# probability that lot is full, given no event
bayes(bayes_data, 3, 1)
## [1] 0.09

2.3 Sampling from a small population

Several example questions here, no new definition.

2.4 Random Variables

2.4.1 Expectation

Random variable

A random process or variable with a numeric outcome.

Expected Value of a Discrete Random Variable

If X takes outcomes \(x_1, \dots, x_k\) with probabilities \(P(X = x_1), \dots, P(X = x_k)\), the expected value of X is the sum of each outcome multiplied by its corresponding probability. \[E(X) = x_1 \times P(X = x_1) + \dots + x_k \times P(X = x_k) = \sum_{i=1}^k x_i P(X = x_i)\]

2.4.2 Variability in random variables

General variance formula

If X takes outcomes \(x_1, x_2,\dots x_k\) with probabilities \(P(X=x_1), P(X=x_2),\dots P(X=x_k)\) and expected value \(\mu = E(X)\), then the variance of X, denoted by Var(X) or the symbol \(\sigma^2\), is:

\[\sigma^2 = (x_1 - \mu)^2 \times P(X=x_1) + (x_2 - \mu)^2 \times P(X=x_2) + \dots + (x_k - \mu)^2 \times P(X=x_k)\]

2.4.3 Linear combinations of random variables

A linear combination of two random variables X and Y is a fancy phrase to describe a combination

\[aX + bY\]

(See text for examples.)

2.4.4 Variability in linear combinations of random variables

The variance of a linear combination of random variables may be computed by squaring the constants, substituting in the variances for the random variables, and computing the result:

\[Var(aX + bY) = a^2 \times Var(X) + b^2 \times Var(Y)\]

This equation is valid as long as the random variables are independent of each other. The standard deviation of the linear combination may be found by taking the square root of the variance.

2.5 Continuous distributions

2.5.1 From histograms to continuous distributions

As more and more bars are added to a histogram, it takes on the appearance of a smooth curve. This smooth curve represents the probability density function (also called a density or a distribution).

2.5.2 Probabilities from continuous distributions

The key point here is that the area under a shaded portion of a probability distribution is equal to the probability of the values in that range.

P-value

(This section is obviously out of place. Leaving it here for the moment, pending deciding on where it should go.)

pvalue <- function(pe, nullval, sepe, onesided=F) {
  # returns p-value, the probability that
  # 
  # nullval is the null value being tested
  # pe is the point estimate
  # sepe is the standard error for the point estimate
  # onesided=FALSE means two-sided estimate, else one-sided
  x = 1 - pnorm(abs(pe - nullval)/sepe)
  if(onesided==F) {
    x <- x * 2
  }
  x
}

Chapter 3: Distributions of random variables

3.1 Normal distribution

Plot a normal distribution

x <- seq(-4, 4, length=100)
hx <- dnorm(x)
plot(x, hx, type="l", lty=1, xlab="x value", ylab="Density", main="Normal Distribution")
abline(h=0)

plot of chunk unnamed-chunk-28

3.1.2 Z score

The Z score for an observation, when the population mean and standard deviation are known, is:

\[Z = \frac{x - \mu}{\sigma}\]

mu_sat <- 1500
sd_sat <- 300
x_ann <- 1800
Z_ann <- (x_ann - mu_sat) / sd_sat
Z_ann
## [1] 1

3.1.3 Normal probability table

To get a percentile for a given Z score from the normal probability table, use pnorm:

pnorm(1.96) # returns 0.975
## [1] 0.975

To get a Z score for a given percentil from the normal probability table, use qnorm:

qnorm(0.975) # returns 1.96
## [1] 1.96

3.2 Evaluating the normal approximation

3.2.1 Normal probability plot

The normal probability plot demonstrates how close to nearly normal a distribution is.

If most points are near the low end, with fewer scattered along the high end, this indicates a skew to the high end (or right). If most points are near the high end, with fewer scattered along the low end, this indicates left skew.

eruption.lm <- lm(eruptions ~ waiting, data=faithful)
eruption.stdres <- rstandard(eruption.lm)
qqnorm(eruption.stdres,
       ylab="Standard Residuals",
       xlab="Normal Scores",
       main="Old Faithful")
qqline(eruption.stdres)

plot of chunk unnamed-chunk-32

# The following prints the non-standardized residuals
# Either method will do, to check for skew
qqnorm(eruption.lm$residuals)
qqline(eruption.lm$residuals)

plot of chunk unnamed-chunk-32

3.3 Geometric distribution

3.3.1 Bernoulli distribution

A Bernoulli random variable has exactly two possible outcomes. For such variables:

\[\mu = p\]

where p is the probability of success on any given trial, and:

\[\sigma = \sqrt{p(1 - p)}\]

3.3.2 Geometric distribution

A geometric distribution gives the probability of the first success on the nth trial. The probability is:

\[(1 - p)^{n-1}p\]

Where n is the number of trials, and p is the probability of success on any trial.

Also:

\[mu = \frac{1}{p}\] \[\sigma^2 = \frac{1 - p}{p^2}\] \[\sigma = \sqrt{\frac{1 - p}{p^2}}\]

3.4 Binomial Distribution

The binomial distribution describes the probability of having exactly k successes in n independent Bernoulli trials with a probability of success p.

The probability of any one trial being a success is:

\[p^k(1 - p)^{n-k}\]

The number of ways to choose k successes is:

\[\binom{n}{k} = \frac{n!}{k!(n - k)!}\]

Combining these:

\[\binom{n}{k} p^k(1 - p)^{n-k} = \frac{n!}{k!(n - k)!} p^k(1 - p)^{n-k}\]

Conditions for a binomial distribution:

  1. The trials are independent.
  2. The number of trials is fixed.
  3. Each trial has an equal probability of success.
  4. Each trial has an outcome of success or failure.

In R, dbinom is the density function for the binomial distribution:

dbinom(x = 3, size = 10, prob = 0.5) # probability of 3 heads in 10 coin flips
## [1] 0.1172
dbinom(x = 0:3, size=10, prob = 0.5) # probabilities of 0, 1, 2 or 3 heads in 10 coin flips
## [1] 0.0009766 0.0097656 0.0439453 0.1171875
sum(dbinom(x = 0:3, size=10, prob = 0.5)) # same, but add the probabilities together
## [1] 0.1719

pbinom is the distribution function for the binomial distribution:

pbinom(q = 3, size = 10, prob = 0.5) # probability of 0 to 3 heads in 10 coins flips (cumulative)
## [1] 0.1719
# Note that this equals the value for the last dbinom example

NOTE: I just started playing around with qbinom, my understanding of it may or may not be correct.

qbinom is the quantile function for the binomial distribution:

qbinom(p = .3, size = 10, prob = 0.5) # at least a 30% probability of 4 or fewer heads in 10 flips
## [1] 4
qbinom(p = .35, size = 10, prob = 0.5) # also at least a 35% probability of 4 or fewer heads in 10 flips
## [1] 4
qbinom(p = .4, size = 10, prob = 0.5) # at least a 40% probability 5 or fewer heads in 10 flips
## [1] 5
qbinom(p = .9, size = 10, prob = 0.5) # at least a 90% probability of 7 or fewer heads in 10 flips
## [1] 7
dbinom(x = 7, size = 10, prob = 0.5) # demonstrates that there is about 11% probability of 7 heads in 10 flips
## [1] 0.1172
qbinom(p = 0.3, size = 10, prob = 0.5, lower.tail = FALSE) # at least a 30% probability of 6 failures in 10 flips
## [1] 6

rbinom is a random generator for the binomial distribution:

rbinom(n = 3, size = 10, prob = 0.5) # flip coin 10 times and record heads, 3 times (values change each time)
## [1] 7 8 4

3.4.2 Normal approximation to the binomial distribution

The binomial formula is cumbersome when the sample size (n) is large, particularly when we consider a range of observations. In some cases we may use the normal distribution as an easier and faster way to estimate binomial probabilities.

The binomial distribution with probability of success p is nearly normal when the sample size n is suciently large that np and n(1 ???? p) are both at least 10. The approximate normal distribution has parameters corresponding to the mean and standard deviation of the binomial distribution:

\[\mu = np\]

\[\sigma = \sqrt{np(1 - p)}\]

The normal approximation may be used when computing the range of many possible successes.

3.4.3 The normal approximation breaks down on small intervals

Enough said.

Improving the accuracy of the normal approximation to the binomial distribution

The normal approximation to the binomial distribution for intervals of values is usually improved if cutoff values are modified slightly. The cutoff values for the lower end of a shaded region should be reduced by 0.5, and the cutoff value for the upper end should be increased by 0.5.

NOTE: Not entirely sure what this means.

My note: Use pbinom instead of normal approximation!

Just use the pbinom function in R, then you don’t have to approximate. The approximation is a dinosaur from pre-computer days, as far as I’m concerned.

pbinom(3, size=10, prob=0.5) # probability of 3 or fewer heads in 10 flips
## [1] 0.1719

3.5 More discrete distributions

3.5.1 Negative binomial distribution

The negative binomial distribution describes the probability of obseving the kth success on the nth trial. The checks for a binomial distribution must be made.

\[\text{P(the $k^{th}$ success on the $n^{th}$ trial)} = \binom{n-1}{k-1} p^k(1 - p)^{n - k}\]

# manual calculation
n <- 10
k <- 3
p <- 0.5
choose(n-1, k-1) * p^k * (1-p)^(n-k)
## [1] 0.03516
# dnbinom function
# note that x is the number of failures before the target number of successes occurs
dnbinom(x=n-k, size=k, prob = p)
## [1] 0.03516

3.5.2 Poisson distribution

A random variable may follow a Poisson distribution if the event being considered is rare, the population is large, and the events occur independently of each other.

Suppose we are watching for rare events and the number of observed events follows a Poisson distribution with rate \(\lambda\). Then:

\[\text{P(observe k rare events)} = \frac{\lambda^{k}e^{-\lambda}}{k!}\]

Where \(\lambda\) is equal to the mean and \(\sqrt{\lambda}\) is the standard deviation.

Examplie: A diner gets an average of 75 customers per hour.

# manual:
lambda <- 75
k <- 70
(lambda^k * exp(-lambda))/factorial(k)
## [1] 0.04016
# #dpois
dpois(x=70, lambda=75)
## [1] 0.04016

Note that the above example might be suspicious. Are the events independent? I.e., is there some external factor such as a parade that causes traffic to drop or increase?

Chapter 4: Foundations for inference

4.1 Variability in estimates

4.1.1 Point estimates

We can estimate a population mean using the sample mean. A population mean is an example of a population parameter. A sample mean is an example of a point estimate. Point estimates vary from one sample to the next, and these form a sampling distribution. The sampling distribution represents the distribution of the point estimates based on sample of a fixed size of a certain population.

The standard deviation associated with an estimate is called the standard error.

Standard error for the sample mean:

\[SE = \frac{\sigma}{\sqrt{n}}\]

We can use the oint estimate of the standard deviation from a sample in place of a population standard deviation when the sample size is at least 30 and the population distribution is not strongly skewed.

4.2 Confidence intervals

4.2.1 Capturing the population parameter

A plausible range of values for the population parameter is called a confidence interval.

4.2.2 An approximate 95% confidence interval

A confidence interval of roughly 95% is based on two standard errors, which are based on the standard deviation (note that precisely two standard deviations gives a confidence interval that is slightly less than 95%, hence the “roughly”).

Thus a confidence interval of roughly 95% is:

\[\text{point estimate} \pm 2 \times SE\]

The phrase “95% confident” means that, given a sample, we are 95% confident that the population mean falls within the confidence interval based on the sample.

Example 4.10:

mu <- 95.61
SE <- 1.58
num_of_std_errs <- 2
conf_interval <- c(mu - num_of_std_errs * SE, mu + num_of_std_errs * SE)
conf_interval
## [1] 92.45 98.77

NOTE: This example in the book uses exactly 2 standard errors, which corresponds to roughly a 95% confidence interval. Later, the book often uses a value of 1.96, which corresponds to a closer approximation of 95% confidence. 1.96 is not precisely equivalent to two standard errors, due to rounding, but when 1.96 is used, the idea is that you want a 95% confidence interval, rather than an interval based on exactly 2 standard errors.

If this seems confusing, it is similar to saying, on the one hand, “two meters is roughly 6 feet” (which is like saying that “two standard errors correspond to roughly a 95% confidence interval”) and “pi is roughly 3.14159” (which is like saying that “1.96 standard errors correspond to roughly a 95% confidence interval”). The first is a rough approximation based on convenience; the difference of the second is due only to rounding.

4.2.3 A sampling distribution for the mean

Central Limit Theorem, information description: If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.

4.2.4 Changing the confidence level

This section first shows the number 1.96 as corresponding to “roughly a 95%” confidence interval. The value 1.96 can be obtained from the normal probability table by looking up 0.9750, which represents 97.5%. This is equal to 95% + (1-95%)/2. That is because we want the allowable 5% of error to be distributed equally on both sides of the distribution. So the right tail should be 97.5% of the distance from the left side of the distribution.

The value 1.96 (or slightly different, due to rounding) can also be obtained from R:

qnorm(0.9750)
## [1] 1.96

So what if we want a confidence level of 99%? First, determine the percentile needed: \(0.99 + ((0.99)/2) = 0.995\). Now look this value up in the normal probability table, or else use R:

qnorm(0.99 + (1-0.99)/2)
## [1] 2.576

The value cited by the book for a 99% confidence level is 2.58.

Conditions for \(\bar{x}\) being nearly normal and SE being accurate

  • The sample observations are independent. Observations in a simple random sample are independent if they consist of less than 10% of the population.
  • The sample size is large: \(n \ge 30\) is a good rule of thumb
  • The distribution of sample observations is not strongly skewed

Additionally, the larger the sample size, the more skew we can tolerate in the sample.

NOTE: So far, anyway, there is no hard calculation to determine how much skew we can tolerate, based on the sample. This must rely on more sophisticated methods we have not seen.

Confidence interval for any confidence level

If the point estimate follows the normal model with standard error SE, then a confidence interval for the population parameter is:

\[ \text{point estimate} \pm z^{*} SE\]

Where \(z^{*}\) corresponds to the confidence level selected (i.e., this is the value 1.96 for a 95% confidence level).

Exercise 4.17

conf_level <- 0.9
Zstar <- qnorm(conf_level + (1 - conf_level)/2)
point_estimate <- 35.05
sd <- 8.97
n <- 100
SE <- sd / sqrt(n)
conf_interval <- c(point_estimate - Zstar * SE, point_estimate + Zstar * SE)
conf_interval
## [1] 33.57 36.53

Alternative conditions for applying the normal distribution model to the sample mean

If the population of cases is known to be nearly normal and the population standard devation \(\sigma\) is known, then the sample mean \(\bar{x}\) will follow a nearly normal distribution \(N(\mu, \sigma\sqrt{n})\) if the sample observations are also independent.

Relaxing the nearly normal condition

As sample size becomes larger, it is reasonable to slowly relax the nearly normal assumption on the data when dealing with small samples. By the time the sample size reaches 30, the data must show strong skew for us to be concerned about the normality of the sampling distribution.

4.3 Hypothesis testing

4.3.1 Hypothesis testing framework

Null and alternative hypotheses

The null hypothesis (\(H_0\)) often represents either a skeptical perspective or a claim to be tested. The alternative hypothesis (\(H_A\)) represents an alternative claim under consideration and is often represented by a range of possible parameter values.

The skeptic will not reject the null hypothesis unless the evidence in favor of the alternative hypothesis is so strong that she rejects \(H_0\) in favor of \(H_A\).

Even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as true. This is similar to a trial, where the verdict can be “not guilty”, but never “innocent.”

4.3.2 Testing hypotheses using confidence intervals

Example 4.21

Test the hypothesis that the average age of runners has not changed from 2006 to 2012, using a 95% confidence level.

\(H_0\): The average age of runners has not changed from 2006 to 2012, \(\mu_{age}=36.13\).
\(H_A\): The average age of runners has changed from 2006 to 2012, \(\mu_{age}\ne36.13\).

point_estimate <- 35.05
n <- 100
sd <- 8.97
SE <- sd/sqrt(n)
conf_level <- 0.95
Zstar <- qnorm(conf_level + (1-conf_level)/2)
conf_interval <- c(point_estimate - Zstar * SE, point_estimate + Zstar * SE)
conf_interval
## [1] 33.29 36.81

The confidence interval contains the value 36.13, so we do not reject the null hypothesis.

4.3.3 Decision errors

There are two types of errors that can be made in a hypothesis test.

do not reject H0 reject H0 in favor of HA
H0 true okay Type 1 Error
HA true Type 2 Error okay

A type 1 error is a rejection of the null hypothesis, when it should not be. This is “hanging an innocent null.”

A type 2 error is a failure to reject the null hypothesis, when it should be. This is “letting a guilty null walk.”

Hypothesis tests are based on probabilities. It is possible that the point estimate obtained does not represent the population well, in which case a decision error may occur.

As a general rule of thumb, for those cases where the null hypothesis is actually true, we do not want to incorrectly reject \(H_0\) more than 5% of the time. This corresponds to a significance level of 0.05.

NOTE: This is just 1 minus the confidence level.

4.3.4 Formal testing using p-values

The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.

My way of putting this is: Assuming that someone has virtually no chance of winning the lottery three days in a row, how skeptical do I become about the fairness of the lottery, if I see someone win three days in a row?

One-sided and two-sided hypothesis tests

A hypothesis test can be one-sided or two-sided. A one-sided test looks for only one of the tails in the distribution. The alternative hypothesis for a one-sided test will have a > or < symbol:

\(H_0: \mu = 7\) \(H_A: \mu > 7\)

A two-sided test looks for both of the tails in the distribution. It will use a \(\ne\) symbol in the alternative hypothesis:

\(H_0: \mu = 7\) \(H_A: \mu \ne 7\)

This is another way of saying that we are testing to see whether the population mean can be significantly greater than, or less than 7.

Always write the null hypothesis as an equality

p-value as a tool in hypothesis testing

The p-value quantifies how strongly the data favor \(H_A\) over \(H_0\). A small p-value (usually < 0.05) corresponds to sufficient evidence to reject \(H_0\) in favor of \(H_A\).

Example 4.34

The average sale price of the 52 Ebay auctions for Wii Mario Kart was $44.17 with a standard deviation of \(4.15. Does this provide sufficient evidence to reject the null hypothesis in Exercise 4.32? Use a significance level of \)$ = 0.01.

(This is an example of a one-sided test.)

\(H_0: mu = 46.99\)
\(H_A: mu < 46.99\)

mu <- 44.17
null_value <- 46.99
n <- 52
sd <- 4.15
alpha <- 0.01
SE <- sd / sqrt(n)
SE
## [1] 0.5755
Z <- abs(mu - null_value) / SE
Z
## [1] 4.9
pvalue <- 1 - pnorm(Z) # In a two-sided test, this value would be doubled
pvalue
## [1] 4.79e-07
pvalue < alpha # TRUE means reject null hypothesis
## [1] TRUE

Example 4.35

Earlier we talked about a research group investigating whether the students at their school slept longer than 7 hours each night. Let’s consider a second group of researchers who want to evaluate whether the students at their college differ from the norm of 7 hours. Write the null and alternative hypotheses for this investigation.

\(H_0: \text{students sleep 7 hours per night, } \mu = 7\)
\(H_A: \text{students do not sleep 7 hour per night, } \mu \ne 7\)

Caution: one-sided hypotheses are allowed only before seeing data

After observing data, it is tempting to turn a two-sided test into a one-sided test. Hypotheses must be set up before observing data.

4.3.6 Choosing a significance level

The siginficance level should reflect the consequence of errors. If a type 1 error is costly, then the significance level should be lower (which makes it harder to reject the null hypothesis). If a type 2 error is costly, then the significance level should be higher (which makes it easier to reject the null hypothesis).

Example: A manufacturer makes products that could cause serious injury to consumers. If the null hypothesis of a test is that the risk of injury is very low, and the alternative is that it is higher, then rejecting the null hypothesis when it is in fact correct (a type 1 error) could lead to serious consequences. The significance level should be lower.

4.4 Examining the Central Limit Theorem

The distribution of \(\bar{x}\) is approximately normal. The approximation can be poor if the sample size is small, but it improves with larger sample sizes.

Intuition: If you sample average heights of all Americans, and your sample size is 1 (one person), then the mean of that sample could be wildy variable. If you sample 1000 people, and calculate their average height, then (assuming the sample is random) the value should be quite close to the population parameter. There is almost no chance that the averge will be 4 feet or 7 feet.

4.5 Inference for other estimators

The sample mean is not the only point estimate that can be used. For example, sample proporations closely resemble the normal distribution when the sample size is sufficiently large.

A point estimate is unbaised if the sampling distribution of the estimate is centered at the parameter it estimates.

4.5.1 Confidence intervals for nearly normal point estimates

A confidence interval based on an unbiased and nearly normal point estimate is:

\[\text{point estimate } \pm z^* SE\]

4.5.2 Hypothesis testing for nearly normal point estimates

  1. First write the hypotheses in plain language, then set up the mathematical notation
  2. Identify an appropriate point estimate of the parameter of interest
  3. Verify conditions to ensure the standard error is reasonable and the point estimate is nearly normal and unbiased
  4. Compute the standard error. Draw a picture of the distirbution of the estimate under the idea that \(H_0\) is true. Shade areas representing the p-value.
  5. Using the pciture and normal model, computer the test statistic (Z score) and identify the p-value to evaluate the hypotheses.

4.5.3 Non-normal point estimates

We may apply the ideas of con dence intervals and hypothesis testing to cases where the point estimate or test statistic is not necessarily normal. There are cases when the point estimate is not nearly normal:

  • the sample size is too small for the normal approximation to be valid
  • the standard error estimate may be poor
  • the point estimate tends toward some distribution that is not the normal distribution

This section leaves me hanging; I suspect it is only hinting at larger topics to come.

4.5.4 When to retreat

Statistical tools are unreliable when the following are not true:

  • The individual observations must be independent. This means sample size < 10% of population. In experiments, subjects must be randomized into groups. If independence fails, more advanced techniques must e used, and in some cases, inference is impossible.
  • Other conditions focus on sample size and skew.

4.6 Sample size and power

Margin of error is calculated as:

\[ME = Z^* \times SE = Z^* \times \frac{\sigma}{\sqrt{n}}\]

So to get a sample size for a desired confidence level and margin of error:

\[n \ge (\frac{Z^*\sigma}{ME})^2\]

Example:

conf_level <- 0.95
Zstar <- qnorm(conf_level + (1-conf_level)/2)
me <- 4
sd <- 25
n <- ((Zstar * sd)/me)^2
ceiling(n)
## [1] 151

4.6.2 Power and the type 2 error rate

The probability of rejecting the null hypothesis is called the power. This must include both the left and right tails. Need to add example from book here.

4.6.3 Statistical significance versus practical significance

Sometimes, a difference between the point estimate is large enough to be statistically significant, but has no practical significance. For example, manufacturing tolerances might vary to a statistically significant degree, but if the variation has no impact on product quality, it doesn’t matter.

Chapter 5: Inference for numerical data

5.1 Paired data

5.1.1 Paired observations and samples

Two sets of observations are paired if each obsveration in one set has a special correspondence or connection with exactly one observation in the other data set.

5.1.2 Inference for paired data

Hypothesis testing is exactly the same as before, but we apply them to differences in the paired observations.

Example 5.2

\(H_0: \mu_{diff}=0\)
\(H_A: \mu_{diff}\ne0\)

s_diff <- 14.26
xbar_diff <- 12.76
nullval <- 0
n_diff <- 73
SE_xbar_diff <- s_diff/sqrt(n_diff)
SE_xbar_diff
## [1] 1.669
Z_diff <- (xbar_diff - nullval) / SE_xbar_diff
Z_diff
## [1] 7.645
pvalue <- 2 * (1 - pnorm(Z_diff))
pvalue
## [1] 2.087e-14

The p-value is so close to zero that it is far less than an assumed significance level of 0.05, so the null hypothesis is rejected.

5.2 Difference of two means

When the data is not paired, we can instead work with the difference of the means for the two sets, rather than the differences of individual paired data points.

5.2.1 Point estimates and standard errors for differences of means

men women
\(\bar{x}\) 87.65 102.13
s 12.5 15.2
n 45 55

Conditions for normality of \(\bar{x}_1-\bar{x}_2\)

If the sample means \(\bar{x_1}\) and \(\bar{x_2}\) each meet the criteria for having a nearly normal sampling distribution and the observations in the two samples are independent, then the differencce in sample means will have a sampling distribution that is nearly normal.

\(SE_{\bar{x}_w-\bar{x}_m} = \sqrt{\frac{\sigma^2_w}{n_w} + \frac{\sigma^2_m}{n_m}} \approx \sqrt{\frac{s^2_w}{n_w} + \frac{s^2_m}{n_m}}\)

mu_women <- 102.13
mu_men <- 87.65
s_women <- 15.2
n_women <- 55
s_men <-12.5
n_men <- 45
SE_diff <- sqrt((s_women^2/n_women) + (s_men^2/n_men))
SE_diff
## [1] 2.77

5.2.2 Confidence interval for the difference

The confidence interval is calculated the same way.

conf_level <- 0.95
mu_diff <- mu_women - mu_men
mu_diff
## [1] 14.48
Zstar <- qnorm(conf_level + (1-conf_level)/2)
conf_interval <- c(mu_diff - Zstar * SE_diff, mu_diff + Zstar * SE_diff)
conf_interval
## [1]  9.051 19.909

5.2.3 Hypothesis test based on a difference in means

smoker nonsmoker
mean 6.78 7.18
st. dev. 1.43 1.60
samp. size 50 100

\(H_0: \mu_n - \mu_s = 0\)
\(H_A: \mu_n - \mu_s \ne 0\)

conf_level <- 0.95
mu_s <-6.78
mu_n <- 7.18
sd_s <- 1.43
sd_n <- 1.60
n_s <- 50
n_n <- 100
SE_diff <- sqrt((sd_s^2/n_s) + (sd_n^2/n_n))
mu_diff <- mu_n - mu_s
Z_diff <- abs(mu_diff - 0)/SE_diff
Z_diff
## [1] 1.551
pvalue <- 2 * (1 - pnorm(Z_diff))
pvalue
## [1] 0.1209
signif_level = 1 - conf_level
pvalue < signif_level # TRUE means reject null hypothesis
## [1] FALSE

We do not reject null hypothesis.

5.2.4 Summary for inference of the difference of two means

When comparing unpaired data:

  • Each sample mean must meet conditions for normality
  • Samples must be collected independently (not paired)

5.2.5 Examining the standard error formula

Standard error of difference of two sample means:

\[SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{SE^2_{\bar{x}_1} + SE^2_{\bar{x}_2}} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\]

5.3 One-sample means with the t distribution

The t distribution is helpful when sample sizes are small.

5.3.1 The normality condition

Central Limit Theorem for normal data

The sampling distribution of the mean is nearly normal when the sample observations are independent and come from a nearly normal distribution. This is true for any sample size.

Caution: Checking the normality condition

We should exercise caution when verifying the normality condition for small samples. It is important to not only examine the data but also think about where the data come from. For example, ask: would I expect this distribution to be symmetric, and am I confident that outliers are rare?

5.3.2 Introducing the t distribution

The t distribution has a bell shape, like a normal distribution, but it is not normal. It has one parameter, degrees of freedom. The degrees of freedom (df) describe the precise form of the bell-shaped t-distribution. A low number of df gives the distribution a lower peak and thicker tails; this corresponds to higher error rates. I.e., if a normal distribution covers 99.7% of the curve in three standard deviations, a t distribution with a low number of degrees of freedom will cover significantly less than 99.7% in three standard deviations.

Plot of normal distribution with t distributions. The lowest dashed line is df=3, the next is df=6, the highest (very similar to normal) is df=20.

normdata <- seq(-4, 4, length=100)
hnormdata <- dnorm(normdata)
plot(normdata, hnormdata, type="l", lty=1, xlab="x value", ylab="Density", main="Normal Distribution",xlim=c(-4,4),ylim=c(0,0.45))
abline(h=0)
tdata <- dt(normdata, df=2)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
tdata <- dt(normdata, df=6)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")
tdata <- dt(normdata, df=20)
par(new=TRUE)
plot(normdata, tdata, type="l", lty=2,xlim=c(-4,4),ylim=c(0,0.45),xlab="",ylab="")

plot of chunk unnamed-chunk-51

Side note: degrees of freedom

This is separate from the text. I once read a description of degrees of freedom. It went something like this. Imagine that there are three numbers that have a mean of 2. Can you tell me what the numbers are? No, they are all free to vary. Now I tell you the first number is a 1. What are the second two numbers? We can’t say, they are both still free to vary. Now I tell you the second number, it is 2. What is the third number? If the first two are 1 and 2, and we know the mean of all three is 2, then the third number must be 3. It is no longer free to vary. That is why the denominator of a sample standard deviation is n-1; it has n-1 degrees of freedom.

Excerpt from t table (leftmost column is df):

one tail |0.100 |0.050 |0.025 |0.010 |0.005 — |— |— |— |— |— two tails |0.200 |0.100 |0.050 |0.020 |0.010 1 |3.08 |6.31 |12.71 |31.82 |63.66 2 |1.89 |2.92 |4.30 |6.96 |9.92 3 |1.64 |2.35 |3.18 |4.54 |5.84 … | 17 |1.33 |1.74 |2.11 |2.57 |2.90 18 |1.33 |1.73 |2.10 |2.55 |2.88 19 |1.33 |1.73 |2.09 |2.54 |2.86 20 |1.33 |1.72 |2.09 |2.53 |2.85 … | 400 |1.28 |1.65 |1.97 |2.34 |2.59 500 |1.28 |1.65 |1.96 |2.33 |2.59 \(\infty\)|1.28 |1.64 |1.96 |2.33 |2.58

Example 5.15

What proportion of the t distribution with 18 degrees of freedom falls below -2.10?

Look at the row for df=18. Since we’re looking for only values below -2.10 (as opposed to minus -2.10 and above 2.10), we want the single-tailed values. The third column contains the value 2.10, so we look at the “one tail” header row so get 0.025. About 2.5% of the population falls below -2.10 when there are 18 degrees of freedom.

In R:

T <- -2.10
df <- 18
pt(T, df=df)
## [1] 0.02505

Example 5.16

A t distribution with 20 degrees of freedom is shown in the left panel of Figure 5.14. Estimate the proportion of the distribution falling above 1.65.

We identify the row in the t table using the degrees of freedom: df = 20. Then we look for 1.65; it is not listed. It falls between the first and second columns. Since these values bound 1.65, their tail areas will bound the tail area corresponding to 1.65. We identify the one tail area of the first and second columns, 0.050 and 0.10, and we conclude that between 5% and 10% of the distribution is more than 1.65 standard deviations above the mean.

In R:

T <- 1.65
df <- 20
pt(T, df=df, lower.tail=FALSE)
## [1] 0.05728

Example 5.17

A t distribution with 2 degrees of freedom is shown in the right panel of Figure 5.14. Estimate the proportion of the distribution falling more than 3 units from the mean (above or below).
(NOTE: Illustration not shown. It is a two-tailed t distribution.)

As before, first identify the appropriate row: df = 2. Next, find the columns that capture 3; because 2.92 < 3 < 4.30, we use the second and third columns. Finally, we find bounds for the tail areas by looking at the two tail values: 0.05 and 0.10. We use the two tail values because we are looking for two (symmetric) tails.

In R:

T <- 3
df <- 2
2 * pt(T, df=df, lower.tail=F) # multiply by 2 because this is two-tailed 
## [1] 0.09547

5.3.3 The t distribution as a solution to the standard error problem

When to use the t distribution

Use the t distribution for inference of the sample mean when observations are independent and nearly normal. You may relax the nearly normal condition as sample size increases. For example, the data distribution may be moderately skewed when the sample size is at least 30.

(I do not have a clear idea of when to use the t distribution when sample size is above 30. When that is the case, the t distribution is very similar to the normal distribution. I find this entire section very confusing; the primary reason to use the t distribution is when the sample size is small. But if it is small, then we can cannot relax the nearly-normal requirements! So what is the point of the above paragraph?)

To proceed with t distribution for inference about a single mean, check two conditions:

  • Independence of observations: We very this condition as before. Simple random sample of less than 10% of population, or if it was an experiment or a random process, we check to best of our ability that observations were independent.
  • Observations come from nearly normal distribution. This is difficult to verify with small data sets.

5.3.4 One sample t confidence intervals

Degrees of freedom for a single sample

If the sample has n observations and we are examining a single mean, then we use the t distribution with df = n-1 degrees of freedom.

Finding a t confidence interval for the mean

Based on a sample of n independent and nearly normal observations, a confidence interval for the population mean is

\[\bar{x} \pm t*_{df} \times SE\] Where \(\bar{x}\) is the sample mean, \(t*_{df}\) corresponds to the confidence level and degrees of freedom, and SE is the standard error as estimated by the sample.

Example 5.20

The FDA’s webpage provides some data on mercury content of fish. Based on a sample of 15 croaker white fish (Pacific), a sample mean and standard deviation were computed as 0.287 and 0.069 ppm (parts per million), respectively. The 15 observations ranged from 0.18 to 0.41 ppm. We will assume these observations are independent. Based on the summary statistics of the data, do you have any objections to the normality condition of the individual observations?

There are no obvious outliers; all observations are within 2 standard deviations of the mean. If there is skew, it is not evident. There are no red flags for the normal model based on this (limited) information, and we do not have reason to believe the mercury content is not nearly normal in this type of fish.

Example 5.21

Estimate the standard error of \(\bar{x}\) = 0.287 ppm using the data summaries in Exercise 5.20. If we are to use the t distribution to create a 90% confidence interval for the actual mean of the mercury content, identify the degrees of freedom we should use and also find \(t*_{df}\).

conf_level <- 0.9
alpha <- 1 - conf_level
mu <- 0.287
sd <- 0.069
n <- 15
SE <- sd / sqrt(n)
df <- n - 1
df
## [1] 14
tstar <- abs(qt(alpha/2, df=n-1)) # two-tailed, so divide alpha by 2 (half on each side)
tstar
## [1] 1.761

Example 5.22

Using the results of Exercise 5.20 and Example 5.21, compute a 90% confidence interval for the average mercury content of croaker white fish (Pacific).

ci <- c(mu - tstar * SE, mu + tstar * SE)
ci
## [1] 0.2556 0.3184

5.3.5 One sample t tests

Example 5.23

\(H_0: \mu_{diff} = 100\)
\(H_A: \mu_{diff} > 100\)

T score:

mu <- 135.9
nullval <- 100
sd <- 82.2
n <- 30
SE <- sd / sqrt(n)
T <- (mu - nullval) / SE
T
## [1] 2.392

Exercise 5.25

Use the t table in Appendix B.2 on page 410 to identify the p-value. What do you conclude?

pvalue <- 1 - pt(T, df=n-1)
pvalue
## [1] 0.01173

The p-value is less than 0.05, so reject the null hypothesis.

If you have all of the data in the sample, R has a function called t.test. The following has a null hypothesis that the mean is 19, and the alternative hypothesis is that it is not equal to 19:

data <- c(12, 22, 13, 14, 17, 19, 21, 20, 38, 16, 20)
# This tests to see whether the estimated population mean is 19
t.test(data, alternative="two.sided", mu = 19)
## 
##  One Sample t-test
## 
## data:  data
## t = 0.1282, df = 10, p-value = 0.9006
## alternative hypothesis: true mean is not equal to 19
## 95 percent confidence interval:
##  14.53 24.01
## sample estimates:
## mean of x 
##     19.27

The above shows that the estimated mean is 19.27. Our null value, 19, is well within the confidence interval of (14.53, 24.01). The p-value is quite large. We do not reject the null hypothesis.

We can adjust the confidence level, and make the test one-sided:

data <- c(12, 22, 13, 14, 17, 19, 21, 20, 38, 16, 20)
# This tests to see whether the estimated population mean is 15
t.test(data, alternative="greater", mu = 15, conf.level=0.9)
## 
##  One Sample t-test
## 
## data:  data
## t = 2.008, df = 10, p-value = 0.03622
## alternative hypothesis: true mean is greater than 15
## 90 percent confidence interval:
##  16.35   Inf
## sample estimates:
## mean of x 
##     19.27

This time, the alternative hypothesis was that the population mean is greater than 15. The p-value is very tiny, so we reject the null hypothesis.

The t.test can also do a paired data test, type ?t.test in R for details.

5.4 The t distribution for the difference of two means

5.4.1 Sampling distributions for the difference in two means

We can use \(\bar{x}_1 - \bar{x}_2\) as an approximation for \(\bar{\mu}_1 - \bar{\mu}_2\).

Using the t distribution for a difference in means

The t distribution can be used for inference when working with the standardized difference of two means if (1) each sample meets the conditions for using the t distribution and (2) the samples are independent.

In that case, the standard error for the difference of two means is

\[SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{SE^2_{\bar{x}_1} + SE^2_{\bar{x}_2}} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}\]

5.4.2 Two sample t test

Summary stats for each exam version are shown in table below. The teacher would like to evaluate whether this difference is so large that it provides convincing evidence that version B was more difficult (on average) than version A.

Version \(n\) \(\bar{x}\) \(s\) min max
A 30 79.4 14 45 100
B 27 74.1 20 32 100

Exercises 5.28-5.30

Construct two-sided hypothesis test to evaluate whether observed difference in sample means, \(\bar{x}_A - \bar{x}_B = 5.3\), might be due to chance.

\(H_0: \mu_{diff} = 0\)
\(H_A: \mu_{diff} \ne 0\)
Where \(\mu_{diff} = \mu_A - \mu_B\)

Standard error:

mu_a <- 79.4
mu_b <- 74.1
n_a <- 30
n_b <- 27
sd_a <- 14
sd_b <- 20
SE <- sqrt((sd_a^2/n_a) + (sd_b^2/n_b))
SE
## [1] 4.62

Test statistic:

nullval <- 0
T <- ((mu_a - mu_b) - nullval)/SE
T
## [1] 1.147

p-value using the book-supplied value for df of 26 (which is the minimum of the degrees of freedom of the two samples):

df <- 26
pvalue = 2 * (1 - pt(T, df=df)) # two-tailed test, so multiply by 2
pvalue
## [1] 0.2618

The p-value is greater than 0.05; do not reject.

NOTE: For Exercise, 5.30, the book says, “we could have used df=45.97.” I don’t see how that number is derived.

This is a function (not yet tested!) to determine degrees of freedom for two samples–PAY NO ATTENTION TO THIS:

df_diff <- (sd_a^2/n_a^2 + sd_b^2/n_b^2)^2 / ( (sd_a^2/n_a)^2/(n_a -1) + (sd_b^2/n_b^2)/(n_b-1) )
df_diff
## [1] 0.3935

5.4.3 Two sample t confidence interval

Exercise 5.33

In Exercise 5.31, you found that the point estimate, \(\bar{x}_{esc} - \bar{x}_{control} = 7.88\), has a standard error of 1.95. Using df = 8, create a 99% confidence interval for the improvement due to ESCs.

point_est <- 7.88
SE <- 1.95
df <- 8
conf_level <- 0.99
alpha <- 1 - conf_level
tstar <- qt(conf_level + (alpha/2), df)
tstar
## [1] 3.355
conf_interval <- c(point_est - tstar * SE, point_est + tstar * SE)
conf_interval
## [1]  1.337 14.423

5.4.4 Pooled standard deviation estimate

Occasionally, two populations will have standard deviations that are so similar that they can be treated as identical. For example, historical data or a well-understood biological mechanism may justify this strong assumption. In such cases, we can make our t distribution approach slightly more precise by using a pooled standard deviation.

The pooled standard deviation of two groups is a way to use data from both samples to better estimate the standard deviation and standard error. If s1 and s2 are the standard deviations of groups 1 and 2 and there are good reasons to believe that the population standard deviations are equal, then we can obtain an improved estimate of the group variances by pooling their data:

\[s^2_{pooled} = \frac{s^2_1 \times (n_1 - 1) + s^2_2 \times (n_2 - 1)}{n_1 + n_2 - 2}\]

In this case, the degrees of freedom is calculated as:

\[df = n_1 + n_2 - 2\]

Caution: Pooling standard deviations should only be done after careful research

A pooled standard deviation is only appropriate when background research indicates the population standard deviations are nearly equal. When the sample size is large and the condition may be adequately checked with data, the benefits of pooling the standard deviations greatly diminishes.

For this course, I take this to mean that you can forget about this technique unless a problem specifically directs you to use it.

5.5 Comparing many means with ANOVA

In this section, we will learn a new method called analysis of variance (ANOVA) and a new test statistic called F. ANOVA uses a single hypothesis test to check whether the means across many groups are equal:

\(H_0\): The mean outcome is the same across all groups. In statistical notation, \(\mu_1 = \mu_2 = \dots = \mu_k\) where \(\mu_i\) represents the mean of the outcome for observations in category i.
\(H_A\): At least one mean is different.

Generally we must check three conditions on the data before performing ANOVA:

  • the observations are independent within and across groups,
  • the data within each group are nearly normal, and
  • the variability across the groups is about equal.

5.5.2 Analysis of variance (ANOVA) and the F test

The method of analysis of variance in this context focuses on answering one question: is the variability in the sample means so large that it seems unlikely to be from chance alone? This question is different from earlier testing procedures since we will simultaneously consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation. We call this variability the mean square between groups (MSG) and it has an associated degrees of freedom, \(df_G = k − 1\) when there are k groups.

The MSG can be thought of as a scaled variance formula for means. If the null hypothesis is true, any variation in the sample means is due to chance and shouldn’t be too large. Details of MSG calculations are provided in footnote 29, however, we typically use software for these computations.

The mean square between the groups is, on its own, quite useless in a hypothesis test. We need a benchmark value for how much variability should be expected among the sample means if the null hypothesis is true. To this end, we compute a pooled variance estimat , often abbreviated as the mean square error (MSE), which has an associated degrees of freedom value \(df_E = n − k\). It is helpful to think of MSE as a measure of the variability within the groups.

The F statistic is:

\[F=\frac{MSG}{MSE}\]

We can use the F statistic to evaluate the hypotheses in what is called an F test. A p-value can be computed from the F statistic using an F distribution, which has two associated parameters: \(df_1\) and \(df_2\). For the F statistic in ANOVA, \(df_1\) = \(df_G\) and \(df_2 = df_E\).

Exercise 5.40

For the baseball data, MSG = 0.00252 and MSE = 0.00127. Identify the degrees of freedom associated with MSG and MSE and verify the F statistic is approximately 1.994.

MSG = 0.00252
MSE = 0.00127
F = MSG/MSE
F
## [1] 1.984

The F statistic and the F test

Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups. ANOVA uses a test statistic F , which represents a standardized ratio of variability in the sample means relative to the variability within the groups. If \(H_0\) is true and the model assumptions are satisfied, the statistic F follows an F distribution with parameters \(df_1 = k − 1\) and \(df_2 = n − k\). The upper tail of the F distribution is used to represent the p-value.

5.5.3 Reading an ANOVA table from software

It is common to use software to avoid the tedious process of calculating ANOVA.

Still need to figure more of this out. I did find this: See table on p. 242.

Df Sum Sq Mean Sq F value Pr(>F)
position 3 0.0076 0.0025 1.9943 0.1147
Residuals 323 0.4080 0.0013

spooled = 0.036 on df = 323

In this case:

F <- 1.9943
df1 <- 3
df2 <- 323
pf(F, df1=df1, df2=df2, lower.tail=FALSE)
## [1] 0.1147

Also: Apparently the inference function can take care of an ANOVA test. Need to get an example working here.

Here is an example of loading inference:

rm(list=ls(all=TRUE))
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata"))
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/gss.Rdata"))

Here is an example of calling inference. It is from lab 4, and is for an ANOVA test.

inference(y = gss$wordsum, x = gss$class, est = "mean", method = "theoretical", type = "ht", alternative = "greater")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_LOWER = 41, mean_LOWER = 5.073, sd_LOWER = 2.24
## n_MIDDLE = 331, mean_MIDDLE = 6.761, sd_MIDDLE = 1.886
## n_UPPER = 16, mean_UPPER = 6.188, sd_UPPER = 2.344
## n_WORKING = 407, mean_WORKING = 5.749, sd_WORKING = 1.865
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##            Df Sum Sq Mean Sq F value  Pr(>F)
## x           3    237    78.9    21.7 1.6e-13
## Residuals 791   2870     3.6                
## 
## Pairwise tests: t tests with pooled SD 
##          LOWER MIDDLE  UPPER
## MIDDLE  0.0000     NA     NA
## UPPER   0.0475 0.2396     NA
## WORKING 0.0306 0.0000 0.3671

plot of chunk unnamed-chunk-69

5.5.4 Graphical diagnostics for an ANOVA analysis

Three conditions to check:

  • Independence: simple random sample, n < 10% population
  • Approximately normal: espcially important when sample size is small
  • Constant variance: variance within groups is about equal from one group to the next

5.5.5 Multiple comparisons and controlling type 1 error rate

If we reject null hypothesis, we want to determine which groups had different means.

Multiple comparisons and the Bonferroni correction for α

The scenario of testing many pairs of groups is called multiple comparisons. The Bonferroni correction suggests that a more stringent significance level is more appropriate for these tests:

\[\alpha^* = \alpha/K\]

where K is the number of comparisons being considered (formally or informally). If there are k groups, then usually all possible pairs are compared and K = k(k−1) .

(Need to add examples here.)

Chapter 6: Infererence for categorical data

6.1 Inference for a single proportion

6.1.1 Identifying when the sample proportion is nearly normal

A sample proportion can be described as a sample mean. If we represent each “success” as a 1 and each “failure” as a 0, then the sample proportion is the mean of these numerical outcomes:

\[\hat{p} = \frac{0 + 1 + 1 + \dots + 0}{976} = 0.44\]

(The above is based on an example in which 44% of American public approves of Supreme Court in a poll of 976 adults.)

Conditions for the sampling distribution of \(\hat{p}\) being nearly normal

  1. the sample observations are independent
  2. we expected to see at least 10 successes and 10 failures in our sample, i.e., \(np \ge 10\) and \(n(1 - p) \ge 10\). This is called the success-failure condition.

If these conditions are met, then the sampling distribution of \(\hat{p}\) is nearly normal with mean \(p\) and standard error:

\[SE_{\hat{p}} = \sqrt{\frac{p(1 - p)}{n}}\]

Typically we do not know the true (population) proportion \(p\), so must substitute some value.

  • For confidence intervals, usually \(\hat{p}\) is used to check success-failure condition and standard error
  • For hypothesis testing, typically the null value (the proportion claimed in the null hypothesis) is used

Reminder on checking independence of observations

If data come come from simple random sample and sample size < 10% population, then independence assumption is reasonable. If the data come from a random process, we must evaluate the independence condition more carefully.

6.1.2 Confidence intervals for a proportion

Exercise 6.2. Estimate the standard error of \(\hat{p}\) using equation 6.1. Because \(p\) is unknown and standard error is for a confidence interval, use \(\hat{p}\) in place of \(p\). (NOTE: This follows from previous example, with 976 respondents, though text doesn’t make this clear.)

p_hat <- 0.44
n <- 976
SE <- sqrt((p_hat * (1-p_hat))/n)
SE
## [1] 0.01589

Example 6.3. Construct a 95% confidence interval for \(p\), the proportion of Americans who approve of Supreme Court.

conf_level <- 0.95
alpha <- 1 - conf_level
point_estimate <- 0.44 # the proportion of Americans 
Z_star <- pnorm(conf_level + alpha/2)
# SE comes from exercise 6.2
conf_interval <- c(point_estimate - Z_star * SE, point_estimate + Z_star * SE)
conf_interval
## [1] 0.4267 0.4533

These numbers differ a bit from the book, which used a rounded value for SE of 0.16.

6.1.3 Hypothesis testing for a proportion

Exercise 6.4. Deborah Toohey is running for Congress, and her campaign manager claims she has more than 50% support from the district’s electorate. Set up a one-sided hypothesis test to evaluate this claim.

\(H_0: p = 0.50\)
\(H_A: p > 0.50\)

Example 6.5. A newspaper collects a simple random sample of 500 likely voters in the district and estimates Toohey’s support to be 52%. Does this provide convincing evidence for the claim of Toohey’s manager at the 5% significance level?

point_estimate = 0.5
null_value <- 0.52
n <- 500
SE <- sqrt(point_estimate * (1 - point_estimate) / n)
SE
## [1] 0.02236
Z <- abs(point_estimate - null_value) / SE
Z
## [1] 0.8944
p_value <- 1 - pnorm(Z)
p_value
## [1] 0.1855

Do not reject the null hypothesis.

Hypothesis test for a proportion

Set up hypotheses and verify the conditions using the null value, \(p_0\), to ensure \(\hat{p}\) is nearly normal under \(H_0\). If the conditions hold, construct the standard error, again using \(p_0\), and show the p-value in a drawing. Lastly, compute the p-value and evaluate the hypotheses.

Use the null value to ensure that both \(np\) and \(n(1 - p)\) are greater than or equal to 10.

6.1.4 Choosing a sample size when estimating a proportion

The formula for choosing a sample size for a proportion, given a margin of error and \(Z^*\) value:

\[n \ge (\frac{Z^*}{ME})^2 \times p(1 - p)\]

Example 6.6 Find the smallest sample size n so that the margin of error of the point estimate \(\hat{p}\) will be no larger than \(m = 0.04\) when using a 95% confidence interval.

No estimate is given for \(p\) so we use the “worst-case scenario” of \(\hat{p}=0.5\), which gives the largest possible sample size (everything else being equal).

conf_level <- 0.95
alpha <- 1 - conf_level
p <- 0.5
ME <- 0.04
Z_star <- qnorm(conf_level + alpha/2)
Z_star
## [1] 1.96
n <- ceiling((Z_star / ME)^2 * p * (1 - p))
n
## [1] 601

Exercise 6.7 Calculate three different sample sizes, for three different proportions. Confidence level is 90%. Margin of error is 2%. Proportions given in vector p below.

p = c(0.017, 0.062, 0.013)
n = rep(NA, 3)
me <- 0.02
conf_level <- 0.9
alpha <- 1 - conf_level
Z_star <- qnorm(conf_level + alpha/2)
n <- ceiling((Z_star / me)^2 * p * (1 - p))
n
## [1] 114 394  87

6.2 Difference of two proportions

NOTE: This is for unpaired data. This stands to reason; if we’re dealing with proportions, the data cannot be paired.

6.2.1 Sample distribution of the difference of two proportions

Conditions for the sampling distribution of \(\hat{p}_1 - \hat{p}_2\) to be normal

The difference \(\hat{p}_1 - \hat{p}_2\) tends to follow a normal model when

  • each proportion separately follows a normal model, and
  • the samples are independent

The standard error of the difference in sample proportions is

\[SE_{\hat{p}_1-\hat{p}_2} = \sqrt{SE^2_{\hat{p}_1} + SE^2_{\hat{p}_2}} = \sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2}}\]

where \(p_1\) and \(p_2\) represent the population proportions, and \(n_1\) and \(n_2\) represent the sample sizes. I cannot say I understand this; generally I thought that we do not have the population proportion, only sample proportion.

6.2.2 Intervals and tests for \(p_1 - p_2\)

NOTE: The title of this section is “intervals and tests” but it discusses only confidence intervals. If I understand correctly, you use \(\hat{p}_1 - \hat{p}_2\) as the point estimate to calculate confidence intervals. When it comes to hypothesis tests (see 6.2.2), you use the “pooled estimate” value for the point estimate.

In the setting of confidence intervals, the sample proportions are used to verify the success-failure condition and also compute standard error, just as was the case with a single proportion.

Exercise 6.12 Hypothesis test on two proportions

\(H_0: \hat{p}_{diff} = 0.03\)
\(H_A: \hat{p}_{diff} > 0.03\)

p <- c(0.958, 0.899)
n <- c(1000, 1000)
p_diff <- p[1] - p[2]
nullval <- 0.03
conf_level <- 0.95
alpha <- 1 - conf_level
SE <- sqrt((p[1]*(1-p[1]))/n[1] + (p[2]*(1-p[2]))/n[2])
SE
## [1] 0.01145
Z <- abs(p_diff - nullval) / SE
Z
## [1] 2.533
p_value <- 1 - pnorm(Z)
p_value
## [1] 0.005648

6.2.3 Hypothesis testing when \(H_0: p_1 = p_2\)

Pooled estimate of a proportion

When the null hypothesis is \(p_1 = p_2\), it is useful to find the pooled estimate of the shared proportion: \[\hat{p}=\frac{\text{number of "successes"}}{\text{number of cases}} = \frac{\hat{p}_1n_1 + \hat{p}_2n_2}{n_1 + n_2}\] Here, \(\hat{p}_1n_1\) represents the number of successes in sample 1 since \[\hat{p}_1= \frac{\text{number of successes in sample 1}}{n_1}\]

Similarly, p2 n2 represents the number of successes in sample 2.

Use the pooled proportion estimate when \(H_0: p_1 = p_2\).

Example from the book:

cancer no cancer
2,4-D 191 304
no 2,4-D 300 641

\(H_0: p_c - p_n = 0\)
\(H_A: p_c - p_n > 0\)

In this case, the “success” is the exposure rate of dogs to 2,4-D. So the pooled proportion estimate is:

\(\frac{191 + 304}{191 + 300 + 304 + 641} = 0.345\)

n_1 <- 491
n_2 <- 945
# "success" counts as exposure to 2,4-D
rate_1 <- 191/n_1
rate_2 <- 304/n_2
rate_diff <- rate_1 - rate_2
nullval <- 0
conf_level <- 0.95
alpha <- 1 - conf_level
p_hat <- (191 + 304)/(191 + 300 + 304 + 641)
SE <- sqrt((p_hat * (1 - p_hat))/n_1 + (p_hat * (1 - p_hat))/n_2)
SE
## [1] 0.02644
rate_diff
## [1] 0.06731
Z <- abs(rate_diff - nullval) / SE
Z
## [1] 2.546
p_value = 1 - pnorm(Z)
p_value
## [1] 0.005453

We do not reject the null hypothesis. The book says we do, but this is apparently due to rounding errors. My rate_diff is 0.06731 (book says 0.067) and SE is 0.02644 (book says 0.026). This leads to slightly lower Z value.

6.3 Testing for goodness of fit using chi-square

In this section, we develop a method for assessing a null model when the data are binned.

Some data used as illustration:

Race White Black Hispanic Other
Representation in juries 205 26 25 19
Registered voters 0.72 0.07 0.12 0.09

This technique is commonly used in two circumstances:

  • Given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population.
  • Evaluate whether data resemble a particular distribution, such as a normal distribu tion or a geometric distribution.

6.3.1 Creating a test statistic for one-way tables

This section does not describe a test statistic! It gives an example about race breakdown on juries. We need a test statistic to indicate whether jury selection is biased based on race.

Also: no mention of what “one-way” means.

Example 6.18 Of the people in the city, 275 served on a jury. If the individuals are randomly selected to serve on a jury, about how many of the 275 people would we expect to be white? How many would we expect to be black?

white_prop = 0.72
black_prop = 0.07
total = 275
expected_white_jurors = white_prop * total
expected_black_jurors = black_prop * total
expected_white_jurors
## [1] 198
expected_black_jurors
## [1] 19.25

The hypothesis test for this data:

\(H_0:\) The jurors are a random sample, i.e, there is no racial bias in who serves on a jury.
\(H_A:\) The jurors are not randomly sampled, i.e., there is racial bias in jury selection.

6.3.2 The chi-square test statistic

We begin by calculating a Z score for each group.

\(Z_1 = \frac{\text{observed white count - null white count}}{\text{SE of observed white count}} = \frac{205-198}{\sqrt{198}} = 0.50\)

\(Z_2 = \frac{\text{observed black count - null black count}}{\text{SE of observed black count}} = \frac{26-19.25}{\sqrt{19.25}} = 1.54\)

\(Z_3 = \frac{\text{observed Hispanic count - null Hispanic count}}{\text{SE of observed Hispanic count}} = \frac{25-33}{\sqrt{33}} = -1.39\)

\(Z_4 = \frac{\text{observed other count - null other count}}{\text{SE of observed other count}} = \frac{19-24.75}{\sqrt{24.75}} = -1.16\)

Now we square each number and sum them:

\((0.5)^2 + (1.54)^2 + (-1.39)^2 + (-1.16)^2 = 5.89\)

This is the test statistic \(X^2\).

6.3.3 The chi-square distribution and finding areas

The chi-square distribution is sometimes used to characterize data sets and statistics that are always positive and typically right skewed. Recall the normal distribution had two parameters–mean and standard deviation–that could be used to describe its exact characteristics. The chi-square distribution has just one parameter called degrees of freedom (df), which influences the shape, center, and spread of the distribution.

As the degrees of freedom increase in a chi-square distribution:

  • the distribution becomes more symmetric
  • the center moves to the right
  • the variability inflates

The following shows chi-square distributions for 2 degrees of freedom (green), 4 (red) and 9 (blue).

x <- rchisq(100, 5)
hist(x, prob=TRUE, ylim=c(0,0.6), xlim=c(-0.1, 20))
curve( dchisq(x, df=2), col='green', add=TRUE)
curve( dchisq(x, df=4), col='red', add=TRUE )
curve( dchisq(x, df=9), col='blue', add=TRUE)

plot of chunk smallplot

There is a chi-square table in which values can be looked up. Or use R. The following says that, when there are 2 degrees of freedom and \(Z^*\) is 2.41, the upper tail probability is roughly 0.3.

pchisq(2.41, df=2, lower.tail=F) # this corresponds to looking up 2.41 in the table in the df=2 row
## [1] 0.2997

To look up in reverse, use qchisq. The following shows that, when there are two degress of freedom and the probability of the upper tail is 0.3, the chi-square value is roughly 2.41.

qchisq(0.3, df=2, lower.tail=F)
## [1] 2.408

6.3.4 Finding a p-value for a chi-square distribution

We use the \(X^2\) test statistic and the chi-square distribution to determine the p-value.

Example 6.27 How many categories were there in the juror example? How many degrees of freedom would be associated with the chi-square distribution use for \(X^2\)?

There were k=4 categories in the juror example. Therefore, df=3 if \(H_0\) is true.

Example 6.28 If the null hypothesis is true, the test statistic \(X^2 = 5.89\) would be losely associated with a chi-square distribution with three degrees of freedom. Using this distribution and test statistic, identify the p-value.

Here is the chi-square distribution when df=3, and a line at x=5.89:

curve(dchisq(x, df=3), xlim=c(-0.1, 20))
abline(v=5.89)

plot of chunk unnamed-chunk-80

The p-value:

pvalue = pchisq(5.89,df=3,lower.tail=F)
pvalue
## [1] 0.1171

The p-value is greater than 0.05 (the assumed significance level) so we do not reject the null hypothesis.

Chi-square test for one-way table

Suppose we are to evaluate whether there is convincing evidence that a set of observed counts \(O_1, O_2, \dots, O_k\) in \(k\) categories are unusually different from what might be expected under a null hypothesis. Call the expected counts that are based on tne null hyothesis \(E_1, E_2, \dots, E_k\). If each expected count is at least 5 and the null hypothesis is true, then the test statistic below follows a chi-square distribution with \(k-1\) degrees of freedom:

\[X^2 = \frac{(O_1-E_1)^2}{E_1} + \frac{(O_2-E_2)^2}{E_2} + \dots + \frac{(O_k-E_k)^2}{E_k}\]

The p-value for this test statistic is found by looking at the upper tail of this chi-square distribution. We consider the upper tail because larger values of \(X^2\) would provide greater evidence against the null hypothesis.

Conditions for the chi-square test:

  • Independence. Each case that contributes a count to the table must be independent of all other cases in the table.
  • Sample size/distribution. Each particular scenario (cell count) must have at least 5 expected cases.
  • Degrees of freedom. We only apply the chi-square technique when the table is associated with a chi-square distribution of 2 or more degrees of freedom.

6.3.5 Evaluating goodness of fit for a distribution

This section describes an application of chi-square testing framework. Not covering now, may come back to this.

6.4 Testing for independence in two-way tables

What is so different about one-way and two-way tables?

A one-way table describes counts for each outcome in a single variable. A two-way table describes counts for combinations of outcomes for two variables. When we consider a two-way table, we often would like to know, are these variables related in any way? That is, are they dependent?

An example is Google testing two new search algorithms against the current algorithm. The hypothesis test is really about assessing whether there is statistically significant evidence that the choice of the algorithm effects whether a user performs a second search. In other words, the goal is to check whether the search variable is independent of the algorithm variable.

The example at the start of the chapter was racial representation on juries. The outcome was “representation in juries.” Just one outcome. The example in this section is Google conducting research on new search algorithms. The outcomes are “No new search” and “New search”. These outcomes are broken out across current algorithm, test algorithm 1 and test algorithm 2.

Here is the table.

Search algorithm current test 1 test 2 Total
No new search 3511 1749 1818 7078
New search 1489 751 682 2922
Total 5000 2500 2500 10000

6.4.1 Expected counts in two-way tables

The table shows that 7078 of 10,000 people (70.78%) did not conduct a new search. If there is no difference between the algorithms, then this proportion should be about the same for all three algorithms. For example, 5000 users saw the current algorithm, so \(5000 \times 0.7078=3539\) users (as opposed to the 3511 observed). Here is an updated table with expected counts in parentheses.

Search algorithm current test 1 test 2 Total
No new search 3511 (3539) 1749 (1769.5) 1818 (1769.5) 7078
New search 1489 (1461) 751 (730.5) 682 (730.5 2922
Total 5000 2500 2500 10000

Computing expected counts in a two-way table

To identify the expected count for the \(i^{th}\) row and the \(j^{th}\) column, copmute \[\text{Expected Count}_{\text{row i, col j}} = \frac{(\text{row i total} \times \text{column j total})}{\text{table total}}\]

6.4.2 The chi-square test for two-way tables

The chi-square test statistic for a two-way table is found the same way as it is found for a one-way table. For each table count, compute

General formula: \(\frac{(\text{observed count - expected count})^2}{\text{expected count}}\)

Row 1, Col 1: \(\frac{(3511-3539)^2}{1769.5}=0.222\)

Row 1, Col 2: \(\frac{(1749-1769.5)^2}{1769.5} = 0.237\)

\(\vdots\)

Row 2, Col 3: \(\frac{(682-730.5)^2}{730.5} = 3.220\)

Adding the computed value for each cell gives the chi-square test statitics \(X^2\):

\(X^2 = 0.222 + 0.237 + \dots + 3.220 = 6.120\)

Add R code here doing all of these calculations.

Computing degrees of freedom for a two-way table

When applying the chi-square test to a two-way table, we use

\[df = (R - 1) \times (C - 1)\]

where R is the number of rows in the table and C is the number of columns. In the Google example, this would be \((2 - 1) \times (3 - 1) = 2\).

Use two-proportion methods for 2-by-2 contingency tables

When analyzing 2-by-2 contingency tables, use the two-proportion methods introduced in Section 6.2. Still haven’t absorbed this, but I can see right away that df would be 1 using two-way tables, and that won’t work.

6.5 Small sample hypothesis testing for a proportion

6.5.1 When the success-failure condition is not met

If the success-failure condition is not met, we can conduct a hypothesis test using simulation.

Example: a quack (oops, consultant) claims that the complication rate for liver donor surgeries is 10% overall, but she has seen only 3 complications in 62 surgeries she has facilitated. 3/62 = 0.048.

6.5.2 Generating the null distribution and p-value by simulation

The simulation would be to take 1 red card and 9 black ones (the null hypothesis is that the complication rate is 10%), and then take 62 samples of size=1 (with replacement). The example given resulted in 5 complications (red cards) out of 62 for a percentage of 0.081.

Now this needs to be repeated many times and a distribution built from the results.

(Need R code to do exactly this! My way, and using inference.)

The simulation shown in the book, for 10,000 repetitions of the simulation, there were 1222 simulate sample proportions with \(\hat{p}_{sim} \le 0.048\). Construct left-tail area and find p-value:

\(\text{left tail} = \frac{\text{Number of observed simulations with} \hat{p}_{sim} \le 0.048}{10000}\)

This is 1222/10,000 or 0.1222. If our significance level if 0.05, we do not reject the null hypothesis; we do not have convincing evidence that the quack’s claim is true.

One-sided hypothesis test for \(p\) with a small sample

The p-value is always derived by analyzing the null distribution of the test statistic. The normal model poorly approximates the normal distribution for \(\hat{p}\) when the success-failure condition is not satisfied. As a substitute, we can generate the null distribution using simulated sample proportions (\(\hat{p}_{sim}\)) and use this distribution to compute the tail area, i.e., the p-value.

6.5.3 Generating the exact null distribution and p-value

Example 6.51 Compute the exact p-value to check the consultant’s cliam that her clients’ complication rate is below 10%.

\[\text{p-value} = \sum\limits_{j=0}^3 \binom{n}{j} p^j(1-p)^{n-j}\]

\(= \sum\limits_{j=0}^3 \binom{62}{j} 0.1^j(1 - 0.1)^{62-j}\)

\(=\binom{62}{0}0.1^0(1-0.1)^{62-0} + \binom{62}{1}0.1^1(1-0.1)^{62-1} + \binom{62}{2}0.1^2(1-0.1)^{62-2} + \binom{62}{3}0.1^3(1-0.1)^{62-3}\)

\(=0.0015 + 0.0100 + 0.340 + 0.0755 = 0.1210\)

This exact p-value is very close to the p-value from the simulations (0.1222) and we come to the same conclusion.

6.5.4 Using simulation for goodness of fit tests

Chapter 7: Introduction to linear regression

Linear models can be used for prediction or to evaluate whether there is a linear relationship between twoo numerical values. Linear regression assume that the relationship between two variables, x and y, can be modeled by a straight line:

\[y = \beta_0 + \beta_1x\]

But there are cases when the relationship is non-linear, such as when a graph shows a parabola shape.

7.1 Line fitting, residuals and correlation

A new statistic, correlation, is used to identify a linear model.

A note on linear regression in R.

This course does not delve into linear regression using R. If you want to learn more about specifically how to use linear regression in R, try http://www.r-tutor.com/elementary-statistics/simple-linear-regression.

7.1.1 Beginning with straight lines

A scatterplot can be used to make a visual assessment of whether a relationship exists between two variables. Straights lines should only be used when the data appear to have a linear relationship.

7.1.2 Fitting a line by eye

It is possible to make a rough guess where a correlation line would cross the y axis, and what its slope should be, and approximate a correlation line. We use \(\hat{y}\) to signify a correlation estimate such as \(\hat{y} = 41 + 0.59x\).

7.1.3 Residuals

Residuals are the leftover variation in the data after accounting for the model fit. If an observation is above the regression line, the vertical distance from the observation to the line is positive. Observations below the line have negative residuals.

Note: Residuals versus Errors

This is not in the textbook. Note that residuals are the difference between your observed y values and the expected value in your regression line, which is based on a sample mean and standard deviation. But the population has its own mean and standard deviation parameters, which are unknowable, and almost certainly do not match the values for the sample. So the distance between the data points and the unknowable population regression line is the error, versus the residual.

Residual: difference between observed and expected

The residual of the \(i^{th}\) observation \((x_i, y_i)\) is the difference of the observed response \((y^i)\) and the response we would predict based on the model fit \((\hat{y}_i)\):

\[e_i = y_i - \hat{y}_i\]

We typicall identify \(\hat{y}_i\) by plugging \(x_i\) into the model.

7.1.4 Describing linear relationships with correlation

Correlation: Strength of a linear relationship

Correlation, which always takes values between -1 and 1, describes the strength of the linear relationship between two variables. We denote the correltion by \(R\).

\[R = \frac{1}{n-1} \sum\limits_{i=1}^n \frac{x_i - \bar{x}}{s_x} \frac{y_i-\bar{y}}{s_y}\]

Example calculating regression

x <- c(10,12,14,10,13,16,20)
y <- c(9,10,12,9,14,14,16)
n = length(x)
sx = sd(x)
sy = sd(y)
xbar = mean(x)
ybar = mean(y)
R = (1/(n-1)) * sum(((x-xbar)/sx) * ((y-ybar)/sy))
#the R value
R
## [1] 0.915
# and now R^2
R^2
## [1] 0.8372
# The cor function also returns the R value
cor(x, y) == R
## [1] TRUE
plot(x,y)

plot of chunk unnamed-chunk-82

# Now use R functions to test the above
data.lm = lm(y ~ x)
data.stdres <- rstandard(data.lm)
qqnorm(data.stdres,
       ylab="Standard Residuals",
       xlab="Normal Scores",
       main="Sample Data")
qqline(data.stdres)

plot of chunk unnamed-chunk-82

summary(data.lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      1      2      3      4      5      6      7 
## -0.453 -0.879 -0.306 -0.453  2.408  0.268 -0.585 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    2.321      1.964    1.18   0.2904   
## x              0.713      0.141    5.07   0.0039 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.22 on 5 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.805 
## F-statistic: 25.7 on 1 and 5 DF,  p-value: 0.00386

7.2 Fitting a line by least squares regression

The least squares regression is a more rigorous approach than fitting a model by eye.

7.2.1 An objective measure for finding the best line

The line determined by this method is called the least squares line.

7.2.2 Conditions for the least squares line

  • Linearity. The data should show a linear trend. If there is a nonlinear trend, an advanced regression method should be applied (not covered in this book).
  • Nearly normal residuals. Generally the residuals must be nearly normal.
  • Constant variability. The variability of points around the least squares line remains roughly constant. I.e., the points are not closely bunched to the line at one end, and more scattered at the other.

7.2.3 Finding the least squares line

The slope of the least squares line can be estimated by

\[b_1 = \frac{s_y}{s_x}R\]

where R is the correlation between the two variables, and \(s_x\) and \(s_y\) are teh sample standard deviations of the explanatory variable and response variable, respectively.

If \(\bar{x}\) is the mean of the norixontal variable (from the data) and \(\bar{y}\) is the mean of the vertical variable, then the point \((\bar{x}, \bar{y})\) is on the least squares line.

Once the slope is determined, and a point on the line \((\bar{x}, \bar{y})\), then the regression line can be determined. Say that \(b_1=-0.0431\), \(\bar{x}=101.8\) and \(\bar{y}=19.94\)

\(y - \bar{y} = b_1(x - \bar{x})\)
\(y - 19.94 = -0.0431(x - 101.8)\)
\(y = 24.3 - 0.0431x\)

Using the sample data from 7.1.4:

slope = (sy/sx) * R
slope # slope for regression line, expected to be 0.713
## [1] 0.7132
yint = (slope * -xbar) + ybar
yint # y-intercept for regression line, expected to be 2.321
## [1] 2.321

7.2.4 Interpreting regression line parameter estimates

The slope describes the estimated difference in the y variable if the explanatory variable x for a case happened to be one unit larger. The intercept describes the average outcome of y if x=0 and the linear model is valid all the way to x=0, which in many applications is not the case.

The point of the previous paragraph is that sometimes a regression line near x=0 makes no sense. An example is a correlation of height to weight. People cannot be zero inches tall or weigh zero pounds.

7.2.5 Extrapolation is treacherous

It is frequently dangerous to extrapolate for values in a linear model beyond the data points given. For example, a line might show average home cost by square footage. But this does not mean that a house with twice the square footage of the largest house in the data set will fit the regression line. Even less does it mean that a house with negative square footage will have a negative cost.

7.2.6 Using \(R^2\) to describe the strength of a fit

The \(R^2\) of a linear model describes the amount of variation in the response that is explained by the least squares line.

7.2.7 Categorical predictors with two levels

7.3 Types of outliers in linear regression

7.4 Inference for linear regression

7.4.1 Midterm elections and unemployment

7.4.2 Undertanding regression output from software

7.4.3 An alternative test statistic

NOTE: This is the r.squared field in a linear regression in R.

eruption.lm = lm(eruptions ~ waiting, data=faithful)
summary(eruption.lm)$r.squared
## [1] 0.8115

Type help(summary.lm) in R to get full information on the information returned in a summary of a linear model. Note that the names function does not show everything; for example, names(eruption.lm) above would not show r.squared. It is returned names(summary(eruption.lm)).

Other Stuff

Manipulating data

Removing records with NA

The following loads GSS data, and then puts only two columns, wrkstat and tvhours, into data, then puts only records from data into data2 where all of the fields in data are complete (no NA).

Note use of complete.cases function as a filter.

load(url("http://bit.ly/dasi_gss_data"))
data <- cbind(gss$wrkstat, gss$tvhours)
# the following removes all rows that have NA in either column
data2 <- as.data.frame(data[complete.cases(data),])
names(data2) <- c("wrkstat", "tvhours")
head(data2)
##   wrkstat tvhours
## 1       4       1
## 2       6       2
## 3       1       3
## 4       1       2
## 5       1       1
## 6       1       3

This is a bit more sophisticated. It begins by putting data into data2 as a data.frame, so at first, it is all records in data. Then it adds the names back in (they default to V1 and V2).

Now a check is made of the number of rows that have an NA in tvhours. Next, data2 is copied into data3, but only where there is no NA in tvhours. So there might still be records where wrkstat has an NA, but not in tvhours column.

So use case here might be something like, you need rows only where a certain set of columns have no NAs, but you will need to access other columns which may contain NAs, and you don’t want those rows thrown out. First set of columns might be numeric values, second might be categories, and it’s OK if there is a category that has no value (you could call it “Undefined” or “Other”).

# this, instead, removes only rows that have NA in tvhours column
data2 <- as.data.frame(data)
names(data2) <- c("wrkstat", "tvhours")
head(data2)
##   wrkstat tvhours
## 1       1      NA
## 2       5      NA
## 3       2      NA
## 4       1      NA
## 5       7      NA
## 6       1      NA
sum(is.na(data2$tvhours))
## [1] 23206
length(data2$tvhours)
## [1] 57061
data3 <- data2[complete.cases(data2$tvhours),]

sum(is.na(data3$tvhours))
## [1] 0
length(data3$tvhours)
## [1] 33855

Subsetting strings

data <- data.frame(x=c("abc", "def", "geh", "abc"))
subset(data, data$x %in% "abc") # returns rows 1 and 4
##     x
## 1 abc
## 4 abc

Combining levels

data <- data.frame(x=c(1,2,3,4,5,6), cat=factor(x=c("one", "one", "two", "two", "three", "four"),levels=c("one","two","three","four"),ordered=T))
# Note that if levels in cat weren't ordered when df is created you could do this:
# data$cat = factor(data$cat, levels=c("one", "two", "three", "four"))
data
##   x   cat
## 1 1   one
## 2 2   one
## 3 3   two
## 4 4   two
## 5 5 three
## 6 6  four
levels(data$cat)
## [1] "one"   "two"   "three" "four"
levels(data$cat) <- c("one", "one", "three", "three") # in effect, changes rows with cat="two" to cat="one", and "four" to "three"
data
##   x   cat
## 1 1   one
## 2 2   one
## 3 3   one
## 4 4   one
## 5 5 three
## 6 6 three

Grouping by numeric values

The following groups the values for Age in a dataframe based on specific groups. One group would be records with an Age value of less than 0 (which would probably mean age was unknown), then 0 up to but not including 18, 18 up to but not including 29, and so on. So you could easily group data by age demographics.

data <- data.frame(Age=c(10,11,18,20,30,60))
data$agecat <- cut(data$Age, c(-Inf, 0, 18, 29, 39, 49, 60, Inf))
data
##   Age  agecat
## 1  10  (0,18]
## 2  11  (0,18]
## 3  18  (0,18]
## 4  20 (18,29]
## 5  30 (29,39]
## 6  60 (49,60]

Random stuff

The following is some random junk I was playing around with.

library(xtable)
d <- data.frame(letter=LETTERS, index=rnorm(52)) 
d.table <- xtable(d[1:5,])
print(d.table,type="html")
letter index
1 A 0.86
2 B 1.36
3 C -0.13
4 D 0.22
5 E 1.16