1 Analyzing data

This chapter describes a number of techniques for analyzing data in R. Many of the functions described in this chapter are useful for preparing data for other analysis, or are the building blocks for other analyses.

1.1 Summarizing statistics

R includes a variety of functions for calculating summary statistics.
Use the mean function to calculate the mean of a vector. We can calculate minima with the min function, or maxima with the max function.

 library(nutshell)
 data(dow30)
 head(dow30)

symbol	Date	Open	High	Low	Close	Volume	Adj.Close
MMM	2009-09-21	73.91	74.68	73.91	74.54	2560400	74.54
MMM	2009-09-18	75.12	75.25	74.50	74.62	4387900	74.62
MMM	2009-09-17	75.34	75.45	74.50	74.89	3371500	74.89
MMM	2009-09-16	74.76	75.49	74.50	75.38	2722500	75.38
MMM	2009-09-15	74.63	74.88	74.00	74.68	3566900	74.68
MMM	2009-09-14	73.72	74.64	73.42	74.56	3466400	74.56

 mean(dow30$Open)

## [1] 36.24574

 max(dow30$Open)

## [1] 122.45

 min(dow30$Open)

## [1] 0.99

For each of these functions, the argument na.rm specifies how NA values are treated. By default, if any value in the vector is NA, then the value NA is returned. Specify na.rm=TRUE to ignore missing values:

 mean(c(1,2,3,4,5,NA))

## [1] NA

 mean(c(1,2,3,4,5,NA),na.rm=TRUE)

## [1] 3

Optionally, we can also remove outliers when using the mean function. To do this, we can use the trim argument to specify the fraction of observations to filter:

 mean(c(-1,0:100,2000))

## [1] 68.43689

 mean(c(-1,0:100,2000), trim=0.1)

## [1] 50

To calculate the minimum and maximum at the same time, use the range function:

 range(dow30$Open)

## [1]   0.99 122.45

Another useful function is quantile. This function can be used to return the values at different percentiles (specified by the probs argument):

  quantile(dow30$Open, probs=c(0,0.25,0.5,0.75,1.0))

##      0%     25%     50%     75%    100% 
##   0.990  19.655  30.155  51.680 122.450

We can return this specific set of values (minimum, 25th percentile, median, 75th percentile, and maximum) with the fivenum function:

 fivenum(dow30$Open)

## [1]   0.990  19.650  30.155  51.680 122.450

Use the function IQR to return the interquartile range (the difference between the 25th and 75th percentile values):

 IQR(dow30$Open)

## [1] 32.025

The most convenient function for looking at summary information is summary. As an example, let’s take a look at the output of summary for the dow30 data set that we used above:

  summary(dow30)

##      symbol             Date           Open             High       
##  MMM    : 252   2008-09-22:  30   Min.   :  0.99   Min.   :  1.01  
##  AA     : 252   2008-09-23:  30   1st Qu.: 19.66   1st Qu.: 20.19  
##  AXP    : 252   2008-09-24:  30   Median : 30.16   Median : 30.75  
##  T      : 252   2008-09-25:  30   Mean   : 36.25   Mean   : 36.93  
##  BAC    : 252   2008-09-26:  30   3rd Qu.: 51.68   3rd Qu.: 52.45  
##  BA     : 252   2008-09-29:  30   Max.   :122.45   Max.   :122.88  
##  (Other):5970   (Other)   :7302                                    
##       Low             Close            Volume            Adj.Close     
##  Min.   :  0.27   Min.   :  0.75   Min.   :1.336e+06   Min.   :  0.75  
##  1st Qu.: 19.15   1st Qu.: 19.65   1st Qu.:1.111e+07   1st Qu.: 19.38  
##  Median : 29.55   Median : 30.10   Median :1.822e+07   Median : 29.41  
##  Mean   : 35.53   Mean   : 36.24   Mean   :5.226e+07   Mean   : 35.64  
##  3rd Qu.: 50.84   3rd Qu.: 51.58   3rd Qu.:4.255e+07   3rd Qu.: 50.97  
##  Max.   :121.62   Max.   :122.11   Max.   :2.672e+09   Max.   :122.11  
##

As you can see, summary presents information about each variable in the data frame. For numeric values, it shows the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. For factors, summary shows the count of the most frequent values. (Less frequent values are grouped into an “Other” category.)

A useful (text based) tool for looking at the distribution of a numeric vector is the stem function:

stem(x, scale = 1, width = 80, atom = 1e-08)

The argument x is a numeric vector, scale controls the length of the plot, width controls the width, and atom is a tolerance factor. For example:

 x <- c(11.6, 10.3, 10.8, 10.6, 11.9, 9.6, 9.0, 9.9, 11.2, 10.21, 10.69)
 stem(x)

## 
##   The decimal point is at the |
## 
##    9 | 069
##   10 | 23678
##   11 | 269

 library(nutshell)
 data(field.goals)
 dim(field.goals)

## [1] 982  10

 head(field.goals)

home.team	week	qtr	away.team	offense	defense	play.type	player	yards	stadium.type
ARI	14	3	WAS	ARI	WAS	FG good	1-N.Rackers	20	Out
ARI	2	4	STL	ARI	STL	FG good	1-N.Rackers	35	Out
ARI	7	3	TEN	ARI	TEN	FG good	1-N.Rackers	24	Out
ARI	12	2	JAC	JAC	ARI	FG good	10-J.Scobee	30	Out
ARI	2	3	STL	ARI	STL	FG good	1-N.Rackers	48	Out
ARI	7	4	TEN	TEN	ARI	FG aborted	15-C.Hentrich	33	Out

Another example, we’ll look at field goal attempts in the NFL during 2005. Specifically, we’ll look at the attempted distances for missed field goals. To do this, we’ll use the subset function to select only missed field goals and then plot the yards for each attempt:

 stem(subset(field.goals,play.type=="FG no")$yards)

## 
##   The decimal point is at the |
## 
##   20 | 0
##   22 | 
##   24 | 
##   26 | 00
##   28 | 0000000
##   30 | 0000000
##   32 | 00000000
##   34 | 000
##   36 | 0000
##   38 | 00000000000000
##   40 | 0000000000
##   42 | 0000000000000000
##   44 | 000000000000
##   46 | 000000000000000000
##   48 | 000000000000000000
##   50 | 000000000000
##   52 | 0000000000000000000
##   54 | 0000
##   56 | 000
##   58 | 00
##   60 | 00
##   62 | 0

1.2 Correlation, covariance and variance

Very often, when analyzing data, we want to know if two variables are correlated. Informally, correlation answers the question “when we increase (or decrease) x, does y increase (or decrease), and by how much?” Formally, correlation measures the linear dependence between two random variables. Correlation measures range between -1 and 1; 1 means that one variable is a (positive) linear function of the other, 0 means the two variables aren’t correlated at all, and -1 means that one variable is a negative linear function of the other.
The most commonly used correlation measurement is the Pearson correlation statistic: \[\begin{align} cor = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}. \end{align}\]
where $\bar{x}$ is the mean of variable $x$, and $\bar{y}$ is the mean of variable $y$.
To compute correlations in R, we can use the function cor. Here is an example

 cor(1:10, 2:11)

## [1] 1

 x <- rnorm(10000)
 y <- rnorm(10000)
 cor(x, y)

## [1] 0.03069866

Sample covariance is defined as: \[\begin{align} cov = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}). \end{align}\]

 cov(x,y)

## [1] 0.03050981

Sample variance is defined as:

\[\begin{align} var = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2. \end{align}\]

 var(x)

## [1] 1.012606

 var(y)

## [1] 0.9754385

 cov(x,y)/sqrt(var(x)*var(y))

## [1] 0.03069866

 cor(x, y)

## [1] 0.03069866

1.3 Principal components analysis (PCA)

The goal of PCA is to replace a large number of correlated variables with a smaller number of uncorrelated variables while capturing as much information in the original variables as possible. These derived variables, called principal components, are linear combinations of the observed variables. Specifically, the first principal component \[\begin{align} PC_1 = a_1X_1 + a_2X_2 + \cdots + a_kX_k. \end{align}\]
is the weighted combination of the $k$ observed variables that accounts for the most variance in the original set of variables. The second principal component is the linear combination that accounts for the most variance in the original variables, under the constraint that it’s orthogonal (uncorrelated) to the first principal component. Each subsequent component maximizes the variance accounted for, while at the same time remaining uncorrelated with all previous components. Theoretically, you can extract as many principal components as there are variables. But from a practical viewpoint, you hope that you can approximate the full set of variables with a much smaller set of components.
In R, principal components analysis is available through the function prcomp in the stats package:

Let’s look at a simple example.

 library(nutshell)
 data(team.batting.00to08)
 dim(team.batting.00to08)

## [1] 270  13

 head(team.batting.00to08)[,1:10] # Ignore the last 3 columns

teamID	yearID	runs	singles	doubles	triples	homeruns	walks	stolenbases	caughtstealing
ANA	2000	864	995	309	34	236	608	93	52
BAL	2000	794	992	310	22	184	558	126	65
BOS	2000	792	988	316	32	167	611	43	30
CHA	2000	978	1041	325	33	216	591	119	42
CLE	2000	950	1078	310	30	221	685	113	34
DET	2000	823	1028	307	41	177	562	83	38

 # ?princomp
 batting.pca <- princomp(~singles+doubles+triples+homeruns+walks+
                         hitbypitch+sacrificeflies+stolenbases+
                         caughtstealing, data=team.batting.00to08)
 batting.pca

## Call:
## princomp(formula = ~singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing, 
##     data = team.batting.00to08)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 74.900981 61.871086 31.811398 27.988190 23.788859 12.884291  9.150840 
##    Comp.8    Comp.9 
##  8.283972  7.060503 
## 
##  9  variables and  270 observations.

The princomp function returns a princomp object. We can get information on the importance of each component with the summary function:

 summary(batting.pca, loadings=TRUE)

## Importance of components:
##                            Comp.1     Comp.2     Comp.3      Comp.4
## Standard deviation     74.9009809 61.8710858 31.8113983 27.98819003
## Proportion of Variance  0.4610727  0.3146081  0.0831687  0.06437897
## Cumulative Proportion   0.4610727  0.7756807  0.8588494  0.92322841
##                             Comp.5      Comp.6      Comp.7      Comp.8
## Standard deviation     23.78885885 12.88429066 9.150840397 8.283972499
## Proportion of Variance  0.04650949  0.01364317 0.006882026 0.005639904
## Cumulative Proportion   0.96973790  0.98338107 0.990263099 0.995903002
##                             Comp.9
## Standard deviation     7.060503344
## Proportion of Variance 0.004096998
## Cumulative Proportion  1.000000000
## 
## Loadings:
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## singles         0.313  0.929  0.136         0.136                     
## doubles                       0.437  0.121 -0.877                0.100
## triples                                                  -0.424 -0.775
## homeruns       -0.235         0.383  0.825  0.324                     
## walks          -0.914  0.328 -0.150 -0.182                            
## hitbypitch                                         0.989              
## sacrificeflies                                           -0.321 -0.330
## stolenbases            0.131 -0.758  0.502 -0.307         0.232       
## caughtstealing               -0.208  0.104               -0.813  0.521
##                Comp.9
## singles              
## doubles              
## triples         0.449
## homeruns             
## walks                
## hitbypitch           
## sacrificeflies -0.882
## stolenbases          
## caughtstealing  0.105

Use the loadings method to show the contribution of each variable to the components:

 loadings(batting.pca)

## 
## Loadings:
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## singles         0.313  0.929  0.136         0.136                     
## doubles                       0.437  0.121 -0.877                0.100
## triples                                                  -0.424 -0.775
## homeruns       -0.235         0.383  0.825  0.324                     
## walks          -0.914  0.328 -0.150 -0.182                            
## hitbypitch                                         0.989              
## sacrificeflies                                           -0.321 -0.330
## stolenbases            0.131 -0.758  0.502 -0.307         0.232       
## caughtstealing               -0.208  0.104               -0.813  0.521
##                Comp.9
## singles              
## doubles              
## triples         0.449
## homeruns             
## walks                
## hitbypitch           
## sacrificeflies -0.882
## stolenbases          
## caughtstealing  0.105
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.111  0.111  0.111  0.111  0.111  0.111  0.111  0.111
## Cumulative Var  0.111  0.222  0.333  0.444  0.556  0.667  0.778  0.889
##                Comp.9
## SS loadings     1.000
## Proportion Var  0.111
## Cumulative Var  1.000

 #batting.pca$score

There is a plot method for princomp objects that displays a scree plot of the variance against each principal component:

plot(batting.pca)

A second useful method for visualizing principal components is the biplot

 biplot(batting.pca,cex=0.5,col=c("gray50","black"))

2 Probability distributions

One convenient use of R is to provide a comprehensive set of statistical tables. Functions are provided to evaluate the cumulative distribution function \[P(X \leq x),\] the probability density function and the quantile function (given $q$), the smallest $x$ such that \[P(X \leq x) > q),\] and to simulate from the distribution.

Distribution	R name	Additional arguments
normal	norm	mean, sd
uniform	unif	min, max
Student’s t	t	df, ncp
Cauchy	cauchy	location, scale
chi-squared	chisq	df, ncp
binomial	binom	size, prob
exponential	exp	rate
F	f	df1, df2, ncp
gamma	gamma	shape, scale
Poisson	pois	lambda

Prefix the name given here by d for the density, p for the CDF, q for the quantile function and r for simulation (random deviates). The first argument is $x$ for dxxx, $q$ for pxxx, $p$ for qxxx and $n$ for rxxx, where the symbol xxx denotes the specified distribution, such as norm, unif, and so on.

As an example, we’ll start with the normal distribution. As you may remember from statistics classes, the probability density function for the normal distribution is:

\[\begin{align} f(x) = \frac{1}{\sqrt{2\pi\sigma}}e^{-(x-\mu)^2/(2\sigma^2)} \end{align}\]

Use the dnorm function to find the probability density at a given value:

 dnorm(0)

## [1] 0.3989423

 dnorm(5)

## [1] 1.48672e-06

 dnorm(-5)

## [1] 1.48672e-06

We can plot the normal distribution with the following command:

 plot(dnorm, -3.5, 3.5, main = "Normal Distribution", ylab = 'Density')

The distribution function for the normal distribution is pnorm:

 pnorm(0)

## [1] 0.5

 pnorm(5)

## [1] 0.9999997

 pnorm(-5)

## [1] 2.866516e-07

We can plot the cumulative normal distribution with a command like this:

 plot(pnorm, -3.5, 3.5, main = "Cumulative Normal Distribution")

We can calculate the quantile function for the normal distribution with the function qnorm:

 qnorm(0.5)

## [1] 0

 qnorm(1)

## [1] Inf

 qnorm(0)

## [1] -Inf

 # qnorm is the inverse of pnorm
 qnorm(pnorm(-1))

## [1] -1

 # finding the left and right sides of a 95% confidence interval
 c(qnorm(.025), qnorm(.975))

## [1] -1.959964  1.959964

Finally, it is possible to generate random numbers taken from the normal distribution. To do this in R, we can use the function rnorm:

 rnorm(10)

##  [1]  0.40565445 -1.80115452  0.83266950  0.07201365 -0.22783142
##  [6]  2.76128624  0.82836761  0.79889065  0.62772680  1.50361186

 hist(rnorm(20000),breaks=50)

3 Statistical tests

Many data problems boil down to statistical tests. For example, you might want to answer a question like:
- Does this new drug work better than a placebo?
- Does the new web site design lead to significantly more sales than the old design?
- Can this new investment strategy yield higher returns than an index fund?

To answer questions like these, you would formulate a hypothesis, design an experiment, collect data, and use a tool like R to analyze the data. This chapter focuses on the tools available in R for answering these questions.

3.1 Continuous data

This section describes tests that apply to continuous random variables.

3.1.1 Normal distribution-based tests

We’ll start off by showing how to use some common statistical tests that assume the underlying data is normally distributed.

3.1.1.1 Comparing means

Suppose that you designed an experiment to show that some effect is true. you have collected some data and now want to know if the data proves your hypothesis. One common question is to ask if the mean of the experimental data is different from what it would be if the hypothesis was not true (which is called the null hypothesis or the alternative hypothesis). Specifically, suppose that you have a set of observations $x_1\cdots,x_n$ with experimental mean $\mu$ and want to know if the experimental mean is different from the null hypothesis mean $\mu_0$. Furthermore, assume that the observations are normally distributed. To test the validity of the hypothesis, you can use a t-test. In R, you would use the function t.test.

 ?t.test

Let’s take a look at an example of how you would use the t.test function.

 t.test(rnorm(300))

## 
##  One Sample t-test
## 
## data:  rnorm(300)
## t = 0.19768, df = 299, p-value = 0.8434
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.102937  0.125927
## sample estimates:
##  mean of x 
## 0.01149497

The real data tires.sus. (For more information the real data tires.sus, please see the help file.)

 library(nutshell)
 data(tires.sus)
 dim(tires.sus)

## [1] 66 27

To begin, let’s extract a vector with the set of values in which we are interested and calculate the true mean:

 time.to.failure.h <- subset(tires.sus, 
                             Tire_Type=='H'&
                             Speed_At_Failure_km_h==160)$Time_To_Failure
 time.to.failure.h

##  [1] 10.00 16.67 13.58 13.53 16.83  7.62  4.25 10.67  4.42  4.25

 mean(time.to.failure.h)

## [1] 10.182

 t.test(time.to.failure.h, mu = 9)

## 
##  One Sample t-test
## 
## data:  time.to.failure.h
## t = 0.75694, df = 9, p-value = 0.4684
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
##   6.649536 13.714464
## sample estimates:
## mean of x 
##    10.182

Here’s an explanation of the output from the t.test function. First, the function shows us the test statistic ($t = 0.7569$), the degrees of freedom ($df = 9$), and the calculated $p$-value for the test (p-value = 0.4684). The $p$-value means that the probability that the mean value from an actual sample was higher than 10.812 (or lower than 7.288) was 0.4684.

Another common situation is when you have two groups of observations, and you want to know if there is a significant difference between the means of the two groups of observations. You can also use a t-test to compare the means of the two groups. For example

 x <- rnorm(200)
 y <- rnorm(200)
 t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.32878, df = 397.04, p-value = 0.7425
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2180601  0.1555739
## sample estimates:
##    mean of x    mean of y 
## -0.039130559 -0.007887479

Let’s pick another example from the tire data. Looking at the characteristics of the different tires that were tested, notice that three of the six tires had the same speed rating: S. Based on this speed rating, we would expect all three tires to last the same amount of time in the test:

 time.to.failure.e <- subset(tires.sus, 
                            Tire_Type=='E'&
                            Speed_At_Failure_km_h==180
                            )$Time_To_Failure
 time.to.failure.d <- subset(tires.sus, 
                            Tire_Type=='D'&
                            Speed_At_Failure_km_h==180
                            )$Time_To_Failure
 time.to.failure.b <- subset(tires.sus, 
                            Tire_Type=='B'&
                            Speed_At_Failure_km_h==180
                            )$Time_To_Failure

 t.test(time.to.failure.e, time.to.failure.d)

## 
##  Welch Two Sample t-test
## 
## data:  time.to.failure.e and time.to.failure.d
## t = -2.5042, df = 8.961, p-value = 0.03373
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.82222528 -0.04148901
## sample estimates:
## mean of x mean of y 
##  4.321000  4.752857

 t.test(time.to.failure.e, time.to.failure.b)

## 
##  Welch Two Sample t-test
## 
## data:  time.to.failure.e and time.to.failure.b
## t = -1.4549, df = 16.956, p-value = 0.164
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5591177  0.1027844
## sample estimates:
## mean of x mean of y 
##  4.321000  4.549167

3.1.1.2 Comparing paired data

Sometimes, we are provided with paired data. For example, we might have two observations per subject: one before an experiment and one after the experiment. In this case, we would use a paired t-test. We can use the t.test function, specifying paired=TRUE, to perform this test.

As an example of paired data, we can look at the SPECint2006 results. We will compare the unoptimized results (called “baseline”) to the optimized results, to see if there is a statistically significant difference between the results.

  library(nutshell)
  data(SPECint2006)

 t.test(subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Baseline, 
        subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Result, 
        paired=TRUE)

## 
##  Paired t-test
## 
## data:  subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline and subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result
## t = -21.804, df = 111, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.957837 -1.631627
## sample estimates:
## mean of the differences 
##               -1.794732

3.1.1.3 Comparing variances of two populations

To compare the variances of two samples from normal populations, R includes the var.test function which performs an F-test:

 x <- rnorm(100, mean = 0, sd = 2)
 y <- rnorm(80, mean = 1, sd = 1)
 var.test(x, y)

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 3.3107, num df = 99, denom df = 79, p-value = 1.014e-07
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  2.162406 5.018046
## sample estimates:
## ratio of variances 
##           3.310721

3.1.1.4 Comparing means across more than two groups

To compare the means across more than two groups, we can use a method called analysis of variance (ANOVA). A simple way to perform these tests is through aov:

As an example, let’s consider the 2006 U.S. Mortality data set. Specifically, we’ll look at differences in age at death by cause of death. Let’s take a look at the summary statistics for age by cause:

 library(nutshell)
 data(mort06.smpl)
 dim(mort06.smpl)

## [1] 243073     36

 tapply(mort06.smpl$age, INDEX = list(mort06.smpl$Cause), FUN = summary)

## $Accidents
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   31.00   48.00   50.88   73.00  108.00       8 
## 
## $`Alzheimer's Disease`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.00   82.00   87.00   86.07   91.00  109.00 
## 
## $Cancer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   61.00   72.00   70.24   81.00  107.00 
## 
## $`Chronic respiratory diseases`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   70.00   78.00   76.37   84.00  106.00       1 
## 
## $Diabetes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   63.00   75.00   72.43   83.00  104.00       1 
## 
## $`Heart Disease`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   70.00   81.00   77.66   88.00  112.00       4 
## 
## $Homicide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   22.00   28.00   32.19   42.00   92.00       2 
## 
## $`Influenza and pheumonia`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   76.00   84.00   80.16   90.00  108.00       1 
## 
## $Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   60.00   78.00   70.44   87.00  110.00      10 
## 
## $Suicide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    8.00   32.00   45.00   46.14   57.00   97.00       2

Now, let’s fit an ANOVA model to the data and show a summary of the model. To do this in R, we simply need to use the aov function:

  fit1 <- aov(age~Cause, data=mort06.smpl)
  fit1

## Call:
##    aov(formula = age ~ Cause, data = mort06.smpl)
## 
## Terms:
##                    Cause Residuals
## Sum of Squares  15727886  72067515
## Deg. of Freedom        9    243034
## 
## Residual standard error: 17.22012
## Estimated effects may be unbalanced
## 29 observations deleted due to missingness

  #head(mort06.smpl[,c('age','Cause')])
  summary(fit1)

##                 Df   Sum Sq Mean Sq F value Pr(>F)    
## Cause            9 15727886 1747543    5893 <2e-16 ***
## Residuals   243034 72067515     297                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 29 observations deleted due to missingness

As another example of aov, let’s consider weight gain by women during pregnancy:

  library(nutshell)
  data(births2006.smpl)
  births2006.cln <- births2006.smpl[births2006.smpl$WTGAIN<99 &
                                    !is.na(births2006.smpl$WTGAIN),]
  tapply(X=births2006.cln$WTGAIN, INDEX=births2006.cln$DOB_MM, FUN=mean)

##        1        2        3        4        5        6        7        8 
## 30.94405 31.08356 31.29317 31.33610 31.07242 30.92589 30.57734 30.54855 
##        9       10       11       12 
## 30.25546 30.43985 30.79077 30.85564

It appears that weight gain increases slightly during winter months, but is this difference statistically significant? Let’s take a look:

  fit2 <- aov(WTGAIN~DOB_MM, births2006.cln)
  fit2

## Call:
##    aov(formula = WTGAIN ~ DOB_MM, data = births2006.cln)
## 
## Terms:
##                   DOB_MM Residuals
## Sum of Squares     14777  73385301
## Deg. of Freedom        1    351465
## 
## Residual standard error: 14.44986
## Estimated effects may be unbalanced

  summary(fit2)

##                 Df   Sum Sq Mean Sq F value Pr(>F)    
## DOB_MM           1    14777   14777   70.77 <2e-16 ***
## Residuals   351465 73385301     209                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.1.1.5 Pairwise t-tests between multiple groups

Sometimes, you’re not interested in just whether there is a difference across groups, but would like to know more details about the differences. One way to do this is by performing a t-test between every pair of groups. To do this in R, you can use the pairwise.t.test function:

As an example, let’s return to the tire data that we used in the example above. When we looked at the t.test function, we created three different vectors for the different types of tires. Here, we’ll just use the pairwise t-test to compare all the tires by type:

  pairwise.t.test(tires.sus$Time_To_Failure, tires.sus$Tire_Type)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  tires.sus$Time_To_Failure and tires.sus$Tire_Type 
## 
##   B       C      D       E       H     
## C 0.2219  -      -       -       -     
## D 1.0000  0.5650 -       -       -     
## E 1.0000  0.0769 1.0000  -       -     
## H 2.4e-07 0.0029 2.6e-05 1.9e-08 -     
## L 0.1147  1.0000 0.4408  0.0291  0.0019
## 
## P value adjustment method: holm

As you can see, there is no statistically significant difference between the means of a few pairs of groups (such as C and L, D and L, or D and E), but there is a significant difference between some others (such as B and H, C and H, or E and L).

3.1.1.6 Testing normality

To test if a distribution is normally distributed in R, we can use the Shapiro-Wilk test for normality through the shapiro.test function. For example

 x <- rnorm(300, mean = 5, sd = 2)
 y <- runif(300, min = 2, max = 4)

My first instinct is to take a look at the distribution using a histogram or a quantile-quantile plot. Here are some R codes to plot both, side by side:

 par(mfcol=c(1,2))
 hist(x)
 qqnorm(x)

 par(mfcol=c(1,2))
 hist(y)
 qqnorm(y)

 shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.99317, p-value = 0.1895

 shapiro.test(y)

## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.95944, p-value = 2.07e-07

3.1.1.7 Testing if a data vector came from an arbitrary distribution

We can use the Kolmogorov-Smirnov test to see if a vector came from an arbitrary probability distribution (not just a normal distribution).
- We can use the ks.test function to realize this in R. For example

 x <- rnorm(200)
 ks.test(x, pnorm)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.037576, p-value = 0.9403
## alternative hypothesis: two-sided

 ks.test(x, punif)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.495, p-value < 2.2e-16
## alternative hypothesis: two-sided

3.1.1.8 Testing if two data vectors came from the same distribution

The Kolmogorov-Smirnov test can also be used to test the probability that two data vectors came from the same distribution. For example

 x <- rnorm(100)
 y <- rnorm(80)
 # Do x and y come from the same distribution?
 ks.test(x, y)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.0875, p-value = 0.8567
## alternative hypothesis: two-sided

 x <- rnorm(100)
 y <- rnorm(80, 1, 1)
 ks.test(x, y)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.4575, p-value = 6.682e-09
## alternative hypothesis: two-sided

 x <- rnorm(100)
 y <- runif(80)
 ks.test(x, y)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.57, p-value = 7.572e-14
## alternative hypothesis: two-sided

3.1.1.9 Correlation tests

If you’d like to check whether there is a statistically significant correlation between two vectors, you can use the cor.test function. For example

 cor.test(1:8, c(0, 2, 4, 6, 8, 10, 11, 14))

## 
##  Pearson's product-moment correlation
## 
## data:  1:8 and c(0, 2, 4, 6, 8, 10, 11, 14)
## t = 36.148, df = 6, p-value = 2.989e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9868648 0.9996032
## sample estimates:
##      cor 
## 0.997712

 cor.test(1:8, c(5, 3, 8, 1, 7, 0, 0, 3))

## 
##  Pearson's product-moment correlation
## 
## data:  1:8 and c(5, 3, 8, 1, 7, 0, 0, 3)
## t = -1.2232, df = 6, p-value = 0.2671
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8757371  0.3764066
## sample estimates:
##        cor 
## -0.4467689

3.1.2 Distribution-free tests

Although many real data sources can be approximated well by a normal distribution, there are many cases where you know that the data is not normally distributed, or do not know the shape of the distribution. These tests can be more computationally intensive than tests based on a normal distribution, but they may help you make better choices when the distribution is not normally distributed.

3.1.2.1 Comparing two means

The Wilcoxon test is the distribution-free equivalent to the t-test. To do this in R, we can use the wilcox.test function. As an example, let’s take a look at the same examples we used for t-tests. Let’s start by looking at the times to failure for tires. As above, let’s start by comparing tires of type E to tires of type D:

  wilcox.test( time.to.failure.e,  time.to.failure.d)

## Warning in wilcox.test.default(time.to.failure.e, time.to.failure.d):
## cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  time.to.failure.e and time.to.failure.d
## W = 14.5, p-value = 0.05054
## alternative hypothesis: true location shift is not equal to 0

  t.test(time.to.failure.e, time.to.failure.d)

## 
##  Welch Two Sample t-test
## 
## data:  time.to.failure.e and time.to.failure.d
## t = -2.5042, df = 8.961, p-value = 0.03373
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.82222528 -0.04148901
## sample estimates:
## mean of x mean of y 
##  4.321000  4.752857

Here’s an explanation of the output. The test function first shows the test statistic (W = 14.5) and the $p$-value for the statistic (0.05054). Notice that this is different from the result for the t-test. With the t-test, there was a significant difference between the means of the two groups, but with the Wilcoxon test, the difference between the two groups is not significant at a 95% confidence level.

3.1.2.2 Comparing more than two means

The Kruskal-Wallis rank sum test is a distribution-free equivalent to ANOVA analysis:

 kruskal.test(x, ...)

As an example, here is the output for the mortality data that we used as an example for ANOVA statistics:

 summary(aov(age~Cause, data=mort06.smpl))

##                 Df   Sum Sq Mean Sq F value Pr(>F)    
## Cause            9 15727886 1747543    5893 <2e-16 ***
## Residuals   243034 72067515     297                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 29 observations deleted due to missingness

  kruskal.test(age~Cause,data=mort06.smpl)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  age by Cause
## Kruskal-Wallis chi-squared = 34868, df = 9, p-value < 2.2e-16

3.1.2.3 Comparing variances

To compare the variance between different groups using a nonparametric test, R includes an implementation of the Fligner-Killeen (median) test through the fligner.test function. Here is the output of fligner.test for the mortality data above:

 fligner.test(age~Cause,data=mort06.smpl)

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  age by Cause
## Fligner-Killeen:med chi-squared = 15788, df = 9, p-value < 2.2e-16

3.2 Discrete Data

There is a different set of tests for looking at the statistical significance of discrete random variables (like counts of proportions), and so there is a different set of functions in R for performing those tests.

3.2.1 Proportion tests

If you have a data set with several different groups of observations and are measuring the probability of success in each group (or the fraction of some other characteristic), you can use the function prop.test to measure whether the difference between groups is statistically significant. Specifically, prop.test can be used for testing the null hypothesis that the proportions (probabilities of success) in several groups are the same or that they equal certain given values.

As an example, let’s visit the field goal data. First, let’s create a new data set containing only good and bad field goals.

 library(nutshell)
 data(field.goals)
 field.goals.goodbad <- field.goals[field.goals$play.type=='FG good' | 
                                    field.goals$play.type=='FG no',]
 head(field.goals.goodbad)

	home.team	week	qtr	away.team	offense	defense	play.type	player	yards	stadium.type
1	ARI	14	3	WAS	ARI	WAS	FG good	1-N.Rackers	20	Out
2	ARI	2	4	STL	ARI	STL	FG good	1-N.Rackers	35	Out
3	ARI	7	3	TEN	ARI	TEN	FG good	1-N.Rackers	24	Out
4	ARI	12	2	JAC	JAC	ARI	FG good	10-J.Scobee	30	Out
5	ARI	2	3	STL	ARI	STL	FG good	1-N.Rackers	48	Out
7	ARI	2	2	STL	ARI	STL	FG good	1-N.Rackers	26	Out

Now, let’s create a table of successes and failures by stadium type:

 field.goals.table <- table(field.goals.goodbad$play.type,   
                            field.goals.goodbad$stadium.type)
 field.goals.table

##             
##              Both  In Out
##   FG aborted    0   0   0
##   FG blocked    0   0   0
##   FG good      53 152 582
##   FG no        14  24 125

The table isn’t quite right for prop.test; we need a table with two columns (one with a count of successes and one with a count of failures), and we don’t want to show empty factor levels. Let’s remove the two rows we don’t need and transpose the table:

 field.goals.table.t <- t(field.goals.table[3:4, ])
 field.goals.table.t

##       
##        FG good FG no
##   Both      53    14
##   In       152    24
##   Out      582   125

Now, we’re ready to see if there is a statistically significant difference in success between the three groups. We can simply call prop.test on the field.goals.table.t object to check:

 prop.test(field.goals.table.t)

## 
##  3-sample test for equality of proportions without continuity
##  correction
## 
## data:  field.goals.table.t
## X-squared = 2.3298, df = 2, p-value = 0.312
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3 
## 0.7910448 0.8636364 0.8231966

As you can see, the results are not significant.

3.2.2 Binomial tests

Often, an experiment consists of a series of identical trials, each of which has only two outcomes. For example, suppose that you wanted to test the hypothesis that the probability that a coin would land on heads was 0.5. You might design an experiment where you flipped the coin 50 times and counted the number of heads. Each coin flip is an example of a Bernoulli trial. The distribution of the number of heads is given by the binomial distribution.

R includes a function binom.test for evaluating such a trial to determine whether to accept or reject the hypothesis.

 binom.test(c(480, 520), p = 0.5)

## 
##  Exact binomial test
## 
## data:  c(480, 520)
## number of successes = 480, number of trials = 1000, p-value =
## 0.2174
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4486329 0.5114851
## sample estimates:
## probability of success 
##                   0.48

 binom.test(c(682, 243), p = 3/4)

## 
##  Exact binomial test
## 
## data:  c(682, 243)
## number of successes = 682, number of trials = 925, p-value =
## 0.3825
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
##  0.7076683 0.7654066
## sample estimates:
## probability of success 
##              0.7372973

 x <- rbinom(1000, 1, 0.3)
 binom.test(sum(x), 1000, p = 0.5)

## 
##  Exact binomial test
## 
## data:  sum(x) and 1000
## number of successes = 286, number of trials = 1000, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2581564 0.3151096
## sample estimates:
## probability of success 
##                  0.286

3.2.3 Tabular data tests

A common problem is to look at a table of data and determine if there is a relationship between two categorical variables. If there were no relationship, the two variables would be statistically independent. In these tests, the hypothesis is that the two variables are independent. The alternative (or null) hypothesis is that the two variables are not independent.
Tables of data often come up in experimental contexts: there is one column of data from a test population and one from a control population. In this context, the analyst often wants to calculate the probability that the two sets of data could have come from the same population (which would imply the same proportions in each). This is an equivalent problem, so the same test functions can be used.

To do this in R, we can use the fisher.test or ‘chisq.test’ function. For example:

A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman’s guess, the alternative that there is a positive association (that the odds ratio is greater than 1).

TeaTasting <-
matrix(c(3, 1, 1, 3),
       nrow = 2,
       dimnames = list(Guess = c("Milk", "Tea"),
                       Truth = c("Milk", "Tea")))
TeaTasting

##       Truth
## Guess  Milk Tea
##   Milk    3   1
##   Tea     1   3

fisher.test(TeaTasting)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  TeaTasting
## p-value = 0.4857
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    0.2117329 621.9337505
## sample estimates:
## odds ratio 
##   6.408309

chisq.test(TeaTasting)

## Warning in chisq.test(TeaTasting): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TeaTasting
## X-squared = 0.5, df = 1, p-value = 0.4795

As a second example, let’s use the 2006 births data set. We will take a look at the number of male and female babies delivered during July 2006, by delivery method. We’ll take a subset of births during July where the delivery method was known and then tabulate the results:

 births.july.2006 <-births2006.smpl[births2006.smpl$DMETH_REC!="Unknown"
                                    & births2006.smpl$DOB_MM==7, ]
 method.and.sex <- table(births.july.2006$SEX,
                   as.factor(as.character(births.july.2006$DMETH_REC)))
 method.and.sex

##    
##     C-section Vaginal
##   F      5326   12622
##   M      6067   13045

Let’s check whether this difference is statistically significant: is the difference due to chance or is it likely that these two variables (delivery method and sex) are independent?

 fisher.test(method.and.sex)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  method.and.sex
## p-value = 1.604e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8678345 0.9485129
## sample estimates:
## odds ratio 
##  0.9072866

 chisq.test(method.and.sex)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  method.and.sex
## X-squared = 18.528, df = 1, p-value = 1.675e-05

As a third example, let’s look only at twin births. (Note that each record represents a single birth, not a single pregnancy.)

 twins.2006 <- births2006.smpl[births2006.smpl$DPLURAL=="2 Twin" &
                               births2006.smpl$DMETH_REC !="Unknown",]
 method.and.sex.twins <- table(twins.2006$SEX,
                          as.factor(as.character(twins.2006$DMETH_REC)))
 method.and.sex.twins

##    
##     C-section Vaginal
##   F      4924    1774
##   M      5076    1860

 fisher.test(method.and.sex.twins)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  method.and.sex.twins
## p-value = 0.67
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9420023 1.0981529
## sample estimates:
## odds ratio 
##   1.017083

 chisq.test(method.and.sex.twins)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  method.and.sex.twins
## X-squared = 0.17451, df = 1, p-value = 0.6761

The chisq.test function can also perform tests on multidimensional tables. As an example, let’s build a table showing the number of births by day and month:

 births2006.bydayandmonth <- table(births2006.smpl$DOB_WK,
                                   births2006.smpl$DOB_MM)
 births2006.bydayandmonth

##    
##        1    2    3    4    5    6    7    8    9   10   11   12
##   1 3645 2930 2965 3616 2969 3036 3976 3160 3270 3964 2999 3744
##   2 5649 4737 4779 4853 5712 5033 6263 5127 4850 6167 5043 4544
##   3 6265 5293 5251 5297 6472 5178 5149 7225 5805 6887 5619 5334
##   4 5131 5280 6486 5173 6496 5540 5499 7011 5725 5445 6838 5666
##   5 5127 5271 6574 5162 5347 6863 5780 6945 5822 5538 6165 5570
##   6 4830 5305 6330 5042 4975 6622 5760 5530 7027 5256 5079 6624
##   7 3295 3392 3408 4185 3364 3464 4751 3686 4669 3564 3509 4396

As above, let’s check the probability that this distribution arose by chance under the assumption that the probability of each combination was equal:

 chisq.test(births2006.bydayandmonth)

## 
##  Pearson's Chi-squared test
## 
## data:  births2006.bydayandmonth
## X-squared = 4729.6, df = 66, p-value < 2.2e-16

3.2.4 Distribution-free tabular data tests

The Friedman rank sum test is a distribution-free relative of the chi-squared test. In R, this is implemented through the friedman.test function.

As examples, let’s look at some of the same tables we looked at above:

  friedman.test(method.and.sex.twins)

## 
##  Friedman rank sum test
## 
## data:  method.and.sex.twins
## Friedman chi-squared = 2, df = 1, p-value = 0.1573

Just like the chi-squared test, the Friedman rank sum test shows that it is very likely that the two distributions are independent.

4 Power tests

When designing an experiment, it’s often helpful to know how much data you need to collect to get a statistically significant sample (or, alternatively, the maximum significance of results that can be calculated from a given amount of data). R provides a set of functions to help you calculate these amounts.

4.1 Experimental design example

Suppose that you want to test the efficacy of a new drug for treating depression. A common score used to measure depression is the Hamilton Rating Scale for Depression (HAMD). This measure varies from 0 to 48, where higher values indicate increased depression. Let’s consider two different experimental design questions. First, suppose that you had collected 50 subjects for the study and split them into two groups of 25 people each. What difference in HAMD scores would you need to observe in order for the results to be considered statistically significant?

We assume a standard deviation of 8.9 for this experiment. We’ll also assume that we want a power of .95 for the experiment (meaning that the probability of a Type II error is less than .05). To calculate the minimum statistically significant difference in R, we could use the following expression:

  power.t.test(power=.95, sig.level=.05, sd=8.9, n=25)

## 
##      Two-sample t test power calculation 
## 
##               n = 25
##           delta = 9.26214
##              sd = 8.9
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

According to the output, the difference in means between the two groups would need to be at least 9.26214 to be significant at this level.

  power.t.test(power=.95, sig.level=.05, sd=8.9, n=50)

## 
##      Two-sample t test power calculation 
## 
##               n = 50
##           delta = 6.480487
##              sd = 8.9
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

As you can see, the power functions can be very useful for designing an experiment. They can help you to estimate, in advance, how large a difference you need to see between groups to get statistically significant results.

4.2 t-test design

If you are designing an experiment where you will use a t-test to check the significance of the results (typically, an experiment where you calculate the mean value of a random variable for a “test” population and a “control” population), then you can use the power.t.test function to help design the experiment. For example

 power.t.test(power = .90, delta = 1)

## 
##      Two-sample t test power calculation 
## 
##               n = 22.0211
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

 power.t.test(power = .95, delta = 1)

## 
##      Two-sample t test power calculation 
## 
##               n = 26.98922
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

4.3 Proportion test design

If you are designing an experiment where you will be measuring a proportion, you can use the power.prop.test function:

power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05,
                power = NULL,
                alternative = c("two.sided", "one.sided"),
                strict = FALSE, tol = .Machine$double.eps^0.25)

For this function, $n$ specifies the number of observations (per group), $p_1$ is the probability of success in one group, $p_2$ is the probability of success in the other group.

 power.prop.test(n = 50, p1 = .50, p2 = .75)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 50
##              p1 = 0.5
##              p2 = 0.75
##       sig.level = 0.05
##           power = 0.7401659
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

 power.prop.test(n = 100, p1 = .50, p2 = .75)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 100
##              p1 = 0.5
##              p2 = 0.75
##       sig.level = 0.05
##           power = 0.9600175
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

 power.prop.test(n = 300, p1 = .50, p2 = .75)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 300
##              p1 = 0.5
##              p2 = 0.75
##       sig.level = 0.05
##           power = 0.9999969
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

 power.prop.test(p1 = .50, p2 = .75, power = .90)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 76.70693
##              p1 = 0.5
##              p2 = 0.75
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

 power.prop.test(p1 = .50, p2 = .75, power = .95)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 94.46783
##              p1 = 0.5
##              p2 = 0.75
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

4.4 ANOVA test design

If you are designing an experiment where you will be using ANOVA, you can use the power.anova.test function. For example

 # ?power.anova.test
 power.anova.test(groups = 4, n = 5, between.var = 1, within.var = 3)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 4
##               n = 5
##     between.var = 1
##      within.var = 3
##       sig.level = 0.05
##           power = 0.3535594
## 
## NOTE: n is number in each group

 power.anova.test(groups = 4, n = 50, between.var = 1, within.var = 3)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 4
##               n = 50
##     between.var = 1
##      within.var = 3
##       sig.level = 0.05
##           power = 0.9999952
## 
## NOTE: n is number in each group

power.anova.test(groups = 4, between.var = 1, within.var = 3,
                 power = .80)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 4
##               n = 11.92613
##     between.var = 1
##      within.var = 3
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

power.anova.test(groups = 4, between.var = 1, within.var = 3,
                 power = .95)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 4
##               n = 18.18245
##     between.var = 1
##      within.var = 3
##       sig.level = 0.05
##           power = 0.95
## 
## NOTE: n is number in each group

5 Regression models

A regression model shows how a continuous value (called the response variable, or the dependent variable) is related to a set of other values (called the predictors, stimulus variables, or independent variables). Often, a regression model is used to predict values where they are unknown. For example, warfarin is a drug commonly used as a blood thinner or anticoagulant. A doctor might use a regression model to predict the correct dose of warfarin to give a patient based on several known variables about the patient (such as the patient’s weight). Another example of a regression model might be for marketing financial products. An analyst might estimate the average balance of a credit card customer (which, in turn, affects the expected revenue from that customer).
Sometimes, a regression model is simply used to explain a phenomenon, but not to actually predict values. For example, a scientist might suspect that weight is correlated to consumption of certain types of foods, but wants to adjust for a variety of factors, including age, exercise, genetics (and, hopefully, other factors). The scientist could use a regression model to help show the relationship between weight and food consumed by including other variables in the regression. Models can be used for many other purposes, including visualizing trends, analysis of variance tests, and testing variable significance.

5.1 Examples for linear models

A linear regression assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable $y$ is a linear function of a set of predictor variables $x_1, \cdots,x_p.$
The first example
- Let’s look at the functions lm through a simple regression example. The dataset women in the base installation provides the height and weight for a set of 15 women ages 30 to 39. Suppose you want to predict weight from height. Having an equation for predicting weight from height can help you to identify overweight or underweight individuals. The analysis is provided in the following listing.

fit <- lm(weight ~ height, data=women)
coef(fit)

## (Intercept)      height 
##   -87.51667     3.45000

summary(fit)

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

predict(fit)

##        1        2        3        4        5        6        7        8 
## 112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 
##        9       10       11       12       13       14       15 
## 140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833

From the output, we can see that the prediction equation is \[\widehat{Weight} = -87.52 + 3.45 \times Height.\]

  plot(women$height,women$weight, 
  xlab="Height (in inches)", 
  ylab="Weight (in pounds)")
  abline(fit)

The second example
- We’re going to look at how different metrics predict the runs scored by a baseball team. Let’s start by loading the data for every team between 2000 and 2008.

 library(nutshell)
 library(lattice)
 data(team.batting.00to08)

Let’s look at scatter plots of runs versus each other variable, so that we can see which variables are likely to be most important. We’ll create a data frame for plotting, using the make.groups function:

  attach(team.batting.00to08)
 forplot <- make.groups(
   singles = data.frame(value=singles, teamID, yearID, runs), 
   doubles = data.frame(value=doubles, teamID, yearID, runs), 
   triples = data.frame(value=triples, teamID, yearID, runs), 
   homeruns = data.frame(value=homeruns, teamID, yearID, runs), 
   walks = data.frame(value=walks, teamID, yearID, runs), 
   stolenbases = data.frame(value=stolenbases, teamID, yearID, runs), 
   caughtstealing = data.frame(value=caughtstealing, teamID, yearID,
                               runs),
   hitbypitch = data.frame(value=hitbypitch, teamID, yearID, runs), 
   sacrificeflies = data.frame(value=sacrificeflies, teamID, yearID,
                               runs))
 detach(team.batting.00to08)

Now, we’ll generate the scatter plots using the xyplot function:

xyplot(runs~value|which, data=forplot, 
      scales=list(relation='free'), 
      pch=19, cex=.2, 
      strip=strip.custom(strip.levels=TRUE, 
      horizontal=TRUE, 
      par.strip.text=list(cex=.8)))

5.1.1 Fitting a model

Let’s fit a linear model to the data and assign it to the variable runs.lm. We’ll use the lm function, which fits a linear model using ordinary least squares:

runs.lm <- lm(formula = runs~singles+doubles+triples+homeruns+walks+
             hitbypitch+sacrificeflies+stolenbases+caughtstealing,
             data=team.batting.00to08)

5.1.2 Getting information about a model

R doesn’t clutter your screen with lots of information you might not want. Instead, R includes a large set of functions for printing information about model objects. This section describes the functions for getting information about lm objects. Many of these functions may also be used with other types of models; see the help files for more information.

5.1.2.1 Viewing the model

For most model functions (including lm), the best place to start is with the print method. If the R console is used, we can simply type the name of the returned object on the console to see the results of print:

 runs.lm

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing, 
##     data = team.batting.00to08)
## 
## Coefficients:
##    (Intercept)         singles         doubles         triples  
##     -507.16020         0.56705         0.69110         1.15836  
##       homeruns           walks      hitbypitch  sacrificeflies  
##        1.47439         0.30118         0.37750         0.87218  
##    stolenbases  caughtstealing  
##        0.04369        -0.01533

Use the formula function to show the formula used to fit the model:

 formula(runs.lm)

## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch + 
##     sacrificeflies + stolenbases + caughtstealing

Use the coef function to get the list of coefficients for a model object:

  coef(runs.lm)

##    (Intercept)        singles        doubles        triples       homeruns 
##  -507.16019759     0.56704867     0.69110420     1.15836091     1.47438916 
##          walks     hitbypitch sacrificeflies    stolenbases caughtstealing 
##     0.30117665     0.37749717     0.87218094     0.04369407    -0.01533245

The function summary can be used to get a summary of a linear model object.

 summary(runs.lm)

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing, 
##     data = team.batting.00to08)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.902 -11.828  -0.419  14.658  61.874 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -507.16020   32.34834 -15.678  < 2e-16 ***
## singles           0.56705    0.02601  21.801  < 2e-16 ***
## doubles           0.69110    0.05922  11.670  < 2e-16 ***
## triples           1.15836    0.17309   6.692 1.34e-10 ***
## homeruns          1.47439    0.05081  29.015  < 2e-16 ***
## walks             0.30118    0.02309  13.041  < 2e-16 ***
## hitbypitch        0.37750    0.11006   3.430 0.000702 ***
## sacrificeflies    0.87218    0.19179   4.548 8.33e-06 ***
## stolenbases       0.04369    0.05951   0.734 0.463487    
## caughtstealing   -0.01533    0.15550  -0.099 0.921530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.21 on 260 degrees of freedom
## Multiple R-squared:  0.9144, Adjusted R-squared:  0.9114 
## F-statistic: 308.6 on 9 and 260 DF,  p-value: < 2.2e-16

5.1.2.2 Predicting values using a model

Use the residuals function to get the vector of residuals from a linear model fit:

 residuals(runs.lm)[1:12]

##          1          2          3          4          5          6 
## -35.572347 -10.526473  -5.920721  57.798284  -4.279294  -4.668995 
##          7          8          9         10         11         12 
##  29.181937  -7.419778   8.319662  52.299908  32.910063  22.377282

Use the fitted function to get a vector of fitted values:

 fitted(runs.lm)[1:16]

##        1        2        3        4        5        6        7        8 
## 899.5723 804.5265 797.9207 920.2017 954.2793 827.6690 849.8181 755.4198 
##        9       10       11       12       13       14       15       16 
## 862.6803 894.7001 874.0899 710.6227 852.8527 867.1590 785.1005 795.8503

5.1.2.3 Analyzing the fit

Use the confint function to compute confidence intervals for the coefficients in the fitted model:

 confint(runs.lm)

##                        2.5 %       97.5 %
## (Intercept)    -570.85828008 -443.4621151
## singles           0.51583022    0.6182671
## doubles           0.57449582    0.8077126
## triples           0.81752968    1.4991921
## homeruns          1.37432941    1.5744489
## walks             0.25570041    0.3466529
## hitbypitch        0.16077399    0.5942203
## sacrificeflies    0.49451857    1.2498433
## stolenbases      -0.07349342    0.1608816
## caughtstealing   -0.32152716    0.2908623

Use the influence function to compute the influence of different parameters:

 influence(runs.lm)

Try influence.measures to access more friendly output:

 influence.measures(runs.lm)

Use the anova function to get analysis of variance statistics. For linear models, the method used is anova.lmlist.

 anova(runs.lm)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
singles	1	2.157547e+05	2.157547e+05	400.4654729	0.0000000
doubles	1	3.565879e+05	3.565879e+05	661.8679941	0.0000000
triples	1	2.372170e+02	2.372170e+02	0.4403019	0.5075647
homeruns	1	7.900513e+05	7.900513e+05	1466.4255566	0.0000000
walks	1	1.143772e+05	1.143772e+05	212.2970903	0.0000000
hitbypitch	1	7.396397e+03	7.396397e+03	13.7285592	0.0002580
sacrificeflies	1	1.172572e+04	1.172572e+04	21.7642718	0.0000049
stolenbases	1	3.573245e+02	3.573245e+02	0.6632351	0.4161654
caughtstealing	1	5.238073e+00	5.238073e+00	0.0097225	0.9215298
Residuals	260	1.400776e+05	5.387599e+02	NA	NA

Interestingly, it appears that triples, stolen bases, and times caught stealing are not statistically significant.

You can also view the effects from a fitted model. The effects are the uncorrelated single degree of freedom values obtained by projecting the data onto the successive orthogonal subspaces generated by the QR-decomposition during the fitting process. To obtain a vector of orthogonal effects from the model, use the effects function:

effects(runs.lm)[1:20]

##    (Intercept)        singles        doubles        triples       homeruns 
##  -1.270491e+04  -4.644941e+02  -5.971498e+02   1.540185e+01  -8.888483e+02 
##          walks     hitbypitch sacrificeflies    stolenbases caughtstealing 
##  -3.381969e+02   8.600231e+01   1.082853e+02  -1.890303e+01  -2.288684e+00 
##                                                                            
##   4.071693e+01   2.903559e+01   2.025866e+00  -4.445514e-01   3.013846e+00 
##                                                                            
##   1.931702e+01  -3.349570e-01  -3.185715e+01   4.476431e+01  -1.258820e+01

To calculate the variance-covariance matrix from the linear model object, use the vcov function:

 vcov(runs.lm)

##                 (Intercept)       singles       doubles       triples
## (Intercept)    1046.4149572 -6.275356e-01 -6.908905e-01 -0.8115627984
## singles          -0.6275356  6.765565e-04 -1.475026e-04  0.0001538296
## doubles          -0.6908905 -1.475026e-04  3.506798e-03 -0.0013459187
## triples          -0.8115628  1.538296e-04 -1.345919e-03  0.0299591843
## homeruns         -0.3190194  2.314669e-04 -3.940172e-04  0.0011510663
## walks            -0.2515630  7.950878e-05 -9.902388e-05  0.0004174548
## hitbypitch       -0.9002974  3.385518e-04 -4.090707e-04  0.0018360831
## sacrificeflies    1.6870020 -1.723732e-03 -2.253712e-03 -0.0051709718
## stolenbases       0.2153275 -3.041450e-04  2.871078e-04 -0.0009794480
## caughtstealing   -1.4370890  3.126387e-04  1.466032e-03 -0.0016038175
##                     homeruns         walks    hitbypitch sacrificeflies
## (Intercept)    -3.190194e-01 -2.515630e-01 -0.9002974059   1.6870019518
## singles         2.314669e-04  7.950878e-05  0.0003385518  -0.0017237324
## doubles        -3.940172e-04 -9.902388e-05 -0.0004090707  -0.0022537124
## triples         1.151066e-03  4.174548e-04  0.0018360831  -0.0051709718
## homeruns        2.582082e-03 -4.007590e-04 -0.0008183475  -0.0005078943
## walks          -4.007590e-04  5.333599e-04  0.0002219440  -0.0010962381
## hitbypitch     -8.183475e-04  2.219440e-04  0.0121132852  -0.0011315622
## sacrificeflies -5.078943e-04 -1.096238e-03 -0.0011315622   0.0367839752
## stolenbases    -2.041656e-06 -1.400052e-04 -0.0001197102  -0.0004636454
## caughtstealing  3.469784e-04  6.008766e-04  0.0001742039  -0.0024880710
##                  stolenbases caughtstealing
## (Intercept)     2.153275e-01  -1.4370889812
## singles        -3.041450e-04   0.0003126387
## doubles         2.871078e-04   0.0014660316
## triples        -9.794480e-04  -0.0016038175
## homeruns       -2.041656e-06   0.0003469784
## walks          -1.400052e-04   0.0006008766
## hitbypitch     -1.197102e-04   0.0001742039
## sacrificeflies -4.636454e-04  -0.0024880710
## stolenbases     3.541716e-03  -0.0050935339
## caughtstealing -5.093534e-03   0.0241794596

Use the deviance function to return the deviance of the fitted model. (though this value is just the residual sum of squares in this case because ‘runs.lm’ is a linear model)

 deviance(runs.lm)

## [1] 140077.6

5.1.3 Refining the model

Often, it is better to use the update function to refit a model. This can save you some typing if you are using R interactively. Additionally, this can save on computation time (for large data sets). The function update refit the model after changing the formula (perhaps adding or subtracting a term) or even after changing the data frame.

For example, let’s fit a slightly different model to the data above. We’ll omit the variable sacrificeflies and add 0 as a variable (which means to fit the model with no intercept):

 runs.lm2 <- update(runs.lm,formula=runs ~ singles + doubles +
                       triples + homeruns + walks + hitbypitch +
                       stolenbases + caughtstealing + 0)
 runs.lm2

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + stolenbases + caughtstealing - 1, data = team.batting.00to08)
## 
## Coefficients:
##        singles         doubles         triples        homeruns  
##        0.29823         0.41280         0.95664         1.31945  
##          walks      hitbypitch     stolenbases  caughtstealing  
##        0.21352        -0.07471         0.18828        -0.70334

5.2 Details about the `lm` function

Now that we’ve seen a simple example of how models work in R, let’s describe in detail what lm does and how you can control it. A linear regression model is appropriate when the response variable (the thing that you want to predict) can be estimated from a linear function of the predictor variables (the information that you know). Technically, we assume that: \[\begin{align} y = c_0+c_1x_1 + \cdots + c_px_p + \varepsilon. \end{align}\]
where $y$ is the response variable, $x_1,\cdots,x_p$ are the predictor variables (or predictors), $c_1,\cdots,c_p$ are the coefficients for the predictor variables, $c_0$ is the intercept, and $\varepsilon$ is the error term.
Suppose that you have a matrix of observed predictor variables $X_{n\times p}$ and a vector of response variables $Y=(Y_1,\cdots,Y_n)$. We have assumed a linear model, so given a set of coefficient $c_0$ and $c=( c_1, \cdots, c_p)^T$, we can calculate a set of estimates $\hat{Y}$ for the input data $X$ by calculating $\hat{Y}=c_0 + Xc$. The differences between the estimates $\hat{Y}$ and the actual values $Y$ are called the residuals. You can think of the residuals as a measure of the prediction error; small residuals mean that the predicted values are close to the actual values. We assume that the expected difference between the actual response values and the residual values (the error term in the model) is 0. This is important to remember: at best, a model is probabilistic.
Our goal is to find the set of coefficients $c_0$ and $c$ that does the best job of estimating $Y$ given $X$; we’d like the estimates $\hat{y}$ to be as close as possible to $Y$. In a classical linear regression model, we find coefficients $c_0$ and $c$ that minimize the sum of squared differences between the estimates $\hat{Y}$ and the observed values $Y$. Specifically, we want to find values for $c_0$ and $c$ that minimize:

\[\begin{align} RSS(c_0, c) = \sum_{i=1}^n(Y_i-\hat{Y_i}^2). \end{align}\]

This is called the least squares method for regression. You can use the lm function in R to estimate the coefficients in a linear model:

5.2.1 Assumptions of least squares regression

Technically, linear regression is not always appropriate. Ordinary least squares (OLS) regression (implemented through lm) is only guaranteed to work when certain properties of the training data are true. Here are the key assumptions:
1. Linearity. We assume that the response variable $y$ is a linear function of the predictor variables $x_1,\cdots,x_p$.
2. Full rank. There is no linear relationship between any pair of predictor variables.(Equivalently, the predictor matrix is not singular.)
3. Exogenicity of the predictor variables. The expected value of the error term $\varepsilon$ is 0 for all possible values of the predictor variables.
4. Homoscedasticity. The variance of the error term $\varepsilon$ is constant and is not correlated with the predictor variables.
5. Nonautocorrelation. In a sequence of observations, the values of $y$ are not correlated with each other.
6. Exogenously generated data. The predictor variables $x_1,\cdots,x_p$ are generated independently of the process that generates the error term $\varepsilon$.
7. The error term $\varepsilon$ is normally distributed with standard deviation $\sigma$ and mean 0.
By the way, it’s perfectly OK for there to be a nonlinear relationship between some of the predictor variables. Suppose that one of the variables is age. You could add $age^2$, $\log(age)$, or other nonlinear mathematical expressions using age to the model and not violate the assumptions above. You are effectively defining a set of new predictor variables: $w_1 = age$, $w_2 = age^2$, $w_3 = \log(age)$. This doesn’t violate the linearity assumption (because the model is still a linear function of the predictor variables) or the full rank assumption (as long as the relationship between the new variables is not linear).
If you want to be careful, you can use test functions to check if the OLS assumptions apply:
- You can test for heteroscedasticity using the function ncvTest in the car (Companion to Applied Regression) package, which implements the Breusch-Pagan test. (Alternatively, you could use the bptest function in the lmtest library, which implements the same test. The lmtest library includes a number of other functions for testing for heteroscedasticity; see the documentation for more details.)
- You can test for autocorrelation in a model using the function durbin.watson in the car package, which implements the Durbin-Watson test. You can also use the function dwtest in the library lmtest by specifying a formula and a data set. (Alternatively, you could use the function bgtest in the lmtest package, which implements the Breusch-Godfrey test. This functions also tests for higher-order disturbances.)
- You can check that the predictor matrix is not singular by using the singular.ok=FALSE argument in lm.
Incidentally, the example used in “Example: A Simple Linear Model” is not heteroscedastic:

 # install.packages('car')
 library(car)

## Loading required package: carData

 ncvTest(runs.lm)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.411893, Df = 1, p = 0.23474

Nor is there a problem with autocorrelation:

 durbinWatsonTest(runs.lm)

##  lag Autocorrelation D-W Statistic p-value
##    1     0.003318923      1.983938   0.834
##  Alternative hypothesis: rho != 0

Or with singularity:

runs.lm <- lm(formula = runs~singles+doubles+triples+homeruns+walks+
             hitbypitch+sacrificeflies+stolenbases+caughtstealing,
             data = team.batting.00to08, singular.ok=FALSE)
runs.lm

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing, 
##     data = team.batting.00to08, singular.ok = FALSE)
## 
## Coefficients:
##    (Intercept)         singles         doubles         triples  
##     -507.16020         0.56705         0.69110         1.15836  
##       homeruns           walks      hitbypitch  sacrificeflies  
##        1.47439         0.30118         0.37750         0.87218  
##    stolenbases  caughtstealing  
##        0.04369        -0.01533

summary(runs.lm)

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing, 
##     data = team.batting.00to08, singular.ok = FALSE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.902 -11.828  -0.419  14.658  61.874 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -507.16020   32.34834 -15.678  < 2e-16 ***
## singles           0.56705    0.02601  21.801  < 2e-16 ***
## doubles           0.69110    0.05922  11.670  < 2e-16 ***
## triples           1.15836    0.17309   6.692 1.34e-10 ***
## homeruns          1.47439    0.05081  29.015  < 2e-16 ***
## walks             0.30118    0.02309  13.041  < 2e-16 ***
## hitbypitch        0.37750    0.11006   3.430 0.000702 ***
## sacrificeflies    0.87218    0.19179   4.548 8.33e-06 ***
## stolenbases       0.04369    0.05951   0.734 0.463487    
## caughtstealing   -0.01533    0.15550  -0.099 0.921530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.21 on 260 degrees of freedom
## Multiple R-squared:  0.9144, Adjusted R-squared:  0.9114 
## F-statistic: 308.6 on 9 and 260 DF,  p-value: < 2.2e-16

5.3 Subset selection

Modeling functions like lm will include every variable specified in the formula, calculating a coefficient for each one. Unfortunately, this means that lm may calculate coefficients for variables that aren’t needed. You can manually tune a model using diagnostics like summary and lm.influence. However, you can also use some other statistical techniques to reduce the effect of insignificant variables or remove them from a model altogether.

5.3.1 Stepwise variable selection

A simple technique for selecting the most important variables is stepwise variable selection. The stepwise algorithm works by repeatedly adding or removing variables from the model, trying to “improve” the model at each step. When the algorithm can no longer improve the model by adding or subtracting variables, it stops and returns the new (and usually smaller) model.
Note that “improvement” does not just mean reducing the residual sum of squares (RSS) for the fitted model. Adding an additional variable to a model will not increase the RSS (see a statistics book for an explanation of why), but it does increase model complexity. Typically, AIC (Akaike’s information criterion) is used to measure the value of each additional variable. The AIC is defined as \[AIC = -2\log(L) + k*edf,\] where $L$ is the likelihood and $edf$ is the equivalent degrees of freedom.
In R, you perform stepwise selection through the step function:

  runs.lm.st <- step(runs.lm)

## Start:  AIC=1707.91
## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch + 
##     sacrificeflies + stolenbases + caughtstealing
## 
##                  Df Sum of Sq    RSS    AIC
## - caughtstealing  1         5 140083 1705.9
## - stolenbases     1       290 140368 1706.5
## <none>                        140078 1707.9
## - hitbypitch      1      6338 146416 1717.9
## - sacrificeflies  1     11142 151219 1726.6
## - triples         1     24130 164207 1748.8
## - doubles         1     73379 213457 1819.7
## - walks           1     91626 231703 1841.8
## - singles         1    256054 396132 1986.6
## - homeruns        1    453575 593653 2095.8
## 
## Step:  AIC=1705.92
## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch + 
##     sacrificeflies + stolenbases
## 
##                  Df Sum of Sq    RSS    AIC
## - stolenbases     1       357 140440 1704.6
## <none>                        140083 1705.9
## - hitbypitch      1      6342 146425 1715.9
## - sacrificeflies  1     11179 151262 1724.7
## - triples         1     24173 164256 1746.9
## - doubles         1     75490 215573 1820.3
## - walks           1     94503 234586 1843.1
## - singles         1    257773 397856 1985.8
## - homeruns        1    454587 594670 2094.3
## 
## Step:  AIC=1704.61
## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch + 
##     sacrificeflies
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        140440 1704.6
## - hitbypitch      1      6390 146830 1714.6
## - sacrificeflies  1     11726 152166 1724.3
## - triples         1     25688 166129 1748.0
## - doubles         1     76598 217038 1820.1
## - walks           1     94655 235095 1841.7
## - singles         1    270589 411030 1992.6
## - homeruns        1    454230 594670 2092.3

  runs.lm.st

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies, data = team.batting.00to08, 
##     singular.ok = FALSE)
## 
## Coefficients:
##    (Intercept)         singles         doubles         triples  
##      -506.6389          0.5712          0.6823          1.1789  
##       homeruns           walks      hitbypitch  sacrificeflies  
##         1.4734          0.3018          0.3790          0.8868

  summary(runs.lm.st)

## 
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns + 
##     walks + hitbypitch + sacrificeflies, data = team.batting.00to08, 
##     singular.ok = FALSE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.667 -11.354  -0.668  14.719  61.992 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -506.63888   30.87161 -16.411  < 2e-16 ***
## singles           0.57115    0.02542  22.468  < 2e-16 ***
## doubles           0.68227    0.05707  11.954  < 2e-16 ***
## triples           1.17894    0.17030   6.923 3.41e-11 ***
## homeruns          1.47344    0.05062  29.110  < 2e-16 ***
## walks             0.30178    0.02271  13.289  < 2e-16 ***
## hitbypitch        0.37897    0.10976   3.453 0.000647 ***
## sacrificeflies    0.88679    0.18960   4.677 4.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.15 on 262 degrees of freedom
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9119 
## F-statistic: 398.7 on 7 and 262 DF,  p-value: < 2.2e-16

6 Logistic regression

Suppose that you were trying to estimate the probability of a certain outcome (which we’ll call A) for a categorical variable with two values. You could try to predict the probability of A as a linear function of the predictor variables, assuming $y = c_0 + c_1x_1 + \cdots + c_px_p = Pr(A)$ The problem with this approach is that the value of $y$ is unconstrained; probabilities are only valid for values between 0 and 1. A good approach for dealing with this problem is to pick a function for $y$ that varies between 0 and 1 for all possible predictor values. If we were to use that function as a link function in a general linear model, then we could build a model that estimates the probability of different outcomes. That is the idea behind logistic regression.
In a logistic regression, the relationship between the predictor variables and the probability that an observation is a member of a given class is given by the logistic function: \[Pr(A) = \frac{e^{\eta}}{1+e^{\eta}}.\] Or \[\log\left(\frac{Pr(A)}{1-Pr(A)}\right)=c_0 + c_1x_1 + \cdots + c_px_p.\]
In R, logistic regression is fitted by function glm()

For the first example, let’s explore the data on infidelity contained in the data frame Affairs, provided with the AER package.

The data frame Affairs contains 9 variables collected on 601 participants and includes how often the respondent engaged in extramarital sexual intercourse during the past year, as well as their gender, age, years married, whether they had children, their religiousness (on a 5-point scale from 1=anti to 5=very), education, occupation (Hollingshead 7-point classification with reverse numbering), and a numeric self-rating of their marriage (from 1=very unhappy to 5=very happy). Let’s look at some descriptive statistics:

 # install.packages('AER')
 data(Affairs, package="AER")
 summary(Affairs)

##     affairs          gender         age         yearsmarried    children 
##  Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171  
##  1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430  
##  Median : 0.000                Median :32.00   Median : 7.000            
##  Mean   : 1.456                Mean   :32.49   Mean   : 8.178            
##  3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000            
##  Max.   :12.000                Max.   :57.00   Max.   :15.000            
##  religiousness     education       occupation        rating     
##  Min.   :1.000   Min.   : 9.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
##  Median :3.000   Median :16.00   Median :5.000   Median :4.000  
##  Mean   :3.116   Mean   :16.17   Mean   :4.195   Mean   :3.932  
##  3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :20.00   Max.   :7.000   Max.   :5.000

Although the number of indiscretions was recorded, your interest here is in the binary outcome (had an affair/didn’t have an affair). You can transform affairs into a dichotomous factor called ynaffair with the following code:

  Affairs$ynaffair[Affairs$affairs > 0] <- 1
  Affairs$ynaffair[Affairs$affairs == 0] <- 0
  Affairs$ynaffair <- factor(Affairs$ynaffair, 
                             levels=c(0,1), labels=c("No","Yes"))
  table(Affairs$ynaffair)

## 
##  No Yes 
## 451 150

This dichotomous factor can now be used as the outcome variable in a logistic regression model:

 fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + 
                  religiousness + education + occupation +rating,
                 data=Affairs, family=binomial())

Just like lm, the glm function returns no results by default. as with lm, we can get more detailed results about the model object with the summary method:

coef(fit.full)

##   (Intercept)    gendermale           age  yearsmarried   childrenyes 
##    1.37725816    0.28028665   -0.04425502    0.09477302    0.39767213 
## religiousness     education    occupation        rating 
##   -0.32472063    0.02105086    0.03091971   -0.46845426

summary(fit.full)

## 
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + children + 
##     religiousness + education + occupation + rating, family = binomial(), 
##     data = Affairs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5713  -0.7499  -0.5690  -0.2539   2.5191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.37726    0.88776   1.551 0.120807    
## gendermale     0.28029    0.23909   1.172 0.241083    
## age           -0.04426    0.01825  -2.425 0.015301 *  
## yearsmarried   0.09477    0.03221   2.942 0.003262 ** 
## childrenyes    0.39767    0.29151   1.364 0.172508    
## religiousness -0.32472    0.08975  -3.618 0.000297 ***
## education      0.02105    0.05051   0.417 0.676851    
## occupation     0.03092    0.07178   0.431 0.666630    
## rating        -0.46845    0.09091  -5.153 2.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 675.38  on 600  degrees of freedom
## Residual deviance: 609.51  on 592  degrees of freedom
## AIC: 627.51
## 
## Number of Fisher Scoring iterations: 4

From the $p$-values for the regression coefficients (last column), you can see that gender, presence of children, education, and occupation may not make a significant contribution to the equation Thus, we use step function to reduce the model.

 fit.reduce <- step(fit.full, direction = 'both')

## Start:  AIC=627.51
## ynaffair ~ gender + age + yearsmarried + children + religiousness + 
##     education + occupation + rating
## 
##                 Df Deviance    AIC
## - education      1   609.68 625.68
## - occupation     1   609.70 625.70
## - gender         1   610.89 626.89
## - children       1   611.40 627.40
## <none>               609.51 627.51
## - age            1   615.67 631.67
## - yearsmarried   1   618.34 634.34
## - religiousness  1   622.92 638.92
## - rating         1   636.75 652.75
## 
## Step:  AIC=625.68
## ynaffair ~ gender + age + yearsmarried + children + religiousness + 
##     occupation + rating
## 
##                 Df Deviance    AIC
## - occupation     1   610.15 624.15
## - gender         1   611.29 625.29
## - children       1   611.62 625.62
## <none>               609.68 625.68
## + education      1   609.51 627.51
## - age            1   615.78 629.78
## - yearsmarried   1   618.46 632.46
## - religiousness  1   623.27 637.27
## - rating         1   636.93 650.93
## 
## Step:  AIC=624.15
## ynaffair ~ gender + age + yearsmarried + children + religiousness + 
##     rating
## 
##                 Df Deviance    AIC
## - children       1   611.86 623.86
## <none>               610.15 624.15
## - gender         1   613.41 625.41
## + occupation     1   609.68 625.68
## + education      1   609.70 625.70
## - age            1   616.00 628.00
## - yearsmarried   1   619.07 631.07
## - religiousness  1   623.98 635.98
## - rating         1   637.23 649.23
## 
## Step:  AIC=623.86
## ynaffair ~ gender + age + yearsmarried + religiousness + rating
## 
##                 Df Deviance    AIC
## <none>               611.86 623.86
## + children       1   610.15 624.15
## - gender         1   615.36 625.36
## + education      1   611.45 625.45
## + occupation     1   611.62 625.62
## - age            1   618.05 628.05
## - religiousness  1   625.57 635.57
## - yearsmarried   1   626.23 636.23
## - rating         1   639.93 649.93

 summary(fit.reduce)

## 
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + religiousness + 
##     rating, family = binomial(), data = Affairs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5623  -0.7495  -0.5664  -0.2671   2.3975  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.94760    0.61234   3.181 0.001470 ** 
## gendermale     0.38612    0.20703   1.865 0.062171 .  
## age           -0.04393    0.01806  -2.432 0.015011 *  
## yearsmarried   0.11133    0.02983   3.732 0.000190 ***
## religiousness -0.32714    0.08947  -3.656 0.000256 ***
## rating        -0.46721    0.08928  -5.233 1.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 675.38  on 600  degrees of freedom
## Residual deviance: 611.86  on 595  degrees of freedom
## AIC: 623.86
## 
## Number of Fisher Scoring iterations: 4

To get a vector of fitted values, use the fitted function:

  pre <- fitted(fit.reduce)
  pre[1:12]

##          4          5         11         16         23         29 
## 0.26337147 0.12233613 0.50397675 0.07785260 0.35318719 0.09267826 
##         44         45         47         49         50         55 
## 0.27067889 0.26439500 0.49222646 0.10810251 0.68776326 0.17017760

Let’s take a look at a specific example of logistic regression. In particular, let’s look at the field goal data set. Each time a kicker attempts a field goal, there is a chance that the goal will be successful, and a chance that it will fail. The probability varies according to distance; attempts closer to the end zone are more likely to be successful. To model this relationship, we’ll try to use a logistic regression. To begin, let’s load the data and create a new binary variable for field goals that are either good or bad:

 library(nutshell)
 data(field.goals)
  field.goals.forlr <- transform(field.goals,
           good=as.factor(ifelse(play.type=="FG good","good","bad")))
  dim(field.goals.forlr)

## [1] 982  11

  field.goals.forlr[1:5,]

home.team	week	qtr	away.team	offense	defense	play.type	player	yards	stadium.type	good
ARI	14	3	WAS	ARI	WAS	FG good	1-N.Rackers	20	Out	good
ARI	2	4	STL	ARI	STL	FG good	1-N.Rackers	35	Out	good
ARI	7	3	TEN	ARI	TEN	FG good	1-N.Rackers	24	Out	good
ARI	12	2	JAC	JAC	ARI	FG good	10-J.Scobee	30	Out	good
ARI	2	3	STL	ARI	STL	FG good	1-N.Rackers	48	Out	good

Let’s take a quick look at the percentage of good field goals by distance. We’ll start by tabulating the results with the table function:

 field.goals.table <- table(field.goals.forlr$good,
                            field.goals.forlr$yards)
 field.goals.table

##       
##        18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
##   bad   0  0  1  1  1  1  0  0  0  3  5  5  2  6  7  5  3  0  4  3 11  6
##   good  1 12 24 28 24 29 30 18 27 22 26 32 22 21 30 31 21 25 20 23 29 35
##       
##        40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
##   bad   7  5  6 11  5  9 12 11 10  9  5  8 11 10  3  1  2  1  1  1  1  1
##   good 27 32 21 15 24 16 15 26 18 14 11  9 12 10  2  1  3  0  1  0  0  0
##       
##        62
##   bad   1
##   good  0

We’ll also plot the results (as percentages):

plot(colnames(field.goals.table),
     field.goals.table["good",]/ (field.goals.table["bad",] + 
     field.goals.table["good",]),
     xlab="Distance (Yards)", ylab="Percent Good")

As you can see, field goal percentage tapers off linearly between about 25 and 55 yards (with a few outliers at the end).

To model the probability of a successful field goal using a logistic regression, we would make the following call to glm:

 field.goals.mdl <- glm(formula=good~yards,
                        data=field.goals.forlr,
                        family="binomial")
 field.goals.mdl

## 
## Call:  glm(formula = good ~ yards, family = "binomial", data = field.goals.forlr)
## 
## Coefficients:
## (Intercept)        yards  
##     5.17886     -0.09726  
## 
## Degrees of Freedom: 981 Total (i.e. Null);  980 Residual
## Null Deviance:       978.9 
## Residual Deviance: 861.2     AIC: 865.2

 summary(field.goals.mdl)

## 
## Call:
## glm(formula = good ~ yards, family = "binomial", data = field.goals.forlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5582   0.2916   0.4664   0.6978   1.3789  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.178856   0.416201  12.443   <2e-16 ***
## yards       -0.097261   0.009892  -9.832   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 978.90  on 981  degrees of freedom
## Residual deviance: 861.22  on 980  degrees of freedom
## AIC: 865.22
## 
## Number of Fisher Scoring iterations: 5

Let’s take a quick look at how well this model fits the data.

  fg.prob <- function(y) {
    eta <- 5.178856 + -0.097261 * y;
    1 / (1 + exp(-eta))
 }

 plot(colnames(field.goals.table),
      field.goals.table["good",]/ (field.goals.table["bad",] + 
      field.goals.table["good",]),
      xlab="Distance (Yards)", ylab="Percent Good")
 lines(15:65,fg.prob(15:65),new=TRUE)

7 Exercises

Ex1
- There is a dataset named “mtcars”. Please attach this data, and fit the linear regression model \[mpghp = hp + wt + hp\times wt + \varepsilon.\]
- Summarize the fit and explain it.
Ex2
- Please attach dataset named “mtcars”, and take a look at mtcars$vs.
- Fit a glm with formula \[vs\sim hp + mpg + gear\]
- Summarize the fit and explain it.

8 References

Horton, N. and Kleinman, K., (2015). “Using R and Rstudio”. CRC Press.
Kabacoff, R. I. . (2011). “R in Action”. Manning Publications Co.
Baeza, S. . (2015). “R For Beginners”. CreateSpace Independent Publishing Platform.
Adler, J. (2010). “R in a nutshell: A desktop quick reference”. O’Reilly Media, Inc.“.

Statistics

Xu Liu