R includes a variety of functions for calculating summary statistics.
Use the mean function to calculate the mean of a vector. We can calculate minima with the min function, or maxima with the max function.
library(nutshell)
data(dow30)
head(dow30)| symbol | Date | Open | High | Low | Close | Volume | Adj.Close |
|---|---|---|---|---|---|---|---|
| MMM | 2009-09-21 | 73.91 | 74.68 | 73.91 | 74.54 | 2560400 | 74.54 |
| MMM | 2009-09-18 | 75.12 | 75.25 | 74.50 | 74.62 | 4387900 | 74.62 |
| MMM | 2009-09-17 | 75.34 | 75.45 | 74.50 | 74.89 | 3371500 | 74.89 |
| MMM | 2009-09-16 | 74.76 | 75.49 | 74.50 | 75.38 | 2722500 | 75.38 |
| MMM | 2009-09-15 | 74.63 | 74.88 | 74.00 | 74.68 | 3566900 | 74.68 |
| MMM | 2009-09-14 | 73.72 | 74.64 | 73.42 | 74.56 | 3466400 | 74.56 |
mean(dow30$Open)## [1] 36.24574
max(dow30$Open)## [1] 122.45
min(dow30$Open)## [1] 0.99
na.rm specifies how NA values are treated. By default, if any value in the vector is NA, then the value NA is returned. Specify na.rm=TRUE to ignore missing values: mean(c(1,2,3,4,5,NA))## [1] NA
mean(c(1,2,3,4,5,NA),na.rm=TRUE)## [1] 3
mean function. To do this, we can use the trim argument to specify the fraction of observations to filter: mean(c(-1,0:100,2000))## [1] 68.43689
mean(c(-1,0:100,2000), trim=0.1)## [1] 50
range function: range(dow30$Open)## [1] 0.99 122.45
quantile. This function can be used to return the values at different percentiles (specified by the probs argument): quantile(dow30$Open, probs=c(0,0.25,0.5,0.75,1.0))## 0% 25% 50% 75% 100%
## 0.990 19.655 30.155 51.680 122.450
fivenum function: fivenum(dow30$Open)## [1] 0.990 19.650 30.155 51.680 122.450
IQR to return the interquartile range (the difference between the 25th and 75th percentile values): IQR(dow30$Open)## [1] 32.025
summary. As an example, let’s take a look at the output of summary for the dow30 data set that we used above: summary(dow30)## symbol Date Open High
## MMM : 252 2008-09-22: 30 Min. : 0.99 Min. : 1.01
## AA : 252 2008-09-23: 30 1st Qu.: 19.66 1st Qu.: 20.19
## AXP : 252 2008-09-24: 30 Median : 30.16 Median : 30.75
## T : 252 2008-09-25: 30 Mean : 36.25 Mean : 36.93
## BAC : 252 2008-09-26: 30 3rd Qu.: 51.68 3rd Qu.: 52.45
## BA : 252 2008-09-29: 30 Max. :122.45 Max. :122.88
## (Other):5970 (Other) :7302
## Low Close Volume Adj.Close
## Min. : 0.27 Min. : 0.75 Min. :1.336e+06 Min. : 0.75
## 1st Qu.: 19.15 1st Qu.: 19.65 1st Qu.:1.111e+07 1st Qu.: 19.38
## Median : 29.55 Median : 30.10 Median :1.822e+07 Median : 29.41
## Mean : 35.53 Mean : 36.24 Mean :5.226e+07 Mean : 35.64
## 3rd Qu.: 50.84 3rd Qu.: 51.58 3rd Qu.:4.255e+07 3rd Qu.: 50.97
## Max. :121.62 Max. :122.11 Max. :2.672e+09 Max. :122.11
##
As you can see, summary presents information about each variable in the data frame. For numeric values, it shows the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. For factors, summary shows the count of the most frequent values. (Less frequent values are grouped into an “Other” category.)
stem function:stem(x, scale = 1, width = 80, atom = 1e-08)The argument x is a numeric vector, scale controls the length of the plot, width controls the width, and atom is a tolerance factor. For example:
x <- c(11.6, 10.3, 10.8, 10.6, 11.9, 9.6, 9.0, 9.9, 11.2, 10.21, 10.69)
stem(x)##
## The decimal point is at the |
##
## 9 | 069
## 10 | 23678
## 11 | 269
library(nutshell)
data(field.goals)
dim(field.goals)## [1] 982 10
head(field.goals)| home.team | week | qtr | away.team | offense | defense | play.type | player | yards | stadium.type |
|---|---|---|---|---|---|---|---|---|---|
| ARI | 14 | 3 | WAS | ARI | WAS | FG good | 1-N.Rackers | 20 | Out |
| ARI | 2 | 4 | STL | ARI | STL | FG good | 1-N.Rackers | 35 | Out |
| ARI | 7 | 3 | TEN | ARI | TEN | FG good | 1-N.Rackers | 24 | Out |
| ARI | 12 | 2 | JAC | JAC | ARI | FG good | 10-J.Scobee | 30 | Out |
| ARI | 2 | 3 | STL | ARI | STL | FG good | 1-N.Rackers | 48 | Out |
| ARI | 7 | 4 | TEN | TEN | ARI | FG aborted | 15-C.Hentrich | 33 | Out |
Another example, we’ll look at field goal attempts in the NFL during 2005. Specifically, we’ll look at the attempted distances for missed field goals. To do this, we’ll use the subset function to select only missed field goals and then plot the yards for each attempt:
stem(subset(field.goals,play.type=="FG no")$yards)##
## The decimal point is at the |
##
## 20 | 0
## 22 |
## 24 |
## 26 | 00
## 28 | 0000000
## 30 | 0000000
## 32 | 00000000
## 34 | 000
## 36 | 0000
## 38 | 00000000000000
## 40 | 0000000000
## 42 | 0000000000000000
## 44 | 000000000000
## 46 | 000000000000000000
## 48 | 000000000000000000
## 50 | 000000000000
## 52 | 0000000000000000000
## 54 | 0000
## 56 | 000
## 58 | 00
## 60 | 00
## 62 | 0
Very often, when analyzing data, we want to know if two variables are correlated. Informally, correlation answers the question “when we increase (or decrease) x, does y increase (or decrease), and by how much?” Formally, correlation measures the linear dependence between two random variables. Correlation measures range between -1 and 1; 1 means that one variable is a (positive) linear function of the other, 0 means the two variables aren’t correlated at all, and -1 means that one variable is a negative linear function of the other.
where \(\bar{x}\) is the mean of variable \(x\), and \(\bar{y}\) is the mean of variable \(y\).
To compute correlations in R, we can use the function cor. Here is an example
cor(1:10, 2:11)## [1] 1
x <- rnorm(10000)
y <- rnorm(10000)
cor(x, y)## [1] 0.03069866
cov(x,y)## [1] 0.03050981
var(x)## [1] 1.012606
var(y)## [1] 0.9754385
cov(x,y)/sqrt(var(x)*var(y))## [1] 0.03069866
cor(x, y)## [1] 0.03069866
is the weighted combination of the \(k\) observed variables that accounts for the most variance in the original set of variables. The second principal component is the linear combination that accounts for the most variance in the original variables, under the constraint that it’s orthogonal (uncorrelated) to the first principal component. Each subsequent component maximizes the variance accounted for, while at the same time remaining uncorrelated with all previous components. Theoretically, you can extract as many principal components as there are variables. But from a practical viewpoint, you hope that you can approximate the full set of variables with a much smaller set of components.
In R, principal components analysis is available through the function prcomp in the stats package:
Let’s look at a simple example.
library(nutshell)
data(team.batting.00to08)
dim(team.batting.00to08)## [1] 270 13
head(team.batting.00to08)[,1:10] # Ignore the last 3 columns| teamID | yearID | runs | singles | doubles | triples | homeruns | walks | stolenbases | caughtstealing |
|---|---|---|---|---|---|---|---|---|---|
| ANA | 2000 | 864 | 995 | 309 | 34 | 236 | 608 | 93 | 52 |
| BAL | 2000 | 794 | 992 | 310 | 22 | 184 | 558 | 126 | 65 |
| BOS | 2000 | 792 | 988 | 316 | 32 | 167 | 611 | 43 | 30 |
| CHA | 2000 | 978 | 1041 | 325 | 33 | 216 | 591 | 119 | 42 |
| CLE | 2000 | 950 | 1078 | 310 | 30 | 221 | 685 | 113 | 34 |
| DET | 2000 | 823 | 1028 | 307 | 41 | 177 | 562 | 83 | 38 |
# ?princomp
batting.pca <- princomp(~singles+doubles+triples+homeruns+walks+
hitbypitch+sacrificeflies+stolenbases+
caughtstealing, data=team.batting.00to08)
batting.pca## Call:
## princomp(formula = ~singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
## data = team.batting.00to08)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 74.900981 61.871086 31.811398 27.988190 23.788859 12.884291 9.150840
## Comp.8 Comp.9
## 8.283972 7.060503
##
## 9 variables and 270 observations.
The princomp function returns a princomp object. We can get information on the importance of each component with the summary function:
summary(batting.pca, loadings=TRUE)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 74.9009809 61.8710858 31.8113983 27.98819003
## Proportion of Variance 0.4610727 0.3146081 0.0831687 0.06437897
## Cumulative Proportion 0.4610727 0.7756807 0.8588494 0.92322841
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 23.78885885 12.88429066 9.150840397 8.283972499
## Proportion of Variance 0.04650949 0.01364317 0.006882026 0.005639904
## Cumulative Proportion 0.96973790 0.98338107 0.990263099 0.995903002
## Comp.9
## Standard deviation 7.060503344
## Proportion of Variance 0.004096998
## Cumulative Proportion 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## singles 0.313 0.929 0.136 0.136
## doubles 0.437 0.121 -0.877 0.100
## triples -0.424 -0.775
## homeruns -0.235 0.383 0.825 0.324
## walks -0.914 0.328 -0.150 -0.182
## hitbypitch 0.989
## sacrificeflies -0.321 -0.330
## stolenbases 0.131 -0.758 0.502 -0.307 0.232
## caughtstealing -0.208 0.104 -0.813 0.521
## Comp.9
## singles
## doubles
## triples 0.449
## homeruns
## walks
## hitbypitch
## sacrificeflies -0.882
## stolenbases
## caughtstealing 0.105
loadings method to show the contribution of each variable to the components: loadings(batting.pca)##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## singles 0.313 0.929 0.136 0.136
## doubles 0.437 0.121 -0.877 0.100
## triples -0.424 -0.775
## homeruns -0.235 0.383 0.825 0.324
## walks -0.914 0.328 -0.150 -0.182
## hitbypitch 0.989
## sacrificeflies -0.321 -0.330
## stolenbases 0.131 -0.758 0.502 -0.307 0.232
## caughtstealing -0.208 0.104 -0.813 0.521
## Comp.9
## singles
## doubles
## triples 0.449
## homeruns
## walks
## hitbypitch
## sacrificeflies -0.882
## stolenbases
## caughtstealing 0.105
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111
## Cumulative Var 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889
## Comp.9
## SS loadings 1.000
## Proportion Var 0.111
## Cumulative Var 1.000
#batting.pca$scorescree plot of the variance against each principal component:plot(batting.pca) biplot(batting.pca,cex=0.5,col=c("gray50","black"))| Distribution | R name | Additional arguments |
|---|---|---|
| normal | norm | mean, sd |
| uniform | unif | min, max |
| Student’s t | t | df, ncp |
| Cauchy | cauchy | location, scale |
| chi-squared | chisq | df, ncp |
| binomial | binom | size, prob |
| exponential | exp | rate |
| F | f | df1, df2, ncp |
| gamma | gamma | shape, scale |
| Poisson | pois | lambda |
Prefix the name given here by d for the density, p for the CDF, q for the quantile function and r for simulation (random deviates). The first argument is \(x\) for dxxx, \(q\) for pxxx, \(p\) for qxxx and \(n\) for rxxx, where the symbol xxx denotes the specified distribution, such as norm, unif, and so on.
dnorm function to find the probability density at a given value: dnorm(0)## [1] 0.3989423
dnorm(5)## [1] 1.48672e-06
dnorm(-5)## [1] 1.48672e-06
plot(dnorm, -3.5, 3.5, main = "Normal Distribution", ylab = 'Density')pnorm: pnorm(0)## [1] 0.5
pnorm(5)## [1] 0.9999997
pnorm(-5)## [1] 2.866516e-07
plot(pnorm, -3.5, 3.5, main = "Cumulative Normal Distribution")qnorm: qnorm(0.5)## [1] 0
qnorm(1)## [1] Inf
qnorm(0)## [1] -Inf
# qnorm is the inverse of pnorm
qnorm(pnorm(-1))## [1] -1
# finding the left and right sides of a 95% confidence interval
c(qnorm(.025), qnorm(.975))## [1] -1.959964 1.959964
rnorm: rnorm(10)## [1] 0.40565445 -1.80115452 0.83266950 0.07201365 -0.22783142
## [6] 2.76128624 0.82836761 0.79889065 0.62772680 1.50361186
hist(rnorm(20000),breaks=50)Many data problems boil down to statistical tests. For example, you might want to answer a question like:
To answer questions like these, you would formulate a hypothesis, design an experiment, collect data, and use a tool like R to analyze the data. This chapter focuses on the tools available in R for answering these questions.
This section describes tests that apply to continuous random variables.
We’ll start off by showing how to use some common statistical tests that assume the underlying data is normally distributed.
t.test. ?t.testLet’s take a look at an example of how you would use the t.test function.
t.test(rnorm(300))##
## One Sample t-test
##
## data: rnorm(300)
## t = 0.19768, df = 299, p-value = 0.8434
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.102937 0.125927
## sample estimates:
## mean of x
## 0.01149497
The real data tires.sus. (For more information the real data tires.sus, please see the help file.)
library(nutshell)
data(tires.sus)
dim(tires.sus)## [1] 66 27
To begin, let’s extract a vector with the set of values in which we are interested and calculate the true mean:
time.to.failure.h <- subset(tires.sus,
Tire_Type=='H'&
Speed_At_Failure_km_h==160)$Time_To_Failure
time.to.failure.h## [1] 10.00 16.67 13.58 13.53 16.83 7.62 4.25 10.67 4.42 4.25
mean(time.to.failure.h)## [1] 10.182
t.test(time.to.failure.h, mu = 9)##
## One Sample t-test
##
## data: time.to.failure.h
## t = 0.75694, df = 9, p-value = 0.4684
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
## 6.649536 13.714464
## sample estimates:
## mean of x
## 10.182
Here’s an explanation of the output from the t.test function. First, the function shows us the test statistic (\(t = 0.7569\)), the degrees of freedom (\(df = 9\)), and the calculated \(p\)-value for the test (p-value = 0.4684). The \(p\)-value means that the probability that the mean value from an actual sample was higher than 10.812 (or lower than 7.288) was 0.4684.
t-test to compare the means of the two groups. For example x <- rnorm(200)
y <- rnorm(200)
t.test(x, y)##
## Welch Two Sample t-test
##
## data: x and y
## t = -0.32878, df = 397.04, p-value = 0.7425
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2180601 0.1555739
## sample estimates:
## mean of x mean of y
## -0.039130559 -0.007887479
Let’s pick another example from the tire data. Looking at the characteristics of the different tires that were tested, notice that three of the six tires had the same speed rating: S. Based on this speed rating, we would expect all three tires to last the same amount of time in the test:
time.to.failure.e <- subset(tires.sus,
Tire_Type=='E'&
Speed_At_Failure_km_h==180
)$Time_To_Failure
time.to.failure.d <- subset(tires.sus,
Tire_Type=='D'&
Speed_At_Failure_km_h==180
)$Time_To_Failure
time.to.failure.b <- subset(tires.sus,
Tire_Type=='B'&
Speed_At_Failure_km_h==180
)$Time_To_Failure t.test(time.to.failure.e, time.to.failure.d)##
## Welch Two Sample t-test
##
## data: time.to.failure.e and time.to.failure.d
## t = -2.5042, df = 8.961, p-value = 0.03373
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.82222528 -0.04148901
## sample estimates:
## mean of x mean of y
## 4.321000 4.752857
t.test(time.to.failure.e, time.to.failure.b)##
## Welch Two Sample t-test
##
## data: time.to.failure.e and time.to.failure.b
## t = -1.4549, df = 16.956, p-value = 0.164
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5591177 0.1027844
## sample estimates:
## mean of x mean of y
## 4.321000 4.549167
paired t-test. We can use the t.test function, specifying paired=TRUE, to perform this test.As an example of paired data, we can look at the SPECint2006 results. We will compare the unoptimized results (called “baseline”) to the optimized results, to see if there is a statistically significant difference between the results.
library(nutshell)
data(SPECint2006) t.test(subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Baseline,
subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Result,
paired=TRUE)##
## Paired t-test
##
## data: subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline and subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result
## t = -21.804, df = 111, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.957837 -1.631627
## sample estimates:
## mean of the differences
## -1.794732
var.test function which performs an F-test: x <- rnorm(100, mean = 0, sd = 2)
y <- rnorm(80, mean = 1, sd = 1)
var.test(x, y) ##
## F test to compare two variances
##
## data: x and y
## F = 3.3107, num df = 99, denom df = 79, p-value = 1.014e-07
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.162406 5.018046
## sample estimates:
## ratio of variances
## 3.310721
aov:As an example, let’s consider the 2006 U.S. Mortality data set. Specifically, we’ll look at differences in age at death by cause of death. Let’s take a look at the summary statistics for age by cause:
library(nutshell)
data(mort06.smpl)
dim(mort06.smpl)## [1] 243073 36
tapply(mort06.smpl$age, INDEX = list(mort06.smpl$Cause), FUN = summary)## $Accidents
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 31.00 48.00 50.88 73.00 108.00 8
##
## $`Alzheimer's Disease`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.00 82.00 87.00 86.07 91.00 109.00
##
## $Cancer
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 61.00 72.00 70.24 81.00 107.00
##
## $`Chronic respiratory diseases`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 70.00 78.00 76.37 84.00 106.00 1
##
## $Diabetes
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 63.00 75.00 72.43 83.00 104.00 1
##
## $`Heart Disease`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 70.00 81.00 77.66 88.00 112.00 4
##
## $Homicide
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 22.00 28.00 32.19 42.00 92.00 2
##
## $`Influenza and pheumonia`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 76.00 84.00 80.16 90.00 108.00 1
##
## $Other
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 60.00 78.00 70.44 87.00 110.00 10
##
## $Suicide
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.00 32.00 45.00 46.14 57.00 97.00 2
Now, let’s fit an ANOVA model to the data and show a summary of the model. To do this in R, we simply need to use the aov function:
fit1 <- aov(age~Cause, data=mort06.smpl)
fit1## Call:
## aov(formula = age ~ Cause, data = mort06.smpl)
##
## Terms:
## Cause Residuals
## Sum of Squares 15727886 72067515
## Deg. of Freedom 9 243034
##
## Residual standard error: 17.22012
## Estimated effects may be unbalanced
## 29 observations deleted due to missingness
#head(mort06.smpl[,c('age','Cause')])
summary(fit1)## Df Sum Sq Mean Sq F value Pr(>F)
## Cause 9 15727886 1747543 5893 <2e-16 ***
## Residuals 243034 72067515 297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 29 observations deleted due to missingness
aov, let’s consider weight gain by women during pregnancy: library(nutshell)
data(births2006.smpl)
births2006.cln <- births2006.smpl[births2006.smpl$WTGAIN<99 &
!is.na(births2006.smpl$WTGAIN),]
tapply(X=births2006.cln$WTGAIN, INDEX=births2006.cln$DOB_MM, FUN=mean)## 1 2 3 4 5 6 7 8
## 30.94405 31.08356 31.29317 31.33610 31.07242 30.92589 30.57734 30.54855
## 9 10 11 12
## 30.25546 30.43985 30.79077 30.85564
It appears that weight gain increases slightly during winter months, but is this difference statistically significant? Let’s take a look:
fit2 <- aov(WTGAIN~DOB_MM, births2006.cln)
fit2## Call:
## aov(formula = WTGAIN ~ DOB_MM, data = births2006.cln)
##
## Terms:
## DOB_MM Residuals
## Sum of Squares 14777 73385301
## Deg. of Freedom 1 351465
##
## Residual standard error: 14.44986
## Estimated effects may be unbalanced
summary(fit2)## Df Sum Sq Mean Sq F value Pr(>F)
## DOB_MM 1 14777 14777 70.77 <2e-16 ***
## Residuals 351465 73385301 209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test function:As an example, let’s return to the tire data that we used in the example above. When we looked at the t.test function, we created three different vectors for the different types of tires. Here, we’ll just use the pairwise t-test to compare all the tires by type:
pairwise.t.test(tires.sus$Time_To_Failure, tires.sus$Tire_Type)##
## Pairwise comparisons using t tests with pooled SD
##
## data: tires.sus$Time_To_Failure and tires.sus$Tire_Type
##
## B C D E H
## C 0.2219 - - - -
## D 1.0000 0.5650 - - -
## E 1.0000 0.0769 1.0000 - -
## H 2.4e-07 0.0029 2.6e-05 1.9e-08 -
## L 0.1147 1.0000 0.4408 0.0291 0.0019
##
## P value adjustment method: holm
As you can see, there is no statistically significant difference between the means of a few pairs of groups (such as C and L, D and L, or D and E), but there is a significant difference between some others (such as B and H, C and H, or E and L).
shapiro.test function. For example x <- rnorm(300, mean = 5, sd = 2)
y <- runif(300, min = 2, max = 4)My first instinct is to take a look at the distribution using a histogram or a quantile-quantile plot. Here are some R codes to plot both, side by side:
par(mfcol=c(1,2))
hist(x)
qqnorm(x) par(mfcol=c(1,2))
hist(y)
qqnorm(y) shapiro.test(x)##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.99317, p-value = 0.1895
shapiro.test(y)##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.95944, p-value = 2.07e-07
ks.test function to realize this in R. For example x <- rnorm(200)
ks.test(x, pnorm)##
## One-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.037576, p-value = 0.9403
## alternative hypothesis: two-sided
ks.test(x, punif)##
## One-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.495, p-value < 2.2e-16
## alternative hypothesis: two-sided
x <- rnorm(100)
y <- rnorm(80)
# Do x and y come from the same distribution?
ks.test(x, y)##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.0875, p-value = 0.8567
## alternative hypothesis: two-sided
x <- rnorm(100)
y <- rnorm(80, 1, 1)
ks.test(x, y)##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.4575, p-value = 6.682e-09
## alternative hypothesis: two-sided
x <- rnorm(100)
y <- runif(80)
ks.test(x, y)##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.57, p-value = 7.572e-14
## alternative hypothesis: two-sided
cor.test function. For example cor.test(1:8, c(0, 2, 4, 6, 8, 10, 11, 14))##
## Pearson's product-moment correlation
##
## data: 1:8 and c(0, 2, 4, 6, 8, 10, 11, 14)
## t = 36.148, df = 6, p-value = 2.989e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9868648 0.9996032
## sample estimates:
## cor
## 0.997712
cor.test(1:8, c(5, 3, 8, 1, 7, 0, 0, 3))##
## Pearson's product-moment correlation
##
## data: 1:8 and c(5, 3, 8, 1, 7, 0, 0, 3)
## t = -1.2232, df = 6, p-value = 0.2671
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8757371 0.3764066
## sample estimates:
## cor
## -0.4467689
The Wilcoxon test is the distribution-free equivalent to the t-test. To do this in R, we can use the wilcox.test function. As an example, let’s take a look at the same examples we used for t-tests. Let’s start by looking at the times to failure for tires. As above, let’s start by comparing tires of type E to tires of type D:
wilcox.test( time.to.failure.e, time.to.failure.d)## Warning in wilcox.test.default(time.to.failure.e, time.to.failure.d):
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: time.to.failure.e and time.to.failure.d
## W = 14.5, p-value = 0.05054
## alternative hypothesis: true location shift is not equal to 0
t.test(time.to.failure.e, time.to.failure.d)##
## Welch Two Sample t-test
##
## data: time.to.failure.e and time.to.failure.d
## t = -2.5042, df = 8.961, p-value = 0.03373
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.82222528 -0.04148901
## sample estimates:
## mean of x mean of y
## 4.321000 4.752857
Here’s an explanation of the output. The test function first shows the test statistic (W = 14.5) and the \(p\)-value for the statistic (0.05054). Notice that this is different from the result for the t-test. With the t-test, there was a significant difference between the means of the two groups, but with the Wilcoxon test, the difference between the two groups is not significant at a 95% confidence level.
Kruskal-Wallis rank sum test is a distribution-free equivalent to ANOVA analysis: kruskal.test(x, ...)As an example, here is the output for the mortality data that we used as an example for ANOVA statistics:
summary(aov(age~Cause, data=mort06.smpl))## Df Sum Sq Mean Sq F value Pr(>F)
## Cause 9 15727886 1747543 5893 <2e-16 ***
## Residuals 243034 72067515 297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 29 observations deleted due to missingness
kruskal.test(age~Cause,data=mort06.smpl)##
## Kruskal-Wallis rank sum test
##
## data: age by Cause
## Kruskal-Wallis chi-squared = 34868, df = 9, p-value < 2.2e-16
fligner.test function. Here is the output of fligner.test for the mortality data above: fligner.test(age~Cause,data=mort06.smpl)##
## Fligner-Killeen test of homogeneity of variances
##
## data: age by Cause
## Fligner-Killeen:med chi-squared = 15788, df = 9, p-value < 2.2e-16
prop.test to measure whether the difference between groups is statistically significant. Specifically, prop.test can be used for testing the null hypothesis that the proportions (probabilities of success) in several groups are the same or that they equal certain given values.As an example, let’s visit the field goal data. First, let’s create a new data set containing only good and bad field goals.
library(nutshell)
data(field.goals)
field.goals.goodbad <- field.goals[field.goals$play.type=='FG good' |
field.goals$play.type=='FG no',]
head(field.goals.goodbad)| home.team | week | qtr | away.team | offense | defense | play.type | player | yards | stadium.type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ARI | 14 | 3 | WAS | ARI | WAS | FG good | 1-N.Rackers | 20 | Out |
| 2 | ARI | 2 | 4 | STL | ARI | STL | FG good | 1-N.Rackers | 35 | Out |
| 3 | ARI | 7 | 3 | TEN | ARI | TEN | FG good | 1-N.Rackers | 24 | Out |
| 4 | ARI | 12 | 2 | JAC | JAC | ARI | FG good | 10-J.Scobee | 30 | Out |
| 5 | ARI | 2 | 3 | STL | ARI | STL | FG good | 1-N.Rackers | 48 | Out |
| 7 | ARI | 2 | 2 | STL | ARI | STL | FG good | 1-N.Rackers | 26 | Out |
Now, let’s create a table of successes and failures by stadium type:
field.goals.table <- table(field.goals.goodbad$play.type,
field.goals.goodbad$stadium.type)
field.goals.table##
## Both In Out
## FG aborted 0 0 0
## FG blocked 0 0 0
## FG good 53 152 582
## FG no 14 24 125
The table isn’t quite right for prop.test; we need a table with two columns (one with a count of successes and one with a count of failures), and we don’t want to show empty factor levels. Let’s remove the two rows we don’t need and transpose the table:
field.goals.table.t <- t(field.goals.table[3:4, ])
field.goals.table.t##
## FG good FG no
## Both 53 14
## In 152 24
## Out 582 125
Now, we’re ready to see if there is a statistically significant difference in success between the three groups. We can simply call prop.test on the field.goals.table.t object to check:
prop.test(field.goals.table.t)##
## 3-sample test for equality of proportions without continuity
## correction
##
## data: field.goals.table.t
## X-squared = 2.3298, df = 2, p-value = 0.312
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3
## 0.7910448 0.8636364 0.8231966
As you can see, the results are not significant.
R includes a function binom.test for evaluating such a trial to determine whether to accept or reject the hypothesis.
binom.test(c(480, 520), p = 0.5)##
## Exact binomial test
##
## data: c(480, 520)
## number of successes = 480, number of trials = 1000, p-value =
## 0.2174
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4486329 0.5114851
## sample estimates:
## probability of success
## 0.48
binom.test(c(682, 243), p = 3/4)##
## Exact binomial test
##
## data: c(682, 243)
## number of successes = 682, number of trials = 925, p-value =
## 0.3825
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
## 0.7076683 0.7654066
## sample estimates:
## probability of success
## 0.7372973
x <- rbinom(1000, 1, 0.3)
binom.test(sum(x), 1000, p = 0.5)##
## Exact binomial test
##
## data: sum(x) and 1000
## number of successes = 286, number of trials = 1000, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.2581564 0.3151096
## sample estimates:
## probability of success
## 0.286
A common problem is to look at a table of data and determine if there is a relationship between two categorical variables. If there were no relationship, the two variables would be statistically independent. In these tests, the hypothesis is that the two variables are independent. The alternative (or null) hypothesis is that the two variables are not independent.
Tables of data often come up in experimental contexts: there is one column of data from a test population and one from a control population. In this context, the analyst often wants to calculate the probability that the two sets of data could have come from the same population (which would imply the same proportions in each). This is an equivalent problem, so the same test functions can be used.
To do this in R, we can use the fisher.test or ‘chisq.test’ function. For example:
A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman’s guess, the alternative that there is a positive association (that the odds ratio is greater than 1).
TeaTasting <-
matrix(c(3, 1, 1, 3),
nrow = 2,
dimnames = list(Guess = c("Milk", "Tea"),
Truth = c("Milk", "Tea")))
TeaTasting## Truth
## Guess Milk Tea
## Milk 3 1
## Tea 1 3
fisher.test(TeaTasting)##
## Fisher's Exact Test for Count Data
##
## data: TeaTasting
## p-value = 0.4857
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2117329 621.9337505
## sample estimates:
## odds ratio
## 6.408309
chisq.test(TeaTasting)## Warning in chisq.test(TeaTasting): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TeaTasting
## X-squared = 0.5, df = 1, p-value = 0.4795
As a second example, let’s use the 2006 births data set. We will take a look at the number of male and female babies delivered during July 2006, by delivery method. We’ll take a subset of births during July where the delivery method was known and then tabulate the results:
births.july.2006 <-births2006.smpl[births2006.smpl$DMETH_REC!="Unknown"
& births2006.smpl$DOB_MM==7, ]
method.and.sex <- table(births.july.2006$SEX,
as.factor(as.character(births.july.2006$DMETH_REC)))
method.and.sex##
## C-section Vaginal
## F 5326 12622
## M 6067 13045
Let’s check whether this difference is statistically significant: is the difference due to chance or is it likely that these two variables (delivery method and sex) are independent?
fisher.test(method.and.sex)##
## Fisher's Exact Test for Count Data
##
## data: method.and.sex
## p-value = 1.604e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8678345 0.9485129
## sample estimates:
## odds ratio
## 0.9072866
chisq.test(method.and.sex)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: method.and.sex
## X-squared = 18.528, df = 1, p-value = 1.675e-05
As a third example, let’s look only at twin births. (Note that each record represents a single birth, not a single pregnancy.)
twins.2006 <- births2006.smpl[births2006.smpl$DPLURAL=="2 Twin" &
births2006.smpl$DMETH_REC !="Unknown",]
method.and.sex.twins <- table(twins.2006$SEX,
as.factor(as.character(twins.2006$DMETH_REC)))
method.and.sex.twins##
## C-section Vaginal
## F 4924 1774
## M 5076 1860
fisher.test(method.and.sex.twins)##
## Fisher's Exact Test for Count Data
##
## data: method.and.sex.twins
## p-value = 0.67
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9420023 1.0981529
## sample estimates:
## odds ratio
## 1.017083
chisq.test(method.and.sex.twins)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: method.and.sex.twins
## X-squared = 0.17451, df = 1, p-value = 0.6761
chisq.test function can also perform tests on multidimensional tables. As an example, let’s build a table showing the number of births by day and month: births2006.bydayandmonth <- table(births2006.smpl$DOB_WK,
births2006.smpl$DOB_MM)
births2006.bydayandmonth##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 3645 2930 2965 3616 2969 3036 3976 3160 3270 3964 2999 3744
## 2 5649 4737 4779 4853 5712 5033 6263 5127 4850 6167 5043 4544
## 3 6265 5293 5251 5297 6472 5178 5149 7225 5805 6887 5619 5334
## 4 5131 5280 6486 5173 6496 5540 5499 7011 5725 5445 6838 5666
## 5 5127 5271 6574 5162 5347 6863 5780 6945 5822 5538 6165 5570
## 6 4830 5305 6330 5042 4975 6622 5760 5530 7027 5256 5079 6624
## 7 3295 3392 3408 4185 3364 3464 4751 3686 4669 3564 3509 4396
As above, let’s check the probability that this distribution arose by chance under the assumption that the probability of each combination was equal:
chisq.test(births2006.bydayandmonth)##
## Pearson's Chi-squared test
##
## data: births2006.bydayandmonth
## X-squared = 4729.6, df = 66, p-value < 2.2e-16
friedman.test function.As examples, let’s look at some of the same tables we looked at above:
friedman.test(method.and.sex.twins)##
## Friedman rank sum test
##
## data: method.and.sex.twins
## Friedman chi-squared = 2, df = 1, p-value = 0.1573
Just like the chi-squared test, the Friedman rank sum test shows that it is very likely that the two distributions are independent.
We assume a standard deviation of 8.9 for this experiment. We’ll also assume that we want a power of .95 for the experiment (meaning that the probability of a Type II error is less than .05). To calculate the minimum statistically significant difference in R, we could use the following expression:
power.t.test(power=.95, sig.level=.05, sd=8.9, n=25)##
## Two-sample t test power calculation
##
## n = 25
## delta = 9.26214
## sd = 8.9
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
According to the output, the difference in means between the two groups would need to be at least 9.26214 to be significant at this level.
power.t.test(power=.95, sig.level=.05, sd=8.9, n=50)##
## Two-sample t test power calculation
##
## n = 50
## delta = 6.480487
## sd = 8.9
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
As you can see, the power functions can be very useful for designing an experiment. They can help you to estimate, in advance, how large a difference you need to see between groups to get statistically significant results.
t-test to check the significance of the results (typically, an experiment where you calculate the mean value of a random variable for a “test” population and a “control” population), then you can use the power.t.test function to help design the experiment. For example power.t.test(power = .90, delta = 1)##
## Two-sample t test power calculation
##
## n = 22.0211
## delta = 1
## sd = 1
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.t.test(power = .95, delta = 1)##
## Two-sample t test power calculation
##
## n = 26.98922
## delta = 1
## sd = 1
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test function:power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05,
power = NULL,
alternative = c("two.sided", "one.sided"),
strict = FALSE, tol = .Machine$double.eps^0.25)For this function, \(n\) specifies the number of observations (per group), \(p_1\) is the probability of success in one group, \(p_2\) is the probability of success in the other group.
power.prop.test(n = 50, p1 = .50, p2 = .75) ##
## Two-sample comparison of proportions power calculation
##
## n = 50
## p1 = 0.5
## p2 = 0.75
## sig.level = 0.05
## power = 0.7401659
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test(n = 100, p1 = .50, p2 = .75)##
## Two-sample comparison of proportions power calculation
##
## n = 100
## p1 = 0.5
## p2 = 0.75
## sig.level = 0.05
## power = 0.9600175
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test(n = 300, p1 = .50, p2 = .75)##
## Two-sample comparison of proportions power calculation
##
## n = 300
## p1 = 0.5
## p2 = 0.75
## sig.level = 0.05
## power = 0.9999969
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test(p1 = .50, p2 = .75, power = .90)##
## Two-sample comparison of proportions power calculation
##
## n = 76.70693
## p1 = 0.5
## p2 = 0.75
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.prop.test(p1 = .50, p2 = .75, power = .95)##
## Two-sample comparison of proportions power calculation
##
## n = 94.46783
## p1 = 0.5
## p2 = 0.75
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
power.anova.test function. For example # ?power.anova.test
power.anova.test(groups = 4, n = 5, between.var = 1, within.var = 3)##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 5
## between.var = 1
## within.var = 3
## sig.level = 0.05
## power = 0.3535594
##
## NOTE: n is number in each group
power.anova.test(groups = 4, n = 50, between.var = 1, within.var = 3)##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 50
## between.var = 1
## within.var = 3
## sig.level = 0.05
## power = 0.9999952
##
## NOTE: n is number in each group
power.anova.test(groups = 4, between.var = 1, within.var = 3,
power = .80)##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 11.92613
## between.var = 1
## within.var = 3
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
power.anova.test(groups = 4, between.var = 1, within.var = 3,
power = .95)##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 18.18245
## between.var = 1
## within.var = 3
## sig.level = 0.05
## power = 0.95
##
## NOTE: n is number in each group
A regression model shows how a continuous value (called the response variable, or the dependent variable) is related to a set of other values (called the predictors, stimulus variables, or independent variables). Often, a regression model is used to predict values where they are unknown. For example, warfarin is a drug commonly used as a blood thinner or anticoagulant. A doctor might use a regression model to predict the correct dose of warfarin to give a patient based on several known variables about the patient (such as the patient’s weight). Another example of a regression model might be for marketing financial products. An analyst might estimate the average balance of a credit card customer (which, in turn, affects the expected revenue from that customer).
Sometimes, a regression model is simply used to explain a phenomenon, but not to actually predict values. For example, a scientist might suspect that weight is correlated to consumption of certain types of foods, but wants to adjust for a variety of factors, including age, exercise, genetics (and, hopefully, other factors). The scientist could use a regression model to help show the relationship between weight and food consumed by including other variables in the regression. Models can be used for many other purposes, including visualizing trends, analysis of variance tests, and testing variable significance.
A linear regression assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable \(y\) is a linear function of a set of predictor variables \(x_1, \cdots,x_p.\)
lm through a simple regression example. The dataset women in the base installation provides the height and weight for a set of 15 women ages 30 to 39. Suppose you want to predict weight from height. Having an equation for predicting weight from height can help you to identify overweight or underweight individuals. The analysis is provided in the following listing.fit <- lm(weight ~ height, data=women)
coef(fit)## (Intercept) height
## -87.51667 3.45000
summary(fit)##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
predict(fit)## 1 2 3 4 5 6 7 8
## 112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333
## 9 10 11 12 13 14 15
## 140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833
From the output, we can see that the prediction equation is \[\widehat{Weight} = -87.52 + 3.45 \times Height.\]
plot(women$height,women$weight,
xlab="Height (in inches)",
ylab="Weight (in pounds)")
abline(fit) library(nutshell)
library(lattice)
data(team.batting.00to08)Let’s look at scatter plots of runs versus each other variable, so that we can see which variables are likely to be most important. We’ll create a data frame for plotting, using the make.groups function:
attach(team.batting.00to08)
forplot <- make.groups(
singles = data.frame(value=singles, teamID, yearID, runs),
doubles = data.frame(value=doubles, teamID, yearID, runs),
triples = data.frame(value=triples, teamID, yearID, runs),
homeruns = data.frame(value=homeruns, teamID, yearID, runs),
walks = data.frame(value=walks, teamID, yearID, runs),
stolenbases = data.frame(value=stolenbases, teamID, yearID, runs),
caughtstealing = data.frame(value=caughtstealing, teamID, yearID,
runs),
hitbypitch = data.frame(value=hitbypitch, teamID, yearID, runs),
sacrificeflies = data.frame(value=sacrificeflies, teamID, yearID,
runs))
detach(team.batting.00to08)Now, we’ll generate the scatter plots using the xyplot function:
xyplot(runs~value|which, data=forplot,
scales=list(relation='free'),
pch=19, cex=.2,
strip=strip.custom(strip.levels=TRUE,
horizontal=TRUE,
par.strip.text=list(cex=.8)))lm function, which fits a linear model using ordinary least squares:runs.lm <- lm(formula = runs~singles+doubles+triples+homeruns+walks+
hitbypitch+sacrificeflies+stolenbases+caughtstealing,
data=team.batting.00to08)lm), the best place to start is with the print method. If the R console is used, we can simply type the name of the returned object on the console to see the results of print: runs.lm##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
## data = team.batting.00to08)
##
## Coefficients:
## (Intercept) singles doubles triples
## -507.16020 0.56705 0.69110 1.15836
## homeruns walks hitbypitch sacrificeflies
## 1.47439 0.30118 0.37750 0.87218
## stolenbases caughtstealing
## 0.04369 -0.01533
formula function to show the formula used to fit the model: formula(runs.lm)## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch +
## sacrificeflies + stolenbases + caughtstealing
coef function to get the list of coefficients for a model object: coef(runs.lm)## (Intercept) singles doubles triples homeruns
## -507.16019759 0.56704867 0.69110420 1.15836091 1.47438916
## walks hitbypitch sacrificeflies stolenbases caughtstealing
## 0.30117665 0.37749717 0.87218094 0.04369407 -0.01533245
summary can be used to get a summary of a linear model object. summary(runs.lm)##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
## data = team.batting.00to08)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.902 -11.828 -0.419 14.658 61.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
## singles 0.56705 0.02601 21.801 < 2e-16 ***
## doubles 0.69110 0.05922 11.670 < 2e-16 ***
## triples 1.15836 0.17309 6.692 1.34e-10 ***
## homeruns 1.47439 0.05081 29.015 < 2e-16 ***
## walks 0.30118 0.02309 13.041 < 2e-16 ***
## hitbypitch 0.37750 0.11006 3.430 0.000702 ***
## sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
## stolenbases 0.04369 0.05951 0.734 0.463487
## caughtstealing -0.01533 0.15550 -0.099 0.921530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.21 on 260 degrees of freedom
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
## F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-16
residuals function to get the vector of residuals from a linear model fit: residuals(runs.lm)[1:12]## 1 2 3 4 5 6
## -35.572347 -10.526473 -5.920721 57.798284 -4.279294 -4.668995
## 7 8 9 10 11 12
## 29.181937 -7.419778 8.319662 52.299908 32.910063 22.377282
fitted function to get a vector of fitted values: fitted(runs.lm)[1:16]## 1 2 3 4 5 6 7 8
## 899.5723 804.5265 797.9207 920.2017 954.2793 827.6690 849.8181 755.4198
## 9 10 11 12 13 14 15 16
## 862.6803 894.7001 874.0899 710.6227 852.8527 867.1590 785.1005 795.8503
confint function to compute confidence intervals for the coefficients in the fitted model: confint(runs.lm)## 2.5 % 97.5 %
## (Intercept) -570.85828008 -443.4621151
## singles 0.51583022 0.6182671
## doubles 0.57449582 0.8077126
## triples 0.81752968 1.4991921
## homeruns 1.37432941 1.5744489
## walks 0.25570041 0.3466529
## hitbypitch 0.16077399 0.5942203
## sacrificeflies 0.49451857 1.2498433
## stolenbases -0.07349342 0.1608816
## caughtstealing -0.32152716 0.2908623
influence function to compute the influence of different parameters: influence(runs.lm)influence.measures to access more friendly output: influence.measures(runs.lm)anova.lmlist. anova(runs.lm)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| singles | 1 | 2.157547e+05 | 2.157547e+05 | 400.4654729 | 0.0000000 |
| doubles | 1 | 3.565879e+05 | 3.565879e+05 | 661.8679941 | 0.0000000 |
| triples | 1 | 2.372170e+02 | 2.372170e+02 | 0.4403019 | 0.5075647 |
| homeruns | 1 | 7.900513e+05 | 7.900513e+05 | 1466.4255566 | 0.0000000 |
| walks | 1 | 1.143772e+05 | 1.143772e+05 | 212.2970903 | 0.0000000 |
| hitbypitch | 1 | 7.396397e+03 | 7.396397e+03 | 13.7285592 | 0.0002580 |
| sacrificeflies | 1 | 1.172572e+04 | 1.172572e+04 | 21.7642718 | 0.0000049 |
| stolenbases | 1 | 3.573245e+02 | 3.573245e+02 | 0.6632351 | 0.4161654 |
| caughtstealing | 1 | 5.238073e+00 | 5.238073e+00 | 0.0097225 | 0.9215298 |
| Residuals | 260 | 1.400776e+05 | 5.387599e+02 | NA | NA |
Interestingly, it appears that triples, stolen bases, and times caught stealing are not statistically significant.
effects function:effects(runs.lm)[1:20]## (Intercept) singles doubles triples homeruns
## -1.270491e+04 -4.644941e+02 -5.971498e+02 1.540185e+01 -8.888483e+02
## walks hitbypitch sacrificeflies stolenbases caughtstealing
## -3.381969e+02 8.600231e+01 1.082853e+02 -1.890303e+01 -2.288684e+00
##
## 4.071693e+01 2.903559e+01 2.025866e+00 -4.445514e-01 3.013846e+00
##
## 1.931702e+01 -3.349570e-01 -3.185715e+01 4.476431e+01 -1.258820e+01
vcov function: vcov(runs.lm)## (Intercept) singles doubles triples
## (Intercept) 1046.4149572 -6.275356e-01 -6.908905e-01 -0.8115627984
## singles -0.6275356 6.765565e-04 -1.475026e-04 0.0001538296
## doubles -0.6908905 -1.475026e-04 3.506798e-03 -0.0013459187
## triples -0.8115628 1.538296e-04 -1.345919e-03 0.0299591843
## homeruns -0.3190194 2.314669e-04 -3.940172e-04 0.0011510663
## walks -0.2515630 7.950878e-05 -9.902388e-05 0.0004174548
## hitbypitch -0.9002974 3.385518e-04 -4.090707e-04 0.0018360831
## sacrificeflies 1.6870020 -1.723732e-03 -2.253712e-03 -0.0051709718
## stolenbases 0.2153275 -3.041450e-04 2.871078e-04 -0.0009794480
## caughtstealing -1.4370890 3.126387e-04 1.466032e-03 -0.0016038175
## homeruns walks hitbypitch sacrificeflies
## (Intercept) -3.190194e-01 -2.515630e-01 -0.9002974059 1.6870019518
## singles 2.314669e-04 7.950878e-05 0.0003385518 -0.0017237324
## doubles -3.940172e-04 -9.902388e-05 -0.0004090707 -0.0022537124
## triples 1.151066e-03 4.174548e-04 0.0018360831 -0.0051709718
## homeruns 2.582082e-03 -4.007590e-04 -0.0008183475 -0.0005078943
## walks -4.007590e-04 5.333599e-04 0.0002219440 -0.0010962381
## hitbypitch -8.183475e-04 2.219440e-04 0.0121132852 -0.0011315622
## sacrificeflies -5.078943e-04 -1.096238e-03 -0.0011315622 0.0367839752
## stolenbases -2.041656e-06 -1.400052e-04 -0.0001197102 -0.0004636454
## caughtstealing 3.469784e-04 6.008766e-04 0.0001742039 -0.0024880710
## stolenbases caughtstealing
## (Intercept) 2.153275e-01 -1.4370889812
## singles -3.041450e-04 0.0003126387
## doubles 2.871078e-04 0.0014660316
## triples -9.794480e-04 -0.0016038175
## homeruns -2.041656e-06 0.0003469784
## walks -1.400052e-04 0.0006008766
## hitbypitch -1.197102e-04 0.0001742039
## sacrificeflies -4.636454e-04 -0.0024880710
## stolenbases 3.541716e-03 -0.0050935339
## caughtstealing -5.093534e-03 0.0241794596
deviance function to return the deviance of the fitted model. (though this value is just the residual sum of squares in this case because ‘runs.lm’ is a linear model) deviance(runs.lm)## [1] 140077.6
update function to refit a model. This can save you some typing if you are using R interactively. Additionally, this can save on computation time (for large data sets). The function update refit the model after changing the formula (perhaps adding or subtracting a term) or even after changing the data frame.For example, let’s fit a slightly different model to the data above. We’ll omit the variable sacrificeflies and add 0 as a variable (which means to fit the model with no intercept):
runs.lm2 <- update(runs.lm,formula=runs ~ singles + doubles +
triples + homeruns + walks + hitbypitch +
stolenbases + caughtstealing + 0)
runs.lm2##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + stolenbases + caughtstealing - 1, data = team.batting.00to08)
##
## Coefficients:
## singles doubles triples homeruns
## 0.29823 0.41280 0.95664 1.31945
## walks hitbypitch stolenbases caughtstealing
## 0.21352 -0.07471 0.18828 -0.70334
lm functionlm does and how you can control it. A linear regression model is appropriate when the response variable (the thing that you want to predict) can be estimated from a linear function of the predictor variables (the information that you know). Technically, we assume that:
\[\begin{align}
y = c_0+c_1x_1 + \cdots + c_px_p + \varepsilon.
\end{align}\]
where \(y\) is the response variable, \(x_1,\cdots,x_p\) are the predictor variables (or predictors), \(c_1,\cdots,c_p\) are the coefficients for the predictor variables, \(c_0\) is the intercept, and \(\varepsilon\) is the error term.
Suppose that you have a matrix of observed predictor variables \(X_{n\times p}\) and a vector of response variables \(Y=(Y_1,\cdots,Y_n)\). We have assumed a linear model, so given a set of coefficient \(c_0\) and \(c=( c_1, \cdots, c_p)^T\), we can calculate a set of estimates \(\hat{Y}\) for the input data \(X\) by calculating \(\hat{Y}=c_0 + Xc\). The differences between the estimates \(\hat{Y}\) and the actual values \(Y\) are called the residuals. You can think of the residuals as a measure of the prediction error; small residuals mean that the predicted values are close to the actual values. We assume that the expected difference between the actual response values and the residual values (the error term in the model) is 0. This is important to remember: at best, a model is probabilistic.
Our goal is to find the set of coefficients \(c_0\) and \(c\) that does the best job of estimating \(Y\) given \(X\); we’d like the estimates \(\hat{y}\) to be as close as possible to \(Y\). In a classical linear regression model, we find coefficients \(c_0\) and \(c\) that minimize the sum of squared differences between the estimates \(\hat{Y}\) and the observed values \(Y\). Specifically, we want to find values for \(c_0\) and \(c\) that minimize:
This is called the least squares method for regression. You can use the lm function in R to estimate the coefficients in a linear model:
Technically, linear regression is not always appropriate. Ordinary least squares (OLS) regression (implemented through lm) is only guaranteed to work when certain properties of the training data are true. Here are the key assumptions:
By the way, it’s perfectly OK for there to be a nonlinear relationship between some of the predictor variables. Suppose that one of the variables is age. You could add \(age^2\), \(\log(age)\), or other nonlinear mathematical expressions using age to the model and not violate the assumptions above. You are effectively defining a set of new predictor variables: \(w_1 = age\), \(w_2 = age^2\), \(w_3 = \log(age)\). This doesn’t violate the linearity assumption (because the model is still a linear function of the predictor variables) or the full rank assumption (as long as the relationship between the new variables is not linear).
ncvTest in the car (Companion to Applied Regression) package, which implements the Breusch-Pagan test. (Alternatively, you could use the bptest function in the lmtest library, which implements the same test. The lmtest library includes a number of other functions for testing for heteroscedasticity; see the documentation for more details.)durbin.watson in the car package, which implements the Durbin-Watson test. You can also use the function dwtest in the library lmtest by specifying a formula and a data set. (Alternatively, you could use the function bgtest in the lmtest package, which implements the Breusch-Godfrey test. This functions also tests for higher-order disturbances.)singular.ok=FALSE argument in lm.Incidentally, the example used in “Example: A Simple Linear Model” is not heteroscedastic:
# install.packages('car')
library(car)## Loading required package: carData
ncvTest(runs.lm)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.411893, Df = 1, p = 0.23474
Nor is there a problem with autocorrelation:
durbinWatsonTest(runs.lm)## lag Autocorrelation D-W Statistic p-value
## 1 0.003318923 1.983938 0.834
## Alternative hypothesis: rho != 0
Or with singularity:
runs.lm <- lm(formula = runs~singles+doubles+triples+homeruns+walks+
hitbypitch+sacrificeflies+stolenbases+caughtstealing,
data = team.batting.00to08, singular.ok=FALSE)
runs.lm##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
## data = team.batting.00to08, singular.ok = FALSE)
##
## Coefficients:
## (Intercept) singles doubles triples
## -507.16020 0.56705 0.69110 1.15836
## homeruns walks hitbypitch sacrificeflies
## 1.47439 0.30118 0.37750 0.87218
## stolenbases caughtstealing
## 0.04369 -0.01533
summary(runs.lm)##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
## data = team.batting.00to08, singular.ok = FALSE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.902 -11.828 -0.419 14.658 61.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
## singles 0.56705 0.02601 21.801 < 2e-16 ***
## doubles 0.69110 0.05922 11.670 < 2e-16 ***
## triples 1.15836 0.17309 6.692 1.34e-10 ***
## homeruns 1.47439 0.05081 29.015 < 2e-16 ***
## walks 0.30118 0.02309 13.041 < 2e-16 ***
## hitbypitch 0.37750 0.11006 3.430 0.000702 ***
## sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
## stolenbases 0.04369 0.05951 0.734 0.463487
## caughtstealing -0.01533 0.15550 -0.099 0.921530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.21 on 260 degrees of freedom
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
## F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-16
lm may calculate coefficients for variables that aren’t needed. You can manually tune a model using diagnostics like summary and lm.influence. However, you can also use some other statistical techniques to reduce the effect of insignificant variables or remove them from a model altogether.A simple technique for selecting the most important variables is stepwise variable selection. The stepwise algorithm works by repeatedly adding or removing variables from the model, trying to “improve” the model at each step. When the algorithm can no longer improve the model by adding or subtracting variables, it stops and returns the new (and usually smaller) model.
Note that “improvement” does not just mean reducing the residual sum of squares (RSS) for the fitted model. Adding an additional variable to a model will not increase the RSS (see a statistics book for an explanation of why), but it does increase model complexity. Typically, AIC (Akaike’s information criterion) is used to measure the value of each additional variable. The AIC is defined as \[AIC = -2\log(L) + k*edf,\] where \(L\) is the likelihood and \(edf\) is the equivalent degrees of freedom.
In R, you perform stepwise selection through the step function:
runs.lm.st <- step(runs.lm)## Start: AIC=1707.91
## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch +
## sacrificeflies + stolenbases + caughtstealing
##
## Df Sum of Sq RSS AIC
## - caughtstealing 1 5 140083 1705.9
## - stolenbases 1 290 140368 1706.5
## <none> 140078 1707.9
## - hitbypitch 1 6338 146416 1717.9
## - sacrificeflies 1 11142 151219 1726.6
## - triples 1 24130 164207 1748.8
## - doubles 1 73379 213457 1819.7
## - walks 1 91626 231703 1841.8
## - singles 1 256054 396132 1986.6
## - homeruns 1 453575 593653 2095.8
##
## Step: AIC=1705.92
## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch +
## sacrificeflies + stolenbases
##
## Df Sum of Sq RSS AIC
## - stolenbases 1 357 140440 1704.6
## <none> 140083 1705.9
## - hitbypitch 1 6342 146425 1715.9
## - sacrificeflies 1 11179 151262 1724.7
## - triples 1 24173 164256 1746.9
## - doubles 1 75490 215573 1820.3
## - walks 1 94503 234586 1843.1
## - singles 1 257773 397856 1985.8
## - homeruns 1 454587 594670 2094.3
##
## Step: AIC=1704.61
## runs ~ singles + doubles + triples + homeruns + walks + hitbypitch +
## sacrificeflies
##
## Df Sum of Sq RSS AIC
## <none> 140440 1704.6
## - hitbypitch 1 6390 146830 1714.6
## - sacrificeflies 1 11726 152166 1724.3
## - triples 1 25688 166129 1748.0
## - doubles 1 76598 217038 1820.1
## - walks 1 94655 235095 1841.7
## - singles 1 270589 411030 1992.6
## - homeruns 1 454230 594670 2092.3
runs.lm.st##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies, data = team.batting.00to08,
## singular.ok = FALSE)
##
## Coefficients:
## (Intercept) singles doubles triples
## -506.6389 0.5712 0.6823 1.1789
## homeruns walks hitbypitch sacrificeflies
## 1.4734 0.3018 0.3790 0.8868
summary(runs.lm.st)##
## Call:
## lm(formula = runs ~ singles + doubles + triples + homeruns +
## walks + hitbypitch + sacrificeflies, data = team.batting.00to08,
## singular.ok = FALSE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.667 -11.354 -0.668 14.719 61.992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -506.63888 30.87161 -16.411 < 2e-16 ***
## singles 0.57115 0.02542 22.468 < 2e-16 ***
## doubles 0.68227 0.05707 11.954 < 2e-16 ***
## triples 1.17894 0.17030 6.923 3.41e-11 ***
## homeruns 1.47344 0.05062 29.110 < 2e-16 ***
## walks 0.30178 0.02271 13.289 < 2e-16 ***
## hitbypitch 0.37897 0.10976 3.453 0.000647 ***
## sacrificeflies 0.88679 0.18960 4.677 4.67e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.15 on 262 degrees of freedom
## Multiple R-squared: 0.9142, Adjusted R-squared: 0.9119
## F-statistic: 398.7 on 7 and 262 DF, p-value: < 2.2e-16
Suppose that you were trying to estimate the probability of a certain outcome (which we’ll call A) for a categorical variable with two values. You could try to predict the probability of A as a linear function of the predictor variables, assuming \(y = c_0 + c_1x_1 + \cdots + c_px_p = Pr(A)\) The problem with this approach is that the value of \(y\) is unconstrained; probabilities are only valid for values between 0 and 1. A good approach for dealing with this problem is to pick a function for \(y\) that varies between 0 and 1 for all possible predictor values. If we were to use that function as a link function in a general linear model, then we could build a model that estimates the probability of different outcomes. That is the idea behind logistic regression.
In a logistic regression, the relationship between the predictor variables and the probability that an observation is a member of a given class is given by the logistic function: \[Pr(A) = \frac{e^{\eta}}{1+e^{\eta}}.\] Or \[\log\left(\frac{Pr(A)}{1-Pr(A)}\right)=c_0 + c_1x_1 + \cdots + c_px_p.\]
In R, logistic regression is fitted by function glm()
For the first example, let’s explore the data on infidelity contained in the data frame Affairs, provided with the AER package.
The data frame Affairs contains 9 variables collected on 601 participants and includes how often the respondent engaged in extramarital sexual intercourse during the past year, as well as their gender, age, years married, whether they had children, their religiousness (on a 5-point scale from 1=anti to 5=very), education, occupation (Hollingshead 7-point classification with reverse numbering), and a numeric self-rating of their marriage (from 1=very unhappy to 5=very happy). Let’s look at some descriptive statistics:
# install.packages('AER')
data(Affairs, package="AER")
summary(Affairs)## affairs gender age yearsmarried children
## Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171
## 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430
## Median : 0.000 Median :32.00 Median : 7.000
## Mean : 1.456 Mean :32.49 Mean : 8.178
## 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000
## Max. :12.000 Max. :57.00 Max. :15.000
## religiousness education occupation rating
## Min. :1.000 Min. : 9.00 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :16.00 Median :5.000 Median :4.000
## Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932
## 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :5.000 Max. :20.00 Max. :7.000 Max. :5.000
Although the number of indiscretions was recorded, your interest here is in the binary outcome (had an affair/didn’t have an affair). You can transform affairs into a dichotomous factor called ynaffair with the following code:
Affairs$ynaffair[Affairs$affairs > 0] <- 1
Affairs$ynaffair[Affairs$affairs == 0] <- 0
Affairs$ynaffair <- factor(Affairs$ynaffair,
levels=c(0,1), labels=c("No","Yes"))
table(Affairs$ynaffair)##
## No Yes
## 451 150
This dichotomous factor can now be used as the outcome variable in a logistic regression model:
fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children +
religiousness + education + occupation +rating,
data=Affairs, family=binomial())Just like lm, the glm function returns no results by default. as with lm, we can get more detailed results about the model object with the summary method:
coef(fit.full)## (Intercept) gendermale age yearsmarried childrenyes
## 1.37725816 0.28028665 -0.04425502 0.09477302 0.39767213
## religiousness education occupation rating
## -0.32472063 0.02105086 0.03091971 -0.46845426
summary(fit.full)##
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + children +
## religiousness + education + occupation + rating, family = binomial(),
## data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5713 -0.7499 -0.5690 -0.2539 2.5191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.37726 0.88776 1.551 0.120807
## gendermale 0.28029 0.23909 1.172 0.241083
## age -0.04426 0.01825 -2.425 0.015301 *
## yearsmarried 0.09477 0.03221 2.942 0.003262 **
## childrenyes 0.39767 0.29151 1.364 0.172508
## religiousness -0.32472 0.08975 -3.618 0.000297 ***
## education 0.02105 0.05051 0.417 0.676851
## occupation 0.03092 0.07178 0.431 0.666630
## rating -0.46845 0.09091 -5.153 2.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.38 on 600 degrees of freedom
## Residual deviance: 609.51 on 592 degrees of freedom
## AIC: 627.51
##
## Number of Fisher Scoring iterations: 4
step function to reduce the model. fit.reduce <- step(fit.full, direction = 'both')## Start: AIC=627.51
## ynaffair ~ gender + age + yearsmarried + children + religiousness +
## education + occupation + rating
##
## Df Deviance AIC
## - education 1 609.68 625.68
## - occupation 1 609.70 625.70
## - gender 1 610.89 626.89
## - children 1 611.40 627.40
## <none> 609.51 627.51
## - age 1 615.67 631.67
## - yearsmarried 1 618.34 634.34
## - religiousness 1 622.92 638.92
## - rating 1 636.75 652.75
##
## Step: AIC=625.68
## ynaffair ~ gender + age + yearsmarried + children + religiousness +
## occupation + rating
##
## Df Deviance AIC
## - occupation 1 610.15 624.15
## - gender 1 611.29 625.29
## - children 1 611.62 625.62
## <none> 609.68 625.68
## + education 1 609.51 627.51
## - age 1 615.78 629.78
## - yearsmarried 1 618.46 632.46
## - religiousness 1 623.27 637.27
## - rating 1 636.93 650.93
##
## Step: AIC=624.15
## ynaffair ~ gender + age + yearsmarried + children + religiousness +
## rating
##
## Df Deviance AIC
## - children 1 611.86 623.86
## <none> 610.15 624.15
## - gender 1 613.41 625.41
## + occupation 1 609.68 625.68
## + education 1 609.70 625.70
## - age 1 616.00 628.00
## - yearsmarried 1 619.07 631.07
## - religiousness 1 623.98 635.98
## - rating 1 637.23 649.23
##
## Step: AIC=623.86
## ynaffair ~ gender + age + yearsmarried + religiousness + rating
##
## Df Deviance AIC
## <none> 611.86 623.86
## + children 1 610.15 624.15
## - gender 1 615.36 625.36
## + education 1 611.45 625.45
## + occupation 1 611.62 625.62
## - age 1 618.05 628.05
## - religiousness 1 625.57 635.57
## - yearsmarried 1 626.23 636.23
## - rating 1 639.93 649.93
summary(fit.reduce)##
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + religiousness +
## rating, family = binomial(), data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5623 -0.7495 -0.5664 -0.2671 2.3975
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.94760 0.61234 3.181 0.001470 **
## gendermale 0.38612 0.20703 1.865 0.062171 .
## age -0.04393 0.01806 -2.432 0.015011 *
## yearsmarried 0.11133 0.02983 3.732 0.000190 ***
## religiousness -0.32714 0.08947 -3.656 0.000256 ***
## rating -0.46721 0.08928 -5.233 1.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.38 on 600 degrees of freedom
## Residual deviance: 611.86 on 595 degrees of freedom
## AIC: 623.86
##
## Number of Fisher Scoring iterations: 4
To get a vector of fitted values, use the fitted function:
pre <- fitted(fit.reduce)
pre[1:12]## 4 5 11 16 23 29
## 0.26337147 0.12233613 0.50397675 0.07785260 0.35318719 0.09267826
## 44 45 47 49 50 55
## 0.27067889 0.26439500 0.49222646 0.10810251 0.68776326 0.17017760
library(nutshell)
data(field.goals)
field.goals.forlr <- transform(field.goals,
good=as.factor(ifelse(play.type=="FG good","good","bad")))
dim(field.goals.forlr)## [1] 982 11
field.goals.forlr[1:5,]| home.team | week | qtr | away.team | offense | defense | play.type | player | yards | stadium.type | good |
|---|---|---|---|---|---|---|---|---|---|---|
| ARI | 14 | 3 | WAS | ARI | WAS | FG good | 1-N.Rackers | 20 | Out | good |
| ARI | 2 | 4 | STL | ARI | STL | FG good | 1-N.Rackers | 35 | Out | good |
| ARI | 7 | 3 | TEN | ARI | TEN | FG good | 1-N.Rackers | 24 | Out | good |
| ARI | 12 | 2 | JAC | JAC | ARI | FG good | 10-J.Scobee | 30 | Out | good |
| ARI | 2 | 3 | STL | ARI | STL | FG good | 1-N.Rackers | 48 | Out | good |
Let’s take a quick look at the percentage of good field goals by distance. We’ll start by tabulating the results with the table function:
field.goals.table <- table(field.goals.forlr$good,
field.goals.forlr$yards)
field.goals.table##
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## bad 0 0 1 1 1 1 0 0 0 3 5 5 2 6 7 5 3 0 4 3 11 6
## good 1 12 24 28 24 29 30 18 27 22 26 32 22 21 30 31 21 25 20 23 29 35
##
## 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## bad 7 5 6 11 5 9 12 11 10 9 5 8 11 10 3 1 2 1 1 1 1 1
## good 27 32 21 15 24 16 15 26 18 14 11 9 12 10 2 1 3 0 1 0 0 0
##
## 62
## bad 1
## good 0
We’ll also plot the results (as percentages):
plot(colnames(field.goals.table),
field.goals.table["good",]/ (field.goals.table["bad",] +
field.goals.table["good",]),
xlab="Distance (Yards)", ylab="Percent Good")As you can see, field goal percentage tapers off linearly between about 25 and 55 yards (with a few outliers at the end).
To model the probability of a successful field goal using a logistic regression, we would make the following call to glm:
field.goals.mdl <- glm(formula=good~yards,
data=field.goals.forlr,
family="binomial")
field.goals.mdl##
## Call: glm(formula = good ~ yards, family = "binomial", data = field.goals.forlr)
##
## Coefficients:
## (Intercept) yards
## 5.17886 -0.09726
##
## Degrees of Freedom: 981 Total (i.e. Null); 980 Residual
## Null Deviance: 978.9
## Residual Deviance: 861.2 AIC: 865.2
summary(field.goals.mdl)##
## Call:
## glm(formula = good ~ yards, family = "binomial", data = field.goals.forlr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5582 0.2916 0.4664 0.6978 1.3789
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.178856 0.416201 12.443 <2e-16 ***
## yards -0.097261 0.009892 -9.832 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 978.90 on 981 degrees of freedom
## Residual deviance: 861.22 on 980 degrees of freedom
## AIC: 865.22
##
## Number of Fisher Scoring iterations: 5
Let’s take a quick look at how well this model fits the data.
fg.prob <- function(y) {
eta <- 5.178856 + -0.097261 * y;
1 / (1 + exp(-eta))
} plot(colnames(field.goals.table),
field.goals.table["good",]/ (field.goals.table["bad",] +
field.goals.table["good",]),
xlab="Distance (Yards)", ylab="Percent Good")
lines(15:65,fg.prob(15:65),new=TRUE)mtcars$vs.