QG Lecture 10

Exam answers, grades

Grades = c(97, 100, 87, 99, 99, 92, 100, 90, 95, 93, 54, 98, 84, 99, 99, 85, 
    97, 90, 84, 98, 90, 98, 89, 93)
round(summary(Grades))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      54      90      94      92      98     100


Grades.df = data.frame(Student = 1:length(Grades), Grade = Grades)

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.0.2

p = ggplot(Grades.df, aes(x = Student, y = Grade))
p + geom_point()
p + geom_boxplot()
p + geom_boxplot() + geom_jitter()

Answers to Exam 1.

After today you will know how to perform some common statistical tests with functions in R. You will also learn about graphical inference as a way to

One Sample Test

Example: Given a sample of FSU student (male) heights in feet, test the hypothesis that the true mean height of the guys at FSU is 6 ft.

ht = c(5.8, 5.9, 5.9, 5.7, 6.3, 6.1, 5.4, 6)
ht.df = data.frame(Student = 1:length(ht), Height = ht)
ggplot(ht.df, aes(x = "", y = Height)) + geom_boxplot() + geom_hline(aes(yintercept = 6), 
    color = "red")

plot of chunk dataPlot

mean(ht)

## [1] 5.888

To formally test the hypothesis that the true mean height is 6 ft you use the t.test() function. The first argument is the data vector and the second is the hypothesized population mean.

t.test(ht, mu = 6)

## 
##  One Sample t-test
## 
## data:  ht
## t = -1.18, df = 7, p-value = 0.2764
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
##  5.662 6.113
## sample estimates:
## mean of x 
##     5.888

The output shows the t-value (-1.1803). The t-value is a statistic computed as

(mean(ht) - 6)/(sd(ht)/sqrt(8))

## [1] -1.18

It also shows the degrees of freedom (7). The degrees of freedom on the t-value is sample size minus 1. There are eight student heights so df = 7.

Sidenote: Degrees of freedom (dof) The degrees of freedom as used here is a statistical term that indicates the number of values in the final calculation of a statistic that are “free” to vary.

Suppose you had n values and the only thing you know about those values is that the mean is 25. The mean is a statistic. You are “free” to choose any value you want for n - 1 of them.

The output from the t.test() function includes the sample mean. The sample mean height is lower than the hypothesized height of 6 ft.

But with only 8 values this amount of “shortness” does not provide enough evidence to reject the null hypothesis that the population height is 6 ft. So you fail to reject the null hypothesis.

This is coded in the p-value which you can think of as providing evidence in suport of the null hypothesis. The smaller the p-value the less support there is for the null hypothesis.

To compute the p-value you type:

pt(-1.180305, df = 7) * 2

## [1] 0.2764

The pt() function is the cumulative distribution function for the t distribution. The t distribution has the df as a parameter so you need to include that.

You multiply this probability by 2 since your null hypothesis is two-sided. H: mu = 6 vs A: mu not equal 6.

The output provides a 95% confidence interval (CI) about the sample mean. It includes the hypothesized mean height of 6 ft.

The CI is a statistic that gives an estimate of the uncertainty on the sample mean. We state that, based on our sample of 8 guys, our best estimate for the mean height of men at FSU is 5.9 ft with a 95% CI that ranges from 5.7 to 6.1 ft.

More about p-values

A p-value is an estimate of the probability that your data, or data more extreme than observed, could have occurred by chance if the null hypothesis were true. This is what the test is designed to tell you. How unusual is your data ASSUMING the null is true. A low p-value tells you that your data is unusual.

The p-value is interpreted as evidence against the null hypothesis. A small p-value indicates evidence to reject the null. The interpretation of the p-value as evidence against the null hypothesis is:

   less than 0.01: convincing
        0.01-0.05: moderate 
        0.05-0.15: suggestive, but inconclusive
greater than 0.15: no

The p-value comes from the tails of the Student t distribution. The distribution describes the variation about the t statistic which is given by \[ t = \frac{\bar x - \mu_o}{s/\sqrt{n}} \] where s is the standard deviation of the sample values and n is the sample size. The denominator is the standard error (sd divided by square root of sample size) of the mean.

The p-value comes from the distribution function pt() respresenting the Student t distribution.

The distribution function determines the area under the distribution curve to the left of a particular value. The curve is created using the density function, dt().

For example,

curve(dt(x, 7), from = -3, to = 3, lwd = 2)
abline(v = -1.180305, col = "red")
abline(v = 1.180305, col = "red")

plot of chunk densityFunction

The area under the curve to the left of -1.18 is

pt(-1.180305, 7)

## [1] 0.1382

So 14% of the area lies to the left of -1.180305. The distribution is symmetric so 14% lies to the right of +1.180305. With a two-sided test you add these two fractions to get the p-value.

2 * pt(-1.180305, 7)

## [1] 0.2764

Graphical inference

The t-test is an example of statistical inference. To draw conclusions about the population from the data sample.

This is why statistics is useful: we don't want our conclusions to apply only to a convenient sample of students, but to the university at large.

There are two components to statistical inference: Testing (is there a difference?) and estimation (how big is the difference?)

The testing component can be visualized. “Is what we see really there?”

More precisely, is what we see in a plot of the sample an accurate reflection of the entire population?

For example, generate samples with the hypothesized mean value. Then see how these samples compare with your data.

rnorm(8, mean = 6, sd = sd(ht))

## [1] 6.428 6.127 6.375 6.107 6.204 5.685 6.225 6.336

Consider the situation where we try to 'find' our data from a lineup of data generated from the null hypothsis.

By 'find' I mean determine which graph in a set of graphs is done with the observed data.

require(nullabor)

## Loading required package: nullabor

## Warning: there is no package called 'nullabor'

fun = null_dist(var = "Height", dist = "normal", params = list(mean = 6, sd = sd(ht)))

## Error: could not find function "null_dist"

inf = lineup(fun, ht.df)

## Error: could not find function "lineup"

ggplot(inf, aes(x = "", y = Height)) + geom_boxplot() + facet_wrap(~.sample, 
    nrow = 1, ncol = 20)

## Error: object 'inf' not found

Can you pick out the accused? A plot of the real data is hidden amount 19 innocents, plots of data generated from the null distribution.

If you can spot the real data, then there is some evidence that the accused is different from the innocent.

With a null hypothesis stating that the mean height is 6 ft, the evidence is not strong that the accused is different. You fail to reject the null.

Suppose the null hypothesis states that the mean height is 6.2 ft.

t.test(ht, mu = 6.2)

## 
##  One Sample t-test
## 
## data:  ht
## t = -3.279, df = 7, p-value = 0.01351
## alternative hypothesis: true mean is not equal to 6.2
## 95 percent confidence interval:
##  5.662 6.113
## sample estimates:
## mean of x 
##     5.888

In this case, the sample of students looks unusual (too short) if the true height is 6.2 ft. The p-value is reduced to 0.014.

Note that the CI does not change. But also note that the CI does not include 6.2.

Repeat the graphical lineup.

fun = null_dist(var = "Height", dist = "normal", params = list(mean = 6.2, sd = sd(ht)))

## Error: could not find function "null_dist"

inf = lineup(fun, ht.df)

## Error: could not find function "lineup"

ggplot(inf, aes(x = "", y = Height)) + geom_boxplot() + facet_wrap(~.sample, 
    nrow = 1, ncol = 20)

## Error: object 'inf' not found

In this case it is fairly simple to pick out the accused. Compare with lower p-value.

Graphical inference can be used in a variety of applications. Here is an example from our tornado research. Here the innocent is a random distribution of touchdown locations.

http://myweb.fsu.edu/jelsner/TornadoLineup.pdf

Can you pick out the accused?

Two Sample Tests

With two samples of data the hypothesis is that they both come from distributions with the same mean. Th

Data are from two groups which we assume are sampled from the normal distributions. We test the null hypothesis that the two samples have the same population mean by computing the t statistic. The t statistic is the difference in sample means divided by the standard error of the difference in means (SEDM).

There are two ways to calculate SEDM. If you assume equal variance then a pooled standard deviation (s) is computed. Under the null hypothesis, the t statistic will follow a t distribution with n1 + n2 - 2 degrees of freedom (dof).

If you don't assume equal variances (default), then it is called the Welch procedure. In this case the t statistic does not quite follow a t distribution, but the t distribution can be approximated by adjusting the dof. The dof in with this procedure is not an integer.

Usually the two methods give similar results unless both the group sizes and the variances are very different among the two samples.

Example: We consider the daily energy expenditure for lean and obese women in the data set energy (ISwR). The data are available as energy in the ISwR package.

require(ISwR)

## Loading required package: ISwR

## Warning: there is no package called 'ISwR'

head(energy)

## Error: object 'energy' not found

The first column gives the energy expenditure in mJ as a numeric vector. The second column gives the stature as a factor vector.

str(energy)

## Error: object 'energy' not found

head(energy)

## Error: object 'energy' not found

Start with side-by-side boxplots.

ggplot(energy, aes(x = stature, y = expend)) + geom_boxplot()

## Error: object 'energy' not found

You see a difference in energy expenditure between the two groups. Is this difference statistically signficant?

With this data format you use R's model syntax for specifying the t test.

R's model syntax is response ~ explanatory, where “response” refers to the response variable and “explanatory” refers to the explanatory variable. The ~ is a tilde given on your keyboard.

Thus energy expenditure (expend) is your response variable and stature is your explanatory variable.

t.test(energy$expend ~ energy$stature)

## Error: object 'energy' not found

The output is a bit different from that of the one-sample test. The confidence interval is for the difference in means.

Here it does not contain the value 0, so you can conclude there is a significant difference.

The small p-value allows you to conclude that there is convincing evidence against the null hypothesis of no difference.

Graphically we should be able to pick out our data from a lineup of graphs.

x = null_permute("stature")

## Error: could not find function "null_permute"

inf = lineup(x, energy)

## Error: could not find function "lineup"

ggplot(inf, aes(x = stature, y = expend)) + geom_boxplot() + facet_wrap(~.sample, 
    nrow = 1, ncol = 20)

## Error: object 'inf' not found

We can.

By default there is no assumption that the variances are the same in the two groups of women so the Welch procedure is used. You can override this default by adding the argument var.equal=TRUE. What is the p value under this assumption?

t.test(expend ~ stature, var.equal = TRUE)

## Error: object 'expend' not found

Note that in this case the dof is a whole number, namely 13 + 9 - 2 = 20. The p-value has dropped slightly and the confidence interval is a little narrower, but overall the changes are slight.

F Test

To test the assumption of equal variances you use var.test() function.

var.test(expend ~ stature)

## Error: object 'expend' not found

What is the null hypothesis? What do you conclude?

The CI includes the value of 1 and is quite wide. For small data sets such as this one, the assumption of constant variance is largely a matter of belief. This test is sensitive to small departures from a normal distribution.

The variance test is based on the assumption that the groups are independent. You should not apply this test to paired data.

The t.test() and var.test() functions are in the stats package as part of the base install of R.

The ctest package contains all the “classical tests,” and has several alternative tests for variance homogeneity, each with its own assumptions, benefits, and drawbacks.

Wilcoxon Test

You can avoid distributional assumptions by using a non-parametric test. The non-parametric alternative to the t test is the Wilcoxon test. Also known as the Mann-Whitney U test.

The test statistic W is the sum of the ranks in the first group minus the sum of the ranks in the second. It is invoked using the wilcox.test() function. Try it on the energy data.

wilcox.test(energy$expend ~ energy$stature)

## Error: object 'energy' not found

Another example: Is there evidence of more or fewer U.S. hurricanes recently? One way to examine this question is to divide the record into two samples and compare the means from both samples.

Read the data into R.

loc = "http://myweb.fsu.edu/jelsner/data/US.txt"
H = read.table(loc, header = TRUE)

We can consider the first half of the record as separate from the second half and ask is there a statistical difference in hurricane counts between the two halfs. The null hypothesis is that the sample means are the same.

First create a vector that divides the record length in two equal parts.

early = H$Year <= median(H$Year)
head(early)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

tail(early)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Then use the test on the U.S. hurricane counts where the explanatory variable is the vector early.

t.test(H$All ~ early)

## 
##  Welch Two Sample t-test
## 
## data:  H$All by early
## t = -0.1631, df = 157.8, p-value = 0.8706
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4915  0.4165
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##               1.675               1.712

The p-value is large so we fail to reject the null hypothesis of no difference.

The 95% confidence interval is around the difference in means. Since the interval contains zero, there is no evidence against the null hypothesis.

Since there are 160 years in the record (length(H$All)) you could take the first 80 years for the first sample (s1) and the next 80 years for the second sample (s2) and then perform the test. Try it now.

s1 = H$All[early]
s2 = H$All[!early]

Small counts are not normally distributed

ggplot(H, aes(factor(All))) + geom_bar() + ylab("Number of Years") + xlab("Number of Hurricanes")

plot of chunk barplotH

so you might want to use the non-parameteric Wilcoxon test instead.

wilcox.test(s1, s2)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  s1 and s2
## W = 3248, p-value = 0.869
## alternative hypothesis: true location shift is not equal to 0

The conclusion is the same. The second half of the data set is statistically indistinguishable from the first half.