Some R commands

Lesson 2 covers statistical material for performing two sample \( t \) tests and chi-squared goodness of fit tests. It also covers the R commands and idioms necessary to pull this off. Here are some questions on the R commands.

EDA or *exploratory data analysis* is a term to describe the

exploration of a data set prior to any formal model fitting. Such

explorations can be via statistical summaries or via graphics. For

this topic it is useful to know many different ways that such

activities can be done.

Let's begin with the simple dataset used in the notes:

```
bottom <- c(0.43, 0.266, 0.567, 0.531, 0.707, 0.716)
surface <- c(0.415, 0.238, 0.39, 0.41, 0.605, 0.609)
```

We can use these variables separately or combine them into a data frame:

```
DF <- data.frame(bottom = bottom, surface = surface)
```

First some questions about data frames. If you are confused, check the comments when you guess wrong.

Using your version of R, make the above data frame and tell me what the outputs of

`nrow(DF)`

is:

Is there a different between `colnames(DF)`

and `names(DF)`

?

Which of these commands returns the values where the `bottom`

value is 0.430 or less?

Okay, lets use `DF`

to look at numeric summaries. In the notes we see

that `summary`

will summarize a numeric variable with its so-called

5-number summary (well, technically not if you are pedantic) and also

its mean). We can call this same method for a data frame:

`summary(DF)`

.

Do so. Which variable has the largest maximum?

Calling `mean(DF)`

causes an warning, calling `median(DF)`

an error. The warning for `mean`

suggests using `sapply`

. What is the output of `sapply(DF, median)`

?

(The `sapply`

function iterates over the object in its first argument

and applies the function to it from the second. For data frames, it

iterates over each column variable so the above takes the median of

each column. The `sapply`

function then tries to put the output into a

nice format.

The two sample t-test is about comparing means. A good graphic to

investigate is the parallell or side-by-side boxplots. These are made

many different ways in R. We use the `boxplot`

function.

Issue the command `boxplot(DF)`

. Do you get side-by-side boxplots?

Well you answered “Yes”, good. This is because data frames are

lists and boxplot will do the “right thing” for lists.

Data frames are also matrices. (Huh?) Will `boxplot`

do the right thing for

matrices? To check look at the output of `boxplot(as.matrix(DF))`

The above two questions show that for rectangular data, the `boxplot`

function does what we would like with minimal fuss. Good. However,

lots of two sample data will not fit into a data frame with each

column being a variable. Well, if we had two different sample

sizes. The alternative storage is to have one column for the values

and one column indicating which group. (This generalizes to more than two samples, which leads to ANOVA).

The `stack`

command is used to make this format. (More generally there

is the `reshape`

function for this type of work and the `reshape2`

package.)

Run the command

```
st <- stack(DF)
```

What type of storage does R use for `ind`

? (Use `class(st$ind)`

)

The stack command works with R's formula interface. We can more or

less avoid this when working with two samples, but it is a *huge*

advantage when working with multivariate data. It is one area where R

shines compared to other languages when doing statistics.

Does the following notation make the same side-by-side boxplot: `boxplot(values ~ ind, data=st)`

The `t.test`

can be done many ways. Do all of these produce the same output?

```
t.test(bottom, surface)
t.test(DF$bottom, DF$surface)
with(DF, t.test(bottom, surface))
t.test(values ~ ind, data=st)
```

As mentioned in the notes R uses “generic functions” to allow one

function name to dispatch to different functions depending on the

arguments you supply. In computer science terms, multiple dispatch is

termed polymorphism in the object oriented literature. (A point I make

for those of you who already know that.) Base R has three different

ways to achieve this, and there are others provided in add-on

packages. The simplest and most common is S3. There the class of the

first argument to a function is considered. This is why both

`t.test(bottom, surface)`

and `t.test(values ~ ind, data=st)`

work as

different functions are ultimately consulted. (The first has a numeric

variable for the first argument, the second a variable.)

The `methods`

function will list the different “methods” registered for a generic function. How many are there for `t.test`

.

The term “Non-visible” means what? Well the function is there but can't be seen – without extra help. Which of these will find the definition (a bunch of code) for the formula implementation of `t.test`

?