Lesson 2 covers statistical material for performing two sample \( t \) tests and chi-squared goodness of fit tests. It also covers the R commands and idioms necessary to pull this off. Here are some questions on the R commands.
EDA or exploratory data analysis is a term to describe the
exploration of a data set prior to any formal model fitting. Such
explorations can be via statistical summaries or via graphics. For
this topic it is useful to know many different ways that such
activities can be done.
Let's begin with the simple dataset used in the notes:
bottom <- c(0.43, 0.266, 0.567, 0.531, 0.707, 0.716) surface <- c(0.415, 0.238, 0.39, 0.41, 0.605, 0.609)
We can use these variables separately or combine them into a data frame:
DF <- data.frame(bottom = bottom, surface = surface)
First some questions about data frames. If you are confused, check the comments when you guess wrong.
Using your version of R, make the above data frame and tell me what the outputs of
Is there a different between
Which of these commands returns the values where the
bottom value is 0.430 or less?
Okay, lets use
DF to look at numeric summaries. In the notes we see
summary will summarize a numeric variable with its so-called
5-number summary (well, technically not if you are pedantic) and also
its mean). We can call this same method for a data frame:
Do so. Which variable has the largest maximum?
mean(DF) causes an warning, calling
median(DF) an error. The warning for
mean suggests using
sapply. What is the output of
sapply function iterates over the object in its first argument
and applies the function to it from the second. For data frames, it
iterates over each column variable so the above takes the median of
each column. The
sapply function then tries to put the output into a
The two sample t-test is about comparing means. A good graphic to
investigate is the parallell or side-by-side boxplots. These are made
many different ways in R. We use the
Issue the command
boxplot(DF). Do you get side-by-side boxplots?
Well you answered “Yes”, good. This is because data frames are
lists and boxplot will do the “right thing” for lists.
Data frames are also matrices. (Huh?) Will
boxplot do the right thing for
matrices? To check look at the output of
The above two questions show that for rectangular data, the
function does what we would like with minimal fuss. Good. However,
lots of two sample data will not fit into a data frame with each
column being a variable. Well, if we had two different sample
sizes. The alternative storage is to have one column for the values
and one column indicating which group. (This generalizes to more than two samples, which leads to ANOVA).
stack command is used to make this format. (More generally there
reshape function for this type of work and the
Run the command
st <- stack(DF)
What type of storage does R use for
The stack command works with R's formula interface. We can more or
less avoid this when working with two samples, but it is a huge
advantage when working with multivariate data. It is one area where R
shines compared to other languages when doing statistics.
Does the following notation make the same side-by-side boxplot:
boxplot(values ~ ind, data=st)
t.test can be done many ways. Do all of these produce the same output?
t.test(bottom, surface) t.test(DF$bottom, DF$surface) with(DF, t.test(bottom, surface)) t.test(values ~ ind, data=st)
As mentioned in the notes R uses “generic functions” to allow one
function name to dispatch to different functions depending on the
arguments you supply. In computer science terms, multiple dispatch is
termed polymorphism in the object oriented literature. (A point I make
for those of you who already know that.) Base R has three different
ways to achieve this, and there are others provided in add-on
packages. The simplest and most common is S3. There the class of the
first argument to a function is considered. This is why both
t.test(bottom, surface) and
t.test(values ~ ind, data=st) work as
different functions are ultimately consulted. (The first has a numeric
variable for the first argument, the second a variable.)
methods function will list the different “methods” registered for a generic function. How many are there for
The term “Non-visible” means what? Well the function is there but can't be seen – without extra help. Which of these will find the definition (a bunch of code) for the formula implementation of