MAT 540 & MAT 640 R Code for Chapter 1

author: Christopher Mecklin

date: August 29, 2013

Accessing R and RStudio

In chapter 1, we cover methods of descriptive statistics, which serve to describe and summarize data. Most of the statistical methods and mathematics covered in this chapter is pretty basic (mean/median/mode, histograms, etc.) and you have seen most of it in classes prior to this one.

However, what is new to most of you is using the R statistical language and dealing with data structures in it. So this document is primarily concerned with introducing the use of R.

We will access R in one of three ways.

Use the base R console (i.e. only download R, click in the big blue R) and enter every command and run it one line at a time. I DO NOT recommend doing this.
Use the RStudio IDE along with R. This way, it is much more convenient to enter programs, run portions of them, save programs, save output, save graphs, access the help features, etc. You should either download both R from http://cran.r-project.org and R Studio from http://www.rstudio.com for your computer, if possible, or use RStudio in the Faculty Hall 1st floor computer lab.
Access the RStudio server that I have running on an Amazon cloud server. I DO NOT promise this will be available all semester, and recommend using this only if you don't have a computer that you can download to for when the computer lab is not open. To do this, email me to get your username and password. Your username will just be your last name. Then log in at http://www.tinyurl.com/MurrayRStudio

Data in R

In order to do statistical analysis, you need to get data into R. There are three primary ways we will do this in this class.

Use data sets that are “built-in” to packages. For example, the faithful package is built into the datasets package that is built into R. We will also use the require command to access other packages (and data sets that are in those packages), particularly the fastR package that goes with our tetbook.
Manually enter data into an R program or script. This is tedious and is only a good idea for very small examples.
Import data that has been entered and saved in another format. This includes spreadsheets such as EXCEL, databases, other statistical software packages, and text files that are in .txt or .csv formats. We won't talk about how to accomplish this at this point.

Data Structure

In statistics, we will usually store our data in a 2-dimensional structure, similar to a matrix, where the rows represent the individuals or observations that we have collected data on, while the columns represent the variables that we have measured. For example, I could give a survey to our class with several questions, let's say 10 questions. With 27 students in this class, I'd have a $ 27 \times 10 $ structure to hold the data. Each student would be a row (for example, John Doe or Jane Smith), and each question/variable would be a column.

Some of the variables will be numerical in nature: for example, I might ask you your age or your height or to rank your love of statistics on a scale from 1 to 5. Others will be categorical in nature: for example, your gender or your favorite color.

In R, we will usually store our data sets in a structure called a data frame, which will be as described above. It is similar, but not identical, to a matrix. Later this semester, we will occasionally work with matrices in R and perform matrix arithmetic.

In a data frame, the top row is called the header and contains the names of all of the variables. Each horizontal row after is a data row, which contains the name of the row, followed by the data values for that individual/observation on each variable.

Let's look at some R code and output for accessing and manipulating data frames.

# load the fastR package, also loads several other packags such as mosaic
require(fastR)

# Fisher's iris data set
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica


# look at the structure of that data set 150 rows (flowers), 5 columns
# (variables) 4 variables are numerical, 1 is a categorical factor
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


# use the snippet command to run any chunk of R code in the text where the
# name of the snippet is in the little box
snippet("iris-str")

## 
## 
##  snippet(iris-str)
##  ------- ~~~~~~~~
## 
## > str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Let's look at some other examples-these are part of the fastR package.

help(actgpa)
actgpa

##    ACT   GPA
## 1   30 3.992
## 2   25 3.377
## 3   20 3.009
## 4   24 3.509
## 5   32 3.969
## 6   32 3.917
## 7   31 3.547
## 8   22 3.416
## 9   26 3.287
## 10  32 4.000
## 11  23 3.446
## 12  30 3.905
## 13  25 2.926
## 14  26 3.100
## 15  26 3.446
## 16  19 2.785
## 17  32 3.663
## 18  20 3.368
## 19  24 3.352
## 20  32 3.929
## 21  31 3.620
## 22  28 3.765
## 23  19 1.986
## 24  21 2.836
## 25  26 3.119
## 26  22 2.662

xyplot(GPA ~ ACT, actgpa)  #scatterplot

plot of chunk unnamed-chunk-2


help(students)
# students has 1000 rows, let's not look at all of them
head(students)

##   ACT  SAT Grad GradGPA HSGPA Cohort
## 1  30   NA TRUE   3.613 3.743   2002
## 2  20   NA TRUE   2.993 2.968   2004
## 3  23 1060 TRUE   3.582 3.507   2002
## 4  30 1420 TRUE   3.513 3.990   2005
## 5  21 1010 TRUE   2.703 3.253   2005
## 6  NA  730 TRUE   3.360 2.621   2001

head(students, n = 12)

##    ACT  SAT  Grad GradGPA HSGPA Cohort
## 1   30   NA  TRUE   3.613 3.743   2002
## 2   20   NA  TRUE   2.993 2.968   2004
## 3   23 1060  TRUE   3.582 3.507   2002
## 4   30 1420  TRUE   3.513 3.990   2005
## 5   21 1010  TRUE   2.703 3.253   2005
## 6   NA  730  TRUE   3.360 2.621   2001
## 7   23   NA  TRUE   2.716 3.100   2003
## 8   25   NA FALSE      NA 2.588   2005
## 9   30   NA  TRUE   3.397 3.762   2003
## 10  21   NA FALSE      NA 3.421   2003
## 11  21   NA FALSE      NA 2.622   2005
## 12  24   NA  TRUE   3.095 3.458   2003

tail(students)

##      ACT  SAT  Grad GradGPA HSGPA Cohort
## 995   22   NA  TRUE   2.725 2.626   2003
## 996   27   NA FALSE      NA 3.574   2004
## 997   25   NA  TRUE   2.854 3.506   2005
## 998   29   NA  TRUE   3.448 3.861   2005
## 999   NA 1160  TRUE   3.727 3.914   2003
## 1000  23 1110 FALSE      NA 3.973   2005

One of my favorite data sets for teaching is from the MASS package. It is called UScereal and has nutritional information on 65 different brands of breakfast cereal.

One of the variables, shelf, contains values 1,2,3 that correspond to the bottom, middle, and top shelf at the grocery store. We know that usually cereals that adults might want (bran flakes, granola) are on the top shelf, sugary cereals geared towards kids are on the middle shelf, and cheaper cereals are on the bottom.

I'm going to “recode” the shelf variable into a new variable called newshelf that will say “top”, “middle”, and “bottom” rather than some arbitrary number that I might forget. This will require the car package.

require(MASS)  #a different package
str(UScereal)

## 'data.frame':    65 obs. of  11 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...


require(car)  # yet another package
UScereal$newshelf <- recode(UScereal$shelf, "1='bottom';2='middle';3='top'")

## Warning: NAs introduced by coercion

str(UScereal)

## 'data.frame':    65 obs. of  12 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ newshelf : chr  "top" "top" "top" "bottom" ...

We can access individual data values or portions of our data frame by using brackets. The first value in brackets represents the row(s) and the second value the column(s) that you want.

UScereal[5, 2]  #row 5, column 2, which is number of calories in Apple Jacks

## [1] 110

UScereal[1:10, 1:3]  #first ten rows and three columns

##                           mfr calories protein
## 100% Bran                   N    212.1  12.121
## All-Bran                    K    212.1  12.121
## All-Bran with Extra Fiber   K    100.0   8.000
## Apple Cinnamon Cheerios     G    146.7   2.667
## Apple Jacks                 K    110.0   2.000
## Basic 4                     G    173.3   4.000
## Bran Chex                   R    134.3   2.985
## Bran Flakes                 P    134.3   4.478
## Cap'n'Crunch                Q    160.0   1.333
## Cheerios                    G     88.0   4.800

UScereal[, 1:3]  #all rows, first three columns

##                                       mfr calories protein
## 100% Bran                               N   212.12 12.1212
## All-Bran                                K   212.12 12.1212
## All-Bran with Extra Fiber               K   100.00  8.0000
## Apple Cinnamon Cheerios                 G   146.67  2.6667
## Apple Jacks                             K   110.00  2.0000
## Basic 4                                 G   173.33  4.0000
## Bran Chex                               R   134.33  2.9851
## Bran Flakes                             P   134.33  4.4776
## Cap'n'Crunch                            Q   160.00  1.3333
## Cheerios                                G    88.00  4.8000
## Cinnamon Toast Crunch                   G   160.00  1.3333
## Clusters                                G   220.00  6.0000
## Cocoa Puffs                             G   110.00  1.0000
## Corn Chex                               R   110.00  2.0000
## Corn Flakes                             K   100.00  2.0000
## Corn Pops                               K   110.00  1.0000
## Count Chocula                           G   110.00  1.0000
## Cracklin' Oat Bran                      K   220.00  6.0000
## Crispix                                 K   110.00  2.0000
## Crispy Wheat & Raisins                  G   133.33  2.6667
## Double Chex                             R   133.33  2.6667
## Froot Loops                             K   110.00  2.0000
## Frosted Flakes                          K   146.67  1.3333
## Frosted Mini-Wheats                     K   125.00  3.7500
## Fruit & Fibre: Dates Walnuts and Oats   P   179.10  4.4776
## Fruitful Bran                           K   179.10  4.4776
## Fruity Pebbles                          P   146.67  1.3333
## Golden Crisp                            P   113.64  2.2727
## Golden Grahams                          G   146.67  1.3333
## Grape Nuts Flakes                       P   113.64  3.4091
## Grape-Nuts                              P   440.00 12.0000
## Great Grains Pecan                      P   363.64  9.0909
## Honey Graham Ohs                        Q   120.00  1.0000
## Honey Nut Cheerios                      G   146.67  4.0000
## Honey-comb                              P    82.71  0.7519
## Just Right Fruit & Nut                  K   186.67  4.0000
## Kix                                     G    73.33  1.3333
## Life                                    Q   149.25  5.9701
## Lucky Charms                            G   110.00  2.0000
## Mueslix Crispy Blend                    K   238.81  4.4776
## Multi-Grain Cheerios                    G   100.00  2.0000
## Nut&Honey Crunch                        K   179.10  2.9851
## Nutri-Grain Almond-Raisin               K   208.96  4.4776
## Oatmeal Raisin Crisp                    G   260.00  6.0000
## Post Nat. Raisin Bran                   P   179.10  4.4776
## Product 19                              K   100.00  3.0000
## Puffed Rice                             Q    50.00  1.0000
## Quaker Oat Squares                      Q   200.00  8.0000
## Raisin Bran                             K   160.00  4.0000
## Raisin Nut Bran                         G   200.00  6.0000
## Raisin Squares                          K   180.00  4.0000
## Rice Chex                               R    97.35  0.8850
## Rice Krispies                           K   110.00  2.0000
## Shredded Wheat 'n'Bran                  N   134.33  4.4776
## Shredded Wheat spoon size               N   134.33  4.4776
## Smacks                                  K   146.67  2.6667
## Special K                               K   110.00  6.0000
## Total Corn Flakes                       G   110.00  2.0000
## Total Raisin Bran                       G   140.00  3.0000
## Total Whole Grain                       G   100.00  3.0000
## Triples                                 G   146.67  2.6667
## Trix                                    G   110.00  1.0000
## Wheat Chex                              R   149.25  4.4776
## Wheaties                                G   100.00  3.0000
## Wheaties Honey Gold                     G   146.67  2.6667

UScereal[1:10, ]  #first ten rows, all coumns

##                           mfr calories protein   fat sodium  fibre carbo
## 100% Bran                   N    212.1  12.121 3.030  393.9 30.303 15.15
## All-Bran                    K    212.1  12.121 3.030  787.9 27.273 21.21
## All-Bran with Extra Fiber   K    100.0   8.000 0.000  280.0 28.000 16.00
## Apple Cinnamon Cheerios     G    146.7   2.667 2.667  240.0  2.000 14.00
## Apple Jacks                 K    110.0   2.000 0.000  125.0  1.000 11.00
## Basic 4                     G    173.3   4.000 2.667  280.0  2.667 24.00
## Bran Chex                   R    134.3   2.985 1.493  298.5  5.970 22.39
## Bran Flakes                 P    134.3   4.478 0.000  313.4  7.463 19.40
## Cap'n'Crunch                Q    160.0   1.333 2.667  293.3  0.000 16.00
## Cheerios                    G     88.0   4.800 1.600  232.0  1.600 13.60
##                           sugars shelf potassium vitamins newshelf
## 100% Bran                 18.182     3    848.48 enriched      top
## All-Bran                  15.152     3    969.70 enriched      top
## All-Bran with Extra Fiber  0.000     3    660.00 enriched      top
## Apple Cinnamon Cheerios   13.333     1     93.33 enriched   bottom
## Apple Jacks               14.000     2     30.00 enriched   middle
## Basic 4                   10.667     3    133.33 enriched      top
## Bran Chex                  8.955     1    186.57 enriched   bottom
## Bran Flakes                7.463     3    283.58 enriched      top
## Cap'n'Crunch              16.000     2     46.67 enriched   middle
## Cheerios                   0.800     1     84.00 enriched   bottom

UScereal[c(1:5, 50:55), c("fibre", "sugars")]  #a specific set of rows and columns

##                            fibre sugars
## 100% Bran                 30.303  18.18
## All-Bran                  27.273  15.15
## All-Bran with Extra Fiber 28.000   0.00
## Apple Cinnamon Cheerios    2.000  13.33
## Apple Jacks                1.000  14.00
## Raisin Nut Bran            5.000  16.00
## Raisin Squares             4.000  12.00
## Rice Chex                  0.000   1.77
## Rice Krispies              0.000   3.00
## Shredded Wheat 'n'Bran     5.970   0.00
## Shredded Wheat spoon size  4.478   0.00

some(UScereal)  #a random sample of rows

##                        mfr calories protein   fat sodium fibre carbo
## Count Chocula            G    110.0   1.000 1.000  180.0 0.000 12.00
## Double Chex              R    133.3   2.667 0.000  253.3 1.333 24.00
## Honey Nut Cheerios       G    146.7   4.000 1.333  333.3 2.000 15.33
## Just Right Fruit & Nut   K    186.7   4.000 1.333  226.7 2.667 26.67
## Oatmeal Raisin Crisp     G    260.0   6.000 4.000  340.0 3.000 27.00
## Product 19               K    100.0   3.000 0.000  320.0 1.000 20.00
## Shredded Wheat 'n'Bran   N    134.3   4.478 0.000    0.0 5.970 28.36
## Total Corn Flakes        G    110.0   2.000 1.000  200.0 0.000 21.00
## Wheat Chex               R    149.3   4.478 1.493  343.3 4.478 25.37
## Wheaties                 G    100.0   3.000 1.000  200.0 3.000 17.00
##                        sugars shelf potassium vitamins newshelf
## Count Chocula          13.000     2      65.0 enriched   middle
## Double Chex             6.667     3     106.7 enriched      top
## Honey Nut Cheerios     13.333     1     120.0 enriched   bottom
## Just Right Fruit & Nut 12.000     3     126.7     100%      top
## Oatmeal Raisin Crisp   20.000     3     240.0 enriched      top
## Product 19              3.000     3      45.0     100%      top
## Shredded Wheat 'n'Bran  0.000     1     209.0     none   bottom
## Total Corn Flakes       3.000     3      35.0     100%      top
## Wheat Chex              4.478     1     171.6 enriched   bottom
## Wheaties                3.000     1     110.0 enriched   bottom

Tables and Graphs to display distributions

In statistics, we are very concerned with the distribution of a random variable, where a distribution is just a list of the values that the variable took on, and how often those values were taken. We will describe distributions in a variety of ways.

table (such as a stemplot, frequency distribution or contingeny table)
graph (such as a histogram or boxplot)
function (we will study many “probability density functions'' this semester)

Here, I'll show R code for the frequency distribution for the manufacturer of cereals, the shelf, along with a contingency table or cross-tab for manufacturer & shelf simultaneously. Also, a frequency distribution table for a numerical variable like sugar, which is not as useful.

# dataset$variable specifies which variable from the data set you want
table(UScereal$mfr)

## 
##  G  K  N  P  Q  R 
## 22 21  3  9  5  5

table(UScereal$newshelf)

## 
## bottom middle    top 
##     18     18     29

table(UScereal$mfr, UScereal$newshelf)  #contingency table or cross-tab

##    
##     bottom middle top
##   G      6      7   9
##   K      4      7  10
##   N      2      0   1
##   P      2      1   6
##   Q      0      3   2
##   R      4      0   1

table(UScereal$sugars)

## 
##         0       0.8  1.769912         2         3         4  4.477612 
##         4         1         1         2         8         1         1 
##  5.681818         6  6.666667  7.462687  8.270677      8.75  8.955224 
##         1         1         1         1         1         1         2 
## 10.447761 10.666667        11        12 12.121212        13 13.333333 
##         1         2         1         9         1         3         3 
## 13.432836        14 14.666667 14.925373 15.151515        16 17.045455 
##         1         4         1         1         1         4         1 
## 17.910448 18.181818 19.402985        20 20.895522 
##         1         1         1         2         1

table(UScereal$sugars > 5)  #counts how many cereals have more than 5 grams/serving

## 
## FALSE  TRUE 
##    18    47

Let us construct a bar chart, to graphically display the frequency distribution of a categorical variable. First, I am going to create a new object which I will name counts (the name is arbitrary, I could call it george or x) to store the frequency distribution. Then I will plot that object, and then make it prettier with labels and colors.

counts <- table(UScereal$newshelf)
barplot(counts)

plot of chunk unnamed-chunk-6

barplot(counts, main = "Frequency Distribution of Cereal", xlab = "Shelf", col = "lightblue")

plot of chunk unnamed-chunk-6

Histograms are used to display the frequency distribution of a numerical variable, by creating "bins” (intervals) and counting how often the data values fall in those bins.

A blessing (and a curse) of open-source software such as R is that there are usually multiple ways to do things, such as construct graphs or fit models. We will encounter this reality frequently in graphing. A package called 'graphics' is built into R and has many functions for plotting. You do not need to use a 'require(graphics)' statement, as that package is automatically loaded any time you use R.

Randall Pruim, your textbook author, is a strong proponent of the graphics found in the 'lattice' package. You do need to use either 'require(lattice)' or 'require(fastR)' to access those graphics. Another popular package, which is very powerful but fairly complicated, is the 'ggplot2' package, developed by Hadley Wickham and based on the book The Grammar of Graphics by Leland Wilkinson. We will not use 'ggplot2' graphics in this course: go to http://ggplot2.org/ if you are interested in pursuing these graphics.

OK, so there's a 'hist' function in the base 'graphics' package and a 'histogram' function in the 'lattice' package. Let's compare them for a histogram of the sugar content of cereal.

hist(UScereal$sugars)

plot of chunk unnamed-chunk-7

histogram(~sugars, data = UScereal)  #lattice uses a (y~x|z,data=dataset) format

plot of chunk unnamed-chunk-7

If you want to have control on the width of your bins, use the 'breaks' argument to set them. You can also add titles and change colors as we've done before.

histogram(~sugars, data = UScereal, type = "count", breaks = c(0, 5, 10, 15, 
    20, 25), col = "red", xlab = "Grams per serving", main = "Sugar Content of Cereal")

plot of chunk unnamed-chunk-8

What's this (y~x|z,data=dataset) format all about? Well, it is used throughout the 'lattice' package. The idea is that the variable on the vertical axis goes in the y slot, the horizontal axis in the x slot, the conditional or grouping variable in the z slot, and the name of your data frame in the dataset slot.

In our example, we are constructing a histogram of the variable 'UScereal$sugars'. Since there is only one variable, there is no y or z in this example. Suppose you want separate histograms for each level of the categorical variable 'newshelf'. We'll just define z to be newshelf. Use the 'subset' argument if you want only one level; for instance, I only want a histogram of the sugar in cereals on the middle shelf.

histogram(~sugars | newshelf, data = UScereal)

plot of chunk unnamed-chunk-9

histogram(~sugars | newshelf, data = UScereal, subset = newshelf == "middle")

plot of chunk unnamed-chunk-9

For scatterplots, we have either the 'plot' function (from the graphics package) or 'xyplot' (from lattice). They are very similar and which one you use is a matter of taste. Obviously, we will not leave the y slot empty this time.

Suppose we want to look at the bivariate relationship between the number of calories and the sugar content of our cereals.

plot(calories ~ sugars, data = UScereal)

plot of chunk unnamed-chunk-10

xyplot(calories ~ sugars, data = UScereal)

plot of chunk unnamed-chunk-10

xyplot(calories ~ sugars | newshelf, data = UScereal)

plot of chunk unnamed-chunk-10

xyplot(calories ~ sugars, groups = newshelf, data = UScereal)

plot of chunk unnamed-chunk-10

The final graph is trying to display information about the categorical variable 'newshelf' as well, but leaves a lot to be desired. Which color represents which shelf is not labeled, the points are too small, and we might wish to use different symbols rather than or along with colors to represent different levels of the shelf variable. The next chunk of code is very complicated, but manually changes many things. 'cex' changes the size of the points and 'pch' the plotting character. To learn more about this, type 'help(par)' into the R Console or go to http://www.statmethods.net/advgraphs/parameters.html

xyplot(calories ~ sugars, groups = newshelf, data = UScereal, auto.key = TRUE, 
    par.settings = list(superpose.symbol = list(cex = c(1.5, 1.5, 1.5), col = c("black", 
        "red", "blue"), pch = 1:3)))

plot of chunk unnamed-chunk-11

Calculating summary statistics

It is very easy to compute the mean in R. R will assume your data represents a sample and you are therefore computing the sample statistic $ \bar{x} $, rather than the population parameter $ \mu $. Of course, both are computed with the same formula, $ \bar{x}=\frac{\sum x_i}{n} $.

You can use the 'by' function to separate by levels of another variable. 'median' does (you guees it!) the median of your distribution.

mean(UScereal$sugars)

## [1] 10.05

by(UScereal$sugars, UScereal$newshelf, mean)

## UScereal$newshelf: bottom
## [1] 6.295
## -------------------------------------------------------- 
## UScereal$newshelf: middle
## [1] 12.51
## -------------------------------------------------------- 
## UScereal$newshelf: top
## [1] 10.86

median(UScereal$sugars)

## [1] 12

by(UScereal$sugars, UScereal$newshelf, median)

## UScereal$newshelf: bottom
## [1] 3.739
## -------------------------------------------------------- 
## UScereal$newshelf: middle
## [1] 12.5
## -------------------------------------------------------- 
## UScereal$newshelf: top
## [1] 12

Now, let's set aside our cereal and look at a small data set involving exam scores. I will manually enter in these scores into a vector in R.

exam <- c(50, 55, 62, 68, 73, 76, 78, 79, 83, 97, 90, 94, 98)
exam

##  [1] 50 55 62 68 73 76 78 79 83 97 90 94 98

mean(exam)

## [1] 77.15

median(exam)

## [1] 78

length(exam)  #our sample size n

## [1] 13

The five number summary is a set of five simple statistics that quickly summarize the central tendency, the spread, and the shape of a distribution. The statistics used are: 1. minimum (or the 0th percentile) 2. first quartile $ Q_1 $ (the 25th percentile) 3. median $ M $ (the 50th percentile) 4. third quartile $ Q_3 $ (the 75th percentile) 5. maximum (or the 100th percentile)

We will define the quantile:

Let $ p \in [0,1] $. The $ p $-quantile of a distribution is a number $ q $ such that the proportion of the distribution that is less than $ q $ is p.

Seems easy enough, but is tricky to compute in practice. As you will soon see, the way common quantiles (such as the first and third quartiles of the five-number summary) will be different between a TI-84 calculator and R, with other methods found in other software packages.

First, we'll discuss how this is computed on your calculator, which corresponds to the discussion found in many (but not all) low-level statistics textbooks.

The minimum and maximum are easy: just take the smallest and largest numbers, i.e. your 1st and $n$th order statistics. For the exams, $ Min=50 $ and $ Max=98 $.

The median isn't very hard either. With a sample of $ n=13 $ students, we use the $ \frac{n+1}{2} $ th order statistic. The 7th order statistic is $ M=78 $.

The TI-84 computes the first quartile $ Q_1 $ in this fashion. Take all of the data values less than the median (i.e. exclusive of the median). This is our first six order statistics $ 50,55,62,68,73,76 $. Take the median of those six values. When you do so, $ Q_1=\frac{62+68}{2}=65 $.

The third quartile is found in a similar fashion using the six values greater than the median, again excluding the median. Doing this, $ Q_3=\frac{87+90}{2}=88.5 $.

You are thinking “I bet R has a function for computing these values”. It has several.

summary(exam)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0    68.0    78.0    77.2    90.0    98.0

quantile(exam)

##   0%  25%  50%  75% 100% 
##   50   68   78   90   98

fivenum(exam)

## [1] 50 68 78 90 98

OH NO!!! The first and third quartiles (i.e. the 25th and 75th percentiles or the 0.25-quantile and 0.75-quantile) are different than my calculator!!! What is R doing???

It turns out there are many ways (type in 'help(quantile)' to learn a bunch of them) to compute quantiles. Here's how R does it.

To find the 25th percentile, we have $ n=13 $ and $ p=0.25 $. Take the $ [(n-1)p]+1 $ th order statistic. In our case, you end up with the 4th order statistic, which is $ Q_1=68 $ (as R said). Similarly with $ p=0.75 $, we use the 10th order statistic, which is $ Q_3=90 $.

I do not want to get into the minutiae of the various ways of computing this, other than to say that the TI-84 method isn't even one of the optional methods you can ask R to compute with the 'quantile' function.

A common measure of spread, or variability, of a distribution is the range, which is just $ Range=Max-Min $. An obvious flaw of this statistic is that it is completely distorted by outliers. If I tell you after our first exam that the range was 70, that could be caused by a single individual (i.e. 26 out of 27 students got very high scores and one person did terrible).

One solution is to use the interquartile range. $ IQR=Q_3-Q_1 $ which tells us the spread in the middle 50% of the distribution. The major use of this statistic is in determining whether or not outliers exist in the data.

The famous statistician John Tukey devised a simple rule for detecting outliers based on the $ IQR $. For the purposes of my example, let's use the exam scores and R's version of the five-number summary.

Find the interquartile range. $ IQR=Q_3-Q_1=90-68=22 $
Find the step. $ Step=1.5IQR=1.5 \times 22=33 $.
Any points more than one step, or 1.5 IQR, from a quartile are outliers.
- $ Q_1-Step=68-33=35 $
- $ Q_3+Step=90+33=123 $
Since no exam scores are below 35 or above 123, this data set contains no statistical outliers.

Tukey constructed a graph known as a boxplot or box-and-whisker display that combines information from the five-number summary and outliers. Bars are drawn representing the median and the first/third quartiles, forming a 'box'. 'Whiskers' are extended to the smallest/largest data values that are NOT outliers; in our case, with no outliers the whiskers extend to the minimum and maximum.

The base graphics packages uses the 'boxplot' function while lattice uses 'bwplot'.

boxplot(exam)

plot of chunk unnamed-chunk-15

bwplot(~exam)

plot of chunk unnamed-chunk-15

I am going to change one of the exam scores to show you how the plot will change with an outlier. Let's suppose the first student actually got a 25, not a 50. Since this score is less than 25, that student will be an outlier. Now, the whisker will extend to 55 and a symbol will be used a 25 to let the world know that we have an outlier.

exam[1] <- 25  #change the first element in the exam vector
bwplot(~exam)  #redraw the boxplot

plot of chunk unnamed-chunk-16

A great way of using boxplots is to draw side-by-side boxplots to compare two or more distributions. Let's go back to our breakfast cereals and compare the sugar content from shelf to shelf.

bwplot(newshelf ~ sugars, data = UScereal)

plot of chunk unnamed-chunk-17

bwplot(sugars ~ newshelf, data = UScereal)

plot of chunk unnamed-chunk-17

Variance and Standard Deviation

Although the range and the interquartile range can be used to measure the spread or dispersion of a data set, the most common statistic used for this purpose is the variance and its cousin the standard deviation.

The sample variance is defined with the following formula: \[ s^2=\frac{\sum (x_i-\bar{x})^2}{n-1} \]

The numerator is the sum of squared deviations from the sample mean. Why do we square the deviations?

Notice in the denominator we divide by $ n-1 $ rather than $ n $, so our variance statistic is not quite the 'average squared deviation'. Do you know why we divide by $ n-1 $?

Incidentally, if our data set represents an entire population rather than a sample, we have different notation and a slightly different formula for variance. \[ \sigma^2=\frac{\sum(x_i-\mu)^2}{n} \]

Although in general we will use technology to compute the variance (either a TI calculator or R), you should be able to compute it 'by hand' given a small data set.

Suppose we have a sample of size $ n=5 $ which is the age of five randomly selected customers in a store. \[ 24 \qquad 45 \qquad 62 \qquad 38 \qquad 35 \]

I will arrange the data into columns, computing the deviations and squared deviations. Notice that $ \bar{x}=40.8 $ years.

$ x_i $	$ (x_i-\bar{x}) $	$ (x_i-\bar{x})^2 $
24	-16.8	282.24
45	4.2	17.64
62	21.2	449.44
38	-2.8	7.84
35	-5.8	33.64

The sum of the rightmost column is the sum of squared deviations, the top of the formula. Thus \[ s^2=\frac{790.8}{5-1}=197.7 years^2 \]

Notice our units get squared, which makes a physical interpretation of the variance difficult. We will typically take the square root of the variance, which we call the standard deviation. We use $ s $ to represent the standard deviation of a sample and $ \sigma $ for a population. In our example, $ s=\sqrt{197.7}=14.06 $ years.

We can double-check our arithmetic.

ages <- c(24, 45, 62, 38, 35)
mean(ages)

## [1] 40.8

var(ages)

## [1] 197.7

sqrt(var(ages))

## [1] 14.06

sd(ages)

## [1] 14.06

A nice command that is part of the 'mosaic' package is 'favstats', that will compute mean, five-number summary, standard deviation, sample size, and number of missing data values in one nice wrapper.

favstats(exam)

##  min Q1 median Q3 max  mean    sd  n missing
##   25 68     78 90  98 75.23 20.02 13       0

Contingency Tables and Simpson's Paradox

A contingency table or cross-tabulation (shortened to cross-tab) is a frequency distribution table that displays information about two variables simultaneously. Usually these variables are categorical factors but can be numerical variables that have been grouped together. For example, we might have one variable represent the sex of a customer in the store and the second variable be age, where age groups such as 18-29, 30-44, 45-64, 65+ are used.

Your textbook uses a famous example based on racial patterns on how the death penalty was applied to murderers in Florida in the 1970s. The built-in data set in 'fastR' is called 'deathPenalty'. We are going to create our own data frame in R based on data in a contingency table. Our example (which I took from a different textbook), looks at the sex and the political preference of a sample of college students (probably not in Kentucky).

	Liberal	Moderate	Conservative
Female	35	36	6
Male	50	44	21

Here's the R Code for creating a data frame for this table.

id <- 1:192  #total of 192 people in the sample
sex <- c(rep("Female", 77), rep("Male", 115))  #rows of the table
politics <- c(rep("Liberal", 35), rep("Moderate", 36), rep("Conservative", 6), 
    rep("Liberal", 50), rep("Moderate", 44), rep("Conservative", 21))  #columns of the table
mytable <- data.frame(id, sex, politics)
xtabs(~sex + politics, data = mytable)

##         politics
## sex      Conservative Liberal Moderate
##   Female            6      35       36
##   Male             21      50       44

Notice that R puts the factors in alphabetical order (Conservative, Liberal, Moderate), rather than the 'liberal is left-wing, conservative is right-wing' idea. It is possible to reorder those levels, but for our purposes I will not worry about that.

A graph called a mosaic plot can be used to visualize a contingency table. The concept is that each tile (rectangle) of the mosaic plot has an area that is proportional to the frequency in that cell of the contingency table. For example, since there were only 6 conservative females in the sample and that was the smallest frequency, it will be represented by a small rectangle. You will be able to see in the graph that males outnumbered females and that liberals & moderates outnumbered conservatives in this sample. Also, if there is some association between the variables (i.e. they are NOT independent), we should be able to see that.

mosaicplot(~sex + politics, data = mytable, col = c("red", "blue", "green"))

plot of chunk unnamed-chunk-21

I added colors, assigning red to conservatives and blue to liberals to follow conventional political stereotypes in the U.S. You can see that the men are more likely to be conservative and less likely to be liberal/moderate than the women. This is based on the relative heights of those tiles.

What would the contingency table and mosaic plot look like if sex and politics were independent (i.e. if exactly the same proportions of men and women fell into the three political catgories)? I've created an artificial example to show that.

id <- 1:180
sex <- c(rep("Female", 120), rep("Male", 60))  #rows of the table
politics <- c(rep("Liberal", 40), rep("Moderate", 20), rep("Conservative", 60), 
    rep("Liberal", 20), rep("Moderate", 10), rep("Conservative", 30))  #columns of the table
mytable <- data.frame(id, sex, politics)
xtabs(~sex + politics, data = mytable)

##         politics
## sex      Conservative Liberal Moderate
##   Female           60      40       20
##   Male             30      20       10

mosaicplot(~sex + politics, data = mytable, col = c("red", "blue", "green"))

plot of chunk unnamed-chunk-22

An interesting phenomenon that sometimes shows up in categorical data is Simpson's Paradox. The definiton of Simpson's Paradox is when a trend that appears in different groups of data disappears when these groups are combined or aggregated, and the reverse trend appears in the combined data. There are a handful of famous examples of Simpson's Paradox, including an example involving possible gender bias in graduate school admission (which I will describe below), the imposition of the death penalty in Florida (the example used in your textbook, originally found by the statistician Alan Agresti and colleagues), and examples involving the batting averages of baseball players. The Wikipedia article on Simpson's paradox gives an example involving the baseball players Derek Jeter and David Justice, http://en.wikipedia.org/wiki/Simpson's_paradox (that article also uses the graduate school example).

Let's look at the possibility of gender discrimination in graduate school at the University of California at Berkeley. In the 1970s, someone compiled data on the percentage of students who applied for grad school that were admitted. In 1973, there were 2691 male applicants to the six largest departments, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom only 557 (30.4%) were admitted. Were the commitees that selected the graduate students full of male chauvinist pigs, or is there another explanation?

Fortunately, this data set is built into R as part of the datasets package, which is part of base R.

UCBAdmissions  #several tables, broken down by departments

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

apply(UCBAdmissions, c(1, 2), sum)  #aggregate into a single table

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

You will notice that some of the departments (such as A and B) had very few female applicants, while others were roughly equal between the sexes (D and F) or had more females than males (C and E). Interestingly, most of the departments admitted a higher percentage of females than males. For example, department A admitted 89/108=82.4% of the women and only 512/825=62.1% of men, while department F admitted 24/341=7.0% of women and 22/373=5.9% of men.

Obviously, some departments admitted a high percentage of students while others admitted a very low percentage of students. Most of the departments, when taken individually, admitted a higher percentage of women than men, but when the data was combined, the percentage for women was less because they were more likely to apply for the departments that accepted a low percentage of applicants (humanities) and less likely in the departments that accepted a high percentage of applicants (science, engineering).

I also did a 'copy and paste' on the code for some graphs from the help file on the 'UCBAdmission' data set.

## Plot for aggregated data
mosaicplot(apply(UCBAdmissions, c(1, 2), sum), main = "Student admissions at UC Berkeley", 
    col = c("lightblue", "hotpink"))

plot of chunk unnamed-chunk-24

## Plots for individual departments
opar <- par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
for (i in 1:6) mosaicplot(UCBAdmissions[, , i], xlab = "Admit", ylab = "Sex", 
    col = c("lightblue", "hotpink"), main = paste("Department", LETTERS[i]))
mtext(expression(bold("Student admissions at UC Berkeley")), outer = TRUE, cex = 1.5)

plot of chunk unnamed-chunk-24