author: Christopher Mecklin
date: August 29, 2013
In chapter 1, we cover methods of descriptive statistics, which serve to describe and summarize data. Most of the statistical methods and mathematics covered in this chapter is pretty basic (mean/median/mode, histograms, etc.) and you have seen most of it in classes prior to this one.
However, what is new to most of you is using the R statistical language and dealing with data structures in it. So this document is primarily concerned with introducing the use of R.
We will access R in one of three ways.
Use the base R console (i.e. only download R, click in the big blue R) and enter every command and run it one line at a time. I DO NOT recommend doing this.
Use the RStudio IDE along with R. This way, it is much more convenient to enter programs, run portions of them, save programs, save output, save graphs, access the help features, etc. You should either download both R from http://cran.r-project.org and R Studio from http://www.rstudio.com for your computer, if possible, or use RStudio in the Faculty Hall 1st floor computer lab.
Access the RStudio server that I have running on an Amazon cloud server. I DO NOT promise this will be available all semester, and recommend using this only if you don't have a computer that you can download to for when the computer lab is not open. To do this, email me to get your username and password. Your username will just be your last name. Then log in at http://www.tinyurl.com/MurrayRStudio
In order to do statistical analysis, you need to get data into R. There are three primary ways we will do this in this class.
Use data sets that are “built-in” to packages. For example, the faithful package is built into the datasets package that is built into R. We will also use the require command to access other packages (and data sets that are in those packages), particularly the fastR package that goes with our tetbook.
Manually enter data into an R program or script. This is tedious and is only a good idea for very small examples.
Import data that has been entered and saved in another format. This includes spreadsheets such as EXCEL, databases, other statistical software packages, and text files that are in .txt or .csv formats. We won't talk about how to accomplish this at this point.
In statistics, we will usually store our data in a 2-dimensional structure, similar to a matrix, where the rows represent the individuals or observations that we have collected data on, while the columns represent the variables that we have measured. For example, I could give a survey to our class with several questions, let's say 10 questions. With 27 students in this class, I'd have a \( 27 \times 10 \) structure to hold the data. Each student would be a row (for example, John Doe or Jane Smith), and each question/variable would be a column.
Some of the variables will be numerical in nature: for example, I might ask you your age or your height or to rank your love of statistics on a scale from 1 to 5. Others will be categorical in nature: for example, your gender or your favorite color.
In R, we will usually store our data sets in a structure called a data frame, which will be as described above. It is similar, but not identical, to a matrix. Later this semester, we will occasionally work with matrices in R and perform matrix arithmetic.
In a data frame, the top row is called the header and contains the names of all of the variables. Each horizontal row after is a data row, which contains the name of the row, followed by the data values for that individual/observation on each variable.
Let's look at some R code and output for accessing and manipulating data frames.
# load the fastR package, also loads several other packags such as mosaic
require(fastR)
# Fisher's iris data set
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
# look at the structure of that data set 150 rows (flowers), 5 columns
# (variables) 4 variables are numerical, 1 is a categorical factor
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# use the snippet command to run any chunk of R code in the text where the
# name of the snippet is in the little box
snippet("iris-str")
##
##
## snippet(iris-str)
## ------- ~~~~~~~~
##
## > str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Let's look at some other examples-these are part of the fastR package.
help(actgpa)
actgpa
## ACT GPA
## 1 30 3.992
## 2 25 3.377
## 3 20 3.009
## 4 24 3.509
## 5 32 3.969
## 6 32 3.917
## 7 31 3.547
## 8 22 3.416
## 9 26 3.287
## 10 32 4.000
## 11 23 3.446
## 12 30 3.905
## 13 25 2.926
## 14 26 3.100
## 15 26 3.446
## 16 19 2.785
## 17 32 3.663
## 18 20 3.368
## 19 24 3.352
## 20 32 3.929
## 21 31 3.620
## 22 28 3.765
## 23 19 1.986
## 24 21 2.836
## 25 26 3.119
## 26 22 2.662
xyplot(GPA ~ ACT, actgpa) #scatterplot
help(students)
# students has 1000 rows, let's not look at all of them
head(students)
## ACT SAT Grad GradGPA HSGPA Cohort
## 1 30 NA TRUE 3.613 3.743 2002
## 2 20 NA TRUE 2.993 2.968 2004
## 3 23 1060 TRUE 3.582 3.507 2002
## 4 30 1420 TRUE 3.513 3.990 2005
## 5 21 1010 TRUE 2.703 3.253 2005
## 6 NA 730 TRUE 3.360 2.621 2001
head(students, n = 12)
## ACT SAT Grad GradGPA HSGPA Cohort
## 1 30 NA TRUE 3.613 3.743 2002
## 2 20 NA TRUE 2.993 2.968 2004
## 3 23 1060 TRUE 3.582 3.507 2002
## 4 30 1420 TRUE 3.513 3.990 2005
## 5 21 1010 TRUE 2.703 3.253 2005
## 6 NA 730 TRUE 3.360 2.621 2001
## 7 23 NA TRUE 2.716 3.100 2003
## 8 25 NA FALSE NA 2.588 2005
## 9 30 NA TRUE 3.397 3.762 2003
## 10 21 NA FALSE NA 3.421 2003
## 11 21 NA FALSE NA 2.622 2005
## 12 24 NA TRUE 3.095 3.458 2003
tail(students)
## ACT SAT Grad GradGPA HSGPA Cohort
## 995 22 NA TRUE 2.725 2.626 2003
## 996 27 NA FALSE NA 3.574 2004
## 997 25 NA TRUE 2.854 3.506 2005
## 998 29 NA TRUE 3.448 3.861 2005
## 999 NA 1160 TRUE 3.727 3.914 2003
## 1000 23 1110 FALSE NA 3.973 2005
One of my favorite data sets for teaching is from the MASS package. It is called UScereal and has nutritional information on 65 different brands of breakfast cereal.
One of the variables, shelf, contains values 1,2,3 that correspond to the bottom, middle, and top shelf at the grocery store. We know that usually cereals that adults might want (bran flakes, granola) are on the top shelf, sugary cereals geared towards kids are on the middle shelf, and cheaper cereals are on the bottom.
I'm going to “recode” the shelf variable into a new variable called newshelf that will say “top”, “middle”, and “bottom” rather than some arbitrary number that I might forget. This will require the car package.
require(MASS) #a different package
str(UScereal)
## 'data.frame': 65 obs. of 11 variables:
## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
## $ calories : num 212 212 100 147 110 ...
## $ protein : num 12.12 12.12 8 2.67 2 ...
## $ fat : num 3.03 3.03 0 2.67 0 ...
## $ sodium : num 394 788 280 240 125 ...
## $ fibre : num 30.3 27.3 28 2 1 ...
## $ carbo : num 15.2 21.2 16 14 11 ...
## $ sugars : num 18.2 15.2 0 13.3 14 ...
## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...
## $ potassium: num 848.5 969.7 660 93.3 30 ...
## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
require(car) # yet another package
UScereal$newshelf <- recode(UScereal$shelf, "1='bottom';2='middle';3='top'")
## Warning: NAs introduced by coercion
str(UScereal)
## 'data.frame': 65 obs. of 12 variables:
## $ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
## $ calories : num 212 212 100 147 110 ...
## $ protein : num 12.12 12.12 8 2.67 2 ...
## $ fat : num 3.03 3.03 0 2.67 0 ...
## $ sodium : num 394 788 280 240 125 ...
## $ fibre : num 30.3 27.3 28 2 1 ...
## $ carbo : num 15.2 21.2 16 14 11 ...
## $ sugars : num 18.2 15.2 0 13.3 14 ...
## $ shelf : int 3 3 3 1 2 3 1 3 2 1 ...
## $ potassium: num 848.5 969.7 660 93.3 30 ...
## $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ newshelf : chr "top" "top" "top" "bottom" ...
We can access individual data values or portions of our data frame by using brackets. The first value in brackets represents the row(s) and the second value the column(s) that you want.
UScereal[5, 2] #row 5, column 2, which is number of calories in Apple Jacks
## [1] 110
UScereal[1:10, 1:3] #first ten rows and three columns
## mfr calories protein
## 100% Bran N 212.1 12.121
## All-Bran K 212.1 12.121
## All-Bran with Extra Fiber K 100.0 8.000
## Apple Cinnamon Cheerios G 146.7 2.667
## Apple Jacks K 110.0 2.000
## Basic 4 G 173.3 4.000
## Bran Chex R 134.3 2.985
## Bran Flakes P 134.3 4.478
## Cap'n'Crunch Q 160.0 1.333
## Cheerios G 88.0 4.800
UScereal[, 1:3] #all rows, first three columns
## mfr calories protein
## 100% Bran N 212.12 12.1212
## All-Bran K 212.12 12.1212
## All-Bran with Extra Fiber K 100.00 8.0000
## Apple Cinnamon Cheerios G 146.67 2.6667
## Apple Jacks K 110.00 2.0000
## Basic 4 G 173.33 4.0000
## Bran Chex R 134.33 2.9851
## Bran Flakes P 134.33 4.4776
## Cap'n'Crunch Q 160.00 1.3333
## Cheerios G 88.00 4.8000
## Cinnamon Toast Crunch G 160.00 1.3333
## Clusters G 220.00 6.0000
## Cocoa Puffs G 110.00 1.0000
## Corn Chex R 110.00 2.0000
## Corn Flakes K 100.00 2.0000
## Corn Pops K 110.00 1.0000
## Count Chocula G 110.00 1.0000
## Cracklin' Oat Bran K 220.00 6.0000
## Crispix K 110.00 2.0000
## Crispy Wheat & Raisins G 133.33 2.6667
## Double Chex R 133.33 2.6667
## Froot Loops K 110.00 2.0000
## Frosted Flakes K 146.67 1.3333
## Frosted Mini-Wheats K 125.00 3.7500
## Fruit & Fibre: Dates Walnuts and Oats P 179.10 4.4776
## Fruitful Bran K 179.10 4.4776
## Fruity Pebbles P 146.67 1.3333
## Golden Crisp P 113.64 2.2727
## Golden Grahams G 146.67 1.3333
## Grape Nuts Flakes P 113.64 3.4091
## Grape-Nuts P 440.00 12.0000
## Great Grains Pecan P 363.64 9.0909
## Honey Graham Ohs Q 120.00 1.0000
## Honey Nut Cheerios G 146.67 4.0000
## Honey-comb P 82.71 0.7519
## Just Right Fruit & Nut K 186.67 4.0000
## Kix G 73.33 1.3333
## Life Q 149.25 5.9701
## Lucky Charms G 110.00 2.0000
## Mueslix Crispy Blend K 238.81 4.4776
## Multi-Grain Cheerios G 100.00 2.0000
## Nut&Honey Crunch K 179.10 2.9851
## Nutri-Grain Almond-Raisin K 208.96 4.4776
## Oatmeal Raisin Crisp G 260.00 6.0000
## Post Nat. Raisin Bran P 179.10 4.4776
## Product 19 K 100.00 3.0000
## Puffed Rice Q 50.00 1.0000
## Quaker Oat Squares Q 200.00 8.0000
## Raisin Bran K 160.00 4.0000
## Raisin Nut Bran G 200.00 6.0000
## Raisin Squares K 180.00 4.0000
## Rice Chex R 97.35 0.8850
## Rice Krispies K 110.00 2.0000
## Shredded Wheat 'n'Bran N 134.33 4.4776
## Shredded Wheat spoon size N 134.33 4.4776
## Smacks K 146.67 2.6667
## Special K K 110.00 6.0000
## Total Corn Flakes G 110.00 2.0000
## Total Raisin Bran G 140.00 3.0000
## Total Whole Grain G 100.00 3.0000
## Triples G 146.67 2.6667
## Trix G 110.00 1.0000
## Wheat Chex R 149.25 4.4776
## Wheaties G 100.00 3.0000
## Wheaties Honey Gold G 146.67 2.6667
UScereal[1:10, ] #first ten rows, all coumns
## mfr calories protein fat sodium fibre carbo
## 100% Bran N 212.1 12.121 3.030 393.9 30.303 15.15
## All-Bran K 212.1 12.121 3.030 787.9 27.273 21.21
## All-Bran with Extra Fiber K 100.0 8.000 0.000 280.0 28.000 16.00
## Apple Cinnamon Cheerios G 146.7 2.667 2.667 240.0 2.000 14.00
## Apple Jacks K 110.0 2.000 0.000 125.0 1.000 11.00
## Basic 4 G 173.3 4.000 2.667 280.0 2.667 24.00
## Bran Chex R 134.3 2.985 1.493 298.5 5.970 22.39
## Bran Flakes P 134.3 4.478 0.000 313.4 7.463 19.40
## Cap'n'Crunch Q 160.0 1.333 2.667 293.3 0.000 16.00
## Cheerios G 88.0 4.800 1.600 232.0 1.600 13.60
## sugars shelf potassium vitamins newshelf
## 100% Bran 18.182 3 848.48 enriched top
## All-Bran 15.152 3 969.70 enriched top
## All-Bran with Extra Fiber 0.000 3 660.00 enriched top
## Apple Cinnamon Cheerios 13.333 1 93.33 enriched bottom
## Apple Jacks 14.000 2 30.00 enriched middle
## Basic 4 10.667 3 133.33 enriched top
## Bran Chex 8.955 1 186.57 enriched bottom
## Bran Flakes 7.463 3 283.58 enriched top
## Cap'n'Crunch 16.000 2 46.67 enriched middle
## Cheerios 0.800 1 84.00 enriched bottom
UScereal[c(1:5, 50:55), c("fibre", "sugars")] #a specific set of rows and columns
## fibre sugars
## 100% Bran 30.303 18.18
## All-Bran 27.273 15.15
## All-Bran with Extra Fiber 28.000 0.00
## Apple Cinnamon Cheerios 2.000 13.33
## Apple Jacks 1.000 14.00
## Raisin Nut Bran 5.000 16.00
## Raisin Squares 4.000 12.00
## Rice Chex 0.000 1.77
## Rice Krispies 0.000 3.00
## Shredded Wheat 'n'Bran 5.970 0.00
## Shredded Wheat spoon size 4.478 0.00
some(UScereal) #a random sample of rows
## mfr calories protein fat sodium fibre carbo
## Count Chocula G 110.0 1.000 1.000 180.0 0.000 12.00
## Double Chex R 133.3 2.667 0.000 253.3 1.333 24.00
## Honey Nut Cheerios G 146.7 4.000 1.333 333.3 2.000 15.33
## Just Right Fruit & Nut K 186.7 4.000 1.333 226.7 2.667 26.67
## Oatmeal Raisin Crisp G 260.0 6.000 4.000 340.0 3.000 27.00
## Product 19 K 100.0 3.000 0.000 320.0 1.000 20.00
## Shredded Wheat 'n'Bran N 134.3 4.478 0.000 0.0 5.970 28.36
## Total Corn Flakes G 110.0 2.000 1.000 200.0 0.000 21.00
## Wheat Chex R 149.3 4.478 1.493 343.3 4.478 25.37
## Wheaties G 100.0 3.000 1.000 200.0 3.000 17.00
## sugars shelf potassium vitamins newshelf
## Count Chocula 13.000 2 65.0 enriched middle
## Double Chex 6.667 3 106.7 enriched top
## Honey Nut Cheerios 13.333 1 120.0 enriched bottom
## Just Right Fruit & Nut 12.000 3 126.7 100% top
## Oatmeal Raisin Crisp 20.000 3 240.0 enriched top
## Product 19 3.000 3 45.0 100% top
## Shredded Wheat 'n'Bran 0.000 1 209.0 none bottom
## Total Corn Flakes 3.000 3 35.0 100% top
## Wheat Chex 4.478 1 171.6 enriched bottom
## Wheaties 3.000 1 110.0 enriched bottom
In statistics, we are very concerned with the distribution of a random variable, where a distribution is just a list of the values that the variable took on, and how often those values were taken. We will describe distributions in a variety of ways.
Here, I'll show R code for the frequency distribution for the manufacturer of cereals, the shelf, along with a contingency table or cross-tab for manufacturer & shelf simultaneously. Also, a frequency distribution table for a numerical variable like sugar, which is not as useful.
# dataset$variable specifies which variable from the data set you want
table(UScereal$mfr)
##
## G K N P Q R
## 22 21 3 9 5 5
table(UScereal$newshelf)
##
## bottom middle top
## 18 18 29
table(UScereal$mfr, UScereal$newshelf) #contingency table or cross-tab
##
## bottom middle top
## G 6 7 9
## K 4 7 10
## N 2 0 1
## P 2 1 6
## Q 0 3 2
## R 4 0 1
table(UScereal$sugars)
##
## 0 0.8 1.769912 2 3 4 4.477612
## 4 1 1 2 8 1 1
## 5.681818 6 6.666667 7.462687 8.270677 8.75 8.955224
## 1 1 1 1 1 1 2
## 10.447761 10.666667 11 12 12.121212 13 13.333333
## 1 2 1 9 1 3 3
## 13.432836 14 14.666667 14.925373 15.151515 16 17.045455
## 1 4 1 1 1 4 1
## 17.910448 18.181818 19.402985 20 20.895522
## 1 1 1 2 1
table(UScereal$sugars > 5) #counts how many cereals have more than 5 grams/serving
##
## FALSE TRUE
## 18 47
Let us construct a bar chart, to graphically display the frequency distribution of a categorical variable. First, I am going to create a new object which I will name counts (the name is arbitrary, I could call it george or x) to store the frequency distribution. Then I will plot that object, and then make it prettier with labels and colors.
counts <- table(UScereal$newshelf)
barplot(counts)
barplot(counts, main = "Frequency Distribution of Cereal", xlab = "Shelf", col = "lightblue")
Histograms are used to display the frequency distribution of a numerical variable, by creating "bins” (intervals) and counting how often the data values fall in those bins.
A blessing (and a curse) of open-source software such as R is that there are usually multiple ways to do things, such as construct graphs or fit models. We will encounter this reality frequently in graphing. A package called 'graphics' is built into R and has many functions for plotting. You do not need to use a 'require(graphics)' statement, as that package is automatically loaded any time you use R.
Randall Pruim, your textbook author, is a strong proponent of the graphics found in the 'lattice' package. You do need to use either 'require(lattice)' or 'require(fastR)' to access those graphics. Another popular package, which is very powerful but fairly complicated, is the 'ggplot2' package, developed by Hadley Wickham and based on the book The Grammar of Graphics by Leland Wilkinson. We will not use 'ggplot2' graphics in this course: go to http://ggplot2.org/ if you are interested in pursuing these graphics.
OK, so there's a 'hist' function in the base 'graphics' package and a 'histogram' function in the 'lattice' package. Let's compare them for a histogram of the sugar content of cereal.
hist(UScereal$sugars)
histogram(~sugars, data = UScereal) #lattice uses a (y~x|z,data=dataset) format
If you want to have control on the width of your bins, use the 'breaks' argument to set them. You can also add titles and change colors as we've done before.
histogram(~sugars, data = UScereal, type = "count", breaks = c(0, 5, 10, 15,
20, 25), col = "red", xlab = "Grams per serving", main = "Sugar Content of Cereal")
What's this (y~x|z,data=dataset) format all about? Well, it is used throughout the 'lattice' package. The idea is that the variable on the vertical axis goes in the y slot, the horizontal axis in the x slot, the conditional or grouping variable in the z slot, and the name of your data frame in the dataset slot.
In our example, we are constructing a histogram of the variable 'UScereal$sugars'. Since there is only one variable, there is no y or z in this example. Suppose you want separate histograms for each level of the categorical variable 'newshelf'. We'll just define z to be newshelf. Use the 'subset' argument if you want only one level; for instance, I only want a histogram of the sugar in cereals on the middle shelf.
histogram(~sugars | newshelf, data = UScereal)
histogram(~sugars | newshelf, data = UScereal, subset = newshelf == "middle")
For scatterplots, we have either the 'plot' function (from the graphics package) or 'xyplot' (from lattice). They are very similar and which one you use is a matter of taste. Obviously, we will not leave the y slot empty this time.
Suppose we want to look at the bivariate relationship between the number of calories and the sugar content of our cereals.
plot(calories ~ sugars, data = UScereal)
xyplot(calories ~ sugars, data = UScereal)
xyplot(calories ~ sugars | newshelf, data = UScereal)
xyplot(calories ~ sugars, groups = newshelf, data = UScereal)
The final graph is trying to display information about the categorical variable 'newshelf' as well, but leaves a lot to be desired. Which color represents which shelf is not labeled, the points are too small, and we might wish to use different symbols rather than or along with colors to represent different levels of the shelf variable. The next chunk of code is very complicated, but manually changes many things. 'cex' changes the size of the points and 'pch' the plotting character. To learn more about this, type 'help(par)' into the R Console or go to http://www.statmethods.net/advgraphs/parameters.html
xyplot(calories ~ sugars, groups = newshelf, data = UScereal, auto.key = TRUE,
par.settings = list(superpose.symbol = list(cex = c(1.5, 1.5, 1.5), col = c("black",
"red", "blue"), pch = 1:3)))
It is very easy to compute the mean in R. R will assume your data represents a sample and you are therefore computing the sample statistic \( \bar{x} \), rather than the population parameter \( \mu \). Of course, both are computed with the same formula, \( \bar{x}=\frac{\sum x_i}{n} \).
You can use the 'by' function to separate by levels of another variable. 'median' does (you guees it!) the median of your distribution.
mean(UScereal$sugars)
## [1] 10.05
by(UScereal$sugars, UScereal$newshelf, mean)
## UScereal$newshelf: bottom
## [1] 6.295
## --------------------------------------------------------
## UScereal$newshelf: middle
## [1] 12.51
## --------------------------------------------------------
## UScereal$newshelf: top
## [1] 10.86
median(UScereal$sugars)
## [1] 12
by(UScereal$sugars, UScereal$newshelf, median)
## UScereal$newshelf: bottom
## [1] 3.739
## --------------------------------------------------------
## UScereal$newshelf: middle
## [1] 12.5
## --------------------------------------------------------
## UScereal$newshelf: top
## [1] 12
Now, let's set aside our cereal and look at a small data set involving exam scores. I will manually enter in these scores into a vector in R.
exam <- c(50, 55, 62, 68, 73, 76, 78, 79, 83, 97, 90, 94, 98)
exam
## [1] 50 55 62 68 73 76 78 79 83 97 90 94 98
mean(exam)
## [1] 77.15
median(exam)
## [1] 78
length(exam) #our sample size n
## [1] 13
The five number summary is a set of five simple statistics that quickly summarize the central tendency, the spread, and the shape of a distribution. The statistics used are: 1. minimum (or the 0th percentile) 2. first quartile \( Q_1 \) (the 25th percentile) 3. median \( M \) (the 50th percentile) 4. third quartile \( Q_3 \) (the 75th percentile) 5. maximum (or the 100th percentile)
We will define the quantile:
Let \( p \in [0,1] \). The \( p \)-quantile of a distribution is a number \( q \) such that the proportion of the distribution that is less than \( q \) is p.
Seems easy enough, but is tricky to compute in practice. As you will soon see, the way common quantiles (such as the first and third quartiles of the five-number summary) will be different between a TI-84 calculator and R, with other methods found in other software packages.
First, we'll discuss how this is computed on your calculator, which corresponds to the discussion found in many (but not all) low-level statistics textbooks.
The minimum and maximum are easy: just take the smallest and largest numbers, i.e. your 1st and $n$th order statistics. For the exams, \( Min=50 \) and \( Max=98 \).
The median isn't very hard either. With a sample of \( n=13 \) students, we use the \( \frac{n+1}{2} \) th order statistic. The 7th order statistic is \( M=78 \).
The TI-84 computes the first quartile \( Q_1 \) in this fashion. Take all of the data values less than the median (i.e. exclusive of the median). This is our first six order statistics \( 50,55,62,68,73,76 \). Take the median of those six values. When you do so, \( Q_1=\frac{62+68}{2}=65 \).
The third quartile is found in a similar fashion using the six values greater than the median, again excluding the median. Doing this, \( Q_3=\frac{87+90}{2}=88.5 \).
You are thinking “I bet R has a function for computing these values”. It has several.
summary(exam)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.0 68.0 78.0 77.2 90.0 98.0
quantile(exam)
## 0% 25% 50% 75% 100%
## 50 68 78 90 98
fivenum(exam)
## [1] 50 68 78 90 98
OH NO!!! The first and third quartiles (i.e. the 25th and 75th percentiles or the 0.25-quantile and 0.75-quantile) are different than my calculator!!! What is R doing???
It turns out there are many ways (type in 'help(quantile)' to learn a bunch of them) to compute quantiles. Here's how R does it.
To find the 25th percentile, we have \( n=13 \) and \( p=0.25 \). Take the \( [(n-1)p]+1 \) th order statistic. In our case, you end up with the 4th order statistic, which is \( Q_1=68 \) (as R said). Similarly with \( p=0.75 \), we use the 10th order statistic, which is \( Q_3=90 \).
I do not want to get into the minutiae of the various ways of computing this, other than to say that the TI-84 method isn't even one of the optional methods you can ask R to compute with the 'quantile' function.
A common measure of spread, or variability, of a distribution is the range, which is just \( Range=Max-Min \). An obvious flaw of this statistic is that it is completely distorted by outliers. If I tell you after our first exam that the range was 70, that could be caused by a single individual (i.e. 26 out of 27 students got very high scores and one person did terrible).
One solution is to use the interquartile range. \( IQR=Q_3-Q_1 \) which tells us the spread in the middle 50% of the distribution. The major use of this statistic is in determining whether or not outliers exist in the data.
The famous statistician John Tukey devised a simple rule for detecting outliers based on the \( IQR \). For the purposes of my example, let's use the exam scores and R's version of the five-number summary.
Tukey constructed a graph known as a boxplot or box-and-whisker display that combines information from the five-number summary and outliers. Bars are drawn representing the median and the first/third quartiles, forming a 'box'. 'Whiskers' are extended to the smallest/largest data values that are NOT outliers; in our case, with no outliers the whiskers extend to the minimum and maximum.
The base graphics packages uses the 'boxplot' function while lattice uses 'bwplot'.
boxplot(exam)
bwplot(~exam)
I am going to change one of the exam scores to show you how the plot will change with an outlier. Let's suppose the first student actually got a 25, not a 50. Since this score is less than 25, that student will be an outlier. Now, the whisker will extend to 55 and a symbol will be used a 25 to let the world know that we have an outlier.
exam[1] <- 25 #change the first element in the exam vector
bwplot(~exam) #redraw the boxplot
A great way of using boxplots is to draw side-by-side boxplots to compare two or more distributions. Let's go back to our breakfast cereals and compare the sugar content from shelf to shelf.
bwplot(newshelf ~ sugars, data = UScereal)
bwplot(sugars ~ newshelf, data = UScereal)
Although the range and the interquartile range can be used to measure the spread or dispersion of a data set, the most common statistic used for this purpose is the variance and its cousin the standard deviation.
The sample variance is defined with the following formula: \[ s^2=\frac{\sum (x_i-\bar{x})^2}{n-1} \]
The numerator is the sum of squared deviations from the sample mean. Why do we square the deviations?
Notice in the denominator we divide by \( n-1 \) rather than \( n \), so our variance statistic is not quite the 'average squared deviation'. Do you know why we divide by \( n-1 \)?
Incidentally, if our data set represents an entire population rather than a sample, we have different notation and a slightly different formula for variance. \[ \sigma^2=\frac{\sum(x_i-\mu)^2}{n} \]
Although in general we will use technology to compute the variance (either a TI calculator or R), you should be able to compute it 'by hand' given a small data set.
Suppose we have a sample of size \( n=5 \) which is the age of five randomly selected customers in a store. \[ 24 \qquad 45 \qquad 62 \qquad 38 \qquad 35 \]
I will arrange the data into columns, computing the deviations and squared deviations. Notice that \( \bar{x}=40.8 \) years.
| \( x_i \) | \( (x_i-\bar{x}) \) | \( (x_i-\bar{x})^2 \) |
|---|---|---|
| 24 | -16.8 | 282.24 |
| 45 | 4.2 | 17.64 |
| 62 | 21.2 | 449.44 |
| 38 | -2.8 | 7.84 |
| 35 | -5.8 | 33.64 |
The sum of the rightmost column is the sum of squared deviations, the top of the formula. Thus \[ s^2=\frac{790.8}{5-1}=197.7 years^2 \]
Notice our units get squared, which makes a physical interpretation of the variance difficult. We will typically take the square root of the variance, which we call the standard deviation. We use \( s \) to represent the standard deviation of a sample and \( \sigma \) for a population. In our example, \( s=\sqrt{197.7}=14.06 \) years.
We can double-check our arithmetic.
ages <- c(24, 45, 62, 38, 35)
mean(ages)
## [1] 40.8
var(ages)
## [1] 197.7
sqrt(var(ages))
## [1] 14.06
sd(ages)
## [1] 14.06
A nice command that is part of the 'mosaic' package is 'favstats', that will compute mean, five-number summary, standard deviation, sample size, and number of missing data values in one nice wrapper.
favstats(exam)
## min Q1 median Q3 max mean sd n missing
## 25 68 78 90 98 75.23 20.02 13 0
A contingency table or cross-tabulation (shortened to cross-tab) is a frequency distribution table that displays information about two variables simultaneously. Usually these variables are categorical factors but can be numerical variables that have been grouped together. For example, we might have one variable represent the sex of a customer in the store and the second variable be age, where age groups such as 18-29, 30-44, 45-64, 65+ are used.
Your textbook uses a famous example based on racial patterns on how the death penalty was applied to murderers in Florida in the 1970s. The built-in data set in 'fastR' is called 'deathPenalty'. We are going to create our own data frame in R based on data in a contingency table. Our example (which I took from a different textbook), looks at the sex and the political preference of a sample of college students (probably not in Kentucky).
| Liberal | Moderate | Conservative | |
|---|---|---|---|
| Female | 35 | 36 | 6 |
| Male | 50 | 44 | 21 |
Here's the R Code for creating a data frame for this table.
id <- 1:192 #total of 192 people in the sample
sex <- c(rep("Female", 77), rep("Male", 115)) #rows of the table
politics <- c(rep("Liberal", 35), rep("Moderate", 36), rep("Conservative", 6),
rep("Liberal", 50), rep("Moderate", 44), rep("Conservative", 21)) #columns of the table
mytable <- data.frame(id, sex, politics)
xtabs(~sex + politics, data = mytable)
## politics
## sex Conservative Liberal Moderate
## Female 6 35 36
## Male 21 50 44
Notice that R puts the factors in alphabetical order (Conservative, Liberal, Moderate), rather than the 'liberal is left-wing, conservative is right-wing' idea. It is possible to reorder those levels, but for our purposes I will not worry about that.
A graph called a mosaic plot can be used to visualize a contingency table. The concept is that each tile (rectangle) of the mosaic plot has an area that is proportional to the frequency in that cell of the contingency table. For example, since there were only 6 conservative females in the sample and that was the smallest frequency, it will be represented by a small rectangle. You will be able to see in the graph that males outnumbered females and that liberals & moderates outnumbered conservatives in this sample. Also, if there is some association between the variables (i.e. they are NOT independent), we should be able to see that.
mosaicplot(~sex + politics, data = mytable, col = c("red", "blue", "green"))
I added colors, assigning red to conservatives and blue to liberals to follow conventional political stereotypes in the U.S. You can see that the men are more likely to be conservative and less likely to be liberal/moderate than the women. This is based on the relative heights of those tiles.
What would the contingency table and mosaic plot look like if sex and politics were independent (i.e. if exactly the same proportions of men and women fell into the three political catgories)? I've created an artificial example to show that.
id <- 1:180
sex <- c(rep("Female", 120), rep("Male", 60)) #rows of the table
politics <- c(rep("Liberal", 40), rep("Moderate", 20), rep("Conservative", 60),
rep("Liberal", 20), rep("Moderate", 10), rep("Conservative", 30)) #columns of the table
mytable <- data.frame(id, sex, politics)
xtabs(~sex + politics, data = mytable)
## politics
## sex Conservative Liberal Moderate
## Female 60 40 20
## Male 30 20 10
mosaicplot(~sex + politics, data = mytable, col = c("red", "blue", "green"))
An interesting phenomenon that sometimes shows up in categorical data is Simpson's Paradox. The definiton of Simpson's Paradox is when a trend that appears in different groups of data disappears when these groups are combined or aggregated, and the reverse trend appears in the combined data. There are a handful of famous examples of Simpson's Paradox, including an example involving possible gender bias in graduate school admission (which I will describe below), the imposition of the death penalty in Florida (the example used in your textbook, originally found by the statistician Alan Agresti and colleagues), and examples involving the batting averages of baseball players. The Wikipedia article on Simpson's paradox gives an example involving the baseball players Derek Jeter and David Justice, http://en.wikipedia.org/wiki/Simpson's_paradox (that article also uses the graduate school example).
Let's look at the possibility of gender discrimination in graduate school at the University of California at Berkeley. In the 1970s, someone compiled data on the percentage of students who applied for grad school that were admitted. In 1973, there were 2691 male applicants to the six largest departments, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom only 557 (30.4%) were admitted. Were the commitees that selected the graduate students full of male chauvinist pigs, or is there another explanation?
Fortunately, this data set is built into R as part of the datasets package, which is part of base R.
UCBAdmissions #several tables, broken down by departments
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
apply(UCBAdmissions, c(1, 2), sum) #aggregate into a single table
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
You will notice that some of the departments (such as A and B) had very few female applicants, while others were roughly equal between the sexes (D and F) or had more females than males (C and E). Interestingly, most of the departments admitted a higher percentage of females than males. For example, department A admitted 89/108=82.4% of the women and only 512/825=62.1% of men, while department F admitted 24/341=7.0% of women and 22/373=5.9% of men.
Obviously, some departments admitted a high percentage of students while others admitted a very low percentage of students. Most of the departments, when taken individually, admitted a higher percentage of women than men, but when the data was combined, the percentage for women was less because they were more likely to apply for the departments that accepted a low percentage of applicants (humanities) and less likely in the departments that accepted a high percentage of applicants (science, engineering).
I also did a 'copy and paste' on the code for some graphs from the help file on the 'UCBAdmission' data set.
## Plot for aggregated data
mosaicplot(apply(UCBAdmissions, c(1, 2), sum), main = "Student admissions at UC Berkeley",
col = c("lightblue", "hotpink"))
## Plots for individual departments
opar <- par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
for (i in 1:6) mosaicplot(UCBAdmissions[, , i], xlab = "Admit", ylab = "Sex",
col = c("lightblue", "hotpink"), main = paste("Department", LETTERS[i]))
mtext(expression(bold("Student admissions at UC Berkeley")), outer = TRUE, cex = 1.5)