nrow(iris)
## [1] 150
150 total cases
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(iris)
## iris
##
## 5 Variables 150 Observations
## --------------------------------------------------------------------------------
## Sepal.Length
## n missing distinct Info Mean pMedian Gmd .05
## 150 0 35 0.998 5.843 5.8 0.9462 4.600
## .10 .25 .50 .75 .90 .95
## 4.800 5.100 5.800 6.400 6.900 7.255
##
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## --------------------------------------------------------------------------------
## Sepal.Width
## n missing distinct Info Mean pMedian Gmd .05
## 150 0 23 0.992 3.057 3.05 0.4872 2.345
## .10 .25 .50 .75 .90 .95
## 2.500 2.800 3.000 3.300 3.610 3.800
##
## lowest : 2 2.2 2.3 2.4 2.5, highest: 3.9 4 4.1 4.2 4.4
## --------------------------------------------------------------------------------
## Petal.Length
## n missing distinct Info Mean pMedian Gmd .05
## 150 0 43 0.998 3.758 3.65 1.979 1.30
## .10 .25 .50 .75 .90 .95
## 1.40 1.60 4.35 5.10 5.80 6.10
##
## lowest : 1 1.1 1.2 1.3 1.4, highest: 6.3 6.4 6.6 6.7 6.9
## --------------------------------------------------------------------------------
## Petal.Width
## n missing distinct Info Mean pMedian Gmd .05
## 150 0 22 0.99 1.199 1.2 0.8676 0.2
## .10 .25 .50 .75 .90 .95
## 0.2 0.3 1.3 1.8 2.2 2.3
##
## lowest : 0.1 0.2 0.3 0.4 0.5, highest: 2.1 2.2 2.3 2.4 2.5
## --------------------------------------------------------------------------------
## Species
## n missing distinct
## 150 0 3
##
## Value setosa versicolor virginica
## Frequency 50 50 50
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------
We see five different variables:
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
All but Species are numerical variables, meaning that the data set has four numerical variables. We know that they are all continuous variables since they’re measuring something physical.
describe(iris$Species)
## iris$Species
## n missing distinct
## 150 0 3
##
## Value setosa versicolor virginica
## Frequency 50 50 50
## Proportion 0.333 0.333 0.333
We know from the previous question that four out of the five variables are numerical, which means that the fifth variable (Species) is categorical. This particular categorical variable has three categories:
setosa
versicolor
virginica
I chose the HairEyeColor data set for this problem. Based on the details provided by
help(HairEyeColor)
## starting httpd help server ... done
we can see that the data here would best be described as “cross sectional” since it observed many (529) different subjects at a single point in time. All of the variables in the data (hair color, eye color, and sex) would be categorical as well, since these are all qualitative groups. Originally, the only data points plotted were hair color and eye color. However, 18 years after the survey was initially taken, the table was split by male/female as well, effectively creating two tables. This is what the data set looks like now:
HairEyeColor
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
But this is what it would have looked like when the survey was originally conducted in 1974:
x <- apply(HairEyeColor, c(1, 2), sum)
x
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
I wasn’t sure how to present this data in a graphical form, since all of the variables are categorical. However, in the same year he added the split by sex, Michael Friendly of York University published a paper that provided a solution for this very problem. In it, he writes about the mosaic display (among other types of visualizations). Each intersection in the data set is represented as a tile, the area of which is proportional to its frequency. Please see here for additional reading: Friendly, M. (1992a). Graphical methods for categorical data. SAS User Group International Conference Proceedings, 17, 190–200. http://datavis.ca/papers/sugi/sugi17.pdf
We can use a mosaic display to get an easier-to-understand graphical representation of the data as it was originally taken:
require(graphics)
mosaicplot(x)
And this is how it looks in the current data set:
mosaicplot(HairEyeColor)