Part 1

Iris data set questions

(1) How many cases were included in the data?

nrow(iris)
## [1] 150

150 total cases

(2) How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(iris)
## iris 
## 
##  5  Variables      150  Observations
## --------------------------------------------------------------------------------
## Sepal.Length 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      150        0       35    0.998    5.843      5.8   0.9462    4.600 
##      .10      .25      .50      .75      .90      .95 
##    4.800    5.100    5.800    6.400    6.900    7.255 
## 
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## --------------------------------------------------------------------------------
## Sepal.Width 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      150        0       23    0.992    3.057     3.05   0.4872    2.345 
##      .10      .25      .50      .75      .90      .95 
##    2.500    2.800    3.000    3.300    3.610    3.800 
## 
## lowest : 2   2.2 2.3 2.4 2.5, highest: 3.9 4   4.1 4.2 4.4
## --------------------------------------------------------------------------------
## Petal.Length 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      150        0       43    0.998    3.758     3.65    1.979     1.30 
##      .10      .25      .50      .75      .90      .95 
##     1.40     1.60     4.35     5.10     5.80     6.10 
## 
## lowest : 1   1.1 1.2 1.3 1.4, highest: 6.3 6.4 6.6 6.7 6.9
## --------------------------------------------------------------------------------
## Petal.Width 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      150        0       22     0.99    1.199      1.2   0.8676      0.2 
##      .10      .25      .50      .75      .90      .95 
##      0.2      0.3      1.3      1.8      2.2      2.3 
## 
## lowest : 0.1 0.2 0.3 0.4 0.5, highest: 2.1 2.2 2.3 2.4 2.5
## --------------------------------------------------------------------------------
## Species 
##        n  missing distinct 
##      150        0        3 
##                                            
## Value          setosa versicolor  virginica
## Frequency          50         50         50
## Proportion      0.333      0.333      0.333
## --------------------------------------------------------------------------------

We see five different variables:

  • Sepal.Length

  • Sepal.Width

  • Petal.Length

  • Petal.Width

  • Species

All but Species are numerical variables, meaning that the data set has four numerical variables. We know that they are all continuous variables since they’re measuring something physical.

(3) How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).

describe(iris$Species)
## iris$Species 
##        n  missing distinct 
##      150        0        3 
##                                            
## Value          setosa versicolor  virginica
## Frequency          50         50         50
## Proportion      0.333      0.333      0.333

We know from the previous question that four out of the five variables are numerical, which means that the fifth variable (Species) is categorical. This particular categorical variable has three categories:

  • setosa

  • versicolor

  • virginica

Part 2

Pick a dataset and tell/show what category of data it belongs to with an appropriate chart/summary statistics

I chose the HairEyeColor data set for this problem. Based on the details provided by

help(HairEyeColor)
## starting httpd help server ... done

we can see that the data here would best be described as “cross sectional” since it observed many (529) different subjects at a single point in time. All of the variables in the data (hair color, eye color, and sex) would be categorical as well, since these are all qualitative groups. Originally, the only data points plotted were hair color and eye color. However, 18 years after the survey was initially taken, the table was split by male/female as well, effectively creating two tables. This is what the data set looks like now:

HairEyeColor
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

But this is what it would have looked like when the survey was originally conducted in 1974:

x <- apply(HairEyeColor, c(1, 2), sum)
x
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

I wasn’t sure how to present this data in a graphical form, since all of the variables are categorical. However, in the same year he added the split by sex, Michael Friendly of York University published a paper that provided a solution for this very problem. In it, he writes about the mosaic display (among other types of visualizations). Each intersection in the data set is represented as a tile, the area of which is proportional to its frequency. Please see here for additional reading: Friendly, M. (1992a). Graphical methods for categorical data. SAS User Group International Conference Proceedings, 17, 190–200. http://datavis.ca/papers/sugi/sugi17.pdf

We can use a mosaic display to get an easier-to-understand graphical representation of the data as it was originally taken:

require(graphics)
mosaicplot(x)

And this is how it looks in the current data set:

mosaicplot(HairEyeColor)