categorical

R Markdown

library(ggplot2) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(scales)

data("HairEyeColor")

The Hair x Eye table comes from a survey of students at the University of Delaware reported by Snee (1974). The split by Sex was added by Friendly (1992a) for didactic purposes.

This data set is useful for illustrating various techniques for the analysis of contingency tables, such as the standard chi-squared test or, more generally, log-linear modelling, and graphical methods such as mosaic plots, sieve diagrams or association plots.

Source http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas

Exploring the data by using str() and summary() function

str(HairEyeColor)

##  'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
##   ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
##   ..$ Sex : chr [1:2] "Male" "Female"

summary(HairEyeColor)

## Number of cases in table: 592 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 164.92, df = 24, p-value = 5.321e-23
##  Chi-squared approximation may be incorrect

head(HairEyeColor)

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

data.df<- as.data.frame(HairEyeColor)
str(data.df)

## 'data.frame':    32 obs. of  4 variables:
##  $ Hair: Factor w/ 4 levels "Black","Brown",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Eye : Factor w/ 4 levels "Brown","Blue",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: num  32 53 10 3 11 50 10 30 10 25 ...

Contungency Tables

With the categorical variables, we usually want to calculate the frequencies for each category. To show frequencies, contingency tables can be produced. For example we want to get the total count of female and male participants

To flatten data into gender/eye color we can make table contains both then calculate the probability table for them

gendereyemix<-xtabs(Freq~Sex+Eye,data.frame(HairEyeColor)) 
prop.table(gendereyemix, 1)# % of men and women across eye color

##         Eye
## Sex           Brown       Blue      Hazel      Green
##   Male   0.35125448 0.36200717 0.16845878 0.11827957
##   Female 0.38977636 0.36421725 0.14696486 0.09904153

# % of men and women for each specific eye color

prop.table(gendereyemix, 2)

##         Eye
## Sex          Brown      Blue     Hazel     Green
##   Male   0.4454545 0.4697674 0.5053763 0.5156250
##   Female 0.5545455 0.5302326 0.4946237 0.4843750

# Number of men and women in the mix

margin.table(gendereyemix, 1)

## Sex
##   Male Female 
##    279    313

# Number of men and women per eye color

margin.table(gendereyemix, 2)

## Eye
## Brown  Blue Hazel Green 
##   220   215    93    64

qplot(data = data.df, Eye, Freq, geom="boxplot", color=Sex)

Most males and females have blue and brown eyes

qplot(data = data.df, Hair, Freq, geom="boxplot", color=Sex)

Most males and females have brown hair.

Let’s assume we are interested in the percentage of male and female with blue eyes

B_M<-data.df %>% select(Eye, Sex, Freq) %>%filter(Sex=="Male" & Eye=="Blue") %>% summarise(Male_Blue=sum(Freq))

B_F<-data.df %>% select(Eye, Sex, Freq) %>%filter(Sex=="Female" & Eye=="Blue") %>% summarise(Female_Blue=sum(Freq))

TOT<-data.df %>% summarise(TotH=sum(Freq))

male_blue <-B_M/TOT*100

female_blue<- B_F/TOT*100

male_blue

##   Male_Blue
## 1  17.06081

female_blue

##   Female_Blue
## 1    19.25676

Density plot of different hair colors

qplot(data=data.df, Eye, geom="density", fill=Eye, alpha=0.6)