Tables

Title: Main functions for establishing relationships between categorial variables

Synopsis: This document is aimed at helping you to cross tables data.

Known as the “Live Music Capital of the World,” Austin, Texas is also home to the longest-running music series in American television history, Austin City Limits. This dataset includes data on a sample of musicians that performed live on the PBS television series Austin City Limits over the last 10 years. Data on each artist include measures of commercial popularity, such as the number of social media followers on Twitter or Facebook, and their success in winning a Grammy Music Award.

For further details see: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 - Week 4: Bivariate Distributions (Categorical Data) > Lecture Videos

Loading the Live Music capital of the world dataset

  # Loading the dataset
    acl = read.csv("AustinCityLimits.csv")
  str(acl)

## 'data.frame':    116 obs. of  14 variables:
##  $ Artist       : Factor w/ 116 levels "Aimee Mann","Alabama Shakes",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year         : int  2008 2013 2009 2009 2007 2009 2010 2009 2003 2008 ...
##  $ Month        : Factor w/ 6 levels "December","February",..: 4 2 3 5 4 4 3 4 3 5 ...
##  $ Season       : Factor w/ 2 levels "fall","winter": 1 2 2 1 1 1 2 1 2 1 ...
##  $ Gender       : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 2 2 2 1 ...
##  $ Age          : int  52 24 75 39 33 62 37 35 43 67 ...
##  $ Age.Group    : Factor w/ 4 levels "Fifties or Older",..: 1 4 1 3 3 1 3 3 2 1 ...
##  $ Grammy       : Factor w/ 2 levels "N","Y": 2 1 1 1 2 2 1 1 2 1 ...
##  $ Genre        : Factor w/ 4 levels "Country","Jazz/Blues",..: 4 3 2 3 3 1 3 3 3 2 ...
##  $ BB.wk.top10  : int  0 1 NA 1 1 0 1 NA 1 0 ...
##  $ Twitter      : int  101870 73313 308634 56343 404439 3326 125758 8197 158647 690 ...
##  $ Twitter.100k : int  1 0 1 0 1 0 1 0 1 0 ...
##  $ Facebook     : int  113576 298278 10721 318313 1711685 27321 563505 18955 1381051 1715 ...
##  $ Facebook.100k: int  1 1 0 1 1 0 1 0 1 0 ...

Table basics

Table of counts

Describe individual categorical variables.

    table(acl$Gender)

## 
##  F  M 
## 35 81

Contingency tables

Show the intersections of characteristics between two categorical variables. Show the characteristics of one categorical variable contingent on another categorical variable.

    Gender.Genre = table(acl$Gender, acl$Genre)
    Gender.Genre

##    
##     Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F       7          6              12                10
##   M      11          7              56                 7

Marginal distribution

Marginal distribution (marginal sums), for each variable represent:

Total columns represents the marginal distribution for the row.
(an aditional column summing up the values of each row)
Total row represents the marginal distribution for the columns. (an aditional row summing up the values of each column)

Counts

    # see all margins
    addmargins(Gender.Genre)

##      
##       Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter Sum
##   F         7          6              12                10  35
##   M        11          7              56                 7  81
##   Sum      18         13              68                17 116

    # see only column margins
    addmargins(Gender.Genre,margin=2)

##    
##     Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter Sum
##   F       7          6              12                10  35
##   M      11          7              56                 7  81

    # See only row margins
    addmargins(Gender.Genre,margin=1)

##      
##       Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F         7          6              12                10
##   M        11          7              56                 7
##   Sum      18         13              68                17

Notice that the values in each of the total group match the individual tables of counts for each of the variables.
So if you add up the total of columns is the same value of adding up the total of rows.
If there is a missing value it will not be included in the contingency table.

Conditional distribution

They are devided in:

Table percentages

Usually ignored because they don´t help to answer a specific question.

    # Proportions
    prop.table(Gender.Genre)

##    
##        Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F 0.06034483 0.05172414      0.10344828        0.08620690
##   M 0.09482759 0.06034483      0.48275862        0.06034483

    # Proportions with margins
    addmargins(prop.table(Gender.Genre))

##      
##          Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter        Sum
##   F   0.06034483 0.05172414      0.10344828        0.08620690 0.30172414
##   M   0.09482759 0.06034483      0.48275862        0.06034483 0.69827586
##   Sum 0.15517241 0.11206897      0.58620690        0.14655172 1.00000000

Row Percentages

These are the conditional distributions of columns by rows. In that case the columns are the Dependent Variable. If you add the percentage of rows they add to 1.

  # Gives the proportion of the table by row (rows add up to 1)
  RowPercentage = prop.table(Gender.Genre,1)
  RowPercentage

##    
##        Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F 0.20000000 0.17142857      0.34285714        0.28571429
##   M 0.13580247 0.08641975      0.69135802        0.08641975

  # Marginal distribution of columns
  MarginalColumnPercentage = margin.table(Gender.Genre, 2)/nrow(acl)
  MarginalColumnPercentage

## 
##           Country        Jazz/Blues   Rock/Folk/Indie Singer-Songwriter 
##         0.1551724         0.1120690         0.5862069         0.1465517

  # Puting together the two tables 
  
  RowMarginalPercentage = rbind(RowPercentage,MarginalColumnPercentage)
  RowMarginalPercentage

##                            Country Jazz/Blues Rock/Folk/Indie
## F                        0.2000000 0.17142857       0.3428571
## M                        0.1358025 0.08641975       0.6913580
## MarginalColumnPercentage 0.1551724 0.11206897       0.5862069
##                          Singer-Songwriter
## F                               0.28571429
## M                               0.08641975
## MarginalColumnPercentage        0.14655172

Column percentages

These are the conditional distributions of row by column. In that case the rows are the Dependent Variable. If you add the percentage of columns they add to 1.

  # Gives the proportion of the table by row (rows add up to 1)
  ColumnPercentage = prop.table(Gender.Genre,2)
  ColumnPercentage

##    
##       Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F 0.3888889  0.4615385       0.1764706         0.5882353
##   M 0.6111111  0.5384615       0.8235294         0.4117647

  # Marginal distribution of Row
  MarginalRowPercentage = margin.table(Gender.Genre, 1)/nrow(acl)
  MarginalRowPercentage

## 
##         F         M 
## 0.3017241 0.6982759

  # Puting together the two tables 
  
  ColumnMarginalPercentage = cbind(ColumnPercentage,MarginalRowPercentage)
  ColumnMarginalPercentage

##     Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889  0.4615385       0.1764706         0.5882353
## M 0.6111111  0.5384615       0.8235294         0.4117647
##   MarginalRowPercentage
## F             0.3017241
## M             0.6982759

The Question we want to answer:

Is there a relationship between the GENRE of music and if the singer is a FEMALE?

Relationship between two categorical variables

Steps

From the contingency table decide which variable is best fir to be you outcome (dependent variable). Just like in correlation when we need to determine what the outcome variable is.

In our example lets use GENDER as a dependente variable.

Determine the margin distribution from the outcome variable of interest

 # Marginal distribution of Row
  MarginalRowPercentage = margin.table(Gender.Genre, 1)/nrow(acl)
  MarginalRowPercentage

## 
##         F         M 
## 0.3017241 0.6982759

Determine the conditional distribution for that variable of interest

  # Gives the proportion of the table by row (rows add up to 1)
  ColumnPercentage = prop.table(Gender.Genre,2)
  ColumnPercentage

##    
##       Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F 0.3888889  0.4615385       0.1764706         0.5882353
##   M 0.6111111  0.5384615       0.8235294         0.4117647

Compare these distributions probabilites. To discover if the isa a relationshio see wheter of not the marginal distribulgion holdee when compared to the condicional distribuition of the dependant variable.

  # Puting together the two tables 
  
  ColumnMarginalPercentage = cbind(ColumnPercentage,MarginalRowPercentage)
  ColumnMarginalPercentage

##     Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889  0.4615385       0.1764706         0.5882353
## M 0.6111111  0.5384615       0.8235294         0.4117647
##   MarginalRowPercentage
## F             0.3017241
## M             0.6982759

If P(A|B) = P(A) then there is no relationship. If we fail to prove that P(A|B) = P(A) then we have a relationship.

Now we must see how gender behaves in it´s normal state of affairs. How gender look withou any consideration for other categorical variables, who GENDER look in terms of it´s distribution of outcomes withou any consideration of GENRE.

To Answer this we turn to our marginal distributions.

Our maginal distribuition for GENDER shows us what we should expect the normal state of affairs should be of a person beeing male or female if we dind´t consider at all GENRE.

To discover if there is a relationship between our two categorical varaibles, we want to see whether or not this marginal distribution holds up when we compare it to the conditional distribution of our dependente variable GENDER (outcome variable, variable of interest).

When we compare the marginal to the conditional distribution of GENDER we can see that things don´t actually match.

Just looking within the third (rock/folk/indie) GENRE we can see that our condition probability of FEMALE GENDER (0.17) does not match out marginal probabilty of GENDER (.30). It is almost half. We we didnt´take into acount GENRE the rate of GENDER woould be about .30 but if we take into accout GENRE the rate of GENDER IS ABOUT .17. We can se that the likelyhood of A FEMALE sing rock/folk/indie is just .17. Our conditional probability of a FEMALE withi in any of outr GENRES does note match out marginal probability, the normal state of affair if we didn´t take into account GENDER intro consideration.

In this case we can see taht P(GENDER|GENRE) <> P(GENDER) so there is a relationship.

So there is absolutely a relationship between GENRE and GENDER.

Your GENDER is somehow related to the GENRE of music you sing. You might say that the probability of a person be a MALE or FEMALE is driveN by the GENRE that sher/he sings.

If you call a participante it is more likely it is a MALE (.698).

But if you call a particpante to sing a Singer-Songwrite type of music it is more likely it wil be a FEMALE (.588)

But to answer our questions we need visutalizations!

Visualizations

Draw a picture based on the conditional distribution that we choose to be our outcome, our dependent variable.
Since we cqan visualizar both probabilites, the one we want to visualize dependes on our perscpective.
We want to use a graph that matches our condition probability conclusion. The graph taht takes advantage of our variable of interest.
So, although you can choose variables it is always good to chose the one that coincides whith oure varia ble of interest.
Once we have COUNTS or PERCENTAGES of a categorical variab le we wanto to use a bar plot.

Now that we know that there is a relatioship between our two categorical variables by showing the disparity between the marginal and conditional probabilities of an event ocurring, we need to tell the rest of the story.

Although the visualization of our contingency table is pretty straightfoward we need to make sure that our story is told from the correct perspective.

  # Contingency table
  Gender.Genre

##    
##     Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F       7          6              12                10
##   M      11          7              56                 7

We want to use a picture to help the reader along. A picture based on the conditional distribution we chose to use previously, in our case, GENDER.

However we can visulize both conditional probabilities, so the one we choose to use depends on our perspective.

Our perspective of the GENDER.

With that in mind, remember that our variable of interest is GENDER. The likelihood of FEMALE or MALE.

So we need to look at our column percentages (conditional distribution) of GENDER based on GENRE, to make it clear.

Here´s the table with our conditional distribution of GENDER based on GENRE.

 # Gives the proportion of the table by row (rows add up to 1)
  ColumnPercentage = prop.table(Gender.Genre,2)
  ColumnPercentage

##    
##       Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
##   F 0.3888889  0.4615385       0.1764706         0.5882353
##   M 0.6111111  0.5384615       0.8235294         0.4117647

   # Marginal distribution of Row
 # MarginalRowPercentage = margin.table(Gender.Genre, 1)/nrow(acl)
 # MarginalRowPercentage
  
  # ColumnMarginalPercentage = cbind(ColumnPercentage,MarginalRowPercentage)
 # ColumnMarginalPercentage

Let´s look at this from the perspective of GENRE. You choose a GENRE of music, as a Country music fan you wonder:

what´s the likelihood of the singer been a FEMALE? Well, it´s 38.9%.
What´s the likelihood of the singer of Rock/Folk/Indie been a FEMALE? Well, it´s 17.6%, almost half. Ouch!

Here´s the graph of counts of FEMALES/MALE based on GENRE.

 barplot(Gender.Genre, main='Count of Female/Male by Genre', xlab='Genre',ylab = 'Counts', legend=T, beside = T)

Here´s the same graph but of proportions of FEMALE/MALE based on GENRE.

    barplot(prop.table(Gender.Genre,2), main='Proportion of Female/Male by Genre', xlab='Genre',ylab = 'Frequency', legend=T, beside = T)

However this graph is redundant, as the sum of both bars (FEMALE and MALE), for each GENRE add up to 100%.

So to make things clear, we might use this graph - The likelihood of FEMALE singer for each GENRE.

Here´s the graph of counts of FEMALES based on GENRE.

    # create a conditional table only for FEMALES
    FirstRow = Gender.Genre[-2,] # Remove the second row
    
    # Now you can plot the percentage of Females By Genre
    barplot(FirstRow, main='Counts of Female by Genre', xlab='Genre',ylab = 'Counts', legend=F, beside = T, horiz=F)

Here´s the graph of proportions of FEMALES based on GENRE.

    # create a conditional table only for FEMALES
    FirstRow = Gender.Genre[-2,] # Remove the second row
    
    # Now you can plot the percentage of Females By Genre
    barplot(prop.table(FirstRow), main='Proportions of Female by Genre', xlab='Genre',ylab = 'Frequency', legend=F, beside = T, horiz=F)

Reverse conditional probability

see: Week 4: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 - Bivariate Distributions (Categorical Data) > Lecture Videos > Reverse Conditional Probability

Practical examples

Contingecy Table - Table of counts

Counts the number of time an item appears on the data set. e.g. Grammy winners

Case 1 - One single variable

  gtab = table(acl$Grammy)
  gtab

## 
##  N  Y 
## 67 49

Case 2 - Two variables

  gtab2 = table(acl$Grammy, acl$Gender)
  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

Proportions

You can calculate the proportions instead of the counts above.

By one variable

  gtab

## 
##  N  Y 
## 67 49

  prop.table(gtab)

## 
##         N         Y 
## 0.5775862 0.4224138

By two variables

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  prop.table(gtab2)

##    
##             F         M
##   N 0.1810345 0.3965517
##   Y 0.1206897 0.3017241

By Row (Conditional Probability)

You can read the table below as follows.

Within the No Grammy row 31% were female and 68% were males. Within the Grammy row 28% were female and 71% were males.

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  prop.table(gtab2,1)

##    
##             F         M
##   N 0.3134328 0.6865672
##   Y 0.2857143 0.7142857

By Column (Conditional Probability)

You can read the table below as follows.

Within the Females column 40% won a grammy while 60% did not. Within the Males columns 43% won a grammy while 57% did not

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  prop.table(gtab2,2)

##    
##             F         M
##   N 0.6000000 0.5679012
##   Y 0.4000000 0.4320988

Bar Charts

Bar chart across one variable

Frequency table of one Grammy Variable

  gtab

## 
##  N  Y 
## 67 49

  barplot(gtab, main = 'ACL Grammy Winners', xlab='Grammy Winner',ylab = 'Counts')

Bar chart across two variable

Contingency table of counts across the Grammy and the Gender Variables

Stacked bar chart

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  barplot(gtab2, main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T)

Side by side bar chart

Using parameter ‘beside=T’

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  barplot(gtab2, main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T, beside = T)

Relative Frequency stacked bar chart

Using parameter ‘beside=T’

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  barplot(prop.table(gtab2,2), main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T)

Relative Frequency beside bar chart

Using parameter ‘beside=T’

  gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

  barplot(prop.table(gtab2,2), main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T, beside = T)

dplyr Function (Tabling by groups)

      #install.packages("plyr")
      library(plyr)
      ddply(mtcars,~cyl, summarise, sumMPG = round(sum(mpg,na.rm=TRUE),2), meanDRAT=round(mean(drat,na.rm=TRUE),2),sdWT=round(sd(wt,na.rm=TRUE),2))

##   cyl sumMPG meanDRAT sdWT
## 1   4  293.3     4.07 0.57
## 2   6  138.2     3.59 0.36
## 3   8  211.4     3.23 0.76