Known as the “Live Music Capital of the World,” Austin, Texas is also home to the longest-running music series in American television history, Austin City Limits. This dataset includes data on a sample of musicians that performed live on the PBS television series Austin City Limits over the last 10 years. Data on each artist include measures of commercial popularity, such as the number of social media followers on Twitter or Facebook, and their success in winning a Grammy Music Award.
For further details see: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 - Week 4: Bivariate Distributions (Categorical Data) > Lecture Videos
# Loading the dataset
acl = read.csv("AustinCityLimits.csv")
str(acl)
## 'data.frame': 116 obs. of 14 variables:
## $ Artist : Factor w/ 116 levels "Aimee Mann","Alabama Shakes",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 2008 2013 2009 2009 2007 2009 2010 2009 2003 2008 ...
## $ Month : Factor w/ 6 levels "December","February",..: 4 2 3 5 4 4 3 4 3 5 ...
## $ Season : Factor w/ 2 levels "fall","winter": 1 2 2 1 1 1 2 1 2 1 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 2 2 2 1 ...
## $ Age : int 52 24 75 39 33 62 37 35 43 67 ...
## $ Age.Group : Factor w/ 4 levels "Fifties or Older",..: 1 4 1 3 3 1 3 3 2 1 ...
## $ Grammy : Factor w/ 2 levels "N","Y": 2 1 1 1 2 2 1 1 2 1 ...
## $ Genre : Factor w/ 4 levels "Country","Jazz/Blues",..: 4 3 2 3 3 1 3 3 3 2 ...
## $ BB.wk.top10 : int 0 1 NA 1 1 0 1 NA 1 0 ...
## $ Twitter : int 101870 73313 308634 56343 404439 3326 125758 8197 158647 690 ...
## $ Twitter.100k : int 1 0 1 0 1 0 1 0 1 0 ...
## $ Facebook : int 113576 298278 10721 318313 1711685 27321 563505 18955 1381051 1715 ...
## $ Facebook.100k: int 1 1 0 1 1 0 1 0 1 0 ...
Describe individual categorical variables.
table(acl$Gender)
##
## F M
## 35 81
Show the intersections of characteristics between two categorical variables. Show the characteristics of one categorical variable contingent on another categorical variable.
Gender.Genre = table(acl$Gender, acl$Genre)
Gender.Genre
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 7 6 12 10
## M 11 7 56 7
Marginal distribution (marginal sums), for each variable represent:
Counts
# see all margins
addmargins(Gender.Genre)
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter Sum
## F 7 6 12 10 35
## M 11 7 56 7 81
## Sum 18 13 68 17 116
# see only column margins
addmargins(Gender.Genre,margin=2)
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter Sum
## F 7 6 12 10 35
## M 11 7 56 7 81
# See only row margins
addmargins(Gender.Genre,margin=1)
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 7 6 12 10
## M 11 7 56 7
## Sum 18 13 68 17
Notice that the values in each of the total group match the individual tables of counts for each of the variables.
So if you add up the total of columns is the same value of adding up the total of rows.
If there is a missing value it will not be included in the contingency table.
They are devided in:
Table percentages
Usually ignored because they don´t help to answer a specific question.
# Proportions
prop.table(Gender.Genre)
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.06034483 0.05172414 0.10344828 0.08620690
## M 0.09482759 0.06034483 0.48275862 0.06034483
# Proportions with margins
addmargins(prop.table(Gender.Genre))
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter Sum
## F 0.06034483 0.05172414 0.10344828 0.08620690 0.30172414
## M 0.09482759 0.06034483 0.48275862 0.06034483 0.69827586
## Sum 0.15517241 0.11206897 0.58620690 0.14655172 1.00000000
Row Percentages
These are the conditional distributions of columns by rows. In that case the columns are the Dependent Variable. If you add the percentage of rows they add to 1.
# Gives the proportion of the table by row (rows add up to 1)
RowPercentage = prop.table(Gender.Genre,1)
RowPercentage
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.20000000 0.17142857 0.34285714 0.28571429
## M 0.13580247 0.08641975 0.69135802 0.08641975
# Marginal distribution of columns
MarginalColumnPercentage = margin.table(Gender.Genre, 2)/nrow(acl)
MarginalColumnPercentage
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## 0.1551724 0.1120690 0.5862069 0.1465517
# Puting together the two tables
RowMarginalPercentage = rbind(RowPercentage,MarginalColumnPercentage)
RowMarginalPercentage
## Country Jazz/Blues Rock/Folk/Indie
## F 0.2000000 0.17142857 0.3428571
## M 0.1358025 0.08641975 0.6913580
## MarginalColumnPercentage 0.1551724 0.11206897 0.5862069
## Singer-Songwriter
## F 0.28571429
## M 0.08641975
## MarginalColumnPercentage 0.14655172
Column percentages
These are the conditional distributions of row by column. In that case the rows are the Dependent Variable. If you add the percentage of columns they add to 1.
# Gives the proportion of the table by row (rows add up to 1)
ColumnPercentage = prop.table(Gender.Genre,2)
ColumnPercentage
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889 0.4615385 0.1764706 0.5882353
## M 0.6111111 0.5384615 0.8235294 0.4117647
# Marginal distribution of Row
MarginalRowPercentage = margin.table(Gender.Genre, 1)/nrow(acl)
MarginalRowPercentage
##
## F M
## 0.3017241 0.6982759
# Puting together the two tables
ColumnMarginalPercentage = cbind(ColumnPercentage,MarginalRowPercentage)
ColumnMarginalPercentage
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889 0.4615385 0.1764706 0.5882353
## M 0.6111111 0.5384615 0.8235294 0.4117647
## MarginalRowPercentage
## F 0.3017241
## M 0.6982759
Is there a relationship between the GENRE of music and if the singer is a FEMALE?
Steps
# Marginal distribution of Row
MarginalRowPercentage = margin.table(Gender.Genre, 1)/nrow(acl)
MarginalRowPercentage
##
## F M
## 0.3017241 0.6982759
# Gives the proportion of the table by row (rows add up to 1)
ColumnPercentage = prop.table(Gender.Genre,2)
ColumnPercentage
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889 0.4615385 0.1764706 0.5882353
## M 0.6111111 0.5384615 0.8235294 0.4117647
# Puting together the two tables
ColumnMarginalPercentage = cbind(ColumnPercentage,MarginalRowPercentage)
ColumnMarginalPercentage
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889 0.4615385 0.1764706 0.5882353
## M 0.6111111 0.5384615 0.8235294 0.4117647
## MarginalRowPercentage
## F 0.3017241
## M 0.6982759
Now we must see how gender behaves in it´s normal state of affairs. How gender look withou any consideration for other categorical variables, who GENDER look in terms of it´s distribution of outcomes withou any consideration of GENRE.
To Answer this we turn to our marginal distributions.
Our maginal distribuition for GENDER shows us what we should expect the normal state of affairs should be of a person beeing male or female if we dind´t consider at all GENRE.
To discover if there is a relationship between our two categorical varaibles, we want to see whether or not this marginal distribution holds up when we compare it to the conditional distribution of our dependente variable GENDER (outcome variable, variable of interest).
When we compare the marginal to the conditional distribution of GENDER we can see that things don´t actually match.
Just looking within the third (rock/folk/indie) GENRE we can see that our condition probability of FEMALE GENDER (0.17) does not match out marginal probabilty of GENDER (.30). It is almost half. We we didnt´take into acount GENRE the rate of GENDER woould be about .30 but if we take into accout GENRE the rate of GENDER IS ABOUT .17. We can se that the likelyhood of A FEMALE sing rock/folk/indie is just .17. Our conditional probability of a FEMALE withi in any of outr GENRES does note match out marginal probability, the normal state of affair if we didn´t take into account GENDER intro consideration.
In this case we can see taht P(GENDER|GENRE) <> P(GENDER) so there is a relationship.
So there is absolutely a relationship between GENRE and GENDER.
Your GENDER is somehow related to the GENRE of music you sing. You might say that the probability of a person be a MALE or FEMALE is driveN by the GENRE that sher/he sings.
If you call a participante it is more likely it is a MALE (.698).
But if you call a particpante to sing a Singer-Songwrite type of music it is more likely it wil be a FEMALE (.588)
But to answer our questions we need visutalizations!
Now that we know that there is a relatioship between our two categorical variables by showing the disparity between the marginal and conditional probabilities of an event ocurring, we need to tell the rest of the story.
Although the visualization of our contingency table is pretty straightfoward we need to make sure that our story is told from the correct perspective.
# Contingency table
Gender.Genre
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 7 6 12 10
## M 11 7 56 7
We want to use a picture to help the reader along. A picture based on the conditional distribution we chose to use previously, in our case, GENDER.
However we can visulize both conditional probabilities, so the one we choose to use depends on our perspective.
Our perspective of the GENDER.
With that in mind, remember that our variable of interest is GENDER. The likelihood of FEMALE or MALE.
So we need to look at our column percentages (conditional distribution) of GENDER based on GENRE, to make it clear.
Here´s the table with our conditional distribution of GENDER based on GENRE.
# Gives the proportion of the table by row (rows add up to 1)
ColumnPercentage = prop.table(Gender.Genre,2)
ColumnPercentage
##
## Country Jazz/Blues Rock/Folk/Indie Singer-Songwriter
## F 0.3888889 0.4615385 0.1764706 0.5882353
## M 0.6111111 0.5384615 0.8235294 0.4117647
# Marginal distribution of Row
# MarginalRowPercentage = margin.table(Gender.Genre, 1)/nrow(acl)
# MarginalRowPercentage
# ColumnMarginalPercentage = cbind(ColumnPercentage,MarginalRowPercentage)
# ColumnMarginalPercentage
Let´s look at this from the perspective of GENRE. You choose a GENRE of music, as a Country music fan you wonder:
what´s the likelihood of the singer been a FEMALE? Well, it´s 38.9%.
What´s the likelihood of the singer of Rock/Folk/Indie been a FEMALE? Well, it´s 17.6%, almost half. Ouch!
Here´s the graph of counts of FEMALES/MALE based on GENRE.
barplot(Gender.Genre, main='Count of Female/Male by Genre', xlab='Genre',ylab = 'Counts', legend=T, beside = T)
Here´s the same graph but of proportions of FEMALE/MALE based on GENRE.
barplot(prop.table(Gender.Genre,2), main='Proportion of Female/Male by Genre', xlab='Genre',ylab = 'Frequency', legend=T, beside = T)
However this graph is redundant, as the sum of both bars (FEMALE and MALE), for each GENRE add up to 100%.
So to make things clear, we might use this graph - The likelihood of FEMALE singer for each GENRE.
Here´s the graph of counts of FEMALES based on GENRE.
# create a conditional table only for FEMALES
FirstRow = Gender.Genre[-2,] # Remove the second row
# Now you can plot the percentage of Females By Genre
barplot(FirstRow, main='Counts of Female by Genre', xlab='Genre',ylab = 'Counts', legend=F, beside = T, horiz=F)
Here´s the graph of proportions of FEMALES based on GENRE.
# create a conditional table only for FEMALES
FirstRow = Gender.Genre[-2,] # Remove the second row
# Now you can plot the percentage of Females By Genre
barplot(prop.table(FirstRow), main='Proportions of Female by Genre', xlab='Genre',ylab = 'Frequency', legend=F, beside = T, horiz=F)
see: Week 4: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 - Bivariate Distributions (Categorical Data) > Lecture Videos > Reverse Conditional Probability
Counts the number of time an item appears on the data set. e.g. Grammy winners
gtab = table(acl$Grammy)
gtab
##
## N Y
## 67 49
gtab2 = table(acl$Grammy, acl$Gender)
gtab2
##
## F M
## N 21 46
## Y 14 35
You can calculate the proportions instead of the counts above.
gtab
##
## N Y
## 67 49
prop.table(gtab)
##
## N Y
## 0.5775862 0.4224138
gtab2
##
## F M
## N 21 46
## Y 14 35
prop.table(gtab2)
##
## F M
## N 0.1810345 0.3965517
## Y 0.1206897 0.3017241
You can read the table below as follows.
Within the No Grammy row 31% were female and 68% were males. Within the Grammy row 28% were female and 71% were males.
gtab2
##
## F M
## N 21 46
## Y 14 35
prop.table(gtab2,1)
##
## F M
## N 0.3134328 0.6865672
## Y 0.2857143 0.7142857
You can read the table below as follows.
Within the Females column 40% won a grammy while 60% did not. Within the Males columns 43% won a grammy while 57% did not
gtab2
##
## F M
## N 21 46
## Y 14 35
prop.table(gtab2,2)
##
## F M
## N 0.6000000 0.5679012
## Y 0.4000000 0.4320988
Frequency table of one Grammy Variable
gtab
##
## N Y
## 67 49
barplot(gtab, main = 'ACL Grammy Winners', xlab='Grammy Winner',ylab = 'Counts')
Contingency table of counts across the Grammy and the Gender Variables
Stacked bar chart
gtab2
##
## F M
## N 21 46
## Y 14 35
barplot(gtab2, main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T)
Side by side bar chart
Using parameter ‘beside=T’
gtab2
##
## F M
## N 21 46
## Y 14 35
barplot(gtab2, main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T, beside = T)
Relative Frequency stacked bar chart
Using parameter ‘beside=T’
gtab2
##
## F M
## N 21 46
## Y 14 35
barplot(prop.table(gtab2,2), main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T)
Relative Frequency beside bar chart
Using parameter ‘beside=T’
gtab2
##
## F M
## N 21 46
## Y 14 35
barplot(prop.table(gtab2,2), main='Gender by Grammy Winner', xlab='Grammy Winner',ylab = 'Counts', legend=T, beside = T)
dplyr Function (Tabling by groups)
#install.packages("plyr")
library(plyr)
ddply(mtcars,~cyl, summarise, sumMPG = round(sum(mpg,na.rm=TRUE),2), meanDRAT=round(mean(drat,na.rm=TRUE),2),sdWT=round(sd(wt,na.rm=TRUE),2))
## cyl sumMPG meanDRAT sdWT
## 1 4 293.3 4.07 0.57
## 2 6 138.2 3.59 0.36
## 3 8 211.4 3.23 0.76