Learning Objectives

In this lesson students will learn to apply categorical data analysis methods to data sets with fundamentally different structures.

  • Work with individual and cross-tabulated level raw data
  • Create univarite tables to show marginal distributions
  • Create two-way tables to show joint and conditional distributions
  • Create bar graphs and assess which type of bar graph is best for a given scenario (stacked, dodged, filled)

The tidyverse package is needed for these examples

library(tidyverse)

Example #1: Immigration Politics

Nine-hundred and ten (910) randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country. The results of the survey by political ideology are shown below.

From Questions from Introduction to Modern Statistics.

Step 0: Install the package

#install.packages("openintro")
library(openintro)

Step 1: Load the Data

data("immigration")
str(immigration)
## tibble [910 × 2] (S3: tbl_df/tbl/data.frame)
##  $ response : Factor w/ 4 levels "Apply for citizenship",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ political: Factor w/ 3 levels "conservative",..: 1 1 1 1 1 1 1 1 1 1 ...

What order are the factors currently in? Check the levels

## CHECK THE LEVELS 

levels(immigration$political)
## [1] "conservative" "liberal"      "moderate"

Step 2: Re-level categories

By default R will order a variable alphabetically, but we might not want that.

immigration$political<-as.character(immigration$political)
immigration$political<-factor(immigration$political, 
                              levels = c("conservative","moderate", "liberal"))

In this dataset the rows represent individuals. In the following example we will learn how to use the table and prop.table functions.

Step 3: One-way Table

Create a one-way frequency table for the distribution of political affiliation

Motivating Question 1: What percent of these Tampa, FL voters identify themselves as conservatives?

# Table for Political affiliation
# use table() function
tabPol<-table(immigration$political)
tabPol
## 
## conservative     moderate      liberal 
##          372          363          175

We can also use kable to make tables in R markdown:

library(knitr)
kable(tabPol, col.names = c('Party', 'Count'),
      caption = "Distribution of Political Indentities")
Distribution of Political Indentities
Party Count
conservative 372
moderate 363
liberal 175

Step 4: Relative Frequency Table

We might also want to display proportions.

# the prop.table() function must take a table object
prop.table(tabPol)
## 
## conservative     moderate      liberal 
##    0.4087912    0.3989011    0.1923077

Step 5: Univariate Bar Graphs

Let’s visualize this distribution.

A. Simple/Vanilla Bar Graph

# create a graph to display the distribution
ggplot(immigration, aes(x=political))+
  geom_bar()

B. Color

Since bars are two dimensional the color aesthetic only outlines bars.

What is going on in this graph?

## ADD COLOR
ggplot(immigration, aes(x=political, color=political))+
  geom_bar()

C. Fill

## OOPS! Let's use fill!
ggplot(immigration, aes(x=political, fill=political))+
  geom_bar()

D. Proportions

## CHANGE Y-AXIS TO PERCENT
ggplot(immigration, aes(x=political, fill=political))+
  geom_bar(aes(y = (..count..)/sum(..count..)))

E. Recipe for a Pie Chart

STEP 1: Make a stacked bar graph.

## STEP 1: MAKE A STACKED-BAR GRAPH
ggplot(immigration, aes(x=1, fill=political))+
  geom_bar()

STEP 2: Use polar coordinates

## PLOT IT IN A CIRCLE
ggplot(immigration, aes(x=1, fill=political))+
  geom_bar()+
  coord_polar("y", start=0)+
  theme_void()

Learning by Doing!

Motivating Question 2: What percent of Tampa, FL voters are in favor of the citizenship option?

In small groups work together to answer the question above, by doing the following tasks.

  1. Make a one-way table for the responses to the citizenship option.
# Table for citizenship response
# use table() function
tabResp<-table(immigration$response)
tabResp
## 
## Apply for citizenship          Guest worker     Leave the country 
##                   278                   262                   350 
##              Not sure 
##                    20
# kable 
kable(tabResp, col.names = c('Response', 'Count'),
      caption = "Distribution of Response to Citizenship")
Distribution of Response to Citizenship
Response Count
Apply for citizenship 278
Guest worker 262
Leave the country 350
Not sure 20
  1. Make a relative frequency table.
# use prop.table() 
prop.table(tabResp)
## 
## Apply for citizenship          Guest worker     Leave the country 
##            0.30549451            0.28791209            0.38461538 
##              Not sure 
##            0.02197802
  1. Make a univariate bar graph for the response options.
# create a graph to display the distribution
ggplot(immigration, aes(x=response, fill=response))+
  geom_bar()

  1. Make a stacked bar graph of the response options.
# stacked bar graph
ggplot(immigration, aes(x=1, fill=response))+
  geom_bar()

  1. Make a pie chart.
# pie graph
ggplot(immigration, aes(x=1, fill=response))+
  geom_bar()+
  coord_polar("y", start=0)+
  theme_void()

Step 6. Two-way Tables

Motivating Question 3:

What percent of these Tampa, FL voters identify themselves as conservatives and are in favor of the citizenship option?

## conservative and citizen
# Row then col
tabPolResp<-table(immigration$political, immigration$response)
tabPolResp
##               
##                Apply for citizenship Guest worker Leave the country Not sure
##   conservative                    57          121               179       15
##   moderate                       120          113               126        4
##   liberal                        101           28                45        1
## kable
kable(tabPolResp)
Apply for citizenship Guest worker Leave the country Not sure
conservative 57 121 179 15
moderate 120 113 126 4
liberal 101 28 45 1

Step 7: Bar Graphs with two dimensions

A. Stacked Bar Graph

## Stacked Bar Graph (Default)
ggplot(data=immigration, aes(x=political, fill=response))+
  geom_bar()

B. Side-by-side Bar Graph

We can use position adjustments to change the type of bar graph.

## Side-by-side Bar Graph 
ggplot(data=immigration, aes(x=political, fill=response))+
  geom_bar(position="dodge")

Step 8: Types of Distributions

A. Joint Distribution

Definition: The probability distribution on all possible pairs of outputs.

## Joint Distribution
jointD<-prop.table(tabPolResp)
jointD
##               
##                Apply for citizenship Guest worker Leave the country    Not sure
##   conservative           0.062637363  0.132967033       0.196703297 0.016483516
##   moderate               0.131868132  0.124175824       0.138461538 0.004395604
##   liberal                0.110989011  0.030769231       0.049450549 0.001098901

Note: The sum of any proper distribution is 1.

## NOTE: The sum of any distribution is 1
sum(prop.table(tabPolResp))
## [1] 1

We can also do this with kable

## kable
kable(round(prop.table(tabPolResp),2))
Apply for citizenship Guest worker Leave the country Not sure
conservative 0.06 0.13 0.20 0.02
moderate 0.13 0.12 0.14 0.00
liberal 0.11 0.03 0.05 0.00

B. Marginal Distribution

Definition: Gives the probabilities of various values of the variable without reference to the values of the other variable.

## Marginal Distribution of Political Affiliation
## Row Sums (ie sum over the cols)
sum(jointD[1,])
## [1] 0.4087912
sum(jointD[2,])
## [1] 0.3989011
sum(jointD[3,])
## [1] 0.1923077
## Observe: This matches with 
prop.table(table(immigration$political))
## 
## conservative     moderate      liberal 
##    0.4087912    0.3989011    0.1923077

Learning by Doing!

How would you find the marginal distribution of response to the citizenship option by using the joint distribution?

## INSERT CODE HERE
## Marginal Distribution of Response
## Col Sums (ie sum over the rows)
sum(jointD[,1])
## [1] 0.3054945
sum(jointD[,2])
## [1] 0.2879121
sum(jointD[,3])
## [1] 0.3846154
sum(jointD[,4])
## [1] 0.02197802
## Observe: This matches with 
prop.table(table(immigration$response))
## 
## Apply for citizenship          Guest worker     Leave the country 
##            0.30549451            0.28791209            0.38461538 
##              Not sure 
##            0.02197802

C. Conditional Distribution

Definition: A probability distribution that describes the probability of an outcome given the occurrence of a particular event.

Motivating Question 4: What percent of these Tampa, FL voters who identify themselves as conservatives are also in favor of the citizenship option? What percent of moderates share this view? What percent of liberals share this view?

## Conditional Distr (Given the row dim)
## margin = 1 for row
marg1<-prop.table(tabPolResp, margin=1)
marg1
##               
##                Apply for citizenship Guest worker Leave the country    Not sure
##   conservative           0.153225806  0.325268817       0.481182796 0.040322581
##   moderate               0.330578512  0.311294766       0.347107438 0.011019284
##   liberal                0.577142857  0.160000000       0.257142857 0.005714286
## Observe that this is a proper distribution
sum(marg1[1,])
## [1] 1
sum(marg1[2,])
## [1] 1
sum(marg1[3,])
## [1] 1

We can visualize conditional distributions with a different position adjustment.

## FILLED BAR GRAPH
ggplot(immigration, aes(x = political, fill=response))+
  geom_bar(position="fill")

Learning by Doing!

Does it make sense to condition in the other direction?

We could describe conditioning on the column dimension as, if we randomly selected a respondent who supported application for citizenship, what is the probability that they belonged to a particular political party?

## Conditional Distr (Given the col dim)
## margin = 2 for col
marg2<-prop.table(tabPolResp, margin=2)
marg2
##               
##                Apply for citizenship Guest worker Leave the country  Not sure
##   conservative             0.2050360    0.4618321         0.5114286 0.7500000
##   moderate                 0.4316547    0.4312977         0.3600000 0.2000000
##   liberal                  0.3633094    0.1068702         0.1285714 0.0500000

Your turn!

  1. Check that this conditional distribution is proper.
## INSERT CODE HERE
sum(marg2[,1])
## [1] 1
sum(marg2[,2])
## [1] 1
sum(marg2[,3])
## [1] 1
sum(marg2[,4])
## [1] 1
  1. Make a filled bar graph that shows this conditional distribution.
## INSERT CODE HERE
ggplot(immigration, aes(x = response, fill=political))+
  geom_bar(position="fill")

Step 9: Cross-tabulated Data

Cross tabulated data shows the number of respondents that share a combination of characteristics or demographics. This is a common format for survey data, which greatly reduces the size of a dataset.

crossTab<-immigration%>%
  group_by(political, response)%>%
  summarise(n=n())
## `summarise()` has grouped output by 'political'. You can override using the
## `.groups` argument.
crossTab
## # A tibble: 12 × 3
## # Groups:   political [3]
##    political    response                  n
##    <fct>        <fct>                 <int>
##  1 conservative Apply for citizenship    57
##  2 conservative Guest worker            121
##  3 conservative Leave the country       179
##  4 conservative Not sure                 15
##  5 moderate     Apply for citizenship   120
##  6 moderate     Guest worker            113
##  7 moderate     Leave the country       126
##  8 moderate     Not sure                  4
##  9 liberal      Apply for citizenship   101
## 10 liberal      Guest worker             28
## 11 liberal      Leave the country        45
## 12 liberal      Not sure                  1

Graphics can also be made with cross-tabulated data.

## Now we need to specify the height 
## We are using color here to see how the bars are composed 
ggplot(data = crossTab, aes(x=political, y =n , color=political))+
  geom_bar(stat="identity")

Let’s try fill now.

## Fill with color
ggplot(data = crossTab, aes(x=political, y =n , fill=political))+
  geom_bar(stat="identity")

Make filled bar graphs.

## Filled bar graphs with cross-tab data
ggplot(data = crossTab, aes(x=political, y =n , fill=response))+
  geom_bar(stat="identity", position = "fill")

CAUTION

It’s generally NOT advised to use pie charts to make comparisons across distributions!

## Comparing Pies
ggplot(data = crossTab, aes(x=1, y =n , fill=response))+
  geom_bar(stat="identity", position = "fill")+
  facet_grid(.~political)+
  coord_polar("y", start=0)+
  theme_void()

Example #2: Gender Bias

In 1973, UC Berkeley became “one of the first universities to be sued for sexual discrimination” (with a statistically significant difference)

Step 1: Load the data

## UC Berk
data(UCBAdmissions)
str(UCBAdmissions)
##  'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Admit : chr [1:2] "Admitted" "Rejected"
##   ..$ Gender: chr [1:2] "Male" "Female"
##   ..$ Dept  : chr [1:6] "A" "B" "C" "D" ...

Step 2: Reformat as data.frame

cal<-as.data.frame(UCBAdmissions)

Step 3: Aggregated Bar Graph

ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
  geom_bar(stat = "identity", 
           position="fill")

Step 4: Separated by Department

ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
  geom_bar(stat = "identity", 
           position="fill")+
  facet_grid(.~Dept)

Simpson’s Paradox

How does this happen?

“The simple explanation is that women tended to apply to the departments that are the hardest to get into, and men tended to apply to departments that were easier to get into. (Humanities departments tended to have less research funding to support graduate students, while science and engineer departments were awash with money.) So women were rejected more than men. Presumably, the bias wasn’t at Berkeley but earlier in women’s education, when other biases led them to different fields of study than men.”