Learning Objectives

In this lesson students will learn to apply categorical data analysis methods to data sets with fundamentally different structures.

  • Work with individual and cross-tabulated level raw data
  • Create univarite tables to show marginal distributions
  • Create two-way tables to show joint and conditional distributions
  • Create bar graphs and assess which type of bar graph is best for a given scenario (stacked, dodged, filled)

The tidyverse package is needed for these examples

library(tidyverse)

Example: Happy Little Trees!

Bob Ross, known for his PBS show The Joy of Painting, was a skilled teacher who guided viewers through creating iconic landscapes like “happy trees,” “almighty mountains,” and “fluffy clouds.”

Over his 11-year career, he painted 381 works, using a consistent set of themes and elements. This large collection of data serves as a foundation for exploring statistical concepts, specifically conditional probability and clustering.

Motivation: https://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/

Step 0: Call the library

library(tidyverse)

Step 1: Load the Data

## BOB ROSS DATA FROM FIVETHIRTYEIGHT
bob<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/ProjectData/elements-by-episode.csv")

Step 2: Data Structure

Look at the structure of the data.

## STRUCTURE 
str(bob)
## 'data.frame':    403 obs. of  69 variables:
##  $ EPISODE           : chr  "S01E01" "S01E02" "S01E03" "S01E04" ...
##  $ TITLE             : chr  "\"A WALK IN THE WOODS\"" "\"MT. MCKINLEY\"" "\"EBONY SUNSET\"" "\"WINTER MIST\"" ...
##  $ APPLE_FRAME       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AURORA_BOREALIS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BARN              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BEACH             : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ BOAT              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BRIDGE            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BUILDING          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BUSHES            : int  1 0 0 1 0 0 0 1 0 1 ...
##  $ CABIN             : int  0 1 1 0 0 1 0 0 0 0 ...
##  $ CACTUS            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CIRCLE_FRAME      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CIRRUS            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CLIFF             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CLOUDS            : int  0 1 0 1 0 0 0 0 1 0 ...
##  $ CONIFER           : int  0 1 1 1 0 1 0 1 0 1 ...
##  $ CUMULUS           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DECIDUOUS         : int  1 0 0 0 1 0 1 0 0 1 ...
##  $ DIANE_ANDRE       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DOCK              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DOUBLE_OVAL_FRAME : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FARM              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FENCE             : int  0 0 1 0 0 0 0 0 1 0 ...
##  $ FIRE              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLORIDA_FRAME     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLOWERS           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FOG               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FRAMED            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GRASS             : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ GUEST             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HALF_CIRCLE_FRAME : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HALF_OVAL_FRAME   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HILLS             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LAKE              : int  0 0 0 1 0 1 1 1 0 1 ...
##  $ LAKES             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LIGHTHOUSE        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MILL              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MOON              : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ MOUNTAIN          : int  0 1 1 1 0 1 1 1 0 1 ...
##  $ MOUNTAINS         : int  0 0 1 0 0 1 1 1 0 0 ...
##  $ NIGHT             : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ OCEAN             : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ OVAL_FRAME        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PALM_TREES        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PATH              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PERSON            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PORTRAIT          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RECTANGLE_3D_FRAME: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RECTANGULAR_FRAME : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RIVER             : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ ROCKS             : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ SEASHELL_FRAME    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SNOW              : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ SNOWY_MOUNTAIN    : int  0 1 0 1 0 1 1 0 0 0 ...
##  $ SPLIT_FRAME       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ STEVE_ROSS        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ STRUCTURE         : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ SUN               : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ TOMB_FRAME        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TREE              : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ TREES             : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ TRIPLE_FRAME      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WATERFALL         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WAVES             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WINDMILL          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WINDOW_FRAME      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WINTER            : int  0 1 1 0 0 1 0 0 0 0 ...
##  $ WOOD_FRAMED       : int  0 0 0 0 0 0 0 0 0 0 ...

In this dataset the rows represent individual paintings. In the following example we will learn how to use the table and prop.table functions.

Step 3: One-way Table

Create a one-way frequency table for paintings with “happy little trees”.

Motivating Question 1: What percent of paintings contain “happy little trees”?

# Table for "happy little trees"
# use table() function
tabTrees<-table(bob$TREES)
tabTrees
## 
##   0   1 
##  66 337
# WHAT DOES 0 AND 1 MEAN?

We can also use kable to make tables in R markdown:

library(knitr)
kable(tabTrees, col.names = c('Trees', 'Count'),
      caption = "Distribution of Paintings with Happy Little Trees")
Distribution of Paintings with Happy Little Trees
Trees Count
0 66
1 337

Step 4: Relative Frequency Table

We might also want to display proportions.

# the prop.table() function must take a table object
prop.table(tabTrees)
## 
##         0         1 
## 0.1637717 0.8362283

Step 5: Univariate Bar Graphs

Let’s visualize this distribution.

A. Simple/Vanilla Bar Graph

A simple bar graph:

# create a graph to display the distribution
ggplot(data=bob, aes(x=TREES))+
  geom_bar()

B. Color

Since bars are two dimensional the color aesthetic only outlines bars.

What is going on in this graph?

## ADD COLOR
bob$TREES<-as.factor(bob$TREES)

ggplot(data=bob, aes(x=TREES, color=TREES))+
  geom_bar()

C. Fill

## OOPS! Let's use fill!
ggplot(data=bob, aes(x=TREES, fill=TREES))+
  geom_bar()

### LET'S TRY DIFFERENT COLORS
treePal<-c("brown", "forestgreen")

ggplot(data=bob, aes(x=TREES, fill=TREES))+
  geom_bar()+
  scale_fill_manual(values=treePal)

D. Proportions

## CHANGE Y-AXIS TO PERCENT
ggplot(data=bob, aes(x=TREES, fill=TREES))+
  geom_bar(aes(y = (..count..)/sum(..count..)))

E. Recipe for a Pie Chart

STEP 1: Make a stacked bar graph.

## STEP 1: MAKE A STACKED-BAR GRAPH
ggplot(bob, aes(x=1, fill=TREES))+
  geom_bar()

STEP 2: Use polar coordinates

## PLOT IT IN A CIRCLE
ggplot(bob, aes(x=1, fill=TREES))+
  geom_bar()+
  coord_polar("y", start=0)+
  theme_void()

Learning by Doing!

Motivating Question 2: What percent of paintings contain ”almighty mountains”?

In small groups work together to answer the question above, by doing the following tasks.

  1. Make a one-way table for mountains
# Table for mountains
# use table() function
tabMnt<-table(bob$MOUNTAINS)
tabMnt
## 
##   0   1 
## 304  99
# kable 
kable(tabMnt, col.names = c('MOUNTAINS', 'Count'),
      caption = "Distribution of Paintings with Almighty Mountains")
Distribution of Paintings with Almighty Mountains
MOUNTAINS Count
0 304
1 99
  1. Make a relative frequency table.
# use prop.table() 
prop.table(tabMnt)
## 
##         0         1 
## 0.7543424 0.2456576
  1. Make a univariate bar graph for mountains.
bob$MOUNTAINS<-as.factor(bob$MOUNTAINS)

# create a graph to display the distribution
ggplot(bob, aes(x=MOUNTAINS, fill=MOUNTAINS))+
  geom_bar()

  1. Make a stacked bar graph for mountains.
# stacked bar graph
ggplot(bob, aes(x=1, fill=MOUNTAINS))+
  geom_bar()

  1. Make a pie chart.
# pie graph
ggplot(bob, aes(x=1, fill=MOUNTAINS))+
  geom_bar()+
  coord_polar("y", start=0)+
  theme_void()

Step 6. Two-way Tables

Motivating Question 3:

What percent of paintings that have “happy little trees” and “almighty mountains”?

## trees and mountains
# Row then col
tabTreesMnt<-table(bob$TREES, bob$MOUNTAINS)
tabTreesMnt
##    
##       0   1
##   0  61   5
##   1 243  94
## kable
kable(tabTreesMnt)
0 1
0 61 5
1 243 94

Now that we have two dimensions this is getting confusing. Maybe we should make labels.

bob<-bob%>%
  mutate(treeLab=case_when(
    TREES==0 ~ "No Trees", 
    TREES==1 ~ "Happy Little Trees"
  ),
  mntLab=case_when(
    MOUNTAINS==0 ~ "No Mnts", 
    MOUNTAINS==1 ~ "Almighty Mountains"
  )
  )
  
tabTreesMnt2<-table(bob$treeLab, bob$mntLab)
tabTreesMnt2
##                     
##                      Almighty Mountains No Mnts
##   Happy Little Trees                 94     243
##   No Trees                            5      61

Step 7: Bar Graphs with two dimensions

A. Stacked Bar Graph

## Stacked Bar Graph (Default)
ggplot(data=bob, aes(x=treeLab, fill=mntLab))+
  geom_bar()

B. Side-by-side Bar Graph

We can use position adjustments to change the type of bar graph.

## Side-by-side Bar Graph 
ggplot(data=bob, aes(x=treeLab, fill=mntLab))+
  geom_bar(position="dodge")

Step 8: Types of Distributions

A. Joint Distribution

Definition: The probability distribution on all possible pairs of outputs.

## Joint Distribution
jointD<-prop.table(tabTreesMnt2)
jointD
##                     
##                      Almighty Mountains    No Mnts
##   Happy Little Trees         0.23325062 0.60297767
##   No Trees                   0.01240695 0.15136476

Note: The sum of any proper distribution is 1.

## NOTE: The sum of any distribution is 1
sum(prop.table(tabTreesMnt2))
## [1] 1

We can also do this with kable

## kable
kable(round(prop.table(tabTreesMnt2),2))
Almighty Mountains No Mnts
Happy Little Trees 0.23 0.60
No Trees 0.01 0.15

B. Marginal Distribution

Definition: Gives the probabilities of various values of the variable without reference to the values of the other variable.

## Marginal Distribution of Happy Little Trees
## Row Sums (ie sum over the cols)
sum(jointD[1,])
## [1] 0.8362283
sum(jointD[2,])
## [1] 0.1637717
## Observe: This matches with 
prop.table(table(bob$treeLab))
## 
## Happy Little Trees           No Trees 
##          0.8362283          0.1637717

Learning by Doing!

How would you find the marginal distribution for mountains in paintings by using the joint distribution?

## INSERT CODE HERE
## Marginal Distribution of Mountains
## Col Sums (ie sum over the rows)
sum(jointD[,1])
## [1] 0.2456576
sum(jointD[,2])
## [1] 0.7543424
## Observe: This matches with 
prop.table(table(bob$mntLab))
## 
## Almighty Mountains            No Mnts 
##          0.2456576          0.7543424

C. Conditional Distribution

Definition: A probability distribution that describes the probability of an outcome given the occurrence of a particular event.

Motivating Question 4: What percent of paintings with “happy little trees” also have “almighty mountains”?

## Conditional Distr (Given the row dim)
## margin = 1 for row
marg1<-prop.table(tabTreesMnt2, margin=1)
marg1
##                     
##                      Almighty Mountains    No Mnts
##   Happy Little Trees         0.27893175 0.72106825
##   No Trees                   0.07575758 0.92424242
## Observe that this is a proper distribution
sum(marg1[1,])
## [1] 1
sum(marg1[2,])
## [1] 1

We can visualize conditional distributions with a different position adjustment.

## FILLED BAR GRAPH
ggplot(bob, aes(x = treeLab, fill=mntLab))+
  geom_bar(position="fill")

Learning by Doing!

Does it make sense to condition in the other direction?

We could describe conditioning on the column dimension as, if we randomly selected a painting that contained mountains, what is the probability that it also contained trees?

## Conditional Distr (Given the col dim)
## margin = 2 for col
marg2<-prop.table(tabTreesMnt2, margin=2)
marg2
##                     
##                      Almighty Mountains    No Mnts
##   Happy Little Trees         0.94949495 0.79934211
##   No Trees                   0.05050505 0.20065789

Your turn!

  1. Check that this conditional distribution is proper.
## INSERT CODE HERE
sum(marg2[,1])
## [1] 1
sum(marg2[,2])
## [1] 1
  1. Make a filled bar graph that shows this conditional distribution.
## INSERT CODE HERE
ggplot(bob, aes(x = mntLab, fill=treeLab))+
  geom_bar(position="fill")

Step 9: Cross-tabulated Data

Cross tabulated data shows the number of respondents that share a combination of characteristics or demographics. This is a common format for survey data, which greatly reduces the size of a dataset.

crossTab<-bob%>%
  group_by(treeLab, mntLab)%>%
  summarise(n=n())
## `summarise()` has grouped output by 'treeLab'. You can override using the
## `.groups` argument.
crossTab
## # A tibble: 4 × 3
## # Groups:   treeLab [2]
##   treeLab            mntLab                 n
##   <chr>              <chr>              <int>
## 1 Happy Little Trees Almighty Mountains    94
## 2 Happy Little Trees No Mnts              243
## 3 No Trees           Almighty Mountains     5
## 4 No Trees           No Mnts               61

Graphics can also be made with cross-tabulated data.

## Now we need to specify the height 
## We are using color here to see how the bars are composed 
ggplot(data = crossTab, aes(x=treeLab, y =n , color=treeLab))+
  geom_bar(stat="identity")

Let’s try fill now.

## Fill with color
ggplot(data = crossTab, aes(x=treeLab, y =n , fill=treeLab))+
  geom_bar(stat="identity")

Make filled bar graphs.

## Filled bar graphs with cross-tab data
ggplot(data = crossTab, aes(x=treeLab, y =n , fill=mntLab))+
  geom_bar(stat="identity", position = "fill")

CAUTION

It’s generally NOT advised to use pie charts to make comparisons across distributions!

## Comparing Pies
ggplot(data = crossTab, aes(x=1, y =n , fill=mntLab))+
  geom_bar(stat="identity", position = "fill")+
  facet_grid(.~treeLab)+
  coord_polar("y", start=0)+
  theme_void()

Example #2: Gender Bias

In 1973, UC Berkeley became “one of the first universities to be sued for sexual discrimination” (with a statistically significant difference)

Step 1: Load the data

## UC Berk
data(UCBAdmissions)
str(UCBAdmissions)
##  'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Admit : chr [1:2] "Admitted" "Rejected"
##   ..$ Gender: chr [1:2] "Male" "Female"
##   ..$ Dept  : chr [1:6] "A" "B" "C" "D" ...

Step 2: Reformat as data.frame

cal<-as.data.frame(UCBAdmissions)

Step 3: Aggregated Bar Graph

Stacked bar:

ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
  geom_bar(stat = "identity", 
           position="fill")

Step 4: Separated by Department

Faceted:

ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
  geom_bar(stat = "identity", 
           position="fill")+
  facet_grid(.~Dept)

Simpson’s Paradox

How does this happen?

“The simple explanation is that women tended to apply to the departments that are the hardest to get into, and men tended to apply to departments that were easier to get into. (Humanities departments tended to have less research funding to support graduate students, while science and engineer departments were awash with money.) So women were rejected more than men. Presumably, the bias wasn’t at Berkeley but earlier in women’s education, when other biases led them to different fields of study than men.”