In this lesson students will learn to apply categorical data analysis methods to data sets with fundamentally different structures.
The tidyverse
package is needed for these examples
library(tidyverse)
Bob Ross, known for his PBS show The Joy of Painting, was a skilled teacher who guided viewers through creating iconic landscapes like “happy trees,” “almighty mountains,” and “fluffy clouds.”
Over his 11-year career, he painted 381 works, using a consistent set of themes and elements. This large collection of data serves as a foundation for exploring statistical concepts, specifically conditional probability and clustering.
Motivation: https://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/
library(tidyverse)
## BOB ROSS DATA FROM FIVETHIRTYEIGHT
bob<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/ProjectData/elements-by-episode.csv")
Look at the structure of the data.
## STRUCTURE
str(bob)
## 'data.frame': 403 obs. of 69 variables:
## $ EPISODE : chr "S01E01" "S01E02" "S01E03" "S01E04" ...
## $ TITLE : chr "\"A WALK IN THE WOODS\"" "\"MT. MCKINLEY\"" "\"EBONY SUNSET\"" "\"WINTER MIST\"" ...
## $ APPLE_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AURORA_BOREALIS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BARN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BEACH : int 0 0 0 0 0 0 0 0 1 0 ...
## $ BOAT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BRIDGE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BUILDING : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BUSHES : int 1 0 0 1 0 0 0 1 0 1 ...
## $ CABIN : int 0 1 1 0 0 1 0 0 0 0 ...
## $ CACTUS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CIRCLE_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CIRRUS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CLIFF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CLOUDS : int 0 1 0 1 0 0 0 0 1 0 ...
## $ CONIFER : int 0 1 1 1 0 1 0 1 0 1 ...
## $ CUMULUS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DECIDUOUS : int 1 0 0 0 1 0 1 0 0 1 ...
## $ DIANE_ANDRE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DOCK : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DOUBLE_OVAL_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FARM : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FENCE : int 0 0 1 0 0 0 0 0 1 0 ...
## $ FIRE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLORIDA_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLOWERS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FOG : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FRAMED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GRASS : int 1 0 0 0 0 0 0 0 0 0 ...
## $ GUEST : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HALF_CIRCLE_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HALF_OVAL_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HILLS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LAKE : int 0 0 0 1 0 1 1 1 0 1 ...
## $ LAKES : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LIGHTHOUSE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MILL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MOON : int 0 0 0 0 0 1 0 0 0 0 ...
## $ MOUNTAIN : int 0 1 1 1 0 1 1 1 0 1 ...
## $ MOUNTAINS : int 0 0 1 0 0 1 1 1 0 0 ...
## $ NIGHT : int 0 0 0 0 0 1 0 0 0 0 ...
## $ OCEAN : int 0 0 0 0 0 0 0 0 1 0 ...
## $ OVAL_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PALM_TREES : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PATH : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PERSON : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PORTRAIT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RECTANGLE_3D_FRAME: int 0 0 0 0 0 0 0 0 0 0 ...
## $ RECTANGULAR_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RIVER : int 1 0 0 0 1 0 0 0 0 0 ...
## $ ROCKS : int 0 0 0 0 1 0 0 0 0 0 ...
## $ SEASHELL_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SNOW : int 0 1 0 0 0 1 0 0 0 0 ...
## $ SNOWY_MOUNTAIN : int 0 1 0 1 0 1 1 0 0 0 ...
## $ SPLIT_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ STEVE_ROSS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ STRUCTURE : int 0 0 1 0 0 1 0 0 0 0 ...
## $ SUN : int 0 0 1 0 0 0 0 0 0 0 ...
## $ TOMB_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TREE : int 1 1 1 1 1 1 1 1 0 1 ...
## $ TREES : int 1 1 1 1 1 1 1 1 0 1 ...
## $ TRIPLE_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WATERFALL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WAVES : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WINDMILL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WINDOW_FRAME : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WINTER : int 0 1 1 0 0 1 0 0 0 0 ...
## $ WOOD_FRAMED : int 0 0 0 0 0 0 0 0 0 0 ...
In this dataset the rows represent individual paintings. In the
following example we will learn how to use the table
and
prop.table
functions.
Create a one-way frequency table for paintings with “happy little trees”.
Motivating Question 1: What percent of paintings contain “happy little trees”?
# Table for "happy little trees"
# use table() function
tabTrees<-table(bob$TREES)
tabTrees
##
## 0 1
## 66 337
# WHAT DOES 0 AND 1 MEAN?
We can also use kable
to make tables in R markdown:
library(knitr)
kable(tabTrees, col.names = c('Trees', 'Count'),
caption = "Distribution of Paintings with Happy Little Trees")
Trees | Count |
---|---|
0 | 66 |
1 | 337 |
We might also want to display proportions.
# the prop.table() function must take a table object
prop.table(tabTrees)
##
## 0 1
## 0.1637717 0.8362283
Let’s visualize this distribution.
A simple bar graph:
# create a graph to display the distribution
ggplot(data=bob, aes(x=TREES))+
geom_bar()
Since bars are two dimensional the color aesthetic only outlines bars.
What is going on in this graph?
## ADD COLOR
bob$TREES<-as.factor(bob$TREES)
ggplot(data=bob, aes(x=TREES, color=TREES))+
geom_bar()
## OOPS! Let's use fill!
ggplot(data=bob, aes(x=TREES, fill=TREES))+
geom_bar()
### LET'S TRY DIFFERENT COLORS
treePal<-c("brown", "forestgreen")
ggplot(data=bob, aes(x=TREES, fill=TREES))+
geom_bar()+
scale_fill_manual(values=treePal)
## CHANGE Y-AXIS TO PERCENT
ggplot(data=bob, aes(x=TREES, fill=TREES))+
geom_bar(aes(y = (..count..)/sum(..count..)))
STEP 1: Make a stacked bar graph.
## STEP 1: MAKE A STACKED-BAR GRAPH
ggplot(bob, aes(x=1, fill=TREES))+
geom_bar()
STEP 2: Use polar coordinates
## PLOT IT IN A CIRCLE
ggplot(bob, aes(x=1, fill=TREES))+
geom_bar()+
coord_polar("y", start=0)+
theme_void()
Motivating Question 2: What percent of paintings contain ”almighty mountains”?
In small groups work together to answer the question above, by doing the following tasks.
# Table for mountains
# use table() function
tabMnt<-table(bob$MOUNTAINS)
tabMnt
##
## 0 1
## 304 99
# kable
kable(tabMnt, col.names = c('MOUNTAINS', 'Count'),
caption = "Distribution of Paintings with Almighty Mountains")
MOUNTAINS | Count |
---|---|
0 | 304 |
1 | 99 |
# use prop.table()
prop.table(tabMnt)
##
## 0 1
## 0.7543424 0.2456576
bob$MOUNTAINS<-as.factor(bob$MOUNTAINS)
# create a graph to display the distribution
ggplot(bob, aes(x=MOUNTAINS, fill=MOUNTAINS))+
geom_bar()
# stacked bar graph
ggplot(bob, aes(x=1, fill=MOUNTAINS))+
geom_bar()
# pie graph
ggplot(bob, aes(x=1, fill=MOUNTAINS))+
geom_bar()+
coord_polar("y", start=0)+
theme_void()
Motivating Question 3:
What percent of paintings that have “happy little trees” and “almighty mountains”?
## trees and mountains
# Row then col
tabTreesMnt<-table(bob$TREES, bob$MOUNTAINS)
tabTreesMnt
##
## 0 1
## 0 61 5
## 1 243 94
## kable
kable(tabTreesMnt)
0 | 1 | |
---|---|---|
0 | 61 | 5 |
1 | 243 | 94 |
Now that we have two dimensions this is getting confusing. Maybe we should make labels.
bob<-bob%>%
mutate(treeLab=case_when(
TREES==0 ~ "No Trees",
TREES==1 ~ "Happy Little Trees"
),
mntLab=case_when(
MOUNTAINS==0 ~ "No Mnts",
MOUNTAINS==1 ~ "Almighty Mountains"
)
)
tabTreesMnt2<-table(bob$treeLab, bob$mntLab)
tabTreesMnt2
##
## Almighty Mountains No Mnts
## Happy Little Trees 94 243
## No Trees 5 61
## Stacked Bar Graph (Default)
ggplot(data=bob, aes(x=treeLab, fill=mntLab))+
geom_bar()
We can use position adjustments to change the type of bar graph.
## Side-by-side Bar Graph
ggplot(data=bob, aes(x=treeLab, fill=mntLab))+
geom_bar(position="dodge")
Definition: The probability distribution on all possible pairs of outputs.
## Joint Distribution
jointD<-prop.table(tabTreesMnt2)
jointD
##
## Almighty Mountains No Mnts
## Happy Little Trees 0.23325062 0.60297767
## No Trees 0.01240695 0.15136476
Note: The sum of any proper distribution is 1.
## NOTE: The sum of any distribution is 1
sum(prop.table(tabTreesMnt2))
## [1] 1
We can also do this with kable
## kable
kable(round(prop.table(tabTreesMnt2),2))
Almighty Mountains | No Mnts | |
---|---|---|
Happy Little Trees | 0.23 | 0.60 |
No Trees | 0.01 | 0.15 |
Definition: Gives the probabilities of various values of the variable without reference to the values of the other variable.
## Marginal Distribution of Happy Little Trees
## Row Sums (ie sum over the cols)
sum(jointD[1,])
## [1] 0.8362283
sum(jointD[2,])
## [1] 0.1637717
## Observe: This matches with
prop.table(table(bob$treeLab))
##
## Happy Little Trees No Trees
## 0.8362283 0.1637717
How would you find the marginal distribution for mountains in paintings by using the joint distribution?
## INSERT CODE HERE
## Marginal Distribution of Mountains
## Col Sums (ie sum over the rows)
sum(jointD[,1])
## [1] 0.2456576
sum(jointD[,2])
## [1] 0.7543424
## Observe: This matches with
prop.table(table(bob$mntLab))
##
## Almighty Mountains No Mnts
## 0.2456576 0.7543424
Definition: A probability distribution that describes the probability of an outcome given the occurrence of a particular event.
Motivating Question 4: What percent of paintings with “happy little trees” also have “almighty mountains”?
## Conditional Distr (Given the row dim)
## margin = 1 for row
marg1<-prop.table(tabTreesMnt2, margin=1)
marg1
##
## Almighty Mountains No Mnts
## Happy Little Trees 0.27893175 0.72106825
## No Trees 0.07575758 0.92424242
## Observe that this is a proper distribution
sum(marg1[1,])
## [1] 1
sum(marg1[2,])
## [1] 1
We can visualize conditional distributions with a different position adjustment.
## FILLED BAR GRAPH
ggplot(bob, aes(x = treeLab, fill=mntLab))+
geom_bar(position="fill")
Does it make sense to condition in the other direction?
We could describe conditioning on the column dimension as, if we randomly selected a painting that contained mountains, what is the probability that it also contained trees?
## Conditional Distr (Given the col dim)
## margin = 2 for col
marg2<-prop.table(tabTreesMnt2, margin=2)
marg2
##
## Almighty Mountains No Mnts
## Happy Little Trees 0.94949495 0.79934211
## No Trees 0.05050505 0.20065789
Your turn!
## INSERT CODE HERE
sum(marg2[,1])
## [1] 1
sum(marg2[,2])
## [1] 1
## INSERT CODE HERE
ggplot(bob, aes(x = mntLab, fill=treeLab))+
geom_bar(position="fill")
Cross tabulated data shows the number of respondents that share a combination of characteristics or demographics. This is a common format for survey data, which greatly reduces the size of a dataset.
crossTab<-bob%>%
group_by(treeLab, mntLab)%>%
summarise(n=n())
## `summarise()` has grouped output by 'treeLab'. You can override using the
## `.groups` argument.
crossTab
## # A tibble: 4 × 3
## # Groups: treeLab [2]
## treeLab mntLab n
## <chr> <chr> <int>
## 1 Happy Little Trees Almighty Mountains 94
## 2 Happy Little Trees No Mnts 243
## 3 No Trees Almighty Mountains 5
## 4 No Trees No Mnts 61
Graphics can also be made with cross-tabulated data.
## Now we need to specify the height
## We are using color here to see how the bars are composed
ggplot(data = crossTab, aes(x=treeLab, y =n , color=treeLab))+
geom_bar(stat="identity")
Let’s try fill now.
## Fill with color
ggplot(data = crossTab, aes(x=treeLab, y =n , fill=treeLab))+
geom_bar(stat="identity")
Make filled bar graphs.
## Filled bar graphs with cross-tab data
ggplot(data = crossTab, aes(x=treeLab, y =n , fill=mntLab))+
geom_bar(stat="identity", position = "fill")
It’s generally NOT advised to use pie charts to make comparisons across distributions!
## Comparing Pies
ggplot(data = crossTab, aes(x=1, y =n , fill=mntLab))+
geom_bar(stat="identity", position = "fill")+
facet_grid(.~treeLab)+
coord_polar("y", start=0)+
theme_void()
In 1973, UC Berkeley became “one of the first universities to be sued for sexual discrimination” (with a statistically significant difference)
## UC Berk
data(UCBAdmissions)
str(UCBAdmissions)
## 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
## - attr(*, "dimnames")=List of 3
## ..$ Admit : chr [1:2] "Admitted" "Rejected"
## ..$ Gender: chr [1:2] "Male" "Female"
## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...
cal<-as.data.frame(UCBAdmissions)
Stacked bar:
ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
geom_bar(stat = "identity",
position="fill")
Faceted:
ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
geom_bar(stat = "identity",
position="fill")+
facet_grid(.~Dept)
How does this happen?
“The simple explanation is that women tended to apply to the departments that are the hardest to get into, and men tended to apply to departments that were easier to get into. (Humanities departments tended to have less research funding to support graduate students, while science and engineer departments were awash with money.) So women were rejected more than men. Presumably, the bias wasn’t at Berkeley but earlier in women’s education, when other biases led them to different fields of study than men.”