In this lesson students will learn to apply categorical data analysis methods to data sets with fundamentally different structures.
The tidyverse
package is needed for these examples
library(tidyverse)
Nine-hundred and ten (910) randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country. The results of the survey by political ideology are shown below.
From Questions from Introduction to Modern Statistics.
#install.packages("openintro")
library(openintro)
data("immigration")
str(immigration)
## tibble [910 × 2] (S3: tbl_df/tbl/data.frame)
## $ response : Factor w/ 4 levels "Apply for citizenship",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ political: Factor w/ 3 levels "conservative",..: 1 1 1 1 1 1 1 1 1 1 ...
What order are the factors currently in? Check the levels
## CHECK THE LEVELS
levels(immigration$political)
## [1] "conservative" "liberal" "moderate"
By default R will order a variable alphabetically, but we might not want that.
immigration$political<-as.character(immigration$political)
immigration$political<-factor(immigration$political,
levels = c("conservative","moderate", "liberal"))
In this dataset the rows represent individuals. In the following
example we will learn how to use the table
and
prop.table
functions.
Create a one-way frequency table for the distribution of political affiliation
Motivating Question 1: What percent of these Tampa, FL voters identify themselves as conservatives?
# Table for Political affiliation
# use table() function
tabPol<-table(immigration$political)
tabPol
##
## conservative moderate liberal
## 372 363 175
We can also use kable
to make tables in R markdown:
library(knitr)
kable(tabPol, col.names = c('Party', 'Count'),
caption = "Distribution of Political Indentities")
Party | Count |
---|---|
conservative | 372 |
moderate | 363 |
liberal | 175 |
We might also want to display proportions.
# the prop.table() function must take a table object
prop.table(tabPol)
##
## conservative moderate liberal
## 0.4087912 0.3989011 0.1923077
Let’s visualize this distribution.
# create a graph to display the distribution
ggplot(immigration, aes(x=political))+
geom_bar()
Since bars are two dimensional the color aesthetic only outlines bars.
What is going on in this graph?
## ADD COLOR
ggplot(immigration, aes(x=political, color=political))+
geom_bar()
## OOPS! Let's use fill!
ggplot(immigration, aes(x=political, fill=political))+
geom_bar()
## CHANGE Y-AXIS TO PERCENT
ggplot(immigration, aes(x=political, fill=political))+
geom_bar(aes(y = (..count..)/sum(..count..)))
STEP 1: Make a stacked bar graph.
## STEP 1: MAKE A STACKED-BAR GRAPH
ggplot(immigration, aes(x=1, fill=political))+
geom_bar()
STEP 2: Use polar coordinates
## PLOT IT IN A CIRCLE
ggplot(immigration, aes(x=1, fill=political))+
geom_bar()+
coord_polar("y", start=0)+
theme_void()
Motivating Question 2: What percent of Tampa, FL voters are in favor of the citizenship option?
In small groups work together to answer the question above, by doing the following tasks.
# Table for citizenship response
# use table() function
tabResp<-table(immigration$response)
tabResp
##
## Apply for citizenship Guest worker Leave the country
## 278 262 350
## Not sure
## 20
# kable
kable(tabResp, col.names = c('Response', 'Count'),
caption = "Distribution of Response to Citizenship")
Response | Count |
---|---|
Apply for citizenship | 278 |
Guest worker | 262 |
Leave the country | 350 |
Not sure | 20 |
# use prop.table()
prop.table(tabResp)
##
## Apply for citizenship Guest worker Leave the country
## 0.30549451 0.28791209 0.38461538
## Not sure
## 0.02197802
# create a graph to display the distribution
ggplot(immigration, aes(x=response, fill=response))+
geom_bar()
# stacked bar graph
ggplot(immigration, aes(x=1, fill=response))+
geom_bar()
# pie graph
ggplot(immigration, aes(x=1, fill=response))+
geom_bar()+
coord_polar("y", start=0)+
theme_void()
Motivating Question 3:
What percent of these Tampa, FL voters identify themselves as conservatives and are in favor of the citizenship option?
## conservative and citizen
# Row then col
tabPolResp<-table(immigration$political, immigration$response)
tabPolResp
##
## Apply for citizenship Guest worker Leave the country Not sure
## conservative 57 121 179 15
## moderate 120 113 126 4
## liberal 101 28 45 1
## kable
kable(tabPolResp)
Apply for citizenship | Guest worker | Leave the country | Not sure | |
---|---|---|---|---|
conservative | 57 | 121 | 179 | 15 |
moderate | 120 | 113 | 126 | 4 |
liberal | 101 | 28 | 45 | 1 |
## Stacked Bar Graph (Default)
ggplot(data=immigration, aes(x=political, fill=response))+
geom_bar()
We can use position adjustments to change the type of bar graph.
## Side-by-side Bar Graph
ggplot(data=immigration, aes(x=political, fill=response))+
geom_bar(position="dodge")
Definition: The probability distribution on all possible pairs of outputs.
## Joint Distribution
jointD<-prop.table(tabPolResp)
jointD
##
## Apply for citizenship Guest worker Leave the country Not sure
## conservative 0.062637363 0.132967033 0.196703297 0.016483516
## moderate 0.131868132 0.124175824 0.138461538 0.004395604
## liberal 0.110989011 0.030769231 0.049450549 0.001098901
Note: The sum of any proper distribution is 1.
## NOTE: The sum of any distribution is 1
sum(prop.table(tabPolResp))
## [1] 1
We can also do this with kable
## kable
kable(round(prop.table(tabPolResp),2))
Apply for citizenship | Guest worker | Leave the country | Not sure | |
---|---|---|---|---|
conservative | 0.06 | 0.13 | 0.20 | 0.02 |
moderate | 0.13 | 0.12 | 0.14 | 0.00 |
liberal | 0.11 | 0.03 | 0.05 | 0.00 |
Definition: Gives the probabilities of various values of the variable without reference to the values of the other variable.
## Marginal Distribution of Political Affiliation
## Row Sums (ie sum over the cols)
sum(jointD[1,])
## [1] 0.4087912
sum(jointD[2,])
## [1] 0.3989011
sum(jointD[3,])
## [1] 0.1923077
## Observe: This matches with
prop.table(table(immigration$political))
##
## conservative moderate liberal
## 0.4087912 0.3989011 0.1923077
How would you find the marginal distribution of response to the citizenship option by using the joint distribution?
## INSERT CODE HERE
## Marginal Distribution of Response
## Col Sums (ie sum over the rows)
sum(jointD[,1])
## [1] 0.3054945
sum(jointD[,2])
## [1] 0.2879121
sum(jointD[,3])
## [1] 0.3846154
sum(jointD[,4])
## [1] 0.02197802
## Observe: This matches with
prop.table(table(immigration$response))
##
## Apply for citizenship Guest worker Leave the country
## 0.30549451 0.28791209 0.38461538
## Not sure
## 0.02197802
Definition: A probability distribution that describes the probability of an outcome given the occurrence of a particular event.
Motivating Question 4: What percent of these Tampa, FL voters who identify themselves as conservatives are also in favor of the citizenship option? What percent of moderates share this view? What percent of liberals share this view?
## Conditional Distr (Given the row dim)
## margin = 1 for row
marg1<-prop.table(tabPolResp, margin=1)
marg1
##
## Apply for citizenship Guest worker Leave the country Not sure
## conservative 0.153225806 0.325268817 0.481182796 0.040322581
## moderate 0.330578512 0.311294766 0.347107438 0.011019284
## liberal 0.577142857 0.160000000 0.257142857 0.005714286
## Observe that this is a proper distribution
sum(marg1[1,])
## [1] 1
sum(marg1[2,])
## [1] 1
sum(marg1[3,])
## [1] 1
We can visualize conditional distributions with a different position adjustment.
## FILLED BAR GRAPH
ggplot(immigration, aes(x = political, fill=response))+
geom_bar(position="fill")
Does it make sense to condition in the other direction?
We could describe conditioning on the column dimension as, if we randomly selected a respondent who supported application for citizenship, what is the probability that they belonged to a particular political party?
## Conditional Distr (Given the col dim)
## margin = 2 for col
marg2<-prop.table(tabPolResp, margin=2)
marg2
##
## Apply for citizenship Guest worker Leave the country Not sure
## conservative 0.2050360 0.4618321 0.5114286 0.7500000
## moderate 0.4316547 0.4312977 0.3600000 0.2000000
## liberal 0.3633094 0.1068702 0.1285714 0.0500000
Your turn!
## INSERT CODE HERE
sum(marg2[,1])
## [1] 1
sum(marg2[,2])
## [1] 1
sum(marg2[,3])
## [1] 1
sum(marg2[,4])
## [1] 1
## INSERT CODE HERE
ggplot(immigration, aes(x = response, fill=political))+
geom_bar(position="fill")
Cross tabulated data shows the number of respondents that share a combination of characteristics or demographics. This is a common format for survey data, which greatly reduces the size of a dataset.
crossTab<-immigration%>%
group_by(political, response)%>%
summarise(n=n())
## `summarise()` has grouped output by 'political'. You can override using the
## `.groups` argument.
crossTab
## # A tibble: 12 × 3
## # Groups: political [3]
## political response n
## <fct> <fct> <int>
## 1 conservative Apply for citizenship 57
## 2 conservative Guest worker 121
## 3 conservative Leave the country 179
## 4 conservative Not sure 15
## 5 moderate Apply for citizenship 120
## 6 moderate Guest worker 113
## 7 moderate Leave the country 126
## 8 moderate Not sure 4
## 9 liberal Apply for citizenship 101
## 10 liberal Guest worker 28
## 11 liberal Leave the country 45
## 12 liberal Not sure 1
Graphics can also be made with cross-tabulated data.
## Now we need to specify the height
## We are using color here to see how the bars are composed
ggplot(data = crossTab, aes(x=political, y =n , color=political))+
geom_bar(stat="identity")
Let’s try fill now.
## Fill with color
ggplot(data = crossTab, aes(x=political, y =n , fill=political))+
geom_bar(stat="identity")
Make filled bar graphs.
## Filled bar graphs with cross-tab data
ggplot(data = crossTab, aes(x=political, y =n , fill=response))+
geom_bar(stat="identity", position = "fill")
It’s generally NOT advised to use pie charts to make comparisons across distributions!
## Comparing Pies
ggplot(data = crossTab, aes(x=1, y =n , fill=response))+
geom_bar(stat="identity", position = "fill")+
facet_grid(.~political)+
coord_polar("y", start=0)+
theme_void()
In 1973, UC Berkeley became “one of the first universities to be sued for sexual discrimination” (with a statistically significant difference)
## UC Berk
data(UCBAdmissions)
str(UCBAdmissions)
## 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
## - attr(*, "dimnames")=List of 3
## ..$ Admit : chr [1:2] "Admitted" "Rejected"
## ..$ Gender: chr [1:2] "Male" "Female"
## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...
cal<-as.data.frame(UCBAdmissions)
ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
geom_bar(stat = "identity",
position="fill")
ggplot(cal, aes(x=Gender, y= Freq, fill=Admit))+
geom_bar(stat = "identity",
position="fill")+
facet_grid(.~Dept)
How does this happen?
“The simple explanation is that women tended to apply to the departments that are the hardest to get into, and men tended to apply to departments that were easier to get into. (Humanities departments tended to have less research funding to support graduate students, while science and engineer departments were awash with money.) So women were rejected more than men. Presumably, the bias wasn’t at Berkeley but earlier in women’s education, when other biases led them to different fields of study than men.”