MATH 138: Tables and Bar Graphs for Qualitative Data

Learning Objectives:

Load data into R from github (or URL)
Create tables
Create different types of bar charts (using base R graphics)
Choose appropriate types of tables and charts to tell a data story

Motivating Example #1: Titanic

At 11:40 pm on April 14, 1912 the Titanic hit an iceberg on its maiden voyage from Southampton to New York City. Our data represent 2201 passengers

(In this example we will also learn how to install packages.)

Install a package

#install.packages("titanic")
library(titanic)

Note that the titanic package contains two datasets:

titanic_test : a subset of the data
Titanic : data pre-organized into frequencies

What is the structure of your data?

#View(titanic_test)
dim(titanic_test)

## [1] 418  11

str(titanic_test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

# Run the view command in your console
# View(titanic_test)

I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual). I’m calling this new dataframe titanicDF. Note this code is supressed because it is not the focus of this exercise. Rather focus on the tables and corresponding plots.

First, lets look at the distribution of passengers by class

## 
##  1st  2nd  3rd Crew 
##  325  285  706  885

We hypothesize that survival rates may differ depending on sex.

##       
##         No Yes
##   1st  122 203
##   2nd  167 118
##   3rd  528 178
##   Crew 673 212

Motivating Example #2: Donner Party

The Donner Party was a group of pioneers that departed Missouri on the Oregon Trail in the Spring of 1846. On their journey the group experienced delays and rugged terrain that caused them to travel in extreme winter weather with low food supplies. This group is well known for the fact that they resorted to cannibalism.

(In this example we will also learn how to import data.)

The this dataset will need to name the columns:

Age
Sex (1 = male, 0 = female)
Survived (1 = survived, 0 = died)

Step 1: Import your data

# Import Data
donner<-read.table("https://raw.githubusercontent.com/kitadasmalley/MATH138/main/FALL_2021/Data/donner.txt",
                   header=TRUE)

Step 2: Look at your data

# Look at the first 6 rows 
head(donner)

##   Age    Sex Survived
## 1  40 Female Survived
## 2  40   Male Survived
## 3  30   Male     Died
## 4  28   Male     Died
## 5  40   Male     Died
## 6  45 Female     Died

# Look at the last 6 rows
tail(donner)

##    Age    Sex Survived
## 39  25   Male     Died
## 40  30   Male     Died
## 41  35   Male     Died
## 42  23   Male Survived
## 43  24   Male     Died
## 44  25 Female Survived

Now its your turn! Let’s make some tables!

Step 3: Look at the one-dimensional distribution for survival

A) Create a table

# One dim table 
table_surv<-table(donner$Survived)
table_surv

## 
##     Died Survived 
##       24       20

B) Let’s look at the relative frequency

prop.table(table_surv)

## 
##      Died  Survived 
## 0.5454545 0.4545455

C) We can make a bar chart using this table!

# One dim: Bar Chart Distribution of survival
barplot(table_surv, main="Survival Distribution",
        xlab="Survival")

D) We can even make pie charts

lbls <- paste(names(table_surv), "\n", table_surv, sep="")
pie(table_surv, labels = lbls,
    main="Pie Chart of Survival\n (with sample sizes)")

Step 4: Look at the two-dimensional distribution for sex and survival

A) Create a two-way table

# Create a frequency table
# Row = Sex
# Col = Survived
table_survFM<-table(donner$Sex, donner$Survived)
table_survFM

##         
##          Died Survived
##   Female    5       10
##   Male     19       10

B) Look at distributions using this table

I) These are marginal tables

# Sex frequencies (summed over Survival)
# Use 1, to sum over columns 
margin.table(table_survFM, 1)

## 
## Female   Male 
##     15     29

# Survival frequencies (summed over Sex)
# Use 2, to sum over rows
margin.table(table_survFM, 2)

## 
##     Died Survived 
##       24       20

II) Find the relative proportions

Table 1: Joint distribution

# cell percentages (joint distribution)
prop.table(table_survFM)

##         
##               Died  Survived
##   Female 0.1136364 0.2272727
##   Male   0.4318182 0.2272727

Table 2: Conditional distribution for survival by sex

# row percentages (conditional distribution for survival by sex)
prop.table(table_survFM, 1)

##         
##               Died  Survived
##   Female 0.3333333 0.6666667
##   Male   0.6551724 0.3448276

Table 3: Conditional distribution for sex by surival

# column percentages (conditional distribution for sex by survival)
prop.table(table_survFM, 2)

##         
##               Died  Survived
##   Female 0.2083333 0.5000000
##   Male   0.7916667 0.5000000

Which table tells the most compelling story?

Step 5: Create bar charts (in base R)

Stacked bar chart

barplot(table_survFM, main="Survival Distribution by Sex",
        xlab="Survival")

Add color

# color
barplot(table_survFM, main="Survival Distribution by Sex",
        xlab="Survival", col=c("darkblue", "red"),
        legend=rownames(table_survFM))

Side-by-side Bar Chart

barplot(table_survFM, main="Survival Distribution by Sex",
        xlab="Survival", col=c("darkblue", "red"),
        legend=rownames(table_survFM),
        beside=TRUE)

Filled Bar Chart

prop1<-prop.table(table_survFM,2)

barplot(prop1, main="Survival Distribution by Sex",
        xlab="Survival", col=c("darkblue", "red"),
        legend=rownames(table_survFM))

Motivating Example #3: Gender Bias in College Admission

1973 UC Berkeley Gender Bias in Admissions “One of the first universities to be sued for sexual discrimination” (with a statistically significant difference)

(In this example we will also learn how work with data already contained within R.)

Use data that is built into R

# Lets look at what datasets are available
library(help="datasets")

# we're going to work with the UCBAdmissions dataset
# first let's turn it into a dataframe
data(UCBAdmissions)
ucb<-as.data.frame(UCBAdmissions)
head(ucb)

##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207

Again, I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual).

Here are the resulting tables and plots

##         
##          Admitted Rejected
##   Female      557     1278
##   Male       1198     1493

##         
##           Admitted  Rejected
##   Female 0.3035422 0.6964578
##   Male   0.4451877 0.5548123

This is a famous example of Simpson’s Paradox. A phenomenon in which a trend appears in several different groups but disappears or reverses when the groups are combined.