At 11:40 pm on April 14, 1912 the Titanic hit an iceberg on its maiden voyage from Southampton to New York City. Our data represent 2201 passengers
(In this example we will also learn how to install packages.)
#install.packages("titanic")
library(titanic)
Note that the titanic package contains two datasets:
#View(titanic_test)
dim(titanic_test)
## [1] 418 11
str(titanic_test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
# Run the view command in your console
# View(titanic_test)
I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual). I’m calling this new dataframe titanicDF. Note this code is supressed because it is not the focus of this exercise. Rather focus on the tables and corresponding plots.
First, lets look at the distribution of passengers by class
##
## 1st 2nd 3rd Crew
## 325 285 706 885
We hypothesize that survival rates may differ depending on sex.
##
## No Yes
## 1st 122 203
## 2nd 167 118
## 3rd 528 178
## Crew 673 212
The Donner Party was a group of pioneers that departed Missouri on the Oregon Trail in the Spring of 1846. On their journey the group experienced delays and rugged terrain that caused them to travel in extreme winter weather with low food supplies. This group is well known for the fact that they resorted to cannibalism.
(In this example we will also learn how to import data.)
The this dataset will need to name the columns:
# Import Data
donner<-read.table("https://raw.githubusercontent.com/kitadasmalley/MATH138/main/FALL_2021/Data/donner.txt",
header=TRUE)
# Look at the first 6 rows
head(donner)
## Age Sex Survived
## 1 40 Female Survived
## 2 40 Male Survived
## 3 30 Male Died
## 4 28 Male Died
## 5 40 Male Died
## 6 45 Female Died
# Look at the last 6 rows
tail(donner)
## Age Sex Survived
## 39 25 Male Died
## 40 30 Male Died
## 41 35 Male Died
## 42 23 Male Survived
## 43 24 Male Died
## 44 25 Female Survived
# One dim table
table_surv<-table(donner$Survived)
table_surv
##
## Died Survived
## 24 20
prop.table(table_surv)
##
## Died Survived
## 0.5454545 0.4545455
# One dim: Bar Chart Distribution of survival
barplot(table_surv, main="Survival Distribution",
xlab="Survival")
lbls <- paste(names(table_surv), "\n", table_surv, sep="")
pie(table_surv, labels = lbls,
main="Pie Chart of Survival\n (with sample sizes)")
# Create a frequency table
# Row = Sex
# Col = Survived
table_survFM<-table(donner$Sex, donner$Survived)
table_survFM
##
## Died Survived
## Female 5 10
## Male 19 10
# Sex frequencies (summed over Survival)
# Use 1, to sum over columns
margin.table(table_survFM, 1)
##
## Female Male
## 15 29
# Survival frequencies (summed over Sex)
# Use 2, to sum over rows
margin.table(table_survFM, 2)
##
## Died Survived
## 24 20
Table 1: Joint distribution
# cell percentages (joint distribution)
prop.table(table_survFM)
##
## Died Survived
## Female 0.1136364 0.2272727
## Male 0.4318182 0.2272727
Table 2: Conditional distribution for survival by sex
# row percentages (conditional distribution for survival by sex)
prop.table(table_survFM, 1)
##
## Died Survived
## Female 0.3333333 0.6666667
## Male 0.6551724 0.3448276
Table 3: Conditional distribution for sex by surival
# column percentages (conditional distribution for sex by survival)
prop.table(table_survFM, 2)
##
## Died Survived
## Female 0.2083333 0.5000000
## Male 0.7916667 0.5000000
Which table tells the most compelling story?
barplot(table_survFM, main="Survival Distribution by Sex",
xlab="Survival")
# color
barplot(table_survFM, main="Survival Distribution by Sex",
xlab="Survival", col=c("darkblue", "red"),
legend=rownames(table_survFM))
barplot(table_survFM, main="Survival Distribution by Sex",
xlab="Survival", col=c("darkblue", "red"),
legend=rownames(table_survFM),
beside=TRUE)
prop1<-prop.table(table_survFM,2)
barplot(prop1, main="Survival Distribution by Sex",
xlab="Survival", col=c("darkblue", "red"),
legend=rownames(table_survFM))
1973 UC Berkeley Gender Bias in Admissions “One of the first universities to be sued for sexual discrimination” (with a statistically significant difference)
(In this example we will also learn how work with data already contained within R.)
# Lets look at what datasets are available
library(help="datasets")
# we're going to work with the UCBAdmissions dataset
# first let's turn it into a dataframe
data(UCBAdmissions)
ucb<-as.data.frame(UCBAdmissions)
head(ucb)
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
Again, I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual).
Here are the resulting tables and plots
##
## Admitted Rejected
## Female 557 1278
## Male 1198 1493
##
## Admitted Rejected
## Female 0.3035422 0.6964578
## Male 0.4451877 0.5548123
This is a famous example of Simpson’s Paradox. A phenomenon in which a trend appears in several different groups but disappears or reverses when the groups are combined.