Welcome to R! R is the most popular statitical programming language. We are going to use it in this class to model data and learn about different statisitcal learning algorithms.
The first thing that I like to know when I’m learning a new programming language is how to comment. Commenting your code is useful because it allows you to leave comments to your future self! This will help especially when code can get messy and long. In R, we use the ‘#’ sign and everything that follows it will be commented out (this means it is not executable).
# numerics
my_numeric <- 42.5
class(my_numeric)
## [1] "numeric"
# integers
my_int <- 2
class(my_int)
## [1] "numeric"
# character strings
my_character <- "hello world"
class(my_character)
## [1] "character"
# logic/booleans
my_logical <- TRUE
class(my_logical)
## [1] "logical"
# Addition
3 + 3
## [1] 6
# Subtraction
4 - 3
## [1] 1
# Multiplication
4 * 3
## [1] 12
# Division
(6 * 4) / 3
## [1] 8
At 11:40 pm on April 14, 1912 the Titanic hit an iceberg on its maiden voyage from Southampton to New York City. Our data represent 2201 passengers
(In this example we will also learn how to install packages.)
#install.packages("titanic")
library(titanic)
Note that the titanic package contains two datasets:
#View(titanic_test)
dim(titanic_test)
## [1] 418 11
str(titanic_test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
# Run the view command in your console
# View(titanic_test)
I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual). I’m calling this new dataframe titanicDF. Note this code is supressed because it is not the focus of this exercise. Rather focus on the tables and corresponding plots.
First, lets look at the distribution of passengers by class
##
## 1st 2nd 3rd Crew
## 325 285 706 885
We hypothesize that survival rates may differ depending on sex.
##
## No Yes
## 1st 122 203
## 2nd 167 118
## 3rd 528 178
## Crew 673 212
The Donner Party was a group of pioneers that departed Missouri on the Oregon Trail in the Spring of 1846. On their journey the group experienced delays and rugged terrain that caused them to travel in extreme winter weather with low food supplies. This group is well known for the fact that they resorted to cannibalism.
(In this example we will also learn how to import data.)
The this dataset will need to name the columns:
# Import Data
donner<-read.table("https://online.stat.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/donner/index.txt",
header=TRUE)
# Name columns
colnames(donner)<-c("Age", "Sex", "Survived")
# Look at the first 6 rows
head(donner)
## Age Sex Survived
## 1 40 0 1
## 2 40 1 1
## 3 30 1 0
## 4 28 1 0
## 5 40 1 0
## 6 45 0 0
# Look at the last 6 rows
tail(donner)
## Age Sex Survived
## 39 25 1 0
## 40 30 1 0
## 41 35 1 0
## 42 23 1 1
## 43 24 1 0
## 44 25 0 1
# Create a frequency table
# Row = Sex
# Col = Survived
mytable<-table(donner$Sex, donner$Survived)
mytable
##
## 0 1
## 0 5 10
## 1 19 10
# Sex frequencies (summed over Survival)
# Use 1, to sum over columns
margin.table(mytable, 1)
##
## 0 1
## 15 29
# Survival frequencies (summed over Sex)
# Use 2, to sum over rows
margin.table(mytable, 2)
##
## 0 1
## 24 20
Table 1: Joint distribution
# cell percentages (joint distribution)
prop.table(mytable)
##
## 0 1
## 0 0.1136364 0.2272727
## 1 0.4318182 0.2272727
Table 2: Conditional distribution for survival by sex
# row percentages (conditional distribution for survival by sex)
prop.table(mytable, 1)
##
## 0 1
## 0 0.3333333 0.6666667
## 1 0.6551724 0.3448276
Table 3: Conditional distribution for sex by surival
# column percentages (conditional distribution for sex by survival)
prop.table(mytable, 2)
##
## 0 1
## 0 0.2083333 0.5000000
## 1 0.7916667 0.5000000
Which table tells the most compelling story?
1973 UC Berkeley Gender Bias in Admissions “One of the first universities to be sued for sexual discrimination” (with a statistically significant difference)
(In this example we will also learn how work with data already contained within R.)
# Lets look at what datasets are available
library(help="datasets")
# we're going to work with the UCBAdmissions dataset
# first let's turn it into a dataframe
data(UCBAdmissions)
ucb<-as.data.frame(UCBAdmissions)
head(ucb)
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
Again, I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual).
Here are the resulting tables and plots
##
## Admitted Rejected
## Female 557 1278
## Male 1198 1493
##
## Admitted Rejected
## Female 0.3035422 0.6964578
## Male 0.4451877 0.5548123
This is a famous example of Simpson’s Paradox. A phenomenon in which a trend appears in several different groups but disappears or reverses when the groups are combined.