MATH 138: Lab 1

An Introduction to R and Tables

Welcome to R! R is the most popular statitical programming language. We are going to use it in this class to model data and learn about different statisitcal learning algorithms.

Learning Objectives:

Use basic operators
Install packages
Find help files
Assess the structure of a dataframe
Create frequency tables
Visually compare frequencies using bar charts

Coding basics

Commenting

The first thing that I like to know when I’m learning a new programming language is how to comment. Commenting your code is useful because it allows you to leave comments to your future self! This will help especially when code can get messy and long. In R, we use the ‘#’ sign and everything that follows it will be commented out (this means it is not executable).

Data types and Object Assignment

# numerics
my_numeric <- 42.5
class(my_numeric)

## [1] "numeric"

# integers
my_int <- 2
class(my_int)

## [1] "numeric"

# character strings
my_character <- "hello world"
class(my_character)

## [1] "character"

# logic/booleans
my_logical <- TRUE
class(my_logical)

## [1] "logical"

Basic operators

# Addition
3 + 3

## [1] 6

# Subtraction
4 - 3

## [1] 1

# Multiplication
4 * 3

## [1] 12

# Division
(6 * 4) / 3

## [1] 8

Motivating Example #1: Titanic

At 11:40 pm on April 14, 1912 the Titanic hit an iceberg on its maiden voyage from Southampton to New York City. Our data represent 2201 passengers

(In this example we will also learn how to install packages.)

Install a package

#install.packages("titanic")
library(titanic)

Note that the titanic package contains two datasets:

titanic_test : a subset of the data
Titanic : data pre-organized into frequencies

What is the structure of your data?

#View(titanic_test)
dim(titanic_test)

## [1] 418  11

str(titanic_test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

# Run the view command in your console
# View(titanic_test)

I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual). I’m calling this new dataframe titanicDF. Note this code is supressed because it is not the focus of this exercise. Rather focus on the tables and corresponding plots.

First, lets look at the distribution of passengers by class

## 
##  1st  2nd  3rd Crew 
##  325  285  706  885

We hypothesize that survival rates may differ depending on sex.

##       
##         No Yes
##   1st  122 203
##   2nd  167 118
##   3rd  528 178
##   Crew 673 212

Motivating Example #2: Donner Party

The Donner Party was a group of pioneers that departed Missouri on the Oregon Trail in the Spring of 1846. On their journey the group experienced delays and rugged terrain that caused them to travel in extreme winter weather with low food supplies. This group is well known for the fact that they resorted to cannibalism.

(In this example we will also learn how to import data.)

The this dataset will need to name the columns:

Age
Sex (1 = male, 0 = female)
Survived (1 = survived, 0 = died)

Import your data

# Import Data
donner<-read.table("https://online.stat.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/donner/index.txt",
                   header=TRUE)

# Name columns
colnames(donner)<-c("Age", "Sex", "Survived")

Look at your data

# Look at the first 6 rows 
head(donner)

##   Age Sex Survived
## 1  40   0        1
## 2  40   1        1
## 3  30   1        0
## 4  28   1        0
## 5  40   1        0
## 6  45   0        0

# Look at the last 6 rows
tail(donner)

##    Age Sex Survived
## 39  25   1        0
## 40  30   1        0
## 41  35   1        0
## 42  23   1        1
## 43  24   1        0
## 44  25   0        1

Now its your turn! Let’s make some tables!

This is a joint table of both sex and survival

# Create a frequency table
# Row = Sex
# Col = Survived
mytable<-table(donner$Sex, donner$Survived)
mytable

##    
##      0  1
##   0  5 10
##   1 19 10

These are marginal tables

# Sex frequencies (summed over Survival)
# Use 1, to sum over columns 
margin.table(mytable, 1)

## 
##  0  1 
## 15 29

# Survival frequencies (summed over Sex)
# Use 2, to sum over rows
margin.table(mytable, 2)

## 
##  0  1 
## 24 20

Find the relative proportions

Table 1: Joint distribution

# cell percentages (joint distribution)
prop.table(mytable)

##    
##             0         1
##   0 0.1136364 0.2272727
##   1 0.4318182 0.2272727

Table 2: Conditional distribution for survival by sex

# row percentages (conditional distribution for survival by sex)
prop.table(mytable, 1)

##    
##             0         1
##   0 0.3333333 0.6666667
##   1 0.6551724 0.3448276

Table 3: Conditional distribution for sex by surival

# column percentages (conditional distribution for sex by survival)
prop.table(mytable, 2)

##    
##             0         1
##   0 0.2083333 0.5000000
##   1 0.7916667 0.5000000

Which table tells the most compelling story?

Motivating Example #3: Gender Bias in College Admission

1973 UC Berkeley Gender Bias in Admissions “One of the first universities to be sued for sexual discrimination” (with a statistically significant difference)

(In this example we will also learn how work with data already contained within R.)

Use data that is built into R

# Lets look at what datasets are available
library(help="datasets")

# we're going to work with the UCBAdmissions dataset
# first let's turn it into a dataframe
data(UCBAdmissions)
ucb<-as.data.frame(UCBAdmissions)
head(ucb)

##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207

Again, I’m going to perform a little data transformation in the background so that we have tidy data (ie. each row represents and individual).

Here are the resulting tables and plots

##         
##          Admitted Rejected
##   Female      557     1278
##   Male       1198     1493

##         
##           Admitted  Rejected
##   Female 0.3035422 0.6964578
##   Male   0.4451877 0.5548123

This is a famous example of Simpson’s Paradox. A phenomenon in which a trend appears in several different groups but disappears or reverses when the groups are combined.