Instructions

  1. Go to Kaggle.com (owned by Google). Create a free account.

  2. Sign up for the Titanic: Machine Learning through Disaster competition located here: https://www.kaggle.com/c/titanic/data?select=train.csv

  3. Download the train.csv data.

  4. Open the train.csv file in R. To do so, use something like mydata <- read.csv(‘D:/train.csv’) but replace ‘D:’ with the directory where you saved the file. You can read up on Assignment Operators in R (i.e. “<-”). You can also try Session -> Set Working Directory -> Choose Directory, and then File -> Import Dataset -> From Text (base)

Then answer the following questions.

(Upload your work as a .pdf file only. Make sure for this assignment and all assignments that you show all R code.)

#  Clear the workspace
rm(list=ls())    # remove all objects from environment
cat("\f")        # Clear the console
graphics.off()   # Clear all graphs
#gc()            # Clear unused memory

?read.csv   # see na.strings options. a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields (but not in character variables)... 

train <- read.csv("/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/Data Analysis/Data Analysis - Fall 2023/Data Analysis - Fall 2023 (shared files)/Week_1-2/titanic/train.csv", 
                  na.strings = c("") # na.strings=c("", ".", "NA") more generally
                  ) 

head(train)     # check if data imported correctly  - first few rows seem right
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500  <NA>        S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250  <NA>        S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500  <NA>        S
## 6           330877  8.4583  <NA>        Q
tail(train)     # check if data imported correctly  - last few rows seem right
##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125  <NA>        Q
## 887  27     0     0     211536 13.000  <NA>        S
## 888  19     0     0     112053 30.000   B42        S
## 889  NA     1     2 W./C. 6607 23.450  <NA>        S
## 890  26     0     0     111369 30.000  C148        C
## 891  32     0     0     370376  7.750  <NA>        Q
# View(train)     # manually eyeball rest of the data
# typeof(train$Cabin)

table(train$Cabin)
## 
##             A10             A14             A16             A19             A20 
##               1               1               1               1               1 
##             A23             A24             A26             A31             A32 
##               1               1               1               1               1 
##             A34             A36              A5              A6              A7 
##               1               1               1               1               1 
##            B101            B102             B18             B19             B20 
##               1               1               2               1               2 
##             B22             B28              B3             B30             B35 
##               2               2               1               1               2 
##             B37             B38             B39              B4             B41 
##               1               1               1               1               1 
##             B42             B49              B5             B50     B51 B53 B55 
##               1               2               2               1               2 
## B57 B59 B63 B66         B58 B60             B69             B71             B73 
##               2               2               1               1               1 
##             B77             B78             B79             B80         B82 B84 
##               2               1               1               1               1 
##             B86             B94         B96 B98            C101            C103 
##               1               1               4               1               1 
##            C104            C106            C110            C111            C118 
##               1               1               1               1               1 
##            C123            C124            C125            C126            C128 
##               2               2               2               2               1 
##            C148              C2         C22 C26     C23 C25 C27             C30 
##               1               2               3               4               1 
##             C32             C45             C46             C47             C49 
##               1               1               1               1               1 
##             C50             C52             C54         C62 C64             C65 
##               1               2               1               1               2 
##             C68              C7             C70             C78             C82 
##               2               1               1               2               1 
##             C83             C85             C86             C87             C90 
##               2               1               1               1               1 
##             C91             C92             C93             C95             C99 
##               1               2               2               1               1 
##               D         D10 D12             D11             D15             D17 
##               3               1               1               1               2 
##             D19             D20             D21             D26             D28 
##               1               2               1               2               1 
##             D30             D33             D35             D36             D37 
##               1               2               2               2               1 
##             D45             D46             D47             D48             D49 
##               1               1               1               1               1 
##             D50             D56              D6              D7              D9 
##               1               1               1               1               1 
##             E10            E101             E12            E121             E17 
##               1               3               1               2               1 
##             E24             E25             E31             E33             E34 
##               2               2               1               2               1 
##             E36             E38             E40             E44             E46 
##               1               1               1               2               1 
##             E49             E50             E58             E63             E67 
##               1               1               1               1               2 
##             E68             E77              E8           F E69           F G63 
##               1               1               2               1               1 
##           F G73              F2             F33             F38              F4 
##               2               3               3               1               2 
##              G6               T 
##               4               1

1 2 way cross tabulation (review)

You can provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex).

table command helps create a one way and two way frequency table.

?table      # table uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

table(train$Survived) # one way table 
## 
##   0   1 
## 549 342
table(train$Sex)      # one way table 
## 
## female   male 
##    314    577
Survival_Sex_Table <- table(train$Survived , 
                            train$Sex
                            ) # two way table - first variable on rows, second on columns 


print(Survival_Sex_Table)  # Survived on rows, Gender in columns
##    
##     female male
##   0     81  468
##   1    233  109
colnames(Survival_Sex_Table) = c("Female", "Male") 
rownames(Survival_Sex_Table) = c("Died", "Survived")

# dim(Survival_Sex_Table)   # 2 (row) by 2 (column) table 

options(digits=4)

#calculate frequency table of proportions 
?prop.table
print(Survival_Sex_PropTable <- prop.table(x      = Survival_Sex_Table,
                                           margin = 2 #  for a matrix 1 indicates rows, 2 indicates columns
                                           )
      )
##           
##            Female   Male
##   Died     0.2580 0.8111
##   Survived 0.7420 0.1889
  • Of the total 891 Titanic passengers from this training data subset, 342 survived. This is a survival rate of about 38.4%.

  • There were 233 surviving women and 109 surviving men. It is most helpful to put these numbers into context by examining the survival rate.

    • The survival rate for women versus men was 74.2% and 18.9% respectively.

This is a dramatic difference at a glance. Use of a statistical test (e.g. T-test, Bayesian, etc.) could help to determine if this is a significant difference by mathematical standards, but there is a clear practical difference in survival rate by sex.

2 3 way cross tabulation

2.1 R

table(train$Survived,
      train$Sex, 
      train$Pclass
      )
## , ,  = 1
## 
##    
##     female male
##   0      3   77
##   1     91   45
## 
## , ,  = 2
## 
##    
##     female male
##   0      6   91
##   1     70   17
## 
## , ,  = 3
## 
##    
##     female male
##   0     72  300
##   1     72   47

2.2 Excel - Pivot Tables

I can create these pivot tables within 30 seconds as well in Excl itself whihc can be useful for exploratory data analysis.

Make sure your numbers match.