Instructions
Go to Kaggle.com (owned by Google). Create a free account.
Sign up for the Titanic: Machine Learning through Disaster competition located here: https://www.kaggle.com/c/titanic/data?select=train.csv
Download the train.csv data.
Open the train.csv file in R. To do so, use something like mydata <- read.csv(‘D:/train.csv’) but replace ‘D:’ with the directory where you saved the file. You can read up on Assignment Operators in R (i.e. “<-”). You can also try Session -> Set Working Directory -> Choose Directory, and then File -> Import Dataset -> From Text (base)…
Then answer the following questions.
(Upload your work as a .pdf file only. Make sure for this assignment and all assignments that you show all R code.)
# Clear the workspace
rm(list=ls()) # remove all objects from environment
cat("\f") # Clear the console
graphics.off() # Clear all graphs
#gc() # Clear unused memory
?read.csv # see na.strings options. a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields (but not in character variables)...
train <- read.csv("/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/Data Analysis/Data Analysis - Fall 2023/Data Analysis - Fall 2023 (shared files)/Week_1-2/titanic/train.csv",
na.strings = c("") # na.strings=c("", ".", "NA") more generally
)
head(train) # check if data imported correctly - first few rows seem right
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 <NA> S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 <NA> S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 <NA> S
## 6 330877 8.4583 <NA> Q
tail(train) # check if data imported correctly - last few rows seem right
## PassengerId Survived Pclass Name Sex
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 886 39 0 5 382652 29.125 <NA> Q
## 887 27 0 0 211536 13.000 <NA> S
## 888 19 0 0 112053 30.000 B42 S
## 889 NA 1 2 W./C. 6607 23.450 <NA> S
## 890 26 0 0 111369 30.000 C148 C
## 891 32 0 0 370376 7.750 <NA> Q
# View(train) # manually eyeball rest of the data
# typeof(train$Cabin)
table(train$Cabin)
##
## A10 A14 A16 A19 A20
## 1 1 1 1 1
## A23 A24 A26 A31 A32
## 1 1 1 1 1
## A34 A36 A5 A6 A7
## 1 1 1 1 1
## B101 B102 B18 B19 B20
## 1 1 2 1 2
## B22 B28 B3 B30 B35
## 2 2 1 1 2
## B37 B38 B39 B4 B41
## 1 1 1 1 1
## B42 B49 B5 B50 B51 B53 B55
## 1 2 2 1 2
## B57 B59 B63 B66 B58 B60 B69 B71 B73
## 2 2 1 1 1
## B77 B78 B79 B80 B82 B84
## 2 1 1 1 1
## B86 B94 B96 B98 C101 C103
## 1 1 4 1 1
## C104 C106 C110 C111 C118
## 1 1 1 1 1
## C123 C124 C125 C126 C128
## 2 2 2 2 1
## C148 C2 C22 C26 C23 C25 C27 C30
## 1 2 3 4 1
## C32 C45 C46 C47 C49
## 1 1 1 1 1
## C50 C52 C54 C62 C64 C65
## 1 2 1 1 2
## C68 C7 C70 C78 C82
## 2 1 1 2 1
## C83 C85 C86 C87 C90
## 2 1 1 1 1
## C91 C92 C93 C95 C99
## 1 2 2 1 1
## D D10 D12 D11 D15 D17
## 3 1 1 1 2
## D19 D20 D21 D26 D28
## 1 2 1 2 1
## D30 D33 D35 D36 D37
## 1 2 2 2 1
## D45 D46 D47 D48 D49
## 1 1 1 1 1
## D50 D56 D6 D7 D9
## 1 1 1 1 1
## E10 E101 E12 E121 E17
## 1 3 1 2 1
## E24 E25 E31 E33 E34
## 2 2 1 2 1
## E36 E38 E40 E44 E46
## 1 1 1 2 1
## E49 E50 E58 E63 E67
## 1 1 1 1 2
## E68 E77 E8 F E69 F G63
## 1 1 2 1 1
## F G73 F2 F33 F38 F4
## 2 3 3 1 2
## G6 T
## 4 1
You can provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex).
table command helps create a one way and two way frequency table.
?table # table uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
table(train$Survived) # one way table
##
## 0 1
## 549 342
table(train$Sex) # one way table
##
## female male
## 314 577
Survival_Sex_Table <- table(train$Survived ,
train$Sex
) # two way table - first variable on rows, second on columns
print(Survival_Sex_Table) # Survived on rows, Gender in columns
##
## female male
## 0 81 468
## 1 233 109
colnames(Survival_Sex_Table) = c("Female", "Male")
rownames(Survival_Sex_Table) = c("Died", "Survived")
# dim(Survival_Sex_Table) # 2 (row) by 2 (column) table
options(digits=4)
#calculate frequency table of proportions
?prop.table
print(Survival_Sex_PropTable <- prop.table(x = Survival_Sex_Table,
margin = 2 # for a matrix 1 indicates rows, 2 indicates columns
)
)
##
## Female Male
## Died 0.2580 0.8111
## Survived 0.7420 0.1889
Of the total 891 Titanic passengers from this training data subset, 342 survived. This is a survival rate of about 38.4%.
There were 233 surviving women and 109 surviving men. It is most helpful to put these numbers into context by examining the survival rate.
This is a dramatic difference at a glance. Use of a statistical test (e.g. T-test, Bayesian, etc.) could help to determine if this is a significant difference by mathematical standards, but there is a clear practical difference in survival rate by sex.
table(train$Survived,
train$Sex,
train$Pclass
)
## , , = 1
##
##
## female male
## 0 3 77
## 1 91 45
##
## , , = 2
##
##
## female male
## 0 6 91
## 1 70 17
##
## , , = 3
##
##
## female male
## 0 72 300
## 1 72 47
I can create these pivot tables within 30 seconds as well in Excl itself whihc can be useful for exploratory data analysis.
Make sure your numbers match.