Instructions
Go to Kaggle.com (owned by Google). Create a free account.
Sign up for the Titanic: Machine Learning through Disaster competition located here: https://www.kaggle.com/c/titanic/data?select=train.csv
Download the train.csv data.
Open the train.csv file in R. To do so, use something like mydata <- read.csv(‘D:/train.csv’) but replace ‘D:’ with the directory where you saved the file. You can read up on Assignment Operators in R (i.e. “<-”). You can also try Session -> Set Working Directory -> Choose Directory, and then File -> Import Dataset -> From Text (base)…
Then answer the following questions.
(Upload your work as a .pdf file only. Make sure for this assignment and all assignments that you show all R code.)
# Clear the workspace
rm(list=ls()) # remove all objects from environment
cat("\f") # Clear the console
graphics.off() # Clear all graphs
#gc() # Clear unused memory
?read.csv # see na.strings options. a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields (but not in character variables)...
train <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Data Analysis/Data Analysis - Fall 2023/Data Analysis - Fall 2023 (shared files)/Week 1/titanic/train.csv",
na.strings = c("") # na.strings=c("", ".", "NA") more generally
)
head(train[,c(2,3,4)], n=2) # check if data imported correctly - first few rows seem right
## Survived Pclass Name
## 1 0 3 Braund, Mr. Owen Harris
## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
tail(train[,c(2,3,4)], n=2) # check if data imported correctly - last few rows seem right
## Survived Pclass Name
## 890 1 1 Behr, Mr. Karl Howell
## 891 0 3 Dooley, Mr. Patrick
You can provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex).
table command helps create a one way and two way frequency table.
?table # table uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
Survival_Sex_Table <- table(train$Survived ,
train$Sex
) # two way table - first variable on rows, second on columns
colnames(Survival_Sex_Table) = c("Female", "Male")
rownames(Survival_Sex_Table) = c("Died", "Survived")
print(Survival_Sex_Table) # Survived on rows, Gender in columns
##
## Female Male
## Died 81 468
## Survived 233 109
Of the total 891 Titanic passengers from this training data subset, 342 survived. This is a survival rate of about 38.4%.
There were 233 surviving women and 109 surviving men. It is most helpful to put these numbers into context by examining the survival rate.
This is a dramatic difference at a glance. Use of a statistical test (e.g. T-test, Bayesian, etc.) could help to determine if this is a significant difference by mathematical standards, but there is a clear practical difference in survival rate by sex.
table(train$Survived,
train$Sex,
train$Pclass
)
## , , = 1
##
##
## female male
## 0 3 77
## 1 91 45
##
## , , = 2
##
##
## female male
## 0 6 91
## 1 70 17
##
## , , = 3
##
##
## female male
## 0 72 300
## 1 72 47
I can create these pivot tables within 30 seconds as well in Excl itself which can be useful for exploratory data analysis.
Make sure your numbers match.
Table could look like -
In other words, the probability of A happening is the same whether or not B has occurred, and vice versa.
This equation states that the probability of the intersection of events A and B is equal to the product of the probabilities of each individual event.
Alternatively, the definition can be expressed as:
where P(A∣B) denotes the conditional probability of A given B, and P(B∣A) denotes the conditional probability of B given A. If these conditional probabilities are equal to the individual probabilities P(A) and P(B), respectively, then events A and B are independent.
The probability of the intersection of mutually exclusive events is zero.
Now, let’s consider an experiment of rolling a six-sided fair die:
Sample Space (Outcomes): {1, 2, 3, 4, 5, 6}
Now, let’s look at examples for each scenario:
Example: Rolling an even number (Event A) and rolling an odd number (Event B) are mutually exclusive because a single outcome cannot be both even and odd. If you roll a 2 (Event A), you cannot roll a 3 (Event B) simultaneously.
Example: Rolling an even number (Event A) and rolling a prime number (Event B) are not mutually exclusive because an outcome like 2 satisfies both events. So, they can occur together.
Example: Tossing a fair coin twice. The outcome of the first toss (Event A: Heads) does not affect the outcome of the second toss (Event B: Tails), and vice versa. The events are independent.
Example:
Event A: Rolling a number less than 4; Event B: Rolling a odd number.
If event A occurs (rolling a number less than 4), it increases the chances of event B (rolling an odd number) since two out of the three numbers in A are odd. So, A and B are not independent.
Alternatively, Drawing cards from a deck without replacement. If you draw an Ace (Event A), the probability of drawing another Ace (Event B) is affected because there is now one less Ace in the deck. Therefore, these events are not independent.
These examples illustrate the concepts of independence and mutual exclusivity in the context of probability and provide insight into how these concepts manifest in different scenarios.