Titanic

Download / Load Data

mydata=read.csv('C:/Users/lfult/Downloads/titanic/train.csv', 
                na.strings = c("", "NA")) #Look at na.strings!

Q1. Variable Types and Measurements

Q1a. Classify

Classify each of the following variables:

PassengerId Age Fare Pclass

For each variable, identify:

Type (quantitative or qualitative) Level of measurement (nominal, ordinal, interval, ratio)

Then answer: Which of these variables can be meaningfully averaged, and why?

data.frame(Variable=c("PassengerId", "Age", "Fare", "Pclass"), 
           Type=c("Qualitative", "Quantitative", "Quantitative", "Qualitative"),
           Level=c("Nominal","Ratio","Ratio","Ordinal"), 
           Average=c("No","Yes","Yes","No"))

##      Variable         Type   Level Average
## 1 PassengerId  Qualitative Nominal      No
## 2         Age Quantitative   Ratio     Yes
## 3        Fare Quantitative   Ratio     Yes
## 4      Pclass  Qualitative Ordinal      No

Why

cat("Averages only make sense when there are fixed intervals between sequential data points.\n",
    "If summing doesn't make sense, then neither does averaging. Data >= interval or ratio.\n")

## Averages only make sense when there are fixed intervals between sequential data points.
##  If summing doesn't make sense, then neither does averaging. Data >= interval or ratio.

Q1b. Missing

Which variable in the dataset has the most missing observations? Cabin

Then briefly explain (1–2 sentences): Why might this variable have missing values in a real-world dataset?

missing_data <- data.frame(Variable = names(mydata), Missing = colSums(is.na(mydata)))

ggplot(missing_data, aes(x = reorder(Variable, Missing), y = Missing)) +
  geom_bar(stat = "identity") + coord_flip() + 
  labs(title = "Missing Values by Variable",  x = "Variable",  y = "Number of Missing Values")

Then briefly explain (1–2 sentences): Why might this variable have missing values in a real-world dataset?

cat("Data can be missing completely at random (MCAR), missing at random (MAR),\n",
    "and missing not at random (MNAR). In this case, there could be grouped or\n",
    "shared lodging (MNAR) and MCAR or MAR issues.\n")

## Data can be missing completely at random (MCAR), missing at random (MAR),
##  and missing not at random (MNAR). In this case, there could be grouped or
##  shared lodging (MNAR) and MCAR or MAR issues.

Q2. Missing Data

Impute missing values for the following variables:

Age Embarked Fare

Why is your chosen method appropriate for each variable? (1 sentence per variable)

mydata$Age[is.na(mydata$Age)] = median(mydata$Age, na.rm = TRUE)
mydata$Fare[is.na(mydata$Fare)] = median(mydata$Fare, na.rm = TRUE)
mydata$Embarked[is.na(mydata$Embarked)]=names(which.max(table(mydata$Embarked)))

Why is your chosen method appropriate for each variable? (1 sentence per variable)

cat("For Age, median is robust to outliers which age has as evidenced\n",
    "by a quick boxplot check. For Fare, median is robust to outliers,\n",
    "and it's grossly skewed. For Embarked, the available measure of\n",
    "center is mode for nominal data.\n")

## For Age, median is robust to outliers which age has as evidenced
##  by a quick boxplot check. For Fare, median is robust to outliers,
##  and it's grossly skewed. For Embarked, the available measure of
##  center is mode for nominal data.

Q3. Descriptives

Install and load the psych R package package.

Compute descriptive statistics for:

Age Fare SibSp

psych::describe(mydata[c("Age", "Fare", "SibSp")])

##       vars   n  mean    sd median trimmed   mad  min    max  range skew
## Age      1 891 29.36 13.02  28.00   28.83  8.90 0.42  80.00  79.58 0.51
## Fare     2 891 32.20 49.69  14.45   21.38 10.24 0.00 512.33 512.33 4.77
## SibSp    3 891  0.52  1.10   0.00    0.27  0.00 0.00   8.00   8.00 3.68
##       kurtosis   se
## Age       0.97 0.44
## Fare     33.12 1.66
## SibSp    17.73 0.04

Then answer: Which variable appears most skewed, and how can you tell?

cat("Fare has the highest skew of 4.77.")

## Fare has the highest skew of 4.77.

Q4. Cross-Tabs

Create a cross-tabulation of survival and sex.

(myt=addmargins(table(mydata$Survived, mydata$Sex)))

##      
##       female male Sum
##   0       81  468 549
##   1      233  109 342
##   Sum    314  577 891

Then compute row proportions.

prop.table(table(mydata$Survived, mydata$Sex), 1)

##    
##        female      male
##   0 0.1475410 0.8524590
##   1 0.6812865 0.3187135

What is the survival rate for males versus females? Provide percentages and briefly interpret your results (2–3 sentences).

Females are nearly 4 times (74:18) more likely to survive.

cat(c("Males",myt[2,2]/myt[3,2]*100), c("Females",myt[2,1]/myt[3,1]*100))

## Males 18.8908145580589 Females 74.2038216560509

Provide percentages and briefly interpret your results (2–3 sentences).

Q5 Visualization.

Create a notched boxplot comparing Age and Survival:

boxplot(
  Age ~ Survived,
  data = mydata,
  notch = TRUE,
  horizontal = TRUE,
  col = c("orange", "skyblue"),
  main = "Age by Survival Status",
  xlab = "Age",
  ylab = "Survived",
  names = c("No", "Yes")
)

Which group has the higher median age? Neither. They are the same.

c(median(mydata$Age[mydata$Survived==1]), median(mydata$Age[mydata$Survived==0]))

## [1] 28 28

median(mydata$Age[mydata$Survived==1])==median(mydata$Age[mydata$Survived==0])

## [1] TRUE

Do the notches overlap? Yes

What does this suggest about differences between the groups? There is no statistically significant median difference in ages for those who survived vs. those who did not.

Q6: Individual Data Sample

Set a seed using the last four digits of your student ID. Mine is 4142.

set.seed(4142) sample_ids <- sample(mydata\(PassengerId, 10) subset_data <- mydata[mydata\)PassengerId %in% sample_ids, ]

set.seed(4142) 
sample_ids <- sample(mydata$PassengerId, 10)
subset_data <- mydata[mydata$PassengerId %in% sample_ids, ]
mean(subset_data$Survived)

## [1] 0.2

What is the survival rate in your sampled subset? How does it compare to the full dataset?

cat("0.2 is the survival rate compared to .38.\n",
    "The sample is too small to converge to the proper estimate.\n")

## 0.2 is the survival rate compared to .38.
##  The sample is too small to converge to the proper estimate.

Submission Requirements

Submit a PDF file only that includes:

All knitted R code All outputs Clear answers to each question

Responses without code or explanation will not receive full credit.

HW1 BC

Sith Fulton

Today