mydata=read.csv('C:/Users/lfult/Downloads/titanic/train.csv',
na.strings = c("", "NA")) #Look at na.strings!
Classify each of the following variables:
PassengerId Age Fare Pclass
For each variable, identify:
Type (quantitative or qualitative) Level of measurement (nominal, ordinal, interval, ratio)
Then answer: Which of these variables can be meaningfully averaged, and why?
data.frame(Variable=c("PassengerId", "Age", "Fare", "Pclass"),
Type=c("Qualitative", "Quantitative", "Quantitative", "Qualitative"),
Level=c("Nominal","Ratio","Ratio","Ordinal"),
Average=c("No","Yes","Yes","No"))
## Variable Type Level Average
## 1 PassengerId Qualitative Nominal No
## 2 Age Quantitative Ratio Yes
## 3 Fare Quantitative Ratio Yes
## 4 Pclass Qualitative Ordinal No
Why
cat("Averages only make sense when there are fixed intervals between sequential data points.\n",
"If summing doesn't make sense, then neither does averaging. Data >= interval or ratio.\n")
## Averages only make sense when there are fixed intervals between sequential data points.
## If summing doesn't make sense, then neither does averaging. Data >= interval or ratio.
Which variable in the dataset has the most missing observations? Cabin
Then briefly explain (1–2 sentences): Why might this variable have missing values in a real-world dataset?
missing_data <- data.frame(Variable = names(mydata), Missing = colSums(is.na(mydata)))
ggplot(missing_data, aes(x = reorder(Variable, Missing), y = Missing)) +
geom_bar(stat = "identity") + coord_flip() +
labs(title = "Missing Values by Variable", x = "Variable", y = "Number of Missing Values")
Then briefly explain (1–2 sentences): Why might this variable have missing values in a real-world dataset?
cat("Data can be missing completely at random (MCAR), missing at random (MAR),\n",
"and missing not at random (MNAR). In this case, there could be grouped or\n",
"shared lodging (MNAR) and MCAR or MAR issues.\n")
## Data can be missing completely at random (MCAR), missing at random (MAR),
## and missing not at random (MNAR). In this case, there could be grouped or
## shared lodging (MNAR) and MCAR or MAR issues.
Impute missing values for the following variables:
Age Embarked Fare
Why is your chosen method appropriate for each variable? (1 sentence per variable)
mydata$Age[is.na(mydata$Age)] = median(mydata$Age, na.rm = TRUE)
mydata$Fare[is.na(mydata$Fare)] = median(mydata$Fare, na.rm = TRUE)
mydata$Embarked[is.na(mydata$Embarked)]=names(which.max(table(mydata$Embarked)))
Why is your chosen method appropriate for each variable? (1 sentence per variable)
cat("For Age, median is robust to outliers which age has as evidenced\n",
"by a quick boxplot check. For Fare, median is robust to outliers,\n",
"and it's grossly skewed. For Embarked, the available measure of\n",
"center is mode for nominal data.\n")
## For Age, median is robust to outliers which age has as evidenced
## by a quick boxplot check. For Fare, median is robust to outliers,
## and it's grossly skewed. For Embarked, the available measure of
## center is mode for nominal data.
Install and load the psych R package package.
Compute descriptive statistics for:
Age Fare SibSp
psych::describe(mydata[c("Age", "Fare", "SibSp")])
## vars n mean sd median trimmed mad min max range skew
## Age 1 891 29.36 13.02 28.00 28.83 8.90 0.42 80.00 79.58 0.51
## Fare 2 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33 4.77
## SibSp 3 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00 3.68
## kurtosis se
## Age 0.97 0.44
## Fare 33.12 1.66
## SibSp 17.73 0.04
Then answer: Which variable appears most skewed, and how can you tell?
cat("Fare has the highest skew of 4.77.")
## Fare has the highest skew of 4.77.
Create a cross-tabulation of survival and sex.
(myt=addmargins(table(mydata$Survived, mydata$Sex)))
##
## female male Sum
## 0 81 468 549
## 1 233 109 342
## Sum 314 577 891
Then compute row proportions.
prop.table(table(mydata$Survived, mydata$Sex), 1)
##
## female male
## 0 0.1475410 0.8524590
## 1 0.6812865 0.3187135
What is the survival rate for males versus females? Provide percentages and briefly interpret your results (2–3 sentences).
Females are nearly 4 times (74:18) more likely to survive.
cat(c("Males",myt[2,2]/myt[3,2]*100), c("Females",myt[2,1]/myt[3,1]*100))
## Males 18.8908145580589 Females 74.2038216560509
Provide percentages and briefly interpret your results (2–3 sentences).
Create a notched boxplot comparing Age and Survival:
boxplot(
Age ~ Survived,
data = mydata,
notch = TRUE,
horizontal = TRUE,
col = c("orange", "skyblue"),
main = "Age by Survival Status",
xlab = "Age",
ylab = "Survived",
names = c("No", "Yes")
)
Which group has the higher median age? Neither. They are the same.
c(median(mydata$Age[mydata$Survived==1]), median(mydata$Age[mydata$Survived==0]))
## [1] 28 28
median(mydata$Age[mydata$Survived==1])==median(mydata$Age[mydata$Survived==0])
## [1] TRUE
Do the notches overlap? Yes
What does this suggest about differences between the groups? There is no statistically significant median difference in ages for those who survived vs. those who did not.
Set a seed using the last four digits of your student ID. Mine is 4142.
set.seed(4142) sample_ids <- sample(mydata\(PassengerId, 10) subset_data <- mydata[mydata\)PassengerId %in% sample_ids, ]
set.seed(4142)
sample_ids <- sample(mydata$PassengerId, 10)
subset_data <- mydata[mydata$PassengerId %in% sample_ids, ]
mean(subset_data$Survived)
## [1] 0.2
What is the survival rate in your sampled subset? How does it compare to the full dataset?
cat("0.2 is the survival rate compared to .38.\n",
"The sample is too small to converge to the proper estimate.\n")
## 0.2 is the survival rate compared to .38.
## The sample is too small to converge to the proper estimate.
Submission Requirements
Submit a PDF file only that includes:
All knitted R code All outputs Clear answers to each question
Responses without code or explanation will not receive full credit.