(This is my first attempt at using RPubs. Please forgive any errors)
Q1A: Variable Types and Levels of Measurement:
PassengerID Type: Qualitative/Categorical Level: Nominal (ie no
mathematical meaning, just a label)
Age Type: Quantitatve/Numnerical Level: Ratio (ie absolute zero exists)
my_data <- read.csv("C:/Users/leonedo/Downloads/train.csv")
head(my_data)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Q1B: Most NAs by Column:
I admit I have more to learn regarding returning just “Age” and “177,”
but visually, I can see Age has the most NAs. Additionally, Cabin has
the most missing/blank data.
#Lists all columns and count of NAs hi
colSums(is.na(my_data))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Q2: Imputation:
Age, SibSp, and Parch are Ratio-Level variables. I imputed these with
the median
#Impute Age with median
my_data$Age[is.na(my_data$Age)] <- median(my_data$Age, na.rm = TRUE)
#Impute SibSp with median
my_data$SibSp[is.na(my_data$SibSp)] <- median(my_data$SibSp, na.rm = TRUE)
#Impute Parch with median
my_data$Parch[is.na(my_data$Parch)] <- median(my_data$Parch, na.rm = TRUE)
Q3: Observations on Age, Sibsp, and Parch:
Age: Mean approximately 29.7 but ranges from infants up to 80 years
old.
Sibsp: Interestingly, most passengers traveled without spouses
Parch: Most traveled without parents/children
library(psych)
my_data = read.csv("C:/Users/leonedo/Downloads/train.csv")
describe(my_data$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 714 29.7 14.53 28 29.27 13.34 0.42 80 79.58 0.39 0.16 0.54
describe(my_data$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
describe(my_data$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
Q4: Observations of Cross-Tabulation of Survived and
Sex:
Historically, I believe the protocol for evacuations in any disaster was
“Women and children first..” This cross-tabulation clearly shows the
impact that has had on gender survival. We can see that nearly 6 times
more males did NOT survive than female. Of those passengers who
survived, over twice as many were women.
cross_tab <- table(my_data$Survived, my_data$Sex)
cat("Cross-tabulation (Rows: Survived, Columns: Sex):\n")
## Cross-tabulation (Rows: Survived, Columns: Sex):
print(cross_tab)
##
## female male
## 0 81 468
## 1 233 109
cat("\n")
Q5 Observations of Boxplot:
Observations
1. Median age does not appear to be significantly different between
those who survived and those who did not.
2. Age alone was not a factor in survival
# Create the boxplot
boxplot(my_data$Age ~ my_data$Survived,
notch = TRUE,
horizontal = TRUE,
main = "Age Distribution by Survival Status",
xlab = "Age (years)",
ylab = "Survived? (0 = No, 1 = Yes)",
col = c("orange", "blue"),
names = c("Died (0)", "Survived (1)"))