(This is my first attempt at using RPubs. Please forgive any errors)

Q1A: Variable Types and Levels of Measurement:
PassengerID Type: Qualitative/Categorical Level: Nominal (ie no mathematical meaning, just a label)

Age Type: Quantitatve/Numnerical Level: Ratio (ie absolute zero exists)

my_data <- read.csv("C:/Users/leonedo/Downloads/train.csv")
head(my_data)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Q1B: Most NAs by Column:
I admit I have more to learn regarding returning just “Age” and “177,” but visually, I can see Age has the most NAs. Additionally, Cabin has the most missing/blank data.

#Lists all columns and count of NAs hi
colSums(is.na(my_data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Q2: Imputation:
Age, SibSp, and Parch are Ratio-Level variables. I imputed these with the median

#Impute Age with median
my_data$Age[is.na(my_data$Age)] <- median(my_data$Age, na.rm = TRUE)

#Impute SibSp with median
my_data$SibSp[is.na(my_data$SibSp)] <- median(my_data$SibSp, na.rm = TRUE)

#Impute Parch with median
my_data$Parch[is.na(my_data$Parch)] <- median(my_data$Parch, na.rm = TRUE)

Q3: Observations on Age, Sibsp, and Parch:
Age: Mean approximately 29.7 but ranges from infants up to 80 years old.
Sibsp: Interestingly, most passengers traveled without spouses
Parch: Most traveled without parents/children

library(psych)
my_data = read.csv("C:/Users/leonedo/Downloads/train.csv")
describe(my_data$Age)
##    vars   n mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 714 29.7 14.53     28   29.27 13.34 0.42  80 79.58 0.39     0.16 0.54
describe(my_data$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
describe(my_data$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03

Q4: Observations of Cross-Tabulation of Survived and Sex:
Historically, I believe the protocol for evacuations in any disaster was “Women and children first..” This cross-tabulation clearly shows the impact that has had on gender survival. We can see that nearly 6 times more males did NOT survive than female. Of those passengers who survived, over twice as many were women.

cross_tab <- table(my_data$Survived, my_data$Sex)
cat("Cross-tabulation (Rows: Survived, Columns: Sex):\n")
## Cross-tabulation (Rows: Survived, Columns: Sex):
print(cross_tab)
##    
##     female male
##   0     81  468
##   1    233  109
cat("\n")

Q5 Observations of Boxplot:
Observations
1. Median age does not appear to be significantly different between those who survived and those who did not.
2. Age alone was not a factor in survival

# Create the boxplot
boxplot(my_data$Age ~ my_data$Survived, 
        notch = TRUE, 
        horizontal = TRUE,
        main = "Age Distribution by Survival Status",
        xlab = "Age (years)",
        ylab = "Survived? (0 = No, 1 = Yes)",
        col = c("orange", "blue"),
        names = c("Died (0)", "Survived (1)"))