HW1

read.csv("C:/R/train.csv")

Q1a:
PassengerId: quantitative&Ordinal
Age: quantitative&Interval
Q1b:

mydata <- read.csv('C:/R/train.csv')
missing_values <- colSums(is.na(mydata))
print(missing_values)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Age has the most missing observations.

Q2
SibSp and Parch don’t have missing value, so I only impute missing observations for Age.

mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm = TRUE)
missing_values <- colSums(is.na(mydata))
print(missing_values)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

library(psych)
describe(mydata$Age)

describe(mydata$SibSp)

describe(mydata$Parch)

table(mydata$Survived, mydata$Sex)

##    
##     female male
##   0     81  468
##   1    233  109

I noticed that
1. There are far more men than women
2. 81% men died while only 26% women died

Q5

boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T)

I noticed that
1. died people have more outliners.
2. For people who survived and died, the medians of these two sets of data are close.

HW1

Yinda Chen

2023-09-08