read.csv("C:/R/train.csv")
Q1a:
PassengerId: quantitative&Ordinal
Age:
quantitative&Interval
Q1b:
mydata <- read.csv('C:/R/train.csv')
missing_values <- colSums(is.na(mydata))
print(missing_values)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Age has the most missing observations.
Q2
SibSp and
Parch don’t have missing value, so I only impute missing observations
for Age.
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm = TRUE)
missing_values <- colSums(is.na(mydata))
print(missing_values)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Q3
library(psych)
describe(mydata$Age)
describe(mydata$SibSp)
describe(mydata$Parch)
Q4
table(mydata$Survived, mydata$Sex)
##
## female male
## 0 81 468
## 1 233 109
I noticed that
1. There are far more men than women
2. 81%
men died while only 26% women died
Q5
boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T)
I noticed that
1. died people have more outliners.
2. For
people who survived and died, the medians of these two sets of data are
close.