load the data of train.csv
mydata <- read.csv("C:/Users/dingz/Downloads/train.csv")
str(mydata)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
class(mydata$PassengerId)
## [1] "integer"
class(mydata$Age)
## [1] "numeric"
PassengerId is a qualitative variable with a nominal level of measurement beacuse it consider to be an identifier. Age is a quantitative variable with a ratio level of measurement because it has a true zero point.
missing_variable <- colSums(is.na(mydata))
x <- names(missing_variable)[which.max(missing_variable)]
print(x)
## [1] "Age"
sum(is.na(mydata$Age))
## [1] 177
Age has the most missing obervations which is 177.
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE)
mydata$SibSp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
summary(mydata$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
summary(mydata$SibSp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.523 1.000 8.000
summary(mydata$Parch)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3816 0.0000 6.0000
library(psych)
describe(mydata$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
describe(mydata$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
describe(mydata$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
hist(mydata$Age, main = "Age", ylab = "People" , col = "pink")
hist(mydata$SibSp, main = "Siblings/Spouses", ylab = "People" , col = "purple")
hist(mydata$Parch, main = "Parents/Children", ylab = "People" , col= "blue")
Age is slightly right skewed with a skewness of 0.51, tail distribution with a kurtosis of 0.97.
SibSp is right skewed with a skewness of 3.68, tail distribution with a kurtosis of 17.73.
Parch is right skewed with a skewness of 2.74, tail distribution with a kurtosis of 9.69.
table(mydata$Survived, mydata$Sex)
##
## female male
## 0 81 468
## 1 233 109
Women has a higher proportion survived compared to men.
Men who did not survive is significantly higher than women.
It shows that women, children and the elderly were transported first when Titanic sank.
boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T, ylab = "Survived", xlab = "Age", main = "people Survided compare to Age")
describe(mydata$Age[mydata$Survived==1])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 342 28.29 13.76 28 28.19 10.38 0.42 80 79.58 0.21 0.43 0.74
describe(mydata$Age[mydata$Survived==0])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 549 30.03 12.5 28 29.09 8.9 1 74 73 0.8 1.29 0.53
The age range of the survivors and victims was not very different, they both have a same median of 28. It shows that more young and old people survived in the sinking of the Titanic, which also confirms the results in Q4 that women, children and the elderly were evacuated first during Titanic sank.