load the data of train.csv

mydata <- read.csv("C:/Users/dingz/Downloads/train.csv")
str(mydata)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Q1 a.What are the types of variable (quantitative / qualitative) and levels of measurement Links to an external site. (nominal / ordinal / interval / ratio) for PassengerId and Age?

class(mydata$PassengerId)
## [1] "integer"
class(mydata$Age)
## [1] "numeric"

PassengerId is a qualitative variable with a nominal level of measurement beacuse it consider to be an identifier. Age is a quantitative variable with a ratio level of measurement because it has a true zero point.

Q1 b. Which variable has the most missing observations? You could have Googled for “Count NA values in R for all columns in a dataframeLinks to an external site.” or something like that.

missing_variable <- colSums(is.na(mydata))
x <- names(missing_variable)[which.max(missing_variable)]
print(x)
## [1] "Age"
sum(is.na(mydata$Age))
## [1] 177

Age has the most missing obervations which is 177.

Q2 Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE)
mydata$SibSp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
summary(mydata$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00
summary(mydata$SibSp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.523   1.000   8.000
summary(mydata$Parch)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3816  0.0000  6.0000

Q3 Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age). Please comment on what you observe from the summary statistics.

library(psych)
describe(mydata$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
describe(mydata$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
describe(mydata$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03
hist(mydata$Age, main = "Age", ylab = "People" , col = "pink")

hist(mydata$SibSp, main = "Siblings/Spouses", ylab = "People" , col = "purple")

hist(mydata$Parch,  main = "Parents/Children", ylab = "People" , col= "blue")

Age is slightly right skewed with a skewness of 0.51, tail distribution with a kurtosis of 0.97.

SibSp is right skewed with a skewness of 3.68, tail distribution with a kurtosis of 17.73.

Parch is right skewed with a skewness of 2.74, tail distribution with a kurtosis of 9.69.

Q4 Provide a cross-tabulation of Survived and Sex (e.g., table(mydata\(Survived, mydata\)Sex). What do you notice?

table(mydata$Survived, mydata$Sex)
##    
##     female male
##   0     81  468
##   1    233  109

Women has a higher proportion survived compared to men.

Men who did not survive is significantly higher than women.

It shows that women, children and the elderly were transported first when Titanic sank.

Q5. Provide notched boxplots for Survived and Age (e.g., boxplot(mydata\(Age~mydata\)Survived, notch=TRUE, horizontal=T). What do you notice?

boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T, ylab = "Survived", xlab = "Age", main = "people Survided compare to Age")

describe(mydata$Age[mydata$Survived==1])
##    vars   n  mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 342 28.29 13.76     28   28.19 10.38 0.42  80 79.58 0.21     0.43 0.74
describe(mydata$Age[mydata$Survived==0])
##    vars   n  mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 549 30.03 12.5     28   29.09 8.9   1  74    73  0.8     1.29 0.53

The age range of the survivors and victims was not very different, they both have a same median of 28. It shows that more young and old people survived in the sinking of the Titanic, which also confirms the results in Q4 that women, children and the elderly were evacuated first during Titanic sank.