# Data directory "Titanic"
mydata <- read.csv('/Users/pin.lyu/Desktop/titanic/test.csv')
a) What are the types of variable (quantitative / qualitative) and levels of measurement. (nominal / ordinal / interval / ratio) for PassengerId and Age?
# Print "PassengerId"
str(mydata$PassengerId)
## int [1:418] 892 893 894 895 896 897 898 899 900 901 ...
# Print "Age"
str(mydata$Age)
## num [1:418] 34.5 47 62 27 22 14 30 26 18 21 ...
Answer: “PassengerId” is a qualitative data. the id is a way to classify each individual passengers, therefore, this data is considered a nominal data. “Age” is a quantitative data and normally it’s measured on a ratio age. This is because age has a true zero value where when an individual’s age is missing, 0 can represent the absence of one’s age.
# Number of 0s in "SibSp"
table(mydata$SibSp)
##
## 0 1 2 3 4 5 8
## 283 110 14 4 4 1 2
# Number of 0s in "Parch"
table(mydata$Parch)
##
## 0 1 2 3 4 5 6 9
## 324 52 33 3 2 1 1 2
# Number of N/As in "Age"
table(mydata$Age)
##
## 0.17 0.33 0.75 0.83 0.92 1 2 3 5 6 7 8 9 10 11.5 12
## 1 1 1 1 1 3 2 1 1 3 1 2 2 2 1 2
## 13 14 14.5 15 16 17 18 18.5 19 20 21 22 22.5 23 24 25
## 3 2 1 1 2 7 13 3 4 8 17 16 1 11 17 11
## 26 26.5 27 28 28.5 29 30 31 32 32.5 33 34 34.5 35 36 36.5
## 12 1 12 7 1 10 15 6 6 2 6 1 1 5 9 1
## 37 38 38.5 39 40 40.5 41 42 43 44 45 46 47 48 49 50
## 3 3 1 6 5 1 5 5 4 1 9 3 5 5 3 5
## 51 53 54 55 57 58 59 60 60.5 61 62 63 64 67 76
## 1 3 2 6 3 1 1 3 1 2 1 2 3 1 1Answer: From the chart shown above, we can see that “SibSp” has 283 zeros, “Parch” has 324 zeros, and “Age” has only 86 zeors. Hence, “Parch” is the variable in this data set that has the most missing entries.
Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframe. (i.e. “dataframe_name$variable_name”).
# Calculate Median
median_age <- median(mydata$Age, na.rm=TRUE)
# Turning N/As into median in "Age"
mydata$Age[is.na(mydata$Age)] <- median_age
summary(mydata)
## PassengerId Pclass Name Sex
## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:23.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :29.60 Mean :0.4474 Mean :0.3923
## 3rd Qu.:35.75 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
##
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA's :1
Answer: Now the minimal value in “Age” is changed to 0.17 which means that we have successfully changed N/As into the median value of the age data. Next, we will apply the same procedure to “SibSp” and “Parch”.
# Same process on "SibSp"
median_SibSp <- median(mydata$SibSp, na.rm = T)
mydata$SibSp[is.na(mydata$SibSp)] <- median_SibSp
# Same process on "Parch"
median_Parch <- median(mydata$Parch, na.rm = T)
mydata$Parch[is.na(mydata$Parch)] <- median_Parch
summary(mydata)
## PassengerId Pclass Name Sex
## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:23.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :29.60 Mean :0.4474 Mean :0.3923
## 3rd Qu.:35.75 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
##
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA's :1
Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age) .
# Switch on "psych" package
library(psych)
# Descriptive stats of "Age", "SibSp", and "Parch"
describe(mydata$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 418 29.6 12.7 27 28.8 7.41 0.17 76 75.83 0.66 0.88 0.62
describe(mydata$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 418 0.45 0.9 0 0.28 0 0 8 8 4.14 26.03 0.04
describe(mydata$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 418 0.39 0.98 0 0.16 0 0 9 9 4.62 30.86 0.05Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex). What do you notice?
mydata2 <- read.csv('/Users/pin.lyu/Desktop/titanic/train.csv')
table(mydata2$Survived, mydata2$Sex)
##
## female male
## 0 81 468
## 1 233 109
Answer: What I noticed is that the amount of survived individuals are disproportional among two sexes. There are more females survived during the wreckage than males. And this makes because women and children were given priority to abroad life boats.
Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T). What do you notice?
boxplot(mydata2$Age~mydata2$Survived,
notch=T,
horizontal=T,
xlab = "Age",
ylab = "Survived & Died",
main = "Survival Stats Across Different Age",
col = "skyblue",
border = "navy")
Answer: What I noticed from this graph is that most of individuals from both sexes who survived from the ship wreckage were people from 20-40 years old. For both sexes, the median age of the person survived is around 28 years old. Additionally, we can tell that lots of children survived as well. However, the accuracy of this interpretation based on the data is unclear due to the modification that we’ve made on age which we replaced missing values with the median of the age data.