HW1_Aritra

#Clearing the global environment
rm(list = ls())

#Uploading the dataset
df_t = read.csv('/Users/aritraray/Downloads/train.csv')

Q1 a. What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal / interval / ratio) for PassengerId and Age?

Age is quantitative and PassengerID is qualitative.
The level of measurement for Age is ratio as a person can be ‘elder’ or ‘younger’ to another. Moreover, a person can be twice as old as another, i.e., a meaningful interval. It holds a true zero value as a child cannot have a negative age value.
The level of measurement for PassengerID is nominal as it does not follow any ordering and acts as a unique number for the data.

Q1b. Which variable has the most missing observations?

Cabin has the maximum number of NA values (687 missing values), followed by Age and Embarked.

#replace blank spaces with NA values
df_t[df_t == "" | df_t == " "]<-NA

#count the number of NA values per variable in a dataframe
colSums(is.na(df_t))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.

There are no missing values for SibSp and Parch.

#Finding the median and replacing the NA values with the same
df_t$Age[is.na(df_t$Age)] <- median(df_t$Age, na.rm=TRUE)
#df_t$Age[is.na(df_t$SibSp)] <- median(df_t$SibSp, na.rm=TRUE)
#df_t$Age[is.na(df_t$Parch)] <- median(df_t$Parch, na.rm=TRUE)

Q3. Provide descriptive statistics for Age, SibSp, and Parch.

library(psych) # To avoid running testthat::describe

describe(df_t[c("Age", "SibSp", "Parch")])

##       vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## Age      1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
## SibSp    2 891  0.52  1.10      0    0.27 0.0 0.00   8  8.00 3.68    17.73 0.04
## Parch    3 891  0.38  0.81      0    0.18 0.0 0.00   6  6.00 2.74     9.69 0.03

Q4. Provide a cross-tabulation of Survived and Sex. What do you notice?

As seen below, out of 577 males, 468 of them died. (slightly more than 80%).
However, only 25% of females passed away.

#Created a table, assigned headers and added margins
table_1 = table(df_t$Survived, df_t$Sex)
names(dimnames(table_1)) <- c("Survived", "Gender")
table_1 <- addmargins(table_1)
table_1

##         Gender
## Survived female male Sum
##      0       81  468 549
##      1      233  109 342
##      Sum    314  577 891

Q5. Provide notched boxplots for Survived and Age. What do you notice?

Here, the overlaps of the median for each box plot shows strong confidence interval. It appears to be similar for both passengers who survived and those who died.
A significant number of outliers can be spotted for passengers above 55 years old who couldn’t survive.

#Created a boxplot
boxplot(df_t$Age ~ df_t$Survived, notch=TRUE, 
        ylab='Survivals',
        xlab='Age',
        col=c("light yellow","lightgrey"), 
        main='Boxplot for age of survivals',
        horizontal=T,
        pch=10,
        names = c("Died", "Survived"))

HW1_Aritra

Aritra

2023-09-10